From: leathm AT solwarra DOT gbrmpa DOT gov DOT au (Leath Muller) Message-Id: <199712010225.MAA16617@solwarra.gbrmpa.gov.au> Subject: Re: 32bit memcpy function? _NEW_ Tried FPU memcpy (problem with CWSDPMI) To: xmerm05 AT manes DOT vse DOT cz (Michal Mertl) Date: Mon, 1 Dec 1997 12:25:19 +1000 (EST) Cc: djgpp AT delorie DOT com In-Reply-To: from "Michal Mertl" at Nov 27, 97 06:17:05 pm Content-Type: text Precedence: bulk > Other thing is that I tried to write memcpy using 64bit FPU registers as > someone here suggested. It's about _20% faster_!! If you know what the src values are and know they won't produce errors, you can speed the code up even more by using the normal FP values, ie: fldl src fldl src + 8 fldl src + 16 ... fxch st8, st0 fstpl dest + ... fstpl dest + ... etc... Which is 3 cycles per iteration... > _LoopPoint: > fildq (%%eax,%%ecx) > fistpq (%%ebx,%%ecx) Have you tried unrolling this more? The fistpq right after the fildq (IIRC) causes a stall which can be prevented by unrolling out... fildq src fildq src + 8 ... fildq src + 56 fxch st8, st0 fistpq dest fistpq dest + 56 etc Note: the rest of your code becomes simpler too as you don't have to worry about adding registers to attain offsets etc... > Interesting thing is that is run only 10-12% faster with cwsdpmi r3 and r4 but > with pmode (1.2), cwsdpr0 (both r3 and r4), qdpmi (1.1 form QEMM 8.0) run the > cpu code faster. The normal memcpy is about the same. Using proper fld/fstp instructions you can do something like 64 byte moves in around 24 (I think) cycles (not considering cache hits). I used it to clear memory buffers (such as floating point Z-buffers) which were very fast in SW. It just means keeping a small 64-byte zero'ed memory region which could be used to fld/fstp at the frame buffer memory location... Leathal.