From: leathm AT solwarra DOT gbrmpa DOT gov DOT au (Leath Muller)
Message-Id: <199712010225.MAA16617@solwarra.gbrmpa.gov.au>
Subject: Re: 32bit memcpy function? _NEW_ Tried FPU memcpy (problem with CWSDPMI)
To: xmerm05 AT manes DOT vse DOT cz (Michal Mertl)
Date: Mon, 1 Dec 1997 12:25:19 +1000 (EST)
Cc: djgpp AT delorie DOT com
In-Reply-To: <Pine.ULT.3.95.971127181022.831A-100000@dec5.vse.cz> from "Michal Mertl" at Nov 27, 97 06:17:05 pm
Content-Type: text
Precedence: bulk

> Other thing is that I tried to write memcpy using 64bit FPU registers as
> someone here suggested. It's about _20% faster_!!

If you know what the src values are and know they won't produce errors,
you can speed the code up even more by using the normal FP values, ie:

	fldl	src
	fldl	src + 8
	fldl	src + 16
		...
	fxch	st8, st0	
	fstpl	dest + ...
	fstpl	dest + ...

etc...

Which is 3 cycles per iteration...
 
> _LoopPoint:
>         fildq    (%%eax,%%ecx)
>         fistpq   (%%ebx,%%ecx)

Have you tried unrolling this more? The fistpq right after the fildq
(IIRC) causes a stall which can be prevented by unrolling out...

	fildq   src
	fildq   src + 8
	   ...
	fildq   src + 56
	fxch    st8, st0
	fistpq  dest
	fistpq  dest + 56

etc
	
Note: the rest of your code becomes simpler too as you don't have to worry
about adding registers to attain offsets etc...

> Interesting thing is that is run only 10-12% faster with cwsdpmi r3 and r4 but
> with pmode (1.2), cwsdpr0 (both r3 and r4), qdpmi (1.1 form QEMM 8.0) run the
> cpu code faster. The normal memcpy is about the same.

Using proper fld/fstp instructions you can do something like 64 byte moves
in around 24 (I think) cycles (not considering cache hits). I used it to clear
memory buffers (such as floating point Z-buffers) which were very fast in
SW. It just means keeping a small 64-byte zero'ed memory region which could
be used to fld/fstp at the frame buffer memory location...

Leathal.