www.delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1997/03/04/21:27:20

From: leathm AT solwarra DOT gbrmpa DOT gov DOT au (Leath Muller)
Message-Id: <199703050217.MAA15402@solwarra.gbrmpa.gov.au>
Subject: Re: Allegro perspective-correct .. (fpu memcopy)
To: nikki AT gameboutique DOT co (nikki)
Date: Wed, 5 Mar 1997 12:16:59 +1000 (EST)
Cc: djgpp AT delorie DOT com
In-Reply-To: <5fhtio$rqm@flex.uunet.pipex.com> from "nikki" at Mar 4, 97 07:35:52 pm

> well, for the benefit of the djgpp community as a whole here's the result.
> first the standard fpu memcopy which i use. this is 2 cycles faster than the
> fastest i've ever seen anywhere else (the agner fog articles) and is 100%
> accurate :
 
> asm volatile ("1:\n\t"
>               "fildq (%%esi)\n\t"             // load first qword  1 NP (2,3)
>               "fildq 8(%%esi)\n\t"            // load second qword 2 NP (3,4)
>               "addl $16,%%esi\n\t"            // update esi        3 uv
>               "addl $16,%%edi\n\t"            // update edi        3 uv
>               "fistpq -8(%%edi)\n\t"          // save 2nd qword    4 NP (-9)
>               "fistpq -16(%%edi)\n\t"         // save 1st qword   10 NP (-15)
>               "decl %%ecx\n\t"                // dec ecx          16 uv
>               "jnz 1b"                        // (loop)           16  v
>              :
>              : "S" (scr_buf), "D" (videoptr), "c" (no_to_move)
>              : "ecx", "esi", "edi" );
 
> as you can see, the slow part is the fist which takes a fat 6NP :( but it
> still manages 16 bytes in 16 cycles with 1/2 the normal write misses and
> associated cache penalties.

Hmmm...you realise if you extend this code to use all 8 registers, you can
speed it up even more, performing only 2 cache loads per loop. You can also
remove the addl's by using indexed addressing to save another cycle each
loop...
 
> now the fast (and theoretically not so accurate) version i came up with.
> replace the flid and fist with fld and fst and set the flags as eli 
> described above. the result is an 8 cycle loop - twice as fast in fact.
> the disadvantages is that this is a 'lossy' form of moving data about. there
> are some sequences of numbers which cause errors and these show quite visibly
> if you're using a blitz to screen for instance. my suggestion therefore is to
> only use this for 24bit screen displays and to +-1 from the values that cause
> fpu errors so that this never happens. the result is something that's visually
> indistinguishable from what you want but twice as fast. (and 4 times faster
> than the rep stos versions) so my question really is - does anyone know which
> sequences cause fpu errors so i can avoid them? :) perhaps leath would know?

I haven't played with this at all actually, because I haven't need to fully
optimise yet. But I will have a look at it tonight and see how I go. Have
you tried putting the FPU into double precision mode before doing this? If
you do that, the values should be stored as loaded, and no conversion should
occur. If you are using the FPU in extended precision, it might be causing
problems with the 64-80-64 bit conversion process. Reducing the precision
would probably help by causing no conversions to be done...and not run any
slower because your still moving 8 bytes a time...

Leathal.

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019