Message-ID: <336DCFDB.7C54@silesia.top.pl>
Date: Mon, 05 May 1997 14:17:31 +0200
From: Michal <wapex AT silesia DOT top DOT pl>
MIME-Version: 1.0
To: djgpp AT delorie DOT com
Subject: Re: Alignment
Content-Type: text/plain; charset=iso-8859-2
Content-Transfer-Encoding: 7bit
Precedence: bulk

Leath Muller wrote:
> 
> No - your wrong... :)  The fdiv, sqrt, fmul, fadd and fsub are all affected
> by moving the FPU into single precision mode...
> 
You're saying that if I have my FPU in double precision mode and execute
for example -fmul %st(1)- the FPU is swiched into single precision and
after executing fmul back to double precision? If it is right, why do we
have diferent precisions, when all operations are realy single?
> I also get the impression then that your texturing 8 pixels, lighting 8
> pixels, texturing 8 pixels, lighting... etc ... Basically, this is _really_
> bad for cache coherency - your better off texturing the complete scanline
> and then lighting the complete scanline. 
No I'm doing it at the same time. 
> I moved to this way with using a
> temporary offscreen memory buffer of 2560 bytes (I do stuff in true colour).
> Write the texture stuff to the offscreen memory (which in my inner loop
> never left the 8k cache area per line), and then do your lighting from there...
> 
I don't think that it would be faster. You would need a buffer to store
1/z for every 8 pixels, unless you're dividing it once more. And some
pixels of scaneline could be out of cashe when they would be written for
the secund time. The only good side that I see is more registers for
both texturing and lightning, but you need more instructions; writting
to 1/z's buffer, secund time address calulation, secund loop and stuff
like that. 
> If your wondering, I had my perspective correct, sub-pixel accurate true
> colour light-sourced, gouraud shaded engine running at 16 cycles per pixel.
My is drawing about 7.8 milions pixels per secund writing to LFB (ViRGE)
on my P120 in 8bit color. That's 15.5 cycles per pixel, but it's with
cashe misses. I've never calculated it so accurately, but I think I
would be something about 12-13 clocks per pixel. Your result is quite
good, I mean clocks/pixel. In 24bit color your inner must be dramaticly
slowing down becouse of cashe misses, you have 3 bytes per pixel
textures and 3 times more memory to address. I think it would be better
to do it in 16bit color, use 1 byte per pixel textures, organize your
pallete in that way, that high byte of all colors (in palete) would by
brightness and low byte the real  color value(teaken from texture).
Adapted to thet my inner in theory would have the same speed, but it
would have more cash misses. I've never coded for 16 bit color, never
even try.
> With MMX registers, I could get it running in 9 cycles per pixel... which
> is faster than Quake and looks a whole lot better...
I don't have MMX, so I can't say, but I don't think it can give such a
speed up. You cann't use MMX and FPU at the same time, so You would have
to write non-FPU inner. The whole think about FPU code overlaping with
CPU code would be lost. Also MMX have no div instruction (as far as I
know) so You would have to use CPU div. Maybe it can be done with
saveing FPU registers in some buffer, and then loading MMX regisrers or
somethink like thet.
Inner speed is not everything, try to create whole engin like in QUAKE,
that's the real difficult task.

PS
What about my first question 'Haw to align in DJGPP'. My doubles are NOT
aligned at 8 byte boundary like they 
should be.

Sorry my anser is so late, but I had problems with my internet provider.