www.delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1998/02/13/13:56:31

Message-Id: <m0y3Kdx-000S2kC@inti.gov.ar>
Comments: Authenticated sender is <salvador AT natacha DOT inti DOT gov DOT ar>
From: "Salvador Eduardo Tropea (SET)" <salvador AT inti DOT gov DOT ar>
Organization: INTI
To: Till Harbaum <harbaum AT ibr DOT cs DOT tu-bs DOT de>, djgpp AT delorie DOT com
Date: Fri, 13 Feb 1998 16:03:17 +0000
MIME-Version: 1.0
Subject: Re: cmpl takes 14 clk cycles on a Pentium ???
In-reply-to: <yks3ehneuyg.fsf@flens.ibr.cs.tu-bs.de>

Till Harbaum <harbaum AT ibr DOT cs DOT tu-bs DOT de> wrote:
> Nate Eldredge <eldredge AT ap DOT net> writes:
> 
> > 
> > At 04:46  2/12/1998 -0500, Mario Deschenes wrote:
> > >Hi everyone,
> > >
> > >  I'm using RDTSC to profile a routine and I got something strange.  My
> > >routine looks like:
> > [snipped]
> > This is somewhat of a shot in the dark, since I am no hardware guru. (Btw,
> > you know that `align X' aligns to the nearest 2^X boundary, right?) What
> > seems most likely to me is some kind of caching or prefetch issue. Perhaps
> > when the target of the jump is close enough, it is already in a cache and is
> > fetched faster. But when it's farther away, a new chunk has to be fetched
> > from real memory, which is slower.
> > 
> I think this is the right idea.
I agree with Nate and you the problem is in the aligment.

> Most modern cpu's do some kind of
> burst read ahead of there code.
In some cases is even simpler than that, some CPUs have a wider bus inside. For 
example: A simple Cyrix 5x86 uses a 128 bits bus inside the CPU to join the L1 
cache and the pipeline.

> This means: If the cpu reads an
> instruction word it initiates a burst transfer from ram to cache and
> reads some data it will likely need in the future (the 68040 for
> example always reads 4 quadwords, even if it doesn't need them).
Even more. L1 and L2 entries aren't one byte or one CPU word, in general they 
are groups of CPU words (32 of 32 bits is common). Why? because these cached 
bytes are "taged", that's the cache have flags to know when to discard the 
bytes and is the bytes are syncronized. You can't have these bits for each byte 
in the cache (a real waste) so you use the bits for a group of bytes.
 
> The bahaviour of the code also depends on the implementation
> of the branch prediction unit of the cpu (which covers a big area
> of the pentium die, i think, so it should be very good :-).
Yes, but P5 have around 95% of hints on it and P6 around 97% so for small loops 
executed various times P5 and P6 are VERY good predicting.
 
> Switch of all caching and look if you still measure those differnces.
The align problems will remain.

The most hard problem is that in some CPUs the aligment is VERY important. I 
was experimenting erratic frame rates in my plasma routines until I figured out 
that some coded needed 64 bits aligment (8 bytes) and that normally djgpp can't 
make it (I did it tricking the specs and the linking configuration). 

SET
------------------------------------ 0 --------------------------------
Visit my home page: http://set-soft.home.ml.org/
or
http://www.geocities.com/SiliconValley/Vista/6552/
Salvador Eduardo Tropea (SET). (Electronics Engineer)
Alternative e-mail: set-sot AT usa DOT net - ICQ: 2951574
Address: Curapaligue 2124, Caseros, 3 de Febrero
Buenos Aires, (1678), ARGENTINA
TE: +(541) 759 0013

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019