www.delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1998/02/13/05:15:34

From: Till Harbaum <harbaum AT ibr DOT cs DOT tu-bs DOT de>
Newsgroups: comp.os.msdos.djgpp
Subject: Re: cmpl takes 14 clk cycles on a Pentium ???
Date: 13 Feb 1998 10:05:11 +0100
Organization: TU Braunschweig, Informatik (Bueltenweg), Germany
Lines: 30
Distribution: world
Message-ID: <yks3ehneuyg.fsf@flens.ibr.cs.tu-bs.de>
References: <199802130328 DOT TAA12256 AT adit DOT ap DOT net>
NNTP-Posting-Host: flens.ibr.cs.tu-bs.de
Mime-Version: 1.0 (generated by tm-edit 7.106)
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp

Nate Eldredge <eldredge AT ap DOT net> writes:

> 
> At 04:46  2/12/1998 -0500, Mario Deschenes wrote:
> >Hi everyone,
> >
> >  I'm using RDTSC to profile a routine and I got something strange.  My
> >routine looks like:
> [snipped]
> This is somewhat of a shot in the dark, since I am no hardware guru. (Btw,
> you know that `align X' aligns to the nearest 2^X boundary, right?) What
> seems most likely to me is some kind of caching or prefetch issue. Perhaps
> when the target of the jump is close enough, it is already in a cache and is
> fetched faster. But when it's farther away, a new chunk has to be fetched
> from real memory, which is slower.
> 
I think this is the right idea. Most modern cpu's do some kind of
burst read ahead of there code. This means: If the cpu reads an
instruction word it initiates a burst transfer from ram to cache and
reads some data it will likely need in the future (the 68040 for
example always reads 4 quadwords, even if it doesn't need them).

The bahaviour of the code also depends on the implementation
of the branch prediction unit of the cpu (which covers a big area
of the pentium die, i think, so it should be very good :-).

Switch of all caching and look if you still measure those differnces.

Ciao,
  Till

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019