www.delorie.com/archives/browse.cgi   search  
Mail Archives: pgcc/2000/02/03/10:56:11

Date: Thu, 3 Feb 2000 13:19:55 +0100
From: Jan Hubicka <hubicka AT atrey DOT karlin DOT mff DOT cuni DOT cz>
To: pgcc AT delorie DOT com
Subject: Re: pgcc and egcs alignment -- function, basic block and string
Message-ID: <20000203131955.D12247@atrey.karlin.mff.cuni.cz>
References: <20000130211158 DOT D641 AT cerebro DOT laendle> <Pine DOT LNX DOT 4 DOT 21 DOT 0002022017450 DOT 16833-100000 AT hq DOT alert DOT sk>
Mime-Version: 1.0
X-Mailer: Mutt 1.0i
In-Reply-To: <Pine.LNX.4.21.0002022017450.16833-100000@hq.alert.sk>; from mandos@hq.alert.sk on Wed, Feb 02, 2000 at 08:29:26PM +0100
Reply-To: pgcc AT delorie DOT com
Errors-To: dj-admin AT delorie DOT com
X-Mailing-List: pgcc AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com

> On Sun, 30 Jan 2000, Marc Lehmann wrote:
> 
> > > 10% is really a lot, inside a loop, which takes (about) 25 * 35 cycles.
> > 
> > That's very much. I doubt it really is the three nops, but...
> 
> Well, AFAIK K6 family (especially K6-1) is pretty sensitive to
> splitting insns over cache line boundary. Such cases slow down the
> decoding of instruction. Considering importance of decoders'
> performance on K6 and loop length (only 25-35 cycles as being said)
> and assuming some longer insns was split this way, 10% difference
> is IMHO possible.
I've measured more than 10% speedups in number of loops by patch assing
.p2align 5,,<opcode+modrm length> before each instruction.
I have made patch to egcs. It is not in the mailnine (I will re-try to
submit updated version soon), but you may find in the mailing list
archives (July or August)

The penalties are not clean (even to the AMD folks), but they are believed
to be following:
insn opcode crossing cache line boundary (32 bytes) - 1 cycle + insn becoming vector decoded (minimally 2 cycles + lost parallelism)
insn opcode crossing ifetch buffer (16 bytes) - 1 cycle at lest
insn mod/rm byte separated by cache line boundary - 1 cycle + lost parallelism in case insn ought to be scheduled to first decoder
insn mod/rm byte separated by ifetch buffer - lost prallelism in case insn ought to be scheduled to first decoder

This is not official. Even the AMD's K6 emulator is incorrect in handling these
situations and probably no-one knows how it really works.
Especialy the penalties for first case are extreme. In other cases padding
by nops may or may not be worthwhile. Reordering insns/moving whole loop
body helps in all cases, but it is out of reach of gcc's optimizers.

Does anyone know how the situation looks for PPro? I tought that only
ifetch buffers matters and that they are missaligned (so when long insn
is crossing the end of current ifetch, next one starts at the start of
that insn), so .p2align strategy don't works there, or am I mistaken?
> 
> BTW: On my K6-2, I get best performance when loops and functions are
> aligned to 8 byte boundary. But this (as well as cache line end issues)
> deserves more testing, so I will do so during weekend.
> 

I've just re-started by work on the K6 support for egcs (and cleaning up
the code and looking for common bits with Athlon I need for my contract)
so please keep me informed.

Honza
> Have a nice day
> 
> ------------------------------------------------------------------------------
> Martin Ockajak a.k.a. Mandos  <mandos AT hq DOT alert DOT sk>  http://hq.alert.sk/~mandos
> "The goal of Computer Science is to build something that will last at
> least until we've finished building it."

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019