Date: Thu, 3 Feb 2000 13:19:55 +0100 From: Jan Hubicka To: pgcc AT delorie DOT com Subject: Re: pgcc and egcs alignment -- function, basic block and string Message-ID: <20000203131955.D12247@atrey.karlin.mff.cuni.cz> References: <20000130211158 DOT D641 AT cerebro DOT laendle> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from mandos@hq.alert.sk on Wed, Feb 02, 2000 at 08:29:26PM +0100 Reply-To: pgcc AT delorie DOT com Errors-To: dj-admin AT delorie DOT com X-Mailing-List: pgcc AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk > On Sun, 30 Jan 2000, Marc Lehmann wrote: > > > > 10% is really a lot, inside a loop, which takes (about) 25 * 35 cycles. > > > > That's very much. I doubt it really is the three nops, but... > > Well, AFAIK K6 family (especially K6-1) is pretty sensitive to > splitting insns over cache line boundary. Such cases slow down the > decoding of instruction. Considering importance of decoders' > performance on K6 and loop length (only 25-35 cycles as being said) > and assuming some longer insns was split this way, 10% difference > is IMHO possible. I've measured more than 10% speedups in number of loops by patch assing .p2align 5,, before each instruction. I have made patch to egcs. It is not in the mailnine (I will re-try to submit updated version soon), but you may find in the mailing list archives (July or August) The penalties are not clean (even to the AMD folks), but they are believed to be following: insn opcode crossing cache line boundary (32 bytes) - 1 cycle + insn becoming vector decoded (minimally 2 cycles + lost parallelism) insn opcode crossing ifetch buffer (16 bytes) - 1 cycle at lest insn mod/rm byte separated by cache line boundary - 1 cycle + lost parallelism in case insn ought to be scheduled to first decoder insn mod/rm byte separated by ifetch buffer - lost prallelism in case insn ought to be scheduled to first decoder This is not official. Even the AMD's K6 emulator is incorrect in handling these situations and probably no-one knows how it really works. Especialy the penalties for first case are extreme. In other cases padding by nops may or may not be worthwhile. Reordering insns/moving whole loop body helps in all cases, but it is out of reach of gcc's optimizers. Does anyone know how the situation looks for PPro? I tought that only ifetch buffers matters and that they are missaligned (so when long insn is crossing the end of current ifetch, next one starts at the start of that insn), so .p2align strategy don't works there, or am I mistaken? > > BTW: On my K6-2, I get best performance when loops and functions are > aligned to 8 byte boundary. But this (as well as cache line end issues) > deserves more testing, so I will do so during weekend. > I've just re-started by work on the K6 support for egcs (and cleaning up the code and looking for common bits with Athlon I need for my contract) so please keep me informed. Honza > Have a nice day > > ------------------------------------------------------------------------------ > Martin Ockajak a.k.a. Mandos http://hq.alert.sk/~mandos > "The goal of Computer Science is to build something that will last at > least until we've finished building it."