Date: Tue, 9 May 2000 13:15:01 +0200 From: Jan Hubicka To: pgcc AT delorie DOT com Subject: Re: pgcc and egcs alignment -- function, basic block and string Message-ID: <20000509131501.B27958@atrey.karlin.mff.cuni.cz> References: <20000130211158 DOT D641 AT cerebro DOT laendle> <20000203131955 DOT D12247 AT atrey DOT karlin DOT mff DOT cuni DOT cz> <389C6000 DOT 5B79248 AT neuss DOT netsurf DOT de> <3917AF5A DOT FF5C82B2 AT neuss DOT netsurf DOT de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-2 Content-Transfer-Encoding: 8bit X-Mailer: Mutt 1.0i In-Reply-To: <3917AF5A.FF5C82B2@neuss.netsurf.de>; from w.formann@netsurf213.neuss.netsurf.de on Tue, May 09, 2000 at 08:25:30AM +0200 Reply-To: pgcc AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: pgcc AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk > Jan, > > seems to be the same with Athlon, at least with this one > vendor_id : AuthenticAMD > cpu family : 6 > model : 2 > model name : AMD Athlon(tm) Processor > stepping : 1 > cpu MHz : 698.660058 > > here again, I got some speedups when I rearranged the code to have no > instructions crossing any 16byte border. OK. I wilask AMD about this issue. Alex from AMD claims, that Athlon don´t have such problems. It is well possible the the speedups are caused by some accidental change elsewhere... Honya > > Wolfgang > > Wolfgang Formann wrote: > > > > Jan Hubicka wrote: > > > > > > > On Sun, 30 Jan 2000, Marc Lehmann wrote: > > > > > > > > > > 10% is really a lot, inside a loop, which takes (about) 25 * 35 cycles. > > > > > > > > > > That's very much. I doubt it really is the three nops, but... > > > > > > > > Well, AFAIK K6 family (especially K6-1) is pretty sensitive to > > > > splitting insns over cache line boundary. Such cases slow down the > > > > decoding of instruction. Considering importance of decoders' > > > > performance on K6 and loop length (only 25-35 cycles as being said) > > > > and assuming some longer insns was split this way, 10% difference > > > > is IMHO possible. > > > I've measured more than 10% speedups in number of loops by patch assing > > > .p2align 5,, before each instruction. > > > I have made patch to egcs. It is not in the mailnine (I will re-try to > > > submit updated version soon), but you may find in the mailing list > > > archives (July or August) > > > > > > The penalties are not clean (even to the AMD folks), but they are believed > > > to be following: > > > insn opcode crossing cache line boundary (32 bytes) - 1 cycle + insn becoming vector decoded (minimally 2 cycles + lost parallelism) > > > insn opcode crossing ifetch buffer (16 bytes) - 1 cycle at lest > > > insn mod/rm byte separated by cache line boundary - 1 cycle + lost parallelism in case insn ought to be scheduled to first decoder > > > insn mod/rm byte separated by ifetch buffer - lost prallelism in case insn ought to be scheduled to first decoder > > > > This seems to be right, so after hacking one more day, I get another > > ~10% > > of improvement. All together crypt586.pl is improved from the original > > 13780 to 18912 crypts/second on my good old K6-I/233 :-) > > > > But there is still a large number of question marks! > > Thanks! > > > > > > > > This is not official. Even the AMD's K6 emulator is incorrect in handling these > > > situations and probably no-one knows how it really works. > > > Especialy the penalties for first case are extreme. In other cases padding > > > by nops may or may not be worthwhile. Reordering insns/moving whole loop > > > body helps in all cases, but it is out of reach of gcc's optimizers. > > > > > > Does anyone know how the situation looks for PPro? I tought that only > > > ifetch buffers matters and that they are missaligned (so when long insn > > > is crossing the end of current ifetch, next one starts at the start of > > > that insn), so .p2align strategy don't works there, or am I mistaken? > > > > > > > > BTW: On my K6-2, I get best performance when loops and functions are > > > > aligned to 8 byte boundary. But this (as well as cache line end issues) > > > > deserves more testing, so I will do so during weekend. > > > > > > > > > > I've just re-started by work on the K6 support for egcs (and cleaning up > > > the code and looking for common bits with Athlon I need for my contract) > > > so please keep me informed. > > > > > > Honza > > > > Have a nice day > > > > > > > > ------------------------------------------------------------------------------ > > > > Martin Ockajak a.k.a. Mandos http://hq.alert.sk/~mandos > > > > "The goal of Computer Science is to build something that will last at > > > > least until we've finished building it."