www.delorie.com/archives/browse.cgi   search  
Mail Archives: pgcc/1998/04/15/00:11:09

X-pop3-spooler: POP3MAIL 2.1.0 b 3 961213 -bs-
Delivered-To: pcg AT goof DOT com
From: ak AT stuttgart DOT netsurf DOT de (Andreas Kaiser)
To: beastium-list AT Desk DOT nl
Subject: Re: [performance] newer binutils / pgcc / K6
Date: Wed, 15 Apr 1998 00:08:49 GMT
Organization: Ananke
Message-ID: <3537f9b9.782528@mail1.stuttgart.netsurf.de>
References: <Pine DOT LNX DOT 3 DOT 96 DOT 980414020447 DOT 12999B-100000 AT goliath DOT csn DOT tu-chemnitz DOT de>
In-Reply-To: <Pine.LNX.3.96.980414020447.12999B-100000@goliath.csn.tu-chemnitz.de>
X-Mailer: Forte Agent 1.5/16.451
MIME-Version: 1.0
Sender: Marc Lehmann <pcg AT goof DOT com>
Status: RO
Lines: 57

On Tue, 14 Apr 1998 02:30:18 +0200 (CEST), Ronald Wahl <rwahl AT gmx DOT net> wrote:

>further testing I found out that it is a code alignment issue. If I use
>-malign-loops=2 the tests run nearly at the same speed as with the older
>versions of binutils (gas). Some tests are a bit slower but not much
>(--> see my appended nbench results). Other alignments will cause
>slowdowns.

I've encoutered the same effect in some program I use for benchmarking. Even
though AMD suggests target alignment, because like all other X86 CPUs except
for the non-MMX Pentium, the K6 won't fetch across a cache line boundary, it
was faster without. My personal interpretation: The 16 entry branch target
instruction cache of the K6 appears to be direct mapped by A2..A5. When many
branch targets are aligned to 16 bytes (A2,A3=0), they can use only 4 of the
16 entries. 

However it can also result from some other accidental side effect: When the
part of the opcode, which is required for instruction length detection, is
split across two cache lines, the instruction becomes microcoded, thus slow
(predecoder problem). Aligning instructions (shifting code around) may
accidentally enlarge or reduce such effects in critical parts of a program.

	Other stuff which affects many X86, including the K6:

A few months ago, I looked into plain GCC 2.7.2 (EMX) and found a code/data
mix which is prone to systematic cache misses. Ok, plain GCC is old and never
knew about split cache X86s, so I downloaded the PGCC (OS/2) and to my big
surprise it was exactly the same.

Mixing code and data in the same cache line may lead to a ping-pong effect,
where the lines are frequently flushed and reloaded from L2 (for X86 CPUs,
the same cache line is *never* included in both I- and D-cache). Especially
aweful is a switch table immediately following the jump using it. This holds
for all X86-CPUs with split caches except for AMD-K5 (where a D-miss/I-hit
data read is handled uncached instead), but the K6 is more affected, because
wrt. this aspect its cache line size is 64 bytes, not 32.

This is easily avoided by putting const data into the data section, a
separate const section (for non-a.out format) or at least a separate
subsection (for old a.out format). The effect on Perl is quite noticable.

Another optimization is worth trying: Avoid [ESI] w/o displacement, because
such an instruction becomes microcoded (once again the predecoder gets
confused, because this address mode has the same opcode as 16bit absolute).
Avoiding [ESI] and the opcode split mentioned above (insert NOPs or lengthen
a prior insn) however can only be done by the assembler.

Just in case it is still found in PGCC (I didn't check it, however it is
regularly used in plain GCC): After partial register operations, like "or
$0x01,%ah" instead of "or $0x0100,%eax", the next insn using %eax may get
stalled until the parts are recombined. This affects many X86s, especially P6
(the decoder stalls, so it's a very large penalty) and K6 (data dependency
stall). Even old 486 stalled for 1 clock. Just Pentium and K5 don't stall
(the ever-astonishing K5 is able to collect the parts of a single operand
from 3 different sources w/o penalty ;-).

		Gruss, Andreas

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019