www.delorie.com/archives/browse.cgi   search  
Mail Archives: pgcc/1999/08/12/02:50:34.1

Message-ID: <19990808155531.34641@atrey.karlin.mff.cuni.cz>
Date: Sun, 8 Aug 1999 15:55:31 +0200
From: Jan Hubicka <hubicka AT atrey DOT karlin DOT mff DOT cuni DOT cz>
To: pgcc AT delorie DOT com
Subject: Re: optimizing for k6
References: <Pine DOT GSO DOT 4 DOT 10 DOT 9908051303340 DOT 29067-100000 AT legolas DOT mdh DOT se> <37AC0114 DOT F3BC458A AT neuss DOT netsurf DOT de>
Mime-Version: 1.0
X-Mailer: Mutt 0.84
In-Reply-To: <37AC0114.F3BC458A@neuss.netsurf.de>; from Wolfgang Formann on Sat, Aug 07, 1999 at 11:49:08AM +0200
Reply-To: pgcc AT delorie DOT com
X-Mailing-List: pgcc AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com

> Henrik Berglund SdU wrote:
> > 
> > ftp://ftp.sinica.edu.tw/pub/doc/cpu/www.amd.com/K6/k6docs/pdf/21828a.pdf
> > 
> > -----------------------------------------------------------------------------
> > Henrik DOT Berglund AT mds DOT mdh DOT se
> > http://www.mds.mdh.se/~adb94hbd/
> 
> This is a long known document, it does some help in optimizing. But the
> information is just too incomplete to get really good optimizations.
> 
> There is also a lot of mistakes in that document. I had a little
> discussion
> with AMD technical support, but they did not help :-(
> AMD Technical Support wrote:
I am just working on the K6 support for new ia32 brackend. You are right
that the document is quite bad. It recommends you thinks that hurts
and fails to tell you about details that really helps. But the AMD technical
support is quite kind to answer all specific questions about the optimizations.

The most important optimizations for K6 seems to be alignment changes
(K6 requires pretty weird alignment before every instruction with 2 byte and longer
opcode, that is also not noticed in the docs) and the instruction selection
(some instruction that are pretty common are vector decoded. Manual
fails to document that. Probably most important for gcc
were inc/dec with ling form and nonmemory operand, neg patterns, shift patterns
and setcc.
I've implemented lots of other stuff and results are pretty good IMO.
byte benchmark optimized for 386  using old backend says 4.23/2.51
(integer/fp index), new backend 4.55/2.41, visual ZC++ 4.40/2.42 and my current result is
4.89/2.61

Maybe I can write some sort of document describing most interesting surprised I've fond
while playing with the optimizations.

Honza
> > 
> > >Return-Path: <w DOT formann AT neuss DOT netsurf DOT de>
> > >Sender: wolfi AT neuss DOT netsurf DOT de
> > >Date: Fri, 12 Mar 1999 19:10:15 +0100
> > >From: Wolfgang Formann <w DOT formann AT neuss DOT netsurf DOT de>
> > >To: AMD Technical Support <blikefet AT pedigree DOT amd DOT com>
> > >Subject: Re: Some question to your literature, maybe a typo?
> > >References: <3 DOT 0 DOT 32 DOT 19990303153034 DOT 0074931c AT pedigree DOT amd DOT com>
> > >
> > 
> > Hi,
> > 
> > it is the last update of the document. I think you must try it.
> > 
> > Kind regards
> > 
> > Bernard
> > 
> > >AMD Technical Support wrote:
> > >>
> > >> >Return-Path: <euro DOT lit AT amd DOT com>
> > >> >X-Sender: support2 AT pedigree
> > >> >Date: Thu, 25 Feb 1999 06:39:16 +0100
> > >> >To: blikefet AT pedigree DOT amd DOT com
> > >> >From: Wolfgang Formann <w DOT formann AT neuss DOT netsurf DOT de> (by way of CPA <euro DOT lit AT amd DOT com>)
> > >> >Subject: Some question to your literature, maybe a typo?
> > >> >
> > >> >I just downloaded the document http://www.amd.com/K6/k6docs/pdf/21828a.pdf.
> > >> >The table in Chaper 4, Pages 37 to 40 says, that all the shift operations
> > >> >like SHIFT mreg16/32,imm8; SHIFT mreg16/32, 1; SHIFT mreg16/32, CL; where
> > >> >SHIFT can be replaced by SAR, SHL/SAL and SHR, are executed as RISC86(tm)
> > >> >Opcode alu. This RISC86(tm) operation is explained on page 24 as
> > >> >`alu - either of the integer execution units`.
> > >> >
> > >> >Whereas in chapter 3 on page 12, this document lists some (all?) operations
> > >> >which can be performed in the Integer Y execution unit. In the list of
> > >> >operations '(ADD, AND, CMP, OR, SUB and XOR)' there is none of the SHIFT's
> > >> >mentioned.
> > >> >
> > >> >By trying it out (I think) I found that chapter 3 is right and the table
> > >> >in chapter 4 has typos.
> > >> >
> > >> >My question: Is there any updated version of this document available or
> > >> >do I have to try out all the other opcodes not listed in chapter 3, but
> > >> >marked as 'alu' in the table in chapter 4 (like mov, movzx)?
> > >> >
> > >> >Thank you
> > >>
> > >> Hi,
> > >>
> > >> the latest version of the document is on the our webside.
> > >
> > >so, it still seems to have different information on the same instruction :-(
> > >
> > >Is there any additional information available, not shown on your web page?
> > >
> > >Thanks again!
> > >
> > >>
> > >> Kind regards
> > >> Bernard Likefett
> > >> AMD Technical Support
> > >
> > >
> > Bernard Likefett
> > AMD Technical Support
> > 
> > Please included all previous emails
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Advanced Micro Devices _______
> > AMD House \____ | Advanced
> > Frimley Business Park /| | | Micro
> > Frimley, Camberley | |___| | Devices
> > Surrey |____/ \|
> > GU16 5SL
> > United Kingdom
> > 
> > EMail id euro DOT tech AT amd DOT com Our Web site is http://www.amd.com
> > Phone +44 (0)1276 803299 Fax +44 (0)1276 803298
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Another thing in that manual is the nice table labeled 'Instruction
> Dispatch and Execution Timing' starting at page 35. Just a few
> questions:
> How many internal cycles do all these vector operations take?
> What internal execution units are used?
> 
> Well, there is no answer, so you have to try them out. The only thing
> you can be sure of, is that you should always use opcodes which can get
> decoded in parallel, these are the ones marked with 'short' since it
> seems that the bottleneck of that CPU is the decoder.
> 
> The next thing is the nice tables in the chapter labeled 'Code Sample
> Analysis'. Did you really understand them? I tried to optimize some
> real code and took these tables as input, but I failed :-( My processor
> seems to behave very different. I did not find out what was wrong.
> So it seems to me, that a lot of information in this document is
> only for marketing purposes, there are too few details and too many
> wrong informations to really help to optimize the code.
> 
> Wolfgang

-- 
                       OK. Lets make a signature file.
+-------------------------------------------------------------------------+
|        Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz         |
|         Czech free software foundation: http://www.freesoft.cz          |
|AA project - the new way for computer graphics - http://www.ta.jcu.cz/aa |
|  homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix, fast  |
|  fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation etc.  | 
+-------------------------------------------------------------------------+

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019