Mail Archives: pgcc/1999/08/14/15:14:40

www.delorie.com/archives/browse.cgi

search

Mail Archives: pgcc/1999/08/14/15:14:40

Message-ID: <19990814183125.24893@atrey.karlin.mff.cuni.cz>

Date: Sat, 14 Aug 1999 18:31:25 +0200

From: Jan Hubicka <hubicka AT atrey DOT karlin DOT mff DOT cuni DOT cz>

To: pgcc AT delorie DOT com

Subject: Re: optimizing for k6

References: <3 DOT 0 DOT 32 DOT 19990814040832 DOT 01181ec0 AT pop DOT xs4all DOT nl>

Mime-Version: 1.0

X-Mailer: Mutt 0.84

In-Reply-To: <3.0.32.19990814040832.01181ec0@pop.xs4all.nl>; from Vincent Diepeveen on Sat, Aug 14, 1999 at 04:08:34AM +0100

Reply-To: pgcc AT delorie DOT com

X-Mailing-List: pgcc AT delorie DOT com

X-Unsubscribes-To: listserv AT delorie DOT com

> 
> Don't use GCC for that experiment. GCC is the worst in producing code
> using 8 bits dataformat. 
Well, thats true basically because it is unable to use HI parts of registers
(that can change in future once David's SUBREG patches gets in)
Otherwise it does approx. same job as for 32bit values.
Because I was propagating the code in sched2 pass, I was getting exactly
same code as for 8 bit, only direct replacements like addb -> addl
etc.
If K6 were noticeably worse for 32 bit than for 8 bit I think I would have
to measure the difference in my benchmarks.
> My conclusion was that gcc optimized my code relatively
> worse when there were 8 bits things to do.
This is most probablye true.
> 
> But about the speed difference. Programming in assembler you clearly
> notice the speed difference. benchmark reference: rebel program. www.rebel.nl
I will check it out soon..
> 
> Although my program was way faster when i had a combined 8 bits/32 bits
> datastructure (so 8 bits code too), i chose for getting 32 bits completely
> in order to get rid of possible casting faults from my side. Let's
> keep it simple&easy...
Isn't that mainly because of memory consumed by your program has decreased
when you changed your datastructure? K6 is very sensitive about memory,
because it have quite small caches and refills are more costy than on the
Intel CPU familly.
This can be quite well seen on the alignment issues. When you use 16 byte
alignment even on small program, you get large speed decrease (so result
is worse that no alignment at all). Even when aligned code ought to load
better to caches and decodes...
I've implemented alignments according to AMD recommendations
(keeping loops starts at least two instructions away from the cache line boundary
and keeping predecode information outside cache lines) and performance hit
seems to be 8-10% on average code...

Honza
> 
> I still don't regret that decision. Especially not if i look to
> how many KBs my source code is growing every month.
> 
> >Honza
>  
> >> >Honza
> >> >> Greetings,
> >> >> Vincent
> >> >> 
> >> >> /At 11:49 AM 8/7/99 +0200, you wrote:
> >> >> >Henrik Berglund SdU wrote:
> >> >> >> 
> >> >> >>
> ftp://ftp.sinica.edu.tw/pub/doc/cpu/www.amd.com/K6/k6docs/pdf/21828a.pdf
> >> >> >> 
> >> >> >>
> >> >>
> >>
> -----------------------------------------------------------------------------
> >> >> >> Henrik DOT Berglund AT mds DOT mdh DOT se
> >> >> >> http://www.mds.mdh.se/~adb94hbd/
> >> >> >
> >> >> >This is a long known document, it does some help in optimizing. But the
> >> >> >information is just too incomplete to get really good optimizations.
> >> >> >
> >> >> >There is also a lot of mistakes in that document. I had a little
> >> >> >discussion
> >> >> >with AMD technical support, but they did not help :-(
> >> >> >AMD Technical Support wrote:
> >> >> >> 
> >> >> >> >Return-Path: <w DOT formann AT neuss DOT netsurf DOT de>
> >> >> >> >Sender: wolfi AT neuss DOT netsurf DOT de
> >> >> >> >Date: Fri, 12 Mar 1999 19:10:15 +0100
> >> >> >> >From: Wolfgang Formann <w DOT formann AT neuss DOT netsurf DOT de>
> >> >> >> >To: AMD Technical Support <blikefet AT pedigree DOT amd DOT com>
> >> >> >> >Subject: Re: Some question to your literature, maybe a typo?
> >> >> >> >References: <3 DOT 0 DOT 32 DOT 19990303153034 DOT 0074931c AT pedigree DOT amd DOT com>
> >> >> >> >
> >> >> >> 
> >> >> >> Hi,
> >> >> >> 
> >> >> >> it is the last update of the document. I think you must try it.
> >> >> >> 
> >> >> >> Kind regards
> >> >> >> 
> >> >> >> Bernard
> >> >> >> 
> >> >> >> >AMD Technical Support wrote:
> >> >> >> >>
> >> >> >> >> >Return-Path: <euro DOT lit AT amd DOT com>
> >> >> >> >> >X-Sender: support2 AT pedigree
> >> >> >> >> >Date: Thu, 25 Feb 1999 06:39:16 +0100
> >> >> >> >> >To: blikefet AT pedigree DOT amd DOT com
> >> >> >> >> >From: Wolfgang Formann <w DOT formann AT neuss DOT netsurf DOT de> (by way of
> CPA
> >> >> <euro DOT lit AT amd DOT com>)
> >> >> >> >> >Subject: Some question to your literature, maybe a typo?
> >> >> >> >> >
> >> >> >> >> >I just downloaded the document
> >> >> http://www.amd.com/K6/k6docs/pdf/21828a.pdf.
> >> >> >> >> >The table in Chaper 4, Pages 37 to 40 says, that all the shift
> >> >> operations
> >> >> >> >> >like SHIFT mreg16/32,imm8; SHIFT mreg16/32, 1; SHIFT
> mreg16/32, CL;
> >> >> where
> >> >> >> >> >SHIFT can be replaced by SAR, SHL/SAL and SHR, are executed as
> >> >> RISC86(tm)
> >> >> >> >> >Opcode alu. This RISC86(tm) operation is explained on page 24 as
> >> >> >> >> >`alu - either of the integer execution units`.
> >> >> >> >> >
> >> >> >> >> >Whereas in chapter 3 on page 12, this document lists some (all?)
> >> >> operations
> >> >> >> >> >which can be performed in the Integer Y execution unit. In the
> >> list of
> >> >> >> >> >operations '(ADD, AND, CMP, OR, SUB and XOR)' there is none of
> the
> >> >> SHIFT's
> >> >> >> >> >mentioned.
> >> >> >> >> >
> >> >> >> >> >By trying it out (I think) I found that chapter 3 is right and
> the
> >> >> table
> >> >> >> >> >in chapter 4 has typos.
> >> >> >> >> >
> >> >> >> >> >My question: Is there any updated version of this document
> >> available or
> >> >> >> >> >do I have to try out all the other opcodes not listed in chapter
> >> 3, but
> >> >> >> >> >marked as 'alu' in the table in chapter 4 (like mov, movzx)?
> >> >> >> >> >
> >> >> >> >> >Thank you
> >> >> >> >>
> >> >> >> >> Hi,
> >> >> >> >>
> >> >> >> >> the latest version of the document is on the our webside.
> >> >> >> >
> >> >> >> >so, it still seems to have different information on the same
> >> >> instruction :-(
> >> >> >> >
> >> >> >> >Is there any additional information available, not shown on your web
> >> page?
> >> >> >> >
> >> >> >> >Thanks again!
> >> >> >> >
> >> >> >> >>
> >> >> >> >> Kind regards
> >> >> >> >> Bernard Likefett
> >> >> >> >> AMD Technical Support
> >> >> >> >
> >> >> >> >
> >> >> >> Bernard Likefett
> >> >> >> AMD Technical Support
> >> >> >> 
> >> >> >> Please included all previous emails
> >> >> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> >> >> Advanced Micro Devices _______
> >> >> >> AMD House \____ | Advanced
> >> >> >> Frimley Business Park /| | | Micro
> >> >> >> Frimley, Camberley | |___| | Devices
> >> >> >> Surrey |____/ \|
> >> >> >> GU16 5SL
> >> >> >> United Kingdom
> >> >> >> 
> >> >> >> EMail id euro DOT tech AT amd DOT com Our Web site is http://www.amd.com
> >> >> >> Phone +44 (0)1276 803299 Fax +44 (0)1276 803298
> >> >> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> >> >
> >> >> >Another thing in that manual is the nice table labeled 'Instruction
> >> >> >Dispatch and Execution Timing' starting at page 35. Just a few
> >> >> >questions:
> >> >> >How many internal cycles do all these vector operations take?
> >> >> >What internal execution units are used?
> >> >> >
> >> >> >Well, there is no answer, so you have to try them out. The only thing
> >> >> >you can be sure of, is that you should always use opcodes which can get
> >> >> >decoded in parallel, these are the ones marked with 'short' since it
> >> >> >seems that the bottleneck of that CPU is the decoder.
> >> >> >
> >> >> >The next thing is the nice tables in the chapter labeled 'Code Sample
> >> >> >Analysis'. Did you really understand them? I tried to optimize some
> >> >> >real code and took these tables as input, but I failed :-( My processor
> >> >> >seems to behave very different. I did not find out what was wrong.
> >> >> >So it seems to me, that a lot of information in this document is
> >> >> >only for marketing purposes, there are too few details and too many
> >> >> >wrong informations to really help to optimize the code.
> >> >> >
> >> >> >Wolfgang
> >> >> >
> >> >> >
> >> >
> >> >-- 
> >> >                       OK. Lets make a signature file.
> >>
> >+-------------------------------------------------------------------------+
> >> >|        Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz
>   |
> >> >|         Czech free software foundation: http://www.freesoft.cz
>   |
> >> >|AA project - the new way for computer graphics -
> http://www.ta.jcu.cz/aa |
> >> >|  homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix,
> fast  |
> >> >|  fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation
> etc.  | 
> >>
> >+-------------------------------------------------------------------------+
> >> >
> >> >
> >
> >-- 
> >                       OK. Lets make a signature file.
> >+-------------------------------------------------------------------------+
> >|        Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz         |
> >|         Czech free software foundation: http://www.freesoft.cz          |
> >|AA project - the new way for computer graphics - http://www.ta.jcu.cz/aa |
> >|  homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix, fast  |
> >|  fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation etc.  | 
> >+-------------------------------------------------------------------------+
> >
> >

-- 
                       OK. Lets make a signature file.
+-------------------------------------------------------------------------+
|        Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz         |
|         Czech free software foundation: http://www.freesoft.cz          |
|AA project - the new way for computer graphics - http://www.ta.jcu.cz/aa |
|  homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix, fast  |
|  fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation etc.  | 
+-------------------------------------------------------------------------+

- Raw text -

webmaster	delorie software privacy
Copyright � 2019 by DJ Delorie	Updated Jul 2019

Message-ID:	<19990814183125.24893@atrey.karlin.mff.cuni.cz>
Date:	Sat, 14 Aug 1999 18:31:25 +0200
From:	Jan Hubicka <hubicka AT atrey DOT karlin DOT mff DOT cuni DOT cz>
To:	pgcc AT delorie DOT com
Subject:	Re: optimizing for k6
References:	<3 DOT 0 DOT 32 DOT 19990814040832 DOT 01181ec0 AT pop DOT xs4all DOT nl>
Mime-Version:	1.0
X-Mailer:	Mutt 0.84
In-Reply-To:	<3.0.32.19990814040832.01181ec0@pop.xs4all.nl>; from Vincent Diepeveen on Sat, Aug 14, 1999 at 04:08:34AM +0100
Reply-To:	pgcc AT delorie DOT com
X-Mailing-List:	pgcc AT delorie DOT com
X-Unsubscribes-To:	listserv AT delorie DOT com