Mail Archives: pgcc/1999/08/14/23:21:05

www.delorie.com/archives/browse.cgi
search
Mail Archives: pgcc/1999/08/14/23:21:05
Message-Id: <3.0.32.19990815040026.01181400@pop.xs4all.nl>

X-Sender: diep AT pop DOT xs4all DOT nl

X-Mailer: Windows Eudora Pro Version 3.0 (32)

Date: Sun, 15 Aug 1999 04:00:27 +0100

To: pgcc AT delorie DOT com

From: Vincent Diepeveen <diep AT xs4all DOT nl>

Subject: Re: optimizing for k6

Mime-Version: 1.0

Reply-To: pgcc AT delorie DOT com
At 06:31 PM 8/14/99 +0200, you wrote:
>> 
>> Don't use GCC for that experiment. GCC is the worst in producing code
>> using 8 bits dataformat. 
>Well, thats true basically because it is unable to use HI parts of registers
>(that can change in future once David's SUBREG patches gets in)
>Otherwise it does approx. same job as for 32bit values.
>Because I was propagating the code in sched2 pass, I was getting exactly
>same code as for 8 bit, only direct replacements like addb -> addl
>etc.
>If K6 were noticeably worse for 32 bit than for 8 bit I think I would have
>to measure the difference in my benchmarks.
>> My conclusion was that gcc optimized my code relatively
>> worse when there were 8 bits things to do.
>This is most probablye true.
>> 
>> But about the speed difference. Programming in assembler you clearly
>> notice the speed difference. benchmark reference: rebel program.
www.rebel.nl
>I will check it out soon..
>> 
>> Although my program was way faster when i had a combined 8 bits/32 bits
>> datastructure (so 8 bits code too), i chose for getting 32 bits completely
>> in order to get rid of possible casting faults from my side. Let's
>> keep it simple&easy...
>Isn't that mainly because of memory consumed by your program has decreased
>when you changed your datastructure? K6 is very sensitive about memory,
>because it have quite small caches and refills are more costy than on the
>Intel CPU familly.

I first thought so yeah, later i figured out that this was not the only
reason.

Secondly i thought it would not matter that much removing the 8 bits
code, as a lot of partial register stalls would go with it... ...that
didn't speed up though.

>This can be quite well seen on the alignment issues. When you use 16 byte

All my code is 32 bits aligned. I had all 8 bits arrays of a 4 times 
multiply or times 2^n of it. How the 8 bits code is getting
aligned i don't know of course. If (p)gcc handles that bad, that's not my
problem. I think doing the above is about the maximum i can do in C,
as i'm not using inline assembler.

>alignment even on small program, you get large speed decrease (so result
>is worse that no alignment at all). Even when aligned code ought to load
>better to caches and decodes...
>I've implemented alignments according to AMD recommendations
>(keeping loops starts at least two instructions away from the cache line
boundary
>and keeping predecode information outside cache lines) and performance hit
>seems to be 8-10% on average code...

I didn't do that of course, as i only program in C. 

>Honza
>> 
>> I still don't regret that decision. Especially not if i look to
>> how many KBs my source code is growing every month.
>> 
>> >Honza
>>  
>> >> >Honza
>> >> >> Greetings,
>> >> >> Vincent
>> >> >> 
>> >> >> /At 11:49 AM 8/7/99 +0200, you wrote:
>> >> >> >Henrik Berglund SdU wrote:
>> >> >> >> 
>> >> >> >>
>> ftp://ftp.sinica.edu.tw/pub/doc/cpu/www.amd.com/K6/k6docs/pdf/21828a.pdf
>> >> >> >> 
>> >> >> >>
>> >> >>
>> >>
>>
-----------------------------------------------------------------------------
>> >> >> >> Henrik DOT Berglund AT mds DOT mdh DOT se
>> >> >> >> http://www.mds.mdh.se/~adb94hbd/
>> >> >> >
>> >> >> >This is a long known document, it does some help in optimizing.
But the
>> >> >> >information is just too incomplete to get really good optimizations.
>> >> >> >
>> >> >> >There is also a lot of mistakes in that document. I had a little
>> >> >> >discussion
>> >> >> >with AMD technical support, but they did not help :-(
>> >> >> >AMD Technical Support wrote:
>> >> >> >> 
>> >> >> >> >Return-Path: <w DOT formann AT neuss DOT netsurf DOT de>
>> >> >> >> >Sender: wolfi AT neuss DOT netsurf DOT de
>> >> >> >> >Date: Fri, 12 Mar 1999 19:10:15 +0100
>> >> >> >> >From: Wolfgang Formann <w DOT formann AT neuss DOT netsurf DOT de>
>> >> >> >> >To: AMD Technical Support <blikefet AT pedigree DOT amd DOT com>
>> >> >> >> >Subject: Re: Some question to your literature, maybe a typo?
>> >> >> >> >References: <3 DOT 0 DOT 32 DOT 19990303153034 DOT 0074931c AT pedigree DOT amd DOT com>
>> >> >> >> >
>> >> >> >> 
>> >> >> >> Hi,
>> >> >> >> 
>> >> >> >> it is the last update of the document. I think you must try it.
>> >> >> >> 
>> >> >> >> Kind regards
>> >> >> >> 
>> >> >> >> Bernard
>> >> >> >> 
>> >> >> >> >AMD Technical Support wrote:
>> >> >> >> >>
>> >> >> >> >> >Return-Path: <euro DOT lit AT amd DOT com>
>> >> >> >> >> >X-Sender: support2 AT pedigree
>> >> >> >> >> >Date: Thu, 25 Feb 1999 06:39:16 +0100
>> >> >> >> >> >To: blikefet AT pedigree DOT amd DOT com
>> >> >> >> >> >From: Wolfgang Formann <w DOT formann AT neuss DOT netsurf DOT de> (by way of
>> CPA
>> >> >> <euro DOT lit AT amd DOT com>)
>> >> >> >> >> >Subject: Some question to your literature, maybe a typo?
>> >> >> >> >> >
>> >> >> >> >> >I just downloaded the document
>> >> >> http://www.amd.com/K6/k6docs/pdf/21828a.pdf.
>> >> >> >> >> >The table in Chaper 4, Pages 37 to 40 says, that all the shift
>> >> >> operations
>> >> >> >> >> >like SHIFT mreg16/32,imm8; SHIFT mreg16/32, 1; SHIFT
>> mreg16/32, CL;
>> >> >> where
>> >> >> >> >> >SHIFT can be replaced by SAR, SHL/SAL and SHR, are executed as
>> >> >> RISC86(tm)
>> >> >> >> >> >Opcode alu. This RISC86(tm) operation is explained on page
24 as
>> >> >> >> >> >`alu - either of the integer execution units`.
>> >> >> >> >> >
>> >> >> >> >> >Whereas in chapter 3 on page 12, this document lists some
(all?)
>> >> >> operations
>> >> >> >> >> >which can be performed in the Integer Y execution unit. In the
>> >> list of
>> >> >> >> >> >operations '(ADD, AND, CMP, OR, SUB and XOR)' there is none of
>> the
>> >> >> SHIFT's
>> >> >> >> >> >mentioned.
>> >> >> >> >> >
>> >> >> >> >> >By trying it out (I think) I found that chapter 3 is right and
>> the
>> >> >> table
>> >> >> >> >> >in chapter 4 has typos.
>> >> >> >> >> >
>> >> >> >> >> >My question: Is there any updated version of this document
>> >> available or
>> >> >> >> >> >do I have to try out all the other opcodes not listed in
chapter
>> >> 3, but
>> >> >> >> >> >marked as 'alu' in the table in chapter 4 (like mov, movzx)?
>> >> >> >> >> >
>> >> >> >> >> >Thank you
>> >> >> >> >>
>> >> >> >> >> Hi,
>> >> >> >> >>
>> >> >> >> >> the latest version of the document is on the our webside.
>> >> >> >> >
>> >> >> >> >so, it still seems to have different information on the same
>> >> >> instruction :-(
>> >> >> >> >
>> >> >> >> >Is there any additional information available, not shown on
your web
>> >> page?
>> >> >> >> >
>> >> >> >> >Thanks again!
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> Kind regards
>> >> >> >> >> Bernard Likefett
>> >> >> >> >> AMD Technical Support
>> >> >> >> >
>> >> >> >> >
>> >> >> >> Bernard Likefett
>> >> >> >> AMD Technical Support
>> >> >> >> 
>> >> >> >> Please included all previous emails
>> >> >> >>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> >> >> >> Advanced Micro Devices _______
>> >> >> >> AMD House \____ | Advanced
>> >> >> >> Frimley Business Park /| | | Micro
>> >> >> >> Frimley, Camberley | |___| | Devices
>> >> >> >> Surrey |____/ \|
>> >> >> >> GU16 5SL
>> >> >> >> United Kingdom
>> >> >> >> 
>> >> >> >> EMail id euro DOT tech AT amd DOT com Our Web site is http://www.amd.com
>> >> >> >> Phone +44 (0)1276 803299 Fax +44 (0)1276 803298
>> >> >> >>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> >> >> >
>> >> >> >Another thing in that manual is the nice table labeled 'Instruction
>> >> >> >Dispatch and Execution Timing' starting at page 35. Just a few
>> >> >> >questions:
>> >> >> >How many internal cycles do all these vector operations take?
>> >> >> >What internal execution units are used?
>> >> >> >
>> >> >> >Well, there is no answer, so you have to try them out. The only
thing
>> >> >> >you can be sure of, is that you should always use opcodes which
can get
>> >> >> >decoded in parallel, these are the ones marked with 'short' since it
>> >> >> >seems that the bottleneck of that CPU is the decoder.
>> >> >> >
>> >> >> >The next thing is the nice tables in the chapter labeled 'Code
Sample
>> >> >> >Analysis'. Did you really understand them? I tried to optimize some
>> >> >> >real code and took these tables as input, but I failed :-( My
processor
>> >> >> >seems to behave very different. I did not find out what was wrong.
>> >> >> >So it seems to me, that a lot of information in this document is
>> >> >> >only for marketing purposes, there are too few details and too many
>> >> >> >wrong informations to really help to optimize the code.
>> >> >> >
>> >> >> >Wolfgang
>> >> >> >
>> >> >> >
>> >> >
>> >> >-- 
>> >> >                       OK. Lets make a signature file.
>> >>
>>
>+-------------------------------------------------------------------------+
>> >> >|        Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz
>>   |
>> >> >|         Czech free software foundation: http://www.freesoft.cz
>>   |
>> >> >|AA project - the new way for computer graphics -
>> http://www.ta.jcu.cz/aa |
>> >> >|  homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix,
>> fast  |
>> >> >|  fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation
>> etc.  | 
>> >>
>>
>+-------------------------------------------------------------------------+
>> >> >
>> >> >
>> >
>> >-- 
>> >                       OK. Lets make a signature file.
>>
>+-------------------------------------------------------------------------+
>> >|        Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz
  |
>> >|         Czech free software foundation: http://www.freesoft.cz
  |
>> >|AA project - the new way for computer graphics -
http://www.ta.jcu.cz/aa |
>> >|  homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix,
fast  |
>> >|  fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation
etc.  | 
>>
>+-------------------------------------------------------------------------+
>> >
>> >
>
>-- 
>                       OK. Lets make a signature file.
>+-------------------------------------------------------------------------+
>|        Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz         |
>|         Czech free software foundation: http://www.freesoft.cz          |
>|AA project - the new way for computer graphics - http://www.ta.jcu.cz/aa |
>|  homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix, fast  |
>|  fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation etc.  | 
>+-------------------------------------------------------------------------+
>
>
- Raw text -
webmaster	delorie software privacy
Copyright � 2019 by DJ Delorie	Updated Jul 2019
Message-Id:	<3.0.32.19990815040026.01181400@pop.xs4all.nl>
X-Sender:	diep AT pop DOT xs4all DOT nl
X-Mailer:	Windows Eudora Pro Version 3.0 (32)
Date:	Sun, 15 Aug 1999 04:00:27 +0100
To:	pgcc AT delorie DOT com
From:	Vincent Diepeveen <diep AT xs4all DOT nl>
Subject:	Re: optimizing for k6
Mime-Version:	1.0
Reply-To:	pgcc AT delorie DOT com