Message-Id: <3.0.32.19990815040026.01181400@pop.xs4all.nl> X-Sender: diep AT pop DOT xs4all DOT nl X-Mailer: Windows Eudora Pro Version 3.0 (32) Date: Sun, 15 Aug 1999 04:00:27 +0100 To: pgcc AT delorie DOT com From: Vincent Diepeveen Subject: Re: optimizing for k6 Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Reply-To: pgcc AT delorie DOT com At 06:31 PM 8/14/99 +0200, you wrote: >> >> Don't use GCC for that experiment. GCC is the worst in producing code >> using 8 bits dataformat. >Well, thats true basically because it is unable to use HI parts of registers >(that can change in future once David's SUBREG patches gets in) >Otherwise it does approx. same job as for 32bit values. >Because I was propagating the code in sched2 pass, I was getting exactly >same code as for 8 bit, only direct replacements like addb -> addl >etc. >If K6 were noticeably worse for 32 bit than for 8 bit I think I would have >to measure the difference in my benchmarks. >> My conclusion was that gcc optimized my code relatively >> worse when there were 8 bits things to do. >This is most probablye true. >> >> But about the speed difference. Programming in assembler you clearly >> notice the speed difference. benchmark reference: rebel program. www.rebel.nl >I will check it out soon.. >> >> Although my program was way faster when i had a combined 8 bits/32 bits >> datastructure (so 8 bits code too), i chose for getting 32 bits completely >> in order to get rid of possible casting faults from my side. Let's >> keep it simple&easy... >Isn't that mainly because of memory consumed by your program has decreased >when you changed your datastructure? K6 is very sensitive about memory, >because it have quite small caches and refills are more costy than on the >Intel CPU familly. I first thought so yeah, later i figured out that this was not the only reason. Secondly i thought it would not matter that much removing the 8 bits code, as a lot of partial register stalls would go with it... ...that didn't speed up though. >This can be quite well seen on the alignment issues. When you use 16 byte All my code is 32 bits aligned. I had all 8 bits arrays of a 4 times multiply or times 2^n of it. How the 8 bits code is getting aligned i don't know of course. If (p)gcc handles that bad, that's not my problem. I think doing the above is about the maximum i can do in C, as i'm not using inline assembler. >alignment even on small program, you get large speed decrease (so result >is worse that no alignment at all). Even when aligned code ought to load >better to caches and decodes... >I've implemented alignments according to AMD recommendations >(keeping loops starts at least two instructions away from the cache line boundary >and keeping predecode information outside cache lines) and performance hit >seems to be 8-10% on average code... I didn't do that of course, as i only program in C. >Honza >> >> I still don't regret that decision. Especially not if i look to >> how many KBs my source code is growing every month. >> >> >Honza >> >> >> >Honza >> >> >> Greetings, >> >> >> Vincent >> >> >> >> >> >> /At 11:49 AM 8/7/99 +0200, you wrote: >> >> >> >Henrik Berglund SdU wrote: >> >> >> >> >> >> >> >> >> ftp://ftp.sinica.edu.tw/pub/doc/cpu/www.amd.com/K6/k6docs/pdf/21828a.pdf >> >> >> >> >> >> >> >> >> >> >> >> >> >> ----------------------------------------------------------------------------- >> >> >> >> Henrik DOT Berglund AT mds DOT mdh DOT se >> >> >> >> http://www.mds.mdh.se/~adb94hbd/ >> >> >> > >> >> >> >This is a long known document, it does some help in optimizing. But the >> >> >> >information is just too incomplete to get really good optimizations. >> >> >> > >> >> >> >There is also a lot of mistakes in that document. I had a little >> >> >> >discussion >> >> >> >with AMD technical support, but they did not help :-( >> >> >> >AMD Technical Support wrote: >> >> >> >> >> >> >> >> >Return-Path: >> >> >> >> >Sender: wolfi AT neuss DOT netsurf DOT de >> >> >> >> >Date: Fri, 12 Mar 1999 19:10:15 +0100 >> >> >> >> >From: Wolfgang Formann >> >> >> >> >To: AMD Technical Support >> >> >> >> >Subject: Re: Some question to your literature, maybe a typo? >> >> >> >> >References: <3 DOT 0 DOT 32 DOT 19990303153034 DOT 0074931c AT pedigree DOT amd DOT com> >> >> >> >> > >> >> >> >> >> >> >> >> Hi, >> >> >> >> >> >> >> >> it is the last update of the document. I think you must try it. >> >> >> >> >> >> >> >> Kind regards >> >> >> >> >> >> >> >> Bernard >> >> >> >> >> >> >> >> >AMD Technical Support wrote: >> >> >> >> >> >> >> >> >> >> >Return-Path: >> >> >> >> >> >X-Sender: support2 AT pedigree >> >> >> >> >> >Date: Thu, 25 Feb 1999 06:39:16 +0100 >> >> >> >> >> >To: blikefet AT pedigree DOT amd DOT com >> >> >> >> >> >From: Wolfgang Formann (by way of >> CPA >> >> >> ) >> >> >> >> >> >Subject: Some question to your literature, maybe a typo? >> >> >> >> >> > >> >> >> >> >> >I just downloaded the document >> >> >> http://www.amd.com/K6/k6docs/pdf/21828a.pdf. >> >> >> >> >> >The table in Chaper 4, Pages 37 to 40 says, that all the shift >> >> >> operations >> >> >> >> >> >like SHIFT mreg16/32,imm8; SHIFT mreg16/32, 1; SHIFT >> mreg16/32, CL; >> >> >> where >> >> >> >> >> >SHIFT can be replaced by SAR, SHL/SAL and SHR, are executed as >> >> >> RISC86(tm) >> >> >> >> >> >Opcode alu. This RISC86(tm) operation is explained on page 24 as >> >> >> >> >> >`alu - either of the integer execution units`. >> >> >> >> >> > >> >> >> >> >> >Whereas in chapter 3 on page 12, this document lists some (all?) >> >> >> operations >> >> >> >> >> >which can be performed in the Integer Y execution unit. In the >> >> list of >> >> >> >> >> >operations '(ADD, AND, CMP, OR, SUB and XOR)' there is none of >> the >> >> >> SHIFT's >> >> >> >> >> >mentioned. >> >> >> >> >> > >> >> >> >> >> >By trying it out (I think) I found that chapter 3 is right and >> the >> >> >> table >> >> >> >> >> >in chapter 4 has typos. >> >> >> >> >> > >> >> >> >> >> >My question: Is there any updated version of this document >> >> available or >> >> >> >> >> >do I have to try out all the other opcodes not listed in chapter >> >> 3, but >> >> >> >> >> >marked as 'alu' in the table in chapter 4 (like mov, movzx)? >> >> >> >> >> > >> >> >> >> >> >Thank you >> >> >> >> >> >> >> >> >> >> Hi, >> >> >> >> >> >> >> >> >> >> the latest version of the document is on the our webside. >> >> >> >> > >> >> >> >> >so, it still seems to have different information on the same >> >> >> instruction :-( >> >> >> >> > >> >> >> >> >Is there any additional information available, not shown on your web >> >> page? >> >> >> >> > >> >> >> >> >Thanks again! >> >> >> >> > >> >> >> >> >> >> >> >> >> >> Kind regards >> >> >> >> >> Bernard Likefett >> >> >> >> >> AMD Technical Support >> >> >> >> > >> >> >> >> > >> >> >> >> Bernard Likefett >> >> >> >> AMD Technical Support >> >> >> >> >> >> >> >> Please included all previous emails >> >> >> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> >> >> Advanced Micro Devices _______ >> >> >> >> AMD House \____ | Advanced >> >> >> >> Frimley Business Park /| | | Micro >> >> >> >> Frimley, Camberley | |___| | Devices >> >> >> >> Surrey |____/ \| >> >> >> >> GU16 5SL >> >> >> >> United Kingdom >> >> >> >> >> >> >> >> EMail id euro DOT tech AT amd DOT com Our Web site is http://www.amd.com >> >> >> >> Phone +44 (0)1276 803299 Fax +44 (0)1276 803298 >> >> >> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> >> > >> >> >> >Another thing in that manual is the nice table labeled 'Instruction >> >> >> >Dispatch and Execution Timing' starting at page 35. Just a few >> >> >> >questions: >> >> >> >How many internal cycles do all these vector operations take? >> >> >> >What internal execution units are used? >> >> >> > >> >> >> >Well, there is no answer, so you have to try them out. The only thing >> >> >> >you can be sure of, is that you should always use opcodes which can get >> >> >> >decoded in parallel, these are the ones marked with 'short' since it >> >> >> >seems that the bottleneck of that CPU is the decoder. >> >> >> > >> >> >> >The next thing is the nice tables in the chapter labeled 'Code Sample >> >> >> >Analysis'. Did you really understand them? I tried to optimize some >> >> >> >real code and took these tables as input, but I failed :-( My processor >> >> >> >seems to behave very different. I did not find out what was wrong. >> >> >> >So it seems to me, that a lot of information in this document is >> >> >> >only for marketing purposes, there are too few details and too many >> >> >> >wrong informations to really help to optimize the code. >> >> >> > >> >> >> >Wolfgang >> >> >> > >> >> >> > >> >> > >> >> >-- >> >> > OK. Lets make a signature file. >> >> >> >+-------------------------------------------------------------------------+ >> >> >| Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz >> | >> >> >| Czech free software foundation: http://www.freesoft.cz >> | >> >> >|AA project - the new way for computer graphics - >> http://www.ta.jcu.cz/aa | >> >> >| homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix, >> fast | >> >> >| fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation >> etc. | >> >> >> >+-------------------------------------------------------------------------+ >> >> > >> >> > >> > >> >-- >> > OK. Lets make a signature file. >> >+-------------------------------------------------------------------------+ >> >| Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz | >> >| Czech free software foundation: http://www.freesoft.cz | >> >|AA project - the new way for computer graphics - http://www.ta.jcu.cz/aa | >> >| homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix, fast | >> >| fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation etc. | >> >+-------------------------------------------------------------------------+ >> > >> > > >-- > OK. Lets make a signature file. >+-------------------------------------------------------------------------+ >| Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz | >| Czech free software foundation: http://www.freesoft.cz | >|AA project - the new way for computer graphics - http://www.ta.jcu.cz/aa | >| homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix, fast | >| fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation etc. | >+-------------------------------------------------------------------------+ > >