Date: Wed, 11 Jan 95 11:48 MST From: mat AT ardi DOT com (Mat Hostetter) To: gbm AT ii DOT pw DOT edu DOT pl (Grzegorz B. Mazur) Cc: djgpp AT sun DOT soe DOT clarkson DOT edu Subject: Re: deadly optimization References: <9501111448 DOT AA22316 AT viki DOT ii DOT pw DOT edu DOT pl> >>>>> "Grzegorz" == Grzegorz B Mazur writes: Grzegorz> I am really impressed with optimization discussion on Grzegorz> the list. I would like to point out that on 486 and Grzegorz> Pentium simple ADD (or adding two registers to form the Grzegorz> address) is not ANY slower than simple move. You can't Grzegorz> get any nanosecond by substituting say mov al, [table + Grzegorz> edx] with "mov dl, x; mov al, [edx]", and the code using Grzegorz> such trick is quite risky (non-portable, non-debuggable, Grzegorz> etc.). You have to do a "move" regardless, so there is no point in comparing the speed of a move to that of an add. You can't just "add" the input byte from memory to your address register in one instruction, since the input is a byte and the address register is not. But you're right in that using an index register in an addressing mode effectively gives you a free add on the Pentium (1 extra cycle on the i486). You didn't show the full code in the first case. In order to use %edx as an index, it has to have zero in the high three bytes. You can set this up with movzbl (non-pairable on the Pentium!) or by clearing the high three bytes outside the loop. However, storing 0 in them isn't much simpler than storing a pointer in them. Zeroing the high three bytes is actually the technique I use when I need to scale the input byte by 2, 4, or 8 for some other code that both translates bytes->longs, etc. and writes to the SVGA window (and so it needs to be in assembly). The choices would be: Grzegorz Mat -------------- -------------------- xorl %edx,%edx movl $my_pointer,%edx loop: movb (%esi),%dl loop: movb (%esi),%dl movb (%eax,%edx),%bl movb (%edx),%bl movl %bl,(%edi) movb %bl,(%edi) ... ... I believe they are both the same speed on the Pentium, but the second loop is smaller and faster on the i486. The first loop does not require any particular alignment from your translation array. If the base of the array is fixed, and you can replace %eax with a constant address, e.g: movb _my_array(%edx),%bl then the two are indeed the same speed on both chips, and the former requires no wasteful alignment. If you're going to bother to code in assembly, you might as well do it right and write the fastest possible code. The major win for hand coding this loop is the instruction scheduling, since gcc does not yet have any clue about instruction scheduling on the Pentium. Grzegorz> With such a beast any code handcrafting is a waste of Grzegorz> time. Note how much time you spend writing inline code, Grzegorz> and count how much time it can save during execution... I mostly agree; writing in assembly is *almost* always a lose. It takes more time to write, which means you spend less time optimizing other parts of your program and adding functionality. It's harder to debug and non-portable. Exceptions should only be made for *exceptionally* time-critical inner loops for programs where performance matters, and even then I write C versions of the same thing and #ifdef i386... the assembly. The code I bothered to write in assembly can by used by Executor to translate the entire screen many times per second, so it has to be *fast*. However, by comparison, our dynamically compiling 68040 emulator is entirely written in C! Many CPU emulators you see out there are written in assembly, but they use stupid algorithms which let them get beat by better emulators written in C. -Mat