Date: Wed, 11 Jan 95 11:48 MST
From: mat AT ardi DOT com (Mat Hostetter)
To: gbm AT ii DOT pw DOT edu DOT pl (Grzegorz B. Mazur)
Cc: djgpp AT sun DOT soe DOT clarkson DOT edu
Subject: Re: deadly optimization
References: <9501111448 DOT AA22316 AT viki DOT ii DOT pw DOT edu DOT pl>

>>>>> "Grzegorz" == Grzegorz B Mazur <gbm AT ii DOT pw DOT edu DOT pl> writes:

    Grzegorz> I am really impressed with optimization discussion on
    Grzegorz> the list. I would like to point out that on 486 and
    Grzegorz> Pentium simple ADD (or adding two registers to form the
    Grzegorz> address) is not ANY slower than simple move. You can't
    Grzegorz> get any nanosecond by substituting say mov al, [table +
    Grzegorz> edx] with "mov dl, x; mov al, [edx]", and the code using
    Grzegorz> such trick is quite risky (non-portable, non-debuggable,
    Grzegorz> etc.).

You have to do a "move" regardless, so there is no point in comparing
the speed of a move to that of an add.  You can't just "add" the input
byte from memory to your address register in one instruction, since
the input is a byte and the address register is not.  But you're right
in that using an index register in an addressing mode effectively
gives you a free add on the Pentium (1 extra cycle on the i486).

You didn't show the full code in the first case.  In order to use %edx
as an index, it has to have zero in the high three bytes.  You can set
this up with movzbl (non-pairable on the Pentium!) or by clearing the
high three bytes outside the loop.  However, storing 0 in them isn't
much simpler than storing a pointer in them.  Zeroing the high three
bytes is actually the technique I use when I need to scale the input
byte by 2, 4, or 8 for some other code that both translates
bytes->longs, etc. and writes to the SVGA window (and so it needs to
be in assembly).  The choices would be:

        Grzegorz			Mat
	--------------			--------------------
	xorl %edx,%edx			movl $my_pointer,%edx
loop:	movb (%esi),%dl		loop:	movb (%esi),%dl
	movb (%eax,%edx),%bl		movb (%edx),%bl
	movl %bl,(%edi)			movb %bl,(%edi)
	...				...

I believe they are both the same speed on the Pentium, but the second
loop is smaller and faster on the i486.  The first loop does not
require any particular alignment from your translation array.  If the
base of the array is fixed, and you can replace %eax with a constant
address, e.g:

	movb _my_array(%edx),%bl

then the two are indeed the same speed on both chips, and the former
requires no wasteful alignment.

If you're going to bother to code in assembly, you might as well do it
right and write the fastest possible code.  The major win for hand
coding this loop is the instruction scheduling, since gcc does not yet
have any clue about instruction scheduling on the Pentium.

    Grzegorz> With such a beast any code handcrafting is a waste of
    Grzegorz> time. Note how much time you spend writing inline code,
    Grzegorz> and count how much time it can save during execution...

I mostly agree; writing in assembly is *almost* always a lose.  It
takes more time to write, which means you spend less time optimizing
other parts of your program and adding functionality.  It's harder to
debug and non-portable.

Exceptions should only be made for *exceptionally* time-critical inner
loops for programs where performance matters, and even then I write C
versions of the same thing and #ifdef i386... the assembly.  The code
I bothered to write in assembly can by used by Executor to translate
the entire screen many times per second, so it has to be *fast*.

However, by comparison, our dynamically compiling 68040 emulator is
entirely written in C!  Many CPU emulators you see out there are
written in assembly, but they use stupid algorithms which let them get
beat by better emulators written in C.

-Mat