Message-Id: <m0zvMvM-000S22C@inti.gov.ar>
Comments: Authenticated sender is <salvador AT natacha DOT inti DOT gov DOT ar>
From: "Salvador Eduardo Tropea (SET)" <salvador AT inti DOT gov DOT ar>
Organization: INTI
To: vcarlos35 AT juno DOT com, djgpp AT delorie DOT com
Date: Wed, 30 Dec 1998 11:52:00 +0000
MIME-Version: 1.0
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7BIT
Subject: Re: pairable instructions much faster than the string 	operation
In-reply-to: <19981230.090441.5903.0.vcarlos35@juno.com>
X-mailer: Pegasus Mail for Windows (v2.54)
Reply-To: djgpp AT delorie DOT com

vcarlos35 AT juno DOT com wrote:

> 
> On Wed, 30 Dec 1998 13:15:25 +0100 Christian Hofrichter
> <ChristianHofrichter AT gmx DOT de> writes:
> >For along time I believed that string operations (rep stosl; rep 
> >movsl) were the fastest methods to write to memory blocks untill I heard
> that 
> >a Pentium can execute two instructions simultaneously. So I realized 
> >that there are better methods to move memory blocks !
> >
> >" rep stosl " : takes 3 clock cycles on a Pentium
> >
> >
> >asm("1:\n\t"
> >       "movl (%%ebx),%%eax\n\t" /*pairable in U-pipe */
> >       "addl   $4,%%ebx\n\t"         /*pairbale in V-pipe  */
> >       "decl   %%ecx\n\t"               /*pairable in U-pipe */
> >       "jnz 1b":                           /*pairbale in V-pipe  */
> >                     :"a"(55/*any value
> >*/),"c"((40*1024*1024)>>2),"b"(memory)
> >                     :"%ecx","%ebx");
> >This takes only 2 clock cycles !
> >
> >
> >To test that, I allocated a buffer of 40 Mb. First I used memset, it
> >took 690000 microseconds to fill the memory-block.
> >Then I wrote it in assembler ( just to be sure) with stosl and it took
> >the same time (how surprising ).
> >And then I wrote the code above and now it took only approximately
> >426000 microseconds to fill the memory-block !!
> >That is approximate the same ratio like 3 clock cycles to 2 clock
> >cycles.
> >
> >So how about a new optimation-switch in djgpp, called pairable
> >instructions ? After all  it can often double the speed of the 
> >program. I can also be used to improve graphic-performence, can't it ?
> >
> 
> AFAIK, having a compiler automatically pair instructions (especially one
> such as gcc which runs on a wide variety of platforms) would pretty much
> be an impossible task. 

Why?

> Instruction pairing rules are complicated and dependent on the CPU to 
> a great extent. 

Assembler code generation *is* complicated and dependent on the CPU.

> For example, on a 6th generation CPU, your
> code is not optimal because the increased register dependencies make
> it difficult for the out-of-order core to extract maximum parallelism
> from
> your code. Additionally, you have to worry about increased aggregate
> opcode
> size and mispredicted branches.

And what? These tasks belongs to the optimizer and if an optimizer doesn't 
do it is just crap. RISC processors needs it and most of the code is in C not 
assembler

From the side of djgpp the only thing we can do is provide some 
conditionally compiled code in libc to speedup things like memset/memcpy. The 
PGCC group works to make a gcc that generates code optimized for Pentiums.

SET
------------------------------------ 0 --------------------------------
Visit my home page: http://set-soft.home.ml.org/
or
http://www.geocities.com/SiliconValley/Vista/6552/
Salvador Eduardo Tropea (SET). (Electronics Engineer)
Alternative e-mail: set-soft AT usa DOT net set AT computer DOT org
ICQ: 2951574
Address: Curapaligue 2124, Caseros, 3 de Febrero
Buenos Aires, (1678), ARGENTINA
TE: +(541) 759 0013