Message-Id: Comments: Authenticated sender is From: "Salvador Eduardo Tropea (SET)" Organization: INTI To: vcarlos35 AT juno DOT com, djgpp AT delorie DOT com Date: Wed, 30 Dec 1998 11:52:00 +0000 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: pairable instructions much faster than the string operation In-reply-to: <19981230.090441.5903.0.vcarlos35@juno.com> X-mailer: Pegasus Mail for Windows (v2.54) Reply-To: djgpp AT delorie DOT com vcarlos35 AT juno DOT com wrote: > > On Wed, 30 Dec 1998 13:15:25 +0100 Christian Hofrichter > writes: > >For along time I believed that string operations (rep stosl; rep > >movsl) were the fastest methods to write to memory blocks untill I heard > that > >a Pentium can execute two instructions simultaneously. So I realized > >that there are better methods to move memory blocks ! > > > >" rep stosl " : takes 3 clock cycles on a Pentium > > > > > >asm("1:\n\t" > > "movl (%%ebx),%%eax\n\t" /*pairable in U-pipe */ > > "addl $4,%%ebx\n\t" /*pairbale in V-pipe */ > > "decl %%ecx\n\t" /*pairable in U-pipe */ > > "jnz 1b": /*pairbale in V-pipe */ > > :"a"(55/*any value > >*/),"c"((40*1024*1024)>>2),"b"(memory) > > :"%ecx","%ebx"); > >This takes only 2 clock cycles ! > > > > > >To test that, I allocated a buffer of 40 Mb. First I used memset, it > >took 690000 microseconds to fill the memory-block. > >Then I wrote it in assembler ( just to be sure) with stosl and it took > >the same time (how surprising ). > >And then I wrote the code above and now it took only approximately > >426000 microseconds to fill the memory-block !! > >That is approximate the same ratio like 3 clock cycles to 2 clock > >cycles. > > > >So how about a new optimation-switch in djgpp, called pairable > >instructions ? After all it can often double the speed of the > >program. I can also be used to improve graphic-performence, can't it ? > > > > AFAIK, having a compiler automatically pair instructions (especially one > such as gcc which runs on a wide variety of platforms) would pretty much > be an impossible task. Why? > Instruction pairing rules are complicated and dependent on the CPU to > a great extent. Assembler code generation *is* complicated and dependent on the CPU. > For example, on a 6th generation CPU, your > code is not optimal because the increased register dependencies make > it difficult for the out-of-order core to extract maximum parallelism > from > your code. Additionally, you have to worry about increased aggregate > opcode > size and mispredicted branches. And what? These tasks belongs to the optimizer and if an optimizer doesn't do it is just crap. RISC processors needs it and most of the code is in C not assembler From the side of djgpp the only thing we can do is provide some conditionally compiled code in libc to speedup things like memset/memcpy. The PGCC group works to make a gcc that generates code optimized for Pentiums. SET ------------------------------------ 0 -------------------------------- Visit my home page: http://set-soft.home.ml.org/ or http://www.geocities.com/SiliconValley/Vista/6552/ Salvador Eduardo Tropea (SET). (Electronics Engineer) Alternative e-mail: set-soft AT usa DOT net set AT computer DOT org ICQ: 2951574 Address: Curapaligue 2124, Caseros, 3 de Febrero Buenos Aires, (1678), ARGENTINA TE: +(541) 759 0013