Mail Archives: djgpp/1996/03/16/18:17:45
In article <31483E82 DOT 7EEB AT i-link DOT net>,
Brad Burgan <bradtech AT i-link DOT net> wrote:
>From what I have read, that would not be a good thing. Intel has been designing
>their chips more and more towards the simple operands and have not paid much
>attention to the string operands, in order to make it easier for compilers
>to operate. I will write a test program and run on 386/486/Pent and see if
>MOVS is faster than MOV and J?, but I think that might be a little off topic, so
>if someone could email me where to send this report? It will do test in 16-bit
>and 32-bit.
With all this talk about memory copy speeds, I decided to write up
a little program on my P90 to see what memory copy algorithm was fastest.
(I use EMX under OS/2, but from what I understand this should compile
under DJGPP just fine.)
The results I get are quite suprising, and not at all what I would have
expected given Intel's document on optimization. According to Intel,
the fastest algorithm for moving memory should be the "ld ld st st" method.
(see the code at the end of this message for a look at how it works.)
In fact, using the floating point unit for the transfer was the fastest
method. According to Intel "Moving a floating point memory to memory should
be done by integer moves instead if doing fld-sdtp." The other suprise
was that there wasn't much difference between 64 bit aligned and 32 bit aligned
block moves. In fact the "ld ld st st" method was faster for 32 bit aligned
blocks.
The results and the code are below. I hope that someone will look at it
to make sure I did it right. I'd also like to see results from a 386, 486,
and a Pentium Pro.
------------------------------------------------------------------------------
64K aligned blocks using _tmalloc() (Mb/sec)
rep stosl ld st ld ld st st C code fildq fistpq
-----------------------------------------------------------
32.206119 34.782609 31.796502 26.041667 38.986355
33.500838 33.557047 32.948929 26.773762 41.067762
32.206119 32.467532 31.847134 26.007802 39.062500
33.167496 34.782609 32.679739 26.560425 40.733198
34.782609 32.520325 34.013605 27.137042 42.918455
-----------------------------------------------------------
33.2 33.6 32.6 26.5 40.8
4 byte aligned blocks using malloc() (Mb/sec)
rep stosl ld st ld ld st st C code fildq fistpq
-----------------------------------------------------------
31.695721 33.500838 33.003300 26.350461 37.174721
31.250000 31.055901 32.679739 26.212320 37.243948
34.602076 33.167496 36.166365 27.932961 43.290043
32.840722 33.670034 34.246575 27.210884 39.761431
31.695721 33.500838 33.333333 26.455026 38.314176
-----------------------------------------------------------
32.4 33.0 33.9 26.8 39.2
-------------------------------------------------------------------------------
#include <stdio.h>
#include <memory.h>
#include <time.h>
inline void copy1(int *p1, int *p2, int n)
{
asm("
repnz
movsl
"
:
: "S" (p1), "D" (p2), "c" (n));
}
inline void copy2(int *p1, int *p2, int n)
{
asm("
dec %2
jl 1f
0:
movl (%0,%2,4),%%ebx
movl %%ebx,(%1,%2,4)
dec %2
jge 0b
1:
"
:
: "r" (p1), "r" (p2), "r" (n)
: "ebx");
}
inline void copy3(int *p1, int *p2, int n)
{
asm("
test $1,%2
jz 0f
dec %2
movl (%0,%2,4),%%ebx
movl %%ebx,(%1,%2,4)
0:
shrl %2
dec %2
jl 2f
1:
movl (%0,%2,8),%%ebx
movl 4(%0,%2,8),%%eax
movl %%ebx,(%1,%2,8)
movl %%eax,4(%1,%2,8)
dec %2
jge 1b
2:
"
:
: "S" (p1), "D" (p2), "c" (n)
: "ebx","eax");
}
inline void copy4(int *p1,int *p2, int n)
{
register int i =n;
register int *pp1 =p1;
register int *pp2 =p2;
if (i & 1) {
i--;
pp2[i]=pp1[i];
}
i=(i>>1);
while (i) {
i--;
pp2[i*2]=pp1[i*2];
pp2[i*2+1]=pp1[i*2+1];
}
}
inline void copy5(int *p1,int *p2, int n)
{
asm("
test $1,%2
jz 0f
dec %2
movl (%0,%2,4),%%ebx
movl %%ebx,(%1,%2,4)
0:
shrl %2
dec %2
jl 2f
1:
fildq (%0,%2,8)
fistpq (%1,%2,8)
dec %2
jge 1b
2:
"
:
: "S" (p1), "D" (p2), "c" (n)
: "ebx");
}
int main(void)
{
int *p1=(int *)malloc(65536);
int *p2=(int *)malloc(65536);
int n,i,clock0,clock1,clock2,clock3,clock4,clock5;
printf("%x %x\n",(int)p1,(int)p2);
for (n=0;n<(65536/sizeof(int));p1[n]=p2[n]=n++);
clock0=clock();
for (i=0;i<200*1024/64;i++) copy1(p1,p2,n);
clock1=clock();
for (i=0;i<200*1024/64;i++) copy2(p2,p1,n);
clock2=clock();
for (i=0;i<200*1024/64;i++) copy3(p1,p2,n);
clock3=clock();
for (i=0;i<200*1024/64;i++) copy4(p1,p2,n);
clock4=clock();
for (i=0;i<200*1024/64;i++) copy5(p1,p2,n);
clock5=clock();
printf("%f %f %f %f %f \n",1.0*CLOCKS_PER_SEC/(clock1-clock0)*200,
1.0*CLOCKS_PER_SEC/(clock2-clock1)*200,
1.0*CLOCKS_PER_SEC/(clock3-clock2)*200,
1.0*CLOCKS_PER_SEC/(clock4-clock3)*200,
1.0*CLOCKS_PER_SEC/(clock5-clock4)*200);
}
--
Eric Korpela | An object at rest can never be
korpela AT ssl DOT berkeley DOT edu | stopped.
<a href="http://www.cs.indiana.edu/finger/mofo.ssl.berkeley.edu/korpela/w">
Click here for more info.</a>
- Raw text -