Mail Archives: djgpp/1996/03/16/18:17:45

www.delorie.com/archives/browse.cgi

search

Mail Archives: djgpp/1996/03/16/18:17:45

Xref: news2.mv.net comp.os.msdos.djgpp:1884

From: korpela AT albert DOT ssl DOT berkeley DOT edu (Eric J. Korpela)

Newsgroups: comp.os.msdos.djgpp

Subject: Block Moves (Re: ASM code & Random)

Date: 16 Mar 1996 21:50:20 GMT

Organization: Cal Berkeley-- Space Sciences Lab

Lines: 190

Message-ID: <4ifd2s$t3u@agate.berkeley.edu>

References: <1996Mar5 DOT 164831 AT zipi DOT fi DOT upm DOT es> <Pine DOT SGI DOT 3 DOT 91 DOT 960310091342 DOT 26365F-100000 AT tower DOT york DOT ac DOT uk> <4i2f8q$51p AT mack DOT rt66 DOT com> <31483E82 DOT 7EEB AT i-link DOT net>

NNTP-Posting-Host: albert.ssl.berkeley.edu

Keywords: GCC pentium optimization

To: djgpp AT delorie DOT com

DJ-Gateway: from newsgroup comp.os.msdos.djgpp

In article <31483E82 DOT 7EEB AT i-link DOT net>,
Brad Burgan  <bradtech AT i-link DOT net> wrote:
>From what I have read, that would not be a good thing.  Intel has been designing
>their chips more and more towards the simple operands and have not paid much
>attention to the string operands, in order to make it easier for compilers
>to operate.  I will write a test program and run on 386/486/Pent and see if
>MOVS is faster than MOV and J?, but I think that might be a little off topic, so
>if someone could email me where to send this report? It will do test in 16-bit
>and 32-bit.


With all this talk about memory copy speeds, I decided to write up
a little program on my P90 to see what memory copy algorithm was fastest.
(I use EMX under OS/2, but from what I understand this should compile
under DJGPP just fine.)

The results I get are quite suprising, and not at all what I would have
expected given Intel's document on optimization.  According to Intel,
the fastest algorithm for moving memory should be the "ld ld st st" method.
(see the code at the end of this message for a look at how it works.)
In fact, using the floating point unit for the transfer was the fastest
method.  According to Intel "Moving a floating point memory to memory should
be done by integer moves instead if doing fld-sdtp."  The other suprise
was that there wasn't much difference between 64 bit aligned and 32 bit aligned
block moves.  In fact the "ld ld st st" method was faster for 32 bit aligned
blocks.

The results and the code are below.  I hope that someone will look at it
to make sure I did it right.  I'd also like to see results from a 386, 486,
and a Pentium Pro.

------------------------------------------------------------------------------

64K aligned blocks using _tmalloc() (Mb/sec) 
rep stosl    ld st     ld ld st st   C code    fildq fistpq
-----------------------------------------------------------
32.206119   34.782609   31.796502   26.041667   38.986355 
33.500838   33.557047   32.948929   26.773762   41.067762 
32.206119   32.467532   31.847134   26.007802   39.062500 
33.167496   34.782609   32.679739   26.560425   40.733198 
34.782609   32.520325   34.013605   27.137042   42.918455 
-----------------------------------------------------------
33.2        33.6        32.6        26.5        40.8


4 byte aligned blocks using malloc() (Mb/sec)
rep stosl    ld st     ld ld st st   C code    fildq fistpq
-----------------------------------------------------------
31.695721   33.500838   33.003300   26.350461   37.174721 
31.250000   31.055901   32.679739   26.212320   37.243948 
34.602076   33.167496   36.166365   27.932961   43.290043 
32.840722   33.670034   34.246575   27.210884   39.761431 
31.695721   33.500838   33.333333   26.455026   38.314176 
-----------------------------------------------------------
32.4        33.0        33.9        26.8        39.2

-------------------------------------------------------------------------------

#include <stdio.h>
#include <memory.h>
#include <time.h>

inline void copy1(int *p1, int *p2, int n) 
{
  asm("
    repnz
    movsl
    "
    :
    : "S" (p1), "D" (p2), "c" (n));
}

inline void copy2(int *p1, int *p2, int n)
{
  asm("
    dec %2
    jl 1f
0:
    movl (%0,%2,4),%%ebx
    movl %%ebx,(%1,%2,4)
    dec %2
    jge 0b
1:
    "
    :
    : "r" (p1), "r" (p2), "r" (n)
    : "ebx");
}

inline void copy3(int *p1, int *p2, int n)
{
  asm("
    test $1,%2
    jz 0f
    dec %2
    movl (%0,%2,4),%%ebx
    movl %%ebx,(%1,%2,4)
0:  
    shrl %2
    dec %2
    jl 2f
1:
    movl (%0,%2,8),%%ebx
    movl 4(%0,%2,8),%%eax
    movl %%ebx,(%1,%2,8)
    movl %%eax,4(%1,%2,8)
    dec %2
    jge 1b
2:  
    "
    :
    : "S" (p1), "D" (p2), "c" (n)
    : "ebx","eax");
}

inline void copy4(int *p1,int *p2, int n)
{
  register int i =n;
  register int *pp1 =p1;
  register int *pp2 =p2;

  if (i & 1) {
   i--;
   pp2[i]=pp1[i];
  }
  i=(i>>1);
  while (i) {
    i--;
    pp2[i*2]=pp1[i*2];
    pp2[i*2+1]=pp1[i*2+1];
  }
}

inline void copy5(int *p1,int *p2, int n)
{
  asm("
    test $1,%2
    jz 0f
    dec %2
    movl (%0,%2,4),%%ebx
    movl %%ebx,(%1,%2,4)
0:  
    shrl %2
    dec %2
    jl 2f
1:
    fildq (%0,%2,8)
    fistpq (%1,%2,8)
    dec %2
    jge 1b
2:  
    "
    :
    : "S" (p1), "D" (p2), "c" (n)
    : "ebx");
}

int main(void)
{
  int *p1=(int *)malloc(65536);
  int *p2=(int *)malloc(65536);
  int n,i,clock0,clock1,clock2,clock3,clock4,clock5;

  printf("%x %x\n",(int)p1,(int)p2);

  for (n=0;n<(65536/sizeof(int));p1[n]=p2[n]=n++);

  clock0=clock();
  for (i=0;i<200*1024/64;i++) copy1(p1,p2,n);
  clock1=clock();
  for (i=0;i<200*1024/64;i++) copy2(p2,p1,n);
  clock2=clock();
  for (i=0;i<200*1024/64;i++) copy3(p1,p2,n);
  clock3=clock();
  for (i=0;i<200*1024/64;i++) copy4(p1,p2,n);
  clock4=clock();
  for (i=0;i<200*1024/64;i++) copy5(p1,p2,n);
  clock5=clock();
  printf("%f %f %f %f %f \n",1.0*CLOCKS_PER_SEC/(clock1-clock0)*200,
                             1.0*CLOCKS_PER_SEC/(clock2-clock1)*200,
                             1.0*CLOCKS_PER_SEC/(clock3-clock2)*200,
                             1.0*CLOCKS_PER_SEC/(clock4-clock3)*200,
                             1.0*CLOCKS_PER_SEC/(clock5-clock4)*200);
}

-- 
Eric Korpela                        |  An object at rest can never be
korpela AT ssl DOT berkeley DOT edu            |  stopped.
<a href="http://www.cs.indiana.edu/finger/mofo.ssl.berkeley.edu/korpela/w">
Click here for more info.</a>

- Raw text -

webmaster	delorie software privacy
Copyright © 2019 by DJ Delorie	Updated Jul 2019

Xref:	news2.mv.net comp.os.msdos.djgpp:1884
From:	korpela AT albert DOT ssl DOT berkeley DOT edu (Eric J. Korpela)
Newsgroups:	comp.os.msdos.djgpp
Subject:	Block Moves (Re: ASM code & Random)
Date:	16 Mar 1996 21:50:20 GMT
Organization:	Cal Berkeley-- Space Sciences Lab
Lines:	190
Message-ID:	<4ifd2s$t3u@agate.berkeley.edu>
References:	<1996Mar5 DOT 164831 AT zipi DOT fi DOT upm DOT es> <Pine DOT SGI DOT 3 DOT 91 DOT 960310091342 DOT 26365F-100000 AT tower DOT york DOT ac DOT uk> <4i2f8q$51p AT mack DOT rt66 DOT com> <31483E82 DOT 7EEB AT i-link DOT net>
NNTP-Posting-Host:	albert.ssl.berkeley.edu
Keywords:	GCC pentium optimization
To:	djgpp AT delorie DOT com
DJ-Gateway:	from newsgroup comp.os.msdos.djgpp