www.delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1997/02/26/12:39:54

Date: Wed, 26 Feb 1997 11:30:06 -0600 (CST)
From: "Jesse W. Bennett" <jesse AT lenny DOT dseg DOT ti DOT com>
Reply-To: Jesse Bennett <jbennett AT ti DOT com>
To: Eli Zaretskii <eliz AT is DOT elta DOT co DOT il>
cc: Jesse Bennett <jbennett AT ti DOT com>, djgpp AT delorie DOT com
Subject: Re: Netlib code [was Re: flops...]
In-Reply-To: <Pine.SUN.3.91.970223115026.1447C-100000@is>
Message-ID: <Pine.LNX.3.91.970226105830.29585A-100000@lenny.dseg.ti.com>
MIME-Version: 1.0

On Sun, 23 Feb 1997, Eli Zaretskii wrote:

> On 20 Feb 1997, Jesse Bennett wrote:

[snip]

> > I have done some benchmarking of the above matrix multiply code on a
> > DEC Alpha using the native DEC compilers.  The "pointer based" C
> > implementation was fastest, followed by the F77 code and the "array
> > based" C code.  Hence my comments.
> 
> Try it with gcc.  In most cases, it converts the array-based code to 
> pointer-based automatically, as far as I could see, and in many cases it 
> does that better than you would.

Hi Eli,

I tried this on a Linux box with gcc 2.6.3 and 2.7.2 and the results were
encouraging, but the pointer based code was still slightly faster.  When 
I looked at the generated assembly I could see that the array based 
implementation was making better use of the x86 CISC instruction set but 
the innermost instruction loop appears to have some unnecessary memory 
references (I say "appears" because I am not very familiar with the 
x86).  The test code was:

void gemm( int m, int n, int k, double **a, double **b, double **c )
{

/* C = AB + C */

   int i, j, l;
   double temp;

   for( i=0; i<m; i++ )
      for( l=0; l<k; l++ )
      {
         temp = a[i][l];
	 for( j=0; j<n; j++ )
	    c[i][j] += temp * b[l][j];
      }     
}

compiled with gcc -O2 -S gemm.c

The generated assembly for the inner loop is:

L13:
        movl (%edi),%edx
        movl (%esi),%eax
        fld %st(0)
        fmull (%eax,%ecx,8)
        faddl (%edx,%ecx,8)
        fstpl (%edx,%ecx,8)
        incl %ecx
        cmpl %ecx,12(%ebp)
        jg L13

It is not clear to me why the edx and eax registers are being reloaded 
each iteration.  Is there something about the addressing mode used by the 
fxxx instructions that is modifying these registers?  I have also tried 
(without success) to eliminate the memory reference for the loop 
counter.  It seems to me (bear in mind that I know zip about x86 
assembly) that a much better implementation would have been:

        movl (%edi),%edx
        movl (%esi),%eax
        movl 12(%ebp),%ebx
L13:
        fld %st(0)
        fmull (%eax,%ecx,8)
        faddl (%edx,%ecx,8)
        fstpl (%edx,%ecx,8)
        incl %ecx
        cmpl %ecx,%ebx
        jg L13

requiring only 3 memory references in the inner loop (executed O(n^3)
times) instead of 6. 

Am I missing something here?

Best Regards,
Jesse

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019