From: Dave Love Newsgroups: comp.os.msdos.djgpp Subject: Re: Netlib code [was Re: flops...] Date: 28 Feb 1997 15:30:52 +0000 Organization: Daresbury Laboratory, Warrington WA4 4AD, UK Message-ID: References: NNTP-Posting-Host: djlvig.dl.ac.uk Lines: 80 To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp >>>>> "Jesse" == Jesse W Bennett writes: Jesse> void gemm( int m, int n, int k, double **a, double **b, double **c ) Jesse> { Jesse> /* C = AB + C */ Jesse> int i, j, l; Jesse> double temp; Jesse> for( i=0; i for( l=0; l { Jesse> temp = a[i][l]; Jesse> for( j=0; j c[i][j] += temp * b[l][j]; Jesse> } Jesse> } Jesse> compiled with gcc -O2 -S gemm.c Jesse> The generated assembly for the inner loop is: Jesse> L13: Jesse> movl (%edi),%edx Jesse> movl (%esi),%eax Jesse> fld %st(0) Jesse> fmull (%eax,%ecx,8) Jesse> faddl (%edx,%ecx,8) Jesse> fstpl (%edx,%ecx,8) Jesse> incl %ecx Jesse> cmpl %ecx,12(%ebp) Jesse> jg L13 Jesse> It is not clear to me why the edx and eax registers are being reloaded Jesse> each iteration. I can't show DJGPP G77 o/p at present, but assume the generated code would be the same as this. (On 586 and especially on ppro, the speed will actually be determined by how your double words happen to get aligned, sigh.) $ cat a.f subroutine gemm(m, n, k, a, b, c) integer i,m,n,k,l,j double precision a(n,m), b(n,m), c(n,m) do i=1,m ! poor for illustration only do l=1,k do j=1,n c(j,i) = c(j,i) + a(l,i)*b(j,l) end do end do end do end $ g77 -S -O2 -v a.f g77 version 0.5.19.1 gcc -S -O2 -v -xf77 a.f Reading specs from /usr/lib/gcc-lib/i486-unknown-linux/2.7.2.1.f.1/specs gcc version 2.7.2.1.f.1 /usr/lib/gcc-lib/i486-unknown-linux/2.7.2.1.f.1/f771 a.f -fset-g77-defaults -qu iet -dumpbase a.f -O2 -version -fversion -o a.s GNU F77 version 2.7.2.1.f.1 (i386 Linux/ELF) compiled by GNU C version 2.7.2.1.f .1. GNU Fortran Front End version 0.5.19.1 compiled: Feb 1 1997 19:51:03 $ more +/L13 a.s ...skipping addl 24(%ebp),%eax .align 4 .L13: movl -24(%ebp),%edi fldl (%edi) fmull (%eax) faddl (%edx) fstpl (%edx) addl $8,%eax addl $8,%edx decl %ecx jns .L13 .L8: