www.delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1996/06/14/08:19:13

Xref: news2.mv.net comp.os.msdos.djgpp:4955
From: brennan AT mack DOT rt66 DOT com (Brennan "Bas" Underwood)
Newsgroups: comp.os.msdos.djgpp
Subject: Re: Speed optimization: memcpy() or for loop ??
Date: 13 Jun 1996 14:02:09 -0600
Organization: None, eh?
Lines: 40
Message-ID: <4pps41$dnp@mack.rt66.com>
References: <4pmlrp$p7u AT crc-news DOT doc DOT ca> <4pmscu$nrt AT rs18 DOT hrz DOT th-darmstadt DOT de>
NNTP-Posting-Host: mack.rt66.com
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp

In article <4pmscu$nrt AT rs18 DOT hrz DOT th-darmstadt DOT de>,
Alexander Lehmann <lehmann AT mathematik DOT th-darmstadt DOT de> wrote:
>Richard Young (richard DOT young AT crc DOT doc DOT ca) wrote:
>: A question for the optimization experts:
>: For moving data, is it faster to use
>: a) memcpy(x,y,n*sizeof(x[0])) 
>: or 
>: b) for (i = 0; i < n; i++) x[i] = y[i];
>: or are they basically the same speed.
>: With C++ is it better code practice to use b) over a)?
>
>(a) uses the function dj_movedata, which will use the repeat
>instruction to copy 4 byte values, which should be pretty fast.
>
>(b) requires a lots of address calculations, unless the compiler is
>very smart (I don't think so), but it can be sped up a bit at least
>(assuming that x and y are of type foo):

It's smart enough to use i as an offset to x and y, but that'll cost at least
1 cycle/instruction on 486 and below. 

The fastest way to move dword aligned memory on 486 and DOWN, is rep movsl.
On Pentium+, you can beat it under the right circumstances, but it's very
difficult. I saw one trick for using many pushes/pops after setting up
esp, but it still didn't see a major gain over rep movsl. rep movsl does one
dword/cycle (cause it uses both pipes internally.) Very hard to beat.
Check out http://www.rt66.com/~brennan/djgpp/bgtia.html for a couple
rep movsl inline assembly macros, *iff* you are doing dword aligned stuff.

You *can* get a major increase by doing a tight loop of loads/stores with
the FPU since it can work with 8-byte long longs, but you'll be in for
an interesting time if you happen to load any of the FPU error bit patterns!
e.g. NotANumber, Divide by Zero, or something to that effect.

Read comp.lang.asm.x86; some really good performance coders hang out there.


--Brennan
-- 
brennan AT rt66 DOT com  |  "He say you Brade Runna!"

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019