Date: Wed, 22 Mar 1995 15:07:24 +0900
From: raraki AT human DOT waseda DOT ac DOT jp (Ryuichiro Araki)
To: A DOT APPLEYARD AT fs2 DOT mt DOT umist DOT ac DOT uk, turnbull AT shako DOT sk DOT tsukuba DOT ac DOT jp
Subject: Re: A quick way to copy n bytes
Cc: DJGPP AT sun DOT soe DOT clarkson DOT edu

>>>>> Stephen Turnbull <turnbull AT shako DOT sk DOT tsukuba DOT ac DOT jp> writes:

>   /*-----*//* fast move s[0:n-1]=t[0:n-1] */
>   void str_cpy(void*s,void*t,int n){
>   asm("pushl %esi"); asm("pushl %edi"); asm("cld");
>   asm("movl 8(%ebp),%edi"); asm("movl 12(%ebp),%esi");
>   asm("movl 16(%ebp),%ecx"); asm("rep"); asm("movsb"); asm("popl %edi");
>   asm("popl %esi");}
>   /*-----*/
>   /* This has given me good service and should run a bit quicker than a C */
>   /* version, as it uses the `rep' repeat instruction */
>
>This looks remarkably like memcpy.s in the standard DJGPP library, but
>it doesn't take advantage of a couple of optimizations included in the
>DJGPP distribution version.  Why are we reinventing the wheel?

When optimizations are enabled, gcc outputs the following inline code for 
memcpy(void *dest, const void *src, size_t cnt) if cnt is a constant:
-----------------------------------------------------------------------
#include <memory.h>
#define COUNT 47

void foo(void){
    char dest[COUNT], src[COUNT];

    memcpy(dest, src, COUNT);
}

	.file	"constant.c"
gcc2_compiled.:
___gnu_compiled_c:
.text
	.align 4
.globl _foo
_foo:
	subl $96,%esp
	pushl %edi
	pushl %esi
	leal 56(%esp),%edi
	leal 8(%esp),%esi
	cld
	movl $11,%ecx
	rep
	movsl
	movsw
	movsb
	popl %esi
	popl %edi
	addl $96,%esp
	ret
-----------------------------------------------------------------------
I guess this output code is smart enough.

But if cnt is not a constant but a variable:
-----------------------------------------------------------------------
#include <memory.h>
#define COUNT 47

void foo(void){
    char dest[COUNT], src[COUNT];
    int cnt = COUNT;

    memcpy(dest, src, cnt);
}

	.file	"variable.c"
gcc2_compiled.:
___gnu_compiled_c:
.text
	.align 4
.globl _foo
_foo:
	subl $96,%esp
	leal 48(%esp),%edx
	movl %esp,%eax
	pushl $47
	pushl %eax
	pushl %edx
	call _memcpy
	addl $12,%esp
	addl $96,%esp
	ret
-----------------------------------------------------------------------
memcpy() won't be compiled as the inline code any more.

This means that optimizing memory/string functions in standard library 
is still effective in improving performance of the executable built 
with djgpp, even if current version of gcc is capable of generating
smart inline codes for such functions when the number of bytes to be 
processed is a constant.

The following code is my own memcpy() written in gas:
---------------------------------------------------------------
.data
.text

.globl	_memcpy

	.align 4,144
_memcpy:
	pushl	%esi
	movl	%edi,%edx

	movl	 8(%esp),%edi	/* dst */
	movl	12(%esp),%esi	/* src */
	movl	16(%esp),%ecx	/* cnt */

	movl	%ecx,%eax	/* DWORD move */
	shrl	$2,%ecx		/* ecx / 4    */
	andl	$3,%eax		/* eax % 4    */
	
	cld
	rep
	movsl

	movl	%eax,%ecx	/* copy remainder */
	rep
	movsb
	
	popl	%esi
	movl	%edx,%edi
	movl	 4(%esp),%eax	/* return value */
	ret
---------------------------------------------------------------

This code uses movsl, and thus somewhat faster than the original memcpy.s.
The drawback is that it might not work correctly if the objects overlap.
But in ANSI C, memcpy() doesn't necessarily guarrantee correct behavior
with overlapping objects.  In such case, one should use memmove() instead
(I suppose DJ's original memcpy.s is indeed a memmove() code).

I've ever sent such memory/string function sources written in gas to DJ 
looong ago (possibly in Fall, 1991), but he didn't seem to prefer them.

If somebody wants to get gas sources of my memory/string functions (for
14 functions), please let me know.

    ----
    raraki(Ryuichiro Araki)
    raraki AT human DOT waseda DOT ac DOT jp