Date: Wed, 22 Mar 95 00:12 MST
From: mat AT ardi DOT com (Mat Hostetter)
To: raraki AT human DOT waseda DOT ac DOT jp (Ryuichiro Araki)
Cc: A DOT APPLEYARD AT fs2 DOT mt DOT umist DOT ac DOT uk, turnbull AT shako DOT sk DOT tsukuba DOT ac DOT jp,
        DJGPP AT sun DOT soe DOT clarkson DOT edu
Subject: Re: A quick way to copy n bytes
References: <199503220607 DOT PAA27191 AT wutc DOT human DOT waseda DOT ac DOT jp>

raraki writes:

The following code is my own memcpy() written in gas:
---------------------------------------------------------------
.data
.text

.globl	_memcpy

	.align 4,144
_memcpy:
	pushl	%esi
	movl	%edi,%edx

	movl	 8(%esp),%edi	/* dst */
	movl	12(%esp),%esi	/* src */
	movl	16(%esp),%ecx	/* cnt */

	movl	%ecx,%eax	/* DWORD move */
	shrl	$2,%ecx		/* ecx / 4    */
	andl	$3,%eax		/* eax % 4    */
	
	cld
	rep
	movsl

	movl	%eax,%ecx	/* copy remainder */
	rep
	movsb
	
	popl	%esi
	movl	%edx,%edi
	movl	 4(%esp),%eax	/* return value */
	ret


This is much better, but what if %esi and %edi are not aligned %4?
Every single transfer might have an unaligned load and an unaligned
store, which is slow.

I fixed this in the memcpy and movedata for the current V2 alpha.
They do movsb's until either %esi or %edi is long-aligned before doing
movsl's (and hopefully both are aligned then).  The code checks for
small moves right away and just use movsb for them, skipping the
alignment overhead.

For what it's worth, I also modified memset to do aligned stosl's when
possible.

-Mat