Date: Fri, 13 Jan 1995 17:30:05 EST From: THE MASKED PROGRAMMER To: djgpp AT sun DOT soe DOT clarkson DOT edu Cc: badcoe AT bsa DOT bristol DOT ac DOT uk Hi, I asked a rather confused question about data alignment, PM and RM and got some useful replies, some people expressed an interest in the exact situation so I'm supplying it here. This is the code that I was wanting to run even faster by aligning the 256x256 table on a 64K boundary. It's for applying 'semi-sophisticated' shading to coloured VGA images in real time (but it's too slow, I always thought it would be). In this case the data at %esi (from) contains words where the first byte is the colour, and the second is the shade-table to process it with: #ifndef _TOSCRTRN_INL_ #define _TOSCRTRN_INL_ static inline void toscrtrn(const void *from, const void *to, const int length, const unsigned long time, const void *table, const void *shades) { asm(" .align 4, 0x90 toscrtrn_%=_4: xor %%eax, %%eax; lodsw; movb (%%ebx,%%eax), %%dl; lodsw; movb (%%ebx,%%eax), %%dh; shl $0x10, %%edx; lodsw; movb (%%ebx,%%eax), %%dl; lodsw; movb (%%ebx,%%eax), %%dh; movl %%edx, %%eax; ror $0x10, %%eax; stosl; # write to screen loop toscrtrn_%=_4; # repeat ecx times " : // no output : "b" (table), "c" (length), "S" (from), "D" (to) : "eax", "ecx", "edx", "esi", "edi"); } #endif // _TOSCRTRN_INL_ Several people say that it is not, as I originally thought, a matter of the difference between PM and RM but rather a loader feature which means that it will not align data beyond about 512K boundaries. I've now found a way to so align the data (by declaring it as twice the size as I need and starting the table at the 64K boundary that must lie within it) (sandmann AT new-orleans DOT NeoSoft DOT com and I both had this idea) but the acceleration is marginal. With 64K alignment the code reads: asm(" toscrtrn_%=_4: lodsw; movb (%%eax), %%dl; # This is the operation that I thought would # be faster than previously lodsw; movb (%%eax), %%dh; shl $0x10, %%edx; lodsw; movb (%%eax), %%dl; lodsw; movb (%%eax), %%dh; xchg %%edx, %%eax; ror $0x10, %%eax; stosl; # write to screen xchg %%edx, %%eax; loop toscrtrn_%=_4; # repeat ecx times " : // no output : "a" (table), "c" (length), "S" (from), "D" (to) : "eax", "ecx", "edx", "esi", "edi"); The only (possibly slight) improvement on that that I have thought of is perhaps to read two offsets at the same time with a lodsl, but the word swapping then involved would probably outweigh the speed up. Anyway, I guess I'll just have to re-think the approach and process only parts of the screen at a time (or just do the whole thing more slowly). (The current timing is c0.07 of a second for 320x200 bytes on a 40mHz 386.) Thanks to those who helped, Badders