From: "David Jonsson" <David DOT Jonsson AT ellemtel DOT se>
To: <pgcc AT delorie DOT com>
Subject: SSI/KNI support (was RE: Intel/Cygnus)
Date: Fri, 5 Mar 1999 12:57:34 +0100
Message-ID: <000001be66ff$5c17a660$3bd16482@ellemtel.se>
MIME-Version: 1.0
Content-Type: text/plain; charset="Windows-1252"
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
Importance: Normal
In-Reply-To: <19990304152121.42144@insula.local>
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.0810.800
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id GAA03055
Reply-To: pgcc AT delorie DOT com

> > ----------
> > From: 	Philipp Rumpf[SMTP:PRUMPF AT JCSBS DOT LANOBIS DOT DE]
> > Sent: 	Thursday, March 04, 1999 4:21:21 PM
> > To: 	pgcc AT delorie DOT com
> > Subject: 	Re: Intel/Cygnus
> > Auto forwarded by a Rule
> > 
> > This is far from trivial. The C syntax need to be abandoned if 
> the optimization
> > is to be transparent from the programmer, see SWAR 
> http://shay.ecn.purdue.edu/~swar/
> 
> I cannot see what is so difficult about it[1] ... I think it is 
> just a special case of loop unrolling.
> 
> char *p;
> int i;
> 
> for(i=0; i<4; i++)
> 	p[i] |= 0x80;
> 
> should become a 32-bit OR ... once we can do that, the rest of 
> SIMD should be trivial[2]

What you write is trivial if it is allowed. I am no compiler expert but I don't think that a compiler is allowed to unroll that loop. It isn't obvious that p and i are independent. Or p[0] and p[1] etc.

> > Another approach is to use a MACRO like addition to ordinary compilers.
> > This is what Apple has done with AltiVec wich is more promising than MMX
> > or KNI/SSI, http://developer.apple.com/hardware/altivec/model.html
> 
> Intel is doing something very similar in their compilers, they 
> even give the
> compiler intrinsics or whatever they call them in the instruction 
> set reference ...

Like libmmx below?
 
> The macro approach has additional advantages though, I really 
> would not like to get
> 11 bits precision for a normal float though I probably would not 
> mind sometimes.

This is enough many times like for sound-processing or simple geometry.

> [2] - Well, it could be a bit difficult to ensure a float * is 
> 128-bit aligned ...

Just align all memory on 128-bit boundaries when compiling or what about a new type like Randy Fisher's libmmx http://min.ecn.purdue.edu/~rfisher/Research/Libmmx/libmmx.html

typedef union {
        long long long long      		o;      /* Octalword (128-bit) value */
        unsigned long long long long      uo;     /* Unsigned Octalword */
        int                     		d[4];   /* 4 Doubleword (64-bit) values */
        unsigned int            		ud[4];  /* 4 Unsigned Doubleword */
        short                   		w[8];   /* 8 Word (16-bit) values */
        unsigned short          		uw[8];  /* 8 Unsigned Word */
        char                    		b[16];   /* 16 Byte (8-bit) values */
        unsigned char           		ub[16];  /* 16 Unsigned Byte */
        float                   		s[4];   /* Single-precision (32-bit) value */
} __attribute__ ((aligned (16))) ssi_t;  /* On an 16-byte (128-bit) boundary */


He also defines macros making it possiblem to write like paddd_m2r(variable, mm0)
I asked him in december if he should support SSI but he said he had full time with MMX. I hope all extra instructions like 3Dnow! MMX SSI can be in the same .h file.

David