Subject: Re: Speed tuning programs
To: eliz AT is DOT elta DOT co DOT il, mcastle AT umr DOT edu
Cc: djgpp AT sun DOT soe DOT clarkson DOT edu
Date: Mon, 15 Aug 1994 11:52:13 +0100
From: Olly Betts <olly AT mantis DOT co DOT uk>

Thanks to everyone who has made suggestions.  I've concatenated replies
to Eli Zaretskii and Mike Castle.  Mike first:

>If the amount of CPU time grows faster than the amount of I/O
>time, I suspect that the DJGPP version will eventually out
>perform the Borland version.  

The processing as it reads the data and builds the network in is O(n),
where n is the size of the input data.  It then munges the network to
make it smaller, using various reduction rules, again O(n) [probably],
and then builds a matrix and solves it, which is O(m^2), where m is the
size of the reduced network.  However, it only needs to build a matrix
if the network is too complicated to solve using the reduction rules, so
m depends on the intertwinedness of the network more than it depends on
n.

The network reduction and "un-reduction" and the matrix
building/solution are quite FP intensive, but the machine has a
coprocessor, so this shouldn't be affected by the compiler.

>Have you tried profiling (with either version) to find bottle
>necks?  How centralizing is your IO?  Do you read everything in
>at once, then process, or read/process/write/read/process/write?

I haven't tried profiling the code recently, so it might be worth doing
again.  However, this would probably just reduce the times for both
versions.

The data is read in and processed a little (parsed and some simple trig,
then added to the network (malloc a block and add some links).  Then the
network reduction is done and (possibly) matrix is built and solved and
the network unreduced.  Then the output files are written out.

>If you could work the IO into larger chunks so there are fewer
>RM/PM switches, that might help the djgpp version.

When does the RM/PM switch occur?  Presumably it's only when DOS
actually writes to the hardware, so just using setvbuf to increase the
file buffer size would reduce the amount of RM/PM switching.  I tried
using setvbuf with a buffer size of 4096 and then 8192 on all input and
output files and got no speed change at all.  The line I used was:

if (fh) setvbuf( fh, NULL, _IOFBF, SIZE );

>Another thing to consider is the start up.  For DJGPP programs,
>you have a slightly larger overhead than Borland programs would
>(initializing PM, loading the 32-bit code, jumping to it, and so
>on).  Did you time with an external program, or an internal
>function?

The timing is done internally so the initialisation time is not
included.  I use time() which seems to have approx. 1/20 of a second
resolution, and has the advantage of being ANSI and hence works for
most platforms.


Eli Zaretskii wrote:
>	1) Try using -O3, which adds additional optimisations.  GCC
>info file has a list of optimization options which you might also
>consider (I think, loop-unrolling isn't included even in -O3, so
>you might try it, too).

"cc1 -version -O2" and "cc1 -version -O3" give me the same output,
so I think there's no difference.  -O3 -funroll-loops makes no
difference to the output timings anyway.

>	2) The runnning time is quite short (5-6 seconds), so the
>overhead of loading go32 and protected mode set-up before your
>program gets control can be significant in relative figures.  Are
>these running times characteristic of your typical application?
>If so, what difference does 1 second make? If not, try running for
>longer time, and see if you get better performance *ratio*.

As I said above, the timing is internal and so doesn't include the
start-up time.  The running time is typical for a reasonably sized data
set on a fast machine (486DX2 66MHz).  With a slower machine and/or a
larger data set, the difference will be more noticable, and I'd like it
to be usable on most machines.  I do have some much larger data sets
somewhere - I'll dig them out and do some more tests.

>	3) You didn't mention if you use some software disk cache
>like SmartDrive.  Do you?  If you do, you can forget about the claims
>that ``DJGPP is a bit weak if you're doing lots of I/O stuff'' (but see
>4 below), because the cache all but eliminates I/O performance problems.

There is no software disk cache running, as the machine has a fairly
good caching disk controller card, so you don't gain anything.  (I'm
told this by someone who tried it on a similar machine).

>	4) On a similar machine, 7 seconds is what takes to DJGPP-compiled
>program to read a 4-megabyte file (with SmartDrive installed), so if you
>read multi-megabyte files like this, then yes, you might have I/O
>bottleneck.  I think, this is somewhat due to the fact that DJGPP
>read()'s files in chunks of 4KB, which might be sub-optimal when reading
>LARGE files.  Otherwise, your problem is *not* I/O.

There are 32 input files read in with a total size of 54316 bytes.  4
output files are produced, total size 158695 bytes.  Pretty small
really.

>	5) If your program was originally born for Borland C, then it
>might have some algorithmic problems which are necessary when writing
>16-bit programs.  You might consider going through your code and searching
>for such things, which may boost performance with GCC.

It was originally written for Norcroft C running under RISC OS (32 bit
environment), and then ported to Borland C.  I can't think of anything
explicitly 16 bit, except a translation table of shorts, which are bit
masks which determines what each input character can be interpreted as.

Olly