Subject: Re: Speed tuning programs To: eliz AT is DOT elta DOT co DOT il, mcastle AT umr DOT edu Cc: djgpp AT sun DOT soe DOT clarkson DOT edu Date: Mon, 15 Aug 1994 11:52:13 +0100 From: Olly Betts Thanks to everyone who has made suggestions. I've concatenated replies to Eli Zaretskii and Mike Castle. Mike first: >If the amount of CPU time grows faster than the amount of I/O >time, I suspect that the DJGPP version will eventually out >perform the Borland version. The processing as it reads the data and builds the network in is O(n), where n is the size of the input data. It then munges the network to make it smaller, using various reduction rules, again O(n) [probably], and then builds a matrix and solves it, which is O(m^2), where m is the size of the reduced network. However, it only needs to build a matrix if the network is too complicated to solve using the reduction rules, so m depends on the intertwinedness of the network more than it depends on n. The network reduction and "un-reduction" and the matrix building/solution are quite FP intensive, but the machine has a coprocessor, so this shouldn't be affected by the compiler. >Have you tried profiling (with either version) to find bottle >necks? How centralizing is your IO? Do you read everything in >at once, then process, or read/process/write/read/process/write? I haven't tried profiling the code recently, so it might be worth doing again. However, this would probably just reduce the times for both versions. The data is read in and processed a little (parsed and some simple trig, then added to the network (malloc a block and add some links). Then the network reduction is done and (possibly) matrix is built and solved and the network unreduced. Then the output files are written out. >If you could work the IO into larger chunks so there are fewer >RM/PM switches, that might help the djgpp version. When does the RM/PM switch occur? Presumably it's only when DOS actually writes to the hardware, so just using setvbuf to increase the file buffer size would reduce the amount of RM/PM switching. I tried using setvbuf with a buffer size of 4096 and then 8192 on all input and output files and got no speed change at all. The line I used was: if (fh) setvbuf( fh, NULL, _IOFBF, SIZE ); >Another thing to consider is the start up. For DJGPP programs, >you have a slightly larger overhead than Borland programs would >(initializing PM, loading the 32-bit code, jumping to it, and so >on). Did you time with an external program, or an internal >function? The timing is done internally so the initialisation time is not included. I use time() which seems to have approx. 1/20 of a second resolution, and has the advantage of being ANSI and hence works for most platforms. Eli Zaretskii wrote: > 1) Try using -O3, which adds additional optimisations. GCC >info file has a list of optimization options which you might also >consider (I think, loop-unrolling isn't included even in -O3, so >you might try it, too). "cc1 -version -O2" and "cc1 -version -O3" give me the same output, so I think there's no difference. -O3 -funroll-loops makes no difference to the output timings anyway. > 2) The runnning time is quite short (5-6 seconds), so the >overhead of loading go32 and protected mode set-up before your >program gets control can be significant in relative figures. Are >these running times characteristic of your typical application? >If so, what difference does 1 second make? If not, try running for >longer time, and see if you get better performance *ratio*. As I said above, the timing is internal and so doesn't include the start-up time. The running time is typical for a reasonably sized data set on a fast machine (486DX2 66MHz). With a slower machine and/or a larger data set, the difference will be more noticable, and I'd like it to be usable on most machines. I do have some much larger data sets somewhere - I'll dig them out and do some more tests. > 3) You didn't mention if you use some software disk cache >like SmartDrive. Do you? If you do, you can forget about the claims >that ``DJGPP is a bit weak if you're doing lots of I/O stuff'' (but see >4 below), because the cache all but eliminates I/O performance problems. There is no software disk cache running, as the machine has a fairly good caching disk controller card, so you don't gain anything. (I'm told this by someone who tried it on a similar machine). > 4) On a similar machine, 7 seconds is what takes to DJGPP-compiled >program to read a 4-megabyte file (with SmartDrive installed), so if you >read multi-megabyte files like this, then yes, you might have I/O >bottleneck. I think, this is somewhat due to the fact that DJGPP >read()'s files in chunks of 4KB, which might be sub-optimal when reading >LARGE files. Otherwise, your problem is *not* I/O. There are 32 input files read in with a total size of 54316 bytes. 4 output files are produced, total size 158695 bytes. Pretty small really. > 5) If your program was originally born for Borland C, then it >might have some algorithmic problems which are necessary when writing >16-bit programs. You might consider going through your code and searching >for such things, which may boost performance with GCC. It was originally written for Norcroft C running under RISC OS (32 bit environment), and then ported to Borland C. I can't think of anything explicitly 16 bit, except a translation table of shorts, which are bit masks which determines what each input character can be interpreted as. Olly