-O2: This takes longer to compile (but not much) and the speed difference is pretty big over
-O3is also available, but it goes nuts with the inlining of functions, and that can blow out your cache pretty well. Give it a try and time it both ways.
-m486: Pick which machine you are targeting. It'll still work on either one if it's run on the other. Use
-m486for Pentium and up, too.
-fomit-frame-pointer: If you will not be debugging or profiling. this option gives gcc another register to play with (
ebp), which can make all the difference in tight loops.
-funroll-loops: I used to think
-O3would turn this on, but it doesn't. Do not just turn this on for the hell of it, though. Time the code before and after. It speeds up loops on 486's but won't have as much effect on Pentiums and up. And the extra code size may have cache side effects. But in my code, I usually turn this on for the tight graphics loops.
-ffast-math: Try this flag if you are doing a lot of floating-point and you don't need accuracy to the last bit (few programs really do, usually scientific programs.) Also causes sqrt() calls to be inlined.
-S: This option causes gcc to emit the assembler code it would feed into its assembler into a
.sfile. Look at this. Find out exactly what is being generated.
-fforce-addr: Force all memory locations to be copied into a register before doing arithmetic on them.
-fstrength-reduce: A loop optimizer. Don't use unless you have gcc 184.108.40.206, which you can determine by typing 'gcc -v' at the command line.
-funroll-all-loops: also turns on
-frerun-cse-after-loop. These days, cache coherency is everything, so this option is rarely useful. If you stick to
-funroll-loops, you'll get a good compromise.
__djgpp_nearptr_enable(): WARNING! This command turns off all memory protection! You could blow things up bad! Of course, if you're used to complete lack of memory protection (like in real-mode DOS prorgamming), you'll live.
_control87(). However, some FPU's do better at double, some are better at single, and some automatically convert everything to double. This might or might not be worth messing with. You could also try using
_detect_80387(), an undocumented function that returns non-zero if a FPU is present, to determine whether to switch to fixed-point or something.
outports, you can try using CWSDPR0. It runs your app at ring 0, which speeds up port accesses. The drawback: No virtual memory. But if you're going for performance, disk swaps would kill you anyway. It also locks all memory, which is nice for when you want interrupt handlers and don't want to deal with locking every byte they touch. You can use stubedit to force your binary to load it instead of CWSDPMI.EXE. However, this won't help you in Windows or OS/2 DOS boxes, which provide their own DPMI.
ints and 8-bits
charsdon't slow it down, just
shorts. This is because DJGPP runs your code in a 32-bit segment and it must issue a register size override prefix (which stalls the pipeline) to specify that the register width differs from the segment width.) Look here for Pentium-specific optimization issues.
memcpy, try to give it fixed-length copies to do. This lets DJGPP convert it to an inline
rep movsl, which saves you the overhead of a function call and some other calcs
memcpydoes. It will stick on extra
movsw's as necessary; it doesn't have to be longwords. It will not necessarily align the destination in this case.
for (i = len; i; i--)instead of
for (i = 0; i < len; i++)Otherwise len must either be kept in a register or loaded from memory every time.
-fstrength-reducesupposedly would make this unnecessary, but it sometimes produces incorrect code in this version of GCC, so it's by default disabled.
inlinekeyword to cause any function to be inlined (as fast as macros) whether you're doing C++ or C. You have to call it from the same source file to get the inlining, and you can still call it from an external object.