DJGPP 2.0 optimization

Geez, what a huge topic. But here's a few tips I've found:

Compiler flags

Use -O2. This takes longer to compile (but not much) and the speed difference is pretty big over -O0 or -O1. -O3 is also available, but it goes nuts with the inlining of functions, and that can blow out your cache pretty well. Give it a try and time it both ways.
Use -m386 or -m486. Pick which machine you are targeting. It'll still work on either one if it's run on the other. Use -m486 for Pentium and up, too.
Use -fomit-frame-pointer if and only if you will not be using:
- Windows or OS/2
- A debugger
- A profiler
But if you can live without those 3 things, this option gives djgpp another register to play with, which can make all the difference in tight loops. (I plan to bundle 2 .exe's in my game, one for Windows DOS sessions, one not.)
-funroll-loops. I used to think -O3 would turn this on, but it doesn't. Do not just turn this on for the hell of it, though. Time the code before and after. It speeds up loops on 486's but won't have as much effect on Pentiums and up. And the extra code size may have cache side effects. But in my code, I usually turn this on for the tight graphics loops.
-S. This option causes gcc to emit the assembler code it would feed into its assembler into a .s file. Look at this. Find out exactly what is being generated.

Other things

__djgpp_nearptr_enable(). WARNING! This command turns off all memory protection! You could blow things up bad! Of course, if you're used to complete lack of memory protection, you'll live.
The point of this call is to allow you to write directly to low DOS memory, like the VGA buffer. Way, way, faster than _dosmemput().
A decent compromise is to use far pointers, which take one extra cycle per access, but keep memory protected.
Try to avoid 16-bit variables in performance-critical code. It takes 1 or more extra cycles on 386/486/P5 (and it's even worse on the P6!) to use the 16-bit versions of the registers. Stick to 32-bit ints and 8-bits chars (chars don't slow it down, just shorts. This is because DJGPP runs your code in a 32-bit segment and it must issue a register size override prefix (which stalls the pipeline) to specify that the register width differs from the segment width.)
If your code uses a lot of outports, you can try using CWSDPR0. It runs your app at ring 0, which speeds up port accesses. The drawback: No virtual memory. But if you're going for performance, disk swaps would kill you anyway. It also locks all memory, which is nice for when you want interrupt handlers and don't want to deal with locking every byte they touch. You can use stubedit to force your binary to load it instead of CWSDPMI.EXE. However, this won't help you in Windows or OS/2 DOS boxes.
If you need to use memcpy, try to give it fixed-length copies to do. This lets DJGPP convert it to an inline rep movsl, which saves you the overhead of a function call and some other calcs memcpy does.
Try to use
```
for (i = len; i; i--)
```
instead of
```
for (i = 0; i < len; i++)
```
Otherwise len must either be kept in a register or loaded from memory every time.
Use inline assembly for your critical loops. See Brennan's Guide to Inline Assembly with DJGPP2.
Page provided by brennan@rt66.com