Basic Optimization Techniques


Table of Contents

Fast Multiplication and Division
Loop Unrolling and Flipping
Strength Reduction
Duff's Device
Writing Clean Loops

Fast Multiplication and Division

One of the often used optimizations is the binary shift. In C, the operators for binary shifting is << and >>. Shifting to the left causes numbers to multiply by the power of two that you shifted them. Shifting to the right is identical except it divides by the power of two.
Powers of two
2
4
8
16
32
64
128
256
...
For instance, let say we are going to plot a pixel in mode 13h. To plot in mode 13h, you copy the color to the screen offset x + y * 320. 320 is the same as 64 + 256. So instead of:

 

screen[y * 320 + x] = color;
we can:

 

screen[((y << 8) + (y << 6)) + x] = color;
This is much faster then multiplication because binary shifts are simple operations, as opposed to mutplications, which are complex operations.

Loop Unrolling and Flipping

When you put code in a loop, you have to optimize it more then most other code, because any slow operation is going to be multiplied many times. Two ways to do this is loop unrolling and fliping. Loop unrolling means repeating lines of code inside a loop. For instance,
for(i=0;i<=20;i++)
{
 ships[i].alive = TRUE;
}
With unrolling:
new_index = 0;
for(i=0;i<=20;i+=2)
{
 ships[new_index++].alive = TRUE;
 ships[new_index++].alive = TRUE;
}
This code actually goes faster, even though it is longer. That leads to one important rule:
Shorter code is not always faster.
The reason this is faster is because we only test if we're done every other operation. If you think about it, the test is actually a complex operation, consisting of a jump and a conditional jump. The CPU always starts processing the instructions before they are executed, and when you use a jump it makes the CPU get rid of that information. A conditional jump is even worse, because the CPU might get rid of the partially processed information and have to reprocess it because the jump was not taken.

Loop fliping allows us to get rid of one of those jump instructions for each time around, and a conditional jump the first time. It is taking a loop that tests at the top and changing it so that it tests at the top.

 Without flipping:

new_index = 0;
for(i=0;i<=20;i+=2)
{
 ships[new_index++].alive = TRUE;
 ships[new_index++].alive = TRUE;
}
With flipping
new_index = 0;
i=0
do
{
 ships[new_index++].alive = TRUE;
 ships[new_index++].alive = TRUE;
 i+=2;
} while(i<=20);
The version that uses fliping and unrolling is a good amount faster then the original.

Do not unroll to far or it will slow you way down! Usually, seven or eight is the max. Take a look at the example below.

for (i=0;i<=400;i++)
{       
        for(k=0;k<=400;k++)
        {
         men(i*5,k*5).lives--;
        }
}
How could we speed this up? For one thing, we can flip both loops. Another thing we can do is called Strength Reduction . The idea is to take big, complex operations like multiplication, division, and modulus, and turn them into cheap operations like addition, OR, AND, and shifting. In this case, we can turn the i * 5 and k * 5 into addition by just incrementing them by five. Also, on the 486, an ADD cost one cycle, while an IMUL costs between 13 and 42 cycles. While this is a extreme example, we will save at least 1920000 cycles with just the strength reduction!
After all these optimizations:
index=0;
i=0;
do
{       
   k=0;
   do
   {
     men(index+=5,k).lives--;
     men(index+=5,k).lives--;
     k+=10;
   } while(i<=2000);
   i+=5;
} while(i<=2000);

Duff's Device and Unrolling Loops

Take a look at this:
index = 0;
for(i=0;i<size;i+=4)
{
 result = result * index++ + 30;
 result = result * index++ + 30;
 result = result * index++ + 30;
 result = result * index++ + 30;
}
What's wrong with this code? Well, besides the fact that it uses multiplication where it could use addition, it can only compute multiples of 3. How can we fix this? There is something called Duff's Device, which you may or may not be familar with. Duff's Device intertwines a while loop with a switch case statement, as is shown below.(From the Jargon File, ver 3.9)
        register n = (count + 7) / 8;      /* count > 0 assumed */

        switch (count % 8)
        {
        case 0:        do {  *to = *from++;
        case 7:              *to = *from++;
        case 6:              *to = *from++;
        case 5:              *to = *from++;
        case 4:              *to = *from++;
        case 3:              *to = *from++;
        case 2:              *to = *from++;
        case 1:              *to = *from++;
                           } while (--n &rt; 0);
The problem with that is a lot of compilers gagged on the code, and many programmers had trouble understanding it. Below is the portable version which originally appeared in More Tricks of the Game Programming Guru's.
index = 0;

switch(size & 0x3)
{
 case 3:
         result = result * index++ + 30;
 case 2:
         result = result * index++ + 30;
 case 1:
         result = result * index++ + 30;
}

if (index < size)
{
        for(i=0;i < size;i+=4)
        {
         result = result * index++ + 30;
         result = result * index++ + 30;
         result = result * index++ + 30;
         result = result * index++ + 30;
        }
}
This code only allows you to unroll loop a power of two times. If you use %(mod) (which was used in the original) instead of &(and) you can unroll your loop any amount of times. The problem is mod is a lot slower then and. Also, earlier in the section on strength reduction, I said you could replace mod's with and's. Well, I just did.

Writing Clean Loops

Programming Law - Anything you can put outside of a loop, put outside.

 Keep garbage out of your loops. This may seem obvious, but anything you can put outside of a loop, do put outside. Take a look at the example below from More tricks of the Game Programming Guru's:

void cdecl ScaleBlit(UBYTE *bitmap, int x0, int y0,
        int x1, int y1)
{
        int u, v, x, y;

        for ( x = x0; x < x1; x++ )
        {
                for ( y = y0; y < y1; y++ )
                {
                        u = width * (x - x0) / (x1 - x0);
                        v = height * (y - y0) / (y1 - y0);
                        screen[x + y * 320] = bitmap[u + v * width];
                }
        }
}
You may be wondering what is wrong with this code. Well, besides the lack of binary shifting, it recomputes the y and x values for each pixel. You see, this guy put garbage in his inner loop, and is part-author of a game programming book! To give him proper credit, this was not his final version. But, this still teaches you to always check for garbage in your loop. On a wider scale, this also teaches you to check for possible optimizations anywhere and everywhere.

 


Copyright 1996, David J Berube.
You can reach me at Form1@aol.com.
1