Assembler level comparisson

At first: I'm not a compiler or assembler guru so perhaps some stuff is wrong, but I have large experience in assembler coding and know enough about CPU architectures.
I taked the simplest benchmark from the suite. That's the numeric sort algorithm. Then I selected the compiler that gave better performance in my computer: egcs 2.91.60 (1.1.1) using -O3 -fomit-frame-pointer -ffast-math and Pentium aligment.
The difference in speed between both compilers is very small (under 10%). So I expected to find just small differences. Ten percent is a really small difference, specially if you think that there are some amount of error in the meassurements.
I generated the assembler for the two small routines involved and then searched both routines in the MSVC 5 code. Lamentably I didn't have the assembler generated by MSVC so I was forced to use Hiew to find the routines. It was relative hard (a 187 Kb file without debug information) but doing it I saw some interesting things.
That's the C code:
static void NumHeapSort(farlong *array,ulong bottom,ulong top)) { ulong temp; /* Used to exchange elements */ ulong i; /* Loop index */ /* ** First, build a heap in the array */ for(i=(top/2L); i>0; --i) NumSift(array,i,top); /* ** Repeatedly extract maximum from heap and place it at the ** end of the array. When we get done, we'll have a sorted ** array. */ for(i=top; i>0; --i) { NumSift(array,bottom,i); temp=*array; /* Perform exchange */ *array=*(array+i); *(array+i)=temp; } return; } static void NumSift(farlong *array,ulong i,ulong j) { unsigned long k; long temp; /* Used for exchange */ while((i+i)<=j) { k=i+i; if(k<j) if(array[k]<array[k+1L]) ++k; if(array[i]<array[k]) { temp=array[k]; array[k]=array[i]; array[i]=temp; i=k; } else i=j+1; } return; } As you can see both routines are small and with a lot of room for optimizations. I think they are excellent to test the optimizers.
Here is the code generated by egcs: (AT&T syntax, line debug info removed)
_NumHeapSort: pushl %ebp pushl %edi pushl %esi pushl %ebx movl 28(%esp),%esi movl 20(%esp),%edi movl %esi,%ebx movl 24(%esp),%ebp shrl $1,%ebx je L41 L43: pushl %esi pushl %ebx pushl %edi call _NumSift addl $12,%esp decl %ebx jnz L43 L41: movl %esi,%ebx testl %ebx,%ebx je L39 leal (%edi,%ebx,4),%esi .p2align 2 L48: pushl %ebx pushl %ebp pushl %edi call _NumSift movl (%esi),%eax movl (%edi),%edx movl %eax,(%edi) addl $12,%esp movl %edx,(%esi) addl $-4,%esi decl %ebx jnz L48 L39: popl %ebx popl %esi popl %edi popl %ebp ret .p2align 2 _NumSift: pushl %edi pushl %esi pushl %ebx movl 20(%esp),%ebx movl 16(%esp),%esi movl 24(%esp),%edi jmp L51 .p2align 2 L53: movl %edx,%ecx cmpl %edi,%edx jae L54 movl 4(%esi,%edx,4),%eax cmpl %eax,(%esi,%edx,4) jge L54 leal 1(%edx),%ecx L54: movl (%esi,%ebx,4),%eax movl (%esi,%ecx,4),%edx cmpl %edx,%eax jge L56 movl %eax,(%esi,%ecx,4) movl %edx,(%esi,%ebx,4) movl %ecx,%ebx jmp L51 .p2align 2 L56: leal 1(%edi),%ebx L51: movl %ebx,%edx addl %ebx,%edx cmpl %edi,%edx jbe L53 popl %ebx popl %esi popl %edi ret I don't know what you think about it, but in my opinion that's excellent code for a C compiler. Let's see the code from MSVC 5: (Intel syntax, no labels) .00403CB0: 53 push ebx .00403CB1: 56 push esi .00403CB2: 8B74240C mov esi,[esp][0000C] .00403CB6: 57 push edi .00403CB7: 8B7C2418 mov edi,[esp][00018] .00403CBB: 8BDF mov ebx,edi .00403CBD: D1EB shr ebx,1 .00403CBF: 740E je .000403CCF -------- (1) .00403CC1: 57 push edi .00403CC2: 53 push ebx .00403CC3: 56 push esi .00403CC4: E857000000 call .000403D20 -------- (2) .00403CC9: 83C40C add esp,00C ;" " .00403CCC: 4B dec ebx .00403CCD: 75F2 jne .000403CC1 -------- (3) .00403CCF: 85FF test edi,edi .00403CD1: 761C jbe .000403CEF -------- (4) .00403CD3: 8B5C2414 mov ebx,[esp][00014] .00403CD7: 57 push edi .00403CD8: 53 push ebx .00403CD9: 56 push esi .00403CDA: E841000000 call .000403D20 -------- (5) .00403CDF: 8B0CBE mov ecx,[esi][edi]*4 .00403CE2: 8B06 mov eax,[esi] .00403CE4: 890E mov [esi],ecx .00403CE6: 83C40C add esp,00C ;" " .00403CE9: 8904BE mov [esi][edi]*4,eax .00403CEC: 4F dec edi .00403CED: 75E8 jne .000403CD7 -------- (1) .00403CEF: 5F pop edi .00403CF0: 5E pop esi .00403CF1: 5B pop ebx ... .00403D20: 8B542408 mov edx,[esp][00008] .00403D24: 53 push ebx .00403D25: 8B5C2410 mov ebx,[esp][00010] .00403D29: 8D0412 lea eax,[edx][edx] .00403D2C: 3BC3 cmp eax,ebx .00403D2E: 7736 ja .000403D66 -------- (1) .00403D30: 8B4C2408 mov ecx,[esp][00008] .00403D34: 57 push edi .00403D35: 56 push esi .00403D36: 3BC3 cmp eax,ebx .00403D38: 730C jae .000403D46 -------- (2) .00403D3A: 8B3481 mov esi,[ecx][eax]*4 .00403D3D: 8B7C8104 mov edi,[ecx][eax]*4[00004] .00403D41: 3BF7 cmp esi,edi .00403D43: 7D01 jge .000403D46 -------- (3) .00403D45: 40 inc eax .00403D46: 8B3491 mov esi,[ecx][edx]*4 .00403D49: 8B3C81 mov edi,[ecx][eax]*4 .00403D4C: 3BF7 cmp esi,edi .00403D4E: 7D0A jge .000403D5A -------- (4) .00403D50: 893481 mov [ecx][eax]*4,esi .00403D53: 893C91 mov [ecx][edx]*4,edi .00403D56: 8BD0 mov edx,eax .00403D58: EB03 jmps .000403D5D -------- (1) .00403D5A: 8D5301 lea edx,[ebx][00001] .00403D5D: 8D0412 lea eax,[edx][edx] .00403D60: 3BC3 cmp eax,ebx .00403D62: 76D4 jbe .000403D38 -------- (2) .00403D64: 5E pop esi .00403D65: 5F pop edi .00403D66: 5B pop ebx .00403D67: C3 retn Now let me show you some details:
  1. Aligment of funtions: egcs aligned the functions very well in this case, but was just a coincidence! The default settings for x86 (at least for djgpp) is to align functions using .p2align 2 that's 32 bits aligment. Now take a look to the MSVC 5 addresses: 00403CB0 and 00403D20. Both are aligned to 32 bytes boundaries, that's 256 bits.
    When I moved the routines a little bit I lost 10% of performance. Looks like 32 bits aligment is good enough for Pentium CPUs but not for AMD CPUs and I think isn't enough for P6 CPUs.
    I tested with both routines aligned to 256 bits, 32 bits (xxxx4) and 32 bits (xxxxC). The three combinations gives 2.67 in a Pentium 233 MMX CPU (benchmarks as a 240 MHz clasic Pentium CPU), but when the code was aligned like this xxxx88 and xxxxDC the K6-2 CPU dropped from 4.11 to 3.68. (Note: 370 MHz P5 equivalent to 331 MHz, this CPU is really faster than P5).
    Looking in the MSVC 5 code (I did it while disassembled the code to find the routines) I saw MSVC 5 aligned all the code of the benchmark (not all the runtime) to 16 or 32 bytes boundaries. I couldn't find why some routines are aligned to 32 and others to 16. Note: I'm not silly, isn't a coincidence, some parts have over than 16 bytes wasted for aligment.
    My experience optimizing routines by hand for a Cyrix 5x86 120 MHz CPU tells me that 128 bits aligment is required. In my opinion that's because the pipeline bus of 5x86 CPU is 128 bits wide, additionally a cache line is 32 bytes. So in my opinion the optimal aligment is somewhere between 128 and 256 bits. MSVC people thinks the same ;-).

  2. Jumps aligment: The default for egcs (at least for djgpp) is to use .p2align 2 (that's 32 bits aligments) for jumps and entry points of loops. I did tests with 0, 4 bytes and 8 bytes.
    In the Pentium MMX CPU the best results are obtained with 0 and 4 bytes (using 0 the code doesn't get accidentally aligned, is really missaligned), in any case the difference is under 2.5% so isn't important.
    In the K6-2 the best performance is for 8 bytes, but the difference is less than 2%. Using 4 bytes is really dangerous because (as I experimented) some loop could be located in an xxxxC address and that's the worst case for K6 (also for Cyrix 5x86).
    If you look in the MSVC code it doesn't align jumps or entry points of loops. In this particular case that's better than using 4 bytes because reduces the probability of hiting an xxxxC address.
    Aligning here introduces dead code and doesn't help the CPU too much.

  3. Inlining: Here egcs didn't inlined NumSift inside NumHeapSort just because the body is declared after. Inlining this routine the code performs worst. I think that's because NumSift is too big for inlining, not sure. The fact is that egcs generates worst code and makes a worst use of the registers. MSVC didn't inlined functions.

  4. Code used to align: MSVC uses int 3 between functions, that's much more safe than using just nop because it will generate an exception instead of executing the wrong function if you use a corrupted pointer to function.

  5. Code proximity: Cache designers knows very well it, if you are executing a chunk of code the probability that you will execute another that's contiguous is much bigger than the probability for other blocks of memory. EGCS doesn't know it and I never expected it, but I got amazed when I discovered that MSVC knows it and re-arranges functions so the called function is close to the caller function. The routines are moved.

  6. Prolog/epilog: Compare it: .00403CB0: 53 push ebx .00403CB1: 56 push esi .00403CB2: 8B74240C mov esi,[esp][0000C] .00403CB6: 57 push edi .00403CB7: 8B7C2418 mov edi,[esp][00018] .00403CBB: 8BDF mov ebx,edi .00403CBD: D1EB shr ebx,1 with: pushl %ebp pushl %edi pushl %esi pushl %ebx movl 28(%esp),%esi movl 20(%esp),%edi movl %esi,%ebx movl 24(%esp),%ebp shrl $1,%ebx Can you see the diference? MSVC is interleaving the push/clobber instructions. Why? I can see two reasons:
    1. because in this way both pipes of the Pentium will be filled. Even when the code from MSVC can be optimized is clearly better than the egcs version.
    2. if the code of the function includes a conditional that makes it end prematurely MSVC code will save some push and pops.
    I think egcs will follow the same rules.

  7. Keeping the execution units busy: Looks like egcs doesn't pay attention to what instructions can be paired: movl (%esi),%eax movl (%edi),%edx movl %eax,(%edi) addl $12,%esp [1] movl %edx,(%esi) [2] addl $-4,%esi decl %ebx jnz L48 The stack adjust (addl $12,%esp) could be moved to make the loop faster and save one cycle. Just exchange [1] and [2] and you get an speed up. MSVC is much more smart with this details. According to Intel a compiler can give between 5% and 25% of extra performance exploiting it.

  8. Better choice of registers and indexes: Compare the following loop: => EDI is preloaded with the base of the array => EBX is preloaded with top and is the loop variable leal (%edi,%ebx,4),%esi => ESI is &array[top], I'll call it arr2 .p2align 2 L48: pushl %ebx pushl %ebp pushl %edi call _NumSift movl (%esi),%eax => temp2=*arr2 movl (%edi),%edx => temp=*array; movl %eax,(%edi) => *array=temp2 => *array=*arr2 addl $12,%esp => Stack adjust call _NumSift movl %edx,(%esi) => *arr2=temp addl $-4,%esi => arr2-- decl %ebx => i-- jnz L48 with: => EDI preloaded with top, is the loop variable => ESI preloaded with array .00403CD3: 8B5C2414 mov ebx,[esp][00014] => EBX=bottom .00403CD7: 57 push edi .00403CD8: 53 push ebx .00403CD9: 56 push esi .00403CDA: E841000000 call .000403D20 -------- (5) .00403CDF: 8B0CBE mov ecx,[esi][edi]*4 => temp2=array[i] .00403CE2: 8B06 mov eax,[esi] => temp=*array .00403CE4: 890E mov [esi],ecx => *array=temp2 => *array=array[i] .00403CE6: 83C40C add esp,00C ;" " => Stack adjust .00403CE9: 8904BE mov [esi][edi]*4,eax => array[i]=temp .00403CEC: 4F dec edi => i-- .00403CED: 75E8 jne .000403CD7 -------- (1) To refresh your mind here is the C code: for(i=top; i>0; --i) { NumSift(array,bottom,i); temp=*array; /* Perform exchange */ *array=*(array+i); *(array+i)=temp; } EGCS really did the wrong thing here, in fact de-optimized the code. MSVC code is clear and is just the C code arranged to be well executed by the Pentium processor. I think egcs uses the extra register for the array[top] stuff just because the registers choice was the wrong one. MSVC choose the right registers and fully exploits the complex addresing modes of 386 CPUs.

Conclutions: The aligment policies of gcc/egcs should be re-thinked and adjusted. Currently isn't very good and, in some cases, produces disasters for the K6 CPU.
GCC/EGCS should take care about how to reorder code to exploit the Pentium superscalar features.
The optimization algorithms should be tuned to avoid silly code as the showed in 8. I think the main problem with egcs/gcc is how it selects what registers to use.

Copyright © 1999 by Salvador E. Tropea.
If you want to use it for some publication just let me know.
HTML created with Setedit, my own text editor ;-).
1