At first: I'm not a compiler or assembler guru so perhaps some stuff is wrong,
but I have large experience in assembler coding and know enough about CPU
architectures.
I taked the simplest benchmark from the suite. That's the numeric sort
algorithm. Then I selected the compiler that gave better performance in my
computer: egcs 2.91.60 (1.1.1) using -O3 -fomit-frame-pointer -ffast-math and
Pentium aligment.
The difference in speed between both compilers is very small (under 10%). So
I expected to find just small differences. Ten percent is a really small
difference, specially if you think that there are some amount of error in
the meassurements.
I generated the assembler for the two small routines involved and then
searched both routines in the MSVC 5 code. Lamentably I didn't have the
assembler generated by MSVC so I was forced to use Hiew to find the routines.
It was relative hard (a 187 Kb file without debug information) but doing it
I saw some interesting things.
That's the C code:
static
void NumHeapSort(farlong *array,ulong bottom,ulong top))
{
ulong temp; /* Used to exchange elements */
ulong i; /* Loop index */
/*
** First, build a heap in the array
*/
for(i=(top/2L); i>0; --i)
NumSift(array,i,top);
/*
** Repeatedly extract maximum from heap and place it at the
** end of the array. When we get done, we'll have a sorted
** array.
*/
for(i=top; i>0; --i)
{ NumSift(array,bottom,i);
temp=*array; /* Perform exchange */
*array=*(array+i);
*(array+i)=temp;
}
return;
}
static
void NumSift(farlong *array,ulong i,ulong j)
{
unsigned long k;
long temp; /* Used for exchange */
while((i+i)<=j)
{
k=i+i;
if(k
As you can see both routines are small and with a lot of room for
optimizations. I think they are excellent to test the optimizers.
Here is the code generated by egcs: (AT&T syntax, line debug info removed)
_NumHeapSort:
pushl %ebp
pushl %edi
pushl %esi
pushl %ebx
movl 28(%esp),%esi
movl 20(%esp),%edi
movl %esi,%ebx
movl 24(%esp),%ebp
shrl $1,%ebx
je L41
L43:
pushl %esi
pushl %ebx
pushl %edi
call _NumSift
addl $12,%esp
decl %ebx
jnz L43
L41:
movl %esi,%ebx
testl %ebx,%ebx
je L39
leal (%edi,%ebx,4),%esi
.p2align 2
L48:
pushl %ebx
pushl %ebp
pushl %edi
call _NumSift
movl (%esi),%eax
movl (%edi),%edx
movl %eax,(%edi)
addl $12,%esp
movl %edx,(%esi)
addl $-4,%esi
decl %ebx
jnz L48
L39:
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.p2align 2
_NumSift:
pushl %edi
pushl %esi
pushl %ebx
movl 20(%esp),%ebx
movl 16(%esp),%esi
movl 24(%esp),%edi
jmp L51
.p2align 2
L53:
movl %edx,%ecx
cmpl %edi,%edx
jae L54
movl 4(%esi,%edx,4),%eax
cmpl %eax,(%esi,%edx,4)
jge L54
leal 1(%edx),%ecx
L54:
movl (%esi,%ebx,4),%eax
movl (%esi,%ecx,4),%edx
cmpl %edx,%eax
jge L56
movl %eax,(%esi,%ecx,4)
movl %edx,(%esi,%ebx,4)
movl %ecx,%ebx
jmp L51
.p2align 2
L56:
leal 1(%edi),%ebx
L51:
movl %ebx,%edx
addl %ebx,%edx
cmpl %edi,%edx
jbe L53
popl %ebx
popl %esi
popl %edi
ret
I don't know what you think about it, but in my opinion that's excellent code
for a C compiler. Let's see the code from MSVC 5: (Intel syntax, no labels)
.00403CB0: 53 push ebx
.00403CB1: 56 push esi
.00403CB2: 8B74240C mov esi,[esp][0000C]
.00403CB6: 57 push edi
.00403CB7: 8B7C2418 mov edi,[esp][00018]
.00403CBB: 8BDF mov ebx,edi
.00403CBD: D1EB shr ebx,1
.00403CBF: 740E je .000403CCF -------- (1)
.00403CC1: 57 push edi
.00403CC2: 53 push ebx
.00403CC3: 56 push esi
.00403CC4: E857000000 call .000403D20 -------- (2)
.00403CC9: 83C40C add esp,00C ;""
.00403CCC: 4B dec ebx
.00403CCD: 75F2 jne .000403CC1 -------- (3)
.00403CCF: 85FF test edi,edi
.00403CD1: 761C jbe .000403CEF -------- (4)
.00403CD3: 8B5C2414 mov ebx,[esp][00014]
.00403CD7: 57 push edi
.00403CD8: 53 push ebx
.00403CD9: 56 push esi
.00403CDA: E841000000 call .000403D20 -------- (5)
.00403CDF: 8B0CBE mov ecx,[esi][edi]*4
.00403CE2: 8B06 mov eax,[esi]
.00403CE4: 890E mov [esi],ecx
.00403CE6: 83C40C add esp,00C ;""
.00403CE9: 8904BE mov [esi][edi]*4,eax
.00403CEC: 4F dec edi
.00403CED: 75E8 jne .000403CD7 -------- (1)
.00403CEF: 5F pop edi
.00403CF0: 5E pop esi
.00403CF1: 5B pop ebx
...
.00403D20: 8B542408 mov edx,[esp][00008]
.00403D24: 53 push ebx
.00403D25: 8B5C2410 mov ebx,[esp][00010]
.00403D29: 8D0412 lea eax,[edx][edx]
.00403D2C: 3BC3 cmp eax,ebx
.00403D2E: 7736 ja .000403D66 -------- (1)
.00403D30: 8B4C2408 mov ecx,[esp][00008]
.00403D34: 57 push edi
.00403D35: 56 push esi
.00403D36: 3BC3 cmp eax,ebx
.00403D38: 730C jae .000403D46 -------- (2)
.00403D3A: 8B3481 mov esi,[ecx][eax]*4
.00403D3D: 8B7C8104 mov edi,[ecx][eax]*4[00004]
.00403D41: 3BF7 cmp esi,edi
.00403D43: 7D01 jge .000403D46 -------- (3)
.00403D45: 40 inc eax
.00403D46: 8B3491 mov esi,[ecx][edx]*4
.00403D49: 8B3C81 mov edi,[ecx][eax]*4
.00403D4C: 3BF7 cmp esi,edi
.00403D4E: 7D0A jge .000403D5A -------- (4)
.00403D50: 893481 mov [ecx][eax]*4,esi
.00403D53: 893C91 mov [ecx][edx]*4,edi
.00403D56: 8BD0 mov edx,eax
.00403D58: EB03 jmps .000403D5D -------- (1)
.00403D5A: 8D5301 lea edx,[ebx][00001]
.00403D5D: 8D0412 lea eax,[edx][edx]
.00403D60: 3BC3 cmp eax,ebx
.00403D62: 76D4 jbe .000403D38 -------- (2)
.00403D64: 5E pop esi
.00403D65: 5F pop edi
.00403D66: 5B pop ebx
.00403D67: C3 retn
Now let me show you some details:
Aligment of funtions: egcs aligned the functions very well in this case,
but was just a coincidence! The default settings for x86 (at least for djgpp)
is to align functions using .p2align 2 that's 32 bits aligment.
Now take a look to the MSVC 5 addresses: 00403CB0 and 00403D20. Both are
aligned to 32 bytes boundaries, that's 256 bits.
When I moved the routines a little bit I lost 10% of performance. Looks like
32 bits aligment is good enough for Pentium CPUs but not for AMD CPUs and I
think isn't enough for P6 CPUs.
I tested with both routines aligned to 256 bits, 32 bits (xxxx4) and 32 bits
(xxxxC). The three combinations gives 2.67 in a Pentium 233 MMX CPU
(benchmarks as a 240 MHz clasic Pentium CPU), but when the code was aligned
like this xxxx88 and xxxxDC the K6-2 CPU dropped from 4.11 to 3.68. (Note:
370 MHz P5 equivalent to 331 MHz, this CPU is really faster than P5).
Looking in the MSVC 5 code (I did it while disassembled the code to find the
routines) I saw MSVC 5 aligned all the code of the benchmark (not all the
runtime) to 16 or 32 bytes boundaries. I couldn't find why some routines are
aligned to 32 and others to 16. Note: I'm not silly, isn't a coincidence,
some parts have over than 16 bytes wasted for aligment.
My experience optimizing routines by hand for a Cyrix 5x86 120 MHz CPU tells
me that 128 bits aligment is required. In my opinion that's because the
pipeline bus of 5x86 CPU is 128 bits wide, additionally a cache line is 32
bytes. So in my opinion the optimal aligment is somewhere between 128 and
256 bits. MSVC people thinks the same ;-).
Jumps aligment: The default for egcs (at least for djgpp) is to use
.p2align 2 (that's 32 bits aligments) for jumps and entry points
of loops. I did tests with 0, 4 bytes and 8 bytes.
In the Pentium MMX CPU the best results are obtained with 0 and 4 bytes
(using 0 the code doesn't get accidentally aligned, is really missaligned),
in any case the difference is under 2.5% so isn't important.
In the K6-2 the best performance is for 8 bytes, but the difference is less
than 2%. Using 4 bytes is really dangerous because (as I experimented) some
loop could be located in an xxxxC address and that's the worst case for K6
(also for Cyrix 5x86).
If you look in the MSVC code it doesn't align jumps or entry points of
loops. In this particular case that's better than using 4 bytes because
reduces the probability of hiting an xxxxC address.
Aligning here introduces dead code and doesn't help the CPU too much.
Inlining: Here egcs didn't inlined NumSift inside NumHeapSort just
because the body is declared after. Inlining this routine the code performs
worst. I think that's because NumSift is too big for inlining, not sure. The
fact is that egcs generates worst code and makes a worst use of the registers.
MSVC didn't inlined functions.
Code used to align: MSVC uses int 3 between functions, that's
much more safe than using just nop because it will generate an
exception instead of executing the wrong function if you use a corrupted
pointer to function.
Code proximity: Cache designers knows very well it, if you are
executing a chunk of code the probability that you will execute another that's
contiguous is much bigger than the probability for other blocks of memory.
EGCS doesn't know it and I never expected it, but I got amazed when I
discovered that MSVC knows it and re-arranges functions so the called function
is close to the caller function. The routines are moved.
Prolog/epilog: Compare it:
.00403CB0: 53 push ebx
.00403CB1: 56 push esi
.00403CB2: 8B74240C mov esi,[esp][0000C]
.00403CB6: 57 push edi
.00403CB7: 8B7C2418 mov edi,[esp][00018]
.00403CBB: 8BDF mov ebx,edi
.00403CBD: D1EB shr ebx,1
with:
pushl %ebp
pushl %edi
pushl %esi
pushl %ebx
movl 28(%esp),%esi
movl 20(%esp),%edi
movl %esi,%ebx
movl 24(%esp),%ebp
shrl $1,%ebx
Can you see the diference? MSVC is interleaving the push/clobber instructions.
Why? I can see two reasons:
because in this way both pipes of the Pentium will be filled. Even when
the code from MSVC can be optimized is clearly better than the egcs version.
if the code of the function includes a conditional that makes it end
prematurely MSVC code will save some push and pops.
I think egcs will follow the same rules.
Keeping the execution units busy: Looks like egcs doesn't pay
attention to what instructions can be paired:
movl (%esi),%eax
movl (%edi),%edx
movl %eax,(%edi)
addl $12,%esp [1]
movl %edx,(%esi) [2]
addl $-4,%esi
decl %ebx
jnz L48
The stack adjust (addl $12,%esp) could be moved to make the loop
faster and save one cycle. Just exchange [1] and [2] and you get an speed up.
MSVC is much more smart with this details. According to Intel a compiler can
give between 5% and 25% of extra performance exploiting it.
Better choice of registers and indexes: Compare the following
loop:
=> EDI is preloaded with the base of the array
=> EBX is preloaded with top and is the loop
variable
leal (%edi,%ebx,4),%esi => ESI is &array[top], I'll call it arr2
.p2align 2
L48:
pushl %ebx
pushl %ebp
pushl %edi
call _NumSift
movl (%esi),%eax => temp2=*arr2
movl (%edi),%edx => temp=*array;
movl %eax,(%edi) => *array=temp2 => *array=*arr2
addl $12,%esp => Stack adjust call _NumSift
movl %edx,(%esi) => *arr2=temp
addl $-4,%esi => arr2--
decl %ebx => i--
jnz L48
with:
=> EDI preloaded with top, is the
loop variable
=> ESI preloaded with array
.00403CD3: 8B5C2414 mov ebx,[esp][00014] => EBX=bottom
.00403CD7: 57 push edi
.00403CD8: 53 push ebx
.00403CD9: 56 push esi
.00403CDA: E841000000 call .000403D20 -------- (5)
.00403CDF: 8B0CBE mov ecx,[esi][edi]*4 => temp2=array[i]
.00403CE2: 8B06 mov eax,[esi] => temp=*array
.00403CE4: 890E mov [esi],ecx => *array=temp2 => *array=array[i]
.00403CE6: 83C40C add esp,00C ;"" => Stack adjust
.00403CE9: 8904BE mov [esi][edi]*4,eax => array[i]=temp
.00403CEC: 4F dec edi => i--
.00403CED: 75E8 jne .000403CD7 -------- (1)
To refresh your mind here is the C code:
for(i=top; i>0; --i)
{ NumSift(array,bottom,i);
temp=*array; /* Perform exchange */
*array=*(array+i);
*(array+i)=temp;
}
EGCS really did the wrong thing here, in fact de-optimized the code. MSVC
code is clear and is just the C code arranged to be well executed by the
Pentium processor. I think egcs uses the extra register for the array[top]
stuff just because the registers choice was the wrong one. MSVC choose the
right registers and fully exploits the complex addresing modes of 386 CPUs.