Wednesday, December 16, 2009

On moving register

One of the great *cough* things of the IA32 architecture is that it offers multiple ways to encode identical instructions. For instance, the mov instruction is implemented by several opcodes - to support multiple variations of the instruction, for instance mov reg, [mem], mov reg, imm , etc.).
Specifically, opcode 89h (mov Ev, Gv) and opcode 8Bh (mov Gv, Ev) are used to encode mov [mem], reg and mov reg, [mem], respectively. Both are also used to encode mov reg, reg. So which one should we use?

With Microsoft dev tools (MASM, CL), opcode 8B was chosen. This means that mov ebp, esp, this classic compiler idiom used in function prologues, is implemented as 8B EC; the alternate form being 89 E5 by the way.

Yesterday, I was wondering, "Why the heck did MS decide to implement move from register to register this way in their compiler suite? Did they have some Intel guidance here? Did they flip a coin? ...". Anyways, I could not believe this choice was innocent.

I first had a quick look at Intel's optimization manual but couldn't find a hint in it - I may have missed that though.

So I decided to code a simple program to benchmark both instructions. I focused on "mov ebp, esp" since it's what initially this whole thing...

__declspec(naked) __int64 test1(void)
{
__asm
{
rdtsc
push edx
push eax

mov edx, ebp
mov eax, esp

mov ecx, COUNT
dummy:
_emit 0x8B // mov ebp, esp
_emit 0xEC
add esp, 0x1234
dec ecx
jg dummy

mov ebp, edx
mov esp, eax

rdtsc
sub eax, [esp]
sbb edx, [esp+4]
add esp, 8

retn
}
}

test1's counterpart, test2, is exactly the same routine except that the mov is implemented as 89 E5. I ran both of these 10 times with COUNT set to 100 millions. The output below shows the routine duration in CPU tick count for test1 (t1), test2 (t2), and the delta (t2-t1).

266427630, 239098970 -> -27328660
198251880, 236396490 -> 38144610
197485280, 236592800 -> 39107520
198359910, 240731530 -> 42371620
198169630, 236203160 -> 38033530
199283790, 237357990 -> 38074200
198334600, 238307940 -> 39973340
201726060, 237601250 -> 35875190
199196030, 235864980 -> 36668950
197668180, 236878930 -> 39210750
198189520, 236734830 -> 38545310
198769940, 236796360 -> 38026420
198463880, 236996720 -> 38532840
197608520, 235117620 -> 37509100
198994300, 237364120 -> 38369820
198324810, 238987550 -> 40662740
197842630, 236818990 -> 38976360
197126140, 236928430 -> 39802290
198637450, 237262350 -> 38624900
197569520, 235503910 -> 37934390

Omitting the first line (performance likely impacted by the CPU caching the memory area containing both routines), it appears that the first form is about 20% faster than the second. And this is not not stack-register specific as I had the same result with mov eax, edx for instance.

So, using opcode 8B for mov reg, reg seems like a good choice - in practice, I have no clue if this makes a difference during execution of a real program.

I still don't know why this choice is what it is though... if anyone (somebody who worked on a real-world C compiler for instance) has a clue, please leave a comment.

Update: I've just noticed NASM generates the alternate form 89 E5.

As for the guesswork: an assembler, to decide which opcode could be used to implement the desired instruction, can use different approach. A simple one could be to go through the opcodes in sequence (00 though FF, then 0F 00 through 0F FF, etc.), and use the first opcode encountered that fits. In this case, that would be 89. And now that I'm writing this, I realize that the simple x86 assembler I wrote in the past uses this method, and therefore should generate 89 E5 like NASM does. The fact that Microsoft tools do not makes me think it is a deliberate design decision...