Wednesday, December 16, 2009

On moving register

One of the great *cough* things of the IA32 architecture is that it offers multiple ways to encode identical instructions. For instance, the mov instruction is implemented by several opcodes - to support multiple variations of the instruction, for instance mov reg, [mem], mov reg, imm , etc.).
Specifically, opcode 89h (mov Ev, Gv) and opcode 8Bh (mov Gv, Ev) are used to encode mov [mem], reg and mov reg, [mem], respectively. Both are also used to encode mov reg, reg. So which one should we use?

With Microsoft dev tools (MASM, CL), opcode 8B was chosen. This means that mov ebp, esp, this classic compiler idiom used in function prologues, is implemented as 8B EC; the alternate form being 89 E5 by the way.

Yesterday, I was wondering, "Why the heck did MS decide to implement move from register to register this way in their compiler suite? Did they have some Intel guidance here? Did they flip a coin? ...". Anyways, I could not believe this choice was innocent.

I first had a quick look at Intel's optimization manual but couldn't find a hint in it - I may have missed that though.

So I decided to code a simple program to benchmark both instructions. I focused on "mov ebp, esp" since it's what initially this whole thing...

__declspec(naked) __int64 test1(void)
push edx
push eax

mov edx, ebp
mov eax, esp

mov ecx, COUNT
_emit 0x8B // mov ebp, esp
_emit 0xEC
add esp, 0x1234
dec ecx
jg dummy

mov ebp, edx
mov esp, eax

sub eax, [esp]
sbb edx, [esp+4]
add esp, 8


test1's counterpart, test2, is exactly the same routine except that the mov is implemented as 89 E5. I ran both of these 10 times with COUNT set to 100 millions. The output below shows the routine duration in CPU tick count for test1 (t1), test2 (t2), and the delta (t2-t1).

266427630, 239098970 -> -27328660
198251880, 236396490 -> 38144610
197485280, 236592800 -> 39107520
198359910, 240731530 -> 42371620
198169630, 236203160 -> 38033530
199283790, 237357990 -> 38074200
198334600, 238307940 -> 39973340
201726060, 237601250 -> 35875190
199196030, 235864980 -> 36668950
197668180, 236878930 -> 39210750
198189520, 236734830 -> 38545310
198769940, 236796360 -> 38026420
198463880, 236996720 -> 38532840
197608520, 235117620 -> 37509100
198994300, 237364120 -> 38369820
198324810, 238987550 -> 40662740
197842630, 236818990 -> 38976360
197126140, 236928430 -> 39802290
198637450, 237262350 -> 38624900
197569520, 235503910 -> 37934390

Omitting the first line (performance likely impacted by the CPU caching the memory area containing both routines), it appears that the first form is about 20% faster than the second. And this is not not stack-register specific as I had the same result with mov eax, edx for instance.

So, using opcode 8B for mov reg, reg seems like a good choice - in practice, I have no clue if this makes a difference during execution of a real program.

I still don't know why this choice is what it is though... if anyone (somebody who worked on a real-world C compiler for instance) has a clue, please leave a comment.

Update: I've just noticed NASM generates the alternate form 89 E5.

As for the guesswork: an assembler, to decide which opcode could be used to implement the desired instruction, can use different approach. A simple one could be to go through the opcodes in sequence (00 though FF, then 0F 00 through 0F FF, etc.), and use the first opcode encountered that fits. In this case, that would be 89. And now that I'm writing this, I realize that the simple x86 assembler I wrote in the past uses this method, and therefore should generate 89 E5 like NASM does. The fact that Microsoft tools do not makes me think it is a deliberate design decision...

Saturday, November 21, 2009

Why are GDT descriptors so messed up?

Ever wondered why a GDT descriptor had such a fragmented format? Like anybody born in the 80's, I have.

Here is a 64-bit, standard, non-system, generic GDT segment descriptor:

The base address is fragmented into 2 pieces (low 24 bits, high 8 bits), as is the segment limit (low 16 bits, high 8 bits). Why so?

The answer is, you guessed, backward compatibility. And the guilty is the 80286. This processor was introduced in 1982, and was the first to support Protected Mode (PM). This PM was not exactly the one we know and use nowadays though; it was simpler, sort of a version 0.1 is you will.

The 286 manual here shows us the encoding of a standard GDT descriptor - check figure 6-3. Its size is 8 bytes, but the upper 2 bytes "must be set to 0 for compatibility with iAPX 386". Interesting; so even then, they were envisaging some PM extensions... The meaningful data is contained in the low 6 bytes, formatting a non-fragmented descriptor:
- bytes 0-1: 16-bit limit
- bytes 2-4: 24-bit base
- byte 5: flags

So wait. Does it mean a segment can be at max 64Kb? And a 24-bit base means that only 16Mb of memory are physically addressable, right? - you say. Well, yes and yes. It's old-PM, remember.

The new version introduced by the 386 has its lot of improvements, among which:
  • Real 32-bit addressing: an extra byte for the base (8th position) was added, making it 32-bit long
  • The possibility to have 4Gb long segments (as opposed to 64Kb...): 4 bits were added to the limit field, making it 20 bits. Since 20 bits can only address 1Mb, 4 extra attribute bits were added, including the G bit (pos.23). Setting the G(ranularity) bit means the limit field indicates the last addressable 4-Kb block. Therefore, one sets the limit to 0xFFFFF with G=1 to have a 4Gb segment (like all modern OS do).
Which explains why GDT descriptors seem so messed up...

Another interesting bit introduced in the extra 4-bit of the attributes field is the D/B bit (pos.22). This bit indicates the default operand-size of the segment, and setting it to 1 means it's 32-bit. It was of course set to 0 for the 286 "6-byte" descriptors. Just one more element that just shows how the 386 was the real cornerstone, implementing the things that lacked in the 286 (including the paging unit), and became a standard.

If you want to know more about this, check out the Wikipedia PM history section as well as the 286 manual mentioned earlier. Also, a very interesting trivia on the 286 is how the inability to switch back from protected-mode to real-more gave a few guys at Microsoft some very hard (and fun) time!

Friday, November 6, 2009

Detecting simple hypervisors

This was tested only on Intel VT-x but it might work as well with AMD-V. In between a couple of blue screens, I realized there exist many simple ways to detect primitive HVM rootkits - ie, the ones that don't implement all the code required to achieve super-stealth.

Let me give you a simple example. CPUID is a peculiar instruction; it's the only non ring-0 allowed instruction that will unconditionally trigger a VM-exit (Vol.3B:253669, p21-2). If you trace over CPUID in your favorite debugger, the VM-exit will happen: the CPU enters VMX-Root mode, and your VMM is called to handle the offending instruction.

The lazy coder would write something like:

// emulate CPUID
if(Reason == EXIT_REASON_CPUID) then
mov eax, GuestEax
mov ecx, GuestEcx
mov GuestEax, eax
mov GuestEcx, ecx
mov GuestEdx, edx
mov GuestEbx, ebx

// update isntruction pointer
GuestEip += 2


Well, that's not enough. Execution will resume after CPUID. The trap flag will raise the debug exception after executing the instruction that follows... and there you have one simple way to detect poorly coded HVMs. One way to prevent this would be to inject a vectored event (INT1) on VM-entry.

This piece of assembly code implements the above trick (if we may call it so):

call GetCurrentProcess
push 1
push eax
call SetProcessAffinityMask

push offset seh
push dword ptr fs:[0]
mov fs:[0], esp

or dword ptr [esp], 100h


mov eax, 1
jmp done

mov eax, [esp+0Ch]
cmp dword ptr [eax+0B8h], traphere
setne al
movzx eax, al

add eax, 30h
push eax
push esp
call crt_printf

push eax
call ExitProcess

(On SMP systems, change the thread affinity to execute the test on other CPUs.)

There are other similar ways to detect a hypervisor from user-mode - again, assuming its implementation is lacking.

Thursday, July 16, 2009

New Mebroot domains for the summer

The last variant of Mebroot's driver uses an obfuscated domain generation algorithm to connect to C&C servers. Here are the domains for this summer, you might want to block them for whatever reason.

The first domain, week-based, is used first (though, after the hardcoded domain in the driver itself); the second domain is day-based and used as a failsafe. On top of '.com', the TLDs '.net' and '.biz' are tried as well.


Monday, March 9, 2009

ProjectEuler problem 144

Project Euler is a website offering math challenges solvable by programming - usually few tens lines of code are enough. Last friday, after more than a year of Euler abstinence, I decided to have a look at it again, and ran across this cool problem (#144, check it out here).

So the goal is to find the number of times the laser beam will hit the ellipse-shaped cell before exiting through the input cavity.
First, let's formulate the problem with simple trigonometry. Have a look at the schema below:

The laser beam comes from A and hits the ellipse at point 0. We can calculate the slope m0: (yO-yA)/(xO-xA); we can also calculate the slope m1: -4*x/y. We need to calculate slope m2 and deduce B in order to iterate this process (find the next point of impact).

It appears clear that:
a = a0-a1
a = a1+PI-a2

tan(a) = tan(a0-a1)
= (tan(a0)-tan(a1))/(1+tan(a0)*tan(a1))
= (m0-m1)/(1+m0*m1)

tan(a) = tan(a1+PI-a2)
= tan(a1-a2)
= (tan(a1)-tan(a2))/(1+tan(a1)*tan(a2))
= (m1-m2)/(1+m1*m2)

Let's set X = (m0-m1)/(1+m0*m1)
X = (m1-m2)/(1+m1*m2)
-> m2 = (m1-X)/(1+X*m1)

We have the slope m2 of the reflected beam (OB). With the coordinates of O (x0,y0), we have the equation of the line:
y = a*x+b, with a=m2 and b=y0-m2*x0

The last step is calculate the coordinates of B. We have:
- the equation of the ellipse: 4*x^2+y^2 = 100
- the equation of (0B): y = m2*x+b

It boils down to solving the quadratic:
x^2*(4+a^2) + x*(2*a*b) + (b^2-100) = 0

And then we yield the coordinates of B. After a few 300-and-something iterations (nah I wouldn't spoil your Euler fun) you'll find out the reflected point falls onto the exit hole. Problem solved!

Wednesday, March 4, 2009

Sudoku solver

Nothing new or original here.
This Sudoku solver just aims to show how compact Python code can be...
The core of this piece of code is less than 40 lines of code. And we could do much better than that.

Saturday, February 14, 2009

Real-time stock quotes Python API

I've put up a handy script on my Google code area right here.

This Python module works like a very simple interface to get stock quotes from the French stock website Boursorama. You can get securities of any trading place providing you know its ISIN of course.

The cool stuff about it is that it gives you real-time quotes (yes, you avoid the 15/20-min delay Yahoo and co. give you).

Here's how to use it:

$ from pybourso import get_stock
$ e = get_stock('US38259P5089')
$ print e['name'], e['value'], e['variation']

GOOGLE INC, 357.66, -1.49

Check out the source to see all that's returned by get_stock(). Enjoy!

Monday, February 9, 2009


csheap is a customizable dynamic memory allocator written in C. It was originally written as a replacement for the standard libc malloc and the Windows runtime.

I've put it up on Google Code here. Enjoy.

Thursday, January 8, 2009

pushfd and popfd are slow...

...damn, damn slow!
Use lahf/sahf instead.
A few numbers later on - if I have time.