Monday, November 26, 2012
JEB should be out late December/early Jan. Check out www.android-decompiler.com to get all the details.
Monday, November 28, 2011
Last night was fun. I found a USB stick I haven't used in years and couldn’t resist browsing its content. My excitement rose when I retrieved - among various other juicy and less juicy pieces of data, mostly dating back from the college era – an archive containing many old C programs. I stumbled upon this piece of code /rambling_begins I wrote at a time my friends and I were venerating Brian Greene et al. and our bible was the Elegant Universe. Eventually, my friends went to study physics but did not major in cosmology. Applied physics pay the bills. String theory physicists are starving to death nowadays! /rambling_over.
So the idea was to visually represent lemniscates.
Lemniscates and its derivatives are fascinating mathematical objects. (Their construction was appealing because it resembled the ellipse's.)
Remember that an ellipse is defined by its two centers C0, C1 and the set of points P such that dist(C0,P)+dist(C1,P) is constant:
Image (c) Wikipedia
A lemniscate is also defined by its two centers C0, C1 and a set of points P, but this time d=dist(C0,P)*dist(C1,P) must be constant. The most famous lemniscate is Bernouilli’s, where d is set to sqrt(dist(C0,C1)/2). This produces … the infinite sign:
Generalized, these lemniscates are formally called Cassini’s ovals, and we can intuitively guess that greater values of d generate what look like ellipses, whereas smaller values of d generate pairs of pseudo-circles:
Back in the days, I didn’t know these objects had real names – not to mention their use in electrostatic fields analysis. I was experimenting and simply called them generalized lemnicates. Anyway, below are a few of screenshots with 2 centers, 3 centers, 6 centers, and a couple of zoomed-in areas.
The trick to get these cool gradients is to rely on the RGB implementation used by most programming frameworks. The (Red, Green, Blue) tuple is represented by 3 bytes stored in a 32-bit dword. (The high-byte can be the alpha channel, but is irrelevant in the case of raw colors.) If we define d=dist(C0,P)*dist(C1,P)*…*dist(Cn,P) and represent d as a colored pixel whose RGB value is directly proportional to d (ie, a number between 0 and 0FFFFFFh).
The yellow-ish separations mean that the Blue byte is about to be incremented. Of course, this depends on the number of centers and how close they are to each other: these initial conditions yield the range of value d can have within the window. With more 10 centers, very close to each other, it takes a while to get to Blue=1, therefore we can see increments of Green (at least the first 4 are clearly marked out).
I’ve attached the source and binary so you can experiment yourself. (Sorry for the Linux folks out there, it’s Windows C…)
Thursday, November 3, 2011
Edit: Also, I've decided to join the Twitter crowd. You can follow me @NicolasFalliere. Not a lot of tweets at the moment, I'm still in the discovery phase :)
Tuesday, January 19, 2010
The old tables (NT, 2K, XP, etc.) were generated based on the MetaSploit System Call Tables page, so thanks to them for that.
Edit: here's a diff with Vista SP0 (a diff with SP2 would be more valuable, will do eventually):
Sunday, January 10, 2010
Quick reminder: Windows runs in protected (or long/IA64) mode with paging enabled, plus/minus physical address extensions enabled. The post focus on a 32-bit kernel with PAE disabled for enhanced clarity.
The PDE/PTE structures for 4Kb pages are (from Intel manual):
The PDE structure for 4Mb pages is:
The conversion to a physical address is done internally by the hardware. No instruction implements a physical-to-virtual address conversion. When the operating system uses paging, the key to physical conversion lies in CR3. This register contains the physical address to the page-directory entries table.
In order to do a virtual-to-physical converion programmatically, one needs to know where the page translation tables are located in virtual memory. When created, these pages are referenced only by their physical addresses, stored in the page-directory or page-table entries (see pictures above). This chicken&egg problem is solved by Windows (and most likely other x86 OSs) by reserving a range in the kernel address space for all lowest-level page-description pages (PTEs for 4Kb pages, PDEs for 4Mb pages).
Let's examine how MmGetPhysicalAddress is implemented in the simplest version of a Windows XP SP3 kernel (32-bit, PAE disabled):
.text:0042E046 ; unsigned __int64 __stdcall MmGetPhysicalAddress(unsigned int BaseAddress)
.text:0042E046 BaseAddress = dword ptr 8
.text:0042E046 mov edi, edi
.text:0042E048 push ebp
.text:0042E049 mov ebp, esp
.text:0042E04B push esi
.text:0042E04C mov esi, [ebp+BaseAddress]
.text:0042E04F mov eax, esi
.text:0042E051 shr eax, 14h
.text:0042E054 and eax, 0FFCh
.text:0042E059 mov ecx, [eax-3FD00000h]
.text:0042E05F mov eax, ecx
.text:0042E061 and ax, 81h
.text:0042E065 cmp al, 81h ;present? page size?
.text:0042E067 jnz short 4Kbpage
.text:0042E069 mov eax, esi ;4Mb page
.text:0042E06B shr eax, 0Ch
.text:0042E06E and eax, 3FFh
.text:0042E073 shr ecx, 0Ch
.text:0042E076 add eax, ecx
.text:0042E078 xor ecx, ecx
.text:0042E07A shld ecx, eax, 0Ch
.text:0042E07E shl eax, 0Ch
.text:0042E081 and esi, 0FFFh
.text:0042E087 add eax, esi
.text:0042E089 mov edx, ecx ;0
.text:0042E08B pop esi
.text:0042E08C pop ebp
.text:0042E08D retn 4
.text:0042E090 test cl, 1 ;present?
.text:0042E093 jz short error
.text:0042E095 mov eax, esi
.text:0042E097 shr eax, 0Ah
.text:0042E09A and eax, 3FFFFCh
.text:0042E09F sub eax, 40000000h
.text:0042E0A4 mov eax, [eax]
.text:0042E0A6 test al, 1 ;PTE present?
.text:0042E0A8 jz short error
.text:0042E0AA shr eax, 0Ch
.text:0042E0AD jmp short convert
.text:0042E0AF xor eax, eax
.text:0042E0B1 xor edx, edx
.text:0042E0B3 jmp short done
.text:0042E0B3 _MmGetPhysicalAddress@4 endp
Remember the VA is decomposed into 3 or 2 parts:
- 3 parts for 4Kb pages: [10bits=PDE index / 10bits=PTE index / 12bit=Page offset]
- 2 parts for 4Mb pages: [10bits=PDE index / 22bits=Page offset]
The code first gets the PDE index*4, which is the PDE offset relative to CR3 since PDE entries are 4-byte long. This value is added to -3FD00000. The PDE offset being in [0,FFC], the result will be in [C0300000,C0300FFC]. Now, the first PDE is pointed by CR3. Which means that physical_to_virtual(CR3)=C0300000, for all processes. The range [C0300000,C0301000[ contains the 0x400 PDEs.
A comparison then checks if the page is present or not (bit0) and the function returns 0 if the page is not present. The comparison also checks for bit7; if set, this bit indicates a 4Mb page and a different, simpler code branch is executed.
For a 4Kb PDE, the next 10 bits of the VA are extracted, then made a PTE offset in [0,FFC]. The offset is added to -40000000. The resulting value is in [C0000000, C03FFFFC]. This means the PTEs are in the range [C0000000, C0400000[. It's important to understand that this range is "reserved"; only a handful of these pages are actually mapped to physical ones, as explained below.
(The physical address is then calculated by extracting the page offset part of the VA (bottom 12 or 22 bits) and adding it to the physical address of the lowest-level page in the translation hierarchy.)
What's interesting in this scheme is the address range used. Let's consider a 3-level hierarchy (4Kb pages). CR3 "points" to C0300000, ie the first PDE is at C0300000. The PTEs go from C0000000 to C0400000: The PDE range overlaps the PTE range! And not anywhere, exactly at the 3/4th of this range; which makes sense since the 3/4th of a 32-bit address space also start at C0000000. This is not random of course: the PDEs are themselves referenced by the PTEs to allow the processor to access the [C0000000, C0400000[ range!
This may seem a bit obscure, plus my explanations here are pretty poor. It's funny how explaining 30 lines of smart assembly can be so tricky... The thing to remember to understand this is that the CPU offers NO facility to do physical to virtual conversion. But to let access the kernel access the pages that allow the CPU to do this conversion internally, they must be accessible in virtual memory. And to be accessible in virtual memory, they must be referenced by themselves. This self-reference mechanism allows the implemention of MmGetPhysicalAddress.
Quick lab experiment. Fire up WinDbg, local kernel debugging:
lkd> !process 0 0 System
PROCESS 81bcc830 SessionId: none Cid: 0004 Peb: 00000000 ParentCid: 0000
DirBase: 00039000 ObjectTable: e1000cc0 HandleCount: 244.
DirBase is the physical address of the first PDE (loaded into CR3).
We can confirm that by check the field in the associated EPROCESS structure:
lkd> dt _EPROCESS Pcb.DirectoryTableBase 81bcc830
+0x000 Pcb :
+0x018 DirectoryTableBase :  0x39000
Now, let's get the physical address of C0300000. We use !vtop, with the PFN page for the process (39000 >> 12):
lkd> !vtop 39 c0300000
Pdi 300 Pti 300
c0300000 00039000 pfn(00039)
The result is 39000, which confirms that C0300000 maps the PDEs.
You can check it for other processes, for instance WinDbg itself:
lkd> !process 0 0 windbg.exe
PROCESS 81ad8410 SessionId: 0 Cid: 049c Peb: 7ffd7000 ParentCid: 028c
DirBase: 00ae9000 ObjectTable: e1c446b0 HandleCount: 614.
lkd> !vtop ae9 c0300000
Pdi 300 Pti 300
c0300000 00ae9000 pfn(00ae9)
Wednesday, December 16, 2009
Specifically, opcode 89h (mov Ev, Gv) and opcode 8Bh (mov Gv, Ev) are used to encode mov [mem], reg and mov reg, [mem], respectively. Both are also used to encode mov reg, reg. So which one should we use?
With Microsoft dev tools (MASM, CL), opcode 8B was chosen. This means that mov ebp, esp, this classic compiler idiom used in function prologues, is implemented as 8B EC; the alternate form being 89 E5 by the way.
Yesterday, I was wondering, "Why the heck did MS decide to implement move from register to register this way in their compiler suite? Did they have some Intel guidance here? Did they flip a coin? ...". Anyways, I could not believe this choice was innocent.
I first had a quick look at Intel's optimization manual but couldn't find a hint in it - I may have missed that though.
So I decided to code a simple program to benchmark both instructions. I focused on "mov ebp, esp" since it's what initially this whole thing...
__declspec(naked) __int64 test1(void)
mov edx, ebp
mov eax, esp
mov ecx, COUNT
_emit 0x8B // mov ebp, esp
add esp, 0x1234
mov ebp, edx
mov esp, eax
sub eax, [esp]
sbb edx, [esp+4]
add esp, 8
test1's counterpart, test2, is exactly the same routine except that the mov is implemented as 89 E5. I ran both of these 10 times with COUNT set to 100 millions. The output below shows the routine duration in CPU tick count for test1 (t1), test2 (t2), and the delta (t2-t1).
266427630, 239098970 -> -27328660
198251880, 236396490 -> 38144610
197485280, 236592800 -> 39107520
198359910, 240731530 -> 42371620
198169630, 236203160 -> 38033530
199283790, 237357990 -> 38074200
198334600, 238307940 -> 39973340
201726060, 237601250 -> 35875190
199196030, 235864980 -> 36668950
197668180, 236878930 -> 39210750
198189520, 236734830 -> 38545310
198769940, 236796360 -> 38026420
198463880, 236996720 -> 38532840
197608520, 235117620 -> 37509100
198994300, 237364120 -> 38369820
198324810, 238987550 -> 40662740
197842630, 236818990 -> 38976360
197126140, 236928430 -> 39802290
198637450, 237262350 -> 38624900
197569520, 235503910 -> 37934390
Omitting the first line (performance likely impacted by the CPU caching the memory area containing both routines), it appears that the first form is about 20% faster than the second. And this is not not stack-register specific as I had the same result with mov eax, edx for instance.
So, using opcode 8B for mov reg, reg seems like a good choice - in practice, I have no clue if this makes a difference during execution of a real program.
I still don't know why this choice is what it is though... if anyone (somebody who worked on a real-world C compiler for instance) has a clue, please leave a comment.
Update: I've just noticed NASM generates the alternate form 89 E5.
As for the guesswork: an assembler, to decide which opcode could be used to implement the desired instruction, can use different approach. A simple one could be to go through the opcodes in sequence (00 though FF, then 0F 00 through 0F FF, etc.), and use the first opcode encountered that fits. In this case, that would be 89. And now that I'm writing this, I realize that the simple x86 assembler I wrote in the past uses this method, and therefore should generate 89 E5 like NASM does. The fact that Microsoft tools do not makes me think it is a deliberate design decision...
Saturday, November 21, 2009
Here is a 64-bit, standard, non-system, generic GDT segment descriptor:
The base address is fragmented into 2 pieces (low 24 bits, high 8 bits), as is the segment limit (low 16 bits, high 8 bits). Why so?
The answer is, you guessed, backward compatibility. And the guilty is the 80286. This processor was introduced in 1982, and was the first to support Protected Mode (PM). This PM was not exactly the one we know and use nowadays though; it was simpler, sort of a version 0.1 is you will.
The 286 manual here shows us the encoding of a standard GDT descriptor - check figure 6-3. Its size is 8 bytes, but the upper 2 bytes "must be set to 0 for compatibility with iAPX 386". Interesting; so even then, they were envisaging some PM extensions... The meaningful data is contained in the low 6 bytes, formatting a non-fragmented descriptor:
- bytes 0-1: 16-bit limit
- bytes 2-4: 24-bit base
- byte 5: flags
So wait. Does it mean a segment can be at max 64Kb? And a 24-bit base means that only 16Mb of memory are physically addressable, right? - you say. Well, yes and yes. It's old-PM, remember.
The new version introduced by the 386 has its lot of improvements, among which:
- Real 32-bit addressing: an extra byte for the base (8th position) was added, making it 32-bit long
- The possibility to have 4Gb long segments (as opposed to 64Kb...): 4 bits were added to the limit field, making it 20 bits. Since 20 bits can only address 1Mb, 4 extra attribute bits were added, including the G bit (pos.23). Setting the G(ranularity) bit means the limit field indicates the last addressable 4-Kb block. Therefore, one sets the limit to 0xFFFFF with G=1 to have a 4Gb segment (like all modern OS do).
Another interesting bit introduced in the extra 4-bit of the attributes field is the D/B bit (pos.22). This bit indicates the default operand-size of the segment, and setting it to 1 means it's 32-bit. It was of course set to 0 for the 286 "6-byte" descriptors. Just one more element that just shows how the 386 was the real cornerstone, implementing the things that lacked in the 286 (including the paging unit), and became a standard.
If you want to know more about this, check out the Wikipedia PM history section as well as the 286 manual mentioned earlier. Also, a very interesting trivia on the 286 is how the inability to switch back from protected-mode to real-more gave a few guys at Microsoft some very hard (and fun) time!