MiscellaneouZ

Monday, November 26, 2012

New venture - Android decompiler

I'm currently working on JEB, an interactive Android app decompiler for security analysts and reverse engineers.

JEB should be out late December/early Jan. Check out www.android-decompiler.com to get all the details.

Monday, November 28, 2011

Visualizing lemniscates and Cassinian curves

Last night was fun. I found a USB stick I haven't used in years and couldn’t resist browsing its content. My excitement rose when I retrieved - among various other juicy and less juicy pieces of data, mostly dating back from the college era – an archive containing many old C programs. I stumbled upon this piece of code /rambling_begins I wrote at a time my friends and I were venerating Brian Greene et al. and our bible was the Elegant Universe. Eventually, my friends went to study physics but did not major in cosmology. Applied physics pay the bills. String theory physicists are starving to death nowadays! /rambling_over.

So the idea was to visually represent lemniscates.

Lemniscates and its derivatives are fascinating mathematical objects. (Their construction was appealing because it resembled the ellipse's.)

Remember that an ellipse is defined by its two centers C0, C1 and the set of points P such that dist(C0,P)+dist(C1,P) is constant:

Image (c) Wikipedia

A lemniscate is also defined by its two centers C0, C1 and a set of points P, but this time d=dist(C0,P)*dist(C1,P) must be constant. The most famous lemniscate is Bernouilli’s, where d is set to sqrt(dist(C0,C1)/2). This produces … the infinite sign:

Image (c) Mathcurve.com

Generalized, these lemniscates are formally called Cassini’s ovals, and we can intuitively guess that greater values of d generate what look like ellipses, whereas smaller values of d generate pairs of pseudo-circles:

Image (c) Mathcurve.com

Now the last step was to generalize that to 3, 4 and more centers. The formal name of these beauties is Cassinian curves, and they look like that:

Image (c) Mathcurve.com

Back in the days, I didn’t know these objects had real names – not to mention their use in electrostatic fields analysis. I was experimenting and simply called them generalized lemnicates. Anyway, below are a few of screenshots with 2 centers, 3 centers, 6 centers, and a couple of zoomed-in areas.

The trick to get these cool gradients is to rely on the RGB implementation used by most programming frameworks. The (Red, Green, Blue) tuple is represented by 3 bytes stored in a 32-bit dword. (The high-byte can be the alpha channel, but is irrelevant in the case of raw colors.) If we define d=dist(C0,P)*dist(C1,P)*…*dist(Cn,P) and represent d as a colored pixel whose RGB value is directly proportional to d (ie, a number between 0 and 0FFFFFFh).

The yellow-ish separations mean that the Blue byte is about to be incremented. Of course, this depends on the number of centers and how close they are to each other: these initial conditions yield the range of value d can have within the window. With more 10 centers, very close to each other, it takes a while to get to Blue=1, therefore we can see increments of Green (at least the first 4 are clearly marked out).

I’ve attached the source and binary so you can experiment yourself. (Sorry for the Linux folks out there, it’s Windows C…)

Thursday, November 3, 2011

Raasta

I haven't posted in a very long while, mainly due to time and professional constraints. This will be fixed eventually.

Meanwhile, this atypical post is meant to let my (many, many) readers know that I recently released my first Android app: Raasta is a powerful, battery-friendly GPS tracker designed to fully operate in areas without data reception. You can record your progress and geolocate yourself during hikes, roadtrips or explorations. It's packed with features, including editable landmarks, custom GPS periods, KML exports, GMap overlay, etc.

Check out the full description and download the app here: https://market.android.com/details?id=com.pnfsoftware.raasta

Best of all: it's entirely free!

Edit: Also, I've decided to join the Twitter crowd. You can follow me @NicolasFalliere. Not a lot of tweets at the moment, I'm still in the discovery phase :)

Tuesday, January 19, 2010

Windows 7 and XP SP3 system call tables

A short post to let you know that I've uploaded system call tables for Windows Seven SP0 and Windows XP SP3 here:

The Python and C versions can easily be integrated into projects.

The old tables (NT, 2K, XP, etc.) were generated based on the MetaSploit System Call Tables page, so thanks to them for that.

Edit: here's a diff with Vista SP0 (a diff with SP2 would be more valuable, will do eventually):

Syscalls added:

NtAllocateReserveObject
NtAlpcRevokeSecurityContext
NtCreateKeyTransacted
NtCreateProfileEx
NtCreateUserProcess
NtDisableLastKnownGood
NtDrawText
NtEnableLastKnownGood
NtEnumerateTransactionObject
NtNotifyChangeSession
NtOpenKeyEx
NtOpenKeyTransacted
NtOpenKeyTransactedEx
NtQuerySecurityAttributesToken
NtQuerySystemInformationEx
NtQueueApcThreadEx
NtReplacePartitionUnit
NtSerializeBoot
NtSetIoCompletionEx
NtSetTimerEx
NtUmsThreadYield

Syscalls removed:

NtPullTransaction
NtGetMUILicenseInfo
NtClearMUILicenseInfo
NtRequestWakeupLatency
NtRollbackSavepointTransaction
NtClearAllSavepointsTransaction
NtClearSavepointTransaction
NtRequestDeviceWakeup
NtSavepointComplete
NtStartTm
NtCancelDeviceWakeupRequest
NtMarshallTransaction
NtListTransactions
NtSavepointTransaction

Seems like there's plenty to investigate (NtCreateUserProcess, NtSerializeBoot, NtReplacePartitionUnit, NtQuerySystemInformationEx, NtQueueApcThreadEx, ...).

Sunday, January 10, 2010

MmGetPhysicalAddress implementation tricks

MmGetPhysicalAddress is a kernel-exported API that allows converting a virtual address to a physical one.

Quick reminder: Windows runs in protected (or long/IA64) mode with paging enabled, plus/minus physical address extensions enabled. The post focus on a 32-bit kernel with PAE disabled for enhanced clarity.

The PDE/PTE structures for 4Kb pages are (from Intel manual):

The PDE structure for 4Mb pages is:

The conversion to a physical address is done internally by the hardware. No instruction implements a physical-to-virtual address conversion. When the operating system uses paging, the key to physical conversion lies in CR3. This register contains the physical address to the page-directory entries table.

In order to do a virtual-to-physical converion programmatically, one needs to know where the page translation tables are located in virtual memory. When created, these pages are referenced only by their physical addresses, stored in the page-directory or page-table entries (see pictures above). This chicken&egg problem is solved by Windows (and most likely other x86 OSs) by reserving a range in the kernel address space for all lowest-level page-description pages (PTEs for 4Kb pages, PDEs for 4Mb pages).

Let's examine how MmGetPhysicalAddress is implemented in the simplest version of a Windows XP SP3 kernel (32-bit, PAE disabled):

.text:0042E046 ; unsigned __int64 __stdcall MmGetPhysicalAddress(unsigned int BaseAddress)
.text:0042E046
.text:0042E046 BaseAddress     = dword ptr  8
.text:0042E046
.text:0042E046                 mov     edi, edi
.text:0042E048                 push    ebp
.text:0042E049                 mov     ebp, esp
.text:0042E04B                 push    esi
.text:0042E04C                 mov     esi, [ebp+BaseAddress]
.text:0042E04F                 mov     eax, esi
.text:0042E051                 shr     eax, 14h
.text:0042E054                 and     eax, 0FFCh
.text:0042E059                 mov     ecx, [eax-3FD00000h]
.text:0042E05F                 mov     eax, ecx
.text:0042E061                 and     ax, 81h
.text:0042E065                 cmp     al, 81h ;present? page size?
.text:0042E067                 jnz     short 4Kbpage
.text:0042E067
.text:0042E069                 mov     eax, esi ;4Mb page
.text:0042E06B                 shr     eax, 0Ch
.text:0042E06E                 and     eax, 3FFh
.text:0042E073                 shr     ecx, 0Ch
.text:0042E076                 add     eax, ecx
.text:0042E076
.text:0042E078
.text:0042E078 convert:
.text:0042E078                 xor     ecx, ecx
.text:0042E07A                 shld    ecx, eax, 0Ch
.text:0042E07E                 shl     eax, 0Ch
.text:0042E081                 and     esi, 0FFFh
.text:0042E087                 add     eax, esi
.text:0042E089                 mov     edx, ecx ;0
.text:0042E089
.text:0042E08B
.text:0042E08B done:
.text:0042E08B                 pop     esi
.text:0042E08C                 pop     ebp
.text:0042E08D                 retn    4
.text:0042E090
.text:0042E090 4Kbpage:
.text:0042E090                 test    cl, 1 ;present?
.text:0042E093                 jz      short error
.text:0042E093
.text:0042E095                 mov     eax, esi
.text:0042E097                 shr     eax, 0Ah
.text:0042E09A                 and     eax, 3FFFFCh
.text:0042E09F                 sub     eax, 40000000h
.text:0042E0A4                 mov     eax, [eax]
.text:0042E0A6                 test    al, 1 ;PTE present?
.text:0042E0A8                 jz      short error
.text:0042E0A8
.text:0042E0AA                 shr     eax, 0Ch
.text:0042E0AD                 jmp     short convert
.text:0042E0AF
.text:0042E0AF error:
.text:0042E0AF                 xor     eax, eax
.text:0042E0B1                 xor     edx, edx
.text:0042E0B3                 jmp     short done
.text:0042E0B3
.text:0042E0B3 _MmGetPhysicalAddress@4 endp

Remember the VA is decomposed into 3 or 2 parts:
- 3 parts for 4Kb pages: [10bits=PDE index / 10bits=PTE index / 12bit=Page offset]
- 2 parts for 4Mb pages: [10bits=PDE index / 22bits=Page offset]

The code first gets the PDE index*4, which is the PDE offset relative to CR3 since PDE entries are 4-byte long. This value is added to -3FD00000. The PDE offset being in [0,FFC], the result will be in [C0300000,C0300FFC]. Now, the first PDE is pointed by CR3. Which means that physical_to_virtual(CR3)=C0300000, for all processes. The range [C0300000,C0301000[ contains the 0x400 PDEs.

A comparison then checks if the page is present or not (bit0) and the function returns 0 if the page is not present. The comparison also checks for bit7; if set, this bit indicates a 4Mb page and a different, simpler code branch is executed.

For a 4Kb PDE, the next 10 bits of the VA are extracted, then made a PTE offset in [0,FFC]. The offset is added to -40000000. The resulting value is in [C0000000, C03FFFFC]. This means the PTEs are in the range [C0000000, C0400000[. It's important to understand that this range is "reserved"; only a handful of these pages are actually mapped to physical ones, as explained below.

(The physical address is then calculated by extracting the page offset part of the VA (bottom 12 or 22 bits) and adding it to the physical address of the lowest-level page in the translation hierarchy.)

What's interesting in this scheme is the address range used. Let's consider a 3-level hierarchy (4Kb pages). CR3 "points" to C0300000, ie the first PDE is at C0300000. The PTEs go from C0000000 to C0400000: The PDE range overlaps the PTE range! And not anywhere, exactly at the 3/4th of this range; which makes sense since the 3/4th of a 32-bit address space also start at C0000000. This is not random of course: the PDEs are themselves referenced by the PTEs to allow the processor to access the [C0000000, C0400000[ range!

This may seem a bit obscure, plus my explanations here are pretty poor. It's funny how explaining 30 lines of smart assembly can be so tricky... The thing to remember to understand this is that the CPU offers NO facility to do physical to virtual conversion. But to let access the kernel access the pages that allow the CPU to do this conversion internally, they must be accessible in virtual memory. And to be accessible in virtual memory, they must be referenced by themselves. This self-reference mechanism allows the implemention of MmGetPhysicalAddress.

Quick lab experiment. Fire up WinDbg, local kernel debugging:

lkd> !process 0 0 System
PROCESS 81bcc830 SessionId: none Cid: 0004 Peb: 00000000 ParentCid: 0000
DirBase: 00039000 ObjectTable: e1000cc0 HandleCount: 244.
Image: System

DirBase is the physical address of the first PDE (loaded into CR3).
We can confirm that by check the field in the associated EPROCESS structure:

lkd> dt _EPROCESS Pcb.DirectoryTableBase 81bcc830
+0x000 Pcb :
+0x018 DirectoryTableBase : [2] 0x39000

Now, let's get the physical address of C0300000. We use !vtop, with the PFN page for the process (39000 >> 12):

lkd> !vtop 39 c0300000
Pdi 300 Pti 300
c0300000 00039000 pfn(00039)

The result is 39000, which confirms that C0300000 maps the PDEs.

You can check it for other processes, for instance WinDbg itself:

lkd> !process 0 0 windbg.exe
PROCESS 81ad8410 SessionId: 0 Cid: 049c Peb: 7ffd7000 ParentCid: 028c
DirBase: 00ae9000 ObjectTable: e1c446b0 HandleCount: 614.
Image: windbg.exe

lkd> !vtop ae9 c0300000
Pdi 300 Pti 300
c0300000 00ae9000 pfn(00ae9)

Wednesday, December 16, 2009

On moving register

One of the great *cough* things of the IA32 architecture is that it offers multiple ways to encode identical instructions. For instance, the mov instruction is implemented by several opcodes - to support multiple variations of the instruction, for instance mov reg, [mem], mov reg, imm , etc.).
Specifically, opcode 89h (mov Ev, Gv) and opcode 8Bh (mov Gv, Ev) are used to encode mov [mem], reg and mov reg, [mem], respectively. Both are also used to encode mov reg, reg. So which one should we use?

With Microsoft dev tools (MASM, CL), opcode 8B was chosen. This means that mov ebp, esp, this classic compiler idiom used in function prologues, is implemented as 8B EC; the alternate form being 89 E5 by the way.

Yesterday, I was wondering, "Why the heck did MS decide to implement move from register to register this way in their compiler suite? Did they have some Intel guidance here? Did they flip a coin? ...". Anyways, I could not believe this choice was innocent.

I first had a quick look at Intel's optimization manual but couldn't find a hint in it - I may have missed that though.

So I decided to code a simple program to benchmark both instructions. I focused on "mov ebp, esp" since it's what initially this whole thing...


__declspec(naked) __int64 test1(void)
{
__asm
{
rdtsc
push    edx
push    eax

mov     edx, ebp
mov     eax, esp

mov     ecx, COUNT
dummy:
_emit   0x8B    // mov ebp, esp
_emit   0xEC
add     esp, 0x1234
dec     ecx
jg      dummy

mov     ebp, edx
mov     esp, eax

rdtsc
sub     eax, [esp]
sbb     edx, [esp+4]
add     esp, 8

retn
}
}

test1's counterpart, test2, is exactly the same routine except that the mov is implemented as 89 E5. I ran both of these 10 times with COUNT set to 100 millions. The output below shows the routine duration in CPU tick count for test1 (t1), test2 (t2), and the delta (t2-t1).


266427630, 239098970 -> -27328660
198251880, 236396490 -> 38144610
197485280, 236592800 -> 39107520
198359910, 240731530 -> 42371620
198169630, 236203160 -> 38033530
199283790, 237357990 -> 38074200
198334600, 238307940 -> 39973340
201726060, 237601250 -> 35875190
199196030, 235864980 -> 36668950
197668180, 236878930 -> 39210750
198189520, 236734830 -> 38545310
198769940, 236796360 -> 38026420
198463880, 236996720 -> 38532840
197608520, 235117620 -> 37509100
198994300, 237364120 -> 38369820
198324810, 238987550 -> 40662740
197842630, 236818990 -> 38976360
197126140, 236928430 -> 39802290
198637450, 237262350 -> 38624900
197569520, 235503910 -> 37934390

Omitting the first line (performance likely impacted by the CPU caching the memory area containing both routines), it appears that the first form is about 20% faster than the second. And this is not not stack-register specific as I had the same result with mov eax, edx for instance.

So, using opcode 8B for mov reg, reg seems like a good choice - in practice, I have no clue if this makes a difference during execution of a real program.

I still don't know why this choice is what it is though... if anyone (somebody who worked on a real-world C compiler for instance) has a clue, please leave a comment.

Update: I've just noticed NASM generates the alternate form 89 E5.

As for the guesswork: an assembler, to decide which opcode could be used to implement the desired instruction, can use different approach. A simple one could be to go through the opcodes in sequence (00 though FF, then 0F 00 through 0F FF, etc.), and use the first opcode encountered that fits. In this case, that would be 89. And now that I'm writing this, I realize that the simple x86 assembler I wrote in the past uses this method, and therefore should generate 89 E5 like NASM does. The fact that Microsoft tools do not makes me think it is a deliberate design decision...

Saturday, November 21, 2009

Why are GDT descriptors so messed up?

Ever wondered why a GDT descriptor had such a fragmented format? Like anybody born in the 80's, I have.

Here is a 64-bit, standard, non-system, generic GDT segment descriptor:

The base address is fragmented into 2 pieces (low 24 bits, high 8 bits), as is the segment limit (low 16 bits, high 8 bits). Why so?

The answer is, you guessed, backward compatibility. And the guilty is the 80286. This processor was introduced in 1982, and was the first to support Protected Mode (PM). This PM was not exactly the one we know and use nowadays though; it was simpler, sort of a version 0.1 is you will.

The 286 manual here shows us the encoding of a standard GDT descriptor - check figure 6-3. Its size is 8 bytes, but the upper 2 bytes "must be set to 0 for compatibility with iAPX 386". Interesting; so even then, they were envisaging some PM extensions... The meaningful data is contained in the low 6 bytes, formatting a non-fragmented descriptor:
- bytes 0-1: 16-bit limit
- bytes 2-4: 24-bit base
- byte 5: flags

So wait. Does it mean a segment can be at max 64Kb? And a 24-bit base means that only 16Mb of memory are physically addressable, right? - you say. Well, yes and yes. It's old-PM, remember.

The new version introduced by the 386 has its lot of improvements, among which:

Real 32-bit addressing: an extra byte for the base (8th position) was added, making it 32-bit long
The possibility to have 4Gb long segments (as opposed to 64Kb...): 4 bits were added to the limit field, making it 20 bits. Since 20 bits can only address 1Mb, 4 extra attribute bits were added, including the G bit (pos.23). Setting the G(ranularity) bit means the limit field indicates the last addressable 4-Kb block. Therefore, one sets the limit to 0xFFFFF with G=1 to have a 4Gb segment (like all modern OS do).

Which explains why GDT descriptors seem so messed up...

Another interesting bit introduced in the extra 4-bit of the attributes field is the D/B bit (pos.22). This bit indicates the default operand-size of the segment, and setting it to 1 means it's 32-bit. It was of course set to 0 for the 286 "6-byte" descriptors. Just one more element that just shows how the 386 was the real cornerstone, implementing the things that lacked in the 286 (including the paging unit), and became a standard.

If you want to know more about this, check out the Wikipedia PM history section as well as the 286 manual mentioned earlier. Also, a very interesting trivia on the 286 is how the inability to switch back from protected-mode to real-more gave a few guys at Microsoft some very hard (and fun) time!

MiscellaneouZ

Monday, November 26, 2012

New venture - Android decompiler

Monday, November 28, 2011

Visualizing lemniscates and Cassinian curves

Thursday, November 3, 2011

Raasta

Tuesday, January 19, 2010

Windows 7 and XP SP3 system call tables

Sunday, January 10, 2010

MmGetPhysicalAddress implementation tricks

Wednesday, December 16, 2009

On moving register

Saturday, November 21, 2009

Why are GDT descriptors so messed up?

Blog Archive

Labels

About Me

Blog Corner (in French)