Sunday, January 10, 2010

MmGetPhysicalAddress implementation tricks

MmGetPhysicalAddress is a kernel-exported API that allows converting a virtual address to a physical one.

Quick reminder: Windows runs in protected (or long/IA64) mode with paging enabled, plus/minus physical address extensions enabled. The post focus on a 32-bit kernel with PAE disabled for enhanced clarity.

The PDE/PTE structures for 4Kb pages are (from Intel manual):



The PDE structure for 4Mb pages is:



The conversion to a physical address is done internally by the hardware. No instruction implements a physical-to-virtual address conversion. When the operating system uses paging, the key to physical conversion lies in CR3. This register contains the physical address to the page-directory entries table.

In order to do a virtual-to-physical converion programmatically, one needs to know where the page translation tables are located in virtual memory. When created, these pages are referenced only by their physical addresses, stored in the page-directory or page-table entries (see pictures above). This chicken&egg problem is solved by Windows (and most likely other x86 OSs) by reserving a range in the kernel address space for all lowest-level page-description pages (PTEs for 4Kb pages, PDEs for 4Mb pages).

Let's examine how MmGetPhysicalAddress is implemented in the simplest version of a Windows XP SP3 kernel (32-bit, PAE disabled):

.text:0042E046 ; unsigned __int64 __stdcall MmGetPhysicalAddress(unsigned int BaseAddress)
.text:0042E046
.text:0042E046 BaseAddress = dword ptr 8
.text:0042E046
.text:0042E046 mov edi, edi
.text:0042E048 push ebp
.text:0042E049 mov ebp, esp
.text:0042E04B push esi
.text:0042E04C mov esi, [ebp+BaseAddress]
.text:0042E04F mov eax, esi
.text:0042E051 shr eax, 14h
.text:0042E054 and eax, 0FFCh
.text:0042E059 mov ecx, [eax-3FD00000h]
.text:0042E05F mov eax, ecx
.text:0042E061 and ax, 81h
.text:0042E065 cmp al, 81h ;present? page size?
.text:0042E067 jnz short 4Kbpage
.text:0042E067
.text:0042E069 mov eax, esi ;4Mb page
.text:0042E06B shr eax, 0Ch
.text:0042E06E and eax, 3FFh
.text:0042E073 shr ecx, 0Ch
.text:0042E076 add eax, ecx
.text:0042E076
.text:0042E078
.text:0042E078 convert:
.text:0042E078 xor ecx, ecx
.text:0042E07A shld ecx, eax, 0Ch
.text:0042E07E shl eax, 0Ch
.text:0042E081 and esi, 0FFFh
.text:0042E087 add eax, esi
.text:0042E089 mov edx, ecx ;0
.text:0042E089
.text:0042E08B
.text:0042E08B done:
.text:0042E08B pop esi
.text:0042E08C pop ebp
.text:0042E08D retn 4
.text:0042E090
.text:0042E090 4Kbpage:
.text:0042E090 test cl, 1 ;present?
.text:0042E093 jz short error
.text:0042E093
.text:0042E095 mov eax, esi
.text:0042E097 shr eax, 0Ah
.text:0042E09A and eax, 3FFFFCh
.text:0042E09F sub eax, 40000000h
.text:0042E0A4 mov eax, [eax]
.text:0042E0A6 test al, 1 ;PTE present?
.text:0042E0A8 jz short error
.text:0042E0A8
.text:0042E0AA shr eax, 0Ch
.text:0042E0AD jmp short convert
.text:0042E0AF
.text:0042E0AF error:
.text:0042E0AF xor eax, eax
.text:0042E0B1 xor edx, edx
.text:0042E0B3 jmp short done
.text:0042E0B3
.text:0042E0B3 _MmGetPhysicalAddress@4 endp


Remember the VA is decomposed into 3 or 2 parts:
- 3 parts for 4Kb pages: [10bits=PDE index / 10bits=PTE index / 12bit=Page offset]
- 2 parts for 4Mb pages: [10bits=PDE index / 22bits=Page offset]

The code first gets the PDE index*4, which is the PDE offset relative to CR3 since PDE entries are 4-byte long. This value is added to -3FD00000. The PDE offset being in [0,FFC], the result will be in [C0300000,C0300FFC]. Now, the first PDE is pointed by CR3. Which means that physical_to_virtual(CR3)=C0300000, for all processes. The range [C0300000,C0301000[ contains the 0x400 PDEs.

A comparison then checks if the page is present or not (bit0) and the function returns 0 if the page is not present. The comparison also checks for bit7; if set, this bit indicates a 4Mb page and a different, simpler code branch is executed.

For a 4Kb PDE, the next 10 bits of the VA are extracted, then made a PTE offset in [0,FFC]. The offset is added to -40000000. The resulting value is in [C0000000, C03FFFFC]. This means the PTEs are in the range [C0000000, C0400000[. It's important to understand that this range is "reserved"; only a handful of these pages are actually mapped to physical ones, as explained below.

(The physical address is then calculated by extracting the page offset part of the VA (bottom 12 or 22 bits) and adding it to the physical address of the lowest-level page in the translation hierarchy.)

What's interesting in this scheme is the address range used. Let's consider a 3-level hierarchy (4Kb pages). CR3 "points" to C0300000, ie the first PDE is at C0300000. The PTEs go from C0000000 to C0400000: The PDE range overlaps the PTE range! And not anywhere, exactly at the 3/4th of this range; which makes sense since the 3/4th of a 32-bit address space also start at C0000000. This is not random of course: the PDEs are themselves referenced by the PTEs to allow the processor to access the [C0000000, C0400000[ range!

This may seem a bit obscure, plus my explanations here are pretty poor. It's funny how explaining 30 lines of smart assembly can be so tricky... The thing to remember to understand this is that the CPU offers NO facility to do physical to virtual conversion. But to let access the kernel access the pages that allow the CPU to do this conversion internally, they must be accessible in virtual memory. And to be accessible in virtual memory, they must be referenced by themselves. This self-reference mechanism allows the implemention of MmGetPhysicalAddress.

Quick lab experiment. Fire up WinDbg, local kernel debugging:

lkd> !process 0 0 System
PROCESS 81bcc830 SessionId: none Cid: 0004 Peb: 00000000 ParentCid: 0000
DirBase: 00039000 ObjectTable: e1000cc0 HandleCount: 244.
Image: System

DirBase is the physical address of the first PDE (loaded into CR3).
We can confirm that by check the field in the associated EPROCESS structure:

lkd> dt _EPROCESS Pcb.DirectoryTableBase 81bcc830
+0x000 Pcb :
+0x018 DirectoryTableBase : [2] 0x39000

Now, let's get the physical address of C0300000. We use !vtop, with the PFN page for the process (39000 >> 12):

lkd> !vtop 39 c0300000
Pdi 300 Pti 300
c0300000 00039000 pfn(00039)

The result is 39000, which confirms that C0300000 maps the PDEs.

You can check it for other processes, for instance WinDbg itself:

lkd> !process 0 0 windbg.exe
PROCESS 81ad8410 SessionId: 0 Cid: 049c Peb: 7ffd7000 ParentCid: 028c
DirBase: 00ae9000 ObjectTable: e1c446b0 HandleCount: 614.
Image: windbg.exe

lkd> !vtop ae9 c0300000
Pdi 300 Pti 300
c0300000 00ae9000 pfn(00ae9)

No comments: