Sunday, September 14, 2008

Rar(e) juicy bits

I recently spent time working on the Rar algorithm. Though the compression routines are not disclosed, Eugene Roshal freely provides the sources for the command line version of its decompressor. It’s the only real pieces of information available out there – the other documents you may find concern the archive file format itself.

A few context elements first: Several versions of the Rar program exist, along with different major versions of the algorithm behind it. The one we're interested in is version 2.9, the algorithm in use in Rar version 3. Version 2.9 compresses blocks of data using either a Lempel/Ziv or a PPM (Prediction by Partial Matching) algorithm.

The interesting thing is that special symbols can be found in the compressed data stream to notify the presence of VM bytecode. Yes, you read right: v29 makes use of a virtual machine. The VM code is interpreted after decompression, and is used for unfiltering the data.

To achieve better compression ratios, data is usually filtered/modified by a compressor program - ie, a filter's goal is to decrease the entropy of the data to be compressed. A very common example is when compressing x86 binary files: relative jumps and calls (0xE8, 0xE9) can be replaced by their fake, absolute equivalents. A jmp $+5 (E8 00 00 00 00 ) at offset 0x100 would be replaced by E8 05 01 00 00. This filter is used by various executable packers (ex: AsProtect), as well as compressors (ex: SevenZip, Rar) or installers (ex: InnoSetup).

Version 2.9 makes use of several filters, for images, videos, audio files, binary files, etc. However, in this very case, the unfiltering routine is encoded in the compressed stream itself, with the use of the VM bytecode. That’s where v29 potentially allows more flexibility than other compressor programs.

Available VM opcodes are the following (check out vmrar.hpp):
VM_MOV, VM_CMP, VM_ADD, VM_SUB, VM_JZ, VM_JNZ, VM_INC, VM_DEC, VM_JMP, VM_XOR, VM_AND, VM_OR, VM_TEST, VM_JS, VM_JNS, VM_JB, VM_JBE, VM_JA, VM_JAE, VM_PUSH, VM_POP, VM_CALL, VM_RET, VM_NOT, VM_SHL, VM_SHR, VM_SAR, VM_NEG, VM_PUSHA,VM_POPA, VM_PUSHF,VM_POPF, VM_MOVZX,VM_MOVSX,VM_XCHG, VM_MUL, VM_DIV, VM_ADC, VM_SBB, VM_PRINT, VM_MOVB, VM_MOVD, VM_CMPB, VM_CMPD, VM_ADDB, VM_ADDD, VM_SUBB, VM_SUBD, VM_INCB, VM_INCD, VM_DECB, VM_DECD, VM_NEGB, VM_NEGD, VM_STANDARD

Looks very similar to x86 opcodes, isn't it? If you go through this whole list, you may wonder what VM_STANDARD stands for? Well, without this opcode, reversing standard filters would be incredibly slooooow! VM_STANDARD instructs the VM that a standard filter is used, and that a hard-coded decoding routine can be executed to speed the process.

VM_STANDARD is not found in the bytecode stream per se, but is generated by the program when the bytecode stream exhibits characteristics such as particular length and crc:

VM_StandardFilters RarVM::IsStandardFilter(byte *Code,int CodeSize)
{
struct StandardFilterSignature
{
int Length;
uint CRC;
VM_StandardFilters Type;
} StdList[]={
53, 0xad576887, VMSF_E8,
57, 0x3cd7e57e, VMSF_E8E9,
120, 0x3769893f, VMSF_ITANIUM,
29, 0x0e06077d, VMSF_DELTA,
149, 0x1c2c5dc8, VMSF_RGB,
216, 0xbc85e701, VMSF_AUDIO,
40, 0x46b9c560, VMSF_UPCASE
};
uint CodeCRC=CRC(0xffffffff,Code,CodeSize)...


Here’s an example: The stream is being decompressed; the current symbol is 0x101, which instructs the decompressor that VM bytecode is to be read (check out unpack.cpp). The bytecode is processed; it contains VM initialization values, as well as a VM opcode sequence. Now, when the filter is to be applied, the opcodes could be processed one by one. Emulating hundreds of VM_MOV, VM_CMP and others is a computationally expensive task. Therefore, what the decompressor does first is to check the opcode sequence length and CRC against a list of known VM routines. These routines are for commonly used filters. If one of these standard routines is matched, then a VM_STANDARD opcode is generated, its operand being the VMSF_Xxx constant referring to the hardcoded routine executed instead of bytecode emulation.

So far, 6 standard filters routines are available:
- x86 jumps and/or call filters (VMSF_E8, VMSF_E8E9)
- Intel Itanium filter (VMSF_ITANIUM)
- RGB filter (VMSF_RGB)
- Audio files filters (VMSF_AUDIO)
- Image filters (VMSF_DELTA)
- Case filters (VMSF_UPCASE)

This choice of having a VM to run the filters offers extension possibilities yet to be seen, maybe via the use of plugins in later versions of Rar? One could then create a special filter, designed to achieve very good compression ratios with particular files. Not to mention that the filters can be organized in a filter stack, reused, and applied one after each other, etc. Quite powerful.

Last thing: how do you decide which filters are applied to your compressed data? Well, you quite don’t. The compressor will decide for you (let me remind you that they’re all closed source). Hmm... going back to Unrar source code, it appears cmddata.cpp contains the code responsible for handling command line parameters of Unrar. In fact, cmddata.cpp contains command line options for Rar and Unrar (it’s very likely that the sources are a subset of Rar+Unrar sources, and the compilation of a specific program is decided via a pre-processor #define).

You may notice that the switch –mc allows the user to force-enable/disable the use of a particular filter:

case 'M':
switch(etoupper(Switch[1]))
{
case 'C':
// ...cut...
switch(etoupper(*(Str++)))
{
case 'T': Type=FILTER_PPM; break;
case 'E': Type=FILTER_E8; break;
case 'D': Type=FILTER_DELTA; break;
...

PPM is a compression mode; it has the role of a dummy filter here. Let’s fire Rar command-line now, to see if these options are advertised:

$ rar -h
Usage: rar command -switch1 -switchN archive files ...
(cut)
mcX Set advanced compression parameters

What does "advanced compression parameters" mean? It seems RarLab is not quite ready to let their users decide what filters they would like to use, let alone code their own filters via the use of the VM.

If you want to know more, well, check out the Unrar source code, more specifically unpack.cpp/hpp, rarvm.cpp/hpp and cmddata.cpp.

No comments: