Wednesday, September 27, 2006
Information Context (a.k.a Code/Data Duality)
For example, categorize the following four pieces of information as either code or data:
1) push 0x6F6C6C65
2) "hello" (without the quotes)
Some common answers might be:
1) Code (x86 instruction)
2) Data (string)
3) Data (integer)
4) Code or Data (unknown)
Really, all four are encoded internally (at least on an intel architecture) the same way, so they're all code and they're all data. Inside the metal box, they all have the same binary representation. The key is to know the context in which they're used. If you thought #1 was code, this is probably based on past experience (e.g. having seen x86 assembly before). If the four ascii bytes 'h' 'e' 'l' 'l' 'o' were in a code segment, then it is possible that they were instructions, however it is also that they were data. This is because code and data can be interspersed even in a code segment (as is often done by compilers.)
Note: Feel free to skip the following paragraph if you're not really interested in the theory behind why this is possible. The application of information context is in the rest of the post.
The reason code and data can be interspersed is often attributed to the Von Neumann architecture, which stores both code and data in a common memory area. Other computing architectures, such as the Harvard architecture have seperate memory storage areas for code and data. However, this is an implementation issue, as information context (or lack thereof) can also be seen in Turing machines. In a Universal Turing Machine (UTM), you can't distinguish between code and data if you encode both the instructions to the UTM and the data to the UTM with the same set (or subset) of tape symbols. Both the instructions for the UTM and the data for the UTM sit on the tape, and if they're encoded with the same set (or subset) of tape symbols, then just by looking at any given set of symbols, there isn't enough information about how the symbols are used to determine if they're instructions (e.g. move the tape head) or data. This type of UTM would be more along the lines of a Von Neumann machine. A UTM which was hard-coded (via the transition function) to use different symbols for instructions and data would be more along the lines of a Harvard machine.
One area where information context greatly determines understanding of code and data is in reverse engineering, specifically executable packers/compressors/encryption/etc. (Note: executable packers/compressors/encryption/etc. are really examples of self-modifying code, so this concept extends to self-modifying code as well). Here's an abstract view of how common block-based executable encryption works (e.g. ASPack, etc.):
| Original executable | --- Packer ---> | Encrypted data |
The encrypted data is then combined with a small decryption stub resulting in an encrypted executable that looks something like:
| Decryption stub | Encrypted data |
When the encrypted executable is run, the decryption stub decrypts the encrypted data, yielding something that can look like:
| Decryption stub | Encrypted data | Original executable |
(Note: this is an abstract view, there are of course implementation details such as import tables, etc. It's also possible to have the encrypted data decrypted in place).
The encrypted data is really code, just in a different representation. The same concept applies when you zip / gzip / bzip / etc. an executable from disk.
Another example where the duality between code and data comes into play is with exploits. Take for example SEH overwrites. An SEH overwrite is a mechanism for gaining control of execution (typically during a stack based buffer overflow) without using the return pointer. SEH stands for Structured Exception Handling, an error/exception handling mechanism used on the Windows operating system. One of the structures used to implement SEH sits on the stack, and can be overwritten during a stack-based buffer overflow. The structure consists of two pointers, and looks something like this:
| Pointer to previous SEH structure | Pointer to handler code |
Now in theory, a pointer, regardless of what it points to, is considered data. The value of a pointer is the address of something in memory, hence data. During an exploit that gains control by overwriting the SEH structure, the pointer to the previous SEH structure is typically overwritten with a jump instruction (often a jump to shellcode.) The pointer to handler code is typically overwritten with the address of a set of pop+pop+ret instructions. The SEH structure now looks like:
| Jump (to shellcode) | Pointer to pop+pop+ret |
The next thing that happens is to trigger an exception so that the pop+pop+ret executes. Due to the way the stack is set up when the exection handling code (now pop+pop+ret) is called, the ret transfers control to the jump (to shellcode). The code/data duality comes into play since this was originally a pointer (to the previous SEH structure) but is now code (a jump). In the context of the operating system, it treats this as data, but in the context of a pop+pop+ret it is treated as code.
Information context goes beyond just code/data duality. Take for example string searches (a.k.a. keyword searches) which are commonly used in digital forensic examinations to generate new leads to investigate. Assume a suspect is accused of stealing classified documents. Searching a suspect's media (hard disk, usb drive, digital camera, etc.) for the string "Classified" (amongst other things) could seem prudent. However a hit on the word "Classified" does not necessarily suggest guilt, but it does warrant further investigation. For instance the hit for "Classified" could come from a classified document, or the hit could come from a cached HTML page for "Classified Ads". Again the key is to understand the context in which the hit comes from.
To summarize this post in a single sentence: Information context can be vital in understanding/analyzing what you're looking at.