### Monday, June 26, 2006

## Base+Offset notation (or why we start counting with zero)

Every now and again, I get the question about why we starting counting things such as arrays, offsets, etc. with zero (0) and not one (1). The answer is simple, when specifying a data structure, we normally specify the byte (or whatever unit) offset for the start of a field for a specific data structure. Think how a computer would get to a specific element of a list. Let's say we want the third element in the list (the offset is 2). The start of the list is the first element. We move to the next element, which is at offset 1, and then we move to the next element (the third item) which is offset 2. We now read the contents of that element. When we start counting with offsets, the first byte is zero bytes from the start (since it's the first byte). Here is what it looks like graphically:

For example, the FAT boot sector starts out like this: (Note: this was taken from http://support.microsoft.com/kb/q140418/)

0 1 2 3 4 5 6Let's say we want to get to element C (the third element, offset 2). If "A" is the start, then it is zero bytes from the start. That means, that A's offset is 0. There is a caret (^) beneath "A" to show the current position. Next we move forward one element (offset = 1) to the second element, which looks like:

A B C D E F G

^ (offset from A = 0)

0 1 2 3 4 5 6Finally, we move forward one more element (offset = 2) to the third element, which looks like:

A B C D E F G

^ (offset from A = 1)

0 1 2 3 4 5 6

A B C D E F G

^ (offset from A = 2)

For example, the FAT boot sector starts out like this: (Note: this was taken from http://support.microsoft.com/kb/q140418/)

Field Offset LengthNotice that the "Bytes per Sector" field (which tells us the size of a sector, usually 512) is at offset 11. This means that this field starts at offset 11 (a.k.a. 11 bytes from the start, the 12th byte, etc.) where the offset of the first byte is 0 (the start).

----- ------ ------

Bytes Per Sector 11 2

Sectors Per Cluster 13 1

Reserved Sectors 14 2

FATs 16 1

Root Entries 17 2

Small Sectors 19 2

Media Descriptor 21 1

Sectors Per FAT 22 2

Sectors Per Track 24 2

Heads 26 2

Hidden Sectors 28 4

Large Sectors 32 4

### Saturday, June 24, 2006

## Argument for MD5

So, there has been a lot of talk over the past few years about using MD5 hash sums in digital forensics, due to the fact that some collisions have been found for MD5.

First, a hash algorithm/function has the following properties:

1) The algorithm takes in a variable sized input data and transforms the data into a fixed size output (usually smaller).

2) The algorithm is one-way, meaning to calculate a hash sum of the input data is relatively easy, while to reverse the process, to calculate the original input data given the hash sum is hard (a.k.a. computationally infeasible)

3) The algorithm is collision-free, meaning that different inputs to the algorithm will yield different outputs.

A nice write-up from RSA labs about hash functions can be found at http://www.rsasecurity.com/rsalabs/node.asp?id=2176.

Now the third property is somewhat of an issue. Due to the pigeon hole principle (http://en.wikipedia.org/wiki/Pigeonhole_principle) a hash function with a fixed output size is never truly collision-free. Here is the rationale:

- Assume a hash function with a fixed output size N. Call the set of possible values that the hash function maps to (the range) B

- Take the set of unique inputs to the hash function, where there are at least N+1 elements. Call this set A.

- Since a hash function maps A to B, there is guaranteed to be a collision, since there are more inputs than there are unique outputs (|A| > |B|).

So, since we have collisions with MD5 does this mean that all of the criminals convicted with digital evidence will run free? Have they all been wrongly convicted? I would guess the answer is no.

First there is the inherent trust in the data collection, handling, processing, analysis, and presentation. This is where things such as chain of custody, and trust in an examiner come into play. While nothing will create a deductive chain of logic that we can rely on for the integrity of evidence, current practices seem not to fail either.

Second, the attacks against MD5 have been against the aspect of being collision free. Nested in the concept of being collision free are two additional concepts:

3.1) A hash function is weakly collision-free if given an input X, then it is computationally infeasible to find another input Y, where X != Y, yet both produce the same hash sum. This is also called a "pre-image attack" by the folks at NSRL.

3.2) A hash function is strongly collision-free if it is computationally infeasible to find to different inputs X and Y that produce the same hash sum.

The key difference between 3.1 and 3.2 is that 3.1 has a pre-specified initial input X, and consequently a pre-specified hash sum to find a match while 3.2 prescribes that is is computationally infeasible to find two different inputs which yield the same output. The subtle difference is one you're trying to find an input that will yield a given output, while the other just says find two inputs that have the same output.

The role of MD5 in helping show preservation of evidence integrity in digital forensics is the fact that it is weakly collision-free. Courtesy of the pigeon-hole principle, it has been known that there ARE collisions, just finding them was rather time consuming. The attacks against MD5 were focused on 3.2, showing that MD5 was not strongly collision-free.

Here is a deductive argument:

Premise 1:

If you have a hash algorithm suitable for digital forensics [A], then it is weakly collision-free [B] (don't worry about strongly collision-free, we won't use it).

A -> B

Premise 2:

If a hash algorithm is weakly collision-free [B] and given an input [C] then it is computationally infeasible to find another unique input with the same hash sum [D] .

(B ^ C) -> D

Premise 3:

The MD5 algorithm is suitable for digital forensics [A]

A

Premise 4:

A bit-stream image of evidence is an input [C]

C

Argument:

The MD5 algorithm is weakly collision-free.

Proof:

1. A (by Premise 3)

2. A -> B (by Premise 1)

3. ((A -> B) ^ A) (by Conjunction of Premise 3 and 1)

4. B (by Modus Ponens and 3)

Argument:

Given a bit-stream image of evidence, it is computationally infeasible to find another input that yields the same MD5 hash sum.

Proof:

5. C (by the argument)

6. B ^ C (by Conjunction of 4 and 5)

7. (B ^ C) -> D (by Premise 2)

8. ((B ^ C) -> D) ^ (B ^ C) (by Conjuction of 6 and 7)

9. D (by Modus Ponens and 8)

Of course, the underlying premise that is up for attack is Premise 3, which states that MD5 is suitable for digital forensics. Notice the recent attacks don't come into play, since they focus on the strongly collision-free aspect of MD5. The folks at NSRL also stated:

"none of the attacks are pre-image attacks. In a pre-image attack, an attacker

must manufacture a file with a previously known hash. All of the current attacks

are only able to produce two files with the same hash, but it is a random hash. "

(http://www.nsrl.nist.gov/documents/analysis/draft-060530.pdf)

There are however projects underway to undermine the weakly collision-free aspect of the MD5 algorithm. Notably, the folks over at the Digital Forensics Reasearch Workshop a.k.a. DFRWS have issued a challenge (http://www.dfrws.org/hashchallenge/index.html).

So in summary, MD5 is still suitable for digital forensic use, although it might be advisable to use an additional algorithm such as SHA-1, SHA-256, SHA-512, etc.

First, a hash algorithm/function has the following properties:

1) The algorithm takes in a variable sized input data and transforms the data into a fixed size output (usually smaller).

2) The algorithm is one-way, meaning to calculate a hash sum of the input data is relatively easy, while to reverse the process, to calculate the original input data given the hash sum is hard (a.k.a. computationally infeasible)

3) The algorithm is collision-free, meaning that different inputs to the algorithm will yield different outputs.

A nice write-up from RSA labs about hash functions can be found at http://www.rsasecurity.com/rsalabs/node.asp?id=2176.

Now the third property is somewhat of an issue. Due to the pigeon hole principle (http://en.wikipedia.org/wiki/Pigeonhole_principle) a hash function with a fixed output size is never truly collision-free. Here is the rationale:

- Assume a hash function with a fixed output size N. Call the set of possible values that the hash function maps to (the range) B

- Take the set of unique inputs to the hash function, where there are at least N+1 elements. Call this set A.

- Since a hash function maps A to B, there is guaranteed to be a collision, since there are more inputs than there are unique outputs (|A| > |B|).

So, since we have collisions with MD5 does this mean that all of the criminals convicted with digital evidence will run free? Have they all been wrongly convicted? I would guess the answer is no.

First there is the inherent trust in the data collection, handling, processing, analysis, and presentation. This is where things such as chain of custody, and trust in an examiner come into play. While nothing will create a deductive chain of logic that we can rely on for the integrity of evidence, current practices seem not to fail either.

Second, the attacks against MD5 have been against the aspect of being collision free. Nested in the concept of being collision free are two additional concepts:

3.1) A hash function is weakly collision-free if given an input X, then it is computationally infeasible to find another input Y, where X != Y, yet both produce the same hash sum. This is also called a "pre-image attack" by the folks at NSRL.

3.2) A hash function is strongly collision-free if it is computationally infeasible to find to different inputs X and Y that produce the same hash sum.

The key difference between 3.1 and 3.2 is that 3.1 has a pre-specified initial input X, and consequently a pre-specified hash sum to find a match while 3.2 prescribes that is is computationally infeasible to find two different inputs which yield the same output. The subtle difference is one you're trying to find an input that will yield a given output, while the other just says find two inputs that have the same output.

The role of MD5 in helping show preservation of evidence integrity in digital forensics is the fact that it is weakly collision-free. Courtesy of the pigeon-hole principle, it has been known that there ARE collisions, just finding them was rather time consuming. The attacks against MD5 were focused on 3.2, showing that MD5 was not strongly collision-free.

Here is a deductive argument:

Premise 1:

If you have a hash algorithm suitable for digital forensics [A], then it is weakly collision-free [B] (don't worry about strongly collision-free, we won't use it).

A -> B

Premise 2:

If a hash algorithm is weakly collision-free [B] and given an input [C] then it is computationally infeasible to find another unique input with the same hash sum [D] .

(B ^ C) -> D

Premise 3:

The MD5 algorithm is suitable for digital forensics [A]

A

Premise 4:

A bit-stream image of evidence is an input [C]

C

Argument:

The MD5 algorithm is weakly collision-free.

Proof:

1. A (by Premise 3)

2. A -> B (by Premise 1)

3. ((A -> B) ^ A) (by Conjunction of Premise 3 and 1)

4. B (by Modus Ponens and 3)

Argument:

Given a bit-stream image of evidence, it is computationally infeasible to find another input that yields the same MD5 hash sum.

Proof:

5. C (by the argument)

6. B ^ C (by Conjunction of 4 and 5)

7. (B ^ C) -> D (by Premise 2)

8. ((B ^ C) -> D) ^ (B ^ C) (by Conjuction of 6 and 7)

9. D (by Modus Ponens and 8)

Of course, the underlying premise that is up for attack is Premise 3, which states that MD5 is suitable for digital forensics. Notice the recent attacks don't come into play, since they focus on the strongly collision-free aspect of MD5. The folks at NSRL also stated:

"none of the attacks are pre-image attacks. In a pre-image attack, an attacker

must manufacture a file with a previously known hash. All of the current attacks

are only able to produce two files with the same hash, but it is a random hash. "

(http://www.nsrl.nist.gov/documents/analysis/draft-060530.pdf)

There are however projects underway to undermine the weakly collision-free aspect of the MD5 algorithm. Notably, the folks over at the Digital Forensics Reasearch Workshop a.k.a. DFRWS have issued a challenge (http://www.dfrws.org/hashchallenge/index.html).

So in summary, MD5 is still suitable for digital forensic use, although it might be advisable to use an additional algorithm such as SHA-1, SHA-256, SHA-512, etc.

### Wednesday, June 21, 2006

## The switch to Levenger

For years I've carried around a small notebook, one of the spiral bound that is almost a 3x5 card size. I've even got a nice leather cover for them somewhere at my Dad's house. I normally use the notebook to do things like take case notes, observations, grocery lists, etc. The most recent notebook had been so used and abused it was actually held together by duct tape.

While at the Denver SANS conference, I noticed one of the students was using a Levenger pocket briefcase to take notes. After seeing how well this system worked out for him, I decided to peruse the Levenger website (Levenger.com) and bit the bullet and ordered a black Levenger Flip Pocket Briefcase (http://www.levenger.com/PAGETEMPLATES/PRODUCT/Product.asp?Params=Category=11-76|PageID=2458|Level=2-3#).

All I can say is the device is great. I carry it with me everywhere, keep case notes on it, and pretty much everything. I haven't abandoned my higher-tech devices such as the Blackberry, but the pocket briefcase is much more efficient for that on-the-spot information gathering.

While at the Denver SANS conference, I noticed one of the students was using a Levenger pocket briefcase to take notes. After seeing how well this system worked out for him, I decided to peruse the Levenger website (Levenger.com) and bit the bullet and ordered a black Levenger Flip Pocket Briefcase (http://www.levenger.com/PAGETEMPLATES/PRODUCT/Product.asp?Params=Category=11-76|PageID=2458|Level=2-3#).

All I can say is the device is great. I carry it with me everywhere, keep case notes on it, and pretty much everything. I haven't abandoned my higher-tech devices such as the Blackberry, but the pocket briefcase is much more efficient for that on-the-spot information gathering.

### Sunday, June 18, 2006

## About this blog

This blog will be a spot where I post thoughts about theoretical computer science (computing theory), digital forensics, information security, and other miscellaneous musings, and optionally their relation to digital forensics