Last Resort File Recovery

Note: advice is (of course) followed at your own risk! YMMV.

Scenario: you accidentally installed Linux over the top of your existing operating system, rather than installing it along side.

All is not lost! It's likely that most of your data is still there, not yet overwritten. The problem is that the file allocation table will have been lost, so you won't know where each file starts and ends, and if it's split up into several fragments.

Let's have a look at how installing Linux changed the hard disk's contents.

Disk usage before accident

Disk usage after accident

What it looks like in Linux

Key

file table
file 1
file 2
file 3
file 4
file 5
linux files
linux file table
unused space

When Linux was installed, it wrote a big block of data at the start of the disk, but the rest of the disk was untouched. The reason for leaving the rest untouched is performance: it takes a long time to write to a whole disk, but just marking it as unused is quick and easy.

You can see that parts of file 1 and file 2 were lost. More importantly, the file allocation table was overwritten.

A single file may be split into several fragments at different positions on the disk (e.g. file 3). The file allocation table contains the locations of these fragments, along with the file name and things like creation date and access permissions. Without the original file table, we don't know where each file is placed - all we see is a big array of bytes.

For example, all of file 3 is still present, but we don't know where its fragments are. I'm going to focus on recovering the unfragmented files - 4 and 5.

How can we recover files 4 and 5?

We need to pick out the files from the bytes on disk, without knowing where they start or end. Unless we know something about the files, this is impossible! But if we do know something about the type of files we're looking for, we're in with a chance.

Almost all .jpg files use one of two formats: JFIF and EXIF.

JFIF JPEGs always begin with the bytes FF D8 FF E0 ?? ?? 4A 46 49 46 00 01 - where ?? can be any byte - and they always end with FF D9.

EXIF JPEGs begin with the bytes FF D8 FF E1 ?? ?? 45 78 69 66 46 00 00 - and they always end with FF D9 source

If we scan through the disk looking for this pattern, we can recover all the unfragmented JPEG images.

A slight snag...

Unfortunately JPEG files can also have FF D9 in the middle. I'm not entirely clear on when/why this happens - I think it may be from embedded thumbnail images. My camera (Canon 400D) always produces JPEGs with FF D9 FF E1 in the middle and FF D9 at the end, so I suggest ignoring any FF D9 that is followed by FF E1.

Get the code!

I've written a JPEG file recovery program in C++ that uses these techniques. You can download it from Github (click 'ZIP' to download as a zip file).

Note: it maps the entire disk into memory, so it requires a 64-bit install of Linux (unless you have a very small disk!). If you have a 64-bit Intel or AMD processor, then the Ubuntu x86_64 livecd is fine, just do sudo apt-get install g++ first).

Several other formats have distinctive headers and footers - why not fork my code on Github and extend the technique to other file types?