store and recover corrupted data from CD/DVDs

Everybody remembers those CD commercials years ago, which claimed the superiority of the CD as a means of storing data since it would last for a thousand years (sic)! Well, I know that most of my 3-decades old tapes still play, and most of my 5 year old CDs fail to read! Anyway, sometimes you just need to make space on your hard disk and a CD or a DVD is the only choice.

BUT, now and then you still need your data. All of it.

Don't despair, for there is a solution. (warning: the tutorial is for Linux users)

Let's take an example. I just finished my PhD and need to archive my PhD folder which contains all my research and sums up to 1.5 GB. First of all I make one compressed file out of it. Since I am a Linux user, I use my favorite combination of tar + bzip2. So now I got the file 'PhD.tar.bz2', sized 458 MB (high compression ratio attributed to programming code and text simulation files). I could write that file directly on to a CD and wait for 3 years until the moisture where I live makes some areas of the CD like Swedish cheese, and therefore unreadable. I don't want this to happen, so what do I do? I could always make a second clone CD, and not use it until the other breaks up, but this is obviously a highly un-funny and scientifically uninteresting method, appropriate only for lame Windows users (that was humor. If you downvote me for that, you are indeed a lame window user!).

Safeguarding Data

Here comes par2 to the rescue. Par2 is the Linux utility for adding redundancy to an archive, so that even if it gets corrupted, there is the available information to retrieve all of the original information.

So, the concept is that we will write as much redundancy as possible on the CD along with the data archive, so that we have a better chance to retrieve it some years later. Since a CD can contain 700MB of data, and our file is 458 MB, we can stuff inside the CD as much as 242 Megabytes of redundant information. This is 242/458 * 100 = 52.84% redundancy, say 52% so that we don't run into any space problems.

We now issue on our favourite shell, bash:

$ par2create -r52 PhD.tar.bz2

We wait for a decent amount of time until the redundant info is created, and after that we can see that many files were created with filenames like PhD.tar.bz2.volXXXX+YYY.par2

Now, these files along with PhD.tar.bz2 amount to ~700 MB, and we write them on a CD.

Technical stuff: Par2 functions by placing our data into a Reed Solomon matrix and computes some parities. It then writes the parities into these files you'll see created. In case your main file is found to be corrupted, or if some bytes somewhere in it are missing, par2 computes the missing or corrupted bytes from the parities and recovers your file.

...

Time passes by...You want to extract a couple of files from the .tar.bz2 archive but you forgot the CD on your desk; cat used it as a scratchpad, kids played the hat man from Goldfinger with it. No wonder it gives you input/output error when you try to extract something from it. Party's startin'!

Recovering Data

So, we have a corrupted archive. Bzip2 confirms that:

/media/cdrom$ bzip2 -t PhD.tar.bz2
bzip2: PhD.tar.bz2:
bzip2: I/O or other error, bailing out. Possible reason follows.
bzip2: Input/output error
Input file = PhD.tar.bz2, output file = (none)

Until now we have never used the redundant info written on the DVD, but now is the time. Par2 will fill in the blanks of our archive from the parities that are written on the files that par2 created and we'll have it recovered in no time! Let's issue:

/media/cdrom$ par2repair PhD.tar.bz2
...
Loading "PhD.tar.bz2.par2".
Loaded 4 new packets
Loading "PhD.tar.bz2.vol0000+001.par2".
Loaded 1 new packets including 1 recovery blocks
Loading "PhD.tar.bz2.vol0001+002.par2".
Loaded 2 new packets including 2 recovery blocks
Loading "PhD.tar.bz2.vol0003+004.par2".
Loaded 4 new packets including 4 recovery blocks
Loading "PhD.tar.bz2.vol0007+008.par2".
Loaded 8 new packets including 8 recovery blocks
Loading "PhD.tar.bz2.vol0015+016.par2".
Loaded 16 new packets including 16 recovery blocks
Loading "PhD.tar.bz2.vol0031+032.par2".
Loaded 32 new packets including 32 recovery blocks
Loading "PhD.tar.bz2.vol0063+064.par2".
Loaded 64 new packets including 64 recovery blocks
Loading "PhD.tar.bz2.vol0127+128.par2".
Loaded 128 new packets including 128 recovery blocks
Loading "PhD.tar.bz2.vol0255+256.par2".
Loaded 256 new packets including 256 recovery blocks
Loading "PhD.tar.bz2.vol0511+489.par2".
Could not read 240072 bytes from /media/cdrom0/PhD.tar.bz2.vol0511+489.par2 at offset 61196036
Could not read 1048576 bytes from /media/cdrom0/PhD.tar.bz2.vol0511+489.par2 at offset 61195974
Loaded 254 new packets including 254 recovery blocks

There are 1 recoverable files and 0 other files.
The block size used was 240068 bytes.
There are a total of 2000 data blocks.
The total size of the data files is 480022724 bytes.

Verifying source files:

Could not read 240068 bytes from /media/cdrom0/PhD.tar.bz2 at offset 75621420

Damn! Par2 read the files with the recovery blocks, but the larger recovery file, the one with 489 recovery blocks could not be read. But even worse was the the program could not read past the 72MB out of the 458MB of our archive leaving us actually with no data to repair...

Are we out of luck? Yes... But we are not out of creativity! We understand that the archive has suffered damage somewhere after its first 72MB, but this doesn't mean that all the rest of the file is unreadable. Simple tools like unix 'cp' may for example be unable to continue copying the file after meeting the damaged part, leaving us with the first 72MB, but there are more complex tools which can save us.

Let me introduce you to dd_rescue. This is a persistent copier which ignores broken sectors and keeps copying whatever it can. Now, we could get eventually most of our archive with just giving a:

/media/cdrom$ dd_rescue PhD.tar.bz2 /tmp/PhD.tar.bz2

but that would take ages to complete due to the damaged sectors after the first 72MB. We want our archive bad, and we want it now! So we just give:

/media/cdrom$ dd_rescue -r PhD.tar.bz2 /tmp/PhD.tar.bz2

This starts copying the file from the end. So it copies very fast the last 458-72 = 386MB and when it starts meeting the error sectors, we just stop it with CTRL-C.

We now have a file /tmp/PhD.tar.bz2 exactly the size it should be, only that the first 72MB are garbage* (IF we needed it (say the redundancy was not enough), we could somehow merge the first 72MB we can read with the last 386MB we read, giving us most of our archive intact, and then just 2-3% of redundancy would suffice to recover the archive, but why should we choose the hard way, when there is enough redundancy?)

* Bzip2 confirms that the first 72MB are garbage:

/tmp$ bunzip2 -t PhD.tar.bz2
bunzip2: PhD.tar.bz2: bad magic number (file not created by bzip2)

So, we have 386MB of our archive, and we bring as much redundancy we can at the same directory, in order to feed it to par2.

/tmp$ cp /media/cdrom/*par2 /tmp

Some par2 redundancy files can be read, some not, as we previously saw.

And the time has come:

/tmp$ par2repair *par2

Loading "PhD.tar.bz2.vol0001+002.par2".
Loaded 6 new packets including 2 recovery blocks
Loading "PhD.tar.bz2.vol0255+256.par2".
Loaded 256 new packets including 256 recovery blocks
Loading "PhD.tar.bz2.vol0003+004.par2".
Loaded 4 new packets including 4 recovery blocks
Loading "PhD.tar.bz2.vol0127+128.par2".
Loaded 128 new packets including 128 recovery blocks
Loading "PhD.tar.bz2.vol0007+008.par2".
Loaded 8 new packets including 8 recovery blocks
Loading "PhD.tar.bz2.vol0015+016.par2".
Loaded 16 new packets including 16 recovery blocks
Loading "PhD.tar.bz2.vol0511+489.par2".
Loaded 254 new packets including 254 recovery blocks
Loading "PhD.tar.bz2.vol0063+064.par2".
Loaded 64 new packets including 64 recovery blocks

There are 1 recoverable files and 0 other files.
The block size used was 240068 bytes.
There are a total of 2000 data blocks.
The total size of the data files is 480022724 bytes.

Verifying source files:

Target: "PhD.tar.bz2" - damaged. Found 1656 of 2000 data blocks.

Scanning extra files:

Repair is required.
1 file(s) exist but are damaged.
You have 1656 out of 2000 data blocks available.
You have 732 recovery blocks available.
Repair is possible.
You have an excess of 388 recovery blocks.
344 recovery blocks will be used to repair.

Computing Reed Solomon matrix.
Constructing: done.
Solving: done.

Wrote 97536000 bytes to disk
Wrote 97536000 bytes to disk
Wrote 97516488 bytes to disk
Wrote 97487232 bytes to disk
Wrote 89947004 bytes to disk

Verifying repaired files:

Target: "PhD.tar.bz2" - found.

Repair complete.

Some minutes later, after par2 has finished computing the Reed Solomon matrix and fixing wrong byte values using the parities, our archive is reconstructed!

Bzip2 confirms that:

/tmp$ bzip2 -tv PhD.tar.bz2
PhD.tar.bz2: ok

Our archive is ready to extract and use with a simple:

$ tar xvjf PhD.tar.bz2

A miracle of maths!

convertible bond	Parity	Goldfinger	bash
CD	You search in vain for recollection of the most basic of facts	No One Ever Listens Do They, 'Lyssa	these arms, this heart, this body, sleep here
Compression ratio	Complication	bzip2	GB
example	Tar	Simulation	MB
tape	commercial	PhD	humor