display | more...
This is an idea I had while walking back home one evening after a bit of a heavy drinking session. New files for games, or movies, are generally getting bigger and bigger. Files do sometimes get corrupted during downloads, particularly on services prone to line noise. Bandwidth is increasing, but not everyone has access to services like cable, dsl or others. Even if they did, redownloading a CD size movie is annoying, and extremely wasteful of network resources.

The problem is, if a file is corrupt, how do we proceed? Well, we can assume that a finite part of the file is corrupt. Realistically speaking, a few hundred bytes could be misplaced; a small amount compared to the file size, but for executables and other non-resilient file types, enough to prevent the correct use of the file.

This download system would first generate a checksum, or mathematical hash of the entire file. Something like MD5 would do the trick. It then asks the server to return the checksum for it's copy of the file. If the two don't match, then the program gets clever. It recursively subdivides the file into two halves, and checksums those. It then asks the server for its checksum for each section of a file.

We repeatedly split the file, discarding sections which have the checksum correct. In this way, we rapidly zero in on the damaged section of the file. Once we reach a certain threshold, we decide to fetch that broken section from the server. Because we're only fetching the damaged part of the file, we save network bandwidth.

In this way, we sacrifice processor time in calculating checksums for network bandwidth in redownloading file sections. Obviously, there are problems - the server would be best advised to cache the generated checksums, to prevent excessive CPU usage as they are generated. By generating the checksums when the file is uploaded, or in a batch at a low processor usage time, this can be optimised further.

This is potentially a radical idea which has occurred to someone else at some point in time. I don't care - it occured to me independantly. Going off on a total tangent, if we could get infinitely fast (or simply very very fast) processors, downloading itself would be unnecessary. Simply sending the checksum and filesize, then generating data which is then checksummed, we could theoretically reconstruct a file from a sufficiently advanced checksum. This is at present computationally infeasible. Roll on quantum computing ...

Rsync is a utility written by the samba team that can recover almost any broken download.

Instead of binary searching progressively smaller sections of the file, it simply checksums small blocks from the outset. The checksum is so much smaller than the original block that there's not much point in interactively searching for errors (The checksums for an entire CD image are only 600k or so - 1000 times smaller than the whole image). A fixed block size means that the server can precompute and cache the checksums, greatly reducing its processor load.

Rsync has its own protocol, but can also transfer files over a unix pipe, enabling it to be tunnelled using almost any protocol. As an example, Rsync can be tunneled over the internet using SSH to provide secure off-site backups without the need for a leased line. Using Rsync means that a slower (cheaper) internet connection can be used without compromising backup time.

Most mirror sites on the internet are kept up to date using rsync - transmitting the same data again and again costs real money, both for the mirror, and the host, so anything that can reduce bandwidth costs is very welcome.

Apart from fixing failed downloads, and rapidly mirroring incremental backups, it has some unforseen uses. The bulk of the Debian CD images can be downloaded from the faster (and more numerous) package mirrors, using rsync and some shell scripts, using a method dubbed 'jigsaw downloading' - The necessary files for the cd image are downloaded, padded to a multiple of 8k (the size of a sector on a CD), and concatenating into one big file in the order they appear in the ISO image. When the image is Rsynced with the copy on the CD mirror, it produces a perfect, bootable copy of the CD image, while only downloading 10-20Mb from the slow CD mirror. Rsync will fix any files that downloaded incorrectly, or were missing from the package mirrors automaticly.

Unfortunately, there's no way to get an infinitely fast download by generating random data until it fits a checksum. For every checksum there are 2^(size of file)/2^(size of checksum) possible bit combinations that will match it. So if we are going through every permutation, we'd get every possible match, the vast majority of which are incorrect and meaningless. This is commonly known as the pigeonhole principle - 'if there are n pigeons, and n-1 holes, at least one hole has two pigeons in it'.

For there to be only one possible match, we'd need a checksum the same size as the file. So we might as well download it.

Log in or register to write something here or to contact authors.