Rsync internals
I’m officially back from a two month blogging hiatus. A few weeks ago, I got really curious about how rsync works and decided to dig deeper.
The objective of the rsync algorithm is to minimize the amount of data that needs to be sent over the wire in order to sync two versions of a file. The algorithm assumes sender has the latest version and we want to override the version on the receiver.
You say merkle tree? Maybe not…
The first technique that came to my mind is what copy-on-write systems do - build a hash tree (merkle tree) over the contents of the file by dividing the file into fixed-size chunks. This works fine in systems where the chunks are fixed-size and chunk boundaries don’t change due to inserts/deletes. In rsync’s case, the user could have inserted (or deleted) bytes into the middle of the file, and if you use fixed-size chunks, then the hashes of the chunks from that point onwards would not match with the hashes on the receiver’s end.
Rolling checksums
The trick which rsync has up its sleeve is the use of rolling checksums. A rolling checksum is a checksum algorithm which given the checksum for byte range [n, n+S-1] for chunk size S, byte n, and byte n+S, can compute the checksum for the byte-range [n+1, n+S] very efficiently. In other words, you can keep computing checksums for a sliding window over a byte stream very efficiently. Rsync actually uses the Adler-32 checksum which is also used in zlib.
But a rolling checksum is cryptographically weak. So just to be safe rsync uses both the MD5 hash and the rolling checksum for each chunk of size S. The receiver computes MD5+rolling-sum for each chunk in its version of the file and sends that across to the sender. The sender then compares the rolling-sums for each chunk, and if they match also compares the MD5sums, and if they match it can safely assume these chunks haven’t changed. Now at some point, the rolling-sums might not match. At this point, the sender can start sliding the window of chunk S over subsequent bytes in its version and try to see the next matching rolling-sum. This operation will be fast due to the nature of the rolling-sum. Once it finds a matching rolling-sum, then the algorithm can proceed as before, and it can also mark the exact byte-sequence which needs to be inserted in the receiver’s version of the file.
Gzip & rsync
The problem with gzipped files is that even if a single byte changes in the uncompressed file, then the rest of the contents of the compressed file from that byte onwards would change completely. This completely messes up rsync’s algorithm.
--rsyncable option - gzip supports this option and when you use it, it writes compressed files in a manner which is conducive to rsync’s algorithm. And what it does is use the same “rolling checksum” idea to define “chunk boundaries”. Once you move onto the next chunk, gzip resets its internal dictionary and starts compressing the next chunk freshly and so any changes to the bytes in the previous chunk don’t affect the compressed bytes in subsequent chunks. This idea is actually quite similar to the SequenceFile idea in Hadoop which lets you store data in compressed format but still be splittable so you can assign file-splits to mappers efficiently.
How are chunk boundaries identified? Whenever the last ‘n’ bits of the rolling checksum become all ‘0’ (statistically this has to happen at least once every 2^n times), and since you compute 1 rolling checksum for each new byte, you roughly get a chunk every 2^n bytes.
More info
There’s an audio recording and a full transcript of an hour-long talk by Dr. Andrew Tridgell who developed the rsync algorithm as part of his PhD thesis. It goes into further detail about the core algorithm itself and is a very entertaining read.