Absolute consistency →
Excellent rant on “consistency” in Dynamo-inspired systems like Riak.
Welcome, my name is Harish Mallipeddi. I work for Amazon Web Services (AWS). This blog is mostly a dump of interesting articles that I come across on the web. Topics span across multiple areas including algorithms/datastructures, NoSQL stores, database internals, web-scale challenges, and functional languages.
Excellent rant on “consistency” in Dynamo-inspired systems like Riak.
I bought a Panasonic Viera LED TV before the holidays and the TV is DLNA certified. I finally figured out how to get DLNA to work properly with the TV.
I installed Serviio on a Linux (Ubuntu) box. Serviio is an Opensource DLNA server written in Java. Running it launches a daemon in the background which listens for requests on a specific port, and it also ships with a console app to configure it.
The TV should automatically discover the DLNA server, and you should be able to browse the shared folders on the server. I was able to play some of the videos directly without any problems, but playing other videos failed.
Serviio supposedly uses ffmpeg underneath to transcode the videos on-the-fly if necessary. I couldn’t get this to work properly. I basically ended up manually transcoding videos into the NTSC-DVD format with ffmpeg and then dropping the transcoded videos into the shared folder. Now the videos started working fine. I was able to pause/skip videos correctly as well.
ffmpeg -i source.avi -target ntsc-dvd -ps 2000000000 -aspect 16:9 target.mpeg
Great article describing how virtual->physical address translation works in x86.
Summary
The author has another article explaining how this works when you’re running virtualized operating systems on top of a hypervisor.
Video of @tlipcon’s talk on Hadoop Performance from Hadoop World 2011 is now available on the Cloudera website. Todd walks through a bunch of performance fixes he did to Hadoop recently, and it’s an interesting list of common perf optimization tricks:
I’m officially back from a two month blogging hiatus. A few weeks ago, I got really curious about how rsync works and decided to dig deeper.
The objective of the rsync algorithm is to minimize the amount of data that needs to be sent over the wire in order to sync two versions of a file. The algorithm assumes sender has the latest version and we want to override the version on the receiver.
The first technique that came to my mind is what copy-on-write systems do - build a hash tree (merkle tree) over the contents of the file by dividing the file into fixed-size chunks. This works fine in systems where the chunks are fixed-size and chunk boundaries don’t change due to inserts/deletes. In rsync’s case, the user could have inserted (or deleted) bytes into the middle of the file, and if you use fixed-size chunks, then the hashes of the chunks from that point onwards would not match with the hashes on the receiver’s end.
The trick which rsync has up its sleeve is the use of rolling checksums. A rolling checksum is a checksum algorithm which given the checksum for byte range [n, n+S-1] for chunk size S, byte n, and byte n+S, can compute the checksum for the byte-range [n+1, n+S] very efficiently. In other words, you can keep computing checksums for a sliding window over a byte stream very efficiently. Rsync actually uses the Adler-32 checksum which is also used in zlib.
But a rolling checksum is cryptographically weak. So just to be safe rsync uses both the MD5 hash and the rolling checksum for each chunk of size S. The receiver computes MD5+rolling-sum for each chunk in its version of the file and sends that across to the sender. The sender then compares the rolling-sums for each chunk, and if they match also compares the MD5sums, and if they match it can safely assume these chunks haven’t changed. Now at some point, the rolling-sums might not match. At this point, the sender can start sliding the window of chunk S over subsequent bytes in its version and try to see the next matching rolling-sum. This operation will be fast due to the nature of the rolling-sum. Once it finds a matching rolling-sum, then the algorithm can proceed as before, and it can also mark the exact byte-sequence which needs to be inserted in the receiver’s version of the file.
The problem with gzipped files is that even if a single byte changes in the uncompressed file, then the rest of the contents of the compressed file from that byte onwards would change completely. This completely messes up rsync’s algorithm.
--rsyncable option - gzip supports this option and when you use it, it writes compressed files in a manner which is conducive to rsync’s algorithm. And what it does is use the same “rolling checksum” idea to define “chunk boundaries”. Once you move onto the next chunk, gzip resets its internal dictionary and starts compressing the next chunk freshly and so any changes to the bytes in the previous chunk don’t affect the compressed bytes in subsequent chunks. This idea is actually quite similar to the SequenceFile idea in Hadoop which lets you store data in compressed format but still be splittable so you can assign file-splits to mappers efficiently.
How are chunk boundaries identified? Whenever the last ‘n’ bits of the rolling checksum become all ‘0’ (statistically this has to happen at least once every 2^n times), and since you compute 1 rolling checksum for each new byte, you roughly get a chunk every 2^n bytes.
There’s an audio recording and a full transcript of an hour-long talk by Dr. Andrew Tridgell who developed the rsync algorithm as part of his PhD thesis. It goes into further detail about the core algorithm itself and is a very entertaining read.
Comprehensive list of things to think about both at the hardware and Linux level in order to run and maintain MySQL servers.
Some interesting things I didn’t know before:
Huffman {en|de}coding is really simple once you know how to build the Huffman tree.
This article provides an excellent overview of the current state of modern microprocessor design for ‘software engineers’.
Super-pipelining - Pipeline instructions aggressively - split complex sections within fetch-decode-execute-writeback into more fine-grained sub-tasks, and also increase the clock speed. Clock speed dictates how often an instruction moves ahead one step in the pipeline. Core 2 has a 14-stage pipeline; Pentium-4 has a 20-stage pipeline; ARM Cortex A9 (Apple A5) has a 8-stage pipeline.
Super-scalar - Execute multiple instructions in parallel in different functional units. Cortex A8/A9 are 2-issue while Core 2 is 4-issue.
Branch prediction - branch mispredictions could be really costly with deeper pipelines. Hence dynamic prediction with on-chip branch prediction table is performed in modern CPUs. Basically some space is reserved to keep count of which way a branch goes and favor that branch when it comes to pipelining.
In-order execution vs out-of-order execution - Pipelining is good but not all instructions are equal (some instructions require more clock cycles to complete than other simpler instructions). So some CPUs rearrange the order in which instructions are executed. But implementing this logic is expensive (lots of logic on chip + power consumption). Hence some CPUs just rely on the compilers to do this for them. A8 and Atom only do in-order while modern x86es do out-of-order execution.
Simultaneous Multi-Threading (SMT) aka Intel’s HyperThreading - At some point, it’s going to become hard to find instructions within the same thread/process to fill up the pipeline even with out-of-order execution because there’s a huge latency involved in load/store instructions which have to read/write from main-memory (100s of cycles). So realistically you can’t find more than 2 parallel instructions to execute over a long period of time. So the solution here is to find instructions from completely different threads and run them on the same processor. This needs only little bit of extra hardware to keep track of the 2 threads of execution but the rest of the pieces of logic on the chip can be shared. This is different from SMP machines where you’ve multiple independent processors and multi-core machines where you’re multiple cores on the same chip.
ext-2 and ext-3 lock a per-inode mutex for the duration of a write. This means that ext-2 and ext-3 do not allow concurrent writes to a file and that can prevent you from getting the write throughput you expect when you stripe a file over several disks with RAID. XFS does not do this which is one of the reasons many people prefer XFS for InnoDB.
—
XFS, ext and per-inode mutexes
I didn’t know this. So if you’re using InnoDB with O_DIRECT on a server with RAID, you’d really be well off using XFS. I’m guessing if you don’t use O_DIRECT, your writes just end up getting cached in the buffer cache, and the write throughout issue won’t be pronounced. Domas made a benchmark to test this behavior.
I’ve started using the following whenever I need to store sensitive stuff in Evernote/Dropbox/GMail/etc.
encrypt_aes128() {
pbpaste | openssl enc -e -aes128 -base64 -pass "pass:$1" | pbcopy
}
decrypt_aes128() {
pbpaste | openssl enc -d -aes128 -base64 -pass "pass:$1" | pbcopy
}
Great series of blog posts on some HBase GC optimization work by Todd Lipcon at Cloudera.
Problem: Long GC pauses were being observed for write-heavy HBase workloads on the RegionServer. One RegionServer is responsible for several regions, and all writes to a Region go to a MemStore which gets flushed to HDFS only after a certain threshold (which means the objects in MemStore make it to tenured generation). Assuming random distribution of writes to regions, memory gets fragmented in the tenured generation because memory for different regions gets interleaved.
Solution: Allocate memory for KVs in MemStore from Region-specific chunks. A chunk is merely just a contiguous byte array. So when a Region’s MemStore gets flushed, it doesn’t fragment memory much. Here’s the original JIRA which also has the patch.
Article introduces what minhashing is and proves that the probability of 2 sets being similar is actually equal to the probability of their minhashes matching. So you can actually calculate the minhashes of sets and use that to determine if the sets are similar/dissimilar without having to compare each and every element.
Bartosz Milewski writes a great article on how STMs are implemented at a high-level.
Proposed redesign of Hadoop by the Y! Hadoop team. In short, HDFS stays the same, but MapReduce becomes an application-level library, and so the existing JobTracker and TaskTrackers get replaced by more generic ResourceManager and NodeManagers.