KodeKabuki

Welcome, my name is Harish Mallipeddi. I work for Amazon Web Services (AWS). This blog is mostly a dump of interesting articles that I come across on the web. Topics span across multiple areas including algorithms/datastructures, NoSQL stores, database internals, web-scale challenges, and functional languages.

January 27, 2012 at 4:27am

0 notes

Absolute consistency →

Excellent rant on “consistency” in Dynamo-inspired systems like Riak.

January 3, 2012 at 11:17am

0 notes

DLNA with a Panasonic Viera

I bought a Panasonic Viera LED TV before the holidays and the TV is DLNA certified. I finally figured out how to get DLNA to work properly with the TV.

Serviio

I installed Serviio on a Linux (Ubuntu) box. Serviio is an Opensource DLNA server written in Java. Running it launches a daemon in the background which listens for requests on a specific port, and it also ships with a console app to configure it.

The TV should automatically discover the DLNA server, and you should be able to browse the shared folders on the server. I was able to play some of the videos directly without any problems, but playing other videos failed.

“Cannot read file” errors

Serviio supposedly uses ffmpeg underneath to transcode the videos on-the-fly if necessary. I couldn’t get this to work properly. I basically ended up manually transcoding videos into the NTSC-DVD format with ffmpeg and then dropping the transcoded videos into the shared folder. Now the videos started working fine. I was able to pause/skip videos correctly as well.

ffmpeg -i source.avi -target ntsc-dvd -ps 2000000000 -aspect 16:9 target.mpeg

December 6, 2011 at 1:31am

0 notes

http://blog.corensic.com/2011/11/28/virtual-machines-memory/ →

Great article describing how virtual->physical address translation works in x86.

Summary

  • Each process gets its own private page table. The pointer to this page table will be stored in a special CR3 register upon a context-switch.
  • Walking the private page tables is expensive. So the entries are cached in a TLB. But that means during a context-switch, you’d to flush the entire TLB cache.
  • Newer AMD and Intel CPUs, apparently attach a Address-Space Identifier (ASID) to every TLB entry. That way no invalidation is required; you just skip the entries in the TLB with ASIDs that don’t match the ASID corresponding to the currently running process.

The author has another article explaining how this works when you’re running virtualized operating systems on top of a hypervisor.

December 3, 2011 at 11:26am

0 notes

http://www.cloudera.com/resource/hadoop-world-2011-presentation-slides-hadoop-and-performance →

Video of @tlipcon’s talk on Hadoop Performance from Hadoop World 2011 is now available on the Cloudera website. Todd walks through a bunch of performance fixes he did to Hadoop recently, and it’s an interesting list of common perf optimization tricks:

  • keeping TCP conns alive in HDFS rather than establishing new ones each time
  • making Hadoop behave better with the Linux pagecache (using fadvise(DONTNEED) so HDFS blocks don’t end up in the pagecache unnecessarily)
  • making the sort implementation better in terms of cache locality. The original implementation had pointers to keys spread all over the heap, and so reading in a key to compare it with something else would result in a completely new cache line being faulted in from memory. In the improvised version, they cache the first 4 bytes of the keys along with the pointers to the keys themselves in the same array. This way the first 4 bytes themselves can be used to do most comparisons and since they’re right next to the pointers, they’ve great cache locality.

October 7, 2011 at 2:23pm

0 notes

Rsync internals

I’m officially back from a two month blogging hiatus. A few weeks ago, I got really curious about how rsync works and decided to dig deeper.

The objective of the rsync algorithm is to minimize the amount of data that needs to be sent over the wire in order to sync two versions of a file. The algorithm assumes sender has the latest version and we want to override the version on the receiver.

You say merkle tree? Maybe not…

The first technique that came to my mind is what copy-on-write systems do - build a hash tree (merkle tree) over the contents of the file by dividing the file into fixed-size chunks. This works fine in systems where the chunks are fixed-size and chunk boundaries don’t change due to inserts/deletes. In rsync’s case, the user could have inserted (or deleted) bytes into the middle of the file, and if you use fixed-size chunks, then the hashes of the chunks from that point onwards would not match with the hashes on the receiver’s end.

Rolling checksums

The trick which rsync has up its sleeve is the use of rolling checksums. A rolling checksum is a checksum algorithm which given the checksum for byte range [n, n+S-1] for chunk size S, byte n, and byte n+S, can compute the checksum for the byte-range [n+1, n+S] very efficiently. In other words, you can keep computing checksums for a sliding window over a byte stream very efficiently. Rsync actually uses the Adler-32 checksum which is also used in zlib.

But a rolling checksum is cryptographically weak. So just to be safe rsync uses both the MD5 hash and the rolling checksum for each chunk of size S. The receiver computes MD5+rolling-sum for each chunk in its version of the file and sends that across to the sender. The sender then compares the rolling-sums for each chunk, and if they match also compares the MD5sums, and if they match it can safely assume these chunks haven’t changed. Now at some point, the rolling-sums might not match. At this point, the sender can start sliding the window of chunk S over subsequent bytes in its version and try to see the next matching rolling-sum. This operation will be fast due to the nature of the rolling-sum. Once it finds a matching rolling-sum, then the algorithm can proceed as before, and it can also mark the exact byte-sequence which needs to be inserted in the receiver’s version of the file.

Gzip & rsync

The problem with gzipped files is that even if a single byte changes in the uncompressed file, then the rest of the contents of the compressed file from that byte onwards would change completely. This completely messes up rsync’s algorithm.

--rsyncable option - gzip supports this option and when you use it, it writes compressed files in a manner which is conducive to rsync’s algorithm. And what it does is use the same “rolling checksum” idea to define “chunk boundaries”. Once you move onto the next chunk, gzip resets its internal dictionary and starts compressing the next chunk freshly and so any changes to the bytes in the previous chunk don’t affect the compressed bytes in subsequent chunks. This idea is actually quite similar to the SequenceFile idea in Hadoop which lets you store data in compressed format but still be splittable so you can assign file-splits to mappers efficiently.

How are chunk boundaries identified? Whenever the last ‘n’ bits of the rolling checksum become all ‘0’ (statistically this has to happen at least once every 2^n times), and since you compute 1 rolling checksum for each new byte, you roughly get a chunk every 2^n bytes.

More info

There’s an audio recording and a full transcript of an hour-long talk by Dr. Andrew Tridgell who developed the rsync algorithm as part of his PhD thesis. It goes into further detail about the core algorithm itself and is a very entertaining read.

July 28, 2011 at 2:48pm

0 notes

http://www.scribd.com/doc/53197944/Linux-and-H-W-optimizations-for-MySQL →

Comprehensive list of things to think about both at the hardware and Linux level in order to run and maintain MySQL servers.

Some interesting things I didn’t know before:

  • nobarrier mount option - barriers are pretty similar to memory barriers, and are useful for filesystem journal updates. If the disk has a volatile cache (without a BBU), and if it tries to reorder writes, then bad things can happen during crashes. If there’s a BBU, then barrier can be safely disabled in exchange for better write performance. Also ext4 has journal checksumming unlike ext3 which safeguards against these scenarios.
  • Use RAID10+HDD for sequential writes (redo log, bin log, double-write buffer) and SSD for random I/O (data files, undo log tablespace).
  • When you set innodb_flush_method=O_DIRECT, only the data files are opened with O_DIRECT. InnoDB redo logs, bin-logs, etc are still opened normally and hence get cached in the page cache. This is done because O_DIRECT enforces the 512-byte alignment restriction for all file I/O.
  • BBCs need to recharge every 90 days typically. Instantly performance goes down, because you switch from write-back to write-through mode. Solution: use FBWC (Flash-backed write cache).
  • Appending + fsync() is much slower than overwriting + fsync()
  • iostat’s %util = (r/s + w/s) * svctm Checking svctm is more useful to see if the same I/O ops are taking more time now than before.

6:21am

0 notes

A quick tutorial on generating a huffman tree →

Huffman {en|de}coding is really simple once you know how to build the Huffman tree.

June 30, 2011 at 11:58am

0 notes

http://www.lighterra.com/papers/modernmicroprocessors/ →

This article provides an excellent overview of the current state of modern microprocessor design for ‘software engineers’.

Key concepts/ideas

  • Super-pipelining - Pipeline instructions aggressively - split complex sections within fetch-decode-execute-writeback into more fine-grained sub-tasks, and also increase the clock speed. Clock speed dictates how often an instruction moves ahead one step in the pipeline. Core 2 has a 14-stage pipeline; Pentium-4 has a 20-stage pipeline; ARM Cortex A9 (Apple A5) has a 8-stage pipeline.

  • Super-scalar - Execute multiple instructions in parallel in different functional units. Cortex A8/A9 are 2-issue while Core 2 is 4-issue.

  • Branch prediction - branch mispredictions could be really costly with deeper pipelines. Hence dynamic prediction with on-chip branch prediction table is performed in modern CPUs. Basically some space is reserved to keep count of which way a branch goes and favor that branch when it comes to pipelining.

  • In-order execution vs out-of-order execution - Pipelining is good but not all instructions are equal (some instructions require more clock cycles to complete than other simpler instructions). So some CPUs rearrange the order in which instructions are executed. But implementing this logic is expensive (lots of logic on chip + power consumption). Hence some CPUs just rely on the compilers to do this for them. A8 and Atom only do in-order while modern x86es do out-of-order execution.

  • Simultaneous Multi-Threading (SMT) aka Intel’s HyperThreading - At some point, it’s going to become hard to find instructions within the same thread/process to fill up the pipeline even with out-of-order execution because there’s a huge latency involved in load/store instructions which have to read/write from main-memory (100s of cycles). So realistically you can’t find more than 2 parallel instructions to execute over a long period of time. So the solution here is to find instructions from completely different threads and run them on the same processor. This needs only little bit of extra hardware to keep track of the 2 threads of execution but the rest of the pieces of logic on the chip can be shared. This is different from SMP machines where you’ve multiple independent processors and multi-core machines where you’re multiple cores on the same chip.

June 25, 2011 at 5:58am

0 notes

ext-2 and ext-3 lock a per-inode mutex for the duration of a write. This means that ext-2 and ext-3 do not allow concurrent writes to a file and that can prevent you from getting the write throughput you expect when you stripe a file over several disks with RAID. XFS does not do this which is one of the reasons many people prefer XFS for InnoDB.

— 

XFS, ext and per-inode mutexes

I didn’t know this. So if you’re using InnoDB with O_DIRECT on a server with RAID, you’d really be well off using XFS. I’m guessing if you don’t use O_DIRECT, your writes just end up getting cached in the buffer cache, and the write throughout issue won’t be pronounced. Domas made a benchmark to test this behavior.

June 10, 2011 at 9:23am

0 notes

Recipe: encrypt/decrypt clipboard contents on OS X

I’ve started using the following whenever I need to store sensitive stuff in Evernote/Dropbox/GMail/etc.


encrypt_aes128() {
    pbpaste | openssl enc -e -aes128 -base64 -pass "pass:$1" | pbcopy
}
decrypt_aes128() {
    pbpaste | openssl enc -d -aes128 -base64 -pass "pass:$1" | pbcopy
}

June 1, 2011 at 10:50pm

0 notes

http://papilio.cc →

An Arduino-like project for FPGA programming.

March 31, 2011 at 1:09pm

0 notes

http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/ →

Great series of blog posts on some HBase GC optimization work by Todd Lipcon at Cloudera.

Problem: Long GC pauses were being observed for write-heavy HBase workloads on the RegionServer. One RegionServer is responsible for several regions, and all writes to a Region go to a MemStore which gets flushed to HDFS only after a certain threshold (which means the objects in MemStore make it to tenured generation). Assuming random distribution of writes to regions, memory gets fragmented in the tenured generation because memory for different regions gets interleaved.

Solution: Allocate memory for KVs in MemStore from Region-specific chunks. A chunk is merely just a contiguous byte array. So when a Region’s MemStore gets flushed, it doesn’t fragment memory much. Here’s the original JIRA which also has the patch.

February 22, 2011 at 2:32pm

0 notes

http://www.toao.com/posts/finding-similar-items-key-store-minhashing.html →

Article introduces what minhashing is and proves that the probability of 2 sets being similar is actually equal to the probability of their minhashes matching. So you can actually calculate the minhashes of sets and use that to determine if the sets are similar/dissimilar without having to compare each and every element.

February 15, 2011 at 2:26pm

0 notes

http://bartoszmilewski.wordpress.com/2010/09/11/beyond-locks-software-transactional-memory/ →

Bartosz Milewski writes a great article on how STMs are implemented at a high-level.

12:16pm

0 notes

http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/ →

Proposed redesign of Hadoop by the Y! Hadoop team. In short, HDFS stays the same, but MapReduce becomes an application-level library, and so the existing JobTracker and TaskTrackers get replaced by more generic ResourceManager and NodeManagers.