http://www.cloudera.com/resource/hadoop-world-2011-presentation-slides-hadoop-and-performance →
Video of @tlipcon’s talk on Hadoop Performance from Hadoop World 2011 is now available on the Cloudera website. Todd walks through a bunch of performance fixes he did to Hadoop recently, and it’s an interesting list of common perf optimization tricks:
- keeping TCP conns alive in HDFS rather than establishing new ones each time
- making Hadoop behave better with the Linux pagecache (using fadvise(DONTNEED) so HDFS blocks don’t end up in the pagecache unnecessarily)
- making the sort implementation better in terms of cache locality. The original implementation had pointers to keys spread all over the heap, and so reading in a key to compare it with something else would result in a completely new cache line being faulted in from memory. In the improvised version, they cache the first 4 bytes of the keys along with the pointers to the keys themselves in the same array. This way the first 4 bytes themselves can be used to do most comparisons and since they’re right next to the pointers, they’ve great cache locality.