KodeKabuki

Welcome, my name is Harish Mallipeddi. I work for Amazon Web Services (AWS). This blog is mostly a dump of interesting articles that I come across on the web. Topics span across multiple areas including algorithms/datastructures, NoSQL stores, database internals, web-scale challenges, and functional languages.

December 3, 2011 at 11:26am

Home

http://www.cloudera.com/resource/hadoop-world-2011-presentation-slides-hadoop-and-performance →

Video of @tlipcon’s talk on Hadoop Performance from Hadoop World 2011 is now available on the Cloudera website. Todd walks through a bunch of performance fixes he did to Hadoop recently, and it’s an interesting list of common perf optimization tricks:

  • keeping TCP conns alive in HDFS rather than establishing new ones each time
  • making Hadoop behave better with the Linux pagecache (using fadvise(DONTNEED) so HDFS blocks don’t end up in the pagecache unnecessarily)
  • making the sort implementation better in terms of cache locality. The original implementation had pointers to keys spread all over the heap, and so reading in a key to compare it with something else would result in a completely new cache line being faulted in from memory. In the improvised version, they cache the first 4 bytes of the keys along with the pointers to the keys themselves in the same array. This way the first 4 bytes themselves can be used to do most comparisons and since they’re right next to the pointers, they’ve great cache locality.