January 2012
2 posts
Absolute consistency →
Excellent rant on “consistency” in Dynamo-inspired systems like Riak.
DLNA with a Panasonic Viera
I bought a Panasonic Viera LED TV before the holidays and the TV is DLNA certified. I finally figured out how to get DLNA to work properly with the TV.
Serviio
I installed Serviio on a Linux (Ubuntu) box. Serviio is an Opensource DLNA server written in Java. Running it launches a daemon in the background which listens for requests on a specific port, and it also ships with a console app to...
December 2011
2 posts
http://blog.corensic.com/2011/11/28/virtual-machine... →
Great article describing how virtual->physical address translation works in x86.
Summary
Each process gets its own private page table. The pointer to this page table will be stored in a special CR3 register upon a context-switch.
Walking the private page tables is expensive. So the entries are cached in a TLB. But that means during a context-switch, you’d to flush the entire TLB...
http://www.cloudera.com/resource/hadoop-world-2011-... →
Video of @tlipcon’s talk on Hadoop Performance from Hadoop World 2011 is now available on the Cloudera website. Todd walks through a bunch of performance fixes he did to Hadoop recently, and it’s an interesting list of common perf optimization tricks:
keeping TCP conns alive in HDFS rather than establishing new ones each time
making Hadoop behave better with the Linux pagecache...
October 2011
1 post
Rsync internals
I’m officially back from a two month blogging hiatus. A few weeks ago, I got really curious about how rsync works and decided to dig deeper.
The objective of the rsync algorithm is to minimize the amount of data that needs to be sent over the wire in order to sync two versions of a file. The algorithm assumes sender has the latest version and we want to override the version on the...
July 2011
2 posts
http://www.scribd.com/doc/53197944/Linux-and-H-W-op... →
Comprehensive list of things to think about both at the hardware and Linux level in order to run and maintain MySQL servers.
Some interesting things I didn’t know before:
nobarrier mount option - barriers are pretty similar to memory barriers, and are useful for filesystem journal updates. If the disk has a volatile cache (without a BBU), and if it tries to reorder writes, then bad...
A quick tutorial on generating a huffman tree →
Huffman {en|de}coding is really simple once you know how to build the Huffman tree.
June 2011
4 posts
http://www.lighterra.com/papers/modernmicroprocesso... →
This article provides an excellent overview of the current state of modern microprocessor design for ‘software engineers’.
Key concepts/ideas
Super-pipelining - Pipeline instructions aggressively - split complex sections within fetch-decode-execute-writeback into more fine-grained sub-tasks, and also increase the clock speed. Clock speed dictates how often an instruction moves...
ext-2 and ext-3 lock a per-inode mutex for the duration of a write. This means...
– XFS, ext and per-inode mutexes
I didn’t know this. So if you’re using InnoDB with O_DIRECT on a server with RAID, you’d really be well off using XFS. I’m guessing if you don’t use O_DIRECT, your writes just end up getting cached in the buffer cache, and the write...
Recipe: encrypt/decrypt clipboard contents on OS X
I’ve started using the following whenever I need to store sensitive stuff in Evernote/Dropbox/GMail/etc.
encrypt_aes128() {
pbpaste | openssl enc -e -aes128 -base64 -pass "pass:$1" | pbcopy
}
decrypt_aes128() {
pbpaste | openssl enc -d -aes128 -base64 -pass "pass:$1" | pbcopy
}
http://papilio.cc →
An Arduino-like project for FPGA programming.
March 2011
1 post
http://www.cloudera.com/blog/2011/03/avoiding-full-... →
Great series of blog posts on some HBase GC optimization work by Todd Lipcon at Cloudera.
Problem: Long GC pauses were being observed for write-heavy HBase workloads on the RegionServer. One RegionServer is responsible for several regions, and all writes to a Region go to a MemStore which gets flushed to HDFS only after a certain threshold (which means the objects in MemStore make it to tenured...
February 2011
4 posts
http://www.toao.com/posts/finding-similar-items-key... →
Article introduces what minhashing is and proves that the probability of 2 sets being similar is actually equal to the probability of their minhashes matching. So you can actually calculate the minhashes of sets and use that to determine if the sets are similar/dissimilar without having to compare each and every element.
http://bartoszmilewski.wordpress.com/2010/09/11/bey... →
Bartosz Milewski writes a great article on how STMs are implemented at a high-level.
http://developer.yahoo.com/blogs/hadoop/posts/2011/... →
Proposed redesign of Hadoop by the Y! Hadoop team. In short, HDFS stays the same, but MapReduce becomes an application-level library, and so the existing JobTracker and TaskTrackers get replaced by more generic ResourceManager and NodeManagers.
If your ideas are not being rejected at least 50% of the time, you are playing...
– Summation: Dealing with rejection is a core competency
January 2011
2 posts
http://www.moserware.com/2010/03/computing-your-ski... →
High level description of the math behind Microsoft’s TrueSkill algorithm used in XBox player scores.
http://www.javalimit.com/2011/01/understanding-vect... →
The best explanation of vector clocks I’ve seen so far.
December 2010
2 posts
http://www.infoq.com/presentations/LMAX →
Very insightful talk by couple of engineers from LMAX (UK). They build high-throughput, low-latency financial systems in Java. They go to extreme lengths to avoid using locks by relying on CAS+memory-barriers from JMM for concurrency control and avoiding cache misses. Read the comments below the video on the InfoQ page as well.
They emphasize on the importance of having “mechanical...
http://calendar.perfplanet.com/2010/the-full-stack/... →
Well-written article by Carols Bueno (who works for Facebook) describing how a full-stack programmer would approach thinking/reasoning about a large-scale system (in this case a web application but this sorta thinking can be applied to reason about any system).
Acquiring full-stack experience & knowing the internals of everything is immensely powerful when designing systems. You probably...
November 2010
2 posts
http://blog.tsunanet.net/2010/11/how-long-does-it-t... →
Interesting article - the author sets out to write a micro-benchmark to measure the cost of a context switch in Linux on different x86 hardware.
Cost of context switch cannot be measured simply by making syscalls to enter/leave kernel mode because in modern Linux kernels apparently that doesn’t cause a full context switch.
Benoit decides to use futexes - parent and child processes waiting...
MySQL TechTalk @ Facebook
Watch live streaming video from facebookevents at livestream.com
I found this recording of a talk that Facebook hosted recently. Their MySQL team presents a bunch of interesting projects they’ve worked on at Facebook. I’ve been doing a lot of MySQL-related projects at work as well.
There’s an interesting section where one of the engineers mentions about his...
October 2010
3 posts
Core dumps on OS X
Core dumps are switched off by default.
Make OS X do a core-dump upon a segmentation fault:
ulimit -c unlimited
Unlike Linux, in OS X core dumps end up in /cores instead of the cwd.
gdb /path/to/your/binary /cores/core.XYZ
Sharding gone wrong →
Foursquare Ops publish a post mortem of their recent outage on their blog. The post made by MongoDB engineer in the google group is actually more interesting since it reveals more of the technical details, and what went wrong specifically because of MongoDB’s behavior.
In bureaucracies many people have the authority to say no, not the authority to...
– John Sculley On Steve Jobs, The Full Interview Transcript | Cult of Mac
September 2010
4 posts
http://jcole.us/blog/archives/2010/09/28/mysql-swap... →
Article discusses how Linux handles memory in a NUMA system especially when you’ve a single process (like mysqld) trying to take up 90% of the physical memory on the box.
http://www.tokyohackerspace.org/akihabara/ →
Hacker heaven - video tour of interesting shops in the Akihabara district of Tokyo
http://blog.extracheese.org/2010/05/the-tar-pipe.ht... →
Fascinating 10 minute tour of everything that goes on in Unix when you type (cd src && tar -cf - .) | (cd dest && tar -xpf -) in a bash terminal.
http://www.yosefk.com/blog/my-history-with-forth-st... →
Yossi Kreinin writes about his experiences with the Forth programming language and concludes that he wasn’t able to ever scale Forth to solve a real-life problem. I tried to teach myself Factor, a Forth-inspired modern stack language. I quickly came to the same conclusion. The simplicity of implementing a naive interpreter for a stack language was very exciting but unlike when I learnt...
August 2010
5 posts
JRuby Hacking Guide (RubyKaigi talk by @nahi) →
Also check out his JRuby Source Reading Guide.
Zippers →
Fun article on zippers, purely functional data structures. The article specifically focuses on implementing zipper lists, and zipper trees in Erlang.
…So we used a series of large scale genetic optimization tests running...
– MailChimp’s Project Omnivore: Declassified | MailChimp Email Marketing Blog
Another interesting application of Hadoop in machine learning for BigData crunching. I need to learn more about genetic optimization algorithms.
The goal of Anonymouse is to selectively exclude data from the cookies we drop...
– Anonymouse | Engineering Rapleaf
The article describes how Rapleaf is trying to automate the process of anonymizing user data (from cookies). Read the follow-up post in which they describe how they sped-up ‘superset queries’ (get all supersets containing the given set) to implement...
Hadoop Summit 2010 recordings →
Videos & slides from Hadoop Summit 2010 are now available.
July 2010
3 posts
MongoDB
There are lots of MongoDB-related posts everywhere lately. Here are three important things you should know before delving into MongoDB:
MongoDB uses memory-mapped files for disk I/O. On 32-bit systems, you’ll quickly exceed the process size limit (4GB). So you should always run MongoDB in 64-bit in production.
MongoDB does not have single-node durability yet (they’re reportedly...
Getting memprof to work on Snow Leopard
memprof is a memory-profiler for Ruby (specifically MRI). It’s great for performance analysis, debugging memory leak issues, etc. But unfortunately it doesn’t work with the default Ruby that ships with Snow Leopard since there are no debug symbols installed. There are a few other restrictions at this point - memprof actually rewrites x86 binary in memory which is why it’s...
Wikipedia - Write amplification & TRIM command on... →
Also good reads: benchmarking SSD performance on Windows 7 and OS X with and without TRIM.
Earlier versions of Windows (pre-Windows 7) don’t have TRIM support apparently. The article doesn’t seem to have a good explanation for why performance in OS X doesn’t degrade due to write amplification even though TRIM hasn’t been set (at least according to the System Profiler app...
June 2010
2 posts
when you don’t create things, you become defined by your tastes rather...
– Just saw this tweet by _why in a RailsConf ‘10 talk by Neal Ford. It’s been two months since I’ve started doing full-time Ruby development (at ZumoDrive). I’ve seen some beautiful code, I’ve cursed meta-programming (someone overrided something inside a gem somewhere and...
git-difftool & git-mergetool - using Araxis on OS...
We use GitHub at work for all projects. One of the things I’ve always missed while using git is the ability to see 2 files side-by-side when I view diffs. The minimal context that you get when you do git diff | mate is just not enough sometimes. I discovered a solution to this today.
Git supports external diff-tool and merge-tools. In fact, there’s a git-difftool and a git-mergetool...
May 2010
4 posts
MRI's memory allocation behaviour →
Insightful article. In short, MRI maintains its own heap space to store meta-data about Ruby objects. This heap space is divided into multiple groups and each group has slots of size 20 bytes (32-bit)/40 bytes (64-bit) each. Each slot is meant to hold a C-struct called RVALUE. Number of slots allocated increases by a factor of 1.8 beginning with 10,000+1 slots in the first iteration. Objects...
http://www.10gen.com/event_mongosf_10apr30 →
Slides/videos from MongoSF 2010
http://www.infoq.com/presentations/Sustainable-Test... →
Great talk titled “Sustainable TDD” by Steve Freeman (co-author of jMock).
April 2010
2 posts
To me, CAP should really be PACELC —- if there is a partition (P) how does...
– DBMS Musings: Problems with CAP, and Yahoo’s little known NoSQL system
http://kotaku.com/5517715/watch-civilizations-creat... →
Documentary: Sid Meier doing a 48 hour hackathon with university students
March 2010
2 posts
After examining the write syscall sizes, I also like to examine the write...
– IO Profiling of Applications: strace_analyzer via @phunt
Article explaining how to use strace plus a bunch of scripts to do statistical analysis for understanding the I/O characteristics of a given application.
RubyConf India 2010
I attended RubyConf India 2010 last weekend. I haven’t done Ruby much so far (Python is my de facto language). But this was a lot of fun nevertheless.
The conference was organized by ThoughtWorks and the speakers list was more or less dominated by ThoughtWorkers (both from India and Overseas). The general impression I got by interacting with the attendees is that most of the big...
February 2010
5 posts
http://www.infoq.com/presentations/Facebook-Hive-Ha... →
Informative talk by Ashish Thusoo and Namit Jain from Facebook’s Hive team.
Hive’s RCFile is pretty interesting - it provides record-columnar storage on top of HDFS. Apparently it results in very good compression and higher scan throughput.
http://www.dabeaz.com/GIL/ →
David Beazley continues his GIL investigation work from his ChiPy ‘09 talk. This time he also analyses the new GIL optimizations that are apparently present in the Python 3.2 svn branch. Look at his slides to see how the new GIL performs!
It took surprisingly little amount of code changes to the Python interpreter to do this kind of investigative work (Read more about his code changes). He...
The spot price of MicroSD cards is nearly identical to the spot price of the...
– On MicroSD Problems « bunnie’s blog
Fascinating article on the MicroSD industry and the economics involved in manufacturing them by a Chumby engineer. He has a whole series of articles about hardware manufacturing and China.