JRuby Hacking Guide (RubyKaigi talk by @nahi) →
Also check out his JRuby Source Reading Guide.
Welcome, my name is Harish Mallipeddi. I work for Amazon Web Services (AWS). This blog is mostly a dump of interesting articles that I come across on the web. Topics span across multiple areas including algorithms/datastructures, NoSQL stores, database internals, web-scale challenges, and functional languages.
Also check out his JRuby Source Reading Guide.
Fun article on zippers, purely functional data structures. The article specifically focuses on implementing zipper lists, and zipper trees in Erlang.
…So we used a series of large scale genetic optimization tests running against every campaign we’ve ever sent to confirm which traits were predictive, and how predictive they were.
—
MailChimp’s Project Omnivore: Declassified | MailChimp Email Marketing Blog
Another interesting application of Hadoop in machine learning for BigData crunching. I need to learn more about genetic optimization algorithms.
The goal of Anonymouse is to selectively exclude data from the cookies we drop so that our users are sufficiently indistinguishable. We define “sufficiently indistinguishable” using the notion of k-anonymity. A dataset is k-anonymous as long as every record in the set is identical to no fewer than k-1 other records.
—
Anonymouse | Engineering Rapleaf
The article describes how Rapleaf is trying to automate the process of anonymizing user data (from cookies). Read the follow-up post in which they describe how they sped-up ‘superset queries’ (get all supersets containing the given set) to implement anonymouse.
Videos & slides from Hadoop Summit 2010 are now available.
There are lots of MongoDB-related posts everywhere lately. Here are three important things you should know before delving into MongoDB:
fsync once every ‘n’ seconds (n is configurable). It doesn’t have a write-ahead log from which you can recover in the event of a crash. So you should always run MongoDB with replication in production.memprof is a memory-profiler for Ruby (specifically MRI). It’s great for performance analysis, debugging memory leak issues, etc. But unfortunately it doesn’t work with the default Ruby that ships with Snow Leopard since there are no debug symbols installed. There are a few other restrictions at this point - memprof actually rewrites x86 binary in memory which is why it’s difficult to get it to work everywhere. Joe Damato, author of memprof has a bunch of interesting blog posts explaining how memprof actually works underneath if you’re curious.
So here are the steps to install a new ruby interpreter (with debug symbols), rubygems and memprof gem:
Here’s the direct link to the Gist embedded above.
Also good reads: benchmarking SSD performance on Windows 7 and OS X with and without TRIM.
Earlier versions of Windows (pre-Windows 7) don’t have TRIM support apparently. The article doesn’t seem to have a good explanation for why performance in OS X doesn’t degrade due to write amplification even though TRIM hasn’t been set (at least according to the System Profiler app in OS X).
when you don’t create things, you become defined by your tastes rather than ability. your tastes only narrow & exclude people. so create.
—
Just saw this tweet by _why in a RailsConf ‘10 talk by Neal Ford. It’s been two months since I’ve started doing full-time Ruby development (at ZumoDrive). I’ve seen some beautiful code, I’ve cursed meta-programming (someone overrided something inside a gem somewhere and it blew up somewhere else), but most of all I’ve been impressed by how the Ruby community always strives to create beautiful new things.
There are entire talks in Ruby conferences devoted to encouraging people to create, to design great APIs, to strive for better. These ideals make it seem like they really want to be more of an artist than an engineer.
We use GitHub at work for all projects. One of the things I’ve always missed while using git is the ability to see 2 files side-by-side when I view diffs. The minimal context that you get when you do git diff | mate is just not enough sometimes. I discovered a solution to this today.
Git supports external diff-tool and merge-tools. In fact, there’s a git-difftool and a git-mergetool command just for this purpose. You can plug-n-play any common diff/merge viewer tool with git via these two commands. OS X ships with a decent diff/merge tool - FileMerge aka opendiff but it didn’t work that well. So I got myself a copy of Araxis Merge for the Mac.
Once you instruct git to use a difftool, you can just do something like git difftool topic-branch-X..master in the shell, and git will open up Araxis Merge with the 2 files. Similarly with the mergetool, after you do a git merge and end up with a bunch of conflicts, you can type git mergetool to resolve these conflicts from inside Araxis Merge.
Here’s the configuration that I’d to do to get git to use Araxis as the difftool and the mergetool (there’s lots of info about this on StackOverflow which is how I got started).
Save the following into a file called araxisgitdifftool.sh:
Append the following to your ~/.gitconfig:
I’m still not completely happy with this. Araxis opens up each changed file in a separate window which means if a lot of files changed in a diff, it’ll open up lots of windows. Thankfully it’s quite fast and very responsive compared to File Merge but still worries me when I’ve to view diffs with lots of changed files.
Insightful article. In short, MRI maintains its own heap space to store meta-data about Ruby objects. This heap space is divided into multiple groups and each group has slots of size 20 bytes (32-bit)/40 bytes (64-bit) each. Each slot is meant to hold a C-struct called RVALUE. Number of slots allocated increases by a factor of 1.8 beginning with 10,000+1 slots in the first iteration. Objects implemented in C/C++ allocate memory on the C heap directly to hold their internal data.
Slides/videos from MongoSF 2010
Great talk titled “Sustainable TDD” by Steve Freeman (co-author of jMock).
Bradley Kuszmaul of Tokutek talks about how fractal trees work and compare against B-Trees. These slides [pdf] might also be helpful to follow along with the video (the video quality is not that great).
To me, CAP should really be PACELC —- if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?
— DBMS Musings: Problems with CAP, and Yahoo’s little known NoSQL system