- The Network is (not) Reliable - good article which argues why network is not reliable and goes over a pretty exhaustive list of things which cause network partitions.
- Introducing Riak 2.0: Data Types, Strong Consistency, Full-Text Search, and Much More
- Facebook’s memcached routing layer - the way they use leases to guarantee cache consistency is interesting.
- DynamoDB’s global secondary indexes - global secondary indexes are asynchronously updated.
- LinkedIn deploys POPs to performance of dynamic content - LinkedIn terminates SSL in various POPs and sees major performance gains.
- Go Performance Tales
- How Stacks are Handled in Go - v1.2 uses segmented stacks, and 1.3+ use contiguous stacks.
- Two traps in iostat: %util and svctm - tldr; don’t trust all the iostat numbers blindly if you’re using SSDs.
- Inside Shellshock: How hackers are using it to exploit systems
- Why Far-Flung Parts of the Internet Broke Today - highly recommend the Renesys blog if you want to understand how brittle BGP is and how easy it’s to screw up.
Interview with Ed Fries - veteran Microsoft employee who worked on Excel/Word in the late 80s and started the XBox division at Microsoft. He has some amazing stories from early Microsoft days.
- Semi-Synchronous Replication at Facebook
- Secondary Indexing in HBase
- Debunking Myths about the VoltDB in-memory database
- Writing a High Performance Database in Go // Speaker Deck
- Details on Netflix’s CDN
- Linux’s fsync() woes are getting some attention
- It’s all a numbers game — the dirty little secret of scalable systems - Martin Thompson
- Measuring Latency in Linux - a good primer on the various clock sources available in Linux.
- Dynamic Tuple Performance On the JVM
- Debugging performance issues in Go programs
- [video] Availability, Consistency, and Horizontally Scalable Data Management, by Peter Bailis
- Call me maybe: etcd and Consul - another great post from the Jepsen series
- Building a Fast and Reliable Purging System - fastly implements a gossip protocol for purging cache entries from their POPs
- Linux 3.13 features
- Tango from SOSP ‘13 -
- Comparison of XFS with other filesystems
- Left-right concurrency control technique - clever technique of ping-ponging between 2 copies of a data-structure (reads happen on 1 copy while a single writer mutates the other copy). Read the first page where they compare the technique’s advantages over regular reader-writer lock and a copy-on-write/compare-and-swap based approach. It’s very similar to RCU.
- How Bad Can 1GB Pages Be? - measuring the performance of hugepages in Linux
- Another 10 Performance Wins
- 40% better single-threaded performance in MariaDB - cost of instruction cache misses in a MySQL benchmark
- HipHop VM performance improvements - another example of how reducing i-cache misses helps performance; hhvm places hot functions together in one part of the binary, and also uses huge-pages to map these parts in order to improve i-TLB hit rate.
- Misaligned struct in postgres wreaks havoc - a struct which is not 64-byte aligned ends up sharing the same cache line with another instance of the struct (both structs have a high update rate due to a spin-lock that’s inside the struct).
- The cost of dynamic (virtual calls) vs. static (CRTP) dispatch in C++
- The beets blog: the SQLite lock timeout nightmare. - lack of usleep during compilation causes sqlite to sleep for a full second if acquiring lock fails
- Understanding DTrace ustack helpers
- Conversation with scientist, engineer and database legend Jim Gray
- Number crunching: Why you should never, ever, EVER use linked-list in your code again - controversial title but interesting read
- /proc/sys/vm/compact_memory - device drivers in Linux are allowed to allocate contiguous regions of memory so they can do things like TCP offloading. Normally requesting memory from the kernel results in non-contiguous memory stitched from free pages in the free list. This is because the free-list is just a linked-list of pages and you can pop off free-pages from the end of the list. Now in order to guarantee a contigous list of pages, Linux has an implementation of a buddy allocator but that’s apparently only enabled if you set /proc/sys/vm/compact_memory and this is turned off by default due to high cpu usage. Why does this all matter? If your don’t have this option and some driver insists on allocating a contiguous region of memory, then Linux could end up evicting huge portions of file-cache while it’s searching for a contiguous chunk of memory.
- Scaling Mercurial at Facebook
- RocksDB: A High Performance Embedded Key-Value Store for Flash Storage
- The Log: What every software engineer should know about real-time data’s unifying abstraction - the log as a building block!
- The New Threat: Targeted Internet Traffic Misdirection - alarming how easy it’s to divert traffic by advertising malicious BGP routes!
- Attacking Tor: how the NSA targets users’ online anonymity - Tor is not broken but there are vulnerabilities at the end of the tunnel (browser exploits, etc).
- Data Structures in the Andrew Text Editor
- I wrote FAT on an airplane, for heaven’s sake
- Non-blocking transactional atomicity
- Booting to Rust
- Making full table scan 10x faster in InnoDB - they implemented logical-read-ahead to fetch pages in primary-key order.
- Counterfactual Thinking, Rules, and The Knight Capital Accident
- Barbarians at the Gateways - a HFT engineer paints a picture of the status-quo in their world.
- Call me maybe: Cassandra
- Call me maybe: Kafka
- A Few Notes on Kafka and Jepsen - Jay Kreps responds to @aphyr’s post on Kafka
- Immutability Changes Everything - Pat Helland, RICON2012
- Writing Datomic in Clojure - good overview of Datomic internals; everything is immutable, all writes go through single-master, the app/query nodes read off immutable data in storage subsystem without any locking
- Skip Unused Pages with MySQL Enterprise Backup 3.9.0 (MySQL Enterprise Backup)
- WAL-E and Continuous Protection with Heroku Postgres
- Direct Memory Alignment in Java
Recently I had to debug a potential memory leak in an application that I’ve been working on. The problem was extra hard because the application is a multi-process application (spawns ~600 processes) that used a large mmap’d segment (~100G) for IPC. Running top and sorting by ‘rss’ values was not helpful since the top pids may not necessarily be the ones that might be leaking memory. The rss values do not account for sharing of the pages among the processes. So if a page is shared by n pids, then it’s counted ‘n’ times (in fact try summing the rss values of all the running pids and it would be way over the physical memory on the box!).
Fortunately newer Linux kernels track the proportional set size (pss) values for different memory segments of a pid. This can be obtained via /proc/<pid>/smaps. So if a page has been faulted in by 4 pids, then the page only contributes 0.25 to the pss of each pid (vs 1.0 to the rss of the pid). So this gives you a more realistic estimate of the memory usage of each pid.
Top doesn’t display the pss values but luckily someone already wrote a python script called smem to parse the /proc/smaps output of each running pid on the box and spit out the aggregated pss and uss values for each pid. Uss is unshared set size (basically all the memory used by a process that is not shared with anyone else). The uss value is actually computed by summing the PrivateClean and PrivateDirty fields in the /proc/smaps output.
I wrote a quick python script to experiment with smem. The script mmaps a 5G segment and spawns off a bunch of child pids which do different things (one only maps the 5G segment, another faults in 50% of the pages, another faults in 100% of the pages, another creates a private 5G segment, and so on).
View the python script smem_test.py on GitHub.
- Debugging Tools for iOS Developers
- Open sourcing Databus: LinkedIn’s low latency change data capture system |
- The Tail at Scale | February 2013 | Communications of the ACM
- Common Pitfalls in Writing Lock-Free Algorithms
- trisha_gee: The Coalescing Ring Buffer, the latest open-sourced tool from L
- The DDoS That Almost Broke the Internet - CloudFlare blog
- GameInternals - Understanding Pac-Man Ghost Behavior
- How multi-disk failures happen
- Debugging a segfaulting binary without debug symbols « LShift Ltd.
- InfoQ: Lock-free Algorithms
- Flower Filter – A simple time-decaying approximate membership filter | 42 E - a probabilistic data structure for storing a list of items with time decay (essentially a space/time-efficient LRU cache).
- What Your Culture Really Says - Pretty Little State Machine
- Recent Code Search Outages · GitHub Blog
- Inside Cloudera Impala: Runtime Code Generation
- Fast Updates with TokuDB | Tokutek
- Software Defined Networking | The conversation has begun (between applicati - Boundary is hooking up to APIs from SDN vendors now!
- Morris algorithm - how to do in-order traversal of a tree with O(1) space.
- Broken by Design: MongoDB Fault Tolerance :: Hacking, Distributed
- Data alignment for speed: myth or reality? - argues data alignment on 4 byte boundaries is over-rated on the latest x86 cpus.
- Intra-cluster Replication in Apache Kafka | LinkedIn Engineering
- Making the net go faster - talks about Google’s TCP-FastOpen proposal.
- Two recent Go talks
- Meet the Data Brains Behind the Rise of Facebook | Wired Enterprise | Wired
- kellan: Answer to EtsyEng FAQs, “How do you continuously deploy schema changes”
- Zones vs KVM vs Xen
- Office of the CTO | Network Virtualization in the Software-Defined Data Cen - finally someone explains clearly how SDNs change the scene!
- Paper Trail » Blog Archive » Columnar Storage
- Under the Hood: Automated backups - how FB does automated backups
- Rust - Mozilla’s take on a systems language
- igrigorik: scaling to 3000 players in an MMORTS: slow down time, aka time dilation
- The Twitter stack - Braindump - @skr gives an overview of the Twitter stack (and the stuff they’ve already opensourced).
- Is Convergent Encryption really secure? - Cryptography Beta - Stack Exchang - there’s recent discussion on HN about convergent encryption given how Mega’s ToS claim they dedupe your encrypted content.
- Cloudera Impala: A Modern SQL Engine for Hadoop - tech talk on Impala.
- When is “ACID” ACID? Rarely. | Peter Bailis - excellent post which describes how most databases offer isolation levels lower than true serializable (notably Oracle which provides snapshot isolation).
- Twitter Engineering: Improving Twitter search with real-time human computat
- Notes on Distributed Systems for Young Bloods – Something Similar - wisdom which seems like common sense when you read it.
- research!rsc: Go Data Structures: Interfaces
- Lock-Free Data Structures - part 1 which shows how to build a lock-free write-less-read-often map in C++.
- Lock-Free Data Structures with Hazard Pointers | Dr Dobb’s Journal - part 2 improves on the previous solution.
- /jsr166/src/jsr166e/Striped64.java - Doug Lea shows how to write a 64-bit striped number (for an example of how to use this abstract class see LongAdder.java)
- TCP incast: What is it? How can it affect Erlang applications? | Humans and - TCP incast occurs when you’ve micro-bursts of traffic and you overflow the buffers in the switches and TCP slow start kicks in. This problem was exacerbated in riak’s case because of head-of-line blocking in Erlang VM which resulted in starving all the other unrelated TCP connections.
- Hello ARM: Exploring Undefined, Unspecified, and Implementation-defined Beh - how MSFT tweaked the behavior of volatile in their C++ compiler due to ARM’s weaker memory model compared to x86 (writes are not re-ordered).
- How Spotify’s P2P network works - one interesting note is how they use a Markov chain simulation to predict how much to buffer before starting to playing the stream; you do not want to start playing prematurely because buffering in the middle of a song destroys the experience. The states in the Markov model are the discrete throughputs to server observed in 1 sec (33 buckets ranging from 0 to 130 Kbps). Once you’ve built a model using empirical data (they use the last 15 min data collected by the client), then you can run simulations given initial buffer-size, and current observed throughput. If too many trials result in buffer underruns, then it’s better to wait further.
- CAP Twelve Years Later: How the “Rules” Have Changed
- Network problems last Friday · GitHub Blog
- Grey Matters: Blog: The Fitch Cheney Card Trick - very clever interview question that I read about on Quora.
- RethinkDB FAQ for programmers new to distributed databases
- How OpenTSB works on top of HBase - some great tips on how to use HBase effectively
Building a newsfeed-like feature
- Real-Time Delivery Architecture at Twitter - one interesting thing which I didn’t realize they started doing is to open persistent connections from the official desktop client and just stream tweets to users (essentially they’ve streaming servers which apply filters on the global firehose and push to the clients). The website is still rendered by reading the write-materialized inbox that’s stored as a list of tweet-ids in redis for each user.
- Petabyte Scale Data at Facebook - Dhurba Borthakur talks about the different kinds of storage systems they’ve built at Facebook for different needs. Good overview of the BigData landscape.
- Facebook News Feed: Social Data at Scale - interestingly they don’t materialize the newsfeed during writes like Twitter. I think this also helps them because they do ranking on the feed items and return only a subset of interesting items back to the user.
- Scaling Tumblr - talk on scaling tumblr’s Dashboard feature (they used to do materialization during reads, but seem to be switching to materialization during writes like Twitter).
More scaling talks
- Plenty of Fish architecture - a rarely seen scale-up approach!
- Blobstore: Twitter’s in-house photo storage system - very similar to FB’s Haystack; storage nodes stores photos as blobs in fat files (versus storing as individual files - which results in several disk seeks while scanning filesystem metadata). Multi-DC replication is done async via message queue.
- Ketchup’s Chinese origins: how it evolved from fish sauce to today’s tomato - ke-tchup is a Hokkien word which means preserved-fish sauce.
- The Oklahoma City Thunder’s Fairy-Tale Rise - NYTimes.com - I don’t follow basketball but an interesting profile of Kevin Durant, OKC Thunder, and the city
- Good design is invisible: an interview with iA’s Oliver Reichenstein | The
This is probably my last post for 2012. Happy new year!
Recently came across a really interesting talk by developers on the Tor project. The talk is about how different countries across the world have tried to block Tor. Go watch it. Most of the countries except for China really just buy a deep-packet-inspection (DPI) solution that’s sold by a handful of companies (mostly based in the US), and hope it blocks Tor (Tor pretends to be SSL and so you’ve to look for patterns to distinguish Tor from real SSL traffic).
I dug a little deeper to learn how Tor works, and I’ve been running a Tor non-exit relay on one of my boxes just for fun (interestingly Akamai seems to run a bunch of exit relays!).
In Tor, you want the Onion Routers (OR) that form a circuit to only know about the previous hop and the next hop. They should never have a full picture of the entire circuit. That way if someone compromises one OR and records all traffic, they won’t be able to identify the end-user or the contents of the traffic with some exceptions.
- Exception 1 - if a compromised OR is the exit relay (i.e. last hop before exiting out of the Tor network), then the attacker can see the traffic (if the traffic is plaintext to begin with; if it’s HTTPS traffic then even the last hop cannot inspect it obviously). But they cannot identify the user’s IP address. But there’s a big caveat of course - the traffic itself shouldn’t contain any user-identifying information. This is why they encourage you to use HTTPS everywhere. Imagine logging into email with plain HTTP and your email id flying around in plain-text. It’s pointless to hide your IP address when you’ve let your email address go in plain text. You’re not anonymous anymore.
- Exception 2 - if a compromised OR is the entry relay (i.e. beginning of the circuit), then an attacker know who the user is but they don’t know the contents. This is because the traffic is encrypted several times by the end-user such that only at the last hop can it be fully decrypted to reveal the plaintext payload like peeling an onion (hence the name).
So Tor typically creates a 3 OR hop circuit, and you should be safe even if one Tor has been compromised, and the attacker wants to break your anonymity.
The Tor client (Onion Proxy) will pick a bunch of relays to create a circuit. There are several mechanisms to get the list of available relays (there’s a global directory list available via HTTP, there are unpublished bridge relays, and then there are secret relays which friends give to their friends only). In countries like Iran/China, they obviously block all IP addresses listed in the global directory list, and sometimes more. Watch the talk I mentioned above to learn more.
Coming back to how Tor works, relays are used to create a circuit, and a circuit will multiplex several TCP streams. Circuit is established in a serial fashion. Client talks to relay A, does a DH exchange, and so they’ve a secret session key that they’ll use for encryption. Client picks another relay B, and tells A to extend the circuit to B. Client does DH exchange with B via A acting as a relay (A cannot see the session key for B because DH provides that guarantee; similarly B does not know the identify of the client since it only sees A as the source of the traffic). So in the future, client would encrypt TCP stream with B’s session key followed by encryption with A’s session key. So when a packet arrives at A, it cannot see the original payload without knowing B’s session key. So original payload can be obtained only if you know all the session keys, or if you’re the exit relay and the payload is not itself encrypted.