Understanding shared mem usage of multi-proc apps on Linux
Recently I had to debug a potential memory leak in an application that I’ve been working on. The problem was extra hard because the application is a multi-process application (spawns ~600 processes) that used a large mmap’d segment (~100G) for IPC. Running top and sorting by ‘rss’ values was not helpful since the top pids may not necessarily be the ones that might be leaking memory. The rss values do not account for sharing of the pages among the processes. So if a page is shared by n pids, then it’s counted ‘n’ times (in fact try summing the rss values of all the running pids and it would be way over the physical memory on the box!).
Fortunately newer Linux kernels track the proportional set size (pss) values for different memory segments of a pid. This can be obtained via /proc/<pid>/smaps. So if a page has been faulted in by 4 pids, then the page only contributes 0.25 to the pss of each pid (vs 1.0 to the rss of the pid). So this gives you a more realistic estimate of the memory usage of each pid.
Top doesn’t display the pss values but luckily someone already wrote a python script called smem to parse the /proc/smaps output of each running pid on the box and spit out the aggregated pss and uss values for each pid. Uss is unshared set size (basically all the memory used by a process that is not shared with anyone else). The uss value is actually computed by summing the PrivateClean and PrivateDirty fields in the /proc/smaps output.
I wrote a quick python script to experiment with smem. The script mmaps a 5G segment and spawns off a bunch of child pids which do different things (one only maps the 5G segment, another faults in 50% of the pages, another faults in 100% of the pages, another creates a private 5G segment, and so on).
View the python script smem_test.py on GitHub.
Links from my reading list
- Debugging Tools for iOS Developers
- Open sourcing Databus: LinkedIn’s low latency change data capture system |
- The Tail at Scale | February 2013 | Communications of the ACM
- Common Pitfalls in Writing Lock-Free Algorithms
- trisha_gee: The Coalescing Ring Buffer, the latest open-sourced tool from L
- The DDoS That Almost Broke the Internet - CloudFlare blog
- GameInternals - Understanding Pac-Man Ghost Behavior
- How multi-disk failures happen
- Debugging a segfaulting binary without debug symbols « LShift Ltd.
- InfoQ: Lock-free Algorithms
Links from my reading list
- Flower Filter – A simple time-decaying approximate membership filter | 42 E - a probabilistic data structure for storing a list of items with time decay (essentially a space/time-efficient LRU cache).
- What Your Culture Really Says - Pretty Little State Machine
- Recent Code Search Outages · GitHub Blog
- Inside Cloudera Impala: Runtime Code Generation
- Fast Updates with TokuDB | Tokutek
- Software Defined Networking | The conversation has begun (between applicati - Boundary is hooking up to APIs from SDN vendors now!
- Morris algorithm - how to do in-order traversal of a tree with O(1) space.
- Broken by Design: MongoDB Fault Tolerance :: Hacking, Distributed
- Data alignment for speed: myth or reality? - argues data alignment on 4 byte boundaries is over-rated on the latest x86 cpus.
Links from my reading list
- Intra-cluster Replication in Apache Kafka | LinkedIn Engineering
- Making the net go faster - talks about Google’s TCP-FastOpen proposal.
- Two recent Go talks
- Meet the Data Brains Behind the Rise of Facebook | Wired Enterprise | Wired
- kellan: Answer to EtsyEng FAQs, “How do you continuously deploy schema changes”
- Zones vs KVM vs Xen
- Office of the CTO | Network Virtualization in the Software-Defined Data Cen - finally someone explains clearly how SDNs change the scene!
- Paper Trail » Blog Archive » Columnar Storage
- Under the Hood: Automated backups - how FB does automated backups
- Rust - Mozilla’s take on a systems language
- igrigorik: scaling to 3000 players in an MMORTS: slow down time, aka time dilation
Links from my reading list
- The Twitter stack - Braindump - @skr gives an overview of the Twitter stack (and the stuff they’ve already opensourced).
- Is Convergent Encryption really secure? - Cryptography Beta - Stack Exchang - there’s recent discussion on HN about convergent encryption given how Mega’s ToS claim they dedupe your encrypted content.
- Cloudera Impala: A Modern SQL Engine for Hadoop - tech talk on Impala.
- When is “ACID” ACID? Rarely. | Peter Bailis - excellent post which describes how most databases offer isolation levels lower than true serializable (notably Oracle which provides snapshot isolation).
- Twitter Engineering: Improving Twitter search with real-time human computat
- Notes on Distributed Systems for Young Bloods – Something Similar - wisdom which seems like common sense when you read it.
- research!rsc: Go Data Structures: Interfaces
Links from my reading list
- Lock-Free Data Structures - part 1 which shows how to build a lock-free write-less-read-often map in C++.
- Lock-Free Data Structures with Hazard Pointers | Dr Dobb’s Journal - part 2 improves on the previous solution.
- /jsr166/src/jsr166e/Striped64.java - Doug Lea shows how to write a 64-bit striped number (for an example of how to use this abstract class see LongAdder.java)
- TCP incast: What is it? How can it affect Erlang applications? | Humans and - TCP incast occurs when you’ve micro-bursts of traffic and you overflow the buffers in the switches and TCP slow start kicks in. This problem was exacerbated in riak’s case because of head-of-line blocking in Erlang VM which resulted in starving all the other unrelated TCP connections.
- Hello ARM: Exploring Undefined, Unspecified, and Implementation-defined Beh - how MSFT tweaked the behavior of volatile in their C++ compiler due to ARM’s weaker memory model compared to x86 (writes are not re-ordered).
- How Spotify’s P2P network works - one interesting note is how they use a Markov chain simulation to predict how much to buffer before starting to playing the stream; you do not want to start playing prematurely because buffering in the middle of a song destroys the experience. The states in the Markov model are the discrete throughputs to server observed in 1 sec (33 buckets ranging from 0 to 130 Kbps). Once you’ve built a model using empirical data (they use the last 15 min data collected by the client), then you can run simulations given initial buffer-size, and current observed throughput. If too many trials result in buffer underruns, then it’s better to wait further.
- CAP Twelve Years Later: How the “Rules” Have Changed
- Network problems last Friday · GitHub Blog
- Grey Matters: Blog: The Fitch Cheney Card Trick - very clever interview question that I read about on Quora.
- RethinkDB FAQ for programmers new to distributed databases
- How OpenTSB works on top of HBase - some great tips on how to use HBase effectively
Building a newsfeed-like feature
- Real-Time Delivery Architecture at Twitter - one interesting thing which I didn’t realize they started doing is to open persistent connections from the official desktop client and just stream tweets to users (essentially they’ve streaming servers which apply filters on the global firehose and push to the clients). The website is still rendered by reading the write-materialized inbox that’s stored as a list of tweet-ids in redis for each user.
- Petabyte Scale Data at Facebook - Dhurba Borthakur talks about the different kinds of storage systems they’ve built at Facebook for different needs. Good overview of the BigData landscape.
- Facebook News Feed: Social Data at Scale - interestingly they don’t materialize the newsfeed during writes like Twitter. I think this also helps them because they do ranking on the feed items and return only a subset of interesting items back to the user.
- Scaling Tumblr - talk on scaling tumblr’s Dashboard feature (they used to do materialization during reads, but seem to be switching to materialization during writes like Twitter).
More scaling talks
- Plenty of Fish architecture - a rarely seen scale-up approach!
- Blobstore: Twitter’s in-house photo storage system - very similar to FB’s Haystack; storage nodes stores photos as blobs in fat files (versus storing as individual files - which results in several disk seeks while scanning filesystem metadata). Multi-DC replication is done async via message queue.
Misc
- Ketchup’s Chinese origins: how it evolved from fish sauce to today’s tomato - ke-tchup is a Hokkien word which means preserved-fish sauce.
- The Oklahoma City Thunder’s Fairy-Tale Rise - NYTimes.com - I don’t follow basketball but an interesting profile of Kevin Durant, OKC Thunder, and the city
- Good design is invisible: an interview with iA’s Oliver Reichenstein | The
This is probably my last post for 2012. Happy new year!
Onion routing
Recently came across a really interesting talk by developers on the Tor project. The talk is about how different countries across the world have tried to block Tor. Go watch it. Most of the countries except for China really just buy a deep-packet-inspection (DPI) solution that’s sold by a handful of companies (mostly based in the US), and hope it blocks Tor (Tor pretends to be SSL and so you’ve to look for patterns to distinguish Tor from real SSL traffic).
I dug a little deeper to learn how Tor works, and I’ve been running a Tor non-exit relay on one of my boxes just for fun (interestingly Akamai seems to run a bunch of exit relays!).
Tor
In Tor, you want the Onion Routers (OR) that form a circuit to only know about the previous hop and the next hop. They should never have a full picture of the entire circuit. That way if someone compromises one OR and records all traffic, they won’t be able to identify the end-user or the contents of the traffic with some exceptions.
- Exception 1 - if a compromised OR is the exit relay (i.e. last hop before exiting out of the Tor network), then the attacker can see the traffic (if the traffic is plaintext to begin with; if it’s HTTPS traffic then even the last hop cannot inspect it obviously). But they cannot identify the user’s IP address. But there’s a big caveat of course - the traffic itself shouldn’t contain any user-identifying information. This is why they encourage you to use HTTPS everywhere. Imagine logging into email with plain HTTP and your email id flying around in plain-text. It’s pointless to hide your IP address when you’ve let your email address go in plain text. You’re not anonymous anymore.
- Exception 2 - if a compromised OR is the entry relay (i.e. beginning of the circuit), then an attacker know who the user is but they don’t know the contents. This is because the traffic is encrypted several times by the end-user such that only at the last hop can it be fully decrypted to reveal the plaintext payload like peeling an onion (hence the name).
So Tor typically creates a 3 OR hop circuit, and you should be safe even if one Tor has been compromised, and the attacker wants to break your anonymity.
Circuit creation
The Tor client (Onion Proxy) will pick a bunch of relays to create a circuit. There are several mechanisms to get the list of available relays (there’s a global directory list available via HTTP, there are unpublished bridge relays, and then there are secret relays which friends give to their friends only). In countries like Iran/China, they obviously block all IP addresses listed in the global directory list, and sometimes more. Watch the talk I mentioned above to learn more.
Coming back to how Tor works, relays are used to create a circuit, and a circuit will multiplex several TCP streams. Circuit is established in a serial fashion. Client talks to relay A, does a DH exchange, and so they’ve a secret session key that they’ll use for encryption. Client picks another relay B, and tells A to extend the circuit to B. Client does DH exchange with B via A acting as a relay (A cannot see the session key for B because DH provides that guarantee; similarly B does not know the identify of the client since it only sees A as the source of the traffic). So in the future, client would encrypt TCP stream with B’s session key followed by encryption with A’s session key. So when a packet arrives at A, it cannot see the original payload without knowing B’s session key. So original payload can be obtained only if you know all the session keys, or if you’re the exit relay and the payload is not itself encrypted.
Links from my reading list
Debugging
November has been a crazy month of debugging hard problems at work preparing for launch. Not sure if it’s pure coincidence but I came across some excellent articles/talks on debugging this month.
- The little ssh that (sometimes) couldn’t - Mina Naguib — get inspired by this guy’s persistence to debug a hard issue!
- Redis Crashes - Antirez weblog — describes a method which redis is now using to detect crashes due to faulty memory modules on the machine.
- How a crazy GNU assembler macro helps you debug GRUB with GDB — Boundary’s Joe Damato goes very low level (this time into the bowels of grub).
- There and Back Again -or- How I Set out to Benchmark an Algorithm and Ended up fixing Ruby — excellent talk on how to debug hard problems (goes into a lot of detail on how to extract the smallest possible repro for the bug and other good debugging advice).
- Jing Jin & Matthew Delaney: The Web’s Black Magic — JSConf EU 2012 - YouTu — excellent talk by a couple of Xooglers from the Chrome team where they go into the guts of Chrome code to illustrate why certain things are more expensive.
Misc
- Photoshop Scalability - ACM Queue
- 7 ways to handle concurrency in distributed systems - Rest for the Wicked
- Technovelty - PLT and GOT - the key to code sharing and dynamic libraries — how do shared libraries work?
- When the Nerds Go Marching In - Alexis C. Madrigal - The Atlantic — piece on Obama’s tech team.
- Anatomy of a Solid-state Drive - ACM Queue
- HBase Replication Overview | Apache Hadoop for the Enterprise | Cloudera
- How to Launch a 65Gbps DDoS, and How to Stop One
Links from my reading list
- Surge 2012 ~ Bryan Cantrill & Brendan Gregg ~ The Real-Time Web in the Real - Joyent guys demo debugging real production issues of one of their customers (Voxer) with DTrace.
- Surge 2012 ~ Tom Daly ~ The Architecture Behind Super Fast DNS - Routing, R - DynDNS explains how TCP Anycast works
- Surge 2011 ~ A journey through the full stack in search of performance and - good talk on performance engineering (aka full-stack jujitsu)
- Summary of the October 22, 2012 AWS Service Event in the US-East Region - AWS post-mortem for recent outage
- Surge 2012 ~ Artur Bergman ~ Mysteries of a CDN Explained - YouTube - Fastly founder describes how their CDN works.
- Erlang Factory Lite Vancouver 2012: Erlang in Production - YouTube - Heroku’s Geoff Cant talks about his experience of running/debugging production Erlang systems.
- Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real | Apache Hado
- NUMA-aware constructs for java.util.concurrent (David Dice’s Weblog)
- This Is Why They Call It a Weakly-Ordered CPU
- Scalyr: a Systematic Look at EC2 I/O
Links from my reading list
- Van Jacobson: The Slow-Start Algorithm - YouTube
- basho/riak_pipe - building pipelines for data processing on top of riak-core.
- fxr.watson.org: linux-2.6 sys/Documentation/memory-barriers.txt
- calm: consistency as logical monotonicity | Bloom Programming Language
- Burn the Library - good post on why you would need CRDTs if you’re using an eventually consistent store like Riak.
- Building Spanner Presentation • myNoSQL - video on Google’s Spanner store.
- Backblaze Blog » Farming hard drives: how Backblaze weathered the Thailand - Backblaze uses commodity disks!
- Backblaze Blog » Petabytes on a Budget v2.0:Revealing More Secrets - more details on the custom hardware they built and use.
Links from my reading list
- GitHub post-mortem
- Post Mortem: What Yesterday’s Network Outage Looked Like - CloudFlare blog
- Predictive Analytics: Data Preparation
- The Netflix Tech Blog: Benchmarking High Performance I/O with SSD for Cassandra
- Dtrace.conf 2012 talks - YouTube
- Mechanical Sympathy: Memory Access Patterns Are Important
- Thread placement policies on NUMA systems - update
- The Go Memory Model - The Go Programming Language
- Rails - impact of removing config.threadsafe
- Stack Backtracing Inside Your Program | Linux Journal
- The Design of LLVM | Dr Dobb’s
- A rare look inside Facebook’s Oregon data center [photos, video]
- Reddit - Curiosity Rover - How NASA deals with packet loss?
- Financial Firms Face Subpoenas on Tax Strategy - NYTimes.com
- Doctor and Patient: The Bullying Culture of Medical School - NYTimes.com
Links from my reading list
- The Linux Graphics Stack | Clean Rinse
- High Scalability - High Scalability - Changing Architectures: New Datacente
- Brendan’s blog » Flame Graphs
- Brendan’s blog » 10 Performance Wins
- Mars rover draws on nuclear power for trek around Red Planet - latimes.com
- Debugging in an Asynchronous World - ACM Queue
- Interview With A High-Frequency Trader | ZeroHedge
Interesting links from ReadItLater queue
- Resilient Response In Complex Systems
- dablooms - an open source, scalable, counting bloom filter library
- How do debuggers keep track of the threads in a program
- An explanation of IOPS and latency | Recovery Monkey
- Peter Geoghegan’s blog: Sorting improvements in PostgreSQL 9.2: the case for micro-optimizations
- Profiling Garbage Collection in Haskell with DTrace & Instruments - Just Te
- research!rsc: Using Uninitialized Memory for Fun and Profit
- Going Colo
- Why interrupt affinity with multiple cores is not such a good thing - Alex