On Tuesday we launched Instagram for Android, and it’s had a fantastic response so far. The last few weeks (on the infrastructure side) have been all about capacity planning and preparation to get everything in place, but on launch day itself the challenge is to find problems quickly, get to the bottom of them, and roll out fixes ASAP. Here are some tools & techniques we used to tackle problems as they arose:
We love statsd at Instagram. Written by Etsy, it’s a network daemon that aggregates and rolls-up data into Graphite. At its core, it has two types of statistics: counter and timers. We use the counters to track everything from number of signups per second to number of likes, and we use timers to time generation of feeds, how long it takes to follow users, and any other major action.
The single biggest reason we love statsd is how quickly stats show up and get updated in Graphite. Stats are basically realtime (in our system, they’re about 10 seconds delayed), which allows us to evaluate system and code changes immediately. Stats can be added at will, so if we discover a new metric to track, we can have it up and running very quickly. You can specify a sample rate, so we sprinkle logging calls throughout the web application at relatively low sample rates, without affecting performance.
Takeaway: having realtime stats that can be added dynamically lets you diagnose and firefight without having to wait to receive new data.
Written by Bitbucket, Dogslow is a piece of Django middleware that will watch your running processes, and if notices any taking longer than N seconds, will snapshot the current process and write the file to disk. We’ve found it’s too intrusive to run all the time, but when trying to identify bottlenecks that may have cropped up, it’s very useful (we’ve added a switch to enable it in our web servers).
We found, halfway through launch day, that processes that were taking over 1.5s to return a response were often stuck in memcached set() and get_many(). Switching over to Munin, which we use to track our machine stats over time, we saw that our memcached boxes were pushing 50k req/s, and though they weren’t maxing out the CPU, they were busy enough to slow down the application servers.
Takeaway: it’s often one piece of the backend infrastructure that becomes a bottleneck, and figuring out the point at which your real, live appservers get stuck can help surface the issue.
Replication & Read-slaves
Two of our main data backends—Redis and PostgreSQL—both support easy replication and read-slaving. When one of our Redis DBs crossed 40k req/s, and started becoming a bottleneck, bringing up another machine, SYNCing to the master, and sending read queries to it took less than 20 minutes. For machines we knew would be busy ahead of time, we’d brought up read-slaves, but in a couple of cases, machines reacted differently under load than we’d projected, and it was useful to split reads off quickly.
For Postgres, we use a combination of Streaming Replication and Amazon EBS Snapshots to bring up a new read-slave quickly. All of our master DBs stream to backup slaves that take frequent EBS snapshots; from these snapshots, we can have a new read-slave up and running, and caught up to the master, in around 20 minutes. Having our machines in an easily scriptable environment like AWS make provisioning and deploying new read-slaves a quick command-line task.
Takeaway: if read capacity is likely to be a concern, bringing up read-slaves ahead of time and getting them in rotation is ideal; if any new read issues crop up, however, know ahead of time what your options are for bringing more read capacity into rotation.
PGFouine is a tool that analyzes PostgreSQL query logs and generates a page of analytics on their impact on your database; sliced by the “heaviest”, or most frequent, or slowest queries. To ease running it, we’ve created a Fabric script that will connect to a database, set it to log every query, wait 30 seconds, then download the file and run a pgfouine analysis on it; it’s available as a gist. PGFouine is our core tool in analyzing database performance and figuring out which queries could use memcached in front of them, which ones are fetching more data than is necessary, etc; as DBs showed signs of stress on launch day, we would run PGFouine, deploy targeted code improvement to relieve hotspots, and then run it again to make sure those changes had the correct effect.
It’s important to know what a “normal” day looks like for your databases, too, for a baseline, so we run PGFouine periodically to gather statistics on non-stressed-out database instances, too.
Takeaway: Database log analysis (especially coupled with a tight iteration loop on optimizing queries and caching what’s needed)
One more thing
Another tool that helped us get through the first day was one we wrote ourselves—node2dm, a node.js server for delivering push notifications to Android’s C2DM service. It’s handled over 5 million push notifications for us so far.
We surveyed the different options for C2DM servers, but didn’t find any open source ones that looked like they were being actively maintained, or fully supported the Google service. We’re open sourcing node2dm today; feel free to fork and pull-request if you have any suggestions for improvements.
If all of this is interesting/exciting to you, and you’d like to chat more about working with us, drop us a note; we’d love to hear from you.
You can discuss this post at Hacker News.
Mike Krieger, co-founder