As we’ve scaled Instagram to an ever-growing number of active users, Postgres has continued to be our solid foundation and the canonical data storage for most of the data created by our users. While less than a year ago, we blogged about how we “stored a lot of data” at Instagram at 90 likes per second, we’re now pushing over 10,000 likes per second at peak—and our fundamental storage technology hasn’t changed.
Over the last two and a half years, we’ve picked up a few tips and tools about scaling Postgres that we wanted to share—things we wish we knew when we first launched Instagram. Some of these are Postgres-specific while others are present in other databases as well. For background on how we’ve horizontally partitioned Postgres, check out our Sharding and IDs at Instagram post.
1. Partial Indexes
If you find yourself frequently filtering your queries by a particular characteristic, and that characteristic is present in a minority of your rows, partial indexes may be a big win.
As an example, when searching tags on Instagram, we try to surface tags that are likely to have many photos in them. While we use technologies like ElasticSearch for fancier searches in our application, this is one case where the database was good enough. Let’s see what Postgres does when searching tag names and ordering by number of photos:
EXPLAIN ANALYZE SELECT id from tags WHERE name LIKE 'snow%' ORDER BY media_count DESC LIMIT 10; QUERY PLAN --------- Limit (cost=1780.73..1780.75 rows=10 width=32) (actual time=215.211..215.228 rows=10 loops=1) -> Sort (cost=1780.73..1819.36 rows=15455 width=32) (actual time=215.209..215.215 rows=10 loops=1) Sort Key: media_count Sort Method: top-N heapsort Memory: 25kB -> Index Scan using tags_search on tags_tag (cost=0.00..1446.75 rows=15455 width=32) (actual time=0.020..162.708 rows=64572 loops=1) Index Cond: (((name)::text ~>=~ 'snow'::text) AND ((name)::text ~<~ 'snox'::text)) Filter: ((name)::text ~~ 'snow%'::text) Total runtime: 215.275 ms (8 rows)
Notice how Postgres had to sort through 15,000 rows to get the right result. Since tags (for example) exhibit a long-tail pattern, we can instead first try a query against tags with over 100 photos; we’ll do:
CREATE INDEX CONCURRENTLY on tags (name text_pattern_ops) WHERE media_count >= 100
Now the query plan looks like:
EXPLAIN ANALYZE SELECT * from tags WHERE name LIKE 'snow%' AND media_count >= 100 ORDER BY media_count DESC LIMIT 10; QUERY PLAN Limit (cost=224.73..224.75 rows=10 width=32) (actual time=3.088..3.105 rows=10 loops=1) -> Sort (cost=224.73..225.15 rows=169 width=32) (actual time=3.086..3.090 rows=10 loops=1) Sort Key: media_count Sort Method: top-N heapsort Memory: 25kB -> Index Scan using tags_tag_name_idx on tags_tag (cost=0.00..221.07 rows=169 width=32) (actual time=0.021..2.360 rows=924 loops=1) Index Cond: (((name)::text ~>=~ 'snow'::text) AND ((name)::text ~<~ 'snox'::text)) Filter: ((name)::text ~~ 'snow%'::text) Total runtime: 3.137 ms (8 rows)
Notice that Postgres only had to visit 169 rows, which was way faster. Postgres’ query planner is pretty good at evaluating constraints too; if you later decided that you wanted to query tags with over 500 photos, since those are a subset of this index, it will still use the right partial index.
2. Functional Indexes
On some of our tables, we need to index strings (for example, 64 character base64 tokens) that are quite long, and creating an index on those strings ends up duplicating a lot of data. For these, Postgres’ functional index feature can be very helpful:
CREATE INDEX CONCURRENTLY on tokens (substr(token), 0, 8)
While there will be multiple rows that match that prefix, having Postgres match those prefixes and then filter down is quick, and the resulting index was 1/10th the size it would have been had we indexed the entire string.
3. pg_reorg For Compaction
Over time, Postgres tables can become fragmented on disk (due to Postgres’ MVCC concurrency model, for example). Also, most of the time, row insertion order does not match the order in which you want rows returned. For example, if you’re often querying for all likes created by one user, it’s helpful to have those likes be contiguous on disk, to minimize disk seeks.
Our solution to this is to use pg_reorg, which does a 3-step process to “compact” a table:
- Acquire an exclusive lock on the table
- Create a temporary table to accumulate changes, and add a trigger on the original table that replicates any changes to this temp table
- Do a CREATE TABLE using a SELECT FROM…ORDER BY, which will create a new table in index order on disk
- Sync the changes from the temp table that happened after the SELECT FROM started
- Cut over to the new table
There are some details in there around lock acquisition etc, but that’s the general approach. We vetted the tool and tried several test runs before running in production, and we’ve run dozens of reorgs across hundreds of machines without issues.
4. WAL-E for WAL archiving and backups
We use and contribute code to WAL-E, Heroku’s toolkit for continuous archiving of Postgres Write-Ahead Log files. Using WAL-E has simplified our backup and new-replica bootstrap process significantly.
At its core, WAL-E is a program that archives every WAL files generated by your PG server to Amazon’s S3, using Postgres’ archive_command. These WAL files can then be used, in combination with a base backup, to restore a DB to any point since that base backup. The combination of regular base backups and the WAL archiving means we can quickly bootstrap a new read-replica or failover slave, too.
We’ve made our simple wrapper script for monitoring repeated failures to archive a file available on GitHub.
5. Autocommit mode and async mode in psycopg2
Over time, we’ve started using more advanced features in psycopg2, the Python driver for Postgres.
The first is autocommit mode; in this mode, Psycopg2 won’t issue BEGIN/COMMIT for any queries; instead, every query runs in its own single-statement transaction. This is particularly useful for read-only queries where transaction semantics aren’t needed. It’s as easy as doing:
connection.autocommit = True
This lowered chatter between our application servers and DBs significantly, and lowered system CPU as well on the database boxes. Further, since we use PGBouncer for our connection pooling, this change allows connections to be returned to the pool sooner.
More details on how this interacts with Django’s db handling here.
Another useful psycopg2 feature is the ability to register a wait_callback for coroutine support. Using this allows for concurrent querying across multiple connections at once, which is useful for fan-out queries that hit multiple nodes—the socket will wake up and notify when there’s data to be read (we use Python’s select module for handling the wake-ups). This also plays well with cooperative multi-threading libraries like eventlet or gevent; check out psycogreen for an example implementation.
Overall, we’ve been very happy with Postgres’ performance and reliability. If you’re interested in working on one of the world’s largest Postgres installations with a small team of infrastructure hackers, get in touch at infrajobs <at> instagram.com.
You can discuss this post at Hacker News
Mike Krieger, co-founder