When Instagram joined Facebook in 2012, we quickly found numerous integration points with Facebook’s infrastructure that allowed us to accelerate our product development and make our community safer. In the beginning, we built these integrations by effectively bouncing through Facebook web services using ad-hoc endpoints. However, we found that this could be cumbersome and it limited our ability to use internal Facebook services.

Starting in April of 2013, we began a massive migration to move Instagram’s back-end stack from Amazon Web Services (AWS) into Facebook’s data centers. This would ease the integration with other internal Facebook systems and allow us to take advantage of tooling built to manage large scale server deployments. The primary goals of the migration were to keep the site fully available during the transition, avoid impacting feature development, and minimize infrastructure-level changes to avoid operational complexity.

The migration seemed simple enough at first: set up a secure connection between Amazon’s Elastic Compute Cloud (EC2) and a Facebook data center, and migrate services across the gap piece by piece. Easy.

Not so much. The main blocker to this easy migration was that Facebook’s private IP space conflicts with that of EC2. We had but one route: migrate to Amazon’s Virtual Private Cloud (VPC) first, followed by a subsequent migration to Facebook using Amazon Direct Connect. Amazon’s VPC offered the addressing flexibility necessary to avoid conflicts with Facebook’s private network.

This task looked incredibly daunting on the face of it; we were running many thousands of instances in EC2, with new ones spinning up every day. In order to minimize downtime and operational complexity, it was essential that instances running in both EC2 and VPC seemed as if they were part of the same network. AWS does not provide a way of sharing security groups nor bridging private EC2 and VPC networks. The only way to communicate between the two private networks is to use the public address space.

So we developed Neti — a dynamic iptables manipulation daemon, written in Python, and backed by ZooKeeper. Neti provides both the missing security group functionality as well as a single address for each instance, regardless of whether it is running in EC2 or VPC. It manages thousands of local NAT and filter rules on each instance to allow secure communication using a single, flat “overlay” address space. The NAT rules selected the most efficient path for communication based on the source and destination instances. Communication between instances across the VPC and EC2 boundary would use the public network, while internal traffic would use the private network. This was transparent to our application and backend systems because Neti applied the appropriate iptables rules on every instance.

It took less than three weeks to migrate all of the various components that make up Instagram’s stack to the VPC environment from EC2 — something that we believe would have taken much longer without Neti. We experienced no significant downtime during the process and as far as we are aware, this was the fastest-ever VPC migration of this scale.

With the VPC migration complete and our instances running in a compatible address space, Instagram was ready to complete its migration into Facebook’s data centers.

An existing set of EC2-centric tools for managing Instagram’s production systems had been built over the years. This included configuration management scripts, Chef for provisioning, as well as Fabric for a wide range of operations tasks, from application deployment to database master promotion. This tooling made assumptions specific to EC2 that were no longer valid in the Facebook environment.

To provide portability for our provisioning tools, all of the Instagram-specific software now runs inside of a Linux Container (LXC) on the servers in Facebook’s data centers. Facebook provisioning tools are used to build the base system, and Chef runs inside the container to install and configure Instagram-specific software. To support an infrastructure that spans both EC2 and Facebook’s data centers, our existing Chef recipes were augmented with new logic that allowed them to support the CentOS platform used inside Facebook, alongside Ubuntu, which was used in EC2.

The EC2-specific command-line tools used for basic tasks, such as enumerating running hosts as well as the provisioning support in the Chef “knife” tool, were replaced with a single tool. This tool was designed as an abstraction layer and provided a similar workflow to the one used in EC2. That eased the human and technical transition into this new environment.

Once the tooling was ready and environment was in place, the migration of Instagram’s production infrastructure from VPC to Facebook’s data centers was completed in two weeks.

This multi-stage effort was hugely successful and hit the major goals laid out when the project began. In addition, during the planning and execution phases of the migration the team shipped major features such as Instagram Direct and our user base nearly doubled in size. We held true to our initial objective of minimizing change, so the transition was mostly transparent to our engineering teams.

Looking back, there were a few key takeaways from the year-long project:

  • Plan to change just the bare minimum needed to support the new environment, and avoid the temptation of “while we’re here.”
  • Sometimes “crazy” ideas work — Neti is proof of that.
  • Invest in making your tools; the last thing you need is unexpected curveballs when conducting a large-scale migration like this.
  • Reuse the concepts and workflows familiar to the team to avoid compounding the complexity of communicating changes to the team.

This was a coordinated effort that spanned multiple teams and a number of individual contributors. We’ll be providing a deeper dive on the work that went into the migration in the weeks to come, so keep an eye on this space.

By Rick Branson, Pedro Canahuati and Nick Shortway