Dec 22 2009

Incremental deployment

I've recently had a chance to look at a high availability system designed and built by Forward colleagues Andy Kent and Paul Ingles. It is a critical web service with a very high impact of failure. Essentially, it must stay up at all times.

The service is hosted on Amazon EC2. It makes use of EC2's geographically distributed regions and different availability zones within each region, fronted by AWS Elastic Load Balancing and additional global DNS fail over outside of EC2/AWS.

high-availability-arch

A part of the project that struck me as particularly interesting is the deployment strategy Paul and Andy settled on. Regardless of how much trust we have in our builds and QA process, deployments become a whole different, much more stressful activity when critical systems like the one under discussion are involved. Andy mentioned it is important to find the balance between what to automate and bits that should require manual input.

# deploy.rb

task :us_1b do
  set :region, 'us-east-1'
  set :servers, us_1b
  # More US 1b specific setup...
end

task :eu_1a do
  set :region, 'eu-west-1'
  set :servers, eu_1a
  # More EU 1a specific setup...
end

This service is incrementally deployed one availability zone at a time, e.g. cap us_1b deploy. Each deployment step is manual - it requires someone to push the button. This means that if something goes wrong, only part of the system will be affected, achieving significant redundancy. If the failure was severe enough to bring the system down, only one availability zone in one region will fail and the load balancers will make sure that this failure is transparent to end users and does not overall affect the entire system.