Also on twitter ( twitter.com/nutrun )

Archive for the ‘Software’ Category

Incremental deployment

Tuesday, December 22nd, 2009

I’ve recently had a chance to look at a high availability system designed and built by Forward colleagues Andy Kent and Paul Ingles. It is a critical web service with a very high impact of failure. Essentially, it must stay up at all times.

The service is hosted on Amazon EC2. It makes use of EC2’s geographically distributed regions and different availability zones within each region, fronted by AWS Elastic Load Balancing and additional global DNS fail over outside of EC2/AWS.

high-availability-arch

A part of the project that struck me as particularly interesting is the deployment strategy Paul and Andy settled on. Regardless of how much trust we have in our builds and QA process, deployments become a whole different, much more stressful activity when critical systems like the one under discussion are involved. Andy mentioned it is important to find the balance between what to automate and bits that should require manual input.

# deploy.rb

task :us_1b do
  set :region, 'us-east-1'
  set :servers, us_1b
  # More US 1b specific setup...
end

task :eu_1a do
  set :region, 'eu-west-1'
  set :servers, eu_1a
  # More EU 1a specific setup...
end

This service is incrementally deployed one availability zone at a time, e.g. cap us_1b deploy. Each deployment step is manual – it requires someone to push the button. This means that if something goes wrong, only part of the system will be affected, achieving significant redundancy. If the failure was severe enough to bring the system down, only one availability zone in one region will fail and the load balancers will make sure that this failure is transparent to end users and does not overall affect the entire system.

Deployment setup automation

Tuesday, November 10th, 2009

Part of my work these days has to do with building and deploying numerous experimental applications with varying life cycles. Many of these applications get built and put on a server in less than a day only to be shut down and never looked at again a couple of days later, others get turned off and revisited after some time, while others graduate to larger, wider scope systems.

This means that I get to deploy applications for the first time more frequently than usual. Also, because we deploy to virtualised infrastructures (including an internal cloud, Slicehost and Amazon EC2), slice instances (servers) tend to get rebuilt more often than they would in the absence of virtualisation. First time deployments are generally more involved than subsequent ones because there is setup up to be made and software to be installed in order for the host servers to accommodate the application.

One way to treat first time deployment woes is to create and maintain images of the system in the state required to host the application. I find this to work well when dealing with moderate numbers of applications and servers, whereas creating and keeping images up to date has a tendency to become tedious and inflexible as the number of applications and images increases.

As an alternative, we can move prerequisite system setup and installations responsibility closer to the application code, in the form of an after hook to the deploy:setup task that we call the first time we deploy an application with Capistrano. Here’s some Capistrano code that performs one time setup tasks.

namespace :util do
  task :install_libraries do
    sudo 'apt-get install libxml2 libxml2-dev libmysqlclient15-dev -y'
  end
end

after 'deploy:setup', 'util:install_libraries'

With this approach, the application knows how to setup the system the way it needs it to be next time it gets deployed for the first time. As an added benefit, the Capistrano code serves as documentation for the application’s system requirements.

VCS practices over features

Saturday, August 29th, 2009

I’ve often heard people I know and respect say that git is leaps and bounds better than Subversion. I’ve been a relatively early adopter of git, it’s been my VCS of choice for almost two years now. Even though I find it superior to most of the competition I struggle to justify the “leaps and bounds” claim and would rather more modestly call it “a step forward”.

This is probably due to the practices we find benefit our development process. Git puts great emphasis on branching, something we generally tend to avoid (to clarify, I’m not referring to local branching). We concentrate on feedback based on the usage of our applications. This means that we strive to commit as often as possible and, most importantly, deploy to production at a constant rate. Grossly simplified, the process is: identify a small coherent feature, build it, commit it to the master branch and deploy. No part of the codebase is owned by a subdivision of the team, everyone works on everything.

By far the most popular git commands we issue are git pull, git add and git push, not that different to svn update and svn commit.

When I first started using git I was wondering if I had developed a fear of branching because of Subversion’s inefficiencies in that area. In reality, I think that an environment where every developer constantly has an up to date understanding of the codebase and especially a current grasp of the design and overall vision will always be more efficient than working remotely and having merge checkpoints, no matter how cleverly the VCS handles branching. This is why I think a faster, distributed, superior at merging VCS is not something more dramatic than a desirable step forward.

Hello world nginx module

Saturday, August 15th, 2009

Several times over the past few months I made short lived attempts of delving into the mechanics of nginx modules. Although an invaluable resource to anyone seriously interested in the subject, Emiller’s Guide To Nginx Module Development doesn’t at the time of this writing include a quick-start example I could hack together and see in action. Getting something to run as quickly as possible is my preferred way of starting the study of new things and every time I caught myself searching the web for a “Hello world nginx module”.

I will not go into any details, Emiller’s Guide does an excellent job at that, I’m only going to mention the steps I believe are absolutely necessary to write, compile and run an nginx handler module that responds to every request with the string “Hello world”.

There is a minimum of two files required for writing an nginx module, the first should be called config and looks something like this:

ngx_addon_name=ngx_http_hello_world_module
HTTP_MODULES="$HTTP_MODULES ngx_http_hello_world_module"
NGX_ADDON_SRCS="$NGX_ADDON_SRCS $ngx_addon_dir/ngx_http_hello_world_module.c"

The second is the module’s implementation in C and nginx convention suggests a name like ngx_http_modulename_module.c, in this case ngx_http_hello_world_module.c .

#include <ngx_config.h>
#include <ngx_core.h>
#include <ngx_http.h>

static char *ngx_http_hello_world(ngx_conf_t *cf, ngx_command_t *cmd, void *conf);

static ngx_command_t  ngx_http_hello_world_commands[] = {

  { ngx_string("hello_world"),
    NGX_HTTP_LOC_CONF|NGX_CONF_NOARGS,
    ngx_http_hello_world,
    0,
    0,
    NULL },

    ngx_null_command
};

static u_char  ngx_hello_world[] = "hello world";

static ngx_http_module_t  ngx_http_hello_world_module_ctx = {
  NULL,                          /* preconfiguration */
  NULL,                          /* postconfiguration */

  NULL,                          /* create main configuration */
  NULL,                          /* init main configuration */

  NULL,                          /* create server configuration */
  NULL,                          /* merge server configuration */

  NULL,                          /* create location configuration */
  NULL                           /* merge location configuration */
};

ngx_module_t ngx_http_hello_world_module = {
  NGX_MODULE_V1,
  &ngx_http_hello_world_module_ctx, /* module context */
  ngx_http_hello_world_commands,   /* module directives */
  NGX_HTTP_MODULE,               /* module type */
  NULL,                          /* init master */
  NULL,                          /* init module */
  NULL,                          /* init process */
  NULL,                          /* init thread */
  NULL,                          /* exit thread */
  NULL,                          /* exit process */
  NULL,                          /* exit master */
  NGX_MODULE_V1_PADDING
};

static ngx_int_t ngx_http_hello_world_handler(ngx_http_request_t *r)
{
  ngx_buf_t    *b;
  ngx_chain_t   out;

  r->headers_out.content_type.len = sizeof("text/plain") - 1;
  r->headers_out.content_type.data = (u_char *) "text/plain";

  b = ngx_pcalloc(r->pool, sizeof(ngx_buf_t));

  out.buf = b;
  out.next = NULL;

  b->pos = ngx_hello_world;
  b->last = ngx_hello_world + sizeof(ngx_hello_world);
  b->memory = 1;
  b->last_buf = 1;

  r->headers_out.status = NGX_HTTP_OK;
  r->headers_out.content_length_n = sizeof(ngx_hello_world);
  ngx_http_send_header(r);

  return ngx_http_output_filter(r, &out);
}

static char *ngx_http_hello_world(ngx_conf_t *cf, ngx_command_t *cmd, void *conf)
{
  ngx_http_core_loc_conf_t  *clcf;

  clcf = ngx_http_conf_get_module_loc_conf(cf, ngx_http_core_module);
  clcf->handler = ngx_http_hello_world_handler;

  return NGX_CONF_OK;
}

Both config and ngx_http_hello_world_module.c should be placed in the same directory, let’s say /etc/ngxhelloworld. Modules are compiled into the nginx binary. To do so, download the nginx source, uncompress, and in the nginx source directory run:

./configure --add-module=/etc/ngxhelloworld
make
sudo make install

Finally, add a module directive to nginx’s configuration (default is /usr/local/nginx/conf/nginx.conf) to enable the module for a location.

location = /hello {
  hello_world;
}

At this point, we can start nginx and navigating to http://localhost/hello will yield the result of all this labor.

Alongside Emiller’s Guide, I also found reading nginx third party module code helpful.

Asynchronous session content injection

Thursday, August 6th, 2009

Applying a clear distinction between stateless and stateful content when designing a web application is tricky but worth tackling early so that content not specific to user sessions can benefit from web caching. The technique we are trying out for scramble.com reminds me of what I described in State separation and was introduced to me by Mike Jones who was inspired by the Dynamically Update Cached Pages chapter in Advanced Rails Recipes.

asynchronous-session-content-injection

The idea involves serving non session specific resources independent from personalized content and use AJAX calls to inject the page with session specific content.

require 'rubygems'
require 'sinatra'
require 'json'

configure do
  enable :sessions
end

get '/' do
  headers['Cache-Control'] = 'max-age=60, must-revalidate'
  erb :index
end

get '/userinfo' do
  if session[:user]
    JSON.dump(:user => session[:user])
  else
    halt 401
  end
end

get '/login' do
  session[:user] = 'rock'
  redirect '/'
end

get '/logout' do
  session.clear
  redirect '/'
end

Notice some of the headers for '/':

$ curl -I http://localhost:4567/
Cache-Control: max-age=60, must-revalidate
Set-Cookie: rack.session=BAh7AA%3D%3D%0A; path=/

The Cache-Control policy instructs a web cache to keep this version of the resource for 60 seconds before requesting a fresh one. Set-Cookie however will usually cause a web cache to never store the response and always query its back end.

The following configuration tells Varnish to throw away the cookie from any request/response that doesn’ match one of the URLs that require authorization, thus causing it to react to response cache policies.

sub vcl_recv {
  if (req.url !~ "^(/login|/logout|/userinfo)") {
    unset req.http.cookie;
  }
}

sub vcl_fetch {
  if (req.url !~ "^(/login|/logout|/userinfo)") {
    unset obj.http.set-cookie;
  }
}

A snippet from the HTML response for '/':

<h1>Hi</h1>
<div id="nav">
  <a href="/login" class='login-control'>Login</a>
</div>

… and the javascript for asynchronously injecting session data to the page:

$(function() {
  $.getJSON('/userinfo', function(data) {
    $('h1').text('Hi ' + data.user);
    $('#nav .login-control').attr('href', '/logout').html('logout');
  })
})

In summary, it is likely that a website will have significant amounts of content that is intended for everyone without the need for personalization. The performance of serving that content can benefit from web caching, but that becomes difficult as many websites’ user experience depends on the presence of user sessions. Separating stateless from session specific content at the resource level and using a combination of HTTP and AJAX to merge the results of requests for both types of resources will make stateless content cacheable by decoupling it from the unnecessary cookie dependency.

Runnable code example : http://pastie.org/573878

Rack::CacheHeaders code

Monday, May 18th, 2009

A few months ago I wrote about a possible method for centrally configuring HTTP cache headers in Rack based web applications which I called Rack::CacheHeaders. This is useful if your application’s architecture involves tools like Squid or Varnish, or if you are generally interested in harvesting the numerous advantages of HTTP caching for your web application.

The code has evolved a bit since and proven useful in a number of production systems. I created a gist of Rack::CacheHeaders in case someone else finds it handy. The tool is not exhaustive in terms of policies as found in the HTTP specs, it’s a collection of the ones we needed in the projects it’s been used so far. Consider adding ones you need to the gist to make the code more complete and widely useful.

Rack::CacheHeaders allows configuring HTTP cache policy response headers based on request URI patterns. For example, to set the Cache-Control: max-age header for a /guitars/:id resource to one hour:

Rack::CacheHeaders.configure do |cache|
  cache.max_age(/^\/guitars\/d+$/, 3600)
end

Download/develop Rack::CacheHeaders

97 Things Every Software Architect Should Know

Saturday, February 28th, 2009

A few months ago I wrote one of the axioms for a community effort called 97 Things Every Software Architect Should Know which was driven and edited by Richard Monson-Haefel. This collection of principles, as contributed by an impressive range of software architects around the world, was recently released as a book by O’Reilly Media and is well worth a look if you’re interested in pragmatic advice based on how some of our colleagues approach technology projects.

Caching proxy fronted web consumer

Saturday, February 14th, 2009

Consider an application which as part of its functionality queries a product search web service.

WEB_SERVICE_ADDRESS = 'http://www.example.com'

url = URI.parse(WEB_SERVICE_ADDRESS)

Net::HTTP.start(url.host, url.port) do |http|
  http.get('/product-search', 'q' => 'guitar')
end

Inspecting the response headers, we notice the web service instructs consumers that the results of the query will remain the same for one hour.

curl -I "http://www.example.com/product-search?q=guitar"

HTTP/1.1 200 OK
Content-Type: text/html
Cache-Control: max-age=3600, must-revalidate
Content-Length: 32650
Date: Sat, 14 Feb 2009 13:53:31 GMT
Age: 0
Connection: keep-alive

At this point we can choose to ignore the cache control header and keep on querying the service for this specific resource regardless of whether the response is going to be the same. This is suboptimal for the consumer, which will suffer unnecessary latency penalties, the service, which will have to respond to inessential requests, and the network which will be subject to unnecessary bandwidth usage. Another option involves making the web consumer aware of the service’s caching policies so that it only queries for data that it doesn’t have or data that’s become stale. This option remedies the above problems but introduces additional complexity to the consumer.

A third option involves introducing a caching proxy to the web consumer’s stack responsible for mediating the service/consumer interactions solely based on the content’s caching characteristics.

caching-proxy-fronted-web-consumer

Benefits of this approach include: The consumer never has to deal with any caching logic; No effort is required in re-implementing cache handling code; It is likely that the caching engine will perform better than custom caching code in the consumer because it’s been built and optimized for this purpose; The caching proxy can be re-used by more than one types of consumer or more than one instances of the same consumer in the stack. As a possible side-effect, the caching proxy is an additional layer to the consumer stack and this can result in network (the consumer’s LAN) latency.

Here’s the configuration needed in order to use Varnish as a caching web consumer proxy for the above example.

# varnish.conf

backend default {
  .host = "www.example.com";
  .port = "http";
}

The only thing that changes in the consumer is the address it directs its requests to.

WEB_SERVICE_ADDRESS = 'http://service-proxy'

url = URI.parse(WEB_SERVICE_ADDRESS)

Net::HTTP.start(url.host, url.port) do |http|
  http.get('/product-search', 'q' => 'guitar')
end

Distributed key-value store indexing

Sunday, February 1st, 2009

Distributed key-value stores present an interesting alternative to some of the functionality relational databases are commonly employed for. Advantages include improved performance, easy replication, horizontal scaling and redundancy.

By nature, key value stores offer one way of retrieving data, by some sort of primary key which uniquely identifies each entry. But what about queries that require more elaborate input in order to collect relevant entries? Full text search engines like Sphinx and Lucence do exactly this and when used in conjunction with a database will query their indexes and return a collection of ids which are then used to retrieve the results from the database. Full text search engines support indexing data sources other than RDBMSs, so there’s no reason why one couldn’t index a distributed key-value store.

distributed-key-value-store-index

Here, we’ll look at how we can integrate Sphinx with MemcacheDB, a distributed key-value store which conforms to the memcached protocol and uses Berkeley DB as its storage back-end.

Sphinx comes with an xmlpipe2 datasource, a generic XML interface aimed at simplifying custom integration. What this means is that our application can transform content from MemcacheDB into this format and feed it to Sphinx for indexing. The highlighted lines from the following Sphinx configuration instruct Sphinx to use the xmlpipe2 source type and invoke the ruby /app/lib/sphinxpipe.rb script in order to retrieve the data to index.

# sphinx.conf

source products_src
{
  type = xmlpipe2
  xmlpipe_command = ruby /app/lib/sphinxpipe.rb
}

index products
{
  source = products_src
  path = /app/sphinx/data/products
  docinfo = extern
  mlock = 0
  morphology = stem_en
  min_word_len = 1
  charset_type = utf-8
  enable_star = 1
  html_strip = 0
}

indexer
{
  mem_limit = 256M
}

searchd
{
  port = 3312
  log = /app/sphinx/log/searchd.log
  query_log = /app/sphinx/log/query.log
  read_timeout = 5
  max_children = 30
  pid_file = /app/sphinx/searchd.pid
  max_matches = 10000
  seamless_rotate = 1
  preopen_indexes = 0
  unlink_old = 1
}

Following is a Product class. Each product instance can present itself as xmlpipe2 data. The class itself gets the entire product catalog as a xmlpipe2 data source. It also has a search method used for querying Sphinx and retrieving matched products from MemcacheDB. Finally, there’s a bootstrap method for populating the store with some example data.

# product.rb

require "rubygems"
require "xml/libxml"
require "memcached"
require "riddle"

class Product
  attr_reader :id
  MEM = Memcached.new('localhost:21201')

  def initialize(id, title)
    @id, @title = id, title
  end

  def to_sphinx_doc
    sphinx_document = XML::Node.new('sphinx:document')
    sphinx_document['id'] = @id
    sphinx_document << title = XML::Node.new('title')
    title << @title
    sphinx_document
  end

  # Query sphinx and load products with matched ids from MemcacheDB
  def self.search(query)
    client = Riddle::Client.new
    client.match_mode = :any
    client.max_matches = 10_000
    results = client.query(query, 'products')
    ids = results[:matches].map {|m| m[:doc].to_s}
    MEM.get(ids) if ids.any?
  end

  # Load all products from MemcacheDB and convert them to xmlpipe2 data
  def self.sphinx_datasource
    docset = XML::Document.new.root = XML::Node.new("sphinx:docset")
    docset << sphinx_schema = XML::Node.new("sphinx:schema")
    sphinx_schema << sphinx_field = XML::Node.new('sphinx:field')
    sphinx_field['name'] = 'title'

    keys = MEM.get('product_keys')
    products = MEM.get(keys)
    products.each { |id, product| docset << product.to_sphinx_doc }

    %(<?xml version="1.0" encoding="utf-8"?>\n#{docset})
  end

  # Create a some products and store them in MemcacheDB
  def self.bootstrap
    product_ids = ('1'..'5').to_a.inject([]) do |ids, id|
      product = Product.new(id, "product #{id}")
      MEM.set(product.id, product)
      ids << id
    end
    MEM.set('product_keys', product_ids)
  end
end

The sphinxpipe.rb script looks like this.

# sphinxpipe.rb
Product.bootstrap
puts Product.sphinx_datasource

With MemcacheDB (or even memcached for the purpose of this example) running, we can tell Sphinx to create an index of products by invoking indexer --all -c sphinx.conf and then start the search daemon – searchd -c sphinx.conf. Now we’re ready to start querying the index and retrieving results from the distributed store.

puts Product.search('product 1').inspect

It is not uncommon for the database to become a performance hotspot. The integration of a fast, distributed key-value store with an efficient search engine can be an interesting substitute for high throughput data retrieval operations.

State separation

Sunday, February 1st, 2009

It is usual for web applications to deal with serving content specific to a user’s session. This makes web caching harder to implement as we don’t want content that is meant to be viewed by a particular user being cached and accidentally offered to others. Some HTTP accelerators like Varnish choose to by default completely ignore responses that contain cookies. However, not all content is always tied to a user’s session, and if that content doesn’t change in real time, it makes sense to cache the parts that are common to all users in order to improve efficiency. With this in mind, one logical split could be made between parts of the system that are globally cache friendly and ones that aren’t.

Consider online retailer websites which usually operate in two modes, one for visitors and one for logged in users. Logged in users are presented with a customized, session specific experience, yet data like the product catalog is essentially the same regardless of whether one is logged in or not and it makes sense for everyone to be accessing the same cached copy of a common resource.

A possible solution involves creating two separate web applications, one entirely dedicated to stateless interactions and one meant for pages that are rendered as part of a user’s session. This might seem like overkill, but it clearly enforces the divide between what can and what can’t be cached. It also promotes reuse of the system’s web caching layer, which now serves content to site “visitors” as well as to the stateful components. The stateful application can delegate requests for potentially cached content to its stateless counterpart via the caching layer and decorate the responses with session specific data.

split_by_state

Web caching presents but one way to cache data that remains static for predefined periods of time. Apart from harnessing proven existing tools, this form of caching comes with the advantage that its policies are universally understood and can significantly improve a website’s efficiency in ways beyond the maintainer’s control. Retrofitting web caching into an application that hasn’t been designed with it mind can be difficult, therefore it is worth to logically separate cacheable and non cacheable resources early on.