Also on twitter ( twitter.com/nutrun )

Archive for February, 2009

97 Things Every Software Architect Should Know

Saturday, February 28th, 2009

A few months ago I wrote one of the axioms for a community effort called 97 Things Every Software Architect Should Know which was driven and edited by Richard Monson-Haefel. This collection of principles, as contributed by an impressive range of software architects around the world, was recently released as a book by O’Reilly Media and is well worth a look if you’re interested in pragmatic advice based on how some of our colleagues approach technology projects.

Caching proxy fronted web consumer

Saturday, February 14th, 2009

Consider an application which as part of its functionality queries a product search web service.

WEB_SERVICE_ADDRESS = 'http://www.example.com'

url = URI.parse(WEB_SERVICE_ADDRESS)

Net::HTTP.start(url.host, url.port) do |http|
  http.get('/product-search', 'q' => 'guitar')
end

Inspecting the response headers, we notice the web service instructs consumers that the results of the query will remain the same for one hour.

curl -I "http://www.example.com/product-search?q=guitar"

HTTP/1.1 200 OK
Content-Type: text/html
Cache-Control: max-age=3600, must-revalidate
Content-Length: 32650
Date: Sat, 14 Feb 2009 13:53:31 GMT
Age: 0
Connection: keep-alive

At this point we can choose to ignore the cache control header and keep on querying the service for this specific resource regardless of whether the response is going to be the same. This is suboptimal for the consumer, which will suffer unnecessary latency penalties, the service, which will have to respond to inessential requests, and the network which will be subject to unnecessary bandwidth usage. Another option involves making the web consumer aware of the service’s caching policies so that it only queries for data that it doesn’t have or data that’s become stale. This option remedies the above problems but introduces additional complexity to the consumer.

A third option involves introducing a caching proxy to the web consumer’s stack responsible for mediating the service/consumer interactions solely based on the content’s caching characteristics.

caching-proxy-fronted-web-consumer

Benefits of this approach include: The consumer never has to deal with any caching logic; No effort is required in re-implementing cache handling code; It is likely that the caching engine will perform better than custom caching code in the consumer because it’s been built and optimized for this purpose; The caching proxy can be re-used by more than one types of consumer or more than one instances of the same consumer in the stack. As a possible side-effect, the caching proxy is an additional layer to the consumer stack and this can result in network (the consumer’s LAN) latency.

Here’s the configuration needed in order to use Varnish as a caching web consumer proxy for the above example.

# varnish.conf

backend default {
  .host = "www.example.com";
  .port = "http";
}

The only thing that changes in the consumer is the address it directs its requests to.

WEB_SERVICE_ADDRESS = 'http://service-proxy'

url = URI.parse(WEB_SERVICE_ADDRESS)

Net::HTTP.start(url.host, url.port) do |http|
  http.get('/product-search', 'q' => 'guitar')
end

Distributed key-value store indexing

Sunday, February 1st, 2009

Distributed key-value stores present an interesting alternative to some of the functionality relational databases are commonly employed for. Advantages include improved performance, easy replication, horizontal scaling and redundancy.

By nature, key value stores offer one way of retrieving data, by some sort of primary key which uniquely identifies each entry. But what about queries that require more elaborate input in order to collect relevant entries? Full text search engines like Sphinx and Lucence do exactly this and when used in conjunction with a database will query their indexes and return a collection of ids which are then used to retrieve the results from the database. Full text search engines support indexing data sources other than RDBMSs, so there’s no reason why one couldn’t index a distributed key-value store.

distributed-key-value-store-index

Here, we’ll look at how we can integrate Sphinx with MemcacheDB, a distributed key-value store which conforms to the memcached protocol and uses Berkeley DB as its storage back-end.

Sphinx comes with an xmlpipe2 datasource, a generic XML interface aimed at simplifying custom integration. What this means is that our application can transform content from MemcacheDB into this format and feed it to Sphinx for indexing. The highlighted lines from the following Sphinx configuration instruct Sphinx to use the xmlpipe2 source type and invoke the ruby /app/lib/sphinxpipe.rb script in order to retrieve the data to index.

# sphinx.conf

source products_src
{
  type = xmlpipe2
  xmlpipe_command = ruby /app/lib/sphinxpipe.rb
}

index products
{
  source = products_src
  path = /app/sphinx/data/products
  docinfo = extern
  mlock = 0
  morphology = stem_en
  min_word_len = 1
  charset_type = utf-8
  enable_star = 1
  html_strip = 0
}

indexer
{
  mem_limit = 256M
}

searchd
{
  port = 3312
  log = /app/sphinx/log/searchd.log
  query_log = /app/sphinx/log/query.log
  read_timeout = 5
  max_children = 30
  pid_file = /app/sphinx/searchd.pid
  max_matches = 10000
  seamless_rotate = 1
  preopen_indexes = 0
  unlink_old = 1
}

Following is a Product class. Each product instance can present itself as xmlpipe2 data. The class itself gets the entire product catalog as a xmlpipe2 data source. It also has a search method used for querying Sphinx and retrieving matched products from MemcacheDB. Finally, there’s a bootstrap method for populating the store with some example data.

# product.rb

require "rubygems"
require "xml/libxml"
require "memcached"
require "riddle"

class Product
  attr_reader :id
  MEM = Memcached.new('localhost:21201')

  def initialize(id, title)
    @id, @title = id, title
  end

  def to_sphinx_doc
    sphinx_document = XML::Node.new('sphinx:document')
    sphinx_document['id'] = @id
    sphinx_document << title = XML::Node.new('title')
    title << @title
    sphinx_document
  end

  # Query sphinx and load products with matched ids from MemcacheDB
  def self.search(query)
    client = Riddle::Client.new
    client.match_mode = :any
    client.max_matches = 10_000
    results = client.query(query, 'products')
    ids = results[:matches].map {|m| m[:doc].to_s}
    MEM.get(ids) if ids.any?
  end

  # Load all products from MemcacheDB and convert them to xmlpipe2 data
  def self.sphinx_datasource
    docset = XML::Document.new.root = XML::Node.new("sphinx:docset")
    docset << sphinx_schema = XML::Node.new("sphinx:schema")
    sphinx_schema << sphinx_field = XML::Node.new('sphinx:field')
    sphinx_field['name'] = 'title'

    keys = MEM.get('product_keys')
    products = MEM.get(keys)
    products.each { |id, product| docset << product.to_sphinx_doc }

    %(<?xml version="1.0" encoding="utf-8"?>\n#{docset})
  end

  # Create a some products and store them in MemcacheDB
  def self.bootstrap
    product_ids = ('1'..'5').to_a.inject([]) do |ids, id|
      product = Product.new(id, "product #{id}")
      MEM.set(product.id, product)
      ids << id
    end
    MEM.set('product_keys', product_ids)
  end
end

The sphinxpipe.rb script looks like this.

# sphinxpipe.rb
Product.bootstrap
puts Product.sphinx_datasource

With MemcacheDB (or even memcached for the purpose of this example) running, we can tell Sphinx to create an index of products by invoking indexer --all -c sphinx.conf and then start the search daemon – searchd -c sphinx.conf. Now we’re ready to start querying the index and retrieving results from the distributed store.

puts Product.search('product 1').inspect

It is not uncommon for the database to become a performance hotspot. The integration of a fast, distributed key-value store with an efficient search engine can be an interesting substitute for high throughput data retrieval operations.

State separation

Sunday, February 1st, 2009

It is usual for web applications to deal with serving content specific to a user’s session. This makes web caching harder to implement as we don’t want content that is meant to be viewed by a particular user being cached and accidentally offered to others. Some HTTP accelerators like Varnish choose to by default completely ignore responses that contain cookies. However, not all content is always tied to a user’s session, and if that content doesn’t change in real time, it makes sense to cache the parts that are common to all users in order to improve efficiency. With this in mind, one logical split could be made between parts of the system that are globally cache friendly and ones that aren’t.

Consider online retailer websites which usually operate in two modes, one for visitors and one for logged in users. Logged in users are presented with a customized, session specific experience, yet data like the product catalog is essentially the same regardless of whether one is logged in or not and it makes sense for everyone to be accessing the same cached copy of a common resource.

A possible solution involves creating two separate web applications, one entirely dedicated to stateless interactions and one meant for pages that are rendered as part of a user’s session. This might seem like overkill, but it clearly enforces the divide between what can and what can’t be cached. It also promotes reuse of the system’s web caching layer, which now serves content to site “visitors” as well as to the stateful components. The stateful application can delegate requests for potentially cached content to its stateless counterpart via the caching layer and decorate the responses with session specific data.

split_by_state

Web caching presents but one way to cache data that remains static for predefined periods of time. Apart from harnessing proven existing tools, this form of caching comes with the advantage that its policies are universally understood and can significantly improve a website’s efficiency in ways beyond the maintainer’s control. Retrofitting web caching into an application that hasn’t been designed with it mind can be difficult, therefore it is worth to logically separate cacheable and non cacheable resources early on.