Feb 01 2009

Distributed key-value store indexing

Distributed key-value stores present an interesting alternative to some of the functionality relational databases are commonly employed for. Advantages include improved performance, easy replication, horizontal scaling and redundancy.

By nature, key value stores offer one way of retrieving data, by some sort of primary key which uniquely identifies each entry. But what about queries that require more elaborate input in order to collect relevant entries? Full text search engines like Sphinx and Lucence do exactly this and when used in conjunction with a database will query their indexes and return a collection of ids which are then used to retrieve the results from the database. Full text search engines support indexing data sources other than RDBMSs, so there's no reason why one couldn't index a distributed key-value store.

distributed-key-value-store-index

Here, we'll look at how we can integrate Sphinx with MemcacheDB, a distributed key-value store which conforms to the memcached protocol and uses Berkeley DB as its storage back-end.

Sphinx comes with an xmlpipe2 datasource, a generic XML interface aimed at simplifying custom integration. What this means is that our application can transform content from MemcacheDB into this format and feed it to Sphinx for indexing. The highlighted lines from the following Sphinx configuration instruct Sphinx to use the xmlpipe2 source type and invoke the ruby /app/lib/sphinxpipe.rb script in order to retrieve the data to index.

# sphinx.conf
source products_src
{
  type = xmlpipe2
  xmlpipe_command = ruby /app/lib/sphinxpipe.rb
}

index products
{
  source = products_src
  path = /app/sphinx/data/products
  docinfo = extern
  mlock = 0
  morphology = stem_en
  min_word_len = 1
  charset_type = utf-8
  enable_star = 1
  html_strip = 0
}

indexer
{
  mem_limit = 256M
}

searchd
{
  port = 3312
  log = /app/sphinx/log/searchd.log
  query_log = /app/sphinx/log/query.log
  read_timeout = 5
  max_children = 30
  pid_file = /app/sphinx/searchd.pid
  max_matches = 10000
  seamless_rotate = 1
  preopen_indexes = 0
  unlink_old = 1
}
  

Following is a Product class. Each product instance can present itself as xmlpipe2 data. The class itself gets the entire product catalog as a xmlpipe2 data source. It also has a search method used for querying Sphinx and retrieving matched products from MemcacheDB. Finally, there's a bootstrap method for populating the store with some example data.

# product.rb
require "rubygems"
require "xml/libxml"
require "memcached"
require "riddle"

class Product
  attr_reader :id
  MEM = Memcached.new('localhost:21201')
  def initialize(id, title)
    @id, @title = id, title
  end

  def to_sphinx_doc
    sphinx_document = XML::Node.new('sphinx:document')
    sphinx_document['id'] = @id
    sphinx_document << title = XML::Node.new('title')
    title << @title
    sphinx_document
  end

  # Query sphinx and load products with matched ids from MemcacheDB
  def self.search(query)
    client = Riddle::Client.new
    client.match_mode = :any
    client.max_matches = 10_000
    results = client.query(query, 'products')
    ids = results[:matches].map {|m| m[:doc].to_s}
    MEM.get(ids) if ids.any?
  end

  # Load all products from MemcacheDB and convert them to xmlpipe2 data
  def self.sphinx_datasource
    docset = XML::Document.new.root = XML::Node.new("sphinx:docset")
    docset << sphinx_schema = XML::Node.new("sphinx:schema")
    sphinx_schema << sphinx_field = XML::Node.new('sphinx:field')
    sphinx_field['name'] = 'title'
    keys = MEM.get('product_keys')
    products = MEM.get(keys)
    products.each { |id, product| docset << product.to_sphinx_doc }
    %(<?xml version="1.0" encoding="utf-8"?>\n#{docset})
  end
  
  # Create a some products and store them in MemcacheDB
  def self.bootstrap
    product_ids = ('1'..'5').to_a.inject([]) do |ids, id|
      product = Product.new(id, "product #{id}")
      MEM.set(product.id, product)
      ids << id
    end
    MEM.set('product_keys', product_ids)
  end
end
  

The sphinxpipe.rb script looks like this.

# sphinxpipe.rb
Product.bootstrap
puts Product.sphinx_datasource
  

With MemcacheDB (or even memcached for the purpose of this example) running, we can tell Sphinx to create an index of products by invoking indexer --all -c sphinx.conf and then start the search daemon - searchd -c sphinx.conf. Now we're ready to start querying the index and retrieving results from the distributed store.

puts Product.search('product 1').inspect
  

It is not uncommon for the database to become a performance hotspot. The integration of a fast, distributed key-value store with an efficient search engine can be an interesting substitute for high throughput data retrieval operations.