Jul 07 2007

Amazon S3 persistent Ruby objects

I have occasionally participated in conversations around the subject of the database as a product with an expiry date, destined to eventually be replaced by highly distributed data storage models. Given the current technological state, this sounds much a like science fiction scenario, but services like AWS S3 bring the idea closer to science and further from fiction.

Although S3's data storage and retrieval model looks presently better suited for larger units of data (e.g. media content), it would be interesting to investigate how it could be applied as an Object persistence service.

In the following example, we will use Ruby's AWS::S3 library to create a class resembling Ruby on Rails' ActiveRecord::Base, allowing Objects to be persisted to and retrieved from an S3 Bucket.

Objects need to be somehow serialized and de-serialized in order to be successfully stored and retrieved from S3. YAML is one of the standard means to object serialization in Ruby, so we will be making use of it.

require 'yaml'
require 'aws/s3'

class S3Record
  attr_accessor :id

  def initialize(attrs = {})
    attrs.each { |k, v| instance_eval "self.#{k} = v" }
  end
end

Requiring YAML provides S3Record with, among other functionality, a to_yaml instance method.

Next, we add the ability to persist an instance of S3Record to S3.

def create
  AWS::S3::S3Object.find(@id, self.class.name)
  raise "Object with key [] already exists"
rescue AWS::S3::NoSuchKey
  AWS::S3::S3Object.store(@id, self.to_yaml, self.class.name)
end

The first parameter to the AWS::S3::S3Object#find method is the unique identifier by which the Object will be keyed when stored and will be the one used to find the object. The second parameter is the name of the Bucket in which the object will be stored. Here, we use the name of our class as the bucket name. This implies that a bucket with a matching name to this of our class must exist before we can start storing objects.

The AWS API will raise a NoSuchKey error in the case where the specified key does not exist in the specified bucket. We make use of this in order to ensure that we will not be overwriting any existing objects. Also, note the call to self.to_yaml. This is the actual data of the Object as it is being stored in S3.

Next, we provide the ability to retrieve objects.

def self.find(id)
  YAML.load(AWS::S3::S3Object.find(id, self.name).value)
end

def self.find_all(options = {})
  bucket = AWS::S3::Bucket.find(self.name, options)
  bucket.objects.map { |s3_obj| YAML.load(s3_obj.value) }
end

We retrieve one object by its identifier and the name of its bucket (AWS::S3::S3Object.find(id, self.name)) and return it in its de-serialized form. The same applies to finding many objects from one Bucket. The options Hash accepts the following parameters: :max_keys - the maximum number of keys to retrieve, :prefix - restrict the response to contain results that begin with a specified prefix, and :marker - restrict the response to results that occur alphabetically after this value (see find (AWS::S3::Bucket)).

Methods to update, delete and count should be self explanatory.

def update
  AWS::S3::S3Object.store(@id, self.to_yaml, self.class.name)
end

def self.delete(id)
  AWS::S3::S3Object.delete(id, self.name)
end

def self.count
  AWS::S3::Bucket.find(self.name).objects.size
end

In action, we could operate on objects we would like to persist on S3 in a way similar to the following.

class Genre  1, :name => "rock")
rock.create

rock = Rock.find(1)
rock.name = "heavy rock"
rock.update

#etc...

What about transactions? Indexing? More elaborate querying? All things databases are well established for? Bandwidth issues?

There are probably no definitive answers to any of these questions, although one could suggest that transaction management is not that hard to implement, indexing can happen - often more efficiently - outside the database (see Lucene, Feret) and bandwidth will not be an issue forever.

A reason prohibiting the above example from being realistic is the present S3 billing model ($0.01 per 1,000 PUT or LIST requests, $0.01 per 10,000 GET and all other requests). It does not seem financially preferable for an application that will need to store and retrieve vast numbers of small resources in great frequency.

The afore-mentioned costs are not applicable if the application is hosted on Amazon's EC2 (Elastic Compute Cloud), as data transferred between Amazon S3 and Amazon EC2 is free of charge.