Amazon S3 persistent Ruby objects
I have occasionally participated in conversations around the subject of the database as a product with an expiry date, destined to eventually be replaced by highly distributed data storage models. Given the current technological state, this sounds much a like science fiction scenario, but services like AWS S3 bring the idea closer to science and further from fiction.
Although S3’s data storage and retrieval model looks presently better suited for larger units of data (e.g. media content), it would be interesting to investigate how it could be applied as an Object persistence service.
In the following example, we will use Ruby’s AWS::S3 library to create a class resembling Ruby on Rails’ ActiveRecord::Base, allowing Objects to be persisted to and retrieved from an S3 Bucket.
Objects need to be somehow serialized and de-serialized in order to be successfully stored and retrieved from S3. YAML is one of the standard means to object serialization in Ruby, so we will be making use of it.
require 'yaml'
require 'aws/s3'
class S3Record
attr_accessor :id
def initialize(attrs = {})
attrs.each { |k, v| instance_eval "self.#{k} = v" }
end
end
Requiring YAML provides S3Record with, among other functionality, a to_yaml instance method.
Next, we add the ability to persist an instance of S3Record to S3.
def create
AWS::S3::S3Object.find(@id, self.class.name)
raise "Object with key [#{@id}] already exists"
rescue AWS::S3::NoSuchKey
AWS::S3::S3Object.store(@id, self.to_yaml, self.class.name)
end
The first parameter to the AWS::S3::S3Object#find method is the unique identifier by which the Object will be keyed when stored and will be the one used to find the object. The second parameter is the name of the Bucket in which the object will be stored. Here, we use the name of our class as the bucket name. This implies that a bucket with a matching name to this of our class must exist before we can start storing objects.
The AWS API will raise a NoSuchKey error in the case where the specified key does not exist in the specified bucket. We make use of this in order to ensure that we will not be overwriting any existing objects. Also, note the call to self.to_yaml. This is the actual data of the Object as it is being stored in S3.
Next, we provide the ability to retrieve objects.
def self.find(id)
YAML.load(AWS::S3::S3Object.find(id, self.name).value)
end
def self.find_all(options = {})
bucket = AWS::S3::Bucket.find(self.name, options)
bucket.objects.map { |s3_obj| YAML.load(s3_obj.value) }
end
We retrieve one object by its identifier and the name of its bucket (AWS::S3::S3Object.find(id, self.name)) and return it in its de-serialized form. The same applies to finding many objects from one Bucket. The options Hash accepts the following parameters: :max_keys - the maximum number of keys to retrieve, :prefix - restrict the response to contain results that begin with a specified prefix, and :marker - restrict the response to results that occur alphabetically after this value (see find (AWS::S3::Bucket)).
Methods to update, delete and count should be self explanatory.
def update AWS::S3::S3Object.store(@id, self.to_yaml, self.class.name) end def self.delete(id) AWS::S3::S3Object.delete(id, self.name) end def self.count AWS::S3::Bucket.find(self.name).objects.size end
In action, we could operate on objects we would like to persist on S3 in a way similar to the following.
class Genre < S3Record attr_accessor :name end rock = Genre.new(:id => 1, :name => "rock") rock.create rock = Rock.find(1) rock.name = "heavy rock" rock.update #etc...
What about transactions? Indexing? More elaborate querying? All things databases are well established for? Bandwidth issues?
There are probably no definitive answers to any of these questions, although one could suggest that transaction management is not that hard to implement, indexing can happen - often more efficiently - outside the database (see Lucene, Feret) and bandwidth will not be an issue forever.
A reason prohibiting the above example from being realistic is the present S3 billing model ($0.01 per 1,000 PUT or LIST requests, $0.01 per 10,000 GET and all other requests). It does not seem financially preferable for an application that will need to store and retrieve vast numbers of small resources in great frequency.
The afore-mentioned costs are not applicable if the application is hosted on Amazon’s EC2 (Elastic Compute Cloud), as data transferred between Amazon S3 and Amazon EC2 is free of charge.

July 7th, 2007 at 12:52 pm
Interesting approach.
(You don’t need to include YAML, btw. Everything gets a to_yaml method when you require ‘yaml’)
July 7th, 2007 at 2:21 pm
Thanks for the info. Amended the example accordingly.
July 8th, 2007 at 7:57 pm
You could probably control some of the cost and speed issues by implementing it like a pstore (that is, a single file) that syncs every now and then, so no constant HTTP traffic back and forth. Would hurt concurrency though!
July 8th, 2007 at 8:32 pm
On the issue of price, you could at least prototype using _why’s ParkPlace, an S3 clone written in Ruby. http://code.whytheluckystiff.net/parkplace
July 9th, 2007 at 4:04 am
One note to remember - if you are using S3 with EC2 to host your application there is no cost for data between the EC2 server instances and S3 service.
July 9th, 2007 at 12:25 pm
J, indeed. I’ve updated the article to reflect that.
July 12th, 2007 at 7:41 am
Are you using YAML so you can index the data external to your app, or…? I don’t really see a benefit over Marshal, especially with the overhead of instantiating the custom class every time.
October 20th, 2007 at 7:18 am
[...] Read the rest of this great post here [...]
March 26th, 2008 at 9:10 am
[...] about using Amazon S3 persistent Ruby objects and an extended version using Amazon [...]