Amazon S3 persistent Ruby objects
I have occasionally participated in conversations around the subject of the database as a product with an expiry date, destined to eventually be replaced by highly distributed data storage models. Given the current technological state, this sounds much a like science fiction scenario, but services like AWS S3 bring the idea closer to science and further from fiction.
Although S3's data storage and retrieval model looks presently better suited for larger units of data (e.g. media content), it would be interesting to investigate how it could be applied as an Object persistence service.
In the following example, we will use Ruby's AWS::S3 library to create a class resembling Ruby on Rails' ActiveRecord::Base, allowing Objects to be persisted to and retrieved from an S3 Bucket.
Objects need to be somehow serialized and de-serialized in order to be successfully stored and retrieved from S3. YAML is one of the standard means to object serialization in Ruby, so we will be making use of it.
require 'yaml' require 'aws/s3' class S3Record attr_accessor :id def initialize(attrs = {}) attrs.each { |k, v| instance_eval "self.#{k} = v" } end end
Requiring YAML provides
S3Record
with, among other functionality, a
to_yaml
instance method.
Next, we add the ability to persist an instance of
S3Record
to S3.
def create AWS::S3::S3Object.find(@id, self.class.name) raise "Object with key [] already exists" rescue AWS::S3::NoSuchKey AWS::S3::S3Object.store(@id, self.to_yaml, self.class.name) end
The first parameter to the
AWS::S3::S3Object#find
method is the unique identifier by which the Object will be keyed when stored and will be the one used to find the object. The second parameter is the name of the Bucket in which the object will be stored. Here, we use the name of our class as the bucket name. This implies that a bucket with a matching name to this of our class must exist before we can start storing objects.
The AWS API will raise a
NoSuchKey
error in the case where the specified key does not exist in the specified bucket. We make use of this in order to ensure that we will not be overwriting any existing objects. Also, note the call to
self.to_yaml
.
This is the actual data of the Object as it is being stored in S3.
Next, we provide the ability to retrieve objects.
def self.find(id) YAML.load(AWS::S3::S3Object.find(id, self.name).value) end def self.find_all(options = {}) bucket = AWS::S3::Bucket.find(self.name, options) bucket.objects.map { |s3_obj| YAML.load(s3_obj.value) } end
We retrieve one object by its identifier and the name of its bucket (AWS::S3::S3Object.find(id, self.name)
) and return it in its de-serialized form. The same applies to finding many objects from one Bucket. The
options
Hash accepts the following parameters:
:max_keys
- the maximum number of keys to retrieve,
:prefix
- restrict the response to contain results that begin with a specified prefix, and
:marker
- restrict the response to results that occur alphabetically after this value (see
find (AWS::S3::Bucket)).
Methods to update, delete and count should be self explanatory.
def update AWS::S3::S3Object.store(@id, self.to_yaml, self.class.name) end def self.delete(id) AWS::S3::S3Object.delete(id, self.name) end def self.count AWS::S3::Bucket.find(self.name).objects.size end
In action, we could operate on objects we would like to persist on S3 in a way similar to the following.
class Genre 1, :name => "rock") rock.create rock = Rock.find(1) rock.name = "heavy rock" rock.update #etc...
What about transactions? Indexing? More elaborate querying? All things databases are well established for? Bandwidth issues?
There are probably no definitive answers to any of these questions, although one could suggest that transaction management is not that hard to implement, indexing can happen - often more efficiently - outside the database (see Lucene, Feret) and bandwidth will not be an issue forever.
A reason prohibiting the above example from being realistic is the present S3 billing model ($0.01 per 1,000 PUT or LIST requests, $0.01 per 10,000 GET and all other requests). It does not seem financially preferable for an application that will need to store and retrieve vast numbers of small resources in great frequency.
The afore-mentioned costs are not applicable if the application is hosted on Amazon's EC2 (Elastic Compute Cloud), as data transferred between Amazon S3 and Amazon EC2 is free of charge.