<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/rss2full.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/itemcontent.css" type="text/css" media="screen"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>nutrun » Software</title>
	
	<link>http://nutrun.com</link>
	<description>nutrun</description>
	<pubDate>Sat, 08 Nov 2008 01:26:08 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.3</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/nutrun/feed" type="application/rss+xml" /><item>
		<title>Rack cache headers</title>
		<link>http://feeds.feedburner.com/~r/nutrun/feed/~3/446063961/</link>
		<comments>http://nutrun.com/weblog/rack-cache-headers/#comments</comments>
		<pubDate>Sat, 08 Nov 2008 01:17:38 +0000</pubDate>
		<dc:creator>George Malamidis</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://nutrun.com/?p=204</guid>
		<description><![CDATA[Rack is an interface between web servers and Ruby web frameworks. The HTTP protocol, amongst other things, defines requirements on HTTP caches in terms of header fields that control cache behavior. The purpose of this article is to demonstrate a possible implementation of a piece of Rack Middleware which enables web application developers to configure [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://rack.rubyforge.org/" title="Rack: a Ruby Webserver Interface">Rack</a> is an interface between web servers and Ruby web frameworks. The <a href="http://www.w3.org/Protocols/" title="HTTP - Hypertext Transfer Protocol Overview">HTTP</a> protocol, amongst other things, defines requirements on <a href="http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p6-cache-04.txt" title="">HTTP caches</a> in terms of header fields that control cache behavior. The purpose of this article is to demonstrate a possible implementation of a piece of Rack Middleware which enables web application developers to configure a web application&#8217;s resource cache related headers in a non obtrusive, centralized manner.</p>
<p>Rack supports the notion of Middleware, pieces of code that sit between the HTTP request and response life cycle. Rack::Lint, for example, validates an application&#8217;s requests and responses according to the Rack specification.</p>
<pre>
Rack::Handler::Mongrel.run(
  Rack::Lint.new(app), :Host => "0.0.0.0", :Port => 9999
)
</pre>
<p>Similarly, if we were to implement a cache header producing layer on top of Rack we&#8217;d end up with a construct similar to the following.</p>
<pre>
Rack::Handler::Mongrel.run(
  Rack::Lint.new(
    Rack::CacheHeaders.new(app)
  ), :Host => "0.0.0.0", :Port => 9999
)
</pre>
<p>Here&#8217;s a possible way of configuring how an application provides HTTP caching headers based on URL path patterns.</p>
<pre>
Rack::CacheHeaders.configure do |cache|
  cache.max_age("/rock", 3600)
  cache.expires("/metal", "16:00")
end
</pre>
<p>Following is a potential implementation for the above.</p>
<pre>
module Rack
  class CacheHeaders
    def initialize(app)
      @app = app
    end

    def call(env)
      result = @app.call(env)
      header = Configuration[env['PATH_INFO']].to_header
      result[1][header.key] = header.value
      result
    end

    def self.configure(&amp;block)
      yield Configuration
    end

    class Configuration
      def self.max_age(path, duration)
        paths[path] = MaxAge.new(duration)
      end

      def self.expires(path, date)
        paths[path] = Expires.new(date)
      end

      def self.[](key)
        paths[key]
      end

      def self.paths
        @paths ||= {}
      end
    end

    class MaxAge
      def initialize(duration)
        @duration = duration
      end

      def to_header
        Header.new("Cache-Control", "max-age=#{@duration}, must-revalidate")
      end
    end

    class Expires
      def initialize(date)
        @date = date
      end

      def to_header
        Header.new("Expires", Time.parse(@date).httpdate)
      end
    end

    class Header &lt; Struct.new(:key, :value);end
  end
end
</pre>
<p>The code below is a minimal Rack based application.</p>
<pre>
require "rubygems"
require "rack"

app = proc {|env| [200, {"Content-Type" => "text/plain"}, "hello"]}

Rack::Handler::Mongrel.run(
  Rack::Lint.new(
    Rack::CacheHeaders.new(app)
  ), :Host => "0.0.0.0", :Port => 9999
)
</pre>
<p>In order to observe the caching related headers the application&#8217;s responses are decorated with we can use <code>curl</code> or something similar, i.e <code>curl -I http://0.0.0.0:9999/rock</code> or <code>curl -I http://0.0.0.0:9999/metal</code>. Output should look something like the following.</p>
<pre>
air:~ gmalamid$ curl -I http://0.0.0.0:9999/rock
HTTP/1.1 200 OK
Connection: close
Date: Sat, 08 Nov 2008 00:51:23 GMT
Cache-Control: max-age=3600, must-revalidate
Content-Type: text/plain
Content-Length: 5

air:~ gmalamid$ curl -I http://0.0.0.0:9999/metal
HTTP/1.1 200 OK
Connection: close
Date: Sat, 08 Nov 2008 00:51:16 GMT
Content-Type: text/plain
Expires: Sat, 08 Nov 2008 16:00:00 GMT
Content-Length: 5
</pre>
<p>Understanding and employing HTTP cache configuration not only enables harnessing the power of tools like <a href="http://varnish.projects.linpro.no/" title="Varnish - Trac">Varnish</a> or <a href="http://www.squid-cache.org/" title="squid : Optimising Web Delivery">Squid</a>, it also makes good citizens in a diverse ecosystem of HTTP aware browsers and caches outside an application&#8217;s knowledge or control.</p>
]]></content:encoded>
			<wfw:commentRss>http://nutrun.com/weblog/rack-cache-headers/feed/</wfw:commentRss>
		<feedburner:origLink>http://nutrun.com/weblog/rack-cache-headers/</feedburner:origLink></item>
		<item>
		<title>HTTP accelerator cache purging</title>
		<link>http://feeds.feedburner.com/~r/nutrun/feed/~3/440011560/</link>
		<comments>http://nutrun.com/weblog/http-accelerator-cache-purging/#comments</comments>
		<pubDate>Sun, 02 Nov 2008 14:47:37 +0000</pubDate>
		<dc:creator>George Malamidis</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://nutrun.com/?p=198</guid>
		<description><![CDATA[The use of an HTTP accelerator such as Varnish or Squid in reverse proxy/accelerator mode can drastically improve a web application&#8217;s content delivery capabilities. Successfully implementing caching comes with numerous challenges but the fundamental goal is straightforward: A stack&#8217;s dynamic content generating layer should ideally not have to generate the same content more than once.

require [...]]]></description>
			<content:encoded><![CDATA[<p>The use of an HTTP accelerator such as <a href="http://varnish.projects.linpro.no/" title="Varnish - Trac">Varnish</a> or <a href="http://www.squid-cache.org/" title="squid : Optimising Web Delivery">Squid</a> in reverse proxy/accelerator mode can drastically improve a web application&#8217;s content delivery capabilities. Successfully implementing caching comes with numerous challenges but the fundamental goal is straightforward: A stack&#8217;s dynamic content generating layer should ideally not have to generate the same content more than once.</p>
<pre>
require "rubygems"
require "sinatra"

def guitars
  @@guitars ||= ['Les Paul', 'SG']
end

get "/guitars" do
  guitars * ', '
end
</pre>
<p>This application exposes a <code>/guitars</code> resource, a request for which will always hit the application server if no caching has been in place. This can prove suboptimal had this been a high traffic website, especially if the operation of generating the content is system resource intensive. Luckily this problem has been solved before. A running instance of Varnish, for example, will only require the following configuration to enable caching of all resources the application serves.</p>
<pre>
backend default {
  .host = "127.0.0.1";
  .port = "4567";
}
</pre>
<p>One of the challenges associated with caching has to do with the cached content&#8217;s freshness. We want to relieve server stress as much as possible, but we also need our application&#8217;s consumers to receive correct data at all times. Let&#8217;s assume that the application contacts guitar manufacturers&#8217; websites once a day to refresh its inventory and we have scheduled this operation to complete at 16:00 every day. This suggests that the cached resource should be refreshed every day at four o&#8217;clock in the afternoon to reflect the latest list of available guitar models. One of the ways of achieving this in HTTP is by making use of the <code>Expires</code> header, whose semantics are understood by (hopefully) any caching aware HTTP component.</p>
<pre>
require "time"

get "/guitars" do
  headers "Expires" => Time.parse("16:00").httpdate
  guitars * ', '
end
</pre>
<p>Things aren&#8217;t always as straightforward. In many cases we cannot fully control the exact time or frequency a resource&#8217;s content changes. The example application also comes with an admin interface, allowing the guitar list administrators to manually enter new guitar models.</p>
<pre>
post "/guitars" do
  guitars &lt;&lt; params["guitar"]
  redirect("/guitars")
end
</pre>
<p>It is clear that a means for arbitrary expiration of cached content needs to be available in order to maintain content freshness. With Varnish, this capability comes in two flavors, one of which involves the use of a <code>PURGE</code> HTTP call. The following configuration enables this functionality.</p>
<pre>
acl purge {
  "localhost";
}

sub vcl_recv {
  if (req.request == "PURGE") {
    if (!client.ip ~ purge) {
      error 405 "Not allowed.";
    }
    lookup;
  }
}

sub vcl_hit {
  if (req.request == "PURGE") {
    set obj.ttl = 0s;
    error 200 "Purged.";
  }
}

sub vcl_miss {
  if (req.request == "PURGE") {
    error 404 "Not in cache.";
  }
}
</pre>
<p>To natively make use of this in Ruby, we need to extend the <code>Net::HTTP</code> library to support the <code>PURGE</code> method.</p>
<pre>
require "net/http"
require "uri"

module Net
  class HTTP
    class Purge &lt; HTTPRequest
      METHOD = "PURGE"
      REQUEST_HAS_BODY = false
      RESPONSE_HAS_BODY = false
    end

    def purge(path, initheader=nil)
      request(Purge.new(path, initheader))
    end
  end
end

def purge_cache(u)
  uri = URI.parse(u)
  query = "?#{uri.query}" if uri.query
  Net::HTTP.new(uri.host, uri.port).start {|h| h.purge("#{uri.path}#{query}")}
end
</pre>
<p>Now we can expire the cached <code>/guitars</code> resource every time the list is amended.</p>
<pre>
post "/guitars" do
  guitars &lt;&lt; params["guitar"]
  purge_cache("http://localhost/guitars")
  redirect("/guitars")
end
</pre>
<p>Although this method is effective, there can be cases where the bidirectional coupling between the application and caching layers might be undesirable. With the fundamental functional pieces in place, however, it is not hard to implement a more elaborate strategy such as the one described in <a href="http://www.mnot.net/cache_channels/" title="HTTP Cache Channels">Cache Channels</a> in order to reduce the application layer&#8217;s knowledge of the caching infrastructure.</p>
]]></content:encoded>
			<wfw:commentRss>http://nutrun.com/weblog/http-accelerator-cache-purging/feed/</wfw:commentRss>
		<feedburner:origLink>http://nutrun.com/weblog/http-accelerator-cache-purging/</feedburner:origLink></item>
		<item>
		<title>Parallelize by process</title>
		<link>http://feeds.feedburner.com/~r/nutrun/feed/~3/432238376/</link>
		<comments>http://nutrun.com/weblog/parallelize-by-process/#comments</comments>
		<pubDate>Sun, 26 Oct 2008 02:57:12 +0000</pubDate>
		<dc:creator>George Malamidis</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://nutrun.com/?p=189</guid>
		<description><![CDATA[Performing computations in parallel is a popular technique for improving application performance and can be achieved in a number of ways, most commonly by employing threads or by splitting workload in a number of concurrent processes.
Memory usage is often a headache with large dataset computations. While memory optimization is something to be sought after, tracking [...]]]></description>
			<content:encoded><![CDATA[<p>Performing computations in parallel is a popular technique for improving application performance and can be achieved in a number of ways, most commonly by employing threads or by splitting workload in a number of concurrent processes.</p>
<p>Memory usage is often a headache with large dataset computations. While memory optimization is something to be sought after, tracking down memory leaks can become tedious and time consuming. We can decrease the chances of a heavy job running a system&#8217;s memory dry by coming up with a strategy for fragmenting the job into a number of shorter running processes. By doing so, any memory used by a worker process will be released the moment the process completes. Additionally, we can run job fragments in parallel, allow ourselves to harness the operating system&#8217;s multi-core capabilities and potentially distribute worker processes over a number of physical hosts and scale out when the need arises. Smaller processes also dictate more manageable chunks of code which are easier to maintain, optimize and test.</p>
<p>Let&#8217;s look at an example where a job involves fetching a large number of categorized products from various sources and processes them for use by our own application.</p>
<pre>
class Job
  def perform
    ADDRESSES.each do |address|
      category = load_category(address)
      category.products.each { |product| process(product) }
    end
  end

  def process(product)
    #some intensive computation
  end

  def load_category(address)
    #load an addressable category dataset
  end
end
</pre>
<p>Let&#8217;s assume that the <code>ADDRESSES</code> constant in the example is a list consisting of entries such as <code>example.com/toys</code>, <code>example.com/phones</code>, <code>example.org/guitars</code>, etc. The job fetches the addressable by category product datasets, iterates over the products and performs a long processing operation on each. Supposing that after every possible optimization the job takes three hours to complete, we can at best run the job eight times a day. What happens if the product categories are updated more often than eight times a day and a requirement in order for our application to be successful suggests that it needs to deal with fresh data all the time?</p>
<p>One natural split can involve creating a worker process for each address entry. We can do so by extracting the majority of the code from the <code>Job</code> class into a <code>Worker</code> class meant to run as a standalone process.</p>
<pre>
class Worker
  def self.process_category(address)
    category = load_category(address)
    category.products.each { |product| process(product) }
  end

  def self.process(product)
    #some intensive computation
  end

  def self.load_category(address)
    #load an addressable category dataset
  end
end

Worker.process_category(ARGV[0]) if ARGV.size == 1
</pre>
<p>Each worker will operate on a significantly smaller dataset and will complete much faster than the initial long running job. Any memory used by each worker will be immediately released the moment the process finishes execution.</p>
<p>After the latest change, <code>Job</code> can take on the role of instrumenting the worker processes. We start by only allowing an arbitrary maximum number of concurrent workers, three in this case.</p>
<pre>
require "thread"

class Job
  def initialize
    @worker_count, @mutex = 3, Mutex.new
  end

  def perform
    ADRESSESES.each do |address|
      sleep 0.1 until @worker_count > 0
      @worker_count -= 1
      Thread.new do
        system("ruby worker.rb #{address}")
        @mutex.synchronize {@worker_count += 1}
      end
    end
  end
end
</pre>
<p>At this point it is a good idea to run the job and monitor the time it takes for it to complete while also measuring system resource usage. This way we can determine the optimal number of concurrent worker processes based on the system&#8217;s specs. Once available resources have been exhausted and both <code>Job</code> and <code>Worker</code> have been sufficiently optimized, we can start thinking about running workers on separate physical nodes.</p>
]]></content:encoded>
			<wfw:commentRss>http://nutrun.com/weblog/parallelize-by-process/feed/</wfw:commentRss>
		<feedburner:origLink>http://nutrun.com/weblog/parallelize-by-process/</feedburner:origLink></item>
		<item>
		<title>Anarchic versus controlled scalability</title>
		<link>http://feeds.feedburner.com/~r/nutrun/feed/~3/411106097/</link>
		<comments>http://nutrun.com/weblog/anarchic-versus-controlled-scalability/#comments</comments>
		<pubDate>Sat, 04 Oct 2008 13:27:29 +0000</pubDate>
		<dc:creator>George Malamidis</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://nutrun.com/?p=184</guid>
		<description><![CDATA[With the number of websites at the time of this writing in the region of one hundred and sixty million and more than a trillion webpages, the Web is the largest network infrastructure to date. Figures like this are nothing short of enviable and so the web&#8217;s architecture has been increasingly influencing software authors&#8217; design [...]]]></description>
			<content:encoded><![CDATA[<p>With the number of websites at the time of this writing in the region of one hundred and sixty million and more than a trillion webpages, the Web is the largest network infrastructure to date. Figures like this are nothing short of enviable and so the web&#8217;s architecture has been increasingly influencing software authors&#8217; design decisions to the extend of emergent trends that place this approach in habitats where it hasn&#8217;t traditionally been commonplace, such as that of &#8220;enterprise&#8221; middleware.</p>
<p>The Web&#8217;s possibly most notable triumph is offering its citizens the ability to exist and adapt in a context that is difficult to control or predict. The design has achieved its monumental scalability by following the set of constraints which compose the REST architectural style. Alongside other objectives, these constraints were put together in order for systems to effectively satisfy a need for anarchic scalability but - and this is something we must not forget - the benefits of these constraints come with associated trade-offs.</p>
<p>Architectural decisions should involve weighing the costs and benefits they introduce to the specific topic they attempt to address. There is no universal solution to every design problem and, while REST has proven successful in achieving anarchic scalability, not all systems exist in wild, disorderly environments. Introducing REST constraints in a system that doesn&#8217;t need to be as loosely controlled as the web can incur unnecessary overhead.</p>
<p>Section <a href="http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm#sec_5_1_3" title="Fielding Dissertation: CHAPTER 5: Representational State Transfer (REST)">5.1.3 Stateless</a> from Roy Fielding&#8217;s seminal <a href="http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm">Architectural Styles and the Design of Network-based Software Architectures</a> paper is a good example. Particular interest for this discussion lies in the second paragraph:</p>
<p><cite>Like most architectural choices, the stateless constraint reflects a design trade-off. The disadvantage is that it may decrease network performance by increasing the repetitive data (per-interaction overhead) sent in a series of requests, since that data cannot be left on the server in a shared context.</cite></p>
<p>Let&#8217;s consider an imaginary example, an auction service which publishes price updates and accepts bids on auctioned items. As a given - this is a private auction - 3000 consumers will interact with the service, each of those subscribing to price updates and placing bids whenever they see fit. These consumers must be authorized to interact with the service.</p>
<p>If we were to carry out the above over HTTP, a potential implementation would involve the service publishing an item&#8217;s current price as a feed, with the consumers subscribing to it and polling for updates. The service enforces a polling frequency of 10 seconds per consumer. For one item, this will result in 6 * 60 * 24 * 3000 = 25,920,000 requests/day. Consumers also need to be authorized to access the resource, so, respecting the statelessness constraint, 25,920,000 handshakes/day will take place. If we assume that an item receives 20,000 bids a day, the system becomes subject to 25,900,000 unnecessary requests and handshakes.</p>
<p>The 20,000 bids/day assumption suggests an average bid frequency of 86400/20000 = 4.32 seconds. The 10 second interval polling frequency is suboptimal when it comes to consumers being able to act on price updates in near real time.</p>
<p>We can optimize by making the consumers friendlier by respecting ETag, Last-Modified, conditional GET and partial GET instructions as proposed by the service. These manage to reduce some unnecessary network usage, but do not reduce the number of requests, nor do they decrease the number of handshakes. Caching and reverse proxies are also commonly employed for relieving server stress, although, due to the close to real time requirement of this scenario, configuring those effectively can be tricky.</p>
<p>In contrast, if we were to implement the example on top of an event driven, stateful transport such as XMPP, the service could publish updates on PubSub nodes, consumers would subscribe to those and receive updates as they happen. By doing so, we&#8217;re looking at 20,000 messages, equal to the number of bids and 3,000 handshakes, equal to the number of connections, equal to the number of consumers. The number of unnecessary requests/handshakes is reduced to zero.</p>
<p>The latter does not make a good candidate for an environment where the number of consumers interacting with the service is outside our control. With each consumer maintaining an open connection, the service never gets the opportunity to release system resources and there is a finite number of persistent connections a physical infrastructure can accommodate.</p>
<p>Adopting established, widely understood open standards introduces a plethora of benefits. HTTP, BitTorrent, XMPP, SMTP, FTP all have contributed to internet scale success stories and all come with associated merits and trade-offs. When faced with choice, we should examine the benefits and drawbacks of each, relative to the characteristics of the environment the system exists in. More interestingly, we should investigate combining available options so that one complements the others&#8217; strengths while countering potential sacrifices.</p>
]]></content:encoded>
			<wfw:commentRss>http://nutrun.com/weblog/anarchic-versus-controlled-scalability/feed/</wfw:commentRss>
		<feedburner:origLink>http://nutrun.com/weblog/anarchic-versus-controlled-scalability/</feedburner:origLink></item>
		<item>
		<title>Efficient data imports</title>
		<link>http://feeds.feedburner.com/~r/nutrun/feed/~3/401228004/</link>
		<comments>http://nutrun.com/weblog/efficient-data-imports/#comments</comments>
		<pubDate>Tue, 23 Sep 2008 22:58:16 +0000</pubDate>
		<dc:creator>George Malamidis</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://nutrun.com/?p=176</guid>
		<description><![CDATA[An application&#8217;s performance is affected, among other things, by the performance of its parts. A large number of current applications contain a database layer which I&#8217;ve noticed become neglected more often than it deserves. This is unfortunate because there are a lot of quick performance victories that can be achieved by harnessing a database&#8217;s strong [...]]]></description>
			<content:encoded><![CDATA[<p>An application&#8217;s performance is affected, among other things, by the performance of its parts. A large number of current applications contain a database layer which I&#8217;ve noticed become neglected more often than it deserves. This is unfortunate because there are a lot of quick performance victories that can be achieved by harnessing a database&#8217;s strong points.</p>
<p>Let&#8217;s think of an application which periodically collects large amounts of data, adapts it from a foreign structure into its native domain and stores the results in a database for further use. Data units must be unique, something we need to enforce each time a new import takes place.</p>
<p>One way of achieving this would be to construct domain native objects or structures by parsing the external data feeds and check against the existence of duplicates in the database, using a custom hashcode identity mechanism. We can store the hashcode values in a <code>UNIQUE</code> database column to ensure data integrity.</p>
<pre>
DATA.each {|e| DB[:entries] &lt;&lt; e rescue nil}
</pre>
<p>This code iterates over the adapted object enumeration and attempts a database insert for each entry, ignoring any exceptions due to uniqueness violations. It also introduces the significant overhead of performing a number of database queries equal to the number of entries included in the imported collection.</p>
<p>Bulk inserts are nothing new and most, if not all, modern databases offer this functionality, which is also supported by the majority of database access application libraries. Ruby&#8217;s <a href="http://sequel.rubyforge.org/" title="Sequel: The Database Toolkit for Ruby">Sequel</a>, for instance, allows bulk insert operations with the <code><a href="http://sequel.rubyforge.org/rdoc/classes/Sequel/Dataset.html#M000675" title="Class: Sequel::Dataset">multi_insert</a></code> method.</p>
<pre>
DB[:entries].multi_insert(DATA)
</pre>
<p>There&#8217;s a caveat here, as this operation will terminate the moment a duplicate entry violation error occurs. MySQL offers the <code>INSERT IGNORE</code> construct which is particularly useful in this scenario. Using the <code>IGNORE</code> keyword will cause errors that occur while executing the <code>INSERT</code> statement to be treated as warnings.</p>
<p>Looking to investigate the performance boost associated with the above technique, I&#8217;ve put together a small extension for Sequel, enabling the toolkit to make use of <code>INSERT IGNORE</code>.</p>
<pre>
module InsertIgnore
  def ignore_duplicates!
    @ignore = true
    self
  end

  def multi_insert_sql(columns, values)
    columns = column_list(columns)
    values = values.map {|r| literal(Array(r))}.join(Sequel::MySQL::Dataset::COMMA_SEPARATOR)
    ignore = @ignore ? " IGNORE " : ' '
    ["INSERT#{ignore}INTO #{source_list(@opts[:from])} (#{columns}) VALUES #{values}"]
  end
end
</pre>
<p>This can be used like this:</p>
<pre>
Sequel::MySQL::Dataset.send(:include, InsertIgnore)
DB[:entries].ignore_duplicates!.multi_insert(DATA)
</pre>
<p>Inserting 100,000 records, some of them duplicates, using the application loop approach which issues an insert query for each entry took about 49 seconds on my laptop. Its <code>INSERT IGNORE</code> counterpart took about 4 seconds.</p>
<p>There are things to watch out for when using the latter approach. We can potentially construct very large queries, depending on the number of records we intend to insert. MySQL sets the maximum length of packets with the <code>max_allowed_packet</code> system variable which defaults to 1 kilobyte and can be increased up to 1 gigabyte. Loading such large datasets in memory can prove problematic, so slicing the import in chunks is probably a good idea.</p>
<p>In like manner, it&#8217;s worth mentioning MySQL&#8217;s <code>ON DUPLICATE KEY UPDATE</code>, which updates an existing column subsequent to a failed insert due to a duplicate value violation.</p>
]]></content:encoded>
			<wfw:commentRss>http://nutrun.com/weblog/efficient-data-imports/feed/</wfw:commentRss>
		<feedburner:origLink>http://nutrun.com/weblog/efficient-data-imports/</feedburner:origLink></item>
		<item>
		<title>EventMachine MapReduce</title>
		<link>http://feeds.feedburner.com/~r/nutrun/feed/~3/388125294/</link>
		<comments>http://nutrun.com/weblog/eventmachine-mapreduce/#comments</comments>
		<pubDate>Tue, 09 Sep 2008 23:52:06 +0000</pubDate>
		<dc:creator>George Malamidis</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://nutrun.com/?p=170</guid>
		<description><![CDATA[MapReduce is a parallel computation strategy useful for scaling large data set processing by distributing workload over multiple worker nodes. The distributed nature of MapReduce suggests network communication and, with that in mind, I thought I&#8217;d put together a demonstration employing EventMachine, a library which makes efficient network programming relatively simple in Ruby.
Before going any [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://labs.google.com/papers/mapreduce.html" title="Google Research Publication: MapReduce">MapReduce</a> is a parallel computation strategy useful for scaling large data set processing by distributing workload over multiple worker nodes. The distributed nature of MapReduce suggests network communication and, with that in mind, I thought I&#8217;d put together a demonstration employing <a href="http://rubyeventmachine.com/" title="Ruby / EventMachine - Trac">EventMachine</a>, a library which makes efficient network programming relatively simple in Ruby.</p>
<p>Before going any further, I should mention that the code examples have not been optimized for production use, they only illustrate what&#8217;s possible. Also, it&#8217;s worth bringing up two established Ruby libraries for tackling similar problems, <a href="http://rufy.com/starfish/doc/" title="Starfish - ridiculously easy distributed programming with Ruby">Starfish</a> and <a href="http://skynet.rubyforge.org/" title="space">Skynet</a>. It&#8217;s advisable that these existing options are investigated before delving into custom alternatives.</p>
<p>MapReduce essentially consists of two steps (although intermediate phases usually need be present for real world implementations), <em>map</em> and <em>reduce</em>. <em>map</em> refers to the higher order function also known as <em>transform</em> or <em>collect</em> and is the operation that is typically distributed and involves a number of nodes performing the transformation of a data set into another set of data. <em>reduce</em> refers to the higher order function, sometimes called <em>fold</em>, <em>inject</em> or other, which is in this case used for collecting the results of map to build a return value.</p>
<p>Counting the number of word occurrences in a large number of documents is one of the examples most commonly used for describing MapReduce. A number of distributed jobs is spawned, splitting document contents into words. The results of these operations are passed to a reduce process whose job is to sum its input.</p>
<p>Map processes can be EventMachine servers. We can have an arbitrary number of those running on a number of physical nodes.</p>
<pre>
module Map
  def receive_data(path)
    document = File.read(path)
    word_counts = document.split(' ').map { |word| [word, 1] }
    send_data(Marshal.dump(word_counts))
    close_connection_after_writing
  end
end

EM.run {EM.start_server("localhost", 5555, Map)}
</pre>
<p>A reduce process can send job requests to those servers, receive and process the results.</p>
<pre>
class Reduce &lt; EM::Connection
  @@all = []

  def initialize(*args)
    super
    @doc, @data = args[0], ''
  end

  def post_init
    send_data(@doc)
  end

  def receive_data(data)
    @data &lt;&lt; data
  end

  def unbind
    Reduce.job_completed
    @@all += Marshal.load(@data)
    unless Reduce.pending_jobs?
      groups = @@all.group_by {|word| word[0] }
      groups.each { |g| p "#{g[0]} : #{g[1].size}" }
      EM.stop
    end
  end

  def self.send_map_job(port, doc)
    @job_count ||= 0
    increment_job_count
    EM.connect("localhost", port, Reduce, doc)
  end

  def self.increment_job_count
    @job_count += 1
  end

  def self.pending_jobs?
    @job_count != 0
  end

  def self.job_completed
    @job_count -= 1
  end
end

EM.run do
  {
    5555 => 'docs/america.txt',
    6666 => 'docs/da-vinci.txt'
  }.each { |port, doc| Reduce.send_map_job(port, doc) }
end
</pre>
<p>The example lacks plumbing code which would make things flexible enough and, as you might have noticed, works on a single node (localhost), but hopefully illustrates a mechanism for distributing workload over a networked farm.</p>
]]></content:encoded>
			<wfw:commentRss>http://nutrun.com/weblog/eventmachine-mapreduce/feed/</wfw:commentRss>
		<feedburner:origLink>http://nutrun.com/weblog/eventmachine-mapreduce/</feedburner:origLink></item>
		<item>
		<title>Phusion Passenger on Amazon EC2</title>
		<link>http://feeds.feedburner.com/~r/nutrun/feed/~3/370076365/</link>
		<comments>http://nutrun.com/weblog/phusion-passenger-on-amazon-ec2/#comments</comments>
		<pubDate>Wed, 20 Aug 2008 15:49:47 +0000</pubDate>
		<dc:creator>George Malamidis</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://nutrun.com/?p=167</guid>
		<description><![CDATA[Phusion Passenger has come a long way since its first public release, significantly simplifying the deployment of Ruby web applications on Apache servers, especially since the addition of support for Rack.
You can use this example Capile if you&#8217;d like to get started quickly with trying out Passenger deployments on Amazon EC2.
It is assumed that your [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.modrails.com/" title="Overview &#x2014; Phusion Passenger&trade; (a.k.a. mod_rails / mod_rack)">Phusion Passenger</a> has come a long way since its first public release, significantly simplifying the deployment of Ruby web applications on Apache servers, especially since the addition of support for <a href="http://rack.rubyforge.org/" title="Rack: a Ruby Webserver Interface">Rack</a>.</p>
<p>You can use <a href="http://nutrun.com/passenger-ec2/Capfile" title="ec2 passenger capfile example">this example Capile</a> if you&#8217;d like to get started quickly with trying out Passenger deployments on <a href="http://aws.amazon.com/ec2" title="Amazon Web Services @ Amazon.com">Amazon EC2</a>.</p>
<p>It is assumed that your environment has been previously configured for launching EC2 AMIs. If not, you might want to read the <a href="http://docs.amazonwebservices.com/AWSEC2/2007-08-29/GettingStartedGuide/" title="Amazon Elastic Compute Cloud">EC2 Getting Started Guide</a>, or refer to the first bits of <a href="http://nutrun.com/weblog/rubyworks-production-stack-on-amazon-ec2/" title="nutrun  &raquo; Blog Archive   &raquo; RubyWorks Production Stack on Amazon EC2">this article</a>.</p>
<p>By completing the following steps, we will end up with a running <a href="http://www.debian.org/" title="Debian -- The Universal Operating System">Debian</a> AMI, with Ruby 1.8.7, Rubygems 1.2.0, Apache2 and Passenger installed.</p>
<p>First, find the section about AWS credentials in the Capfile and replace the values with yours. These are <code>:keypair</code>, <code>:account_id</code>, <code>:access_key_id</code>, <code>:secret_access_key</code>, <code>:pk</code> and <code>:cert</code>. Once this is done, invoke:</p>
<pre>
cap instance:start
</pre>
<p>Copy the instance id from the output of this command and use it as the value for the <code>:instance_id</code> field in the Capfile. Call <code>ec2-describe-instances</code> until the AMI has been started. Use the instance URL that comes for the <code>:instance_url</code> field in the Capfile. Next invoke:</p>
<pre>
cap instance:bootstrap
</pre>
<p>This will install Apache2 and Passenger on the instance. Once this step is complete, you can navigate to the instance URL from a browser and see the default page served by the newly installed, Passenger enabled Apache. At this point - optionally and for demonstration purposes - you can invoke:</p>
<pre>
cap instance:example_app
</pre>
<p>This will install the <a href="http://merbivore.com/" title="Merb | Looking for a hacker's framework?">Merb</a> gems, create a flat Merb application in the instance&#8217;s <code>/var/www/example</code> directory, set it up for use with Passenger (create <code>public</code>, <code>log</code> and <code>tmp</code> directories and add a <code>config.ru</code> Rack configuration file as required by Passenger) and setup an Apache virtual host in order for Passenger to serve the application. Once this step is complete, navigate to the instance&#8217;s URL and you should see a page served by Merb.</p>
<p>There&#8217;s another couple of convenient commands in the Capfile, <code>cap instance:ssh</code> and <code>cap instance:stop</code>.</p>
]]></content:encoded>
			<wfw:commentRss>http://nutrun.com/weblog/phusion-passenger-on-amazon-ec2/feed/</wfw:commentRss>
		<feedburner:origLink>http://nutrun.com/weblog/phusion-passenger-on-amazon-ec2/</feedburner:origLink></item>
		<item>
		<title>Rails Summit Latin America</title>
		<link>http://feeds.feedburner.com/~r/nutrun/feed/~3/364207655/</link>
		<comments>http://nutrun.com/weblog/rails-summit-latin-america/#comments</comments>
		<pubDate>Wed, 13 Aug 2008 20:52:59 +0000</pubDate>
		<dc:creator>George Malamidis</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://nutrun.com/?p=162</guid>
		<description><![CDATA[
Danilo and I will be talking about REST (or maybe not&#8230;) at the Rails Summit Latin America, October 15, 2008. Many thanks to everyone who&#8217;s given me the opportunity to participate.
]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.locaweb.com.br/railssummit/"><img src="http://www.akitaonrails.com/assets/2008/8/13/en_souPalestrante_210x60.jpg" alt="Rails Summit Latin America" title="Rails Summit Latin America" border="0" width="210" height="60"/></a></p>
<p><a href="http://www.dtsato.com/blog/" title="Danilo Sato">Danilo</a> and I will be talking about REST (or maybe not&#8230;) at the <a href="http://www.locaweb.com.br/railssummit/" title="Servi&ccedil;os de Internet - Locaweb">Rails Summit Latin America</a>, October 15, 2008. Many thanks to everyone who&#8217;s given me the opportunity to participate.</p>
]]></content:encoded>
			<wfw:commentRss>http://nutrun.com/weblog/rails-summit-latin-america/feed/</wfw:commentRss>
		<feedburner:origLink>http://nutrun.com/weblog/rails-summit-latin-america/</feedburner:origLink></item>
		<item>
		<title>Cheap lunch</title>
		<link>http://feeds.feedburner.com/~r/nutrun/feed/~3/352078585/</link>
		<comments>http://nutrun.com/weblog/cheap-lunch/#comments</comments>
		<pubDate>Fri, 01 Aug 2008 01:31:00 +0000</pubDate>
		<dc:creator>George Malamidis</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://nutrun.com/?p=155</guid>
		<description><![CDATA[Is the free lunch really over? It surely is a question that troubles many software developers. Constrained by the laws of physics, processor manufacturing has definitely changed its rules of play the last few years. It steadily and increasingly becomes near impossible to extract more juice out of a CPUs single core.
Since I started thinking [...]]]></description>
			<content:encoded><![CDATA[<p>Is the free lunch really <a href="http://www.gotw.ca/publications/concurrency-ddj.htm" title="The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software">over</a>? It surely is a question that troubles many software developers. Constrained by the laws of physics, processor manufacturing has definitely changed its rules of play the last few years. It steadily and increasingly becomes near impossible to extract more juice out of a CPUs single core.</p>
<p>Since I started thinking about the implications of the multi-core evolution, I always kept an open eye for situations where taking advantage of multi-core CPUs would profit my work. It is almost certain that the reason has to do with my work being primarily around server side applications, but I&#8217;m still to come against many situations where adopting a multi-core influenced approach would have provided additional benefit which could have been achieved by exclusively following this paradigm.</p>
<p>The problem is undeniably evident if we approach it from the side of computational units restricted by the laws of physics. It seems like we will always have a healthy appetite for increased performance, and given we can&#8217;t get much more out of one core, we must start thinking and programming in a multi-core context. We could do with <a href="http://www.openoffice.org/" title="www:        OpenOffice.org - The Free and Open Productivity Suite">OpenOffice</a> being more feature rich and faster, thus our desktop needs to be more potent.</p>
<p>At the same time, the web, and networking in general, is increasingly influencing the way we think about, use and create software. Considering the OpenOffice example, there is already a myriad of applications moving similar functionality over to the web. Networking brings distributed solutions to the table, which, alongside other applications, are widely employed for improving software performance.</p>
<p>Next to physics, the software world is governed by the laws economics. The creation of software must result in some form of social, or financial, or other profit, part of which is achieved by minimizing associated costs. It is almost certain that vendors will claim that a data center of quad-core equipped slices is the next answer to our software woes, but it pays to remember that a cloud of commodity hardware might, in some situations, improve rate of return. The lunch was never free, but today, just like 10 years ago, it&#8217;s really about how cheap the lunch is.</p>
<p>The need for concurrency remains an undeniable must, but whether its mainstream representation will be that of multi-core friendly programming or distributed over a network architectures remains to be seen.</p>
]]></content:encoded>
			<wfw:commentRss>http://nutrun.com/weblog/cheap-lunch/feed/</wfw:commentRss>
		<feedburner:origLink>http://nutrun.com/weblog/cheap-lunch/</feedburner:origLink></item>
		<item>
		<title>Cache watch</title>
		<link>http://feeds.feedburner.com/~r/nutrun/feed/~3/348886092/</link>
		<comments>http://nutrun.com/weblog/cache-watch/#comments</comments>
		<pubDate>Tue, 29 Jul 2008 00:40:55 +0000</pubDate>
		<dc:creator>George Malamidis</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://nutrun.com/?p=151</guid>
		<description><![CDATA[Web frameworks like Merb or Rails provide convenient ways for caching output data to static files or other stores, used for improving a web application&#8217;s performance. Caching is typically handled inside controller classes. With merb-cache, for example, we can cache an entire page by doing something along the lines of:

class Foo &#60; Merb::Controller
  cache_page [...]]]></description>
			<content:encoded><![CDATA[<p>Web frameworks like <a href="http://merbivore.com/" title="Merb | Looking for a hacker's framework?">Merb</a> or Rails provide convenient ways for caching output data to static files or other stores, used for improving a web application&#8217;s performance. Caching is typically handled inside controller classes. With <a href="http://merb-cache.rubyforge.org" title="merb-cache docs">merb-cache</a>, for example, we can cache an entire page by doing something along the lines of:</p>
<pre>
class Foo &lt; Merb::Controller
  cache_page :index
end
</pre>
<p>Expiring cached data is handled with a number of instance methods available to controllers, such as <code>expire_page(key)</code> or <code>expire_all_pages</code>. This implies that cache expiration needs to be put explicitly in place inside actions.</p>
<p>The most common event signifying the need for cache expiration is the modification of the underlying data which has at some point been cached. More often than not, this means some sort of write (insert, update, delete) storage operation, which in turn means that cache expiration is closer to storage aware parts of the application rather than controllers. With this in mind, it would be useful to be able to configure cache expiration in a manner similar to that of cache creation, for example:</p>
<pre>
class Foo &lt; Merb::Controller
  cache_page :index
  cache_watch :foo_store, :bar_store
end
</pre>
<p>The <code>cache_watch :foo_store, :bar_store</code> line signifies that any cached artifacts associated with this controller need to be expired whenever a data altering operation takes place in the context of the <code>FooStore</code> or <code>BarStore</code> classes.</p>
<p>Approaching data altering operations as events presents a good case for employing the Observer pattern in order to enable cache expiration when such events take place. ActiveRecord, for instance, offers means for adding hooks to persistent objects&#8217; life cycle methods in the form of <a href="http://api.rubyonrails.org/classes/ActiveRecord/Observer.html" title="Class: ActiveRecord::Observer">Observers</a>.</p>
<pre>
class FooObserver &lt; ActiveRecord::Observer
  def after_save(foo)
    expire_cache
  end
end
</pre>
<p>Putting it all together, we can create a module that enables configuring cache expiration declaratively inside controllers in a way reminiscent to how cache creation is handled.</p>
<pre>
module CacheInvalidator
  def cache_watch(controller, *models)
    models.each {|model| (@entries ||= Set.new) &lt;&lt; Entry.new(controller, model)}
  end

  def activate!
    @entries.each do |entry|

      return nil if Kernel.const_defined?(entry.class_name)

      entry.log

      observer = Class.new(ActiveRecord::Observer) do
        include CacheInvalidator
        observe(entry.model)
        define_method(:entry) {entry}
      end

      Kernel.const_set(entry.class_name, observer)
      observer.instance
    end
  end

  module_function :watch
  module_function :activate!

  def after_save(model)
    destroy_cache
  end

  def after_destroy(model)
    destroy_cache
  end

  private

  def destroy_cache
    FileUtils.rm_f(entry.file_path) if File.file?(entry.file_path)
    FileUtils.rm_r(entry.dir_path) if File.directory?(entry.dir_path)
  end

  class Entry

    attr_reader :controller, :model

    def initialize(controller, model)
      @controller, @model = controller, model
    end

    def class_name
      (controller.name.gsub(/\:\:/, '') + model.to_s.camelize + "CacheObserver").intern
    end

    def ==(other)
      controller == other.controller &amp;&amp; self.model == other.model
    end

    def file_path
      "#{dir_path}.xml"
    end

    def dir_path
      "#{APP_ROOT}/public/cache/#{@controller.name.underscore}"
    end

    def log
      logger.info "Cache-watching #{model.to_s.camelize} for #{controller}"
    end
  end
end
</pre>
<p>By including the <code>CacheInvalidator</code> module we can declare cache invalidation rules inside controllers.</p>
<pre>
class FooController &lt; Merb::Controller
  include CacheInvalidator
  cache_page :index
  cache_watch :FooStore, :BarStore
end
</pre>
<p>The cache can be activated where app initialization tasks are kept, such as <code>init.rb</code> in Merb.</p>
<pre>
Merb::BootLoader.after_app_loads do
   CacheInvalidator.activate!
end
</pre>
]]></content:encoded>
			<wfw:commentRss>http://nutrun.com/weblog/cache-watch/feed/</wfw:commentRss>
		<feedburner:origLink>http://nutrun.com/weblog/cache-watch/</feedburner:origLink></item>
	</channel>
</rss>
