Archive for the ‘Microformats’ Category

Parsing Microformats: rel-tag, adr, hCard

Wednesday, June 27th, 2007

rel-tag

Out of all Microformats, rel-tag is one of the simplest, therefore one of the easiest to parse.

I find it useful to treat Microformat object representations as Structures without behavior, with a set of member attributes that map to their properties to capture their state. A Scanner class takes care of parsing a given piece of HTML to collect all rel-tag occurrences.

[ruby]
class RelTagTest < Test::Unit::TestCase
def test_rel_tag_extraction
html = %(


)

tags = RelTagScanner.find_all(html)

assert_equal(2, tags.size)
assert_equal(”tech”, tags[0].name)
assert_equal(”http://technorati.com/tag/tech”, tags[0].url)
assert_equal(”rock”, tags[1].name)
assert_equal(”http://technorati.com/tag/rock”, tags[1].url)
end
end
[/ruby]

The implementation for the above specification is compact and straightforward.

[ruby]
require “rubygems”
require “hpricot”

class RelTag < Struct.new(:url, :name);end

class RelTagScanner
def self.find_all(html)
(Hpricot(html)/"[@rel=tag]").map do |tag|
RelTag.new(tag[:href], tag.inner_text)
end
end
end
[/ruby]

The RelTag class is a Struct with two members, url and name. In RelTagScanner’s find_all method, we ask Hpricot to fetch all elements with a rel="tag" attribute and from those we extract the url and value to populate the RelTag objects.

adr

Even though adr’s schema specifies more properties, parsing it is more straightforward than rel-tag as all the adr fields are marked up with class="field" constructs.

[ruby]
class AdrTest < Test::Unit::TestCase
def test_adr_extraction
html = %(

665 3rd St.
Suite 207

San Francisco,
CA
94107

U.S.A.

)

adr = AdrScanner.find_all(html)[0]

assert_equal(”665 3rd St.”, adr.street_address)
assert_equal(”Suite 207″, adr.extended_address)
assert_equal(”San Francisco”, adr.locality)
assert_equal(”CA”, adr.region)
assert_equal(”94107″, adr.postal_code)
assert_equal(”U.S.A.”, adr.country_name)
end
end
[/ruby]

Again, the Adr class extends a new instance of Struct with members that correspond to the adr spec’s Property List.

[ruby]
class Adr < Struct.new(:post_office_box, :extended_address, :street_address, :locality, :region, :postal_code, :country_name);end

class AdrScanner
def self.find_all(html)
doc = Hpricot(html)
(doc/".adr").map do |adr|
Adr.new(*Adr.members.map { |m| (adr/".#{m.gsub('_', '-')}").inner_text })
end
end
end
[/ruby]

In this case, it suffices for the AdrScanner to detect all elements that are marked up as class="adr". For each of those elements, we extract the matching nested properties (e.g. class="locality") and pass them as an array to the constructor of Adr.

hCard

Things get slightly more complicated when parsing hCards. hCard is a compound Microformat, in the sense that it contains other Microformats, notably adr. In addition to that, the tel property can appear more than once, whilst it can also accept an optional type attribute.

[html]

Phone: +1-727-231-0101

Fax:
+1-727-258-0207

[/html]

Keeping this in mind, it would make sense to treat the simple members of hCard in a similar fashion to that of the previous examples. The adr member should be of type Adr, whereas the phone numbers could go in a Hash field and retrieved as hcard.tels[:type]

[ruby]
class HcardTest < Test::Unit::TestCase
def setup
html = %(

Wikimedia Foundation Inc.
200 2nd Ave. South #358
St. Petersburg,
FL
33701-4313
USA

Phone: +1-727-231-0101
Email:

Fax:
+1-727-258-0207

)
@hcard = HcardScanner.find_all(html)[0]
end

def test_simple_members
assert_equal(”Wikimedia Foundation Inc.”, @hcard.fn)
assert_equal(”Wikimedia Foundation Inc.”, @hcard.org)
assert_equal(”info@wikimedia.org”, @hcard.email)
end

def test_tels
assert_equal(”+1-727-231-0101″, @hcard.tels[:default])
assert_equal(”+1-727-258-0207″, @hcard.tels[:fax])
end

def test_adr
assert_equal(”200 2nd Ave. South #358″, @hcard.adr.street_address)
assert_equal(”St. Petersburg”, @hcard.adr.locality)
assert_equal(”FL”, @hcard.adr.region)
assert_equal(”33701-4313″, @hcard.adr.postal_code)
assert_equal(”USA”, @hcard.adr.country_name)
end
end
[/ruby]

Because tel does not necessarily require the class="type" and class="value" attributes, we are treating the number that is marked up solely as class="tel" as the default.

[ruby]
class Hcard < Struct.new(:fn, :org, :email, :tels, :adr);end

class HcardScanner
def self.find_all(html)
doc = Hpricot(html)
(doc/".vcard").map do |vcard|
hcard = Hcard.new(*[:fn, :org, :email].map { |m| (vcard/".#{m}").inner_text })
hcard.tels = find_tels(vcard)
hcard.adr = AdrScanner.find_all(vcard.to_html)[0]
hcard
end
end

private

def self.find_tels(vcard)
tels = {}
(vcard/".tel").each do |tel|
type = (tel/".type").inner_text
if type.empty?
type = :default
value = tel.inner_text
else
type = type.downcase.to_sym
value = (tel/".value").inner_text
end
tels[type] = value
end
tels
end
end
[/ruby]

For each vcard element found in the HTML, we construct a new Hcard object. We can use the AdrScanner to extract the adr element and pass it on to the Hcard. For all occurrences of tel, we have to check for the presence of class="type" and class="value" and add them as entries to the Hcard#tels hash. In the absence of those two attributes, we add the phone number to the tels hash keyed as :default.

Microformats: Machine CSS

Saturday, June 16th, 2007

I have in the past expressed skepticism against the claim that Microformats are “Designed for humans first and machines second”.

I would argue that Microformats are to machines what CSS is to humans.

There are more than one similarities between what CSS and Microformats are trying to achieve, or even how they manifest themselves in terms of implementation. They do, after all, share a common platform - HTML.

One of the most important common goals shared by Microformats and CSS is making the resources they decorate more meaningful to the receiver of those resources. And while making a heading bright orange and bold would make it look more like a heading, something directly linked to a human reader’s understanding, decorating a div with class="vcard" facilitates a program’s perception of what the marked-up data is representing and how it is to be treated.

Imagine a tag cloud in two states, before and after its entries have been enhanced with rel-tag. A casual human reader would always perceive the entries as tags regardless of rel-tag and would in fact be oblivious to its existence. The human reader recognizes the tags because of the way they look, as instructed by the web page’s stylesheet. A tag aggregator script on the other hand would not recognize the content of the cloud as tags before it was marked as so with rel-tag.

Ultimately, and through various levels of indirection, most, if not all, information is to be used by/for humans. It is probably the amount of levels of indirection that signifies what is designed primarily for humans and what is designed primarily for machines.

HMachine

Thursday, May 24th, 2007

20 odd lines of a dirty hack for a Microformats parser (thanks to open-uri and Hpricot).

%w(rubygems hpricot open-uri).each {|l| require l}
module Microformats
  class Microformat < Struct
    def self.for(uri)
      mf = new
      name = mf.class.name.split('::').last.downcase
      doc = Hpricot(open(uri))
      members.each do |m|
        eval %{
          val = doc%('.#{name} .#{m.gsub('_', '-')}')
          mf.#{m} = val.inner_text.strip if not val.nil?
        }
      end
      mf
    end
    class << self; alias :/ :for end
  end
end

Adding support for microformat specifications can be achieved as:

module Microformats
  class MyFormat < Microformat.new(:x, :y, :z);end
end

Where :x, :y, :z are the Microformat’s properties.
As a more concrete example, let’s add support for (part of) HReview:

class HReview < Microformat.new(:summary, :fn,
                                :dtreviewed, :description,
                                :rating);end

In action…

include Microformats

hr = HReview.for('http://www.amk.ca/books/h/Velocity_of_Honey')
p hr.fn

# => "The Velocity of Honey: And More Science of Everyday Life"

… or even slicker…

hr = HReview/'http://www.amk.ca/books/h/Velocity_of_Honey'

How microformats will simplify the web

Tuesday, April 17th, 2007

For the humorous sake of it, let me start with what I don’t like about Microformats. To cite the site (heh..) Microformats are “Designed for humans first and machines second”. Every time I see the word “humans” in relation to something to do with technology, I kind of turn red. Partly because I’m the sort of person who would always choose to use the ATM instead of going through the Human Bank-Employee, partly because everything designed for humans is going to end up being used most and foremost by spammers.

You see, there’s no denying that software is meant to be used by or facilitate the lives of humans, things like Microformats however are only interesting - and, at least at a low level, are destined to remain so - to a very specific subset of humans: Programmers. So I’d be way better off spared the hippy humane talk.

Having gotten this semantic (there we go…) complaint off my chest, I think Microformats are great. In fact, I find them to be an idea as big as REST over SOAP style Web Services.

The magic lies in the simplicity that escorts Microformats and the household status of their platform. There’s no funny/strict schema in a mile, only mere semantic enhancements to one of the most widely used mediums on the web today: HTML (primarily, although I think it should be exclusively).

Let’s take the following example of something reminiscent of a weblog post:

<html>
  <head><title>Who needs Atom?</title></head>
  <body>
    <h1>Who >needs Atom?</h1>
    <h2>
      Posted on Friday, April 13, 2007
      by Dave Mustaine
    </h2>
    <p>
      Really... Who needs it? Or RSS, come to think of it...
    </p>
  </body>
</html>

Now, let’s add a tiny bit of non-intrussive, albeit meaningful semantic coating to our mark-up:

<html>
  <head><title>Who needs Atom?</title></head>
  <body>
    <h1 class="title">Who needs Atom?</h1>
    <h2>
      Posted on <span class="date">Friday, April 13, 2007</span>
      by <span class="author">Dave Mustaine</span>
    </h2>
    <p class="content">
      Really... Who needs it? Or RSS, for the part?...
    </p>
  </body>
</html>

Suddenly, our document can say a lot to a syndication engine or human developer with a website scraping API at hand. And there’s no need to maintain a feed.xml, or anything similar. The website really is the weblog.

What Microformats, alongside REST are proudly showcasing is how much can be achieved by concentrating on two simple things: Meaningful URLs (Where the resources are and will be) and meaningful mark-up (What the resources are about). Once this is achieved, anyone can do whatever they want with them, because the only thing a website needs to qualify as a weblog, resume, web-service API - and the beat goes on… - is a little bit of meaning.