Parsing Microformats: rel-tag, adr, hCard
Wednesday, June 27th, 2007rel-tag
Out of all Microformats, rel-tag is one of the simplest, therefore one of the easiest to parse.
I find it useful to treat Microformat object representations as Structures without behavior, with a set of member attributes that map to their properties to capture their state. A Scanner class takes care of parsing a given piece of HTML to collect all rel-tag occurrences.
[ruby]
class RelTagTest < Test::Unit::TestCase
def test_rel_tag_extraction
html = %(
tech
rock
)
tags = RelTagScanner.find_all(html)
assert_equal(2, tags.size)
assert_equal(”tech”, tags[0].name)
assert_equal(”http://technorati.com/tag/tech”, tags[0].url)
assert_equal(”rock”, tags[1].name)
assert_equal(”http://technorati.com/tag/rock”, tags[1].url)
end
end
[/ruby]
The implementation for the above specification is compact and straightforward.
[ruby]
require “rubygems”
require “hpricot”
class RelTag < Struct.new(:url, :name);end
class RelTagScanner
def self.find_all(html)
(Hpricot(html)/"[@rel=tag]").map do |tag|
RelTag.new(tag[:href], tag.inner_text)
end
end
end
[/ruby]
The RelTag class is a Struct with two members, url and name. In RelTagScanner’s find_all method, we ask Hpricot to fetch all elements with a rel="tag" attribute and from those we extract the url and value to populate the RelTag objects.
adr
Even though adr’s schema specifies more properties, parsing it is more straightforward than rel-tag as all the adr fields are marked up with class="field" constructs.
[ruby]
class AdrTest < Test::Unit::TestCase
def test_adr_extraction
html = %(
San Francisco,
CA
94107
)
adr = AdrScanner.find_all(html)[0]
assert_equal(”665 3rd St.”, adr.street_address)
assert_equal(”Suite 207″, adr.extended_address)
assert_equal(”San Francisco”, adr.locality)
assert_equal(”CA”, adr.region)
assert_equal(”94107″, adr.postal_code)
assert_equal(”U.S.A.”, adr.country_name)
end
end
[/ruby]
Again, the Adr class extends a new instance of Struct with members that correspond to the adr spec’s Property List.
[ruby]
class Adr < Struct.new(:post_office_box, :extended_address, :street_address, :locality, :region, :postal_code, :country_name);end
class AdrScanner
def self.find_all(html)
doc = Hpricot(html)
(doc/".adr").map do |adr|
Adr.new(*Adr.members.map { |m| (adr/".#{m.gsub('_', '-')}").inner_text })
end
end
end
[/ruby]
In this case, it suffices for the AdrScanner to detect all elements that are marked up as class="adr". For each of those elements, we extract the matching nested properties (e.g. class="locality") and pass them as an array to the constructor of Adr.
hCard
Things get slightly more complicated when parsing hCards. hCard is a compound Microformat, in the sense that it contains other Microformats, notably adr. In addition to that, the tel property can appear more than once, whilst it can also accept an optional type attribute.
[html]
Fax:
+1-727-258-0207
[/html]
Keeping this in mind, it would make sense to treat the simple members of hCard in a similar fashion to that of the previous examples. The adr member should be of type Adr, whereas the phone numbers could go in a Hash field and retrieved as hcard.tels[:type]
[ruby]
class HcardTest < Test::Unit::TestCase
def setup
html = %(
FL
33701-4313
Fax:
+1-727-258-0207
)
@hcard = HcardScanner.find_all(html)[0]
end
def test_simple_members
assert_equal(”Wikimedia Foundation Inc.”, @hcard.fn)
assert_equal(”Wikimedia Foundation Inc.”, @hcard.org)
assert_equal(”info@wikimedia.org”, @hcard.email)
end
def test_tels
assert_equal(”+1-727-231-0101″, @hcard.tels[:default])
assert_equal(”+1-727-258-0207″, @hcard.tels[:fax])
end
def test_adr
assert_equal(”200 2nd Ave. South #358″, @hcard.adr.street_address)
assert_equal(”St. Petersburg”, @hcard.adr.locality)
assert_equal(”FL”, @hcard.adr.region)
assert_equal(”33701-4313″, @hcard.adr.postal_code)
assert_equal(”USA”, @hcard.adr.country_name)
end
end
[/ruby]
Because tel does not necessarily require the class="type" and class="value" attributes, we are treating the number that is marked up solely as class="tel" as the default.
[ruby]
class Hcard < Struct.new(:fn, :org, :email, :tels, :adr);end
class HcardScanner
def self.find_all(html)
doc = Hpricot(html)
(doc/".vcard").map do |vcard|
hcard = Hcard.new(*[:fn, :org, :email].map { |m| (vcard/".#{m}").inner_text })
hcard.tels = find_tels(vcard)
hcard.adr = AdrScanner.find_all(vcard.to_html)[0]
hcard
end
end
private
def self.find_tels(vcard)
tels = {}
(vcard/".tel").each do |tel|
type = (tel/".type").inner_text
if type.empty?
type = :default
value = tel.inner_text
else
type = type.downcase.to_sym
value = (tel/".value").inner_text
end
tels[type] = value
end
tels
end
end
[/ruby]
For each vcard element found in the HTML, we construct a new Hcard object. We can use the AdrScanner to extract the adr element and pass it on to the Hcard. For all occurrences of tel, we have to check for the presence of class="type" and class="value" and add them as entries to the Hcard#tels hash. In the absence of those two attributes, we add the phone number to the tels hash keyed as :default.
