A collection of computer systems and programming tips that you may find useful.
Brought to you by Craic Computing LLC, a bioinformatics consulting company.

Friday, September 19, 2008

Ruby Hpricot Tip - Extracting Arbitrary Blocks of HTML

Hpricot is a HTML parser for Ruby, written by 'whytheluckystiff', and is a great tool for extracting information from web pages.

If the target page uses divs with unique ids or classes then this task is especially easy, but most of the pages I care about are not as well designed as they might be. I often come across pages where the distinct sections are delimited by some arbitrary feature, such as a horizontal rule or simply a title in plain text.

Hpricot uses CSS selectors (as well as XPath) to pull out specific elements but that approach is not a great match for this class of arbitrary pages.

Here is one way to solve this problem. I've set up a simple web page with four sections, separated by <hr> tags. You can find that here and you can find the Ruby code to parse it here.

Basically you get the first Hpricot element on the page contained in the Body, then step through the elements in turn adding each to a new Hpricot::Elements object until either a hr tag or the end of the document is encountered. Every time it finds a delimiter it pushes the current Elements structure into an array and starts a new one.

Once done, you have an array of Hpricot::Elements objects, one for each section of your page. Each of these can be processed further using Hpricot.

The short version of the code, with comments removed, is here:
el = doc.search("body > *").first
blocks = Array.new
block = Hpricot::Elements.new
while el = el.next
if el.to_html =~ /\<hr/
blocks << block
block = Hpricot::Elements.new
block << el
blocks << block

Let me know if you have other solutions to this.

Archive of Tips