In previous sections we used Markaby and RedCloth to generate HTML from Ruby code and data. In this section, we’ll look at doing the reverse by taking HTML code and extracting data from it in a structured fashion.
Hpricot is a Ruby library by “why the lucky stiff” designed to make HTML parsing fast, easy, and fun. It’s available as a RubyGem via gem install hpricot. Though it relies on a compiled extension written in C for its speed, a special Windows build is available via RubyGems with a precompiled extension.
Once installed, Hpricot is easy to use. The following example loads the Hpricot library, places some basic HTML in a string, creates a Hpricot object, and then searches for H1 tags (using search). It then retrieves the first (using first, as search returns an array), and looks at the HTML within it (using inner_html):

require 'rubygems'
require 'hpricot'
html = <<END_OF_HTML
<html>
<head>
  <title>This is the page title</title>
</head>
<body>
  <h1>Big heading!</h1>
  <p>A paragraph of text.</p>
  <ul><li>Item 1 in a list</li><li>Item 2</li><li class="highlighted">Item
3</li></ul>
</body>
</html>
END_OF_HTML
doc = Hpricot(html)
puts doc.search("h1").first.inner_html
Advertisements