How I Used Hpricot and Mechanize in GeeQE

While building GeeQE I wanted to enhance the CC dump of Stack Overflow's data. The main reason I wanted to do this was to capture Gravatar hashes and user badges. To do this I decided to continue using Ruby as I did with the XML loading (see my previous post on XML parsing with Ruby). The easy choice was of course Hpricot to parse the HTML from the users page and Mechanize to move from one page to the next.

The first thing I wanted to make sure to do was to scrape this data as efficiently as possible. That is why I use the users page instead of going over every single user's profile page. This approach is not optimal since the users pages most likely update as the script moves through them so users could be missed but I'm not looking for 100% here so that limitation was acceptable.

To run the user profile script that is described here you will first need to load the database schema and then load the CC data dump with the XML loader script.

Two good sources of information about using Hpricot can be found in the Hpricot showcase and the Hpricot challenge wiki pages.

The only tricky thing that I needed to do with Hpricot was associated with parsing badge counts since the outer spans have a title with the name of the badge type in them:

<div class="user-details">
  <a href="/users/256/example-user" >Example User</a><br>
  <span class="reputation-score" title="reputation score">22k</span>
  <span title="8 gold badges"><span class="badge1">&#9679;</span><span class="badgecount">8</span></span>
  <span title="5 silver badges"><span class="badge2">&#9679;</span><span class="badgecount">5</span></span>
  <span title="7 bronze badges"><span class="badge3">&#9679;</span><span class="badgecount">7</span></span>

I could have probably used the class of the inner span but I decided since it wasn't named that I couldn't be completely sure those would stay the same.

I used the ability of Hpricot to match attribute values based on Trac Query syntax. Here you can see that with the "@title~=badge type" where ~= matches when the value of the title attribute contains the corrisponding badge type name:

user_bc = (user_info/"div[@class='user-details']/span[@title~=gold]/span[@class='badgecount']")
  user_gold = user_bc != nil && user_bc[0] != nil ? user_bc[0].inner_html : 0
  user_bc = (user_info/"div[@class='user-details']/span[@title~=silver]/span[@class='badgecount']")
  user_silver = user_bc != nil && user_bc[0] != nil ? user_bc[0].inner_html : 0
  user_bc = (user_info/"div[@class='user-details']/span[@title~=bronze]/span[@class='badgecount']")
  user_bronze = user_bc != nil && user_bc[0] != nil ? user_bc[0].inner_html : 0

After parsing the page for user information the script then looks for the next page URL to parse then sleeps for a random amount of time before using Mechanize to pull down the page.

Leave a Reply

Your email address will not be published. Required fields are marked *