While building GeeQE I wanted to enhance the CC dump of Stack Overflow's data. The main reason I wanted to do this was to capture Gravatar hashes and user badges. To do this I decided to continue using Ruby as I did with the XML loading (see my previous post on XML parsing with Ruby). The easy choice was of course Hpricot to parse the HTML from the users page and Mechanize to move from one page to the next.
One of the first things I needed to do while building the GeeQE iPhone application was process the CC data dump from Stack Overflow. The dump contains XML files representing tables from Stack Overflow with the largest file being posts.xml weighing in at 1.2G as of September. I decided it would be pretty easy to use Ruby to parse the XML and load the data into MySQL so I went about finding the right parser for the job.
If you haven't processed large amounts of XML before one thing to realize is that you don't want to use a DOM parser because it is going to load the entire XML structure into memory. What you want is a SAX parser that can work on the XML stream as it comes in. With this in mind I started looking around and quickly found an older benchmark post that gave me an educated guess that the LibXML library was going to be the fastest parser for Ruby. After figuring out how to use it I decided to also give a couple other libraries a shot to see how they stacked up, the other two I looked at were REXML and Nokogiri.