Tip: Use Hpricot’s XML Parsing Mode to Parse … XML

Patrick Reagan

I’ve been working on a project that requires handling a lot of data coming in from external sources – some of it in a delimited format (e.g. CSV) and some in XML format. When trying to decide how to handle the XML-formatted files, Mark suggested that I try using Hpricot instead of a more “traditional” XML parser like REXML. I really love how Hpricot handles traditional HTML documents, so having the same functionality for XML documents made my life much easier.

But, there was a catch. Given this sample XML file, finding the contents of the <link> tag wasn’t possible:

< ?xml version="1.0" encoding="UTF-8"?>
 <posts>
  <post>
    <id>1</id>
    <title>Sheer Awesomeness!</title>
    <body>This is an awesome post.</body>
    <link>http://www.sneaq.net/sheer-awesomeness/</link>
  </post>
 </posts>

My first attempt to capture the URL (with XPath) wasn’t successful, it just returned an empty string:

doc = Hpricot(open('sample.xml'))
puts (doc/:posts/:post/:link).inner_html

As a test, I attempted to reconstruct the original document by using the to_html method:

doc = Hpricot(open('sample.xml'))
puts doc.to_html

< ?xml version="1.0" encoding="UTF-8"?>
<posts>
  <post>
    <id>1</id>
    <title>Sheer Awesomeness!</title>
    <body>This is an awesome post.</body>
    <link />http://www.sneaq.net/sheer-awesomeness/
  </post>
</posts>

Strange – it appeared that the parser made certain assumptions about the structure and validity of the document. Namely that the <link> tag could only contain attributes, not values. After digging deeper into the documentation, I found that there is an :xml option when parsing a document. The updated code looks like this (and returns the expected URL):

doc = Hpricot.XML(open('sample.xml'))
puts (doc/:posts/:post/:link).inner_html

The moral? When parsing XML documents with Hpricot, you need to parse them as … well, XML.

Trick: Overriding ActiveRecord’s ID

Patrick Reagan

So you botched a portion of your long-running data import (who hasn’t?) and you need to re-import that part. The problem is, the rest of your dataset relies on the auto-numbered primary key that ActiveRecord has already generated behind the scenes and you need re-use that ID. Not a problem:

>> AccountType.create(:id => 100, :name => 'Administrator')
=> AccountType id: 1, name: "Administrator",
   created_at: "2008-02-17 23:29:42", updated_at: "2008-02-17 23:29:42"

Oops, that ID doesn’t look quite right! Let’s try a different way:

>> a = AccountType.new(:name => 'Administrator')
=> AccountType id: nil, name: "Administrator", created_at: nil, updated_at: nil
>> a.id = 100
=> 100
>> a.save
=> true
>> a.reload
=> AccountType id: 100, name: "Administrator",
   created_at: "2008-02-17 23:35:20", updated_at: "2008-02-17 23:35:20"

Much better – since ActiveRecord doesn’t allow setting the ID through mass-assignment, you’ll need to set it separately to make it stick.

Trick: Tuning MySQL to Speed Up Bulk Inserts

Patrick Reagan

I’ve been working on a project that requires loading a large dataset into MySQL so I’m using the acts_as_importable plugin that I created. This allows me to generate MySQL bulk insert statements instead of doing the inserts through ActiveRecord. Even though this approach was much faster (2 minutes instead of 30 for ~100k records), I still needed to speed things up a bit more. I ran across this post that addresses the performance of MySQL’s LOAD DATA INFILE statement, but I wasn’t sure if it would work for bulk inserts, so I gave it a shot.

Initial Settings & Benchmarks

I performed the benchmarking and tuning on my MacBook, so the numbers don’t actually refelect what I would see on a production server. Instead, it gave me a good baseline to compare against future benchmark results. Since I am using InnoDB tables, here are the initial settings for the options I wanted to change:

[mysqld]
innodb_buffer_pool_size = 8M
innodb_log_file_size = 5M
innodb_flush_log_at_trx_commit = 1

And here’s the benchmark when importing 100K records:

$ time mysql sample_db < db/import/listings.sql

real    1m1.050s
user    0m0.945s
sys     0m0.130s

Not too bad, but if we extrapolate this across 10MM records it will take almost 2 hours to load the entire dataset!

Improving Performance

I took the recommendations I found and updated the MySQL configuration file:

[mysqld]
innodb_buffer_pool_size = 1G
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 1

After restarting the server I was ready to try the same benchmark again, but I was unable to access the database tables. I could connect to the database server but had issues when selecting from the table:

080215 20:03:13  mysqld started
InnoDB: Error: log file ./ib_logfile0 is of different size 0 5242880 bytes
InnoDB: than specified in the .cnf file 0 268435456 bytes!
080215 20:03:14 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
080215 16:10:36 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect information in file: './sample/listings.frm'

The solution (from the MySQL documentation) is to just remove the logfiles:

$ cd /path/to/mysql/data
$ mv ib_logfile* ~/

Once the logfiles were moved away, I could restart the server and have access to the tables.

Final Benchmarks

The result of the configuration changes was about a 70% speedup in loading my sample file:

$ time mysql sample < db/import/listings.sql

real    0m17.668s
user    0m0.952s
sys     0m0.130s

Awesome!

Tip: Mocks - No Substitute (and That’s a Fact)

Patrick Reagan

When spec’ing out code for a Rails plugin I’m working on, I needed some methods that were only available as part of the current connection. Rather than create a connection, my first instinct was to start mocking:

describe User do
 
  before do
    @import_file = "#{RAILS_ROOT}/db/import/users.import.sql"
    User.acts_as_importable
  end
 
  it "should append to SQL when given a valid hash" do
    user = User.new
   
    columns_mock = mock()
    columns_mock.expects(:quote_column_name).times(2).returns("`first_name`", "`last_name`")
    user.expects(:connection).at_least_once.returns(columns_mock)
   
    User.expects(:quoted_table_name).returns("`users`")
    User.expects(:quote_value).with('Patrick').returns("'Patrick'")
    User.expects(:quote_value).with('Reagan').returns("'Reagan'")
   
    user.add :first_name => 'Patrick', :last_name => 'Reagan'
    User.sql.should == "INSERT INTO `users` (`first_name`, `last_name`) VALUES ('Patrick', 'Reagan')"
  end
   
end

Ouch, what am I testing here? This quickly took a turn for the worse – maybe creating a connection isn’t so bad to start with:

describe User do
 
  before(:all) do
    User.establish_connection(YAML.load_file("#{RAILS_ROOT}/config/database.yml")['test'])
  end
 
  before do
    @import_file = "#{RAILS_ROOT}/db/import/users.import.sql"
    User.acts_as_importable
  end
 
  it "should append to SQL when given a valid hash" do
    User.new.add :first_name => 'Patrick', :last_name => 'Reagan'
    User.sql.should == "INSERT INTO `users` (`first_name`, `last_name`) VALUES ('Patrick', 'Reagan')"
  end
   
end

Much more succinct and I no longer have to worry about the internals of my implementation.

Tip: Large File? Grep It!

Patrick Reagan

If the thought of typing “man grep” at the commandline doesn’t make you cringe, you might find options that you never knew existed:

-m NUM, --max-count=NUM
    Stop  reading  a file after NUM matching lines.
-n, --line-number
    Prefix each line of output with the line number within its input
    file.

I knew about the -n option, but -m came in handy today when I was trying to grab a slice of an import file that contained sequential IDs:

1234|55601|1
1234|55687|2
1235|45990|1

Since I knew the next record past what I was interested in, a simple grep would tell me the line number that I could parse with Ruby and then grab the correct number of lines:

$ grep "^1235|" -m1 -n import_file.txt | cut -d: -f1
3

Lesson: don’t be afraid to shell out.

Advanced Rails Recipe #4: Custom Response Formats

Patrick Reagan

fr_arr.jpgJust a quick announcement – one of the recipes I submitted for the upcoming Advanced Rails Recipes book has been published in the current beta release. My 4-page tutorial steps you through the process of having your application respond to request that Rails doesn’t know about out of the box – in this case, audio files.

This is a concept that I pulled from the existing Lemurgene codebase, though the current PHP implementation isn’t quite as nice as the Rails version. Creating this same behavior in a Rails application relies on the config/initializers/mime_types.rb file along with the controller that’s used to serve up the audio files. First, the configuration:

Mime::Type.register 'audio/mpeg', :mp3

And in the controller that the user hits to retrieve the file:

class Mp3sController < ApplicationController

  def show
    mp3 = Mp3.find(params[:id])
   
    respond_to do |format|
      format.mp3 { redirect_to @mp3.url }
    end
  end
 
end

Obviously, this isn’t the complete version. Check out the book to see this recipe (and others) in their entirety.

Rails Gotcha: Patching a Class From Your Plugin

Patrick Reagan

While I was working on the caches_constants plugin for the upcoming Advanced Rails Recipes book, I had a bit of trouble adding a new method to the String class. Following some examples I found from the Rails source, I created a new module:

module Viget
  module Format
    def constant_name
      value = self.strip.gsub(/\s+/, '_').gsub(/[^\w_]/, '').upcase
      value = nil if value.blank?
      value
    end
  end
end

To hook it in as an instance method on the String class, I originally dropped this into the init.rb file for the plugin:

class String
  include Viget::Format
end

One final test, and I should have been ready to go (from ./script/console):

>> "Pending Approval".constant_name
NoMethodError: undefined method `constant_name' for "Pending Approval":String
        from (irb):1

Strange. As a test, I re-pasted the include from init.rb into the console and tried again:

>> class String
>>   include Viget::Format
>> end
=> String
>> "Pending Approval".constant_name
=> "PENDING_APPROVAL"

At this point it appeared there wasn’t anything wrong with my code, but for some reason the mix-in wasn’t included when run from inside the plugin initialization. My first fix was to drop the include code into the file that contained the module definition.

This produced the desired effect, but I wasn’t happy with the implementation. In a last-ditch effort, I decided to try using send to perform the mixin:

String.send(:include, Viget::Format)

This functioned as expected and was consistent with how I was extending ActiveRecord::Base. To check out the full implementation, browse the source or install the plugin:

$ cd my_rails_app
$ ./script/plugin install http://svn.extendviget.com/lab/trunk/plugins/caches_constants

Using Parameter Matchers in Mocha

Patrick Reagan

As James pointed out in the comments in my last post, Mocha has the ability to do parameter matching based on type. This is similar to the behavior of FlexMock that I discussed earlier and it’s available from SVN if you want to try it for yourself.

To grab the code and build the gem, head on over to the Mocha repository at RubyForge:

$ svn co svn://rubyforge.org/var/svn/mocha/trunk mocha
$ cd mocha
$ rake gem
$ sudo gem install pkg/mocha-0.5.2.gem # or whatever gem it builds

Now, let’s try it out by testing some basic code (yes, this is backwards – always test first):

class Die
  def roll
     rand(6) + 1
  end
end

Here’s the corresponding test:

require 'rubygems'
require 'test/unit'
require 'mocha'
require 'die'

class DieTest < Test::Unit::TestCase
  def test_roll_should_return_rand_plus_one
    die = Die.new
    die.stubs(:rand).once.with(kind_of(Integer)).returns(5)
    assert_equal 6, die.roll
  end
end

Excellent! Thanks to James and the rest of the Mocha team for this one.

Getting Re-Acquainted With FlexMock

Patrick Reagan

Ok, so I’m on somewhat of a “mocking” kick at the moment. I’m actually preparing a bit of material for an upcoming talk at the next NovaRUG meeting where I’ll be (briefly) talking about mocking libraries in Ruby.

A bit of history – I had switched to using Mocha after initially getting hooked on Jim Weirich’s FlexMock library back in January. This was the first time I had seen mocks used in Ruby and I was completely blown away by how powerful they were. This was definitely not something I was able to do in PHP.

Now that I’m actively comparing the two libraries, I’m noticing that the issues I previously had with FlexMock have disappeared with the newest version (0.6.2 as I write this). It’s still a bit verbose, but I’m now finding it useful to mock out methods based on the type of parameters that they receive instead of the actual value. For example, this is how you would mock out an ActiveRecord#find call with Mocha:

Account.expects(:find).with(1).returns(Account.new(:username => 'preagan'))

FlexMock allows you to be a little more liberal when matching against a method call:

flexmock(Account).should_receive(:find).with(Integer).and_returns(Account.new(:username => 'preagan'))

This is something that Stuart Halloway mentioned during a talk he gave in Richmond a few months ago. I still like the syntax of Mocha, but this feature of FlexMock may be enough to bring me back.

Mocking With Mocha

Patrick Reagan

While I was writing a simple URL validation plugin that checked the HTTP status of a resource, I needed to test some code that made heavy use of some Net::HTTP methods. Here is a simplified version of the code:

require 'uri'
require 'net/http'

class Resource
  def self.exists?(url)
    begin
      uri = URI.parse(url)
      path = uri.path.strip.length == 0 ? '/' : uri.path.strip
      response = Net::HTTP.new(uri.host).head(path)
      exists = response.is_a?(Net::HTTPSuccess)
    rescue
      exists = false
    end
    exists
  end
end

When writing tests I don’t need to test the HTTP methods, and I don’t want to access a resource that might not be available during my test run. So, how do I hit that code path that will make exists ‘true’? The solution here is to use Mocha to stub out the behavior that my code is using:

require 'test/unit'
require 'mocha'
require 'resource'

class ResourceTest < Test::Unit::TestCase
  def test_exists_with_http_success_should_return_true
    success_mock = mock(:head => Net::HTTPSuccess.new('1.2', '200', 'OK'))
    Net::HTTP.expects(:new).returns(success_mock)
    assert_equal true, Resource.exists('http://www.example.com')
  end 
end

That works well for success, but what happens if URI.parse throws a URI::InvalidURIError?

def test_exists_when_uri_parse_fails_should_return_false
  URI.expects(:parse).raises(URI::InvalidURIError)
  assert_equal false, Resource.exists?('http://www.example.com')
end

I really like the concise syntax of Mocha as compared to other mocking libraries and the added bonus that it keeps the same type information for the instances that have stubbed methods. This is something that I struggled with for a while when trying to use FlexMock for this task.