Scraping a website using Ruby is surprisingly easy. With the Nokogiri gem1 and a few lines of code you can start to extract information from a website quickly and easily.

To get started, create a plain text file with a .rb extension and mark it as executable2. In your terminal type:

touch scraper.rb
chmod +x scraper.rb

Now that we have our script file it’s time to add some content to it. To start, try copying the following six lines into your scraper.rb file:

#!/usr/bin/env ruby
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://www.nokogiri.org'))
links = doc.css('a').map{ |link| link['href']}

puts links

The first line tells the OS how to execute this file2.

The next two require lines tell Ruby that we need to use some non-core functionality, including Nokogiri from the Nokogiri gem installed on your machine1.

We then get to the meat of the script. On line 5 (line 4 is blank) doc = creates a variable called doc and puts the result of the right hand side expression into it. The right hand side expression tells Nokogiri’s HTML Class to go to the passed in URL and load the entire document in to memory.

Line 6 is the most interesting. Taking the HTML document (doc) we use the .css method we use standard css notation to fetch every <a>...</a> element into what Nokogiri calls a NodeSet.

Once we’ve done that .map{ ... } tells ruby to execute the contained code for every item in the NodeSet. The |link| label could have been named anything, e.g. |x| or |i|, but for clarity I like to describe the contents3.

Each matching Nokogiri node goes into the { } block with all of it’s attributes and inside the block we ask the node for the value of it’s href key with link['href']4. We end up with an array of relative URLs which gets assigned to the links variable at the start of line 6.

We then reach the final line of the script, puts links, which just spits the array out to the console, one item per line.

This is a very simple example but hopefully it helps you get started.

I first used this approach to test out the core idea for an app I am working on before getting bogged down in all the details and things I don’t yet know about creating full featured Rails apps. It’s potentially also a good fit to automate a one off task that would otherwise involve a lot of fussing about in the page source of dozens of web pages.

  1. Installing Nokogiri varies depending on your OS and how your machine is already set up. macOS already has Ruby and RubyGems installed so it should be quite straight forward to install Nokogiri 2

  2. Technically the second step is not required. We can leave out the chmod from the terminal step and the shebang (#!/usr/bin/env ruby) from the ruby script but it just makes executing the script slightly more complicated.  2

  3. See .map documentation for more info. It’s quite succinct to write so when I’m reading it I try to think of what it’s actually doing. For example, my_collection.map{ |item| item.value } could be read “for every item in my_collection return the item’s value”. 

  4. To inspect what other attributes a node has we can just grab the first one with first = doc.css('a').first and then inspect it with puts first.inspect