1. Scraping the DOM with CSS3 selectors and Perl

    The other day I stumbled upon this blogpost http://liamkaufman.com/blog/2012/03/08/scraping-web-pages-with-jquery-nodejs-and-jsdom/ about scraping the web with Javascript.
    I immediately thought, why would you use Javascript for that? One of the author’s arguments was that the CSS3 selectors and DOM traversal stuff that is able in jQuery is less effort than accessing DOM elements with Xpath. I agree, but why reinvent the wheel if other tools already have proven and solid methods for that. I strongly believe in choosing the right technology for the job, instead of wasting time to write methods and stuff that others already have done. So why not use Perl? It has several modules to access DOM elements with CSS3 selectors just like jQuery. WWW::Mechanize let’s you simulate a browser, filling out web forms, pushing buttons, and following links. It has great text manipulation capabilities, which come in handy when scraping webpages. Perl is a proven, reliable and robust tool, used for mission critical projects in the public and private sectors all over the world. And hey..Perl is listed in the Oxford English Dictionary…;).

    We are going to scrape Craigslist’s housing section in Amsterdam since I live there, it is a good exercise because the HTML of Craigslist is not really structured so it makes it a little harder to scrape.

    Let’s think about the flow of the program for a while. This is the page I want to scrape: http://amsterdam.en.craigslist.org/apa/, let this be the basic page. On the basic page we got 100 links to the details page of each apartment. At the bottom of the basic page there is a “next 100 postings” link, so for each basic page we can create a loop that grabs the links on it, loads the HTML in a variable and extracts the data we need until there is no “next 100 postings” link anymore. Basically we got two nested loops, the outer does the navigating, the inner grabs the HTML for each link on the basic page.

    Okay, enough talking already, let’s get our hands dirty.

    First we need to install a few modules:

    WWW::Mechanize (Simulate a browser)
    Mojo::DOM (DOM traversal stuff, CSS3 selectors)
    JSON::XS (Dump our data in JSON format)

    I’m assuming you already know how to install Perl modules. If you don’t, no worries, there is plenty information about that on the web. This is the first part.
    #!/usr/bin/perl
    use strict;
    use warnings;
    use LWP::Simple;
    use WWW::Mechanize;
    use Mojo::DOM;
    use JSON::XS;
    use feature 'say'; # use "say" instead of "print", adds newline after printed stuff
    
    # Let's define our house variables
    my $deeplink;
    my $title;
    my $price;
    my $bedrooms;
    my $place = "Amsterdam";
    my $area;
    my $description;
    my @pictures;
    
    # Open JSON file to write data to
    open FILE, ">", "craigslist.json" or die $!;
    
    # Initialize mech object
    my $mech = WWW::Mechanize->new();
    
    # Get our link and fake the User Agent 
    $mech->get("http://amsterdam.en.craigslist.org/apa/",
    'User-Agent' => 'Mozilla/4.76 [en] (win98; U)'
    );
    
    # Die and give back the status if WWW::Mechanize can't get URL
    die $mech->response->status_line unless $mech->success;
    
    Next explanation is for Perl newcomers, if you already have some experience with Perl you can skip this part because the comments in the script should be sufficient to understand what’s going on.

    First we load the necessary perl modules, and after that we declare our housing variables. Since we’re using strict pragmas we must declare all your variables (with “my”) before using them. It is good practice to use strict pragmas, more info about strict can be found here: http://docstore.mik.ua/orelly/perl4/lperl/ch04_10.htm. Warnings are also good to load, they can save you a lot of time finding bugs. After declaring our housing variables we open a JSON file to write our data to. Next we create an instance of the WWW::Mechanize object and use the “get” method to grab our link, after that we fake the user agent. The last line of the snippet aborts the script if WWW::Mechanize can’t get the URL e.g. it is malformatted, or doesn’t exist and gives back the status. The content of the basic page is now loaded in our WWW::Mechanize object.

    Let’s take a look and the next part.
    # Create a loop block for the redo at the bottom
    {
    
    	# Instantiate MOJO::DOM object with WWW::Mechanize data in it
    	my $dom = Mojo::DOM->new($mech->{content});
    
    	# Find the link with housing details
    	$dom->find('p > a')->each(sub { 
    
    		$deeplink = shift->{href};
    
    		# Get details HTML for further processing
    		my $details_page = get "$deeplink";
    		die "Couldn't get $deeplink" unless defined $details_page;
    
    		# Create new MOJO::DOM object with the details of the houses
    		my $dom_details = Mojo::DOM->new($details_page);
    
    		# Title #
    		$title = $dom_details->at('h2')->text;
    
    		# Description #
    	    $description = $dom_details->at('div#userbody')->text;
    
    	    # Area #
    		$area = $dom_details->at('ul.blurbs li')->text;
    
    		# Remove "Location:" from string
    		$area =~s/Location: //;
    
    		# Price, is alway is front of the title delimited by a forward slash, so let's get it with a simple split on the forward slash #
    		my @price = split /\/ /, $title;
    		$price = $price[0];
    
    		# I want to get rid of the "EUR" in front of it
    		$price =~s/EUR//;
    
    		# Number of bedrooms is also in the title so let's use our price split again #
    		$bedrooms = $price[1];
    		my @bedrooms = split / -/, $bedrooms;
    		$bedrooms = $bedrooms[0];
    		
    		# I want to get rid of the "br" in it
    		$bedrooms =~s/br//;
    
    		# Create an array with all the house variables
    		my @house_data = ($deeplink, $title, $description, $area, $price, $bedrooms);
    
    		# Images #
    		$dom_details->find('img')->each(sub {
    
    			# Fill the array with the image sources
    			@pictures = shift->{src};
    
    			# Remove /thumb from image src, those are to small, the larger picture is one folder up
    			for (@pictures){
    			s/\/thumb//;
    			}
    
    			# Join house_data with pictures array
    			push (@house_data, @pictures);
    		});
    
    		# Create array reference, to pass to JSON::XS
    		my $house_data_array_ref = \@house_data;
    
    
    		# Encode Json
    		my $json = encode_json $house_data_array_ref;
    
    		# Write to json file
    		say FILE $json;
    
    	});
    
    redo if $mech->follow_link(text=> 'next 100 postings');
    }
    
    close FILE;
    
    Let’s break this down. First we want to create the outer loop that checks if there is a “next 100 postings” link. We are using the loop control keyword redo with an if statement for that. If the “follow_link” method finds a link with the text: “next 100 postings” it loads the content of the new basic page in our instantiated WWW::Mechanize object, and restarts the loop block.
    # Create a loop block for the redo at the bottom
    {
    ...
    ...
    redo if $mech->follow_link(text=> 'next 100 postings');
    }
    
    To use CSS3 selectors for scraping we need to instantiate a MOJO::DOM object with WWW::Mechanize content in it, that happens here:
    my $dom = Mojo::DOM->new($mech->{content});
    
    Now we have our content in a MOJO::DOM object let’s do some CSS selecting, first we want to get the URL of each house. Here is the HTML:
    <p><a href="http://amsterdam.en.craigslist.org/apa/3017387588.html">EUR1150 / 1br - 50ft&sup2; - Nicely furnished ...</a> </p>
    <p><a href="http://amsterdam.en.craigslist.org/apa/3017379903.html">EUR2500 / 2br - 2500 euros 2 bedr...</a> </p>
    .....
    .....
    
    Let’s create the CSS selector and walk through the links:
    $dom->find('p > a')->each(sub {
    
    For each link we are going to get the HTML and load it in a variable so we can start extracting data.
    $deeplink = shift->{href};
    
            # Get details HTML for further processing
            my $details_page = get "$deeplink";
            die "Couldn't get $deeplink" unless defined $details_page;
    
    Now we have the details page we can instantiate MOJO::DOM object with the details of the house:
    # Create new MOJO::DOM object with the details of the houses
            my $dom_details = Mojo::DOM->new($details_page);
    
    Let’s grab some data!
    
            # Title #
            $title = $dom_details->at('h2')->text;
    
            # Description #
            $description = $dom_details->at('div#userbody')->text;
    
            # Area #
            $area = $dom_details->at('ul.blurbs li')->text;
    
            # Remove "Location:" from string
            $area =~s/Location: //;
    
            # Price, is alway is front of the title delimited by a forward slash, so let's get it with a simple split on the forward slash #
            my @price = split /\/ /, $title;
            $price = $price[0];
    
            # I want to get rid of the "EUR" in front of it
            $price =~s/EUR//;
    
            # Number of bedrooms is also in the title so let's use our price split again #
            $bedrooms = $price[1];
            my @bedrooms = split / -/, $bedrooms;
            $bedrooms = $bedrooms[0];
            
            # I want to get rid of the "br" in it
            $bedrooms =~s/br//;
    
    Ok, we have the data from each house loaded in variables. Let’s create one array of it, so later on we can join the housingdata array with the pictures array. See the declaration of the pictures array in the top of the script. my @pictures;
    
            # Create an array with all the house variables
            my @house_data = ($deeplink, $title, $description, $area, $price, $bedrooms);
    
    It’s time to grab the pictures of each house:
    
            # Images #
            $dom_details->find('img')->each(sub {
    
                # Fill the array with the image sources
                @pictures = shift->{src};
    
                # Remove /thumb from image src, those are to small, the larger picture is one folder up
                for (@pictures){
                s/\/thumb//;
                }
    
                # Join house_data with pictures array
                push (@house_data, @pictures);
            });
    
    First we find the img tag, since each house has more than one image we walk through each one and add it to the previously declared pictures array. Some image sources have “thumb” in the path. They are in a thumbs directory and therefore are really small. I prefer bigger pictures, luckily one directory up there is an image with the same name but bigger. So we removed /thumb. Lastly we join the housing array with the pictures array so both arrays are now in @house_data. Now we can pass that array (actually the array reference) to JSON::XS to dump our data in JSON.

    JSON::XS needs an array reference before it can encode our array to JSON, so let’s create it. After that we encode our housing data to JSON format and write it to our already opened .json file. Lastly we close our inner loop, outer loop and close the .json file again.
    
            # Create array reference, to pass to JSON::XS
            my $house_data_array_ref = \@house_data;
    
    
            # Encode Json
            my $json = encode_json $house_data_array_ref;
    
            # Write to json file
            say FILE $json;
    
        });
    
    redo if $mech->follow_link(text=> 'next 100 postings');
    }
    
    close FILE;
    For security reasons it’s important to check the return values of the public data you are scraping, there could be malicious Javascript in the HTML tags, binaries in the links or image sources etc.., this is not an out of the box solution but merely an example of the possibilities that Perl offers for webscraping.

    Further reading: http://shop.oreilly.com/product/9780596005771.do it’s a little outdated (don’t think MOJO::DOM was written when the book came out), but it has some nice techniques in it.

    That’s it, you can put this script in a nightly cronjob so you can feed your cool web 11.0 ;) app with the JSON. Happy scraping!