Data Science: Don’t build a crawler (if you can avoid it!)
Mark Needham explains how he was able to build data sets for personal use when open data isn’t available using Neo4j!
On Tuesday I spoke at the Data Science London meetup about football data and I started out by covering some lessons I’ve learnt about building data sets for personal use when open data isn’t available.
When that’s the case you often end up scraping HTML pages to extract the data that you’re interested in and then storing that in files or in a database if you want to be more fancy.
Ideally we want to spend our time playing with the data rather than gathering it so we we want to keep this stage to a minimum which we can do by following these rules.
Don’t build a crawler
One of the most tempting things to do is build a crawler which starts on the home page and then follows some/all the links it comes across, downloading those pages as it goes.
This is incredibly time consuming and yet this was the approach I took when scraping an internal staffing application to model ThoughtWorks consultants/projects in neo4j about 18 months ago.
Ashok wanted to get the same data a few months later and instead of building a crawler, spent a bit of time understanding the URI structure of the pages he wanted and then built up a list of pages to download.
It took him in the order of minutes to build a script that would get the data whereas I spent many hours using the crawler based approach.
If there is no discernible URI structure or if you want to get every single page then the crawler approach might make sense but I try to avoid it as a first port of call.
Download the files
The second thing I learnt is that running Web Driver or nokogiri or enlive against live web pages and then only storing the parts of the page we’re interested in is sub optimal.
We pay the network cost every time we run the script and at the beginning of a data gathering exercise we won’t know exactly what data we need so we’re bound to have to run it multiple times until we get it right.
It’s much quicker to download the files to disk and work on them locally.