Zero to Hero... Data Collection through Web Scraping

Zero to Hero... Data Collection through Web Scraping

ยท

7 min read

Why?

Machine learning is cool, but we can't really do much without data. So let's kick off our journey the right way through web scraping!

Now I'm going to preface this post by stating there are two options:

  • Popular and easy to manage data
  • Unique and niche data

For this mini-series we're aiming to create a unique, exciting and impressive project, so we are skipping over the overused and cliche projects! Instead, we've chosen to create a video game recommendation system.

How we got here

Before diving into code and pulling logic apart I feel like it's important to give a brief overview of how we can collect data:

  • Decided on what data we're looking for (game titles, summaries, and reviews)
  • Research and find potential data sources (Wikipedia and Metacritic)
  • Scrape and export final input data

Primary precedence

Let me start with a little insight on this project, I'm actually pairing up with a friend who just started their machine learning journey. Now I understand how it feels to work with a minimal amount of knowledge as I started my first projects this way!

I've never mentioned this before, but I started programming as a nieve kid. I always wanted to program, and I thought I could when I was ~13. So I joined a game-creating competition where I completely bombed out! Know why? I thought I knew it all, I thought tutorials would explain all the details for any project I chose, and that they'd explain how everything should come together magically. Thus I was completely unprepared for the challenge ahead of me!

My friend is at the same stage I was back then, and it's reflected by his order of precedence (what he considered most to least important). Now let me explain why this is important with an anecdote.

My friend just started working on his first task, to find box art for all our games. This seemed easy to him, but alas, it was quite dubious! You see Metacritic and Wikipedia provide lists of games with small icons displayed next to each title, but they're just small thumbnail image. This detail was easy to miss, so he breezed over it and tried to collect images without a second thought (a costly mistake).

Our mistakes are similar to wandering through a jungle without reading a map! We may hike halfway through a jungle, but only after thoroughly examining a map can we possibly hope to find which direction we need to travel in! These wild bets which have extremely low chances of paying off (i.e. getting us to our desired location) are what I call nieve assumptions!

The challenge

I hear people asking how nieve assumptions relate to our amazing project.

Data collection is when you start writing code for your project and so you're currently most vulnerable to the nieve tendency

As your guide I need to explain how we can simply avoid nieve assumptions:

  1. Start with finding context about the problem you're solving
  2. Proceed to decompose your problem into smaller pieces
  3. Strategise by planning how to overcome the problem
  4. Fill in the blanks

The main takeaway here is to always research the mechanisms at play before working. For us, this means closely inspecting where your data will come from and how it will be used. Don't worry if you're naturally not thinking this way (yet), as it'll come with time and practice!

Writing the code

Finally some code!

Let's start off by deciding our output file format and location. Since we're using Scrapy it's relatively easy! Just set the FEED_FORMAT (file format) and FEED_URI (file path) settings and we're good to go.

Our first task will be to collect a list of PC video game titles and summaries. For simplicity, we're going to use our primary data source Wikipedia! Although it's possible to easily download all of Wikipedia, we only really need articles on PC games, so we're going to find the URL for each article. There's a pre-created list on Wikipedia, which we'll be scraping for titles and URLs.

We start by isolating the parts of the page with relevant data. To do this we can use CSS or XPath selectors. The first step to finding the right selector is understanding a page's HTML structure. So we can take a look at our browser's inspector which provides a small interactive view of the HTML code (in Firefox press cntrl+shift+i). To easily find a HTML element on the page use the select element feature (press cntrl+shift+c in Firefox).

After playing around it's apparent that all our game titles are a elements. We can try and use a CSS selector of a, however, you'll notice that its output is composed of more than just game titles. We can become more specific, i > a, and we'll get slightly better results! The trick for CSS selectors is to try your luck by starting fairly generic and progressively making them more specific. In the end, our experimentation revealed that td > i > a selects our desired game elements. Now though we actually want two things:

  • The name of each game
  • The URL to each Wikipedia article on a game

To find specific parts of our element with can use attributes ::! For URLs you use ::attr(href) and text ::text. Note that we don't want HTML elements, so we can use the get or getall functions to extract our data! Now here are our CSS selectors for collecting our two pieces of information from the page:

  • The name of each game: td > i > a::text
  • The URL to each Wikipedia article on a game: td > i > a::attr(href)

If you've looked at our Wikipedia webpage you should realise that we need to manually switch between pages ๐Ÿ˜‘. We luckily have links to each page present at the top of the Wikipedia list! Now that means we'll need to find a CSS selector, and then loop through each page of Wikipedia games.

If you followed my advice (general -> specific) you'd eventually realise that some elements specify classes. We can easily use CSS classes by writing . following a class name! These classes are another amazing way to make selectors more specific. Using our new knowledge of classes we can create a final CSS selector for each page: div.toc > div > ul > li > a::attr(href).

All we have to do now is to loop through and scrape each page. With Scrapy all you have to do is yield another response object! To do this easily for relative paths use the response.follow function. Note here that we would scrape the first page twice if we go through each link on the page.

Now, of course, these are just game titles and webpages, but we can use Wikipedia's special export articles page! However, this, of course, isn't automated and so a small amount of manual work will have to be done to replicate the results.

Technical summary

CSS selector tips and tricks:

  • Elements can be specified like div
  • Using .class_name allows you to specify an element's class
  • Attributes can be specified after ::
    • URLs are found with the attr(href) attribute

For more take a look at w3schools complete list.

To extract data from HTML elements use the get or getall function and to remove extraneous characters use strip.

The trouble with web scraping

I hope you appreciate how well designed Scrapy is! It makes the process of web scraping relatively easy to do.

But now why am I gloating about Scrapy when I previously emphasised how troublesome web scraping can be? Well, unfortunately for us whilst Scrapy can make how we web scrape easy, it can't remove inconsistencies in web pages ๐Ÿ˜ฅ. We, together, have discussed how to web scrape a Wikipedia page, and I sincerely hope that web scraping with me has provided you with some transferable value (i.e. you become capable of replicating this yourself).

My code

To understand more about how to create a Scrapy spider to extract information from our website see my GitHub repository!

THANKS FOR READING!

I know data collection (especially web scraping) can be time-consuming, and often feel like a slight pain. However, being persistent and fighting through the data collection process will definitely give us a strong start!

If you haven't already, check out the first intro to building end-to-end machine learning projects. Make sure to stay put for the next post where I actively go through the next (and potentially most important) step of our journey (data preprocessing)! Last but not least, make sure to follow me on Twitter for updates!