Data is the basis for many products like Google Maps, without it these products would not exist. Consider the phone book for your city, it has phone numbers for the businesses in the city. Phone books have now migrated to the web, providing the same data to the public as in the printed version. Showing a list of phone numbers on a website is pretty simple though.
The real value is in the data itself, showing a list of information isn’t anything special. What if you wanted to make your own website or app using those phone numbers? You need that data, but it's not made available in a public API that anyone can use. To get the data you'll need to build your own scraper.
What is a scraper?
A scraper is a program that pulls webpages down and parses them to find information. Generally if you go to a website and can see some data that you’d like to grab, you can scrape it. To see how this works, you can visit any website, right-click somewhere on the page and hit View Page Source. That will show you the raw HTML, if you can find the data you’re looking for in there then you can scrape it.
How does it work?
You can break down the process of scraping a website into two steps. First you need to get the raw HTML from the web page you want to scrape. Usually you can use cURL but it will depend on what language you’re using. The second step is where you actually parse the HTML to find the data you want. To do this you should use regular expressions (regex), a common format to express patterns. You use these expressions to define the pattern you are looking for.
For example, let’s say that you want to make a website for wine recommendations. But to do this you need a database of wines, and you know there’s an existing site with the data you need. To start you should look at what URLs you will need to be hitting to get the data you want. So if there’s a page that lists all the wines like
winedatabase.com/list?p=1 you can save that in your scraper. You may be able to get their full wine list by iterating through the pages of wine. So start at
winedatabase.com/list?p=1 and keep adding one like so
winedatabase.com/list?p=2, until you get to a page that has no more data on it.
Then to get the details for each wine, use a page like this
1233 is the ID of the wine you want details for. This is something you would’ve had to scrape when you got the list of wines. Once you have that, run your scraper and you’ll have the full wine database.
As I mentioned, regular expressions define patterns to search for. These can get quite complicated, but most patterns you need to look for can be set up using similar patterns. As an example, take this section of HTML that might exist on
<h1 class=”wine_name title”>Meursault</h1>. Looking at this I could do a quick search of the page to see if there’s anything else that uses the
wine_name class. Once I have confirmed that it’s only used for wine titles, I can include it in my regular expression. In PHP my regular expression might look like this:
What I have done here is break the pattern into three groups using the parentheses. The first group is what comes before what I’m trying to capture (the wine name). The second group is the wine name itself, and the third group is what comes after the wine name. The first and third group together define where the wine name is. We don’t want to include them in our result though, we want the wine name.
The two characters together
?: at the start of the group make it so the group is not included in the result. The dot character
. substitutes for any character (A-Z, 0-9, etc.). The asterisk
* means that the previous character (in this case
. so it can be anything) can repeat zero or more times. Also, when a question mark
.*? follows an asterisk
* it means it will stop at the first match. Since the regular expression looks like this
.*?> it will stop at the first
> it finds.
So the result of the first group would be
wine_name title”>. Followed by the wine name in the second group
Meursault. Then in the third group
<. Ignoring the first and third group, we end up with the wine name
Meursault which is exactly what we wanted. Applying this regular expression to the entire page will get us every wine name.
Issues with scraping
In the ideal scenario, the website you want to scrape from will have a straightforward layout. But in reality it’s not always that easy. The site’s layout could change at any time, breaking whatever patterns you had created. This can be annoying and will happen to your scraper at some point. These changes could make your scraper too unstable to be of any use.
The worst scenario is that the site changes, but your pattern starts matching something else. It would seem like your scraper is still working, but in reality it’s saving garbage data. Maintaining the scraper is the biggest drawback, they’re best suited for sites that don’t change very often or ever. But sometimes if you want that data there’s no other way to get it.
It’s also possible for websites to protect against scraping through a variety of means. This means you first need to get around these protections if you want to even try to scrape their data. Generally speaking, it’s possible to break through most protections. It comes down to how much time you want to spend on the project. Even if you do manage to break through, they can always add new protections. So your scraper could break at any moment. Although if you only need to run your scraper once in awhile, it may not matter.
Sometimes it works, sometimes it doesn’t
While building a scraper is not very difficult, the main issue is that you can't prevent it from breaking. It’s impossible for your scraper to adapt to all changes made to the website you’re scraping. This could be from simple design changes on the site, or from protection to block scrapers.
But if you can get it working, the data you get could be very valuable. Also, since it’s not a public API then nobody else will have access to it, making the barrier to entry higher. Once it’s setup it can pay dividends for a long time to come with little maintenance. This all depends on the site you’re trying to scrape from, you’ll have to decide if it’s worth the effort to get the data you want.