What is web scraping, and why you might want to use it?
This article will describe what web scraping is, how it’s different from web crawling, and finally, who and why uses it.
Starting from definitions
For not to have one meaning is to have no meaning, and if words have no meaning our reasoning with one another, and indeed with ourselves, has been annihilated— Aristotle
Web Scraping is a process of extracting data structured data from a given web page (or a group of pages).
Crawling is a process of visiting (literally crawling) web pages which potentially will be extracted during the Web Scraping process.
These two processes are normally complimenting each other, though to extract data, it’s required to visit a page. To discover a page, it’s usually required to extract the data from the already visited page.
Who and why uses Scraping/Crawling
I am often asked: “Why it’s needed? Most of the sites do have APIs, so you can easily get the data through them!”
And that’s true! You don’t need scraping, use API and save the Earth from Global warming… :)
Ok, if you’re still with me here, I hope you have understood that the statement above is a joke! Unfortunately, most websites don’t have proper APIs, and those who have often do not maintain their systems properly.
So in practice, it looks like this: You need the data, but you can’t get it through the API they provide. Unfortunately, it’s as ugly as this, and you can’t do a lot about it. So, so far, we don’t have a chance to stop global warming, at least not yet :(.
So let’s talk about who and why use it!
Search engines
Modern Internet can’t be imagined without search engines. But how do they work? Of course, there are some noticeable differences between the ranking algorithms, which cause different results. Still, in a nutshell, all search engines require the data to be available and pre-processed upfront. Top search engines like Google, Bing, or Yandex are literally operating with hundreds of petabytes of scraped data. These guys are extracting the data at a scale :)!
Online marketing
It’s important to promote your goods, so they are ranked higher in search results. But how do you understand the impact of your marketing activities? Well, you can extract the data from search result pages to see how well your products are ranked and how their ranking changes over time!
Machine learning
Machine learning gives us amazing possibilities and a completely new way of writing code. If previously most of the tasks were solved only with an algorithmic approach, now it’s also possible to train your code to solve a specific set of tasks. Also, the current state of the industry allows using machine learning without really knowing the underlying mathematics. Most of the time, you have to import a few libs or use an API from AWS of IBM, depending on your needs.
And in this case, the success of your machine learning project will depend mostly on your data quality. And the scraping may be an answer to your question: “Where do I get the data for training my models?”
Human resources
As a computer scientist, I am getting emails with job offers from time to time. Have you ever considered how do recruiters find suitable people for these emails? Well, of course, most of the time, they are just sending random emails everywhere :). However, some of them tend to extract data from large websites to monitor the job market.
Social and political researches
Another related area is related to social researches on the job market (and not only there). It’s quite interesting to understand how many developers are active in the given area with a given skillset when you’re planning your new startup, for example.
Affiliate market
Did you know that you can earn a commission by advertising goods from other websites? For example, from Amazon? You might want to ask, why would you establish a website, which is just showing goods which can be bought on another website. The simple answer here is the following: “Amazon, for example, is a huge marketplace. It’s almost impossible to provide an equally good user experience for every user because different goods require a different user interface, different filtering, or different search. That’s why there will always be a place for small players who help Amazon to sell better.”
Challenges
Now, as we have covered the question of usages, let’s think about the challenges. What does it take to scrape data? What do you need to know before starting?
Scale
It might be a simple task to extract data from a couple of pages. However, when it comes to a scale of a few million per day, the task becomes quite complex. It’s even more complex when you’re establishing a new search engine that literally needs to extract terabytes per day.
Dealing with Javascript
Web heavily depends on Javascript to render the data. And it turns to be one of the big complexities for scraping engines. It’s required to render data as if you are a web browser, and most of the currently available tools are purely command line and are not capable of that. People are literally creating browser farms. However, it’s expensive to maintain them!
Ethical aspects
We will cover ethical aspects in our next areas. Here we will mention that it’s important to be polite on the web. You should follow robots.txt guidelines to avoid problems!
Networking problems
The network is not something reliable, and outages are not rare even these days. It’s important to know how to do retries and how to deal with errors.
Local versions of websites
Some websites would render different content for users from different countries. And in these cases, you have to use a proxy server to access the data from an expected region.
Conclusions
Scraping seems to be a well-established computer science area, widely used across the world by different companies, who are often responsible for setting trends for the entire internet.
Because the Internet is growing unbounded, it will continue offering an enormous amount of possibilities related to data intelligence, leading to high salaries, interesting and challenging tasks, and great business opportunities!
That’s why we want to invite you to try our web scraping solutions called Crawly UI, which is an opensource web scraping platform that simplifies tasks of scrapers creation and monitoring. See more at http://crawlyui.com/.
Thanks for reading this story!