Visual scraping with Elixir and Crawly, or how to get data without programming.
In one of my previous articles, we discussed why you might want to scrape data from the Internet. We have shown how to extract data from multiple websites to organize a price monitoring solution for a real estate agency.
Here we want to show how to create a web scraper even if you don’t know how to program it!
Data from the real estate market
As in the previous example, we will be interested in data from one of the Swedish real estate websites. For this example, we will take the Hemnet website. And we want to get: URLs, prices, addresses, images. As in the previous article, we will not be interested in crawling a website, but instead, we would want to monitor a given list of pages.
Bootstrapping the project
As we have mentioned in the previous article, the website does not state any restrictions to web crawlers (at least according to robots.txt), so we assume that it’s ethical to get a bit of data out of it!
As we have promised, this time, coding will not be required at all. So a bootstrapping will be as simple as opening a webpage and defining a spider with the help of a few online forms.
In this example, we will use a deployed version of the Crawly UI project, which is our open-source attempt to make crawling more predictable and usable. You’re free to try it on the link above or you can have your own deployment (see instructions in the project’s readme)
Open CrawlyUI’s, create spider link here.
At this point, you have to enter a spider name and fields, which you need to monitor. In our case, a spider name is “HemnetMonitor,” and fields are “URL,price,address,image.”
Hit the “Update” button once done!
Please note: A spider name should not contain whitespace or special characters at this moment. Ideally please use a camel case notation.
Defining crawl URLs
After clicking update, you will be redirected to the next stage of the spider creation — crawling definition. At this point, you need to specify a starting point and a list of paths to follow.
In this example, we will monitor the following properties:
Note: It’s expected to enter paths to follow parts of links if you want your crawler to follow something. In our case, we don’t want the spider to explore the website, in this case we just want to monitor a predefined set of links.
Finally, hit save to move forward!
Adding extraction rules
In this final step, we have to define how item extraction is supposed to be done.
So, first of all, let’s add an extraction rule. Enter one of the URLs to monitor in the Add new rule field and hit the “Add” button.
Now you will be redirected to the item extraction dialog.
NOTE: It may be the case that you will not see a target website in iframe. Most of modern webserver would block iframes. However it’s possible to install a Chrome extension called Ignore X-Frame -Headers to address this problem.
The next thing you have to do is check the loaded page with your browser's inspect dialog (in the case of Google Chrome) and define the item extractors on the left part of the page.
We have tried to achieve the state when the target website and tested selectors are visible on the same page, so it’s really possible to say if the selector you have created works or not.
In our case, we have found the following CSS selectors required to extract data from the given webpage:
Note: We’re still experimenting with a way selectors are defined. In this particular case I want to highlight two special selectors:
response_url — get a URL of the current page
.property-gallery__item//0//src — long syntax of selector definitions which can be explained this way:
1. Get everything which has .property-gallery__item class
2. Take only the first item out of the picked list
3. extract src attribute
Now after everything is set, let's hit the update button, so the extractor is saved.
At this point, you will see the previous dialog with a list of defined extractors:
It’s possible to edit current extractors here or to define additional extractors.
At this point, you may wonder what will happen if you will create more than one extractor per spider? Well, extractors are applied one by one, and the spider will pick the one which is capable for extracting more fields than others.
That’s basically all! Now let's click Save spider, and we’re ready to perform crawls (which can be triggered via Schedule link)!
Let’s finally schedule the spider:
And also, preview the extracted data:
Note: CrawlyUI is currently under development, so please be advised that the current code is not considered as stable. Also some parts of the code will have to be re-written, so they will make more sense from architectural point of view. Finally the whole project (CrawlyUI) is currently running one EC2 micro instance, so we have applied quite restrictive rules (max 10 workers per spider) for CrawlyUI usage.
The spider definition can be found here: http://crawlyui.com/spider/new?spider=HemnetPrices
Conclusions
This article shows how to create spiders with the help of Elixir without any knowledge of Elixir. It can be used to automate quite a bit of scraping work you need. However, it may be the case that this approach will not work for more complex crawling targets!
I am currently running just one amazon free tier micro instance, which imposes quite a few limitations (and will also expire soon). That’s why I would highly appreciate the support, which would allow me to have a better server and add new instances for browser rendering and other things! Please let me know if you’re interested!