Visual scraping with Elixir and Crawly, or how to get data without programming.

Image by Pete Linforth from Pixabay

In one of my previous articles, we discussed why you might want to scrape data from the Internet. We have shown how to extract data from multiple websites to organize a price monitoring solution for a real estate agency.

Here we want to show how to create a web scraper even if you don’t know how to program it!

Data from the real estate market

As in the previous example, we will be interested in data from one of the Swedish real estate websites. For this example, we will take the Hemnet website. And we want to get: URLs, prices, addresses, images. As in the previous article, we will not be interested in crawling a website, but instead, we would want to monitor a given list of pages.

Bootstrapping the project

As we have mentioned in the previous article, the website does not state any restrictions to web crawlers (at least according to robots.txt), so we assume that it’s ethical to get a bit of data out of it!

As we have promised, this time, coding will not be required at all. So a bootstrapping will be as simple as opening a webpage and defining a spider with the help of a few online forms.

In this example, we will use a deployed version of the Crawly UI project, which is our open-source attempt to make crawling more predictable and usable. You’re free to try it on the link above or you can have your own deployment (see instructions in the project’s readme)

Open CrawlyUI’s, create spider link here.

Create a spider dialog, step1

At this point, you have to enter a spider name and fields, which you need to monitor. In our case, a spider name is “HemnetMonitor,” and fields are “URL,price,address,image.”

Hit the “Update” button once done!

Please note: A spider name should not contain whitespace or special characters at this moment. Ideally please use a camel case notation.

Defining crawl URLs

After clicking update, you will be redirected to the next stage of the spider creation — crawling definition. At this point, you need to specify a starting point and a list of paths to follow.

In this example, we will monitor the following properties:

Real estate properties to monitor

Note: It’s expected to enter paths to follow parts of links if you want your crawler to follow something. In our case, we don’t want the spider to explore the website, in this case we just want to monitor a predefined set of links.

Create spider: Setting start urls

Finally, hit save to move forward!

Adding extraction rules

In this final step, we have to define how item extraction is supposed to be done.

So, first of all, let’s add an extraction rule. Enter one of the URLs to monitor in the Add new rule field and hit the “Add” button.

Add extraction rule, step 1

Now you will be redirected to the item extraction dialog.

Extraction dialog consists of the fields defined in step1 and the target page loaded in the iframe.

NOTE: It may be the case that you will not see a target website in iframe. Most of modern webserver would block iframes. However it’s possible to install a Chrome extension called Ignore X-Frame -Headers to address this problem.

The next thing you have to do is check the loaded page with your browser's inspect dialog (in the case of Google Chrome) and define the item extractors on the left part of the page.

Finding selectors to extract data

We have tried to achieve the state when the target website and tested selectors are visible on the same page, so it’s really possible to say if the selector you have created works or not.

In our case, we have found the following CSS selectors required to extract data from the given webpage:

Selectors required to extract data from the page

Note: We’re still experimenting with a way selectors are defined. In this particular case I want to highlight two special selectors:

response_url — get a URL of the current page

.property-gallery__item//0//src — long syntax of selector definitions which can be explained this way:

1. Get everything which has .property-gallery__item class

2. Take only the first item out of the picked list

3. extract src attribute

Now after everything is set, let's hit the update button, so the extractor is saved.

At this point, you will see the previous dialog with a list of defined extractors:

Define extractors dialog

It’s possible to edit current extractors here or to define additional extractors.

At this point, you may wonder what will happen if you will create more than one extractor per spider? Well, extractors are applied one by one, and the spider will pick the one which is capable for extracting more fields than others.

That’s basically all! Now let's click Save spider, and we’re ready to perform crawls (which can be triggered via Schedule link)!

Let’s finally schedule the spider:

Spider is automatically visible on the dashboard after scheduling.

And also, preview the extracted data:

The data is instantly available for preview.

Note: CrawlyUI is currently under development, so please be advised that the current code is not considered as stable. Also some parts of the code will have to be re-written, so they will make more sense from architectural point of view. Finally the whole project (CrawlyUI) is currently running one EC2 micro instance, so we have applied quite restrictive rules (max 10 workers per spider) for CrawlyUI usage.

The spider definition can be found here: http://crawlyui.com/spider/new?spider=HemnetPrices

Conclusions

This article shows how to create spiders with the help of Elixir without any knowledge of Elixir. It can be used to automate quite a bit of scraping work you need. However, it may be the case that this approach will not work for more complex crawling targets!

I am currently running just one amazon free tier micro instance, which imposes quite a few limitations (and will also expire soon). That’s why I would highly appreciate the support, which would allow me to have a better server and add new instances for browser rendering and other things! Please let me know if you’re interested!

Software developer at Erlang Solutions.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store