Building a Chrome-based fetcher for Crawly

There has been almost a year since I have written a blog post regarding browser rendering. At that point, we have used a combination of Splash and Crawly to fetch data from ajax rendered websites.

One of the downsides of this approach is the fact that Splash is not a real browser. Sometimes it behaves strangely, hangs, or crashes in unpredictable situations. Moreover, it looks like a machine, so it’s easy to recognize and block it. That’s why it can be useful to use real browsers instead, as they look like browsers, and can be quite easily hidden in a crowd of other real browsers if you know what I mean.

Let us try to build a crawly fetcher that uses a webdriver and a real browser engine.

Sounds interesting? Let's follow!

Starting out with a project

Let’s build a crawler for a popular german bonsai shop, which allows you to find nice trees for your hobby: BonsaiShop.

As you can there is almost no data for those of you who do not have JS enabled.

We will gather the product information: name, price, URL, description. It’s not as simple as it may seem, at least without a proper fetcher.

Configuring

Start by setting up a project and defining a basic configuration for it:

mix new bonsai_shop --sup

Create a configuration file config/config.exs with the following contents:

confix.exs

Add crawly and floki to the mix.exs file:

# Run "mix help deps" to learn about dependencies.
defp deps do
[
{:crawly, "~> 0.13.0"},
{:floki, "~> 0.26.0"}
]
end

Finding item extractors

Now we have approached the most upsetting part. The quick investigation shows that product pages are empty when you don’t have JavaScript enabled:

As you can see, there is no information available when JS is disabled
It’s not possible to get price information from a page like this

Building a fetcher that can render JavaScript

Crawly can be easily extended to use a custom fetcher, as soon as it implements the general fetcher interface, that defines a single fetch function.

In a simple case it could look like this:

Now we should actually build a code responsible for fetching the page with the help of a Browser. And this time, we’re going to use one of Elixir’s libraries that are built for testing purposes, but it gives an interface to a ChromeDriver (and Selenium as well). Of course, in the future, it would be nice to fork it and to make things maybe a bit more simple and more suitable for scraping.

If you’re curious what Lib I am talking about: that’s Wallaby.

  1. Add Wallaby to the list of dependencies, just below Floki
# Run "mix help deps" to learn about dependencies.
defp deps do
[
{:crawly, "~> 0.13.0"},
{:floki, "~> 0.26.0"},
{:wallaby, "~> 0.28.0"}
]
end

2. Extend the fetch function with the following contents

The first implementation of the fetch function

3. Update the configuration file, so our fetcher is used by Crawly instead of the default one:

Crawly configuration, which uses a custom fetcher

Hush!!! Believe me or not, but we’re done now :)!

Trying it out

Now we can try to see how it’s possible to extract data from product pages:

  1. Fetch the response with: `response = Crawly.fetch “https://www.bonsai-shop.com/izanpip121-02_26785_9870"`

2. Now we can do the regular Floki magic:

{:ok, document} = Floki.parse_document(response.body)iex(17)> Floki.find(document, ".price.h1") |> Floki.text()
"\n EUR 49.00\n *"

It worked!! We can now have the access to data.

Conclusions

In this article, we have managed to build a primitive browser-based fetcher for Crawly.

It’s just our first step towards this direction, and we plan to proceed with the following:

  1. Skip loading unrelated media assets (slow unrelated JS parts, especially tracking related can easily slow us down)
  2. Using proxy servers
  3. Improving configurability of the fetcher

We would do these steps if the article will show itself as interesting for our audience, and we will get 250 claps!

Final notes

I would be delighted to see Elixir projects that use Scraping and try to help you set up yours. Your feedback and comments motivate us to move forward!

You can always contact us via our Telegram group https://t.me/crawlyelixir.

Software developer at Erlang Solutions.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store