Web scraping with Elixir and Crawly. Browser rendering.

Image by Gerd Altmann from Pixabay

Introduction

In one of our previous articles, we were discussing how to perform web scraping with the help of Elixir. In this article, I will take another step into the interesting world of data scraping and will investigate how to handle the interactivity of the modern web.

Since the number of websites using Javascript for content rendering grows, the demand for extracting data from those also grows. Interactivity itself adds some complexity to the data extraction process, as it’s not possible to get the full content with a regular request made from a command-line HTTP client.

One of the ways of solving this problem is to simulate an asynchronous request with an additional POST. Although, in my opinion, this approach adds, even more, complexity and fragility to your code.

Another way of solving the problem is to use browser renderers instead of command-line HTTP clients. Let me show you how to set up a Crawly spider in order to extract data from autotrader.co.uk with the help of a browser renderer. You will be surprised how simple it is!

Getting started

Let's spin up a new elixir project:

mix new autosites — sup

Now as the project is created let’s add and download the most recent version of Crawly:

# Run “mix help deps” to learn about dependencies.
defp deps do
[
{:crawly, “~> 0.8”},
{:meeseeks, “~> 0.14.0”}
]

Ok, at this point lets explore our target a bit. Let’s open one of the pages which contain the information about the leased car: https://www.autotrader.co.uk/cars/leasing/product/201911194518336

Example product from the Autotrader website

Ok, so far so good. Let’s try to fetch this page with the help of Crawly:

$ iex -S mix
iex(1)> response = Crawly.fetch(“https://www.autotrader.co.uk/cars/leasing/product/201911194518336")
%HTTPoison.Response{
body: “<!doctype html> …”,
headers: [
{“Date”, “Mon, 17 Feb 2020 13:49:42 GMT”},
{“Content-Type”, “text/html;charset=utf-8”},
{“Transfer-Encoding”, “chunked”},

NOTE: Autotrader.co.uk is a dynamic website with lots of cars on sale. It might (and probably will) be that by the time when you read this article and actual car (id: 201911194518336) will be unavailable on their website. We recommend picking one of the other cars from the https://www.autotrader.co.uk/cars/leasing section.

Ok, it looks like it’s possible to get a car only for 192.94 pounds per month… cool! But let’s see if we can really get this data in our shell. Let’s visualize our downloaded HTML in a browser to see how it looks like after the fetch:

iex(9)> File.write(“/tmp/nissan.html”, response.body)

Ok, lets now try to find a price on the given page:

The same product page fetched with console HTTP client

Ok. Now it’s visible that there is no price in the price block. So how would we deal with that? One of the possible approaches is to dig the page source, and to find the following block:

However, writing an appropriate regular expression to extract this data from the javascript is a real challenge. And seriously speaking it would add lots of unneeded complexity to the spider code.

Let’s try an alternative approach.

Extracting dynamic content

A new version of Crawly 0.8.0 ships with pluggable fetchers support, which allows us to re-define how Crawly is fetching HTTP responses. In our case we are interested in the possibility to pipe all requests through a browser, so we will get all pages rendered. One of the possible options is to use a lightweight browser implementation, which will do the basic javascript for us. For the sake of demonstration purposes, we will take a Splash renderer.

Splash is a lightweight and simple browser renderer based on Python and Qt. While it might not be suitable for complex targets, we find it quite useful for more simple cases.

Let’s start a splash service locally:

docker run -it -p 8050:8050 scrapinghub/splash — max-timeout 300

Now splash is up and running and can accept requests.

Configuring Crawly

Let’s put some basic configurations. Assuming that we want to extract: id, title, URL, and price, the configuration might look like this.

Create the config/config.exs file with the following content:

# This file is responsible for configuring your application
# and its dependencies with the aid of the Mix.Config module.
use Mix.Config
config :crawly,
fetcher: {Crawly.Fetchers.Splash, [base_url: “http://localhost:8050/render.html"]},
# Defines how to retry requests
retry:
[
retry_codes: [400, 500],
max_retries: 5,
ignored_middlewares: [Crawly.Middlewares.UniqueRequest]
],
closespider_timeout: 5,
concurrent_requests_per_domain: 20,
closespider_itemcount: 1000,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.UserAgent
],
user_agents: [
“Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0”,
“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36”,
“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36 OPR/38.0.2220.41”
],
pipelines: [
{Crawly.Pipelines.Validate, fields: [:id, :title, :url, :price]},
{Crawly.Pipelines.DuplicatesFilter, item_id: :id},
{Crawly.Pipelines.JSONEncoder, []},
{Crawly.Pipelines.WriteToFile, extension: “jl”, folder: “/tmp”}
]

The important part here is:

fetcher: {Crawly.Fetchers.Splash, [base_url:”http://localhost:8050/render.html"]},

As it defines that Crawly is going to use a Splash fetcher in order to get pages out of the target website.

Fetching the car page once again

Now as soon as we have set everything up, it’s time to fetch the page again in order to see the difference:

iex(3)> response = Crawly.fetch(“https://www.autotrader.co.uk/cars/leasing/product/201911194518336")
iex(3)> File.write(“/tmp/nissan_splash.html”, response.body)

Lets now see how the rendered page looks:

Well :(. That’s a bit empty. It happened because we did not give splash enough time to render the page. Let’s modify the configuration a bit and ask Splash to wait 3 seconds after fetching the page (so it has enough time to render all elements).

The updated configuration looks like this:

fetcher: {Crawly.Fetchers.Splash, [base_url: “http://localhost:8050/render.html", wait: 3]}

Now after re-fetching the page (just repeat previous commands) we will get the following results:

Page with rendered content

The spider

Finally, let’s wrap everything into a spider, so we can extract information about all the cars available on Autotrader. The process of writing a spider is described in our previous article and also in Crawly’s getting started guide, so we will not repeat it here. But for the sake of completeness, let's see how the spider code might look like (it’s required to add {:meeseeks, “~> 0.14.0”} to the mix.exs deps):

defmodule AutotraderCoUK do
@behaviour Crawly.Spider
require Logger import Meeseeks.CSS @impl Crawly.Spider
def base_url(), do: "https://www.autotrader.co.uk/"
@impl Crawly.Spider
def init() do
[
start_urls: [
"https://www.autotrader.co.uk/cars/leasing/search",
"https://www.autotrader.co.uk/cars/leasing/product/201911194514187"
]
]
end
@impl Crawly.Spider
def parse_item(response) do
case String.contains?(response.request_url, "cars/leasing/search") do
false ->
parse_product(response)
true ->
parse_search_results(response)
end
end
defp parse_search_results(response) do
# Parse page once only
parsed_body = Meeseeks.parse(response.body, :html)
# Extract href elements
hrefs =
parsed_body
|> Meeseeks.all(css("ul.grid-results__list a"))
|> Enum.map(fn a -> Meeseeks.attr(a, "href") end)
|> Crawly.Utils.build_absolute_urls(base_url())
# Get pagination
pagination_hrefs =
parsed_body
|> Meeseeks.all(css(".pagination a"))
|> Enum.map(fn a ->
number = Meeseeks.own_text(a)
"/cars/leasing/search?pageNumber=" <> number
end)
all_hrefs = hrefs ++ pagination_hrefsrequests =
Crawly.Utils.build_absolute_urls(all_hrefs, base_url())
|> Crawly.Utils.requests_from_urls()
%Crawly.ParsedItem{requests: requests, items: []}
end
defp parse_product(response) do
# Parse page once only
parsed_body = Meeseeks.parse(response.body, :html)
title =
parsed_body
|> Meeseeks.one(css("h1.vehicle-title"))
|> Meeseeks.own_text()
price =
parsed_body
|> Meeseeks.one(css(".card-monthly-price__cost span"))
|> Meeseeks.own_text()
thumbnails =
parsed_body
|> Meeseeks.all(css("picture img"))
|> Enum.map(fn elem -> Meeseeks.attr(elem, "src") end)
url = response.request_urlid =
response.request_url
|> URI.parse()
|> Map.get(:path)
|> String.split("/product/")
|> List.last()
item = %{
id: id,
url: url,
thumbnails: thumbnails,
price: price,
title: title
}
%Crawly.ParsedItem{items: [item], requests: []}
end
end

Finally, let’s run the spider:

iex(1)> Crawly.Engine.start_spider(AutotraderCoUK)

That’s it. Now you will have the content extracted!

I think that this approach is quite interesting, as it allows to offload the data extraction complexity to an external service. So the code itself does not have to handle it and can stay simple.

Thanks for reading!

Software developer at Erlang Solutions.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store