Web scraping with Elixir and Crawly - rendering JavaScript in 2023

Oleg Tarasenko
5 min readAug 25, 2023

Intro

It has been a while since I created an article about browser rendering support in Crawly. I would not say this problem often appears, but sometimes you can’t get the data from a website without Javascript. It may be the case that the target website blocks anything that does not look like a browser.

One of the previous approaches to address this problem was using the Splash render server. Here, you can find an example. I liked this approach, as it was fast and simple. You did not have to install Chromes/Chrome drives and other things on your machine. Just run a docker container, and voila, with small hacks, you get your JS-rendered page.

However, it turns out that it’s a bit problematic to use Splash these days. First of all, the project seems to be a bit outdated (the last commits were done a couple of years ago). Another thing — Docker containers are not runnable on M1 Mac machines, which became a problem for me and my friends.

Finally, Python is a very dynamic language :), so it turns out that it’s a bit hard to compile it from sources with a given version of Python and the dependencies it requires :(.

The alternative

As a quick alternative, I have written a browser renderer with Puppeteer and NodeJS. Puppeteer already does everything we need to automate the browser actions, so we only need some HTTP wrapper that can be used from Crawly.

I suggest looking at my attempt with the Crawly Render Server. It’s not yet battle-tested as of August 2023. However, it has helped me with several targets already.

Trying it

Let’s assume we need to monitor prices for mobile internet packages on the best Swedish telecom provider hallon.se

First of all, let's start a new Elixir project:

mix new render_server_example --sup
./cd render_server_example

Add Crawly and Floki to the mix.exs dependencies:

defp deps do
[
{:crawly, git: "https://github.com/elixir-crawly/crawly", branch: "render_server_integration"},
{:floki, "~> 0.33.0"}
]
end

NOTE: Right now, the render server support is in the branch. However, it’s going to be merged and released with newer versions of Crawly.

After getting dependencies (mix deps.get), you can generate a basic configuration for your project:

mix crawly.gen.config

# No config_file: :enoent -> creating new one
# Done!

That automatically creates a configuration for you:

import Config

config :crawly,
closespider_timeout: 10,
concurrent_requests_per_domain: 8,
closespider_itemcount: 100,

middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
{Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot", "Google"]}
],
pipelines: [
# An item is expected to have all fields defined in the fields list
{Crawly.Pipelines.Validate, fields: [:url]},

# Use the following field as an item unique identifier (pipeline) drops
# items with the same urls
{Crawly.Pipelines.DuplicatesFilter, item_id: :url},
Crawly.Pipelines.JSONEncoder,
{Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
]

In this example, we will be interested only in the monthly price, package name, and url.

So, let’s rewrite a validation pipeline in config so it looks like this:

{Crawly.Pipelines.Validate, fields: [:url, :name, :price, :discount_price]},

Now, as soon as the configuration is done, let's try to get the data from the webpage:

So let’s start by opening the webpage in a browser:

A quick view on the example page shows quite a good structure

We can see quite clear HTML code. The class names are a bit complex, but extracting the data we need is possible.

The next step would be to look at the page with Javascript switched off. I am using a Chrome extension called “Toggle Javascript” for that.

Here is what I can see:

We can see that the page is empty, and the usual guess would be that all the data is being inserted here by some dynamic code. As a result, most of the normal scraping tools could not get the data out of it.

There are no more elements on the page. Switching JS off makes the page un-scrapable.

A quick verification of the statement above in the Elixir shell can confirm our conclusions:

# Fetch the page
iex> response = Crawly.fetch("https://www.hallon.se/")

# Parse it
iex> doc = Floki.parse_document!(response.body)

# Try to package information
iex> Floki.find(doc, ~s(li[class*="OfferCardstyled"] h2:last-child))
[
{"h2", [{"class", "Typographystyled__StyledTypography-sc-1keselg-0 cvhOyz"}],
[]},
{"h2", [{"class", "Typographystyled__StyledTypography-sc-1keselg-0 cvhOyz"}],
[]},
{"h2", [{"class", "Typographystyled__StyledTypography-sc-1keselg-0 cvhOyz"}],
[]},
{"h2", [{"class", "Typographystyled__StyledTypography-sc-1keselg-0 cvhOyz"}],
[]},
{"h2", [{"class", "Typographystyled__StyledTypography-sc-1keselg-0 cvhOyz"}],
[]}
]

So, as we have expected, we can’t just get the discounted price this way.

Solving the problem with the Crawly Render Server

As we have promised, the Crawly Render Server allows us to do the same job as Splash, it takes the requests you need to render in a browser and uses headless chrome to run them. Let's see how it’s done.

Running the Render Server

# Clone the repository
git clone git@github.com:elixir-crawly/crawly-render-server.git
cd ./crawly-render-server

# Build and run docker container:
docker run -p 3000:3000 --rm -it $(docker build -q .)

The code above takes the Crawly Render Server from the GitHub builds and runs the docker container. The HTTP API is available on port 3000.

Now let's test it with curl:

curl -H "Content-Type: application/json" \
localhost:3000/render \
-d '{"url": "https://example.com"}'

It should give an HTTP response, and you should see something like this in the Crawly Render Server output:

Server is running on port 3000
Served Requests: 0
Error Count: 0
[DEBUG] Fetched 'https://example.com' -> 'https://example.com/' status: 200 (0.418s)
[DEBUG] Fetched 'https://example.com' -> 'https://example.com/' status: 200 (0.4s)
Served Requests: 2
Error Count: 0

Configuring Crawly

As soon as the rendering server is running, we should ask Crawly to use it. Add the following line to your Crawly configuration:

fetcher: {Crawly.Fetchers.CrawlyRenderServer, base_url: "http://localhost:3000/render"}

Now, restart the Crawly shell and try to fetch the page above one more time:

# Fetch the page
iex> response = Crawly.fetch("https://www.hallon.se/")

# Parse it
iex> doc = Floki.parse_document!(response.body)

# Exract the data
iex> Floki.find(doc, ~s(li[class*="OfferCardstyled"] h2:last-child))
[
{"h2", [{"class", "Typographystyled__StyledTypography-sc-1keselg-0 cvhOyz"}],
["19 kr/mån"]},
{"h2", [{"class", "Typographystyled__StyledTypography-sc-1keselg-0 cvhOyz"}],
["19 kr/mån"]},
{"h2", [{"class", "Typographystyled__StyledTypography-sc-1keselg-0 cvhOyz"}],
["19 kr/mån"]},
{"h2", [{"class", "Typographystyled__StyledTypography-sc-1keselg-0 cvhOyz"}],
["19 kr/mån"]},
{"h2", [{"class", "Typographystyled__StyledTypography-sc-1keselg-0 cvhOyz"}],
["19 kr/mån"]}
]

This time, the expected information is in the response.

Conclusions

In this article, I have shown another way of solving the JavaScript problem with a Splash replacement. However, I must admit that the Crawly Render Server is a prototype project that cannot guarantee a perfect experience regarding your commercial needs.

We plan to continue evolving it based on the projects we have in out pipeline, and as usual, I would highly appreciate feedback and contributions to the Crawly ecosystem with browser rendering and other concerns.

That’s all for today! Thanks for reading!

--

--