Let’s build your personal SEMRush with Elixir!

Oleg Tarasenko
5 min readFeb 11, 2021

Hey everyone! I am delighted that my article about scraping Google with Elixir and Crawly has gathered 200 claps! And as I have promised, we will take extra steps to follow google’s pagination and get more meaningful data from the results. So let’s do it practically!

Setting the task

Many online businesses rely on Search Engines as their primary revenue stream, especially eCommerce websites, tools, and service providers. It’s known that 80% of the clicks are taken by the website, which holds the first position in the search results.

Usually, digital companies would start from the paid search advertisement to get first sales and to circle their business. However, it soon becomes clear that it’s way more favorable to be organically in the first place. At this point, companies start conducting promotional activities (usually investing in high-quality backlinks) and track their position changes.

Another part of their digital strategy is competitor analysis, which helps understand the complexity of promotion in given conditions.

Right now, the market of tools that allows doing this kind of tracking is represented by several companies, but I would highlight one of them — SemRush.

What is SEMRush?

Well, SEMRush is definitely a giant in the world of internet marketing. They do provide amazing visibility on your positions and the positions of your competitors. Positions tacking there is simply beyond expectations! You can read more about them here.

However, one of the weak points of their approach is a pricing model. It’s tough for a startup to afford their amazing tool. I totally understand that they provide excellent value for your money, however, sometimes you can’t afford it, and that’s the sad reality :(.

Can we still do something to match their level?

As far as I know, there is no other tool that provides the same level of visibility. Please let me know in the comments if you are aware of someone else to try :). However, as a part of this tutorial, we will build our own crawler on Elixir, which will easily match their level :).

This is a bit of an advanced tutorial, so if you are looking for something more simple to start with, please have a glance here. Otherwise, you can always get help from us on our telegram channel.

Defining the task

Let’s build a crawler that gathers results from multiple pages and tracks the position of items it extracts.

This data then allows building a historical chart of position changes for a given set of competitors for a given keyword.

We will base our work on the simple google crawler we have built last time. To recap, last time, we have made the crawler, which extracts data (URL, title, description) from the first page of google search results. However, we will extend out data as the following: URL, title, description, position in search results.

Getting started

Let’s clone the Google SERP crawly repository, to continue from the place where we have stopped in the first article:

git clone https://github.com/oltarasenko/crawly_serp.git

Let’s also define an Interface that allows starting the spider with a given query in the main module of the application (google_serp.ex):

@doc """
Start a crawler with a given keyword as a starting point

## Examples

iex> GoogleSerp.crawl("scraping elixir")
:ok

"""
def crawl(query) do
Crawly.Engine.start_spider(Spider, query: query)
end

This way, no one will have to guess how to start things. You can call GoogleSerp.crawl(“scraping elixir”), and it will give you expected results!

Extracting requests to google pagination

Let's explore how to follow a google pagination element.

It turns out that there are at least two options:

  • Extracting links using CSS selectors
We have discovered that all pagination elements do have an “fl” class.

So it’s possible to get links using the:

links = Floki.find(document, "a.fl") |> Floki.attribute("href")
  • However, there is a better (and more simple way of doing that). It turns out that Google will automatically jump to the next page as soon as you add &start=<number>` get parameter to your URL.
Google automatically jumps to the third page, if you set start=20

We will use this second way of doing things, as soon as we’re interested only in 100 first placement in Google, and also it requires just a tiny adjustment of the init function we have defined earlier.

@impl Crawly.Spider
def init(options) do
# Reading start urls from options passed from the main module
template = fn(query, start) ->
"https://www.google.com/search?q=#{query}&start=#{start}"
end
starts = for x <- 0..10, do: x * 10

query =
options
|> Keyword.get(:query, "scraping elixir")
|> URI.encode()

[start_urls: Enum.map(starts, fn x -> template.(query, x) end)]
end

Ok, that’s it. Now let's see how to add a position to the extracted item.

Extending item structure with a current placement on the page

Let’s extend our parse_item function in the following way:

  1. Find the page number out of the given response:
page_number =
response.request_url
|> URI.parse()
|> Map.get(:query)
|> URI.decode_query()
|> Map.get("start", "0")
|> String.to_integer()

Add an element position to the item structure:

items =
search_results
|> Enum.with_index()
|> Enum.map(fn({block, i}) ->
block
|> parse_search_result()
|> Map.put(:position, page_number + i)
end)

We’re done now! It should be sufficient for our goals.

Reviewing scrapped results

Let’s start the magic:

GoogleSerp.crawl("crawly elixir")14:47:45.570 [debug] Starting the manager for Elixir.Spider14:47:45.574 [debug] Starting requests storage worker for Elixir.Spider...14:47:45.580 [debug] Started 8 workers for Elixir.Spider
:ok

And after some time, you will have the data baked for you:

Google search results from multiple pages, with a position

Finally, you will have to run the job periodically to build a historical trace of the change, similarly to what you would see in SEMRush.

As usual, everything can be found here on GitHub.

Conclusions

This article shows how to build a crawler that allows you to track your positions in google search results. That allows you to have a free version of SEM Rush available. Of course, we’re far away from the naive idea that we can completely match their genuine infrastructure and features. However, it can be a good ground for people who are just starting their business and looking for a cheap option to do the same job.

Otherwise, I want to suggest another small challenge for my dear readers :). Let’s reach 300 claps on this article so that I can understand the interest. After that, I promise to deploy the google crawler that we have just built, on our CrawlyUI demo instance, with a possibility to crawl things by search terms :). So imagine a lightweight version of SemRush available for your needs without any challenges.

If you think that the things we are doing are interesting, we would highly appreciate your support with the project!

--

--