Using Elixir and Crawly for price monitoring
In one of my previous articles, we discussed why one might want to extract data from the Internet. Now it’s time to be more specific to showcase one of the possible use cases — “The real estate market.”
A bit about the real estate market
The real estate market is a highly competitive market, with many companies trying to approach the same client. As soon as the amount of properties advertised through the internet is constantly growing, the importance of publicly available data is also playing a big role. So those who will be able to get used to this data will win clients.
Setting the task
Imagine that you’re a real estate agent interested in a set of properties available through multiple websites. You have discovered some options for your client, but now it’s time to look a bit after them to see how prices are changed over time, as it could allow you to get useful insights, which would help your client make a choice.
Target websites
As soon as we’re located in Sweden, let’s assume you need information from two websites:
- Hemnet — is probably one of the leaders when it goes to find a property online.
- Boli — another leader on the market we would want to get insights from.
We have checked robots.txt entries for both websites to understand the desired way to extract data. It looks like there no special restrictions for regular web scrapers, and we can move forward to extract data.
To be specific, let’s define a shortlist of URLs from both websites for monitoring:
https://www.booli.se/annons/4060796
https://www.booli.se/annons/4060561
https://www.booli.se/annons/4060535
Please be advised that both websites are highly dynamic, so it might be the case that these properties will become unavailable by the time when you will read this article.
Extracting prices with Crawly
Let's use the Crawly scraping framework to extract this data. We usually need to fetch data from a predefined set of pages for price monitoring solutions in this example. That’s why we will use one Spider module for all links. Let's get started!
Bootstrapping the project
Now, as we have a task, bootstrap everything, complete the following steps from the Quickstart guide:
- Create a project
mix new price_monitoring -- sup
2. Add Crawly & Floki to the mix file:
defp deps() do
[
{:crawly, "~> 0.12.0"},
{:floki, "~> 0.26.0"}
]
end
Extracting the data from multiple sites
Finally, we need to complete 3 steps to complete the project!
- Create the config folder and config.exs file
Here you may see that compared to the previous article's setup dedicated to solving the authentication problem, we have excluded the `Crawly.Middlewares.DomainFilter` middleware. We’re going to parse the given set of URLs directly, so there is no need to filter them in any way.
2. Define high-level API
As you can see, our start function sends a list of URLs into a spider. Technically the start function may just read them from the database you have before, which will make this even more flexible!
Please be advised that the current version of crawly (which is 0.12.0 by the article creation time), has a known bug which prevents initialization with unboundly large collections of URLs. So at least for now it’s not possible to send 100_000 or more. Please be generous to us :). Crawly is an opensource lib, which has it’s own bugs as any other project. however we’re trying to work on it’s stability!
3. Define the spider file: spider.ex
Now we can finally start our crawler with:
PriceMonitoring.start()
which will allow us to get the data (which is by default stored under /tmp folder (see our config.exs definition for details))
Conclusions
This article has shown how to create a spider to monitor price information from multiple real estate websites. The final code for the project is available on GitHub.
Looking for more?
Does it all sound too complicated to you, and you want to know how to get started with Crawly? Please have a glance at our getting started guide!
You can find more articles describing how to use Crawly under my medium URL. Also, try our solution for spiders management!