How to convert the NewYorkTimes into your database :)

Oleg Tarasenko
3 min readJan 19, 2021
Extracting data from newspapers? RSS is your best friend. (Image by MichaelGaida from Pixabay)

In one of my previous articles, we discussed why one might want to extract data from the Internet. In another article, we were expanding the case and demonstrating how to use scraping as a part of the price monitoring solution. Now, let's dig a bit toward the area of journalism. It might sound a bit unexpected, but journalists rely on web scraping quite a bit these days. I may refer to the following articles as examples:

From our side, we can think of the following use case: Let’s imagine you’re running a local news website, and you’re trying to stay informed about the most recent news in a specific area. Of course, you can send your man to every city in the area, or.. you can periodically scrape the data from their websites and create an alert for specific keywords (e.g., those which require extra attention)

Setting the task

Let’s extract data from one of the categories of the NewYorkTimes online newsletter. (e.g., the world category). We will be interested in titles, links, and descriptions. However, this time, we will rely on the RSS feed as a source of links and other information.

NOTE: Sometimes a crawled website might be fairly complex. For example lots of JavaScript plus other non-trivial challenges. And in these cases, the idea of starting from their RSS feed (or at least using it for some parts of the crawling logic), might be a game-changer.

After some investigation of the target website, we have discovered that they do provide a good RSS feed to work with https://archive.nytimes.com/www.nytimes.com/services/xml/rss/index.html

So let's use it as a starting point!

Bootstrapping the project

Now, as we have a task, bootstrap everything, complete the following steps from the Quickstart guide:

  1. Create a project
mix new rss_example — sup

2. Add Crawly & SweetXML to the mix file:

defp deps do
[
{:crawly, "~> 0.12.0"},
{:sweet_xml, "~> 0.6.6"}
]
end

NOTE: In this case, we’re not using Floki, as we’re going to parse the XML file. Crawly itself does not require you to use any pre-defined parser, so you can chose a parser which suites your use-case.

3. Create the config folder and config.exs file

use Mix.Config

config :crawly,
closespider_timeout: 10,
concurrent_requests_per_domain: 8,
middlewares: [
{Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
],
pipelines: [
Crawly.Pipelines.JSONEncoder,
{Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
]

As you can see, we did not have to use a complex middleware chain in this case. It’s always easier when you need to parse a feed!

4. Now, finally, we should define XML selectors to extract needed data.

Let’s fetch the RSS feed (e.g., https://rss.nytimes.com/services/xml/rss/nyt/World.xml) and have a look at it:

RSS Feed from NYT

As you can see, the information is stored under the item block. So we need to look through all item elements extracting title, link, and description.

In our case, everything was summarized in the following simple parse_item function:

@impl Crawly.Spider
def parse_item(response) do
items_list = xpath(response.body, ~x"//item"l)

items =
Enum.map(
items_list,
fn i ->
%{
title: "#{xpath(i, ~x"//title/text()")}",
link: "#{xpath(i, ~x"//link/text()")}",
description: "#{xpath(i, ~x"//media:description/text()")}"
}
end
)

%{
:requests => [],
:items => items
}
end

The full code can be found here: https://github.com/oltarasenko/rss_example.

Conclusions

This article has shown how to create a spider that uses RSS feed as a source of information and has demonstrated how to use XML parser instead of Floki.

Looking for more?

Try our new spider management application called Crawly UI, which simplifies spider management and data analysis concerns.

--

--