How To Scrape Media Articles in Just a Few Clicks

Saving time and money through a web scraping browser extension

Photo by Headway on Unsplash

Scraping web pages is one of the most effective ways to retrieve data from the web. On the other hand, every web page is different, and extracting data from it programmatically requires a scraping script involving custom logic.

Building such a script costs you time and money. Luckily, many scraping services that allow you to scrape the web with just a bunch of clicks have been recently developed. So, you no longer have to write code to achieve your data extraction goals!

Here, you will learn how to extract data from a media article with Listly, a scraping service that contacted me to test their product and review it honestly. Let’s jump into the article!

A media articles generally consist of:

Not surprisingly, the most important information here is the text of the article, but also images and videos are important. In particular, when dealing with multimedia files, you have to take into account that they might be protected by copyright. And if you want to avoid problems, you might be asked to indicate where the multimedia file comes from. So, retrieving the information about the source and author of an image or a video is crucially important.

Then, you can use all this info to create a news aggregator app, add a news section to your website or app, study how a media article may change over time for marketing purposes, create a data source for your machine learning algorithm to study how language works, or simply share the article to your friends.

Now, let’s delve into the tool chosen to scrape data from media articles.

“Listly is a Web Scraping Service for everyone from non-technical marketers to advanced developers. It turns web pages into an Excel spreadsheet within seconds. The extracted data is used for retail, research, big data, and other data-related works.” — FAQ — Listly.io

The recommended way to use Listly is through the official Chrome extension, which has already been downloaded by more than 60k users.

But let’s not waste any more time and learn how to use Listly to scrape data from media articles.

Let’s learn how to scrape data from media articles with Listly in a step-by-step tutorial with images.

1. Getting started with Listly

First, you need a Listly account. Visit this page, fill out the form, and click on “SIGN UP”.

The Listly Sign Up page

You will receive the following email in your inbox to verify your email address:

The Listly verification email

Click on “Verify email” and you should now have a valid Listly account.

Now, you need to install the Listly Chrome extension. All you have to do is visit the Listly website and click on “ADD TO CHROME”.

The Listly.io homepage

Keep in mind that you can test Listly for free, but the free plan comes with some limitations. This means that if you want a complete experience, you need a paid plan.

Now, you have everything you need to start to scrape data from websites. But before starting using it, I recommend pinning Listly in the Chrome extension toolbar by clicking the following button:

Pinning the Listly Chrome extension

2. Selecting the article to scrape

Now, visit a media website and choose the article you want to scrape. In this tutorial, you will see how to scrape the “Why Italy’s ‘king of chocolate’ is so delicious” article from the CNN website.

This is what the article looks like:

The full view of the selected media article

As you can see, it is a long and detailed article with several images. The main challenge with scraping media articles is that they generally consist of several blocks of text. Also, there might many ads, images, and embeds between them. So, developing a scraping script to retrieve the data you are interested in can involve complex logic. But you can avoid all this with Listly!

Now, let’s see how Listly allows you to scrape such a page with just a bunch of clicks and no code.

3. Scraping a CNN article with Listly in a few clicks

Visit the page of the article you selected and click on the Listly icon in the Chrome extension toolbar.

This is what the popup window displayed from the Listly extension looks like:

The Listly extension popup window

Since a media article is not table-like and you want to scrape the entire article, click on “LISTLY WHOLE”.

Wait for Listly to do its magic, and you should be redirected to the page below:

The Listly Databoard page

This is the Databoard page, where you can decide what data to scrape and what to ignore. Notice how Listly automatically scrapes and organizes for you all the cards found on the source webpage.

By exploring the data the Listly interface offers you, you should notice that the tab with 58 cards is the one containing what you are looking for. But only some of all the 58 cards are actually interesting. To select only the relevant ones, choose “Select Tabs” in the “Selected Cards” input field.

This is what your Listly Databoard page should now look like:

The Listly Databoard page with the “Select Cards” option enabled

Now, each card has a check radio button you can use to select or deselect it. Only the card you marked as selected will be taken into account in the final data extraction process.

After selecting the cards of interest, click on the “EXCEL” button to export the extracted data into an Excel file. A LISTLY_SINGLE_XXXXXX_YYYYY.xlsx file will be automatically downloaded.

Open the Excel file, and you should see the data scraped from the CNN article that you manually selected organized in cells as in the image below:

The LISTLY_SINGLE_ZReu0qsW_20220506.xlsx file exported from Listly

As you can see, the LABEL 1 column contains all the paragraphs, image URLs, and subtitles. The LABEL 2 column stores the TD;DR section and image captions. While the LABEL 3 column has the image author and copyright information.

Basically, in these three columns, there are all the most important data you can retrieve from a media article.

Et voila! With just a few clicks, you can scrape a web page containing heterogeneous and structured content. All this, without writing a single line of code.

In this article, we looked at what data you should scrape from a media article, why, and how to do it without writing a single line of code. This was possible thanks to Listly, a web scraping service that comes with a powerful, easy-to-use, and fast browser extension that empowers you with the ability to scrape any website. As shown, Listly comes with some minor pitfalls, but my experience with it has been good overall.

Thanks for reading! I hope that you found this article helpful. Feel free to reach out to me with any questions, comments, or suggestions.

Leave a Comment