Saving time and money through a web scraping browser extension
Scraping web pages is one of the most effective ways to retrieve data from the web. On the other hand, every web page is different, and extracting data from it programmatically requires a scraping script involving custom logic.
Building such a script costs you time and money. Luckily, many scraping services that allow you to scrape the web with just a bunch of clicks have been recently developed. So, you no longer have to write code to achieve your data extraction goals!
Here, you will learn how to extract data from a media article with Listly, a scraping service that contacted me to test their product and review it honestly. Let’s jump into the article!
A media articles generally consist of:
Not surprisingly, the most important information here is the text of the article, but also images and videos are important. In particular, when dealing with multimedia files, you have to take into account that they might be protected by copyright. And if you want to avoid problems, you might be asked to indicate where the multimedia file comes from. So, retrieving the information about the source and author of an image or a video is crucially important.
Then, you can use all this info to create a news aggregator app, add a news section to your website or app, study how a media article may change over time for marketing purposes, create a data source for your machine learning algorithm to study how language works, or simply share the article to your friends.
Now, let’s delve into the tool chosen to scrape data from media articles.
“Listly is a Web Scraping Service for everyone from non-technical marketers to advanced developers. It turns web pages into an Excel spreadsheet within seconds. The extracted data is used for retail, research, big data, and other data-related works.” — FAQ — Listly.io
The recommended way to use Listly is through the official Chrome extension, which has already been downloaded by more than 60k users.
But let’s not waste any more time and learn how to use Listly to scrape data from media articles.
Let’s learn how to scrape data from media articles with Listly in a step-by-step tutorial with images.
1. Getting started with Listly
First, you need a Listly account. Visit this page, fill out the form, and click on “SIGN UP”.
You will receive the following email in your inbox to verify your email address:
Click on “Verify email” and you should now have a valid Listly account.
Now, you need to install the Listly Chrome extension. All you have to do is visit the Listly website and click on “ADD TO CHROME”.
Keep in mind that you can test Listly for free, but the free plan comes with some limitations. This means that if you want a complete experience, you need a paid plan.
Now, you have everything you need to start to scrape data from websites. But before starting using it, I recommend pinning Listly in the Chrome extension toolbar by clicking the following button:
2. Selecting the article to scrape
Now, visit a media website and choose the article you want to scrape. In this tutorial, you will see how to scrape the “Why Italy’s ‘king of chocolate’ is so delicious” article from the CNN website.
This is what the article looks like:
As you can see, it is a long and detailed article with several images. The main challenge with scraping media articles is that they generally consist of several blocks of text. Also, there might many ads, images, and embeds between them. So, developing a scraping script to retrieve the data you are interested in can involve complex logic. But you can avoid all this with Listly!
Now, let’s see how Listly allows you to scrape such a page with just a bunch of clicks and no code.
3. Scraping a CNN article with Listly in a few clicks
Visit the page of the article you selected and click on the Listly icon in the Chrome extension toolbar.
This is what the popup window displayed from the Listly extension looks like:
Since a media article is not table-like and you want to scrape the entire article, click on “LISTLY WHOLE”.
Wait for Listly to do its magic, and you should be redirected to the page below:
This is the Databoard page, where you can decide what data to scrape and what to ignore. Notice how Listly automatically scrapes and organizes for you all the cards found on the source webpage.
By exploring the data the Listly interface offers you, you should notice that the tab with 58 cards is the one containing what you are looking for. But only some of all the 58 cards are actually interesting. To select only the relevant ones, choose “Select Tabs” in the “Selected Cards” input field.
This is what your Listly Databoard page should now look like:
Now, each card has a check radio button you can use to select or deselect it. Only the card you marked as selected will be taken into account in the final data extraction process.
After selecting the cards of interest, click on the “EXCEL” button to export the extracted data into an Excel file. A
LISTLY_SINGLE_XXXXXX_YYYYY.xlsx file will be automatically downloaded.
Open the Excel file, and you should see the data scraped from the CNN article that you manually selected organized in cells as in the image below:
As you can see, the
LABEL 1 column contains all the paragraphs, image URLs, and subtitles. The
LABEL 2 column stores the TD;DR section and image captions. While the
LABEL 3 column has the image author and copyright information.
Basically, in these three columns, there are all the most important data you can retrieve from a media article.
Et voila! With just a few clicks, you can scrape a web page containing heterogeneous and structured content. All this, without writing a single line of code.
In this article, we looked at what data you should scrape from a media article, why, and how to do it without writing a single line of code. This was possible thanks to Listly, a web scraping service that comes with a powerful, easy-to-use, and fast browser extension that empowers you with the ability to scrape any website. As shown, Listly comes with some minor pitfalls, but my experience with it has been good overall.
Thanks for reading! I hope that you found this article helpful. Feel free to reach out to me with any questions, comments, or suggestions.