By Sara Metwalli, Associate Editor at Towards Data Science.
No data science project is completed without data; I can even argue that you can’t say “data science” without data. Often, in most data science projects, the data you need to analyze and use to build machine learning models are stored in a database somewhere. That somewhere sometimes is the web.
You may collect data from a specific webpage about a certain product or from social media to uncover patterns or perform sentiment analysis. Regardless of why you are collecting the data or how you intend to use it, collecting data from the web — web scraping — is a task that can be quite tedious, but you will need to do for your project to achieve its goals.
Web scraping is one of the important skills you need to master as a data scientist; you need to know how to look for, collect and clean your data, so your results are accurate and meaningful.
Web scraping has been a gray legal area, so before we dive deeper into tools that can help your data extraction tasks, let’s make sure that your activity is fully legal. In 2020, the US court fully legalized web scraping of publicly available data. That is, if anyone can find the data online (such as Wiki articles), then it’s legal to scrape it.
However, when you do that, make sure:
- That you don’t re-use or re-publish the data in a way that violates copyright.
- That you respect the terms of services of the site you’re trying to scrape.
- That you have a reasonable crawl-rate.
- That you don’t try to scrape private parts of the website.
As long as you don’t violate any of those terms, your web scraping activity should be on the legal side.
If you’re constructing your data science projects using Python, then you probably used BeatifulSoup and requests to collect your data and Pandas to analyze it. This article will present you with 6 web scraping tools that don’t include BeatifulSoup that you can use for free to collect the data you need for your next project.
№1: Common Crawl
The creator of Common Crawl developed this tool because they believe that everyone should have the chance to explore and analyze the world around them and uncover its patterns. They offer high-quality data that was only available for large corporations and research institutes to any curious mind free of charge to support their open-source beliefs.
This means, if you are a university student, a person navigating your way in data science, a researcher looking for your next topic of interest, or just a curious person that loves to reveal patterns and find tends, you can use this tool without the worry about fees or any other financial complications.
Common Crawl provides open datasets of raw web page data and text extractions. It also offers support for non-code-based usage cases and resources for educators teaching data analysis.
Crawly is another amazing choice, especially if you only need to extract basic data from a website or if you want the extract data in CSV format so you can analyze it without writing any code.
All you need to do is input a URL, your email address to send the extracted data, and the format you want your data (choose between CSV or JSON), and voila, the scraped data is in your inbox for you to use. You can use the JSON format and then analyze the data in Python using Pandas and Matplotlib, or in any other programming language.
Although Crawly is perfect if you’re not a programmer, or you’re just starting with data science and web scarping, it has its limitations. It can only extract a limited set of HTML tags including, Title, Author, Image URL, and Publisher.
№3: Content Grabber
Content Grabber is one of my favorite web scraping tools. The reason is, it is very flexible; if you just want to scrap a webpage and don’t want to specify any other parameters, you can do so using their simple GUI. However, if you want to have full control over the extraction parameters, Content Grabber gives you the option to do that.
One of Content Grabber’s advantages is you can schedule it to scrape information from the web automatically. As we all know, most webpages update regularly, so having a regular content extraction can be quite beneficial.
It also offers a wide variety of formats for the extracted data, from CSV, JSON to SQL Server or MySQL.
Webhose.io is a web scraper that allows you to extract enterprise-level, real-time data from any online resource. The data collected by Webhose.io is structured, clean, contains sentiment and entity recognition, and available in different formats such as XML, RSS, and JSON.
Webhose.io offers comprehensive data coverage for any public website. Moreover, it offers many filters to refine your extracted data so you can before fewer cleaning tasks and jump straight into the analysis phase.
The free version of Webhose.io provides 1000 HTTP requests per month. Paid plans offer more calls, power over the extracted data, and more benefits such as Image Analytics & Geolocation and up to 10 years of archived historical data.
ParseHub is a potent web scraping tool that anyone can use free of charge. It offers reliable, accurate data extraction with the ease of a button click. You can also schedule scraping times to keep your data up to date.
One of ParseHub strengths is that it can scrap even the more complex of webpages hassle-free. You can even instruct it to search forms, menus, login to websites, and even click on images or maps for a further collection of data.
You can also provide ParseHub with various links and some keywords, and it can extract relevant information within seconds. Finally, you can use REST API to download the extracted data for analysis in either JSON or CSV formats. You can also export the data collected as a Google Sheet or Tableau.
Scrapingbee can be used in one of three ways:
- General Web Scraping, for example, extracting stock prices or customer reviews.
- Search Engine Result Page is used often for SEO or keyword monitoring.
- Growth Hacking, which includes extracting contact information, or social media information.
Scrapingbee offers a free plan that includes 1000 credit and paid plans for unlimited use.
Collecting data for your projects is perhaps the least fun and most tedious step during a data science project workflow. This task could be quite time-consuming, and if you work in a company or even freelance, you know that time is money, which always means that if there’s a more efficient way to do something, you better use it.
The good news is that web scraping doesn’t have to be tedious; you don’t need to perform it or even spend much time doing it manually. Using the correct tool can help save you a lot of time, money, and effort. Moreover, these tools can be beneficial for analysts or people with not enough coding background.
When you want to choose a tool to scrape the web, there are some factors you need to consider, such as API integration and large-scale scraping extendability. This article presented you with some tools that can be used for different data collection regimes; give them a try and choose the one that makes your next data collection task a breeze.
Original. Reposted with permission.
Bio: Sara Metwalli is a Ph.D. student and research assistant working on quantum computing at Keio University.