What are the elements of a web scraping project?
Extracting data from the web is becoming very popular. More and more businesses are leveraging the power of web scraping. But this doesn't mean that the technical challenges are gone. Building a sustainable web scraping infrastructure takes expertise and experience.
Here, at Zyte , we scrape 9 billion pages per month, using our own web data extraction tools and infrastructure. In this article, we are going to summarize the essential elements of a successful web scraping project. And the building blocks you need to take care of, in order to develop a healthy web data pipeline.
The building blocks:
- Web spiders
- Spider management
- Javascript rendering
- Data QA
- Proxy management
Web spiders
Let's start with the obvious, spiders. In the web scraping community, a spider is a script or program that extracts data from the web page. Spiders are essential to scrape the web. There are many libraries and tools available that we could use. In Python, you have Scrapy, the web scraping framework, or beautifulsoup. If you’re programming in Java, you have Jsoup or HTMLUnit. In Javascript, it’s Puppeteer or Cheerio. These are just the most popular ones, but there are many other libraries and headless browsers that you can use for web scraping. If you’ve never done web scraping before, It can be difficult to choose which one o choose.
For one-off, small-scale projects it doesn't really matter what library you use. Choose the one which is the easiest to get started within your preferred programming language. For example beautifulsoup (or Scrapy) or Jsoup. But for long-term projects, where you will need to maintain your spiders and maybe build on top of them, you should choose a tool that lets you focus on your project-specific needs and not on the underlying scraping logic.
Spider management
Now that we got the spiders right, the next element you need to have in your web scraping stack is spider management. Spider management is about creating an abstraction on top of your spiders. A spider management platform, like Scrapy Cloud, makes it quick to get a sense of how your spiders are performing. You can schedule jobs, review scraped data, and automate spiders. And you can stay updated regarding your project’s health, without managing servers.
Javascript rendering
Javascript, as a technology, is widely used on modern websites. Unfortunately, it gives more complexity to your web scraping project if the data you want to extract is rendered using JS. But also, if the data is transferred from the backend to the website’s frontend using AJAX calls, you might be able to get that data by inspecting the website and simulating the AJAX request in your spider to grab the JSON/XML file.
Generally, when you know the website is using JS to render its content, your first instinct could be to grab a headless browser like Selenium, Puppeteer, Playwright, etc… Which can be necessary if you’re trying to work around antibots.
The trade-off with headless browsers is that in the short term it's much quicker to just render JS and get the data, but in the long term, it will take much more hardware resources. So make sure that if execution time and used hardware resources are important for you, inspect the website first properly and see if there's any AJAX request or hidden API calls in the background. If yes, then try to replicate that in your scraper, instead of executing JS.
If there's just no other way to get the data and you have to render JS no matter what, only then use a headless browser.
Data quality
The next building block is data quality assurance. All our scraping efforts are worth it only if the output data is the right data in the correct format. To make sure this is the case, we can do several things:
- validate the output data against a predefined JSON schema
- check the coverage of the extracted fields
- check duplicates
- compare two scraping jobs
It is also useful to set up alerts and notifications that are triggered when a certain action happens. Zyte has open-sourced two tools that we use for spider monitoring and to ensure data quality. One is Spidermon which, for example, can be used to send an email or slack notifications, it can also create custom reports based on web scraping stats. The other one is Arche, which can be used to analyze and verify scraped data using a set of rules.
Proxy management
The last, but very important, block that we need to mention is proxy management. Proxies are necessary for large-scale projects - when you need to make a lot of requests frequently.
First of all, you should have an estimation regarding what’s possible and what’s not when it comes to scaling up. You cannot solve everything just by throwing more hardware (proxies) at the problem. Sometimes it’s impossible to scale further. For example, if your target website has 50,000 visitors/month and you want to scrape it 50,000 times in a month… that’s not going to work. You need to consider the traffic profile of the website before making sure your expectations are realistic regarding the number of requests you want to make with your scraper. You can get an estimation on what kind of proxies and how many of them you need based on target websites, requests volume, and geolocation (example.com, 3M requests/day, USA).
There are projects when we need ongoing maintenance and we scrape hundreds of millions or even billions of URLs. For this kind of volume, you definitely need a proxy solution, like Zyte Smart Proxy Manager . But you can also just buy proxies and implement your own proxy management logic. Keep in mind, that if web scraping is not the core of your business or project then it makes more sense to outsource proxy management and save a lot of developing/maintenance time.
Web scraping ethics
Before finishing up this article, it’s important to talk about web scraping ethics. When you scrape a website, you have to make sure you also respect it. Here are some best practices you can follow to scrape respectfully.
Don't be a burden
The most important rule when you scrape a website is not to harm it. Do not make too many requests. Making requests too frequently could make it hard for the website server to serve other visitors. Limit the number of requests in accordance with the target website.
Robots.txt
Before scraping, always inspect the robots.txt file first. This will give you a good idea of what parts of the website you are free to visit and what pages you should not.
User-agent
Define a user-agent that clearly describes you or your company. Also, it’s best to include contact information in your user-agent as well, so they can let you know if they have any issues with what you’re doing.
Pages behind a login wall
There are cases when you can only access certain pages if you are logged in. If you want to scrape those pages, you need to be very careful. By logging in and/or explicitly agreeing to the website's terms and conditions that state you cannot scrape, then you CAN NOT scrape. You should always honor the terms of any contract you enter into, including website terms and conditions and privacy policies.
Closing note
Getting started with web scraping is easy. But as you scale, it’s going to get more and more difficult to keep your spiders working. You need a game plan on how to tackle the challenges of website layout changes, keeping data quality high, and proxy needs. By perfecting the mentioned building blocks, you should be well on your way.
Learn more about web scraping tools
Here at Zyte , we have been in the web scraping industry for 12 years. We have helped extract web data for more than 1,000 clients ranging from Government Agencies and Fortune 100 companies to early-stage startups and individuals. During this time we gained a tremendous amount of experience and expertise in web data extraction.
Here are some of our best resources if you want to deepen your web scraping knowledge: