Web Scraping Dynamic Websites With Zyte API

Introduction
Understanding Dynamic Websites
Setting Up Zyte API
Making API Requests to Scrape Dynamic Content
Handling Challenges with Zyte API
Dynamic websites scraping tips
Conclusion

Introduction

Web scraping is proving critical for businesses and researchers seeking to gather invaluable data from the internet.This said, scraping dynamic websites presents multi-faceted unique challenges. These sites use technologies like JavaScript and AJAX to load content asynchronously, making it difficult for traditional web scraping methods to extract data effectively. Common obstacles include:

JavaScript rendering
Changing website structures
Infinite scrolling

This is where Zyte comes in. Zyte APIprovides a solution with tools for overcoming these challenges. It is specifically designed to streamline web scraping, especially for dynamic websites. Zyte API provides features such as JavaScript rendering support, smart proxy rotation, and auto-handling of anti-bot protections. It is an effective tool, with a user-friendly interface, that allows both developers and non-technical users to navigate complex scraping tasks with ease.

This article aims to provide a step-by-step guide on using Zyte API for dynamic web scraping, demonstrating how to effectively extract data from modern, complex websites.

Understanding Dynamic Websites & Their Challenges

Dynamic websites load content asynchronously, meaning that the data doesn't appear in the initial page source but is generated by JavaScript following user interactions or API calls.

Common obstacles in scraping dynamic websites include:

JavaScript rendering: Modern websites often use JavaScript-heavy frameworks such as React, Angular, and Vue.js to provide dynamic content. Traditional scrapers may not execute JavaScript, failing to capture content rendered client-side.
Anti-bot mechanisms: Websites implement anti-bot mechanisms like CAPTCHAs, rate limits, and IP blocking to protect their resources from overloading. Advanced bot detection systems scrutinize browsing patterns, mouse movements, and request headers to distinguish between human users and automated scrapers.

Traditional scrapers like BeautifulSoup and requests may not work effectively on dynamic websites because they primarily fetch the raw HTML of a page. These tools are adept at parsing static HTML content but struggle to handle content rendered through JavaScript. Unlike headless browsers, traditional web scrapers don't execute JavaScript, meaning dynamically loaded content is not captured.

Zyte API is an automated solution for web data extraction, designed to overcome the challenges presented by dynamic websites. It offers a user-friendly interface suitable for both developers and non-technical users. It provides the tools to extract data from sophisticated sites, applying state-of-the-art techniques, and getting rid of time-consuming configuration and anti-bot workarounds.

Key features of Zyte API include:

JavaScript rendering support: Zyte API employs headless browser technology to fully render web pages, execute JavaScript, and mimic human interactions. This allows the API to scrape data from even the most complex and dynamic websites.
Smart proxy rotation: Zyte API uses intelligent IP rotation techniques to avoid detection and ensure continuous access to data. This includes the use of proxy IP addresses distributed across different geographical locations. Zyte's Smart Proxy Management automatically selects the leanest set of proxies and techniques to keep your crawl healthy.
Auto-handling of anti-bot protections: Zyte API is designed to manage bans by using tools such as proxies and clever use of cookies. The API can adapt its methods based on the website's anti-bot measures, ensuring that the most cost-effective solution is used. It is particularly effective in overcoming common obstacles faced during web scraping, including measures like IP bans.
Scalability for large-scale scraping: Zyte API is designed to handle large volumes of requests, making it easy to scale up your data extraction efforts as needed. The infrastructure is hosted and managed by Zyte, eliminating the need to invest in additional resources.

Setting Up Zyte API

To set up the Zyte API, follow these steps:

Create an account on Zyte: Go to zyte.com and sign up for an account. You can sign up using your Google account or with an email and password.

Get API credentials (API key setup):

After creating an account, you may need to go through a checkout process where you enter your credit/debit card information. A minimal amount (like $1) will be charged and immediately refunded to ensure your card works.
Select a trial plan for Zyte API which provides access to the Zyte API via an API key and a $5 credit for a free trial. The free trial lasts until you use the $5 credit or until 30 days have passed, whichever comes first.
To view your API key, select the Zyte API dropdown and click API access. This will take you to a screen where you can view and replace your keys.

Install necessary dependencies:

Python: Ensure you have Python installed.
Requests library: Install the Requests library for making HTTP requests: pip install requests
Zyte API Python library: You can install the official Zyte API client library for Python using pip: pip install zyte-api
CA Certificate: Download their CA Certificate and configure it using the instructions for your OS. If you run into SSL issues, you can pass verify=False into your requests. The easiest way to bundle use certificate with Requests is to simply specify the path to the certificate in your code.

To make API requests, you'll need to pass your API key with each request. Almost all the requests sent to Zyte will use the POST method. This is more secure, as your API keys are sent in a secure Authorization header.

Making API Requests to Scrape Dynamic Content

To make API requests to scrape dynamic content with Zyte API, you should consider the following:

● Basic API request format: The Zyte API functions through a request-response mechanism. Specify the target URL and parameters to define how the request should be handled. The API processes the request, performing actions such as rendering JavaScript, handling cookies and sessions, and managing IP addresses. Once complete, the API returns a structured response containing the extracted data.

Sending a GET request with Zyte API: Although most requests to Zyte API use the POST method for security, you can use a GET request. To access content from the Zyte API, place the parameters inside the JSON body of the request:

response = requests.post(
"https://api.zyte.com/v1/extract",
auth = ( API_KEY ,   "" ) ,
json =   {
"url" :  url ,
"httpResponseBody" :   True
}
)

Copy

● auth holds a tuple consisting of your API key and an empty string. json holds the parameters passed into the API:

○ "url": The URL to scrape.

○ "httpResponseBody": Specifies that you want the body of the response.

● Handling JavaScript-rendered content: To render JavaScript, use use either httpResponseBody or browserHtml but not both in the same request;

"browserHtml": True in the JSON body, instructing the Zyte API to open a real browser and render the page, executing JavaScript. To disable JavaScript, pass "javascript": False.

● Example: Scraping a website with AJAX-loaded data: Websites using AJAX (Asynchronous JavaScript and XML) load content dynamically without requiring a page refresh. Scraping these sites requires intercepting network requests, extracting API responses, and mimicking user interactions. Zyte API automatically manages AJAX-based data extraction without additional setup, overcoming JavaScript rendering issues and retrieving structured API responses.

Code snippet for a simple Zyte API request:

import requests
import json
config = {}
with open("config.json") as file:
	config = json.load(file)
ca_cert_path = "zyte-ca.crt"
proxies = {
	"http": f"http://{config['zyte_api_key']}:@api.zyte.com:8011",
	"https": f"http://{config['zyte_api_key']}:@api.zyte.com:8011"
}
response = requests.get("https://toscrape.com", proxies=proxies, verify=ca_cert_path)
print(response.content)

Copy

This code reads your API key from a config file and sets HTTP and HTTPS proxies to the port URL f"http://{config['zyte_api_key']}:@api.zyte.com:8011". All requests to the target site are then routed through this port.

● Advanced Functionality: Zyte API provides flexible customisation options, each passed in as a field in the JSON body.

“ | Field | Description | Default | | ------------------ | -------------------------------------------------------------- | --------------- | | browserHTML | Opens a real browser and renders the page | False | | screenshot | Takes a screenshot using the browser | False | | product | Extracts product data | False | | customAttributes | Extracts page elements based on criteria | null | | geolocation | Makes a request through a specific country | Based on server | | javascript | Forces JavaScript execution on the browser | Based on site | | actions | A list of actions to perform in the browser | null | | session | Creates a reusable session | null | | networkCapture | Captures network requests from the browser | null | | device | Emulates a specific device | desktop | | cookieManagement | How cookies are managed in the browser | auto | | requestCookies | A list of cookies to send with a request | null | | responseCookies | Shows cookies from the request in its response | False | | serp | Search engine results of the domain | False | | ipType | Use either a residential or data centre IP | datacenter |”

Copy

Handling Challenges with Zyte API

To handle challenges using the Zyte API, you should consider the following points related to pagination, infinite scrolling, dynamic content, and rate limits:

Pagination and Infinite Scrolling:

Pagination involves navigating through multiple pages to extract all the desired data from a website. You can automate navigation through pagination menus using custom crawling rules in your spiders. Zyte API's AI can handle common pagination types automatically.
For websites using infinite scrolling, where content loads as the user scrolls down, Zyte API offers solutions. You can use a headless browser with programmable actions like scrollBottom or scrollTo to mimic scrolling behaviour. However, be mindful of the API's runtime limit of less than 1 minute, which might not suffice for all cases.
An alternative approach is to reverse-engineer the JavaScript code that handles the infinite scrolling, typically implemented through paginated API requests.

Extracting Data from Dynamically Loaded Elements:

Dynamic content, loaded asynchronously via AJAX requests, poses a challenge for traditional scraping methods because the data isn't available in the initial HTML source.
Zyte API is designed to handle such dynamic content by intercepting network requests and extracting API responses, thus overcoming JavaScript rendering issues.
You can also use Zyte API's server-managed sessions to ensure requests rotate through sessions with pre-configured ZIP codes, which is useful when websites store configurations server-side rather than directly in cookies.
Zyte API can use AI to extract navigation data and can rely entirely on AI for both crawling and parsing. A single input URL is enough for a spider to automatically extract data.

Managing Rate Limits and API Quotas:

Websites implement rate limiting to prevent abuse and manage server load by restricting the number of requests from a user or IP within a timeframe. Exceeding these limits can result in errors or IP blocking.
To handle rate limits, you can adjust request rates and leverage solutions like Crawlee to comply with website limits while maintaining efficient scraping. Zyte API dynamically manages website changes to prevent the scraper from breaking.
Zyte API automatically selects the leanest set of proxies and techniques to keep your crawl healthy and help you avoid bans.

Session Management

Advanced session management is needed to manage modern bot defenses.
Server-managed sessions let Zyte API handle session management.
Client-managed sessions let you control session IDs and manage them as per your scraping logic.

Dynamic websites scraping tips

To scrape dynamic websites effectively with Zyte API, consider the following tips:

Avoiding Detection and Bans:

Normal browsing behaviour by introducing random delays between requests. You can implement this using the random.uniform() function in Python.
Some Web Scraping APIs, like Zyte API, use a mix of tools such as proxies and cookies to solve bans, adapting their methods based on the website's anti-bot measures.
Avoid scraping during peak hours.
Implement organic crawling patterns.
To avoid your requests getting blocked, Zyte API provides intelligent, adaptive session management.

Respecting robots.txt:

Always check a website’s robots.txt file to see which parts of the site are allowed to be scraped. The robots.txt file provides directives to web crawlers, indicating which parts of the site they can access.
If the robots.txt file disallows scraping certain sections, respect those rules.
Failing to adhere to the instructions in the robots.txt file could result in being blocked from scraping the website.

Structuring API Requests Efficiently:

Zyte API integrates as an HTTP API, where users can POST their API key along with the URLs to be scraped and optional parameters, such as JavaScript rendering or custom headers.

Zyte API is a single automated solution for dependable web data extraction that uses the leanest setup to reliably return HTML from any website at the lowest cost.

Use Zyte API’s automatic extraction feature, powered by AI, which allows you to start getting product data from any e-commerce website in seconds and ensures that any changes in the website’s schema won't affect the schema of the extracted data you get.

Structure API requests efficiently.

Optimising Cost and Performance with Zyte API:

Zyte API offers per-site pricing, providing a cost-effective solution for reliable web data collection.
Metrics such as response time, throughput, and error rates play a significant role in determining an API's effectiveness.
Reduce request and response payloads, enable caching mechanisms, and optimise database queries.
Tools such as Prometheus and Grafana can be utilised for API observability, helping identify bottlenecks and areas for improvement.
Concurrency is not directly managed through the API. If you get a status 429, reduce your concurrent threads, as you are being rate limited.
Zyte API automatically finds the right-size features and configures itself to retrieve data from any website, building your scraping stack in a fraction of the time previously required and only using exactly the right features and resources required on a site-by-site basis.

Dynamic Content:

For complex websites that rely heavily on JavaScript (AJAX) or similar languages to handle dynamic content, the typical approach for writing spiders will not suffice. In these cases, specialised libraries and advanced spiders are needed.
Zyte API comes with the automatic extraction feature powered by AI, allowing developers to start getting product data from any e-commerce website in seconds.

Conclusion

In conclusion, Zyte API is an effective solution for dynamic web scraping because it simplifies data extraction, handles anti-bot measures, and offers scalable solutions. It enables users to focus on data extraction rather than dealing with proxies, bans, and maintenance.

Key benefits and features of Zyte API include:

Effectiveness for Dynamic Web Scraping Zyte API is particularly effective for scraping dynamic websites that use technologies like JavaScript and AJAX.
Comprehensive Feature Set The Zyte API consolidates web scraping technologies and techniques into a simple API. It is designed to handle large volumes of requests. Its features include dynamic proxy management, smart proxy rotation, and AI-powered data extraction.
Adaptability and Automation Zyte API automatically adapts to website changes, finds the right-size features, and configures itself to retrieve data from any website. It provides automated proxy management.
Data Handling and Compliance Zyte API delivers data in JSON format, simplifying data processing and integration. It is designed to comply with legal and ethical standards.
Simplified User Experience It provides a simple, seamless, and predictable data collection experience.
Cost-Effective Solution Zyte API offers per-site pricing, making it a cost-effective solution for reliable web data collection.
Manage Anti-Bot Measures Zyte configures settings to unblock websites. It employs sophisticated algorithms to prevent detection and ensure uninterrupted access to target websites.
Scalability Zyte API is designed to handle large volumes of requests, making it easy to scale data extraction efforts. The infrastructure is managed by Zyte, reducing the need for additional resources.

Zyte offers resources such as comprehensive documentation with code examples and guidelines for various technologies to help users implement advanced scraping techniques.

Zyte API is evolving into an adaptable and resilient web scraping API suitable for simple and complex websites, regardless of project size. It encourages users to try Zyte for scalable scraping solutions.