Browser bother: Three painkillers for headless scraping headaches

Web scraping has traditionally been carried out using two broad approaches:

The conventional method of using retrieval libraries and scraping frameworks like BeautifulSoup, Scrapy, or even wget, to fetch and parse page content.
Browser-based method, leveraging automation libraries such as Puppeteer, Selenium, and Playwright to control real, headless browsers.

Conventional wisdom has always been that dedicated, browserless scraping tools are faster and more efficient, while real browser automation is prone to performance concerns.

Nevertheless, the value of browser-based scraping is clear to see. Today, more and more websites rely on JavaScript-heavy content, making raw HTML extraction ineffective. Others closely manage their traffic using CAPTCHAs, fingerprinting, and rate limiting.

Browser-based scraping has become a useful tool in the scraping box – but one which presents a number of challenges.

The difficulty of browser-based scraping

The modern web is a bloated soup of technologies. Websites don’t just serve visible content; they execute scripts, fetch data asynchronously, and track user behavior through different front-end frameworks, third-party trackers, and dynamic elements.

That has turned web browsers into resource-hungry beasts. As anyone with multiple Chrome tabs open will attest, CPU usage can spike unpredictably, memory consumption balloons, and background scripts continue running even when a page seems idle.

Of course, web browsers were not built for web scraping. While a normal user can scroll a page before all assets are loaded, an automated system waits until a page is completely ready.

The difficulties grow when scrapers need scale. Large quantities of data cannot be obtained with one browser alone. But target sites tend to discourage access attempts to a single account from multiple browsers. Success, therefore, depends on being able to project the same “state” across a range of browser instances.

Failing to do so could mean data from a dynamic-content site varying wildly in scrape results.

Coordinating multi-instance states and managing the required resources can be challenging. But options are emerging to help.

Three strategies to scale a browser automation operation

During Extract Summit 2024, Joel Griffith, CEO of Browserless, outlined three strategies commonly implemented to address these concerns.

1. Manage session states with cookies

When a regular user accesses "stateful” pages that depend on user preferences or authorization, these details are often stored in cookies and sent to the server to achieve the intended page state.

Cookies are the glue of the web, little heroes that have long allowed human users to maintain browser states across sessions. These simple text files are easy to serialize and store.

Web scraping developers, too, can obtain the same page state by passing cookie name-value strings in their HTTP request header.

Cookies, with their simple name-value pairs, take up minimal space and are easy to modify, inspect, and rotate.
You can export and reuse cookies across machines, enabling distributed scraping setups.
Best for authentication persistence over days or weeks.

Trade-offs:

Cookies aren’t a cache – Although they can help maintain state across sessions, a browser will still need to download every page asset each time.
Limited authentication coverage – Cookies alone don’t retain client-side data storage like localStorage or IndexedDB, which some systems require for retrieval.
Security risks – Improper handling of cookies can expose sensitive user sessions.

2. Leverage Chrome's full user data directory

While cookies can help maintain session-level persistence, for more sophisticated websites, more information needs to be provided.

Chrome’s user data directory goes one step beyond cookies, also storing items saved through the Web Storage API as well as IndexedDB for session persistence and authentication. The folder also caches files served by websites in order to reduce duplicate requests.

The default location of Chrome’s user data directory varies by operating system, and Chrome allows you to specify any directory to load. That’s great for scrapers because it means they can swap in whole sets of custom caches for different scrape jobs.

By starting your Chrome instances while specifying –user-data-dir=/path/to/data/dir, the browser instance gains access to every client-side asset that the website may have cached.

Providing a user data directory is often the best strategy when accessing data from single page applications (SPA), which tend to store cookies and other assets in the local file system.

Ideal for long-term browsing emulation where credentials, cache, and session history need to be stored.
Useful for automation that needs to run over periods of days.

Trade-offs:

High storage overhead – UDDs can grow to hundreds of megabytes per instance.
Concurrency issues – Sharing the same directory across multiple browser instances can lead to data corruption.
Crash recovery concerns – Unexpected browser terminations can cause profile corruption.

3. Accelerate access by keeping browser processes open

For a human, constantly opening and re-opening a browser would be inefficient – no wonder so many of us leave tabs open for days or weeks on end. In web scraping, too, keeping browser instances running continuously can make data acquisition faster.

Retrieving a web page is slower when you need to start a browser from cold. So, instead of starting from scratch, you can keep the browser processes open, mapping your requests back to the right instance on the right server.

To achieve this, you'll need a load balancer that routes network requests back to the correct browser instance, plus logic to intercept browser.close() calls so the processes aren’t shut down prematurely.

Combines the efficiency of cookies and cache.
Requires robust load balancing to manage sessions efficiently.
Complex to implement, but highly effective at scale.

Trade-offs:

Difficult to scale – Sessions are tied to a specific process, making cross-machine load balancing complex.
Session lifecycle management – Preventing stale sessions, tracking TTLs, and handling unexpected disconnects.
High resource usage – Continuous browser processes can lead to memory leaks and CPU overload if not carefully monitored.

Watch Joel’s talk in full for more insights, sample code, and a Q&A session.

Browser infrastructure services and providers

Despite these tricks, managing browser automation infrastructure can be complex. So, several specialized providers have popped up to lighten the load further.

Chrome-for–hire infrastructure providers: Several services allow you to run cloud-hosted headless Chrome instances, rather than on your own infrastructure. Think of it as being able to hail a ride rather than owning and maintaining your own fleet of transportation.
Rendering APIs: To quickly and easily render a page without your own overhead, specialist services offer lightweight API endpoints like /render. Some services go one step further and wrap these capabilities into standalone products such as web page monitoring services.
Web scraping APIs: For scrapers that don’t want to manage their own Chrome instances, even in the cloud, web scraping tech vendors offer APIs that abstract browser functions into easier-to-use endpoints combined with data acquisition capabilities.

For web scraping, the Zyte API’s Headless Browser – part of the Zyte Web Scraping API – is a fully hosted headless browser that is specifically designed for web scraping. Unlike general-purpose browser automation tools, it includes:

Proxy and session management: the browser is built to maximize target site access by strategically selecting the best-performing proxies, reusing successful sessions to reduce bans, and handling stateful sessions, ensuring consistency between requests.
Fingerprint management: Traditional headless browsers use standard browser binaries that expose JavaScript APIs and behavioral traces, revealing automation fingerprints that many websites use as signals to block traffic. Zyte API’s Headless Browser has a built-in mechanism to manage this risk.
Memory management: Running browsers to collect data locally or on self-managed browser instances requires provisioning and monitoring your own CPU and RAM. Zyte API’s Headless Browser allows you to tap into an elastic cloud-based infrastructure that scales as needed.

If most browser automation APIs are like solo musicians playing single instruments, Zyte API is a conductor orchestrating an entire ensemble, coordinating multiple musicians into one seamless performance.

When your web data collection runs through a single experience in this way, you gain consistency and simplicity. You don’t need to modify code, or even browser, to respond to page markup changes. You work more resiliently in a browser-agnostic manner, without dealing with low-level decisions.

Conclusion

Browser automation can be bothersome.

Using browsers for web scraping is sometimes unavoidable, but that doesn’t mean it has to be a headache. Solutions like Zyte API remove the complexity of memory management, clunky state handling, and fragile data collection by bundling rendering, crawling, extraction, and unblocking into one streamlined interface.

Ultimately, browser automation is just one part of the web data collection stack. If babysitting browsers isn’t where you want to invest your resources, you always have the options to find help.