Three strategies to scale a browser automation operation
During Extract Summit 2024, Joel Griffith, CEO of Browserless, outlined three strategies commonly implemented to address these concerns.
1. Manage session states with cookies
When a regular user accesses "stateful” pages that depend on user preferences or authorization, these details are often stored in cookies and sent to the server to achieve the intended page state.
Cookies are the glue of the web, little heroes that have long allowed human users to maintain browser states across sessions. These simple text files are easy to serialize and store.
Web scraping developers, too, can obtain the same page state by passing cookie name-value strings in their HTTP request header.
Cookies, with their simple name-value pairs, take up minimal space and are easy to modify, inspect, and rotate.
You can export and reuse cookies across machines, enabling distributed scraping setups.
Best for authentication persistence over days or weeks.
Trade-offs:
Cookies aren’t a cache – Although they can help maintain state across sessions, a browser will still need to download every page asset each time.
Limited authentication coverage – Cookies alone don’t retain client-side data storage like localStorage or IndexedDB, which some systems require for retrieval.
Security risks – Improper handling of cookies can expose sensitive user sessions.
2. Leverage Chrome's full user data directory
While cookies can help maintain session-level persistence, for more sophisticated websites, more information needs to be provided.
Chrome’s user data directory goes one step beyond cookies, also storing items saved through the Web Storage API as well as IndexedDB for session persistence and authentication. The folder also caches files served by websites in order to reduce duplicate requests.
The default location of Chrome’s user data directory varies by operating system, and Chrome allows you to specify any directory to load. That’s great for scrapers because it means they can swap in whole sets of custom caches for different scrape jobs.
By starting your Chrome instances while specifying –user-data-dir=/path/to/data/dir, the browser instance gains access to every client-side asset that the website may have cached.
Providing a user data directory is often the best strategy when accessing data from single page applications (SPA), which tend to store cookies and other assets in the local file system.
Ideal for long-term browsing emulation where credentials, cache, and session history need to be stored.
Useful for automation that needs to run over periods of days.
Trade-offs:
High storage overhead – UDDs can grow to hundreds of megabytes per instance.
Concurrency issues – Sharing the same directory across multiple browser instances can lead to data corruption.
Crash recovery concerns – Unexpected browser terminations can cause profile corruption.
3. Accelerate access by keeping browser processes open
For a human, constantly opening and re-opening a browser would be inefficient – no wonder so many of us leave tabs open for days or weeks on end. In web scraping, too, keeping browser instances running continuously can make data acquisition faster.
Retrieving a web page is slower when you need to start a browser from cold. So, instead of starting from scratch, you can keep the browser processes open, mapping your requests back to the right instance on the right server.
To achieve this, you'll need a load balancer that routes network requests back to the correct browser instance, plus logic to intercept browser.close() calls so the processes aren’t shut down prematurely.
Combines the efficiency of cookies and cache.
Requires robust load balancing to manage sessions efficiently.
Complex to implement, but highly effective at scale.
Trade-offs:
Difficult to scale – Sessions are tied to a specific process, making cross-machine load balancing complex.
Session lifecycle management – Preventing stale sessions, tracking TTLs, and handling unexpected disconnects.
High resource usage – Continuous browser processes can lead to memory leaks and CPU overload if not carefully monitored.
Watch Joel’s talk in full for more insights, sample code, and a Q&A session.