Powerful ban management with Zyte API’s Scripting API
You encounter so many different anti-bot systems as a developer working on custom projects for Zyte Data clients. This cat and mouse game requires your time, persistence and deep knowledge of web scraping. We have a story to tell you about one of these games. It was a daily battle that was slowing the team down. In this post, we’ll break down the problem and show you how we solved it with Page.evaluate() from Zyte API’s Scripting API.
Our client has high demands. We need to deliver data according to three different schedules: hourly, daily, and fortnightly. There isn’t much room for error when you have hourly and daily schedules. Access to websites is of paramount importance to ensure the client gets the web data that’s absolutely essential to their operations.
The headache of managing daily bans
We migrated the client's spiders from Smart Proxy Manager (SPM) to Zyte API with cookies and headless browsing. We had three glorious months free of anti-bot issues before they slowly reintroduced themselves.
It started with occasional bans and increased to weekly bans. We started using the Scripting API endpoint and a few header changes to overcome these early bans. However, the bans started to happen more frequently and our team would respond with manual fixes each time. We guessed that the website was identifying our crawler by analyzing the headers. So, we would spend hours cycling through different combinations of header and cookie logic to find the one that allowed website access. That fix would last a couple days, if we were lucky, and the cycle would begin again.
This was urgent client work, so the team was working around the clock; the bans happened at all times of the day and night. The manual fixes were difficult. Finding the configuration that allowed website access on that particular day was time consuming and frustrating. Want to know what demoralizes your web scraping development team? Never ending ban situations like these.
The development team’s velocity slowed right down. We were spending so much time figuring out the configuration of the day. It was critical to find a solution that would stick.
Scripting API to the rescue
We contemplated writing a solution that would save us time by automatically discovering the winning combination. At the beginning of each day, it would cycle through all the different combinations of headers until it found that special one. The spiders would then be set to use that combination. This wouldn’t really solve the fundamental problem of our crawlers being identified, but it would help the development team with the gruntwork.
We discovered that using browser scripting to replicate a user session was the trigger that tipped off the anti-bot system. We needed the browser to access the website but needed to find another way to execute user actions so they’d be undetected.
Zyte API’s Scripting API has a function called Page.evaluate(). It allows us to execute JavaScript (JS) code within a page context. So instead of using the browser scripting to click different elements in the user session, we used JS to program the actions and called Page.evaluate() to execute that code. Once we had access to the site, all the requests came from the same browser and IP, making it hard to be detected, and the JS performed the user actions faster.
Our Scripting API strategy eliminated the excessive amount of time used to setup and discover the winning combination of headers. JS was the perfect programming language to perform a request and process a response from the site as if you were using a browser. Our team’s development velocity (and sanity levels) returned to normal because we weren’t spending hours battling website bans. Now the client has consistent access to the target domain.
Interested in trying out Zyte API’s powerful ban handling features like the Scripting API? Sign up for your free trial and see how we handle your toughest websites.