Sledge hammer solutions
If you’re spending a lot of time building or buying a system that is like a hammer and sees every site as a nail, you end up paying for tech you don’t need. This adds up quickly if you are scraping a lot of pages.
Pros: Instant unblocking of the majority of websites
Cons: Expensive and doesn’t scale well
Trade-offs: Sacrifices cost for speed and success
The compromise solution
Building a system that trades off some success for cost efficiency might work well if you’re under little to no time pressure and can constantly revise.
Pros: Cheaper to run than sledge hammer and AI solutions
Cons: Susceptible to missing data and slow crawling
Trade-offs: Sacrifices speed and success for cost
The optimised solution
You can build systems with waterfalls of varying proxy types, browsers and other infrastructure. We’ve even seen generative AI being used to facilitate and speed up developers by being able to manually build crawlers as JSON instructions for a complex optimised system. The issue here is that you’re spending a lot of time and money building a very fragile, multi-vendor system that requires a lot of maintenance and upkeep.
Pros: Instant unblocking of the majority of websites
Cons: Costs a lot to build and maintain and you need highly specialised developers
Trade-offs: Cost efficiency is gained at the high cost of system ownership. Even if you’re building smart systems, with some logic to help navigate the distribution problem, you’re just shunting the problem into a different challenge. You’re swapping the time it takes to scrape websites one by one for the challenge of building a vast scraping system. That system might save your developers time in building and maintaining the actual crawler, but another developer has to balance and maintain a proprietary system created from multiple tools, vendors and internal code bases. So any time savings are diminished because of the total cost of ownership for the entire scraping system.
AI solutions
You can build AI powered or AI augmented solutions that can speed up some aspects of writing web scraping code from spider/crawler creation to selector generation. You can even use large language models (LLM) to parse page data or to write selector code for you.
Pros: You can increase productivity in multiple areas by having AI do some of the manual coding work for you at development time. E.g. help you write selector code or convert JSON into scraping configurations.
Cons: The trade off here is that the LLMs are generally too expensive to run on every page (and they aren’t very accurate for specific fields like SKU or Price), so using LLMs for extraction is out.
Trade-offs: So you are basically just using it to speed up writing selectors, which is nice, but those selectors will break in time, and you'll have to fix them again and again.
Is compromise between cost, speed and success inevitable?
Kind of…
Ultimately, no matter which system you build you’re always going to be tied to one fatal flaw — you're using human power to diagnose, overcome and fix website bans one by one. Your headcount determines your speed and capacity to scale more than any other factor (outside of budget).
Depending on your business and project needs, that might be OK though.
Sometimes it's okay to spend thirty times more per request, when you’re only crawling five websites and 10,000 pages if speed is your priority.
Sometimes you’re only going to extract data from one massive website with tens of millions of pages once per quarter, so you need to have hyper-optimised requests that are super cheap per request for that one website.
However, in the case where you want to extract data from lots of different websites, quickly with high success at a low total cost or without investing in years-long programs of system development, then there are few options. And any solution needs the ability to:
Analyse a website’s anti-bot tech on the fly without (much) human intervention.
Automatically assign the minimum set of resources needed to overcome any ban. It has to be cheaper or cheaper for the easy sites and appropriately priced for the more difficult ones.
Monitor and self-correct over time.
Access to the infrastructure required to access the websites and return the data (Proxies, browsers, stealth tech, cookie management, etc).
Work using an API to interact, customise and control the solution via a scraping framework like Scrapy.
Have a pricing model that adjusts to each site's costs individually.
Failure to have any one of these will doom any website unblocking system to the cost vs speed vs success trade off and crush your ability to collect web data at scale. You’ll be bogged down with upfront work unblocking spiders and then monitoring and maintaining their health.