The Scraper’s System Part 2: Explorer’s Compass to analyze websites
Welcome to part two of "The Scraper’s System" series.
If you haven’t read the introductory part yet, you can do so here.
In the first part, we discussed a template to define the clear purpose of your web scraping system that can help you design your crawlers better and prepare you for the uncertainty involved in a large scale web scraping project.
Step 1 clarifies the three W’s:
Why, What, Where – of a large scale Web Scraping project – which will be the guiding North-Star throughout the development process.
Step 2 of the framework helps you answer: “How do we extract the data?”
I also like to call this phase – The Explorer’s Compass – you must understand this critical navigational tool and some best practices before you set out to sail through the target websites.
At Zyte, developers spend days analyzing target websites using four parameters that will help design the high-level crawl logic and choose the best suitable technology stack for your project.
APIs Availability.
Dynamic Content.
Antibot Mechanisms.
Pages and Pagination.
API availability
Web Scraping vs API – Which is the better option?
My answer to this question… It's always subjective and depends primarily on your business goals.
If you need to collect data from the same website all the time, API is a suitable choice.
A good idea is to always check for API availability and note all those data fields that can be extracted from the website APIs, rather than jumping straight away to scraping them.
Benefits of checking API availability:
Respect the websites by not burdening them.
Save a lot of development time.
Avoid blocks/bans.
Data may be available to developers for free.
Exceptions:
Keep in mind, that this may not hold true in certain scenarios which involve:
Collecting real-time data.
Websites with heavy Anti-ban protection.
Data nested in Javascript which requires browser emulation.
For example, if the target websites are e-commerce aggregators, then API would make sense. If the target websites are flight aggregators, then maybe not.
When deciding whether to choose API over web scraping or vice-versa, it’s also important to check the dynamicity of the target websites and look at the website's history. In addition to trends in that specific industry to determine the likelihood of disruptive website changes, which as a result could break the crawlers.
Once again, clarify the business goal and make a list of all data-fields that can be extracted using the APIs provided by the target websites.
Dynamic Content
Let me give you two examples for a quick introduction of Interactive elements on the websites.
Once you see this, you will notice little red markers lit up in your head every time you interact with any favorite web or mobile applications.
If you feel like eating your favorite cake… open Google Maps and type in “Bakery near me”.
Did you notice those little red markers appear?
Open your favorite e-commerce website, check the exact availability of any product you want to buy – select the delivery location and enter the pin code.
Did you notice this entire interaction happened without loading the entire page?
After this, you cannot unsee such interactions across many applications over the web.
The use of JavaScript can vary from simple form events to Single Page Applications (SPA) and the data is displayed to the user on request. As a result, for many web pages the content that is displayed in our web browser is not available in the original HTML.
Therefore, the regular approach of scraping data will fail when it comes to scraping dynamic websites.
Two alternative approaches:
Reverse engineering JavaScript - we can reverse engineer websites behavior and replicate it in our code!
Rendering JavaScript using browser automation/simulation tools like Headless Browser Libraries - Puppeteer, Playwright, Selenium or you can use Zyte API - check
actions.
To Do – make a list of all data-fields that require Reverse-Engineering Javascript or browser simulation tools.
Antibot Mechanisms
This part of the analysis is boolean, either your requests go through or you start experiencing "429's".
It’s hard to pinpoint what goes on behind the scenes to block your requests.
Read this blog post – where Akshay Philar covers the most common measures used by websites and we’ll show you how to overcome them with Zyte Data API Smart Browser.
To summarize, these are some of the defensive measures that get you blocked:
IP rate Limitation- A lot of crawling happens from datacenter IP addresses. If the website owner recognizes that there are a lot of non-human requests coming from this set of IPs, they can just block all the requests coming from that specific datacenter so the scrapers will not be able to access the site. To overcome this, you need to use other datacenter proxies or residential proxies. Or just use a service that handles proxy management.
Detailed Browser fingerprinting-a combination of browser properties/attributes derived from Javascript API and used in concert with each other to detect inconsistencies. It contains information about OS, devices, accelerometer, WebGL, canvas, etc…
Captchas and other ‘humanity’ tests- Back in the day, captchas used HIP (Human Interactive Proof) with the premise that humans are better at solving visual puzzles than machines. Machine learning algorithms weren’t developed enough to solve captchas like this:However, as machine learning technologies evolved, nowadays a machine can solve this type of captcha easily. Then, more sophisticated image-based tests were introduced, which gave a bigger challenge for machines.
TCP/IP fingerprinting, geofencing, and IP blocking.
Behavioral patterns : Human observation methods, such as analysis of mouse movements and detailed event logging, to differentiate automated crawlers from the behavior patterns of real-life visitors.
Requests Patterns: the amount and frequency of requests you make. The more frequent your requests (from the same IP) are, the more chance your scraper will be recognized.
In this process, try to figure out the level of protection used by the target website. This helps answer whether a rotation proxy solution will be enough or it needs an advanced anti-ban solution like Zyte API that takes care of bans of all types.
Pages and Pagination
Lastly, the number of steps required to extract the data - in some cases all the target data isn’t available on a single page, instead, it requires the crawler to make multiple requests to obtain the data.
In these cases, determine the number of requests that will need to be made which will determine the amount of infrastructure the project will require.
The complexity of iterating through records – certain sites have more complex pagination (infinite scrolling pages, etc.) or formatting structures that can require a headless browser or complex crawl logic. This helps answer the type of pagination and crawl logic required to access all the available records.
Conclusion
To conclude, try the following exercise first.
Answer the questions below to ensure you fully understood "The Explorer’s Compass" and are ready to move forward with The Scraper's System.