Scrapy Cloud secrets: Hub Crawl Frontier and how to use it
Imagine a long crawling process, like extracting data from a website for a whole month. We can start it and leave it running until we get the results.
Though, we can agree that a whole month is plenty of time for something to go wrong. The target website can go down for a few minutes/hours, there can be some sort of power outage in your crawling server, or even some other internet connection issues.
Any of those are real case scenarios and can happen at any given moment, bringing risk to your data extraction pipeline.
In this case, if something like that happens, you may need to restart your crawling process and wait even longer to get access to that precious data. But, you don’t need to panic, this is where Hub Crawl Frontier(HCF) and scrapy cloud secrets come to the rescue.