How to set up a custom proxy in Scrapy?
When scraping the web at a reasonable scale, you may come across a series of problems and challenges. You may want to access a website from a specific country/region.
Or maybe you want to work around anti-bot solutions. Whatever the case, to overcome these obstacles you need to use and manage proxies.
In this article, I'm going to cover how to set up a custom proxy inside your Scrapy spider in an easy and straightforward way.
Also, we're going to discuss what are the best ways to solve your current and future proxy issues.
You will learn how to do it yourself but you can also just use Zyte Proxy Manager to take care of your proxies.
Why you need smart proxies for your web scraping projects?
If you are extracting data from the web at scale, you’ve probably already figured out the answer. IP banning. The website you are targeting might not like that you are extracting data even though what you are doing is totally ethical and legal.
When your scraper is banned, it can really hurt your business because the incoming data flow that you were so used to is suddenly missing.
Also, sometimes websites have different information displayed based on country or region.
To solve these problems, we use a set of techniques for bypassing IP bans, such as rotating proxies for successful requests to access the public data we need.
Setting up proxies in Scrapy
Setting up a proxy inside Scrapy is easy. There are two easy ways to use proxies with Scrapy - passing proxy info as a request parameter or implementing a custom proxy middleware.
Option 1: Via request parameters
Normally when you send a request in Scrapy you just pass the URL you are targeting and maybe a callback function. If you want to use a specific proxy for that URL you can pass it as a meta parameter, like this:
def start_requests(self): for url in self.start_urls: return Request(url=url, callback=self.parse, headers={"User-Agent": "My UserAgent"}, meta={"proxy": "http://192.168.1.1:8050"})
The way it works is that inside Scrapy, there’s a middleware called HttpProxyMiddleware which takes the proxy meta parameter from the request object and sets it up correctly as the used proxy. The middleware is enabled by default so there is no need to set it up.
Option 2: Create custom middleware
Another way to utilize proxies while scraping is to actually create your own middleware. This way the solution is more modular and isolated. Essentially, what we need to do is the same thing as when passing the proxy as a meta parameter:
from w3lib.http import basic_auth_header class CustomProxyMiddleware(object): def process_request(self, request, spider): request.meta[“proxy”] = "http://192.168.1.1:8050" request.headers[“Proxy-Authorization”] = basic_auth_header(“<proxy_user>”, “<proxy_pass>”)
In the code above, we define the proxy URL and the necessary authentication info. Make sure that you also enable this middleware in the settings and put it before the HttpProxyMiddleware:
DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.CustomProxyMiddleware': 350, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400, }
How to verify if your custom proxy is working?
To verify that you are indeed scraping using your proxy you can scrape a test site that tells you your IP address and location (like this one). If it shows the proxy address and not your computer’s actual IP it is working correctly.
Rotating proxies
Now that you know how to set up Scrapy to use a proxy you might think that you are done. Your IP banning problems are solved forever. Unfortunately not! What if the proxy we just set up gets banned as well? What if you need multiple proxies for multiple pages? Don’t worry there is a solution called IP rotation and it is key for successful scraping projects.
When you rotate a pool of IP addresses what you’re doing is essentially randomly picking one address to make the request in your scraper. If it succeeds, aka returns the proper HTML page, we can extract data and we’re happy. If it fails for some reason (IP ban, timeout error, etc...) we can’t extract the data so we need to pick another IP address from the pool and try again. Obviously, this can be a nightmare to manage manually so we recommend using an automated solution for this.
IP rotation in Scrapy
If you want to implement IP rotation for your Scrapy spider you can install the scrapy-rotating-proxies middleware which has been created just for this.
It will take care of the rotating itself, adjusting crawling speed, and making sure that we’re using proxies that are actually alive.
After installation and enabling the middleware you just have to add your proxies that you want to use as a list to settings.py:
ROTATING_PROXY_LIST = [ 'proxy1.com:8000', 'proxy2.com:8031', # ... ]
Also, you can customize things like ban detection methods, page retries with different proxies, etc…
Conclusion
So now you know how to set up a proxy in your Scrapy project and how to manage simple IP rotation. But if you are scaling up your scraping projects you will quickly find yourself drowned in proxy related issues. Thus, you will lose data quality and ultimately you will waste a lot of time and resources dealing with proxy problems.
For example, you will find yourself dealing with unreliable proxies, you’ll get poor success rate and poor data quality, etc... and really just get bogged down in the minor technical details that stop you from focusing on what really matters: making use of the data. How can we end this never-ending struggle? By using an already available solution that handles well all the mentioned headaches and struggles.
This is exactly why we created Zyte Proxy Manager . Zyte Proxy Manager enables you to reliably crawl at scale, managing thousands of proxies internally, so you don’t have to. You never need to worry about rotating or swapping proxies again. Here's how you can use Zyte Proxy Manager with Scrapy.
If you’re tired of troubleshooting proxy issues and would like to give Zyte Smart Proxy Manager a try then signup today. It has a 14-day FREE trial!