PINGDOM_CHECK

How to set up a custom proxy in Scrapy?

Read Time
5 Mins
Posted on
August 8, 2019
By
Attila Toth
Ă—

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Return to top

How to set up a custom proxy in Scrapy?

When scraping the web at a reasonable scale, you may come across a series of problems and challenges. You may want to access a website from a specific country/region.

Or maybe you want to work around anti-bot solutions. Whatever the case, to overcome these obstacles you need to use and manage proxies.

In this article, I'm going to cover how to set up a custom proxy inside your Scrapy spider in an easy and straightforward way.

Also, we're going to discuss what are the best ways to solve your current and future proxy issues.

You will learn how to do it yourself but you can also just use Zyte Proxy Manager to take care of your proxies.

Why you need smart proxies for your web scraping projects?

If you are extracting data from the web at scale, you’ve probably already figured out the answer. IP banning. The website you are targeting might not like that you are extracting data even though what you are doing is totally ethical and legal.

When your scraper is banned, it can really hurt your business because the incoming data flow that you were so used to is suddenly missing.

Also, sometimes websites have different information displayed based on country or region.

To solve these problems, we use a set of techniques for bypassing IP bans, such as rotating proxies for successful requests to access the public data we need.

Setting up proxies in Scrapy

Setting up a proxy inside Scrapy is easy. There are two easy ways to use proxies with Scrapy - passing proxy info as a request parameter or implementing a custom proxy middleware.

Option 1: Via request parameters

Normally when you send a request in Scrapy you just pass the URL you are targeting and maybe a callback function. If you want to use a specific proxy for that URL you can pass it as a meta parameter, like this:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def start_requests(self):
for url in self.start_urls:
return Request(url=url, callback=self.parse,
headers={"User-Agent": "My UserAgent"},
meta={"proxy": "http://192.168.1.1:8050"})
def start_requests(self): for url in self.start_urls: return Request(url=url, callback=self.parse, headers={"User-Agent": "My UserAgent"}, meta={"proxy": "http://192.168.1.1:8050"})
def start_requests(self):
    for url in self.start_urls:
        return Request(url=url, callback=self.parse,
                       headers={"User-Agent": "My UserAgent"},
                       meta={"proxy": "http://192.168.1.1:8050"})

The way it works is that inside Scrapy, there’s a middleware called HttpProxyMiddleware which takes the proxy meta parameter from the request object and sets it up correctly as the used proxy. The middleware is enabled by default so there is no need to set it up.

Option 2: Create custom middleware

Another way to utilize proxies while scraping is to actually create your own middleware. This way the solution is more modular and isolated. Essentially, what we need to do is the same thing as when passing the proxy as a meta parameter:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from w3lib.http import basic_auth_header
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta[“proxy”] = "http://192.168.1.1:8050"
request.headers[“Proxy-Authorization”] =
basic_auth_header(“<proxy_user>”, “<proxy_pass>”)
from w3lib.http import basic_auth_header class CustomProxyMiddleware(object): def process_request(self, request, spider): request.meta[“proxy”] = "http://192.168.1.1:8050" request.headers[“Proxy-Authorization”] = basic_auth_header(“<proxy_user>”, “<proxy_pass>”)
from w3lib.http import basic_auth_header

class CustomProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta[“proxy”] = "http://192.168.1.1:8050"
        request.headers[“Proxy-Authorization”] =
                          basic_auth_header(“<proxy_user>”, “<proxy_pass>”)

In the code above, we define the proxy URL and the necessary authentication info. Make sure that you also enable this middleware in the settings and put it before the HttpProxyMiddleware:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}
DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.CustomProxyMiddleware': 350, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400, }
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomProxyMiddleware': 350,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}

How to verify if your custom proxy is working?

To verify that you are indeed scraping using your proxy you can scrape a test site that tells you your IP address and location (like this one). If it shows the proxy address and not your computer’s actual IP it is working correctly.

Rotating proxies

Now that you know how to set up Scrapy to use a proxy you might think that you are done. Your IP banning problems are solved forever. Unfortunately not! What if the proxy we just set up gets banned as well? What if you need multiple proxies for multiple pages? Don’t worry there is a solution called IP rotation and it is key for successful scraping projects.

When you rotate a pool of IP addresses what you’re doing is essentially randomly picking one address to make the request in your scraper. If it succeeds, aka returns the proper HTML page, we can extract data and we’re happy. If it fails for some reason (IP ban, timeout error, etc...) we can’t extract the data so we need to pick another IP address from the pool and try again. Obviously, this can be a nightmare to manage manually so we recommend using an automated solution for this.

IP rotation in Scrapy

If you want to implement IP rotation for your Scrapy spider you can install the scrapy-rotating-proxies middleware which has been created just for this.

It will take care of the rotating itself, adjusting crawling speed, and making sure that we’re using proxies that are actually alive.

After installation and enabling the middleware you just have to add your proxies that you want to use as a list to settings.py:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
# ...
]
ROTATING_PROXY_LIST = [ 'proxy1.com:8000', 'proxy2.com:8031', # ... ]
ROTATING_PROXY_LIST = [
    'proxy1.com:8000',
    'proxy2.com:8031',
    # ...
]

Also, you can customize things like ban detection methods, page retries with different proxies, etc…

Conclusion

So now you know how to set up a proxy in your Scrapy project and how to manage simple IP rotation. But if you are scaling up your scraping projects you will quickly find yourself drowned in proxy related issues. Thus, you will lose data quality and ultimately you will waste a lot of time and resources dealing with proxy problems.

For example, you will find yourself dealing with unreliable proxies, you’ll get poor success rate and poor data quality, etc... and really just get bogged down in the minor technical details that stop you from focusing on what really matters: making use of the data. How can we end this never-ending struggle? By using an already available solution that handles well all the mentioned headaches and struggles.

This is exactly why we created Zyte Proxy Manager . Zyte Proxy Manager enables you to reliably crawl at scale, managing thousands of proxies internally, so you don’t have to. You never need to worry about rotating or swapping proxies again. Here's how you can use Zyte Proxy Manager with Scrapy.

If you’re tired of troubleshooting proxy issues and would like to give Zyte Smart Proxy Manager a try then signup today. It has a 14-day FREE trial!

Useful links

smart proxy manager

Ă—

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.