How to use proxies with Python Requests Proxy Module
Configuring and using proxies is not easy, especially when sending HTTP requests.
Built-in modules like urllib, urllib2 help us deal with HTTP requests. Third-party tools like Requests come in handy when it comes to proxies.
Many developers use this tool because of it's efficiency and easy of use to send HTTP requests. If you want to get your proxy config right, it's key to understand the Python Requests Proxy module.
In the web scraping world, there are many obstacles we need to overcome and choosing the right proxy tool that fits your needs is just one piece of the puzzle. One huge challenge is when your scraper gets blocked.
To solve it, you need to know some strategies to avoid bans. In this article, I'm going to show you how to use proxies to avoid your scraper from getting banned.
How to configure proxies
In this part, we're going to cover how to configure proxies in Requests. To get started we need a working proxy and a URL we want to send the request to.
Basic usage
import requests proxies = { “http”: “http://10.10.10.10:8000”, “https”: “http://10.10.10.10:8000”, } r = requests.get(“http://toscrape.com”, proxies=proxies)
The proxies dictionary must follow this scheme. It is not enough to define only the proxy address and port. You also need to specify the protocol. You can use the same proxy for multiple protocols. If you need authentication use this syntax for your proxy:
http://user:pass@10.10.10.10:8000
Environment variables
In the above example, you can define proxies for each individual request. If you don’t need this kind of customization you can just set these environment variables:
export HTTP_PROXY="http://10.10.10.10:8000" export HTTPS_PROXY="http://10.10.10.10:1212"
This way you don’t need to define any proxies in your code. Just make the request and it will work.
Proxy with session
Sometimes you need to create a session and use a proxy at the same time. In this case, you first have to create a new session object and add proxies, then finally send the request through the session object:
import requests s = requests.Session() s.proxies = { “http”: “http://10.10.10.10:8000”, “https”: “http://10.10.10.10:8000”, } r = s.get(“http://toscrape.com”)
IP rotating
As discussed earlier, a common problem that we encounter while extracting data from the web is that our scraper gets blocked.
It's frustrating because if we can’t even reach the website we won’t be able to scrape it either.
The solution for this is to use some kind of proxy or rather multiple rotating proxies. This let's you get around the IP ban.
To be able to rotate IPs, we first need to have a pool of IP addresses. We can use free proxies that we can find on the internet or we can use commercial solutions for this. Be aware, that if your product/service relies on scraped data a free proxy solution will probably not be enough for your needs. If a high success rate and data quality are important for you, you should choose a solution like Zyte Smart Proxy Manager .
Python Requests Proxy and IP rotation
So let’s say we have a list of proxies. Something like this:
ip_addresses = [“85.237.57.198:44959”, “116.0.2.94:43379”, “186.86.247.169:39168”, “185.132.179.112:1080”, “190.61.44.86:9991”]
Then, we can randomly pick a proxy request. If it works properly we can access the given site. If there’s a connection error we might want to delete this proxy from the list and retry the same URL with another proxy.
try: proxy_index = random.randint(0, len(ip_addresses) - 1) proxy = {"http": ip_addresses(proxy_index), "https": ip_addresses(proxy_index)} requests.get(url, proxies=proxies) except: # implement here what to do when there’s a connection error # for example: remove the used proxy from the pool and retry the request using another one
There are multiple ways you can handle connection errors. Because sometimes the proxy that you are trying to use is just simply banned. In this case, there’s not much you can do about it other than removing it from the pool and retrying using another proxy. But other times if it isn’t banned you just have to wait a little bit before using the same proxy again.
Implementing your own smart proxy solution free of error is a very hard task.
Next time you have proxy rotation issues don't go for the first proxy tool you find. Consider a well managed solution, like Zyte Smart Proxy Manager, and avoid all the unnecessary pain of proxy management.
How to solve Python requests proxy issues
As a closing note, I want to show you how to solve issues with python requests proxies using Zyte Proxy Manager.
import requests url = "http://httpbin.org/ip" proxy_host = "proxy.crawlera.com" proxy_port = "8010" proxy_auth = ":" proxies = { "https": "https://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port), "http": "http://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port) } r = requests.get(url, proxies=proxies, verify=False)
What does this piece of code do?
It sends a successful HTTP python request. When you use Zyte Proxy Manager, you don’t need to deal with proxy rotation manually. Everything is taken care of internally.
Managing python requests proxies is too complex to do on your own and you need an easy solution, give Zyte Smart Proxy Manager a go. You can try for free!