A developer’s guide to rotating proxies in Python
A proxy is an intermediary server that hides your IP, so you can navigate through web traffic anonymously and securely. Proxies have very interesting use-cases, the most prominent of them being web scraping for pricing intelligence, SEO monitoring, data collection for market research, etc. And the correct use of rotating proxies is a key ingredient of this.
If you want to know more about proxies for web scraping and how proxies work, feel free to skim through our recent blog.
In this developer guide, you will learn how to:
- Set up a proxy using the Python library - ‘Requests’
- Use rotating proxies in three different ways
- Using Request library
- Using Scrapy rotating middleware
- Using Zyte’s Smart Proxy Manager
So let’s get started!
Prerequisites
- Requests: It is an elegant and simple HTTP library for Python. It allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs or to form-encode your POST data. To install the library, run this command in the terminal.
python -m pip install requests
- Scrapy: This is one of the most powerful, fast, open-source web crawling frameworks written in Python to extract structured data which can be used for a wide range of useful applications, like data mining, information processing, or historical archival. If you are new to scrapy, this tutorial on Scrapy would be a good place to start. Scrapy comes with a middleware that makes rotating proxies a breeze, once you have a list of working proxies. To install scrapy and scrapy-rotating-proxies, run the following commands.
pip install scrapypip install scrapy-rotating-proxies
- Zyte Smart Proxy Manager: This is a proxy management and antiban solution that manages proxy pools and handles bans so you can focus on extracting quality data. Follow this guide to create a Smart Proxy Manager account and get a 14-day free trial. You can cancel at any time and you won’t be charged a single penny for the free trial.
To use Smart Proxy Manager with Scrapy, you need to install this middleware `scrapy-zyte-smartproxy`.
pip install scrapy-zyte-smartproxy
How to set up a proxy using Requests?
First, import the Requests library, then create a proxy dictionary to map the protocols - HTTP and HTTPS to a proxy URL. Finally, set up a response using requests.get method to make the request to a URL using the proxy dictionary. For example:
import requests proxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', } response = requests.get('http://example.org', proxies=proxies)
Configure proxies for individual URLs
You can configure proxies for individual URLs even if the schema is the same. This comes in handy when you want to use different proxies for different websites you wish to scrape.
import requests proxies = { 'http://example.org': 'http://10.10.1.10:3128', 'http://something.test': 'http://10.10.1.10:1080', } requests.get('http://something.test/some/url', proxies=proxies)
Creating sessions
Sometimes you need to create a session and use a proxy at the same time to request a page. In this case, you first have to create a new session object and add proxies to it then finally send the request through the session object:
`requests.get` essentially uses the `requests.Session` under the hood.
import requests s = requests.Session() s.proxies = { "http": "http://10.10.10.10:8000", "https": "http://10.10.10.10:8000", } r = s.get("http://toscrape.com")
How to rotate proxies?
For the internet, your IP address is your identity. One can only make limited requests to a website with one IP. Think of websites as some sort of regulator. Websites get suspicious of requests coming from the same IP over and over again. This is ‘IP Rate Limitation’. IP rate limitations applied by websites can cause blocking, throttling, or CAPTCHAs. One way to overcome this is to rotate proxies. Read more about why you need rotating proxies.
Now let's get to the “how” part. This tutorial demonstrates three ways you work with rotating proxies:
- Writing a rotating proxies logic using the Request library
- Rotating proxies in python using the Scrapy middleware scrapy-rotating-proxies
- Using Zyte Smart Proxy Manager
Note: You don’t need any different proxies to run the code demonstrated in this tutorial. If your product/service relies on web scraped data, a free proxy solution will probably not be enough for your needs.
Let’s discuss them one by one:
Rotating proxies using Request library
In the code shown below, first, we create a proxy pool dictionary. Then, randomly pick a proxy to use for our request. If the proxy works properly we can access the given site. If there’s a connection error we may have to delete this proxy from the list and retry the same URL with another proxy.
import requests s = requests.Session() s.proxies = { "http": "http://10.10.10.10:8000", "https": "http://10.10.10.10:8000", } r = s.get("http://toscrape.com")
Rotating proxies in python using Scrapy
In your settings.py
- add the list of proxies like this.
ROTATING_PROXY_LIST = [ 'Proxy_IP:port', 'Proxy_IP:port', # ... ]
If you want more external control over the IPs, you can even load it from a file like this.
ROTATING_PROXY_LIST_PATH = 'listofproxies.txt'
- Enable the middleware
DOWNLOADER_MIDDLEWARES = { # ... 'rotating_proxies.middlewares.RotatingProxyMiddleware': 800, 'rotating_proxies.middlewares.BanDetectionMiddleware': 800, # ... }
That’s it! Now all your requests will automatically be routed randomly between the proxies.
Note: Sometimes the proxy that you are trying to use is just simply banned. In this case, there’s not much you can do about it other than remove it from the pool and retry using another proxy. But other times if it isn’t banned you just have to wait a little bit before using the same proxy again.
Use Zyte Smart Proxy Manager
The above-discussed ways to rotate proxies work well for building demos and minimum viable products. But things can get tricky as soon as you decide to scale your data extraction project. Infrastructure management of proxy pools is quite challenging, time-consuming, and resource extensive. You will soon find yourself refurbishing proxies to keep the pool healthy, managing bans and sessions, rotating user agents, etc. Proxy infrastructure also needs to be configured to work with headless browsers to crawl javascript-heavy websites. Phew! It’s not shocking how quickly your data extraction project gets converted into a proxy management project.
Thanks to the Zyte Smart Proxy Manager – you don't need to rotate and manage any proxies. It is all done automatically so you can focus on extracting quality data. Let’s see how easy it is to integrate with your scrapy project.
- In the settings file of your Scrapy project, enable the middleware
# enable the middleware DOWNLOADER_MIDDLEWARE={'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610}
- In your Scrapy spider, add these attributes
# enable Zyte Proxy ZYTE_SMARTPROXY_ENABLED = True # the API key you get with your subscription ZYTE_SMARTPROXY_APIKEY = '<your_zyte_proxy_apikey>'
Demo code for the above-discussed settings,
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" zyte_smartproxy_enabled = True zyte_smartproxy_apikey = 'a7f74201a57542d7a0b0a08946147fd3' custom_settings = { "DEFAULT_REQUEST_HEADERS": { "X-Crawlera-Profile": "desktop", "X-Crawlera-Cookies": "disable", } } def start_requests(self): urls = [ 'https://quotes.toscrape.com/page/1/', 'https://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = f'quotes-{page}.html' with open(filename, 'wb') as f: f.write(response.body) self.log(f'Saved file {filename}')
This piece of code sends a successful HTTP Python request to https://quotes.toscrape.com/.
When you use Zyte Proxy Manager, you don’t need to deal with proxy rotation manually. Everything is taken care of internally through the use of our rotating proxies.
You can try Zyte Smart Proxy Manager for 14 days for free.