Scrapy on the road to Python 3 support
Scrapy is one of the few popular Python packages (almost 10k github stars) that's not yet compatible with Python 3. The team and community around it are working to make it compatible as soon as possible. Here's an overview of what has been happening so far.
You're invited to read along and participate in the porting process from Scrapy's Github repository.
First off you may be wondering: "why is Scrapy not in Python 3 yet?"
If you asked around you likely heard an answer like "It's based on Twisted, and Twisted is not fully ported yet, you know?". Many blame Twisted, but other things are actually holding back the Scrapy Python 3 port.
When it comes to Twisted, its most important parts are already ported. But if we want Scrapy spiders to download from HTTP urls, we are really going to need a fix or workaround on Twisted's http agent, since it doesn't work on Python 3.
The Scrapy team started to make moves towards Python 3 support two years ago by porting some of the Scrapy dependencies. For about a year now a subset of Scrapy tests is executed under Python 3 on each commit. Apart from Twisted, one bottleneck that was blocking the progress for a while was that most of Scrapy requires Request or Response objects to work - which was recently resolved as described below.
Scrapy core devs meet and prepare a sprint
During EuroPython 2015, from 20 to 28 of July, several Scrapy core developers gathered together in Bilbao and were able to make progress on the porting process - meeting for the first time after several years of working together as they did.
There was a Scrapy sprint scheduled to the weekend. Mikhail Korobov, Daniel Graña, and Elias Dorneles teamed up to prepare Scrapy for it by porting Request and Response in advance. This way, it would be easier for other people to join and contribute during the weekend sprint.
In the end, time was short for them to fully port Request and Response before the weekend sprint. Some of the issues they faced were:
- Should HTTP headers be bytes or unicode? Is it different for keys and values? Some headers values are usually UTF-8 (e.g. cookies); HTTP Basic Auth headers are usually latin1; for other headers there is not a single universal encoding as well. Generally, bytes for HTTP headers make sense, but there is a gotcha: if you're porting an existing project from Python 2.x to 3.x, code which was working before may start to silently producing incorrect results. For example, let's say there is a response with 'application/json' content type. If header values are bytes, in Python 2.x `content_type == 'application/json'` will return True, but in Python 3.x it will return False because you're then comparing a unicode literal with bytes.
- How to percent-escape and unescape URLs properly? Proper escaping depends on web page encoding and on a part of URL being escaped. This matters if a webpage author uses non-ascii URLs. After some experiments we found that browsers are doing crazy things here: URL path is encoded to UTF-8 before escaping, but the query string is encoded to web page encoding before escaping. You can't trust browser UIs to check that. What they are sending to servers is consistent, namely a UTF-8 path and page-encoding for the query string - in each of Firefox and Chrome on OS X and Linux. But what they display to users depends on their browser and operating system.
- URL-related functions are very different in Python 2.x and 3.x. In Python 2.x they only accept bytes, while in Python 3.x they only accept unicode. Combined with the encoding craziness, this makes porting the code harder still.
The EuroPython sprint weekend arrives
To unblock further porting, the team decided to use bytes for HTTP headers and momentarily disable some of the tests for non-ascii URL handling, thus eliminating the two bottlenecks that were holding things back.
The sprint itself was quiet but productive. Some highlights on what developers did there:
- They unblocked the main problem with the port, the handling of urls and headers, and then migrated the Request and Response classes so it’s finally possible to divide the work and port each component independently;
- Elias split the Scrapy Selectors into a separate library (called Parsel), which reached a stable point and which Scrapy now depends on (there is work being done in the documentation to make an official release);
- Mikhail and Daniel ported several Scrapy modules to make further contributions easier;
- A mystery contributor came, silently ported a Scrapy module, and left without a trace (please reach out!);
- Two more guys joined, they were completely new to Scrapy and had fun setting up their first project.
The road is long, but the path is clear!
In the end the plan worked as expected. After porting Request and Response objects and making some hard decisions, the road to contributions is open.
In the weeks that followed the sprint, developers continued to work on the port. They also got important contributions from the community. As you can see here, Scrapy already got several pull requests merged (for example from @GregoryVigoTorres and @nyov).
Before the sprint there were ~250 tests passing in Python 3. The number is now over 600. These recent advances helped increase our test coverage under Python 3 from 19% to 54%.
Our next major goal is to port the Twisted HTTP client so spiders can actually download something from remote sites.
It's still a long way to the Python 3 support, but when it comes to Python 3 porting Scrapy is in a much better shape now. Join us and contribute to porting Scrapy following these guidelines. We have added a badge to Github to show the progress of Python 3 support. The percentage is calculated using the tests that pass in Python 3 vs the total number of tests available. Currently, 633 tests passing on Python vs 1153 in total.
Thanks for reading!