I spent some time this weekend working on porting feedparser to Python 3, and found that it will be difficult because there are two separate parsers included (a strict parser and a loose parser), and while each works differently, both use the same core machinery in feedparser.
With the strict parser, things are going fairly well. The "wellformed" test cases don't throw any errors, although some do fail. I have confidence most of those can be fixed fairly easily, but the "illformed" test cases present a more serious problem. What I found is that the SAX parser that the strict parser uses will pass Python 3
str objects to the core machinery. If there's a problem, however, and feedparser falls back on the loose parser -- which uses an SGML parser -- that core machinery will suddenly be dealing with Python 3's
bytes objects, and will seize up and throw errors every which way.
I'm still considering how I want to handle these problems. There's a one line change that will resolve 1800 "illformed" test case errors (although 700 tests start failing), but it doesn't resolve my central question: do I want to standardize the machinery on Python 3's
bytes objects? If I choose
str objects, that may prove to be a less invasive and time-consuming choice, but it may prove impossible to resolve some of the "illformed" errors and failures. If I choose
bytes objects, I may have to make significant changes to sgmllib.py (which would introduce a new library to port and maintain, since it's no longer available in Python 3) as well as feedparser itself. I could choose both, but I'd sure like to avoid a bunch of
if-else statements, which could introduce significant code duplication.
Whatever ends up being the most effective choice, it's clear that I don't use git to its fullest potential. I downloaded the feedparser repository using git-svn and then created a git branch to track incremental changes. I'm aware that I've already pursued a few ideas that I first committed and later undid, but I'm not backtracking using git: I'm manually reverting the changes later on when the changes appear to be evolutionary dead ends.
On a positive note, every change I make is tested against Python 2.4 and up to ensure that I don't lose any existing functionality.
Oh, and I want to ask the feedparser community: why are some arguments to functions named so they conflict with Python's built-in namespace? I'm honestly boggled how one function works at all, with code that first uses
type as a variable name, and later uses it as a built-in for an equality test! (I'm looking at you,
_sanitizeHTML.) I assume it's a relic from the bad days when Python 2 didn't treat built-in objects as first-class citizens, but while I could be wrong, I'm definitely confused. Regardless, it's fixed in my branch.