Getting back into it (part 2)

Posted 16 October 2012 in feedparser

I was invited to participate in the gpodder hackfest last Saturday by joining the group on IRC. I had a blast talking with the group! In the invitation it was suggested that I could join and talk about my plans for feedparser. I've thought a lot about serious architectural issues, so after discussing it on IRC I wanted to share what I wrote (spelling mistakes have been fixed):

first, feedparser has a serious design flaw: wellformed feeds are severely punished because the initial design was to use sgmllib. all of the xml parsing is force-fed through the sgml function calls, which doubles the function call count.

feedparser has a second flaw that's bothered me. it's not modular (avoiding the word plugins throughout this paragraph), so some decisions were made that have created confusing behavior. it will sometimes act like an xml dom parser and collect everything it doesn't recognize into a dict. unfortunately, it doesn't always guess correctly (depending on the presence of element attributes iirc) and i keep getting bug reports about it not handling such-and-such namespace under such-and-such circumstance.

third, i'm not sold on the beautifulsoup dependency for several reasons. first, i don't think it's worthwhile to parse through html and try to extract vcard or vcalendar information. picking out enclosure information and tag information seems worthwhile, but i think that can be accomplished without beautifulsoup. second, beautifulsoup itself (the 3.2.x version that's supported by feedparser) is a dead end. bs4 might be an upgrade path, but at that point i'm back to "is the return on investment worth it for this dependency?"

and i should add that another objection to beautifulsoup is the effect it has on performance, which goes right back to "is the potential that someone wants a vcard they can download to their address book worth considering?" i personally don't think so, but i'm open to discussion on the topic.

fourth, i'd like to support python 3 without 2to3 conversion if possible. i worked on this very intermittently for a few months in a private branch and then lost that work when my hdd died. that was discouraging enough i'd rather get a few wins before i tackle a potential dead end. ;)

As a direct result of that conversation, I added a one-line patch that fixes a bug introduced by the chardet library. It's a start, and it helped me identify some more issues with my development environment.

☕ Like my work? I accept tips!

Kurt McKee

lessons learned in production

Getting back into it (part 2)