Kurt McKee

lessons learned in production

Hey there! This article was written in 2012.

It might not have aged well for any number of reasons, so keep that in mind when reading (or clicking outgoing links!).

Date parsing

Posted 17 December 2012 in feedparser and listparser

I have lost patience with the RFC 822 date parsing in both feedparser and listparser. Back in 2009 when I started writing listparser I decided to use regular expressions to turn RFC 822 date strings into Python datetime objects. Earlier this year when I discovered that feedparser's RFC 822 parser had copied code from Python's rfc822 module I stripped it out and replaced it with the code I'd written for listparser.

Over time it's been necessary to tweak the code to support additional variations: extra commas, extra whitespace, swapped days and months, non-standard timezone modifications...so this weekend I decided to look at what the regular expression currently looks like. The result is not pretty:

(?:(?P<dayname>mon|tue|wed|thu|fri|sat|sun), )?(?P<day> *\\d{1
,2}) (?P<month>jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec
)(?:[a-z]*,?) (?P<year>(?:\\d{2})?\\d{2})(?: (?P<hour>\\d{2}):
(?P<minute>\\d{2})(?::(?P<second>\\d{2}))? (?:etc/)?(?P<tz>ut|
gmt(?:[+-]\\d{2}:\\d{2})?|[aecmp][sd]?t|[zamny]|[+-]\\d{4}))?

What's worse, to support swapped days and months it's necessary to create a second regular expression to match that, too. So I decided to rewrite the code using str.split() and a couple of dictionaries. I then ran timing tests on the whole affair, and I'm feeling pretty pleased with the results so far, as it just barely edges out the current code. I expect the new parser to land in feedparser after I integrate it into listparser.