Date parsing
Posted 17 December 2012 in feedparser and listparserI have lost patience with the RFC 822 date parsing in both feedparser and listparser. Back in 2009 when I started writing listparser I decided to use regular expressions to turn RFC 822 date strings into Python datetime
objects. Earlier this year when I discovered that feedparser's RFC 822 parser had copied code from Python's rfc822
module I stripped it out and replaced it with the code I'd written for listparser.
Over time it's been necessary to tweak the code to support additional variations: extra commas, extra whitespace, swapped days and months, non-standard timezone modifications...so this weekend I decided to look at what the regular expression currently looks like. The result is not pretty:
(?:(?P<dayname>mon|tue|wed|thu|fri|sat|sun), )?(?P<day> *\\d{1
,2}) (?P<month>jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec
)(?:[a-z]*,?) (?P<year>(?:\\d{2})?\\d{2})(?: (?P<hour>\\d{2}):
(?P<minute>\\d{2})(?::(?P<second>\\d{2}))? (?:etc/)?(?P<tz>ut|
gmt(?:[+-]\\d{2}:\\d{2})?|[aecmp][sd]?t|[zamny]|[+-]\\d{4}))?
What's worse, to support swapped days and months it's necessary to create a second regular expression to match that, too. So I decided to rewrite the code using str.split()
and a couple of dictionaries. I then ran timing tests on the whole affair, and I'm feeling pretty pleased with the results so far, as it just barely edges out the current code. I expect the new parser to land in feedparser after I integrate it into listparser.