Supporting a timeout in feedparser
Posted 5 January 2011 in feedparserThere's been a lot of discussion going on in an already-closed feature request for feedparser regarding how best to support a URL request timeout, and I thought it would be good to summarize the issue and what options are available.
The problem
The problem revolves around Python's default request timeout, which is to never timeout. Developers are getting hit by this when they use feedparser to request a URL from a server that never responds, in which case their application hangs and waits for a response until the end of time. The expectation is that feedparser should be configurable so that developers can set a timeout on URL requests. It seems like a reasonable expectation to me and I think it would be a valuable addition if it's possible to resolve the issues below. That said, for all of its conveniences feedparser is not an HTTP client and I'm also willing to question the expectation that it has to support timeouts. (You can and should read "Adding features" by Havoc Pennington for more information on my view here.)
Unfortunately, Python didn't support fine-grained control over timeout until version 2.6. Further, since feedparser supports version 2.4 and up, there are two versions of Python that are special cases. For 2.4 and 2.5 the only way feedparser can set the timeout is by setting the global timeout.
Configuring the timeout period
There are two options for customizing the timeout period:
- Add an argument to
parse()
- Add a module-level variable (say,
TIMEOUT
)
I feel very strongly that it's inappropriate to add an additional argument to parse()
, as I reasoned in the feature request:
Unlike the other arguments to
parse()
, I don't see a pressing need for individualized timeouts (as opposed to, say, User-Agent headers, which might need to be modified on a case-by-case basis depending on whether a particular server will filter access based on the User-Agent). It seems more reasonable to expect that developers would want to set a global timeout
With that in mind, the rest of this entry works with the assumption that we're dealing with a module-level TIMEOUT
variable.
The available options
- Make it work in all versions of Python
- Make it work only in Python 2.6 and up
- Punt
Option 1 is preferable if a timeout is going to be added at all. The issue is that the only way to change the timeout in Python 2.4 and 2.5 is to change the global timeout using code similar to the following:
import socket
import urllib2
old_timeout = socket.getdefaulttimeout()
if TIMEOUT:
socket.setdefaulttimeout(TIMEOUT)
f = urllib2.urlopen('http://example/')
if TIMEOUT:
socket.setdefaulttimeout(old_timeout)
If you'll notice, this actually introduces a thread-safety issue: anything that creates a socket while feedparser has modified the global timeout is affected. This is unacceptable, so option 1 is a non-starter.
Option 2 involves a simple try-except
block to deal with the different call signatures:
try:
f = urllib2.urlopen('http://example/', timeout=TIMEOUT)
except TypeError:
# there's no `timeout` argument in 2.4 and 2.5
f = urllib2.urlopen('http://example/')
There are some variations on this theme that differ in how to deal with Python 2.4 and 2.5. One possibility is to silently ignore TIMEOUT
, another is to throw an exception if TIMEOUT
is set but it's Python 2.4 or 2.5:
if TIMEOUT:
# this will throw a TypeError in 2.4 and 2.5, so it's left
# to the developer to not set TIMEOUT in the first place
f = urllib2.urlopen('http://example/', timeout=TIMEOUT)
else:
f = urllib2.urlopen('http://example/')
I'm not a fan of silently ignoring TIMEOUT
, but throwing an exception doesn't feel right either. Adding information to the debug_message
key of the result dictionary won't help developers if the application hangs, so that's out. What I haven't seen proposed is throwing a warning when parse()
is called:
import sys
import warnings
class FeatureUnavailable(Warning): pass
def parse(*args, **kwargs):
if TIMEOUT and sys.version_info[:2] in ((2, 4), (2, 5)):
warnings.warn("TIMEOUT is ignored in Python 2.4 and 2.5", FeatureUnavailable)
If someone can convince me that adding a timeout that doesn't always work is the right decision, it seems to me that this (option 2 and throwing a warning) is the best way to do it. Currently, however, I've yet to be convinced on technical merits, which is why I haven't written the patch. As noted in the feature request, developers aren't completely left out in the cold! They've been able to do the following for over seven years (if they're not creating long-running sockets in a multi-threaded application):
import socket
socket.setdefaulttimeout(10) # 10 seconds
Conclusion
I think that setting a timeout is important when requesting URLs. However, because of the scope of the feedparser project and the limitations of the older Python interpreters it currently supports, I'm not convinced that any of the potential solutions are ideal (though I think that one is less evil). I'm also not confident that feedparser is where a timeout needs to be added to fix these developers' hanging applications. It may be better to direct these developers to the urllib3, httplib2, eventlet, or twisted projects. I've only cursorily read their documentation, but they all seem to support timeouts.
Now then, one of the commentators suggested noting this pitfall in the documentation, which I think is a great idea. Would somebody write that patch?
UPDATED: One of the commentators also mentioned that the eventlet library is another option. Added that to the list of possibilities and added links for each.