Kurt McKee

lessons learned in production

Archive

Hey there! This article was written in 2008.

It might not have aged well for any number of reasons, so keep that in mind when reading (or clicking outgoing links!).

Scraping LiveJournal comments

Posted 8 July 2008 in programming, python, and technology

As a first attempt at expanding my comment tracking software, I did a little testing in regards to scraping LiveJournal comments. Having written some uncomfortably convoluted XSL transformations in the past, I've become familiar with XPath. While BeautifulSoup has served me well in the past for quick excursions into the awful world of malformed HTML, BeautifulSoup doesn't currently support XPath. Thus, I chose a tool called lxml.

In short order I came up with a small number of XPath queries that extracted comments from an entry on my blog. Unfortunately, I then discovered that LiveJournal's comment system allows comment threads to be collapsed and span multiple pages, which means I'm going to need to go back and do a little more work for completeness' sake.

Caleb, you may eventually have a comment feed for LiveJournal.

☕ Like my work? I accept tips!