Scraping LiveJournal comments
Posted 8 July 2008 in programming, python, and technologyAs a first attempt at expanding my comment tracking software, I did a little testing in regards to scraping LiveJournal comments. Having written some uncomfortably convoluted XSL transformations in the past, I've become familiar with XPath. While BeautifulSoup has served me well in the past for quick excursions into the awful world of malformed HTML, BeautifulSoup doesn't currently support XPath. Thus, I chose a tool called lxml.
In short order I came up with a small number of XPath queries that extracted comments from an entry on my blog. Unfortunately, I then discovered that LiveJournal's comment system allows comment threads to be collapsed and span multiple pages, which means I'm going to need to go back and do a little more work for completeness' sake.
Caleb, you may eventually have a comment feed for LiveJournal.