Archives

February 2010 (1)
September 2009 (1)
May 2009 (1)
April 2009 (1)
March 2009 (4)
January 2009 (3)

November 2008 (2)
October 2008 (2)
September 2008 (1)
August 2008 (5)
July 2008 (3)
June 2008 (1)
May 2008 (5)
April 2008 (8)
March 2008 (3)
February 2008 (1)
January 2008 (2)

December 2007 (2)
November 2007 (4)
October 2007 (17)
September 2007 (9)

Painless html parsing with lxml

2009-01-14 17:17:10

I am working on a Ruminator 2.0. I intend to parse full stories, not just the summaries that appear in RSS.

So I’ve been investigating my options for HTML parsing. There are quite a few options for Python, with varying degrees of speed, flexibility, and tolerance for broken markup.

After a rapturous writeup from Ian Bicking, I thought I’d try lxml, which is a Pythonic wrapper around Gnome’s libxml and libxlst libraries. I’m sold. You can even use CSS selectors if, just like jQuery! (I like not having too much loaded into my head at once).

Suppose you want to scrape a news story (for statistical analysis, not copyright infringement) from the NZ Herald:

>>> from lxml.html import parse
>>> doc = parse('http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=10551829&ref=rss&pnum=0').getroot()
>>> paras = doc.cssselect('div.article-holder p')
>>> for p in paras:
... print p.text_content()

Easy peasy.

From Julian on 2009-02-26 21:33:34

I finally got a chance to try this out when a BeautifulSoup-based script from a couple of years ago stopped working. lxml worked first time, great!

Tags: python ~ lxml ~ the ruminator




Rendered at 2010-03-14 17:52:18