Archives

February 2010 (1)
September 2009 (1)
May 2009 (1)
April 2009 (1)
March 2009 (4)
January 2009 (3)

November 2008 (2)
October 2008 (2)
September 2008 (1)
August 2008 (5)
July 2008 (3)
June 2008 (1)
May 2008 (5)
April 2008 (8)
March 2008 (3)
February 2008 (1)
January 2008 (2)

December 2007 (2)
November 2007 (4)
October 2007 (17)
September 2007 (9)

Sanitising smelly text

Tuesday, December 04 2007

At work we are migrating an old site to a new CMS.

Unfortunately the content is a mess. Owing to people pasting text in from Word and various other accidents, one fragment of HTML can be a mixture of UTF-8 and Latin-1 and cp1252 and goodness knows what else. When you’ve been a good boy and coded all your templates to declare “I am UTF-8, honest guv” it’s a bit trying. Especially when the client complains.

The markup is pretty broken too. It’s littered with weird markup from Word and generally non-compliant.

So far I’m having good results from a pipeline of various tricks.

  1. Python’s unicode function. unicode takes a string and transcodes it into Unicode. You can optionally force it to treat input as a particular encoding, and you can tell it how to handle errors.
  2. Beautiful Soup. It finds tag soup delicious. It also does a best-effort to detect encodings and transcode to Unicode. (You have to love software with a module called UnicodeDammit).
  3. htmltidy, in its utidylib manifestation. Does beautiful cleanup. It’s not super-robust though; I can make it segfault and dump core by feeding it the crap we have. Which is why I clean up with BeautifulSoup first.
  4. I butchered Josh Goldfoot’s marvellous XSS-defense script to strip out some of the more outrageous markup that I know we won’t use.

The only downside is that over thousands of items, this is pretty slow. But it’s the price you pay to be beautiful, I guess.

no comments

Tags: python ~ unicode ~ markup ~ programming ~ html tidy ~ beautiful soup

More webcam success on Ubuntu Gutsy

Saturday, December 01 2007

Flushed with the pleasure of my last purchase, I bought another webcam from Dick Smith Electronics, so that I could set up video conferencing on the PC upstairs. This webcam was on special for $20. It is a DSE XH5221. And I got it working. The chipset turns out to be a Pixart PAC7311, which shows up in lsusb as usb id 093a.

i can has webcam

 It didn’t run on Ubuntu Gutsy straight away, but it turns out that there is a newer version of gspca that supports it, and I was able to download and install it. There are some nice step-by-step instructions here.

The picture quality is pretty bad, to be honest. Way too contrasty, and with distinct blocky artifacts. However, I can live with that for $20, and I am going to fool with the driver source to see what I can do.

I’ve tested it with Skype, motion, cheese and camorama – works with all of them. I also discovered that if you have been using another USB camera since boot time, all of those programs can get confused (or maybe it’s a V4L problem). But as long as you haven’t plugged another webcam in first, this one works fine.

no comments

Tags: webcam ~ linux ~ ubuntu

Recent comments

Rendered at 2010-08-01 22:30:34