August 2008 (5)
July 2008 (3)
June 2008 (1)
May 2008 (5)
April 2008 (8)
March 2008 (3)
February 2008 (1)
January 2008 (2)
December 2007 (2)
November 2007 (4)
October 2007 (17)
September 2007 (9)
Forthcoming improvements to the Ruminator
Sunday, August 03 2008
I have been experimenting with graphing Ruminator data, so that one can see trends a la the Google Zeitgeist.
On Friday Matt At Work sent around a link to a nice Canvas-based graph widget, and this was the kick in the arse I needed to get going. I have visions of being able to graph word counts over time, but also to make any given day’s data available as a histogram, and so on.
There are some real UI challenges here. Not every word you can think of appears in the news, but my corpus already has about 40,000 words in it, and there are only going to be more in future. Lookup lists are out of the question. Finally, a use case where AJAX is really justified—I’ve long admired Zoomin’s search box, and it looks like just the ticket for this kind of situation.
I’m still only part way through, because I have the dreaded lurgy and my energy levels for weekend hacking are low. However I have also discovered the Ajax Autocompleter widget from Scriptaculous, and got that working. I hope to have a page people can play with soon.
Tags: the ruminator | ajax | graphs
More fooling around with DHTML
Sunday, June 01 2008
Update: I like it better now. Still not guaranteed to work in your browser.
I’m still ironing out the kinks, but here is a first crack at a year’s worth of New Zealand news, more or less. It’s showing the top 10 words, day by day. For each consecutive day they appear, they migrate out from the centre. Size indicates the rank in that day’s news.
Thanks to Andrew At Work for some helpful suggestions about how this might work.
Not guaranteed to work in your browser.
Tags: the ruminator | dhtml | fun
Sunday, May 18 2008
I want the Ruminator to have animated tag clouds that display trends over time, so I’ve been poking around to see what might be done with DHTML. I prefer doing things in DHTML where possible: it strikes me as being the virtuous simple option. It turns out that it’s going to be easy. (This post will probably not do anything if you read it in a feed reader).
I move when you click me.
Click for frame 1.
Click for frame 2
Tags: the ruminator | dhtml | fun
Wednesday, April 02 2008
The Ruminator now knows about words and phrases. From now on, it can tell you about the “district court”, “helen clark”, and the “human rights commission”, instead of jumbling those words up into clark commission court district helen human rights. This took a bit of work, because even the best statistical analysis produces somewhat dodgy results. I have jiggered things so that I get a list of suggestions with each day’s run, but it’s just not safe to automate it.
I have also written a bit more about how the whole thing works, and where I need help.
Tags: the ruminator
Sunday, March 16 2008
I have been ignoring my burgeoning cold today and working on the Ruminator, teaching it to identify phrases in text.
This has been unexpectedly easy, because I googled up a paper that lays out a technique which has proved very effective. I feel I should write the authors a thank you note.
Their technique is to identify pairs of words with a high mutual information statistic, and then to do a second pass through the corpus to try and find words to the left and right of the pair that might also be part of the phrase. They suggest only testing pairs where at least one word is capitalised.
Bugger me, but it works well.
Here’s a little chunk of output from my New Zealand news corpus:
The initial pair
[‘Coast’, ‘District’]
It appears 48 times
48
These are the words that appear to the left in the corpus.
{‘Union’: 1, ‘issued’: 1, ‘Workers’: 1, ‘executive’: 1, ‘soon’: 1, ‘Under’: 1, ‘chair’: 1, ‘announcement’: 1, ‘death’: 1, ‘workers’: 1, ‘winner’: 1, ‘troubled’: 2, ‘two-month-long’: 1, ‘over’: 2, ‘Both’: 1, ‘g
overnment’: 1, ‘assau’: 1, ‘birth’: 2, ‘not’: 1, ’50’: 1, ‘Clinic’: 1, ‘crisis-stricken’: 2, ‘says’: 1, ‘picket’: 1, ‘Organisation’: 1, ‘Disability’: 2, ‘Gisborne’: 1, ‘year’: 2, ‘laboratory’: 1, ‘embattled’:
2, ‘for’: 4, ‘has’: 2, ’11m’: 2, ‘state’: 1, ‘patient’: 1, ‘siege’: 1, ‘met’: 1, ‘address’: 1, ‘by’: 2, ‘on’: 1, ‘about’: 2, ‘her’: 2, ‘of’: 3, ‘products’: 1, ‘action’: 1, ‘footsteps’: 1, ‘raised’: 1, ‘industr
ial’: 1, ‘Cup’: 1, ‘into’: 1, ‘alleged’: 1, ‘suspended’: 1, ‘crisis’: 1, ‘impressed’: 1, ‘given’: 1, ‘from’: 1, ‘Monday’: 1, ‘hospital’: 1, ‘criticised’: 1, ‘next’: 2, ‘Hospital’: 1, ‘Wellington’: 1, ‘doctors’
: 2, ‘line’: 1, ‘with’: 3, ‘Anaesthetists’: 1, ‘hat’: 1, ‘and’: 2, ‘do’: 1, ‘in’: 1, ‘at’: 7, ‘Capital’: 41, ‘Commissioner’: 2, ‘end’: 1, ‘Regional’: 1, ‘Lab’: 1, ‘concerns’: 1, ‘take’: 1, ‘Zealand’: 1, ‘Medic
al’: 1, ‘rain’: 1, ‘Melbourne’: 1, ‘The’: 9, ‘the’: 20, ‘a’: 3, ‘disbelief’: 1, ‘Wellingtons’: 5, ‘Another’: 1, ‘2008’: 1, ‘gardens’: 1}
These words appear on the right.
{‘and’: 2, ‘says’: 3, ‘over’: 1, ‘expects’: 1, ‘defended’: 1, ‘manager’: 1, ‘Health’: 48, ‘Board’: 43, ‘have’: 1, ‘in’: 3, ‘moves’: 1, ‘staff’: 1, ‘spokeswoman’: 1, ‘for’: 1, ‘remains’: 1, ‘admission’: 1, ‘ami
dst’: 1, ‘Cabinet’: 1, ‘to’: 4, ‘Opposition’: 1, ‘new’: 1, ‘has’: 4, ‘is’: 2, ‘A’: 2, ‘Neville’: 1, ‘Boards’: 3, ‘after’: 1, ‘but’: 1, ‘CCDHB’: 1, ‘hopes’: 1, ‘The’: 2, ‘about’: 1, ‘scheme’: 1, ‘taking’: 1, ‘c
ompliance’: 1, ‘will’: 1, ‘chief’: 1, ‘maternity’: 1, ‘could’: 1}
This is a phrase.
[‘Capital’, ‘Coast’, ‘District’, ‘Health’, ‘Board’]
We started with “Coast District”, and looked at the frequency of words to the left and right, and presto, we get Capital Coast District Health Board.
Here’s another one:
[‘Australian’, ‘Prime’]
36
{‘help’: 1, ‘charities’: 1, ‘cruise’: 1, ‘hell’: 1, ‘its’: 2, ‘before’: 1, ’24’: 1, ‘informal’: 1, ‘ships’: 1, ‘to’: 3, ‘board’: 1, ‘Helen’: 2, ‘has’: 1, ‘upping’: 1, ‘Prime’: 2, ‘they’: 2, ‘not’: 1, ‘one’: 1, ‘Protests’: 1, ‘calling’: 1, ‘continue’: 1, ‘A’: 1, ‘Howard’: 2, ‘doing’: 1, ‘national’: 1, ‘Somalia’: 1, ‘Sydney’: 2, ‘year’: 1, ‘John’: 1, ‘said’: 1, ‘Environmentalists’: 1, ‘Darfur’: 1, ‘new’: 1, ‘announced’: 1, ‘be’: 1, ‘missing’: 1, ‘aboriginal’: 1, ‘takeover’: 1, ‘MPs’: 1, ‘on’: 5, ‘climate’: 1, ‘Clark’: 1, ‘of’: 4, ‘region’: 1, ‘times’: 1, ‘abuse’: 3, ‘airline’: 1, ‘tough’: 1, ‘angrily’: 1, ‘three’: 1, ‘poll’: 1, ‘Harawira’: 1, ‘given’: 1, ‘from’: 1, ‘would’: 1, ‘&’: 1, ‘Australias’: 1, ‘two’: 1, ‘attack’: 1, ‘way’: 1, ‘forward’: 1, ‘meeting’: 2, ‘gives’: 1, ‘a’: 2, ‘apologise’: 1, ‘labelled’: 1, ‘child’: 1, ‘he’: 2, ‘HIV-positive’: 1, ‘Saturdays’: 1, ‘this’: 3, ‘polls’: 2, ‘reacted’: 1, ‘will’: 1, ‘country’: 1, ‘urging’: 1, ‘are’: 3, ‘have’: 3, ‘Northern’: 3, ‘voters’: 1, ‘moved’: 1, ‘Expectations’: 1, ‘an’: 1, ‘as’: 1, ‘want’: 1, ‘in’: 8, ‘end’: 1, ‘ex-partner’: 1, ‘Minister’: 2, ‘outbreak’: 1, ‘you’: 1, ‘Zealand’: 1, ‘towards’: 1, ‘after’: 1, ‘plane’: 1, ‘mouth’: 1, ‘building’: 1, ‘later’: 2, ‘2005’: 1, ‘the’: 7}
{‘a’: 1, ‘Maori’: 2, ‘says’: 1, ‘Howard’: 27, ‘warned’: 1, ‘that’: 1, ‘visit’: 1, ‘Ministers’: 1, ‘brief’: 1, ‘to’: 1, ‘racist’: 1, ‘Minister’: 35, ‘Howards’: 1, ‘put’: 1, ‘Rudd’: 1, ‘John’: 28, ‘The’: 2, ‘Kevin’: 1, ‘he’: 1}
[‘Australian’, ‘Prime’, ‘Minister’, ‘John’, ‘Howard’]
I’m stoked. It just needs a little tuning, and I’ll have a collection of phrases I can use to make the Ruminator’s output a lot more meaningful.
Tags: python | the ruminator | natural language processing
Wednesday, January 30 2008
I’m trying to make The Ruminator a bit smarter.
Right now, it simply chomps text up into words by splitting it on whitespace and lower-casing it. This means that some things that really ought to be treated as one thing aren’t. I’ve hard-coded “New Zealand” but that approach is pretty stupid.
So I’ve been looking into ways to do this better.
The thing to do seems to be to identify so-called “collocations”, which are sequences of words that are significant. “North Island”, “Wellington City Council”, “aggravated robbery” are examples of collocations that the Ruminator might see. The trick is in deciding on significance just through statistical analysis.
There is a bunch of computer science that deals with this problem already, and I’ve found some helpful references. The guts of the best solution seems to be to calculate the mutual information statistic. Which is to say, take the probability of words x and y appearing in your corpus in sequence, and divide that by the probability of x occuring times the probability of y occuring. Or:
P(“xylophonic yurt”)/P(“xylophonic”)P(“yurt”)
Having done that and identified some collocations, we could repeat the exercise with the words appearing before and after, and see whether “white xylophonic yurt” and “xylophonic yurt zapper” are collocations too.
There’s a bunch of tweaking to do after that. What is the threshold for considering something significant? What about sequences that score high, but based on a very few appearances in your corpus?
And of course I need better tools for identifying “words” in the first place. Yay NLTK. I hope to use this to “stem” words so that minor variations in syntax don’t result in stories ending up in different places.
I have a big corpus of news items to play with. I’ve already discovered that reading in 100 MB of text at one go isn’t so smart… anyway, the results are interesting, but it’s going to take a while to fine tune.
Once I’ve done that, I’m going to see whether Bayesian techniques have anything to offer in sorting, tagging and labelling news items. I foresee pain there: someone has to train the sorter, and that could take a while.
Still, it’s enjoyable. Sometimes I regret not having had a full computer science education, and pursuing problems like this makes me feel as though I am somehow making up for it. And it’s just interesting.
Stephen on Protecting your goodies on the web is hard ⋅ Rob Coup on Protecting your goodies on the web is hard ⋅ stephen on Protecting your goodies on the web is hard ⋅ Brenda on Protecting your goodies on the web is hard ⋅ Brenda on Protecting your goodies on the web is hard ⋅ Brenda on One the one hand, on the other hand ⋅ stephen on Unexpected success ⋅ paul on Unexpected success ⋅ George Darroch on Coffee and COTS ⋅ Sam Vilain on Perl vs Python, minor things i
DOM (1) | WALS (1) | ajax (1) | annoyance (2) | ant (1) | apache (2) | ati (1) | atom (1) | bad design (1) | baz (1) | beautiful soup (1) | blosxom (1) | burble (13) | chest pain (1) | code (2) | coffee (1) | colubrid (2) | content management (1) | debugging (1) | dhtml (2) | dotting your i’s (1) | dual head (1) | email (1) | firefox (2) | fun (2) | funny (5) | gnome (1) | good practice (1) | graphs (1) | gutsy (1) | ha ha only serious (1) | hacking (1) | health (1) | heart (1) | hospital (1) | hosting (1) | html (1) | html parsing (1) | html tidy (2) | huh (1) | i did it my way (2) | ian bicking (1) | investing (1) | it just works (1) | java (1) | javascript (1) | kiwibank (1) | language (1) | linguistics (1) | linky (3) | linux (3) | localisation (1) | magic formula (1) | management (2) | mark dominus (1) | markup (2) | metadata (1) | mod_rewrite (2) | money (1) | mysql (1) | natural language processing (2) | neat things (1) | nostalgia (1) | oddity (1) | paste (1) | perl (2) | philosophy (1) | programming (8) | python (13) | regex (1) | scope creep (1) | security (4) | skype (1) | software (6) | software development (1) | sql (1) | support (1) | syndication (1) | tagging (1) | templates (1) | testing (1) | the bazombo media empire (1) | the ruminator (7) | tragedy (1) | treo (1) | tvnz (1) | ubuntu (3) | unicode (1) | usability (2) | vital.org.nz (1) | web development (4) | webcam (2) | wsgi (3) | xml (2) | xrandr (1) | yslow (1)
Rendered at 2008-08-29 12:34:54