November 2008 (2)
October 2008 (2)
September 2008 (1)
August 2008 (5)
July 2008 (3)
June 2008 (1)
May 2008 (5)
April 2008 (8)
March 2008 (3)
February 2008 (1)
January 2008 (2)
December 2007 (2)
November 2007 (4)
October 2007 (17)
September 2007 (9)
2008-03-16 20:32:03
I have been ignoring my burgeoning cold today and working on the Ruminator, teaching it to identify phrases in text.
This has been unexpectedly easy, because I googled up a paper that lays out a technique which has proved very effective. I feel I should write the authors a thank you note.
Their technique is to identify pairs of words with a high mutual information statistic, and then to do a second pass through the corpus to try and find words to the left and right of the pair that might also be part of the phrase. They suggest only testing pairs where at least one word is capitalised.
Bugger me, but it works well.
Here’s a little chunk of output from my New Zealand news corpus:
The initial pair
[‘Coast’, ‘District’]
It appears 48 times
48
These are the words that appear to the left in the corpus.
{‘Union’: 1, ‘issued’: 1, ‘Workers’: 1, ‘executive’: 1, ‘soon’: 1, ‘Under’: 1, ‘chair’: 1, ‘announcement’: 1, ‘death’: 1, ‘workers’: 1, ‘winner’: 1, ‘troubled’: 2, ‘two-month-long’: 1, ‘over’: 2, ‘Both’: 1, ‘g
overnment’: 1, ‘assau’: 1, ‘birth’: 2, ‘not’: 1, ’50’: 1, ‘Clinic’: 1, ‘crisis-stricken’: 2, ‘says’: 1, ‘picket’: 1, ‘Organisation’: 1, ‘Disability’: 2, ‘Gisborne’: 1, ‘year’: 2, ‘laboratory’: 1, ‘embattled’:
2, ‘for’: 4, ‘has’: 2, ’11m’: 2, ‘state’: 1, ‘patient’: 1, ‘siege’: 1, ‘met’: 1, ‘address’: 1, ‘by’: 2, ‘on’: 1, ‘about’: 2, ‘her’: 2, ‘of’: 3, ‘products’: 1, ‘action’: 1, ‘footsteps’: 1, ‘raised’: 1, ‘industr
ial’: 1, ‘Cup’: 1, ‘into’: 1, ‘alleged’: 1, ‘suspended’: 1, ‘crisis’: 1, ‘impressed’: 1, ‘given’: 1, ‘from’: 1, ‘Monday’: 1, ‘hospital’: 1, ‘criticised’: 1, ‘next’: 2, ‘Hospital’: 1, ‘Wellington’: 1, ‘doctors’
: 2, ‘line’: 1, ‘with’: 3, ‘Anaesthetists’: 1, ‘hat’: 1, ‘and’: 2, ‘do’: 1, ‘in’: 1, ‘at’: 7, ‘Capital’: 41, ‘Commissioner’: 2, ‘end’: 1, ‘Regional’: 1, ‘Lab’: 1, ‘concerns’: 1, ‘take’: 1, ‘Zealand’: 1, ‘Medic
al’: 1, ‘rain’: 1, ‘Melbourne’: 1, ‘The’: 9, ‘the’: 20, ‘a’: 3, ‘disbelief’: 1, ‘Wellingtons’: 5, ‘Another’: 1, ‘2008’: 1, ‘gardens’: 1}
These words appear on the right.
{‘and’: 2, ‘says’: 3, ‘over’: 1, ‘expects’: 1, ‘defended’: 1, ‘manager’: 1, ‘Health’: 48, ‘Board’: 43, ‘have’: 1, ‘in’: 3, ‘moves’: 1, ‘staff’: 1, ‘spokeswoman’: 1, ‘for’: 1, ‘remains’: 1, ‘admission’: 1, ‘ami
dst’: 1, ‘Cabinet’: 1, ‘to’: 4, ‘Opposition’: 1, ‘new’: 1, ‘has’: 4, ‘is’: 2, ‘A’: 2, ‘Neville’: 1, ‘Boards’: 3, ‘after’: 1, ‘but’: 1, ‘CCDHB’: 1, ‘hopes’: 1, ‘The’: 2, ‘about’: 1, ‘scheme’: 1, ‘taking’: 1, ‘c
ompliance’: 1, ‘will’: 1, ‘chief’: 1, ‘maternity’: 1, ‘could’: 1}
This is a phrase.
[‘Capital’, ‘Coast’, ‘District’, ‘Health’, ‘Board’]
We started with “Coast District”, and looked at the frequency of words to the left and right, and presto, we get Capital Coast District Health Board.
Here’s another one:
[‘Australian’, ‘Prime’]
36
{‘help’: 1, ‘charities’: 1, ‘cruise’: 1, ‘hell’: 1, ‘its’: 2, ‘before’: 1, ’24’: 1, ‘informal’: 1, ‘ships’: 1, ‘to’: 3, ‘board’: 1, ‘Helen’: 2, ‘has’: 1, ‘upping’: 1, ‘Prime’: 2, ‘they’: 2, ‘not’: 1, ‘one’: 1, ‘Protests’: 1, ‘calling’: 1, ‘continue’: 1, ‘A’: 1, ‘Howard’: 2, ‘doing’: 1, ‘national’: 1, ‘Somalia’: 1, ‘Sydney’: 2, ‘year’: 1, ‘John’: 1, ‘said’: 1, ‘Environmentalists’: 1, ‘Darfur’: 1, ‘new’: 1, ‘announced’: 1, ‘be’: 1, ‘missing’: 1, ‘aboriginal’: 1, ‘takeover’: 1, ‘MPs’: 1, ‘on’: 5, ‘climate’: 1, ‘Clark’: 1, ‘of’: 4, ‘region’: 1, ‘times’: 1, ‘abuse’: 3, ‘airline’: 1, ‘tough’: 1, ‘angrily’: 1, ‘three’: 1, ‘poll’: 1, ‘Harawira’: 1, ‘given’: 1, ‘from’: 1, ‘would’: 1, ‘&’: 1, ‘Australias’: 1, ‘two’: 1, ‘attack’: 1, ‘way’: 1, ‘forward’: 1, ‘meeting’: 2, ‘gives’: 1, ‘a’: 2, ‘apologise’: 1, ‘labelled’: 1, ‘child’: 1, ‘he’: 2, ‘HIV-positive’: 1, ‘Saturdays’: 1, ‘this’: 3, ‘polls’: 2, ‘reacted’: 1, ‘will’: 1, ‘country’: 1, ‘urging’: 1, ‘are’: 3, ‘have’: 3, ‘Northern’: 3, ‘voters’: 1, ‘moved’: 1, ‘Expectations’: 1, ‘an’: 1, ‘as’: 1, ‘want’: 1, ‘in’: 8, ‘end’: 1, ‘ex-partner’: 1, ‘Minister’: 2, ‘outbreak’: 1, ‘you’: 1, ‘Zealand’: 1, ‘towards’: 1, ‘after’: 1, ‘plane’: 1, ‘mouth’: 1, ‘building’: 1, ‘later’: 2, ‘2005’: 1, ‘the’: 7}
{‘a’: 1, ‘Maori’: 2, ‘says’: 1, ‘Howard’: 27, ‘warned’: 1, ‘that’: 1, ‘visit’: 1, ‘Ministers’: 1, ‘brief’: 1, ‘to’: 1, ‘racist’: 1, ‘Minister’: 35, ‘Howards’: 1, ‘put’: 1, ‘Rudd’: 1, ‘John’: 28, ‘The’: 2, ‘Kevin’: 1, ‘he’: 1}
[‘Australian’, ‘Prime’, ‘Minister’, ‘John’, ‘Howard’]
I’m stoked. It just needs a little tuning, and I’ll have a collection of phrases I can use to make the Ruminator’s output a lot more meaningful.
Tags: python | the ruminator | natural language processing
Rendered at 2008-12-04 15:02:27
Stephen on Title here ⋅ Shiny on I do not think that means what you think it means ⋅ Stephen on Protecting your goodies on the web is hard ⋅ Rob Coup on Protecting your goodies on the web is hard ⋅ stephen on Protecting your goodies on the web is hard ⋅ Brenda on Protecting your goodies on the web is hard ⋅ Brenda on Protecting your goodies on the web is hard ⋅ Brenda on One the one hand, on the other hand ⋅ stephen on Unexpected success ⋅ paul on Unexpected success