The Ruminator

How the Ruminator works

The Ruminator is a simple beast.

It grazes on RSS and Atom feeds. Right now it is only interested in feeds comprising New Zealand national news, but in theory, it can digest anything that has a feed, including blogs or even sports news.

Once it has obtained a bunch of feeds, it processes each story, recording the number of times words and phrases appear. “Word” and “phrase” are surprisingly slippery concepts, and the Ruminator has to take into account spaces, punctuation and English grammar. The Ruminator takes a crude, robust and quite possibly erroneous view of what counts as a word or phrase.

The Ruminator thinks a word is what’s left after replacing all the hyphens with spaces and removing all other punctuation. Phrases are things that match on the Ruminator’s phrase list.

Some words are so common or so meaningless on their own that the Ruminator ignores them. The list is getting longer and longer:

a about after an and another are as at be been before being but by despite during for from get got had has have he her him his i if in into is it its last more near newstalk not nzpa of off on or out over s said say says she since so t than that the their them then there they this those to told under up was were what when where which while who will with would zb

The phrase list is updated from time to time by applying some simple statistical techniques to the Ruminator’s ever-growing collection of news, and editing the results by hand.

Finally, the Ruminator ranks the words and phrases it found, and generates story pages pointing back to the news stories where those words and phrases were found.

Rendered at 2008-08-29 12:44:39