Aug

22

News Clouds at The Daily Anvil

Posted by Danny on Saturday, August 22, 2009 at 2:17 am

A Wordle Cloud of text from The Toronto Sun by Byron at blog.thedailyanvil.com.

Because Byron’s the cool sort of cat what wonders about things and then fiddles with them until he has answers, he’s been playing with Wordle’s graphic representation of word frequency to see if he can capture and distinguish the flavour of diverse news publications by feeding in a week’s worth of copy.  It’s a neat smudging of quantification and art: the data could be presented with more rigour and reliability if he’d used a more basic frequency calculator and set rules about what text gets included (advertisements, classifieds, etc., which might be irregularly available or partially OCR’d across the publications he surveyed), but the benefit to this approach is that the presentation essentially distils the visceral impact without wholly discarding the medium—you’re still sort of looking at the emotional impact of the newspaper, just without all the syntax and filler and communication that honestly we probably can’t afford in this economic climate anyway.  His write-up and gallery are available here.  I really, really like the decision to match the colour schemes of the papers to the Wordle clouds generated.

Over brunch a few days ago, we talked about expanding the scale and rigour of this project so that the newspapers would be monitored over a period of perhaps a year.  The benefit would be a normalisation of the subject matter reported—relatively isolated events that snag media attention briefly but totally, like Michael Jackson’s death, say, would tend toward more appropriate representation in the cloud.  Based on Wordle’s current customisation options, here’re my suggestions for such an enterprise:

  • Collect a snapshot of each publication’s content from a uniform position (like the front page, or from each article featured on the front page) at strictly regular intervals, as close to simultaneously as is practical.
  • Aim to use a capture technique that extracts information from images as well as text.  For example, generating a PDF of the webpages and then running the same OCR software over them would tend to liberate information from static image advertisements.
  • Begin the project only after deciding upon a universal “ignore list” of common words to filter out interface noise (if we’re not interested in that data).  Looking at Byron’s gallery, some words I might filter would be “news,” “am/pm,” “video,” “home,” “articles,” etc.  Possibly include the names of the newspapers themselves in the list.
  • Ensure that the final presentation of the Wordle clouds be generated according to the same rules and with the same settings (except colour scheme).  This way tags at a particular size could be more reliably correlated with their frequency across publications.

If anyone’s interested in working with us on this (and assuming that Byron’s still gung-ho after the one-week trial run), it might also be fun to collect a bunch of hypotheses from you folks about what sort of trends you expect to see in terms of register, emphasis, reading-level, etc.  This isn’t science, but it could still be a fun experiment.