One of the core tasks in Trading Consequences is being able to identify words in digitised texts which refer to commodities (as well as words which refer to places). Here’s a snippet of the kind of text we might be trying to analyse:
How do we know that gutta-percha in this text is a commodity name but, say, electricity is not? The simplest approach, and the one that we are adopting, is to use a big list of terms that we think could be names of commodities, and check against this list when we process our input texts. If we find gutta-percha in both our list of commodity terms and in the document that is being processed, then we add an annotation to the document that labels gutta-percha as a commodity name.
In our first version of the text mining system, we derived the list of commodity terms from WordNet. WordNet is a big thesaurus or lexical database, and its terms are organised hierarchically. This means that as a first approximation, we can guess that any lexical item in WordNet that is categorised as a subclass of Physical Matter, Plant Life, or Animal might be a commodity term. How well do we do with this? Not surprisingly, when we carried out some initial experiments at the very start of our work on the project, we found that there are some winners and some losers. Here’s some of terms that were plausibly labeled in as commodities in a sample corpus of digitised text:
horse, tin, coal, seedlings, grains, crab, merino fleece, fur, cod-liver oil, ice, log, potatoes, liquor, lemons. And here are some less plausible candidate commodity terms:
weevil, water frontage, vomit, vienna dejeuner, verde-antique, vapours, toucans, steam frigates, smut, simple question, silver oics.
There are a number of factors that conspire to give the incorrect results. The first is that our list of terms is just too broad, and includes things that could never be commodities. The second is that for now, we are not taking into account the context in which words occur in the text — this is computationally quite expensive, and not an immediate priority. The third is that the input to our text mining tools is not nice clean text such as we would get from ‘born-digital’ newswire. Instead, nineteenth century books have been scanned and then turned into text by the process of Optical Character Recognition (OCR for short). As we we’ll describe in future posts, OCR can sometimes produce bizarrely bad results, and this is probably responsible for our silver oics.
At the moment, we are working on generating a better list of commodity terms (as mentioned in a recent post by Jim Clifford. We’ll report back on progress soon.