The text mining pipeline takes plain text as input. An example of such a document (The Coconut Planter, Ferguson 1923) can be found here. The document text is processed using various linguistic preprocessing steps, named entity recognition, grounding and relation extraction. The output is stored in XML and the output for the above example can be found here.
An output file contains a document element containing a meta, a text and a standoff section (element). If available, we make use of any existing meta data and add that as additional information to the meta section of the output file. In the text section we store the text of the document (along with some of the linguistic processing output marked up as either nested XML elements or their attributes). In the standoff section we store the entities (ents) and relations (relations) found in the the text. Each entity (ent) links back to its start and end offsets in the text and is accompanied by the snippet it occurs in. We recognise commodity, location and date entities. Each entity mention is grounded depending on its type. For example, locations are grounded to GeoNames identifiers and their lat/longs. Commodities are grounded to DBpedia concepts and categories. Dates are grounded to year, month, date and day attributes.
Each commodity-location relation is made up of two arguments (a location and a commodity entity), linking back to their respective entities via their ids.
The XML output is then ingested into the Trading Consequences database. The information in the database can be accessed through a series of web visualisations.