Invited talk on Digital History and Big Data

Last week I was invited to give talk about Trading Consequences at the Digital Scholarship: day of ideas event 2 organised by Dr. Siân Bayne.  If you are interested in my slides, you can look at them here on Slideshare.

Rather than give a summary talk about all the different things going on in the Edinburgh Language Technology Group at the School of Informatics, we decided that it would more informative to focus on one specific project and provide a bit more detail without getting too technical.  My aim was to raise our profile with attendees from the humanities and social sciences in Edinburgh and further afield who are interested in digital humanities research.  They made up the majority of the audience, so this talk was a great opportunity.

My presentation on Trading Consequences at the Digital Scholarship workshop (photo taken by Ewan Klein).

Most of my previous presentations were directed to people in my field, so to experts in text mining and information extraction.  So this talk would have to be completely different to how I would normally present my work which is to provide detailed information on methods and algorithms, their scientific evaluation etc.  None of the attendees would be interested in such things but I wanted them to know what sort of things our technology is capable of and at the same time let them understand some of the challenges we face.

I decided to focus the talk on the user-centric approach to our collaboration in Trading Consequences, explaining that our current users and collaborators (Prof. Colin Coates and Dr. Jim Clifford, environmental historians at York University, Toronto) and their research questions are key in all that we design and develop.  Their comments and error analysis feed directly back into the technology allowing us to improve the text mining and visualisation with every iteration.  The other point I wanted to bring across is that transparency in the quality of the text mining is crucial to our users, who want to know to what level they can trust the technology.  Moreover, the output of our text mining tool in its raw XML format is not something that most historians would be able to understand and query easily.  However, when text mining is combined with interesting types of visualisations, the data mined from all the historical document collections becomes alive.

We are currently processing digitised versions of over 10 million scanned document images from 5 different collections amounting to several hundred gigabytes worth of information.  This is not big data in the computer science sense where people talk about terrabytes or petabytes.  However, it is big data to historians who in the best case have access to some of these collections online using keyword search but often have to visit libraries and archives and go through them manually.  Even if a collection is available digitally and indexed, it does not mean that all the information relevant to a search term is easily accessible users.  In a large proportion of our data, the optical character recognised (OCRed) text contains a lot of errors and, unless corrected, those errors then find their way into the index.  This means that searches for correctly spelled terms will not return any matches in sources which mention them but with one or more errors contained in them.

The low text quality in large parts of our text collections is also one of our main challenges when it comes to mining this data.  So, I summarised the types of text correction and normalisation steps we carry out in order to improve the input for our text mining component.  However, there are cases when even we give up, that is when the text quality is just so low that is impossible even for a human being to read a document.  I showed a real example of one of the documents in the collections, the textual equivalent of an up-side-down image which was OCRed the wrong way round.

At the end, I got the sense that my talk was well received.  I got several interesting questions, including one asking whether we see that our users’ research questions are now shaped by the technology when the initial idea was for the technology to be driven by their research.  I also made some connections with people in literature, so there could be some exciting new collaborations on the horizon.  Overall, the workshop was extremely interesting and very well organised and I’m glad than I had the opportunity to present our work.



Lethal Brandy for Christmas

by Bea Alex

We are making good progress with our text mining work in Trading Consequences and are able to identify related commodities and geo-referenced locations in our collections. We are working on creating different visualisations for the mentions of commodities in proximity to location, including dates and global frequencies to enable historians to get an overview of when things were traded and where in the world. Equally, they will be able to drill down to individual documents and see which documents are most relevant for a given commodity and study the mentions of commodities in context.

For example, at Christmas of 1746 brandy is mentioned in one of the documents in Early Canadiana Online in relation to York Factory on the southwestern shore of Hudson Bay in northeastern Manitoba, Canada (Willson, 1899). In May of that year, two ships of The Hudson’s Bay Company had sailed from England to Hudson Bay aiming to discover a northwest passage to India. In September, they stopped not far from York Factory to stay there for the winter. They erected a log cabin as a shelter and called it Montague House (see a drawing of Montague House here).

The relationship between the local Governor Norton and the explorers was far from cordial. At Christmas, Norton sent them a couple of casks of brandy as a present to celebrate.  Soon afterwards, scurvy broke out amongst the explorers and several of them died. The disease was blamed on the brandy and Governor Norton was alleged to have refused to give assistance or suggest a remedy to cure the diseased. He had also prevented any Indians from approaching the explorers or provide them with any supplies. The latter resumed their voyage in the spring of the following year but eventually gave up their mission and returned to England without having discovered the a northwest passage. And the moral of the story is, be careful if someone offers you a lot of brandy for Christmas. ;-)