Last week I was invited to give talk about Trading Consequences at the Digital Scholarship: day of ideas event 2 organised by Dr. Siân Bayne. If you are interested in my slides, you can look at them here on Slideshare.
Rather than give a summary talk about all the different things going on in the Edinburgh Language Technology Group at the School of Informatics, we decided that it would more informative to focus on one specific project and provide a bit more detail without getting too technical. My aim was to raise our profile with attendees from the humanities and social sciences in Edinburgh and further afield who are interested in digital humanities research. They made up the majority of the audience, so this talk was a great opportunity.
Most of my previous presentations were directed to people in my field, so to experts in text mining and information extraction. So this talk would have to be completely different to how I would normally present my work which is to provide detailed information on methods and algorithms, their scientific evaluation etc. None of the attendees would be interested in such things but I wanted them to know what sort of things our technology is capable of and at the same time let them understand some of the challenges we face.
I decided to focus the talk on the user-centric approach to our collaboration in Trading Consequences, explaining that our current users and collaborators (Prof. Colin Coates and Dr. Jim Clifford, environmental historians at York University, Toronto) and their research questions are key in all that we design and develop. Their comments and error analysis feed directly back into the technology allowing us to improve the text mining and visualisation with every iteration. The other point I wanted to bring across is that transparency in the quality of the text mining is crucial to our users, who want to know to what level they can trust the technology. Moreover, the output of our text mining tool in its raw XML format is not something that most historians would be able to understand and query easily. However, when text mining is combined with interesting types of visualisations, the data mined from all the historical document collections becomes alive.
We are currently processing digitised versions of over 10 million scanned document images from 5 different collections amounting to several hundred gigabytes worth of information. This is not big data in the computer science sense where people talk about terrabytes or petabytes. However, it is big data to historians who in the best case have access to some of these collections online using keyword search but often have to visit libraries and archives and go through them manually. Even if a collection is available digitally and indexed, it does not mean that all the information relevant to a search term is easily accessible users. In a large proportion of our data, the optical character recognised (OCRed) text contains a lot of errors and, unless corrected, those errors then find their way into the index. This means that searches for correctly spelled terms will not return any matches in sources which mention them but with one or more errors contained in them.
The low text quality in large parts of our text collections is also one of our main challenges when it comes to mining this data. So, I summarised the types of text correction and normalisation steps we carry out in order to improve the input for our text mining component. However, there are cases when even we give up, that is when the text quality is just so low that is impossible even for a human being to read a document. I showed a real example of one of the documents in the collections, the textual equivalent of an up-side-down image which was OCRed the wrong way round.
At the end, I got the sense that my talk was well received. I got several interesting questions, including one asking whether we see that our users’ research questions are now shaped by the technology when the initial idea was for the technology to be driven by their research. I also made some connections with people in literature, so there could be some exciting new collaborations on the horizon. Overall, the workshop was extremely interesting and very well organised and I’m glad than I had the opportunity to present our work.