Invited talk on Digital History and Big Data

Last week I was invited to give talk about Trading Consequences at the Digital Scholarship: day of ideas event 2 organised by Dr. Siân Bayne.  If you are interested in my slides, you can look at them here on Slideshare.

Rather than give a summary talk about all the different things going on in the Edinburgh Language Technology Group at the School of Informatics, we decided that it would more informative to focus on one specific project and provide a bit more detail without getting too technical.  My aim was to raise our profile with attendees from the humanities and social sciences in Edinburgh and further afield who are interested in digital humanities research.  They made up the majority of the audience, so this talk was a great opportunity.

My presentation on Trading Consequences at the Digital Scholarship workshop (photo taken by Ewan Klein).

Most of my previous presentations were directed to people in my field, so to experts in text mining and information extraction.  So this talk would have to be completely different to how I would normally present my work which is to provide detailed information on methods and algorithms, their scientific evaluation etc.  None of the attendees would be interested in such things but I wanted them to know what sort of things our technology is capable of and at the same time let them understand some of the challenges we face.

I decided to focus the talk on the user-centric approach to our collaboration in Trading Consequences, explaining that our current users and collaborators (Prof. Colin Coates and Dr. Jim Clifford, environmental historians at York University, Toronto) and their research questions are key in all that we design and develop.  Their comments and error analysis feed directly back into the technology allowing us to improve the text mining and visualisation with every iteration.  The other point I wanted to bring across is that transparency in the quality of the text mining is crucial to our users, who want to know to what level they can trust the technology.  Moreover, the output of our text mining tool in its raw XML format is not something that most historians would be able to understand and query easily.  However, when text mining is combined with interesting types of visualisations, the data mined from all the historical document collections becomes alive.

We are currently processing digitised versions of over 10 million scanned document images from 5 different collections amounting to several hundred gigabytes worth of information.  This is not big data in the computer science sense where people talk about terrabytes or petabytes.  However, it is big data to historians who in the best case have access to some of these collections online using keyword search but often have to visit libraries and archives and go through them manually.  Even if a collection is available digitally and indexed, it does not mean that all the information relevant to a search term is easily accessible users.  In a large proportion of our data, the optical character recognised (OCRed) text contains a lot of errors and, unless corrected, those errors then find their way into the index.  This means that searches for correctly spelled terms will not return any matches in sources which mention them but with one or more errors contained in them.

The low text quality in large parts of our text collections is also one of our main challenges when it comes to mining this data.  So, I summarised the types of text correction and normalisation steps we carry out in order to improve the input for our text mining component.  However, there are cases when even we give up, that is when the text quality is just so low that is impossible even for a human being to read a document.  I showed a real example of one of the documents in the collections, the textual equivalent of an up-side-down image which was OCRed the wrong way round.

At the end, I got the sense that my talk was well received.  I got several interesting questions, including one asking whether we see that our users’ research questions are now shaped by the technology when the initial idea was for the technology to be driven by their research.  I also made some connections with people in literature, so there could be some exciting new collaborations on the horizon.  Overall, the workshop was extremely interesting and very well organised and I’m glad than I had the opportunity to present our work.

 

 

Professional success for Trading Consequences team member

Dr Jim Clifford, postdoctoral fellow at York University on the Trading Consequences project, has accepted a tenure-track job at the University of Saskatchewan in Saskatoon beginning 1 July 2013.  Jim’s work on this digital humanities project undoubtedly contributed to his success.  Hired as an Environmental Historian, Jim joins the History Department and contributes to their existing strength in environmental and digital history.  He will remain a core researcher on the Trading Consequences team after he takes up this new position, and we will apply to have him listed as a co-applicant on the SSHRC component of the grant.

In addition to this career success, Jim was also awarded a SSHRC postdoctoral fellowship, which he has declined, in order to take up the position in Saskatoon.  He also recently received word that he has been offered a visiting fellowship at the Rachel Carson Center for Environment and Society at Ludwig-Maximilians-Universität in Munich.   Congratulations, Jim!

 

 

Guest Post on Kew Gardens’ Blog

The Trading Consequences team have created a guest post, “Bringing Kew’s Archive Alive” for Kew Gardens’ Library, Art and Archives’ blog

The post looks at how digital data produced by Kew’s Directors’ Correspondence team can be used as a source for visualising the British Empire’s 19th Century trade networks.

You can read the post in full here: http://www.kew.org/news/kew-blogs/library-art-archives/bringing-kews-archive-alive.htm

How to Build a Macroscope

Our York University team members, led by Timothy Bristow at the Library, have organized a one day workshop on text mining in the humanities on March 15:

A macroscope is designed to capture the bigger picture, to render visible vastly complex systems. Large-scale text mining offers researchers the promise of such perspective, while posing distinct challenges around data access, licensing, dissemination, and preservation, digital infrastructure, project management, and project costs. Join our panel of researchers, librarians, and technologists as they discuss not only the operational demands of text mining the humanities, but also how Ontario institutions can better support this work. Read More

Progress to date on Trading Consequences Visualizations

Up here in St Andrews we are in the process of exploring several routes to visualize the vast amount of commodity data that have been extracted from the historical archives by our colleagues from the University of Edinburgh.

Research in environmental history can be an open-ended process where research questions are formed and refined as part of working with the available data (i.e. historic documents). Our goal is therefore the development of visualization concepts that will reveal a range of temporal, geographic and content-related perspectives on the commodity data, and that will highlight different conceptual angles and relations within the data. Such “interlinked” visualization perspectives can provide an overview of the entire dataset and, at the same time, act as probes to explore certain aspects of the commodity data in more detail. Using this approach we aim to support more open-ended explorations of the commodity data as well as providing easy access to specific documents of interest.

Our design process so far has been driven by discussions with Jim and Colin, paper sketches to iterate on certain visualization ideas and some literature research on information visualization and digital humanities.

Discussions with Jim and Colin revealed that the temporal and geographic aspects of the data are central to their research but always in close combination with commodity types and their relations to each other. This resulted in several paper sketches, as you can see below, to explore how these particular aspects could be visually expressed and augmented with interactive features.

We also created (static) computational sketches (shown below) based on samples from the actual database. At the same time, our collaborators from EDINA created an interface to the database that allowed interrogating the data through textual queries and list views.

Both these approaches allowed use to explore the character of the data and potential visualization challenges that this introduces.

The implementation of a web-based visualization prototype that combines the ideas from our early design explorations is currently in full swing. This prototype is based on the popular visualization library d3.js. We are closely collaborating with the teams from Toronto and Edinburgh on iterating  its design and implementation.

Moving from questions and the interests of researchers in environmental history to interactive visualizations which support digging into data with fluid and commodity oriented inquiries is a process on continual refinement and the exploration of small and large interaction research questions.

Lethal Brandy for Christmas

by Bea Alex

We are making good progress with our text mining work in Trading Consequences and are able to identify related commodities and geo-referenced locations in our collections. We are working on creating different visualisations for the mentions of commodities in proximity to location, including dates and global frequencies to enable historians to get an overview of when things were traded and where in the world. Equally, they will be able to drill down to individual documents and see which documents are most relevant for a given commodity and study the mentions of commodities in context.

For example, at Christmas of 1746 brandy is mentioned in one of the documents in Early Canadiana Online in relation to York Factory on the southwestern shore of Hudson Bay in northeastern Manitoba, Canada (Willson, 1899). In May of that year, two ships of The Hudson’s Bay Company had sailed from England to Hudson Bay aiming to discover a northwest passage to India. In September, they stopped not far from York Factory to stay there for the winter. They erected a log cabin as a shelter and called it Montague House (see a drawing of Montague House here).

The relationship between the local Governor Norton and the explorers was far from cordial. At Christmas, Norton sent them a couple of casks of brandy as a present to celebrate.  Soon afterwards, scurvy broke out amongst the explorers and several of them died. The disease was blamed on the brandy and Governor Norton was alleged to have refused to give assistance or suggest a remedy to cure the diseased. He had also prevented any Indians from approaching the explorers or provide them with any supplies. The latter resumed their voyage in the spring of the following year but eventually gave up their mission and returned to England without having discovered the a northwest passage. And the moral of the story is, be careful if someone offers you a lot of brandy for Christmas. ;-)

 

Commodities, Vampires and Fashion: Making Connections in Victorian Research

By Colin Coates,

Earlier this year, Jim Clifford and I were invited to present the Trading Consequences project to a group of scholars, many of them from English Departments in the Toronto region, who are interested in the Victorian period.  We contributed to the workshop, “Making Connections in Victorian Research”, held at York University in Toronto, on 19 October.

Our paper was sandwiched between talks about clothing reform in Victorian Britain and the pornographic elements of Bram Stoker’s Dracula.  Not surprisingly, we were concerned that our discussion of computer-assisted analysis of trading patterns and associated environmental consequences in the British empire might appear tangential, maybe even irrelevant, to the cultural concerns of this audience of scholars.

However, one of the advantages of historical studies is that issues within the same chronological time frame do have ways of connecting.  As Barry Commoner suggested, a key principle of ecology is that “everything is connected to everything else.”  The same is true when one approaches matters historically.

We presented the methodology of the Trading Consequences project, discussing the collaboration with computational linguists and computer scientists, and we showed some preliminary visualisations of the research findings.  The map we showed illustrated the global geographical locations associated with references to natural resources in Canadian government documents from 1860 to 1900.  In presenting these data, we are hoping to understand the mental geography of Canadian decision-makers (politicians, government officials and businesspeople) in this time period.  A feature of the exploitation of natural resources is that extraction activities can shift fairly quickly from one part of the globe to another.  In other words, a fisher off Nova Scotia may have to keep in mind what fishers in the North Sea are doing.  The production of lime for fertiliser in Ontario may be influenced by developments in Florida or Algeria.  The map was based on an experiment with visualisation techniques.  Much of what it illustrated was fairly commonsense:  concentrations of references to the United Kingdom and the United States.  France seemed more prominent than we would have expected, as were Nepal and the Philippines, possibly illustrating some problems with the data which we will need to explore.  China seemed under-represented.  However, to our mind, the emphasis on the Caribbean seemed one angle worth pursuing.

Continue reading

From Cod to Cinchona: Creating a Bibliographic Database of Sources for the Trading Consequences Project

As part of our work with the Trading Consequences project, Jim Clifford and I have compiled a bibliographic database of secondary sources that focus on the environmental and economic effects of the nineteenth-century global commodity trade. This is no small task, since the historiography is as vast as the imperial networks that this project seeks to explore. In this post, I’ll explain how we went about creating the database.

Earlier this year, Jim created a preliminary database of sources that originated from his own research interests in the environmental history of the British Empire during the nineteenth century. Project members had included many of these sources in the Digging into Data funding application, so it made an obvious starting point for us.

Zotero was an easy choice of software for our database, and it offers a number of advantages. For example, users can create folders within the larger database so that entries can be categorized by descriptors such as geographic area and type of commodity analyzed within the text. The software also enables users to enter source entries by clicking on an icon within the web browser address bar, create notes for such entries, and share their work with others in a group. With the click of a few keys, Zotero easily converts these entries into a conventional bibliography, as we’ve done at the end of this post.

Screen capture of our database in Zotero. Note the various folders on the left and the list of sources in the middle.

During the summer, I joined the Trading Consequences project as a researcher. One of my tasks was to add sources to the existing database. My first strategy led me to survey existing bibliographies related to environmental history. For example, I used the Network in Canadian History and Environment’s (NiCHE) New Scholars Wiki that its members had created in 2008 in order to assist graduate students who needed to compile secondary sources for comprehensive exams in environmental history. Continue reading

Putting it all together: first attempt

Within Trading Consequences, our intention is to develop a series of prototypes for the overall system. Initially, these will have limited functionality, but will then become increasingly powerful. We have just reached the point of building our first such prototype. Here’s a picture of the overall architecture:

Text Mining

Our project team has delivered the first prototype of the Trading Consequences system. The system takes in documents from a number of different collections. The Text Mining component consists of an initial preprocessing stage which converts each document to a consistent XML format. Depending on the corpus that we’re processing, a language identification step may be performed to ensure that the current document is in English. (We plan to also look at French documents later in the project.) The OCR-ed text is then automatically improved by correcting and normalising a number of issues.

The main processing of the TM component involves various types of shallow linguistic analysis of the text, lexicon and gazetteer lookup, named entity recognition and grounding, and relation extraction. We determine which commodities were traded when and in relation to which locations. We also determine whether locations are mentioned as points of origin, transit or destination and whether vocabulary relating to diseases and disasters appears in the text. All additional information which we mine from the text is added back into the XML document as different types of annotation.

Populating the Commodities Database

The entire annotated XML corpus is parsed to create a relational database (RDB). This stores not just metadata about the individual document, but also detailed information that results from the text mining, such as named entities, relations, and how these are expressed in the relevant document.

Visualisation

Both the visualisation and the query interface access the database so that users can either search the collections directly through textual queries or browse the data in a more exploratory manner through the visualisations. For the prototype, we have created a static web-based visualization that represents a subset of the data taken from the database. This visualization sketch is based on a map that shows the location of commodity mentions by country. We are currently working on setting up the query interface and are busy working on dynamic visualisation of the mined information.

Trading Consequences at SICSA DemoFest 2012

Aside

Trading Consequences will be showing a poster at the SICSA DemoFest 2012 at the Informatics Forum, University of Edinburgh on Tuesday 6th November.
Trading Consequences poster for SICSA DEMOFest 12

According to the official publicity:

SICSA is the largest ICT research cluster in Europe and this year’s DEMOfest shows the best of Informatics and Computing Science state of the art research in Scotland. DEMOfest promotes research and encourages commercial collaboration between academia, business and industry. The event will exhibit over 50 presentations and demonstrations aimed at showing:

  • Research with commercial potential.
  • Opportunities for collaboration between university and industry.

We are looking forward to see who’s interested in our project!