Background

Humanities text-mining in the digital library

Over the last decade, many millions of dollars have been invested in creating digital library collections: at this point, terabytes of full-text humanities resources are publicly available on the web. Those collections, dispersed across many different institutions (not only libraries but also publishers) are large enough and rich enough to provide an excellent opportunity for text-mining, and we believe that web-based text-mining tools will make those collections significantly more useful, more informative, and more rewarding for research and teaching.

WordHoard and Nora

MONK builds on work done in two separate projects funded by the Andrew W. Mellon Foundation: WordHoard (http://wordhoard.northwestern.edu/), at Northwestern University, and Nora (http://www.noraproject.org/), with participants at the University of Illinois, the National Center for Supercomputing Applications, the University of Maryland, the University of Georgia, the University of Nebraska, the University of Virginia, and the University of Alberta. The two projects shared the basic assumption that the scholarly use of digital texts must progress beyond treating them as book surrogates and move towards the exploration of the potential that emerges when you put many texts in a single environment that allows a variety of analytical routines to be executed across some or all of them.

The WordHoard project applied to literary texts the insights and techniques of corpus linguistics, namely the empirical and computer-assisted study of large bodies of written texts or transcribed speech. In WordHoard, such texts are annotated or tagged according to morphological, lexical, prosodic, and narratological criteria. In its current release, WordHoard contains the entire canon of Early Greek epic in the original and in translation, as well as all of Chaucer and Shakespeare, and Spenser’s Faerie Queene.

The goal of the Nora project was to produce software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources in existing digital libraries. Like WordHoard, Nora applied some of the tools, techniques, and insights of corpus linguistics to its collections, and like WordHoard, Nora deals with literary texts, though from a later era–British and American literature of the 18th and 19th centuries. Nora built on D2K (Data to Knowledge), a generalized visual-programming framework for data-mining developed and still being improved at the National Center for Supercomputing Applications, in the Automated Learning Group.

If you look under the hood of the two projects, many similarities and some differences emerge, but the similarities run much deeper. Both projects have procedures for

1. Ingesting arbitrary texts that meet some rules (e.g. well-formed XML)
2. Tokenizing the texts, assigning to each word a unique location, and applying part-of-speech tagging and other techniques familiar from corpus linguistics
3. Converting the tokenized and preprocessed texts into a datastore that includes various count objects to simplify and speed up subsequent operations

In Nora the data store provides the basis for a chain of operations that go via D2K to the end-user applications. In the WordHoard environment, the user interface talks to the data store through a software layer called Hibernate. In both Nora and WordHoard, however, the datastore is separable from the processes it feeds and could in principle feed quite different processes via quite different intermediate layers.

Nora and WordHoard differ marginally in their basic ways of tokenizing and preprocessing data. They have used different tag sets and have differed with regard to lemmatization and named-entity extraction–matters on which it is desirable and quite easy to reach agreement. WordHoard has also tagged some prosodic and narratological phenomena, but these very granular tagging operations are unlikely to scale to data sets that are larger by orders of magnitude.

Nora and WordHoard have both employed relational database systems to maintain their data stores but Nora is exploring different technical options. Both projects make use of the xml tags in the texts, something that sets them apart from most text-mining done in the scientific community. On a technical level, both projects distribute a webstart application written in Java.

If you compare the types of queries supported by the current interfaces of Nora and WordHoard, the latter comes at things from a philological perspective quite familiar to humanists while the former applies text-mining strategies more deeply rooted in business and the social sciences. But there is real complementarity here, and the underlying operations are in any event very similar. It is difficult to distinguish between “text-analysis” and “text-mining”: it is more productive to think in terms of a broad spectrum of text analysis, with different scholars finding themselves in different moments at different points on the spectrum. What matters, finally, is that we create a common environment where scholars will find the tools that meet their needs.

Because these two projects have very similar underlying requirements for their texts, and very similar basic techniques for analyzing those texts, it made sense to combine them. Because they developed in complementary ways and explored alternative strategies for accomplishing similar goals, it seemed likely that they would strengthen one another–and they did, but it was also be a challenge to meld them and build out from the two at a technical level, since they made different choices at the level of architecture and implementation.

Scaling up to MONK

Both Nora and WordHoard achieved most of their objectives with limited data sets where document count is in the low dozens and the total word count in the low millions. In order to take full advantage of word-level metadata and the inquiries they support, though, we wanted data sets consisting of hundreds or thousands of documents and running to hundreds of millions to billions of words. In the MONK project we created an environment that lets users carry out complex data-mining and query operations across collections that contain nearly 200 million words. The major challenges, moving from WordHoard and Nora to MONK, were

  1. to develop automated (or largely automated) methods for converting texts from a variety of sources into a common format
  2. to construct a datastore sufficiently robust, fast, and flexible to support MONK's much larger datastore
  3. to develop a usable web-based interface
  4. to choose analytic routines that are meaningful for exploring literary texts
  5. to document those routines, and present their results, in ways that are both credible and intelligible for humanities scholars.

MONK Home