|
This page last changed on Nov 29, 2008 by martinmueller@northwestern.edu.
This is a set of reflections on the Monk Project and where it might go. I write this for scholarly users who know little about Monk. Some of it might find its way into a final report or into subsequent grant proposals.
I discuss things following a text from its source through the various stages. There are two reasons for this. First, there is a (chrono)logical logic to this. Second – and more importantly, for most scholarly users of Monk texts are the stuff they are interested in and care about. Where do they come from? Are they good enough for my purpose? What can I do with them that I can't do with them by just reading them? These are questions that constantly need foregrounding.
Scale and interoperability are two keywords for the digital version of 'text and context' that is at the heart of Monk. A successful Monk environment creates a very large document space which from the users' perspective contains many, if not all, of the texts that are necessary for their particularly inquiries — enough Elizabethan or Jacobean plays, 17th century witchcraft documents, or 19th century novels from both sides of the Atlantic to support new ways of contextualizing familiar texts. These new ways involve new tools — more accurately, tools that have been used successfully in other disciplines but are largely new to the humanities. But most scholars will find these tools interesting only to the extent that they help them with their stuff. Tell a group of humanities scholars about an exciting new tool, and their eyes will glaze over. Tell them about the ways in which a tool will help them with their stuff, and they eyes may light up.
Actually, in a successful Monk environment, the distinction between 'stuff' and 'tool' is blurred. If there are tools (or analytics) to profile texts of interest against a background of other texts, are the other texts part of the tool or part of the stuff? And what about the metadata that enable such profiling in the first place? Are they 'tool' or 'stuff'? Whichever way you answer, the quality and properties of the stuff matter a lot.
If from the scholar's perspective it's all about the stuff and if the advantage of Monk consists in a large document space with a potential for complex forms of navigation or analytics, the constraints of copyright law create two opportunities for document spaces that are in principle unencumbered by concerns about who owns what: texts published before 1923 and texts published on the Web since 1994. From the perspective of scholarly needs in history, literature, philosophy, religion, and music — you wouldn't divide the world that way. But we have to live with that division for quite a while.
So you can in principle build a fully interoperable public domain archive of everything before 1923 or of everything on the Web. By 'interoperable' I mean that there are for all practical purposes no legal constraints on what you can't do with texts. You can't build such an archive for Vietnam War fiction, to give only one example.
The archive before 1923 is static and has been through the filter of time. However muddy at the margins, there is a lot of agreement of what texts matter more and what texts matter less, at least for many purposes. The Web archive is dynamic, growing at phenomenal speed and, for all, I know, adds more words in a year than were published before 1923.
These two document spaces differ in many ways, and they attract different kinds of scholarly or casual users. While for some technical purposes "texts are texts are texts" or "stuff is stuff" there are deep differences in which the user communities approach the texts of interest to them. The textual world before 1923 is a world of well-defined, if overlapping, scholarly and readerly communities that have their canons and anti-canons. This is a world with well-developed expectations, inherited from the print world, about the provenance, the quality of editions, the tolerance for error, etc. These expectations must be honored, adapted, and appropriately enhanced in the an environment populated by digital surrogates of the print avatars.
By contrast the textual world of the Web is anarchic, dynamic, and increasingly multimodal. It has been around for only fifteen years. The only thing you can say of it with certainty is that it is growing at a dizzying rate.
Monk belongs much more in the formal than in the latter world. It is a cultural heritage project in the broadest sense of the term. It concerns itself with finding appropriate expressions for, and new modes of access to, texts that originated in a world of books and manuscripts. Think of it as a form of 'keeping' or 'digital upkeep'.
The texts
Source texts for MONK are first converted from their original encoding to a particular flavor of TEI called TEI-Analytics and are then linguistically annotated with the NLP toolkit Morphadorner, using the POS tag set NUPOS.
From source text to TEI-Analytics
Most of the texts in Monk come from archives encoded according the Level 4 Guidelines for encoding TEI texts. This is a relatively loosely formulated set of rules followed by American libraries, in particular the Virginia, Michigan, and Indiana university libraries. The dtd used in these Guidelines is TEI-Lite, which uses a subset of ~150 elements from the ~500 elements of the complete TEI dtd. All these texts were encoded before the adoption of the new and XML-native P5 version of the TEI
TEI-Analytics is a close relative of TEI-Lite. It adds elements from the main set that are necessary for linguistic annotation (w, c, s), at creates some new elements that are 'syntactic sugar' for existing elements, such as <sup> for <hi rend="superscript"> or <sb/> for <milestone unit="sentence">. It is called TEI-Analytics because its chief purpose is to enable the movement of diverse texts into a common document space that easily supports analytical operations across different corpora regardless of the origins of particular texts.
Brian Pytlik-Zillig and Stephen Ramsay at the University of Nebraska developed the routines for converting the raw source texts into TEI-A files. In the process we discovered that the people in the different encoding projects had not thought very hard about making their texts play well with each other. Nor had they anticipated their potential transformation into an annotated linguistic corpus. We also discovered the following: if the P5 conversions of the different project proceed as discrete local projects, the resultant texts will not be very interoperable. If the various libraries established a working group for P5 conversion, they could with relatively little adjustments create text archives that are much more interoperable than what exists now.
Linguistic annotation by Morphadorner
The TEI-Analytics go through a process of linguistic annotation that adds part-of-speech tags, lemmatization, and explicit sentence boundaries. This process uses Morphadorner, a natural language processing toolkit developed by Phil Burns, and NUPO, a part-of-speech tag set designed by Martin Mueller.
MorphAdorner and NUPOS differ from similar NLP toolkits in their degree of attention to dialectal and diachronic variance in the texts of the Monk corpus. The goal is to surround the texts with a set of metadatata that virtually level orthographic and morphological variance and let users manipulate texts from Chaucer to Joyce as if they were written in modern English. The operative word here is 'virtual'. The point of this virtual standardization is not to modernize the texts or obliterate diachronic and dialectal difference. On the contrary, virtual standardization is a device for making actual difference more apparent: a user looking for the verb 'love' can easily retrieve all the forms and spellings of that word
Towards a cultural genome of written English or What else can you do with morphadorned TEI-Analytics texts?
Morphadorned TEI-Analytics texts were developed for the purpose of being ingested into the Monk datastore that supports a wide variety of queries and analytics. But it is worth dwelling in some detail on the fact that these texts can serve many other purposes.
You can think of them as a prototype of a 'Book of English'— a project that combines methods and approaches from Biology and Corpus Linguistics to construct a 'cultural genome' of written English from Caxton's Troy book (1473) to Joyce's Ulysses (1922). Such a project might be anchored in several CIC libraries, e.g. Northwestern, UIUC, and the University of Chicago and would complement two very different and much larger digitizing projects, the Text Creation Partnership and the Hathi Trust. The former seeks to create digital transcriptions of about 40,000 British and American books before 1800. The latter, a joint project of the CIC Libraries and the California Digital Library and partly affiliated with Google Books, has so far made 2.5 million digitized books available.
There is an obvious question about the added value of linguistically annotated texts when digitized versions of them are already available. The is that digitization is a many-layered thing, and different forms of digitization have different affordances, as these two examples illustrate. A few days ago I wondered whether you can use an English form of the Greek word akribeia . A German form, Akribie is the standard word for 'meticulousness'. The English form 'acriby' is not found in the OED, but a Google search retrieves 141 hits, partly from the Web and partly from Google books, and from a quick review of them 'acriby' emerges as a plausible if somewhat recondite word. There are 165,000 hits for Akribie but 6.6 million hits for 'meticulous.'
This is a nice example of a standard philological query to which a vast shallowly encoded archive gives much the best answers. Athenaeus in his Deipnosophist (written approximately 200 CE) coined the nickname Keitosoukeitos for the pedant Ulpian (perhaps a relative of the great jurist), who prided himself on his proper speech and always asked whether a given word occurred (keitai) or did not occur (ou keitai) in the best Greek authors. Keitosoukeitos would have loved Google, which for many lexical searches is much better than the OED and will improve in usefulness as the Google Books project progresses.
You can even draw some quantitative conclusions from the Google search. 'Acriby' is obviously a much rarer animal in English than Akribie is in German. It is less obvious, however, whether English 'meticulous' is relatively more common than German Akribie.
Now consider another example. On several occasions, friends were intrigued when I told them that compared with her contemporaries, Jane Austen uses the noun 'heart' much less often and the verb 'think' much more often. My friends thought that this lexical observation was a nice way of capturing something important about Austen's way of being in the world.
How do you get information of this kind and, more importantly, how do you follow it up and determine whether it is just a fluke or is part of a larger lexical pattern? You cannot get this information from a Google-like search, however large the archive and however fast the search engine. For this you need a corpus that makes it easy for you to define a sub-corpus (the novels by Jane Austen), another subcorpus (enough novels by her contemporaries), and a procedure that lets you compare and evaluate differences in usage. The procedure in this case is a 'log likelihood statistic' from which you learn that the ratios of usage (6:10 for 'heart', 9:5 for 'think') are very unlikely to be random given the frequency of those words. But the procedure — in itself a fairly elementary statistic — is useless unless you have a corpus of texts with bibliographical and linguistic metadata that let you quickly compare one set of data with another.
This is where the metaphor of the cultural genome comes in handy. My daughter Rachel is an evolutionary biologist who spent several years as a graduate student in Berkeley's Museum of Vertebrate Zoology. There you can walk along shelves and shelves of salamander specimens, meticulously prepared label by generations of field biologists boing to the 1800's. These are, if yuou will, surrogates of living animals and the handwritten metadata on the lables are a minimal representation of their environment. Working with such specimens is not unlike working with books.
As part of her work my daughter extracted DNA from some of these specimens, fed the DNA sequences into a collaborative gene bank, and used a comparative analysis of these sequences to fomulate hypotheses about the descent of certain kinds of salamander families. As a generic research problem this a very familiar story to any literary scholar who has ever traced the affiliations of texts over time. But the manipulation of these particular salamander surrogates is impossible without digital technology. You either do it with a computer or you do not do it all.
Over the course of my daughter's career as a graduate student the time cost of analyzing DNA sequences on a computer dropped from weeks to hours. Ten years earlier, her work would for all practical purposes have been impossible. 'Rachel's Salamanders' thus is a project in which a particularly intensive and extensive exploration of a digital surrogate played an essential role and was enabled by very rapid improvements in the ability of computers to manipulate and store very large amounts of data more quickly and more cheaply.
The intrinsically collaborative aspects of this project are worth pointing out because they are an essential ingredient of the query potential of the digital surrogate in this case. Rachel contributed her DNA sequences – carefully and tediously extracted from specimens – to GenBank, an 'annotated collection of all publicly available DNA sequences'. GenBank is part of the International Nucleotide Sequence Database Collaboration. In these enterprises, the immense phenotypical variance of life is reduced to systematic descriptions at the level of the genotype. Think of it as a Book of Life, written in a four-letter alphabet, with collaboration and reduction as the cause and cost of scientific insight. The DNA sequences Rachel contributed in a standardized format acquired much of their meaning for her particular project by their incorporation into the large gene bank that allowed her to make sense of them in that wider context. In turn, her contribution enriched that gene bank, by however little. The aggregate of such enrichments over time by hundreds and thousands of biologists continues to increase the query potential of this digital surrogate of the vast Book of Life.
There are several ways in which these standard practices from the workday of a contemporary biologist bear on the idea of a cultural genome or an environment for the digital contextualization of literary texts.It is not enough to have digitized texts
|