This page last changed on May 06, 2007 by unsworth.

Introduction

What is Monk

MONK aims at making modern forms of text analysis and text mining accessible to humanities scholars. The name is an acronym for Metadata Offer New Knowledge. 'Metadata' does not sound like a very humanities-friendly word, while 'monk' may seem to gesture comfortingly towards a romantic or piously ascetic image of the Middle Ages. But the developers of MONK have not taken special vows of poverty, chastity, and obedience. Their affinity with monks is based on the simple fact that for more than five centuries monks were the pioneers of text processing. In the seventh century Irish monks hit on the idea of marking word boundaries by leaving a space between the end of one and the beginning of the next word. Monks developed lower case characters of which some rise above (b,d) while others drop below an upper or lower boundary (g, j), thus giving to common words distinctive shapes that the reader's would take in as a single unit rather than process as a sequence of letters.

Monks were the first to turn a text into a database. Faced with the challenge of understanding every word of the Bible in relation to every other word and not possessing the instant recall of everything they attributed to the Lord, they thought of a finding aid to help their faulty memories. They divided the text into an inventory of its parts and listed the parts in alphabetical order together with the places of occurrence. This came to be known as a 'concordance' because it was a tool for demonstrating the harmony that they thought must be attributed to all the words of a charitable God. Many of the principles of a database as a finding tool are firmly if primitively realized in the Biblical concordance, which became the model of the secular concordances, dictionaries, and indexes that became the tools of trade for centuries of text analysis in a pre-digital philological world.

So the name 'MONK' has a double function. It draws attention to a particular type of data and the procedures associated with them. But more broadly and more importantly, the name says that the application of technology to texts is not a new thing. What you can do with texts in MONK will sometimes speed up the pace of known operations. At other times, it will enable you to do new things that previously were impractical. But whatever you do rests on centuries of text technology. Reading texts and working with them has always taken place in a technological environment, and usually in a relatively high one at that.

What are metadata?

Metadata are second-order data or data about data. It is a relative term: what counts as second-order, depends on the boundaries of the first order. If you think of text as a sequence of words, punctuation marks could be seen as a form of metadata, and they do indeed "mark up" the text by grouping words in a hierarchical order: commas group words into clauses, while full stops group clauses into sentences. If you treat punctuation marks as if they were words, which is a common and useful practice in text processing, roughly one word in seven is a punctuation mark. Faced with the question whether you would rather lose punctuation marks or every seventh word from a text, you would probably choose the text without punctuation. And editors who change the punctuation in a text typically don't think of this as alteration or emendation.1Thus punctuation is better thought of as a kind of metadata.

White space, i.e. space on a page that is not taken up by letters, is a very powerful form of metadata as you can easily prove to yourself by removing all spaces and line breaks from a text file. White space marks up a sequence of sentences as a paragraph, tells the reader that a given sequence of lines is a poem, or that a major section of a work, e.g. a chapter, has come to an end.

Page numbers, chapter headings, tables of content, or indexes are other forms of metadata. In some forms of writing, notably plays, there are rigid and explicit metadata conventions that govern such things as the marking of speakers, stage directions, acts, and scenes.

A book is in fact a very metadata-rich environment with implicit and explicit navigation aids. But it is a characteristic feature of print-culture metadata that the boundaries between text and metadata are blurred and that metadata are kept in imprecise or ambiguous ways.

Metadata in a digital environment

The soft edges of metadata in print culture do not matter very much for at least two reasons. First, the typesetter who encodes information on a printed page has the luxury of dealing with remarkably context sensitive decoders (aka readers) who draw on an extraordinary amount of largely tacit knowledge to overlook inconsistencies or correct errors as their eyes traverse the page in the process of making sense of the symbols on it.

Secondly, in a print world metadata meet their purpose if they guide the reader in a particular context. White space is a very powerful tool for orienting the reader as s/he encounters the next set of facing pages in a book. But white space is not a particularly good paragraph marker if your goal is to use a machine to count all the paragraphs in the chapters of a texts and the number of words in each paragraph, which is the kind of dumb or brute force operation that is a typical preparatory step in digital text analysis.

For such operations you need to start from the assumption that the computer is a machine and like other machines lacks tacit knowledge. It is completely uncomprehending and unforgiving outside of the boundaries for which it has been explicitly programmed. It cannot 'read' or make sense; it can only process, but it can process very fast and very accurately if it has been exactly what to do.

If the computer is used to process texts in ways that go beyond displaying an image on a screen that a human will then read, the texts to be processed must have added to them at least some of the tacit knowledge that readers bring to the task of making sense. And whatever is added must be added in completely explicit manner. You might want to think of levels of "explicitation." I use the awkward neologism advisedly because the very shape of the word draws attention to the tediousness of unfolding , making explicit, and recording systematically the tacit knowledge on which the reader draws.

There are at least three levels of such explicitation in the documents that MONK deals with, and they may be described as metadata at the top or document level, metadata at the middle or discursive level, and
metadata at the bottom or word occurrence level. Metadata at the top or document level are like a bibliographical record, and anybody who has ever used a bibliography or library catalogue, whether in print or online , knows why it is often useful to be able to look at metadata about books before looking at some of the books themselves.

Metadata at the middle or discursive level are in large part explicitations of the information that is conveyed by white space and typographical choices on the printed page. The most firmly established physical way of doing is to use 'tags' or labels that enclose text segments and to use angle brackets to enclose the labels, as in

<l>To be or not to be, that is the question</l>

where the start and end tag declare that the content between them is a line of verse. Such tagging is called 'structural mark-up', and a tagged text may be said to be transformed into an 'ordered hierarchy of content' or OHCO. This is a good thing for many uses, but it is also the case that many texts cannot be boxed up in this way without considerable compromises along the way.

While metadata at the discursive level can be broadly defined as the translation of the implicit meanings of layout into explicit tags, metadata at the level of word occurrence have no equivalent in the print world. 'Linguistically annotated corpora', which is the technical term for texts that encode low-level linguistic metadata for each word, are a child of the digital age. Such corpora become useful only if they are quite large, but it is impractical to encode the necessary information or sort through result sets by hand.

The context sensitive expertise of the reader comes most fully into play in the tacit disambiguation of the many ambiguities that readers encounter along the way. 'Like' is a verb here, but not there; 'bears' refers to animals here, but is a verb there. In "jumped over the moon" 'over' is a preposition and goes with 'the moon', but in "take over the world" it is used adverbially and goes with 'take.'

The explicitation of such low-level linguistic phenomena is called morphosyntactic or part-of-speech tagging. There are different notation schemes for expressing it. But whether you write

jumped_vvd over_p-acp the_dt moon_n1

or
<w pos="vvd">jumped</w> <w pos="p-acp">over</w> <w pos="dt">the</w> <w pos="n1">moon</w>

it is perfectly clear that such information will never be of any use to a reader making sense of a passage except in the special case of an elementary language course, and even then only in small doses. In this regard metadata at the word occurrence level differ very much from metadata at the discursive level, which are essential to making a text readable easy. From a reader's perspective, text without discursive metadata and text with word occurrence metadata are about equally unreadable.

What is the point of metadata that only get in the way of the reader? This question is answered in the next section.

What are metadata good for?

The utility of digital metadata is best illustrated in the word-level metadata that from the reader's perspective are merely in the way and worse than useless. But unlike white space on the printed page, POS tags were never meant for the reader. Their purpose is to feed data to the computer. People, an IBM executive remarked somewhere, are smart but slow, while computers are dumb but fast. When 'bears' is tagged as a plural noun it joins all the other words that are tagged as plural nouns, and by implication it joins all the words that are tagged as nouns, whether singular, plural or possessive. It then becomes trivial to count all the nouns in a given text or across a very large collection.2

What is true of nouns is equally true of verbs, or past and present verb forms, adjectives. But once you have information about how often given phenomena occur in each work of a large collection, many opportunities arise for analytical operations that draw inferences from the relative frequency of some chosen or discovered set of phenomena in text A or texts A, B,C when compared with text X or texts X.Y,Z.

The guiding assumption here is that the distribution of low-level linguistic phenomena provides powerful evidence for the analysis of higher-order phenomena and that the utility of such linguistic evidence is by no means restricted to questions that are likely to interest a linguist. Given the fact that writers spend endless hours putting their words into the right order, it is disconcerting that a list of their most commonly used nouns will tell you quite a bit about what they are about. And if you profile that list against a background of a similar list from a wider corpus, using some simple statistical routine, the expressive power of the data is greatly increased. It works better than it should. Here, for instance, is a comparison of the ten most common nouns in Early Greek epic and Shakespeare:

Early Greek epic
Shakespeare
Man (anêr)
lord
Ship (naus)
man
God (theos)
sir
Heart (thumos)
love
Hand (cheir)
king
Son (huios)
heart
Horse (hippos)
eye
Father (patêr)
time
Word (epos)
hand
Companion (hetairos)
father

Striking shared features of and characteristic differences between the worlds of Homer and Shakespeare are quite sharply highlighted in this highly reductive comparative routine.

The Importance of Not-Reading

As long as there have been books there have been more books than you could read. In the life of a professional or scholar, reading in the strong sense of "close reading" almost certainly takes a back-seat to finding out what is in a book without actually reading all or even any of it. There are age-old techniques for doing this, some more respectable than others, and they include skimming or eyeballing the text, reading a bibliography or following what somebody else says or writes about it. Knowing how to "not-read" is just as important as knowing how to read.

This is where the computer's way with low-level linguistic data really shines. In the early twentieth century Christian Morgenstern, a German poet much influenced by English nonsense poetry, wrote a poem called "The Spectacles". It is about a man who likes to read but is put off by the wordiness of it all and invents a pair of spectacles "whose energies condense the text for him." In the last stanza he gives as an example the poem he has written and concludes that with those spectacles you could not read it at all because "thirty three instances of it would add up to only one question mark"

Beispielsweise dies Gedicht_
läse, so bebrillt, man - nicht!_
Dreiunddreißig seinesgleichen_
gäben erst - Ein - - Fragezeichen!!

This is a nicely prophetic poem about the promises and doubts that hover around the possibilities of "not-reading" or "distant reading," as Franco Moretti has called it. Part-of-speech tagged corpora offer powerful versions of Morgenstern's condensing spectacles, especially if you use them in connection with other forms of not-reading and use them to complement and direct, rather than replace, reading in the strong sense.

A provisional answer to the question what metadata are good for, then, might say that metadata give you Morgenstern's spectacles. They let you condense not only a single text, but in a sufficiently ample environment they let you condense arbitrarily large sets of texts. And if you employ visualization techniques-an increasingly powerful digital tool-these condensed representations can be displayed as if they were locations on some map. Just as white space in a book with good layout maps the terrain of the pages and orients readers before they actually "read", so metadata, when "laid out" in the right way can provide readers with a simultaneous overview of many books and direct their attention to areas where it would pay to read closely. That is the promise of Franco Moretti's "distant reading." But the sweeping vistas are made possible by metadata gathered, extracted, and processed through tediously explicit routines.

Monk in detail

Now for a more detailed look at Monk. You encounter it as a service that lets you perform various operations that range from straightforward look-ups to complex statistical routines, where you can make sense of the output but do not really understand how it got to you. An example of a look-up would be searching for 'child' in English and American novels between 1780 and 1830 and being able to group and sort the result set, which runs to several thousand hits. In a statistically driven routine you might begin by selecting texts or parts of text about which you make the holistic judgment that a particular quality(e.g. sentimentality) is conspicuously present or absent in them. The machine then extracts prominent low-level linguistic features from your positive selections and looks for other texts where these features are prominent as well, on the assumption that if a given combination of low-level features add up to 'sentimentality' here they will also do so there.

Strictly speaking, Monk is a bundle of services that are quite distinct from a particular collection. But since as a user you never encounter the services independently of some text collection(s) , I will use 'Monk' loosely to refer to the environment you perform Monk operations on a particular set of texts. In the Monk prototype (Monk I hereafter) you encounter an "L-shaped" collection, where the vertical part of the L consists of some 700 works of English and American fiction from about 1550 to 1923 (the cut-off date for copyright), and the horizontal part consists of some 500 texts selected from different genres and published between the birth of Queen Elizabeth (1533) and the death of King James (1625).3

This L marks the contours of a larger space that you may want to think of as a matrix in which every text occupies a spot defined by its specific time and genre coordinates. That is a very crude way of locating the position of a text in a document space, but it is good enough to make the central point that the query potential of a text in the Monk environment has a lot to do with the fact that it can be located precisely and manipulated flexibly in a capacious and complex document space. Conceptually this is no different from reading a text in context, but the digital environment changes by orders of magnitude the size of the context and speed of contextualization. Think of both a telescope and a kaleidoscope.

The Monk I collection will grow as we fill in the empty space marked by the contours of the L. Other instances of Monk will use different collections, including possibly collections in other languages.

The source texts from a scholarly perspective

The following paragraphs apply in their particulars only to the selection of texts in Monk I. But beyond those particulars they address general questions that have to be addressed in any Monk implementation. As a scholarly user you will want to know how trustworthy the source texts are for your chosen purpose.

The L-shaped collection includes sizable numbers of texts from large collections and may in addition include individual texts or some smaller collections. The large collections are

1. The Text Creation Partnership (TCP), an archive of ~13,000 texts published between 1470 and 1700 in England)
2. The Chadwyck-Healey Eighteenth-Century Fiction Collection (~100 novels published between 1700 and 1780)
3. The Chadwyck-Healey Nineteenth-Century Fiction Collection (~250 novels written in England between 1780 and 1900
4. The Chadwyck Healey Early American Fiction archive (~100 novels written in America between 1780 and 1851
5. The Wright American Fiction Archive (~1,000 American novels published between 1851 and 1875)

The texts in these collections are not scholarly, let alone critical, editions, but they are good enough for most analytical purposes that are supported by the Monk environment, and in their editorial practices they have much in common. For the TCP texts these editorial practices are quite well documented. For the other collections they can be inferred from the results or from what is known about industry practices.

Each text is a diplomatic transcription of an early print edition into a digital surrogate. In many cases the transcribed text is the first or only edition. The TCP texts have been transcribed from page images based on microfilms of the originals. 4 Most digital transcriptions are outsourced to vendors who employ people in developing countries. If you look at the TCP instructions for the work you see instantly that it requires considerable education, knowledge, and discipline to do the work well. Texts are double keyboarded by two individuals following the same instructions. Their output is collated by a machine, differences are identified as errors, and corrected. In the TCP project, the files returned by the vendors are spot checked for accuracy defined as an error rate of 0.1%.

There is no reason to doubt that this process of transcription is intelligently and accurately executed, but the description also makes clear that we are not in a world of individually edited texts, licked into shape for nine years by a mother bear, as poems ought to be, according to Horace. On the other hand, one can say of the transcriptions that they are pretty faithful versions of the words in sequence that people at the time actually read. More importantly, the degree of textual variance that becomes a legitimate area of concern in a scholarly edition is much smaller than the margin of error that is part of the analytical routines in Monk. If, for instance, the Gutenberg text of Ulysses were put into a Monk environment, some Joyce scholars would be indignant at the choice of so obviously deficient a text, when there are better and more fully curated texts. Assuming that the Gutenberg text is a good transcription of the 1922 original (a reasonable assumption), there are three ways of answering this complaint:

1. The 'better' text is not freely available
2. The errors in the chosen text do not matter to the kinds of analytics Monk can deliver
3. Whatever the errors (if indeed they are all errors), the text in Monk is the version that most readers of Joyce have read, at least well into the nineties. The allegedly 'bad' text has an authority of its own that derives from its reception.

The digital transcription typically adds bibliographical metadata and 'explicitates' the white-space or other markup of the source text with structural markup. The mark-up language of the Text Encoding Initiative (TEI), which is an XML language, is most commonly used for that purpose. Marginally different flavors of TEI are used for the transcriptions in the TCP, Early American Fiction, and the Wright Archive. The 18th and 19th century fiction archives were originally encoded in a somewhat different mark-up language, but TEI conversions of them do exist.

The different archives differ somewhat in how they define the limits of transcription. The TCP texts, for instance, provide a full transcription of every title page and replicate much of it in the bibliographical header. The Chadwyck-Healey texts do not transcribe the title page. The TCP texts pay a great deal of attention to typographical detail and use 'character entities' to preserve many short cuts that originate in manuscript culture and survived well into the first century of print culture.5

There is a possibility that the Monk prototype will include a selection of late nineteenth- and early twentieth century novels that have been transcribed in Project Gutenberg, in particular novels by Henry James, Edith Wharton, Joseph Conrad, Virginia Woolf, and James Joyce-a goodly chunk of the early modernist canon. Many of these texts have been very conscientiously transcribed, and they have gone through several rounds of formal proofreading organized by the Distributed Proofreaders Foundation, a volunteer organization. Considered as diplomatic transcriptions of contemporary editions, these individually transcribed texts appear to be no worse than the texts in the project based archives. But they lack bibliographical headers as well as any structural mark-up. There is no agreement yet on what are in the longer run the most effective ways of integrating such texts into a Monk environment.

Value added to source files in Monk

What happens to source files when they enter into the Monk environment? Remember that all operations in Monk ultimately depend on the three levels of metadata and their interaction. Thus every document in Monk must be modeled as a triple-decker with metadata at the top level (bibliographical description), at the mid level of structural articulation, and at the bottom level of individual word occurrence. And for these metadata to be useful it must be possible to use them for comparative and analytical across texts and collections. The texts may be apples and oranges, but the metadata must all be 'orapples'.

Many texts have metadata at the top and midlevel, but some do not. If the texts have been encoded in TEI, consistency at the top level may almost be taken for granted. Consistency at the midlevel is intrinsically harder to achieve. None of the texts enter Monk with metadata at the bottom level. This is a great blessing because it is easier to achieve consistency if you can start from scratch.

The technical details of metadata at the bottom are fairly complicated. As an end user, they are of interest to you only to the extent that they clarify what operations you can or cannot perform. Broadly speaking, if it is not in the metadata, it cannot be searched. It is also the case that if you want to take advantage of the opportunities Monk offers you must have a clear understanding of what is in those metadata and how they can be combined for various analytics.

As said before, the provision of metadata at the level of word occurrence is an 'explicitation' of the tacit knowledge that readers bring to the task of construing the sentences on the page. What you need to know about this explicitation is quite satisfactorily modeled by pretending that it is pretty much like cataloguing a book, except that instead of cataloguing the book as a whole you catalogue each word of it separately.

There are two advantages to this make-believe. First, every likely user of Monk understands what a catalog record looks like and what you can do with it. And secondly, if you model the process of bottom-level metadata creation on the process of top-level cataloguing you draw attention to the intimate relationship between metadata at the top and at the bottom. In the Monk environment, operations that involve the combination of top and bottom level metadata will typically be more important than operations based on midlevel metadata.

The reasons for this are quite simple. Every document can be described consistently at the top level in terms of who, what, when, and where, the categories of author, title, date, and place that are the essential pieces of a bibliographical record. And at the bottom every text document can be described as a 'flat' structure or stream of words moving from the first to the last through the minimal discursive structure of sentences. But in the middle things get tricky. Texts are diverse: the structural markup of a text is always a model of the muddle in the middle. A particular model may be quite adequate for a particular text, and for some texts, especially plays, there are models that work well across a range of texts. But there can never be a uniform model that works across a structurally diverse collection in the same way in which you can have a uniform model for bibliographical description at the top and for the stream of words at the bottom.

Let us pretend, then, that the 'tokenizer' , the computer program that explicitates the bottom-level metadata from the stream of words, is a librarian who catalogues every word as if it were a new accession to the collection, which indeed it is. And for every word the librarian creates something like a word level MARC record:

Word address
A14876-00025
In this compound identifier the first part refers to the catalogue number of the document and the second is a word counter . Through the first part, the word inherits all the properties associated with the document as a whole. Through the second number, its neighbors to the right and left are unambiguously established.
Sentence boundary
no
If sentence boundaries are explicitly marked, you can capture sentences as objects of analysis. Implicit in the records with a 'yes' value are the number of sentences and their length.
Spelling
louyth
The spelling as it occurs in the document
Standard spelling
loveth
The standard spelling of this form in a modern text. In this case the word form itself is archaic, but this is the spelling you would expect in, say, the King James Bible
Part of speech
vvz
This combines four pieces of information and says that the word is the third person (1), singular (2), present (3) of a verb (4). #4 is implicit in #3 because only verbs can have tense
lemma
love
This is the dictionary entry form of the word. The information recorded here explicitly is implicit in the combination of the standard spelling and part of speech.

And so on for however many words there are in the document, whether 100 or two million.

In Monk I there will be approximately 250 million records of this type. All information in Monk comes from these records and from what they 'inherit' either from the top level catalogue record or from whatever information is available at the mid level.6

Word properties inherited from mid-level metadata

While mid -level metadata are intrinsically problematical, there is at least one distinction that cut across the entire collections. In all texts of the Monk prototype lines of verse are explicitly marked by being enclosed in <l> tags, and it may well be reasonable to require that in any text that enters a Monk environment the distinction between verse and prose should be explicitly marked, at least if the text includes substantial amounts of verse.

If a word with a specified address occurs in a text segment that is marked as verse, it inherits the verse property from the mid-level tagging, and its prosodic status is implicitly known. It may even be explicitly recorded at the word level, as if our imaginary word level catalog entry included a 'prosodic status' property.

Subject information in the top level metadata

Even so minimal a classification as a division of texts into poetry, fiction, drama, and prose has considerable utility. For many text this primitive classification follows from their inclusion in a particular collection (Early American Fiction), but in a cross-genre collection such as the Monk prototype it needs to be spelled out to be useful. More granular forms of text classification can be created by mapping the basic bibliographical data of a given text against a Library of Congress catalogue, which typically includes a considerable amount of subject information. The TCP texts already include substantial information of this kind in their headers.

More granular forms of text classification are likely to be most useful with the catch-all category of 'prose', which differs little from 'other'. Whether such additional granularity is worth the effort remains to be seen.

The sex of an author is not specified in bibliographical headers and cannot always be inferred from the name (George Eliot). It is, however, an easy thing to add even to quite a large collection, and it is well worth doing because you add considerable query potential if you can classify authors by sex as well as by date and genre.

A tale of two catalogues

An enormous amount of information hides in the interactions of our triple-decker metadata model. For convenience sake let us ignore the middle level and assume that we have imported into the word catalog the verse/prose distinction, which is the only feature that cuts across all text in the collection and may be expected to be recorded with reasonable consistency. The triple-decker becomes a double-decker, and working within it means shuttling between its two levels. The approximately 1,000 records of the document catalogue contain metadata about author, title, date, place of publication, and genre. The 250 million records of the word catalog record location, sentence boundary, spelling, standard spelling, lemma, and prosodic status. The story of Monk is a tale of two catalogues.

Notice that the Monk environment is not actually structured in this way. But because you, the modal Monk user, are not likely to be knowledgeable about the data architecture of a software system but savvy in the ways of getting information out of a library catalog it may be helpful to describe Monk procedures as if they were library catalog queries. The critical thing to keep in mind is that in these queries you may chain criteria from the two catalogues into arbitrarily complex queries. Also remember that the more complex the query the longer it will take to execute, and not everything worth doing can be done in "web time." 7

Before describing actual procedures in Monk I and some possible procedures to be implemented in subsequent versions it may be helpful to look at some very broad functionalities that are drawn upon in virtually everything you do in Monk.

Counting, grouping, and sorting or All about lists

One of the most famous computer languages is called 'lisp', which is an acronym for 'list processor.' This is an inspired name, and one may wonder whether the lingering unease about computers in the humanities world would dispel more quickly if the animal had been called by a different name from the beginning. The name 'computer' points to counting as a primary activity. "List processor" points to counting as an essential but subordinate tool for making and keeping lists. Do lists and laundry lists are part of our daily lives. All scholars work with bibliographies. 8 Help with lists is always welcome, and many functionalities of Monk flow from the fact that it wants to be your friendly List Helper.

As Mosteller observed, computers, unlike people, can count high, fast, and accurately. It is no big deal for a not especially powerful computer to keep track of 250 million records and count how many there are of this or of that. When the data sets get large, it may be helpful to 'precompute' some things. You may think of this as another form of explicitation. Thus summary data about words in each document, whether grouped by lemma, part-of-speech tag, prosodic status or whatever, may be computed once and stored for later use rather than computed "on the fly" whenever they are needed. And if your documents run in the thousands or tens of thousands, it may make sense to store summaries of summaries.

The details of this need not concern you as an end user, but the point to keep in mind is that while 250 million words sounds like a lot of things to count, it does not pose serious problems for computers that are within reach of a humanities cyber infrastructure. And while a billion words still may sound like a lot today , it will not be a lot three or five years down the road.

For all practical purposes there are few limitations on the kinds of things that can be counted or compared with regard to quantity in a Monk environment. And while humanist typically are not mathematically inclined they are like everybody else in basing their arguments on the fact that there is more of this here and less of it there. Computers are extremely useful in telling you how much more there is and helping you figure out whether it matters.

As said above, counting is subordinate to list keeping, and Monk pays a great deal of attention to the grouping and sorting that are fundamental aspects of list management. One of the most famous articles ever published in Psychology is George Miller's "The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information." Miller argued that up to a certain point we can take in a manifold intuitively or at once, but beyond that we must use various divide and conquer strategies to make sense of things. Once we have gone beyond the Magical Number Seven we have entered the World of Lists.

This has profound implications for interface design when it comes to returning result sets. A search will return one or more results. If the result set stays within Miller's magic number you take it in at a glance. If it gets larger you need to process it in some explicit way, sometimes rearranging the sequence in your mind. Once the result set exceeds a dozen, give or take two, the processing time becomes noticeable. If you have hundreds or thousands of results you are at a loss. Some search engines intervene when there are many results and ask you to refine your search. That is a reasonable assumption if the search is of a "needle in the haystack" sort. But if I look at the uses of 'child' in eighteenth-century fiction, I am not looking for a needle; I am looking for a pattern.

This is where 'group and sort' comes in handy. A search for 'child' retrieves well over a thousand hits in a KWIC output. It is of little use to me that the search may return the hits within seconds if it then takes me hours to work my way through the raw list. But it helps if I can group the list by title, author, or decade and sort it by ascending or descending relative frequency. I may want to sort it by the word that precedes or follow, but in a large list it is likely that some collocations will be common. It may therefore be helpful to group such collocations and sort them in descending order so that the most common collocations appear at the top.9

Sampling is closely related to grouping and sorting, and it is fully supported in the Monk concordance.

Lists and statistics

If you are a proper humanist you will detest formal statistics and ignore the fact that life is a constant game of figuring the odds, forming expectations on the basis of observation, noticing the unexpected, and revising your expectations if it is repeated often enough. Probability judgements of an informal kind are built deeply into judgments of 'too much', 'too little', or 'not enough'. The statistical calculator that is probably hard-wired into us is conspicuously good at some things and conspicuously bad at certain kinds of risk assessment. But judgments about probability are a constant part of our lives.

Monk offers a number of statistical routines that can be seen as extensions of its mission as a List Helper. Some of them are invoked explicitly; others work implicitly, but all of them have much to do with lists, which may be their inputs or their outputs or both.

Because in Monk you often range across large data sets, and queries may produce large result sets you will need to decide what, if anything, to make of the fact that there is quite a bit of this here and somewhat more of it there. In those situations it is helpful to know whether an observed difference does or does not stay within a range of expected variance. If it does not, you still have to decide whether this means anything, but if it does, you may think twice before building a case on it. At a minimum statistical routines are a useful form of insurance, but they also undergird the various discovery routines that are part of text mining.

All statistics boil down to the comparison of numbers, and they involve the comparison of two or more sets with regard to one or more variables. In its initial phase, the Monk environment focuses on statistical routines that compare two sets with regard to one or more variables. The two key procedures are Bayes' theorem and Dunning's log likelihood. Both of them have the advantage that the intelligent evaluation of results does not require a precise understanding of the underlying mathematics.

Bayes' Theorem

Bayes' theorem lets you classify data on the basis of past experience. Given what is known about Class A and Class B, it establishes the probability with which C should be classified as a case of A or of B. It was first applied to text analysis by Frederick Mosteller in his classic demonstration of the authorship of some of the Federalist Papers. While most of the papers can be attributed to Madison or Hamilton on the basis of external evidence, the authorship of 12 of the 84 papers was long in dispute. Mosteller and his collaborator David Wallace looked at the papers with known authorship and measured the difference in such low-level phenomena as sentence length, and the use of 'while', 'upon' , 'by' and 'from'. They then used Bayes' theorem to establish the probability with the authorship of the unknown papers could be attributed to either Madison or Hamilton on the basis of what was known about the 72 papers. The conclusion that Madison was the author of the twelve papers has been universally accepted, not least because it squared with a preponderance of non-quantitative evidence.

A modern and less elevated application of Bayes' theorem is the spam filter. Here you 'train' the spam filter by giving it texts that you declare to be spam. The program then looks for lexical or other patterns that the spam texts have in common, and depending on the probability threshold that you set it will predict which of your incoming emails are spam.

In the Monk environment you want to rule things in rather than out, but the underlying algorithms are the same. Let us say you want to look for texts that are 'sentimental'. You choose a set of texts that for whatever reason you judge to rank high on a sentimental scale. And to improve your results you choose another set of texts that rank low on the sentimental scale. The two sets are your training corpus. Once the program has analyzed the properties of the training corpus, you let it range across a collection of texts and rank them on the sentimental scale.

Dunning's log likelihood ratio

Dunning's log likelihood ratio is descriptive rather than predictive. It lets you compare an 'analysis corpus' with a 'reference corpus'. It looks at all the word tokens in the analysis corpus, compares them with the words in the reference corpus and produces a list of words that are statistical outliers by being either unusually common or rare in the analysis corpus. The procedure can be seen as a form of automatic keywording or profiling of a text. The list produced by it is often interesting precisely because the words on it are so ordinary. If you compare Julius Caesar with Shakespeare's other tragedies, 'she' is by far the biggest outlier, and it is underused. Overused words are 'man', 'countryman', 'today', 'mighty', 'do', 'countryman', 'street', 'run', and 'honorable'. One can make quite a bit of this list, and generally the list of over- and under-used words in a given text circumscribes major thematic areas with considerable precision.

You need not take the spelling or lemma of a word token as your point of departure. You can see it as an instance of a word class. If you do this for Spenser and Shakespeare, you see that adjectives are astronomically more common in Spenser, which may have more to do with the differences between narrative poetry and drama than with the individual authors. But it is a very striking finding.

Other statistical routines

(More is needed here about cluster routines that have been used in nora and multivariate forms of analysis. Forms of it (principal component and discriminant analysis) are widely used in literary and linguistic computing for use cases that lie well within Monk, e.g. the distance between different novels with regard to known or unknown criteria. My hunch is, however, that these techniques are difficult to use intelligently unless you have a firm mathematical background, and the graphic representations of results often look a lot more striking than the associated p-values suggest. It may well be a criterion for a statistical routine used in Monk that non-technical users can clearly understand what goes on in a general way and what the results mean even if they cannot follow the math.)

The Monk Lexicon

The Monk Lexicon is a summary of the information in the two catalogues, but with regard to lemmata it also contains information that is context-independent and therefore does not need to be recorded in the word catalogue. The Lexicon is an important gateway for queries in which the text is modeled as a 'bag of words' and word order is ignored. In many ways the lexicon is a very high tech version of the Biblical concordance invented by medieval monks. It supports the activity of going from the word here to the words there, which lies at the heart of concordance technology. It lets you go from any lemma to all of its occurrences in any form, and it uses both 'group and sort' routines and visualization to help you contextualize a given lemma, word form, or spelling at the desired level of granularity.

If Monk I grows beyond its first collection to a target size of a billion words the Monk Lexicon will for some purposes be superior to the OED. A particular wonderful part of the OED is the word histories that were constructed by judiciously chosen quotations. But this virtue is also a weakness. For coverage before 1900, the OED is a collection of word histories as learned Victorians thought they should be written. The Monk lexicon, on the other hand, starts from and keeps count of actual usage. To the extent that the underlying collection is biased, the bias is reflected in the lexicon. But if the underlying collection is big enough and broadly representative, the Monk Lexicon will in many cases offer a better guide than the OED to changing usage over time. The lexicon for Monk I contains about 100,000 lemmata, including personal and place names.

Variable displays subject to Miller's magic number seven

Like any other dictionary, the Monk Lexicon is organized by lemma, and for every lemma it presents the information in a way that reflects a concern with Miller's "magical number seven" as well as the document and collection frequency of the lemma.10 Words are very unevenly distributed across texts. For instance, in a very large collection of 607 million word occurrences, the 1501 spellings with more than 30,000 occurrences account for 480 million or almost 80% of total occurrences. Of the 2.1 million distinct spellings, 1.1 million occur only once, and 1.8 million occur seven times or less. What is true of spellings is true of lemmata, with minor changes in the proportions.

The Monk Lexicon organizes the default display of a lexical according to the combination of document and collection frequency, which is the most fundamental piece of information about a lemma: Lemma A occurs x times in y documents. The Lexicon assumes that where a result stays within Miller's magical number seven there is nothing to group or sort. The user is best served by the raw list presented in chronological order.

There is a considerable difference between a lemma that occurs 25 times in seven documents, and a lemma whose 25 occurrences are restricted to a single document.

Where lemma occurrences exceed the magic number seven and enter the world of lists, there are displays that offer different levels of summarization. At the most general level, you combine the genre information from the document catalogue and the count information from the word catalogues and arrive at something like this:

Lemma: X

docfreq
colfreq
per 10K
total
50
372
16.8
poetry
46
357
24.4
drama
4
15
2
fiction
0

prose
0

A considerable amount of orientation is provided by this table before you read the first KWIC line: you learn that this lemma occurs commonly in poetry, occasionally in drama, and never in fiction or prose, from which you can infer that the drama occurrences are almost certainly in verse. (A pie chart might be a better way to present this information)

In a second temporally oriented display, the Monk Lexicon uses charts familiar from quarterly earnings reports. Fiction, poetry, drama, and prose report their quarter-century frequencies per 10,000 words as if they were a measure of market share. The potential of this feature cannot be fully explored in Monk I because cross-genre will be available only for the period 1533-1625.

In a third display, you move from the lemma to the spellings and word forms that are bundled under it. The Lexicon tabulates all spellings of all morphosyntactic conditions of a lemma together with document and collection frequencies, so that you can survey the distribution of orthographical and morphosyntactic variance.

You can go to KWIC output from any of the data points in these different tables, and the default KWIC output is governed by the document and collection frequency for the phenomenon in question. There is no grouping and sorting wherever you stay within Miller's magic number seven.

(There is more to be said about this table)

Baskets of words

It will often be useful to gather information about the behaviour of similar or contrasting words, e.g. liberty, freedom, slavery, servitude. In Monk I you can plot the frequency history of a basket of words, using either the total or following a particular genre.

Information added to the Lexicon at the lemma level

Information about the etymology or sound of a lemma is context-independent. While information about names is context-sensitive, the part-of-speech tagging in Monk cannot disambiguate between the referents of a name but only determine that it is a name. These three types of information are therefore most conveniently stored as attributes of a lemma and its parts considered independently of context. But they can be easily imported into the word catalogue and become part of the metadata double-decker on which all queries depend.

Etymology

It would be relatively simple to map each lemma to basic information about its etymology. The point of this is not to do something that the OED does much better. It is rather to create a simple tool for measuring stylistic register, and at least 80% of the work for this tool is done by the simple distinction English/Latin, or more accurately by the coding of Latinate words that entered into English in the early modern period. Distinguishing between Scandinavian and Anglo-Saxon origins (skirt/shirt) does little for that purpose. Nor does the distinction between Anglo-Saxon and Anglo-Norman help. Words like 'beef', 'veal', and 'jail' do not operate in a stylistic register that differs from plain English. Latinate words do, and so perhaps do French words.11

Phonetic properties

Just as you can map a variant spelling to a standard spelling so you can map a spelling to a phonetic representation of it. If you have a phonetic representation of every spelling in the corpus you can use it to explore the sonorities of particular texts. The Italians use 'tintura' or 'coloring' to refer to the distinctive registers and orchestral or vocal timbres that establish the dominant mood of an opera. Can one measure the 'tintura' of a poem?12

Spelling out a poem in the phonetic alphabet is not likely to get you very far with that. But what if you translate phonetic values into colors so that you can 'see' the sonorities of a poem as a variably tinted background? And if this works at the level of the individual line, can you group the sonorities of individual lines, map larger sections, or compare whole poems?

The problem of allophones requires notice in this context. Take spellings like 'chaunce', 'daunce', and 'daunger', which strongly suggest that the vowel was once pronounced as it is today in the French 'danger'. So those words may have had darker sonorities for some speakers than they have for modern speakers, and 'chance' and 'dance' are brighter in American than British English pronunciation. But there is not much one can do about this.

Names

Information about names is by far the most important information that is added at the lemma level. POS tagging can distinguish proper from common nouns, but tag sets do not distinguish between place names and personal names. The distinction between place names and personal names is of great interest to many inquiries, the distinction between different types of personal names less so.

Place names and personal names are not easily distinguished unless one decides (perhaps not unreasonably) to mark as a place name every name that originated as such. Essex, Surrey, Washington and similar names clearly are place names by that criterion. Within the collections of Monk I, place names are often us as personal names. The reverse (Rachel, Nevada; Elizabeth NJ) is less common, probably quite rare before 1850, and almost never found in a European context.

Whatever information is attached to a lemma at the lexicon level can travel back to the lemma's occurrences and used for analytical purposes. If GIS coordinates are added to place names at the Lexicon level there occurrences in a work may point to distances actually traveled. Or they help in the mapping of a geography of desire. GIS coordinates could be used to "lemmatize" place names and retrieve, for instance, novels in which place names from the South of France often appear.

There are several problems that need addressing. The same name may be used for different places. This is very common in American texts, much less common in European texts. In the great majority of cases one location dominates.

In older texts, the spelling of place names is not standardized, but place names are probably easier to map to standard forms than personal names or other words.

Since there is a magnificent atlas of the ancient Mediterranean, there are GIS coordinates for places that no longer exist (Carthage, etc). Whether these exist in a form that can easily be mapped to a list of ancient names is another question.

Many place names consist of two or even three words, and are not captured in the initial tokenization. It may make sense to identify them in a second pass through the Monk I corpus and make them part of the Monk Lexicon.

The lexicon as gateway to word based queries

Queries that ignore the order of words in a text are based on a text model that is often called 'bag of words' because the word tokens are like so many marbles in a bag. To say that the Monk Lexicon is the gateway to all such queries is something of a tautology. After all, if each text is a bag of words, the Lexicon is simply the bag of all the bags.

But it is not quite accurate to say that a Monk text is modeled as a bag of words. It is modeled as a bag of word occurrences or distinct events each of which can be "seen as" an instance of any of its properties or any combination of its properties. And the term lexicon is misleading if it suggests that a search that goes from a word (or a basket of words) to its occurrences has a privileged status, even though for many users it will remain the most common and certainly the most familiar search. More broadly speaking, the lexicon based search goes from any combination of properties associated with word occurrences and retrieves whatever meets the criteria of that combination. Here are some examples:

1. Adjectives that are only found in prose
2. Word classes that are disproportionately common or rare in drama when compared with other genres
3. Names that occur in both English and American novels but are more common in the latter
4. Words more often used by female than by male novelists
5. Words used by Jane Austen and Sir Walter Scott but in at most two other texts
6. Lexical differences between American novels written in the decade before and the decade after the Civil War
7. Words or word classes that are commonly found in novels with a lot of superlative forms (This is tricky and probably requires phrasal searches as well)

Thomas Mann defined the novelist as the "murmuring conjuror of the imperfect." This is a famous association of a central mission with a grammatical property. Do novelist differ in their use of the past tense? The ratios of past and present tense forms is easily derived from the Lexicon, but a skeptic might say that differences may have more to do with the relative frequency of dialogue than with narrative per se. Unfortunately, the distribution of narrative and speech cannot be analyzed in Monk I because speech is not explicitated in the mid-level metadata and speech markers in the texts, quotation marks of various kinds, are not used with sufficient consistency to support its identification.

Phrases, collocations, patterns, and repetitions

The bag of words model is based on a huge "as if": it treats a text as if the order of words did not matter. It is a huge insult to a writer's ambition, but unfortunately it works: tell me your word frequencies, and I will tell you what kind of writer you are. There is no such insult in queries that respect word order: at least they pay attention to what the writer labored most about. On the other hand, they are much more complex things to handle, both for the developers and for the users.

The great German classicist Wilamowitz said that "once is never, twice is ever" (einmal ist keinmal, zweimal ist immer). Nonce words or hapax legomena are very common. In a collection of almost any size, about a third of the words will occur only once. But a phrase becomes a phrase only if it is repeated at least once.

Phrases in the strict sense are word sequences repeated exactly, and collocation strictly understood also refers to words that rub shoulders. But it can be used more loosely to refer to the co-occurrence of two or more words. As with all language phenomena, distinctions blur. Take the following: "it is (always| |sometimes|never|often|almost never|very rarely) the case." In the strictest lexical sense this is not a word sequence repeated exactly, but it is a phrase with a quite stringently defined variable.

Phrases can be defined at a more abstract grammatical level. Take this famous opening phrase: "Emma Woodhouse, handsome, clever, and rich." At the grammatical level this is a strict phrase of "adjective, adjective, conjunction, adjective," and it is an instance of the "three adjective rule" articulated by seventeenth century French writers and widely followed in sophisticated writing. Jane Austen says quite a bit about her subject and herself when she opens her novel with this phrase.

Mixed phrases combine lexical with grammatical constraints. You might be interested in the sequence of "adjective + (lad|lass)," and our first example could be rewritten as "it is + adverb +the case"

A phrase, whether lexical or grammatical, makes sense. By contrast, n-grams are segmentations of a text into sequences of arbitrary length. "Woodhouse, handsome" is a trigram, but not a phrase. N-grams that are repeated often are likely to be phrases of some kind, and they are the clever programmer's way of getting the machine to look for phrases or more generally for patterns. The odds of an n-gram being repeated exactly decline precipitously with its length, and empirically it appears to be the case that anything beyond an identical pentagram is likely to involve copying. Plagiarism detectors are based on this assumption.

Information about lexical, grammatical, and mixed phrases is implicit in the way the word catalogue is kept. You can identify the next or previous address of any word address by adding or subtracting 1. Thus a Monk word may be said to know its immediate neighbours, who in turn know their immediate neighbours . But it appears to be a difficult and slow process to explicitate this knowledge and make it the basis for particular queries. So there is the question how to get at phrases, however defined, or more broadly at what I will call "sequence-sensitive word patterns" (SSWP).

Does Monk understand sentences and paragraphs?

The smaller the unit that is transformed into a bag of words the greater the likelihood that it will retrieve SSWPs. If a text is modeled as word bags of sentences, a search for sentences that contain 'Earl' and 'Rochester' will retrieve all instances of the phrase 'Earl of Rochester' with a tolerable noise level ('the Earl went to Rochester', 'while at Rochester, the Earl'). The noise level increases if we go the paragraph.

If sentences and paragraphs are explicit 'chunks' in Monk, many types of SSWP can be treated as "small-bag-of-word" problems. If I am interested in adjectives that modify 'liberty' I may want to look for all instances of liberty where the previous word has the POS tag 'j'. If I look for sentences that contain 'liberty' and one or more adjectives, I will get all those results together with other stuff. But erring on the side of 'recall' may be better than erring on the side of 'precision' and often lets you stumble across good stuff you were not looking for. 13

There has not so far been much discussion of how or whether to model sentences and paragraphs. But sentences and paragraphs are the fundamental units of text construction, and this is a topic worth more discussion.

Types of sequence-sensitive word patterns

Multi-word lexical units

There are many short phrases that are fixed lexical units and might as well be words: 'out of', 'according to', 'in vain', 'by way of', and hundreds of others. Some of them may be written as one or two words: 'in faith' or "i'faith" are the same lexeme. You find both 'insofar' and 'in so far'. In early modern texts, what we now call reflexive pronouns are written as two words and captured by a POS tagger as a sequence of a possessive pronoun and a noun: 'herself' vs. 'her self'. A sophisticated tokenizer and parser would recognize that sometimes a space is not a space and treat 'her self' and 'herself' as orthographical variants of a single token.

One can think of such phenomena as phrasal debris of no interest to anyone. Who cares about the distribution of 'in vain' 'by way of' or 'as far as'? Alternately one could think of them as quite powerful indicators of authorial or genre-based habits and worth keeping and counting precisely because of their empty banality. The second is almost certainly the better choice in any environment that wants to move beyond narrowly and pragmatically oriented goals of information retrieval.

Multiword names

Many names or name-like entities consist of more than one word: William Shakespeare, Archbishop of Canterbury, Wuthering Heights. King Henry, United Provinces, Star Chamber, Bank of England, India Office, to name only a few and not to speak of the Circumlocution Office. Name-based queries of one kind or another will form a big part of user interest in Monk. Name phrases of this kind therefore will need to be captured and catalogued in some form. And they need to support queries by type as well as by string, so that in addition to looking for 'Archbishop of Canterbury' you can look for texts in which these types of names are common or rare: their distribution within or across texts is clearly an important marker of social register.

Syntactic fragments

It is a big and open question to what extent syntactic fragments can serve as good enough proxies for inquiries into large-scale syntactic structure. The L-shaped collection in Monk I will offer considerable support for diachronic stylistic and syntactic studies even though it will be limited to the genre of fiction, but that support will be greatly increased once the empty parts of the L have been filled in and the larger collection supports queries that combine diachronic with cross-genre criteria.

An interest in the syntactic properties of text extends far beyond linguists. If your topic of inquiry is a particular genre, whether approached synchronically or diachronically, it is a good hypothesis that the parameters of a genre are determined as much by syntactic as by lexical habits and choices. If you have an interest in rhetoric-a field of growing interest in recent years-syntax matters as much as the lexicon. Shakespeare's Brutus lost to Mark Antony because of the way he put his words together.

So there has to be a way of getting at syntax in Monk. But full syntactic parsing of a large collection is out of the question because it cannot be done without a lot of human intervention. Can you get good enough information about larger-scale structures from the distribution of syntactic fragments that are easily captured by part-of-speech tagging? A syntactic fragment only makes sense in a wider syntactic structure, which may be reconstructed with reasonable accuracy from the fragments. How and at what length or level of abstraction would you have to capture fragments for them to be useful?

To begin with the question of abstraction, consider the following. The word catalog records a part-of-speech for every word occurrence. Thus the text can be modeled as a sequence of spellings or POS tags, and you get these equivalents:

John likes Mary np1 vvz np1
John hit Mary np1 vvd np1
Librarians like books n2 vvb n2
John likes guns np1 vvz n2

The four different syntactic fragments involve difference in type of noun (np vs. n), tense (vvd vs. vvb/vvz), person (vvz vs. vvb/vvd), and number (vvz vs. vvb/vvd, n2 vs. np1). If I disregard all those criteria I am left with the single fragment n v n. If I disregard the distinction between nouns and names, I do not in this sample reduce the number of patterns. If I only keep tense and person, I reduce the patterns by one.

I can always abstract the simpler from the more complex patterns, and if storage or processing speed are of no concern one would keep all patterns. But if they are of concern, then the question arises what combination of minimal length and maximal abstraction will produce a catalogue of fragments that is both affordable and sufficiently expressive. What do we need to say things like "this is probably a novel by Henry James"?

The question of syntactic fragment is part of the broader question of sentences as an object of attention in Monk. The length and types of sentences can be easily determined by counting tokens and observing punctuation marks. They are no less valuable for being easily found. Variations of sentence length and type may be powerful measures of affect, and attention to it has a long rhetorical tradition (Attic/Seneca vs. Asiatic/Ciceronian). If one wants to posit "sentimental fiction" as a genre, it is highly likely that syntactic features are an important part of the mix that makes the reader say 'sentimental'.

Repetivity per se

The degree and kind of repetition in a text or set of texts is an interesting question by itself, and the contextualization afforded by the Monk environment makes it possible to give nuanced answers to the question when, where, how, or why authors repeat themselves.

Writing is more concise than speech, and writing has fairly clear and remarkably consistent thresholds of tolerance beyond which repetition is perceived as a blunder or a special effect. Such diverse works as Thucydides' Peloponnesian Wars, Plato's dialogues, the Aeneid, Emma, and Das Kapital do not differ significantly in their location of that threshold. Jane Austen has fun with it in her characterization of Miss Bates. Homer and Gertrude Stein cross it by a mile or more. In the case of Homer the very different repetition threshold has long been associated with the oral tradition in which the poems originate. Gertrude Stein deliberately and aggressively violates repetition thresholds: the ambition and effects are very literally 'transcendental.'

Adapting plagiarism detection software for the Monk environment

If phrases, collocations, and syntactic fragments are in some fashion 'catalogued' in Monk, the combination of syntactic fragments with single and multi-word lexical items will support many inquiries into the when, where, how, and why of repetition. What is missing is a way of dealing with long repetitions in a manner that goes beyond the confines of a particular text.

If I suspect a student of plagiarism I can submit his essay to a plagiarism detection service, and it will come back with some score or annotation. The underlying algorithms presumably evaluate the density and length of sequence-sensitive word patterns in the test candidate and complain if there is too much of it. What if an "extended repetition test" were part of the preprocessing routine and an report on it were added or linked to the document catalogue entry? Thus in Monk I the first text could report only on itself, the second could report on repetitions with itself and with text 1. This process would need dynamic updating: if text 793 discovered an extended repetition with text 1, the record for text 1 would need updating.

A summary form of this report would assign separate scores to the degrees of internal and external repetition (measured in terms of document count and extent of repetition). Gertrude Stein would be off the charts on internal repetition. Bulwer-Lytton would rank high on external repetition because he liked to show off his education with fancy quotations. The utility of this test would greatly increase with the size and diversity of the collection. And once a large diachronic and cross-genre collection from 1470-1923 is in place somewhere, the cumulative results of that test would lay a pretty firm foundation for intertextual studies. It would, among other things, create something of a citation index.

Annotation and collaboration

If you envisage an environment in which scholars practice new and sophisticated techniques of 'not-reading' across very large archives you ask whether they can fruitfully collaborate by dividing the labor of reading or more broadly, work together in the different stages of 'distant reading' and 'close reading' and share data sets or analytical results, preferably by putting them directly on the 'sofa' where they belong.14 'Sofa' is an inspired term of art from the UIMA world; it marks the 'subject of annotation'.

Building a framework of annotation and collaboration is clearly beyond the scope of Monk I. On the other hand, it is not to early to make fundamental decisions about data architecture in such a way that they will subsequently support layers of annotation both on a private and collaborative basis.

Some use cases

Several use cases have been mentioned along the way in this discussion, including
1. the definition of sentimental fiction,
2. the use of personal names for measuring social register and of place names for tracing geographical awareness
3. the visualization of acoustic properties of text through phonetic transcriptions
4. the use of syntactic fragments
5. repetitions at the lexical and syntactic level

In the following pages I sketch three use cases. One of them is real (sentimental fiction), the others are hypothetical, but all of them take advantage of the investigators' ability to employ search or discovery routines that range over a much larger body of texts than they could read closely in the typical context of writing a dissertation or, for that matter, revising it into a book. All of these projects would benefit from divide and conquer strategies made possible by collaboration.

Sentimental Fiction

The investigator is interested in identifying and analyzing the lexical, syntactic, or generally rhetorical practices that lead readers to identify a novel as sentimental. Is there a genre of sentimental fiction, and what are its parameters? A subsidiary goal is to ask whether American and English novels differ interestingly in their way of being sentimental.

In Monk I she has at her disposal a body 350 English novels written between 1700 and 1900 and a body of 300 novels written in America between 1789 and 1875. This body of fiction includes all the "great" novels, novels that are canonical because of their very "badness", e.g. The Lamplighter that Joyce poked fun at, and a lot of novels that many people read then and few people read now, e.g. Bulwer-Lytton.

She starts by building a training corpus of well-known and indeed super-canonical sentimental scenes in fiction, such as the deaths of Jo in Bleak House, Little Nell in The Old Curiosity Shop, and Eva in Uncle Tom's Cabin. The training corpus is used to find books "like it" with the help of Bayesian statistics or similar routines.

In a second step she isolates the lexical and syntactic features that are identified in the statistical procedure and in an iterative way analyses their function not only in the works classified by the procedure as sentimental but also in other works. Much of this work is of a familiar look-up kind, but the time cost of such look-ups is reduced by the group and sort capabilities of the Monk concordance tool as well by the ability of the Monk Lexicon to organize frequency-based lexical information on a time line.

The study that results from this work might involve a return to canonical works read in the light of their wider context. Or, perhaps more promisingly, it might be a study that defines the parameters of sentimental fiction through some case studies that focus on lesser known works. The reader of the resulting study might not necessarily recognize that the research relied explicitly or implicitly on statistical routines. There might or might not be tables with numbers. But in the preface the author would write with much conviction that she could not have oriented herself in so large a fiction space in so short a time without the help of the various orientation tools provided in the Monk environment.

The rhetoric of the body in fiction from 1700-1900

Twelve students in a senior honors seminar explore the rhetoric of the body in fiction from 1700 to 1900. Monk I has no collaboration tools, but they all have laptops, and there is a nice coffee shop with free wireless.

They have available to them a body of 350 English novels from 1700-1900 and 300 American novels written between 1789 and 1875. This includes all the "great" novels and a lot of novels that were popular then but are little read now. One of them regrets that they cannot include poetry in their study, and another points out that nothing would keep them from looking at English Poetry in the Philologic database, which does some basic things very well and very fast. On reflection, they decide that a quarter is not a semester and that they will have their hands full with fiction.

The parts of the body and the clothes that cover them are pretty well known, and discovery procedures are less important than search strategies.15 They work in teams of two, dividing the two centuries into six generations and the body parts in some similar fashion.

They use collocation tools to identify the attributes and behaviours of various body part, that is to say adjectives or verbs that collocate strongly with eye, hand, heart, etc. They have much use for the ability of the Monk lexicon to plot this lexical information on a time line. Another important tool is the word basket, which lets them load all or some body parts into a single search and get a quick overview of relative frequencies.

They get quite excited when they discover that they can use their lists of body parts, attributes, and behaviours to test whether the rhetoric of the body differs between male and female novelists. They are a little when they discover that there are no good procedures for distinguishing between physical and metaphorical uses of body parts. Sometimes you just have to look at it yourself.

Towards the end of the seminar they wonder whether one could construct from various components a thermometer that would measure, so to speak, the body temperature of each novel. And if those measurements were visualized, would they show interesting patterns by date, genre, sex, or location? But the quarter was over before they could get around to it.

Theology ideology, and politics in the age of Elizabeth and James

In this graduate seminar a dozen students use the Monk environment to explore to what extent the rhetoric of explicitly theological and political documents is reflected in other genres. They have at their disposal a collection of some 500 text published between the birth of Elizabeth (1533) and the death of James (1625). This is an eclectic cross-genre collection that includes historical, theological, philosophical, rhetorical, and political writing as well as plays, poems, stories, travel books, and manuals about farming and domestic life.

In the first half of the course they try to get a handle on explicitly ideological texts (few of them will have done much reading in this genre). They read various samples, focus on developing a list of keywords and their collocates, and ask whether the distribution of keywords and their collocates changes over time. Depending on their findings they build two or more training sets at appropriate time intervals, consisting of text samples that appear to them speak very explicitly about what Marx called 'the struggles and wishes of an age'. They then use Bayesian statistics to test the plays, poems, stories, and the books about travel, farming, and domestic life to discover which, if any, are "like" the training corpus.

This is of course an experiment designed to fail: you will not expect any text to match very closely, but you will be interested in texts that are least far away. Depending on the outcome of the experiment they then turn to the question how one would go about finding keywords in other genres.

The most successful students in this seminar will probably be good at a version of distant reading that is more like snacking or spot-reading-digitally enhanced forms of skimming.

1 The famous change of punctuation, "Call me, Ishmael", is precisely what editors do not do.
2 You can trust the computer to do this accurately, but you cannot trust people. When Mosteller applied computer analysis to the Federalist Papers and reviewed earlier and manually based statistical inquiries, he remarked wryly: "I learned quickly that people cannot count, at least not very high."
3 If you are interested in texts that are not represented in the collection they can be added. User-driven growth may well be the best way to add to a collection.
4 In the 600 million words of TCP text there are three million instances where the transcriber marked a letter, word, or whole passage as illegible. Not a surprise if one considers that the transcriber faced a page at three removes from an original that may not have started out its life as a well-printed page.
5 A character entity is a periphrastic expression that describes a feature that cannot be directly represented. For instance, there is a symbol to abbreviate the prefix 'per', which appears in TCP texts as &abper;, where the opening ampersand and closing semicolon mark the string as a character entity, the string 'ab' defines it as a symbol for an abbreviation and the 'per' is the content of the entity. Thus a spelling &abper;ficit translates ultimately to 'perficit'.
6 This is a technical term from computing. Programmers are think hierarchies. Anything in a lower 'class' is saddled with whatever is true of the higher 'class'. This is inheritance.
7 Paradoxically it may be simpler to deal with query that take days rather than an hour or minutes to execute. It is at least clear that you cannot hang around for the answer but must think of it in Interlibrary loan terms, with email notification when the answer has arrived. That may seem an odd way of operating if your expectations are governed by Web experience. But it is a perfectly standard way in the sciences where evolutionary biologists, for instance, feed data sets into their computer and are delighted if the analysis finishes in five days rather than two weeks.

8 In a well-known essay on "Scholarly Primitives" John Unsworth identified a preliminary set as "Discovering Annotating Comparing Referring Sampling Illustrating Representing." Listing probably should be on this list, but escaped notice perhaps because it sits too deep.
9 Complex group and sort routine have a time cost: the initial response to the request is relatively slow because what is returned is not only the raw list but the information required to group and sort arbitrarily subject only to the constraints imposed by the metadata. The time cost of an operation, however, should not be measured by the computer's initial response but by how long it takes users to carry out their task.
10 Document frequency refers to the number of texts in which a word occurs, while collection frequency refers to the total count in all texts.
11 An interest in this feature was expressed by Ken Price, the editor of the Whitman archive at the University of Nebraska
12 The opening of Verdi's Simone Boccanegra is a spectacular example of 'tintura'
13 These are terms from information retrieval. Given a return to a query, you may ask how many relevant hits the query returns (recall) or how many of the returns are relevant to the query (precise). If there are 100 relevant hits in a query addressed to 1000 items, you will have total recall if you return all items and total precision if you return one correct item. Obviously neither result set gives you what you want.
14 'Sofa' is an inspired term of art from the UIMA world; it marks the 'subject of annotation'. UIMA stands for "Unstructured Information Management Architecture." It originated as an IBM research project and is now part of the Apache Foundation. Monk is very much unstructured information management architecture.

15 Pornography, an important part of the rhetoric of the body, is not well represented in Monk I. Fanny Hill is the only example.
Notes towards a user manual, page 2


Document generated by Confluence on Apr 19, 2009 15:04