This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Monk Data Cell Conference Call

Tuesday, May 29, 2007. 3-4 p.m. central time

Present: Amit Kumar (chair), Loretta Auvil, Martin Mueller, Phil Burns (Pib), Joe Paris, Bill Parod, John Norstad (secretary)

We began by discussing Martin's mailing list message from earlier this morning "Must haves for the data cell". Martin asked if we could accept this as a statement of agreement.

Martin Mueller: Must haves for the data cell

From: martinmueller@northwestern.edu
Subject: Monk - Must haves for the data cell
Date: May 29, 2007 9:50:39 AM CDT
To: monk@lists.lis.uiuc.edu

It still not clear to me whether we have full agreement within the data cell about the must haves of our task. It would be useful to have a formal decision on this, and I hope we can achieve this in today's conference call. In the following I sketch in deliberately non-technical language what appear to the fundamental requirements that turn a set of texts into the basic data store that provides all information that supports subsequent data models and the analytical operations that are run against those data models.

The slogan is "A flexible data model." By this we understand a data model that permits end users to ask questions that are based on arbitrary combinations of the quasi-atomic data points that are first created in the process of tokenization and then systematically classified in a data model such that they can be combined and retrieved at will.

1. There is a text or collection of texts for "monkification." These will normally be TEI texts of some flavour, but they need not be.

2. Simple but consistent bibliographical information is extracted from or associated with the text(s). This information contains data about author, sex of author, title, a date or date range of origin, place of publication, as well as a minimal classification by genre (poetry, fiction, drama, prose), which comes from membership in a collection (e.g. Wright Archive) or from keywords in the text's header (TCP, DocSouth collections).

3. Each text is tokenized and linguistically annotated. This means that the text is split into sentences and the sentences are split into tokens. A token is an explicitly recorded word occurrence and is associated with

a) a collection-wide unique location id
b) information about the status of its spelling (variant or standard)
c) a morphosyntactic description (POS)
d) its lemma

4. The process of tokenization transforms the source text into a collection of "token bags." The largest of these token bags is the "document bag," which contains all the other token bags with their tokens. The smallest of them is the "sentence bag." Depending on the structural articulation of each text and its expression in some mark-up scheme, a text will contain (potentially multiple) hierarchies of intermediate bags (paragraphs, pages, chapters, sections, acts, scenes, stanzas, cantos, etc).

In addition to these bags, the tokens of a text are also divided into a "main" bag and a "side" bag. Many texts contain materials that are not fully part of it and often are not by the author. In TEI-encoded texts, such materials are found in the <front> or <back> elements, as well as in the so-called "Jump tags," such as <note>, <stage>, or <speaker>. The sequestration of such tokens is useful for many purposes, and it is easily achieved. It can be ignored by researchers who are skeptical about any distinction between authorial and non-authorial words.

5. The token bags are the fundamental building blocks for analytical routines. Tokens in bags from the sentence bag up can be counted, classified, and compared in various ways. Each token may be seen as an instance of any of its properties, separately or in combination. This is especially true of morphosyntactic information: the tag for 'is' specifies that it is

a) the third person
b) singular
c) present
d) a verb form
e) a special kind of verb (auxiliary)

In any analytical operation a token is always treated "as" an instance of something (verb, present tense, lemma etc). In whatever form the tokenized text is kept, it must support data models that allow end users to explore the query potential that derives from the original identification of the token as an instance of different properties and as a member of different token bags. In a properly encoded version of Hamlet, it is easy to isolate the token bag of words spoken by Ophelia in prose and used only by her.

6a. Any token is also a member of token bags that are created by bibliographical information, e.g. "fiction between 1750 and 1770" or "19th century." Beyond such "bags of bags", the process of tokenization and linguistic annotation creates the potential for a frequency-based lexicon that systematically maintains information gathered from texts in a particular collection and may in time become a stand-alone resource that cuts across different collections. The lexicon from the first testbed collection will be a deliverable that lowers the cost and improves the accuracy of preprocessing in subsequent collections. This assumes that certain kinds of data extracted from proprietary text collections will always be in the public domain, as in the hypothetical case that the spelling 'louyth' is found in 34 documents between 1470 and 1567 but never thereafter.

6b. Since sentence splitting and tokenization are error-prone operations it is desirable to create the conditions for user-contributed error correction over time. The tokenized and linguistically annotated text must be kept in a form that will support review and correction at a later date. The Distributed Proofreader Foundation associated with Project Gutenberg is a model in this regard.

7. Many analytical operations are sensitive to the order of words in a a text and must therefore target token sequences. The precise details of how to deal with token sequences have not been fully determined. It is highly likely that many analytic routines are best supported by some procedure of "second-order" tokenization in which repeated token sequences are captured and classified as fixed phrases, named entities, and the like. The great majority of analytically useful token sequences will be bigrams or trigrams (Wuthering Heights, Earl of Rochester). But there is also a need to capture ngrams of variable and indefinite length.

Unfortunately, Martin's original message had two items numbered 6. To avoid confusion, we have edited his message above to refer to these items as 6a and 6b.

Amit said that is seems broadly right to him, but of course we need to formalize the details an make it more concrete. He mentioned the treatment of n-grams as an example. Martin said that for the moment, if we wish, we could bracket out the issue of n-grams.

Martin talked about the "Frankenfile" output files of MorphAdorner. These files serve as input to the data ingest pipeline. Amit asked if these Frankenfiles might potentially be too big to be handled gracefully by the data ingest software. Pib mentioned that we can always process them sequentially if necessary, e.g. using SAX instead of trying to read in entire DOM trees. Pib also mentioned that there are ways to reduce file sizes if necessary, by having MorphAdorner generate stand-off markup files. This is a tractable problem.

John asked if the raw Frankenfiles are Monk deliverables in and of themselves. Bill said that he thinks they are not, only the detailed documentation of the file format.

Martin asked if everything in his message is compatible with D2K analysis pipelines. Amit said yes, it is all compatible, and Loretta agreed, saying that D2K would be a consumer of this datastore.

Martin talked about item 6b in his mesage. He stressed that at a later point (not Monk 1), it would be desirable to develop software to facilite data corrections by users. Some such corrections would be to inividiual data items, while some would be systematic in nature. We need to support both kinds of corrections. We all agreed that this is a worthy goal, and we will architect the data store in such a way that we do not prohibit or hinder the possibility of developing such tools in the future. In general, however, item 6b is not part of the specification for the first version of the Monk datastore.

We ended by taking a bit more about the problem of n-grams. Pib and John talked about the general problem of supporting queries involving sequences or "patterns" of characters and tokens. In general, it should be possible to perform the same operations on these kinds of sequences as we permit with single words - find them, generate concordances containing them, count them, generate sparse matrices of counts for D2K, etc.

In summary, we all agreed that Martin's message above will serve as a statement of agreement for the data store that will be constructed by our data cell, except for item 6b, which will not be part of the first version.

Document generated by Confluence on Apr 19, 2009 15:04