This page last changed on Apr 14, 2008 by martinmueller@northwestern.edu.

The following is an informal proposal for the workflow of a text as it migrates into and through MONK.

As a text moves through MONK, information gets added to it. Information should be recorded in a formal way as soon as it becomes available in the workflow and passed on to the next stage.

The curatorial stage

In the curatorial phase a text is taken from somewhere and transformed into a TEI-A text. At the end of this process all bibliographical and structural facts about the text are known and should be formally expressed. Call this the MONK Fact Sheet

Bibliographical data

Bibliographical data are taken from the teiHeader, but need to be checked and in some cases supplied. Some data for analytics are not implicit in the teiHeader but need to be supplied.

Perhaps the simplest way of doing this would be to create a database, whether relational or XML that is initially populated by relevant data from the sourceDesc element of the teiHeader. This database has a Web interface that lets curatorial staff perform editing.
At some point users could suggest data corrections.

MorphAdorner needs some information from this database to go about its business. For instance, the choice of the appropriate training set will be the result of genre and date information.

Structural data

In order to create its "pseudo-pages" MorphAdorner needs to know what div elements it should use for pagination. That is a curatorial decision.

It would be helpful to create an "element profile" of each text. This profile would list all elements down to the <div> level with their type attributes and the count of each div/type combination. You count distinct XPaths down to the div level.

Below the div level, it is useful to know elements with their counts. The nesting adds little. But an element profile gives you very useful information about the make-up of the document, both for subsequent processing and for the end user.

The MONK fact sheet

The MONK fact sheet for each text is kept as as distinct document in Fedora. Some facts about a text are maintained through "keys." For instance, information about Charles Dickens need not be kept for every text by Dickens, but a text is associated with an author, about whom appropriate information is kept in an authority file. If we can rely on external sources for this, e.g. the Library of Congress, we are better off.

The MorphAdorner stage

The process of linguistic annotation creates a variety of useful summary data that it may useful to keep in a separate document, also stored in Fedora, and usable for a variety of purposes. The data involve

  1. the number of characters, word tokens, and average word length
  2. the number of sentences and their distribution by length
  3. a "bag of words" model of the text that gives distinct "token tuples" with their counts

Many of these data are subsequently kept and precomputed in the MONK data store.But it may be useful and not particularly expensive to keep a set of lexical summaries

Before Prior

Once the texts have gone through the curatorial process and linguistic annotation, different representations of them will be stored in Fedora. Thesse representations are

  1. The source text
  2. The TEI-A text
  3. The MONK fact sheet with bibliographical and structural information
  4. The lexical skeleton or "bag of words" model of the text
Document generated by Confluence on Apr 19, 2009 15:04