This page last changed on Apr 13, 2008 by martinmueller@northwestern.edu.

The following is a report on where we stand as of April 12, 2008 with regard to the processing of texts for ingestion into the Monk data store.

Brian and Steve have succeeded in creating TEI-Analytics versions of all the texts we are likely to use in Monk I, that is to say, the 250 novels in the Chadwyck Healey collection of 19th century British fiction (~ 40 million words), a selection of some 300 novels from the Wright archive of American fiction 1851-1876 (~ 40 Million words), and a collection of ~ 700 TCP-EEBO texts that focus on Early Modern English writing between the birth of Elizabeth (1533) and the death of James (1625) (~ 60 million words). The temporal boundaries are not rigidly observed and systematically broken in the inclusion of ~100 texts about witchcraft the majority of which are documents from the middle and late seventeenth century.

It would be easy to add texts from Early American Fiction or the DocSouth collection, since these are these are texts with sparse but very consistent mark-up practices that do not raise new problems beyond what has been encountered with the three data sets processed.

There is likely to be some adjustment at the margins in the structure of the TEI-A schema and in the selection of texts. But this is a good time to take stock, and significant changes are very unlikely.

TEI-Analytics

The process of conversion involves creating texts that parse under TEI-Analytics, a slightly modified subset of the P5 TEI schema. The latest versions of TEI-A are found at http://segonku.unl.edu/teianalytics/. I have been validating texts with the RelaxNG scheme at http://segonku.unl.edu/teianalytics/TEIAnalytics.rng.

TEI-Analytics is a close cousin of TEI-Lite. It differs from TEI-Lite P4 and TEI-Lite P5 in the following ways:

  1. Like TEI-Lite P5 it abandons numbered divs, which have been widely used by archives that encode texts according to the Level 4 Guidelines of libraries
  1. It adds a number of elements at the word level to support morphosyntactic annotation, notably <w> and <c>
  1. It extends the content model of the <w> element to include word tokens that are very commonly found in a variety of text archives. The P4 and P5 content models do not anticipate the existence of word tokens where word parts are wrapped in elements. But this is an extremely common phenomenon. Examples include "<hi>Peter</hi>'s" and "S<sup>t</sup>."

The current content model of <w> is identical with that of <seg> and identical with <w><seg></seg></w>, which is valid under P4 and P5. It may tighten a little, but it creates a token space sufficiently generous to allow the inclusion of many orthographic and typographic phenomena that otherwise involve complex ways of splitting and recombining character strings that are clearly one word.

  1. It adds <sup> and <sub> elements, which are largely, but not exclusively, a convenient short hand for <hi rend="sup">
    etc. The <hi> element is a typographical marker without semantic significance (though it marks emphasis). Removing a <hi> element will never change the plain meaning of a character string. But this is not true of strings like "y<sup>e</sup>", which resolves to "the", or "Ma<sup>tie</sup>," which resolves to "Majesty."
  1. P5 introduced a <floatingText> element that allows for a much better encoding of various forms of inserted documents. thus deeply nested structure like
    <q><text><body><div type="letter"></div></body> </text></q>

can be more economically modeled in P5 as

<floatingText type="letter"><body></body></floatingText>

In the TEI-A texts somewhat different practices in the source texts have been consistently modeled via <floatingText>.

  1. TEI-Lite P4 merged the <quote> and <q> elements. In practice, the <q> element in archives encoded with Level 4 Guidelines always marks a quotation and never marks direct speech. P5 introduced <said> as a new element to mark the representation of spoken utterances in written language. TEI-A translates all <q> elements of the source texts into <quote> elements and enables the epxlicit representation of spoken language through the <said> element.

Processing the TCP SGML files

The source files of the TCP texts were encoded in SGML, and the project has used a number of project specific encoding procedures that are independent of SGML. The TEI-A files generated for this project are "lossy" in the sense that they sacrifice a number of features that are preserved in the source texts but have been judged irrelevant to any foreseeable purpose inside MONK. Features of the source files that have been shed in the first transformation and cannot be reconstructed from the TEI-A file include

  1. the distinction between the short and long s (half-heartedly observed in about a third of the TCP texts)
  2. The treatement of soft hyphens where they exist or do not exist at the end of a line the source text
  3. the expansion of certain brevigraphs: the chracter entity "&abque;" marking an abbreviation for 'que' appears just as 'que'.

Treatment of gap element

The <gap> element marks something that is not there in the source text, for whatever reason. There has been a change from P4 to P5 in the required or allowed attributes in the <gap> element. The TCP texts are quite inconsistent in the description of gaps, largely as a result of different vendors doing the same thing in somewhat different ways. The TEI-A texts try to keep consistently kept information about gaps but shed attribute or values that tell you nothing ("missing", "illegible").

By far the most common type of gap is one or more missing letters. After consultation with a Unicode guru, Deborah Anderson at Berkeley, we settled on replacing all those gap elements with a Unicode character from the Block elements, so that a string like <HI>Pr<GAP DESC="ILLEGIBLE" EXTENT="3 letters">ustes</HI> from the SGML source appears in the TEI-A text as <hi>Pr...rustes</hi>. (the dots should be instances of \u25cf, which is used by the University of Michigan to display missing characters, but this Wiki can't save that character yet).

Simplifying the gap element for missing letters makes it much easier for humans to read and manipulate the TEI-A files, and some manipulation of this kind will clearly be part of the life cycle of these files.

TCP files that do not parse

25 TCP files do not parse under the current version of TEI-A. We have quarantined them and may or may not use them. There are several reasons why files do not parse. The most common reason is that the TEI content model is inappropriate. In the process of encoding some 14,000 Early modern texts, Paul Schaffner, the technical director of the project, discovered that the P3 and P4 content models simply did not square with the textual facts on the ground, and he made minor extensions of the content model for some elements. In nearly all these cases, these simply follow contours of what is in the text.

The two major cases are the <sp> and <postscript> elements. The <sp> element has been in the TEI from the beginning. In P5, there is a <said> element that models the representation of spoken utterances in written texts. The <said> element allows for the fact that a speaker might read aletter or the Bill of Rights aloud, which would then appear as a <floatingText> element. But the <sp> element does not permit this, although the occurrence of such phenomena in drama is legion.

The <postscript> element is new to P5 but has been in the TCP dtd for some time. The folks in the TCP knew that not infrequently "postscripts" are anything but. They can be complex documents with salutations, closers, etc. The P5 model of a postscript only allows what a postscript ought to be.

In these cases, amendments to the TEI content model seem the sensible way to go. A trickier case is the content model for <cell>. In some TCP texts, notably Fox's famous Book of Martyrs, there are pages and pages in which the fates of martyrs are set forth in tabular form, and the <cell> element that carries the case narrative is often a complex scene with paragraphs and speeches. Paul Schaffners loosened the content model of <cell> to make it more like <item> and justified it in the following passage from an email to me:

    • given
      that so many actually occurring text features hover on the
      cusp between markup as tables and markup as lists, it seems hard
      to justify a scheme in which item and cell cannot contain similar
      markup. Or at least that was the way I was thinking when I
      loosened up the <cell> model to make it more like <item>. Probably
      those who insisted on the tighter definition of <cell> fell
      into the school of 'table purists'--i.e. those who think that
      tables are a thing unto themselves and fear above all their
      abuse as page-formatting devices. I do not subscribe to that
      opinion myself.

In one way or another TEI-A will have to accommodate the very sensible extensions that the TCP folks made to the content models of some elements, if only because they are the generators of the largest and most valuable archives. The best solution would be for the TEI to integrate the TCP modifications into its content model.

25 TCP files did not parse immediately but required some manual attention, largely because of oddities in the original encoding. Working with oXygen, it typically takes five minutes or so to spot and correct the quirky features that keep the file from parsing.

A better way of processing TCP files

The TEI-A versions of the TCP files generated by Abbot are good enough for MONK and indeed are more useful than the source files because they are more consistent and the information lost is information that is not kept in any other MONK files and is without analytical value. On the other hand, there are some short cuts in the current procedures. A perfect way of generating TEI-A files from the SGML source files would be to do it in a two-step procedure, which first generates a P5 non-lossy XML file and then creates a reversible TEI-A derivative.

That is a project for another day, but the groundwork for it has already been laid in Brian Pytlik-Zillig's XSLT style sheets. It will probably involve some more thinking through the relationship of the TEI-A file and the processes of tokenization and linguistic annotation, whether in MorphAdorner or some other NLP tool suite. There may be something to be said for separating the process of tokenization from the process of linguistic annotation. If the process of tokenization (with its attendant creation of token IDs) occurs at the stage of creating a lossless P5 representation, the tokenized P5 representation can become the new original and derivatives from it are always traceable back to their source.

Document generated by Confluence on Apr 19, 2009 15:04