This page last changed on May 10, 2007 by martinmueller@northwestern.edu.

Not all the words in a document belong to the text in the same way. If a document has <front>,<body>, and <back> elements, there is a good chance that the stuff in the front may not be part of the text as the reader would ordinarily understand it. The stuff in a <back> element may be a publisher's note or an advertisement. Notes interrupt the flow of the text. Speaker tags are labels. Stage directions are an odd form of side text.

There are two conclusions that follow from this. First, tokenization must be sensitive to what are called "jump" tags or tags that interrupt the flow of the main text. In TEI encoded texts, it is very common for <note> elements to be inserted in the middle of a sentence. <note> elements are classic jump tags, but so are <speaker> and <stage> tags.

Secondly, whether or not the content of an element may be said to be in the linear flow of the document, it may be useful to divide the document bag into a main bag and a side bag. The point of this exercise is not to divide the document into stuff is or is not "by" the author. Rather, it is to increase the odds that the word tokens in the main bag are only the author's words, wheras the words in the side bag msay be of mixed origin.

A crude but useful procedure would be to define the main document bag as everything in the <body> element of a text minus the content of "jump" elements and to define the side document bag as whatever is in the <front> or <back> elements (if there are any) plus the content of jump elements.

Something like this procedure is certainly necessary for plays, where you certainly want to distinguish between words uttered by characters and everything else.

An interesting problem is posed by epigraphs, which are very common in literary works and almost by definition not the words of the authors. On the other hand they are typically short and may not be worth bothering with.

Document generated by Confluence on Apr 19, 2009 15:04