|
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
I had a long conversation with John Norstad the other day, and we talked about various ways of getting information out of TEI elements. It is clear that the elements differ quite drastically in their predictive or analytical value, and it may b e a good idea to pay more attention to some of them than to others.
I conclude from a cursory survey that from the perspective of cross-document comparison the number of important tags is quite small. There is the distinction between verse and prose. There is the type attribute that lets you identify correspondence. There are the <sp> and <said> elements that oppose the written representation of spoken words to everything else. And there are elements that let you find epigraphs, prologues, or prefatory materials that are typically gather in <front> elements. In terms of precomputing or other forms of special treatment a small number of elements may meet most needs.
Am I simply reflecting my own prejudices here or is that a reasonable conclusion to be drawn from the ways in which TEI elements are actually used in encoding?
There is, however, one very primitive way in which elements are always useful. If I remember right, TAPOR does something along those lines. I learn certain things about a document if I know what element/type combinations are used in the tagging. Thus it may be helpful to have a MONK wide element catalogue. A summary version of it consists of data rows like
<div type="dedication"> 72 98
<epigraph> 231 2314
The first number refers to the number of documents that contain this particular element or element/type combination. The second number refers to the total number of instances. From this summary you can go in two directions. You can get more detailed information about the use of each element. Thus, following up on <epigraph> you would get a list of the works that contain this element together with the number in them.
Alternately, you can look at a particular work and get a list of elements with their count in them. If you have a very modest understanding of the TEI, these very primitive frequency lists of elements and type attributes may be surprisingly effective in helping users identify certain kinds of content that they want to look at more closely.
What follows is a coarse grouping of the elements that will be part of TEI-A, together with some preliminary observations about their analytical value. It will be helpful for others to add, subtract, or contradict.
<l>, <said>, <sp> and other paragraph level elements.
The vast majority of words in a TEI file sit in one of the following elements: <ab>, <l>, <p>, <q>, <quote>, <said>, <sp>. Of these, <l> tells you that it contains verse, while <said> and <sp> tell you that they contain utterances that are spoken by somebody. The rare <ab> (anonymous block) is aggressively non-predictive, while <q> and <quote> set off their enclosed text in a not very clearly specified fashion.
For a literary scholar, the most powerful oppositions emerging from these elements are
<l> and not <l> or the opposition between verse and prose
not <said> or <sp> and <said> or <sp> or the opposition between writing and the written representation of a spoken utterance
The distinction between verse and prose is marked to some extent at the bibliographical level. If a work is coded as poetry it will consist for the most part of verse. And if it is not coded as poetry it will consist largely of prose. But this top-level distinction is less precise than the distinction between what is encoded in <l> or not in <l>. In the 250 English novels from 1780-1900, there are some 23,000 lines of poetry--roughly the equivalent of half a dozen plays by Shakespeare. So there is a fair amount of poetry in some prose, and if you are interested in the different usage of words over time and genre, you will almost certainly want to be able to draw on the power of the "<l> vs. not <l>" encoding.
If you have an interest in spoken language, certain kinds of drama and the spoken sections of fiction and some other types of documents are often the only available proxy for earlier periods. The latter has not been encoded in any of the MONK documents, although it is in principle possible to do so. Prose drama is a major source of evidence for colloquial language. But the difference between written and spoken language has a lot of analytical potential, and one should try to make its potential available wherever it has been encoded.
The type attribute in <div> and <floatingText> elements
The type attribute establishes subclassifications among certain elements, notably <div> and <floatingText>. Type attributes vary widely across texts and collections. Sometimes they make simply explicit what is implicit. In a play <div type="scene"> tells you nothing that you don't know already from the second-level div status.
The attribute value of "letter" has fairly strong predictive quality and identifies the content of its element as belonging to a very specific form of writing, which hovers interestingly between writing and speech. Letters may occur as <div> elements in epistolary novels that consist entirely of letters. There are novels in which some divs are letters and others are chapters. Most letters, however, appear as floatingText elements inside a <div type="chapter">element.
It is doubtful whether there is another type attribute that may be said to be used so consistently across different collections. Of course, the letters themselves vary widely in style.
Opening and closing elements
There are a number of elements that define an utterance in terms of its position at the beginning or end of an utterance. There is some predictive value in this because what is said at the beginning or end tends to be more conventionally restricted than what is said in the middle. Whether this yields much for analytical purposes is open to question. If you have a meticulously encoded correspondence, a relationship between sender and recipient may be apparent from changes in address over time. But such consistency cannot be assumed for the texts in MONK.
The <epigraph: element may be of some interest. It is a sub genre of fiction and the study of epigraphs (which begins with their identification) may be illuminating in various ways. In drama <prologue> and <epilogue> refer to special kinds of scenes, but this is less true of prologues or epilogues in novels or other genres. So the analytical value of opening and closing elements is probably quite limited.
The elements in this group include <argument>, <closer>, <epigraph>, <epilogue>, <head>, <prologue>, <opener>, <salute>, <signed>, <trailer>.
Large scale structural arguments
The large-scale structural elements include <TEI>, t<eiCorpus>, <back>, <body> ,<div>, <front>, <group>, and <text>. The > <floatingText> is something of a wild card and lets you insert text of any length almost anywhere. The predictive value of these elements is very low. The <front> and <back> elements are perhaps most expressive in telling you that their content is not strictly part of the text at all, and they are "paratext" par excellence. The content of <front> elements, however, may be useful for a scholar who is interested in dedicatory rhetoric. On the other hand, not all dedicatory rhetoric is contained in <front> elements. Still, it is a good way at getting much of it.
Inline and other small-scale elements
These elements include <abbr>, <add>, <addrLine>, <address>, <c>, <corr>, <date>, <emph>, <email>, <foreign>, <formula>, <gap>, <hi>, <mentioned>, <name>, <num>, <orig>, <reg>, <s>, <seg>, <sic>, <soCalled>, <term>, <sub>, <sup>, <unclear>, <sup>, <w>
These elements are of virtually no analytical interest, except for <name> and <date>, which take you into named entity extraction or more generally into the world of 'who', 'where', 'when'. Actually, most of the texts do not use <name> or <date> elements to identify people, places, or dates. People and place names are currently being identified through morphosyntactic tagging. There is discussion, still unresolved, about second-order passes over a linguistically annotated text, where you use existing morphosyntactic tagging as a basis for more precise extraction of the names of places, people, and institutions. It is unclear at this time whether the results of such extraction are subsequently added to the linguistically annotated text in the form of TEI elements. That is certainly a possibility.
Bibliographical, linking, and referring elements
Bibliographical elements include <bibl> <biblFull>, <byline>, <cit>, <dateline>, <docAuthor> <docDate> <docEdition> <docImprint> <docTitle>, <link>, <note>,<ptr>, <ref>, <rs>, <imprimatur>, <titlePage>.
These are quite specialized elements and are probably of very limited utility in the MONK environment. Or to put it differently: if you wanted to do research that focused on the content of those elements you would be better off working with a large library catalogue
Lists and tables
Lists and tables are relatively rare in the Monk documents. They are currently classified as paratext, partly for their rarity and partly because their content does not lend itself easily to morphosyntactic tagging or sentence splitting. Thus they contain for the most part language in a form that is resistant to analysis and may distort results, although there may not be enough of them to affect any analysis seriously.
The TEI header
In addition to teiHeader, there are <author>, <authority>, <availability>, <edition>, <editionStmt>, <editor>, <editorialDecl>, encodingDesc>, <extent>, <fileDesc>, <change>, <idno>, <keywords>, <langUsage>, <language>, <notesStmt> <principal>, <profileDesc>, <projectDesc>, <pubPlace>, <publicationStmt>, <publisher>, <resp>, <respStmt>, <revisionDesc>, <series>, <seriesStmt>, <sourceDesc>, <taxonomy>, <textClass> <title>, <titlePart>, <titleStmt
The analytical potential of these elements in MONK is quite limited. Most of the bibliographical information supporting analytics will come from the simple 'factors' or categories of author, date, genre, sex, region of origin, genre. Not all of this information is in the header, and most of it is likely to enter the MONK space in a simplified form as part of a SIP (submission information package) that may accompany the TEI-A file.
|