This page last changed on Feb 22, 2008 by martinmueller@northwestern.edu.

Present: Amit, Bill, Joe, John N., Loretta, Phil, Tanya

We had a lively and inconclusive discussion that clarified (perhaps) some questions and raised some new ones. I hope the following remarks shape rather than distort the discussion. I'll repeat some things that seem to me to be the case or clearly on the agenda. Correct me if I am wrong. I'll also assign some tasks.

Our discussion focused on various ways of gathering samples at different levels of granularity, the word, the sentence, and higher discursive levels, whether defined as sentence sequences or structural XML elements.

The problem of the mid-level XML structure in a data store

The discussion was shaped by apparent agreement on these constraints: if you want to model a set of diverse TEI documents in a relational database, it is a relatively straightforward business to find consistent descriptions across many document from the top down to the <div> level and from the bottom up to the sentence level. It is very difficult and may be impossible to develop a model that accurately and consistently reflects the intermediate hierarchy of elements around the <p> level with their multiple nestings and encoding options.

If you are primarily interested in analytics of varying kinds and if you have information in the datastore from the top to the div level and from the word to the sentence level, what information from the intermediate level is essential or important to analysis?

It seems to me we have edged towards agreement that you can sidestep the "model of the muddle in the middle" if you can extract from the intermediate hierarchy data that let the data store respond to questions of the following kind:

  • Is it verse or prose? (Does it occur in an <l> element or not?)
  • Is it spoken language or not? (Does it occur in a <said> element or is its grandparent a <sp> element?)
  • is it part of an inserted document? (Does it occur inside a <floatingText> element? BTW I am virtually certain that in a TEI-A document <floatingText> elements will NOT be children of <p> elements)
  • Is it a case of correspondence? (Does it occur inside a <div> or <floatingText> with a @type="letter" attribute?)

If the data store can can answer these questions, it can satisfy a lot of inquiries on the literary side of the humanities. This looks like a hodgepodge of criteria, but in the collections we are dealing with these may be the only elements and @type values that occur with sufficient frequency across a sufficiently large number of documents to become objects of analytical interest.

Words and sentences

The output of MorphAdorner explicitly marks sentence boundaries. It is therefore possible to identify sentences as minimal discursive units. It is possible to model information about sentences in the data store. John and Phil should develop a practical model for what kinds of information about sentences should be kept explicitly in the data store and propose it to the Analytics/DataCell.

Main text and paratext

We agreed some time ago that for some purpose many users will find it helpful to exclude the content of some elements from their analysis. The elements that are identified as paratext are the <front> and <back> elements of a TEI document and the following elements when they incur inside the <body>:

  1. core: <add>, <bibl>, <head>, <item>, <label>, <list>, <note>, <ref>, <respStmt>, <speaker>, <stage>
  2. drama: <castGroup>, <castGroup>, <castList>, <role>, <roleDesc>
  3. figures: <cell>,<figDesc>, <figure>, <row>, <table>
  4. textstructure: <byline>, <docAuthor>, <docDate>, <docEdition>, <docImprint>, <docTitle>, <epigraph>, <titlePage>, <titlePart>, <trailer>

Users will have the ability to make "main text" or "all text" the object of their analysis. The paratext is by definition a hodgepodge of elements and not a useful object for analysis. In practice this means that the data store can answer the question:

  • Is it paratext or not?

and treats the paratext as a kind of "spam" that users may choose to ignore. Be aware that in Literary Studies one man's spam is another man's filet mignon.

Sentence samples and discursive samples

The following points are still very much under deliberation. For many purposes, users will want to select text samples and construct them in ways that suit their purposes. Some users will choose text ranges for rating purposes. Such text ranges can in principle be defined as sequences of words, sentences, XML elements on the <p> level or as page ranges. We need to make a decision which "from here to there" procedure(s) offer the best combination of precision, user convenience, and ease of implementation. Some conversation involving Amit, John, Phil, and Stan might be helpful here. We are looking for the cheapest solution that will "satisfice" the needs of most users.

Other users might use samples for analytical operations of different kinds. This is one way of conquering scale. Operations on several thousand samples of sentences, sentence sequences, or paragraph-like units might take minutes to compute, and their results may either be definitive in themselves or sufficiently suggestive to justify operations on large text regions that might take hours or days.

For operations of this kind, the purpose of your inquiry will shape the definition of the sample. If your primary is in the "what" of a document, you may find a "bag of words" model informative or you may be interested in samples of contiguous sentences. If your interest is in a "how" of a document, a sample of discontiguous sentences may be as informative or more informative.

Does it matter whether a contiguous stretch of sentences stays within element boundaries or not? If sentence sequences have ragged edges at the element boundaries, does it matter? There was some disagreement in our discussion. One could argue that it does not matter as long as the sentences stay within a certain kind of discourse (plain prose, poetry, letter). But another person could argue that it would be very helpful to associate every paragraph in a work with metadata about the number and length of sentences it contains.

Seeing the samples

Our discussion raised an issue that we have never addressed before. If there are samples, can users see them, and how will they be displayed? We agreed that users will want to see them. Is the following a useful point of departure for implementation?

From the perspective of the data store, any text sample is a span of words that starts with one wordID and ends with another. These wordIDs may or may not coincide with element boundaries. John N. has argued that wordIDs inside the data store should correspond to wordIDs that exist outside the data store. If that is the case and a given sample starts at wordID XYZ345 and conntinues for 181 words, can the request for a display go to an XML version of the text, look for wordID XYZ345 and display it and the next 180 words?

If the start and end point of the sample are inside elements and the resulting text region is not a well-formed XML fragment, can one transform it on the fly into a well-formed HTML fragment that articulates minimal structural breaks? Minimal is quite minimal: as long as the line breaks of verse and the paragraph breaks of prose are observed, readers will get enough information to orient themselves.

Is this a matter for John, Amit, and Stan to follow up?

If the sample consists of discontiguous sentences, the user's ability to display the samples in different sort order (e.g. by author and length) could be extremely illuminating.

It occurs to me that a sample of discontiguous sentences is productively envisaged as a "data frame" in "long data format" so that each sentence is column entry in a data row that tells you about author, work, sex, date, origin, and genre. The problems of display and manipulation are very similar to "Search and Sort."

Document generated by Confluence on Apr 19, 2009 15:04