This page last changed on Mar 02, 2008 by martinmueller@northwestern.edu.

The following is a set of observations and questions about the display of text in MONK. Text display is not an 'analytic' in itself, but how it is done has considerable implications for the effectiveness of analytical operations. Just about any question addressed to a text corpus sooner or later ends up with users looking at some text and reading it. More accurately, in a digital environment users will typically look at a lot of passages. The critical design question is how to reduce the time cost of locating one passage and going from it to others. Remember Ranganathan's fourth law of library science: "Save the time of the reader."

The major conclusion of the following remarks is that while the XML organization of a text is extraordinarily helpful for many purposes it cannot be allowed to dominate analytical or display procedures. In other words, users must be free to choose text blocks that begin or end arbitrarily, and they must be able to see such text blocks whether or not the start and end points coincide with element boundaries that make for a well-formed XML fragment. (It may not be possible to display all encoding choices of the XML document in such an environment, but that will not be necessary.)

Must the display follow the XML hierarchy?

Text display comes in three sizes: small (KWIC), medium (snippet/paragraph), and large. In the print world, "large" is the double page of a book. In the screen world it is a more variable concept. If it follows the XML structure of the source document, it will be the <div> that contains paragraphs and may run for many print pages.

The line and the snippet/paragraph will nearly always take up less than a screen. How do you define the page? The KWIC context is typically defined by a number of characters to the left or right. It has physically justified but conceptually ragged margins. One could imagine extending that approach to the page and defining a digital page either as statically defined by "screen breaks" at fixed word intervals or dynamically as a fixe number of words (~150) that precede and follow a hit word.

If you prefer "conceptually justified" margins you have to deal with text segments of greatly varying size. "Two paragraphs up and down" produces highly variable output. A mixed model takes more programming: "the closest paragraph breaks within 150 words up or down" will produce fairly equal text segments in most cases.

You can give up on the problem and simply deliver the div. But that leaves users with the task of wandering across a wilderness of text. The disadvantages of that approach are apparent from the current WordHoard implementation of nineteenth-century fiction, where users confront unbroken chapters. Some form of explicit segmentation and pagination is of great help in cutting down the time cost of readerly orientation.

Must digital pages necessarily be well-formed XML fragments? If you have

<div>
<p>blah</p>
<p>blah</p>
</div>

there is no problem. But if you have

blah</p>
<p>blah</p>
<p>blah</p>
<p> blah

why not transform it into
<p class="fade">blah</p>
<p>blah</p>
<p>blah</p>
<p class="fade"> blah</p>

No doubt the rules for this get complicated. But are there intrinsic technical difficulties of displaying such fragments, or is there a belief that one ought not to do this because the XML hierarchy is sacrosanct? Speaking for myself, I have no conceptual or aesthetic qualms about text display that fades in and fades out, perhaps greying out incomplete sentences at the beginning or end. In fact, there is much to be said in favor of fixed-length display segments with "introductory" and "extraductory" phrases.

Sentence and text samples

Considered as a discursive unit for stylistic analysis the sentence is a much more important unit than the paragraph. It is an odd feature of TEI encoding that sentences are rarely marked. Given its difficulty and the problem of concurrent hierarchies, especially in poetry, this is understandable. But sentences are powerful analytical objects, and making them available to users should be a major feature of MONK. A lot of work in MorphAdorner has gone into reducing the error rate of sentence splitting.

For many analytical purposes you may want to collect a sample of discontiguous sentences from some work(s) or work set(s). These will almost never add up to well-formed XML fragments.

Sequences of discontiguous sentences are most easily thought of as data frames in long data form where the sentence is contained in a column that identifies the target of observation and the other columns provide the factors that serve as variables in subsequent analysis. Users will want to look at such sentences and order them in various ways, whether by length, genre, date, author, some combination of them.

Sequences of contiguous sentences are better modeled as blocks of fixed-length texts, e.g. 500 words. Such blocks by definition do not fit the XML hierarchy: they are likely to begin in the middle of one sentence and end in the middle of another, although you may for convenience sake begin and end at sentence breaks. There are some powerful analytical advantages to defining text samples in so mechanical a fashion, which is presumably why linguists often do it that way: if the average number of paragraphs in the sample of author A is 2.5 and the average number in author B is 4.7, you have already learned or seen something useful. If you can hold text block size constant at the 'molecular' level of word occurrence, a lot of procedures are simplified. At the minimum, you can directly compare counts from one sample with those from another.

Investigators who select fixed-length text blocks for some analytical purpose may not want to see all of them, but they will want to be able to see some of them. They may want to check all of them if they suspect that a particular selection might produce outliers that should be removed.

Citation and location schemes as time savers

However texts are displayed, there has to be a stable citation scheme that is visible at all times and helps readers cut down the time cost of text navigation. Ideally, a citation scheme sticks closely to the structural articulation of a text. Thus a novel has volumes, chapters, and page numbers. A play has acts and scenes or at least acts. It may not be possible to retrieve chapter or act numbers algorithmically from the XML structure without a lot of editorial intervention. The best we may be able to do is follow actual page numbers or make up "pseudo pages," as Phil Burns has called them.

We may want to imitate one of the dumbest but most effective of all citation schemes, the "Stephanus" pages of Plato. Henri Estienne in 1578 published a complete edition of Plato. Every page was divided into between four and six regions. The Gorgias, for instance, goes from 447a to 527e, and a reference like Gorgias 501d pinpoints a citation with a span of approximately 50 words, which is precise enough for just about any purpose. It does not matter that the exact boundary between 501c and 501d is not marked on the page.

In a similar manner, dividing the digital page into quintiles will offer a good enough citation and location scheme. The digital text of Jane Austen's Emma, for instance, comes from a three-volume edition where each volume begins with a page 1. A reference like Emma 3.27.2 directs the reader's attention to a little above the middle of page 27 in the third volume. If the page and quintile numbers are clearly displayed, the time cost of finding something on a page drops from tens of seconds to seconds. That is a non-trivial gain if you look at a lot of passages.

The programmers will say at this point that they can highlight the hit word in color. And so they can. But that is of no use if I want to tell a colleague or students that a particularly fine example of Austen's prose is found at location X. But if I can say that in MONK it is at Emma 3.27.2, they can find it right away.

As Ranganathan said, Save the time of the reader.

Side-by-side display of arbitrarily chosen passages.

Franco Moretti's "distant reading" or my "not-reading" are important aspects of MONK. But neither claims to replace "close reading." You just get to close reading in different ways, and you may assemble different passages for that exercise. Close reading typically involves comparing one passage with another. For that you want to be able to have two arbitrarily chosen passages in the same field of vision. That is something impossible to do with the book, but it is quite easy to do on the screen. In this regard Literary Studies can learn from Art History, where the side-by-side display of slides has been an indispensible tool for decades. There is a lot of analytical potential in being able to see two things at once.

Highlighting words in text

You can use size, color, or both to highlight words in texts. There are, however, two problems with this technique. There is the part/whole problem, and there are the criteria by which something is highlighted. Highlighting all the adjectives in a novel is of limited use if I can see only one screen at a time. Thus highlighting is helpful only if the whole text fits on one screen or if I can treat a single text screen as if it were a whole. For instance, in a close analysis of a sonnet or a paragraph by Gibbon I might see new things by successively highlighting nouns, verbs, adjectives, prepositions, or conjunctions.

Why does this or that word get highlighted? Raw counts or even relative frequencies may be uninformative or even misleading because the highlighted phenomenon may be an ordinary feature of the language or of a particular genre. Highlighting is useful only if it isolates distinctive features of the text.

If the object of inquiry is a long text, it is probably more helpful to present distinctive phenomena in a table or a visualization derived from a table.

Document generated by Confluence on Apr 19, 2009 15:04