This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Introduction

The following is an attempt to move forward on the question of repetitions and create a framework for making some decisions on what to support or not. 'Repetitions' has been our shorthand for multiword expressions (of whatever length) that occur more than once. I write this in the double column mode in the hope that others will add to it. On the left column I try to discuss things in plain English (or some semblance of it). The right side is for Tech Talk. As I read this over, I suspect my colleagues will tell me that towards the end I departed way too far from plain English.

I hope that in the course of the left column I have articulated all of the 'must haves' and some of the 'may haves' that have been raised by one person or another in the course of discussion. So now the task remains to figure out whether the left column has left things out and whether its poetry of desire can be accommodated by the prose of implementation on the right.

The problem of repetitions comes in two flavors: ordinary and extraordinary. All language is shot through with repetition at all levels of organization from the phoneme up. Consider the fact that when you talk for an hour you are constantly repeating approximately forty phonemes in only a tiny percentage of the combinations that are mathematically possible.

Written language has a much lower tolerance for repetition than spoken language, and works as different as Thucydides' Peloponnesian Wars, the Aeneid, Das Kapital, or Emma (where the author punishes the heroine for her own fun at the expense of the loquacity of the Misses Bates) are remarkably similar with regard to the threshold beyond which a repetition is perceived as either a blunder or a special effect. Ordinary repetition refers to the kind of things that for the most part happen below the threshold of attention and are either part of the language or the result of the combination of habits and choices that make up a writer's idiolect.

Extraordinary repetition refers to cases where either the readers or the writer or both make a fuss about the fact that the ordinary thresholds for repetition in written discourse have been violated. The oldest and most famous case is Homer. From the beginnings of Homeric criticism in antiquity scholars have pointed to the odd ways in which the poet repeats himself, and parodies or imitations of Homer from Vergil through Fielding to Joyce invariably focus on repetition.

It has been an endlessly debated question whether Homeric repetition is intentional or whether the surprise of the literate reader is an inappropriate response to a poetry that has its roots in an oral tradition with quite different repetition thresholds. On the other hand, there is no question that Gertrude Stein repeats herself on purpose (perhaps under the influence of Southern oral tradition) and that the essence of her style is an aggressive, continuous, and systematic violation of the thresholds of written discourse, as witnessed by her signature sentence "Rose is a rose is a rose." No writer before her made such a fuss about repetition.

Whether inquiries into ordinary and extraordinary forms of repetition can be handled by the same computational routines is a question to which the answer on the whole is probably 'yes'. But it seems preferable to get a handle on what kinds of challenges must be met in the the domain of ordinary repetition before moving to extrme cases and figuring out what special routines, if any, they require.

I don't think it is necessary, and it may not even be helpful, to base the discussion of ordinary repetition on specific use cases. Phrasal phenomena are a powerful aspect of language and a research topic of considerable interest with a technical language of their own, witness 'multi-word expression' with its own acronym (MWE). While it is the case that the majority of lexical items are single words, the language is shot through with lexical items that cross word boundaries. In Latin, Greek, and German, compound verbs are typically single words. In English, you 'put on' rather than 'onput' or 'put off' rather than 'offput', but these are clearly lexical items.

There are fluid boundaries between MWE's that are clearly fixed lexical items and MWE's that are phrases or idioms that may allow for variance of one king or another. It is prima facie plausible that the phrasal repertoire of a text offers a powerful guide to the habits and choices of writers. It may well be more telling than their single word lexicon. If a text has 'features' that one identifies, extracts, and analyzes, then MWE's rank very high on the list of desirable features.

I take it, then, that there is a 'must-have' for MONK, and it consists of capturing across an entire MONK environment the repertoire of phrasal phenomena, identifying them, labeling them as we now label single-word tokens, and creating capabilities for manipulating and analyzing them.

Tech Talk

Ordinary repetition

A text is a sequence of words "ordered together" (the literal meaning of 'syntax'). The most common forms of text analysis throw away the syntax and manipulate the text as if it were a collection of single tokens (the bag of words model). The working hypothesis is that each token is a discrete unit that carries independent meaning. It is a miracle that for a large variety of purposes so grossly reductive a textual model works as well as it does.

Repetition is not a problem within the range where the bag-of-words model works. You count the single tokens, map their distribution, and notice patterns. Of the nine occurrences of 'sad' in The Merchant of Venice eight occur in the opening scene, and one of those eight is emphatically placed at the end of the first line. Is this going to be a comedy?

But repetition becomes a problem as soon as you push against the limits of the bag-of-words model. In fact, you can make a plausible argument that the problems of ordinary repetition are caused by what is left out in the bag-of-word model but cannot really be ignored. And these problems arise pretty much in the same way in all texts.

Tech Talk

Multi-word expressions: Named entities

In one of his books John Sinclair writes about the shortcomings of an 'open choice' model of language in which a speaker puts words together according to the rules of syntax and at any point in a discourse is theoretically free to choose whatever is lexically and syntactically legal. He argues that the facts are otherwise, that an 'idiom principle' is at work and that in practice speakers and writers typically work with prefabricated multi-word units of various kinds.

Multiword expressions (MWE) divide roughly into two types: noun equivalents and phrases. The Book of Genesis tells you that God brought the animals to Adam "to see what he would call them: and whatsoever Adam called every living creature, that was the name thereof." Naming continues to be a central human activity, but what with the proliferation of things, substances, species, procedures, or institutions that are created or discovered daily single-word names have gone the way of seven-digit telephone numbers. A subfield of information retrieval, called "named entity recognition" or "entity extraction," concerns itself centrally with the fact that two or more more words are often used to refer to a single thing in the world, whether the Earl of Rochester, sulfur dioxide, or the United States of America. Apart from their practicality, acronyms, whether MONK or POTUS, may be a nostalgic residue of the Adamic dream that a single thing should have a single name: lion, elephant, serpent...

Not all multi-word noun equivalents are 'named entities' in the strict sense. 'OS X' is certainly a named entity, 'operating system' is not. But language has wide marshlands, and there is a pretty swampy stretch between what are clearly names and clearly nouns.

Multi-word named entities have one great advantage over other repeated phrases: you don't need frequency information to identify them. Instead you identify them by the putative properties that such phrases have, whether proper names in succession, proper names preceded by a title, with or without an intervening preposition, and the like.

Tech Talk

Phrases and other multiword expressions

For my purposes here, whatever is not a multi-word noun equivalent is a 'phrase', and phrases range from lexical items that just happen to be written as separate words (and sometimes are written as single words, e.g. insofar goodbye) to idioms and proverbial expressions that may be completely frozen or permit variation. Phrases tend to have little direct value for the purposes of information retrieval narrowly construed: they don't add much to your knowledge of what a text is about. But the phrasal repertoire of a text may tell you as much or more about a writer's habits and disposition than the lexical repertoire. The present POTUS is exceptionally fond of the phrase "Make no mistake about it." The frequency of this phrase tells you nothing about the 'what' of his views but much about the 'how' of his thinking.

I take it as axiomatic that the 'qualitas' or 'howness' of texts, whether at the level of the individual text, an author, a genre, a period, or some intersection of those parameters, will be a central concern of many MONK users. Whatever 'features' can be extracted to help users get a better grasp of the 'howness' of texts are must-haves for MONK.

The upshot of this is that we need 'phrasal recognition' procedures as much as or even more than named entity recognition procedures. There is, however, a very real difficulty with such recognition procedures. Phrases are defined solely by their frequency; unlike multiword named entities they not contain properties that are shared across all or most of them. If, as they say, the dose makes the poison, so frequency makes the phrase. But a phenomenon of uncertain size and shape and solely defined by its frequency is a very difficult animal to get hold of.

Tech Talk

N-grams

N-grams are written sequences of something: bigrams, trigrams tetragrams, twelve-grams, whatever. The something can be a character: word tokens are character n-grams with an analytical potential that we have not yet explored in MONK.

In MONK parlance, n-grams refer to sequences of word tokens. N-grams are the programmer's way of coming to terms with multi-word expressions. The latter always make sense in terms of their lexical and syntactic properties. The former are generated by defining n and, moving from beginning to end token by token, creating the n-gram that starts with the current token. Most n-grams make no sense, but you use them to hunt for the multi-word expressions.

From the discussion in the FeatureLens technical report and from browsing a little on the Web, I gather that multi-word expression hunting goes something like this. There are techniques that let businessmen figure out whether there are patterns in their customers' behaviour. Millions of customers purchase millions of items. Each shopping trip that leads to a transaction produces a shopping cart with at least one item in it. The transaction that is recorded in the cash register becomes a transaction in a database, which now reports that Customer A bought a specified set of items.

A lot of customers buy a lot of items, and it is inevitable that customer A's purchase of x and y on the same shopping trip is repeated by Customer B on some other shopping trip. So there are a lot of 'patterns' out there whose pursuit is unlikely to improve your profits. And you establish a frequency threshold below which you ignore the pattern because it does not have enough 'support'. If a pattern occurs above the support level, it is a 'frequent item set'. If you notice that a given pattern always occurs in conjunction with another pattern, you can merge the two patterns into one and end up with a 'closed item set'. These are ways of reducing the noise level of irrelevant patterns.

If you apply this model to text, the writing of a paragraph or other text chunk is a 'transaction' in which the writer buys items from the Word Store. In the FeatureLens case, the items are trigrams, and a paragraph is a shopping cart of trigrams. The more trigrams two paragraphs share the more they are alike.

The technique is ingeniously robust. If two passages don't quite match, their trigram patterns don't quite match. There are, however, a number of questions about its application to textual analysis. Some of them have to do with what the system may be said to 'know' about the relationship of trigrams in a pattern. In the State of the Union example of the FeatureLens Demo, the patterns are defined as sets of trigrams. The largest of these sets (somewhat confusingly called 'longest') consists of 12 trigrams. For the sake of simplicity, I choose a smaller example. In the State of the Union addresses for 2005, 2006, and 2007, the President used the phrase "cut(ting) the deficit in half by 2009." In FeatureLens this event is modelled as follows:

the deficit in half by 2009
the deficit in
  deficit in half
    in half by
      half by 2009

I think because the "support" level was set at 3 and the trigrams were created from literal rather than lemmatized strings, the system does not recognize that the repeated phrase is actually longer than shown. As they occur in each paragraph, the trigrams are marked in red, and to the extent that they overlap, they form in each of the three occurrences the hexagram 'the deficit in half by 2009'. What is not clear to me, however, is whether the system 'knows' that there is a hexagram 'the deficit in half by 2009', which is repeated three times with specific start and end points. If the system stores the trigrams as a set without their start locations, the proper display of the hexagram may be said to be an accidental by-product. If the trigrams are stored with their start point then the system may be able to recognize that the trigrams are in sequence and form a tetragram and a pentagram that are subsumed in the final hexagram.

That, I think, is what you want from a phrase recognition system. You want it to say "there is a phrase 'the deficit in half by 2009.'" It is six words long, and it occurs, here, there, and there. And you want to able to ask the system to show you the longest phrase shared between this document and that document and whatever other questions can be addressed to a system that knows about an entity with certain properties, in this case a multi-word strong of known frequency, known length, and known tokens.

Is that possible within the current capabilities of D2K?

As a curiosity, I note the difficulties that arise when you try to spot the resemblance between the two phrases "the deficit in half by 2009" and "the deficit by half in 2000." The human reader instantly recognizes the pattern "the deficit . half . 2009." But these two phrases share no trigrams, and they share only one bigram. The same six unigrams are used within a space of six words, with two words switching position. How does one deal with that?

Tech Talk

What can be achieved on the basis of n-grams?

The fundamental question that arises from the discussion above is whether or how you can use short n-grams to identify repeated phrases of varying length across an entire Monk environment and then 'catalogue' them for retrieval by various properties that are associated with them, such as length, frequency, composition of its parts, and status (e.g. named entity).

The underlying idea here is that the MONK lexicon as the catalogue of all discrete entities includes not only the unigrams known as words but n-grams of varying size and status.

Tech Talk

How frequent is frequent enough?

Wilamowitz, the greatest classical scholar of the late nineteenth and early twentieth century, once said that "Einmal is keinmal. Zweimal ist immer" or "once is never, twice is forever." This was spoken out of a keen sense of the fragmentary nature of the texts that have survived from antiquity. They are a very small sample of what once was there. Of a single item in those fragmentary remains you could not tell whether it was a nonce phenomenon. But if something occurred twice, the assumption that the only two instances survived accidentally stretches belief. If two survived, there must have been a lot more. Thus "twice is forever" and thus frequent enough.

The repertoire of documents in MONK environments is by and large much less fragmentary. So the question of support levels for frequent item sets poses itself somewhat differently. I have no answer to the question, but it is a question worth posing. At the level of the unigram, we keep track of all occurrences from 1 to infinity. The concept of the unique occurrence, the nonce word or "hapax legomenon" of Homeric epic, is well understood. Should we therefore have "nonce repetitions." If an n-gram occurs twice, it is a nonce repetition and therefore deserves to be recognized. Or is that madness?

Tech Talk

Syntactic patterns

The phrasal repertoire of a text consists not only of multi-word units with a specific lexical content. It exists just as importantly as a preference for or avoidance of particular ways of putting words together. "Handsome, clever, and rich" from the opening sentence of Emma is a famous example of a syntactic choice that has high value as a stylistic marker. If we can describe textual texture from one perspective as the interweaving of common words in a particular frequency distribution, we can also describe it as an interweaving of common syntactic patterns in a particular frequency distribution. Graduate students in my department once wrote a parodistic memo under my name. They were quite good at capturing my writerly habits, and I was not displeased since in their gently mocking portrayal of how they saw me through my syntax I recognized by and large a kind of person I don't mind being. Thus the repertoire of syntactic habits and choices in a given text is a very powerful indicator, and if there is a way of capturing it, it is well worth doing.

The simplest measure (not to be despised for its crudity) is to count the length of sentences and of the words in them. In an ideal world a text would be not only tokenized but also syntactically parsed. That appears unachievable, and even 'shallow' forms of parsing may be beyond the scope of the project. The question is whether n-grams at the level of syntax can be used to identify recurring syntactic fragments of varying length. If such fragments can be identified it is a reasonable assumption that once can extrapolate from the fragments to larger syntactic structures. (If one goes that route it is probably wise to give the verb 'be' its own tag set because its syntactic funtions are unique).

A trigram in a given location will have a lexical and syntactic description. 'Beautiful evening' will have the tag j-n1 wherever it occurs. 'Loved her deeply' could be vvd-pn-av or vvn-pn-av depending on the context. The 'lexicon' of tags is much smaller than the lexicon of words, especially if one takes them at their coarsest level. Thus one can set the 'support' level for 'frequent item sets' much higher.

If a text is modeled in a skeletal fashion as a sequence of tags we are moving in the direction of bio-informatics. If a text is modelled on no more tags than there are letters in the alphabet (not an unrealistic assumption) we may be in a world that from technical and substantive angles has a lot to do with genome sequences and can perhaps use some of its techniques.

Tech Talk

Possible uses of character n-grams

Perhaps we should take a look a character n-grams as having non-trivial potential in getting at the 'how' of a text. Character n-grams are the most primitive of all n-grams: they track the letter sequences of the word tokens. The Roman alphabet is used to express many different languages. The possible combinations of letters from that alphabet differ sharply from one language to another, and the distribution of possible combinations differs just as sharply. Thus a relatively crude tabulation of character n-grams will allow you to say with great certainty: "This must be German," or "I don't know what this is, but it cannot be English."

This works because words in a given language have characteristic patterns, especially at the end and at the beginning. If you scroll through a large word list arranged in the manner of a rhyming dictionary, you move from the world of -ed to through worlds of -iage, '-ness, 'tion', -tude, and the like. This was well known to the long-suffering Mrs. Gradgrind in Dickens' Hard Timnes, who would speak of her husband's -ologies.

The character n-gram profile of a text is probably quite telling and seems like a relatively simple thing to produce. It is as crude as counting the length of words and sentences but it may give you a quite nuanced overview of both lexical and syntactic habits. This is a feature where raw or relative infomrmation is not very informative. You need directly comparative information, such as Dunning's log likelihood ratio provides.

Tech Talk

Extraordinary Repetition

I now turn to extraordinary repetition, by which I understand the presence in a text of repetitive elements that lie clearly outside the narrow thresholds of tolerance that written language has for extended repetition. Plagiarism is a good example for that narrow tolerance. In a plagiarism case, the facts are rarely in dispute, and arguments that a given phrase is coincidental or conventional do not survive the third iteration. The question then becomes whether the repetition is unconscious and innocent, fraudulent, or legitimate in a particular discursive environment (I once was approached by a Minnesota radio station that had discovered Garrison Keillor's liberal and unacknowledged borrowings from various local histories).

Homer and Gertrude Stein are very conspicuous examples of extraordinary repetition. The phenomenon is central to the 'how' of their texts, and not to talk about it would be like talking about Christianity without mentioning Jesus. There are actually not very many writers like that.

Tech Talk

The Chicago Homer

I want to describe briefly what we did in the Chicago Homer almost a decade ago (www.library.northwestern.edu/homer). This is a heavily customized and author-specific application, modeled in a relational database environment, and designed to capture with as much precision as possible the quantity and variety of repeated phrases in Early Greek epic. We have now moved Gertrude Stein's Making of Americans into that environment, and while there are some quirks, the implementation is good enough to test whether in principle it is useful with that text.

Early Greek epic (largely the Iliad and Odyssey plus a few shorter texts) consists of some 32,000 lines of hexameter and ~230,000 words (Stein's Making of Americans has more than half a million words). The text is full of repeated bigrams, chiefly of the epithet-noun type (rosy-fingered dawn, winged words, swift-footed Achilles). There are longer repetitions that range from half a line to ~twenty lines or 123 words, but there are fewer than two dozen repeated passages that are more than five lines long.

Craig Berry, a Northwestern English PhD and Spenser scholar, who makes his living as a computer consultant, wrote a perl script to extract repetitions. The script was based on the concept of the 'independently recurring substring' (I think I owe a debt to Ian Lancashire here). This concept seems to me quite similar to the D2K "closed frequent item set." Given a repeated string 'abcd' (where each letter stands for a word token in the text), the substrings 'ab', 'bc','cd', 'abc', 'bcd' are ignored unless they occur outside of 'abcd'. Thus if 'ab' only occurs as part of 'abcd' it is ignored. But if there is a repeated string 'abcd' and another string 'abfg', the program records two occurrences of 'abcd' and three occurrences of 'ab'.

The program was run on a lemmatized version of the text. Greek is a much more heavily inflected language than English, and accordingly a lemmatized version of the text is a morphologically much more reduced version. The input was a tab file that associated a particular word occurrence with its unique address and its lemma.

The program is computationally intensive. Ten years ago it took two days to process the 230,000 words of Greek. It spent some of its time assembling repeated strings of varying length, but most of its time throwing out substrings that did not belong. When it was done it produced an output which consisted of a string, the length of the string and the start points of each repetition occurrence. There is some post processing that generates from this central output the various tables that provide the query potential in a relational environment.

Bill Parod then wrote a quite simple but stunningly successful program that permitted the visual display and linking of repeated strings in the text, including the odd ways in which they overlapped. Thus assuming that the original listener of Homeric epic 'heard' certain resonances as one phrase echoed another, the modern reader, not trained in the patterns of oral composition, could 'see' those resonances in a given verse paragraph and by following the links of repetition trace what I somewhat pretentiously called 'the network of bardic memory.'It was quite early Web technology, but it still works better than anything else in that field.

A decade later it took some thirty hours to process Gertrude Stein's text, which is more than twice the size (the load on the program is geometrical or worse rather than linear). So this is an intrinsically expensive procedure, although Craig, an old VMS hand and at the time new both to NLP programming and perl, is the first to admit that it could be written much more efficiently today. You would need something that is quicker by two or three orders of magnitude to make this approach scale to a substantial number of texts.

What the program delivers to the scholarly user is an inventory of all repeated strings, searchable by location, frequency, word, and whatever other properties you choose to encode in the source text.

The program does not understand 'near repetitions'. For instance, in the opening book of the Iliad there is a line that means "threw many valiant souls into Hades" and goes like

pollas d' iphthimous psychas Aidi proiapsen

It is repeated almost verbatim in Book 11, with 'heads' substitude for 'souls' and the ubiquitous Greek particle 'de' omitted:

pollas iphthimous kephalas Aidi proiapsen

Once you direct his attention to it, the human reader immediately sees the string

pollas . iphthimous . Aidi proiapsen

But the program sees only the identical verse-terminal bigram. Fortunately, that is not a very common phrase, and once you follow up on it you see that the two lines are nearly the same.

Overlapping trigrams perform even worse here, because the two lines, obviously the 'same' for human readers, share no trigrams at all:

pollas d' iphthimous psuchas Aidi proiapsen
pollas d' iphthimous
  d' iphthimous psuchas
    iphtimous psuchas Aidi
      psuchas Aidi proiapsen

For the repetition there is

pollas iphthimous kephalas Aidi proiapsen
pollas iphthimous kephalas    
  iphthimous kephalas Aidi  
    kephalas Aidi proiapsen

What they do share is four unigrams in the same order within a five- or six-word range. Bigram analysis would at least have caught the shared final bigram.

Tech Talk

Conclusions

Two conclusions seem to me emerge from this discussion.

The first is that extraordinary repetition does not differ qualitatively from ordinary repetition. There is just a lot more of it, and some of it is a lot longer.

The second is that if we want to identify all repeated multi-word expressions of whatever length and treat them as if they were unigram tokens, surrounded by metadata that support deep querying, that may be a hard goal and beyond our current powers.

On this point I would like to be proved wrong.

Tech Talk

Document generated by Confluence on Apr 19, 2009 15:04