This page last changed on May 10, 2007 by martinmueller@northwestern.edu.

The lexicon is a key element in the data structure of a Monk environment. Go back to the three questions answered in the tokenizing process:

  1. Where do you live?
  2. What do you look like?
  3. What do you stand for?

The three questions can be collapsed into two data points: a unique location is associated with a pointer that refers to a place in which all the relevant information about the location is to be found. This place in turn includes pointers. Imagine four unique locations inhabited by the spellings 'louyd', 'loued', 'loued', and 'loved'. All four locations point to the standard spelling 'loved'. But three of them include a pointer to the past tense of the verb 'love' while the fourth points to the past participle of the verb 'love'. All four of them point to the verb 'love' rather than the noun 'love.'

The aggregate of such referring events is the lexicon. It is a hierarchical structure in which the spellings resident at token addresses point ultimately towards lemmata via the association of standard spellings with particular morphosyntactic conditions. The structure can be imagined as an abstract system telling you that "'louyd' is a possible past tense or past participle of the verb love." You can also associate it concretely with particular counts that let you say things like "'louyd' occurs 234 times in 58 texts between 1470 and 1589 and is the most common spelling of the past participle of 'love' in that period." (this example is made up)

There are several things worth noting about such a lexicon. It is built up from the ground of actual word occurrences and grows with those occurrences. It helps you make sense of the occurrences. Aristotle in the Poetics distinguishes between common and rare words. The lexicon is very precise about how common or rare any spelling, wordform,or lemma is for a given work, author, or time period. You may not care very much about the orthographic variance of past participial forms of 'love.' On the other hand, you may be deeply interested in the distribution over time of 'liberty' and 'slavery', but you can't get to the top-level forms of 'liberty' and 'slavery' except through the orthographic and morphosyntactic variants of those lemmata.

The second point is of a legal kind. The first Monk Lexicon will be built from both public domain and proprietary sources. Now it is clear that where a source is proprietary the owners will limit the display of even text snippets to users who have purchased access rights. Thus to go from the abstract spelling 'louyd' to its actual occurrence and context in some TCP text from the 1520's will require the owner's permission until the texts enter the public domain. On the other hand, the statement that "'louyd' occurs 234 times in 58 texts between 1470 and 1589 and is the most common spelling of the past participle of 'love' in that period" is almost certainly a public domain statement. And there is a great deal of query potential in the aggregate of such possible statements even for those users who cannot go from such aggregate data to particular passages in particular authors.

Thirdly, a lexicon is transferrable from the first Monk collection to the next. This is an effortless process if the next collection uses the same lemmatization and POS tagging scheme. It requires some effort if different POS taggers and lemmatizers are used. But the transferability of lexical information from one collection to other will be a big benefit to subsequent users. It also makes it doubly important to get the first collection right.

Document generated by Confluence on Apr 19, 2009 15:04