This page last changed on Feb 20, 2008 by martinmueller@northwestern.edu.

What information do we want to know about each word?
The following was taken from John Norstad's Data Cell meeting notes from April 17, 2007. Let's further discuss this topic in this space.

Bill: Lexical Data - What information do we want to know about each word?

identifier - a string that uniquely identifies the word:
address - its unambiguous position/extent in the text
spelling - the word token string
standard original form - From Martin's memo:
"The value of the spe attribute is usually identical with the value of the tok attribute, but sometimes it is not. Look at "common|lie," where the vertical bar is the SGML representation of a soft hyphen at the end of a line. The value of the spe attribute is the original spelling as it would ordinarily appear in a text from that period. In this case that is 'commonlie.' The spe attribute is also used to resolve printer attributes or odd spelling conventions that are not found in this stretch of text but are very common. Thus "y^t" becomes "that", "&abper;ficit" becomes 'perficit', and other printing conventions are similarly written out in their contemporary rather than modern form (although these will often be the same.)."
standard modern form - from Martin's memo:
The value of the reg attribute is the standard modern orthographic form of the original spelling. But the morphological form is not modernized. Thus a spelling like 'lovyth' would be regularized to 'loveth', but 'loveth' would not be regularized to 'loves' but is recognized as a standard archaic form.
lemma - The lemma or dictionary headword for the word
pos - The part of speech
sentence boundary - indicate whether the word ends a sentence

We want to capture the above. Also this information forms a reference lexicon which is a central resource and useful in its own right.

Loretta mentioned the importance of preserving capitalization. Bill and Pib reassured her that the spelling attribute does indeed retain the original spelling of the word token, including all of the capitalization.

John raised the issue of contractions, words which have more than one lemma and part of speech. This has always been an important issue for Martin. An example is the first word of Hamlet, "who's". This is a single word, a single lexical token, but it has two parts. The first part is an instance of the lemma "who" with NUPOS part of speech "q-crq". The second part is an instance of the lemma "be", with NUPOS part of speech "vaz". Pib mentioned that MorphAdorner knows how to deal with these kinds of words, and emits multiple lemma and part of speech tags for them, as in the example from Hamlet.

John also raised the issue of keeping track of word order and word proximity, to make it possible to answer questions involving collocation, n-grams, and general morphological pattern matching searches. Steve, Amit and Pib discussed the facilities available for doing these kinds of tasks within existing search engine products like Lucene. Do we need to concern ourselves with this issue in the Monk datastore proper?

We also talked about the need to keep track of punctuation in the datastore and make it possible for clients of the datastore to work with punctuation as analysis features. Pib remarked that MorphAdorner does indeed maintain all punctuation.

Document generated by Confluence on Apr 19, 2009 15:05