|
MONK : Lexical data (archive)
This page last changed on Feb 20, 2008 by martinmueller@northwestern.edu.
What information do we want to know about each word? Bill: Lexical Data - What information do we want to know about each word? identifier - a string that uniquely identifies the word: We want to capture the above. Also this information forms a reference lexicon which is a central resource and useful in its own right. Loretta mentioned the importance of preserving capitalization. Bill and Pib reassured her that the spelling attribute does indeed retain the original spelling of the word token, including all of the capitalization. John raised the issue of contractions, words which have more than one lemma and part of speech. This has always been an important issue for Martin. An example is the first word of Hamlet, "who's". This is a single word, a single lexical token, but it has two parts. The first part is an instance of the lemma "who" with NUPOS part of speech "q-crq". The second part is an instance of the lemma "be", with NUPOS part of speech "vaz". Pib mentioned that MorphAdorner knows how to deal with these kinds of words, and emits multiple lemma and part of speech tags for them, as in the example from Hamlet. John also raised the issue of keeping track of word order and word proximity, to make it possible to answer questions involving collocation, n-grams, and general morphological pattern matching searches. Steve, Amit and Pib discussed the facilities available for doing these kinds of tasks within existing search engine products like Lucene. Do we need to concern ourselves with this issue in the Monk datastore proper? We also talked about the need to keep track of punctuation in the datastore and make it possible for clients of the datastore to work with punctuation as analysis features. Pib remarked that MorphAdorner does indeed maintain all punctuation. |
| Document generated by Confluence on Apr 19, 2009 15:05 |