|
MONK : Proposal about word and sentence level metadata (12-1-2007)
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Word level metadata are data that kept about each token in a text. Punctuation marks count as separate word tokens. Problems of disambiguation arise with the apostrophe/single quote and the period. The period is part of a word token in abbreviations and decimal numbers. The single quote mark is part of the token when it acts as an apostrophe. A word token has the following properties or attributes: 1. A token address or corpus-wide unique identifier, which consists of a work identifier and a word counter A lemma always belongs to a particular word class and is an abstract concept that bundles various inflected or orthographic forms of a word. In English the lemma is represented by the zero form of a word, the singular of a noun (love) or the present tense of a verb (love). A search for a lemma is therefore always a search for all inflectional and orthographic variants of a word. For a variety of analytical purposes it is helpful to search for a combination of lemma and POS tag. A LemmaPOS is a particular inflected form of a word regardless of its orthographic form: 'loves', 'louyth', 'loueth', 'loveth' are or can be instances of the third person singular of the verb 'love' (love_vvz). A LemmaPOS is nearly always the same as a combination of a standardized spelling and POS tag (loves_n2 vs. loves_vvz). But leaves_n2 could refer to the LemmaPOS leaf_n2 or leave_n2. |
| Document generated by Confluence on Apr 19, 2009 15:04 |