This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Word level metadata are data that kept about each token in a text. Punctuation marks count as separate word tokens.

Problems of disambiguation arise with the apostrophe/single quote and the period. The period is part of a word token in abbreviations and decimal numbers. The single quote mark is part of the token when it acts as an apostrophe.

A word token has the following properties or attributes:

1. A token address or corpus-wide unique identifier, which consists of a work identifier and a word counter
2. The spelling that occupies the token address
3. A part-of-speech tag
4. The standard spelling of the spelling at the token address
5. The lemma associated with the spelling and POS tag at the token address

A lemma always belongs to a particular word class and is an abstract concept that bundles various inflected or orthographic forms of a word. In English the lemma is represented by the zero form of a word, the singular of a noun (love) or the present tense of a verb (love). A search for a lemma is therefore always a search for all inflectional and orthographic variants of a word.

For a variety of analytical purposes it is helpful to search for a combination of lemma and POS tag. A LemmaPOS is a particular inflected form of a word regardless of its orthographic form: 'loves', 'louyth', 'loueth', 'loveth' are or can be instances of the third person singular of the verb 'love' (love_vvz). A LemmaPOS is nearly always the same as a combination of a standardized spelling and POS tag (loves_n2 vs. loves_vvz). But leaves_n2 could refer to the LemmaPOS leaf_n2 or leave_n2.

Document generated by Confluence on Apr 19, 2009 15:04