This page last changed on Feb 20, 2008 by martinmueller@northwestern.edu.

What morphological data is associated with words?

See Lexical data (archive) and Morphology and NUPOS

Perhaps it is better to ask "What data are associated with a given token, where a 'token' is a character string separated from other character strings by white space or other characters that act as a word boundary markers.The answer to that question is:

  1. a unique and sequential id that not only locates the token in the text but also makes it possible to look for its nearest neighbour to the right and left
  2. a marker that defines the token as sentence terminal or not
  3. the standard spelling of the token, which often will be identical with the spelling of the token
  4. The POS tag assigned to the token
  5. The lemma that is associated with the standard spelling and POS tag

Strictly speaking, lemma information need not be encoded at the token level because a lemma is implicit in the combination of a standard spelling and a POS tag. Leaving aside the question of homonymns (eg the verb 'lie' in its present forms, but not 'lied', 'lay', 'laid'), a standard spelling and POS will always point to one, and only one, lemma.

There is additional information about tokens that is useful for many purposes such as

  1. the length of the token
  2. the reverse spelling
  3. the phonetic value

But since this information is context independent it needs to be stored only once in some lexicon.

A word about sentence boundaries. The 'sentences' established by the tokenizer may not always be sentences in the grammatical sense: things like "Chapter Five" would count as a sentence. But it is extremely useful for many inquiries to be able to count sentences or the number of words in them. Sentence splitting in a good tokenizer will have about the same error rate as the tokenizing itself (~3#). Any procedure that gets at least 95% of sentence boundaries right is worth having.

The annotated token can be seen as an instance of any of its attributes. In the case of the POS attribute, it is important to note that this attribute operates at different levels of granularity: tokens with the POS values 'vvd', 'vvn', 'vvb', 'vvi', 'vvz' can be aggregated as instances of a verb token.

Document generated by Confluence on Apr 19, 2009 15:04