|
This page last changed on Feb 20, 2008 by martinmueller@northwestern.edu.
Lemmatizing English Word Spellings
Lemmatization is the process of reducing an inflected spelling to
its lexical root or lemma form. The lemma form is the base form or
head word form you would find in a dictionary. The combination of
the lemma form with its word class (noun, verb. etc.) is called the
lexeme.
In English, the base form for a verb is the simple infinitive. For
example, the present participle "striking" and the past form "struck"
are both forms of the lemma "(to) strike". The base form for a noun
is the singular form. Thus the plural "mice" is a form of the lemma
"mouse."
Most English spellings can be lemmatized using regular rules of
English grammar, as long as the word class is known. MorphAdorner
uses a list of about 200 such rules (details below). Some spellings
require special handling because they don't follow the general rules.
These irregular forms include "strong" verbs like "to catch" and
nouns like "mouse." MorphAdorner recognizes over 3,000 irregular forms.
The lemma form of a spelling depends upon its word class. Thus the
noun "bee" has "bee" as a lemma form, while "bee" as a verb has "(to)
be" as a lemma form. This turns out to be a bigger problem in Early
Modern English than in contemporary English because spelling was not
reasonably standardized until the late eighteenth century. Using a
standard spelling helps in finding the lemma form. For example,
"strykynge" is an old spelling for "striking." By transforming the
old spelling to a standardized (usually modern) spelling, we can apply
the standard lemmatization rules and obtain "(to) strike" as the lemma.
MorphAdorner's English lemmatizer works best with standardized spellings.
Another problem area is the use of the "'s" as a possessive.
Sixteenth and seventeenth century English texts generally did not use
the "'s" for the possessive form. Thus a phrase like "his majesty's
horses" might appear as "his majesties horses." Handling this problem
requires part of speech tagging in tandem with spelling
standardization.
Not so trivial is the disambiguation of homonyms like 'lie' or
'bark'. There are a few hundred (at most) such pairs in English. In
the future we may be able to distinguish which homonym is meant in
some situations using methods collectively called word sense
disambiguation. That would allow more accurate lemmatization for
homonyms. MorphAdorner does not currently include such
disambiguation.
Lemmatization procedure
Given a (spelling, NUPos part of speech) pair, MorphAdorner
first checks if a lemma appears for that combination in the currently
active word lexicon. If so, MorphAdorner returns the lemma
specified by the lexicon
Consider the spelling pair (striking, vvg). MorphAdorner's
19th English lexicon defines the lemma strike for this
combination of spelling and NUPos part of speech.
When the (spelling, part of speech) combination is not found in the
current word lexicon, MorphAdorner uses its general English
lemmatizer which is based upon a list of irregular forms and
grammar rules. The lemmatizer is not tied to a specific
part of speech set. Instead the lemmatizer categorizes irregular forms
and rules using the following major part of speech classes.
- adjective
- adverb
- compound
- conjunction
- infinitive-to
- noun, plural
- noun, possessive
- preposition
- pronoun
- verb
The NUPos (or other) part of speech is converted to one or more of these
major word classes for the purposes of lemmatization. In our example
above, the NUPos gerund tag vvg maps to the verb class. The
lemmatizer then processes the spelling pair (striking, verb) as follows.
When the spelling pair appears in the irregular forms list,
the lemmatizer returns the lemma specified in that list.
In our example, striking does not appear on the irregular forms list.
On the other hand, the spelling pair (mice,noun) does
appear on the irregular forms list, which specifies that
mouse is the lemma form for mice.
When the spelling pair does not appear in the irregular forms list,
the lemmatizer begins a series of rule matches for the the major
word class. Each rule specifies an affix pattern to match and a
replacement pattern which generates the lemma form. Once a replacement
has been effected, the lemmatization process is complete. These rules
are often called rules of detachment because the affixes are detached
from the inflected word form to produce the lemma form.
In the case of striking, the first match occurs against the rule:
which says "match a consonant, followed by a vowel,
followed by a consonant, followed by ing at the end of
the word." The replacement string says to keep the
consonant followed by the vowel followed by the consonant,
but replace ing with e . The result is that striking
is lemmatized to strike.
Some words require the application of multiple sets of detachment
rules. For example, the word "astoundingly" is an adverb formed
from a present participle. The lemmatizer first applies the adverb
rules to remove the ly producing "astounding", then applies the
verb rules to remove ing and produce "astound" as the lemma form.
Once a successful substitution occurs, the lemmatization process stops.
The reduced form for some endings is ambiguous. For example,
the lemma for the past tense of a verb ending in ored
may end in ore (e.g., implored -> implore) or in or
(e.g., colored -> color). To help disambiguate such cases,
a lemmatization rule can specify that the resulting candidate
lemma formed by applying the rule must appear in a known word list.
NUPos uses a large list of standard word forms taken from the
1911 Webster's Dictionary and other sources.
For example, consider the rule sequence:
The first rule says to replace "ored" with "ore" and check that
the result is a known word (that's what the "+" denotes). When
the result is not a known word, the rule is bypassed, and the
following rule which replaces "ored" with "or" is used instead.
Examples:
Lemmatize recolored:
- recolored -> recolore : recolore not in dictionary, go to next rule.
- recolored -> recolor : recolor in dictionary, accept this form as the lemma.
Lemmatize implored:
- implored -> implore : implore in dictionary, accept this form as the lemma
Words containing more than one part of speech require special
handling. MorphAdorner attempts to split such words at a logical
point and assign a separate lemma using the process above to
each word part. For example, the spelling I'm with a compound
NUPos part of speech pns11|vam (the vertical bar separates the
parts of speech), is split into two pairs:
- (I,pns11)
- ('m,vam)
The first pair lemmatizes to i and the second pair to be,
giving the compound lemma form i|be.
Certain irregular compound forms such as gimme, a
contraction of "give me", appear under the compound
entry in the irregular forms list. The lemma form for gimme is
give|i.
Punctuation and symbols "lemmatize" to themselves.
Foreign words (marked by one of the foreign part of speech tags)
and singular nouns are left untouched by MorphAdorner's lemmatizer –
the original spelling is considered the lemma form.
The lemma form for some words is ambiguous. For example, "axes"
is the plural form of both "axe" and "axis". NUPos returns one of
the possible forms (e.g., "axe" for "axes"). This will not be the
correct form for all cases.
|