|
This page last changed on Feb 20, 2008 by martinmueller@northwestern.edu.
Spelling Standardization
English texts of the past exhibit far greater spelling variance than
contemporary texts. Texts from the seventeenth century and earlier
times use conventions that differ from contemporary standards in the
use of "u" and "v" and "y" and capitalization, among others. Often
the same word is spelled several different ways even within the same
work. By the eighteenth-century texts employ much more modern orthographic
standards, except for capitalization.
MorphAdorner uses rules, word lists, and extended search techniques
such as spelling correction methods and other heuristics to map
variant spellings to their standard (usually modern) form. For
obsolete words no longer in use, a representative standard form is
chosen which is usually the Oxford English Dictionary headword form.
Presently MorphAdorner knows a couple of hundred thousand variant
spellings. Using these methods and word lists, MorphAdorner can
automatically determine the correct standard form for previously
unseen spellings in many cases.
Sometimes a new spelling is just too different from any of the ones
MorphAdorner already knows. Using the extended search facilities on
such a spelling may result in a "standard spelling" which veers far
from the correct form. As time goes one we hope to reduce the
the occurrence of such errors.
Orthographic standardization improves the quality of part-of-speech
tagging, name recognition, and text searching. However,
standardization by itself isn't sufficient to fix some other
problems. These include the lack of the apostrophe to mark the
possessive case and the inconsistent practices of capitalization as
markers of proper nouns.
In English before 1700 the apostrophe never indicates the genitive,
and "her mother's daughter" is written "her mothers daughter". An
even more problematic example is "her majesty's daughter" which
appears in early texts as "her majesties daughter." The use of the
apostrophe as a genetive marker gained ground during the eighteenth
century, and has been used as it is today since the early nineteenth
century.
In the eighteenth century, the apostrophe is sometimes used as a
plural marker in certain character combinations. Thus "canoe's" is
much more likely to be a plural than a possessive form.
The modern practice of restricting capitalization to names, namelike
entities, and certain emphatic uses is about two centuries old. In
earlier English nouns are freely capitalized, and capitalization is
not a reliable way of picking out proper nouns. However, proper nouns
have usually been capitalized in all forms of written English since
about 1550. Before that names can appear in lower case.
In poetry the first word of each line is often capitalized even when
that word does not start a sentence. For purposes of part=of-speech
tagging, a simple workaround is to use the lower case form of a word
that does not start a sentence, except if the word appears in a list
of known proper names.
Standardization Process
MorphAdorner attempts to standardize a spelling as follows.
- Load the list of known standard spellings. This is a combination
of entries from the 1911 Webster's Dictionary and entries verified
against the Oxford English Dictionary from ongoing work with the
TCP/EEBO texts.
- Load a map of known variant spellings to modern spellings. Currently
this list contains several hundred thousand known variants culled from
ongoing work with the TCP/EEBO texts.
- Create a ternary trie of all the standard and variant spellings.
A ternary trie allows very efficient extraction of strings within a
specified edit distance of a given string. In other words, it allows
efficient extraction of list of words whose spellings are near to any
given word's spelling.
- Load a list of modernization rules. Currently MorphAdorner defines
about 70 such rules which can transform many variant spellings to their
modern spellings, or come very close. The rules also provide for
correcting defective spellings that contain "gap" markers reflecting
illegible letters in the original text. Some sample rules include:
- Transform the ending "me~" to "men"
- Transform the ending "ynge" to "ing"
- Transform "uu" to "w"
- Transform "v" followed by a non-vowel to "u"
Now for each old spelling, perform the following steps.
- Apply all the applicable transformation rules which results in an
improved spelling. If this spelling appears in the standard spellings list,
we're done. For example, applying the rules to strykynge directly
produces the modern standard spelling striking.
- See if the transformed spelling appears in the variant spellings map.
If so, assign the mapped spelling value as the standard spelling.
We're done. For example, applying the rules
to vniuersitie produces universitie . This is not the modern spelling,
but it is close. The mapped spelling list for Early Modern English
provides an entry for universitie, giving the modern spelling as
university.
- Compile a list of words whose spellings are "close to" the transformed
spelling by using the ternary trie to search quickly for all words within
a specified edit distance of the transformed word.
- Compute a measure of string similarity between each found spelling and
the transformed spelling. String similarity measures how similar two
strings of characters are. A similarity of 0.0 indicates two strings are
completely different, while a similarity of 1.0 indicates two strings are
identical. MorphAdorner uses a weighted similarity score based upon
letter pair similarity, phonetic distance, and edit distance.
- Choose the found spelling with the highest similarity as the most probable
correct/standard spelling. If this spelling appears in the standard
spellings list, we're done. If not, see if it appears in the mapped
spellings list. if so, take the mapped spelling value as the standard
spelling, and we're done. Otherwise, accept the transformed spelling
as the standard spelling, with the proviso that it may not be a proper
standard spelling, and requires further review.
Interactions with Part Of Speech
The standard spelling for some words cannot be determined until the part of
speech for the word is known. Examples of such words include doe, bee, poor,
marie, and wast. Thus "doe" is most likely "doe" a female deer when it appears
as a noun, while "doe" is most likely "do" when it appears as a verb.
When "marie" appears as an adjective it is probably "merry", but most likely
"marry" when used as a verb.
MorphAdorner keeps a short list of variant spellings by general word class.
The final standardized spelling is not assigned until a part of speech has
been assigned, so these special cases can usually be disambiguated properly.
Standardizing Proper Names
Proper names can appear with a bewildering variety of spellings even
within a single work. Some variants can be transformed to their modern
standard forms by using the general standardization rules presented above.
For example, the spellings Syracvse and Vlysses, which are the commonest
variants of those proper name spellings in the TCP/EEBO version
of Plutarch's Lives, both transform by rule to their modern spellings
Syracuse and Ulysses.
Other variants are not so easily rectified. The place name Cappadocia
appears in Plutarch's Lives as
CPADOCIA 1
Cappadocia 21
OHPPADOCIA 1
Coppadocia 1
CAPRADOCIA 1
where the frequency of occurrence follows each variant.
MorphAdorner currently uses the following algorithm to look for standard
spelling candidates for proper names. This is a variant of the extended
search algorithm for standard spellings described above. Because we
know we are looking for proper names, we can do a better job by limiting
the search space to known proper names.
Proper name search algorithm
- Collect the list of known spellings of proper names (tagged with NUPos
parts of speech np1 and np2) in the early modern English lexicon.
Currently there are around 66,000 such spellings.
- Construct a "name" ternary trie of the lowercase versions of
all these names. A ternary trie allows very efficient extraction of
strings within a specified edit distance of a given string.
- Construct a "consonant" ternary trie of the lowercase
versions of the names with all vowels removed. For each unique
combination of consonants (in order), store the list of spellings
which reduce to that consonant string.
For each unknown name, perform the following steps.
- Find all strings in the "name" trie within a specified edit
distance of the unknown name. An edit distance of 2 seems to be a
good choice.
- If any names were found in step 1, compute a measure of string
similarity between each found name and the unknown name. Choose the
found name with the highest similarity as the most probable
correct/standard spelling. Letter-pair similarity seems
to work well as a measure of string similarity, but there are many
other possible choices.
- If no names were found in step 1, find all strings in the
"consonant" trie within a specified edit distance of the unknown name
with vowels removed. An edit distance of 3seems to be a good choice.
- If any consonant strings were found in step 3, perform the
following steps for each consonant string.
- Pick up all the names which reduce to this consonant string.
- For each of those names, compute a measure of string
similarity between the name and the unknown name (that is, between
the full spellings).
- Keep a list of those found names with a similarity score above a
reasonable threshhold. 0.75 seems to be a good choice.
- Choose the found name with the highest similarity as the most probable
correct/standard spelling.
If no names were found by either lookup procedure, leave the unknown
name alone.
Here is an example of the algorithm applied to the list of names
above. In each case, only one candidate spelling (the correct one,
it turns out) was found.
Names near CPADOCIA
cappadocia (0.75)
Names near Cappadocia
cappadocia (1.0)
Names near OHPPADOCIA
cappadocia (0.7777777777777778)
Names near Coppadocia
cappadocia (0.7777777777777778)
Names near CAPRADOCIA
cappadocia (0.7777777777777778)
|