|
This page last changed on May 16, 2007 by martinmueller@northwestern.edu.
Update about MorphAdorner by Martin Mueller
This is an update on the report below about MorphAdorner. The earlier report focused on the rather gritty problems of making Early Modern English texts computationally tractable. In the past three weeks Phil Burns and I have focused on getting MorphAdorner to work with texs after 1750. We are confident that by the time of the next 'all hands' meeting in June we will be able to process the fiction texts that are part of the first testbed collection.
I put this report in the form of bullet points, some of which recapitulate earlier discussions. I also add some explanatory remarks to the attached spreadsheet, which analyzes tagging errors.
1. MorphAdorner is a tool that in addition to sentence splitting, tokenizing,and POS tagging can provide orthographic standardization and lemmatization.
2. MorphAdorner fits into the GATE and UIMA architecture. Phil Burns did some preliminary testing about this a month ago. But the main focus of recent work has focused on getting the substantive functions of the tool to work properly, so that given input from various periods and genres it will do the proper job of splitting sentences, isolating tokens, and assigning to each token the appropriate values in terms of standard spelling, lemma, and POS tag.
3. Data from different periods or genres will require different treatment. There is no single training corpus that will process Holinshed, Jane Austen, or O'Neill's play with acceptable margins of error. Collections in Monk will differ at least as much as Holinshed, Jane Austen or O'Neill differ from each other, and the ability to absorb a new training set quickly is an important feature of a tool chain for Monk purposes. MorphAdorner is fast in this regard (minutes rather than hours).
4.The best POS taggers get things right about 97% of the time on modern texts. This is not nearly as good as it sounds, because about 90% of word occurrences can be POS tagged accurately in terms of their lexical status and without regard to context. Moreover, a lot of the erorrs occur in the 10-15% of word occurrences that users are likely to care about. So the difference between a 3% error rate and a 4% error rate is a lot bigger than it sounds. MorphAdorner has hovered around an error rate of 3% - 4% on early texts, and we are trying to get it unambiguously close to 3% or better.
5. MorphAdorner has a throughput of approxminately 5 million words an hour. Larger lists of names and variant spellings may slow things down some in order to gain accuracy.
6. MorphAdorner currently uses the NUPOS tag set but could use any tag set. The NUPOS tag set (developed by Martin Mueller) owes a lot to the Penn Treebank set and CLAWS (not to speak of the basic facts of the English language) but differs from other tag sets in these ways:
a) it allows for the explicit coding of grammatical forms current in earlier English, notably the second person singular and explicit plural form of verbs.
b)it makes it easier to interpret tags at different levels of granularity
c)it explicitly draws attention to characteristic errors by identifying word classes that are subject to characteristic tagging errors, e.g. adverbial, prepositional, and conjunctive uses of certain words.
d) it makes explicit in the tagging some things that must be inferred from other tags, such as the adjectival use of a participle.
But there is no special connection between MorphAdorner and NUPOS.
It is possible--with some loss to translate tags from one scheme to another. Thus in a text tagged with NUPOS you can map POS tags to PennTree or Claws. This is a 'lossy' translation: vvd2 >vvd or j-vvg to j. But much of the loss is recoverable: if a 'vvd' form (past tense) ends in 'st' it is almost certainly a second person singular.
7. The time table for development is as follows:
a) By the June meeting we expect to have tagged the Chadwyck-Healey 19th century archive and selections from the Wright Archive of American fiction. We can also provide on-demand tagging for other texts from that period
b) by early July we expect to have tagged the Early Modern English texts for use in the Monk testbed.
c)as soon as we are confident that MorphAdorner does what it is supposed to do in a substantive way, i.e. split sentences, tokenize words, assign proper POS tagsa, and standard spellintgs for texts from a wide variety of periods and genres, Phil will focus on wrapping up the code properly, providing the hooks that integrate it into other frameworks, and providing documentation.
MorphAdorner errors
Like any other tagger MorphAdorner makes mistakes. Attached to this report is the spreadsheet MorphAdornerErrors.xls. The current training set for MorphAdorner consists of Shakespeare, Wroth's Urania, Painter's Palace of Pleasure, and Austen's Emma. This is a corpus of 1.8 million words. The training corpus was hand-corrected by me. Phil tagged the training corpus with MorphAdorner trained by it--a slightly incestuous but useful procedure. There were 45,093 errors in a corpus of 1,840,713 words. This is a very respectable error rate of 2.4%. Allow an addition 0.5% for errors on new texts, and you're still within 3%.
Over the course of the next two weeks I will extend the training data by including chapters from Scott, Edgeworth, Dickens, George Eliot, Trollope, Thackeray, Stowe and some others. Scottish and black dialects are going to be the two challenges for MorphAdorner in the nineteenth-century fiction world.
Looking in detail at the most common errors reveals a common story but also points to an advantage of the NUPOS tag set, which explicitly alerts users to expected errors. The fourteen most common errors account for 22,871 or more than half of all occurrences. Some 5,000 cases involve a confusion of conjunctive or prepositional uses of words that can be used adverbially, conjunctively, or prepostionally. In the mistakes of 'p-acp' for 'c-acp' or 'c-acp' for 'p-acp'a common error source is flagged in the heuristic wordclass 'acp' ('since' is a good example). There are almost 1,200 cases in which an interrogative use of a wh-word is wrongly classified as relative (r-crq for c-crq).
Not unexpectedly, the most common error with verbs involves the confusion of the infinitive and the present form. There are 2900 errors of mistaking vvb for vvi and 2400 errors of mistaking vvi for vvb. Actually, this involves one error in the three specifications of wordclass (verb), tense (present), and mood (indicative or infinitive). The Penn Treebank tag set does not make the distinction between infinitive and indicative.
A very similar and equally expected kind of error involves mistaking a past participle for a past indicative (2,200) or mistaking a past indicative for a participle (1,300). The former error is more common by 50% because a past participle is most decisively identified by an immediately preceding auxiliary verb (was, had, etc). Where the auxiliary is split from the participle by intervening words, errors will be more common. But with the vvd/vvn confusion, as with vvb/vvi, two out of three specifications (verb and past) are accurately captured.
From my experience with hand-correcting CLAWS tagged texts, it is my sense that MorphAdorner and CLAWS are fairly similar in their error rate with regard to vvd/vvn and vvb/vvi.
In 2,100 of the top 22,871 errors, everything is wrong: a noun is classified as an infinitive verb (1,227) or the other way round (873). These are the kinds of errors where you look for improvement. But the good news is that the "hopeless errors" make up only 10% of all the errors.
I take these figures as very reassuring and am comfortable with the assertion that we are very close to having a morphosyntactic tagging environment that can be easily customized to provide linguistic annotation for English texts that differ widely by period or genre and that will work as well as the best taggers work for modern texts. "Easily" does not necessarily mean "quickly": wherever the training set is altered, some human has to do some work, which is likely to be measured in days rather than hours, but typically not weeks rather than days.
MorphAdorner and Early Modern English
The following is a non-technical report on Phil Burns' MorphAdorner as a tool for tagging early modern English texts ('early modern' means roughly texts from the late fifteenth to the early eighteenth century). I have spent a fair amount of time over the course of the past week evaluating the tagging of the first volume of Holinshed's Histories of England, Ireland, and Scotland (1587). We chose this text as a test case because we know it to be difficult in various ways, and our assumption has been that good enough tagging results on this text would be fairly substantial evidence that MorphAdorner can handle all but bizarrely difficult texts that are thrown at it. The error rate on this text is on the order of 3%, which is what good modern taggers deliver on modern texts. And the throughput rate is encouraging: it processes ~ 100,000 words a minute or between five and six million words an hour.
Rough edges remain in the fields of orthographic standardization, lemmatization, and name recognition, but the pieces for fixing these things are well in place, and it is a matter of tweaking things and tying them together. It is my understanding that we can now produce some tagging on demand, and it will probably be a good thing to do so because other people will also errors you didn't. I also understand that there is a lot of progress on integrating MorphAdorner with GATE and UIMA environment and that Phil will be able to make some fairly definite statements about this fairly soon.
I know very little and will say nothing about the innards of MorphAdorner, but I know something about tagger outputs, having worked with the CLAWS tagger on a number of corpora. There are several problems when you want to tag texts from before 1800, and the problems get worse the further back you go in time. English orthography was pretty wild in the early sixteenth century, and some of its fixed conventions were orthogonal to ours. It's Gershwin's "tomato/potato" song: they wrote 'vniuersitie' where we write 'university'. Orthographic standardization is a process that worked over several centuries until something like reasonably firm standards of educated spelling and capitalization were firmly established by the early nineteenth century with its own fixed transatlantic variants (honour/honor).
Early modern English contains some forms that are now archaic, such as the second person singular. It uses capitalization to mark names, but capitals may be used for a lot of other purposes so that capitals are unreliable indicators of named entities. And the use of the apostrophe to mark the possessive case is virtually unknown before 1700. Even in modern prose the primary marker of sentence boundaries, the period sign, is an 'overloaded' character that poses difficulties. Early modern print culture adds to the overloading with its own conventions, including weird treatments of Roman numerals.
There is a lot there to confuse a modern tagger, and to my knowledge there is no tagging tool that has the right combination of POS tags, tagging rules, and training corpora to deal with texts that stray outside the prose of the Wall Street Journal (nice as that prose is).
For generations to come, the major source of early modern digital texts will be the transcriptions of EEBO texts in the TCP project (Text Creation Partnership). This archive, consisting now of some 13,000 texts and destined to grow to 25,000 texts, is the largest scholarly archive of responsibly encoded English texts: it is considerably larger than the major Chadwyck-Healey archives put together, and it is a multi-disciplinary collection that includes poetry, drama, as well as histories and theological, political, domestic, alchemical, mathematical, astronomical, astrological, scientific, musical and philosophical treatises. The TCP texts are produced and funded in an enterprise that joins a commercial publisher with a consortium of universities. The texts are currently proprietary, but they will pass into the public domain by the middle of the next decade.
To the extent that MONK will deal with early modern materials it must confront the TCP. It is often the only game in town. And it is critical to remember that for every TCP text there is an EEBO facsimile or digitized version of the microfilm of the original. It matters a lot to scholars that they can go back and look at the original page.
TCP texts pose some generic problems to NLP procedures. The project is older than XML. Texts are still encoded in SGML with character entities for any but lower ASCII characters. Much effort has gone into preserving early typographical conventions such as abbrevations or superscripts. An equal but not very consistent effort has gone into marking the extent of passages that the encoder could not read (some three million of them). Many of the texts have extensive marginal notes. These have been encoded as <note> elements wherever the encoder encountered them.
Put all this together, and your modal TCP text will give you a pretty bumpy ride if you approach it with the idea that a text is a linear thing that moves from the previous to the next word in an orderly and predictable fashion.
There is not at this moment a lossless XML representation of the TCP SGML texts. This is something we have been working at Northwestern in the context of the CIC funded VOSPOS project (virtual orthographic standardization and part of speech tagging). But once you have such a lossless representation you need to shed or standardize a lot of information in order to make the texts amenable to NLP procedures of any kind. Thousands of y's followed by a superscript 't' or 'e' are so many instance of 'that' or 'the', and something that appears in the SGML transcription as '&abper;' is a printer's symbol for the syllable 'per'. There are hundreds of such typographical conventions for which there are no Unicode and which are now handled by character entities in the SGML texts and by private use Unicode code points (and some kludges) in XML versions of TCP texts.
At the same time you do not want to break the chain that extends from the original SGML transcription (which has some authority, however dubious) to the processable representation of the text. This is the general context within which you might want to look at the following fragment from a "Frankenfile," as Steve Ramsay calls it, which produces a tokenized, orthographically standarized, lemmatized, and POS tagged version of the following SGML fragment:
<P>ABout this time,<NOTE PLACE="marg">Beda histae<GAP DESC="illegible" RESP="tech" EXTENT="3 letters"/> lib. 3. cap. 21. 653.</NOTE> the people of Mercia common|lie
called Middleangles, receiued the christian
faith vnder their king named Peda. . .
<p>
<w eos="0" id="a68197-544663" lem="About" pos="p-acp" reg="About" spe="ABout" tok="ABout" ord="531388">ABout</w>
<w eos="0" id="a68197-544664" lem="this" pos="d" reg="this" spe="this" tok="this" ord="531389">this</w>
<w eos="0" id="a68197-544665" lem="time" pos="n1" reg="time" spe="time" tok="time" ord="531390">time</w>
<w eos="0" id="a68197-544666" lem="," pos="," reg="," spe="," tok="," ord="531391">,</w>
<note anchored="yes" place="marg">
<w eos="0" id="a68197-544879" lem="Beda" pos="FW-LA" reg="Beda" spe="Beda" tok="Beda" ord="531392">Beda</w>
<w eos="0" id="a68197-544880" lem="histae?" pos="n1" reg="histae?" spe="histae?" tok="histae?" ord="531393">histae</w>
<gap desc="illegible" extent="3 letters" resp="tech">
<w eos="0" id="a68197-544880" lem="histae?" pos="n1" reg="histae?" spe="histae?" tok="histae?" ord="531393">???</w>
</gap>
<w eos="0" id="a68197-544881" lem="lib" pos="n1-j" reg="lib" spe="lib" tok="lib" ord="531394">lib</w>
<w eos="0" id="a68197-544882" lem="." pos="." reg="." spe="." tok="." ord="531395">.</w>
<w eos="0" id="a68197-544883" lem="3" pos="crd" reg="3." spe="3." tok="3." ord="531396">3.</w>
<w eos="0" id="a68197-544884" lem="cap" pos="n1" reg="cap" spe="cap" tok="cap" ord="531397">cap</w>
<w eos="0" id="a68197-544885" lem="." pos="." reg="." spe="." tok="." ord="531398">.</w>
<w eos="0" id="a68197-544886" lem="21" pos="crd" reg="21." spe="21." tok="21." ord="531399">21.</w>
<w eos="1" id="a68197-544887" lem="653" pos="crd" reg="653." spe="653." tok="653." ord="531400">653.</w>
</note>
<w eos="0" id="a68197-544667" lem="the" pos="dt" reg="the" spe="the" tok="the" ord="531401">the</w>
<w eos="0" id="a68197-544668" lem="people" pos="n1" reg="people" spe="people" tok="people" ord="531402">people</w>
<w eos="0" id="a68197-544669" lem="of" pos="pp-f" reg="of" spe="of" tok="of" ord="531403">of</w>
<w eos="0" id="a68197-544670" lem="Mercia" pos="np1" reg="Mercia" spe="Mercia" tok="Mercia" ord="531404">Mercia</w>
<w eos="0" id="a68197-544671" lem="common" pos="av-j" reg="commonly" spe="commonlie" tok="common|lie" ord="531405">common|lie</w>
<w eos="0" id="a68197-544672" lem="call" pos="vvn" reg="called" spe="called" tok="called" ord="531406">called</w>
<w eos="0" id="a68197-544673" lem="Middleangle" pos="np2" reg="Middleangles" spe="Middleangles" tok="Middleangles" ord="531407">Middleangles</w>
<w eos="0" id="a68197-544674" lem="," pos="," reg="," spe="," tok="," ord="531408">,</w>
<w eos="0" id="a68197-544675" lem="receive" pos="vvd" reg="received" spe="receiued" tok="receiued" ord="531409">receiued</w>
<w eos="0" id="a68197-544676" lem="the" pos="dt" reg="the" spe="the" tok="the" ord="531410">the</w>
<w eos="0" id="a68197-544677" lem="christian" pos="jp" reg="christian" spe="christian" tok="christian" ord="531411">christian</w>
<w eos="0" id="a68197-544678" lem="faith" pos="n1" reg="faith" spe="faith" tok="faith" ord="531412">faith</w>
<w eos="0" id="a68197-544679" lem="under" pos="p-acp" reg="under" spe="vnder" tok="vnder" ord="531413">vnder</w>
<w eos="0" id="a68197-544680" lem="their" pos="po32" reg="their" spe="their" tok="their" ord="531414">their</w>
<w eos="0" id="a68197-544681" lem="king" pos="n1" reg="king" spe="king" tok="king" ord="531415">king</w>
<w eos="0" id="a68197-544682" lem="nam" pos="vvn" reg="nam" spe="named" tok="named" ord="531416">named</w>
<w eos="0" id="a68197-544683" lem="Peda" pos="np1" reg="Peda" spe="Peda" tok="Peda" ord="531417">Peda</w>
<w eos="0" id="a68197-544684" lem="or" pos="cc" reg="or" spe="or" tok="or" ord="531418">or</w>
The doubly interrupted flow of this text is represented in eight different versions, expressed as different sequences of attribute values. There is a 'dumb' word counter and a 'smart' or context-sensitive word counter. The dumb counter, the ord attribute, counts word tokens as they occur in the text as a flat file. The smart counter, the id attribute, keeps track of different sequences: while a note with nine tokens interrupts the flow of the main text "At this time, the people," the id sequence identifies 'the' as the token that follows the comma after 'time'.
The word that is split by the <gap> element is accurately tokenized as a single entity in the tok attribute but its parts are separately recorded as the content of sequential <w> elements with identical id and ord attribute values (I don't quite understand why it is done this way, but it seems to work better from a programming perspective).
The value of the spe attribute is usually identical with the value of the tok attribute, but sometimes it is not. Look at "common|lie," where the vertical bar is the SGML representation of a soft hyphen at the end of a line. The value of the spe attribute is the original spelling as it would ordinarily appear in a text from that period. In this case that is 'commonlie.' The spe attribute is also used to resolve printer attributes or odd spelling conventions that are not found in this stretch of text but are very common. Thus "y^t" becomes "that", "&abper;ficit" becomes 'perficit', and other printing conventions are similarly written out in their contemporary rather than modern form (although these will often be the same.).
The value of the reg attribute is the standard modern orthographic form of the original spelling. But the morphological form is not modernized. Thus a spelling like 'lovyth' would be regularized to 'loveth', but 'loveth' would not be regularized to 'loves' but is recognized as a standard archaic form.
The lem and pos attributes record the lemma and part of speech. The eos attribute (end of sentence) records whether the token ends a sentence (1) or not (0).
This is a very verbose representation from which you can derive a simplified version in which there is only one <w> element for each token and the value of the spe attribute becomes the content of that element. You could also use <c> elements to capture punctuation, drop the eos attribute, and mark sentence boundaries as a type attribute of the <c> element. Alternately you could do it with some milestone element. But this simplified file can be traced back token by token at least to the stage of the lossless XML version that served as the source file for the process of linguistic annotation.
There is another representation of this text, which has been useful for developing a training corpus and for doing data checking, which will remain an important task until there is a fully trustworthy training corpus. It is probably more accurate to speak of training corpora because works from widely different periods or genres may require adjustments in the training data. At least there should be some skepticism about the claim that one set of training data can accommodate all texts that fall under MONK. On the other hand, it may turn out that variations in the training data will produce only marginal improvements in the output. That is an empirical matter to be settled by checking data and quantifying types and rates of error. On the plus side, Phil's tagger is very fast at 'learning' a new training set.
For data checking purposes, I have used a tabular representation of the text in which the attribute values appear as columns and each data row includes columns that show 40 characters on either side of the <w> elements. This gives you a highly manipulable KWIC output in which you can focus on particular POS tags and look for improvements in the areas that are likely to be of highest interest to users. It is, for instance, almost impossible to keep the different uses of words like 'as' or 'like' clearly separate. But if you distinguish with high precision between verbs, adjectives, nouns, and named entities, most users will probably be forgiving about the failure to distinguish consistently between prepositional or adverbial uses of 'in' or 'on'. I attach a short section of such a table as an Excel document (MorphAdornerSample.xsl).
A network based version of such tabular representation is a possible tool for user generated data improvement. The quality of texts in Project Gutenberg has greatly benefited from the volunteer based efforts of the Distributed Proofreaders Foundation. It would not be difficult from a technical perspective to create such an environment. Whether there are enough users who want to do this kind of thing is another question. But if users find the data not good enough, it would be nice to be able to tell them: "Here is a way in which you can help yourself and us to fix them."
|