|
MONK : The analytical potential of POS n-grams
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
The following is a brief report on an experiment I did with POS n-grams in Shakespeare. Baayen in his Analyzing Linguistic Data reports on several experiments in which linguists used tag trigrams to discriminate between different types of texts. I wanted to know whether the analytical potential of POS n-grams would be enhanced if you added longer n-grams. The short answer to the question appears to be No. I based my experiments on forty plays and poems of Shakespeare, a corpus of some 900,000 words. Some works consist entirely of verse. Others mix verse and prose. No work consists of prose alone. The works are conventionally divided into the subgenres of tragedy, history, comedy, and romance. I ignored commas, but not other punctuation marks. This may or may not have been a good decision, but since I applied it uniformly to n-grams to all length, it is unlikely to make a difference to the outcome. The fundamental facts jump to the eye in the following table that group POS tags by length and occurrence. The frequency of n-grams diminishes very rapidly with their length. There are 34 trigrams that occur more than 1,000 times in Shakespeare, compared with three tetragrams, and no pentagrams.
I played around with these data in the JMP statistical program, using discriminant analysis and testing whether POS n-grams distinguish sharply between prose or verse, by genre, or period. Discriminant analysis is a useful technique because you can look for the discriminant power of multiple variables separately or together. In a first trial, I used the 35 tag trigrams that occurred more than 1,000 times. They discriminate very sharply between poetry and prose. They also discriminate sharply between prose or verse before 1596 and after 1605 and somewhat less sharply between poetry or verse in different genres. Then I used the 36 tetragrams and one pentagram that occurred more than 300 times. They did a fine job at distinguishing prose and verse. They did not work so well on distinguishing prose or verse by period, and they did poorly on distinguishing verse or prose by genre. I conclude from this that tag trigrams work quite well and that longer n-grams add little. For the purpose, then, of using syntactic fragments as proxies for the analysis of larger syntactic structure, it appears that tag trigrams offer useful and good enough evidence for a variety of purposes. Can one extrapolate confidently from the Shakespeare corpus to other corpora? I think you can. If anything, you may expect difference between authors to be rather larger than differences within the work of a single author. The results of the various Shakespearean tests square with common sense. The differences between prose and verse are more striking than other differences. This suggests the rule of thumb that one should always measure prose and verse separately. Otherwise one may simply measure the result of different proportions of prose and verse in a given work. Something similar is almost certainly true of the difference between spoken and narrated passages in fiction. But we lack fiction corpora in which the difference between speech and narrative is encoded. I say nothing in this report about what happens when you look closely, as you should, at the different trigrams, and ask what larger syntactic preferences they point to. |
| Document generated by Confluence on Apr 19, 2009 15:04 |