|
MONK : Baayen's Analyzing linguistic data
This page last changed on Dec 16, 2007 by martinmueller@northwestern.edu.
I have put a copy of Harald Baayen's forthcoming Analyzing linguistic data on the Monk wiki under Recommended Readings. The ground rules for downloading the free pdf of the draft book is a willingness to buy it when it comes out. It will almost certainly be helpful for several copies of this book to float around MONK and SEASR, if only to help with the all-important task of documentation. Mastering the Art of Text Analysis (with apologies to Julia Child) would be a good companion volume to the first release of MONK. For MONK purposes, the key chapter in Baayen is "Clustering and classification." Like the other chapters, it explains particular procedures through the detailed examination of particular examples (use cases). While all of them come from Linguistics some of them turn on purposes and features that are clearly of interest to Literary Studies. I notice in particular the following: 1. A data set 'affixProductivity', which tabulates information about different affixes ('hood', 'tion', 'ness' etc) for 44 different authors classified under four genres (religious, children, literature, official). Affixes are classified as 'native' ( 2. There is a data frame 'oldFrench', which provides the evidence for an analysis of register variation and diachronic variation in the use of syntactic constructions in Medieval French. The source documents are manuscripts by 29 different authors. They were divided into 2,000 word chunks. The 35 most common trigrams were extracted from the texts. The guiding assumption here is that a limited set of tag trigrams are a good enough proxy for capturing syntactic variance in texts that differ by the categories that are of analytical interest in MONK: time, gender, genre, place of origin, social register, etc. Correspondence analysis and SVM are used to analyze the data. 3. A data set lists the presence or absence of 125 grammatical features (fricatives, prenasalized stops, etc.) for 15 Papuan and 16 Oceanic languages. The goal is to create a dendrogram that gives you a hypothesis about the genealogy of those languages. Of particular interest in this example is the visualization technique of "unrooted trees," in which lines branch off in all directions from a center, with sub-branchings along the way. If you look at the unrooted trees showing the affinities of Oceanic and Papuan languages it is an attractive thought to imagine similar visualizations of novels or authors in a "fiction space." 4. A data set 'spanish' consists of 3,000 word samples from five different texts by three Spanish authors. Principal component analysis and discriminant analysis are used to test how successfully tag trigrams discriminate between authors or can be used to assign unknown text samples correctly. I think it would be helpful to use Baayen's book as a point of departure for a discussion within MONK and SEASR about a sufficient set of statistical routines and data requirements ('features') to cope with the likely range of questions that users will bring to MONK and for which quantitative inquiries of one kind or another sound promising. The likely range can, I think, be defined with some precision by looking at 1. The individual use cases of people on the MONK team We probably have a pretty good idea of what all this adds up to. But it will be helpful to make it explicit in the weeks to come. |
| Document generated by Confluence on Apr 19, 2009 15:04 |