|
MONK : agenda items for DataAnalytics
This page last changed on Feb 13, 2008 by martinmueller@northwestern.edu.
The following is a stab at the agenda items that Data and Analytics have to decide on in the weeks to come. The order is somewhat arbitrary, and I may have forgotten some items. Modeling midlevel metadataMid level metadata refers to everything between the top level of the document (bibliographical data of various kinds) and the bottom level of linguistic annotation at the word occurrence and sentence level. Agreement about top and bottom level metadata are expressed in the proposal for metadata about works of October 19, 2007 and the proposal for metadata about word and sentence level metadata of December 1, 2007 (both at https://apps.lis.uiuc.edu/wiki/display/MONK/Analytics+Cell). With regard to the midlevel, I take it there is broad agreement that the structural articulation of documents across different collections is intrinsically too varied and too inconsistent in its encoding to make its modeling below the <div> level a fruitful endeavour for a data store. Information from the midlevel will draw on some combination of the following:
Named entity extractionNamed entity extraction is a major field of inquiry in all information retrieval and likely to be of keen interest to literary scholars. Phil Burns' memo of February 1 (https://apps.lis.uiuc.edu/wiki/display/MONK/Named+Entity+Extraction) is an excellent summary of the difficulties you face when extracting names from texts of different periods and genres. Main text and paratextThe approved proposal of October 19, 2007 (https://apps.lis.uiuc.edu/wiki/display/MONK/Analytics+Cell) may need some tweaking at the edges in the light of John Norstad's work with the actual data. The Abbot and the PriorThe Abbot is Steve Ramsay's name for the application that takes a text from the "wild" and transforms it into a TEI-A text that is linguistically annotated and can be ingested into MONK. The Prior is John Norstad's name for software routines that govern the process of ingestion of an Abbot file into a MONK data store. Citation schemes are a problem that requires particular attention in this context. MorphAdorner assigns two kinds of unique IDs to a text in the process of 'adornment':
Neither scheme is reader friendly in the sense of communicating to readers where they are in a document. Such schemes can be created by constructive "bijective" citation schemes in which there is one and only one citation for every numerical ID in the text. Citation schemes of this type are likely to be of a title,volume,chapter,chunk,wordcount type. John Norstad has pointed out that if one envisages multiple representations of a text in different data stores, such bijective citation schemes need to be created before ingestion into the data store(s) so that every word in one representation can be matched in another. Settle on analytical routines for inclusion in MONKNaive Bayes and SVM have been identified as binary text classifiers for inclusion in MONK. Other analytics include Dunning's log likelihood ratio and set of simpler 'analytics' that are bundled under Search and Sort and provide users with tools for exploratory data analysis. It appears that a handful, but no more than a handful, of other classifying and clustering techniques (both supervised and unsupervised) should be part of the initial MONK arsenal. PCA and Discriminant analysis are commonly used in text analysis in text analysis pieces published in the standard journals. "Burrows'delta" is a technique that has received considerable attention over the past few years. According to Shlomo Argamon it is a k-nearest neighbor algorithm. It is an open question whether it is worth implementing in SEASR or whether SEASR already has an algorithm that performs equally well or better. According to Amit and Loretta, it will be a relatively trivial operation to plug other statistical routines into a work flow that has been developed to accommodate Naive Bayes. On the other hand, the user-friendly display of results is specific to Naive Bayes/SVM and equivalent display modules would have to be written for other routines. Data visualiazationData visualization is a key goal in MONK. It is not something that humanities scholars are familiar with. Examples like the New York Times graphics of lexical changes in State of the Union are very striking. It is also the case that there are few instances of so tightly controlled a genre in which variance can be measured and displayed with equal ease or precision. Most literary research problems involve larger and messier data sets. There are two ccomplementary strategies:
Obviously the second is easier for the non-technical user and more glamorous but also harder for the developers. On the other hand, it will be a non-trivial achievement for MONK if it supports the extraction of customizable data sets from a variety of textual sources in such a manner that the subsequent manipulation in third-party programs can be done by non-technical people with tolerable effort. |
| Document generated by Confluence on Apr 19, 2009 15:04 |