|
MONK : Analytics Cell Interim Report November 27
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Users have questions. Use cases are one way of formalizatig questions. The chief task of the Analytics Cell is to take the users' questions or use cases and translate them into the requirements for "analytics" or "software objects that perform analysis operations." These are technical terms from the IBM Research Report about UIMA. In defining the requirements for particular analytics, the Analytics Cell interacts both with the Data Cell and with the Interface Cell. It works with the Data Cell in formulating the data requirement for particular analytics. It works with the Interface Cell in exposing the various analytics in a user-friendly fashion to students and scholars who typically lack technical expertise in programming or in the statistical underpinnings of many analytical routines. Search and Sort as the fundamental analyticIn the context of scholarly text analysis in the humanities, the fundamental analytic is something we call Search and Sort, which is both like and unlike "Googling." Like Googling it is a "find" operation in which you enter some search term(s) and evaluate results. In Googling you want to identify the shortest list of top hits in the quickest time. This is not an untypical operation in scholarly inquiry. But Search and Sort also includes a different mode of operation where you assemble data by some combination of criteria and then work your way through them to look for patterns of various kinds. The concept of "top hits" is not especially relevant to this kind of inquiry, which is both iterative and ruminative. The claim that Search and Sort is the fundamental analytic rests on at least four arguments:
A good Search and Sort implementation depends on the ability to
Search criteria fall into the broad categories of
Details about bibliographical metadata are spelled out in the Proposal for Metadata about works (October 19, 2007). The key feature of Search and Sort consists in the fact that the data retrieved in the first search step can be subsequently manipulated by any combination of the criteria available for the search in the first place. What a search returns to the user is a 'data frame' in a 'long data format', to use terminology from Harald Baayen's Analyzing Lingistic Data, a tabular representation in which every search criterion appears as a column. Such a data frame becomes the input for the MONK interface, but it may also be exported to third-party spreadsheets, statistical programs, or visualization tools, whether Excel, Minitab, or ManyEyes. Partial models for Search and Sort in MONK are the search page of Philologic (http://www.lib.uchicago.edu/efts/ARTFL/philologic/), which is very strong on complex query formulation, and the Find Words feature of WordHoard (http://wordhoard.northwestern.edu), which is very strong on letting users group and sort search results in an iterative fashion. The structure of the data frame for the initial return of search results varies with the size of the result sets. If the number of hits are below some threshold (still to be determined but probably between 1,000 and 3,000), the data frame will return individual locations and KWIC information in the form of ~35 characters before and after each hit. If the hits exceed that threshold, the data frame will return aggregate information. Random sampling from large result sets is also a feature of Search and Sort. The challenges of translating the requirements of Search and Sort into a user-friendly interface have been discussed with the Interface group at length and appear to be well understood. Clustering and classification analyticsHarald Baayen in Analyzing linguistic data has a very useful chapter on "Clustering and classification," which illustrates with a variety of linguistic and literary use cases the major statistical routines that are used in text analysis and text mining. Clustering techniques involve
Classification techniques involve
Binary text classificationWithin this domain of text analysis routines, the central use case in MONK has been binary text classification using either Naive Bayes or Support Vector machines. Naive Bayes was successfully used to discriminate between 'erotic' and 'non-erotic' letters in Emily Dickinson's correspondence. Efforts to discriminate sharply between 'sentimental' and 'non-sentimental' passages in nineteenth century novels have not yet been successful. Possible reasons are:
The first and second of these are straightforward if tedious matters. The fourth we hope to be not true. Feature selection may be a good target of attention. It appears to be more of an art than a science, and it may be helpful to have more explicit discussions about it. Features include such things as
There may not, however, be enough a shared sense of what has worked here or there and what is likely to work. Other techniques of clustering and classificationWhile particular use cases in the current MONK group have focused on binary text classification as the central text mining routine, there are other techniques that are widely used. This is apparent from Baayen's survey, and it is confirmed by even the most casual examination of Linguistic and Literary Computing, the leading journal in the humanities text analysis field. A tool kit can be too large, but it can also be too small. We probably should look beyond the hammer of binary text classification. There are two questions that the Analytics Cell should discuss. First, what are the most promising techniques to be added to the repertoire? Second, if there is a workflow that routs data from the data store through the D2K analysis engine to an interface for the purposes of binary text classification, how much programming is required to adapt that work flow to other techniques, e.g. principal component or discriminant analysis? In the Humanities Computing text analysis world, "Burrows' delta" has received a fair amount of attention. As I understand it, this is an implementation of a 'nearest neighbor classification' system. The general utility of that approach is well illustrated by Burrows' thoughtful and nuanced piece on textual analysis in Schreibman's, Siemens' and Unsworth's Companion to Digital Humanities. Is it worth implementing Burrows' delta in MONK? Or does D2K already contain a procedure that works as well or better? Burrows defines delta as "the mean of the absolute differences between the z-scores for a set of word-variables in a given text-group and the z-scores for the same set of word-variables in a target text." Shlomo Argamon has redefined it as "the sum of the standard-deviation-normalized absolute differences of the word frequencies. Note that n is the number of frequent words used, and the subscript B indicates equivalence to These are not sentences that humanist readers love to read. But they would probably appreciate clustering techniques that can be used and interpreted intelligently with avery limited understanding of the underlying math or its attendant computational routines. Dunning's log likelihood ratioDunning's log likelihood ratio is a straightforward technique for binary text comparison. Given a corpus A and a corpus B, you can determine which textual phenomena are over- or underused in A as compared with B or the other way round. It is a very helpful tool for defining a text in terms of its positive and negative keywords. It is being implemented in MONK, using code from WordHoard. CollocationJ. B. Firth famously said that "you shall know a word by the company it keeps." Collocation analysis is the translation of that maxim into particular techniques. There is a variety of statistical techniques for measuring the company a word keeps. A technique called "Specific Mutual Information" tends to do the best job at measuring the company a content word keeps with other content words--which is what most users are interested in. We have not yet made a decision about whether or how to implement a collocation feature, but it is an important topic to think about. In a practical way, concordance output that can be sorted by the word that precedes or follows provides a very simple tool for spotting immediate collocates. N-grams and repeated phrasesQuestions relating to n-grams and repetitions have been discussed frequently but inconclusively. This is an area to which we need to return with some desire for decisions and closure. There are two quite different approaches. In the first, you have no particular research interest in repetitions. You assume that n-grams of fixed size are useful 'features' for various text analysis routines. To judge from MONK internal discussions, from conversations with linguists, and some outside reading, I gather that there is no firmly received wisdom on what n-grams are useful for what. Baayen reports on two quite interesting analyses that are based on POS trigrams. In conversation, he seemed to imply that longer n-grams become statistically unmanageable. Mark Olsen and Shlomo Argamon have reported good success with lemma bigrams as features in Bayesian analysis. If there is knowledge beyond anecdote in this area I have not found it. In the second approach, you actually care about repetitions in your text. This is Tanya's use case. You want to know what they are, where they begin, where they end, and how many there are of them. Instead of a starting point of fixed n-grams, you have an end point of a lexicon of repeated phrases. Fixed n-grams are, however, one way of getting to such a lexicon. Two different questions arise from these approaches. With the first approach you want to have guidelines that tell you what n-grams to look for what purpose. With the second approach you want to know how you can construct a lexicon of repeated phrases/passages of varying length in an author or across a corpus. The Linguistic Data Consortium was given by Google a vast corpus of pentagrams that appear at least 40 times on the web, 1,176,470,663, to be precise.(http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13) Does this do anything for us? It tells me that the science of n-gram is still a decidedly heuristic enterprise. |
| Document generated by Confluence on Apr 19, 2009 15:04 |