|
MONK : Analytics Cell
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Members
Purpose and ScopeThe Analytics Cell is responsible for what we are calling "data analytics" in MONK. We're using the term "analytics" to include text analysis broadly conceived, from traditional procedures (like concordance generation), to more sophisticated quantitative and statistical analyses (a la WordHoard) and text mining (a la nora). There's a thin boundary separating analytics from issues of primary concern to the Interface Cell and the Uses and Users Cell – particularly since most of us believe that visualizations (the usual results of analytical procedures) are themselves interfaces, and that active use cases are what drives development at all levels. Still, we have established this cell as a distinct specialization within the overall project in the hope that participants pursuing various ideas in text analysis can compare notes and work together. Approved ProposalsThe following contain proposals that have been reviewed and approved by the Analytics Cell. The claim is that the approved proposal spell out requirements with sufficient precision for other cells, especially Data and Interface, to implement them, leaving details to their discretion. If this is not the case, tell us about it right away. Proposal for metadata about works (October 19, 2007)The following metadata will be kept about each work. The list follows the list of metadata about works that Stéfan circulated after the October 16 call about metadata needed for the current workbench components.
It will be a matter for the Data Cell to determine the most efficient way of maintaining these metadata through some combination of the teiHeader, METS, MODS, or Dublin Core. In a memo of October 22, Catherine suggests adding a unique "most likely date," which for many purposes may be better than an average of start and end date. I add this as a friendly amendment. MM 11-30-2007 Proposal about main text and paratext (October 17, 2007)For a variety of purposes, many users will find it helpful to filter out certain kinds of 'paratext' from their searches and analyses. They prefer to see the movie without the trailers and credits. Main text consists of what are clearly part of the author's words from the naive reader's perspective. Paratext consists of what is ambiguously or clearly not part of the author's words. It also includes information in lists and tables that are not easily parsed as sentences. The distinction is easier to maintain in some genres than in others. In plays before the late nineteenth, for instance, the main text consists of all the word intended to be spoken by actors on the stage. Everything else is paratext. In the plays of Ibsen or Shaw, it is harder to decide whether or how stage directions are a form of paratext. The distinction between 'main text' and paratext is established at the point of a text's ingestion into MONK and becomes part of its SIP or submission information package. In principle it is possible to change the distinction on a text by text basis. In practice, one will do it on a batch basis, using as the criterion a genre (plays) or a particular collection of texts. Since MONK texts will overwhelmingly come in some version of TEI, the distinction between main and side text can be expressed in terms of elements that will count as one or the other. Paratext will always include the content of <front> and <back> elements. It will also include some elements that occur inside the <body> element, in particular the following elements from these TEI modules:
Where count objects are precomputed, separate counts are kept for main text and paratext. The lattert is by its nature a hodge podge and unlikely to be an object of attention in itself. But users must have the option of performing their operations on main text, paratext, or "all text." Users will typically have the relevant knowledge to determine whether they want to filter out or include The initial selection of side text elements will be a curatorial decision and will always be based on local knowledge of a particular collection or set of texts. Sir Walter Scott, for instance, wrote Proposal about word and sentence level metadata (December 1, 2007)Word level metadata are data that kept about each token in a text. Punctuation marks count as separate word tokens. Problems of disambiguation arise with the apostrophe/single quote and the period. The period is part of a word token in abbreviations and decimal numbers. The single quote mark is part of the token when it acts as an apostrophe. A word token has the following properties or attributes: 1. A token address or corpus-wide unique identifier, which consists of a work identifier and a word counter A lemma always belongs to a particular word class and is an abstract concept that bundles various inflected or orthographic forms of a word. In English the lemma is represented by the zero form of a word, the singular of a noun (love) or the present tense of a verb (love). A search for a lemma is therefore always a search for all inflectional and orthographic variants of a word. For a variety of analytical purposes it is helpful to search for a combination of lemma and POS tag. A LemmaPOS is a particular inflected form of a word regardless of its orthographic form: 'loves', 'louyth', 'loueth', 'loveth' are or can be instances of the third person singular of the verb 'love' (love_vvz). A LemmaPOS is nearly always the same as a combination of a standardized spelling and POS tag (loves_n2 vs. loves_vvz). But leaves_n2 could refer to the LemmaPOS leaf_n2 or leave_n2. Proposal about Search and Sort as the fundamental analytic (December 1, 2007)In the context of scholarly text analysis in the humanities, the fundamental analytic is something we call Search and Sort, which is both like and unlike "Googling." Like Googling it is a "find" operation in which you enter some search term(s) and evaluate results. In Googling you want to identify the shortest list of top hits in the quickest time. This is not an untypical operation in scholarly inquiry. But Search and Sort also includes a different mode of operation where you assemble data by some combination of criteria and then work your way through them to look for patterns of various kinds. The concept of "top hits" is not especially relevant to this kind of inquiry, which is both iterative and ruminative. The claim that Search and Sort is the fundamental analytic rests on at least four arguments:
A good Search and Sort implementation depends on the ability to
Search criteria fall into the broad categories of
Details about bibliographical metadata are spelled out in the Proposal for Metadata about works (October 19, 2007). The key feature of Search and Sort consists in the fact that the data retrieved in the first search step can be subsequently manipulated by any combination of the criteria available for the search in the first place. What a search returns to the user is a 'data frame' in a 'long data format', to use terminology from Harald Baayen's Analyzing Lingistic Data, a tabular representation in which every search criterion appears as a column. Such a data frame becomes the input for the MONK interface, but it may also be exported to third-party spreadsheets, statistical programs, or visualization tools, whether Excel, Minitab, or ManyEyes. Partial models for Search and Sort in MONK are the search page of Philologic (http://www.lib.uchicago.edu/efts/ARTFL/philologic/), which is very strong on complex query formulation, and the Find Words feature of WordHoard (http://wordhoard.northwestern.edu), which is very strong on letting users group and sort search results in an iterative fashion. The structure of the data frame for the initial return of search results varies with the size of the result sets. If the number of hits are below some threshold (still to be determined but probably between 1,000 and 3,000), the data frame will return individual locations and KWIC information in the form of ~35 characters before and after each hit. If the hits exceed that threshold, the data frame will return aggregate information. Random sampling from large result sets is also a feature of Search and Sort. The challenges of translating the requirements of Search and Sort into a user-friendly interface have been discussed with the Interface group at length and appear to be well understood. The Analytics
Analytics Use Case Actor-Goal ListRepetition (SCHOLAR: Tanya)The repetition analysis is seen as finding frequently occurring patterns in the text. Frequent pattern (or association rules are described in wikipedia) is a pattern (in our case a word, or phrase) that occurs frequently in the data set (in our case the document set). The following document also provides a nice survey of frequent patterns. http://www.adrem.ua.ac.be/~goethals/software/survey.pdf Some definitions, an item is an attribute-value combination, a word, or a phrase. Essentially frequent pattern analysis compares every item to every other item. Algorithms have been optimized to cutoff this comparison when certain conditions are met. These frequent patterns can also be seen as an automated hypothesis generation. But the patterns need to be evaluated by the domain expert. Frequent pattern analysis generate lots and lots of patterns. Some are not interesting because they report known and/or common occurrences, but other times they may find a novel pattern. There are several implementations of frequent pattern analysis in D2K. D2K modules that are inserted into the workflow before frequent pattern analysis to control whether stop words are removed, stemmed word are used, or words or ngrams (phrases) are used. Actual counts of words are not used, but could be. At this time, a boolean representation indicates whether or not the word occurs in the document. Since frequent pattern analysis generates so many rules, we can use a clustering approach to cluster similar patterns together. Given K clusters, patterns that have a common set of items are clustered together. Sentimentality (SCHOLAR: Sara)This use case is an evolution of the one used for the nora project. Sara will use the NoraVis tool to tag individual paragraphs as instances of sentimentality. Once this training set has been created, we will need to create a sparse matrix of word frequencies. The x-axis will consist of all individual word tokens in the target corpus; the y-axis will represent individual paragraphs with word frequencies. The classification will then undertaken using the stock D2K itineraries for performing Naive Bayesian inference and Support Vector Machine analysis. (help me out here, guys) Phone Conference Time and DatesTuesday at 13:00 on the following dates.
Callers in the Champaign-Urbana area should dial 217-244-7526. Toll free callers should dial 877-607-8976. Phone Conference MinutesConference call, 2007 Apr. 4, Analytics |
| Document generated by Confluence on Apr 19, 2009 15:04 |