This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Members

  • Bernie A'cs, NCSA
  • Loretta Auvil, UIUC
  • Tanya Clement, UMD
  • Vered Goren, UIUC
  • Martin Mueller, Northwestern, Chair
  • Brian Pytlik Zillig, Nebraska
  • Steve Ramsay, Nebraska
  • Sara Steger, Nebraska

Purpose and Scope

The Analytics Cell is responsible for what we are calling "data analytics" in MONK. We're using the term "analytics" to include text analysis broadly conceived, from traditional procedures (like concordance generation), to more sophisticated quantitative and statistical analyses (a la WordHoard) and text mining (a la nora).

There's a thin boundary separating analytics from issues of primary concern to the Interface Cell and the Uses and Users Cell – particularly since most of us believe that visualizations (the usual results of analytical procedures) are themselves interfaces, and that active use cases are what drives development at all levels. Still, we have established this cell as a distinct specialization within the overall project in the hope that participants pursuing various ideas in text analysis can compare notes and work together.

Approved Proposals

The following contain proposals that have been reviewed and approved by the Analytics Cell. The claim is that the approved proposal spell out requirements with sufficient precision for other cells, especially Data and Interface, to implement them, leaving details to their discretion. If this is not the case, tell us about it right away.

Proposal for metadata about works (October 19, 2007)

The following metadata will be kept about each work. The list follows the list of metadata about works that Stéfan circulated after the October 16 call about metadata needed for the current workbench components.

  1. full title
  2. short title (explicit or computed)
  3. author
  4. size, expressed as character counts, word counts, and file size
  5. first circulation date range start
  6. first circulation date range end
  7. sex of the author
  8. author life date range start
  9. author life date range end
  10. place of origin
  11. major genre
  12. keywords (subject terms, when available or computed)
  13. sentence counts

It will be a matter for the Data Cell to determine the most efficient way of maintaining these metadata through some combination of the teiHeader, METS, MODS, or Dublin Core.

In a memo of October 22, Catherine suggests adding a unique "most likely date," which for many purposes may be better than an average of start and end date. I add this as a friendly amendment. MM 11-30-2007

Proposal about main text and paratext (October 17, 2007)

For a variety of purposes, many users will find it helpful to filter out certain kinds of 'paratext' from their searches and analyses. They prefer to see the movie without the trailers and credits. Main text consists of what are clearly part of the author's words from the naive reader's perspective. Paratext consists of what is ambiguously or clearly not part of the author's words. It also includes information in lists and tables that are not easily parsed as sentences.

The distinction is easier to maintain in some genres than in others. In plays before the late nineteenth, for instance, the main text consists of all the word intended to be spoken by actors on the stage. Everything else is paratext. In the plays of Ibsen or Shaw, it is harder to decide whether or how stage directions are a form of paratext.

The distinction between 'main text' and paratext is established at the point of a text's ingestion into MONK and becomes part of its SIP or submission information package. In principle it is possible to change the distinction on a text by text basis. In practice, one will do it on a batch basis, using as the criterion a genre (plays) or a particular collection of texts.

Since MONK texts will overwhelmingly come in some version of TEI, the distinction between main and side text can be expressed in terms of elements that will count as one or the other.

Paratext will always include the content of <front> and <back> elements. It will also include some elements that occur inside the <body> element, in particular the following elements from these TEI modules:

  1. core: <add>, <bibl>, <head>, <item>, <label>, <list>, <note>, <ref>, <respStmt>, <speaker>, <stage>
  2. drama: <castGroup>, <castGroup>, <castList>, <role>, <roleDesc>
  3. figures: <cell>,<figDesc>, <figure>, <row>, <table>
  4. textstructure: <byline>, <docAuthor>, <docDate>, <docEdition>, <docImprint>, <docTitle>, <epigraph>, <titlePage>, <titlePart>, <trailer>

Where count objects are precomputed, separate counts are kept for main text and paratext. The lattert is by its nature a hodge podge and unlikely to be an object of attention in itself. But users must have the option of performing their operations on main text, paratext, or "all text." Users will typically have the relevant knowledge to determine whether they want to filter out or include
paratext.

The initial selection of side text elements will be a curatorial decision and will always be based on local knowledge of a particular collection or set of texts. Sir Walter Scott, for instance, wrote
voluminous notes for his historical novels. If you know this you are likely to include the notes in the main text as "clearly part of the author's words from the naive reader's perspectives." Similarly, you might decide that the stage direction of some authors are really part of the main text. Or you might still classify them as paratext text because users can ignore the distinction.

Proposal about word and sentence level metadata (December 1, 2007)

Word level metadata are data that kept about each token in a text. Punctuation marks count as separate word tokens.

Problems of disambiguation arise with the apostrophe/single quote and the period. The period is part of a word token in abbreviations and decimal numbers. The single quote mark is part of the token when it acts as an apostrophe.

A word token has the following properties or attributes:

1. A token address or corpus-wide unique identifier, which consists of a work identifier and a word counter
2. The spelling that occupies the token address
3. A part-of-speech tag
4. The standard spelling of the spelling at the token address
5. The lemma associated with the spelling and POS tag at the token address

A lemma always belongs to a particular word class and is an abstract concept that bundles various inflected or orthographic forms of a word. In English the lemma is represented by the zero form of a word, the singular of a noun (love) or the present tense of a verb (love). A search for a lemma is therefore always a search for all inflectional and orthographic variants of a word.

For a variety of analytical purposes it is helpful to search for a combination of lemma and POS tag. A LemmaPOS is a particular inflected form of a word regardless of its orthographic form: 'loves', 'louyth', 'loueth', 'loveth' are or can be instances of the third person singular of the verb 'love' (love_vvz). A LemmaPOS is nearly always the same as a combination of a standardized spelling and POS tag (loves_n2 vs. loves_vvz). But leaves_n2 could refer to the LemmaPOS leaf_n2 or leave_n2.

Proposal about Search and Sort as the fundamental analytic (December 1, 2007)

In the context of scholarly text analysis in the humanities, the fundamental analytic is something we call Search and Sort, which is both like and unlike "Googling." Like Googling it is a "find" operation in which you enter some search term(s) and evaluate results. In Googling you want to identify the shortest list of top hits in the quickest time. This is not an untypical operation in scholarly inquiry. But Search and Sort also includes a different mode of operation where you assemble data by some combination of criteria and then work your way through them to look for patterns of various kinds. The concept of "top hits" is not especially relevant to this kind of inquiry, which is both iterative and ruminative.

The claim that Search and Sort is the fundamental analytic rests on at least four arguments:

  1. Whatever else users will do, they will all use Search and Sort as an important tool.
  2. Many users will be satisfied with relatively straightforward find operation. This may not excite the developer as a design challenge, but it is all-important to the user.
  3. Sophisticated users will use combinations of regular expression and metadata searches for exploratory data analysis.
  4. The results of 'aggregate analytics' such as Naive Bayes or other text mining routines will in nearly all cases require the detailed analysis of the manner in which particular features or criteria contribute to a statistical result. This cannot be done without sophisticated Search and Sort routines.

A good Search and Sort implementation depends on the ability to

  1. formulate search criteria based on arbitrary combinations of search terms in the text as well as in the metadata
  2. group and sort the results by arbitrary combinations of the same search terms

Search criteria fall into the broad categories of

  1. regular expression searches
  2. bibliographical metadata about the work as a whole
  3. linguistic metadata generated by part-of-speech tagging and lemmatization
  4. frequency and distributional data created in act of linguistic annotation and aggregated appropriately
  5. structural metadata about works.

Details about bibliographical metadata are spelled out in the Proposal for Metadata about works (October 19, 2007).

The key feature of Search and Sort consists in the fact that the data retrieved in the first search step can be subsequently manipulated by any combination of the criteria available for the search in the first place.

What a search returns to the user is a 'data frame' in a 'long data format', to use terminology from Harald Baayen's Analyzing Lingistic Data, a tabular representation in which every search criterion appears as a column. Such a data frame becomes the input for the MONK interface, but it may also be exported to third-party spreadsheets, statistical programs, or visualization tools, whether Excel, Minitab, or ManyEyes.

Partial models for Search and Sort in MONK are the search page of Philologic (http://www.lib.uchicago.edu/efts/ARTFL/philologic/), which is very strong on complex query formulation, and the Find Words feature of WordHoard (http://wordhoard.northwestern.edu), which is very strong on letting users group and sort search results in an iterative fashion.

The structure of the data frame for the initial return of search results varies with the size of the result sets. If the number of hits are below some threshold (still to be determined but probably between 1,000 and 3,000), the data frame will return individual locations and KWIC information in the form of ~35 characters before and after each hit. If the hits exceed that threshold, the data frame will return aggregate information.

Random sampling from large result sets is also a feature of Search and Sort.

The challenges of translating the requirements of Search and Sort into a user-friendly interface have been discussed with the Interface group at length and appear to be well understood.

The Analytics

  1. Determining what is in the collection
  2. Searching and Sorting
    • address regular expression queries to the tokens whether
      considered as text or POS tags
    • address SQL like queries (XQuery?) to the metadata considered as
      parameters for queries
    • uses group-filter-and-sort routines to return result sets, where
      sorting includes:
      1. word before
      2. word after
      3. text
      4. date
  3. Statistical routines that perform
    • text classification
    • cluster analysis
    • nearest neighbor analysis (how different from b?)
    • over and under use of features in a comparative analysis (Dunning, chi-square)
  4. Determining syntactic or lexical patterns consisting of multiword
    expressions.

Analytics Use Case Actor-Goal List

Repetition (SCHOLAR: Tanya)

The repetition analysis is seen as finding frequently occurring patterns in the text. Frequent pattern (or association rules are described in wikipedia) is a pattern (in our case a word, or phrase) that occurs frequently in the data set (in our case the document set).

The following document also provides a nice survey of frequent patterns. http://www.adrem.ua.ac.be/~goethals/software/survey.pdf
Jiawei Han (UIUC CS Professor) has written a book on data mining techniques and his slide sets for each chapter are online. See http://www-sal.cs.uiuc.edu/~hanj/bk2.

Some definitions, an item is an attribute-value combination, a word, or a phrase.

Essentially frequent pattern analysis compares every item to every other item. Algorithms have been optimized to cutoff this comparison when certain conditions are met. These frequent patterns can also be seen as an automated hypothesis generation. But the patterns need to be evaluated by the domain expert. Frequent pattern analysis generate lots and lots of patterns. Some are not interesting because they report known and/or common occurrences, but other times they may find a novel pattern.

There are several implementations of frequent pattern analysis in D2K.

D2K modules that are inserted into the workflow before frequent pattern analysis to control whether stop words are removed, stemmed word are used, or words or ngrams (phrases) are used. Actual counts of words are not used, but could be. At this time, a boolean representation indicates whether or not the word occurs in the document.

Since frequent pattern analysis generates so many rules, we can use a clustering approach to cluster similar patterns together. Given K clusters, patterns that have a common set of items are clustered together.

Sentimentality (SCHOLAR: Sara)

This use case is an evolution of the one used for the nora project. Sara will use the NoraVis tool to tag individual paragraphs as instances of sentimentality. Once this training set has been created, we will need to create a sparse matrix of word frequencies. The x-axis will consist of all individual word tokens in the target corpus; the y-axis will represent individual paragraphs with word frequencies. The classification will then undertaken using the stock D2K itineraries for performing Naive Bayesian inference and Support Vector Machine analysis.

(help me out here, guys)

Phone Conference Time and Dates

Tuesday at 13:00 on the following dates.

  • September 11, 25
  • October 9, 23
  • November 6, 20
  • December 4, 18

Callers in the Champaign-Urbana area should dial 217-244-7526. Toll free callers should dial 877-607-8976.

Phone Conference Minutes

Conference call, 2007 Apr. 4, Analytics
Conference call, 2007 Apr. 18, Analytics
Conference call, 2007 May 2, Analytics
Conference call, 2007 July 18, Analytics
Conference call, 2007 July 25, Analytics
Conference call, 2007 Aug. 1, Analytics
Conference call, 2007 Aug. 22, Analytics
Conference call, 2007 Sept. 11, Analytics
Conference call, 2007 Sep. 25, Analytics
Conference call, 2007 Nov. 6, Analytics
Conference call, 2007 Dec. 4, Analytics

Document generated by Confluence on Apr 19, 2009 15:04