This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

2007/08/22 Analytics Cell Meeting Minutes

Present: Loretta Auvil, Phil Burns, Vered Goren, Martin Mueller, Stan Ruecker, Duane Searsmith, Sara Steger

We discussed Stan's list of eleven analytic categories as defined by the User Interface group. During the course of our discussion we combined categories to produce the following revised list of nine.

1. Find text chunks that are like the text chunks I like (theme, genre, and topic)

This requires the ability to mark up arbitrary swatches of text and assign an attribute and associated value(s) to that swatch. Such user-defined adornments can take different types of values (yes/no, ranks, numerical values). Once defined, one or more sets of such adornments can be selected for input to a classification algorithm such as a supervised learning method. The resulting classifier can then be applied to categorize other texts on the selected attributes.

Monk should provide at least the two algorithms already supported in Nora: Naive Bayes and Support Vector Machines. We need to define both default and optional behavior for each approach and determine a good way to present the options. As an improvement over Nora, Monk should be able to generate classifiers from multiple attributes (adornment classes).

Should Monk also include unsupervised learning or statistical methods for building classifiers?

2. Search, filter, group and sort words (including concordance display)

There are at least three places where Monk should allow these activities.

  1. At the collection browser level;
  2. Inside texts in a work set;
  3. Against the output of other methods, e.g., classification results.

Monk should allow searching of the text alone, the text metadata alone, and the text and metadata jointly.

Monk should present aggregate displays of returned values, particularly for long lists. The WordHoard concordance display demonstrates one approach.

3. Compare texts or authors by finding distinctive vocabulary or showing patterns that involve parts of speech

Monk should allow searching both for emergent patterns as well as for user-defined patterns. An analysis is likely to involve an interaction of both search types. We need to define the syntax for such pattern searches.

Since an author can be viewed as a collection of that author's texts, these methods can also be used to profile the language used by an author. Such profiles, in combination with classification methods, can address authorship attribution questions.

4. Show repetitions with variations within a text, or across texts

We decided to postpone further discussion of repetitions until Tanya can join us.

5. Build a social network map from text

Monk should display how objects (including named entities) act upon, or are acted upon by, other objects or entities. Several D2K modules produce such act-on/acted-on lists which could be used to generate network diagrams. Does D2K also provide methods for recognizing and merging multiple references to the same entities?

Aaron Coburn and colleagues at Middlebury College have an ongoing project using latent semantic indexing for producing relationship maps. We may want to look at their work.

6. Create a chronological timeline: of language use or from bibliographical text

Monk should provide methods for analyzing and presenting longitudinal analyses of language use. This would inform studies such as gender differences in language over the centuries and how clusters of words such as "liberty" and "slavery" or syntactic structures co-evolve over time. A range of dates is usually more useful than a point date when assigning a phenomenon to a time period.

7. Find collocated words

This includes "traditional" collocate measures as well as various ways to extract "interesting" phrases, n-grams, as well as various types of multiword combinations of spellings, parts of speech, etc.

Should parsed sentences be part of the base morphological data for Monk documents? If so, what type of parsing should we use? How would we represent this in the external version of a document?

Should how a word is used in a sentence be allowed as a search/sort/filter or classification criterion?

It would be very useful if Monk allowed multiword combinations to be used as input to other analytic procedures in the same way as individual words.

8. Visualize sonic coloring

Metaphone and soundex can assign phonetic codes to words. It is also possible to assign levels of "darkness" or "lightness" to syllables or morphemes. These phonetic values can be displayed using color values and levels to produce a "sonic map" of a text. The phonetic values can also be used for sort/search/filter and classification. The same kind of map could be used for other types of researcher-supplied adornments, e.g., for displaying the degree of sentimentality in portions of a text.

We have to be cognizant of the pronunciation differences in words across the centuries. A word like "dance" is now usually pronounced with a "short" a (e.g., U.S. "pat") or "ah" (U.S. pot). In earlier times "dance" was probably pronounced with an "aw" sound (U.S. paw), as evidenced by the typical older spelling "daunce". Some now-silent morphemes such as medial "gh" were pronounced in earlier times, and still are in some areas. It may be difficult to adjust modern phonetic algorithms to accommodate historical and dialectical pronunciations.

Should phonetic codes be added to each word during initial document preparation? MorphAdorner includes the technology to do this, but currently does not. Should we make the choice of phonetic value user-configurable, treating it as just another resarcher-supplied adornment?

9. Show geographical awareness visualization

In "real world" texts (e.g., news stories) geographic awareness can incorporate notions of actual physical location using standard global coordinates. This may be less useful in literature studies where the locations may not exist in reality. A scholar is probably more interested in how characters interact with fictive locations, and how characters perceive locations mentally. We may be able to address some of these questions using the Monk social networking methods.

We decided to postpone further discussion until Steve can join us.

Document generated by Confluence on Apr 19, 2009 15:04