This page last changed on Dec 13, 2007 by plaisant@cs.umd.edu.

(Update edited by Catherine from Tanya's emails)

The perspective on repetition afforded by text mining and visualization software within the MONK project provided me with an opportunity to reread The Making the Americans as a purposefully and intricately structured text.  It has facilitated the "distant reading" that provides this new perspective. Using their D2K, Loretta generated lists of thousands of reoccurring repetitive patterns from the text. Since each slight variation generates a new pattern, the list of thousands of patterns that impossible difficult to understand. To bring these results into focus, I worked with Catherine's team onFeatureLens to visualizing those patterns more coherently. One analysis produced a subset of co-occurring patterns that enabled the discovery of two sections from chapter 4 (¶1726-27) and from chapter 5 (¶1823-24) that share 495 words. The loss and the subsequent discovery of these paragraphs as they play out across the chaos of repetitions pique our interest.  They raise the question of what else our attempt at close reading an unreadable text may have missed.

Next, I worked with Martin to look at a different set of repeated phrases derived from the same text. Each item in the database is an independently recurring string within the text (including characters and punctuation) that occurs at a specific location.  For example, if the four-token string "abcd" occurs twice in the text (once in chapter 1 and once in chapter 9) it appears twice in the database with two different locations but the same ID. The count is based on a closed set so subsets of the two instances of "abcd" are not included in the database; however, if one of the substrings occurs independently in another location within the text (e.g., "ab" also occurs one other time in chapter 4) the database includes the two instances of "abcd" and the three instances of "ab" with a new ID. This data allowed me to create visualizations (2D and 3D scatterplots) within a commercial application called Spotfire provided by Catherine. As a result, a significant assumption can be made that has previously been undiscovered: the longer and less frequent repetitions are local to the first half of the text while the shorter, more frequent repetitions run across the text as a whole. This work facilitates an alternative direction or perspective for reading the text.

Since then I have been working on writing my thesis and a paper. 

Papers Clement, T. 'A thing not beginning or ending': Using Digital Tools to Distant-Read Gertrude Stein's The Making of Americans." Literary and Linguistic Computing 23.1 (Invited, April 2008).
Don, A.; Zheleva, E.; Gregory, M.; Tarkan, S.; Auvil, L.; Clement, T.; Shneiderman, B.; & Plaisant, C. "Discovering interesting usage patterns in text collections: Integrating text mining with visualization." Proceedings of the sixteenth ACM conference on Conference on Information and Knowledge Management. New York: ACM Press, 2007: 213-222.

Expectations or hopes for short term Monk features

I would like to pursue the other questions in my user case study (i.e., involving comparing MoA and Three Lives to the NCS data with text mining). I would also like to see the following changes made in featurelens:

1. the repetitions have an ID number that identifies a "single" unit, much like Mueller's database of repetitions.

2. a visualization that incorporates clustering.

3. a scatterplot overview of the patterns

Document generated by Confluence on Apr 19, 2009 15:05