|
This page last changed on Apr 29, 2008 by tclement@wam.umd.edu.
Use case summary updated 4.29.2008
My next step in analyzing Making within this developmental phase of MONK is to compare patterns in Making to patterns within other texts across MONK collections such as NCF and EAF. Now that the MONK infrastructure is further in development, this comparison can be done by choosing features or decision criteria for classification that may be counted and compared across the many texts. For instance, my next step is to compare Making's style with styles of other British and American nineteenth century novels in order to look more closely at Stein's assertion that, as a modernist, she was developing characters in a dramatically different way than were popular nineteenth century novelists. Comparing patterns within novels by Charles Dickens, Jane Austen, or George Eliot in NCF or by Harriet Beecher Stowe, Louisa May Alcott, or Fenimore Cooper in EAF could provide evidence for and against Stein's contention that nineteenth century novels were about "characters" and twentieth century novels were more about "form." This method for discovery will begin by using D2K to extract named entities (character names) and parts-of-speech (such as noun phrases) from Making and two extremely popular texts with popular characters—Old Curiosity Shop and Uncle Tom's Cabin—from the NCF and EAF collections. Finding these features (character names and various parts-of-speech) as they are co-located within sentences, we will use D2K to find and cluster frequent patterns so that I can further classify and label these patterns for text mining across other texts. This analysis may underscore or undermine Stein's contention about the difference between her novel about "form" and nineteenth century novels about characters like little Nell and Uncle Tom.
Developing an analysis like this would be valuable for literary studies by providing perspective on various styles of character development, but the process by which we will extract and visualize the data is important as well for the continued development of MONK and other large-scale projects that propose to do text mining analysis on large collections of literary texts. The idea that parts-of-speech can be used to discriminate between authors is well-established. John Burrows used multivariate statistical techniques (such as principal component analysis and probability distribution tests) in the 1980s to examine Jane Austen's style through words like "in" and "it" and "of." Harald Baayen's approach in Analyzing Linguistic Data (2008) relies on using parts-of-speech trigrams as a factor in authorship attribution. Digital humanities scholars have used computational analysis to ask the question "Was Text A written by author X or author Y?" and it is not too far removed from asking "How does author X differ from author Y?" (the main question of this use case). Now, because of both the sheer number of digital texts encoded and available and the extensive processing power of applications like D2K, mining for styles could extend well beyond an author's oeuvre, her genre, or her century.
The process by which we could perform these analyses, however, remains essentially untested. Ideally, we could extract parts-of-speech from Making with named entities that are co-located in each sentence or paragraph and using a frequent pattern analysis algorithm and a Naive Bayes classifier we could attempt to find what patterns are like and unlike these in tens or hundreds of other texts. Yet, this "simple" process is complicated by the fact that the data returned from each step, including, but not limited to, the extraction of "dirty" (or unedited) named entities, would require an iterative approach that allows the user to manage, correct, or label what would be large amounts of data. Thus far, MONK has produced a social network representation as a proof-of-concept application for unsupervised learning (clustering) based on named entity extraction; The Making of Americans appears in this investigation. As part of my case study, further development is underway to use SocialAction, a social network/clustering tool created by Ben Shneiderman and Adam Perer at HCIL. In collaboration with Romain Vuillemont (also at HCIL) who will create and design the augmentations, I will investigate how the D2K frequent pattern analysis may be visualized in such a way that these results might be comprehensible to the user. This development will include visualizing social networks over the evolution of a text using the names as nodes and the features (the parts-of-speech patterns) to determine relationships between nodes. An interface with multiple views will be incorporated in order to facilitate comparing "snapshots" across texts (such as the same data from Uncle Tom's Cabin or Old Curiosity Shop). Again, my case study represents a crucial step in developing a process for analyzing style across multiple texts and will prove, as did FeatureLens, to be an integral part of the future MONK interface.
Use case summary old:
Stein believed that she was writing something "new" with MoA and
Three Lives (most modernists did) in comparison to what was being
written in the 19th century by Dickens, George Eliot, Austen,
Collins, etc. One thing in particular that she talks about is
character development. So, essentially, I am wondering if I could
do some sort of comparison of how characters are developed between
The Making of Americans and Three Lives and select texts from NCF. The question for the analytics
group might be what this comparison could entail--what would the
Beysian/SVM analysis be run on? One thing that Bill and I discussed
was possibly looking at patterns of "major word classes" (a
WordHoard term for an expanded notion of parts of speech)
surrounding proper names, getting the frequencies and distributions
and then running that through a datamining analysis to find
clusters or like patterns, etc.
Response from Martin interspersed with further replies from Tanya:
What you say intersects with two issues that I have been thinking
about and would like to raise in the Analytics cell: tag ngrams and
text samples.
If I understand you correctly you want to get at questions of
changing representations of characters (or character change) by
looking at syntactic patterns that contain name tags.
**Yes . . .
The idea that
POS ngrams can be used to discriminate between texts is well
established. In Baayen's forthcoming book on Analyzing Linguistic
Data there is an extended discussion of tag trigrams as a factor in
authorship attribution. The question "how does author X differ from
author Y" is often more interesting than the question "Was Text A
written by author X or author Y?" But it is a very similar question.
Now n-grams are messy because there are so many of them, especially
if you want to have n-grams of varying length. But what about text
samples? Imagine taking 100 500 word samples from Stein. Then you
take 100 500 word samples from novels from the 1870's, 1880's
1890--200,000 words altogether.
**Would these samples have any relationship with names or would they be random?
Now you extract various n-grams from
them, say bigrams, trigrams, pentagrams, and heptagrams. You keep the
patterns that are sufficiently frequent to support analysis.
As a next step, you could ask a) whether there are any interesting
differences between the NCF groups and b) whether patterns that
involve names stand out in any way.
**Okay, so what you are saying is that for my above question: NO, the samples are random and may or may not have names involved. In the analysis result, however, we see where the names occur and what happens around them.
Finally, you turn to Stein and
look at the differences with the NCF texts both with regard to
general patterns and patterns that involve names.
**Here's where things get a little confusing. I'm not sure that we have addressed in the project what a "pattern" will look like, right? Or, rather, we are addressing that issue now and haven't quite figured it out. Does it look like it looks in FeatureLens, for example? What am I looking at to detect those differences?
Perhaps sampling novels across a decade without regard to author or
subgenre makes no sense. One could play a very similar game by
picking different author, whether at random or according to some
hypothesis about significant resemblances or differences.
My hunch is that the distribution of POS ngrams in Stein would be
very different from just about any comparison group. So whatever you
do you would probably want to look at POS ngrams with or without
names. And you'd like to find some features that differentiate ngrams
with names from ngrams without.
**Very true--they would be very different. This part of where the problem is . . . if everything is different then it will be very difficult for me to detect difference. You know what I mean? In other words, I need to be able to pinpoint similarities in order to gauge difference. So, in this case it would be important to be able to say "In sample 1 this author is describing a character. In sample 112 this author is describing a character. In the Stein sample, she is describing a character. Now, what do the patterns tell me is the difference here?"
Anyhow, am I right in thinking that your proposal involves
"interrogating" (as the literary critics like to say) syntactic
fragments about issues of character representation? And the question
whether sampling would be a legitimate form of data reduction so that
a lot of tedious operations would become computationally more tractable.
**Yes. I think this is right except for the fact that I'm (a) not sure what a pattern "looks like" and (b) not sure what data needs to go in to machinations in order for me to understand what is coming out . . .
|