|
This page last changed on Jun 13, 2007 by ssteger@uga.edu.
Sentimentality Storyboard
There are two possible scenarios which I outline below. The first is that I know the classification and I want to know what low-level features correlate in similarly-classified texts. The second is that I want to know which other texts might also be grouped with a dataset based on their similarity in features - the "more like these" classification.
SESSION ONE - Feature extraction
There are six passages from texts in which I am particularly interested - these include Bleak House, Uncle Tom's Cabin, The Old Curiosity Shop, The History of Mary Prince, Mary Barton, and Little Women. (NOTE - Mary Prince is a part of the DocSouth collection, and the other texts are likely a part of the NCF collections.) There are sections of each text which are either repeatedly referenced as being sentimental or that I've identified as highly sentimental for the purposes of my study. I know the classification of the texts; what I don't know is which low-level features form the sentimental pattern.
I sit down at my computer and open up MONK. I am able to browse the available collections and select my first text. I then am able to view the text as divided into chapter (or equivalent) divisions. I select, and I am able to browse my selected division. I isolate the section of the division in which I'm interested, and I highlight and drag it into a window in which I can build my dataset. In this window, I'm able to rank the level of sentimentality of the selection. When I have completed the ranking, the selection minimizes, and I go back to the collection to add more ranked selections to my dataset. I may even throw in one or two sections of text that are very unsentimental to expand the training set.
After I have completed all my selections, I click on the "find me features" button to ask MONK to run data mining algorithms to compare the low-level features of the selected texts, including vocabulary, part of speech, punctuation, and even bi/tri-grams (to isolate key phrases). In other words, I want MONK to return a list of the features in the selected texts that are correlative for the sentimental texts - perhaps the greatest similarity is an abundance of adjectives, then the frequency of the word "little," and then a frequency of the dash and so on. I'm also able to click on any one feature to see a breakdown by text - for example how many times the word "little" is used in each selection of text.
I am able to save the session and result set.
SESSION TWO - Automatic text classification: more like these
I open up MONK and browse the list of available titles of texts. I again click on some texts, which pulls up divisions from which I browse, select, and isolate sections of texts. As I read the isolated sections, I rank the level of sentimentality, building a training set for supervised learning. It would be very important that I'm able to save my set so that I can continue to build (it could potentially be a long process).
Once I have built the training set, I then choose texts from the available collection to run against the training set. I don't know much about these other texts or I would have made them a part of my training set, but I am interested in texts from certain years (or I'm only interested in women authors, or I'm interested in novels with certain keywords, etc.). For each text in the collection, I'm able to click to get a pop-up window with bibliographic information about the selected text, so I can then choose whether to add it to my collection or not. I can also sort the texts according to bibliographic info to make this process easier. Since I realize that running the whole novel wouldn't give me the best results, I'm able to choose to isolate each division of the novel (presumably chapters) as separate texts for the purpose of the analysis.
By this time, I have two collections of texts - the training set and the set I want to use for predictions. I click on the "more like these" button and am able to choose which classifier I want to use for the automatic text classification: Naïve Bayes or SVM. If I'm not sure what these mean, I'm able to hover over for a short, layman's terms explanation of each classifier. I also can click to take me to the more robust documentation with technical information. I make my selection, and I'm given an estimate of how much time this could take. I've built a large collection, so I'm okay with the fact that it will take a while. I wander off to get some coffee.
When I return, MONK has done its magic. There are a few ways I can view the results. I can see a visual which clusters the sections of texts according to their sentimentality. If I'm more numbers-oriented, I can toggle to a screen that lists the text sections according to their similarity ratios. If I'm interested in the back-end, I can move to a screen that lists the features that were used for the classification.
I look at the list or the visual and see that, for a certain novel, the chapters keep showing up in ways that I think are interesting. I decide I'd like to see the distribution of sentimentality across that novel. By selecting one of the chapters, I'm able to choose to re-group all the chapters of that novel (or at least the ones I included in the prediction set) and visualize how they rank. MONK remembers where chapters are in the novel so I see a progression over the timespan of the novel (first chapter to last).
Ultimately, I am presented with a list of chapters from texts that are similar to my training set - other sentimental moments. I am also provided with the features that are similar, so that I can better understand what comprises a sentimental moment.
|