This page last changed on Apr 23, 2008 by plaisant@cs.umd.edu.

Report on Progress for Sentimentality Use Case

(Report from Sara, after discussion of the results at the Montreal Hackfest, with summary comments from Catherine)

When I last reported to the group, I had been using WordHoard to run some experiments. At our meeting in December, I talked about the lexicons I had built for the sentimental and unsentimental chapters and I reported the results from the log-likelihood feature (in WordHoard) for comparing texts. In December, Loretta and Amit got me a decision tree based on my classified texts. I found the tree really helpful in understanding not just which words were sentimental, but how the presence and absence of these words determined sentimentality.

Since then, Amit has been working to get me some more data. Through SEASR, there are three flows (i.e. what was once called itineraries in D2K) available now that are relevant to my use case (they're not in the workbench yet, but hopefully that will change during the course of the current hackfest). These flows include Naïve Bayes classification (predictions), the decision tree of training data (structure of if-rules based on features, and the Naïve Bayes classification with decision tree (where the tree corresponds to the prediction for the entire workset). I've been running experiments and wanted to pass on some of the results to the group.

First off, I used the decision tree flow to create a tree based on the training data. The results were based on absolute classifications, not predicted classifications. The training set originally consisted of 40 sentimental chapters and 70 unsentimental chapters from the NCF collection. The resulting tree is here. Right now, SEASR uses Weka algorithms for classification, and Weka uses the j48 decision tree. The visualization here is using the open source Graphviz program. While it's a pretty simple tree (no branches, just a really long trunk), it still gives me information about my training set. Like the tree Amit and Loretta provided me back in December, the word "forgiveness" is an important indicator of sentimentality. I don't know what to make of "trunk" or "hammer" - one of the things we're talking about again here at the hackfest is the essential need to provide access to the texts where the features are used (an old song). Now I have to search manually and this is painfully time consuming . I also see from the tree that my training set is too small - the presence of the word "schoolmaster" at the end of the tree (which corresponds to a particular character in a particular novel) shows me that that work is over-represented in the training set.

I then ran the Naïve Bayes Classification on the original workset using the orginal training set. These were the sets that I created last December by making worksets in WordHoard, and I wanted all novels between 1838 and 1865 (80 novels) for my workset. As I found out, this wasn't the best way to create the workset because the resulting list (I selected the entire work in WordHoard) of NCF chapter names didn't include the full chapter name. It stripped off the individual chapters (the last part of the NCF name) when a work had volumes. In other words, it provided the volume name, but not the chapters. So this made the workset much smaller than I had intended. Nonetheless, I decided to run the NB classification anyway - just to see if I could get things to work. As it turns out, I got some astounding results. The chapters that were labeled as sentimental were mostly ones that I would have labeled as sentimental as well. The system is really great at detecting sentimentality in Dickens, in particular. I had left out the scene from Dombey and Son where Paul dies (the chapter "What the Waves Were Always Saying") - a chapter that is pretty much always referenced in discussions of sentimentality - just to see if it would be "found" through the classification. It was! Other chapters included the final scene from A Christmas Carol (think Tiny Tim and "God bless us, everyone!") and several scenes from David Copperfield and Nicolas Nickleby. It also returned chapters from Fanny Trollope's Michael Armstrong and Anne Brontë's Agnes Grey.

While the results from this NB run only included works that weren't broken into volumes, I still found the exercise useful not just for testing the system, but for increasing my training set. I added 22 chapters to the sentimental training set, bringing it to 62 chapters. I then wanted to see how it would impact the decision tree, so I ran that again. Here are the results. There seem to be less anomalies in the tree, and it ousted the "schoolmaster" overfit problem. "Forgiveness" still tops the tree, and words like "remembrance," "sob," "bequeathed," "dew" all make sense. They are words that have more to do with general sentimentality than particular scenes or works, which is what you would want to see.

Given my expanded training set, I then ran the Naïve Bayes Classification Flow that included a Decision Tree with an expanded workset that included all the chapters from the 80 novels in the time period (mid-Victorian or "High" Victorian) in which I was interested. I'm working through the resulting 91 pages of classified chapters. So far, I've been really impressed and excited by the results. When something is classified as "sentimental" and it's not really sentimental, I can usually understand a trend in the "mistake." For instance, I'm finding many chapters that include a direct address to the reader ("you, oh reader!") are included. This is, in fact, something that happens a lot with sentimentality - where the author is trying to reach out to the audience and make that emotional connection.

This Flow also includes a decision tree that is based, not on the training set data, but on the machine-predicted classifications (i.e. the entire workset). It's a much more complicated tree (it's a much more complicated workset). It's a bit hard to see, but here it is: NB Tree. You have to click to zoom in to make much sense of it. I'm looking at these results and trying to make sense of them. In particular, "Wittenberg" means nothing to me.

THINGS WE STILL WANT TO DO WITH DECISION TREES:
It would be great if I could click on the word in the tree (like "playful") and be taken to context in the text for the five instances it's used in the sentimental texts and the one instance in which it's used in the unsentimental texts (and know the difference). Also, we want to add some parameters to the way the decision tree is working. Right now, we're not holding back any "folds" for cross-validation, but we want to add that parameter to increase confidence levels. We'll also probably be using a different visualization for the tree than the Graphviz.
THINGS WE STILL WANT TO DO IN GENERAL:
To review the results, I'm having to take the NCF chapter name that is returned by the NB classification, go to a table that lists the NCF chapter name to find out which work and chapter it is, and then go find it (out on the web, usually) to review it. I don't actually have access to the texts. This is, of course, pretty cumbersome. We think all things will be hooked up by the end of the hackfest so that I can review the classified chapters through the workbench.

Using the workbench's collection tree browser to input 130 or so individual chapters for the training set isn't really practical. Stefan is working on a way to have an input field that has completion so that you can begin typing what you want to select.

As we have discussed, the ranking of 1 to 5 is okay, but it would be helpful also to have a "yes/no" ranking for classification. Will the 1 to 5 ranking turn into yes/no? Do we need to have a feature selection for the way you want to do rankings?

We've also been talking about the idea of iteration - when we get results back from the classification, we need an easy way to "correct" the classification by over-riding the machine-prediction and re-iterating the experiment.

Extra notes from Catherine:

Overall this is good news!  Nayes Bayes can be useful for Sara's use case.  The classification worked better than for the Erotic/Dickinson Nora use case shoing that some questions will be more appropriate than others.   Many of the lessons learned form this use case confirm the findings of the earlier user case and reinforce our undertstanding of users' needs.  Which is good news too...

Sara's experience confirms that it is imperative to provide access to the text from the list of features or any results provided from the analytics.
It also confirms that the process is iterative: even problems with "bad" worksets generated good predictions which could become part of the training set. This also confirms that saving multiple separate ratings, multiple worksets and multiple results is important and should be allowed.

Even though a very large workpart/chunk size what used the overall experience was very positive. This is probably thanks to the fact that the characteristic researched i.e. sentimentality, infuses entire chapters. Sara confirned that it didn't find chapters where only a small part is sentimental. Kirsten may not have the same luck with classification if she looks for small specialized elements in the text, unless we allow training and classification on smaller workparts.
Running the decision tree on the training set allowed Sara to evaluate the quality of the training set. The tree may be hard to use but Sara was able to make sense of the numbers and rules. A more novice user may be confused by the tree, rules and numbers, but would still benefit from the list of words.

Sara's comment about the explainable "mistakes" confirms what we saw with Martha and the Dickinson erotic. Sara's classification didn't necessary always found sentimentality but helped uncover characteristics that are often associated with sentimentalism (e.g. addressing the user). With Martha, the erotics classification seemed to suggest that a text had to include personal pronouns (you me her etc.) for erotics to be found, but that it was not a sufficient condition, hence the apparent errors.

detail note: About the 1-5 versus yes/no:  From past experience we know that some users will want 1-5 while others will want yes/no.  We also know that the data mining will work better with 2 classes than 5 for a  training set of the same size.  So what has been suggested and seems to be the best way to start in the short term is to allow 1-5 ratings, but users who want yes/no should only use 1 and 5 ratings.  The next step is to allow users to specify that 1+2 should be considered as No and 4+5 as YES (or whatever mapping they want)...  (In the future, the best solution would be to give users the choice of the number of classes, labels for the classes etc. )

Document generated by Confluence on Apr 19, 2009 15:05