|
MONK : From WordHoard's Find Words to Monk's Search and Sort
This page last changed on Sep 23, 2007 by martinmueller@northwestern.edu.
I write the following in response to Stan Ruecker's request to give the interface designers a better idea of Search and Sort by showing them how it works in WordHoard. I am going to go through some examples in a step-by-step fashion, both from a technical and interpretative perspective. I'll add some comments about things we didn't do in WordHoard but which would be improvements in MONK. Starting up WordHoardI assume you will read this with WordHoard on your screen. Get WordHoard from http://wordhoard.northwestern.edu and follow the instructions for download. Once you're ready to go, choose "Find Words" from the Find menu. You see a dialogue box with three choice points.
Take a look at the criteria. There are 20 them. There is broad, but not complete, overlap with MONK.
I ignore criteria that are WordHoard specific. Choosing a criterion will restrict the values in the values box to the values appropriate to that criterion. The result of any search via Find Words can be thought as returning a "data frame" with the columns as criteria and the rows as values. The return set will implicitly include three additional columns:
Once this data frame is retrieved from the server to the client computer, users can mainpulate by any combination of applicable criteria. Retrieving the data frame is not especially fast, especially if the return list is large. Manipulating the data frame, however, is very fast. 'Think' in Shakespeare as an exampleLet us look at "think" in Shakespeare. As you open the Find Words dialogue you get the default setting asking you first to specify a corpus and second to specify a lemma. You don't have to do, but let us stick with that and specify 'Shakespeare' for 'corpus' and 'think' for lemma. The result list is given to you by default by work in descending order of frequency per 10K words. Individual KWIC lines are collapsed so that you get an overview. You see Othello at the top (33.23) and Venus and Adonis at the bottom (6.17). Not much thinking is going on in Shakespeare' soft-porn narrative poem about a handsome and rather reluctant young man being relentlessly pursue by the voracious Venus. What about 'think' in Othello? If you remember Eliot's Waste Land you won't be surprised: 'What are you thinking of? What thinking? What? I think we are in rats' alley But it's nice to know that a very primitive statistic (relative frequency) puts Othello at the top of the list and provides quantitative evidence for Eliot's allusion. If you know your Shakespeare reasonably well, you'll see at once that the top six plays on the 'think' list all deal prominently with sex and betrayal. The top two plays (Othello and Much Ado) are specifically about male jealousy. Too much thinking of the wrong kind. if you 'group by' 'work part' as well as work see how 'think' is distributed across the different scenes of Othello, and you notice immediately the spikes in relative frequencies in two scenes (3.3 and 4.3). You can also look for scenes in other works with high frequencies. You notice that a third act scene of 'Twelfth Night', a close neighbour of Othello in many ways, has a very high concentration of 'think.' You need not look very closely to see that it is the scene in which Olivia 'hits on' Viola/Cesario. You may formulate the fruitful hypothesis that 'think' in Shakespeare is closely related to error and sexual confusion. Limitations of the WordHoard approachIf you reformulate the 'think' search without specifying the corpus you get 2,199 occurrences of think in Chaucer, Shakespeare and Spenser. If you group them by author, you see a design flaw in WordHoard. Relative frequencies were precomputed for some categories but not others. They exist for works and work parts, but not for aggreates like author or publication decade. I suggested relative frequencies for works and work parts, when I should have thought about relative frequency (and other derivative computations) as a customizable feature of the available data set If you think of a data frame as returning information about a collection of works, the critical data points for initial orientation are
From these data points you can analyze, tabulate, and visualize returns in many ways. To stay with 'think', WordHoard does not tell you in how many Shakespeare plays the word was found or how many documents there were (although it knows it). It tells you that it retrieved 1467 words. It would have been better to say that it searched 41 documents and retrieved 1467 hits from 39 documents. This does not matter if you're only interested in the documents in which 'think' occurs. It matters a lot if you want to see the dispersion of a word across a corpus. You should also remember that in much literary analysis you are not only interested in the top hits; you are very often interested in the distribution of a phenomenon across some collection. Single-screen returns as quasi-graphics; the case of 'sad'If you know that there are ~40 works in the Shakespeare corpus, the default display by work and descending frequency allows you to 'eyeball' a result list. You can read it almost like a graphic and see instantly whether a word occurs in a lot or just a few works. Look up 'cogitation', which occurs in just two plays. And as long as you can see the entire result list in a single window and you know that it is sorted by descending frequency, you can also look at the middle of the list and get an idea of the median. Look at 'sad'. You see right away that the median frequuency per 10K is a little less than 3. That makes sense: a Shakespeare play is on average about 20,000 long. There are 210 hits, and the average frequency will about 5. When you look at the top, you see a a very high frequency in a work with just one ocurrence. That must be a very short work and outlier (and it is). You also see that (unsurprisingly) 'sad' appears to be a key word in the narrative poem The Rape of Lucrece, which has a much higher count and relative frequency than any other work. If you are used to looking at Find Words returns in WordHoard you can see right away that the frequency of 'sad' in The Rape of Lucrece is five times as hgh as the median (as compared with 'think' where the frequency in the top hit Othello is about twice as high.) The examples just discussed depend critically on the fact that the return set fits on a single screen and is analyzed as a quasi-graphic (writing is a visualization of language). But you cannot 'read' results this way if they take up more than a screen, as they do, for instance, in in the WordHoard implementation of the 250 NCF novels. (If you have an NU netid, you can get to this prototype via VPN at http://noir.at.northwestern.edu/wordhoard/ncf/) If you look for 'inward' , you get a result of 1453 words (roughly the same as Shakespearean think). If you eyeball the results in their default display of decreasing frequency per work, you see that the word is most common in George Eliot's Daniel Deronda (3.69). You also see that George Eliot is is the author of six of the top dozen hits and that there is only one male author in the top dozen. (You need to know that Currer Bell is the pseudonym for Charlotte Bronte) But you cannot easily see what the median value is or how the term is distributed. What you would like to see first in this case is probably a box plot. And if the top dozen values are for most part by women writers, you might want two parallel box plots by gender. There are two basic design features to grasp here:
There are several ways of solving the problem:
The Lemma Window in WordHoardThere is a WordHoard tabulation that is indirectly useful for MONK. Go to the Windows menu and choose "Chaucer Lexicon". You see a table with all the lemmas used by Chaucer. The table includes information about word class, collection count, and number of works in which the lemma appears. For every word you get a crude overview of its dispersal. If you look at the Table of Contents for WordHoard, you see that the Chaucer corpus is divided into 11 works. Sort the lemma list by count/lemma and click on the word 'be', which is the second most common word in Chaucer. A separate lemma window opens up, which shows a summary by default but has a 'word forms' option.(You can also get to this window by clicking on any occurrence of a word in the text). Click on Word Forms. You see a table that gives you summary information about the different forms of the verb 'be' in Chaucer. The information is arranged in descending frequency and gives you an overview of actual usage. Grammatical information of this kind is more useful for a student of Chaucer than of later writers, and we will probably not devote much energy on displaying it in Monk. The point of the table for the present purpose is that it gives you an overview of a lemma from one perspective. You see almost at once that there are forms and spellings of 'be' that do not exist in modern English, such as 'weren', 'beth', 'ybeen', 'nis' or 'nart'. But the most casual look at the raw counts also tells that these forms and spellings are much less common than the standard modern forms. Thus a single screen overview of 'be' tells you a considerably amount about the relationship of Chaucer to modern English: distinctively medieval forms are on the way out, and Chaucer's English is in some ways more an early form modern English than a late form of Middle English. If you look at the tabulation of other verb forms, you see similar patterns. In MONK we will also want to give overviews of lemmata, but the overview will focus on semantic rather than morphosyntactic or orthographic phenomena. Instead of seeing a verb divided into its present, past, and participial forms (the minimal morphological structure that remains in modern English) we may want to think about its usage over time, in different genres (prose, fiction, drama, poetry), and by authors who are either male or female or English or American. In the grammatical tabulation raw counts are sufficient. Because we are dealing with a single corpus raw counts are accurate indicators of proportion. In a survey of usage, we must start from the fact that all the variables have different raw counts: thre are more words from 1850 than from 1580, more words by men than women, more words in fiction than poetry, and so forth. It may be that a "mosaicplot" may hold the answer to some of this. "Mosaicplot" is a graphics feature in the R statistical program. It takes as its input a set of 'contingency tables' or cross-tabulated information. The output consists of the division of a rectangle into rectangles of different size, reflecting the distribution of various phenomena. But one would probably need another dimension. In the case of a word like 'liberty', the mosaic plot would be a good way of visualizing the percentage of its occurrence in different forms of writing. But it wouldn't show you relative frequency |
| Document generated by Confluence on Apr 19, 2009 15:04 |