This page last changed on Mar 10, 2007 by martinmueller@northwestern.edu.

The following is not an exhaustive catalog of everything possible in WordHoard, but it attempts to describe the major analytical routines that users can perform. I'll try and do this as much as possible in non-technical terms. Bill, John, and Phil may want to add to this in various ways.

What WordHoard 'knows' about each word

Everything in WordHoard depends on what the underlying data model may be said to "know" about each word in a text in WordHoard. It knows:

  1. the precise address at which it can be found
  2. the precise address of the word that precedes or follows (and implicitly everything that can be extracted from that address
  3. the spelling of the word at that address
  4. the grammatical properties of the word form represented by the spelling, expressed as a POS tag
  5. the lemma or dictionary entry form of the wordform associated with the spelling
  6. the word class to which the lemma belongs
  7. the work in which the word occurs
  8. everything that is associated with knowledge of the work (e.g. author, date, genre)
  9. the work part of which it is part
    #the token counts of the various types as which the spelling in its location can be conceptualized (lemma, POS tag, etc)

Get Info on a word occurrence

A first and simple analytic in WordHoard is that it will generate a report about each word occurrence in one-click procedure. This is a quite powerful tool with Chaucer, where for any given word occurrence you can see a tabular representation of all spellings of all word forms of the lemma.

'Group and sort' concordance output

Probably the most useful analytic for most users is the group and sort capabilities of the concordance tool. if a simple lookup for a word retrieves a result list that stays within George Miller's magic number 7 plus or minus 2 you can take in the results informally and at a glance. The more a result list exceeds a dozen the more useful it is to group the results in some fashion. In WordHoard you can group the results by just about any aspect of what WordHoard "knows" about each word (spelling, POS, work, date, etc).

If you are struck by 'sad' in the opening line of The Merchant of Venice and group a concordance by work and work part you see immediately that of the nine occurrences of the word in that play eight occur in the first scene.

You can activate a concordance search from within a text or by entering a word in the Find Word menu.

The Find Lemma tool

WordHoard lets you used drag and drop routines to build up constraints for searches across all its corpora. Since this type of search always returns a list of lemmata, it is called Find Lemma. But that is not a good name for it (users don't know how it differs from Find Word).

For me the coolest feature of this tool is that it lets you explore shared rare vocabulary. Let's say you want to know what words occur in Hamlet and Lear but are rare elsewhere in Shakespeare. You drag Hamlet and Lear from the table of contents, activate Boolean 'all', and select a document frequency of < 3 in Shakespeare. That will retrieve all words that occur in both plays but in at most one other play. If you were curious about Chaucerian words that occur in The Faerie Queene, you can drag The Faerie Queene and all of Chaucer, again choose Boolean 'all' asnd filter out lemmata with a frequency above, say, 10.

You get the idea of this kind of search, which can be very useful in a variety of contexts.

The WordHoard Calculator

The WordHoard Calculator is a programmable statistic application that lets you run various statistical routines against against customized data sets drawn from the general data store in WordHoard. In principle you can write your own programs at the command line. Nobody has actually done that so far, and people have used only what has been mediated through a GUI.

Dunning's log likelihood ratio

Dunning's log likelihood ratio is a statistic that does more or less the same thing as a Chi-square, but it is supposed to be more suitable for textual data. It supports a 'figure and ground' operation where you choose one set of texts or words, called the Analysis Corpus, compare it with another set of texts or words, the Reference Corpus, and find words that are, in comparison with the Reference corpus, disproportionately common or rare in the Analysis Corpus. The resultant log likelihood ratio maps to a probability table that you interpret in the same way as a chi-square statistic. There are some precomputed work and word sets that can be used as Analysis or Reference corpora, but users can also construct and save their own sets.

If,for instance, you compare Shakespeare tragedies with this comedies you discover that they differ most sharply in the use of 'she' (much less common in tragedy). If you then compare Julius Caesar with the tragedies, you discover that it differs most sharply in the use of 'she' (much less common in Julius Caesar).

Because the Shakespeare corpus in WordHoard distinguishes speaker by sex and lines by verse or prose you can look for characteristic differences in the word choices of male and female speakers. Or you could compare the verse of men with the prose of men (There are some limitations at the moment on the size of word sets you can construct).

Triangulating from WordHoard to MONK, it would be quite illuminating to place novelists in their generation. You could probably do this with sufficient accuracy if you had precomputed sixty-year Reference Corpora that move in 25-year windows: 1800-1860, 1825-1885, etc.

Collocation

J. B. Firth famously said that "you shall know a word by the company it keeps." There are various statistics in WordHoard that let you measure collocation, and you can define the collocation window (words within n words of the chosen term). At bottom the calculation of collocation is another figure-ground operation: the span of words you define as collocation constitutes the analysis corpus and the text as a whole is the reference corpus: if a word is relatively more frequent in the collocation span than in the text as a whole it may be said to collocate.

There are different ways of measuring the strength of association between collocating words. WordHoard doesn't explain these very well at the moment, but the "specific mutual information" statistic tends to give the most interesting results when you want to know what content words appear in close proximity of another content word. Doing this for 'honour' in Chaucer and Shakespeare gives interestingly different profiles for the company of 'honour'.

Multiword units

Analytics based on phrases did not get as far in WordHoard as we expected to, but there are some useful features. You can, for instance, generate a list of repeated phrases. Thus a search for repeated phrases between two and seven words in length will retrieve the bulk of repetitions in the Odyssey within two to three minutes.

Words over time

You can look for the history of a word over time. You can only do this within a corpus, and it is of limited utility because diachronic information about Early Greek epic and Chaucer is not reliable. Tracing the history of Shakespearean words can be interesting. But broadly speaking, this feature becomes much more appealing when a multi-author corpus stretches over many decades.

Document generated by Confluence on Apr 19, 2009 15:04