This page last changed on Oct 14, 2008 by martinmueller@northwestern.edu.

The following is an attempt to move the discussion of Search and Sort into an implementation stage by spelling out requirements in a way hopefully intelligible to the implementer.

"Search and Sort" is a short hand name for a bundle of simple analytics that support exploratory data analysis in the informal, iterative, and interactive manner characteristic of scholars in the humanities. In this short hand, 'search' refers to all the steps through which a search is defined. 'Sort' refers to all procedures by which the results of a search can be postprocessed. These procedures are not limited to literal sort operations but include groupings and counts of subsets that follow logically from sort operations.

Search Criteria

Every Search and Sort operation begins with the identification of a subset of data in the data store that are to be moved to the client for post-processing. The identification of this subset results from the arbitrary combination of selection criteria that the datastore exposes to the user. These include

  1. a string literal or regular expression
  2. lemma
  3. standardized spelling
  4. POS tag
  5. word class
  6. author
  7. work
  8. work part
  9. date (specifiable as some range between a start and end data)
  10. sex of author
  11. author origin (controlled vocabulary)
  12. work genre (controlled vocabulary)

An example of such an arbitrary combination of criteria would be "adjectives ending in 'ly' in novels by American women published between 1851 and 1853."

The data frame returned by a search

In response the data store delivers to the client machine the requested data in a format that can either be postprocessed directly or can be exported for further processing to a third-party application such as Excel, Minitab, R, ManyEyes etc.

The requested data constitute a "data frame" in a "long data format" that spreadsheets or statistical programs are familiar with. Depending on the size of the result set, the data will be returned as a "full data frame" or a "summary data frame."

The "long data format" is redundant or "denormalized" from a perspective of normalized data representation. Whether data should be sent in a normalized fashion and be denormalized on the client, or whether they should be denormalized on the server side is a practical question.

The size of the result set may be user configurable and depends on some balance between the user's patience and the server's capabilities. For the sake of argument let us say that results up to 3,000 data rows are routinely delivered as a full data frame and that results of more than 3,000 data rows are delivered as a summary data frame, unless the user requests a higher ceiling. There may be some ceiling (10,000? 64,000?) beyond which the server will always deliver summary data frames.

The scenario of the full data frame

In the scenario of a full data frame the server delivers to the client a table with information about every word occurrence that meets the search criteria. The columns of the table will specify

  1. The unique word occurrence ID of the hit word, which will serve as a link to the full context
  2. The lemma of the word before the hit word
  3. The lemma of the word following the hit word
  4. 40 characters preceding the hit word
  5. 40 characters following the hit word
  6. lemma
  7. standardized spelling
  8. POS tag
  9. word class
  10. author
  11. work
  12. The word count for the work
  13. work part
  14. The word count for the work part
  15. date (specifiable as some range between a start and end data)
  16. sex of author
  17. author origin (controlled vocabulary)
  18. work genre (controlled vocabulary)

The scenario of the summary data frame

in the scenario of a summary data frame the server delivers to the client a table that provides these data:

  1. The spelling(s) with counts
  2. The standard spelling(s) with counts
  3. The lemma(s) with counts
  4. The POS tag(s) with counts
  5. The LemmaPOS combination(s) with counts
  6. The work(s) with total word count and counts for the work(s) at the spelling, lemma, POS, and LemmaPOS level
  7. The date of each work
  8. The author of each work
  9. The sex of the author
  10. The origin of the author
  11. The genre of the author

Post processing search results

The fundamental point about the data frame returned by a search is that it allows the subsequent manipulation of data by all the criteria in the data store, whether or not they were specified in the search. Some of these criteria are "factors" in statistical parlance, others are "counts". Genre, sex, author, origin, and date are factors. So are lemma and part of speech. A key operation at the postprocessing stage consists of grouping returns by one or more factors and performing counts or other simple mathematical operations on the resultant groups.

One way of looking at the postprocessing stage is to think of it as an arbitrary shuttling between "group by" operations of a SQL database and simple statistical routines of a statistical/graphics package. The "group by" operations let you aggregate or subset your data and construct new data frames from the data frame that was originally requested. These become the inputs for various operations.

Descriptive statistics for search results

The most useful of these operations belong in the world of elementary descriptive statistics and provide information about frequency and distribution. Some key concepts of such descriptive statistics, whether count, relative frequency, average, and standard deviation, should be assumed as important elements to be transparently exposed to users.

Some searches result in hit lists that can be taken in at a glance or are sorted and grouped in the user's memory without any assistance. Searches that result in more than two dozen hits will benefit from formal groupings and quantitative manipulations.

Statistics text books typically begin with a section on "descriptive statistics," very simple procedures of counting results, computing medians or averages, and putting the data on some range of 'quantiles' to give you a sense of scale and a sense what is high, low, or in the middle. A simple quantitative overview is often good enough.

The data relevant to such a simple quantitative overview consist of

  1. The total number of words in a 'container', such as a work or an aggregate of works by author, genre, time, or sex of author
  2. The count of a word in a container (term frequency)
  3. The number of documents in which a word occurs (document frequency)
  4. The count of a word in the entire collection or largest relevant container (collection frequency)
Relative frequency and standardized scores

Relative frequency is the simplest thing to compute: you divide the count of the search term(s) by the total number of tokens. Because most words are rare, frequency figures per 10,000 words are easier to read than percentage figures.

Standardized figures or z-scores are harder to computer but extend the range of comparison. In order to standardize counts or frequencies, you need to have a range of observations that let you compute average and standard deviations. In the standardized score the average value is converted to zero and other values are expressed as the number of standard deviations from zero. Values between +1 and -1 are largely unremarkable, but +2 takes you into the world of "very tall" and -2 into a world of "quite short."

Consider the following descriptive statistics for 'come' and 'think' in Othello:

come 119 45.2 0.15
think 86 32.7 2.73

The raw count, relative frequency, and standardized value together provide a fair amount of information that is within the grasp of eighth grade math. The raw counts assure you that there are enough occurrences to make statistical inferences reasonable. The relative frequencies tell you about the rate of occurrences. The standardized value of 0.15 tells you that the use of 'come' stays very close to the Shakespearean average but the value of 2.73 for 'think' identifies that word as something of an outlier.

Standardized values are suspect because they assume a normal distribution, which is not the case for language. On the other hand, they work well enough in practice, and they are very easy to interpret.

What Minitab and similar programs call Basic Statistics provide a very useful survey of behaviour or variance for common words. If you see that in Shakespeare's plays and poems the average frequency for 'the' is 323 per 10,000 words, but the standard deviation is 45, a moment's reflection will tell you that the variance of this ubiquitous word is approximately the same as that of people's weight: in a room full of people you're not surprised to see some people who weigh twice as much as others.

Thus figures like

come 3787 43.2 13.9
think 1467 16.6 5.8

are useful in telling you about the behaviour of common verbs in Shakespeare. Not surprisingly, their variance is greater than that of the article, but they are quite similar to each other.

For a student with minimal statistical skills, the lesson learned from such very basic quantitative data is that linguistic behaviour varies considerably and that you should be careful to make too much of small differences.

By way of postscript, Ian Ayres in his recent and deeply interesting Super Crunchers tells us with some paternal pride how he had taught his middle school daughter to think of the 2SD rule (two standard deviations) as her friend, and how it is a useful rough approximation for getting an initial grasp of data even where the data do not follow a normal distribution, as they certainly do not in language.

Inverse document frequency

Inverse document frequency is an alternative and may be a better way of providing basic quantitative orientation about usage of a word. There are a number of formulas that work with the basic ingredients of term frequency, document frequency, and collection frequency. Tf-idf (term frequency-inverse document frequency) values can be easily computed from the inputs of the full or summary data frame. The main function of these statistics, however, is to identify salient content words. It is unclear how successful they are at identifying differences in the usage of common words, which may be an important feature of stylistic analysis.

Visualizing search results

Visualizing a collection

With three dimensions of size, colour, and patterning one can represent the document space of a collection by assigning to each work a number of pixels according to size, a colour according to genre, and some pattern to distinguish by origin or sex of author

Words over time

Keywords are likely to differ in their frequency across genre, sex, or origin. The same graphics that trace a bundle of stock prices over time can be used to track words, for instance 'liberty' in American or English novels by male or female writers by decade.

In such a graphic one probably has to stick to average by decade so as not overload the screen with data. But if you track by a single factor over time, you might want to use a variant of the box plot. In this case, you see the variance and outliers for each decade.

If you track by quarter century, it may be possible to track by two factors, so that male and female values are distinguished.

Document generated by Confluence on Apr 19, 2009 15:04