This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

In the context of scholarly text analysis in the humanities, the fundamental analytic is something we call Search and Sort, which is both like and unlike "Googling." Like Googling it is a "find" operation in which you enter some search term(s) and evaluate results. In Googling you want to identify the shortest list of top hits in the quickest time. This is not an untypical operation in scholarly inquiry. But Search and Sort also includes a different mode of operation where you assemble data by some combination of criteria and then work your way through them to look for patterns of various kinds. The concept of "top hits" is not especially relevant to this kind of inquiry, which is both iterative and ruminative.

The claim that Search and Sort is the fundamental analytic rests on at least four arguments:

  1. Whatever else users will do, they will all use Search and Sort as an important tool.
  2. Many users will be satisfied with relatively straightforward find operation. This may not excite the developer as a design challenge, but it is all-important to the user.
  3. Sophisticated users will use combinations of regular expression and metadata searches for exploratory data analysis.
  4. The results of 'aggregate analytics' such as Naive Bayes or other text mining routines will in nearly all cases require the detailed analysis of the manner in which particular features or criteria contribute to a statistical result. This cannot be done without sophisticated Search and Sort routines.

A good Search and Sort implementation depends on the ability to

  1. formulate search criteria based on arbitrary combinations of search terms in the text as well as in the metadata
  2. group and sort the results by arbitrary combinations of the same search terms

Search criteria fall into the broad categories of

  1. regular expression searches
  2. bibliographical metadata about the work as a whole
  3. linguistic metadata generated by part-of-speech tagging and lemmatization
  4. frequency and distributional data created in act of linguistic annotation and aggregated appropriately
  5. structural metadata about works.

Details about bibliographical metadata are spelled out in the Proposal for Metadata about works (October 19, 2007).

The key feature of Search and Sort consists in the fact that the data retrieved in the first search step can be subsequently manipulated by any combination of the criteria available for the search in the first place.

What a search returns to the user is a 'data frame' in a 'long data format', to use terminology from Harald Baayen's Analyzing Lingistic Data, a tabular representation in which every search criterion appears as a column. Such a data frame becomes the input for the MONK interface, but it may also be exported to third-party spreadsheets, statistical programs, or visualization tools, whether Excel, Minitab, or ManyEyes.

Partial models for Search and Sort in MONK are the search page of Philologic (http://www.lib.uchicago.edu/efts/ARTFL/philologic/), which is very strong on complex query formulation, and the Find Words feature of WordHoard (http://wordhoard.northwestern.edu), which is very strong on letting users group and sort search results in an iterative fashion.

The structure of the data frame for the initial return of search results varies with the size of the result sets. If the number of hits are below some threshold (still to be determined but probably between 1,000 and 3,000), the data frame will return individual locations and KWIC information in the form of ~35 characters before and after each hit. If the hits exceed that threshold, the data frame will return aggregate information.

Random sampling from large result sets is also a feature of Search and Sort.

The challenges of translating the requirements of Search and Sort into a user-friendly interface have been discussed with the Interface group at length and appear to be well understood.

Document generated by Confluence on Apr 19, 2009 15:04