|
This page last changed on Aug 22, 2007 by sramsay@unlserve.unl.edu.
Most of the searching and sorting that occurs in a digital context is prefatory to some more substantive activity. We enter keywords into Google and click on the results so we can then go read the documents. We sort the files on our computers (alphabetically or by most recent modification date, for example) so that we can more easily locate the file we're interested in. Because these are practical matters, we tend to judge the efficacy of the interfaces and algorithms involved in similarly practical terms. We want searches to be fast, and we want the results to be relevant to the task at hand. We want sorting mechanisms to be uncomplicated and the interaction brief.
There will undoubtedly be elements of the MONK system that demand these kinds of interactions. For example, a user might simply want to find out which documents contain a certain phrase, or might want to view documents by date of composition – perhaps as a prelude to some other analytical procedure. But searching and sorting has a slightly more exalted role in text analysis. There are many contexts in which the accuracy and visual tractability of the preliminary search or sort is of critical importance to the subsequent analysis – often consuming considerably more of the user's attention than the analysis itself. Analytical results might likewise be meaningless without the ability to make output data tractable through further searching and sorting. Finally, there are circumstances in which searching and sorting is neither a matter of efficacy or a prelude to something else, but the primary algorithmic substance of the analysis. For all these reasons, searching and sorting need to be understood as first class analytical procedures in their own right deserving of the same attention we give to things like classification algorithms and network visualizations.
Searching and sorting are cognate activities, insofar as one is rarely seen without the other. But when thinking about searching and sorting in the context of text analysis, it is useful to break the activity into its constituent parts.
Query
In the practical circumstances outlined above, users are usually able to get by with simple keywords, and indeed, keyword searching is probably adequate for many of the ordinary tasks within MONK. However, more advanced operations require a query tool of uncommon power and flexibility. At the very least, the user should be able to:
- Undertake proximity searches with user specified ranges at arbitrary levels of granularity. For example, x within 10 characters (or words, or sentences) of y.
- Regular expression support, by which we mean not a limited set of wildcard or globbing operators, but the ability to use symbolic expressions to match any arbitrary string. MONK's query tool should support at least the operators found in the traditional UNIX standard, but should ideally support the Perl-compatible or POSIX extended set.
- The ability to search the metadata of a document, and the ability to combine queries on the text content with queries on the XML structure of a document. In other words, it should be possible to constrain proximity, regex, and keyword searches to "only within paragraphs," "only within chapters," or any other structure in teisimple. Ideally, MONK would have the ability to incorporate any arbitrary XPath expression as a constraint on a query, even allowing such matters as "only in the last paragraphs of the documents" or "only in opening sentences."
- The ability to perform compound searches in which two or more distinct queries are combined to form an intersection or disjunction (the latter being important for comparative work).
(XQuery is probably the logical choice here, both as an interface option for advanced users and as a messaging format for a more abstract UI)
Sort
As with search, there is undoubtedly a place for the "list of hits" in the MONK interface. However, sorting in the context of text analytical work often demands much more sophisticated features:
- In a conventional list of hits, the data is typically sorted by relevancy. In the case of Google, one rarely makes it to the third page of relevant hits before reformulating the query. In text analysis, however, this notion of relevance is quite different. It's possible that the first five hundred to a thousand hits is "relevant" to the interpretation of the data. Likewise, the bottom of the list is almost always as important to the researcher as the top. This proposes serious challenges for interface design.
- Sorted data must always be subject to further filtering: by date, by keyword, by collocate, and perhaps even by allowing the user to formulate a new query on the result set.
- Columnar data must allow for internal sorting, after the manner of an SQL ORDER BY constraint. So for example, it must always be possible to sort the data by numeric value (a frequency, say), but then have all the columns that have the same numeric value sorted internally in alphabetical order.
|
I very much agree with the notion that Search and Sort is a first-order analytical procedure. In fact, I would push things a little further with the following thought experiment: If you asked a group of humanities scholars that they could either have a sophisticated Search and Sort tool or all other text analytics but not both, most of them would go for Search and Sort.
Of course, this is a silly alternative. Search and Sort is the glue that holds all other analytics together. It can and is often used by itself. It may be the first exploratory step. It is likely to be a follow-up analytic excercise on the results of some other analytic, which in turn leads to yet another analytic.
A few more points on "columnar data." While I don't quite understand the innards of WordHoard, a WordHoard "Find Words" search returns a result list (invisible to the user) that can be conceptualized as a flat table including
The hit address
The spelling
The lemma
The part of Speech tag
The work part
The author
The work
The date
The word before
The word after
The KWIC output
Once users have received this information they can sort and group this invisible table in a variety of ways. This can be quite powerful. Sorting occurrences of 'sad' by work part, shows instantly that eight of nine occurrences of that word in The Merchant of Venice occur in the opening scene.
There are refinements of that model that may or may not be within MONK reach, but here is a sketch of them:
A search operates on some subset of a collection. Hits occur in some documents but not in others. Imagine that the search return includes the word count for each document as well as the number of documents (with their total word count) in which the search term does not occur. Now the return is a "data frame" (Harald Baayen's term in in his forthcoming book on R) that can be used for subsequent analysis inside or outside MONK.
In the current WordHoard, some result groupings have relative frequencies attached to them, which provides an informative sort. But it would be better if some simple analytics can be performed in run-time on user-defined groupings of result, whether by author, work, decade, etc. If you have the count of hits, the word counts, and the number of documents that contain or don't contain the term, you can, I believe, standardize the results (z-score) or create a td-idf statistic.
You could also pass them to a visualization tool. Phil drew my attention to the effectiveness of box-and-whisker displays for such data. They are simple to read and they make no assumptions about the distribution of data.
Finally, for some users (probably not very many), it would be helpful to have the result list available as a tab delimited file that can be passed to some external application.

Posted by martinmueller@northwestern.edu at Aug 22, 2007 11:01
|
|