|
This page last changed on Feb 22, 2008 by martinmueller@northwestern.edu.
2007/09/11 Analytics Cell Meeting Minutes
Present: Phil Burns, Tanya Clement, Martin Mueller, Sara Steger, Steve
Ramsay.
Sara has been working with the NCF tests in the WordHoard environment.
Martin suggested we examine the tool sets and analytic operations
proposed by the interface group and identify data cell needs for
those operations. We began by considering "search and sort."
Steve noted that search and sort are fundamental operations for both text
analysis and web page searches. However, the applications differ. A web
search seeks the top matching results, e.g., the results of interest
generally appear on the first page. If not, the query is probably best
reformulated. Rarely does the list of web search results feed further
analyses.
In text analysis, the search results may themselves form the basis for
further analysis in an iterative fashion. The most interesting results
may well appear in the middle or at the end of the results list. There
is not necessarily anything like a top match. Anomalies can also appear
anywhere in the results list. The ability to sort and filter the results
in many different ways should be a key feature of Monk search and sort.
As an example, Tanya expressed her frustration at the inability of
FeatureLens to provide easy access to arbitrary portions of the results.
An analysis selects data from a corpus or set of corpora using
bibliographic, morphological, and/or derived data such as counts.
The results of an analysis can be stored internally as a kind of
"hypercube" or "data frame" of multidimensional tabular data.
Spreadsheet-like slices should be extractable, displayable, and
manipulable.
Steve noted that while two-dimensional tables offer a customary and familiar
representation, we do not want the interface group to feel contrained to
present only this type of display. In particular, we want to suggest
presenting graphical displays when feasible. Some of these visualizations
will aid exploration of the results, while others will provide simpler
summary results than long lists of numbers.
Tanya noted that the scatterplots provided by FeatureLens were
often more illuminating than the lists of numeric results.
When the results list is small – a single response or less than a
screenful – eyeballing the results may be sufficient.
As the results list grows, scrolling through it may be unrevealing,
unless the list can be reorganized according to various criteria – i.e.,
the result hypercube can be sliced differently, and/or different
summary measures presented. For gigantic result sets (where gigantic
is not well defined at the moment), the initial display should be
a summary, numeric or visual, which allows further exploration of
subsets of interest. One simple approach is display of a random
sample of the large result set. Another is to limit the results to
those associated with words from a user-specified asset list. We may want
to allow user-defined specification of the values which designate the
switchover from full results display to summary display.
Given a list of result words, one might want to expand outwards and
obtain bibliographic information about those words. This leads to
further analyses, in an iterative fashion, with data being pruned and
joined in different ways at each step. We may want to allow saving
the intermediate data as well.
We agreed that the same search and sort operations used to specify the
initial analysis should be available for further processing of the results.
We should reemphasize to the interface group the fundamental importance
of flexible sort and search facilties.
Another one of the analysis categories proposed by the interface group
is "create a time line." Martin asked, is such a chronological analysis
just a sorting procedure? Here one of the dimensions of the data hypercube
has a well-defined order, namely date/time. Such data leads naturally
to a timeline display. Pib noted that it also offers the possibility
of longitudinal statistical analyses using log-linear models or generalized
linear models.
|