This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

2007/09/25 Analytics Cell Meeting Minutes

Present: Loretta Auvil, Phil Burns, Tanya Clement, Vered Goren,
Martin Mueller, Sara Steger, Steve Ramsay.

Martin suggested a disconnect exists between D2K knowledge at NCSA
and the ability of the non-NCSA technical and non-technical staff to
use and understand D2K. The main problem is that the existing
documentation is inadequate for programmers unless one already knows
D2K. There also seems to be a lack of adequate high-level
documentation for use by non-programmers. As examples, Martin stated
he does not understand clearly what D2K does and what it provides.
Bill Parod has had trouble determining what kind of data his code
should present to D2K itineraries in some cases.

Loretta responded that D2K offers a general platform for writing most
any computational method. D2K currently includes a number of useful
predefined methods. Loretta will work with Bill to answer his
questions.

Martin recommended more D2K high-level documentation and analysis
descriptions for scholars like Martin and Sara.

Steve asked, do we want to document the D2K itineraries, the
algorithms used by D2K modules, or the output for specific
algorithms? Probably we want all of these documented, as well as
tutorial information for programmers beyond the javadoc level. Pib
noted that such documentation, while necessary, is time-consuming to
produce. Except for the code level documentation, the rest can be
written by folks other than programmers (and probably should be).
Martin and Sara volunteered to help write the high-level
documentation.

Pib suggested collating all the existing D2K documents in a single
location in the Monk repository. This would allow everyone to see
exactly what currently exists, what is missing, and what could be
improved.

Martin asked if the existing D2K code in Nora/Monk extracts needed
data from the XML texts or from a database. Loretta responded that
the existing itineraries extract data from the NoraDB database rather
than the XML.

Sara asked if the D2K algorithm can work with spelling/part of speech
combinations? Loretta said yes – you can present the D2K algorithm
with a column of counts of such combinations. Loretta suggested it
would be useful to allow selecting specific parts of speech or word
classes. Sara noted she would be interested in selecting just words
that are adjectives.

Martin suggested a Naive Bayes implementation should allow you to
select as many features as possible from the underlying data. This
led to an extended discussion of what features can be selected in the
existing NoraVis application, how NoraVis uses D2K to perform Naive
Bayes analyses, and how to extend this for Monk. NoraVis currently
allows a model containing 2-grams and 3-grams of spellings. Loretta
suggested extending this to handle lemmata, synonyms and antonyms,
and so on. Such extensions simply require different summary data
than NoraDB currently produces.

Loretta stated that all D2K routines use more or less same format
tables. It should be feasible to define a small set of methods to
extract the needed tabular data from the data store and provide it to
the D2K modules in a consistent D2K-compatible tabular format.

Martin asked if D2K should access the data store directly rather than
through an interface. Loretta added that it is possible to write data
wrapper D2K modules, but it may prove faster and easier to use
database specific methods. Pib suggested this is an implementation
issue that should be left up to the data cell. He suggested keeping
the D2K modules independent of the particulars of the data store.
The D2K itineraries in Monk should access the data store only through
well-defined interface methods and not via direct database calls.

Martin asked if the D2K tabular format would be helpful in
implementing any of the sort/search facilities envisioned for Monk.
Steve responded that D2K is not really useful for that. Sort and
search should be handled at the data store level.

Document generated by Confluence on Apr 19, 2009 15:04