This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Agenda: Friday 12/14 - Saturday 12/15 2007
Hornbake Library - South Wing (i.e. enter on the side of the room)
Room 2119
University of Maryland College Park

Map of the building/campus: http://www.cs.umd.edu/hcil/contact/parking.shtml.

For orientation: The UMUC Marriott is the building at the top most of the map (which is west!). The Quality Inn is outside the bottom left of the map on Rt 1.

Friday December 14

8:30 coffee

9 AM: Inventory of what's been accomplished to date in the major cells: Data, Interface, Analytics (and what are highlights for this year's annual report?)

Progress Reports:

SuperCell: Most recent minutes, with update/overview, dated November 26th--

https://apps.lis.uiuc.edu/wiki/display/MONK/November+26+conference+call

Set the agenda for this meeting, mostly, in the Nov. 26th call, and raised the issues of scale/granularity...

Data: Progress report dated November 26th--

https://apps.lis.uiuc.edu/wiki/display/MONK/Data+Cell+Progress+Report+-+11.26.2007

Defined TEI-Analytics schema; developed monk data-access; further work on morphadorner; processed NCF into a data-store; leveraged seasr and ncf data store to work on Sara's use case (with post-processing by hand for clean-up, and hand-off of the results will be in tabular form; first run of results was not very good, but beefing up the training set and the post-processing filters for a second run). Currently preparing Wright American Fiction and EEBO collections.

Future plans: Rationalize the migration of TEI materials through XSL into TEI-Analytics, and then into the database. Ingest has rules for meeting data integrity: questions about whether to do that before ingest, during ingest. There is still a lot of mapping to do on what are required elements, who requires them, where to enforce that requirement, etc. In principle, no requirements are enforced just for the hell of it. Other questions: where do bibliographic records come from? In the header? How to make certain kinds of markup consistent (line-end hyphenation)? Downstream, what are you counting, and when are you counting it?

  • Develop monk datastore
  • Review what interface needs from datastore
  • Review what SEASR needs from datastore
  • Implement facilities to meet analytics requirements

See also the recently posted datastore demo: http://scribe.at.northwestern.edu:8090/monk/servlet

and the MorphAdorner demo, updated December 9: http://picard.at.northwestern.edu/morphadorner/

Uses and Users: No meeting minutes, and it doesn't appear that use cases have developed a lot since first posted--does this cell live? Should it?

There are some very recent updates to the use cases, but a lot of work in this cell was done in June. Since then, repetition data from Martin has come in to Tanya, and that was very useful; discussions about potentially useful visualizations. Good news is that by using early feature-lens plus various other tools, and data ginned up by Martin, we have some ideas about how to work with MONK tools when they become ready. Sara has been working on sentimentalism--small training set against a very large collection produced pretty much useless results. Reduced testbed to 80 novels from 250, training set increased from 25 chapters to 111 chapters or about 10% (using wordhoard to create worksets).

How do requirements get from the use cases to the relevant other cells (features that need to be made available in the datastore; ability to swap out analytics for the same data set; what results might need to be saved and/or shared? How would we want to select subsets (data range, gender of author, etc.) We'll take up some of these questions in the breakouts.

Witchcraft documents: colored marker, sticky notes. 300 documents (possession, obsession, witchcraft documents in EEBO). Some documents are 500 words; others are 800 pages. Useful chunk might be less than 500 words--100? Given an interest in following transformation across time in the description of the devil, witchcraft, etc., how to employ the computer? Latent semantics indexing? syntactic patterns? timeline visualizations? Can we save the markup that the user would supply? in colored highlighter? This moves into annotation, but how do we bring annotation into analysis.

Interface: Bang-up report dated December 2007--

https://apps.lis.uiuc.edu/wiki/display/MONK/Interface+Cell+December+2007+Report

Stan had a dream: no hood to look under. The team at McMaster has been working on the workbench, the Alberta group has been working on search widget for that workbench. This group hasn't posted a lot of news to the list, sometimes to each other. Three workbenches: workflow (first approximation of user interface), webclipse (expert user), desktop (a simple desktop metaphor). The point is to show how different tools need to be combined at different stages – toolsets are the expression of this. Everything is staged, and you use different tools at different stages. current version has some session management (remembers different users), allows combining tools, has a sense of process, saves result sets and worksets, uses the old proxy/lucene-exist datastore). Most of the workbench stuff is written in javascript; graphics control is provided extJS. Analytics is currently D2K Webservices. Runs predictions, works.

SEASR/MEANDRE update:

Loretta has a powerpoint to share, will link in here

Meandre is a SEASR workbench/workflow/dataflow environment, and MONK needs to sit on top of this as a user-interface for working with data, and then on top of that needs to sit some portal for sharing results. MEANDRE is written in google web toolkit.

Long discussion of architecture (how does MEANDRE relate to monk datastore; to Stefan/Stan's workbench, to a portal for sharing). Long discussion of transparency (we should be able to publish all known stuff that has bearing on how you achieved results, including component metadata, flow parameterization, etc.). Agreed that "performance" ultimately is measured by quality of results, rather than speed of results. Regular meetings will begin at UIUC to bring MONK/SEASR folks together (Stan and Stefan to be included by phone).

Analytics: No meeting minutes since November 6th, but there are a number of proposals at

https://apps.lis.uiuc.edu/wiki/display/MONK/Analytics+Cell

Analytics Cell Interim Report November 27 is just that.

Many meetings, many proposals, many of which have been ratified. The most important thing we've discussed has been to realize that the iterative, exploratory, interactive routines are key for scholarly users. Whatever more automated routines we add, should not slight those things. Don't forget food, sex, sleep, and rummaging, even if you have a brand new car.

Collaboration: Most recent minutes are October 18th, Martin Wattenberg, ManyEyes.

Modest but tangible work this semester; early fall was productive (talked with Thorny Staples Fedora Evanelist), and Martin Wattenberg (ManyEyes). Commitment from IBM to give us the ManyEyes API. Loretta and Catherine have followed up in person, and they remain committed, but the challenge will be getting past legal issues. Renewed focus in the spring: Dan Cohen will join us for a conversation about Zotero, comparative annotation (Peter Boot).

10 AM: What is going on with SEASR?

SEASR/MONK:

Points of Intersection: https://apps.lis.uiuc.edu/wiki/display/MONK/Points+of+Intersection

Goals and Timelines: https://apps.lis.uiuc.edu/wiki/display/MONK/Goals+and+Timelines

How can we start to coordinate MONK and SEASR to mutual benefit?

Conference calls: loretta will take part in collaboaration cell calls; monks at urbana will meet to coordinate with SEASR, Stan et al can call in.

Noon: working lunch: summer SEASR/MONK workshop; other opportunities for workshops (Harvard in the fall, others?) Check back in a month to see where

1 PM - 3 PM: Proxy calls breakout

1 PM - 3 PM: Use cases breakout

3 PM - 5 PM: Specify deliverables and time lines for the second half of MONK, middle half of SEAR: where should we be in February, April, June, August, October, December? When in this timeline will we hit scale problems? When will we hit complexity issues?

Use cases:

Sara's case: proxy calls required include Seasr itinerary, manager, collection selection, rating, creation of a workset, storing results. Middle of January. SEASR web services? TEI analytics integrated into the ingest pipeline: end of February.

New NCF: TEI-A end of January, middle to end of March

Wright: TEI-A now; ingested new-style end of February

EEBO: TEI-A mid-January, experimental ingest as early as possible just with Kristen's subset, full EEBO ingested new-style middle of March

SEASR: clustering components end of January; Interface/proxy: output vor visualization using ManyEyes (a URL, etc.)

Witchcraft: work now to February on building a map-viz component for SEASR, limiting named entity exraction by geographical range of reference, working also on how to control for change of place-names over time.

Martin: will suggest some readings on clustering from Baayen (forthcoming), Burrows (Companion to Digital Humanities)

Suggested readings:

on clustering from Baayen (forthcoming)
Burrows (Companion to Digital Humanities)

6 PM: Dinner

Saturday December 15

8:30 AM: Coffee (might want to bring your own...)

Goals and timeline:

January-March: Wright, NCF, EEBO in TEI-A
TEI-A ingest routine designed for Monk Datastore
Wright, NCF, EEBO ingested into Monk Datastore
SEASR Web Services
SEASR clustering components
SEASR Map-mashup componennt
SEASR Place-name extraction, limited by region
Interface integration with new Proxy and Datastore
Ability to ship visualizations to ManyEyes via URL
Ability to export CSV

March-May: Review results, iterating on the use cases
Reverse engineering FeatureLens into MONK interface
Support for same in proxy and datastore
Interface components for search and sort
SEASR components for search and sort
Experiment with ingesting Brown, DocSouth, Perseus,
LOC text other nora collections?
Experiment with OCA texts.
XML index

June-October: Portal
More use cases
Ingest additional designated collections based on
March-May experiments
Supporting use cases and analyzing results
Hardening and documenting re-usable components
(Brian's XSL, Monk Datastore, SEASR components,
Morphadorner)
SEASR component for generalized search on Monk
datastore

October-February: Pull as much of the proxy as possible into SEASR
Initial experiments with distributed collections using
Teracotta, OAI-Tim Cole, etc.

11 AM: Organizational questions: do all the cells need to continue as such? What's not working? What's working well?

Cells are OK, but now we need to communicate across cells about projects. PLEASE send all email through the monk list, not as side-mail to selected participants. Use subject-line prefix to allow people to determine if they need to read. Prefix could be name of a cell or name of a project:

INGEST project
INTEGRATION project
DATASTORE cell
INTERFACE cell
COLLABORATION cell
SEASR project
USE-CASE cell
etc.

Next meetings;

Infrastructure hackfest: Feb. 8-10 Chicago JMU will arrange meeting space, hotel

Use-case meeting: April 14-16 Montreal, Kristen/Catherine

July meeting of supercell in Finland, plus any who can.

12 Noon: adjourn

Document generated by Confluence on Apr 19, 2009 15:04