|
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Monk Data Cell Conference Call
Tuesday, March 20, 2007. 3-4 p.m. central time
Present: Martin Mueller, Bill Parod, Phil Burns, Bob Taylor, Joe Paris, James Chartrand, John Norstad (secretary), plus three new data cell members from NCSA who joined us for the first time: Loretta Auvil, Bernie Acs, and Vered Goren.
Missing: Amit Kumar
We began by introducing ourselves. The new NCSA folks met us, and we met them and welcomed them to our Monk data cell.
Loretta described the new Mellon-funded SEASR project. The purpose of the project is to establish an architecture to make it easy to integrate text processing and analysis tools. It will interact with UIMA. The new framework will be along the lines of D2K but will be different from D2K. The project will also offer a humanities-driven interface for using the tools. She is very interested in having the tools of Monk be usable with this new SEASR infrastructure.
Martin talked about scale issues and asked Bernie what he thought about scaling up from a few million words in Nora and WordHoard to the order of hundreds of millions or even a billiion words in Monk. Bernie said this would be a large database, similar to the kinds used in large data warehousing applications, but it should not be a major issue.
Martin has been working on a Monk DTD, based on TEI Tight, a kind of greatest common factor markup language for our initial collection of CH and EEBO texts. This DTD provides markup tags at three levels. At the top level, we have a kind of library card catalog structure. At the bottom level, we have word and sentence tagging. These two levels are pretty much the same across all the different works and genres. At the middle level, we have tags for structures like chapters, stanzas, paragraphs, lines, speeches, etc. These middle level tags and structures are more heterogeneous across the works and genres.
Martin described a "shadow play" generic use case, where a scholar profiles the works of an author against the backdrop of his contemporaries. It may be possible to precompute the backdrop data needed for this kind of application in, for example, a sliding window of 50 year periods that advances in increments of 25 years.
Martin also described a generic kind of use case where one performs analyses within the two dimensions of time and genre.
Martin mentioned that we should ignore fine typographical markup for presentation purposes. While we must be able to display attractive and readable text, it is not critical that the presentations have high typographical fidelity with the original sources.
Bernie mentioned that XML schemas are richer and more flexible than DTDs.
Bill mentioned that at NU we hope to begin outlining a detailed draft of a domain model definition for Monk. This will determine data storage needs and is a prerequisite for designing a data store architecture. Bernie agreed, and he referred to this as the "structural elements" or "meat" of Monk.
Loretta expressed her concern that the Monk data store, however it is implemented, must support the kinds of analytics that will be done in SEASR.
Pib briefly describe his MorphAdorner pipeline for morphological tagging. He and Loretta discussed UIMA as a potential useful tool for managing this pipeline.
|