|
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Monk Data Cell Conference Call
Tuesday, July 10, 2007. 3-4 p.m. central time
Present: Phil Burns (Pib), Bill Parod, Loretta Auvil, Amit Kumar (chair), John Norstad (secretary), Duane Searsmith
Amit will log on to Ariadne, get a copy of the adorned NCF texts and put them into a repository at UIUC for other Monk people to get them.
John is working on ingesting the NCF texts into the current WordHoard data store.
Bill is working on trying to extract sparse matrices from the WordHoard data store into D2K itineraries.
Amit asks what the steps might be if a user wants to upload a new text to be incorporated into Monk. What is the formal workflow? What steps are iterative? What might be done in parallel?
Pib: Does the text conform to a DTD we understand? Does it use the kind of English for which our training data is appropriate?
Amit: To add a new collection to the Monk datastore. What are the documents? Where are they located? Who is the contact person? In what format is the metadata? Is the XML well-formed? Is it valid? Is there training data?
Martin: To be practical and concrete. We have NCF. Wright archives is next. Brian is the appointed curator. We can ask Steve and Brian to add Wright and document it. We then have a second collection. The next one is the TCP early modern texts, and that will be a bit harder. The next step is to get Steve and Brian involved.
Amit wants to define a workflow within the next few weeks, at least for he simplest use case. A user uploads a file and fills out a sequence of forms. The process checks for well-formed and valid XML. Amit would like to set up something quickly and modify it as time goes on.
Martin: A critical variable is that if the process involves morphological adornments, is the training data sufficient? When data does not come from the same linguistic universe, somebody has to do some work with training data.
Pib: We must distinguish between processes that can be automated and the hard time-consuming work that must be done by hand. Documenting this process will be difficult. It may be possible to write additional software in the future for Monk to facilitate this kind of work. But some things are simply irreducibly complex.
Amit: Tanya would like to get some documents into Monk. Martin: Doc South subset, standard American language with some Southern dialect.
Pib: Interested for reasons at NU other than Monk in workflow languages and software systems.
Amit: Looking at JPDL (Job Process Description Language) as a workflow language. It's XML-based and JBOSS supports it for use in web applications.
Duane: They see Monk as a use case for CAESR, and are working with Amit on workflow issues. These issues bleed into component and service-based frameworks.
Amit: Need to involve Brian and Steve in this project, to help define what the workflow should be.
|