This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Scheduled for

May 06 2007 from 15:00 to 16:00 CST

Agenda Items

  1. What are our responsibilities?
  2. What to document? Code Documentation/Documentation of the project infrastructure.
  3. Please review the Milestones for your cell that are sketched in the notes from the meeting at Northwestern, at https://apps.lis.uiuc.edu:443/wiki/display/MONK/Meetings+and+Milestones
  4. What are the modules in this project?
  5. A Brief review of the data cell wiki resources.
  6. Who will write functional specifications?
  7. Testing, Design and Implementation responsibilities.
  8. Meeting Times and How frequently?

Monk Data Cell Conference Call

Tuesday, March 6, 2007. 3-4 p.m. central time

The full cell was present: Amit Kumar (chair), Martin Mueller, Bob Taylor, James Chartrand, Bill Parod, Joe Paris, Phil Burns (Pib), and John Norstad (secretary).

The agenda items were discussed slightly out of order, with the milestones at the end.

1. What are our responsibilities?

We have two major responsibilities.

The first is text preparation, the process Martin has described in his note What is a MONK text? (archive) There are two steps in this process, first transforming raw texts into a common standards-based "Monkable" format, and second adorning the texts with morphologoical tagging data using Pib's MorphAdorner pipeline to produced "Monk texts" or "Monkified texts". Work is already in progress on this task at NU.

The second is is designing and implementing a persistent store for this data with a flexible API that efficiently exposes all of the data and its major behaviors to the higher levels of Monk software (e.g., middle-ware servers and end-user applications).

The first job in designing the persistent store is to enumerate the requirements for the store. This process is ultimately guided by and defined by the needs of end-use cases. Amit announced that the user cell is working on documenting some of these end-use cases. John mentioned the functionality of WordHoard as an end-use case, although many of WordHoard's features may not be relevant to Monk, and there is definitely lake of clarity in this area that needs be resolved.

These "requirements" may also be called "functional specifications", "technical specifications", a "data model" or an "object model". For now at least, we consider these terms to be more-or-less interchangeable. Whatever they may be called, they ultimately define what we are going to do with the data, what we're not going to do, what we're going to model and what we're not going to model, what data services we are going to provide to the rest of Monk, what ones we are not, and so on.

Martin mentioned concordances with grouping and sorting of the kind provided by WordHoard as an important area of functionality for Monk that should be supported by the data store.

2. What to document?

For Java code, John and Pib recommend following Sun's standard conventions for javadoc documentation. All methods, classes, packages, etc. should be formally documented using these conventions. In addition, object model and architecture overview documentation is important.

All data files, raw and processed, and all code and documentation will reside in a central Monk repository. We will use the tools already in place at UIUC that Amit has set up for this purpose.

Martin and Amit will work together to get the raw XML source files for our initial collections of text loaded into the Monk repository at UIUC.

John will look at the Library page for the data cell area on the Monk Wiki and will add links, if necessary, to relevant NU WordHoard resources. (I, John, just did this - there's already a link to our WordHoard web site, which has everything including source code, javadoc, overview docs, user docs, developer docs, data files, etc. I don't see a need for any more work here.)

4. What are the modules in this project?

This question is deferred until we have a first draft of the requirements for the data store. These is little we can say intelligently about modules and architectures until we have at least a detailed first draft of the requirements.

5. A brief review of the data cell wiki resources.

We all promised to keep up to date with activity in the data cell area of the Monk Wiki. Amit again mentioned RSS feeds as a convenient way to keep up to date.

6. Who will write functional specifications?

We will all work together on this, as a group. John will begin enumerating some details of major areas of concern in the data store via a series of mailing list messages. For now, we will use the full Monk mailing list with subjects tagged with the phrase "[DATA]" to carry on this discussion. Amit will investigate whether a separate data cell-only list might be more appropriate. Eventually, when preliminary agreement is reached on details, the results will be posted as first draft specification documents on the Wiki. We're not at all certain that this method of communication is best, but we're going to start this way and see how it works.

7. Testing, design and implementation responsibilities.

We deferred this question.

8. Meeting times and how frequently?

The data cell will have a bi-weekly conference call at the same day and time as this one, on every other Tuesday afternoon from 3-4 p.m. central time. Our next call will be on Tuesday, March 20.

3. Milestones.

? Agree on preliminary data model for interoperability.

We will certainly keep in mind the desirability of achieving as much interoperability as possible with other text processing and analysis systems that exist outside of Monk. Producing a detailed first draft of our preliminary data model is our next task.

? Agree on an API (or proxy, or whatever) for interfaces to address.

We are all committed to a universal and open API as a central mission-critical goal. We will provide an API that is not targeted to any particular end-user application architecture or set of technologies and that should be easily usable within virtually any conceivable architecture or framework. The details of this API will be designed after we have completed a first draft of the requirements.

? Agree on some ways of working across collections (where a collection is represented by an index).

This is a desirable goal with major challenges. Some preliminary work has been done on thinking hard about this problem and investigating technologies and alternatives, but much hard work remains to be done. As with so many other tasks and milestones for the data cell, this problem also cannot be addressed intelligently until we have a first draft of the requirements.

? Assemble a collection of fiction from 1600-1920 (with some public domain texts included and identified).

This task is done. Martin briefly discussed the possibility of adding some early 20th century Guttenberg texts, investigating and implementing some of his ideas for "up-tagging" as part of that project. This idea is, however, not critical for now and is deferred.

? Reassess data model with early modern texts in mind.

This is NUPOS, Martin's new part of speech tagset, together with the work Pib is doing on his MorphAdorner software. This work is in progress. We will post details soon on how NUPOS is represented in the data model.

? Assemble a collection of early modern texts in various genres (with some public domain texts included and identified).

The public domain texts will be from the Wright Early American Fiction Archive. This work is done. We have decided on the initial collection of texts for Monk, and we have them in hand.

Document generated by Confluence on Apr 19, 2009 15:04