This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
  1. Summarizing decisions on the mailing list to the wiki.
  2. What are our Tasks?
    My suggestion for moving forward would be to create a set of broad tasks and have a timeline with them.
    Any suggestions in this regard?
  3. Are there people with spare cycles? Can they make themselves known.
  4. We need volunteers to take Design, Implementation and Evaluation roles.
    The person would manage discussions leading to decisions being made. https://apps.lis.uiuc.edu/wiki/display/MONK/The+SuperCell (Scroll down to Roles in each cell)- Omitted for now, because of lack of understanding of what part roles play

Monk Data Cell Conference Call

Tuesday, April 17, 2007. 3-4 p.m. central time

Present: Amit Kumar (chair), James Chartrand, Phil Burns (Pib), John Norstad (secretary), Bill Parod, Joe Paris, Vered Goen, Loretta Auvil and Bernie Arcs

Missing: Martin Mueller and Bob Taylor (traveling).

Honored guest from the Analytics cell: Steve Ramsey

Thanks to: Bill Parod for contributing to these minutes from his own notes.

We started by discussing the need to move from the general discussion and brainstorming phase into a more disciplined phase where we draw up specific tasks, responsibilities, and timelines. We decided that this will be the primary agenda item for our next conference call, and in the next two weeks we will all work towards that goal.

We have been working hard in recent days discussing some of the decisions we need to make on the Monk mailing list and in correspondence just within the data cell. We need to summarize these decisions on the wiki.

We talked a great deal about lexical data, based on the following summary prepared by Bill Parod:

Bill: Lexical Data - What information do we want to know about each word?

identifier - a string that uniquely identifies the word:
address - its unambiguous position/extent in the text
spelling - the word token string
standard original form - From Martin's memo:
"The value of the spe attribute is usually identical with the value of the tok attribute, but sometimes it is not. Look at "common|lie," where the vertical bar is the SGML representation of a soft hyphen at the end of a line. The value of the spe attribute is the original spelling as it would ordinarily appear in a text from that period. In this case that is 'commonlie.' The spe attribute is also used to resolve printer attributes or odd spelling conventions that are not found in this stretch of text but are very common. Thus "y^t" becomes "that", "&abper;ficit" becomes 'perficit', and other printing conventions are similarly written out in their contemporary rather than modern form (although these will often be the same.)."
standard modern form - from Martin's memo:
The value of the reg attribute is the standard modern orthographic form of the original spelling. But the morphological form is not modernized. Thus a spelling like 'lovyth' would be regularized to 'loveth', but 'loveth' would not be regularized to 'loves' but is recognized as a standard archaic form.
lemma - The lemma or dictionary headword for the word
pos - The part of speech
sentence boundary - indicate whether the word ends a sentence

We want to capture the above. Also this information forms a reference lexicon which is a central resource and useful in its own right.

Loretta mentioned the importance of preserving capitalization. Bill and Pib reassured her that the spelling attribute does indeed retain the original spelling of the word token, including all of the capitalization.

John raised the issue of contractions, words which have more than one lemma and part of speech. This has always been an important issue for Martin. An example is the first word of Hamlet, "who's". This is a single word, a single lexical token, but it has two parts. The first part is an instance of the lemma "who" with NUPOS part of speech "q-crq". The second part is an instance of the lemma "be", with NUPOS part of speech "vaz". Pib mentioned that MorphAdorner knows how to deal with these kinds of words, and emits multiple lemma and part of speech tags for them, as in the example from Hamlet.

John also raised the issue of keeping track of word order and word proximity, to make it possible to answer questions involving collocation, n-grams, and general morphological pattern matching searches. Steve, Amit and Pib discussed the facilities available for doing these kinds of tasks within existing search engine products like Lucene. Do we need to concern ourselves with this issue in the Monk datastore proper?

We also talked about the need to keep track of punctuation in the datastore and make it possible for clients of the datastore to work with punctuation as analysis features. Pib remarked that MorphAdorner does indeed maintain all punctuation.

We agreed to defer a detailed discussion of the important issue of n-grams.

We moved on to a discussion of structural issues, and the notion of "chunks" in particular.

Loretta explained that one reason chunks are important in Nora is that they are the smallest units of text over which counting and analysis are possible. This is not the case in WordHoard, which permits counting and analysis over arbitrary "bags of words", and uses its notion of "work parts" primarily as way to organize tables of contents for works and as a unit of text presentation. We talked about the needs of some use cases to identify and characterize passages of text and in a sense make them "user-defined chunks" over which counting and analysis are possible. This issue was raised by Catherine Plaisant in a Monk mailing list message:

Catherine on 4/16/07, 9:39 PM

The use cases also suggest that users do wish to rate chunks with as
much flexibility as possible, so clearly saving data about low level
chunks is important. They will be marked as erotic, sentimental or as
being a witch accusation. In fact one of the challenge is to allow for
the assignment of a rating to a custom size of text (either part of a
chunk, or a consecutive set of chunks) as one unit for the rating. We
could create interfaces to do that, but I am not sure how the analytics
can deal with it. This is a very common request though.

We also discussed the following comment by Catherine in the same mail message. We need to follow up on this with her.

Catherine again on 4/16/07, 9:39 PM

Earlier on in Nora we had discussions about having a fixed but general
and human-comprehensible hierarchy of chunk types which could have
predictable behaviors (a chapter would have different properties and
behaviors than a paragraph or a preface). Chapter might get counts, not
paragraphs or prefaces. Chunk types also drive what interface widget is
used to browse it. Just having labels for the chunks types will not be
enough.

For our next meeting, Amit reiterated the need to agree on a formal timeline to finalize our initial informal design discussions and begin to make decisions.

Pib suggested that we all read and carefully study the use cases on the wiki, examining them to determine in detail what requirements they impose on the Monk datastore.

We will concentrate on continuing to work on the mailing list and on the wiki over the next two weeks to formalize the lexical and structural issues and start to make firm decisions.

Vered and Bernie were present, just didn't say so on roll call.

Posted by amitku at Apr 18, 2007 14:16

I want to add some comments on the question of word order. My general position is that we must build into the basic structure of Monk extensive capabilities for supporting analytical routines that are sensitive to the order of words in a document. How much of these capabilities we will fully implement in Phase I is a practical question. But we don't want to find ourselves in a situation where we want to add routines that are sequence-sensitive only to discover that earlier architectural decisions do not support them or support them in truncated ways.

So this is an issue where we need to very careful and avoid short cuts that will cost us dearly later.

Posted by martinmueller@northwestern.edu at Apr 19, 2007 09:11
Document generated by Confluence on Apr 19, 2009 15:04