|
MONK : Conference call, 2007 Apr. 17, Data
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Monk Data Cell Conference CallTuesday, April 17, 2007. 3-4 p.m. central time Present: Amit Kumar (chair), James Chartrand, Phil Burns (Pib), John Norstad (secretary), Bill Parod, Joe Paris, Vered Goen, Loretta Auvil and Bernie Arcs Missing: Martin Mueller and Bob Taylor (traveling). Honored guest from the Analytics cell: Steve Ramsey Thanks to: Bill Parod for contributing to these minutes from his own notes. We started by discussing the need to move from the general discussion and brainstorming phase into a more disciplined phase where we draw up specific tasks, responsibilities, and timelines. We decided that this will be the primary agenda item for our next conference call, and in the next two weeks we will all work towards that goal. We have been working hard in recent days discussing some of the decisions we need to make on the Monk mailing list and in correspondence just within the data cell. We need to summarize these decisions on the wiki. We talked a great deal about lexical data, based on the following summary prepared by Bill Parod: Bill: Lexical Data - What information do we want to know about each word? identifier - a string that uniquely identifies the word: We want to capture the above. Also this information forms a reference lexicon which is a central resource and useful in its own right. Loretta mentioned the importance of preserving capitalization. Bill and Pib reassured her that the spelling attribute does indeed retain the original spelling of the word token, including all of the capitalization. John raised the issue of contractions, words which have more than one lemma and part of speech. This has always been an important issue for Martin. An example is the first word of Hamlet, "who's". This is a single word, a single lexical token, but it has two parts. The first part is an instance of the lemma "who" with NUPOS part of speech "q-crq". The second part is an instance of the lemma "be", with NUPOS part of speech "vaz". Pib mentioned that MorphAdorner knows how to deal with these kinds of words, and emits multiple lemma and part of speech tags for them, as in the example from Hamlet. John also raised the issue of keeping track of word order and word proximity, to make it possible to answer questions involving collocation, n-grams, and general morphological pattern matching searches. Steve, Amit and Pib discussed the facilities available for doing these kinds of tasks within existing search engine products like Lucene. Do we need to concern ourselves with this issue in the Monk datastore proper? We also talked about the need to keep track of punctuation in the datastore and make it possible for clients of the datastore to work with punctuation as analysis features. Pib remarked that MorphAdorner does indeed maintain all punctuation. We agreed to defer a detailed discussion of the important issue of n-grams. We moved on to a discussion of structural issues, and the notion of "chunks" in particular. Loretta explained that one reason chunks are important in Nora is that they are the smallest units of text over which counting and analysis are possible. This is not the case in WordHoard, which permits counting and analysis over arbitrary "bags of words", and uses its notion of "work parts" primarily as way to organize tables of contents for works and as a unit of text presentation. We talked about the needs of some use cases to identify and characterize passages of text and in a sense make them "user-defined chunks" over which counting and analysis are possible. This issue was raised by Catherine Plaisant in a Monk mailing list message: Catherine on 4/16/07, 9:39 PM The use cases also suggest that users do wish to rate chunks with as We also discussed the following comment by Catherine in the same mail message. We need to follow up on this with her. Catherine again on 4/16/07, 9:39 PM Earlier on in Nora we had discussions about having a fixed but general For our next meeting, Amit reiterated the need to agree on a formal timeline to finalize our initial informal design discussions and begin to make decisions. Pib suggested that we all read and carefully study the use cases on the wiki, examining them to determine in detail what requirements they impose on the Monk datastore. We will concentrate on continuing to work on the mailing list and on the wiki over the next two weeks to formalize the lexical and structural issues and start to make firm decisions. |
| Document generated by Confluence on Apr 19, 2009 15:04 |