|
MONK : Conference call, 2007 Apr 3, Data
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Agenda Items:
Monk Data Cell Conference CallTuesday, April 3, 2007. 3-4 p.m. central time Present: Amit Kumar (chair), James Chartrand, Phil Burns (Pib), John Norstad (secretary), Martin Mueller, Loretta Auvil, Joe Paris, and Bernie Arcs. Missing: Bill Parod and Bob Taylor (out of town), Vered Goren. We began by discussing Morph Adorner. Pib mentioned that we still have unresolved issues concerning sharing the proprietary data in Morph Adorner. Martin thinks that sharing between UIUC and NU should not be a problem. We made resolving this issue an action item for Bill Parod and Bob Taylor when they return to the office next week. Amit wants to see use cases, which should define our data cell requirements. Martin mentioned a category of use cases which analyze lexical use over time. Martin also mentioned some other categories of use cases in our March 20 conference call. John mentioned that the major functionality of WordHoard should be considered to be a use case. Amit asked how one would go about converting documents to Martin's "Monkable" format (using Martin's new Monk DTD). Martin has been working on the TCP texts, which have a DTD which is a slightly modified version of TEI. To accommodate these texts he has relaxed the TEI DTD a little bit, added a few things, and adjusted element name case issues, and he's done. For C-H, the SGML texts are not very compatible with either TCP or the American Fiction texts. Martin and Bill have converted these texts by hand and with the help of a few scripts. This work is done. All that remains for our initial collection of fiction texts is the 18th century fictions collections. For texts in the wild, Amit would like some kind of automatic way to up-tag arbitrary texts to Monk-DTD format. Pib says this is not easy, not even for HTML format wild texts. He says we could think about developing tools to help with this process, but this would be a big project, and we could never fully automate the process. Some cases that might be considered are plain text with no markup and HTML texts. Martin takes a concrete point of view. Our initial L-shaped collection is more than enough to keep us busy. There are other collections which might be useful to include perhaps at the end of the Monk project or with side grants. He agrees with Pib that we probably need to table trying to automate the processing of texts in the wild. We returned to more discussion of Morph Adorner. Martin will work with Bill and Bob to resolve issues involving the sharing of the proprietary data in Morph Adorner. Beyond these legal issues, Pib mentioned several technical issues with sharing his current version of Morph Adorner: There's no documentation, no sample code, the code needs to be cleaned up, and the code base is very volatile. Pib feels that the code is not really releasable, not even for testing. He said that by mid-summer things should be settled down enough to release a first testing verstion. Amit feels that he and others would like to be part of this process, not just consumers of Morph Adorner, and waiting until mid-summer is too long a time. He'd like to try it out now and compare it to other NLP toolkits. Pib disagreed, and said that he doesn't see how this development work could be shared effectively, and that there's plenty of other Monk work to be done, but he will leave this up to management. Martin says that he hopes to be able to share his new Monk DTD and some sample texts next week. |
| Document generated by Confluence on Apr 19, 2009 15:04 |