This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Monk Data Cell Conference Call

Tuesday, July 24, 2007. 3-4 p.m. central time

Present: Amit Kumar (chair), Bernie A'cs, Phil Burns (Pib), Bill Parod, John Norstad (secretary), James Chartrand, Martin Mueller, Loretta Auvil, Joe Paris

Amit is still working on the workflow system and parallel processing of texts for Nora DB. Ingesting NCF into Nora DB is on his to do list.

John is still working on ingesting NCF into the WordHoard datastore. He has ingested 10 sample works. He received the full collection of 250 tagged texts yesterday and started working on them. He has examined about 50 of them so far.

Amit needs Pib's help to understand MorphAdorner, how it works, its dependencies, etc. Amit wants to be able to replace OpenNLP by MorphAdorner as part of his process of ingesting texts into Nora DB. Amit will look at Pib's code and then ask questions. Pib has a sample batch file to read an input file, adorn it, and generate the output file.

Bill will take care of keeping the Monk SVN repository updated with Pib's latest versisons of source code, and with the latest versions of the unadorned and adorned text files.

Martin asked to talk about repetitions. How do we capture and store multi-word patterns? Martin talked about his previous work with repetitions in Homer, where multi-word patterns are computed, stored, and indexed. Martin says it's not clear to him at the moment that this kind of pattern information is stored in this way in other systems like FeatureLens or D2K.

We do have all the data needed to do such things. We have the words, their orders, and their attributes. Multi-word patterns are derived data, and can in theory be computed given the data we have. The technical details of computing, storing, and analyzing repetitions are difficult, however.

Document generated by Confluence on Apr 19, 2009 15:04