|
MONK : Conference call, 2007 Oct. 2, Data
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Tuesday, October 2, 2007. 3-4 p.m. central time Present: Bill Parod (chair, secretary), Phil Burns (Pib), Loretta Auvil, Amit Kumar, Tim Cole, Martin Mueller Bill - NU (John Norstad) is working on a search implementation with lexical and bibiographic constraints that returns data frames of such with lemma/POS/spelling counts. If successful, such data access facility could also be used for feeding sparse matrix inputs for D2K Naive Baysian analytic. Loretta - D2K documentation is coming along for tables. NB and SVM itineraries are being documented as well. We discussed whether to use D2K as workflow engine for raw text processing and whether that could be feasible in real time. That scenario was discouraged. Bill: What might be next practical steps with D2K for Sara's use case: Loretta described input table sturcture: Each row in that matrix is a 'document' whatever we mean by 'document'. For this discussion each row probably represents a 'chunk'. It might also represnt an arbitrary set of words perhaps identified by a user as a span of text. Each column generally represents a feature - can be a word, phrase, lemma/pos/spelling,.... Each cell is a count of that feature in the 'document' row. Each such 'document' is also labeled for classification. Loretta recommends at least 20 to 30 'documents' labeled in this way. Amit: Amit is looking for recommendations for next steps on the proxy. He will convene a call to discuss Proxy API among data producers and consumers next week. Martin: Martin did some data checking on MorphAdorner output using NCF training data on the Wright corpus. He is pleased with the results he saw, even though expection was modest given that custom training data has not been developed for Wright. Tim: Tim asked the Collection description effort? Tim: A Collection description would also include a statement about what is in the collection and why it is a collection. Bill: Collection definitions vary and might sometimes seem arbitrary or idiosyncratic. This is one reason it's useful to have a collection level record, so that when a user sees "19th Century Fiction" they can read a description of its coverage and how it came about and have an appropriate expection of what they'll find there. |
| Document generated by Confluence on Apr 19, 2009 15:04 |