This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Tuesday, October 2, 2007. 3-4 p.m. central time

Present: Bill Parod (chair, secretary), Phil Burns (Pib), Loretta Auvil, Amit Kumar, Tim Cole, Martin Mueller

Bill - NU (John Norstad) is working on a search implementation with lexical and bibiographic constraints that returns data frames of such with lemma/POS/spelling counts. If successful, such data access facility could also be used for feeding sparse matrix inputs for D2K Naive Baysian analytic.

Loretta - D2K documentation is coming along for tables. NB and SVM itineraries are being documented as well.
Martin asks how one judges advantage of NB v. SVM. Loretta: By checking accuracy.

We discussed whether to use D2K as workflow engine for raw text processing and whether that could be feasible in real time. That scenario was discouraged.

Bill: What might be next practical steps with D2K for Sara's use case:
Sara would create training data by labeling 'chunks'. Those labels and their identified chunks would be presented to D2K. D2K would make requests (using custom InputModule) to data store for (lemma/pos/spelling) count data for Sara's identified chunks and create a sparse matrix representing the training set.

Loretta described input table sturcture: Each row in that matrix is a 'document' whatever we mean by 'document'. For this discussion each row probably represents a 'chunk'. It might also represnt an arbitrary set of words perhaps identified by a user as a span of text. Each column generally represents a feature - can be a word, phrase, lemma/pos/spelling,.... Each cell is a count of that feature in the 'document' row. Each such 'document' is also labeled for classification. Loretta recommends at least 20 to 30 'documents' labeled in this way.

Amit: Amit is looking for recommendations for next steps on the proxy. He will convene a call to discuss Proxy API among data producers and consumers next week.

Martin: Martin did some data checking on MorphAdorner output using NCF training data on the Wright corpus. He is pleased with the results he saw, even though expection was modest given that custom training data has not been developed for Wright.

Tim: Tim asked the Collection description effort?
Amit: Kelly is working on this. We expect desccription for the first 2 or 3 collections in coming weeks.
Martin: How does this intersect with Brian's work? Amit: Brian's effort is at the work level, Kelly is working at the collection level.
Martin: What is needed for Collection description more than what is obtained from its works?
Amit: We want to see genre classification vocabulary, METS expression, formalize chunk level vocabulary, Dublin Core for collection as a whole including its size, title, access rights, audience, ...

Tim: A Collection description would also include a statement about what is in the collection and why it is a collection.
Amit: and perhaps how it is related to other collections.
Martin: Who uses that information? Amit: Used for search and browse.

Bill: Collection definitions vary and might sometimes seem arbitrary or idiosyncratic. This is one reason it's useful to have a collection level record, so that when a user sees "19th Century Fiction" they can read a description of its coverage and how it came about and have an appropriate expection of what they'll find there.

Document generated by Confluence on Apr 19, 2009 15:04