This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Present: Bill Parod (chair, secretary), Phil Burns (Pib), Amit Kumar, Vered Goren, Martin Mueller, Sara Steger, Brian Pytlik Zillig, Tim Cole

Data Cell welcomes Brian Pytlik Zillig to the cell.

Agenda Item:
1) Sara Steger analysis status:

Sara has four sets of chapter level work sets with the following classifications:
Highly sentimental chapers
Anti-sentimental chapters
Random selection of sentimental chapters
Random selection of unsentimental chapters

Amit will meet with Vered to discuss selection and operation of an appropriate D2K itinerary to use these sets to rank all NCF chapters for sentimentality.

Bill provided a D2K InputModule for data access and sparse matrix creation which Vered has reviewed and finds straighforward.

Vered suggests we use itineraries 305 309
Amit will include new itineraries with web service.
We will probably see early results by next week. This will include prediction classes for all the chapters with a confidence level for each prediction. Results will also include ranks of influential lemmas.

Agenda Item:
2) Wright corpus preporocessing

Brian has been busy updating the teisimple content model for <w>.
Brian has developed a method to convert the Wright texts to teisimple. He is ready to perform that conversion.
We discussed who will then perform morphological 'adornment' of the Wright teisimple texts. We discussed the benefits of Nebraska undertaking that step with MorphAdorner (MA). This would tease out what is needed wrt additional documentation or operational aids. We would like to see that experience carefully documented in order to capture workflow issues so the process can be formalized and automated in the future. Brian will check with Steve if this is in scope for Nebraska work on MONK. MA is a command line program that generates adorned files. Pib will have a new version incorporating better training data and processing next week.

Brian asked about <orig> hanPib has a fix for the <orig> problems in Wright. The MONK wiki has a memo from Pib on how MA handles split tokens in general. <orig> handling conforms to this. MorphAdorner XML Output see also [Handling

Tim Cole asked how we verify the results of adornment. Martin described his verification process using Microsoft Access. In addition to XML output, MA can provide tabular output with KWIC. Martin uses group and sort routines as well as sampling of 10k or so words to check results. He can usually extrapolate from there whether there are problems or results are good.

Pib has a fix for the <orig> problems in Wright. The MONK wiki has a memo from Pib on how MA handles split tokens in general. <orig> handling conforms to this. MorphAdorner XML Output See also Handling orig tags in Wright texts (archive)

We expect to do approximately 300 Wright texts taken from the first 3 years, last 3 years, and (Civil) War years.

Martin: Wright conversion is the first process of monkification where we use all the tools that we are likely to have. We should take special note of this as it informs process in the future.

Tim: How important is the 'collection' model for processing? What about 'loose' individual texts? This is also a relevant processing scenario for us.

Document generated by Confluence on Apr 19, 2009 15:04