This page last changed on Apr 12, 2008 by martinmueller@northwestern.edu.

Present: Amit, Brian, Duane, John N. Loretta, Martin, Sara, Tanya

Brian reported on the progress of Abbot. Of 695 TCP-EEBO texts (~ 60 million words), all but 67 now parse under TEI-Analytics.

Additional post-meeting comments by Martin: We will work with the ~630 texts that currently parse. As for the remaining texts, the majority will parse once certain adjustments have been made to the TEI-Analytics schema, notably the content model for <sp> and <postscript>. We will want these adjustments to be friendly amendments to TEI and are still working on that process.

We can now say that all the texts currently envisaged for inclusion in MONK I exist in TEI-A format. They include

  1. 630 TCP-EEBO texts (~50 million words)
  2. 250 NCF texts from British fiction between 1780 and 1900 (~40 million words)
  3. 300 Wright American fiction texts (~40 million words)

It would be trivial to include selections from Early American fiction or DocSouth if there are use cases that require them. These collections will not pose new parsing problems.
Amit and Sara reported on the use of a decision tree algorithm with the NCF data. Sharable results should be forthcoming within days.

Amit discussed visualization routines for the results that might work with Meandre and with the Monk interface

Duane and Loretta reported about the porting of text analytics from D2K into SEASR. These fall under the three broad headings of classification, clustering, and information extraction.

Loretta and Tanya reported on some experiments with extracting named entities and associating them with other entities or groups of words from particular semantic fields (e.g. color)

Loretta will talk about Monk and FeatureLens at the forthcoming ICDM data mining conference in Atlanta.

We discussed a possible trip by Amit and Steve or Brian to Evanston in May to tackle workflow problems.

Document generated by Confluence on Apr 19, 2009 15:04