This page last changed on Mar 31, 2008 by unsworth.

Present: Catherine, Martin, Stan, John

Data/Analytics cell:

TCP texts are not quite done, but Brian is back this morning, and a few things remain to be done, but it should be done in the next few days. NCF texts validate under TEI-A.

Once the texts are morphadorned and ingested into a MONK datastore, is the datastore that it goes into the same one that shows up in the workbench demo? There is a datastore at Northwestern, and one on monk.lis.uiuc.edu. The latter should be the one that is public to the project, the former can remain as a development version, but to avoid confusion we need to have just one that is public to the project (so that when we ask "are Sara's texts up?" we'll know which datastore we're talking about).

Training sets have been updated, by the way, so some of the texts that were ingested some time ago might benefit from being re-done in morphadorner.

Witchcraft texts are the next priority, for testing the training sets, though--to see whether acceptable quality can be produced.

Next mini-hackfest in April is to get Sara working through the MONK interface, rather than having her work through SEASR. Stopwords, stemming, lemmatization features are not available to turn on or off in SEASR, but the latter two are in the datastore; stopwords are problematic for certain purposes (as Bei's research showed). In a workbench scenario, one could choose a stop-list or design one (pronouns, say) on the fly. Do we know what analytics are available? We need more back and forth with SEASR about what analytics are appropriate to literary texts. We need a priority list with an ETA for analytics. Decision trees, which Sara asked for, will be done soon. Dunning's log likelihood, Burrow's Delta...or more generically, a nearest-neighbor classifier. Could we have some documentation of the characteristics of different analytic tools, aimed at end-users?

Workbench interface; doesn't work--it's a long way from being usable right now. Andrew is doing most of the development (a lot for one person...). Catherine and he spent an hour on the phone with him, sorting out the urgent from the not-so-urgent, getting tasks sorted out. CP: let's get back at least to the level of functionality of the open-laszlo NORA. Run, rate, get results. Selecting and deselecting items in large collections, like NCF, is very slow. There is a long enough to-do list to keep several people occupied. Alejandro has been working on the search function; we need to know what we should be looking at, though: otherwise we assume nothing has happened.

Would it makes sense to have one version of the interface that we could plateau, as a working version, another with development features--and a small text collection with the public demo, with publicly available collections. We do need a flag at the collection or document level that indicates open access; at the moment, for us, that covers everything except NCF and TCP. Another flag for "clean and working" data vs. problematic data needing cleanup.

Stan will ask for more reporting. JMU will ask Stefan about intern for ManyEyes.

SEASR: from the point of view of Data/Analytics, the most useful thing to hear from them would be an indication of which half a dozen analytics would be most useful; Martin would then spend time describing them in terms that we could use with end users.

Document generated by Confluence on Apr 19, 2009 15:05