This page last changed on Sep 29, 2008 by unsworth.

Supercell Call:

Present: Matt, Martin, Stefan, John, Amit, Steve, Catherine

Topics:

Edit some of the proceedings of MONK? Steve, Brian, Martin will be giving a talk in November: perhaps stage it as an argument.

Philologic: Amit asks: what are we looking at using? Are we trying to see it as something complementary (a web service), or are we providing them data? Or are we wrapping up bits or all of it in SEASR? Martin says: the 3 big Illinois libraries can work productively together on questions of text analysis, beyond MONK. The more immediate issue is search and sort: we haven't made headway on that problem in MONK. Would there be ways to make use of Philologic for that? Use Philologic as an alternate data store for some kinds of things? They have thought a lot about text search and we can learn from the facilities they provide. Amit says: there are a few months left in this project, so we can't move off a critical path to something that needs completion; still, there is one thing that might be promising: the web service interface. Indeed, we don't have a good keyword-in-context search (done on the client side), though we do have search facility, so you can get all the functionality John N. has provided (get work, work part, etc.) but the business of seeing where words, lemma, n-grams occur, that might be something to ship off to Philologic as a web service. Martin: two issues: what to do in the next few months, and what happens beyond that. We need to make sure that our datasets will be useable in a variety of ways (by non-technical users). Amit: some of the aspects of usability of interface at the workbench level--it is complicated because we are trying to give something generalized. Display of Philologic-type information isn't in itself difficult, but integrating it into the workbench could be. John: first find out from Mark, how close is Philologic's web service? If it is close or available, we could look at integrating a couple of key pieces of functionality as a proof of concept (KWIC, for example). We'd need to figure out when to ship them data, with IDs that are consistent for both datastores (currently assigned at the Morphadorner stage). Martin: Meet at Chicago digital colloquium, around November 1st? Teleconference first, to do some planning? Tuesday Oct. 28th for a call? Martin will also visit around the 6th and discuss some of this as well.

Wordle and ManyEyes: we aren't going to be able to incorporate Wordle or other ManyEyes code: too many lawyers at IBM. Our goal here, for the time being, has to be to ship data to hosted applets at ManyEyes. The most useful thing has been using Dunning as input, to give relative frequencies, and then to have the damper for adjusting extreme values. We need to build in that facility to the interface, using whatever ManyEyes/Wordle offers right now. Same kind of thing as Google Charts: a final output, a dead end, as it were: no input coming back. Catherine: do we put our energy into creating our own tag cloud? Or do we put time into the possible complications of working with someone else's stuff, even if it is complex. Martin: I vote for the first. John: I vote for the second. Hmm. Martin: Tag clouds that show relative frequency would be very useful, for A and B corpora. Duane has done this, I think? Amit: Duane has done some of these things, and we can re-use his code, once he has his first demo out. We can integrate it into the workbench: but Stan or Andrew need to be here, as developers, to say how these things can actually be done. We have currently started using Trac, the task-ticketing system, and we have to be realistic about what can be done. John: Google charts? Amit: Mike Plouffe was working on it, but I will pick it up from here: not a big problem: a week. John: it's a worthy proof of concept to take our output and put it into someone else's input, especially if it buys a meaningful visualization, but no matter what we do, we have to have someone to do it. Amit: Dashiki's funding is running out about the same time as MONK's: the dynamic URL for data-polling is going to be done within Dashiki, by Matt McKeon. Stefan: After the last conference call, we left it that we'd continue exchanging email, and we'll get a sneak preview of Dashiki. Perhaps the best thing is to ask Martin (at ManyEyes) about a rest-based solution for input of data to hosted ManyEyes applet. Stefan will ask Matt about URL.

FeatureLens: Duane sent some email with a question about how we might integrate aspects of FeatureLens into MONK proper. Amit: how do we bring in TalesTek, Duane's clustering stuff, which will involve solving the applet integration problems we've been having: mid-November? if we can do that, we can probably look at bringing in FeatureLens. We could treat it as a separate thing--it needs its own datastore. This is, then, like Philologic: we need to fork off a separate data store in each case, after morphadorning, with IDs. A future project could look at integration across different applications, with consistently ID'd documents.

81.5M words currently in the data store: 60 more novels coming from Wright; a dozen long TCP texts coming (will probably double the size of the TCP corpus) and perhaps some EAF novels from Virginia. End size will probably be 160M words. Public access datastore: Wright plus some of the EAF texts. Martin could also ask TCP (Mark Sandler) whether they would be willing to have some subset made available (early modern drama)? ProQuest/NCF has a lot of the canonical British authors: Martin will follow up as well, through Jeff Garrett. We'll then plan to put out a separate, open access datastore with the final release of MONK: and we should use that set as the basis of a Philologic and/or Featurelens datastore.

Document generated by Confluence on Apr 19, 2009 15:05