|
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
1 PM Thursday, June 7
Room 2000, NCSA Building, UIUC
Pizza will be provided
0. Introductions and cell reports
Data cell: our important stuff is at https://apps.lis.uiuc.edu/wiki/display/MONK/Data+Cell+Topics and our priority is to get things decided at this meeting so we can begin coding. Details here: https://apps.lis.uiuc.edu/wiki/display/MONK/May+29+2007+minutes
Interface cell: core application with optional modules. three levels of modules: core, interesting but stable, highly experimental.
Analytics cell: Steve? Anyone seen Steve? What we need to know is how to reduce use cases to analytics, and what's important in them. What data, from what corpora, will drive the use cases? There are some things that are so simple we overlook them, but they must be there: find the item in the text, get there easily.
Uses and Users: Prioritize requirements list coming out of proposed use cases, and decide which of these things needs to be done next. Please review the list of requirements here and provide yes/no votes: Monk high level requirements - V1
Collaboration: Public web site is up. Our virtual neighborhood is surveyed (seven or eight related projects or tools); assignments are handed out for becoming familiar with these things. The problem with this cell in the long term is that everyone on it is also on other cells that are consuming their time. Another problem is that it's difficult to think about this part of the problem in the abstract.
Supercell:
I. Demo videos for Mellon
- Wordhoard demo (status)
- Nora demo (status)
Try to finish these by end of next week, ship them to Don. Northwestern will use Snap-Z, Maryland will use Camtasia. Five minutes apiece, on DVD.
II. Uses and Users
Review the first four use cases at
https://apps.lis.uiuc.edu/wiki/display/MONK/Uses+and+Users+Cell
- Repetition: Tri-grams are the unit of analysis being used here to isolate patterns in a very long text with a great deal of repetition (Stein's The Making of America). Clustering was used for a while and then abandoned, but we'd like to bring it back. http://monk.lis.uiuc.edu:6060/openlaszlo-3.3.3-servlet/my-apps/featurelens/src/featurelens.lzx?debug=false&lzt=html is an experiment in meeting the needs of this use case. If N-grams are done on a text that's been stemmed and lemmatized and if results can be shown in unstemmed, unlemmatized words, then you would be able to find patterns at both levels. Combinatorics are a challenge here, as we create a datastore that has many fields per record (word, lemma, part of speech, location, soundex, spelling variants, semantic marker, wordnet, etc.), but let's not rule out choices at the level of the datastore: let's do that at the interface or user-education level: if you set too many parameters, you run the risk of creating a job that will never finish, but if you choose 3 to 5 parameters, it'll be better. Implement so that simple requests come back quickly, and complex ones come back eventually, and complex ones are relatively simple to specify. Statistical information about higher-level structures (sentences, paragraphs, etc.) might be useful and interesting, but can be computed on the fly and do not need to be indexed upon ingestion. In general, the data cell needs to make sure that they are not ruling out analytic possibilities, and should err on the side of not doing so. The interface, too, should allow the user flexibility in allowing comparison and keeping track of history, so I can choose which views of the data to put side by side.
- Sentimentality: data-mining to uncover low-level features that are associated with higher-level meaning. And since sentimentality is associated with cliche and stock language, these levels are pretty closely related in this case. Cross-referencing feature extractions will be important: I'm not just interested in whether "weeping" occurs, but words plus punctuation, like "God!" expressed as an interest in proper nouns plus exclamation mark. Select a section of text, and have that compared to the features in the datastore, and offer a choice of the features in that selection as the basis of finding patterns. Location as feature for analysis. Document as a contour map that shows the distribution/density of features. Or overlays of different feature-displays. Abstracting from this to a general case, the researcher wants to go back and forth between high-level and detailed examination, and you want some group and sort capabilities. A possible additional feature request: multiple feature ranking for classification, possibly with ranking for multiple characteristics in a single pass, or ranking for one in each of multiple passes, but merging and comparing results in various ways (which texts, or which words, are sentimental and erotic, which are sentimental but not erotic, which are erotic but not sentimental, etc.).
- (Bracketed Geographical Awareness, Transformation)
Data Store questions: can we have a single shared data-store that is the MONK testbed? Yes. It should be SVN, and from there, we need to have texts automatically extracted to be ingested into an xml database for searching and browsing, and into the MONK datastore for use with MONK tools. Let's bear in mind that we are building a testbed, not a permanent collection. We need to identify the clean-up that could be generalized and automated as part of the ingestion process, isolate the clean-up beyond that which needs to be done by hand and do it only for texts that are actually being used in use cases or texts that are being used as training data, and identify as a collaboration cell project thinking up ways for users to do that second kind of clean-up when MONK is deployed with library collections. Writing up problems with the data could be part of the final deliverable from MONK--also it could be part of the documentation for curators.
Milestones:
- By the end of June, PIB will produce a large collection of texts (the L testbed); meanwhile, texts on ariadne.northwestern.edu in e:\users\shared\monk\ncf\xml\adorned on ariadne at northwestern can be picked up (five or six texts, preferably ones that Sara could use).
9 AM Friday, June 8
Room 2000/2100 NCSA Building, UIUC
III. Parallel sessions:
Review Data (Room 2100)
See https://apps.lis.uiuc.edu/wiki/display/MONK/May+29+2007+minutes
- Review of Datacelll Milestones. https://apps.lis.uiuc.edu/wiki/display/MONK/Meetings+and+Milestones
- Do we have agreement on the basic elements of a MONK data object?
- Can we express this in an API that is made available to the other cells and that is versioned only by agreement with those cells?
- Let's figure out how, by the end of june, to be ready with a datastore, ingesting PIB's texts. NB: ingest-date is information that should be stored and exposed routinely.
Data Cell Meeting notes
Collaboration (Room 2001)
- how do we get something started on this front?
- second life?
- who will do what, etc?
COLLABORATION CELL
Full Report Here:
https://apps.lis.uiuc.edu/wiki/display/MONK/6-8-07+Minutes
Kinds of collaboration:
between MONK and other projects (e.g. interact with MONK from within 2nd Life; exposing Gutenberg texts)
between members of the MONK team
between users of MONK
our users helping us with our work (e.g. data cleaning by the Distributed Proofreaders Project)
I want to load my own single document, say a novel, in feature lens. E.g. The Bostonians. How does that fit in with the larger collection? They may not have rights, and so on.
Other projects
Zoterocan we use it as a general annotation framework? You can grab snapshots of any webpage...
TAPoRwe want these co-operative rather than competitive. Texts in the wild with annotations in the world.
ManyEyesfor scientific visualizations.
Yahoo Pipeswe may not build more than one or two, but we need to provide a feed.
NINES and Collexcollection tools. You browse, harvest into a shopping cart, then create a collection that you curate as an exhibit. Networked Infrastructure for Nineteenth-Century Scholarship.
Digital DocketWayne McIntosh (Govt Studies) and Jimmy Lin (Info Science)
Project GutenbergDistributed proofreading
SEASRtheir infrastructure
Second Lifetoo fun to take off the list. Can we render some of our data there?
DATA CELL
250-500 of morphadorned texts by end of June.
Start from the Wordhoard data model and morph it. The NORA data can't express relations.
For 4-6 weeks, we'll try tough queries in different backends to compare performance.
Queries may be measured in long times.
EranosElectronic Research Annotation System (Greek for potluck) proposal to Mellon
Bill has the thought that annotations would be Fedora objects (Matt says the alternative is a D-Space world).
Martin
You can extract statistics, for instance, even from proprietary collections, and create a frequency-based diachronic lexicon that can be used as a backboard for comparison in profiling a text you do have access to.
To choose Wordhoard functions, have Sara try the 19th century texts and keep track of which functions were useful for her.
The Wright Archive. All the American novels published over 25 years. Civil war plus and minus a decade. Untrained classification and comparison across time.
Catherine
The basket of words. Martin says what about Roget's Thesaurus. There was a basket of words app called Mac Searcher, out of Stanford and "based on Pat out of Michigan "
Interface
We may want to provide a slider that shows response times that can be expected for various numbers of documents.
Key function from Martin: A concordance that gave you side-by-side display of arbitrarily chosen columns. is Versioning Machine java\?. John Norstad's concordance lets you get other information that the client can use for sorting and grouping (part of speech tag, before or after the hit, and so on). Very helpful.
Both Martin and Matt K feel strongly that we should provide a common visual branding.
Phil has created a calculator in Java/Swing.
Martin suggests we might have a collection greet you with a backboard of summary info.
1 pm
Lunch, Dos Reales, University Ave.
IV. Parallel sessions:
Review Interface decisions (Room 2000)
See https://apps.lis.uiuc.edu/wiki/display/MONK/Survey+of+Ajax+Technologies
- What are the outstanding issues in selecting a development environment or environments?
- What is the least bad choice?
- What are the shortcomings of that choice?
- Going forward on the basis of that choice, who will be coding which parts of the project, and how will they coordinate?
Analytics (Room 2000)
- what's required to do the use cases selected?
- what is the impact of the data model decisions on analytics?
- who will do what in the development of the first MONK demo?
V. Milestones going forward:
- 2 months
- 4 months
- 6 months
- year
5 PM
Meeting concludes
|