This page last changed on Jun 28, 2007 by mkirschenbaum@gmail.com.

PRESENT: Matt K., Greg, Tanya, Matt B. Stan, James, Stefan, John U.

DISCUSSION

Thoughts on Google / OCA:

  • Uses for raw texts from OCA?
  • Usefulness depends upon rate of error, threshold below which is not acceptable below a certain rate for even introductory access to material

Could untrained classification strategies be used to clean up OCR scans in the first place?

  • Gibberish from OCR is (sometimes) predictable gibberish

Workflow?

  • Get text from OCA
  • Some kind of classification
  • Give back to OCA

People would be able to withdraw materials from OCA and load them as a corpus

How many books are they producing?

  • OCA: about 215,000 books online, 12,000 more / month
  • Google: ~10million as a goal

What kinds of projects can we take on using dirty books vs. clean ones?

  • uses of 1000 of each? uses of 10 of each?

Data not at a place where we could use it for serious analytics

  • reservations about its usefulness, but we haven't developed methodologies that correspond with using such a huge corpus
  • there are things we could do, but not the analytical procedures
  • should think about what kinds of operations are meaningful with what kinds of material
  • tiered approach: if there are things that are useful to do with dirty OCR, we can do them, and fall back to Project Guttenberg or other resources

Operations to perform on large-scale corpora:

  • basic bibliographic metadata is there, processing tasks based on that
  • orthography, compare things over timeframes, things like spelling changes
  • some processes (orthography) that don't require all the words to be correct
    (so much data there that it works out in the end)

If we can begin articulating specific scenarios w/ Google/OCA, that's how we can open up avenues toward those resources

Annotation:

Annotation for FeatureLens, or moving toward more abstract/generalizable model for project as a whole

John talked to Roy Rosenzweig and Dan Cohen about Zotero, possibly moving two projects together, shared project management

  • needed explanation about project management (Zotero has a close-knit team of programmers)
  • someone coordinating between the projects, discussing goals and deadlines
  • John Bradley's work on PLINY

Tim Cole
Formation of deep subcollections
Building of annotation, shared proofreading into these digital repositories

Key Questions: Are we annotating texts or states? What's the object of annotation?

Matt B.: "we're going to do history at the state level, why not do annotation at state level too?"

ManyEyes
Can browse visualizations, explore the same datasets you've been browsing, and then use them to make new visualizations

  • can attach the state of the visualization to comments about the visualization
  • ManyEyes is an IBM project, might be worth asking NCSA (Loretta) if those capabilities are the same you'd get from their project

User registration for MONK

  • common login (passport) for MONK

Multiple layers:
o Annotation level, open, etc.
o access to text, heavier layer

Distributed proofreading

FeatureLens text-analysis environment, sub windows, views (like Eclipse) for interface

Stefan mentioned idea for a Web-based implementation of Eclipse

ACTION ITEMS

John U. will push SuperCell about availability of a common data model.

Catherine will post requirements for FeatureLens annotation to list.

Matt will contact Loretta about ManyEyes/UIMA.

Matt B. will recommend some places to start for doing deep background reading on how state/history are conceptualized in software architectures

James will send information about user registration system in Nora

Document generated by Confluence on Apr 19, 2009 15:04