This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Discussions:

  1. Objects in MONK
    Collection Metadata
    Chunk/Work part
    Word
    Bibliographic Data
  2. Architecture
    Proxy, D2K server, relational database and Tuple based approach
    Separation of Interface and the services
    A model of GUI components at the server end akin to JSF
  3. Areas of responsibility in the the data cell
    Data processing/Indexing/Ingestion
    Storing the output from the tagger to a persistent datastore for efficient querying.
    Storing the bibliographic information and metadata about the document and collection for search and retrieval.
    Provides count/freq based data to the D2K itinerary for data mining.
    Provides batch processing facilities for long lasting on demand data crunching like certain higher order ngrams (for example).
    Provides a process or task monitoring API that would allow the long lasting jobs to continue in the background and report back once the task is finished.
    Name Entity Extraction
    Data pre-processing
    Morphological Tagger
    -Output format?
    Web Services layer
    -Data Mining and D2K
    -Collections and works discovery/visualization for selection.
    ...
    Cataloging the data and Itinerary
  1. Checkpoint and Milestones.

Conference call, 2007 June 8, Data

Identify work we can do on Monday

AK take wordhoard data model and hook itineraries to it
Wordhoard client not relevant
Hibernate part of the model
JN two otions talk JDBC to database and forget about Hibernate
or use hibernate
Try out performance issues in hibernate vs. JDBC
try
Wright Selection, DocSouth subcollections
NCF
TCP

RDF Triple Store can it do a better job than SQL
stuff in Monk not in WordHoard; higher level relationships
use the RDF for higher order relations
JENA allows parallel progress with regard to relational and RDF models

Real questions
Many RDF models against the

Morphadonrned tables into datastore

JN
parts of wordhoard we don't care about
parts of Monk not in Wordhoard
parts of wordhoard that don't work

focusing on weak points is a good way of moving forward

  • SVN for adorned text and derivatives and bibliographic searches:
  • put documents back into eXist.
  • inline Vs standoff markup;
  • Franken file has advantages one file to have everything; but higher level adornments should be in separate files -PIB
  • The issue about SVN: is that versioning is not the only thing, the derivative data could be a database/METs profile.
  • Support for multiple users, with there own metadata- Duane
  • Managing the XML documents
    What kind of management is required?
    Document ID
    List of text available
    List of collections available.
    Status of the text: Has named entity reference done? Where is it stored?
    Do documents parse? Error message if it does not
    Is it in SVN?
    Attribute sets that is what can be used for any of these workflow.

Morphadorner Workflow: Validated->Entity references expanded Various other things (missed that)

SVN -> Validate -> TEI Simple -> UTF-8 -> Morphadorner -> Persistent Store

Who|What|Property|Date

John U will create this Application.

Document generated by Confluence on Apr 19, 2009 15:04