This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Discussions:

  1. Objects in MONK
    Collection Metadata
    Chunk/Work part
    Word
    Bibliographic Data
  2. Architecture
    Proxy, D2K server, relational database and Tuple based approach
    Separation of Interface and the services
    A model of GUI components at the server end akin to JSF
  3. Areas of responsibility in the the data cell
    Data processing/Indexing/Ingestion
    Storing the output from the tagger to a persistent datastore for efficient querying.
    Storing the bibliographic information and metadata about the document and collection for search and retrieval.
    Provides count/freq based data to the D2K itinerary for data mining.
    Provides batch processing facilities for long lasting on demand data crunching like certain higher order ngrams (for example).
    Provides a process or task monitoring API that would allow the long lasting jobs to continue in the background and report back once the task is finished.
    Name Entity Extraction
    Data pre-processing
    Morphological Tagger
    -Output format?
    Web Services layer
    -Data Mining and D2K
    -Collections and works discovery/visualization for selection.
    ...
    Cataloging the data and Itinerary
  1. Checkpoint and Milestones.

Conference call, 2007 June 8, Data

Identify work we can do on Monday

AK take wordhoard data model and hook itineraries to it
Wordhoard client not relevant
Hibernate part of the model
JN two otions talk JDBC to database and forget about Hibernate
or use hibernate
Try out performance issues in hibernate vs. JDBC
try
Wright Selection, DocSouth subcollections
NCF
TCP

RDF Triple Store can it do a better job than SQL
stuff in Monk not in WordHoard; higher level relationships
use the RDF for higher order relations
JENA allows parallel progress with regard to relational and RDF models

Real questions
Many RDF models against the

Morphadonrned tables into datastore

JN
parts of wordhoard we don't care about
parts of Monk not in Wordhoard
parts of wordhoard that don't work

focusing on weak points is a good way of moving forward

  • SVN for adorned text and derivatives and bibliographic searches:
  • put documents back into eXist.
  • inline Vs standoff markup;
  • Franken file has advantages one file to have everything; but higher level adornments should be in separate files -PIB
  • The issue about SVN: is that versioning is not the only thing, the derivative data could be a database/METs profile.
  • Support for multiple users, with there own metadata- Duane
  • Managing the XML documents
    What kind of management is required?
    Document ID
    List of text available
    List of collections available.
    Status of the text: Has named entity reference done? Where is it stored?
    Do documents parse? Error message if it does not
    Is it in SVN?
    Attribute sets that is what can be used for any of these workflow.

Morphadorner Workflow: Validated->Entity references expanded Various other things (missed that)

SVN -> Validate -> TEI Simple -> UTF-8 -> Morphadorner -> Persistent Store

Who|What|Property|Date

John U will create this Application.

Minutes

  1. Develop and experiment with two database approaches: Triples/eXist/Lucene base approach here at UIUC: -Amit in consultations with Bernie, Hibernate/Relational database approach at NW: -John N, Bill and others.
    we want to have something working and a prototype (to test performance and flexibility) by the end of this month. Put some of the MONK texts (NCF, Wright and Stein) into Wordhoard data store, and provide Put some of the MONK texts (NCF, Wright and Stein) into Wordhoard data store, and provide
    functions and methods that satisfy D2K Calls.

Note: Bill Parod thinks desired texts MorphAdorned by the end of June is very feasible, but having new texts in Wordhoard by the end of June is unlikely.

  1. John Norstad will have a page or two functions and methods that satisfy D2K Calls.
    of documentation that will guide both the groups with regards to parts of wordhoard we don't care about, parts of Monk not in Wordhoard, parts of wordhoard that don't work
  2. I will do the same thing with regards to NORA datastore.
  3. Phil will have a large corpus of documents (from multiple collections) adorned by the end of this month, in the meantime there are
    some adorned texts we can use.
  4. John U will lead an effort to create a workflow application based on his another project, that will allow team members to query and view the documents that are available, to manage the documents, adornments and collections, provide a programming interface (webservice) to do the same. Amit will help in this process and the first step is to setup webdav on the monk server.
  5. James: John U suggested your name as the lead person for the Webservices part of the MONK and we thought that was a good idea.

To do:

  • Set up a dropbox on monk.lis.uiuc.edu
  • Install SFTP/WebDav/SCP
  • Set up a database with
    • Iput ID
    • Submitter
    • Process run
    • Date
    • Output result
    • Output ID

Addressable by web services, including business rules: what X has to be done before you can do Y

Service-Oriented Architecture

Areas of responsibility:

Cataloging: Bryan
Preprocessing: Northwestern
Processing/Indexing/Ingestion: UIUC - RDF/Lucene/Exist
NWU - Hibernate/relational/object
Web services/Proxy: McMaster

Checkpoints/Milestones:

  • Now: ingest sample texts while recording process information
  • End of the month: Ingest lots of texts
  • End of next month: Evaluate parallel persistent store experiments
Document generated by Confluence on Apr 19, 2009 15:04