|
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Discussions:
- Objects in MONK
Collection Metadata
Chunk/Work part
Word
Bibliographic Data
- Architecture
Proxy, D2K server, relational database and Tuple based approach
Separation of Interface and the services
A model of GUI components at the server end akin to JSF
- Areas of responsibility in the the data cell
Data processing/Indexing/Ingestion
Storing the output from the tagger to a persistent datastore for efficient querying.
Storing the bibliographic information and metadata about the document and collection for search and retrieval.
Provides count/freq based data to the D2K itinerary for data mining.
Provides batch processing facilities for long lasting on demand data crunching like certain higher order ngrams (for example).
Provides a process or task monitoring API that would allow the long lasting jobs to continue in the background and report back once the task is finished.
Name Entity Extraction
Data pre-processing
Morphological Tagger
-Output format?
Web Services layer
-Data Mining and D2K
-Collections and works discovery/visualization for selection.
...
Cataloging the data and Itinerary
- Checkpoint and Milestones.
Conference call, 2007 June 8, Data
Identify work we can do on Monday
AK take wordhoard data model and hook itineraries to it
Wordhoard client not relevant
Hibernate part of the model
JN two otions talk JDBC to database and forget about Hibernate
or use hibernate
Try out performance issues in hibernate vs. JDBC
try
Wright Selection, DocSouth subcollections
NCF
TCP
RDF Triple Store can it do a better job than SQL
stuff in Monk not in WordHoard; higher level relationships
use the RDF for higher order relations
JENA allows parallel progress with regard to relational and RDF models
Real questions
Many RDF models against the
Morphadonrned tables into datastore
JN
parts of wordhoard we don't care about
parts of Monk not in Wordhoard
parts of wordhoard that don't work
focusing on weak points is a good way of moving forward
- SVN for adorned text and derivatives and bibliographic searches:
- put documents back into eXist.
- inline Vs standoff markup;
- Franken file has advantages one file to have everything; but higher level adornments should be in separate files -PIB
- The issue about SVN: is that versioning is not the only thing, the derivative data could be a database/METs profile.
- Support for multiple users, with there own metadata- Duane
- Managing the XML documents
What kind of management is required?
Document ID
List of text available
List of collections available.
Status of the text: Has named entity reference done? Where is it stored?
Do documents parse? Error message if it does not
Is it in SVN?
Attribute sets that is what can be used for any of these workflow.
Morphadorner Workflow: Validated->Entity references expanded Various other things (missed that)
SVN -> Validate -> TEI Simple -> UTF-8 -> Morphadorner -> Persistent Store
Who|What|Property|Date
John U will create this Application.
Minutes
- Develop and experiment with two database approaches: Triples/eXist/Lucene base approach here at UIUC: -Amit in consultations with Bernie, Hibernate/Relational database approach at NW: -John N, Bill and others.
we want to have something working and a prototype (to test performance and flexibility) by the end of this month. Put some of the MONK texts (NCF, Wright and Stein) into Wordhoard data store, and provide Put some of the MONK texts (NCF, Wright and Stein) into Wordhoard data store, and provide
functions and methods that satisfy D2K Calls.
Note: Bill Parod thinks desired texts MorphAdorned by the end of June is very feasible, but having new texts in Wordhoard by the end of June is unlikely.
- John Norstad will have a page or two functions and methods that satisfy D2K Calls.
of documentation that will guide both the groups with regards to parts of wordhoard we don't care about, parts of Monk not in Wordhoard, parts of wordhoard that don't work
- I will do the same thing with regards to NORA datastore.
- Phil will have a large corpus of documents (from multiple collections) adorned by the end of this month, in the meantime there are
some adorned texts we can use.
- John U will lead an effort to create a workflow application based on his another project, that will allow team members to query and view the documents that are available, to manage the documents, adornments and collections, provide a programming interface (webservice) to do the same. Amit will help in this process and the first step is to setup webdav on the monk server.
- James: John U suggested your name as the lead person for the Webservices part of the MONK and we thought that was a good idea.
To do:
- Set up a dropbox on monk.lis.uiuc.edu
- Install SFTP/WebDav/SCP
- Set up a database with
- Iput ID
- Submitter
- Process run
- Date
- Output result
- Output ID
Addressable by web services, including business rules: what X has to be done before you can do Y
Service-Oriented Architecture
Areas of responsibility:
Cataloging: Bryan
Preprocessing: Northwestern
Processing/Indexing/Ingestion: UIUC - RDF/Lucene/Exist
NWU - Hibernate/relational/object
Web services/Proxy: McMaster
Checkpoints/Milestones:
- Now: ingest sample texts while recording process information
- End of the month: Ingest lots of texts
- End of next month: Evaluate parallel persistent store experiments
|