|
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Discussions:
- Objects in MONK
Collection Metadata
Chunk/Work part
Word
Bibliographic Data
- Architecture
Proxy, D2K server, relational database and Tuple based approach
Separation of Interface and the services
A model of GUI components at the server end akin to JSF
- Areas of responsibility in the the data cell
Data processing/Indexing/Ingestion
Storing the output from the tagger to a persistent datastore for efficient querying.
Storing the bibliographic information and metadata about the document and collection for search and retrieval.
Provides count/freq based data to the D2K itinerary for data mining.
Provides batch processing facilities for long lasting on demand data crunching like certain higher order ngrams (for example).
Provides a process or task monitoring API that would allow the long lasting jobs to continue in the background and report back once the task is finished.
Name Entity Extraction
Data pre-processing
Morphological Tagger
-Output format?
Web Services layer
-Data Mining and D2K
-Collections and works discovery/visualization for selection.
...
Cataloging the data and Itinerary
- Checkpoint and Milestones.
Conference call, 2007 June 8, Data
Identify work we can do on Monday
AK take wordhoard data model and hook itineraries to it
Wordhoard client not relevant
Hibernate part of the model
JN two otions talk JDBC to database and forget about Hibernate
or use hibernate
Try out performance issues in hibernate vs. JDBC
try
Wright Selection, DocSouth subcollections
NCF
TCP
RDF Triple Store can it do a better job than SQL
stuff in Monk not in WordHoard; higher level relationships
use the RDF for higher order relations
JENA allows parallel progress with regard to relational and RDF models
Real questions
Many RDF models against the
Morphadonrned tables into datastore
JN
parts of wordhoard we don't care about
parts of Monk not in Wordhoard
parts of wordhoard that don't work
focusing on weak points is a good way of moving forward
- SVN for adorned text and derivatives and bibliographic searches:
- put documents back into eXist.
- inline Vs standoff markup;
- Franken file has advantages one file to have everything; but higher level adornments should be in separate files -PIB
- The issue about SVN: is that versioning is not the only thing, the derivative data could be a database/METs profile.
- Support for multiple users, with there own metadata- Duane
- Managing the XML documents
What kind of management is required?
Document ID
List of text available
List of collections available.
Status of the text: Has named entity reference done? Where is it stored?
Do documents parse? Error message if it does not
Is it in SVN?
Attribute sets that is what can be used for any of these workflow.
Morphadorner Workflow: Validated->Entity references expanded Various other things (missed that)
SVN -> Validate -> TEI Simple -> UTF-8 -> Morphadorner -> Persistent Store
Who|What|Property|Date
John U will create this Application.
|