|
This page last changed on May 18, 2008 by amitku.
This prototype has been implemented. Current work is on using Fedora as the backend document repository with certain sub workflows that require curator intervention.
Please also see SEASR page https://apps.lis.uiuc.edu/wiki/display/seasrD/Workflows
Goal
The goal of this sub project is to
-Develop a Workflow model for the MONK ingestion process.
-Implement the Workflow in jBPM/JPDL.
-Integrate the Workflow in the NORA-OL interface.
The deliverable are a new set of features in the MONK webservices that would
-Describe the set of tasks that are required to be carried out in order to MONK-ify the collection.
-A web application and an XML RPC interface respectively to submit a collection for processing and to inquire about the status of such a task.
Simple Workflow for Implementation
0. Documents received.
-Retrieve documents from the file system/upload the files or from
Bernie Suggests that this could be a portal environment.
1. Require Collection Level Profile Description as user input.
or check to see if an existing profile can be applied to this collection?
2. Check for Well formed documents.
3. Check for Validity.
Is validation always a requirement?
4. Convert existing XML to MONK XML (TEIsimple) using XSLT stylesheet.
5. Extract metadata from MONK XML files at the collection and individual work level using XSLT. (Validate metadata?)
-This will be used for collection selection and search process in the interface.
6. Creation of training set for morphadorner.
Or check to see if an existing training set can be used?
7. Process the documents through morphadorner.
8. Ingest the document in the datastore for datamining/fulltext search and retrieval.
Do we have from the data cell an agreed-upon "data model 1.0"?
Participants and SVN access
-Amit,Andrew,Brian,Duane,Mary,Vered and others
SVN
URL: svn://nora.lis.uiuc.edu/MonkServices/trunk
If you don't have access or have forgotten login/password let Amit know.
Documents
We will use NCF documents that have already been morpadorned in this process. For part of speech tagging, we will use OpenNLP (unless we are able to get Morphadorner source code and training data) and for the ingestion process we will use nora-db. We will in the second iteration implement Wordhoard ingestion step.
Timeline and Planning -based on Simple Workflow
-A working prototype by 20th of July (Workflow could still be in flux at that time, but we aim for a simple implementation)
Get folks with SVN accounts (only Andrew is left) Done July 10th 2007
Integrate jBPM with the Spring container in MONK services by 11th July -use spring jBPM extension if there are hurdles we will just use jBPM inside spring without IOC support. -Amit jBPM spring modules integrated with MonkSpringServices July 11th 2007
HTML wireframe diagrams on piece of paper.
Second Iteration: 23rd July onwards, which would involve XMLRPC adapter and Wordhoard ingestion.
Current Status
Finished July 28th 2007:
-Create New Process.
/WorkflowManager.createNewProcess?workflowName=simple&comments=here%20are%20some%20comments
-Infrastructure to support any Workflow and add it at runtime
/WorkflowManager.createNewWorkflow?processDefResourceLoc=processdefinition.xml
-Retrieve a list of Processes started by a user
-Retrieve a list of available workflows
/WorkflowManager.getAvailableWorkflows
-Nora-db/Nora-chunk XML Validation, Unzip task and lucene ingestion tasks have all been created.

Expected Skills
-Knowledge of Spring Framework.
-jBPM/jpdl.
-HTML/JSP development.
|