This page last changed on Feb 21, 2007 by amitku.

Topics for discussion in advance of February 23 meeting in Evanston

We need to trim this down to a manageable agenda for the 23rd by resolving some of the following issues on the email list and removing them from this list. You may also either comment on this document (see the comment feature at the bottom of the page) or edit the page directly (under "page operations" on the left-hand menu).

  1. Namespace
    What is the nature of the relationship between Wordhoard, Nora, and MONK? Is MONK to be a common environment for both Wordhoard and Nora? Will Wordhoard and Nora retain discrete identities?

My understanding and expectation is that MONK will certainly draw experience and where appropriate, technology from Nora and Wordhoard, but that MONK is a new project that brings many of the goals and ambitions of Nora and Wordhoard but represents an expansion in several areas. It will be our task and challenge to explicate what those are, and then how best to accomplish them, and then if appropriate to leverage work from Nora and Wordhoard. I don't take it as a given that our goal is to somehow combine Nora and Wordhoard. (BP)

As far as whether Nora and Wordhoard will retain discrete identities. For Wordhoard, I expect for the near future that it will retain a separate identity. It remains to be seen what MONK will be and whether it will subsume Wordhoard's benefits. If it does, there's incentive to focus all efforts on MONK. If Wordhoard retains unique value separate from MONK, then we'll likely consider it a separate project and development/support activity. My hope is that MONK would subsume Wordhoard.(BP)

  1. MONK mission statement
    What is our mission statement?

    MONK is a toolkit for exploring linguistic patterns in literary texts.

    MONK is a Web-deliverable applications layer that can be integrated into existing digital library collections to support large-scale, cross-collection text mining and text analysis with rich visualization and social software capabilities. MGK

  2. End-user applications
    What are the end-user tools we envisage for MONK?

    -Shopping Cart metaphor for the digital library texts, where a repository could be added to the Monastery and scholars could select documents
    while browsing through the catalog or search results.
    What is more important is to provide a backend that is GUI agnostic; The end tools could take any shape from the set of simple web services like
    a.) search/browse
    b.) run Text mining alg
    c.) get top 10 high frequency words
    d.) get the repeating patterns
    ....

    A pipeline metaphor where these functions can be organized with dependencies to produce visualization at the users end. (AK)

    What use cases do they support?
    How do we obtain those use cases?

I think these two points are important. MONK is intended to provide benefits to scholars. Are the scholars on the project sufficient representation for that community? How do we identify and prioritize MONK features in that regard? (BP)

  1. Corpora:
    What texts will we work with initially, eventually?

I would focus on a very broad collection of fiction written in English between 1700 and 1900 on both sides of the Atlantic. Something like 1,000 novels to begin with. A collection of that size requires solving some of the scale issues that are part of MONK. It builds on Nora interests, and it is something we have been trying to to do at Northwestern. There are interesting opportunities there for exploring differences between American and English fiction over at least a century. (MM)

What are our assumptions about their property rights and locations?
What prerequisites (IP, technical, bibliographic, lexical, ...) will texts have for inclusion in MONK?

  1. Preprocessing
    What transformations, extractions, annotations will we perform on texts before their ingest into MONK?
    What operations must their markup support?
    When we say such files are "interoperable", in what specific sense do we mean that - for what operations?

    The important issue that we have not discussed is the use of Markup in this preprocessing, existing NORA infrastructure
    uses markup as a structure delimiter to create chunks like chapters, works, sections etc, but it does not use semantic meaning
    of markup for feature extraction. For example //persname/@reg should be used for persname extraction and how should
    we reconcile this with the data mining based entity extraction.
    (AK)

  2. User interface design
    What are appropriate user interfaces for the identified tools?


    Bill do you mean/ browser vs fat java swing based visualization tools? Both have advantages and disadvantages.

    - No, I simply mean what the user sees as opposed to 'backend'.

    The underlying infrastructure should provide service ortiented architecture that would allow different technologies
    and interfaces to coexist. This option is not easy, moving every thing to a proxy or server makes things very complicated
    and the trick will be to design the system in such a way that user experience is the same as in fat client. (AK)

  3. Analytics
    What text analysis/mining algorithms/services/tools do we want to provide?

    -Predictive/ Frequent pattern analysis/Supervised or automated like clustering.
    -Ability to configure these algorithms with different feature sets like stemming/using wordnet to
    compress/expand feature sets and reduce dimensions in SVM for example. (AK)
    -Ability to configure these algorithms with different algorithm parameters, some algorithms have additional parameters for tuning the algorithms. (LA)

  4. Service architecture
    What data access, analysis, search, (other service classes?) services do the end-user tools require?

    How do digital libraries and software like Fedora/DSpace relate to what we are doing? How can we use content already in these
    repositories and use existing functionality like browse/search; Digital Assets management.
    -Project Management related functions like:
    User Preferences and Collection Selection grouping information. Create a pile of work sets
    Being able to overlay results and graphs from various data mining iterations.
    Create sample projects that act like templates for users to explore and modify.

    • Digital Library Like functions like
      Browse/Search ?Annotate and mark./ Share annotations. Store search results.
      Use search results as a base for data mining operations.
      -Data Mining functions
      Ability to retrieve outputs from past jobs without rerunning them.
      Retrieve list of algorithms available, and what others have been using in general.
      -Visualization functions
      I have not thought through these enough... Catherine/Stan might be good people along with Loretta.
      (AK)
  5. Data models
    What data models are we currently working with?

    In NORA, we have XML Database; A lucene index for each feature and each chunk level and a Chunk object specification.
    (AK)

    Can they be reconciled with one another?
    Do they (one, or the other, or the reconciled version) have what is required to support MONK's proposed services and preprocessing?

I think we need NW and UIUC folks to discuss this in detail to understand the limitation of our approaches,
and open this topic for discussion with use cases that we want to satisfy.
(AK)

I think we must first articulate MONK requirements in this regard before any particular technology (lucene, eXist, mysql, Hibernate, ...) can be meaningfully considered. (BP)

  1. Social computing
    What are our goals in this area?

    RSS support, annotation, shared datastores, exposure of results to third party client tools like Zotero or Yahoo! Pipes. MGK

  2. Documentation and dissemination
    Who is responsible for what kinds of documentation, and where does documentation come in the process?
  3. Project support
    What tools and protocols do we have for communications (project tracking, bug tracking, user feedback, ...)
    What are the necessary common practices on source code generation, organization, and control

    Ours is a small team of programmers, the source code oragnization and control should be based
    IMHO on industry wide standards with programs like stylecheckers and PMD used to check
    the quality of code. We should be using Unit tests for testing and a plugins like clover helps a developer
    a lot to discover untested code.
    (AK)

    What are the necessary common practices on corpus management (organization and control of source texts, preprocessed texts, bibliographic catalog,.. )
    What are our computing resources (storage, preprocessing, service hosting,...). What are our needs? How, when, and where will they be met?
    What are our plans for process management (orchestrating preprocessing pipelines, transitioning corpus work states)

  4. Project management
    What are the subprojects (such as the topics above)? Who will direct them and who will work on them? How will work assignments be made?

    Broad categories like Interface development/ Web services and core api. We should be looking
    at plugin mechanisms and divide the code into core api and plugins.
    (AK)

    What are their interdependencies, what they will produce, and how they will communicate?

    I have felt on site meetings go a long way in moving the project forward. These should form an important
    part of the process. Every month for local core institutes and every other month for larger group -IMHO.
    (AK)

    How will work be designated MONK-"chargeable"?
    How will disagreements be reviewed and resolved?

(I tried to edit the text but could only see the first third of it in the editor window. I was afraid I might delete it if I proceeded. In any case, here are a couple of comments. Sorry for the awkward format--Stan)

USE CASES
I would be interested in the idea of soliciting a list of volunteer projects from the humanities computing community, then choosing some to start on. One of our selection criteria could be the availability in the project of a group of study participants who could help to provide use cases for us. (SR)

CORPORA
If we created a short list of volunteer projects, these could help drive our initial collection choices. I see Martin's plan to work with 1000 novels as an example of this approach. (SR)

Posted by sruecker@ualberta.ca at Feb 21, 2007 09:47
Document generated by Confluence on Apr 19, 2009 15:04