This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Monk Data Cell Conference Call

Tuesday, May 1, 2007. 3-4 p.m. central time

Present: Amit Kumar (chair), Loretta Auvil, Bernie Arcs, Duane Searsmith, Phil Burns (Pib), Martin Mueller, Bob Taylor, Joe Paris, Bill Parod (secretary)

Missing: James Chartrand and John Norstad.

Notes towards a Monk User Manual
We discussed Martin Mueller's "Notes towards a Monk User Manual". Martin wrote it to gather current assumptions about MONK with the expectation that MONK participants will strike, add, edit statements as needed for the document to become a consensus expression of MONK requirements. Data cell issues have been broken out into individual wiki pages:

Bibliographic Data (done) - What bibliographic descriptive data is associated with works, collections, and data sets?
Structural Data (archive) - What structural data is captured for works?
Morphological Data (done) - What morphological data is associated with words?
Named Entities - How are personal and place names obtained, recognized, referenced, and managed?
Words in order or Multi\-word\-units, phrases, and repetitions \(N\-grams\) - How are N-grams expressed and obtained?
Data Compliance and Profiles - What can software assume about a texts or data set's features?
User Contributed Data - How do users create, contribute, discover, reference, and manage their data?
Repository Services - What discovery, access, and deposit protocols should we support for our repository?
MONK SIP (archive) - Should we define a Submission Information Package? What should it include? Is Nora Chunk file it?

Requirements assertions in Martin's document regarding these topics will be moved into the wiki pages by Bill Parod. These wiki pages will gather technical requirements and serve as accounting for decisions made and decisions outstanding for these topics. Martin will update the "Notes towards a Monk User Manual" as requirements are detailed on the wiki.

Collections
Amit pointed out that the notion of a "Collection" is missing from the model. That will be added and its details discussed.

Named entities
Can we recognize named entities in texts? We can recognize known names and even abstract patterns that might indicate personal titles: "PN of PN", using POS for example. But disambiguating same named entities will be more difficult. It was asked if texts containing named entity markup would present problems for processing. MorphAdorner does not remove any markup in the texts it processes. So no, existing named entity markup in texts does not present problems.

Amit Remarks:

Further question that we should consider is about words/phrasses that are detected as entities by MorphAdorner and that are also tagged as
<persName> or <placeName>. We should be able to
#1 Disambiguate the source.
#2 Resolve duplicates

Structural data
We can expect to capture punctuation and sentence boundaries as part of the tokenization process.
We will likely rely on collection providers for encoding of paragraph and major structural divisions. We might develop some ingest workflows that apply heuristics for structural markup to texts that are not encoded with structural markup. Regardless, we will need to detail what MONK recognizes as structural markup and what MONK does with it.

Martin said we can describe documents at the top (bibliographic) and bottom (lexical) levels in the same language, but not in the middle (structural level).

We agree that defining 'chunks' is important for the project. We will need to define what we mean by 'chunk'.

Amit Remarks

Bill and I will create this page; I will take the first stab today.

Martin on words:

I think we should be very careful about words that have a distinct WordHoard or nora history and either avoid them or be very explicit about their. 'Chunk' seems to me a word of that kind. It has a very specific meaning in the nora context, where 'chunk', 'nora chunk', and 'nora chunk file' are used in quite project specific ways. Monk chunk might be a nice moutful, but we need to make sure that we don't inadvertently carry things over from previous projects. I say 'inadvertently' advisedly beccause we will clearly carry many things forward. But we need to be aware of it.


Chunk is a bag of words with a labels. It is used as a unit for data analysis and navigation. That is it. Nothing less than that. "Words",
"ngrams" and "entites" are not a chunk from Nora prospective.

MONK lexicon / glossary
It was pointed out that we often use terminology in project discussions ('chunk' or 'word' for instance) that have specific but perhaps different meanings to Nora and Wordhoard members. It was suggested that we begin compiling a lexicon of project terms with definitions that we can refer to. It is thought that compiling such a glossary will also help clarify many issues currently in discussion. Phil Burns and Amit Kumar volunteered to start working on a first draft.

N-grams
We want facility to obtain any n-gram that existing data affords. Are there particular types of n-grams that should be pre-computed? Bernie Arcs asked if we have considered use of a document coordinate system to extract arbitrary n-grams. Phil Burns said that is a useful and commonly used approach and confirmed that document ordinal values for words are obtained in current MorphAdorner processing. Martin Muller suggested the value of a repeated phrase 'dictionary'. It was observed that N-gram expression/discovery and named entity discovery might be the same task and an example of one such dictionary.

Document generated by Confluence on Apr 19, 2009 15:04