|
This page last changed on May 01, 2007 by plaisant@cs.umd.edu.
(Tanya, Sara, Martin, Catherine + notes sent from Kirsten)
Friday April 27
We discussed Martin's document, gave feedback/ suggestions.
We all agreed that this document was very helpful
Title/Scope
We were concerned with the "user manual" part of the title. The beginning reads well as an introduction to Monk but the rest of the document is more like "Reflections on Monk" "the philosophy of Monk" or "Toward Monk" with open questions more than a manual which should reflect the decisions we took. >From this philosophy document should stem a more technical description of the algorithms and the "how to" user manual. The more open-ended research questions section should be separated and identified as such in a separate section. Adding more structure to the
document would help.
Overall focus
We felt that even though many nora-flavored aspects were mentioned briefly there was a stronger focus on the wordhoard word based heritage. There were also some worries from User case writers that aspects of their work had no clear place (yet?) in the document. (Those 2 previous issues are related) Martin needs help emphasizing the nora-heritage philosophy (Matt was proposed as a good candidate J)
Need for a dictionary of Monk terminology
This document is a good step toward clarifying many things (e.g. what is metadata in monk) but we as a group use terms in a loose and inconsistent way. We would all benefit from our own dictionary so an early section dedicated to definitions would help.
E.g. The document talks about a lexicon? catalogues? a word class? A genre? This is a bit confusing without definitions. What's different between a monk collection and a document/text (the distinction is sometime messy), a chunk, a paragraph (defined by tags), a word (separated by space, but is in fact only what actually appears in the text?), a lemma, punctuation etc. What are the names of the many forms a text can take (raw, stemmed, lemmatized, POS, phonetics/soundex, summarized to a list of topics?)
Metadata
We felt that the description of the 3 levels of metadata was very useful. The top and bottom was clear, but we felt that a lot of the Monk functionalities will rely on the middle layer which was not defined enough. As martin said the middle layer is messy and hard to achieve, but we can't dismiss it entirely without abandoning a lot of Monk. We need to define what the Monk middle layer of metadata will be.
Having in that document a proposed list of middle layer metadata (as least to address the use cases we know about) would be helpful.
A starting list of the middle layer metadata might be:
- collection structural information (e.g. chapters, sections etc. (practically the chunk hierarchy) but also characteristics specifying preface, TOC, etc. so users can navigate between sections, compare sections etc.)
- all the counts at that lower structural levels than the document level.
- features extracted from analytics tools (e.g. frequent patterns)
- user-added knowledge in the form of tags (e.g. the training sets in the classification tools e.g. sentimental chunks, or any other exploration tools e.g. accusations or personal)
- preprocessing tags for Names and Places, or for the presence of dialogs, or poetry etc.
- prosody information (if we want to study how a text sounds
- Other missing?
The Section "what is metadata good for" focuses only on POS.
The other forms of the text and the middle layer metadata will open the door to other types of analysis than POS provides
Miller
We found that Miller's seven +/- 2 was over emphasized. It was designed for short term memory tasks and often used counterproductively, Visualization is a good example as it would be silly to only show seven items. In monk we are not trying to have scholars memorize things but see patterns. Grouping and sorting still appropriate of course...
Making user case data available somewhere now
We also discussed the need to clarify how we will get the collection our use case users need into Monk. We thought that it is time to make them available somewhere so that scholars could use the existing tools with the data they care about. Martin agreed to bring that question to the Data Cell next week
Extra from Catherine
- I feel that the footnote on page 5 misleading (about computers being more trustable than people...) True for counting, but computers can't do things like PAS correctly, so accurately counting bad data can't be trusted either. The challenge is how to make clear and understable how much trust users should put in the results.
- Also, word frequencies do not really tell you what the text is about or the kind of writer you are. It's more like DNA or a fingerprint, it uniquely defines an individual but cant predict how you think as a person or what the text means.
|