This page last changed on Oct 11, 2007 by plaisant@cs.umd.edu.

NOTES: We tried to pick words we could use consistently in our discussions + in the interface + in the code and middleware calls, etc. The terms themselves may possibly be changed later on, but hopefully the "animals/concept" they represent are becoming stable.

  • MONK is a web environment with a MONK USER INTERFACE. text is available in COLLECTIONS of WORKS made of work PARTS. Users use TOOLS integrated in TOOLSETS. New toolsets can be created using a WORKBENCH.
  • COLLECTION: consists of a TABLE OF CONTENT which describes the hierarchical structure of its PARTS, and has METADATA (e.g. provenance, curator)

MM: Collection continues to be a difficult word for me. Let us assume that a given MONK environment consists of a number of works. A work is something whose integrity is established by some combination of authorial and editorial decisions. 'Dubliners' and 'The Man with a Blue Guitar' are works in the sense that they were published as single books. 'The Dead' is not a work in that sense, but is part of a work. These distinctions have more to do with history or conventions than with logic. But there they are.

Some works are parts of collection that are recognized bibliographical entities of one kind or another. 'Great Expectations' enters MONK as part of NCF; 'Moby Dick' enters it as part of the Wright Archive. Some properties of the digital encoding can be inferred from the membership of a given text in a particular collection of that sort.

Some works in MONK are not part of collections in that sense. Stein's Making of Americans is not a collection, but a single work. The poems and letters of Emily Dickinson are distinct works (or are they parts of a single bibliographical entity, "The Works of Emily Dickinson?"). But they are not a collection in the sense of TCP or NCF.

We may be better off thinking in terms of a library. When I go to a library, I don't approach items as parts of a collection except in the trivial sense that all the items are part of the collection that is the library as a whole. I look for individual items in the library catalogue. Some of these items may be part of a collection (Rare Books, Music, Transportation), but that is incidental to my search.

To the extent that MONK is like a library it has a catalog, and from that catalog I choose items to work with, that is to say, a work set or "My Collection," a completely arbitrary construction of a set of items for a particular project. I think that MOA and Dickinson in the nora project are collections in that sense: project specific work sets.

My hunch is that users will have an easier with MONK if work at the bibliographical level works like it does in a library. The primary object is the bibliographical record for an individual item: Moby Dick. Whether this is part of Wright or the Library of the Americas is secondary (though it certainly should be there in the bibliographical record.


CP: Regarding to MM comment above: I think we can deal with this problem with an adequate UI. Users will be able to browse and serach both at the work level or a the work level (e.g. they can ask for a list of works written by Author X or by women) so we are OK I think... Amit made some strong point for keeping the TOC at the collection level to deal with subcollections etc.

  • PART: (previously called chunks) is a piece of text. Parts have a type e.g. WORKS and PARAGRAPHS.
  • WORK: a part in a Collection. Works are specials because they have METADATA (author, date etc.) So far they are the only parts to have metadata.

MM: As above: a work may, but need not, be part of a collection. But you access it directly rather than through a collection
CP: this should be ok. Now that we know it could happen we will deal with it. A collection is only required to have one work and one paragraph! e.g. Stein could be alone in it's own monk collection... not a problem.

  • PARAGRAPHS are the lowest possible level of parts in the table of content. Collections have at least 1 work which has at least 1 paragraph.
  • The TABLE OF CONTENTS describes the structure of a collection. It is created at the time of the ingestion of the collection in MONK. It uses the structure of the XML tags found in the original files. A collection need to have at least one work. Analysis can only be done at the WORK or PARAGRAPH level. Rationale for that?: there is no standard vocabulary for the types of parts (i.e. terms like sub collections, chapter, section, play, act, letters are not used consistently in the tags) so we cannot say what parts are equivalent across collections to run an analysis across equivalent types of parts. The evolution path might be: today we can only do analysis at paragraph and work level. In the future we might define a monk standard set of types, and enlist the help of curators to make collections monk-compliant. The scholars that really need the added functionality will most likely be willing to invest some time in that tedious work if the benefits are made clear.

MM:
I have second thoughts about 'table of contents.' It may confuse users. A table of contents is typically found at the beginning of a book or single work. The word for the table of contents of a collection is 'catalog.' There certainly needs to be a MONK catalog, which is a list sortable by the major bibliographical criteria of author, date, genre, sex, place of origin. The bibliogrpahical collections status of an item should be a searchable criterion, but it will rarely be of primary interest to users. If I want to look for fiction written by women between 1600 and 1870, I am unlikely to care whether a given item comes from TCP, EAF, NCF, DocSouth, or Wright. Nor should I need information about the collection to identify items of a certain kind. The fact that an item is of a certain kind (novel, sermon, play, written by a woman, originating in the American South) needs to be recorded in the metadata for each work so that items can be directly retrieved from the Monk library in terms of their classifications.

Would it make sense to give to each item an LC catalogue number, which would allow us to leverage a lot of subject classification? The question arises in a slightly manner with the TCP collection. There, each record has an STC number, i.e. the catalog number of its bibliographical record in the Short Title Catalog

A table of contents is something that the user sees. Do we really envisage users seeing a representation of 'Bleak House' with its 7,000 distinct paragraphs? I find that use case very hard to imagine. That's different from saying that users should be able to identify and select a paragraph for a particular purpose (e.e.g adding it to a training set). But they are extremely unlikely to do this by choosing from a tree structure of some sort. It will be much more intuitive to read a chapter or similar section and select paragraphs or other text segments as you gao along.

It may be that 'tree structure' or 'chunk hierarchy' will be better terms for what we intend here. But these are ters for internal discussion.

CP: to discuss... I do like table of content better than tree structure. This also reveals that Workset also need a table of content as I WILL argue that yes you want to see the context of the text parts you are working with at least at the work level (may be not collection level): e.g. when your workset is the set of sentimental parts, then you do want to see a Table of Content of your workset.

  • Issue: do all collections actually have divs called works and paragraphs? Will we convert by hand or set equivalences for those 2 levels we expect in a monk collection? or will we just treat as paragraph the lowest level, what ever it is... (e.g. in Dickinson I am not sure there are paragraphs)
  • SENTENCES: Sentences can be identified with morphadorner and can be counted or their length can be averaged. So there will be metrics related to sentences, but sentences are not parts in the table of content.

MM Sentences are identified in the process of tokenization and sentence splitting. Their start and end points are explicitly marked, and they are identifiable objects with unique IDs just as word tokens are. They are not, however, part of the tree structure of a work, since they often cross XML element boundaries in certain forms of writing (poetry, drama).

CP: I guess this is just a clarification you give, no problem.

  • ATTRIBUTES: There are 2 types of monk attributes (note that the interface we may call them all attributes but we need more specific words for us:
  1. METADATA (extracted from the bibliographic records or TEI tags) describes COLLECTIONS and WORKS. Parts only inherit the metadat from the work they belong to (e.g. the author of a Part is the author of the work)
  2. METRICS are numerical attributes which correspond to anything that can be counted, averaged, plotted etc. They are attributes of the parts and can be aggregated up the table of content to the work or even collection (e.g. number of words). Questions: some metrics may not be aggregated easily: e.g. number of unique words. Is this a problem?

MM 'Unique word' is an ambiguous terms. 'Unique' in what context? Every work has a lot of words that occur only once in it. But the word may occur in other works. The ratio of lemmata that occur only once in a given work can be a useful measure of lexical density. You cannot sum the unique words of works. You can determine unique words in an author. That too may be useful and does in fact follow automatically from thhe aggregation of lexical data

CP; I was just using unique work as example, analytics has to define those metrics and publish a definition. I just know that # of unique word is useful to Tanya to get an idea of the amount of repetition... e.g. Green eggs and Ham has only 25 unique words if I remember correctly

  • TOKENS are the low level objects used in analytics:
  1. Single word tokens: spelling, standardized spelling, lemma, stemmed form, soundex form, any of the PART OF SPEECH descriptors etc.
  2. Multi-word tokens: bi-grams, ngrams. It is not clear if we will use sentences a low level items.
    Note: Reversed spelling is only display option not a token.

MM If you imagine a Monk Dictionary, which is the aggregate of information about tokenization, there are kinds of information about token attributes that are context independent and can be stored at a dictionary level. The phonetic quality of a word or its reverse spelling belong. You can also encode information about the stop word status of certain words at that level

Question: do we have a stemmed tokens or is it the same as lemma...

MM Lemmatization and stemming are different. See http://snowball.tartarus.org for what I think is a good reference site

  • SIDE TEXT: they are the things like the preface, footnotes, speaker labels in plays etc. i.e. they can be entire parts or little segments of text inside a part. Morph adorner does mark side text at a low level (e.g. speaker labels in plays) but we need to decide how we mark larger side texts . Is it an attribute of PARTS?? (this would mean that all PARTS need metadata - not just the works).

MM For the distinction of main and side text you must make a distinction across all works in a Monk environment about elements whose content counts as main or side. It may be possible to vary these distinctions by sizable batches of texts from different collections in the bibliographical sense. Perhaps distinctions by genre are even more helpful. Side text in a play, for instance, can be identified with considerable precision.

Main text consists of the content of all child elements of the <body> element, except for <stage>, <speaker>, <table>, <note>.

Side text consist of the content of the <front> and <back> elements, plus the list of elements excluded from <body>.

Main text thus defined will only include words that readers will naively consider the 'real' text, but it may not include all those words. Side text is a miscellany of stuff. Whether side text by itself is a useful searchable body I don't know. I doubt it. So the point of side text is to filter out some stuff. There is no claim that the filtered-out stuff is in itself a useful thing.

Some users will find this distinction very useful, others will hate it. So you have to give users the option to ignore it. One useful feature of side text is that it is likely to filter out a lot of problematical sentences.

The <head> element is tricky in this regard. In certain genres, notably fiction, it will in 99.9% of cases consist of side text. In some TCP texts that have a question and answer structure (e.g a catechism), the question is encoded in a <head> element. There it is obviously part of the main text.

CP: as long as the ingestion process generate the data, the UI can give users the option to use or not the sidetext, may be with some level of granularity if the data is available. Should we assume it will be there is the question! (or ignore this problem for now)

  • WORKSET: the set of PARTS of interest, selected by a user to conduct their work. Users can have multiple worksets which can be edited, emailed etc. They represent the entire scope of interest, from which 2 or more corpora might be subsetted and compared.
  • CUSTOM PASSAGE: we are not planning to allow them in the short term: a continuous logical segment of text that does not correspond to a chunk. It may be contained in a chunk or cross chunk boundaries. Might be needed to allow users to rate good and bad examples with less noise .

MM Some discussions in the data cell suggest that this may actually be the simplest way for users to identify passages of arbitrary length for us in training corpora

  • CORPUS (or SET_OF_PARTS): the sets of work parts that you want to compare. E.g. the sentimental corpus is all the parts in a NB data mining class rated sentimental, or corpuses can be generated automatically by a cluster analysis, or it could be the corpus of text spoken by a character to be compared to the corpus of text spoken by another character.
  • FEATURE: any aspect of a corpus that can be used as a criterion in a search or emerges as a characteristic during data mining. In other words: users will search for features in the text, or features will be revealed to them by data mining. Features can be of many different types:
  1. instances of tokens (i.e. specific words, specific ngrams, or POS values (e.g. past tense verbs), specific sounds) or
  2. metadata values (e.g. specific authors, specific genre, decades or year, place of origin) or
  3. named entities (e.g. King Arthur appear a lot in this corpus).
    Features may be defined as BASKETS i.e. a list of features (e.g. all the words representing love).
    Note: more complex features may also appear as we get pmore advanced: e.g. clusters of ngrams can be returned by a repetition analysis.

MM
We may not get a clean nomenclature for features, criteria, attributes. 'Factor' may be another term that will come in. It is a well-established term in statistical analysis. For several potential analytics in MONK, sex, date, or genre will be 'factors' in the analysis.

A distinction for which we do not yet have a good vocabulary has to do with what the user knows in advance of an operation. Factors are always known in advance. Features are (partially) discovered: in text classification, I settle on some criteria in advance (lemmata, bigrams, POS tags, tag n-grams, separately or in combination) but I know nothing about the variables that make the difference.

  • Note about authors: we may benefit from a separate relation for authors with their own attributes: gender, nationality etc. instead of duplicating all that info in the work metadata
  • RATINGS: ratings are annotations, they have label (e.g. sentimentality), a value, an author and an access status (private or public). For performance reasons ratings are attached to the workset.
  • NAMED ENTITIES
    We will have People and Places (with links to their locations in the text i.e. wordID)
    Issue: need to way to allow cleaning of this data i.e. manual or semi manual corrections. This is essential.
  • ENTITY RELATIONSHIPS: sets of pairs linking people and people, or possibly people and places.
  • Users can use individual TOOLS (e.g. to browse collections or to get a concordance), use pre-defined TOOLSETS that combine tools to accomplish more complex analysis (e.g. compare 2 corpora), or they can assemble sets of tools into their own custom toolsets using a WORKBENCH.
  • TOOL: corresponds to the smallest unit of functionality offered to users (and probably a software module that can be composed with others). Individual tools you can combined in the workbench. Some may be used autonomously, other only in combination with other tools in a toolset.
  • TOOLSET: an integrated set of tools built using the workbench.
  • Pre-defined toolset: a toolset prepared by MONK for novice users as an independent application they can use "as is"
  • WORKBENCH: the tool users use to combine individual TOOLS into custom TOOLSETS to accomplish new tasks not possible with the existing pre-defined toolsets.
  • PROJECT: a composite made of one or more worksets and toolsets, a time stamps stack of history states you can return to, and user generated content. A project has an owner, private/public access and a description.
  • HISTORY STATE: a saved point to which you can return to. A URL can be generated to return directly to that state.
  • ANNOTATIONS: the user generated content
    Ratings of parts
    text annotations of history states
  • MIDDLEWARE = what was called Proxy before
  • DATASTORE : where the data lives

WORDS WE SHOULD AVOID USING if we are trying to be precise and avoid confusion:

  • documents or texts: instead clarify if you are talking about WORKS (that have metadata like author or date) or a PARTS. Of course those will still be used but in generic terms BUT NOT when describing a middleware call or a user interface feature or widget.
  • chunk hierarchy, or structure: now use table of content
  • proxy: now called middleware
  • secrets : isn't it the "features"?
  • word patterns? are they the features? may be the more complex features.
Document generated by Confluence on Apr 19, 2009 15:04