This page last changed on Feb 08, 2008 by plaisant@cs.umd.edu.

monk lexicon work area for discussion Last updated 2007/10/04. 

LIST of TERMS in meaningful order

  • MONK is a web environment with a MONK USER INTERFACE. Text is available in COLLECTIONS of WORKS made of work PARTS. Users use TOOLS integrated in TOOLSETS and WORKBENCHES.
  • COLLECTION: consists of a TABLE OF CONTENT of WORKS and has METADATA (e.g. provenance, curator)
  • WORK: has a TABLE OF CONTENT of PARTS. They have METADATA (author, date etc.)
  • PART: (previously called chunks) is a piece of text. Parts have a type e.g. PARAGRAPHS.
    Analysis can only be done at the WORK or PARAGRAPH level. Rationale for that?: there is no standard vocabulary for the types of parts (i.e. terms like sub collections, chapter, section, play, act, letters are not used consistently in the tags) so we cannot say what parts are equivalent across collections to run an analysis across equivalent types of parts. The evolution path might be: today we can only do analysis at paragraph and work level. In the future we might define a monk standard set of types, and enlist the help of curators to make collections monk-compliant. The scholars that really need the added functionality will most likely be willing to invest some time in that tedious work if the benefits are made clear.
  • PARAGRAPHS are the lowest possible level of PARTS in the table of content.
  • The TABLE OF CONTENTS describes the structure of a collection or a work. It is created at the time of the ingestion of the collection in MONK using the structure of the XML tags found in the original files. Issue: not all collections actually have divs called works and paragraphs So some equivalent has to be defined e.g. Stanza and line group.
  • SENTENCES: Sentences can be identified with morphadorner and can be counted or their length can be averaged. So there will be metrics related to sentences, but sentences are not parts in the table of content.
  • METADATA: There are different types of monk metadata (note that the interface we may call them all attributes but we need more specific words for us:
  1. BIBLIOGRAPHIC METADATA (extracted from the bibliographic records or TEI tags) describes COLLECTIONS and WORKS. Parts only inherit the metadats from the work they belong to (e.g. the author of a Part is the author of the work)
  2. MORPHO-SYNTAXIC METADATA:
    2 types: Single word MSmetadata (spelling, standardized spelling, lemma, stemmed form, soundex form, any of the PART OF SPEECH descriptors etc.) and Multi-word MSmetadata (bi-grams, ngrams). It is not clear if we will use sentences as low level items. Note: Reversed spelling is only display option not metadata.
  3. STRUCTURAL METADATA (but it is really the TABLE CONTENT, put here for completeness)
  4. METRICS are numerical attributes which correspond to anything that can be counted, averaged, plotted etc. They are attributes of the parts and can be aggregated up the table of content to the work or even collection level (e.g. number of words). Issue: some metrics may not be aggregated easily: e.g. number of unique words and may require PART level metadata. Is this a problem?
  • PARATEXT: they are the things like the preface, footnotes, speaker labels in plays etc. i.e. they can be entire parts or little segments of text inside a part. Morph adorner does mark side text at a low level (e.g. speaker labels in plays) but we need to decide how we mark larger side texts . Is it an attribute of PARTS?? (this would mean that all PARTS need metadata - not just the works).
  • WORKSET: the set of PARTS of interest, selected by a user to conduct their work. Users can have multiple worksets which can be edited, emailed etc. They represent the entire scope of interest, from which 2 or more corpora might be subsetted and compared.
  • CUSTOM PASSAGE: we are not planning to allow them in the short term: a continuous logical segment of text that does not correspond to a chunk. It may be contained in a chunk or cross chunk boundaries. Might be needed to allow users to rate good and bad examples with less noise .
  • CORPUS (or SET_OF_PARTS): the sets of work parts that you want to compare. E.g. the sentimental corpus is all the parts in a NB data mining class rated sentimental, or corpuses can be generated automatically by a cluster analysis, or it could be the corpus of text spoken by a character to be compared to the corpus of text spoken by another character. Issue Not sure it's the right word but we'll see later when we get to that. We may just split worksets and compare them.
  • FEATURE: anything which can be used as a criterion in a search, a factor in an analysis, or emerges as a characteristic during data mining. In other words: users will search for features in the text, or features will be revealed to them by data mining. Features can be of many different types:
  1. instances of MSMetadata (i.e. specific words, specific ngrams, or POS values (e.g. past tense verbs), specific sounds) or
  2. metadata values (e.g. specific authors, specific genre, decades or year, place of origin) or
  3. named entities (e.g. King Arthur appear a lot in this corpus).
    Features may be defined as BASKETS i.e. a list of features (e.g. all the words representing love).
    Note: more complex features may also appear as we get pmore advanced: e.g. clusters of ngrams can be returned by a repetition analysis.
  • Note about authors: we may benefit from a separate relation for authors with their own attributes: gender, nationality etc. instead of duplicating all that info in the work metadata
  • RATINGS: ratings are annotations (OR NOW WE MAY CALL THEM INTERMEDIATE DATA") they have a label (e.g. sentimentality), a value, an author and an access status (private or public). For performance reasons ratings were attached to the workset, but this is being changed.
  • NAMED ENTITIES
    We will have People and Places (with links to their locations in the text i.e. wordID)
    Issue: need to way to allow cleaning of this data i.e. manual or semi manual corrections. This is essential.
  • ENTITY RELATIONSHIPS: sets of pairs linking people and people, or possibly people and places.
  • Users can use individual TOOLS (e.g. to browse collections or to get a concordance), use pre-defined TOOLSETS that combine tools to accomplish more complex analysis (e.g. compare 2 corpora), or they can assemble sets of tools into their own custom toolsets using a WORKBENCH.
  • TOOL: corresponds to the smallest unit of functionality offered to users (and probably a software module that can be composed with others). Individual tools you can combined in the workbench. Some may be used autonomously, other only in combination with other tools in a toolset.
  • TOOLSET: an integrated set of tools built using the workbench.
  • Pre-defined toolset: a toolset prepared by MONK for novice users as an independent application they can use "as is" (but more advanced users may still add tools to the toolset to custumize it and save it as their own toolset.
  • WORKBENCH: A work environment where a set of TOOLS and TOOLSETS are available. The PROGRAMMER WORKBENCH as ALL the tools in it, but because there will eventually be so many tools, SPECIALIZED WORKBENCHES will be created to accomplish particular hi-level tasks (e.g. a repetition workbench, or a search by example workbench). A specialized workbench can have a single toolset (in which case that toolset is opened automatically at the start like in the example we have now "wrongly" called programmer's workbench), or multiple toolsets - like the designer workbench we have. When a workbench has multiple toolsets it start with a menu of toolsets and a way to create your own toolsets.) Tools typically have a single window. Toolsets are coordinated sets of Tools. They appear as coordinated windows with predefined placements. They may include a sequence of steps to help users go thru the tools one at a time.
  • PROJECT has one or more worksets, results (from using toolsets) and intermediate data such as training sets.  It also will eventually have a time stamped history stack of states you can return to, and user generated content such as comments and notes. A project has an owner, private/public access and a description.
  • HISTORY STATE: a saved point to which you can return to. A URL can be generated to return directly to that state.
  • ANNOTATIONS: the user generated content
    text annotations of history states
    (not clear if ratings are considered annotations or intermediate data
  • PROXY or MIDDLEWARE lives between the interface and the DATASTORE

WORDS WE SHOULD AVOID USING if we are trying to be precise and avoid confusion:

  • documents or texts: instead clarify if you are talking about works collectiosn or parts. Of course those will still be used but in generic terms BUT NOT when describing a middleware call or a user interface feature or widget.
  • chunk hierarchy, or structure: now use table of content
  • word patterns? are they the features? may be the more complex features.

-------------------------

NOT UPDATED BELOW

I DO NOT UPDATE THIS ALPHABETICAL LIST BECAUSE THIS IS NOT A GOOD WAY TO LEARN ABOUT THE VOCABULARY.  AND IF YOU KNOW THE TERM YOU WILL JUST SEARCH FOR IT NOT SCROLL THE LIST (CP)

 ------------------------

ALPHABETIC LIST

Adorned Collection

An adorned colection is a collection in which the words in each work in the corpus have been adorned with morphological information such as part of speech and lemma.

Adornment

Adornment is the process of adding information such as morphological information to texts. We use the term "adornment" in preference to terms such as "annotation" or "tagging" which carry too many alternative and confusing meanings. Adornment harkens back to the medieval sense of manuscript adornment or illumination performed by monks - attaching pictures and marginal comments to texts.

Affix

An affix is a prefix or suffix which can be added to a morpheme or word to modify its meaning.

Attribute (in machine learning terms only)

An attribute in machine learning terms is a property of an object which may be used to determine its classification. For example, one attribute of a literary work is its genre: play, novel, short story, etc.

Bayes's Rule

Bayes's rule defines the condssitional probability for two events A and B as follows:

Pr(A | B) = Pr(B | A) * Pr(A) / Pr(B)

Bigram

A bigram is an ordered sequence of two adjacent words, characters, or morphological adornments.

Bound Morpheme

A bound morpheme is a prefix or suffix which is not a word but which can be attached to a free morpheme to modify its meaning. For example, the bound morpheme "un" may be attached to the free morpheme "known" to form the new morpheme/word "unknown."

Part

A part (called chunk before) is a part of a work residing in a collection. A chunk consists of an ordered series of words and associated morphological information with a label. A chunk may be treated as a bag of words or ngrams for data analysis and navigation.

Collocate

Words which appear near each other in a text more frequently than we would expect by chance are called collocates. Collocates may be ngrams, but may also consist of multiple words with gaps between one or more of the words.

Component

A component is a bundle of services. A component knows how to render messages.

Collection

A Collection is a set of works.

Data Herding

Data herding is the process of acquiring, combining, editing, normalizing, and warehousing texts so they can be used for further analysis.

Datastore

A datastore means a query-able data source.

Document Coordinate System

A document coordinate system assign a numeric vector of coordinate values to the position of each token in a document. A typical coordinate value might consist of a pair of line and column values based upon the printed form of the text, or a character offset and length pair based upon the digitized text.

Edit Distance

The edit distance between two strings of characters is the number of operations required to transform one of them into the other. The most commonly use transformation operations are character insertion, character deletion, and character replacement.

Feature

TO ADD

Free Morpheme

A free morpheme is the basic or root form of a word. Bound morphemes can be attached to modify the meaning.

Hard tag

A hard tag is an SGML, HTML, or XML tag which starts a new text segment but does not interrupt the reading sequence of a text. Examples of hard tags include <div> and <p>.

Hidden Markov Model

A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters. The problem is to find the unknown parameters using values of the observable model parameters.

HMM

Abbreviation for hidden markov model.

Interface

Interface means user interface(s)

Jump tag

A jump tag is an SGML, HTML, or XML tag which interrupts the reading sequence of a text and starts a new text segment. Examples of jump tags include <note> and <speaker>.

Keyword Extraction

Keyword extraction extracts "interesting" phrases which characterize a text.

Language Recognition

Language recognition attempts to determine the language(s) in which a text is written. Literary texts are generally composed in one principal language with possible inclusions of short passages (letters, quotations) from other languages. It is helpful to categorize texts by principal language and most prominent secondary language, if any. We can use statistical methods based upon character ngrams and rank order statistics to determine the principal language of a text and list possible secondary languages.

Lemma

The lemma form or lexical root of an inflected spelling is the base form or head word form you would find in a dictionary. A lemma can also refer to the set of lexemes with the same lexical root, the same major word class, and the same word-sense.

Lemmatization

Lemmatization is the process of reducing an inflected spelling to its lexical root or lemma form. The lemma form is the base form or head word form you would find in a dictionary.

Lexeme

A lexeme is the combination of the lemma form of a spelling along with its word class (noun, verb. etc.).

Lexicon

A lexicon is a collection of words and their associated morphological information as used in a corpus.

Machine Learning

Machine learning occurs when a computer program modifies itself or "learns" so that subsequent executions with the same input result in a different and hopefully more accurate output. Machine learning methods may be supervised, i.e., using training data, or unsupervised, without using training data.

Markov Process

A Markov process is a discrete state random process in which the conditional probability distribution of the future states of the process depends only upon the present state and not on any past states.

Message

A message is a query or the result of a query. ?????????

Middleware (OR PROXY???)

Middleware means the stuff between the interface and the datastore(s).

MorphAdorn

MorphAdorn used as a verb is a Monk neologism which means "to adorn a text using MorphAdorner."

MorphAdorner

MorphAdorner is a suite of Java programs which performs morphological adornment of words in a text. A high-level description of MorphAdorner's capabilities appears at http://apps.lis.uiuc.edu/wiki/display/MONK/About+MorphAdorner.

Morpheme

A morpheme is a minimal grammatical unit of a language. A morpheme consists of a word or meaningful part of a word that cannot be divided into smaller independent grammatical units.

Multiword Unit

A multiword unit is a special type of collocate in which the component words comprise a meaningful phrase. ???????

Named Entity

A named entity is a multiword unit consisting of a type of name such as a personal name, corporate name, place name, or date.

Ngram

An ngram is an ordered sequence of n adjacent words, characters, or morphological adornments.

NUPOS

NUPOS is a part of speech tag set devised by Martin Mueller to allow part of speech tagging of English texts from all periods as well as texts in classical languages. Further information about NUPOS appears in Morphology and NUPOS.

Part of Speech

The part of speech is the role a word performs in a sentence. A simple list of the parts of speech for English includes adjective, adverb, conjunction, noun, preposition, pronoun, and verb. For computational purposes, however, each of these major word classes is usually subdivided to reflect more granular syntactic and morphological structure.

Part of Speech Tagging

Part of speech tagging adorns or "tags" words in a text with each word's corresponding part of speech. Part of speech tagging relies both on the meaning of the word and its positional relationship with adjacent words.

Phone

A phone is an acoustic pattern which apeakers of a particular natural language consider distinguishable and linguistically important. Distinct phones in one language may be grouped together and treated as the same sound in another language.

Phoneme

A phoneme is a group of phones considered to be the same sound by speakers of a specific natural language. One or more phonemes combine to form a morpheme.

Prefix

A prefix consists of characters comprising one or more bound morphemes which can be added to the front of a word to modify its meaning.

Pronoun Coreference Resolution

Pronoun coreference resolution matches pronouns with the nouns to which they refer. Some pronouns may not actually refer to a specific noun. For example, in the sentence "It is not clear how to proceed" the initial pronoun "It" does not refer to any specific noun.

Pseudo-bigram

A pseudo-bigram generalizes the computation of bigram statistical measures to ngrams longer than two words by splitting the original multiword units into two groups of words, each treated as a single "word".

Sentence Splitting

Sentence splitting assembles a tokenized text into sentences. Recognizing sentence boundaries is a difficult task for a computer and generally requires a combination of rules and statistical methods.

Sentiment Assignment ?????

Service

A service is a list of messages that serve a particular component.

Soft tag

A soft tag is an SGML, HTML, or XML tag which does not interrupt the reading sequence of a text and does not start a new text segment. Examples of soft tags include <hi> and <em>.

Spelling

The spelling is the orthographic representation of a spoken word. Words may have more than one spelling, particularly in texts dating from earlier periods when spelling was not standardized.

Spelling Standardization

Spelling standardization is the mapping of variant, often archaic, spellings to standard modern forms.

Stemming

Stemming removes affixes from a spelling. The resulting stem is not necessarily a proper lexeme. Stemming offers a simpler alternative to lemmatization. Stemming can be useful in information retrieval applications, but is much less useful in literary applications. Popular stemmers include the Martin Porter's stemmer and the Lancaster (Paice-Husk) stemmer.

String similarity

String similarity is a measure of how similar two strings of characters are. A similarity of 0.0 indicates two strings are completely different, while a similarity of 1.0 indicates two strings are identical. Dozens of different string similarity measures
have been proposed.

Suffix

A suffix consists of characters comprising one or more bound morphemes which can be added to the end of a word to modify its meaning.

Supervised Learning

Supervised learning is a machine learning technique which predicts the value of a given function for any valid input after having been presented with training examples (i.e. pairs of input and correct output).

Tagged Collection

See adorned collection.

Text Encoding Initiative

The Text Encoding Initiative (TEI) Guidelines "are an international and interdisciplinary standard that enables libraries, museums, publishers, and individual scholars to represent a variety of literary and linguistic texts for online research, teaching, and preservation." More information may be found at the official Text Encoding Initiative site

TEI

Abbreviation for Text Encoding Initiative.

TEISimple

TEISimple is a literary DTD created by Martin Mueller to enable the use of a common XML DTD across all texts to be included in Monk. A fuller description may be found at TEISimple A useful May Have?.

Trigram

A trigram is an ordered sequence of three adjacent words, characters, or morphological adornments.

Unsupervised Learning

Unsupervised learning is a machine learning method which fits a model to observed data without benefit of training data.

Viterbi Algorithm

The Viterbi algorithm allows searching a space containing an apparently exponential number of points to be searched in polynomial time. The Viterbi algorithm is frequently used in hidden Markov model statistical part of speech tagging applications to reduce the time complexity of seaches for the best tags for a sequence of spellings in a sentence.

Word

A word is the basic unit of a language. Words are composed of morphemes.

Word Sense Disambiguation

Word sense disambiguation is the process of distinguishing different meanings of the same word in different textual contexts. For example, a "bank" can be both a financial institution or a geographic location next to a river.

Word Tokenization

Word tokenization splits a text into words, whitespace, and punctuation.

Work

A work is a single text which is a member of a Collection. Each work consist of one or more text segments called parts.

Part

See chunk.

Document generated by Confluence on Apr 19, 2009 15:04