This page last changed on Feb 20, 2008 by martinmueller@northwestern.edu.

What structural data is captured for works?

Comment by MM: This is a difficult section for me to understand. I think we will better off if we start at a non-technical level and work with a model of a text that moves up from words through sentence and intermediate hierarchies to the top level of the document. I have tried to do that in the entry "Sentence and document bags"

AK Comments

Structural data for works will include

  1. Count of items -line breaks, paragraphs, sentences, letters, section... i.e. other discreet self contained units that are used for navigation: This is collection specific.

    Bill Asks: Amit's Comments are in in bold
    # So for each chunk there would be a count of line breaks (<lb> or \n or both?). <lb> sure. I don't know if \n adds anything, just because in markup \n may coincide with the manuscript
    # sentences - as in source text or provided by tokenization? Both, I would say. But I think in the collections where sentences are marked up, they should take precedence
    # letters (<div type="letter"> or something else? - might this be a chunk itself?) yes
    # section - what is a section? <seg>, <div?...>..? section is div[@type="section"]
    # Are chunks hierarchical? Yes they are. A section type chunk may have chapters and each chapter may have paragraphs. But the system should be able to describe arbitrary unions and intersection of other structural units like last 5 paragraphs from div1[1] and the first 5 paragraphs from div1[2] as chunks. The user could select these units in the User Interface. The lowest of those units should be sentences. In other words there are two kinds of chunks, system identified and the ones that users can arbitrarily create.
AK Comments
  1. The structure itself: where it starts and ends and functions to retrieve those sections.

    Bill Asks: Amit's Comments are in in bold
    # How is its location, start or end indicated? Character/byte offsets in document, xml:id, ...? Does this matter or do we just need a way to obtain the chunk by identifier?
    *You are right, it should not matter whether we store character offsets or xml:id as long as the API provides methods to
    retrieve the chunk.*
AK Comments
  1. Count of items like -words,lemmas and n-grams for each chunk.

    Bill Asks: Amit's Comments are in in bold
    Should n-gram counts be kept as part of the chunk or separately reference their chunk(s)? I ask because I imagine n-grams to be variously defined in ad-hoc ways and so variously available - we don't know yet what kind of patterns of what length we will automatically pre-gather if any.
    We can pre-compute and store certain ngram pattern counts. Rest can be calculated at runtime. n-grams of the size 5 would be a great idea to pre compute, just words not the POS ngram else it. We can decide about this after carrying out actual experiments, and see if the ngrams can be computer effectively at runtime or not
AK Comments
  1. Types of chunks: //div/@type attribute in TEI or //article/@class for Docbook

    Bill Asks: Amit's Comments are in in bold
    So all <div>s are chunks? Then chunks are hierarchical? Are we expecting Docbook documents?
    Potentially all divs are chunks. So are all paragraphs. What is a chunk can be declared in a standoff markup. Docbook was an example, but our vocblury for describing chunks should be schema independent.

How are 'chunks' obtained or declared?

AK Comments
  1. Chunks should be declared in my view in a stand off markup format; The vocabulary should allow for describing structures
    for a group of documents that might form a sub collection or documents of same kind by the virtue of the markup structure. I am biased to the nora-chunk approach to satisfy this requirement.


    Bill Asks: Amit's Comments are in in bold
    # So would chapters, paragraphs,.. of a text be declared in a separate file/place? Or are you referring to document/collection inclusions like in the Nora Chunk File? nora chunk file yes.

    # Vocabulary used where/how? Are you referring to the @type vocabulary for <div>s? yes div/@type for TEI

    # Meaning that a given set of documents are declared to conform to the same schema and so can be processed the same way with the same assumptions about tagging? exactly
AK Comments
  1. Chunks are obtained from the source documents, A collection could support some default chunk types, like work/chapters/paragraphs
    etc -this would be collection specific. Chunks in theory can span across the markup, and a use case might need user to be able
    to create chunks that break across structural units like last 5 sentences of a paragraph one and another 5 of subsequent paragraph
  2. Sentence chunk will be created by the tokenizer and if sentences are tagged using <seg> and other tags in TEI they should
    be included as a chunk entity.


Bill Asks: Amit's Comments are in in bold
# Do we need sentence chunks? Are we interested in the count of words,lemmas, ... in each sentence? I simply ask - I don't know. Yes I would think we would need words,lemmas, ... counts for each sentence; Just because these chunks will form the unit of analysis for data mining just like paragraphs or chapters would.

What are a chunk's properties and what are they used for?

They're useful for (from JLN):

1. As a navigation aid for reading. The "table of contents" for a work, in other words.

2. As a unit of text presentation. In WordHoard the chunk ("work part") is used this way. For example, in plays, a "scene" is the lowest-level chunk, and that's our unit of text display - we display one scene at a time for plays.

3. As a target for annotation, adornment, marking, or whatever other term might be appropriate here for the act of a user "attaching something" to text. Yes, I think so, but I think this is a broader concept. Our users should be permitted to attach comments and attributes and marks to any range of text, not just whatever "chunks" might be pre-defined. This would include but not be limited to individual words, sentences, paragraphs, stanzas, speeches, etc.

4. As a unit of analysis. E.g., one example we've sometimes used in WordHoard is comparing the prologues of the Canterbury tales against the background

They have at least these properties (from Amit):

type - can be of chapter/work/poem/line/sentence or anything.
title or abbreviated display (say first 25 words for a paragraph and div/head/title for work)
count of features
can return the list of feature Instances (list of words for example).

Catherine referred to varied ;chunk behaviors discussed early in Nora. What are these?

There has been side discussion about how chunks are obtained - declared in chunk file, marked in a conforming text, ... I suggest we bracket that discussion and take whatever profits we've obtained in chunk definition above for now.

How do we maintain a chunk type vocabulary or vocabularies in general?

Amit's Comments
  1. Chunk types no doubt will be collection specific, but we should have a notion of equivalent chunks For example
    Documenting American south collection has several documents that has the following structure
    Work.1:
    /div1[@type=section]
      /div1/div2[@type=chapter]
      /div1/div2[@type=letter]
      ...
    ...
    
    Work.2
    /div1[@type=section]
    /div1[@type=section]
    ...
    
    Work.3
    /div1[@type=chapter]
    /div1[@type=chapter]
    ...
    
    work.1,work.2,work.3 
    All have p tags in the divs
    
    

If we decide that work,chapter,sections and paragraphs represent the chunk type vocabulary for this collection and in our stand off markup
we promote the @type="section" to the level of chapter in work.2 and @type="letter" to the level of chapter in work.1; The
details are in the nora-chunk.

Document generated by Confluence on Apr 19, 2009 15:04