This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

This note discusses the Monk datastore ingest process and starts to flesh out some of the technical details of this process and our plans for the development of a new ingest process. These notes are preliminary and in this first draft they are based only on my own thoughts, observations, and work to date. I hope that we can work together to turn them into formal specifications as time goes on.

The Current Ingest Process

The current Monk datastore is based on a subset of the WordHoard data. The data ingest process is:

  1. The input is a collection of unmodified TEI XML source files for the texts to be ingested.
  2. The TEI source files are processed by the MorphAdorner tool to tokenize the text and adorn it with NUPOS morphological tagging data.
  3. The MorphAdorned files are converted to WordHoard XML input format using a tool named ConvertMorph.
  4. The WordHoard input files are ingested into a WordHoard datastore using the old WordHoard build tools.
  5. The WordHoard datastore is converted to a Monk datastore using a tool named ConvertWordHoard.
  6. The output is a MySQL relational database which contains the data for the Monk datastore.

The MorphAdorner and ConvertWordHoard tools are part of the Monk codebase. Source code is available in the Monk SVN repository.

The ConvertMorph tool and the old WordHoard build tools are part of the WordHoard codebase. Source code is available at the WordHoard web site at http://wordhoard.northwestern.edu/userman/dev-files.html. Detailed documentation on file formats is also available at the same web site.

Problems with the Current Ingest Process

This current ingest process made it possible to develop the first version of the Monk datastore much more quickly than would have been possible otherwise, by temporarily making use of code written for a different project. It is not, however, a path to the future. With this ingest process, in order to add the many new features we want in the Monk datastore, we would have to implement them in both WordHoard and in Monk. This is double the work it would take to implement them in Monk alone.

Thus, in order to move forward, we need a new Monk data ingest process which does not involve WordHoard.

The New Ingest Process

At our all-hands meeting in Maryland on Dec. 14 and 15, we started discussing the new Monk ingest process. This new process will be a collaborative effort of the TEI-Analytics team at Nebraska and the datastore team at Northwestern. The new ingest process will be:

  1. The input is a collection of unmodified TEI XML source files for the texts to be ingested.
  2. The TEI source files are processed by TEI-Analytics to normalize them to a common, interoperable, simplified, well-defined format called TEI-A. This process includes using MorphAdorner to tokenize the text and adorn it with NUPOS morphological tagging data. (Nebraska).
  3. The TEI-A files are processed by a new BuildMonk tool to construct the Monk datastore. (Northwestern).
  4. The output is a MySQL relational database which contains the data for the Monk datastore.

The exact division of labor between steps 2 and 3 and the details of the TEI-A format that defines the interface between them in this new process have not yet been fully specified, but a broad general statement of principle was enunciated by Steve Ramsey in his message to the Monk mailing list on Dec. 18:

The data store's job is to store data, not to determine
what the data is in the first place. Having this stuff represented in
the TEI-A file also provides a hedge against radical changes to the
datastore (or radical changes of datastore technology). Finally, it
allows these texts to be useful in contexts beyond MONK.

Martin Mueller also proposed a general principle for this division of labor in his message on Dec. 20:

More broadly speaking, the goal of the curatorial process is to
produce a text that parses under TEI-Analytics and can be ingested
into the datastore without further accommodations.

My understanding of these principles is that TEI-Analytics will take care of all the details that involve curator interaction, and will produce TEI-A files that contain all the data needed by BuildMonk to fully populate the data for all the objects and their attributes in the Monk datastore, without requiring any secondary additional input to BuildMonk by a curator or from any other source.

Is my understanding correct? Are we all in agreement about this plan? If not, we need to discuss these basic principles in more detail.

In Maryland, we scheduled this work for January and February, and promised the delivery of an initial implementation by the end of February. At that time, we should have the first version of a new ingest process that is general, efficient, extensible, and does not involve any legacy WordHoard code or processes. This will lay the foundation for future work adding the new features to the datastore that go beyond the current minimal set of features that is based on a subset of the old WordHoard data.

A note on terminology: In this note, by TEI-A I mean the whole package of deliverables from the TEI-Analytics team at Nebraska to the data cell team at Northwestern. This is not necessarily just one "TEI-A" file per work. We probably need a new name for the whole package to distinguish it from the individual work files, but for now I'll just call the whole thing "TEI-A".

The Data

In this section we review the data that is in the current initial implementation of the Monk datastore. The purpose is to start a detailed discussion of the issues of how each data item is ingested. We follow the presentation of the details in the Monk datastore overview documentation available at the datastore test web site. In particular, we assume familiarity with the documentation on "Core Objects and Attributes" at http://scribe.at.northwestern.edu:8090/monk/core-objects.html.

All of this is subject to discussion and change if needed, of course. This section is intended to be just a first draft of an attempt to begin working through all the details.

Note that I only include the data that is in the current version of the datastore, not the additional data that we will need to implement all of the features on our wish list in subsequent versions of the datastore. Just this core basic data is hard enough, and I'd prefer to discuss the extensions needed to implement new features elsewhere.

Corpora

Each corpus has the following attributes:

tag: A short unique string identifier for the corpus. Supplied by the curator and provided in a well-defined location and format in TEI-A to BuildMonk.

title: The corpus title. Supplied by the curator and provided in a well-defined location and format in TEI-A to BuildMonk.

The remaining attributes (numWorks, works, numAuthors, authors, numWords, and numWordParts) are derived data computed by BuildMonk.

Works

Each work has the following attributes:

corpus: The corpus to which the work belongs. Supplied by the curator and provided in a well-defined location and format in TEI-A to BuildMonk.

pubDateStart and pubDateEnd: Supplied by the original unmodified TEI source files and/or by a curator. These attributes may have the value -1 to indicate "missing", but of course we want to have publication dates when it is at all possible to supply them. Provided in a well-defined location and format in TEI-A to BuildMonk.

numAuthors and authors. Supplied by the original unmodified TEI source files and/or by a curator. Provided in a well-defined location and format in TEI-A to BuildMonk.

Works are also work parts. In addition to the attributes listed above, each work also has all the attributes listed below for work parts.

Work Parts

Each work part has the following attributes:

tag: A short unique string identifier for the work part. For works, this should probably be derived using some algorithm from the identifying information contained in the original TEI source file header sections, or perhaps from the file names of those source files. In any case, work tags are supplied in a well-defined location and format in TEI-A to BuildMonk. For work parts, TEI-Analytics may wish to supply this data, or BuildMonk could generate it, most likely by appending child index numbers at each level to work tags. E.g., if "ham" is the tag for "Hamlet", the tag for Act 1, Scene 3 might be generated as "ham-1-3".

title: The title of the work part. Supplied by the original unmodified TEI source files and/or by the curator. Provided in a well-defined location and format in TEI-A to BuildMonk.

type: The type attribute on the "div" element for the work part. TEI-Analytics will supply additional derived data based on this type, including a "mapped type" from some small fixed vocabulary, and a boolean attribute which indicates whether the work part contains "paratext" or "non-paratext". All three pieces of data will be provided in a well-defined location and format in TEI-A to BuildMonk.

htmlText: The human-readable text for the work part, formatted as HTML. The TEI-A markup for text must be well-defined in such a way that BuildMonk can use a fixed set of simple rules to generate the human-readable text.

All of the other many attributes of work parts, which we will not list in detail here, are derived data computed by BuildMonk.

Authors

Each author has the following attributes:

tag: A short unique string identifier for the author. In the current datastore, this is the same as the author's full name. Is this good enough? One problem is distinct authors who may happen to have the same name, e.g., two "Doe, John" authors. In this case we'd have the same name for both authors, but we'd need unique tags. In any case, author tags are supplied in a well-defined location and format in TEI-A to BuildMonk.

name: The author's name, in the format "last name, first name". Supplied by the original unmodified TEI source files and/or by a curator. Provided in a well-defined location and format in TEI-A to BuildMonk.

birthYear and deathYear: Supplied by the original unmodified TEI source files and/or by a curator. These attributes may have the value -1 to indicate "unknown", but of course we want to have these dates when it is at all possible to supply them. Provided in a well-defined location and format in TEI-A to BuildMonk.

The remaining attributes, which we will not list in detail here, are derived data computed by BuildMonk.

Note that a work may have multiple authors, and a single author may have multiple works in multiple corpora, at least in theory.

NUPOS data

This data includes all of the word-level NUPOS morphological tagging data: lemmas, parts of speech, spellings, word classes, major word classes, and part of speech categories.

This data comes from two sources: The adornments added to the TEI source files by MorphAdorner, and definitions of the internal structure and attributes of the NUPOS part of speech tagset as supplied by Martin Mueller. I do not believe that there are any issues regarding this category of data that need to be addressed by TEI-Analytics.

Counters

See http://scribe.at.northwestern.edu:8090/monk/counts.html.

Counters are derived data computed by BuildMonk. I do not believe that there are any implications for TEI-A beyond the other specs outlined here.

Words

See http://scribe.at.northwestern.edu:8090/monk/words.html.

Given all the other specs outlined here, I believe that BuildMonk has everything it needs to construct and persist the Word objects, and there are no additional implications for TEI-A.

ConvertMorph, NCF, and Stein

At our summer all-hands meeting in Champaign-Urbana, I was tasked with the job of "loading up NCF into WordHoard". The ConvertMorph program mentioned earlier was the result of this work. We also loaded up two novels by Gertrude Stein into WordHoard using the same tool. The result of this work in turn provided the data for the NCF and Stein corpora in the current version of our Monk datastore, using the ingest process described earlier.

ConvertMorph takes as input MorphAdorned TEI source files for works to be ingested into WordHoard and produces as output WordHoard-format XML input files. These files are in turn processed by the old largely unmodified WordHoard build tools to produce a WordHoard datastore.

In our new Monk build process, we face a very similar problem. TEI-Analytics must take as input unmodified TEI source files for works to be ingested into Monk, run those texts through MorphAdorner, normalize their formats, and produce as output TEI-A format XML files, which are in turn read by BuildMonk to produce a Monk datastore.

Our experience at Northwestern in the development of ConvertMorph and the use of that tool to ingest NCF and Stein may be useful to the TEI-Analytics team at Nebraska, even though they almost certainly will prefer to use their own architecture and tools and code to solve their problems. In this section I will outline the approach we took in ConvertMorph and hilight some of the significant problems we had to solve.

There were two major kinds of problems we had to solve. First, the raw data files we were given for NCF and Stein were quite a mess. The data was not consistently encoded, certainly not between the two collections, and not even within the same collection. A good example is publication dates. In NCF, publication dates were encoded in many different formats, and we had to use complex regular expression pattern matching rules to decode them. In Stein, publication dates were not present at all in the TEI files, and we had to look them up on the Internet. We faced similar problems with generating work tags, getting author names and birth and death dates, and even extracting such simple information as work titles. We faced a variety of very difficult technical issues in mapping the structures of "div" and other elements into the tree of work parts. Extracting usable titles for work parts was particularly bothersome. Another issue was the use of idiosyncratic values for "rend" attributes. Determining how to map these values into human-readable text formatting operations was non-trivial. We had no documentation for the NCF conventions, and we still don't know for certain what some of them are supposed to mean!

In short, it was clear that it was impossible to extract all the data we needed in the format we needed, even for very simple information like titles and dates, from the raw TEI data files by themselves as given to us. Too much data was missing, and way too much of the data was inconsistent and ambiguous. Extra information needed to be supplied somehow to get the data we needed.

Our solution was to specify all this extra information in "SIP" files. ("SIP" stands for "Submission Information Package".) We also called these files "rule files" in the code and documentation. To ingest a corpus, in addition to the individual TEI source files, the curator must prepare and supply a SIP file for the corpus. We have one such SIP file for the NCF corpus, and one for the Stein corpus.

The SIP file provides two kinds of additional information to the ConvertMorph tool. First, any data that is simply missing in the TEI files is specified in the SIP. For example, this is how publication dates and the author name are supplied for the Stein corpus. Second, the SIP specifies a set of "rules" that define how to process the TEI source files and all their elements and attributes, in order to resolve ambiguities and inconsistencies.

As an example, in the SIP for NCF, we specify the following set of rules for extracting work part titles from the TEI files:

  • First, try looking for a "head" child of the "div" element. If one is found, use its contents as the work part title, after stripping out any embedded footnotes (yes, this was necessary!) and stripping out formatting elements and other junk.
  • If the "div" element has no "head" child, use the "div" element's "type" attribute with the first character converted to upper case. Many "Dedication" and similar kinds of work parts get their titles using this rule.
  • If there's no "head" child and there's no "type" attribute, as a last resort use the element name with the first character converted to upper case. Quite a few "Trailer" work parts got their titles using this rule.

When we ingested Stein, we found that a different set of rules in the SIP for Stein worked best for getting reasonable work part titles. The SIP for Stein specifies the following rules:

  • First, try using the "type" and "n" attributes on the "div" element to generate titles like "Chapter 2".
  • If that doesn't work, look for a "head" child element as in NCF.

There are many more details of the SIP files we developed for use by ConvertMorph. For the details, see the two appendices below where we present the full SIP files for NCF and Stein, in all their gory detail. They contain lots of comments which discuss all the details. For even more details, see the WordHoard source code for the ConvertMorph tool in the package edu.northwestern.at.wordhoard.tools.cm, which contains about 4,000 lines of Java code in 27 source code files.

Appendix 1 - NCF SIP

<?xml version="1.0" encoding="utf-8"?>

<!-- 
    ConvertMorph rules for the 19th Century Fiction collection (NCF). 
    
    These rules define the translation of MorphAdorner XML output files for NCF into WordHoard 
    XML input files for NCF.
    
    We've included quite a few comments in this file about the format of the file in general and 
    what all the many rules do in general, in addition to comments that are specific to the NCF 
    collection. These comments do not add up to complete formal specifications, but they should 
    give you a good idea of what everything means and how it works.
    
    The general idea is to make ConvertMorph fully parameterized and rule-driven, so that
    it can be used to process morphadorned XML files from many different sources and in many
    different formats. Indeed, in an ideal world, these files would not even need to be in
    TEI format or a TEI subset or "simple" format. They could be in almost any format at all,
    as long as it is well-formed XML that has been processed by MorphAdorner. We're still short
    of fully reaching this goal, but we've made progress.
    
    Even within the realm of "TEISimple", there are many variations in the way collections are
    encoded. For an example which is quite different from NCF, see the ConvertMorph rules file
    for the Stein collection.
-->

<ConvertMorphRules>

    <!-- 
        The WordHoard corpus tag for this collection is "ncf".
        
        This element is required. Every WordHaord corpus must have a unique corpus tag, and
        there must be a definition for the specified corpus in the WordHoard "corpora.xml" file.
    -->

    <corpusTag>ncf</corpusTag>
    
    <!--
        Title page rules.
        This element is optional.
        
        This rule contains text to be generated for the WordHoard title pages of each work. 
        There may be any number of <respStmt> elements and at most one <publicationStmt> element, 
        in the format described in the WordHoard manual.
    -->
    
    <titlePageRules>
        <publicationStmt>
            <p>This version of the text is an experimental derivative of
            its version in the Chadwyck-Healey archive of
            Nineteenth-Century Fiction (hereafter NCF) and will be
            improved or replaced in the course of the MONK Project.</p>
            <p>The copyright to this text is owned by ProQuest, and this
            version may be accessed only by members who are affiliate
            with institutions that have purchased the NCF source files
            or by members directly affiliated with the MONK Project, who
            have been granted special permission by ProQuest to use the
            source files for purposes that bear directly on the
            development of the MONK Project.</p> <p>As an experimental
            derivative, this version differs from its source file in the
            following ways:</p> <p>1. The text file has been transformed
            from its original encoding in a Chadwyck-Healey SGML dtd
            into an XML file that parses under a TEI P5 dtd that uses a
            sharply reduced number of elements. This dtd, provisionally
            called teisimple.dtd, employs some minor extensions and
            relaxations of the TEI content model to accommodate a
            variety of text archive, including the Text Creation
            Partnership (TCP), Early American Fiction (EAF), the Wright
            Archive of American Fiction 1851-75), and Documenting the
            American South. Teisimple will evolve with the final release
            of TEI P5, scheduled for October 2007.</p> <p>2. While
            letters are explicitly tagged as such in other fiction
            collections, they are not identified in NCF, unless they
            occur in epistolary novels that consist only of letters. In
            order to enhance the comparability of NCF texts with texts
            in other fiction archives, letters in NCF have been
            explicitly tagged by Martin Mueller.  Nested narratives and
            other inserted documents have also been tagged explicitly.
            In the WordHoard environment, this tagging is not yet
            available for querying.</p> <p>3. The use of the apostrophe
            as an elision marker at the beginning or end of a word is a
            very common feature in poetic, regional or colloquial
            speech, but in digital environments its consistent
            disambiguation from opening or closing single quotations
            marks is a non-trivial task, with considerable consequences
            for tokenization and POS tagging. For various reasons, the
            current version does not distinguish between the apostrophe
            and the single quote as  accurately as it should and
            includes some unnecessary errors in tokenization or POS
            tagging. We know where to look for and correct these errors
            but will be grateful for users to point them out anyhow by
            using the error reporting feature of WordHoard.</p>
        </publicationStmt>
    </titlePageRules>
    
    <!-- We have no file rules for NCF. For an example of file rules, see the Stein rules. -->

    <!-- 
        Header rules for extracting WordHoard bibliographic information for each work. 
        
        These rules use regular epxression pattern matching. For details, see
        Sun's javadoc for the class java.util.regex.Pattern.
        
        NCF uses some rather complicated and sometimes inconsistent conventions for encoding
        header information, and we often need to use rather complex patterns to get the
        information we need for WordHoard. The examples given below along with the patterns
        illustrate all the different cases.
        
        These rules extract the following bibliographic information about each work for 
        WordHoard:
        
        workTag = Work tag. Each work in a collection must have a unique tag. For NCF we
        use the numbers as assigned by the C-H encoders.
        
        title = Work title.
        
        pubDateStart = Publication year, if known, or first year in a range of publication 
        years.
        
        pubDateEnd = Last year in a range of publication years.
                
        Note that for both works and work parts, WordHoard has the notion of
        "full" and "short" titles. ConvertMorph currently only has one notion of
        "title", and sets both WordHoard titles to the same value.
    -->

    <headerRules>
    
        <headerRule>
        
            <!-- Extracts the WordHoard work tag from the <idno> element. -->
                            
            <path>TEI/teiHeader/fileDesc/publicationStmt/idno</path>
            
            <pattern>
                <!--
                    Example:
                    In ANCF0101.xml, the <idno> is "NCF0101". The WordHoard work tag for this 
                    work is "0101", with the "NCF" prefix stripped off.
                    
                    Note that the WordHoard "full work tag" for a work is always 
                    "corpusTag-workTag", in this example "ncf-0101". Thus, from the point
                    of view of WordHoard, encoding "NCF" in the work tag itself would be
                    redundant, and we don't do it.
                    
                    In the pattern below, parentheses are used to "group" everything after
                    the "NCF" prefix, and "$1" is used to extract the value of this group
                    as the WordHard work tag.
                -->
                <match>NCF(.*)</match>
                <extract item="workTag">$1</extract>
            </pattern>
            
        </headerRule>
        
        <headerRule>
        
            <!-- 
                Extracts the WordHoard work title from the <title> element. 
                
                The titles encoded in NCF are verbose and include author names and
                other junk. It would be nice to have "cleaner" titles, but this would
                most likely need to be done by hand by a human being. It does not seem that
                regular expressions are adequate for this task. So for now we just use the
                work titles as is, without making any attempt to clean them up.
                
                Note that for NCF, ConvertMorph often generates very long titles for both
                works and their parts. The WordHoard client is ill-prepared to deal
                with long titles, which present human interface problems in quite a few places.
                To solve these problems, the WordHoard ingest program BuildWorks currently 
                truncates all titles to 50 characters.
                
                Whenever pattern matching is used to examine a string extracted from an
                XML source file, ConvertMorph replaces any runs of line feeds and carriage 
                returns in the string by a space before trying to match patterns against the
                string. Believe it or not, such garbage actually appears in at least one NCF
                work title string!
            -->
            
            <path>TEI/teiHeader/fileDesc/titleStmt/title</path>
            
            <pattern>
                <!-- 
                    Note that .* matches the whole string, and $0 extracts the whole string.
                    This is the simplest possible kind of pattern matching and header value
                    extraction.
                -->
                <match>.*</match>
                <extract item="title">$0</extract>
            </pattern>
            
        </headerRule>
        
        <headerRule>
        
            <!-- 
                Extracts WordHoard publication dates from the <date> element.
                
                WordHoard supports both simple publication dates (yyyy) and publication date
                ranges (yyyy-yyyy). NCF specifies this information in a variety of different
                formats, and we need quite a few different patterns to catch all the
                variations.
            -->
            
            <path>TEI/teiHeader/fileDesc/sourceDesc/biblFull/publicationStmt/date</path>
            
            <pattern>
                <!--
                    Four digit pub date, optionally enclosed in square brackets.
                    Examples:
                    ANCF0101.xml: 1839
                    ANCF22505.xml: [1850]
                    
                    The square brackets presumably encode some kind of meaningful information
                    in NCF, but we just ignore them. In the second example above, the
                    WordHoard publication date is simply 1850, and whatever information is
                    represented by the square brackets is lost.
                -->
                <match>\[?(\d\d\d\d)\]?</match>
                <extract item="pubDateStart">$1</extract>
            </pattern>
            
            <pattern>
                <!--
                    Two four digit dates separated by dash or en-dash.
                    Example:
                    ANCF22503.xml: 1840-1841
                    
                    Yes, NCF sometimes uses dash, and sometimes en-dash. It's hard to even
                    see the difference in the source text, but it's there and must be dealt
                    with in our patterns.
                -->
                <match>(\d\d\d\d)[--](\d\d\d\d)</match>
                <extract item="pubDateStart">$1</extract>
                <extract item="pubDateEnd">$2</extract>
            </pattern>
            
            <pattern>
                <!--
                    Four digit date, dash or en-dash, then two digits.
                    Example:
                    ANCF1602.xml: 1794-97
                    In this example, the WordHoard pub date range is 1794-1797.
                -->
                <match>(\d\d)(\d\d)[--](\d\d)</match>
                <extract item="pubDateStart">$1$2</extract>
                <extract item="pubDateEnd">$1$3</extract>
            </pattern>
            
            <pattern>
                <!--
                    Four digit date, dash or en-dash, then one digit.
                    Example:
                    ANCF3701.xml: 1826-7
                    In this example, the WordHoard pub date range is 1826-1827.
                -->
                <match>(\d\d\d)(\d)[--](\d)</match>
                <extract item="pubDateStart">$1$2</extract>
                <extract item="pubDateEnd">$1$3</extract>
            </pattern>
            
        </headerRule>
        
    </headerRules>
    
    <!-- 
        Rules for extracting WordHoard author information for each work. 
        
        A work can have more than one author. For each author, we extract the following 
        information for WordHoard:
        
        authorName = Author name.
        
        authorBirthYear = Author birth year, if known.
        
        authorDeathYear = Author death year, if known.
        
        authorEarliestWorkYear = Author earliest work year, if known. (Not used for NCF.)
        
        authorLatestWorkYear = Author latest work year, if known, (Not used for NCF.)
        
        ConvertMorph gathers together all the author information and updates the WordHoard
        authors definition file "authors.xml". Any new authors are added to the file. 
        If any author is encounterd which is already in the authors.xml file with 
        conflicting attribute values, an error message is issued.
    -->
    
    <authorRules>
    
        <!-- 
            The path to author elements. There may be more than one instance of this path
            for multiple authors, although for NCF this is not the case.
        -->
    
        <path>TEI/teiHeader/fileDesc/titleStmt/author</path>
        
        <headerRule sep=" / ">
        
            <!--
                Extracts WordHoard author information from the <author> element.
                
                NCF encodes author names, birth dates, and death dates in a variety of 
                different formats.
                
                The "sep" attribute is used to specify the separator for multiple values 
                encoded in a single element. In this rule, sep=" / " is specified for NCF. 
                The string is split using the separator, and each part is processed using 
                the patterns specified. For example, in ANCF25901.xml there are two 
                authors specified as:
                
                     Somerville, E. OE. (Edith OEnone), 1858-1949 / Ross, Martin, 1862-1915
                    
                In this example, the two parts are split out of the string and matched against 
                the patterns separately:
                
                    Somerville, E. OE. (Edith OEnone), 1858-1949
                    Ross, Martin, 1862-1915
                    
                This results in two WordHoard "author" elements being generated, one for
                the author "Somerville, E. OE. (Edith OEnone)" with birth and death dates
                1858 and 1949, and one for the author "Ross, Martin" with birth and death dates
                1862 and 1915.
                
                Patterns are matched in order. When a patten matches, the specified extraction
                rules are applied, and the remaining patterns are skipped. In this example, 
                the third and last pattern is used only if the first two patterns fail. The order
                of the patterns is therefore important. For this rule, if we put the last pattern
                first in the list, the rule would not work as intended!
            -->
            
            <!--
                The path to the element for this rule. This path is relative to the base
                path given above to author elements. For NCF, the author information is
                encoded directly in the author element rather than in child elements, so this
                relative path is empty.
            -->
            
            <path></path>
            
            <pattern>
                <!-- 
                    Both birth and death dates specified.
                    Examples:
                    ANCF24501.xml: Linton, E. Lynn (Elizabeth Lynn) 1822-1898
                    ANCF1501.xml: Hays, Mary, 1759/60-1843
                     
                    Either a dash (-) or an en-dash (-) can be used to separate the dates.
                    
                    In the second example above, the WordHoard birth and death dates are set
                    to 1759 and 1843 respectively. The "/60" following "1759" is ignored.
                -->
                <match>(.*?),? *(\d+)(/\d+)?[--](\d+)(/\d+)?</match>
                <extract item="authorName">$1</extract>
                <extract item="authorBirthYear">$2</extract>
                <extract item="authorDeathYear">$4</extract>
            </pattern>
            
            <pattern>
                <!-- 
                    Only birth date specified.
                    Example: 
                    ANCF0901.xml: Dacre, Charlotte, b. 1782 
                -->
                <match>(.*?),? *b. *(\d+)(/\d+)?</match>
                <extract item="authorName">$1</extract>
                <extract item="authorBirthYear">$2</extract>
            </pattern>
            
            <pattern>
                <!-- 
                    No dates specified.
                    Example: 
                    ANCF1101.xml: Fenwick, E. (Eliza) 
                -->
                <match>.*</match>
                <extract item="authorName">$0</extract>
            </pattern>
            
        </headerRule>
    
    </authorRules>
    
    <!-- 
        The path to the text for the work is TEI/text. 
        This element is required.
    -->
    
    <textPath>TEI/text</textPath>
    
    <!-- 
        Text element rules. 
        This element is required.
        
        For each element in the subtree rooted at the text path, these rules specify how to 
        process the element. The rules specify both when to create work parts and how to 
        format the text within the work parts in WordHoard.
        
        Each rule has the following possible attributes. Only the "name" attribute is 
        required.
        
        name = the name of the element.
        
        parBreak = true to force a paragraph break before and after the element. Default = false. 
        In WordHoard, paragraphs are separated by blank lines.
        
        lineBreak = true to force a line break before and after the element. Default = false.
        
        Note that parBreak implies lineBreak.
        
        lineStyle = left, center, or right. Default = no change in current line style.
        
        indent = indentation in pixels. Default = no change in current line indentation.
        Indentation is cumulative and is only used with the left justification line style.
        
        wordStyles = list of word styles separated by commas. Default = no change in current
        word styles. The word styles may be any styles supported by WordHoard: bold, italic,
        extended, underline, overline, superscript, subscript, monospaced, and plain. Word
        styles are cumulative, except for "plain", which is used to remove any current other
        word styles and revert to plain unstyled text.
        
        ignoreChildren = true to ignore any children of this element. Default = false.
        
        createPart = never, sometimes, or always. Default = never. "always" means to always
        create a new work part for this element (e.g., in NCF, for <div> elements). "sometimes"
        means to create a new work part only if it is necessary  (e.g., in NCF, for <trailer>
        and <epigraph> elements, which sometimes occur in "stranded" contexts where there is
        no active work part into which we are able to generate text.)
        
        footnote = true to treat as a footnote. Default = false. Footnotes are represented by 
        footnote numbers in the main text, and the numbered footnotes proper are generated at 
        the end of each work part.
        
        ignoreRend = true to ignore any rend attributes that might be present on this element.
        Default = false.
        
        genBefore = plain text string to generate before processing the element. Default =
        nothing.
        
        genAfter = plain text string to generate after processing the element. Default =
        nothing.
        
        <w> and <c> elements are generated by MorphAdorner. These elements are processed
        specially by ConvertMorph, and result in WordHoard <w> and <punc> elements. None of the
        other elements are processed specially in any way other than as specified by the rules
        enumerated here.
        
        ConvertMorph generates an error message if it encounters an element while processing
        the text which is not definied by a rule.
    -->
    
    <textElementRules>
        <textElementRule name="add"/>
        <textElementRule name="argument" parBreak="true" indent="20"/>
        <textElementRule name="back"/>
        <textElementRule name="bibl" parBreak="true" lineStyle="right"/>
        <textElementRule name="body"/>
        <textElementRule name="c" ignoreChildren="true"/>
        <textElementRule name="cell" genBefore=" [" genAfter="] " ignoreRend="true"/>
        <textElementRule name="closer" parBreak="true"/>
        <textElementRule name="div" createPart="always"/>
        <textElementRule name="epigraph" parBreak="true" indent="20" createPart="sometimes"/>
        <textElementRule name="figure" ignoreChildren="true"/>
        <textElementRule name="foreign" wordStyles="italic"/>
        <textElementRule name="front"/>
        <textElementRule name="gap" ignoreChildren="true" genBefore=" "/>
        <textElementRule name="head" parBreak="true" lineStyle="center" wordStyles="bold"/>
        <textElementRule name="hi"/>
        <textElementRule name="insertDoc" parBreak="true" indent="20"/>
        <textElementRule name="item" parBreak="true" indent="20"/>
        <textElementRule name="l" lineBreak="true"/>
        <textElementRule name="label" parBreak="true" lineStyle="center" wordStyles="italic"/>
        <textElementRule name="lb" lineBreak="true" ignoreChildren="true"/>
        <textElementRule name="letter" parBreak="true" indent="20"/>
        <textElementRule name="lg" parBreak="true"/>
        <textElementRule name="list" parBreak="true"/>
        <textElementRule name="milestone" ignoreChildren="true"/>
        <textElementRule name="note" parBreak="true" footnote="true"/>
        <textElementRule name="opener" parBreak="true"/>
        <textElementRule name="p" parBreak="true"/>
        <textElementRule name="pb" ignoreChildren="true"/>
        <textElementRule name="q" parBreak="true" indent="20"/>
        <textElementRule name="row" lineBreak="true"/>
        <textElementRule name="salute" parBreak="true"/>
        <textElementRule name="seg"/>
        <textElementRule name="signed" lineBreak="true"/>
        <textElementRule name="sp" parBreak="true" indent="20"/>
        <textElementRule name="speaker" parBreak="true" indent="-20"/>
        <textElementRule name="stage" parBreak="true" lineStyle="center" wordStyles="italic"/>
        <textElementRule name="table" parBreak="true" indent="20"/>
        <textElementRule name="text"/>
        <textElementRule name="title" wordStyles="italic"/>
        <textElementRule name="trailer" parBreak="true" lineStyle="center" createPart="sometimes"/>
        <textElementRule name="w" ignoreChildren="true"/>
    </textElementRules>
    
    <!--
        Work part title rules.
        This element is required.
        
        There are four kinds of rules that can be used for getting work part titles. The rules 
        are tried in the order listed until one works. If none of the rules work, an error 
        message is issued and the title is set to "Untitled".
        
        useFirstChild: Use the first child element with a specified name. For NCF, we use the 
        first <head> child element, which works reasonably well in most cases. All of the text
        of the child element is used, except for any embedded footnotes and any embedded
        descendant elements which have ignoreChildren set to true in their text element rule.
        This is important, because in NCF there are indeed some <head> elements which have
        these kinds of descendants.
        
        useAttributeValue: Use an attribute value, optionally converting the first letter
        to upper case. For NCF we use the "type" attribute. For example, in some NCF
        works, there are <div type="dedication"> elements which have no <head> children. In
        this case, the work part title is set to "Dedication".
        
        useElementName: Use the element name, optionally converting the first letter to
        upper case. In NCF, this rule catches quite a few <trailer> elements which become
        work parts with the title "Trailer".
        
        useAttributeValuePair: Uses a pair of attribute values separated by a space,
        optionally converting the first letter of the first attribute value to upper case.
        This rule is not used for NCF, but it might be useful for other collections. For
        example, a <div n="3" type="chapter"> element under this rule might result in the
        work title "Chapter 3". The rule for this example would be:
        
           <useAttributeValuePair name1="type" name2="n" capitalizeFirstLetter="true"/>
        
        For many NCF works, perhaps most of them, these rules work quite well, even
        surprisingly well. For many other works, however, the table of contents formed by
        the work part hierarchy and the titles generated by these rules ends up being, shall we 
        say, a bit goofy, ugly, and rather short of optimal. 
    -->
    
    <workPartTitleRules>
        <useFirstChild name="head"/>
        <useAttributeValue name="type" capitalizeFirstLetter="true"/>
        <useElementName capitalizeFirstLetter="true"/>
    </workPartTitleRules>
    
    <!--
        Rend attribute rules.
        
        These rules map the "rend" attribute values used in NCF to WordHoard "rend" attribute 
        values.
        
        Rend attributes are processed wherever they occur, on any element, unless the element
        rule specifies ignoreRend="true".
        
        Each rend rule can contain optional lineStyle, indent, and wordStyles attributes that work
        the same way as in element rules.
    -->
    
    <rendAttributeRules>
        <rendAttributeRule attrName="rend">
            <rendAttributeMapping value="b(1)" wordStyles="bold"/>
            <rendAttributeMapping value="i(1)" wordStyles="italic"/>
            <rendAttributeMapping value="i(2)" wordStyles="italic"/>
            <rendAttributeMapping value="italics" wordStyles="italic"/>
            <rendAttributeMapping value="align(c)" lineStyle="center"/>
            <rendAttributeMapping value="align(r)" lineStyle="right"/>
            <rendAttributeMapping value="indent(1)" indent="20"/>
            <rendAttributeMapping value="indent(2)" indent="40"/>
            <rendAttributeMapping value="indent(3)" indent="60"/>
            <rendAttributeMapping value="indent(5)" indent="100"/>
            <rendAttributeMapping value="sc(1)"/>
            <rendAttributeMapping value="sc(2)"/>
            <rendAttributeMapping value="small(1)"/>
            <rendAttributeMapping value="small(2)"/>
            <rendAttributeMapping value="sub(1)" wordStyles="subscript"/>
            <rendAttributeMapping value="sub(2)" wordStyles="subscript"/>
            <rendAttributeMapping value="sup(1)" wordStyles="superscript"/>
            <rendAttributeMapping value="sup(2)" wordStyles="superscript"/>
            <rendAttributeMapping value="roman(1)" wordStyles="plain"/>
            <rendAttributeMapping value="roman(2)" wordStyles="plain"/>
            <rendAttributeMapping value="speaker"/>
            <rendAttributeMapping value="caption - pb"/>
            <rendAttributeMapping value="caption - div"/>
        </rendAttributeRule>
    </rendAttributeRules>
    
    <!--
        Footnote rules.
        This element is optional. The default values are as shown below.
        
        Footnotes are rendered at the ends of work parts, with superscript references in
        the main text.
    -->
    
    <footnoteRules>
        <footnoteRefStyle wordStyles="superscript"/>
        <footnoteStyle indent="20"/>
    </footnoteRules>

</ConvertMorphRules>

Appendix 2 - Stein SIP

<?xml version="1.0" encoding="utf-8"?>

<!-- 
    ConvertMorph rules for the Stein collection. 
-->

<ConvertMorphRules>

    <corpusTag>stein</corpusTag>
    
    <!-- We currently have no title page rules for Stein. -->
    
    <!--
        File rules.
        
        Unlike the NCF collection, in the Stein collection full bibliographic data is
        not encoded in the TEI files. Only the work titles are encoded in the files. We
        use the <fileRules> section here to enumerate the other bibliographic information
        for each file in the Stein collection.
    -->
    
    <fileRules>
        <fileRule>
            <name>threelives-1.0.xml</name>
            <workTag>tli</workTag>
            <author>Stein, Gertrude</author>
            <pubDateStart>1909</pubDateStart>
        </fileRule>
        <fileRule>
            <name>moa-1.1.xml</name>
            <workTag>moa</workTag>
            <author>Stein, Gertrude</author>
            <pubDateStart>1925</pubDateStart>
        </fileRule>
    </fileRules>
    
    <!--
        Header rules.
        
        Note that file rules override header rules. For example, suppose a file rule
        specifies a publication date of 1832 for a work, and a header rule extracts a
        publication date of 1847 for the same work. In this case, the WordHoard publication
        date is set to 1832, from the file rule for the file, and the value in the header
        is ignored. This is not an issue for Stein, where the only header rule we specify
        is for titles.
    -->

    <headerRules>
        <headerRule>
            <path>TEI.2/teiHeader/fileDesc/titleStmt/title</path>
            <pattern>
                <match>.*</match>
                <extract item="title">$0</extract>
            </pattern>
        </headerRule>
    </headerRules>
    
    <!-- 
        We have no author rules for Stein - author namess are given by the file rules above,
        and author attributes are specified in the WordHoard authors.xml file. 
    -->
    
    <textPath>TEI.2/text</textPath>
    
    <!--
        Text element rules.
        
        Note that we generate work parts for <div2> elements. This works well for "Three Lives",
        but results in some goofy "Space-break" parts in "Making of Americans". We could
        specify createPart="never" for <div2> to fix this, but then we'd need separate
        rule files for the two works. There's no way to say use one rule for one work and
        use a different rule for other works.
    -->
    
    <textElementRules>
        <textElementRule name="bibl" parBreak="true" lineStyle="right"/>
        <textElementRule name="body"/>
        <textElementRule name="c" ignoreChildren="true"/>
        <textElementRule name="div0" createPart="always"/>
        <textElementRule name="div1" createPart="always"/>
        <textElementRule name="div2" createPart="always"/>
        <textElementRule name="epigraph" parBreak="true" indent="20"/>
        <textElementRule name="head" parBreak="true" lineStyle="center" wordStyles="bold"/>
        <textElementRule name="name"/>
        <textElementRule name="note" parBreak="true" footnote="true"/>
        <textElementRule name="p" parBreak="true"/>
        <textElementRule name="pb" ignoreChildren="true"/>
        <textElementRule name="text"/>
        <textElementRule name="trailer" parBreak="true" lineStyle="center"/>
        <textElementRule name="w" ignoreChildren="true"/>
    </textElementRules>
    
    <!--
        Work part title rules.
        
        The combination and order of the work part title rules below was determined by
        trial and error. It seems to result in the most reasonable titles for the two Stein 
        novels.
    -->
    
    <workPartTitleRules>
        <useAttributeValuePair name1="type" name2="n" capitalizeFirstLetter="true"/>
        <useFirstChild name="head"/>
    </workPartTitleRules>
    
    <!-- We have no rend attribute rules. Stein doesn't have any style formatting! -->

</ConvertMorphRules>
Document generated by Confluence on Apr 19, 2009 15:04