|
MONK : Notes on Data Ingest
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
This note discusses the Monk datastore ingest process and starts to flesh out some of the technical details of this process and our plans for the development of a new ingest process. These notes are preliminary and in this first draft they are based only on my own thoughts, observations, and work to date. I hope that we can work together to turn them into formal specifications as time goes on. The Current Ingest ProcessThe current Monk datastore is based on a subset of the WordHoard data. The data ingest process is:
The MorphAdorner and ConvertWordHoard tools are part of the Monk codebase. Source code is available in the Monk SVN repository. The ConvertMorph tool and the old WordHoard build tools are part of the WordHoard codebase. Source code is available at the WordHoard web site at http://wordhoard.northwestern.edu/userman/dev-files.html. Detailed documentation on file formats is also available at the same web site. Problems with the Current Ingest ProcessThis current ingest process made it possible to develop the first version of the Monk datastore much more quickly than would have been possible otherwise, by temporarily making use of code written for a different project. It is not, however, a path to the future. With this ingest process, in order to add the many new features we want in the Monk datastore, we would have to implement them in both WordHoard and in Monk. This is double the work it would take to implement them in Monk alone. Thus, in order to move forward, we need a new Monk data ingest process which does not involve WordHoard. The New Ingest ProcessAt our all-hands meeting in Maryland on Dec. 14 and 15, we started discussing the new Monk ingest process. This new process will be a collaborative effort of the TEI-Analytics team at Nebraska and the datastore team at Northwestern. The new ingest process will be:
The exact division of labor between steps 2 and 3 and the details of the TEI-A format that defines the interface between them in this new process have not yet been fully specified, but a broad general statement of principle was enunciated by Steve Ramsey in his message to the Monk mailing list on Dec. 18: The data store's job is to store data, not to determine Martin Mueller also proposed a general principle for this division of labor in his message on Dec. 20: More broadly speaking, the goal of the curatorial process is to My understanding of these principles is that TEI-Analytics will take care of all the details that involve curator interaction, and will produce TEI-A files that contain all the data needed by BuildMonk to fully populate the data for all the objects and their attributes in the Monk datastore, without requiring any secondary additional input to BuildMonk by a curator or from any other source. Is my understanding correct? Are we all in agreement about this plan? If not, we need to discuss these basic principles in more detail. In Maryland, we scheduled this work for January and February, and promised the delivery of an initial implementation by the end of February. At that time, we should have the first version of a new ingest process that is general, efficient, extensible, and does not involve any legacy WordHoard code or processes. This will lay the foundation for future work adding the new features to the datastore that go beyond the current minimal set of features that is based on a subset of the old WordHoard data. A note on terminology: In this note, by TEI-A I mean the whole package of deliverables from the TEI-Analytics team at Nebraska to the data cell team at Northwestern. This is not necessarily just one "TEI-A" file per work. We probably need a new name for the whole package to distinguish it from the individual work files, but for now I'll just call the whole thing "TEI-A". The DataIn this section we review the data that is in the current initial implementation of the Monk datastore. The purpose is to start a detailed discussion of the issues of how each data item is ingested. We follow the presentation of the details in the Monk datastore overview documentation available at the datastore test web site. In particular, we assume familiarity with the documentation on "Core Objects and Attributes" at http://scribe.at.northwestern.edu:8090/monk/core-objects.html. All of this is subject to discussion and change if needed, of course. This section is intended to be just a first draft of an attempt to begin working through all the details. Note that I only include the data that is in the current version of the datastore, not the additional data that we will need to implement all of the features on our wish list in subsequent versions of the datastore. Just this core basic data is hard enough, and I'd prefer to discuss the extensions needed to implement new features elsewhere. CorporaEach corpus has the following attributes: tag: A short unique string identifier for the corpus. Supplied by the curator and provided in a well-defined location and format in TEI-A to BuildMonk. title: The corpus title. Supplied by the curator and provided in a well-defined location and format in TEI-A to BuildMonk. The remaining attributes (numWorks, works, numAuthors, authors, numWords, and numWordParts) are derived data computed by BuildMonk. WorksEach work has the following attributes: corpus: The corpus to which the work belongs. Supplied by the curator and provided in a well-defined location and format in TEI-A to BuildMonk. pubDateStart and pubDateEnd: Supplied by the original unmodified TEI source files and/or by a curator. These attributes may have the value -1 to indicate "missing", but of course we want to have publication dates when it is at all possible to supply them. Provided in a well-defined location and format in TEI-A to BuildMonk. numAuthors and authors. Supplied by the original unmodified TEI source files and/or by a curator. Provided in a well-defined location and format in TEI-A to BuildMonk. Works are also work parts. In addition to the attributes listed above, each work also has all the attributes listed below for work parts. Work PartsEach work part has the following attributes: tag: A short unique string identifier for the work part. For works, this should probably be derived using some algorithm from the identifying information contained in the original TEI source file header sections, or perhaps from the file names of those source files. In any case, work tags are supplied in a well-defined location and format in TEI-A to BuildMonk. For work parts, TEI-Analytics may wish to supply this data, or BuildMonk could generate it, most likely by appending child index numbers at each level to work tags. E.g., if "ham" is the tag for "Hamlet", the tag for Act 1, Scene 3 might be generated as "ham-1-3". title: The title of the work part. Supplied by the original unmodified TEI source files and/or by the curator. Provided in a well-defined location and format in TEI-A to BuildMonk. type: The type attribute on the "div" element for the work part. TEI-Analytics will supply additional derived data based on this type, including a "mapped type" from some small fixed vocabulary, and a boolean attribute which indicates whether the work part contains "paratext" or "non-paratext". All three pieces of data will be provided in a well-defined location and format in TEI-A to BuildMonk. htmlText: The human-readable text for the work part, formatted as HTML. The TEI-A markup for text must be well-defined in such a way that BuildMonk can use a fixed set of simple rules to generate the human-readable text. All of the other many attributes of work parts, which we will not list in detail here, are derived data computed by BuildMonk. AuthorsEach author has the following attributes: tag: A short unique string identifier for the author. In the current datastore, this is the same as the author's full name. Is this good enough? One problem is distinct authors who may happen to have the same name, e.g., two "Doe, John" authors. In this case we'd have the same name for both authors, but we'd need unique tags. In any case, author tags are supplied in a well-defined location and format in TEI-A to BuildMonk. name: The author's name, in the format "last name, first name". Supplied by the original unmodified TEI source files and/or by a curator. Provided in a well-defined location and format in TEI-A to BuildMonk. birthYear and deathYear: Supplied by the original unmodified TEI source files and/or by a curator. These attributes may have the value -1 to indicate "unknown", but of course we want to have these dates when it is at all possible to supply them. Provided in a well-defined location and format in TEI-A to BuildMonk. The remaining attributes, which we will not list in detail here, are derived data computed by BuildMonk. Note that a work may have multiple authors, and a single author may have multiple works in multiple corpora, at least in theory. NUPOS dataThis data includes all of the word-level NUPOS morphological tagging data: lemmas, parts of speech, spellings, word classes, major word classes, and part of speech categories. This data comes from two sources: The adornments added to the TEI source files by MorphAdorner, and definitions of the internal structure and attributes of the NUPOS part of speech tagset as supplied by Martin Mueller. I do not believe that there are any issues regarding this category of data that need to be addressed by TEI-Analytics. CountersSee http://scribe.at.northwestern.edu:8090/monk/counts.html. Counters are derived data computed by BuildMonk. I do not believe that there are any implications for TEI-A beyond the other specs outlined here. WordsSee http://scribe.at.northwestern.edu:8090/monk/words.html. Given all the other specs outlined here, I believe that BuildMonk has everything it needs to construct and persist the Word objects, and there are no additional implications for TEI-A. ConvertMorph, NCF, and SteinAt our summer all-hands meeting in Champaign-Urbana, I was tasked with the job of "loading up NCF into WordHoard". The ConvertMorph program mentioned earlier was the result of this work. We also loaded up two novels by Gertrude Stein into WordHoard using the same tool. The result of this work in turn provided the data for the NCF and Stein corpora in the current version of our Monk datastore, using the ingest process described earlier. ConvertMorph takes as input MorphAdorned TEI source files for works to be ingested into WordHoard and produces as output WordHoard-format XML input files. These files are in turn processed by the old largely unmodified WordHoard build tools to produce a WordHoard datastore. In our new Monk build process, we face a very similar problem. TEI-Analytics must take as input unmodified TEI source files for works to be ingested into Monk, run those texts through MorphAdorner, normalize their formats, and produce as output TEI-A format XML files, which are in turn read by BuildMonk to produce a Monk datastore. Our experience at Northwestern in the development of ConvertMorph and the use of that tool to ingest NCF and Stein may be useful to the TEI-Analytics team at Nebraska, even though they almost certainly will prefer to use their own architecture and tools and code to solve their problems. In this section I will outline the approach we took in ConvertMorph and hilight some of the significant problems we had to solve. There were two major kinds of problems we had to solve. First, the raw data files we were given for NCF and Stein were quite a mess. The data was not consistently encoded, certainly not between the two collections, and not even within the same collection. A good example is publication dates. In NCF, publication dates were encoded in many different formats, and we had to use complex regular expression pattern matching rules to decode them. In Stein, publication dates were not present at all in the TEI files, and we had to look them up on the Internet. We faced similar problems with generating work tags, getting author names and birth and death dates, and even extracting such simple information as work titles. We faced a variety of very difficult technical issues in mapping the structures of "div" and other elements into the tree of work parts. Extracting usable titles for work parts was particularly bothersome. Another issue was the use of idiosyncratic values for "rend" attributes. Determining how to map these values into human-readable text formatting operations was non-trivial. We had no documentation for the NCF conventions, and we still don't know for certain what some of them are supposed to mean! In short, it was clear that it was impossible to extract all the data we needed in the format we needed, even for very simple information like titles and dates, from the raw TEI data files by themselves as given to us. Too much data was missing, and way too much of the data was inconsistent and ambiguous. Extra information needed to be supplied somehow to get the data we needed. Our solution was to specify all this extra information in "SIP" files. ("SIP" stands for "Submission Information Package".) We also called these files "rule files" in the code and documentation. To ingest a corpus, in addition to the individual TEI source files, the curator must prepare and supply a SIP file for the corpus. We have one such SIP file for the NCF corpus, and one for the Stein corpus. The SIP file provides two kinds of additional information to the ConvertMorph tool. First, any data that is simply missing in the TEI files is specified in the SIP. For example, this is how publication dates and the author name are supplied for the Stein corpus. Second, the SIP specifies a set of "rules" that define how to process the TEI source files and all their elements and attributes, in order to resolve ambiguities and inconsistencies. As an example, in the SIP for NCF, we specify the following set of rules for extracting work part titles from the TEI files:
When we ingested Stein, we found that a different set of rules in the SIP for Stein worked best for getting reasonable work part titles. The SIP for Stein specifies the following rules:
There are many more details of the SIP files we developed for use by ConvertMorph. For the details, see the two appendices below where we present the full SIP files for NCF and Stein, in all their gory detail. They contain lots of comments which discuss all the details. For even more details, see the WordHoard source code for the ConvertMorph tool in the package edu.northwestern.at.wordhoard.tools.cm, which contains about 4,000 lines of Java code in 27 source code files. Appendix 1 - NCF SIP<?xml version="1.0" encoding="utf-8"?>
<!--
ConvertMorph rules for the 19th Century Fiction collection (NCF).
These rules define the translation of MorphAdorner XML output files for NCF into WordHoard
XML input files for NCF.
We've included quite a few comments in this file about the format of the file in general and
what all the many rules do in general, in addition to comments that are specific to the NCF
collection. These comments do not add up to complete formal specifications, but they should
give you a good idea of what everything means and how it works.
The general idea is to make ConvertMorph fully parameterized and rule-driven, so that
it can be used to process morphadorned XML files from many different sources and in many
different formats. Indeed, in an ideal world, these files would not even need to be in
TEI format or a TEI subset or "simple" format. They could be in almost any format at all,
as long as it is well-formed XML that has been processed by MorphAdorner. We're still short
of fully reaching this goal, but we've made progress.
Even within the realm of "TEISimple", there are many variations in the way collections are
encoded. For an example which is quite different from NCF, see the ConvertMorph rules file
for the Stein collection.
-->
<ConvertMorphRules>
<!--
The WordHoard corpus tag for this collection is "ncf".
This element is required. Every WordHaord corpus must have a unique corpus tag, and
there must be a definition for the specified corpus in the WordHoard "corpora.xml" file.
-->
<corpusTag>ncf</corpusTag>
<!--
Title page rules.
This element is optional.
This rule contains text to be generated for the WordHoard title pages of each work.
There may be any number of <respStmt> elements and at most one <publicationStmt> element,
in the format described in the WordHoard manual.
-->
<titlePageRules>
<publicationStmt>
<p>This version of the text is an experimental derivative of
its version in the Chadwyck-Healey archive of
Nineteenth-Century Fiction (hereafter NCF) and will be
improved or replaced in the course of the MONK Project.</p>
<p>The copyright to this text is owned by ProQuest, and this
version may be accessed only by members who are affiliate
with institutions that have purchased the NCF source files
or by members directly affiliated with the MONK Project, who
have been granted special permission by ProQuest to use the
source files for purposes that bear directly on the
development of the MONK Project.</p> <p>As an experimental
derivative, this version differs from its source file in the
following ways:</p> <p>1. The text file has been transformed
from its original encoding in a Chadwyck-Healey SGML dtd
into an XML file that parses under a TEI P5 dtd that uses a
sharply reduced number of elements. This dtd, provisionally
called teisimple.dtd, employs some minor extensions and
relaxations of the TEI content model to accommodate a
variety of text archive, including the Text Creation
Partnership (TCP), Early American Fiction (EAF), the Wright
Archive of American Fiction 1851-75), and Documenting the
American South. Teisimple will evolve with the final release
of TEI P5, scheduled for October 2007.</p> <p>2. While
letters are explicitly tagged as such in other fiction
collections, they are not identified in NCF, unless they
occur in epistolary novels that consist only of letters. In
order to enhance the comparability of NCF texts with texts
in other fiction archives, letters in NCF have been
explicitly tagged by Martin Mueller. Nested narratives and
other inserted documents have also been tagged explicitly.
In the WordHoard environment, this tagging is not yet
available for querying.</p> <p>3. The use of the apostrophe
as an elision marker at the beginning or end of a word is a
very common feature in poetic, regional or colloquial
speech, but in digital environments its consistent
disambiguation from opening or closing single quotations
marks is a non-trivial task, with considerable consequences
for tokenization and POS tagging. For various reasons, the
current version does not distinguish between the apostrophe
and the single quote as accurately as it should and
includes some unnecessary errors in tokenization or POS
tagging. We know where to look for and correct these errors
but will be grateful for users to point them out anyhow by
using the error reporting feature of WordHoard.</p>
</publicationStmt>
</titlePageRules>
<!-- We have no file rules for NCF. For an example of file rules, see the Stein rules. -->
<!--
Header rules for extracting WordHoard bibliographic information for each work.
These rules use regular epxression pattern matching. For details, see
Sun's javadoc for the class java.util.regex.Pattern.
NCF uses some rather complicated and sometimes inconsistent conventions for encoding
header information, and we often need to use rather complex patterns to get the
information we need for WordHoard. The examples given below along with the patterns
illustrate all the different cases.
These rules extract the following bibliographic information about each work for
WordHoard:
workTag = Work tag. Each work in a collection must have a unique tag. For NCF we
use the numbers as assigned by the C-H encoders.
title = Work title.
pubDateStart = Publication year, if known, or first year in a range of publication
years.
pubDateEnd = Last year in a range of publication years.
Note that for both works and work parts, WordHoard has the notion of
"full" and "short" titles. ConvertMorph currently only has one notion of
"title", and sets both WordHoard titles to the same value.
-->
<headerRules>
<headerRule>
<!-- Extracts the WordHoard work tag from the <idno> element. -->
<path>TEI/teiHeader/fileDesc/publicationStmt/idno</path>
<pattern>
<!--
Example:
In ANCF0101.xml, the <idno> is "NCF0101". The WordHoard work tag for this
work is "0101", with the "NCF" prefix stripped off.
Note that the WordHoard "full work tag" for a work is always
"corpusTag-workTag", in this example "ncf-0101". Thus, from the point
of view of WordHoard, encoding "NCF" in the work tag itself would be
redundant, and we don't do it.
In the pattern below, parentheses are used to "group" everything after
the "NCF" prefix, and "$1" is used to extract the value of this group
as the WordHard work tag.
-->
<match>NCF(.*)</match>
<extract item="workTag">$1</extract>
</pattern>
</headerRule>
<headerRule>
<!--
Extracts the WordHoard work title from the <title> element.
The titles encoded in NCF are verbose and include author names and
other junk. It would be nice to have "cleaner" titles, but this would
most likely need to be done by hand by a human being. It does not seem that
regular expressions are adequate for this task. So for now we just use the
work titles as is, without making any attempt to clean them up.
Note that for NCF, ConvertMorph often generates very long titles for both
works and their parts. The WordHoard client is ill-prepared to deal
with long titles, which present human interface problems in quite a few places.
To solve these problems, the WordHoard ingest program BuildWorks currently
truncates all titles to 50 characters.
Whenever pattern matching is used to examine a string extracted from an
XML source file, ConvertMorph replaces any runs of line feeds and carriage
returns in the string by a space before trying to match patterns against the
string. Believe it or not, such garbage actually appears in at least one NCF
work title string!
-->
<path>TEI/teiHeader/fileDesc/titleStmt/title</path>
<pattern>
<!--
Note that .* matches the whole string, and $0 extracts the whole string.
This is the simplest possible kind of pattern matching and header value
extraction.
-->
<match>.*</match>
<extract item="title">$0</extract>
</pattern>
</headerRule>
<headerRule>
<!--
Extracts WordHoard publication dates from the <date> element.
WordHoard supports both simple publication dates (yyyy) and publication date
ranges (yyyy-yyyy). NCF specifies this information in a variety of different
formats, and we need quite a few different patterns to catch all the
variations.
-->
<path>TEI/teiHeader/fileDesc/sourceDesc/biblFull/publicationStmt/date</path>
<pattern>
<!--
Four digit pub date, optionally enclosed in square brackets.
Examples:
ANCF0101.xml: 1839
ANCF22505.xml: [1850]
The square brackets presumably encode some kind of meaningful information
in NCF, but we just ignore them. In the second example above, the
WordHoard publication date is simply 1850, and whatever information is
represented by the square brackets is lost.
-->
<match>\[?(\d\d\d\d)\]?</match>
<extract item="pubDateStart">$1</extract>
</pattern>
<pattern>
<!--
Two four digit dates separated by dash or en-dash.
Example:
ANCF22503.xml: 1840-1841
Yes, NCF sometimes uses dash, and sometimes en-dash. It's hard to even
see the difference in the source text, but it's there and must be dealt
with in our patterns.
-->
<match>(\d\d\d\d)[--](\d\d\d\d)</match>
<extract item="pubDateStart">$1</extract>
<extract item="pubDateEnd">$2</extract>
</pattern>
<pattern>
<!--
Four digit date, dash or en-dash, then two digits.
Example:
ANCF1602.xml: 1794-97
In this example, the WordHoard pub date range is 1794-1797.
-->
<match>(\d\d)(\d\d)[--](\d\d)</match>
<extract item="pubDateStart">$1$2</extract>
<extract item="pubDateEnd">$1$3</extract>
</pattern>
<pattern>
<!--
Four digit date, dash or en-dash, then one digit.
Example:
ANCF3701.xml: 1826-7
In this example, the WordHoard pub date range is 1826-1827.
-->
<match>(\d\d\d)(\d)[--](\d)</match>
<extract item="pubDateStart">$1$2</extract>
<extract item="pubDateEnd">$1$3</extract>
</pattern>
</headerRule>
</headerRules>
<!--
Rules for extracting WordHoard author information for each work.
A work can have more than one author. For each author, we extract the following
information for WordHoard:
authorName = Author name.
authorBirthYear = Author birth year, if known.
authorDeathYear = Author death year, if known.
authorEarliestWorkYear = Author earliest work year, if known. (Not used for NCF.)
authorLatestWorkYear = Author latest work year, if known, (Not used for NCF.)
ConvertMorph gathers together all the author information and updates the WordHoard
authors definition file "authors.xml". Any new authors are added to the file.
If any author is encounterd which is already in the authors.xml file with
conflicting attribute values, an error message is issued.
-->
<authorRules>
<!--
The path to author elements. There may be more than one instance of this path
for multiple authors, although for NCF this is not the case.
-->
<path>TEI/teiHeader/fileDesc/titleStmt/author</path>
<headerRule sep=" / ">
<!--
Extracts WordHoard author information from the <author> element.
NCF encodes author names, birth dates, and death dates in a variety of
different formats.
The "sep" attribute is used to specify the separator for multiple values
encoded in a single element. In this rule, sep=" / " is specified for NCF.
The string is split using the separator, and each part is processed using
the patterns specified. For example, in ANCF25901.xml there are two
authors specified as:
Somerville, E. OE. (Edith OEnone), 1858-1949 / Ross, Martin, 1862-1915
In this example, the two parts are split out of the string and matched against
the patterns separately:
Somerville, E. OE. (Edith OEnone), 1858-1949
Ross, Martin, 1862-1915
This results in two WordHoard "author" elements being generated, one for
the author "Somerville, E. OE. (Edith OEnone)" with birth and death dates
1858 and 1949, and one for the author "Ross, Martin" with birth and death dates
1862 and 1915.
Patterns are matched in order. When a patten matches, the specified extraction
rules are applied, and the remaining patterns are skipped. In this example,
the third and last pattern is used only if the first two patterns fail. The order
of the patterns is therefore important. For this rule, if we put the last pattern
first in the list, the rule would not work as intended!
-->
<!--
The path to the element for this rule. This path is relative to the base
path given above to author elements. For NCF, the author information is
encoded directly in the author element rather than in child elements, so this
relative path is empty.
-->
<path></path>
<pattern>
<!--
Both birth and death dates specified.
Examples:
ANCF24501.xml: Linton, E. Lynn (Elizabeth Lynn) 1822-1898
ANCF1501.xml: Hays, Mary, 1759/60-1843
Either a dash (-) or an en-dash (-) can be used to separate the dates.
In the second example above, the WordHoard birth and death dates are set
to 1759 and 1843 respectively. The "/60" following "1759" is ignored.
-->
<match>(.*?),? *(\d+)(/\d+)?[--](\d+)(/\d+)?</match>
<extract item="authorName">$1</extract>
<extract item="authorBirthYear">$2</extract>
<extract item="authorDeathYear">$4</extract>
</pattern>
<pattern>
<!--
Only birth date specified.
Example:
ANCF0901.xml: Dacre, Charlotte, b. 1782
-->
<match>(.*?),? *b. *(\d+)(/\d+)?</match>
<extract item="authorName">$1</extract>
<extract item="authorBirthYear">$2</extract>
</pattern>
<pattern>
<!--
No dates specified.
Example:
ANCF1101.xml: Fenwick, E. (Eliza)
-->
<match>.*</match>
<extract item="authorName">$0</extract>
</pattern>
</headerRule>
</authorRules>
<!--
The path to the text for the work is TEI/text.
This element is required.
-->
<textPath>TEI/text</textPath>
<!--
Text element rules.
This element is required.
For each element in the subtree rooted at the text path, these rules specify how to
process the element. The rules specify both when to create work parts and how to
format the text within the work parts in WordHoard.
Each rule has the following possible attributes. Only the "name" attribute is
required.
name = the name of the element.
parBreak = true to force a paragraph break before and after the element. Default = false.
In WordHoard, paragraphs are separated by blank lines.
lineBreak = true to force a line break before and after the element. Default = false.
Note that parBreak implies lineBreak.
lineStyle = left, center, or right. Default = no change in current line style.
indent = indentation in pixels. Default = no change in current line indentation.
Indentation is cumulative and is only used with the left justification line style.
wordStyles = list of word styles separated by commas. Default = no change in current
word styles. The word styles may be any styles supported by WordHoard: bold, italic,
extended, underline, overline, superscript, subscript, monospaced, and plain. Word
styles are cumulative, except for "plain", which is used to remove any current other
word styles and revert to plain unstyled text.
ignoreChildren = true to ignore any children of this element. Default = false.
createPart = never, sometimes, or always. Default = never. "always" means to always
create a new work part for this element (e.g., in NCF, for <div> elements). "sometimes"
means to create a new work part only if it is necessary (e.g., in NCF, for <trailer>
and <epigraph> elements, which sometimes occur in "stranded" contexts where there is
no active work part into which we are able to generate text.)
footnote = true to treat as a footnote. Default = false. Footnotes are represented by
footnote numbers in the main text, and the numbered footnotes proper are generated at
the end of each work part.
ignoreRend = true to ignore any rend attributes that might be present on this element.
Default = false.
genBefore = plain text string to generate before processing the element. Default =
nothing.
genAfter = plain text string to generate after processing the element. Default =
nothing.
<w> and <c> elements are generated by MorphAdorner. These elements are processed
specially by ConvertMorph, and result in WordHoard <w> and <punc> elements. None of the
other elements are processed specially in any way other than as specified by the rules
enumerated here.
ConvertMorph generates an error message if it encounters an element while processing
the text which is not definied by a rule.
-->
<textElementRules>
<textElementRule name="add"/>
<textElementRule name="argument" parBreak="true" indent="20"/>
<textElementRule name="back"/>
<textElementRule name="bibl" parBreak="true" lineStyle="right"/>
<textElementRule name="body"/>
<textElementRule name="c" ignoreChildren="true"/>
<textElementRule name="cell" genBefore=" [" genAfter="] " ignoreRend="true"/>
<textElementRule name="closer" parBreak="true"/>
<textElementRule name="div" createPart="always"/>
<textElementRule name="epigraph" parBreak="true" indent="20" createPart="sometimes"/>
<textElementRule name="figure" ignoreChildren="true"/>
<textElementRule name="foreign" wordStyles="italic"/>
<textElementRule name="front"/>
<textElementRule name="gap" ignoreChildren="true" genBefore=" "/>
<textElementRule name="head" parBreak="true" lineStyle="center" wordStyles="bold"/>
<textElementRule name="hi"/>
<textElementRule name="insertDoc" parBreak="true" indent="20"/>
<textElementRule name="item" parBreak="true" indent="20"/>
<textElementRule name="l" lineBreak="true"/>
<textElementRule name="label" parBreak="true" lineStyle="center" wordStyles="italic"/>
<textElementRule name="lb" lineBreak="true" ignoreChildren="true"/>
<textElementRule name="letter" parBreak="true" indent="20"/>
<textElementRule name="lg" parBreak="true"/>
<textElementRule name="list" parBreak="true"/>
<textElementRule name="milestone" ignoreChildren="true"/>
<textElementRule name="note" parBreak="true" footnote="true"/>
<textElementRule name="opener" parBreak="true"/>
<textElementRule name="p" parBreak="true"/>
<textElementRule name="pb" ignoreChildren="true"/>
<textElementRule name="q" parBreak="true" indent="20"/>
<textElementRule name="row" lineBreak="true"/>
<textElementRule name="salute" parBreak="true"/>
<textElementRule name="seg"/>
<textElementRule name="signed" lineBreak="true"/>
<textElementRule name="sp" parBreak="true" indent="20"/>
<textElementRule name="speaker" parBreak="true" indent="-20"/>
<textElementRule name="stage" parBreak="true" lineStyle="center" wordStyles="italic"/>
<textElementRule name="table" parBreak="true" indent="20"/>
<textElementRule name="text"/>
<textElementRule name="title" wordStyles="italic"/>
<textElementRule name="trailer" parBreak="true" lineStyle="center" createPart="sometimes"/>
<textElementRule name="w" ignoreChildren="true"/>
</textElementRules>
<!--
Work part title rules.
This element is required.
There are four kinds of rules that can be used for getting work part titles. The rules
are tried in the order listed until one works. If none of the rules work, an error
message is issued and the title is set to "Untitled".
useFirstChild: Use the first child element with a specified name. For NCF, we use the
first <head> child element, which works reasonably well in most cases. All of the text
of the child element is used, except for any embedded footnotes and any embedded
descendant elements which have ignoreChildren set to true in their text element rule.
This is important, because in NCF there are indeed some <head> elements which have
these kinds of descendants.
useAttributeValue: Use an attribute value, optionally converting the first letter
to upper case. For NCF we use the "type" attribute. For example, in some NCF
works, there are <div type="dedication"> elements which have no <head> children. In
this case, the work part title is set to "Dedication".
useElementName: Use the element name, optionally converting the first letter to
upper case. In NCF, this rule catches quite a few <trailer> elements which become
work parts with the title "Trailer".
useAttributeValuePair: Uses a pair of attribute values separated by a space,
optionally converting the first letter of the first attribute value to upper case.
This rule is not used for NCF, but it might be useful for other collections. For
example, a <div n="3" type="chapter"> element under this rule might result in the
work title "Chapter 3". The rule for this example would be:
<useAttributeValuePair name1="type" name2="n" capitalizeFirstLetter="true"/>
For many NCF works, perhaps most of them, these rules work quite well, even
surprisingly well. For many other works, however, the table of contents formed by
the work part hierarchy and the titles generated by these rules ends up being, shall we
say, a bit goofy, ugly, and rather short of optimal.
-->
<workPartTitleRules>
<useFirstChild name="head"/>
<useAttributeValue name="type" capitalizeFirstLetter="true"/>
<useElementName capitalizeFirstLetter="true"/>
</workPartTitleRules>
<!--
Rend attribute rules.
These rules map the "rend" attribute values used in NCF to WordHoard "rend" attribute
values.
Rend attributes are processed wherever they occur, on any element, unless the element
rule specifies ignoreRend="true".
Each rend rule can contain optional lineStyle, indent, and wordStyles attributes that work
the same way as in element rules.
-->
<rendAttributeRules>
<rendAttributeRule attrName="rend">
<rendAttributeMapping value="b(1)" wordStyles="bold"/>
<rendAttributeMapping value="i(1)" wordStyles="italic"/>
<rendAttributeMapping value="i(2)" wordStyles="italic"/>
<rendAttributeMapping value="italics" wordStyles="italic"/>
<rendAttributeMapping value="align(c)" lineStyle="center"/>
<rendAttributeMapping value="align(r)" lineStyle="right"/>
<rendAttributeMapping value="indent(1)" indent="20"/>
<rendAttributeMapping value="indent(2)" indent="40"/>
<rendAttributeMapping value="indent(3)" indent="60"/>
<rendAttributeMapping value="indent(5)" indent="100"/>
<rendAttributeMapping value="sc(1)"/>
<rendAttributeMapping value="sc(2)"/>
<rendAttributeMapping value="small(1)"/>
<rendAttributeMapping value="small(2)"/>
<rendAttributeMapping value="sub(1)" wordStyles="subscript"/>
<rendAttributeMapping value="sub(2)" wordStyles="subscript"/>
<rendAttributeMapping value="sup(1)" wordStyles="superscript"/>
<rendAttributeMapping value="sup(2)" wordStyles="superscript"/>
<rendAttributeMapping value="roman(1)" wordStyles="plain"/>
<rendAttributeMapping value="roman(2)" wordStyles="plain"/>
<rendAttributeMapping value="speaker"/>
<rendAttributeMapping value="caption - pb"/>
<rendAttributeMapping value="caption - div"/>
</rendAttributeRule>
</rendAttributeRules>
<!--
Footnote rules.
This element is optional. The default values are as shown below.
Footnotes are rendered at the ends of work parts, with superscript references in
the main text.
-->
<footnoteRules>
<footnoteRefStyle wordStyles="superscript"/>
<footnoteStyle indent="20"/>
</footnoteRules>
</ConvertMorphRules>
Appendix 2 - Stein SIP<?xml version="1.0" encoding="utf-8"?>
<!--
ConvertMorph rules for the Stein collection.
-->
<ConvertMorphRules>
<corpusTag>stein</corpusTag>
<!-- We currently have no title page rules for Stein. -->
<!--
File rules.
Unlike the NCF collection, in the Stein collection full bibliographic data is
not encoded in the TEI files. Only the work titles are encoded in the files. We
use the <fileRules> section here to enumerate the other bibliographic information
for each file in the Stein collection.
-->
<fileRules>
<fileRule>
<name>threelives-1.0.xml</name>
<workTag>tli</workTag>
<author>Stein, Gertrude</author>
<pubDateStart>1909</pubDateStart>
</fileRule>
<fileRule>
<name>moa-1.1.xml</name>
<workTag>moa</workTag>
<author>Stein, Gertrude</author>
<pubDateStart>1925</pubDateStart>
</fileRule>
</fileRules>
<!--
Header rules.
Note that file rules override header rules. For example, suppose a file rule
specifies a publication date of 1832 for a work, and a header rule extracts a
publication date of 1847 for the same work. In this case, the WordHoard publication
date is set to 1832, from the file rule for the file, and the value in the header
is ignored. This is not an issue for Stein, where the only header rule we specify
is for titles.
-->
<headerRules>
<headerRule>
<path>TEI.2/teiHeader/fileDesc/titleStmt/title</path>
<pattern>
<match>.*</match>
<extract item="title">$0</extract>
</pattern>
</headerRule>
</headerRules>
<!--
We have no author rules for Stein - author namess are given by the file rules above,
and author attributes are specified in the WordHoard authors.xml file.
-->
<textPath>TEI.2/text</textPath>
<!--
Text element rules.
Note that we generate work parts for <div2> elements. This works well for "Three Lives",
but results in some goofy "Space-break" parts in "Making of Americans". We could
specify createPart="never" for <div2> to fix this, but then we'd need separate
rule files for the two works. There's no way to say use one rule for one work and
use a different rule for other works.
-->
<textElementRules>
<textElementRule name="bibl" parBreak="true" lineStyle="right"/>
<textElementRule name="body"/>
<textElementRule name="c" ignoreChildren="true"/>
<textElementRule name="div0" createPart="always"/>
<textElementRule name="div1" createPart="always"/>
<textElementRule name="div2" createPart="always"/>
<textElementRule name="epigraph" parBreak="true" indent="20"/>
<textElementRule name="head" parBreak="true" lineStyle="center" wordStyles="bold"/>
<textElementRule name="name"/>
<textElementRule name="note" parBreak="true" footnote="true"/>
<textElementRule name="p" parBreak="true"/>
<textElementRule name="pb" ignoreChildren="true"/>
<textElementRule name="text"/>
<textElementRule name="trailer" parBreak="true" lineStyle="center"/>
<textElementRule name="w" ignoreChildren="true"/>
</textElementRules>
<!--
Work part title rules.
The combination and order of the work part title rules below was determined by
trial and error. It seems to result in the most reasonable titles for the two Stein
novels.
-->
<workPartTitleRules>
<useAttributeValuePair name1="type" name2="n" capitalizeFirstLetter="true"/>
<useFirstChild name="head"/>
</workPartTitleRules>
<!-- We have no rend attribute rules. Stein doesn't have any style formatting! -->
</ConvertMorphRules>
|
| Document generated by Confluence on Apr 19, 2009 15:04 |