This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Introduction 

MorphAdorner can add word-level morphological adornments
to XML texts encoded in two common formats, the Text Encoding
Initiative (TEI) format or the Text Creation Partnership (TCP) format.
Other XML formats can be accommodated using customized input methods.

MorphAdorner adds XML tags to mark words, punctuation, and whitespace.
All other XML tags which appear in the input file are passed through to the
output unchanged except for minor reformatting.

XML Tag types

In order to adorn an XML formatted text properly, MorphAdorner needs
to determine the reading context of each word in the input text by
constructing the reading sequence for the text. The reading context
for a word depends upon the type of XML tag in which
it appears as well as the text of its neighboring words.

A hard tag is an SGML, HTML, or XML tag which starts a new text segment but
does not interrupt the reading sequence of a text. Examples of hard tags
include <div> and <p>.

A jump tag is an SGML, HTML, or XML tag which interrupts the reading
sequence of a text and starts a new text segment. An example of a jump tag
is <note>. Jump tags initiate a new reading context.  The previous reading
sequence continues after the end of the jump tag.

A soft tag is an SGML, HTML, or XML tag which does not interrupt the
reading sequence of a text and does not start a new text segment. Some
soft tags provide textual decoration such as <hi> and <em>. Others
indicate textual milestones such as <milestone> or formatting such as <lb>.
Still others mark higher level text segments such as <rs>.

The <w> and <c> tags

MorphAdorner uses the <w> tag to enclose the text of a word,
a symbol, or a punctuation mark, and the <c> tag to enclose
whitespace.

The text enclosed by the <w></w> tags is the original token text,
which may be a complete word token, or a token fragment when the
token text is split by soft or jump tags. Split words are discussed below.

MorphAdorner normalizes the whitespace in input documents, mapping
all multiple blanks, tabs, and end of line characters to single blanks.
The normalized whitespace is output using the <c> tag.  Each <c> </c>
tag pair encloses a single whitespace character.

To prevent output lines from becoming too long, MorphAdorner emits
each <w></w> tag and each <c></c> tag on a separate output line.
Most other XML tags are also indented and emitted on separate lines. This
"pretty-printing" implies that programs which process the
MorphAdorner output should ignore end of line characters and use the
contents of the <c></c> tags to perform basic text spacing.

<w> tag attributes

MorphAdorner defines the following attribute fields for each <w> tag.

xml:id Provides a unique id for the token or token fragment. This
should be treated as an opaque value. However, see the section
on split tokens below.
ord Specifies the ordinal of the token, beginning at 1 for the first
token. The ordinal is consecutive across all XML tags.
MorphAdorner assigns the same ordinal value to all parts of a token
split by soft tags since these token fragments appear consecutively
in the input file. Tokens split by jump tags receive different
ordinal values for non-consecutive fragments.
eos A value of "1" indicates this token ends a sentence.
A value of "0" indicates this token does not end a sentence.
The eos value is most accurately set for ordinary text. Tokens
within cells or other abbreviated entries may not be marked
correctly.
lem Provides the lemma form(s) of the token. For punctuation and
symbols this is the same as the spelling. For words, this is the base
form or head word (uninflected) form you would find in a dictionary.
When a word contains more than one lemma, a vertical bar
separates the lemma forms.
part Indicates which part of a split token this token text provides.
  • A value of "N" means the token text is unsplit.
  • A value of "I" means the token text is the first part of a split token.
  • A value of "M" means the token text is some part after the first but
    before the last.
  • A value of "F" means the token text is the last part of a split token.
pos The part of speech for the token. By default, MorphAdorner
uses the NUPos part of speech tag set. For symbols and punctuation
the part of speech is the same as the token. For words containing
more than one part of speech (e.g., contractions), a vertical bar
separates the part of speech tags.
reg A standardized, usually modern, version of the spelling.
For obsolete words no longer in use, a representative standard form
is chosen which is usually the Oxford English Dictionary headword form.
spe The spelling. This value combines the fragments
of a split word into the complete spelling. In most cases the
spe value will match the tok value. However, some
corpora use special metacharacters in the tokens which are
not intended to be part of a word. For example, the TCP/EEBO
texts use characters such as the "+" and "|" to mark various
kinds of word breaks. The tok attribute value retains those
metacharacters for archival completeness, but the spe value
removes them.
tok The original token text. Includes all metacharacters
in the original text. The tok value may be a fragment of
the complete token when the token text is split by soft or jump tags.

Abbreviated attribute output

By default MorphAdorner outputs the full set of <w> attributes.
MorphAdorner can also output an abbreviated attribute set, in which
only non-redundant attribute values appear in the <w> tag. This produces
smaller output files with no loss of information, since the omitted attribute
field values can be restored from those of the other attributes or the
token text.

MorphAdorner uses the following algorithm to generate the abbreviated
set of <w> tag attributes.

  1. Let the token-text be the text enclosed within the <w></w> tag pair.
  2. When tok has the same value as the token-text, omit the tok attribute.
  3. When spe has the same value as tok, omit the spe attribute.
  4. When reg has the same value as spe, omit the reg attribute.
  5. When pos has the same value as tok, omit the pos attribute.
  6. When lem has the same value as spe, omit the lem attribute.
  7. When eos has the value "0", omit the eos attribute.
  8. When part has the value "N", omit the part attribute.

The following algorithm can be used to reconstruct the full set of
<w> attributes from the abbreviated set.

  1. When tok is missing, set its value to the text enclosed by the <w></w> tags.
  2. When spe is missing, set its value to the value of tok.
  3. When reg is missing, set its value to the value of spe.
  4. When pos is missing, set its value to the value of tok.
  5. When lem is missing, set its value to the value of spe.
  6. When eos is missing, set its value to "0" (zero).
  7. When part is missing, set its value to "N".

The attribute values for xml:id and ord are always present in either
abbreviated or verbose output files.

Split tokens

Individual tokens in XML texts may be split by soft tags, and occasionally
by jump tags. MorphAdorner assembles the fragments of a split token into
a complete token and sets the tok and spe attributes of the
<w> tag for the token fragment to contain the complete token.

The xml:id field for a split word adds "dot partnumber" to the end
of the <w> tag's xml:id value. The xml:id can still be treated as an opaque
object, but the part number can be extracted from the end if desired.
In many cases the part number is not needed, and the value of the
part attribute of the <w> tag suffices.

  • part="N" means the token is unsplit (complete).
  • part="I" means the token is the first part of a split token.
  • part="M" means the token is some part after the first but before the last.
  • part="F" means the token is the last part of a split token.

Here is an example of a split word from Austen's Lady Susan
(ancf0207.xml). The original XML text is:

<p rend="align(r)">Edward S<hi rend="sup(1)">t</hi>.</p>

The "St." token is split into three pieces by soft tags.
The corresponding adorned text is:

<p rend="align(r)">
  <w eos="0" lem="Edward" pos="np1" reg="Edward"
     spe="Edward" tok="Edward" xml:id="ancf0207-050740" part="N"
     ord="4958">Edward</w>
  <c> </c>
  <w eos="1" lem="saint" pos="n1" reg="St." spe="St." tok="St."
     xml:id="ancf0207-050750.1" part="I" ord="4959">S</w>
  <hi rend="sup(1)">
     <w eos="1" lem="saint" pos="n1" reg="St." spe="St." tok="St."
       xml:id="ancf0207-050750.2" part="M" ord="4959">t</w> 
  </hi>
   <w eos="1" lem="saint" pos="n1" reg="St." spe="St." tok="St."
     xml:id="ancf0207-050750.3" part="F" ord="4959">.</w>
</p>

The ord attribute value is the same for all three fragments
of "St." . This is also the case for words split solely by soft tags.
The ord attribute values will not be the same for words split by
jump tags, as the individual word fragments can be separated by
hundreds or even thousands of other words.

Named Entities

MorphAdorner contains a procedure can which can add named entity tags
to input texts. At present named entities are not added routinely for Monk.

Each named entity is enclosed by <rs type="named entity type"></rs> tags.
The type= attribute value specifies the type of named entity, which
may be one of the following.

type="date" A date reference (e.g., March 12).
type="location" A geographical location (e.g., England).
type="money" An amount of money (e.g., 1 shilling).
type="organization" An organization name (e.g., Bank of England)
type="person" A person's name (e.g., Emma Woodhouse)
type="time" A time reference (e.g., 12 midnight)
type="literary" A literary reference (e.g., Ivanhoe)
Document generated by Confluence on Apr 19, 2009 15:04