Monk Datastore Overview

prev     tcon     next

Acolyte

Acolyte adds curator-supplied bibliographic data to TEI XML files.

Usage:

java edu.northwestern.at.monk.acolyte.Acolyte params

params = path to parameters file.

The parameters file

The parameters file is an XML file which specifies the parameters for the program. For example:

<params>

    <curatorDataFile>curator-data.txt</curatorDataFile>
    
    <directories corpus="eebo">
        <input>texts/eebo/adorned</input>
        <output>texts/eebo/bibadorned</output>
    </directories>
    
    <directories corpus="ncf">
        <input>texts/ncf/adorned</input>
        <output>texts/ncf/bibadorned</output>
    </directories>
    
    <directories corpus="sha">
        <input>texts/sha/adorned</input>
        <output>texts/sha/bibadorned</output>
    </directories>

</params>

The curatorDataFile element specifies the path to the curator data input file. In the example, this file is located at curator-data.txt.

Each directories element specifies a pair of input and output directories to be processed for a corpus. In the example, three such directory pairs are specified, one pair for each of the three corpora with the tags "eebo", "ncf", and "sha".

The input and output elements specify the paths to the input and output directories for a corpus. In the example, for the "eebo" corpus, the input directory is located at texts/eebo/adorned and the output directory is located at texts/eebo/bibadorned.

For each input directory, only files in the directory with the extension ".xml" are processed.

Output directories are created if they do not already exist.

In this example, the input file for Shakespeare's Hamlet is located at texts/sha/adorned/ham.xml, and the output file is written to texts/sha/bibadorned/ham.xml.

The curator data file

The curator data file contains bibliographic data for all the works and authors. It is a tab-delimited text file. Each line contains 14 columns in the following order:

  1. File name, not including the ".xml" extension.
  2. Author sequence number (1, 2, 3, ...).
  3. Author name.
  4. Author birth year, or empty if unknown.
  5. Author death year, or empty if unknown.
  6. When the author flourished.
  7. The origin of the author.
  8. The gender of the author.
  9. The title of the work.
  10. The genre of the work.
  11. The subgenre of the work.
  12. The tag of the corpus.
  13. The circulation year of the work, or empty if unknown.
  14. The availability of the work.

If a work has more than one author, it has multiple lines in the curator data file, one per author.

For example, the following line provides the bibliographic data for Hamlet:

ham\t1\tShakespeare, William\t1564\t1616\tn/a\tBritish Isles\tM\tHamlet\tplay\t\tsha\t1600\tunrestricted

The bibadorned output files

Each bibadorned output file is a copy of its corresponding TEI input file, with the bibliographic data for the work added as a new monkHeader element at the beginning of the file. For example, the output file for Hamlet begins as follows:

<?xml version="1.0" encoding="utf-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
    <monkHeader xmlns="http://monk.at.northwestern.edu/ns/1.0">
        <tag>sha-ham</tag>
        <corpus>sha</corpus>
        <fileName>ham</fileName>
        <title>Hamlet</title>
        <author>
            <name>Shakespeare, William</name>
            <birthYear>1564</birthYear>
            <deathYear>1616</deathYear>
            <flourished>n/a</flourished>
            <origin>British Isles</origin>
            <gender>M</gender>
        </author>
        <circulationYear>1600</circulationYear>
        <genre>play</genre>
        <subgenre></subgenre>
        <availability>unrestricted</availability>
    </monkHeader>
  <teiHeader>
    <fileDesc>
    ...

Note that the tag for a work is formed from the corpus tag, a hyphen, and the file name.

If a work has more than one author, there are multiple author elements, one per author, in the order specified by the author sequence numbers in the curator data file.

Note that we use two XML namespaces:

The report file

Acolyte writes a report to stdout. The report lists the tags of all the works processed.

If there are any errors or inconsistencies in the curator data file, an error message is written to the report and Acolyte terminates without processing any of the text files.

If a work has an XML file for which there is no data in the curator data file, the work is not processed. The report lists the tags of all unprocessed works of this kind.

If a work has data in the curator data file, but there is no corresponding input XML file, the work is not processed. The report lists the tags of all unprocessed works of this kind.

prev     tcon     next