Monk Datastore Overview
Acolyte adds curator-supplied bibliographic data to TEI XML files.
Usage:
java edu.northwestern.at.monk.acolyte.Acolyte params
params= path to parameters file.
The parameters file is an XML file which specifies the parameters for the program. For example:
<params> <curatorDataFile>curator-data.txt</curatorDataFile> <directories corpus="eebo"> <input>texts/eebo/adorned</input> <output>texts/eebo/bibadorned</output> </directories> <directories corpus="ncf"> <input>texts/ncf/adorned</input> <output>texts/ncf/bibadorned</output> </directories> <directories corpus="sha"> <input>texts/sha/adorned</input> <output>texts/sha/bibadorned</output> </directories> </params>The
curatorDataFileelement specifies the path to the curator data input file. In the example, this file is located atcurator-data.txt.Each
directorieselement specifies a pair of input and output directories to be processed for a corpus. In the example, three such directory pairs are specified, one pair for each of the three corpora with the tags "eebo", "ncf", and "sha".The
inputandoutputelements specify the paths to the input and output directories for a corpus. In the example, for the "eebo" corpus, the input directory is located attexts/eebo/adornedand the output directory is located attexts/eebo/bibadorned.For each input directory, only files in the directory with the extension ".xml" are processed.
Output directories are created if they do not already exist.
In this example, the input file for Shakespeare's Hamlet is located at
texts/sha/adorned/ham.xml, and the output file is written totexts/sha/bibadorned/ham.xml.
The curator data file contains bibliographic data for all the works and authors. It is a tab-delimited text file. Each line contains 14 columns in the following order:
- File name, not including the ".xml" extension.
- Author sequence number (1, 2, 3, ...).
- Author name.
- Author birth year, or empty if unknown.
- Author death year, or empty if unknown.
- When the author flourished.
- The origin of the author.
- The gender of the author.
- The title of the work.
- The genre of the work.
- The subgenre of the work.
- The tag of the corpus.
- The circulation year of the work, or empty if unknown.
- The availability of the work.
If a work has more than one author, it has multiple lines in the curator data file, one per author.
For example, the following line provides the bibliographic data for Hamlet:
ham\t1\tShakespeare, William\t1564\t1616\tn/a\tBritish Isles\tM\tHamlet\tplay\t\tsha\t1600\tunrestricted
Each bibadorned output file is a copy of its corresponding TEI input file, with the bibliographic data for the work added as a new
monkHeaderelement at the beginning of the file. For example, the output file for Hamlet begins as follows:<?xml version="1.0" encoding="utf-8"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <monkHeader xmlns="http://monk.at.northwestern.edu/ns/1.0"> <tag>sha-ham</tag> <corpus>sha</corpus> <fileName>ham</fileName> <title>Hamlet</title> <author> <name>Shakespeare, William</name> <birthYear>1564</birthYear> <deathYear>1616</deathYear> <flourished>n/a</flourished> <origin>British Isles</origin> <gender>M</gender> </author> <circulationYear>1600</circulationYear> <genre>play</genre> <subgenre></subgenre> <availability>unrestricted</availability> </monkHeader> <teiHeader> <fileDesc> ...Note that the tag for a work is formed from the corpus tag, a hyphen, and the file name.
If a work has more than one author, there are multiple
authorelements, one per author, in the order specified by the author sequence numbers in the curator data file.Note that we use two XML namespaces:
- For TEI proper:
http://www.tei-c.org/ns/1.0- For Monk extensions to TEI:
http://monk.at.northwestern.edu/ns/1.0
Acolyte writes a report to
stdout. The report lists the tags of all the works processed.If there are any errors or inconsistencies in the curator data file, an error message is written to the report and Acolyte terminates without processing any of the text files.
If a work has an XML file for which there is no data in the curator data file, the work is not processed. The report lists the tags of all unprocessed works of this kind.
If a work has data in the curator data file, but there is no corresponding input XML file, the work is not processed. The report lists the tags of all unprocessed works of this kind.