Monk Datastore Overview
Prior, together with the cdb.csh script, creates the Monk MySQL database.
Usage:
java edu.northwestern.at.monk.acolyte.Prior params
params= path to parameters file.
For large texts, Prior requires a large Java heap size and the 64 bit version of Java. For example, to run Prior with a 14 gigabyte heap, use:
java -Xms14g -Xmx14g edu.northwestern.at.monk.prior.Prior params
The parameters file is an XML file which specifies the parameters for the program. For example:
<params> <wordClasses>word-classes.xml</wordClasses> <pos>pos.xml</pos> <outputDir>tables</outputDir> <corpus tag="eebo" title="Early English Books Online"/> <corpus tag="ncf" title="Nineteenth Century Fiction"/> <corpus tag="wright" title="Wright American Fiction"/> <corpus tag="eaf" title="Early American Fiction"/> <corpus tag="sha" title="Shakespeare"/> <texts>texts/eebo/bibadorned</texts> <texts>texts/ncf/bibadorned</texts> <texts>texts/wright/bibadorned</texts> <texts>texts/eaf/bibadorned</texts> <texts>texts/sha/bibadorned</texts> </params>The
wordClasseselement specifies the path to the NUPOS word classes definition file.The
poselement specifies the path to the NUPOS parts of speech definition file.The
outputDirelement specifies the path to the output directory. Prior writes one file to this directory for each table in the MySQL database, with the file name the same as the table name. Each file is a tab-delimited text file in MySQL "load data infile" import format. The output directory is created if it doesn't already exist.The
corpuselements enumerate all the corpora and specify their tags and titles.The
textselements enumerate the input directories of bibadorned files as produced by Acolyte. Only files with the extension ".xml" are processed.The input texts must be in the format produced by Abbot, MorphAdorner, and Acolyte. That is, they must be bibadorned, morphadorned, TEI-A files.
This file defines all the word classes.
<wordClasses> <wordClass id="j" majorClass="adjective">adjective</wordClass> <wordClass id="jn" majorClass="adjective">adjective/noun</wordClass> ... <wordClass id="it" majorClass="foreign word">Italian</wordClass> <wordClass id="ge" majorClass="foreign word">German</wordClass> </wordClasses>Each
wordClasselement defines a single word class. Theidattribute is the tag of the word class. ThemajorClassattribute is the major word class of the word class. The text is the description of the word class.
This file defines all of the parts of speech.<partsOfSpeech> <pos id="(" syntax="pu" tense="" mood="" case="" person="" number="" degree="" negative="" wordClass="pu"/> ... <pos id="zz" syntax="zz" tense="" mood="" case="" person="" number="" degree="" negative="" wordClass="zz"/> </partsOfSpeech>Each
poselement defines a single part of speech. Theidattribute is the tag of the part of speech. Thesyntaxthroughnegativeattributes are the values for the various part of speech categories. ThewordClassattribute is the word class of the part of speech.
Prior writes a report to
stdout. The report lists all the works processed and any error messages. Error messages begin with "#####".
Prior can be run in "debug" mode to check texts which have not yet been adorned with bibliographic information. In this mode, the input texts must be morphadorned TEI-A files, but they do not have to be bibadorned.
The process a directory of texts in debug mode, use the
debugCorpusTagattribute on thetextselement in the parameters file. For example:<texts debugCorpusTag="ecco">texts/ecco/adorned</texts>