Monk Datastore Overview

prev     tcon     next

Prior

Prior, together with the cdb.csh script, creates the Monk MySQL database.

Usage:

java edu.northwestern.at.monk.acolyte.Prior params

params = path to parameters file.

For large texts, Prior requires a large Java heap size and the 64 bit version of Java. For example, to run Prior with a 14 gigabyte heap, use:

java -Xms14g -Xmx14g edu.northwestern.at.monk.prior.Prior params

The parameters file

The parameters file is an XML file which specifies the parameters for the program. For example:

<params>

    <wordClasses>word-classes.xml</wordClasses>
    <pos>pos.xml</pos>
    
    <outputDir>tables</outputDir>
    
    <corpus tag="eebo" title="Early English Books Online"/>
    <corpus tag="ncf" title="Nineteenth Century Fiction"/>
    <corpus tag="wright" title="Wright American Fiction"/>
    <corpus tag="eaf" title="Early American Fiction"/>
    <corpus tag="sha" title="Shakespeare"/>

    <texts>texts/eebo/bibadorned</texts> 
    <texts>texts/ncf/bibadorned</texts> 
    <texts>texts/wright/bibadorned</texts>
    <texts>texts/eaf/bibadorned</texts>
    <texts>texts/sha/bibadorned</texts> 

</params>

The wordClasses element specifies the path to the NUPOS word classes definition file.

The pos element specifies the path to the NUPOS parts of speech definition file.

The outputDir element specifies the path to the output directory. Prior writes one file to this directory for each table in the MySQL database, with the file name the same as the table name. Each file is a tab-delimited text file in MySQL "load data infile" import format. The output directory is created if it doesn't already exist.

The corpus elements enumerate all the corpora and specify their tags and titles.

The texts elements enumerate the input directories of bibadorned files as produced by Acolyte. Only files with the extension ".xml" are processed.

The input texts must be in the format produced by Abbot, MorphAdorner, and Acolyte. That is, they must be bibadorned, morphadorned, TEI-A files.

The NUPOS word classes definition file

This file defines all the word classes.

<wordClasses>
    <wordClass id="j"
        majorClass="adjective">adjective</wordClass>
    <wordClass id="jn"
        majorClass="adjective">adjective/noun</wordClass>
    ...
    <wordClass id="it"
        majorClass="foreign word">Italian</wordClass>
    <wordClass id="ge"
        majorClass="foreign word">German</wordClass>
</wordClasses>

Each wordClass element defines a single word class. The id attribute is the tag of the word class. The majorClass attribute is the major word class of the word class. The text is the description of the word class.

The NUPOS parts of speech definition file

This file defines all of the parts of speech.
<partsOfSpeech>
    <pos id="("
        syntax="pu"
        tense=""
        mood=""
        case=""
        person=""
        number=""
        degree=""
        negative=""
        wordClass="pu"/>
    ...
    <pos id="zz"
        syntax="zz"
        tense=""
        mood=""
        case=""
        person=""
        number=""
        degree=""
        negative=""
        wordClass="zz"/>
</partsOfSpeech>

Each pos element defines a single part of speech. The id attribute is the tag of the part of speech. The syntax through negative attributes are the values for the various part of speech categories. The wordClass attribute is the word class of the part of speech.

The report file

Prior writes a report to stdout. The report lists all the works processed and any error messages. Error messages begin with "#####".

Debug mode

Prior can be run in "debug" mode to check texts which have not yet been adorned with bibliographic information. In this mode, the input texts must be morphadorned TEI-A files, but they do not have to be bibadorned.

The process a directory of texts in debug mode, use the debugCorpusTag attribute on the texts element in the parameters file. For example:

<texts debugCorpusTag="ecco">texts/ecco/adorned</texts> 

prev     tcon     next