Monk Datastore Overview
A Word object represents a single occurrence of a word somewhere in the text
of a work. Words have the following attributes:
tag A short unique string identifier for the word. corpus The corpus that contains the word. work The work that contains the word. workPart The work part that contains the word. numWordParts The number of word parts in the word. wordParts The array of word parts in the word. spelling The spelling = the token mapped to lower case. standardSpelling The standard spelling mapped to lower case. context The TEI context of the word. verse True if the word is in verse, false if it is in prose. paratext True if the word is in paratext, false if it is in main text. puncBefore The string of punctuation preceding the word. token The token = the word exactly as it appears in the text. puncAfter The string of punctuation following the word. lineBreak True if a line break follows the word. parBreak True if a paragraph break follows the word. workOrdinal The ordinal of the word within its work,
starting at 0 for the first word.colOrdinal A collocation ordinal for the word. endOfSentence True if this word is at the end of a sentence. sentenceInWorkOrdinal The ordinal of this word's sentence within its work,
starting at 0 for the first sentence.wordInSentenceOrdinal The ordinal of the word within its sentence,
starting at 0 for the first word in the sentence.
Word parts have the following attributes:
word The word to which the word part belongs. lemma The lemma of the word part. pos The part of speech of the word part.
Word parts are used for contractions. For example, consider the first word of Shakespeare's Hamlet, the word "who's" in the question asked by Bernardo "Who's there?" This word has two parts. The word has the following attributes:
spelling = who's token = Who's word part 1: lemma = who (crq) part of speech = q-crq word part 2: lemma = be (va) part of speech = vax
Note that this word has only one spelling and one token, but it has two lemmas and two parts of speech.
The puncBefore, token, puncAfter, lineBreak and parBreak attributes are used for generating concordances and in other contexts which need to present a textual representation of a sequence of words.
The context attribute is the path of TEI element names leading from and including the element that created the work part containing the word down to but not including the w element for the word, separated by slashes, with leading and trailing slashes. For example:
/div/p/ /div/sp/l/ /trailer/hi/
A word is considered to be "in verse" if it is a descendant of an l element in the TEI source. Otherwise, it is considered to be "in prose."
A word is considered to be in "paratext" if it is a descendant of any of the following elements in the TEI source: back, bibl, castGroup, castItem, castList, docImprint, docAuthor, docDate, docEdition, docTitle, figure, front, head, note, ref, role, roleDesc, speaker, stage, titlePage or trailer. Otherwise, it is considered to be in "main text".
The colOrdinal attribute is a "collocation ordinal" for the word, a number with the following useful property: word1 is n words to the left of word2 and in the same work part as word2 if and only if word1.getColOrdinal() + n =
word2.getColOrdinal(), for all n < 231.
The workOrdinal, colOrdinal, sentenceInWorkOrdinal and wordInSentenceOrdinal attributes define orderings of the words and sentences within each work part and of the words in each sentence. This ordering is the same as the ordering of the words and sentences in the TEI-A source, except that all note elements are moved to the end of each work part.
Word objects are immutable and comparable. The natural ordering of word objects is first by corpus, then by work, then by work ordinal.
Unlike core objects, the Word objects are not all read into memory at initialization, because there are too many of them. There is no getAll method.
There can be multiple copies of a Word object in memory at the same time,
so you must use the
equals method when comparing words for equallity, not the ==
operator.
The static method Word.get gets a word given its tag. The static methods
Word.find search for words using collections of search criteria.
The static methods Word.sort sort arrays and collections of words.
/** Gets a word by tag and prints some of its attributes. * * @param tag Tag. * * @throws ModelException */ void getWordAndPrintAttributes (String tag) throws ModelException { Word word = Word.get(tag); if (word == null) { System.out.println("There is no word with tag: " + tag); return; } System.out.println("tag = " + tag); System.out.println("corpus = " + word.getCorpus().getTitle()); System.out.println("work = " + word.getWork().getTitle()); System.out.println("work part = " + word.getWorkPart().getTitle()); System.out.println("word parts:"); for (WordPart wordPart : word.getWordParts()) { System.out.println(" lemma = " + wordPart.getLemma().getTag()); System.out.println(" pos = " + wordPart.getPos().getTag()); } System.out.println("spelling = " + word.getSpelling().getTag()); System.out.println("punc before = $" + word.getPuncBefore() + "$"); System.out.println("punc after = $" + word.getPuncAfter() + "$"); System.out.println("work ordinal = " + word.getWorkOrdinal()); System.out.println("collocation ordinal = " + word.getColOrdinal()); }
In this example we print the text for a work part "vertically," one word per line. On each line we print the word's spelling plus its lemmas and parts of speech.
/** Prints a work part for "vertical" reading. * * @param workPart Work part. * * @throws ModelException */ void printWorkPartVertical (WorkPart workPart) throws ModelException { Word[] words = workPart.getWords(); for (Word word : words) { System.out.print(word.getSpelling().getTag()); for (WordPart wordPart : word.getWordParts()) System.out.print(" " + wordPart.getLemma().getTag() + "/" + wordPart.getPos().getTag()); System.out.println(); } }Note the call to
workPart.getWords(). This convenience method finds all the words in a work part and returns them in order. It is implemented in theWorkPartclass as follows:/** Gets the words in the work part. * * @return Array of words in the work part, in order. * * @throws ModelException * <br>Could not get word list */ public Word[] getWords () throws ModelException { try { Collection<Word> words = Word.find(new WorkPartCriterion(this)); return Word.sort(words, Word.SortOption.WORK_ORDINAL_ASCENDING); } catch (Exception e) { throw new ModelException("Could not get word list", e); } }
/** Prints an HTML concordance of all occurences of a lemma in a corpus. * * @param corpus Corpus. * * @param lemma Lemma. * * @throws ModelException */ void printHtmlConcordance (Corpus corpus, Lemma lemma) throws ModelException { Collection<Word> words = Word.find( new CorpusCriterion(corpus), new LemmaCriterion(lemma)); Collection<Concordance> result = Concordance.find(10, words); System.out.println("<table>"); for (Concordance c : result) { String leftText = c.getLeftText(50); Word word = c.getWord(); String rightText = c.getRightText(50); System.out.println("<tr>"); System.out.println("<td align=\"right\">" + escape(leftText) + "</td>"); System.out.println("<td align=\"left\">" + "<b>" + escape(word.getSpelling().getTag()) + "</b>" + escape(rightText) + "</td>"); System.out.println("</tr>"); } System.out.println("</table>"); } /** Escapes a string for HTML. * * @param str String. * * @return Escaped string. */ String escape (String str) { str = str.replace("&", "&"); str = str.replace("<", "<"); str = str.replace(">", ">"); return str; }This example uses the
Concordanceclass to build the concordance after doing the search. The call toConcordance.findrequests that KWIC lines be constructed for 10 words to the left and the right of each word found in the search. The calls togetLeft,getWord, andgetRightget the left text truncated to 50 characters, the word in the middle that was the search result, and the right text truncated to 50 characters. These items are output in HTML table format. The search result words are aligned vertically and displayed in boldface.
In this example we do a full text search for patterns exemplified by the phrase "handsome, clever, and rich," that is, patterns of the form "adjective, adjective, optional coordinating conjunction, adjective."
To be precise, we search the full sequence of word parts in a work for the following pattern:
part of speech syntax category = j or jp
part of speech syntax category = j or jp
optional part of speech word class = cc
part of speech syntax category = j or jpOnly phrases which do not cross sentence boundaries are printed.
/** Finds and prints patterns like "handsome, clever, and rich" in a work. * * @param work Work. * * @throws ModelException */ void handsomeCleverAndRich (Work work) throws ModelException { Collection<WorkPart> workParts = work.getDescendants(); SyntaxCategory j = SyntaxCategory.get("j"); SyntaxCategory jp = SyntaxCategory.get("jp"); WordClass cc = WordClass.get("cc"); for (WorkPart workPart : workParts) { if (!workPart.hasWords()) continue; WordPart[] parts = workPart.getWordParts(); int len = parts.length; for (int i = 0; i < len-2; i++) { WordPart part1 = parts[i]; SyntaxCategory syntax = part1.getPos().getSyntaxCategory(); if (syntax != j && syntax != jp) continue; WordPart part2 = parts[i+1]; syntax = part2.getPos().getSyntaxCategory(); if (syntax != j && syntax != jp) continue; WordPart part3 = parts[i+2]; WordClass wordClass = part3.getPos().getWordClass(); if (wordClass == cc) { if (i+3 >= len) continue; WordPart part4 = parts[i+3]; syntax = part4.getPos().getSyntaxCategory(); if (syntax != j && syntax != jp) continue; if (crossSentenceBoundary(part1, part2, part3, part4)) continue; printMatch(part1, part2, part3, part4); } else { syntax = part3.getPos().getSyntaxCategory(); if (syntax != j && syntax != jp) continue; if (crossSentenceBoundary(part1, part2, part3)) continue; printMatch(part1, part2, part3); } } } } /** Returns true if a sequence of word parts crosses a sentence boundary. * * @param parts Array of word parts. * * @return True if word parts cross a sentence boundary. */ private boolean crossSentenceBoundary (WordPart... parts) { int numParts = parts.length; Word lastWord = parts[numParts-1].getWord(); long sentenceInWorkOrdinal = lastWord.getSentenceInWorkOrdinal(); for (int i = 0; i < numParts-1; i++) { Word word = parts[i].getWord(); if (word.getSentenceInWorkOrdinal() != sentenceInWorkOrdinal) return true; } return false; } /** Prints a match. * * @param parts Word parts matching the pattern. */ void printMatch (WordPart... parts) { Word location = parts[0].getWord(); String phrase = WordPart.formatAsString(parts); System.out.println(location.getTag() + ": " + phrase); }Note the call to
WordPart.formatAsString(parts). This convenience method formats a contiguous sequence of word parts as a string, with punctuation, and with line breaks represented by " / ". There's a similar methodWord.formatAsStringto format a contiguous sequence of words.Also note that checking to see if a pattern match crosses a sentence boundary is a bit tricky, because we are matching sequences of word parts, but sentence ordinal is a property of words, not word parts.
/** Finds all occurrences of a lemma in a work which appear in speeches. * * @param work Work. * * @param lemma Lemma. * * @return Collection of all the words in the work which have the * specified lemma and which occur inside a TEI "sp" * element. * * @throws ModelException */ Collection<Word> findWordsInSpeeches (Work work, Lemma lemma) throws ModelException { return Word.find( new WorkCriterion(work), new LemmaCriterion(lemma), new ContextPatternCriterion("*/sp/*")); }