This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

This is a companion to Amit's notes titled Nora Chunk and Nora DB. It describes the parallel facilities in WordHoard.

This is only a very high level overview and comparison. For the technical details, see the documentation at the WordHoard web site. In particular, see:

Notes for Developers: http://wordhoard.northwestern.edu/userman/dev-intro.html
Adding New Texts: http://wordhoard.northwestern.edu/userman/text-intro.html
Object Model Drawings: http://wordhoard.northwestern.edu/userman/other-files/model.pdf
The Javadoc: http://wordhoard.northwestern.edu/userman/javadoc/index.html

We of course keep essentially the same information about works and their parts in WordHoard as in Nora, we just do it a bit differently and we use a slightly different terminology. What Nora calls a "collection" we call a "corpus". What Nora calls a "chunk" we call a "work part". I believe that our uses of the term "work" have pretty much the same meaning, although there are often decisions to be made with even such a simple concept as this one. For example, in WordHoard we made a decision to make all of Shakespeare's sonnets a single "work", with each individual sonnet a separate "part" within the work, grouped into sections of 20 sonnets each to make navigation a little bit easier.

Both the external and internal representations of this information are different in Nora and in WordHoard.

For the external representation we both use XML. In Nora, the "chunk file" is separate from the work XML files proper. In WordHoard we have a separate XML file to define corpus (collection) attributes, but the work parts (chunks) and their attributes are defined inside the work XML files proper, at the "div" level. This is a mostly trivial distinction. In WordHoard, we could easily separate out the work part structural information into a separate XML file or files. Mapping between the two ways of representing this information is a simple matter. They are functionally equivalent, differing only in the most minor of details.

Both Nora and WordHoard have an "ingest pipeline" that takes the external XML representation of the data and transforms it into an internal representation. From Amit's notes I can see that Nora makes more of an effort to provide tools to help curators ingest new works from the "wild". We did not have the time to address this issue to any great extent in WordHoard, although we certainly would have liked to. Pib's current work on MorphAdorner is a step in this general direction, and this is certainly an important issue for Monk.

For the internal representation, Nora uses what appears to be a combination of the original XML files, the chunk XML files, Lucene, eXist, some Java code that ties these things together in a way not clear to me, and I'm not sure what if anything else. Amit's descriptions assume a level of familiarity with these other technologies that I do not possess. (Bill and Pib are more familiar with them than I am.)

In WordHoard, our internal representation of all of this information is a plain old Java object model, what's called a "POJO" in the industry. For example, we have a "Corpus" object which contains both corpus-level attributes (title, counts, etc.) and the collection of all the works which comprise the corpus. We have a "Work" object which contains work-level attributes (title, author(s), publication date(s), counts, etc.). Each Work object is also the root of a tree of "WorkPart" objects. WorkPart objects contain work part attributes (full title, short title, counts, etc.) and a reference to the text for the work part. For example, in our Shakespeare corpus, each play is a work which has a list of work parts for the acts, and each act is a work part that contains a list of work parts for the scenes in the act, and each work part for a scene contains a reference to the text for the scene. And so on. This representation in WordHoard of the notions of "corpus", "work", and "work part" is not at all complicated or especially significant or difficult. It's just a handful of regular Java classes which have the expected simple relationships with each other. Because our notion of "work part" is not much more or less than a simple tree rooted at the work, we can ammodate most literary structures, from drama to novels to ancient Greek epics to poetry to whatever else might be thrown our way. As in Nora, however, we typically require the assistance of a curator or some other domain expert to help define the chunks, give them ids, give them names if they don't already have reasonable ones, etc. This is all part of the process of preparing a text for ingest into a system such as Nora, WordHoard, or Monk.

In WordHoard, this "chunk" information is stored along with the rest of our object model on a large MySQL relational database. We use the Hibernate Java object persistence architecture to manage this datastore. This is all mostly transparent to the programmer, however. Programmers write their code directly to the object model, without having to concern themselves for the most part with the persistence details. There is no need for special APIs for basic object model traversal. The programmer just uses the normal Java programming conventions for this kind of work. E.g., "c.getWorks()" gets a collection of all the works in a corpus "c", "w.getChildren()" gets the child work parts for a work "w", "p.getText()" gets the text for a work part "p", and so on. The programmer does not need to worry about invoking remote methods on a server, or constructing SQL queries or API requests to some other data store architecture, or converting requests and responses back and forth between XML and internal representations, or anything else complicated. The complexity is managed by Hibernate behind the scenes.

Of course, in WordHoard, this "work part" or "chunk" model we've talked about is just a small part of our full object model. We don't treat it specially or differently from the other parts of our model. This has never been all that big a deal for us, just another part of the program that we implemented to fit together with all the many other parts. We used basically the same "POJO" strategy for all the many other parts of our model and our system.

I'm not certain how flexible or powerful Nora's query language is. In the WordHoard model, we can ask essentially any possible question about the underlying data, using either the higher level Hibernate query language or direct low-level SQL, usually the former because it's much more convenient and it expresses the queries and their results directly in terms of our Java objects, which makes everything easier to program. We did not attempt to design our own query language. This would have been reinventing a truly enormous wheel. We rather leveraged the existing mature and sophisticated tools and languages for querying these kinds of large complex datastores.

It seems clear that Nora and WordHoard support the same essential broad categories of queries over this data, although again all of the many details of how Nora does this and exactly what kinds of queries it supports are not at all clear to me.

We can both deliver the text for a work part. In Nora, this appears to be in either XML or HTML format. In WordHoard, we have our own abstract text model and internal representation of the model, along with Swing components to present interactive text to the user in our Java Swing end-user application. We were forced to do this because the existing Swing tools at our disposal for dealing with XML and HTML text were sadly (and inexcusably, IMHO) not adequate for our requirements in WordHoard. The WordHoard model could, however, easily deliver XML and/or HTML versions of these text objects. This is another small matter of detail, not an essential issue or problem.

To briefly go beyond the work part model into related other parts of the model, in Nora and WordHoard we can both deliver detailed information about the words in a work part or parts, including absolute and relative frequency counts for analysis routines. We can deliver morphological information for individual words and collections of words, collocation information, and bigram and trigram information. In the WordHoard model, these kinds of requests can be defined in terms of any kind of word attributes (part of speech, lemma, word class, major word class, spelling, etc.) and in terms of the relationships between words and other parts of our model (e.g., all the adverbs in Shakespeare spoken by female characters in comedies published in some particular decade). I don't know about Nora when it comes to this level of detail. Basically, in WordHoard, a word's being "part of a chunk" is just another attribute or relationship, not treated any differently from any of its very large number of other attributes and relationships. It's just another search criterion, only one of several dozen others that can be used to find and analyze words.

Both Nora and WordHoard support the delivery of the data needed to present concordances, and perhaps this as good a place as any to present a somewhat extended example. In WordHoard, this is based on the use of doubly-linked lists of word occurences in our model. Given a set of word occurences (e.g., the result of a search operation), a concordance is generated by simply traversing these lists to the left and the right using each word occurence in the result set as a starting point. There isn't anything special or terribly complicated about this operation. It's just another simple traversal of the object graph - a pair of loops which invoke the "getPrev" and "getNext" methods until the left and right margins of the display area are reached respectively. We did find the need to do some preloading of "adjacent words" to make this operation efficient, but that was not difficult, since our model supports the notion of "collocation" out to any desired number of words. The difficult part of doing concordances was the interactive human interface Swing code, which is rather sophisticated in WordHoard and supports multi-level grouping and sorting operations in addition to just displaying the lines of a traditional concordance presentation. Dealing with the data model was not hard - retrieving the proper words in their proper order from the data store that needed to displayed on the screen in a concordance was a triviality thanks to our "POJO" architecture.

Amit mentioned a Java interface in Nora for delivering sparse matrices of frequency counts in the format used by D2K. We don't have such a D2K-specific interface in the WordHoard model, but it would be a trivial matter to add it. We do indeed have a great deal of internal architecture and code developed by Pib for efficiently generating these kinds of sparse matrices and performing computations that use them. In WordHoard we precompute a good deal of this kind of frequency information as a stage of our ingest pipeline for the sake of efficiency, and I believe that Nora does this also.

The Nora "proxy server" is not really part of the main topic here, but it's important and related to this discussion, so I will say a few words about it. We have no such server in WordHoard, although we do have a tiny server for mediating access to shared user-created objects and enforcing authentication and authorization policies involving these objects. Essentially, in WordHoard, for the very large static part of the data store, MySQL is the server, as mediated and in a sense "hidden" in a critical way by Hibernate. For Monk, we definitely need a server, to accomodate a more heterogenous development environment for end-user client software if for no other reason. For our purposes here, it suffices to mention that WordHoard's object model and data store architecture in no way prohibit or make especially difficult the development of such a server (or servers). Indeed, as an example, it would probably be quite feasible to implement a compatible version of the Nora proxy server layered on top of an object model and datastore similar to the one we used in WordHoard. Should we actually do this? I don't know! But it's a possibility. There's nothing in the architecture to prohibit such an approach. Whether or not it's a good idea is another question, and a more difficult one.

Again, this is just an attempt to summarize the main points about this part of the problem, and give a few examples of how we addressed some of the issues in WordHoard. I emphasize again that this is only one small corner of the big picture. The WordHoard object model is much more than just a way to keep track of "chunks"! For example, I have only mentioned in passing the morphology model, and I haven't mentioned at all the parts of the model which deal with user-created objects like annotations, saved and shared word, work, and work part sets, etc. In addition, I have concentrated here mostly on our core object model and data store, and I haven't said much about either the details of our ingest pipeline or our end-user application Swing code, both of which are neither trivialities nor small amounts of code! But the object model is the heart and soul of all of this, and that's an important fact to remember too.

Again, all of the many details about everything I've discussed here are available at the WordHoard web site using the links I gave above.

I hope this helps, and thanks to Amit for starting the conversation with his discussion of how some of this is done in Nora.

Document generated by Confluence on Apr 19, 2009 15:04