|
MONK : Abbot and the Datastore
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Thanks to Martin for his very useful note titled "The Abbot and other curatorial matters" on 12/30/07. It raises many important questions about the implications for my work on the datastore. In particular, I'd like to focus on what Martin often calls the "mid-level" data that is important for analytics - all the many complex structures between the level of the TEI-A "div" elements and the MorphAdorner "w" elements, between the "work parts" and the "words" inside the work parts. This includes paragraphs, lines, sentences, quotes, letters, speeches, notes, floating text, and so on and so on in what I must confess is to me quite bewildering detail. Brian tells me that the vocabulary of TEI-A consists of "only" about 120 elements, most of which I guess live at this level. What is expected of the datastore in terms of modeling all of this kind of data? With the sole exception of modeling speeches and speakers in drama, we did not address any of these problems in our WordHoard work, so this is all new to me. Unlike my other work to date on the datastore, I have no prior experience in designing solutions to these problems, and I have no clear picture of what is expected. While the work on Abbot as described by Martin defines the input to the datastore, I cannot build and implement a model of this data without also understanding the output, which is the API that I must implement between the datastore and the other layers of the Monk server. That API exists to serve the needs of those other layers, and the development of specifications for the API is primarily the responsibility of the programmers of those other layers. As the programmer of the datastore, I cannot even begin to design an implementation without thoroughly understanding the needs of my clients at the detailed technical level of an API specification. That's the first step of the design process for a component in a large software system. So we have lots of work to do. The kinds of specifications Martin has begun to lay out in detail for TEI-A are necessary but not sufficient to define my work on the datastore. They are only half of the problem. They describe the input I will receive, but not the output I am expected to deliver. I expect that my biggest job after my work on the new ingest process in January and February will be to work with the programmers of the other layers of the Monk server, especially the analytics programmers, to define in detail this API, the interface between our components. Only when that work is complete will I be able to actually model these mid-level objects and implement the model in the datastore. I expect this to be a massive challenge and a huge amount of hard work. To date the only kind of specification I have is much too vague: "copy the data you are given to the datastore, don't throw anything away, and make the data available on request". That's trivial to implement: I just copy each TEI-I file to disk and implement an API that delivers the contents of the files on request, perhaps as parsed XML DOM trees. To make the whole thing more efficient, perhaps I do the DOM tree parsing at ingest time, and store serialized versions of the DOM trees. But that's obviously not enough, obviously not what we want or need, right? At the other extreme, the specification might demand an API that is similar to and no less powerful than that provided by various major existing XML search engine products. While that would be very nice, we're talking about many man-years of work to produce a major new product that is clearly well outside the scope of our Monk project. If that is what is needed, then we have no practical alternative but to use one of the existing XML search engine products to store and make available on demand all of this mid-level data. John Unsworth has mentioned this many times. If this is what we need, then I have no programming work to do in my part of the datastore. We just load all the TEI-A files up into one of those existing search engine products, and the programmers of the other layers of the Monk server query the search engine whenever they need mid-level data. Unfortunately, Steve Ramsey has told us many times that all of these available products are much too slow for our needs, and none of them scale up at all well to handle the amount of data we need to store and query. Other technical people who understand the capabilities of these products also tell me that they would not meet Martin Mueller's needs for what he calls "exploratory data analysis" or "search and sort." I have nothing to contribute to this discussion, because I've never used any of these products and I know nothing at all about them. Fortunately, there are many other technical people in our project who have considerable experience with these products, and I defer to them on these questions. There are an infinite number of possibilities between these two extremes. For example, in Maryland Steve Ramsey mentioned the need to ask whether a word is in a paragraph or not. I can easily develop an API to meet this particular need. I simply add a boolean attribute named "inParagraph" to "word" objects, add a corresponding search criterion, populate the attribute in the obvious way at ingest time from the data I'm given in the TEI-A input files, and that's the end of my work. But what about all the many other questions that might be asked about all the mid-level data? I suspect that "is a word in a paragraph or not" is not the only question of interest. I suspect that we're interested in all sorts of behaviors with respect to at least some significant subset of this mid-level data that involve searching, sorting, grouping, counting, and so on. None of this has been specified yet. I am not the programmer of the parts of Monk which need this data, so I have no answers to these questions. I certainly expect to be intimately involved in developing the answers, but I cannot take the lead in this. I'm the supplier of this data, but it's the consumer who defines the needs. The consumer in this case is, I believe, primarily the programmers of the analytics layers of Monk. To summarize, the major unanswered question is "what are the implications for the datastore of the mid-level data produced by Abbot?" Do we use existing XML search engine products to model this data, despite their purported flaws and inadequacies, in which case there are no implications for the datastore? If not, what do we do? I cannot even begin to answer these questions by myself - I need your help. |
| Document generated by Confluence on Apr 19, 2009 15:04 |