|
MONK : Thoughts on Monk and Models (archive)
This page last changed on Feb 20, 2008 by martinmueller@northwestern.edu.
This is a proposal that we start to design a detailed formal object model for Monk. I've posted this in the data cell area, but this work is really a job for the entire Monk team. This note is a bit long, and some of it is devoted to a general discussion of what software modeling and models are all about, and why I feel it's important to start this way. I'm certain much if not all of this is familiar to many people, but perhaps not to everyone. Martin has posted previously with his thoughts about the initial and important task of data preparation and morphological adornment. Let me look beyond that task and ask where we are once we have finished it. We will have a large collection of adorned XML text files, in a common interoperable format, containing a massive amount of data for several hundred million words of English literature. This is all of the data and its structural markup that we plan to use for the first phase of our project. What can we do with these files and the rich and highly structured information that they contain? We can open the files in our favorite editors like BBEdit and Oxygen and look at them and admire our hard work. There's not much else we can do with them. So these interoperable standards-compliant XML files, while they are a major Monk deliverable, are just the beginning of the project, not its end. We want to both develop our own software and facilitate the development by others of future software that permit flexible and arbitrary exploration and exploitation of all this data and all of the relationships inherent in the data. We are of course guided by our experiences with Nora and WordHoard. We want to do things like point at a word in a text of interest and find out its basic morphological attributes and perhaps gracefully go find similar words in a "scholars workbench" kind of exploratory environment, or browse interactive lexicons for corpora and other text collections with basic linguistic frequency information, or get interactive concordances of all occurrences of the verb "love" in Shakespeare, and examine his use of this lemma over the two dimensions of time and genre, or compare early 19th century British to American fiction using the algorithms of corpus linguistics, or explore the use of repetition in Virginia Wolf, or develop data-mining classifiers for eroticism in the letters of Emily Dickinson, or do just about anything else imaginable with the data within a huge but reasonably well-defined and well-understood universe of the kinds of scholarly exploration enabled by our new collection of interoperable data. My impression is that we have a good idea of the general kinds of scholarly tasks we want to support in our software. I don't see that as much of a huge issue, although a dissection of the major Nora end-use cases to enumerate their precise detailed data needs would be quite useful as a sanity-check if nothing else, and is certainly something that needs to be done. As another thought about this, I've always thought that it would be useful to examine all the papers from the Digital Humanities 2006 conference in Paris and ask ourselves the question "Could this research be done in Monk, given the necessary data preparation work? If not, why not?" I recall sitting in the sessions at the time and wondering about the research I was hearing described and asking the same question about WordHoard, with mixed results, some of them discouraging, a few encouraging, and with much learning and eye-opening on my part. In any case, to get back to the point, ignoring some of these details for the moment, in principle the Monk XML text files as envisioned and described by Martin certainly do by themselves contain all the information needed to do all of these things. At the same time, however, they are for all practical purposes useless by themselves for doing any of these things. So we need to write quite a bit of software above and beyond the software we write to get us from source files to interoperable Monk files. There are many challenges. First, there are many different kinds of objects in the data - corpora, works, authors, dates associated with both works and authors of various kinds, additional bibliographic attributes like genre for works, work parts (aka "chunks") that form a hierarchy with structures that vary widely from work to work and genre to genre, text and its formatting information and additional attributes, primary text as opposed to secondary text like front and back matter, footnotes and stage directions, titles, sub-titles and other kinds of headers, paragraphs of prose, stanzas and lines of poetry, sentences, punctuation marks, named entities, words, spellings archaic and modern, word parts (for contractions), lemmas, parts of speech, part of speech categories like case, number, voice, mood, tense, etc., word classes and major word classes, annotations, translations, speeches with speakers and their attributes within the drama genre at least, and so on. Many of these objects we care about deeply, others not as much or perhaps even not at all. Our first job is to painstakingly enumerate all of these objects, their attributes, and their myriad relationships to each other, in all the detail that is necessary to express this information in a computer programming language (e.g., in Java, although the initial modeling work is usually best kept abstract, not tied to a particular programming language, to avoid getting bogged down in particular language details.). In addition to precisely cataloging all these objects and their many attributes and relationships, we also need to define the major behaviors of these objects, often modeled with the help of abstractions called "interfaces" (not to be confused with either "human interfaces" or architectural "api interfaces"). For example, some objects and their attributes can be used as criteria in the definitions of searches for other objects. In WordHoard, and presumably in Monk, a critical behavior is the ability to efficiently do arbitrarily complex user-defined searches over the full collection of objects, attributes, and relationships. These searches are used to do things like generate concordances for direct interactive examination, and to define and/or extract word, lemma and work sets of scholarly interest as the first step for use in subsequent linquistic analysis procedures. The sets of objects generated by the searches, as well as the searches themselves, can be saved and optionally shared with other researchers and groups of researchers. All of this basic behavior has to be defined and modeled, including the notions of searches, search criteria, search result sets, users, groups, permissions, privileges, saved queries, saved object sets, and all of the relationships among these dynamic objects. As another example, many objects and their attributes can be used to group and order collections of other objects, and that kind of behavior can also be defined and made explicit as part of the object model, as another interface in the case of WordHoard. As trivial as this basic grouping and ordering behavior may seem, it has turned out to be one of the more useful and appreciated features of our program. As a third and final example, for computational analysis, we need behaviors for efficiently counting collections of objects, in a general way, generating both simple counts and structured kinds of counting objects such as large sparse matrices. We quickly learned that our WordHoard model had to be extended with a number of "derived" objects to facilitate this kind of counting and other aggregation tasks, and to optimize the production of the most commonly requested kinds of such objects. This is far from a complete list of the behaviors that we will need to model. I've just given a few of the more important and obvious examples that we needed to address in our WordHoard object model and that, presumably, we're also going to need to address in Monk. Collaboration and annotation is another major area which I believe is important to us for Monk, and which we only started to model in WordHoard. So this is the "object model". It's not a program or program source code, although of course there is eventually a concrete implementation of it as code, and that happens sooner rather than later. It is an abstraction or conceptual framework within which code is written and the rest of the project is designed and implemented. It's the structured internal representation of exactly the same information that is present in the XML files, but in a directed-graph-with-attributes-and-behavior structure that can be manipulated by the higher-level middleware and end-user pieces of software we want to write, and which in fact defines the kinds of such software that it is possible to write. The object model defines the "what" of the software system at a precise level of detail. Many choices must be made here that are critical about how we represent all this data inside our software. These decisions are among the most important ones to be made in the early stages of a large software development project. They add up to an object model definition, which has to be laid out in the most excruciating detail, at least in a first draft, in order to be used as the basis for any subsequent software development. Computers are stupid, even stupider than we are, and you have to tell them in laborious detail exactly what to do. The details of the object model are the first step in this process. So that's been the focus of my thoughts since our big meeting ended on Saturday - the object model. This is always the first thing a software designer concentrates on when starting a complex new project. We have to get the "what" defined first as an object model, before we can worry about all the "hows". "How" questions include the details of persistence strategies such as relational/object/hierarchical/XML databases and/or Lucene datastores etc. They also include "middleware" API architectures such as mediating proxy servers with data transfer objects that present slices and views of model subsets and expose behavior APIs to a collection of horizontal heterogeneous clients, or more vertical but more powerful Hibernate/JDO kinds of frameworks which offer transparent unmediated direct access to the full object model, or mixtures of the two approaches, which are common. "How" questions also include end-user application and human interface programming frameworks such as web apps with technologies like Spring, OpenLazlo, etc. and/or stand-alone traditional desktop metaphor direct-manipulation Swing applications, etc. There's a million relevant technologies in this arena. None of these "how" questions can be addressed intelligently without prior work on the details of the "what" questions (the object model). In most cases, the structural and behavioral requirements imposed by the model determine which "how" technologies are appropriate, and are used to measure and judge the tradeoffs imposed by the set of appropriate available "how" technologies. A detailed object model draft serves another critical function. It exposes in a clear and useful way the tradeoffs that are part of any big software project. We can't do everything we may think we would like to do in this project, as Steve Ramsay reminded us several times during our meeting. We must make painful decisions about what we're going to do and what we could do in theory but must table for some other time due to lack of resources. I've always been an arch-conservative about these matters, preferring to do a few things well rather than lots of things poorly. These are big decisions, not mine to make, and not even decisions for just the data cell, but rather for the entire Monk team. Where does that leave us? From my perspective, the WordHoard object model is a good place to start from, as long as we keep in mind that it will not be the place where we will end. I propose we start there. I know of several areas that I suspect will need a good deal of work, including the text model, named entities, n-grams and repetitions, dynamic annotations and adornments, and most likely more support in the model for data-mining activities. Others will have other ideas. Here is another way of putting it. Once we can take the preprocessed and fully adorned Monk files for granted, we have a pile of bricks. But we now have the task of drawing up the blueprints for the kind of house we think our scholarly occupants would like to live in. Those plans don't follow from the bricks. Designing the object model is the first step of drawing up those plans in detail. That is the next major task. And as we approach it we should decide first on what it is we want to build. The 'how' - the choice of this rather than that technology - is like choosing the contractor. And it should be deferred until we have a clear idea of what it is we want to build. |
| Document generated by Confluence on Apr 19, 2009 15:04 |