This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

In our data cell conference call on 4/17/07 (see the Conference call, 2007 Apr. 17, Data), we began to discuss issues surrounding the general problem of architecting Monk so that researchers can extend the system by adding idiosyncratic tagging data. Unfortunately, this happened near the end of the meeting, and we ran out of time. Later, Pib, Bill and I discussed this at some length.

In Monk, we will provide morphological tagging data at the word level (NUPOS part of speech, lemma, word class, spellings, etc.). We will also of course need to provide high level bibliographic tagging data (author, title, publication date, genre, etc.), and we will pre-define a kind of "chunk" structure for the works in our collections. We're making good progress on making some detailed decisions about these matters. There is a great deal that can be done with just this kind of data that will be pre-computed and offered as a fixed collection of static data to our users.

It's clear from the use case scenarios, however, that many researchers need to be able to create their own tagging data and attach it to the texts. Rating passages for "hotness" and "witchiness" are two examples that we have been discussing. We in the data cell do not, of course, plan to do this ourselves in the same way that we plan to do the morphological and bibliographic tagging.

This may seem to be an enormous problem, and it is. After giving it some thought, however, we have come to realize that a good deal of the work has already been done as part of the WordHoard project, and that our model and datastore would require only some relatively minor enhancements to support this kind of functionality.

We already have an annotation facility which permits the attachment of plain text annotations to passages of text. These annotations can be private or shared with other users or groups of users. So our model already supports the notions of text ranges, user-defined attachments to text ranges, accounts, groups, permissions, privileges, etc. This is a great deal of infrastructure that is already in place.

We also already have a general abstract architecture for searching and saving the result sets as user-defined "bags of words". These "bags of words" can be displayed in interactive concordances with multi-level grouping and sorting. They can be saved in the dynamic part of our datastore and used for counting and analysis in the same way that works and chunks can be used as units of counting and analysis. So all of the infrastructure required to support arbitrary "bags of words" and use them for searching, displaying, grouping, sorting, counting, and analysis is already in place in our core object model and its underlying datastore.

What would we need to add to our model to make it possible for researchers to define and use their own tagsets? We have identified two key extensions, neither of them trivial, but both of them quite feasible.

The first extension is a generalization of the notion of an "annotation". In addition to supporting plain text annotations, we would need to support a simple kind of "structured annotation". Lacking better terminology, I'll follow Pib's lead and call this kind of object an "adornment" to distinguish it from a "plain-text annotation".

The simplest example would an adornment type (or "template") with a boolean value "hot" or "not hot" which could be attached to any word, phrase, sentence, paragraph, or arbitrary range of text, including but not limited to chunks and even entire works. We would also need to support at least the usual other basic kinds of primitive data types for our adornments. For example, numeric adornments (e.g., a rating from 1 to 10 of "hotness"), ordered or unordered enumerated types (e.g., an adornment whose values could be any one of "hot", "warm", "luke-warm", "cool", or "cold"), string valued adornments, lists of "keyword=value" adornments, and so on. A few years ago, we developed some software here at NU which solved this kind of problem for a different project. It was used by political scientists to mark up Supreme Court oral arguments. It would not be difficult to add such a facility to our existing model and datastore.

We would also need to provide a namespace for these kinds of user-defined adornments. For example, one of our "hotness" examples above might be named "Catherine's hotness tagset". It would belong to her, in the sense that it would be "owned" by her account on the system. She could choose to keep her tagging data private, share it with some set of specified colleagues, or share it with everybody.

The second extension is to add a searching/grouping/sorting criterion to the existing large collection of such criteria for using the user-defined adornments. For example, a search criteria might be "all words tagged in Catherine's hotness tagset as hot" (e.g., the bag of all words marked by Catherine as being in "hot" passages of text), or "all words tagged in her tagset with a hotness rating greater than or equal to 6", or "at least luke-warm", and so on.

If a user does a search using such a criterion, likely in combination with other criteria, the resulting bag of words is a first class object just like the result of any other kind of search, available for further use and analysis by all the other parts of Monk. It can be presented in an interactive concordance, grouped, sorted, saved, counted, analyzed, data mined, viewed and manipulated using data visualization tools, and so on.

After talking about all of this for quite some time yesterday, Bill, Pib and I came to the conclusion that this basic outline of a design makes good sense within the context of the work we've already done over the last several years. While not a triviality to implement (the devil's always in the details!), it is really quite feasible, if we can leverage the work we've already done.

Other extensions that go beyond these basics also come to mind. For example, a researcher might generate a large custom set of tagging data using some other software, and have data on a file that he would like to import into the system. There should be provisions for this kind of large "batch tagging data" ingest operation in the datastore. We have in fact developed this kind of ingest software already, in the context of static annotations for The Iliad and The Shepheardes Calender. Another idea is to permit the attachment of annotations and adornments to a wider class of objects than just words and other kinds of text ranges. Lemmas are an obvious example, and Bill has done some preliminary work in this area. This is where one would be most likely to add semantic tags, for example.

Let's explore this example of semantic tagging a bit further. Martin has done some preliminary work with this which he has described elsewhere, but so far there hasn't been an implementation of his ideas. Should we try to explore this as something we might do in Monk? That's an open question, and an interesting one. It's certainly a possibility, but we may unfortunately not have the time to address this in our first version of Monk. If, however, we design the system from the start for extensibility along the lines outlined here, this kind of thing would be much easier to add later than it would be without such a core architecture already in place for extensibility.

As a last example, as another kind of use case scenario, we have had conversations about place names and geographical tagging data and analysis. In the system outlined here, imaging a researcher with an externally-generated list of place names and their latitudes and longitudes. Given an extensible architecture like the one outlined here, he could import this data into Monk in the form of "latitude=xxx, longitude=yyy" adornments attached to proper noun lemmas in the Monk lexicon, and use that data for searching, grouping, sorting, analysis, mining, etc. From the point of view of this researcher and his colleagues, this data and its "intelligence" about place names and locations has become just another part of the system, as fully functional and powerful as all the other data in the system, like the built-in part of speech tags.

In summary, we think it's clear that it would be desirable if we could design and implement an extensible system. This meets a real need, both for our concrete current collection of use case scenarios and for other purposes. We will provide our users with a rather modest but powerful collection of built-in tags at the word, chunk, work, and collection level (the details aren't finalized yet, of course). But we don't think we can or should stop there, if it's possible to go further given our resources and budget. While these pre-defined tags are quite useful, if we cannot provide a way for researchers to define and use their own tags as first-class objects with the same range of functionality as our built-in ones, we will have failed to meet the real needs of many of the people in the target audience for our software system. At this point anyway, we think that the plan outlined here is feasible given our resource and budget constraints.

Finally, to perhaps state the obvious, I have deliberately focused on core modeling and datastore issues. I haven't talked at all about end-user client software or human interface issues. Those are not at all trivial or uninteresting! In fact, it is typically true that it takes more work and lines of code (and perhaps even intelligence and creativity!) to implement good end-user clients and human interfaces for these kinds of problems that it does to deal with the core modeling and persistence issues. But this is, after all, the data cell. I'm hoping that people who read this can imagine a number of different graceful ways to design attractive and powerful human interfaces for this kind of functionality. For our purposes here, we're worried about what it would take to support this kind of thing at the very lowest level, in the core model and its datastore, and to export the required functionality to the higher levels of Monk.

Document generated by Confluence on Apr 19, 2009 15:04