This page last changed on Feb 16, 2007 by martinmueller@northwestern.edu.

Metadata Offer New Knowledge
(MONK)

Abridged version of proposal
September 25, 2006

Humanities text-mining in the digital library

The work proposed here builds on work done in two separate projects funded by the Andrew W. Mellon Foundation: WordHoard (http://wordhoard.northwestern.edu/), at Northwestern University, and Nora (http://www.noraproject.org/), with participants at the University of Illinois, the National Center for Supercomputing Applications, the University of Maryland, the University of Georgia, the University of Nebraska, the University of Virginia, and the University of Alberta. The two projects share the basic assumption that the scholarly use of digital texts must progress beyond treating them as book surrogates and move towards the exploration of the potential that emerges when you put many texts in a single environment that allows a variety of analytical routines to be executed across some or all of them.

The WordHoard project applied to literary texts the insights and techniques of corpus linguistics, namely the empirical and computer-assisted study of large bodies of written texts or transcribed speech. In WordHoard, such texts are annotated or tagged according to morphological, lexical, prosodic, and narratological criteria. In its current release, WordHoard contains the entire canon of Early Greek epic in the original and in translation, as well as all of Chaucer and Shakespeare, and Spenser's Faerie Queene.

The goal of the Nora project is to produce software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources in existing digital libraries. Like WordHoard, Nora applies some of the tools, techniques, and insights of corpus linguistics to its collections, and like WordHoard, Nora deals with literary texts, though from a later era?British and American literature of the 18th and 19th centuries. Nora builds on D2K (Data to Knowledge), a generalized visual-programming framework for data-mining developed and still being improved at the National Center for Supercomputing Applications, in the Automated Learning Group.

In their current state, the two projects look very different to an end user. WordHoard offers an integrated environment for close reading and scholarly analysis of a limited number of texts. The user interfaces focuses on the basic philological activity of "going from the word here to the words there," and it leverages the power of the computer to offer easier and more complex support to this activity than you could achieve in a world of books and concordance or, for that matter, simple web sites. But WordHoard's statistical module also includes a number of routines that take you towards the domain of data-mining from which Nora starts.

Nora does not look like an application that is built around particular texts. It is better thought of as a set of procedures that are run against data that are extracted from diverse collections. These procedures are accessible to users through distinct applications built on D2K. Current Nora applications focus on text categorization, text-mining, and visualizations of patterns in collections.

If you look under the hood of the two projects, many similarities and some additional differences emerge, but the similarities run much deeper. Both projects have procedures for
1. Ingesting arbitrary texts that meet some rules (e.g. well-formed XML)
2. Tokenizing the texts, assigning to each word a unique location, and applying part-of-speech tagging and other techniques familiar from corpus linguistics
3. Converting the tokenized and preprocessed texts into a datastore that includes various count objects to simplify and speed up subsequent operations

In Nora the data store provides the basis for a chain of operations that go via D2K to the end-user applications. In the WordHoard environment, the user interface talks to the data store through a software layer called Hibernate. In both Nora and WordHoard, however, the datastore is separable from the processes it feeds and could in principle feed quite different processes via quite different intermediate layers.

Nora and WordHoard differ marginally in their basic ways of tokenizing and preprocessing data. They have used different tag sets and have differed with regard to lemmatization and named-entity extraction?matters on which it is desirable and quite easy to reach agreement. WordHoard has also tagged some prosodic and narratological phenomena, but these very granular tagging operations are unlikely to scale to data sets that are larger by orders of magnitude.

Nora and WordHoard have both employed relational database systems to maintain their data stores but Nora is exploring different technical options. Both projects make use of the xml tags in the texts, something that sets them apart from most text-mining done in the scientific community. On a technical level, both projects distribute a webstart application written in Java.

If you compare the types of queries supported by the current interfaces of Nora and WordHoard, the latter-while including several statistical routines-comes at things from a philological perspective quite familiar to humanists while the former applies text-mining strategies more deeply rooted in business and the social sciences. But there is real complementarity here, and the underlying operations are in any event very similar. It is difficult to distinguish between "text-analysis" and "text-mining": it is more productive to think in terms of a broad spectrum of text analysis, with different scholars finding themselves in different moments at different points on the spectrum. What matters, finally, is that we create a common environment where scholars will find the tools that meet their needs.

Because these two projects have very similar underlying requirements for their texts, and very similar basic techniques for analyzing those texts, it makes sense to combine them. Because they have developed in complementary ways and explored alternative strategies for accomplishing similar goals, it seems likely that they will strengthen one another?but it will also be a challenge to meld them and build out from the two at a technical level, since they have made different choices at the level of architecture and implementation.

The major challenge will be to construct a datastore that will be sufficiently robust, fast, and flexible to support the much larger data sets envisaged in this joint proposal. Such a datastore must hold more data by two or three orders of magnitude. It must deliver these data at acceptable speed, and it should support different ways of accessing, querying, and manipulating data, including D2K, which is a particular focus of attention in the current proposal. In order to meet this challenge, we put an emphasis in the budget for this phase on salaries for full-time or parts of full-time technical staff. In the discussions preceding this proposal we have found that the key developers on the Nora and WordHoard side have wrestled with very similar problems in quite similar ways, and we think that the joint development of the architecture for a very large datastore (or set of such datastores ) will be the critical step to the success of MONK. But some student involvement in the building of this architecture remains desirable, not least because of the learning opportunities that this project offers them.

Over the last decade, many millions of dollars have been invested in creating digital library collections: at this point, terabytes of full-text humanities resources are publicly available on the web. Those collections, dispersed across many different institutions (not only libraries but also publishers) are large enough and rich enough to provide an excellent opportunity for text-mining, and we believe that web-based text-mining tools built on the architecture of a reusable framework like D2K will make those collections significantly more useful, more informative, and more rewarding for research and teaching.

At the same time, we can see from the rapid rise and surprising power of social software (wikis, blogs, social bookmarking, folksonomies, etc.) that in some cases it might make sense to assume. that users could contribute to improving, maintaining, tagging, and enriching texts in digital libraries. The current release of WordHoard, for instance, lets users construct customized 'word sets' and store these for private or public use. Of course, there are many issues to address there, beginning with the library's need to ensure the integrity of its collections?and the fact that many of these collections will actually be licensed from and served up by publishers, who are also going to concerned about the integrity of their product. Still, it seems at least imaginable that user contributions might be layered on top of the original texts and that an environment for the online analysis of texts would be far richer if it were an online community as well, where analysis could be shared, where intermediate artifacts from one person's research process could be made available to others rather than being recreated by them, and so on.
D2K and UIMA

D2K (Data to Knowledge) is a data-mining framework developed by the Automated Learning Group (ALG) at the National Center for Supercomputing Applications (NCSA), initially in order to facilitate its own research activities. D2K is a rapid, flexible data mining and machine learning system that integrates analytical data mining methods for prediction, discovery, and deviation detection, with data and information visualization tools. It offers a visual programming environment that allows users to connect programming modules together to build data mining applications and supplies a core set of modules, application templates, and a standard API for software component development. All D2K components are written in Java for maximum flexibility and portability. D2K "itineraries" for data can also be run as web services. Major features that D2K provides to an application developer include:

? Visual Programming System Employing a Scalable Framework
? Robust Computational Infrastructure
o Enables processor intensive applications
o Supports distributed computing
o Enables data intensive applications
o Provides low overhead for module execution
? Flexible and Extensible Architecture
o Provides plug and play subsystem architectures and standard APIs
o Promotes code reuse and sharing
o Expedites custom software developments
o Relieves distributed computing burden
? Rapid Application Development (RAD) Environment
? Integrated Environment for Models and Visualization
(http://alg.ncsa.uiuc.edu/do/tools/d2k)

D2K has been an integral part of Nora since the beginning of the Nora project, with Nora applications calling D2K web services to execute itineraries designed for text-mining on literary texts. We are also currently working on wrapping up in D2K the routines that we use to produce the data-stores against which those itineraries run, so that data-preparation as well as data-exploration can be run as D2K-enabled web services. At the end-user application level, we distribute a web-start java executable which currently calls an external, web-accessible properties file in order to determine how to configure the application interface for a particular collection: that process, and the application itself, are taking place outside the D2K framework, but that's largely because we haven't yet felt these parts were sufficiently stable and well understood to merit being brought into the D2K environment in the form of modules.

UIMA (Unstructured Information Management Architecture) is a project of IBM Research. It is an open, industrial-strength, scalable and extensible platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components. IBM makes UIMA available as a free software development kit, and makes the core Java framework available as open source software. We have not worked with UIMA, and it has only recently come to our attention, but we are interested in working with ALG to explore how it might be combined with D2K to accomplish our purposes in the Nora project?for example, UIMA includes a semantic search engine that keys on xml fragments produced by named-entity recognition, and we are using named-entity recognition to reconstruct the social networks in novels. UIMA uses "a common representation system called the CAS or Common Analysis Structure" that could be quite useful to Nora:

The CAS is used to provide analysis engines with read access to the artifact being analyzed (e.g., document, image, video, etc) and read/write access to the analysis results or annotations associated with defined regions of the artifact. Regions may correspond to words, sentences or paragraphs in text or frames or parts of frames in video, for example. The CAS is shared among analysis engines working in concert as part of a larger workflow to process a collection of artifacts. UIMA supports standard XML and high-speed binary serializations of the CAS. The CAS maybe shared among Java and C++ analysis engines. UIMA provides a native Java Interface to the CAS that renders analysis results as Java objects and properties making it easy for the Java programmer to interact with the CAS. The CAS contains high-speed indices to speed up access to type instances.
(http://www.research.ibm.com/UIMA/)

Our understanding, at this point, is that program officers at the Andrew W. Mellon Foundation are working with researchers at ALG and IBM to broker a conversation about the possibility of combining or coordinating these efforts. We are committed to being engaged in those conversations as early and as often as our use-case offers a useful example to focus the general discussion into particulars.
Scaling up from Nora and WordHoard

Both Nora and WordHoard achieved most of their objectives with limited data sets where document count is in the low dozens and the total word count in the low millions. But if we want to take full advantage of word-level metadata and the inquiries they support we will want data sets consisting of thousands of documents and running to hundreds of millions of words. In the MONK project we want to create an environment that will scale up by orders of magnitude and let users carry out complex data-mining and query operations across collections that will eventually consist of more than a billion words. There are three phases in the approach to this goal, and only the first two are contemplated under the round of funding being requested in this proposal. By the time we reach the end of this round, we should have a good idea whether the third phase is actually feasible: if it is, we will want to bring in other partners who are content providers.

Phase 1

The first phase in the MONK project will be to combine the texts that have been the testbeds for Nora and WordHoard, to add some texts from periods not represented in either Nora or WordHoard, and to pre-process that collection uniformly. The resulting testbed as a whole would not be publicly available, though we would identify a subset of it that could be made available in demos, beginning with those texts that Nora and WordHoard have used in this way. The main purpose for this uniformly pre-processed collection of texts would be for developing and testing our software.

The testbed we plan to assemble will consist of primary texts written in English from the very early modern to the beginning of modernism, from 1471 (the date of Caxton's Recuyell of the Histories of Troy) to the early 20th century, a cutoff date dictated by copyright. We expect that, since this is a continuation of the Nora and WordHoard projects, we could continue to test software privately on the texts collected from libraries and scholars for those projects, and we would approach DLF libraries for a broader selection. In addition, Martin Mueller has a working relationship with ProQuest (around Chadwyck-Healey and the Text Creation Partnership), and he has had preliminary conversations with them that indicate they would have no objection to having their texts included in the testbed if use were limited to faculty or students at institutions with subscriptions to this content. Even partial success in negotiating for testbed texts across these sites would produce a collection that includes most of the texts that are taught regularly in colleges or are the focus of research by humanities scholars whose work involves the study of documents in English written between 1500 and 1900. Thus the MONK testbed will for many purposes be a good enough "Book of English" or cultural genome from the very early modern to the modernist.

During both the first and the second phase, we will be exploring ways to integrate WordHoard functionality with environments such as D2K, working to expand the analytical tools available to the end user, exploring the possibilities for semantic analysis, expanding the repertoire of visualizations, improving visual design and user-interface functionality, and developing technical and end-user documentation.

Phase 2

The second phase would involve some proof of concept work on social software capabilities for MONK, including the sharing of intermediate work-products (for example, pre-processed sub-collections selected by one user and then shared with others), sharing of results, annotation and correction of data, etc.. Part of this second phase would also be to work with a small number of libraries and publishers to provide the tools we have built with existing large collections. Candidates for this would include the libraries at some of the participating MONK institutions (Maryland, Nebraska, Northwestern, UIUC) as well as with OCLC (concerning the RLG collections) and JSTOR. We would also begin conversations with them and others about what it would take to do text-mining and text-analysis across collections. That conversation would include Katherine Kott, from the Digital Library Federation, as well as Herbert Van de Sompel, Carl Lagoze, and Sandy Payette, in connection with the Pathways project.

Phase 3

The third phase would be deploying the MONK tools in a distributed environment that would allow scholars to do text-mining across multiple large collections. Developing that distributed environment is beyond the scope of this round of funding, but we believe that we can provide a use-case for projects like Pathways (Van de Sompel at al.) or for the Digital Library Federation's Aquifer project, or both.

Deliverables

The deliverables from this round of funding on the MONK project would be:
? Data pre-processing routines that support a wide variety of text-mining and text-analysis activities and that are essentially "point and shoot" for content providers who agree to provide MONK with their collections;
? APIs for use by D2K and other interfaces to data stores, client software, and properties (currently "Nora Chunk") file;
? Additional D2K modules for text-analysis and text-mining (for example, modules to do clustering of named entities, modules to embody some of WordHoard's functionality), including some facilities for semantic analysis;
? Improved user-interface and visual design;
? Working demos of MONK with open-access text collections, hosted on MONK project servers;
? Beta installations of MONK alongside several large collections provided by libraries or publishers, hosted on their servers;
? Proof-of-concept social-software facilities for use in MONK;
? Analysis and proposals concerning grid-based text mining and text analysis across collections.
At the end of the two-year period for which funding is being requested in this proposal, the MONK pre-processing software will be capable of ingesting hundreds of millions of words in an offline process that would be completed before real-time exploration of the collection, and the end-user client software and the D2K itineraries it calls on will be usable at scale (hundreds of millions of words) over the web, in real time.
Phase 1 of MONK

We begin with a very broad description of the steps involved in creating a uniform testbed, because it is a good way of directing attention to the various points at which scale issues are likely to arise.

Comparable Representations of Text

Virtually any text-analysis operation turns on the opposition of 'like' and 'different'. The texts in our testbed must be at some level comparable. Comparability is achieved by representations that are variously reductive. The card catalogue of a library is a classic model of this: books of different size, length, and content are replaced with representations of them on cards of equal size, which contain heavily standardized information. Reading the bibliographical record is not quite the same as reading the book. On the other hand, looking closely at a bibliography may tell you more about a subject than reading any individual item in it.

With current computing technology and storage capacities, cataloguing each word in every book of a library becomes a feasible thing to do. A catalog record for a word occurrence is by itself a very primitive kind of thing: an occurrence of the word 'loves' declares "I am the third person singular of the verb 'love' and I live at address X in text Y." The provision of such minimal grammatical and positional information is called part-of-speech or POS tagging.

In printed English texts before 1800 there is much orthographical variance, and such variance is also found in representations of dialect. Unlike a human reader, a POS tagger cannot identify 'university' and 'vniuersitie' as mere variants of the same word. Where you have a significant amount of orthographic variance, you need to apply a layer of "virtual orthographic standardization" (VOS) so that the tagging encounters its word tokens in a form that it recognizes. More importantly, such standardization lets modern users look for a word in the standard form that they are familiar with.

The VOS-POS process permits the creation of another representation of a text: the text as "bag of words" that contains a list of all the distinct words with a count of their occurrences. This representation is quite basic in its importance to many kinds of statistical analysis of text. Once documents are reduced to word list, they become eminently comparable, too. Given the trouble writers take to put words in a particular order, it may seem an insult to point out that for some purposes that order is not important, but many useful practices within information retrieval can be carried out with catalogues of words that carry no record of the original sequence in which those words occurred. The combination of more traditional catalog information from the top level of a document with catalog information from the bottom level of its word occurrences creates remarkably powerful tools for analysis.

Phil Burns in Northwestern's Academic Technologies has recently been working on data pre-processing routines for early texts. This work has grown out of WordHoard but intersects with a project funded by the Council of Library Initiatives at the CIC to provide virtual orthographic standardization and part of speech tagging for the large collection of 15th to early 18th century in the Text Creation Partnership project. The purpose of this project, provisionally named NUPOS, is to develop a tag set, tagging rules, and workflow routines that can cope with characteristic orthographic, lexical, and syntactic features of English and American letters before 1900. (The part-of-speech taggers in common use (GATE, CLAWS) are very much focused on 20th century English and by default come with the Wall Street Journal or something like it as their training corpus.) Some tentative initial test results from NUPOS are very promising, with an error of 3.5%, which is in the ballpark of acceptable taggers. If these results are confirmed in further testing, NUPOS is likely to form the basis of preprocessing routines in MONK.

Information about the middle level?phrases, sentences, semantic units?of texts is much harder to catalogue and compare. Texts are diverse in structure, and encoding practices differ widely. Certain classes of texts?e.g. plays?are fairly universally divided into speeches that combine into scenes and acts, and something can be made of that. But for the most part MONK will focus on metadata that are derived from the top and bottom levels of documents.

Broadly speaking, then, the search space of MONK is created by reducing texts to simplified representations?silhouettes, if you will. Analytical procedures, whether direct searches or data mining, target these silhouettes separately or in combination. The justification for such simplification is found in the results it yields?and with respect to any individual data-point, it has been the practice of both Nora and WordHoard to allow the user to see that word in its original context in the document, even though combining that requirement with the bag-of-words approach to analysis has raised some interesting technical challenges. It may be that others have already developed solutions to those challenges. For example, in IBM's Unstructured Information Management Architecture, researchers have developed "is a common representation system called the CAS or Common Analysis Structure. The CAS is used to provide analysis engines with read access to the artifact being analyzed (e.g., document, image, video, etc) and read/write access to the analysis results or annotations associated with defined regions of the artifact. Regions may correspond to words, sentences or paragraphs in text" (http://www.research.ibm.com/UIMA/). We might be able to use this structure, in combination with D2K's analytical itineraries and our datastores to produce a generalized text-mining environment.
Managing expectations

Some problems of scale may be solvable only by changing the user's expectations. Nora aimed at keeping the time of any given operation "within the acceptable time limits of the World Wide Web." This may, however, be an unrealistic constraint for an application that aims at supporting scholarly inquiry into large and complex archives. Consider evolutionary biologists (who are textual scholars of a peculiar kind), comparing the "texts" of various genomes to get a handle on their relationships. They use various kinds of software to do this, and they are very happy if their newest desktop computer does in hours what previously took days?but they still expect certain procedures to run for days, perhaps even weeks. Twenty years ago, classicists were delighted when Pandora, a HyperCard program, took only forty minutes to execute a search through all of the TLG. Today it takes seconds. But while the power of computer has grown enormously and you can do a lot more, it may never be the case that all things worth doing can be done instantly. Psychologically, it may be easier to accept processes that take hours or days than to accept ones that take minutes: processes that take minutes encourage you to wait and count the seconds impatiently. Managing the horizon of expectations is therefore an important part of interface design. In both Nora and WordHoard, it often takes longer to construct a data set than to perform an operation on it. Some scale problems could be solved by having users request sub-collections tailored to their interest from "the stacks" and come back later to the "reading room" to work with those collections in real time, once they've been notified by email that their sub-collection is ready.
Search and display procedures

Many analytical routines in Nora and WordHoard are forms of text categorization or make use of it. Text categorization broadly understood is any form of text summary. Authors give their documents a title. A catalog number in the LC or Dewey system provides a lot of text categorization. So does a "bag of words" model of a text. Indeed, it is a reasonable assumption that the relative frequencies of words in a given text may tell you more about it than a title or catalog number, especially if you have some way of seeing those relative frequencies in contrast to some presumed norm.

Text categorization in the narrower context of information retrieval involves routines by which you seek to foreground a set of features in one or more texts against a relevant background of some other text or group of texts. Scale problems in MONK will arise to a considerable extent from the fact that this "relevant background" shifts with the user's interests. A user's exploration typically involves a long chain of steps. If you know that this chain will always or most of the time involve certain steps, you can precompute those steps and then combine them very quickly, even if they are not always combined in the same order or if not all steps are included. But if you want to give the user the power to build analytical procedures from the bottom up, then by definition you don't know in advance what steps will be taken, and that limits the ability to compute partial result sets. This has implications for both the software's designers and its users.
Interface design

Scaling up to thousands of texts requires substantial revision of the interface. More complex forms of grouping or visualization may be required to help users make sense of result sets. In the current concordance display of WordHoard, for instance, if the result set for a search in the Shakespeare corpus is grouped by play, with each item collapsing the hits in in a play, the user will encounter a list of at most forty items. A similar search in a novel corpus may retrieve results of several hundred authors. You need to find a better way of giving a user a quick grasp of the shape of a result set. These are not problems of computer performance but using sorting, grouping, and visualization to give the user immediate cues on how to make sense of the results. Similar problems need solving at all MONK interfaces once result sets run in the hundreds or thousands rather than low dozens.

Phase 2 of MONK

MONK Deployed with Existing Collections

We believe that in order to see real adoption of the MONK tools, we will need to present them along with the collections that people are already using. In second phase of MONK, we will seek out some large and interesting collections, beginning with libraries at some of the institutions participating in MONK, and will offer to work with the owners of those collections to deploy MONK on their servers, within their security and authentication schemes. Some of these collections would provide a good test for the claim that a project like MONK encourages new forms of research. For example, consider the possibilities of text-analysis at scale, across a collection of British and American fiction. With a few exceptions, graduate students specializing in American fiction do not read much English fiction and the other way round. If you put a comprehensive collection of fiction in English (from both sides of the Atlantic) in a common analytical environment, will it lead researchers to look more at the other side of the Atlantic when dealing with the 'sentimental' or the 'Gothic' or whatever? Will it lead to new ways of looking at the progressive differentiation of 'English' and 'American' fiction? It will certainly put within easy reach of many researchers quite powerful tools for asking questions of this kind?for example, a catalogue of word occurrences can be turned into a diachronic and frequency-based dictionary with very little additional effort. That is a step well worth taking, and this lexical summary would almost certainly be heavily used.
Grid computing

Nora and WordHoard are running on fairly ordinary servers. We need to explore what kinds of improvements one might achieve by parallelizing processes and running them across clusters of computers with more powerful CPUs and faster disks. As we approach the problem of trying to allow users to select their "sub-collections" from collections provided by different libraries or publishers, we will also be in the territory of grid computing, where the user wants on-demand access to networked computational resources, complete with virtualized storage (in which items contributed from the various distributed collections can be treated?for purposes of analysis?as a single sub-collection). We recognize the very great technical and policy challenges that attend this vision of the future (our "phase 3," above), and we think addressing those challenges is the central mission of those organizations and entities interested in cyberinfrastructure, in any domain. Our hope with the MONK project is not to provide some unilateral solution to the problems raised here?even if only for the humanities?but rather to provide a well developed test case that raises these challenges in ways that will be immediately recognizable to those who work on cyberinfrastructure in other domains, but also in ways that embody the research needs and practices of humanities scholars. Inasmuch as the Andrew W. Mellon Foundation will be involved in these larger discussions of cyberinfrastructure, both from a technical and a scholarly perspective, we hope that they might help to ensure that our test case is on the table in those discussions, and that we have the opportunity to help shape the collective solutions that emerge.
Social software

If MONK is going to be a fully manipulable large-scale collection of word-objects, then it should support collaborative work. In the digital library, individuals working on their own can do things that they could never do in a print world, but the exploration of collections through something like MONK also lends itself to collaborative work or scholarly barn-raising and benefits from collaborative software that permits users in a common project to share the results of their explorations, enrich the data, and generate annotations, public or private.

In principle, any combination of texts should be available for user manipulation. In practice, users will beat certain paths and demarcate boundaries that are rarely if ever crossed. Users will cross lines of genre or period, but not typically both: there will not be many cases in which a theological treatise from the sixteenth century and a Victorian novel are called up in the same procedure. In both Nora and WordHoard, users construct data sets for subsequent manipulation. In Nora you can select from a list of pre-defined collections ('sentimental' novels, Emily Dickinson's poems) and then perform certain kinds of analysis on those collections. In WordHoard you can build work sets (Shakespeare comedies before 1596) or word sets (words spoken by women in comedies) and manipulate them in various ways. Such collections or data sets are computationally expensive to pre-process in ways that make them useful for analysis, but once that's been done, the resulting datastore is cheap to keep on hand. Thus one could imagine a structure in which large library collections (and perhaps distributed library collections) are the source from which persistent sub-collections have been drawn, and these sub-collections are visible to other users and are pre-processed in ways that allow them to respond to analysis immediately, as long as users stay within their boundaries. If not, then back to the stacks for the longer process of building another user-selected sub-collection.

Annotation is a basic concept in data-mining, and it was the core of a 'seminar-ware component' planned for WordHoard (a component that did not develop as far as hoped). A simple version of location-bound annotation is in operation in an internal version of WordHoard, though: there is a prototype of 'concept bound' annotation, tied to a phenomenon that has more than one occurrence in the text (e.g. lemmata). Another developer's group at Northwestern has done very striking work on annotation software for images and media (ProjectPad). Work on annotation software will remain an important part of Academic Technologies at Northwestern, and the general goal is to develop a module that is project and platform independent or, more modestly, can be attached to different platforms or projects, with relatively little effort.

The relevance of this to MONK should be kept in mind, because the availability of a collaborative annotation module would greatly improve the utility of MONK for a variety of scholarly and pedagogical projects. Whether a solid framework for the generation and display of annotation can be built within, or attached to, MONK through some combination of D2K and UIMA is question worth pursuing.

Roles in MONK
Tasks

The following table gives an overview of areas of responsibility and activity in MONK, based in part on past experience and accomplishments in Nora and in WordHoard. The table is roughly organized into tasks and institutional participants, which include faculty or staff from UIUC, Northwestern, Maryland, Humanities Visualization Project (including McMaster and Alberta), Nebraska, and NCSA.

Task Illinois Northwestern Maryland HumViz Nebraska NCSA
Software design & systems integration X X * * * X
Rights & licenses X X
Data preparation * *
Data storage and retrieval X X * X
Use Cases * * * * X
Data exploration X X * * * X
Visualization * * X X *
Interface design * X *
Visual design * X
Library integration * X
Evaluation X *
Documentation and project management X *
X indicates a lead role (which may be shared); * indicates a collaborative role

Personnel
Biographical sketches for all non-student personnel are included in Appendix B. Boldfacing in the list below indicates Co-PI at participating institution: curricula vitae for Co-PIs are included in Appendix C, in alphabetical order by last name.

Bernhard A'cs, Database Architect, NCSA
Loretta Auvil, Senior Project Coordinator, NCSA
Philip Burns, Senior Academic Technologies Software Developer, Scholarly Technologies Group, Information Technology, Northwestern University
Vered Goren, Research Programmer, NCSA
Matthew Kirschenbaum, Assistant Professor of English, Associate Director of the Maryland Institute for Technology in the Humanities, University of Maryland
Martin Mueller, Professor of English and Classics, Northwestern University
John Norstad, Lead Academic Technologies Software Developer, Scholarly Technologies Group, Information Technology, Northwestern University
Joseph Paris, Senior Academic Technologies Systems Engineer, Scholarly Technologies Group, Information Technology, Northwestern University
William Parod, Architect for Scholarly Technologies, Scholarly Technologies Group, Information Technology, Northwestern University
Stephen Ramsay, Assistant Professor of English, University of Nebraska
Stan Ruecker, Assistant Professor of Humanities Computing, Department of English and Film Studies, University of Alberta; co-director of the Humanities Visualization Project
Stefan Sinclair, Assistant Professor of Multimedia, McMaster University; co-director of the Humanities Visualization Project
Martha Nell Smith, Professor of English, University of Maryland
John Unsworth, Dean and Professor, Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign
Brian Pytlik Zillig, Assistant Professor on the Library faculty and Digital Initiatives Librarian, Center for Digital research in the Humanities, University of Nebraska
Institutional Partners

The Automated Learning Group (ALG) at NCSA collaborates with researchers on new computer methods and applies the results to historical data to improve future decisions. This field of study, often called data mining, has already produced high-value applications in areas such as customer modeling, manufacturing optimization, and fraud. The primary goal of the Automated Learning Group is to extend the state of the art in the field of data mining. Toward that end, we collaborate with researchers to invent new approaches and tools that will become the basis for future commercial software. Development efforts are primarily fueled by data and problems brought to us by our industrial, government, and academic partners. The algorithms and solutions developed are then made available to partners and collaborators through web repositories, tutorials, and direct collaboration with ALG group members. By this process our partners have access to new methods long before they become commercially available. One of the core activities of the Automated Learning Group is the development of D2K (Data to Knowledge). D2K is a rapid, flexible data mining and machine learning system that integrates analytical data mining methods for prediction, discovery, and deviation detection, with data and information visualization tools. It offers a visual programming environment that allows users to connect programming modules together to build data mining applications and supplies a core set of modules, application templates, and a standard API for software component development.

Our collaborators at Alberta (Ruecker) and McMaster (Sinclair) have been engaged in the Humanities Visualization Project (HumViz). Their research has focused on experimenting with designs and prototypes for a new generation of visualization tools that are better suited to the richly encoded electronic texts that are currently available. Among the initial experimental deliverables of HumViz is the Digital Play Book (an environment for viewing the movements and speeches of characters in TEI-encoded plays), and the Mandala Browser (a generalized XML-aware environment for exploring and searching data collections through direct manipulation of items). Alberta and McMaster are also nodes in the Text Analysis Portal for Research (TAPoR) Project, an initiative to build an online work environment for scholars working with electronic texts. McMaster in particular has led the development of the actual TAPoR Portal software, with the assistance of Open Sky Solutions, a software development company specializing in adaptation of open source software for academic projects, in particular the Apache Software Foundation suite of open source Java applications and the mySQL database.

The Northwestern MONK team consists of Phil Burns, John Norstad, Joseph Paris, and Bill Parod in Academic Technologies and Martin Mueller, Professor of English and Classics. Burns, Mueller, Norstad, and Parod have worked together on developing WordHoard. Their collective strengths are general architecture design, statistics, in particular statistics applicable to Natural Language Processing, and the design of interfaces that make very complex query environments as user-friendly as possible. Joseph Paris joins the MONK effort for his experience in cluster and grid computing, as well as visualization experience gained with the Argonne Futures Lab. Burns, Norstad, Paris, and Parod are staff in Academic Technologies (AT), a department of Northwestern University Information Technology, supports faculty in their primary roles as instructors and researchers. Academic Technologies also provides access to educational technologies and various multimedia resources for the larger Northwestern community. AT works in close partnership with experts from the NU Library to provide "one-stop" service to faculty from joint home office in the University Library on the Evanston campus. Within AT, the Scholarly Technologies staff (headed by Parod) has been working on digital archives and associated tools for research and teaching for over 10 years: this work has been accomplished with the help of a number of research grants secured in partnership with NU faculty. ST's current work includes object design for cultural heritage sites accommodating 3D models and photography, 2D high resolution photography, archeological artifacts, and associated metadata; software for natural language processing, delivery, and analysis of text transcriptions; and basic work on the services of digital archives.

The Graduate School of Library and Information Science (GSLIS) at the University of Illinois, Urbana-Champaign is home to two Mellon-funded projects on data-mining using D2K, the Nora project (headed by Dean John Unsworth) and the Music Information Retrieval project, headed by J. Stephen Downie. The School also hosts the Library Research Center, which will be under new leadership (Professor Carole Palmer) by the time the MONK project begins, and which will be focusing on scientific and scholarly communication. Also in GSLIS is the Information Science Research Lab, with senior research scientist David Dubin (who specializes in statistical research methods) and Senior Programmer Amit Kumar, who has been working with the Nora project over the last year (since he moved to Illinois from the Maryland Institute for Technology in the Humanities, where he worked with Matt Kirschenbaum).

The MONK group at the University of Maryland brings together talent and expertise from both a working digital humanities center?the Maryland Institute for Technology in the Humanities (MITH) ?and the Human Computer Interaction Lab, founded by Ben Shneiderman and widely regarded as one of the finest such groups in the world. In addition, the team includes significant representation in literary scholarship, as multiple team members are based in the Maryland English department. This particular combination positions Maryland to assume lead responsibility in the area of both interface design?identifying what features users will require in order to work with the tools effectively?as well as evaluation (one team member, Catherine Plaisant, literally co-wrote the book on user needs evaluation, while Martha Nell Smith has been especially effective in recruiting "rank and file" literary scholars in the testing and evaluation of the Nora tools produced to date). Furthermore, the established expertise in information visualization at HCIL, coupled with Matthew Kirschenbaum's longstanding interests and experimentation in this area, prepares Maryland to share lead responsibility for visualization design with the Humanities Visualization Project (Alberta and McMaster).

A joint initiative of the University of Nebraska-Lincoln Libraries and the UNL College of Arts and Sciences, the Center for Digital Research in the Humanities (CDRH) is a collaborative, interdisciplinary research center for creating unique digital content, developing text analysis and visualization tools, and advancing knowledge (and refinement) of international standards for humanities computing. Many of its projects are conducted in partnership with the University of Nebraska Press, 19th Century Studies, and the Plains Humanities Alliance (one of the nine original NEH-funded regional humanities centers). The Center has received funding totaling $1.7 million over seven years and is co-directed by Kenneth M. Price and Katherine L. Walter. A portion of the Center's funding supports research faculty fellowships, a Council on Library and Information Resources (CLIR) Postdoctoral Fellowship in the Humanities, and the Nebraska Digital Workshop. The Workshop is an annual forum improve, and showcase, and evaluate critically the digital humanities work of the best early-career scholars. For information on the Center for Digital Research in the Humanities, see http://cdrh.unl.edu.

Timeline

January 15, 2007: Nora ends; Phase 1 of MONK begins

June 15, 2007: Data pre-processing routines stabilized and packaged as
separate D2K application.
APIs for D2K interface to data stores, client software,
and properties (currently "Nora Chunk") file.

January 15, 2008: Phase 2 of MONK begins.
Additional D2K modules for text-analysis and text-mining.
Improved user-interface and visual design.
Working demos of MONK with open-access text collections

June 15, 2008: Beta installations of MONK alongside several large collections
provided by libraries or publishers, hosted on their servers;

September 15, 2008: Analysis and proposals concerning grid-based text-mining
and text-analysis across collections; possible proposal for Phase 3
of MONK.

January 15, 2009: Proof-of-concept social-software facilities for use in MONK
Phase 2 of MONK end

Document generated by Confluence on Apr 19, 2009 15:05