|
This page last changed on Jul 25, 2007 by martinmueller@northwestern.edu.
To the Monks
|
This is an attempt to synthesize stuff on the Use and Users, Analytics, and Interface cells. Think of it as an inventory of things you can do in Monk, organized in a stepwise fashion, from first to last, and from the simple to the complex. Everything that has been discussed as a must-have or may-have and received some support in the discussions should be reflected here
There are things that users want to do. Call them user wants. There are things that the system can deliver. Call them analytics. And there is the miraculous User Interface that mediates between them.
I also offer this as a trial balloon for shared writing between the non-technical and technical folks. The left hand column is designed to be fully intelligible to the student of literature whom we want to attract as a user. The right column has no such rhetorical constraints and may be as ellipitical or full of tech talk as the Monks feel happy with. I'm not sure whether this will work, but it's worth a try to ensure that as little as possible gets "lost in translation."
You will find it a little more trouble writing comments in this form rather than adding them as comments. In the source file, the left colum has a width=40% attribute. You write your comments in the column with the width=54% attribute. I have also flagged it with 'Tech Talk'. There is an empty middle column to keep a space. While it may be a little harder to write this way, it will be easier for the reader because the non-technical and technical items are on, so to speak, facing pages of a bilingual work. You are of course entirely welcome to add to or edit the left column, but the rule there is "plain English only." Point out the instances where I have violated that rule.
|
|
Tech Talk |
Some Guiding Assumptions
|
The left hand text of what follows is addressed to "you," the reincarnation of the "gentle reader" of yore as the intended user of a digital text archive. You may be a student, a teacher, a scholar, or a retired accountant. But whoever and wherever you are we assume that the following will be more or less true of each of you:
- You like to read, you have a special interest in texts that we call 'literary', and you are willing to spend time, energy, as well as ingenuity and patience in the study of texts that are of particular concern to you.
- You have an interest in what is said in a book, but you also have an interest in how it is said.
- You believe that if you want to know something well you need to look at it from a distance and from up close. The "figure in the carpet," to quote the title of a Henry James story, is made up of thousands of tiny little knots, and just as a connoisseur of carpets will take pleasure in examining these knots, so the reader will pay attention to the minute details of verbal texture on which large-scale effects rest. If you are into Oriental rugs you have used a magnifying glass to study the knots in a carpet more closely and from the underside as well. You will be just as happy to explore digital tools for close textual analysis and remember that 'text' and 'textile' come from the same Latin word for 'weaving'.
- On the other hand, you cannot read all the books you want to read, not to speak of the books you don't want to read but need to know something about anyhow. As long as there have been books there have been ways of "not-reading" them, whether by skimming (not easily done in the world of scrolls), reading reviews (not until the eighteenth century), depending on someone else's 'cherry-picking' (florilegia or anthologies), taking somebody else's word, thinking a knowledge of the title is good enough, or just pretending to have read them anyhow. In the digital world the number of texts has grown, and with it the "Importance of Not-Reading" has assumed a new urgency. You are willing to use the help of the computer to orient yourself rapidly in an unfamiliar archive or to see a familiar text in a new light when it is profiled against a large digital archive. And you are willing to entertain the bet that when it comes to techniques of "not-reading" digital tools do a rather better job than their predigital avatars.
- Perhaps you are a digital enthusiast and have been converted, like the reader in Rilke's sonnet about the Apollo of Belvedere with its closing line "you must change your life" (Du musst dein Leben aendern). But perhaps you are more taken with this quotation from Douglas Engelbart's famous 1962 report "Augmenting Human Intellect":
"You're probably waiting for something impressive. What I'm trying to prime you for, though, is the realization that the impressive new tricks all are based upon lots of changes in the little things you do. This computerized system is used over and over and over again to help me do little things--where my methods and ways of handling little things are changed until, lo, they've added up and suddenly I can do impressive new things."
As the Scots say "many a mickle makes a muckle." You need not change your life if you don't feel like it. You can just use digital tools to help you keep track of a lot of mickles, which they are very good at and you are probably not so good at.
- Finally and very importantly, there is the assumption that however eager you are to make intelligent use of digital tools you have no desire to go "beyond" reading in the sense of leaving it behind and replacing it with something else. You will be like Antaeus, whom nobody could conquer as long as he kept his feet on the earth. Hercules eventually defeated him by lifting him up and strangling him in mid-air. You don't want to be Hercules in this scenario, but like Antaeus you will always want to return to the earth of the text.
Some important design principles flow from this. The various text analysis routines of MONK stand at various distances from reading in the traditional sense and its associated activities. Looking up a word in a digital dictionary or concordance is not very different from doing it in print. Performing cluster analysis on a whole set of books is very different from reading, but the results may deeply shape how or what you read next. Thus typical use scenarios are likely to be iterative and shuttle between directed searches and data-driven discovery routines in ways that should be driven by the goals of your project.
Sara says: This reminds me of the mantra - "We're not trying to take the human out of Humanities Computing."
|
|
Tech Talk |
A few words about language
|
Written texts of any kind and in any language, unless they are very short or highly unusual, consist of two layers of words. There is a relatively small set of core words (< 2,000) that occur very frequently and a very large set of words that are quite rare. In Shakespeare, who is utterly typical in this regard, 10% of the 18,000 distinct lemmata account for 90% of all word occurrences. Over a third of all lemmata occur only once, and almost two thirds occur four times or less.
If you want to think of a text as a textile, you can imagine it as a sequin-studded dress in which a basic fabric, woven from the strands of the core vocabulary is decorated with many types of sequins that differ in shape or colour and each of them quite rare. It follows from this fundamental structure of all texts that the character of a particular text is shaped by the weave of the common words and the repertoire of their rare words. Texts will differ from each other in the weave of their basic fabric (more strands of this, fewer of that) as much as in the distribution of rare words.
If your interest in a collection of texts is primarily driven by information retrieval you are likely to focus on the rare or less common words because they are likely to be more distinctive indicators of what the text is about. On the other hand, a text's way of being in the world and addressing its readers, is much more likely to be shaped by the author's lexical and syntactic habits that are reflected in the weave of the core vocabulary. Readers respond keenly and tacitly to differences in the 'qualitas' or 'howness' of a text, but it is practically impossible for them to keep count of multiple subtle shifts in the use of common words that collectively create that 'howness'. Some statistical analytics that on the surface are least readerly can in fact become very useful tools in helping scholars to identify the quantitative changes that in the aggregate amount to a perception of qualitative difference. To return once more to Douglas Engelbart (the man who first thought of the computer mouse): "This computerized system is used over and over and over again to help me do little things."
|
|
Tech Talk |
Find out what is in the collection
|
Any Monk environment will have an aggregate of texts from one or more collections. Each text has a bibliographical description that tells you about its
- title
- date of origin
- place of origin
- sex of author
- genre (fiction, prose, poetry, drama)
- extent (count of word tokens)
- the collection (if any) that it belongs to
Sara says: Maybe edition information?
MM: edition information should certainly be in the header. Whether it is a search criterion across different texts seems to me another question.
You can use these criteria, separately or in combination, to make up lists, grouping and sorting them in various ways, for instance a list of novels written in England by women between 1830 and 1870.
You can use such a list to define a "work set," that is to say, a set of texts on which you perform various operations, ranging from simple look-ups to complex text mining routines.
If a Monk environment includes texts from different collections, you can always "mix and match," search across collections or construct your own sub-collection from different collections. Full interoperability across texts from different collections is a cardinal feature of any Monk environment.
Sara says: It's important to make the distinction that a "work set" doesn't necessarily need to made from full "texts." Perhaps a scholar is interested in portions of a full text - certain chapters or paragraphs (the ubiquitous "chunks"). For instance, if I were interested in watching the trends in one novel - in my case, I have a hypothesis that there's a seismograph of emotion or affect in a novel - I could create a work set that consists of the chapters from that novel.
MM Agreed and and oversight on my part. Both WordHoard and Nora have exactly this feature, but call it by different names.
|
|
Tech Talk
Is this the Collection Browser? MM |
Save work sets and generally keep track of what you are doing
|
You can register and set up an account in Monk. If you do so, you can save work sets and return to them at a later point without having to redefine them. By the same token, you can save the results of particular searches or operations performed on these work sets.
Monk will aim to overcome, as much as possible, the 'stateless' condition of a Web environment. 'Stateless' means that any user act is an isolated event and as such unrelated to what happened before or will follow. In a maximally 'stateful' environment, on the other hand, the system keeps track of everything you do. Not only are the results of particular procedures available, but the procedures themselves may be repeated. Keeping track of everything may be almost as bad as starting from scratch everytime you do anything. At this point, we have not figured out how much "state" will be useful and affordable. |
|
Tech Talk
Specify workset(s) of interest and save them (e.g. the entire Dickinson collection, only 1 book in a collection of 10 books, all the book chapters that include the word "witch" in them, all the texts written by a particular author, all text older than 1700 that can be analyzed at the sentence levels, all text for which the POS tagging has been cleaned up by experts i.e. not the one done automatically, set of text hand picked one by one, etc.) In all cases users should be able to specify if the "extra" materials are wanted or not (e.g. title page, preface etc.) This task needs to be done while having access to the text - if it is available. Users DO need to be able to navigate the hierarchy of texts in the collections and see >meaningful< titles for all the chunks (CP) |
Share work sets and data derived from them
|
In a later version of Monk, if not in the original one, you can share with others the things you save in your account. |
|
|
Get a literal 'overview' of a collection through visualization
|
There are some simple visualization routines that let you survey the texts in your MONK environment. You can look for fiction texts between 1700 and 1900, group them by a third of a century, and get a bar chart that shows you the number of words written respectively by male and female novelists, with the number of texts (and perhaps the average length) showing up in the chart as well. This is very primitive information indeed, but it tells you at a glance how much there is in the collection, what is the balance of male and female writers, and whether novels are getting longer or shorter. Did Hawthorne have a point in complaining about all those "scribbling women"? |
|
TechTalk
|
Get a comparative profile of a text or set of texts
|
Consider the statement that "she is tall for a three-year old, but her father is very short."
You are quite confident from this statement that the father is still much taller than the daughter because you immediately evaluate "tall" and "short" against quite firm expectations about the average height of a three-year old child or an adult male. Most of our quantitative judgements are immediately and tacitly relative or "towards something" as Plato charmingly put it. And whatever our attitudes towards "quantitative" inquiries, we continually make quantitative judgments that within their vague boundaries are quite firm: a short vaction does not last four weeks, nor a long movie thirty minutes, unless you say something like
"it was only thirty minutes but felt like the longest movied I'd ever seen."
Now consider this statement: "This text contains 134,589 words in 12,345 sentences. Average word length is 4.57 letters, and average sentence length is 10.9 words." This set of observations will be useless to most people because it does not occur in a "horizon of expectations" that gives meaning to it. By contrast, the statement "This 700 page novel is a must-read" may provoke the immediate response: "not for me." Pages are something that we are used to tacitly measuring, and a significant benefit of a book is that it carries a lot of quantitative information. You see immediately whether a book is "big," and a few seconds' look and feel encounter with the thickness of the paper, font size and margin width will give you enough information to revise your initial estimate upwards or downwards.
These are quantitative judgments of considerable precision within a fairly clearly understood margin of error. Monk offers you a lot of "descriptive statistics" about a text or texts of your choice, but it puts them in a context that gives you the equivalent of information that you associate with the height and weight of a human male or female at a particular age.
You can choose the default context or define your own context from the combination of parameters that are available at the collection level. The default context looks at the genre of your text and its origin by half-century. The profile for an individual text lets you see, probably in the form of a bar chart, comparative information about word count, sentence count, word length, and sentence length. And you may immediately spot a text that is relatively short but has very long sentences with lots of long words in them. |
|
TechTalk |
Histograms
|
A more detailed look lets you call up histograms that provide detailed information about the distribution of sentences or words by length, again against the background of similar histograms for default or customized reference data.
A histogram is a very primitive but very powerful tool for displaying the distribution of phenomema that vary on a continuous range, such as income, age, or weight. You divide the continuum into arbitrary but equally spaced ranges or 'bins'. Then you put the various phenomena in the appropriate bins and find out how many there are in each bin. In the most typical display of such sorting, each bin becomes a bar in a bar chart, and if your data are 'normally distributed', you will see something approaching the Bell curve, with the bins in the middle having the highest bars.
There are a lot of textual phenomena for which a histogram offers a helpful first orientation. Comparing the histograms of sentence length for two texts may be quite striking. You will discover that histograms of textual phenomena will usually not follow anything like a normal distribution.
Histograms are very tedious to assemble by hand, but they are quickly and easily produced by a computer, and they are easily interpreted. |
|
Tech Talk |
Read the Text
|
Whatever you else you want to do in Monk, there will always be times when you want to just read the text, and text display in a pleasing and functional way is a central design goal.
Display arbitrarily chosen passages from the same text side by side
When it comes to reading a single page, books win hands-down over any screen display. But with one copy of a book you cannot display passages from different pages unless they happen to be on facing pages in your edition. On a computer screen of sufficient size you can have side-by-side display of arbitrarily chosen passages from the same text in virtually the same field of vision. In many situations of close reading this is a crucial advantage, and
the Monk interface tries to maximize it. It is particularly useful in combination with another feature, where the computer wins hands-down over the book: the concordance. |
|
Tech Talk
This is where Stan might talk about the document reader |
Search
|
"Seek, and ye shall find," Jesus famously said in the Sermon on the Mount (Mat 7.7), leaving it conveniently open whether what you end up finding is what you set out seeking. Often it is not, but unless you seek something you will not find anything. That is the point of serendipity, a term coined by Horace Walpole in 1754 in a letter quoted in the very useful Wikipedia entry:
I once read a silly fairy tale, called The Three Princes of Serendip: as their highnesses travelled, they were always making discoveries, by accidents and sagacity, of things which they were not in quest of: for instance, one of them discovered that a mule blind of the right eye had travelled the same road lately, because the grass was eaten only on the left side, where it was worse than on the rightnow do you understand serendipity?
Enhancing what Walpole elsewhere in this letter calls "accidental sagacity" is a key goal of this project. MONK offers a variety of 'analytics' or particular search procedures, which range from the highly directed to "fishing expeditions." You may want to think of these 'analytics' as useful tools for practicing the art of noticing.
|
|
Tech Talk |
What you must know about the MONK environment before you can seek or find anything
|
You may engage in highly directed searches that proceed from a hunch or formally articulated hypothesis. Or you may pretend that you have no hypothesis but want to go fishing with the computer and let its pattern searching algorithms act as a dragnet in the hope that something interesting may turn up. It does often enough, especially if you have a knack for recognizing it when it happens, and no computer will ever substitute for that knack. "Therein the patient must minister to himself," as his doctor reminded the dying Dr. Johnson in a famous anecdote.
But whether you hunt or fish, it helps to know something about the way texts are structured in the MONK environment. At the collection level, there is a catalog record for every text, and it is very much like a library catalogue record in that it records discrete information in highly structured fields. At the document level, each document is linguistically annotated. There is something very much like a catalog entry for every location in every text and it contains "metadata" or formal descriptions of
- The location of the word by work and location in the work
- The spelling resident at that location
- The standard spelling of that spelling: orthographic variance by date or region is a common thing in English
- The part of speech or POS tag
- the lemma of the word form or the form in which it typically appears in a dictionary
This process of dividing a text stream into discrete items with descriptive metadata is called "tokenization," and a word that has been thus isolated and identified is a "token."
The process of tokenization not only establishes firmly where one word ends and another begins (a far from unproblematical decision); it also establishes sentence boundaries. Once you know where words begin and end and where sentences begin and end you have a number of "count items," things the system keeps track of, computes, and compares in various ways. Given the explicit information about the status of each token and about sentence boundaries, the system can draw out the implications and create information about
- How many sentences there are
- How many words there are in each sentence
- How often each word occurs in the work
All this is conceptually quite simple, though it can be quite hairy in its technical details. The important thing to grasp is that a search of any kind is never run against the text that you see displayed on the screen in the way in which a look-up in a dictionary is executed quite literally "on" the page where the word is found. All data about a word in the print edition of the OED is visible to the naked eye. By contrast, all searches in MONK are run against the metadata and data created by drawing out the implications of the explicitly recorded metadata.
|
|
Tech Talk |
A Tale of Two Catalogues
|
Most analytics in MONK boil down to some query that combines constraints from the collection level with constraints at the word occurrence level. That is why much of the analytics in MONK may be called a tale of two catalogues. An inquiry into whether men and women use verbs differently depends on first, using the collection catalogue to create sets of texts written by male and female writers and second, using the word catalogues for each text to identify verb forms.
There are good reasons why a majority of analytic procedures are based on information from the top (bibliographical) and bottom (word occurrence) level. You can readily model any collection of texts as a set of items with uniform bibliographical descriptions. You can also model every text as a sequence of sentences consisting of sequences of words as long as you can tolerate an error rate on the order of 3% or less. That is a lot of individual errors in a large collection, but in the aggregate they are extremely unlikely to affect the validity of the results of queries that within seconds or minutes range across hundreds of documents or millions of words.
It is, however, impossible to develop a consistent descriptive model for texts at a higher discursive level than the sentences. Even 'sentences' are problematic at the margins, but when you enter a world of paragraphs, chapters, sections, scenes, acts, etc. you enter a world of irreducible diversity. Subsets of texts share common structural features: plays are divided into acts and scenes and novels into chapters or parts and chapters. But there are no "chunks"( a usefully colloquial piece of Tech Talk) tht run consitently across diverse collections, and how to "chunk" a particular text into sub-units remains a persistent problem that must be managed but cannot really be solved.
Sara says: Would it be possible for a user to define a chunk - for example a "scene" that consists of multiple paragraphs that stretch across two chapters?
MM: If the chunk begins inside one chapter and moves to a point inside another chapter and text were tokenized with MorphAdorner, which uses sequential token IDs you can easily define any text range as beginning with one token ID and ending with another.Or so I think. The trick would be to make you look at the token IDs and enter them as start and end points for your desire chunk.
|
|
Tech Talk |
Look up words
|
Loooking up a word is the simplest search routine. It is a centuries-old practice, and the digital environment is much better at supporting it than the print environment. Just because it is simple does not mean it is not important. For many users it is the thing they do most often, and for some users it is the only thing they do.
Simple look-up can be activated by clicking on any word in the text, which is the equivalent of entering the word in a search box. In the digital world, the standard return for a search is not a definition, as in a dictionary, but a list of the passages in which the word occurs. This is the 'concordance', to use the print terms, or the keyword-in-context or KWIC, which is the common digital term. There is more to be said about the ways in which the results of a search are reported if the "hit list" grows beyond the handful of returns that you can take in more or less at a glance.
|
|
|
Constrain a lookup in various ways
|
In the simplest look-up you constrain your search by a spelling and you look for all the passages in which a given spelling occurs. This is helpful if the word you are looking for is rare (and most words are rare), but it is unhelpful if you look up a common word in a large collection. The return list may take hours or days to work through. So you will want to constrain or refine your search in various ways.
|
|
Tech Talk |
Refine your search at the collection level by bibliographical criteria
It is an obvious strategy to look up a word only in those texts that you are interested in. All the techniques that let you define a subset of works for special inquiry can be used to constrain the look-up of a particular word.If you want to focus on female English novelists between 1830 and 1870 you can limit your search to texts written by them.
|
|
Tech Talk |
Refine a search by proximity or Boolean criteria
|
It is an obvious strategy to look up a word only in those texts that you are interested in. All the techniques that let you define a subset of works for special inquiry can be used to constrain the look-up of a particular word.If you want to focus on female English novelists between 1830 and 1870 you can limit your search to texts written by them.Very often it is helpful to look for all occurrences of a given word in the vicinity of another word. 'Love' and 'death' are each very common words, but the number of passages in which 'love' occurs within, say, ten words of 'death' is much smaller.
Technically speaking, a proximity search is a special kind of Boolean "AND" search, where you look for documents that contain both one word and another but you add the additional constrain that the words should be separated by no more than a specified distance. |
|
Tech Talk |
Expand a search by looking for a word at different levels
|
Remember that a simple search returns a list of word locations. In the catalog record of that location its content is captured as an instance of a spelling, a standard spelling, a morphosyntactic condition, or a lemma. Thus
- a search for the spelling 'louyth' returns all instances of that spelling.
- search for the standard spelling 'loveth' returns all instances of 'loveth', 'loueth', 'louyth' and the like.
- a search for the lemma 'love' will ask you whether you are looking for the noun or the verb, and if you specify the verb it returns all spellings of all different morphosyntactic conditions in which the verb 'love' occurs.
You can also think of word locations more abstractly as instances of morphosyntactic conditionsand look for verbs or nouns, or plural nouns, or third person singular forms of verbs. That is likely to produce very large result lists. It is discussed more fully in the section "Look for unknown words that meet specified criteria". |
|
Tech Talk |
Expand a search through a 'basket of words'
|
Sometimes you want to expand rather than refine your search parameters. If you are interested in a particular concept there may be more than one word to express it. In Monk you can make up a 'basket of words' and look them up as if they were a single term. You can save such a 'basket' (term expansion is a more technical term), just as you can save a collection of works, and you can edit it by adding to, or subtracting from, it.
In Boolean terms, a basket of words is an example of a Boolean 'OR' search: you look for the documents that contain word A or word B or word C etc. |
|
Tech Talk |
Expand a search through 'wild cards' or 'regular expressions'
|
You can also expand a search by not completely specifying a word but look for all words whose spelling contains, begins or ends with some string, say 'nat'. MONK supports 'regular expression' searching, which is a particularly powerful and flexible form of wildcard searching. |
|
Tech Talk |
Look for unknown words that meet specified criteria.
|
Not every word search needs to take the form of a dictionary look-up and start from some string of letters that define particular spellings wholly or in part. You can also look for words in terms of the metadata that are associated with them in the collection. Such searches typically produce lists of words, often quite long lists. Thus you may look for adjectives that occur more than five times, adjectives that occur in both Jane Austen and George Eliot, or words that do not occur before or after a certain date in your collection.
Searches of this kind typically combine parameters from the collection level (bibliographical data) with parameters from the text level (nouns occuring in verse, etc).
|
|
Tech Talk |
Group, sort, filter, and visualize result sets
|
If your search retrieves a lot of hits you want the computer to help you with making sense of your results. Some times you can do this by closely defining your search to begin with an excluding results you will not need. But this does not always work. If, for instance, you are interested in the use of the definite article, the most common word in English, you will always get a lot of results.
It is also clear that in such cases it makes little sense to work your way through a long concordance. You want a procedures that let you group and sort your result set. And since any interesting results will turn on difference in quantity, you will want to 'see' those differences in some graphic display that is easier to grasp than a list of numbers. You will also want to look at the numbers, but not until after you have 'seen' them.
The grouping and sorting will follow the criteria that are explicitly recorded in or can be implicitly derived from the bibliographical data at the collection level. Thus in the extreme case of 'the', you can chart the usage of the word over time, whether by decades or generation. You will base such charts of relative frequencies rather than raw counts, and you can determine whether the word is used differently at different times or by different authors.
|
|
Tech Talk
Some of these functionalities are well implemented in FeatureLense |
Compare results
|
To analyze is to compare, whether explicitly or implicitly. Many Monk features are designed to lower the time cost of comparing one thing with another. If you cycle through a concordance list, the side-by-side display of arbitrarily chosen passages lets you look at two things at the same time in the same space. This does not matter if you have an excellent memory. If you do not, it helps a great deal.
The same principle carries into other functionalities. If you see a chart tracing the fortunes of 'the' by decade or generation, you see one phenomenon at different points in time. More interestingly, you may want to look at apparent synonyms like 'liberty' and 'freedom', see charts by time or genre and see immediately whether or how their usage differs.
|
|
Tech Talk |
Filter, re-group and re-sort results
|
The result of a search in Monk is always at bottom a list of locations in which words occur.
Sara says: This seems a bit too specific. Maybe the result of a search is always at bottom a list of locations in which tokens occur - so that punctuation is included.
MM Agreed. I've been working with a model where a punctuation mark counts as a word token and should have been explicit about it. Unknown macro: {green}
Each location can be tied to a word, the context in which it occurs, a work, author, date, etc. Once you have the list, you can regroup and resort it in various ways It may tell no story if sorted by frequency; it may tell an interesting story if sorted by time or genre. Once a list has been assembled, which may take minutes (or hours in extreme cases), it will take seconds to regroup: you can think of a kaleidoscope that can be shaken to reveal different configurations almost instantly.
You may be familiar with similar operations from web sites where a list of products can be sorted by price or brand. These are very simple procedures, but they are quite powerful, especially if they can be done quickly.
For many literary and linguistic use scenarios, this ability to re-group or filter initial result sets in multiple ways may be the key feature. In the end you are interested in making distinctions or identifying patterns that no algorithm can can retrieve without a lot of 'noise' but that you recognize instantly when you see them. You cannot 'see' them unless you look at a lot of examples, and you want a tool that lets you cut down the time cost of working your way through them. This is a prime example of Engelbart's argument that the computer 'augments' your intellect by helping you "do little things." |
|
Tech Talk |
Save or export result sets
|
The result of a search can be saved, as described above.
Some users in some contexts will want to manipulate the results of searches in programs or services outside of MONK, either because they are already familiar with their routines or because they offer functionalities that MONK does not have.
It is a key feature of MONK that you can export the results of many searches in a format that other programs can read, typically as tables in which the columns are defined by fixed with, tab delimiters, or comma-separated values. Microsoft Excel is probably the most widely used program to manipulate such tables and use them as use them as inputs for various computations or graphic displays. ManyEyes is a Web-based service for common forms of data visualization.
FileMaker, Microsoft Access, and Minitab are other examples of programs that have, in Ben Shneiderman's terms, "low thresholds and high ceilings" for use by users with no programming skills.
|
|
Tech Talk |
Perform text mining and other statistical routines
|
It is only one step, but an important one, from informal comparisons (more of this here, less of it there) to formally quantified comparisons that depend on statistical routines and let you classify unknown texts in various ways, isolate distinctive features, or offer guidance in determining whether observed phenomena lie outside the degree of variance that you would expect on a random basis.
Here the humanist enters the strange world of "machine learning," where s/he encounters assumptions and a rhetoric that are unfamiliar and sometimes repellent. So it is important to describe "machine learning" in terms that humanists can understand and separate the procedures from assumptions about goals and purposes that may not be relevant to what humanists do most of the time. Or to put it positively and use Engelbart's language again, it is important to explain the procedures in such ways that humanists can judge whether they may "augment the humanist intellect" in doing things that humanists have always done or have wanted to do but in the past have found impracticable. MONK assumes that the answer to that question is "yes," provided you have a clear understanding of what such procedures can in principles deliver and within what margins of error they operate.
Machine learning divides into "supervised" and "unsupervised" routines. Both routines depend on algorithms that extract information from textual data. In supervised learning, you provide the machine and its algorithms with examples or "training data" that tell it what to look for. In unsupervised learning the algorithms are not provided with examples.
An algorithm is a set of instructions that is formulated into a "process" that can be executed by a machine without the intervention of human intelligence at any point. The English philosopher Michael Oakeshottin his great book On Human Conduct draws a powerful distinction between 'process' and 'procedure'. A process is a 'going-on' that does not involve human intelligence. A blinking eyes or a dripping faucet are examples of processes. A procedure is a 'going-on' that involves intelligence. Does the person next to you 'blink' or 'wink' at you? When surgeons perform 'procedures' or cooks add salt and pepper 'according to taste' they perform routinized actions that involve judgment calls at unpredictable stages of the action.
Machines are incapable of executing procedures in Oakeshott's sense of the term. They cannot add salt or pepper "according to taste." They execute processes designed by humans, and they never make judgment calls. They may in many situations respond to differences more quickly, accurately, and helpfully than a human would. Some new cars can sense impending collisions and perform appropriate actions such as braking or stiffening the suspension. But that is because some human anticipated that situation and told the machine exactly what to do under exactly specified conditions.
It is worth dwelling on this point in some detail because for humanist readers judgment, taste, tact, gut feelings, or other 'soft' categories are matters of supreme importance that they do not want to surrender to a machine. It is therefore critical to be modest in what one claims for machine learning and not to make claims that are false and will put off the users you want to attract.
What the machine tells you is never a substitute for human judgment. What it offers you is the result of processes that reflect the judgment of other (and often very ingenious) humans that in certain circumstances that will call on your irreplaceable judgment you will find it helpful to have information before you in a form that allows you to make appropriate decisions in the time available to you.
From this perspective Engelbart's phrase needs some modification. It is not really the machine that augments your intellect. It is the humans who designed that machine. You rely indirectly on the judgment of others, which is something you do all the time. The dwarf standing on the shoulders of the giant is an old and famous image of individual human intelligen relying on past human intelligence, absorbing, and adding to it. |
|
Tech Talk |
Supervised learning
|
The spam filter on your email is a daily reminder of the utility of a simple "text classifier" that depends on supervised learning. The task of the machine is to make a yes/no judgment about incoming messages and separate the wheat of messages you should read from the chaff that you never want to see in the first place. Your or somebody whose judgment you trust gives the machine examples of texts that have been classified as junk. This is the 'training set.' The machine uses its algorithms to identify various low-level linguistic phenomena ('features' in tech talk) whose presence always or above a certain threshold of frequency makers it likely that the incoming message is junk mail. The machine then uses information about these features to 'classify' incoming messages as 'junk or not'.
You could start from scratch and create your own training data by manually classifying incoming messages until the machine performs at a level that is good enough for you. It is much more likely that you will start with somebody else's judgement about what constitutes junk but modify the training data by classifying some incoming real mail as junk and rescuing some nessage from the junk bin.
Spam filters make mistakes and can be fooled. In World War II, the RAF used tinsel to confuse German radar and as a child I was delighted to find "Lametta" in the backyard (the German word for Christmas decorations. Today you often see junk mail that contains a large jumble of text below the ad. This is to confuse the spam filter, and it succeeds often enough.
A very common algorithm for classifying texts in this way is Naive Bayes. Bayes was an 18th century English mathematician who developed a formula for judging the probability of future events on the basis of past experience. The algorithm is called Naive Bayes because it operates on the assumption that the various classifying features are independent of each other, when in fact they are not. But it turns out that despite its faulty assumptions Naive Bayes works well enough in practice.
A classic and ground-breaking application of Bayesian statistics to text classification was the work of Frederick Mosteller, who in the sixties used it to resolve the question whether twelve unidentified Federalist papers should be assigned to Hamilton or Madison. Since his findings that they were Madison's supported the preponderance of judgments that historians had reached on other grounds, he may be said to have settled that question definitively.
Sara says: For the intro to my dissertation, I've been crafting an explanation of Naive Bayes. I'm including it here in case it might prove helpful. (By the way, if there's any error, do let me know so that I can fix it in my diss.)
Bayes' theorem can be expressed by the following equation:
P(c|d) = P(d|c) * P(c) \ P(d)
This equation expresses the probability that the document d belongs to the category c. To clarify what's going on, I'll explain each item in the equation.
Let's says we want to predict whether a certain document is a letter. We've collected information about whether the text contains an address and whether it contains chapter divisions. P(c) represents the prior (or marginal) probability that the category c (it is a letter) was used. It's called "prior" because it doesn't take into account any information about the document d, which means that it indicates the probability that any given document falls into the category, regardless of what's in the document (whether it has an address or chapters). P(c|d), which is read "the probability of c, given d," indicates a different type of probability, called conditional (or posterior) probability, that the document d belongs to the category c. This probability does take into account the specified information about the documents - namely, the probability that a document falls into a certain category given its characteristics (presence of address line and absence of chapters). P(d|c) is another conditional probability, of d given c. It expresses the probability that a document does have an address line and doesn't have chapters, given the fact that we know it is a letter. Finally, P(d) represents another prior (or marginal) probability about the probability of d, without taking into account any information about category. It is the probability that a document has an address line and doesn't have chapters. P(d) represents a constant for all categories, and acts as a normalizing function. In other words, probabilistic information about the documents being evaluated is always the same. Ultimately, Bayes theorem finds the conditional (or posterior) probability that a document belongs to a particular category by multiplying the prior probability distribution by the likeliood factor that it belongs to the category, and then dividing by the normalizing constant.
What complicates matters is that the information about the documents that is used in the classification can get really complex. Our simplified exampled above only took into account two variables - presence or absence of address lines and chapters. Given just a few variables. things are relatively simple; if there's an address line and there aren't chapters, it's probably a letter. But if you have also collected information about word frequencies, length of the document, parts of speech used, proper names used, and number of chapters, and if this information is also considered in the probability classifiers, they present more possible combinations of features and confuse the issue of classification. Some features are more helpful than others, depending on the classification.
To get around this problem, we use Naive Bayes classifiers. What's naive about Naive Bayes is that it assumes that each feature is independent from the other. Thus the probability that a document that 1) has an address line, 2) has no chapters, 3) is four pages, 4) uses the word "love" most frequently, and 5) uses verbs most frequently is a letter is calculated on the independent probabilities that a letter has an address line, no chapters, is less than five pages, etc. Logically, we know that there is some relationship between whether a document has an address line and lacks chapter divisions when it comes to classifying it as a letter. Nonetheless, Naive Bayes works. Making the features dependent on each other, what is called "relaxing the naive assumption," doesn't produce results that are more accurate or statistically reliable. So, Naive Bayes remains the main classification algorithm for text mining.
Another text classifying algorithm is called SVM (Support Vector Machines). The underlying mathematics are more complicated than Bayes' quite old and simple formulas, but it does not appear to produce better results.
|
|
Tech Talk |
How do supervised text classifying routines help literary scholars?
|
Authorship attribution is an obvious problem where text classification can be helpful, as Mosteller's example demonstrates. But it is one of those sub-disciplines that passionately engage the interest of a few while boring most others to tears. And there are not really a lot of remaining cases where the answer matters. So you have to look for other problems whose solution or analysis interests a larger group of users.
There are two related use scenarios. In one case, you have a set of texts that interest you for whatever reasons and you want to identify others that are "like them" from a large group of texts that you have neither the patience nor the time to read through. Conceptually and technically this is exactly the spam problem, except that you want to keep rather than toss the texts that the algorithms have identified with the help of your training data.
In the second case you may be interested in identifying the underlying and unknown low-level linguistic phenomena that prompt you to make the holistic judgement "this is a sentimental novel. The close relationship between these two interests is well illustrated by the following use case.
|
|
Tech Talk |
Classifying sentimental fiction
|
The investigator is interested in identifying and analyzing the lexical, syntactic, or generally rhetorical practices that lead readers to identify a novel as sentimental. Is there a genre of sentimental fiction, and what are its parameters? A subsidiary goal is to ask whether American and English novels differ interestingly in their way of being sentimental.
In Monk I she has at her disposal a body 350 English novels written between 1700 and 1900 and a body of 300 novels written in America between 1789 and 1875. This body of fiction includes all the "great" novels, novels that are canonical because of their very "badness", e.g. The Lamplighter that Joyce poked fun at, and a lot of novels that many people read then and few people read now, e.g. Bulwer-Lytton.
She starts by building a training corpus of well-known and indeed super-canonical sentimental scenes in fiction, such as the deaths of Jo in Bleak House, Little Nell in The Old Curiosity Shop, and Eva in Uncle Tom's Cabin. The training corpus is used to find books "like it" with the help of Bayesian statistics or similar routines.
In a second step she isolates the lexical and syntactic features that are identified in the statistical procedure and in an iterative way analyses their function not only in the works classified by the procedure as sentimental but also in other works. Much of this work is of a familiar look-up kind, but the time cost of such look-ups is reduced by the group and sort capabilities of the Monk concordance tool as well by the ability of the Monk Lexicon to organize frequency-based lexical information on a time line.
The study that results from this work might involve a return to canonical works read in the light of their wider context. Or, perhaps more promisingly, it might be a study that defines the parameters of sentimental fiction through some case studies that focus on lesser known works. The reader of the resulting study might not necessarily recognize that the research relied explicitly or implicitly on statistical routines. There might or might not be tables with numbers. But in the preface the author would write with much conviction that she could not have oriented herself in so large a fiction space in so short a time without the help of the various orientation tools provided in the Monk environment. |
|
Tech Talk |
Unsupervised Machine Learning
|
In unsupervised machine learning you rely on algorithms alone and do not provide the machine with examples of any kind. Without knowing anything about the underlying algorithms you may expect this method to be inherently less reliable and for its result to be harder to interpret because there is less information to constrain the 'decisions' of the machine.
As with supervised learning, the underlying algorithms examine the frequency of low-level 'features' and group or 'cluster' a collection of texts according to their similarity with regard to the features that the algorithms deem to be relevant. The grouping follows a branching procedure, as in a tree, and the result produces something like a genealogical chart that lets you see how closely or distantly related different texts are with regard to a putative common ancestor.
This method is quite suggestive if you have a large number of texts and want to whether or how they "organize themselves" into relationships that an investigator can make sense of. But you need to keep in mind that the groupings suggest by cluster analysis are suggestions rather than discoveries. They need a lot of skeptical interpretation and further contextualization. In the Companion to Digital Humanities John Burrows has an essay on "Textual Analysis" that is a model of how to respond with shrewd caution to the suggestions of cluster analysis. In a general way, it appears that this technique may be of greater use to a person who already has a good sense of the lay of the land. But this may ultimately be true of all quantitative procedures. |
|
Tech Talk
Burrows' Delta may be a method to incorporate into MONK. Revised and simplified versions of it appear to have a good reputation in the NLP community as a "nearest neighbor classification method." MM |
Analytics for profiling a text
|
Remember the metaphor of the text as a sequin-studded dress in which the fabric is woven by the distribution of common words and the sequins are the rare words that decorate it. The literal 'texture' arises from the combination of different frequencies. If you are interested in the details of that, how do you get at the data?
You could start counting words or get somebody's word list. But wordlist with raw counts are very uninformative things. What do you learn from the fact that 'the' occurs some 28.000 times in the Shakespeare corpus? Nothing. If you are told that 'the' adds up to 3.3% of all word occurrences in Shakespeare, you have a little more knowledge, but you have no idea whether 3.3% is a lot or a little. You need a horizon of expectations within which you can make sense of the numbers and translate them into everyday categories of much, quuite a bit, very few, extremely high, and so forth.
There are statistics that do just that and they are relatively straightforward to use and interpret. MONK uses Dunning's log likelihood ratio. This is a cousin of chi-square, a famous elemntary statistic in which you compare two lists, each of which consists of items with their count. You make the founding assumption that the first list is the summary count of drawing items at random from a putative collection that contains some distribution of those items. Then you try to figure out how likely it is that the second list is a summary count of another random draw from the same collection.
If you apply this technique to texts you proceed as if texts were created by random draws from some wordhoard. This is of course an utterly ridicuolous "as if." But however ridiculous it sounds, in any given case it lets you conclude with considerable certainty: if you assume for a moment that writing is like drawing at random from a wordhoard, the odds that these two texts (as represented by their word lists) were drawn from the same wordhoard are "quite high", "vanishingly low", or whatever other description is justified by the numbers.
In Dunning's log likelihood ratio probabilities are associate with individual words and the test is equally sensitive to underuse as to overuse. This is important because what is not said or rarely said in a text may be just as telling as what is said often.
In a test of this kind you focus on a foreground against a background. You want to see the figure in the carpet. Much depends on your choice of your background. If you profile a novel of Dickens against the set of other Dickens novels, you will register its deviations from the Dickens corpus. If you profile it against a set of 19th century novels, the results will not tell you what is specific to Dickens and what is specific to this novel. If you profile it against a multi-genre corpus across different centuries, the results may tell you more about nineteenth century fiction than about Dickens. So this is a test that depends very much on what you choose to compare with what and for what purpose. It is also a test where the most interesting results come from iterative profilings against different backgrounds.
There is a very nice example of this in Shakespeare. If you profile the tragedies against the comedies you learn that they differ most in their underuse of 'she'. If you now profile Julius Caesar against Shakespeare's tragedies, you learn that it differs most in its underuse of 'she'. But Portia's famous speech to Brutus is the most eloquent plea for marriage as a partnership of equals in all of Shakespeare. From which you learn that frequency and salience are different things.
MONK has a number of precomputed reference corpora that you can use as background for such profiling exercises. In future versions, if not in the first version, you can also customize.
Tests of this kind are poor tools for making discoveries about rare word, which will by definition show up as highly overused. But they offer a terrific tool for studying the fabric of common words, and the pattern that emerges from the aggregate of underused and overused common words may be thought of as an automatic self-indexing that establishes useful parameters for interpretation. To return to Julius Caesar , the pattern of common words consists of 'she', statistically by far the greatest outlier through its underuse, followed the following words that are strongly overused: 'man', 'do', 'noble', 'street', 'today', 'countryman', 'friend'.
The results of this likelihood lend themselves to very accurate and striking forms of visualization.
|
|
Tech Talk
The results of a log likelihood test may be a very good input for the Tapor Word Rain feature. You can show heavily overused words as fat blobs dropping from above and heavily underused words as pale wraiths trying to rise. This is cool and playful, but it is also informative. Since the results are logarithmic the differences between numbers looke smaller than they are, the right choice of size and color helps to draw attention to the real magnitude of difference. |
|