This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

The following is a (mostly technical) commentary on the general problem of studying "geographical awareness" in English prose literature. The general literary critical problem is more fully described there, though it is useful to recall the basic set of questions that this analytic proposes to address: "How did the expansion of the British Empire influence the way author's talk about the world (from a purely geographical standpoint)?", "How do key historical events, such as the arrival of the railroad or the telegraph, influence the way space is represented in literary narratives?", "To what degree is an author's sense of place and geography a function of where they live and work? Are there aspects of an author's "locatedness" that generally lead toward wider or more narrow narrative settings?"

The Data in Question

The most elemental data points in the study of geographical awareness (hereafter GA) are place names extracted through gazetteering, statistical named entity extraction, or some combination of these two approaches. However, it is unlikely that lists of place names will suffice. It is to be expected that other geographical markers (where a novel was published or written) and more general metadata (date of publishing or composition) will also matter, because in general, the goal is to speak about GA – however we come define this – in terms of some other feature. In other words, the question is not so much, "What patterns lie in the place names?" but "How do patterns in place name references relate to the author's gender, the time in which it was written, its country of origin?"

It is also perhaps unwise to speak of data apart from the structures that hold it and the procedures that render it useful. In general, we are talking about counting placenames, arranging them in graphs, trees, or vectors, and providing accesspoints by which they can be accessed and re-arranged. Still, place names, in combination with the ordinary metadata that accompany the average TEI text, are (for this monk, at least) the stuff of GA.

The Text Mining Approach

It is possible to consider GA as a classification problem. One could, for example, take all the place names in a corpus, arrange them in a vector, and populate a matrix where each row corresponds to a work. Whenever a place name from the full vector occurs in work, the appropriate cell is given a zero or one (presence of absence of the term) or a frequency. We then label these rows using some feature that is either derived directly from the metadata (place of publication, gender, year of publication) or one that the user supplies. The learning algorithm (naive Bayes, SVM, etc.) is then trained on some subset of the data and applied to the rest. As with all such methods, the utility will perhaps depends on the user's confidence in the labels. If we are classifying on known labels (American vs. British literature, for example), we will undoubtedly be most interested in the place names that serve as the strongest classification vectors. If we are working with less objective labels, the system hopefully reveals further avenues for thoughtful consideration. In both cases, the "outliers" become as interesting as the successful matches.

The Mapping Approach

If we plot place names in geographical space, we gain access to the physical distances between the places being mentioned, and therefore also gain some quantitative data with which to pursue further text mining approaches or some other kind of analytical operation. An early nineteenth-century novel that seldom strays out of the south of England will cover less physical distance than a twentieth century novel about jet-setters moving between California and India. Being able to plot those changes in overall distance over time (and across genres, etc.) would be most interesting.

Though we cannot go any further without mentioning an obvious flaw in this approach (and perhaps in GA more generally). If the data we have to work with consists of place name references, we cannot say anything with confidence about the role of that place in the narrative. Japan is mentioned once or twice in Lawrence's Women in Love; no one in the narrative actually goes to Japan. Percival in Woolf's The Waves spends the whole of the narrative in India, but the narrative itself never "goes there."

For this reason, GA is a quite slippery concept. However, it still seems possible to draw conclusions (over broad spaces of time and vast tracts of literature) about general trends. References to Japan in eighteenth century novels are rare; they are quite common in novels of the British fin-de-siecle (as Japanisme begins to enthrall the early Modernist movement). The hope is that we can speak less anecdotally about such trends, even if the specificity behind a "reference" is rather crude.

The mapping approach is surely worth pursuing, but I have a sense that such a method will introduce highly skewed data points that will overemphasize references that aren't in any way central to the narrative, and perhaps also provide wildly varying quantifications in novels that are mostly similar in their comparatively narrow view of the geographical world. For this reason, I wouldn't want to suggest a fully implementation of the mapping approach until we've conducted some small scale experiments with it first. For the Data Cell and the wider Analytic Cell, however, it might be useful to note that some parallel data structure (that does not reside in the text data itself) might be requisite within the MONK architecture for supporting operations like these. It's hard (though not impossible) to imagine a "MONKed" document" containing GIS coordinates (for example). It's hard to imagine a MONKed document containing the distances between a place name mentioned in a text and every other place name mentioned in the other texts. It would not be hard to imagine a cross-reference table that contains such information and to which the analytical operation has access. Thoughts?

The Graphing Approach

This is without doubt the most

A graph is a mathematical object consisting of a set of vertices (sometimes called nodes or points) connected to one another by edges (or lines). More formally, a graph is defined as an ordered pair

G := {V,E}
where V is a finite set of vertices and E is a set of pairs of vertices. Graphs are commonly represented visually:

When E is an ordered set, the graph is called a directed graph, and is usually represented with arrows on one end of each edge:

It is possible to encode the visualization of a graph with lots of additional information. We might, for example, label the edges, vary the size or symbolism of the nodes, and add shades of color – all of which could be used to convey concepts such as difference, "weight," and category. However, the underlying mathematical structure is largely independent of such notions. It is particularly important to note that the particular layout of the graph is mostly an aesthetic matter. This graph, for example, is mathematically identical to the previous one:

A graph does convey information about relationships and associations between vertices, but it is more akin to the way an electrical schematic or a network diagram represents relationships – eschewing metric space for something like "relational space." It would be possible, for example, to redraw an airline flight network as a graph. Such a graph might more easily convey whether it was possible to get from one place to another (or how many stopovers you would need), but it might also say nothing at all about distance or geographical proximity. Subway maps often do something like this. The London Underground map, for example, fails to represent distance and direction accurately (Victoria, St. James's Park, and Westminster are portrayed as both equidistant and lying along a perfect east-west axis), but the resulting map is much easier for a busy traveler to read and understand.

It is possible to describe the formal properties of graphs, and to use them as the basis for classification. Properties frequently used by graph theorists (typically as part of theorems) include:

  • order: the number of vertices in a graph (|V|)
  • size: the number of edges in a graph (|E|)
  • degree: the number of vertices that connect to a particular node through edges, which allows us to describe graphs as having max degrees (the degree for the node with the highest degree) and directed graphs as having a max in-degree and a max out-degree
  • eccentricty: the greatest distance between a vertex and any other vertex in the graph. From this we can derive the notion of graph diameter (the maximum eccentricity in the graph) and the radius (the minimum eccentricity)
  • cyclic: A path (or open walk) in a graph is some sequence of vertices such that there is an edge leading from some starting vertex to the next vertex in the sequence. When the start vertex and the end vertex in the sequence are the same, the graph is said to be cyclic, though it often more useful to speak of a graph as being acyclic (that is, it contains no cycles). There are many subcategories of graphs based on cyclic properties. For example, a path that encounters all vertices in the graph exactly once is called a Hamiltonian cycle.
  • oriented: A simple directed graph permits edges (a,b) and (b,a) for vertices a and b. A graph that does not allow both directions is said to be oriented.

Studying Geographical References in Texts

Texts mention places, and it would be useful (from my standpoint, fascinating) to compare texts based on the pattern of place name references.

It is tempting to conceive of a system that can take the place name references, plot them on a map of some kind, and then measure the properties that result. However, this method is fraught with serious difficulties. For one thing, the notion of "place" is likely to occur at several levels of granularity. In one sentence, the author mentions Dorchester; in another, Dorsetshire; in another, the South of England; in another, Europe. There's also the problem of the passing reference (Jane Eyre might mention Japan, for example). And all of these cases, there really isn't any obvious way to tell whether a character went to a particular place or merely referred to it, or whether the mention of a place represents a change of scene.

It's possible, though, that when we take the pattern of place name references in individual documents across a large corpus, we can detect broad changes in the way people talk about place. But within this scheme, the distance between places might not be all that significant. More signficant, it seems to me, would be the level at which a narrator or character keeps mentioning a place – coming back to it again and again – or the various sequences in which places are embedded.

Geographical References as Directed Graphs

Graphs might provide a way to study these kinds of properties – in part by focussing less on matters like granularity and distance, and more on the overall sequence in which places are mentioned. Consider the following narratives. Two travellers write post cards back East about their experiences in California. The first is on vacation:

"We've been having a lovely time! We flew to San Jose, drove up to Palo Alto, and then went from Palo Alto to San Francisco. Palo Alto is particularly lovely, with the bay on one side and the mountains on the other. We don't want to leave, but we'll be back in Boston on the 31st."

The second lives and works in the area:

"I really have the worst commute ever. I live in Palo Alto, but work in San Franciso. It doesn't sound like a big deal, but after you've spent hours in traffic each week – Palo Alto, San Franciso, Palo Alto, San Francisco – it starts to wear on you. There are days when I actually miss Boston traffic."

More soon.


graph_analytic1.png (image/png)
graph_analytic2.png (image/png)
graph_analytic3.png (image/png)
Document generated by Confluence on Apr 19, 2009 15:04