|
MONK : Geographical Awareness
This page last changed on May 26, 2007 by sramsay@unlserve.unl.edu.
Geographical AwarenessAnalyzing Place Name References in Literary TextsName: Stephen Ramsay, Imagineer Problem/Question: Humans have developed several abstractions and visual representations for talking about geography. The most obvious example is, of course, the map – a representation of space that highlights the relationships between objects as they actually occur within that space. Yet, as Alfred Korzybski famously observed, "the map is not the territory." In the case of literary narratives, the "territory" is much more likely to be a highly distorted, cognitive artifact that encapsulates some other, far more subjective abstraction. Studying these cognitive artifacts as they appear in literary narratives – and in particular, comparing subjective representations of space across time, genre, and author – is a vital question in literary studies closely related to the more general study of "place" in literature. The problem manifests itself in such literary questions as "How did the expansion of the British Empire influence the way author's talk about the world (from a purely geographical standpoint)?", "How do key historical events, such as the arrival of the railroad or the telegraph, influence the way space is represented in literary narratives?", "To what degree is an author's sense of place and geography a function of where they live and work? Are there aspects of an author's "locatedness" that generally lead toward wider or more narrow narrative settings?" Status of the research: I am currently working on this problem using a combination of software I developed and some off-the-shelf tools (in particular, GraphViz and the Waikato Environment for Knowledge Analysis (Weka)). I have developed some software that can (a) generate directed graph visualizations of the placenames mentioned in a narrative (see http://segonku.unl.edu/graphs for some examples), and (b) compute a dozen or so mathematical properties. I then use those properties as the basic vectors for classification (using a data mining algorithm called Random Forest). Using this technique, I've been able to classify commanders mentioned in battle reports from the American Civil War by rank (using Brigadier Generals and Lieutenant Colonels) using only the mathematical properties of their geography graphs with about 80% accuracy. And this is very interesting, because the data I'm analyzing does not include the actual place names themselves. I realize this is a fairly technical description of the "status of the research," but that's basically where I am. Measure of success:: I think we first need prove that my "geographical awareness" graphs are useful objects for talking about the subjective sense of place in narrative texts. If, for example, we generated graphs and then found that the data drawn from them were useful classification vectors (say, 90% accuracy or above on some label), you'd have proof that there is a correlation between the the order of placenames mentioned in a narrative and some other facet of that narrative (genre, time, or what have you). So, that's clearly the first measure of success. The second measure of success is proof of the viability of "geographical awareness graphs" as an interface modality. If people can useful organize, rearrange, and navigate a corpus using graphs as one type of "handle" for the data, we'll have produced something quite novel and significant (IMHO). What would be a good indicator of success in this case study? For you personally (e.g. a paper, a paragraph of my thesis talking abouy the work I did)? Being able to analyse more than 2 books? Prove Joe Shmoe wrong? For me personally? At least one article that talks about this as a purely technical matter (my sources tell me that data mining with graphs is a nascent research area in CS). And then at least one article that uses this method to talk about place from a purely humanistic standpoint. After that, I suppose I'd like to get a call from Stockholm. Texts needed in the collection:
I don't know that I need a specific text, but I want a text where the place name tags are pretty uniform and that has other "hooks" for classification vectors. So, for example, a collection of texts that mentions place names fairly frequently, but which has some utterly unambiguous date, genre, or place of composition tag somewhere.
I think this use case is by nature intended for the analysis of corpora. The corpus need not be exceedinly large, but there needs to be enough in there to see broad patterns over a collection of documents.
I've been having very good luck with the Official Reports from the various campaigns during the American Civil War, but I only have three collections (still a considerable amount of data). Any large newspaper collection would be fascinating (and we would obviously get unambiguous date markers with newspapers). All of Wright American Fiction would be nice. This corpus, which tries to be comprehensive for American fiction published between the years 1851 to 1875 is one of the most exciting collections we have, but it's not yet complete.
I don't think this applies to this particular use case. Generality: what other questions other users might ask that would be similar to your question? I'd love to hear some. I suspect that I should write out how I imagine this working in more detail, but that's hard to do on a wiki. Granularity: can you guess the granularity(ties) you will need to use? word, paragraph, page, books etc. Multiple levels? I suspect that the granularity will be fairly coarse for this one. If you're looking at novels, you probably want the whole novel (I'm not sure what chapter-by-chapter graphs would look like). If you're looking at letters, you want individual letters. If you're looking at newspapers, you probably want individual articles. Of course, whether those levels of granularity correspond to individual files on the filesystem is another matter. We certainly won't be able to assume that a collection of letters will always be in one document or several. Characteristics: what low level characteristics of the text you think will be useful for your research? (e.g. POS, Ngrams, Soundex). I think we're talking about graph properties here, and as I said, I've already developed techniques for generating graphs (in GraphML) and for generating matrices of graph properties (in ARFF). If this turns into something, I suppose we may need to think about how to get that data generated as part of the MONK pre-processing pipeline. For now, though, it might be enough to take the graph files and pour them into Amit's chunking framework. Patterns Can you try to express examples of complex patterns you want to identify, or hope to find? Right now, I'd like to see the mathematical properties of the graphs correlate to anything at all (because, as I mentioned above, that would prove that they are useful objects). But honestly, my greatest hope is that they identify changes in geographical awareness over time. Morphology: example of use? Not sure what's being asked here. Catherine? Tags:
Some kind of predictable placeName tag (of the sort generated by programs that do gazeteering like GATE, OpenNLP, etc.)
It's not so much a set of dream tags as it is a set of highly predictable locations for existing tags. For example, I want something that always tells me who wrote the document, when they wrote it, and what genre it is. And I want that tag to always be in the same location in the XML document (am I talking about something like Dublin Core here?) If we have some documents that give the author's name over here, and others that give it over there, we're doomed. Not just on this use case, but maybe on all of the use cases that try to look at corpora of texts. Classification: Is classification interesting (e.g. supervised learning like in nora or not). Give examples of questions I think this project is all about classification, but again, there are possibilities for using it as a straight-up navigation/visualization tool as well. Comparisons: Are comparisons between texts useful? which comparisons? It's also all about comparison. Graphs in isolation are of very limited use. You always want to be able to compare graphs across the collection. Topic extraction: interested? Well, sure. But I don't think it's absolutely essential. Lexicon, counts of words, most common occurences, concordance Describe need and importance Probably useful as yet another classification vector to explore. I'm planning to experiment with some of these to see how they work. Annotation: _if you could annotate.. give example of what type of annotation you would love to have in the tool itself. I really have no idea. I'm just not far enough in my thinking on this to give a good answer. Collaboration If we had collaborative tools, who would you collaborate with? what annotation, results, tools etc. would you like to share, or not share Same as above. Bonus question: What's your favorite example of text analysis results paper? and why? Man, there was a paper a few years ago (in either CHUM or LLC) in which these guys looked at patterns among date references in a 100-year run of newspapers. It was absolutely dazzling. I'll see if I can dig it up. I also like the guys who used Constraint Logic Programming to untangle the temporal sequence in Faulkner's "A Rose for Emily," just because it was such a crazy thing to do. But to be honest, I can think of many more text analysis articles that I don't like. What I have seen over and over (and I have read just about every text analysis paper written in the last 50 years) is talented people using amazing techniques to draw banal conclusions. Not because text analytical techniques are weak, but because people (for reasons I discuss in my forthcoming book |
| Document generated by Confluence on Apr 19, 2009 15:05 |