|
This page last changed on Sep 21, 2007 by plaisant@cs.umd.edu.
i.e. USER CENTERED FUNCTIONALITIES DRIVEN BY USE CASE
Vocabulary
see the lexicon
Todo
Clear requirements
The functionalities that are 1) requested in one or more use cases, 2) there is consensus that they represent general needs, 2) good candidates to explore the texhnology while making progress in developing tools people can use.
PRIORITY P=HIGH/MED/LOW
It would be good to assign dates: e.g. HIGH=by the end of December? MED=end of spring 2008? LOW= End of 2008?
Getting Started
PREDEFINED-TOOLSET: Create Workset (i.e. sets of text of interest)"
- TOOL: Browse collections. View what is available (HIGH)(by time range, author, british vs american access restriction, tools available to run on that collection). Get estimate of size for each categories.
This can be done in a simple way with lists(HIGH), but if we have the ressources we can do better with some interactive browser of the facets of our collections (MED). Seeing statistics for the collecions and for individual text will be useful (MED).
- TOOL: Select chunks and save them (HIGH) (e.g. the entire Dickinson collection, only 1 book in a collection of 10 books, all the book chapters that include the word "witch" in them, all the texts written by a particular author, all text older than 1700 that can be analyzed at the sentence levels, all text for which the POS tagging has been cleaned up by experts i.e. not the one done automatically, set of text hand picked one by one, etc.) In all cases users should be able to specify if the "extra" materials are wanted or not (e.g. title page, preface etc.) This task needs to be done while having access to the text - if it is available. Users DO need to be able to navigate the hierarchy of texts in the collections (HIGH) and see >meaningful< titles for all the chunks (LOW because need customization).
TOOL: "Manage your MONK project(s)"
COMMENT: is it a tool in the worksbench?? NO, so there is a different class of tools that are always present????
Manage multiple projects (save current work so nothing is ever lost (MED). A minimum level of history keeping would be nice (HIGH because we decided to to that early). Projects should be independant from workset. Projects use worksets, but one can give a workset to someone else... Projects represent a scholarly question one is studying, worksets are only a set of texts. A project should encompass the history of the multiple analyses (including training steps) done to a workset. A training set should be considered as additional Monk metadata created for the text. It should NOT be linked to the workset, but to the texts themselves so that you can pass/give ratings to other people or other projects that might use different worksets.
IN WEB PAGES of the MONK WEBSITE (LOW).
- read intro about what monk can do (MED)
- Highlights of special collections and examples of use
- View detail information about the tools available - Detail info about algorithms.
- Detail information about each collection
- Success stories
Implicit needs 1 of 2 (for users)
- TOOL: Search (simple e.g. a word (HIGH), and advanced (MED) (e.g search of a complex regular expression combining words + POS)
- IN EVERY TOOL: Saving/export of results. need a Save and "Save as" (HIGH) WHatever you do it should not be lost (HIGH)
- TOOL: Reading of the text (including seeing where it is in the table of content if it is a large document or collectio, with highlight of features in the text as needed)
- TOOL: get an account and login (HIGH) - Guest account (MED - good for demos and making results public)
- IN EVERY TOOL: keeping history of action taken, undo and redo whenever possible (MED but pushed to HIGH)
Implicit needs 2 of 2 (for us Monks as researchers)
- IN EVERY TOOL Logging of usage so we can report on use, even during the early stages of experimentation within the use case. (HIGH)
Classification and feature extraction (explicitly requested by a few Use Cases e.g. Sara/Sentimentality)
PREDEFINED TOOLSET: "Find more like this"
i.e. Find text chunks that resemble the text chunks I like
(theme, genre, topic) and show me the features the system used to group them
- select or create a workset
- Rate chunks
- Show me new chunklist of the ones the system has identified
- Let me choose the granularity of results
- Show features belived to be representative of this set of chunks.
Example from Sara. Find chunks that represent various kinds of affect.
- Create a workset of chunks, each chunk to contain roughly 5-10 paragraphs from nineteenth-century British (and American) texts.
Workset comments: NORA didn't support aggregate paragraphs as a chunk type, so Sara would be constrained to using single paragraphs.
Proxy calls: CollectionManager.getCollections. We need to add a return of DocCount for FeatureLens.
CollectionManager.getChunkCount. We should have the system choose how many levels down to go: either the current node and its direct children; or the current node and everything under it. Otherwise we risk locking the system while it retrieves too many nodes.
CollectionManager.getChunkHierarchy.
CollectionManager.searchCollections.
Return data: Collection labels, chunkcounts, and chunklist labels in hierarchy for display.
Submit data: Selected workset.
- Rate representative chunks for sanctity of motherhood, ellipses, childhood innocence, last looks, Christian death or death of children, and indicators of sensibility - fainting, sighing, weeping, paleness.
Rating Comments: In the NORA interface, Sara has to do these one at a time. We should consider ways to allow multiple rankings at once.
Proxy calls: D2KManager.runFakeAnalysis (eventually replaced with D2KManager.runAnalysis)
D2KManager.getJobStatus
D2KManager.getApproxTime
D2KManager.abortJob
CollectionManager.renderChunkWithFeatures
CollectionManager.searchCollections
Submit data: Choice of chunk to view.
Return data: Chunk text.
Submit data: List of representative chunks with user ratings (positive and negative).
- Show me new chunklist of the ones the system has identified
Show new chunklist comments: If Sara can rate simultaneously on multiple dimensions, we would want to return multiple system ratings at once, and multiple feature sets at once. Once we have these multiples, we would want ways to group, sort, and compare them.
Proxy calls: D2KManager.getPrediction
Return data: System ratings for all chunks in the workset.
- Let me choose the granularity of results NEED FIXING
Granularity comments: we don't have this feature in NORA. The goal is to visualize groups of results for cases where there are long lists. So for example, we might want to see a list of author names and document titles rather than a list of chunks of 5-10 paragraphs each.
Proxy calls: CollectionManager.getDocumentMetadata
Submit data: Chunklist of chunks with significant system ratings.
Return data: List of authors and titles that can also serve as links back to the chunks.
- Show features used for the rating
Show features comments: If we provide for multiple ratings, we want to show multiple features. We'll need to find a way to make it clear what features correspond to what ratings. We will also want to support different ways of displaying the features: lists, graphs, and other visuals. We may also want to make use of some form of granularity, as with the chunklists.
Proxy calls: D2KManager.getGraph
D2KManager.getFeatures
D2KManager.getFeatureChart
CollectionsManager.getChunksContainingFeature
CollectionManager.getDocumentMetadata
CollectionManager.renderChunkWithFeatures with the AsHTML parameter. We should also be able to use this for FeatureLens in place of Hilight Pattern.
Return data: List of features.
Submit data: Feature search for selected individual items.
Return data: Highlighted list of selected items in the current chunk in the reading view.
COMMENTS//
- Sara's use case would need to allow to train on multiple factors at once and new analytics to do the data mining on multiple factors. (MED)
- ANALYTICS: need better mechanisms to refine how the data mining is performed (e.g. options for ignoring the default stop words, or user-defined stop words, using stemming, using other algorithms, etc) (MED)
- ANALYTICS AND UI: allow unsupervised classification (NOTE by MM. I am actually quite interested in unsupervised classification, though I'm not well read in the literature about results produced by one or the other.) (MED????????)
- DATA AND ANALYTICS: need to be able to use ngrams, part of speech instead of words in the analysis. Soundex was also requested by a use case. (MED????????)
- DATA: to start we would need to get access to the Nora datasets (that would give us Sara's 1st book, and Dickinson for Martha's use case), and Tanya and Kirsten can also use this immediately if we process their data now. (HIGH)
- UI: we have a lot of existing designs that would refine and extend. In particular we need to change the design so that we can work with hirarchrical worksets. Also, dealing with multiple factors at once will change the design significantly (MED).
- UI: Need better ways of exploring the features returned by the analysis (MED). More visualization and text browsing capabilities(MED). Sending the features to Featurelens might be nice to better explore how the features are used(MED).
- UI: need an history mechanism and some interface to review all the analyses that were done. Ideally there is a way to compare analysis run (e.g. what's the difference between the results using NB versus SVM,) * NOTE: so far noone in our use case seems to be have requested unsupervised classification (e.g. what are the overall topics in this book/collection) so supervised classification seems to remain the focus.
- COLLAB: sharing training sets with other people(LOW???), comparing results between users (e.g. comparing erotics according to Martha and someone else(LOW?). Comparing results of a group of people and run a hierarchical clustering of people and text (LOW).
Frequent pattern analysis (Explicitly requested in Tanya's Use Case)
PREDEFINED TOOLSET: Show Patterns of Repetition (comment: the names are not consistent, Explore patterns of repetition is better).
- Requested by Tanya's user case, but seems to be general enough to apply to many other studies (especially once you study your text at the part of speech level, the level of repetition is even greater than what is found in the Stein book).
- UI: we have a great start with FeatureLens, just integrating it in MONK will give us a lot (HIGH)
- ANALYTICS: we have some D2K itineraries already. Work is needed to deal with speed of processing and to study what the best strategies to find patterns is. e.g. are 3-grams the best?, how to remove/aggregate the close duplicates? Ondemand or preprocessing? Can we preprocessed all possible options of processing? If not what are the one we offer by default.(HIGH)
- DATA: to start we need to process Stein's book (DONE)
- UI. The Featurelens prototype is OK already but we have a list of many things we can do to improve it (e.g. better way to review the lists, showing more trend lines at once, refining the spike-and-other trends filter algorithm, add saving of results, annotations etc.
- >>>>> See also: featurelens-nextsteps (list started at Illinois meeting in July 07)
Studying the characters/names and geography/place (Explicitly requested by at least 2 use cases, Steve and Tanya)
(HIGH/MED/LOW???????? Not clear)
PREDEFINED TOOL SET: Explore relationships
Comment: the names should be more clear about the fact this is name and place only)
We could imagine at least do some side experiments in the short term (e.g. using UMd SocialAction tool for social network, as long as we got good entity and link data from MONK, but I am not sure this is being taken care of???? )
- DATA: Entity extraction names and geographical location. Loretta's team already has some tools available that goes further than just extracting entities but also gets information about the relationship itself between entities.
- some kind of analytics and visualization is needed to help understand the social network, time-space or storyline-space relationship in the data (the use cases here are very vague at this point.
- UI: at minimum we need a way to see where each entity appears in the text (can use FeatureLens for that). For more advanced functionality e.g. using network tools, it would be best to throw our users' data into existing tools (e.g. social action from UMd for social network) and see how our users use it, before re-implementing the wheel within Monk
- Vis: Zoomable Timelines, a network vis (either node-link or matrice or both) - Using Social Action from UMD would be great (it's Java)
- DATA: to start we would need to run entity extraction tools on Steve's data and Tany's Stein book ASAP. And decide where the entities extracted reside in the data model. Plan that users WILL ask to be able to correct the many mistakes made by the automatic tools.
- ANALYTICS: computing various statistics of the social network, cluster analysis etc.
- NOTE: The danger as this could be a whole 2 year project in itself and there are entire books of research on timepspace analysis e.g. http://www.ais.fraunhofer.de/and/eda/index.html .
From Word to Word level study - or Statistical comparisons (I assume this is the core of Martin's use case)
(HIGH/MED/LOW?????? NOt sure... what do we want in the short term?)
PREDEFINDED TOOLSET: Explore word patterns
and
PREDEFINED TOOLSET: COmpare documents
- ANALYTICS: needs detailed statistics for word/POS/sentence/authors/timeperiod or any unit of text
- UI: not clear to me yet how this will be integrated in the rest of the Monk UI. May be a separate UI with simple "classic" stats then access to advanced scripts. MAy be "just" a profile viewer? i..e pick 2 worksets and compare them?
- UI: the WordOard UI (and Tapor UI) do a lot of this very well, should we reimplement what works well elsewhere already (in other word, I am not sure what is different except that MONK will give access a larger data collections - but not even necessarily allow using your own text). We could "improve" on the existing UIs e.g. adding for visualization may be?
Note by CP: If not every text is "super-monkified" with extra tagging, we need to make sure users can tell what has been processed when they select their workset.
Note by MM: Minimal monkification involves consistent bibliographical description in the document header and, as well as tokenization, sentence splitting, and Pos tagging. I wouldn't call the addition of bibliographical information 'extra-tagging' but a minimal requirement. And if there were a text that has no date assigned to it, monkification would involve at the minimum an explicit statement that the date is unknow.
REPLY by CP: YEs adding tags by automatic processing is is a minimal requirement, the question is: do we let users add their own tags" or corect existing tags? I see that as important but low priority compared to the rest of the work.
Changing patterns over time (explicitly requested by Martin and Kirsten)
(MED)
PREDEFINED TOOL SET: CREATE TIMELINE
NOT CLEAR WHAT PLAN IS
- This could be seen as a subset of the above Names and Places problem, but really visualizing change over time is I think such an important task it needs to be addressed more generally.
- For Kirsten there are several patterns of interest in the early English witchcraft trial documents. One is of accusation and counter-accusation. Another is the change of the physical appearance, ownership, and activity of familiars. A third is the pattern of movement of trials from one town to another.
- Martin's interest, if I have this right, was in the changing use of words over decades or centuries.
- UI: FeatureLens allows some time analysis already. It would need to be refined to allow more varied types of aggregations for the units of time to be considered (i.e. instead of using about a dozen section/chapters or years, it would to allow 100 sections for 100 years? Or may be a hierarchical timeline corresponding to a collection TOC? )
Adding user defined tags (requested by Kirsten)
(LOW)
CURRENTLY NOT INCLUDED IN TOOLS THAT I CAN SEE
- creating new tags or annotaton as the result of the analysis is not too hard to do. Do we get a commitment to allow that?
- saving ratings from the classification is already there in nora and a possible way to enter any tag-like info, but it does't currently make that user provided content available as tags to others.
Emerging requirements we have not worked on yet
terminology
It seems clear that we want to avoid using techy words like data mining... (see discussion Stan etc. around 9/20)
usage of a thesaurus or wordnet for words, gazeteers for places
Clearly something people wish to have. Need to see what can be done parctically to allow a way to deal with higher concepts and not just words... One potential "easy" thing to do is to let users type their own set of words representing a concept (e.g. all the love words) and then using all those words as a unique basket for everything is the interface (e/g/ to search, show changes in frequencies etc.) See discussion aroun 9/18-20 2007. One issue is that an existing thesaurus is unlikely to work well with old texts.
Collaboration with other scholars
(LOW but some HIGH)
- There are some easy things we can do...
- Create a "URL" for any state which can be returned to automatically by anyone (HIGH in Featurelens)
- Annotation of saved states of one's work (MED)
- Export results into ManyEyes and let users annotate it (LOW)
Several Use Case request some way to share results of analysis (e.f. a way to publish/exchange the training sets, the classification sets or adding xml tags to the collection) (LOW)
Study of prosody
(LOW)
Blue sky requirements - Those we probably won't get to this round most likely
add here
add here
|