|
MONK : Frequent pattern analysis
This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.
Frequent pattern analysisUse case: repetitionThe repetition analysis is seen as finding frequently occurring patterns in the text. Frequent pattern (or association rules are described in wikipedia) is a pattern (in our case a word, or phrase) that occurs frequently in the data set (in our case the document set). The following document also provides a nice survey of frequent patterns. http://www.adrem.ua.ac.be/~goethals/software/survey.pdf Some definitions, an item is an attribute-value combination, a word, or a phrase. Essentially frequent pattern analysis compares every item to every other item. Algorithms have been optimized to cutoff this comparison when certain conditions are met. These frequent patterns can also be seen as an automated hypothesis generation. But the patterns need to be evaluated by the domain expert. Frequent pattern analysis generate lots and lots of patterns. Some are not interesting because they report known and/or common occurrences, but other times they may find a novel pattern. There are several implementations of frequent pattern analysis in D2K. D2K modules that are inserted into the workflow before frequent pattern analysis to control whether stop words are removed, stemmed word are used, or words or ngrams (phrases) are used. Actual counts of words are not used, but could be. At this time, a boolean representation indicates whether or not the word occurs in the document. Pattern summarizationUse case: repetitionSince frequent pattern analysis generates so many rules, we can use a clustering approach to cluster similar patterns together. Given K clusters, patterns that have a common set of items are clustered together. |
| Document generated by Confluence on Apr 19, 2009 15:04 |