This page last changed on May 10, 2007 by martinmueller@northwestern.edu.

Word tokens can be aggregated in various ways. That is the main point of tokenizing a document in the first place. The BOW model (bag of words) is a term of art in the NLP community to describe analytical routines that treat a text as a collection of words without regard to their order.

What bags do you put the words into? The two fundamental bags are the sentence bag and the document bag. A minimal text consists of one sentence that contains at least one word, which may be a representation of silence, such as "". In practice documents are likely to contain paragraphs that aggregate into chapters, etc. But while every document will consist of one document bag that contains one or more sentence bags, the intermediate bags come in different shapes and size. Poems have lines and (perhaps) stanzas; novels have chapters, plays have acts and scenes. There is considerable isomorphism within genres, but less across them. One can say with some justice that a stanza and a paragraph are comparable bags for poems and essay. It is less obvious whether the dramatic bag equivalent for a fictional chapter is the act or the scene.

The document bag is established by the bibliographical record: given a title 'foo', the document bag 'foo' contains all the words in that title. The sentence bag is established by the sentence splitting that is associated with tokenization. The intermediate bags are established by various kinds of mark-up and differ greatly.

The BOW model thus means that every tokenized document will by definition have a document bag and one or sentence bags. It may have an indeterminate hierarchy and number of intermediate bags

Document generated by Confluence on Apr 19, 2009 15:04