|
MONK : Collections Cell
This page last changed on Feb 14, 2007 by amitku.
I'd be incline to draw the line between the collection cell and the datastore cell a little differently and include data preprocessing in the Collection Cell. I assume that in practice we will acquire a Monk library. No monastery without a library. And in this library every item will exist in two shapes: the shape in which it arrives from the archive to which it belongs, and the shape that it assumes when certain information is added in the preprocessing (POS tagging, lemmatization, orthographic standardization) and perhaps some types of tagging are stripped. It's hard to imagine that in any MONK process we'd ever need something like y<SUP>t</SUP>, which is the TCP representation of an abbreviated 'that', to which it should be resolved. The preprocessed texts are the input for the various data representations/structures that support end user operations directly or indirectly. And I'd move this into the collection or curatorial cell because the preprocessing is part of a set of routines governed by the question: How do we manage a diverse collection of text files and derive from them uniform input data for subsequent processing. |
| Document generated by Confluence on Apr 19, 2009 15:05 |