This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Discussion of <div> types and their implications - 12/20/2007

Present: Amit, Stan, John N., Martin, Stefan, Bill (secretary), Brian

Seeking interim solution for NCF types

Martin - what are we faking for?
Amit - Analysis/D2K for Sara's but is a general problem that needs to be solved.

Amit - With SEASR analysis (prediction/classification) need training and test set. The size of the training set. The chunk size should have some kind of equivalence. Navigation/visual display wants indication of types of chunks.

Martin - If interim solution - is it just for Sara? If so, Chapters will work just fine.

Amit - if you can give me Chapters and Volumes that is fine for now.

Martin - IF NB can discriminate Sentimental, and you exclude workparts that are not 'Chapter', that doesn't matter.

Amit - there is a notion of Chunk Types. We are looking for chunks that are roughly the same size. That is why we want to know

Martin - Chapters vary in length by at least a factor of 5 and perhaps an order of magnitude.

JLN - Why don't we key off size if that is what matters?

Amit - we probably need both type and size in the end, by March for example.

Stefan - it strikes me that there might be a difference in type usage with analytics versus communications to users.

JLN - Chapter length varies from a couple 100 to 50K words in NCF.

Amit - indicate length with chapter type to user.

Martin - would reformulate to closest 500 word page to paragraph break. There would be some that are shorter and some longer.

Stan - can these be 200 words for viewing/scrolling purposes?

Martin - start with what fits on a screen and then offer a way of bundling (handfulls of screens).

JLN - Don't analytics work with frequencies rather than counts.

Amit - If chunk is small, then its vocabulary is small.

JLN - If analysis is sensitive to small containers, then why not eliminate small containers? Isn't 'Sentimentality' a function of user defined boundaries?

Martin - If skim through text and find 'sentimental' blob - sentimental parts cluster -

Amit - If it is 10 words and the corpus size is 100Ks words, then those 10 words

JLN - Are we saying that chunks are the only things that can be classified.

Amit - when we do prediction, should it return results at 'chapter', 'volume' level...?

Stefan, I'm worried that basing these things on one use case is dangerous.

We're looking at size of chunk and vocabulary of chunk. Let's make sure our solution is sensitive to both issues.

--------
Amit and John will work in early January to meet Amit's immediate needs.

Martin - if the immediate hack is to meet Sara's use case, check with Sara about whether there are any problems with ignoring workparts other than Chapters.

Amit - can also put a hack in that Chapters should be greater than 200 words...
--------

Stefan - Can we really set word length thresholds for ignoring Chapters?
Do algorithms accommodate length profiles?

------

Martin will draft a revision of expectation for chunking. JLN will need technical specifications from that which is spelled out at proxy call level as well.

Amit modified some of the proxy calls and added attributes for what the datastor provides.


Martin - type sometimes are structural ('Chapter') and sometimes genre ("letter").

John - In general, I will need specifications for these issues from the Analytics Cell.

Martin - do we have tools to explore and analyze the complexity of texts' markup?

JLN - I don't.

Amit - We could put texts into eXist and use XQuery for some of this.


Stefan - Mainly the interface wants to know what the chunk type is, rather than its length. I would like a high level / ontology /classification of 'chunk super types' that we could group the varying types.

Document generated by Confluence on Apr 19, 2009 15:04