This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

Loretta wanted to see a sample of the MorphAdorner training data. I
sent her a brief extract from the 19th century fiction training
data. I'll put up the next round of the complete 19th century
training data on our ariadne server once we do a bit more cleaning,
probably this coming week. Martin made a lot of improvements to the
training data this past week. I should have these changes folded
into MorphAdorner shortly. The MorphAdorner training data can be
used with other NLP systems too.

John Norstad has made good progress with updating the WordHoard
ingestion process to accept MorphAdorned texts. He also has updated
the WordHoard work part mappings to handle the wider variety of TEI
tagged text sections, and the text display to show them in a
presentable format. He will start working on adding the entire 250
novel NCF set as soon as we are satisfied with the readorned
texts. This will give us a good look at scaling issues using a
relational database approach. Amit is looking at moving the
MorphAdorned texts into Nora-DB.

Amit has also been looking at methods for automating portions of the
Monk processing workflow.

Vered has had some success in performing named entity extraction on
the Stein texts. This led into a more extended discussion of named
entity extraction. I noted that Gate, out of the box, doesn't work
very well on literature. After modifying and augmenting the gazeteer
lists, changing some of the Jape rules, and adding a regular
expression post-processor to fixup some of the extracted entities, I
was able to improve the results for 19th century fiction. They still
aren't very good. My code to use Gate appears in the MorphAdorner
snapshot, for what it's worth.

Duane suggested contacting Hamish Cunningham to see if we can gain
access to some of the improved versions of the Annie system that are
not available in Gate. Duane also suggested looking at Snow. I
mentioned we had some folks in the business school working with BBN's
IdentiFinder. That is a very expensive commercial system, but Duane
mentioned some folks have been able to get free access for research
projects. Loretta suggested we look at the Alembic Workbench, These
statistical systems require training data. I mentioned that we
should be able to mutate, perhaps in a mechanical fashion, the
existing MorphAdorner training data, to have tags indicating the
start, middle, and end of entities as well as non-entity words. Then
MorphAdorner could be trained on this data for entity recognition
purposes. The same training data could be used with other entity recognizers.

Tanya offered a cogent summary of her experiences with
FeatureLens. Many of its deficiencies for her purposes stem from its
limited data access. Loretta noted that she had been working on
stemmed versions of Tanya's texts, but relating these back to the
original words is not currently possible. This underscores the need
to be able to relate analyses of any morphologically derived data to
the original texts. This in turn indicates the need for this type of
access in the Monk data access layer.

Several folks asked about what other texts than the NCF texts were we
at Northwestern looking at processing. We have already adorned
Tanya's two Stein texts. The folks at Nebraska are selecting a set
of a few hundred works from the Wright archive, some of interest
Sara. The main problem with these texts will be constructing good
training data for the dialects. The same is true for texts from
Documenting the American South. We also adorned The Scarlet Letter
as an example of any earlier American text.

Martin and I will return to Early Modern English texts over the rest
of this summer. Our initial focus will be on primary sources for
Shakespeare, such as Holinshed, Painter, and Plutarch.

Document generated by Confluence on Apr 19, 2009 15:04