This page last changed on Feb 20, 2008 by martinmueller@northwestern.edu.

Sentence Splitting in MorphAdorner

Extracting words and sentences from a text are fundamental operations
required by other language processing functions. Word tokenization
splits a text into words and punctuation marks. Sentence splitting
assembles the tokenized text into sentences.

Recognizing the end of a sentence is not an easy task for a computer.
In English, punctuation marks that usually appear at the end of a
sentence may not indicate the end of a sentence. The period is the
worst offender. A period can end a sentence but it can also be part
of an abbreviation or acronym, an ellipsis, a decimal number, or part
of a bracket of periods surrounding a Roman numeral. A period can
even act both as the end of an abbreviation and the end of a sentence
at the same time. Other the other hand, some poems may not contain
any sentence punctuation at all.

Another problem punctuation mark is the single quote, which can
introduce a quote or start a contraction such as 'tis. Leading-quote
contractions are uncommon in contemporary English texts, but appear
frequently in early Modern English texts.

Few literary texts which have already been marked up using SGML or
XML recognize sentences in the markup. (The Chadwick-Healey archive
of eighteenth century novels is a notable counterexample.) Sentences
often cross other element boundaries. Texts without sentence markup
require preprocessing to add it without disturbing the existing
markup. This allows further processing of the texts, in particular,
part of speech tagging and name recognition. MorphAdorner allows
pluggable input and output processors to handle reification of texts
and addition of extra markup as needed.

MorphAdorner's default sentence splitter uses the standard Java
BreakIterator class along with a set of heuristics for determining if
two or more sentences generated by BreakIterator should be joined
into one sentence, or split into more than one sentence. The heuristics
include special treatment of sentence-ending brackets (right parenthesis,
right bracket, and right brace), abbreviations, and interjections.
Some of these heuristics are described below. The resulting sentence
extraction is not perfect but is better than BreakIterator's splitting
and much better than naive splitting methods.

The article Finding text boundaries in Java by Rich Gillam at
http://www.ibm.com/developerworks/java/library/j-boundaries/boundaries.html
describes the methods underlying the Java BreakIterator. MorphAdorner
only uses BreakIterator to provide initial sentence boundaries.
MorphAdorner's word tokenizer uses its own methods for determining token
boundaries within a sentence.

Abbreviations

The period ending an abbreviation may act as both a part of the abbreviation
and the end of a sentence. MorphAdorner maintains a list of common
abbreviations along with a flag indicating if the abbreviation usually
can end a sentence. MorphAdorner will not split a sentence after an
abbreviation which is not designated as a potential sentence ender.

For example, the abbreviation Mrs. rarely ends a sentence, so
MorphAdorner does not issue sentence splits following Mrs. Thus

Mrs. Smith was here earlier.

is correctly considered a single sentence, while

I will leave it up to the Mrs. She will know what to do.

which should be two sentences (with a split after Mrs.) is also treated
as a single sentence by MorphAdorner. This could be handled by
recognizing that Mrs. can end a sentence when followed by something
other than a proper name.

When an abbreviation can end a sentence, MorphAdorner tries to determine
if a particular use ends a sentence or not by looking for possible verbs
before and after the abbreviation. MorphAdorner does not split the
sentence after the abbreviation unless it has found a possible verb in the
sentence preceding the abbreviation. MorphAdorner does not use detailed
part of speech information during sentence splitting. However, the parts
of speech for any word can be looked up in the word lexicon or determined
using a part of speech guesser. That is sufficient to guide the sentence
splitting algorithm in many but not all cases.

MorphAdorner splits the text

I mailed the letter early in the a.m. The next step is to wait for a reply.

correctly into two sentences following a.m., while

I mailed the letter early in the a.m. the next day too.

is left unsplit.

MorphAdorner correctly leaves unsplit the following sentences.

She needs her car by 5 p.m. Saturday evening.
At 5 p.m. I had to go to the bank.
She has an appointment at 5 p.m. Saturday afternoon.
By 5 p.m. Sunday I have to be at home.

MorphAdorner correctly splits the following text into two sentences
following p.m.:

It was due Friday at 5 p.m. Saturday afternoon would be too late.

The text

She has an appointment at 5 p.m. Saturday afternoon to get her car fixed.

should be left as a single sentence, but MorphAdorner splits it into
two sentences with the split occurring after p.m. While both get and
fixed can be verbs, neither appears in context as the the right kind of
verb form to allow the text following p.m. to be considered a sentence.

MorphAdorner does not recognize abbreviations containing blanks, such as
"U. S." for United States. However, "U.S." without the blank is recognized.

Characters not allowed to start a sentence

MorphAdorner does not allow a sentence to start with a comma, a period,
or a percent sign. These characters will be attached to the previous token
and/or sentence, if any. Dashes and hyphens are joined preferentially
to the end of a sentence rather than the start of a sentence.

Interjections

MorphAdorner maintains a list of common interjections, These are words
typically used for emphasis, and generally followed by an exclamation
mark or question mark. MorphAdorner does not split the sentence following
the interjection, and it leaves the question mark or exclamation point
attached to the interjection word. The situation can become ambiguous when
quote marks are involved.

MorphAdorner treats the following lines as single sentences.

What! That's bad!
"What! That's bad!"

On the other hand, the following line is treated as two sentences.

"What!" "That's bad!"

"What!" is the first sentence and "That's bad!" is the second
sentence.

Numbers

A period following a number may act as both a decimal point and the end
of a sentence (in English). In general, MorphAdorner ends a sentence
following a number ending in a period when the next word begins with a
capital letter. The following text is considered one sentence by
MorphAdorner.

There are 12. of them.

MorphAdorner splits each of the following two lines into two sentences
following 12.

There are 12. More would be unnecessary.
There are 12. "More would be unnecessary."

Document generated by Confluence on Apr 19, 2009 15:04