The paper presents background and motivation for a processing model
that segments discourse
into units that are simple, non-nested clauses, prior to the recognition
of clause internal phrasal
constituents, and experimental results in support of this model. One
set of results is derived
from a statistical reanalysis of the Swedish empirical data in
Strangert, Ejerhed and Huber
1993 concerning the linguistic structure of major prosodic units. The
other set of results is
derived from experiments in segmenting part of speech annotated Swedish
text corpora into
clauses, using a new clause segmentation algorithm. The clause segmented
corpus data is taken from the Stockholm Umeå Corpus (SUC), 1 M words
of Swedish
texts from different
genres, part of speech annotated by hand, and from the Umeå corpus
DAGENS INDUSTRI
1993 (DI93), 5 M words of Swedish financial newspaper text, processed by
fully automatic
means consisting of tokenizing, lexical analysis, and probabilistic POS
tagging. The results of
these two experiments show that the proposed clause segmentation
algorithm is 96% correct
when applied to manually tagged text, and 91% correct when applied
to probabilistically tagged text.