A New Probability Model for Data Oriented Parsing (Extended Version)
Remko Bonnema, Paul Buying, Remko Scha
Abstract:
Data oriented parsing systems employ redundant stochastic tree
substitution grammars STSGs to analyse natural language utterances on
the basis of an annotated corpus (a treebank). An important component
of such systems is the way in which the substitution probability of a
parse tree is estimated from its occurrences in the treebank. In the
standard method for doing this, the probability of a parse tree is
directly correlated with its occurrence frequency in the collection of
all fragments of all corpus trees. We show that this results in
undesirable statistical biases. We therefore propose an alternative
method, which estimates the substitution probability of a fragment as
the probability that it has been involved in the derivation of a
corpus tree. We show that this method has more plausible properties.
Keyword(s): linguistics, parsing, probabilistic parsing, data oriented
parsing, dop, statistical methods in linguistics, corpus linguistics