%Nr: DS-1999-02
%Author: Khalil Sima'an
%Title: Learning Efficient Disambiguation
It is widely recognized that the resolution of syntactic ambiguity of natural
language grammars demands resources that are usually considered
extralinguistic, e.g. discourse, cultural and geographical preferences and
world-knowledge. Performance models of natural language parsing provide
mechanisms for modeling the effects of these extra-linguistic factors in
ambiguity resolution. The majority of the existing performance models
acquires probabilistic grammars from tree-banks that represent language use
in limited domains. In the probabilistic approach to ambiguity resolution,
a language is considered a probability distribution over word-sequences,
derivations and parse-trees. The probabilities, are acquired from a tree-bank
(a multi-set of sentence-analysis pairs) and are assumed to represent the
relative plausibility / grammaticality of parse-trees / utterances given that
tree-bank. Among the existing performance models, the Data Oriented Parsing
(DOP) model represents a memory-based approach to natural language parsing.
The DOP model casts a whole tree-bank, which is assumed to represent the
language experience of an adult in some domain of language use, into a
probabilistic grammar called a Stochastic Tree-Substitution Grammar (STSG).
The efficiency of human cognitive capacities is often considered a hallmark
of human intelligence. A remarkable fact about contemporary performance models
is their entrenched inefficiency. In principle, existing performance models of
parsing are based on exhaustive search of the space of possible analyses for
whatever input utterance they encounter. It is evident that the actual and
worst-case timeand space-complexities of these models are (almost) independent
from the frequencies of individual utterances in the tree-bank. For example,
existing models do not account for two appealing properties of human language
processing:
1. utterances in specific contexts, typical for limited domains of language
use, are usually less ambiguous than Broad-Coverage Grammars (BCGs) and
tree-bank annotations assume,
2. more frequent utterances are usually processed (e.g. understood) more
efficiently.
We observe that the first property concerns samples of utterances that are
typically considered limited domains of language. Hence, this property must be
available in the form of statistical biases in samples of limited domains of
language use, e.g. tree-banks. A major theoretical hypothesis of the present
dissertation states that when these statistical biases are exploited within an
Information-Theoretic framework of learning for acquiring the first property
above, the second property is acquired as a side-effect.
The present dissertation addresses issues of complexity and efficiency of
probabilistic disambiguation in general and under the DOP model in particular.
It starts out by presenting proofs that some problems of probabilistic
disambiguation are intractable (NP-hard) under existing performance models.
Some of these problems underly some of the most relevant technological
applications such as speech-understanding. Clearly, these proofs concern
worst-case complexities that typically assume independence from properties of
individual inputs. Therefore, the dissertation provides also empirical
evidence further supporting the hypothesis that contemporary performance
models do not account for efficiency properties of human language processing
in limited domains, specifically the two properties mentioned above. The DOP
model serves, through the entire dissertation, as a representative of
contemporary performance models.
The dissertation, then, studies solutions to the inefficiency of performance
models in general and the DOP model in particular. The principal idea for
removing these sources of inefficiency is to incorporate the desired
efficiency properties of human behavior in limited domains of language use,
such as the properties mentioned above, into existing performance models.
These properties can be incorporated into a performance model through the
combination of two methods of learning from a domain-specific tree-bank:
1) an off-line method that constrains the recognition-power and the ambiguity
of the linguistic annotation of the treebank in order to specialize it for the
domain, and 2) an on-line performance model that acquires less ambiguous and
more efficient probabilistic grammars from that "specialized tree-bank". With
this sketchy idea as departure point, this dissertation provides a
theoretical, computational and empirical study of both on-line and off-line
learning of ambiguity resolution in the context of the DOP model. To this end
* it presents a framework for specializing performance models, especially
the DOP model, and Broad-Coverage Grammars to limited domains of
language-use. The goal of automatic specialization algorithms is to
minimize the ambiguity of the specialized grammar while retaining
sufficient coverage of the domain. Ambiguity-Reduction Specialization
(ARS) takes place off-line on a tree-bank that is representative of a
limited domain of language use resulting in a specialized tree-bank. A
Specialized DOP (SDOP) model is then acquired from this specialized
tree-bank. The dissertation discusses various algorithms based on
Information-Theoretic and other measures of ambiguity and coverage.
* it presents deterministic polynomial-time algorithms for parsing and
disambiguation under the DOP model for various tasks such as sentence
disambiguation and word-graph (speech-recognizer's output) disambiguation.
Crucially, these algorithms have time complexity linear in STSG size. It
is noteworthy that prior to the first publication of these algorithms,
parsing and disambiguation under the DOP model took place solely by means
of inefficient non-deterministic exponential-time algorithms.
* it reports on an extensive empirical study of implementations of the DOP
model and the specialization algorithms on two independent domains that
feature two languages (English and Dutch) and two different tasks. In most
experiments the present Specialized DOP model outperforms the DOP model in
efficiency while retaining almost the same coverage and accuracy figures. A
dedicated experiment exemplifies the fact that the specialized DOP models
tend to process more frequent input more efficiently (in contrast to the
original DOP model).
The conclusion of this dissertation concerns both subjects that this
dissertation relates and studies: the computational and algorithmic aspects
of disambiguation, and the specialization of performance models to limited
domains of language use. The complexity analysis of disambiguation implies
that efficient probabilistic disambiguation under the DOP model and various
similar performance models can not be achieved using conventional
optimization techniques; this suggests that it is necessary to involve
non-conventional optimization techniques and extra-linguistic sources of
knowledge for achieving efficient probabilistic disambiguation. As to the
disambiguation algorithms, the empirical results clearly show that they
improve the efficiency of disambiguation under the DOP model substantially.
Finally, the study of the specialization of performance models to limited
domains results in new insights concerning the modeling of efficiency-
properties of human language processing. The empirical results support our
hypothesis concerning the relation between these efficiency-properties and
the statistical biases in limited domains. Naturally, only further empirical
exploration will clarify whether the present method can be applied as
successfully to other languages and other domains of language-use.