Learning efficient disambiguation Khalil Sima'an Abstract: %Nr: DS-1999-02 %Author: Khalil Sima'an %Title: Learning Efficient Disambiguation It is widely recognized that the resolution of syntactic ambiguity of natural language grammars demands resources that are usually considered extralinguistic, e.g. discourse, cultural and geographical preferences and world-knowledge. Performance models of natural language parsing provide mechanisms for modeling the effects of these extra-linguistic factors in ambiguity resolution. The majority of the existing performance models acquires probabilistic grammars from tree-banks that represent language use in limited domains. In the probabilistic approach to ambiguity resolution, a language is considered a probability distribution over word-sequences, derivations and parse-trees. The probabilities, are acquired from a tree-bank (a multi-set of sentence-analysis pairs) and are assumed to represent the relative plausibility / grammaticality of parse-trees / utterances given that tree-bank. Among the existing performance models, the Data Oriented Parsing (DOP) model represents a memory-based approach to natural language parsing. The DOP model casts a whole tree-bank, which is assumed to represent the language experience of an adult in some domain of language use, into a probabilistic grammar called a Stochastic Tree-Substitution Grammar (STSG). The efficiency of human cognitive capacities is often considered a hallmark of human intelligence. A remarkable fact about contemporary performance models is their entrenched inefficiency. In principle, existing performance models of parsing are based on exhaustive search of the space of possible analyses for whatever input utterance they encounter. It is evident that the actual and worst-case timeand space-complexities of these models are (almost) independent from the frequencies of individual utterances in the tree-bank. For example, existing models do not account for two appealing properties of human language processing: 1. utterances in specific contexts, typical for limited domains of language use, are usually less ambiguous than Broad-Coverage Grammars (BCGs) and tree-bank annotations assume, 2. more frequent utterances are usually processed (e.g. understood) more efficiently. We observe that the first property concerns samples of utterances that are typically considered limited domains of language. Hence, this property must be available in the form of statistical biases in samples of limited domains of language use, e.g. tree-banks. A major theoretical hypothesis of the present dissertation states that when these statistical biases are exploited within an Information-Theoretic framework of learning for acquiring the first property above, the second property is acquired as a side-effect. The present dissertation addresses issues of complexity and efficiency of probabilistic disambiguation in general and under the DOP model in particular. It starts out by presenting proofs that some problems of probabilistic disambiguation are intractable (NP-hard) under existing performance models. Some of these problems underly some of the most relevant technological applications such as speech-understanding. Clearly, these proofs concern worst-case complexities that typically assume independence from properties of individual inputs. Therefore, the dissertation provides also empirical evidence further supporting the hypothesis that contemporary performance models do not account for efficiency properties of human language processing in limited domains, specifically the two properties mentioned above. The DOP model serves, through the entire dissertation, as a representative of contemporary performance models. The dissertation, then, studies solutions to the inefficiency of performance models in general and the DOP model in particular. The principal idea for removing these sources of inefficiency is to incorporate the desired efficiency properties of human behavior in limited domains of language use, such as the properties mentioned above, into existing performance models. These properties can be incorporated into a performance model through the combination of two methods of learning from a domain-specific tree-bank: 1) an off-line method that constrains the recognition-power and the ambiguity of the linguistic annotation of the treebank in order to specialize it for the domain, and 2) an on-line performance model that acquires less ambiguous and more efficient probabilistic grammars from that "specialized tree-bank". With this sketchy idea as departure point, this dissertation provides a theoretical, computational and empirical study of both on-line and off-line learning of ambiguity resolution in the context of the DOP model. To this end * it presents a framework for specializing performance models, especially the DOP model, and Broad-Coverage Grammars to limited domains of language-use. The goal of automatic specialization algorithms is to minimize the ambiguity of the specialized grammar while retaining sufficient coverage of the domain. Ambiguity-Reduction Specialization (ARS) takes place off-line on a tree-bank that is representative of a limited domain of language use resulting in a specialized tree-bank. A Specialized DOP (SDOP) model is then acquired from this specialized tree-bank. The dissertation discusses various algorithms based on Information-Theoretic and other measures of ambiguity and coverage. * it presents deterministic polynomial-time algorithms for parsing and disambiguation under the DOP model for various tasks such as sentence disambiguation and word-graph (speech-recognizer's output) disambiguation. Crucially, these algorithms have time complexity linear in STSG size. It is noteworthy that prior to the first publication of these algorithms, parsing and disambiguation under the DOP model took place solely by means of inefficient non-deterministic exponential-time algorithms. * it reports on an extensive empirical study of implementations of the DOP model and the specialization algorithms on two independent domains that feature two languages (English and Dutch) and two different tasks. In most experiments the present Specialized DOP model outperforms the DOP model in efficiency while retaining almost the same coverage and accuracy figures. A dedicated experiment exemplifies the fact that the specialized DOP models tend to process more frequent input more efficiently (in contrast to the original DOP model). The conclusion of this dissertation concerns both subjects that this dissertation relates and studies: the computational and algorithmic aspects of disambiguation, and the specialization of performance models to limited domains of language use. The complexity analysis of disambiguation implies that efficient probabilistic disambiguation under the DOP model and various similar performance models can not be achieved using conventional optimization techniques; this suggests that it is necessary to involve non-conventional optimization techniques and extra-linguistic sources of knowledge for achieving efficient probabilistic disambiguation. As to the disambiguation algorithms, the empirical results clearly show that they improve the efficiency of disambiguation under the DOP model substantially. Finally, the study of the specialization of performance models to limited domains results in new insights concerning the modeling of efficiency- properties of human language processing. The empirical results support our hypothesis concerning the relation between these efficiency-properties and the statistical biases in limited domains. Naturally, only further empirical exploration will clarify whether the present method can be applied as successfully to other languages and other domains of language-use.