Learning Syntactic Structure
Yoav Seginer

Abstract:
Syntactic structure plays a central role in most theories of language,
but it cannot be directly observed.  The question this thesis investigates
is whether there is a relation between syntactic structure and immediately 
observable properties of language, such as the statistics of the words 
and sentences that we hear and read. Finding 
such a relation has important consequences for the problem of language
acquisition by children, as well as implications for the theory of syntax
itself. It can also be used in engineering language processing systems.
One way to tackle this question is to design an algorithm which collects
statistics from sentences in a language and uses these statistics to
determine the syntactic structure of these and other sentences in the 
language. This process learns the syntactic structure of a language
from unannotated examples (plain sentences as can be observed in 
a language without extra information). The algorithm codes a relation
between the statistics collected from the observable language and 
the syntactic structure. If the algorithm successfully captures at least
part of the syntactic structure, we can say that the algorithm is 
an approximation of the relation between the observable language and its
syntactic structure. Such an algorithm is called an unsupervised
parser. This thesis describes a specific proposal for an unsupervised
parsing algorithm. By testing the proposed parser on corpora of different
languages, it is shown that the algorithm successfully discovers part of
the syntactic structure of these languages.

The relation described by the unsupervised parser depends not only on the
choice of statistics being collected but also on the choice of representation
of the syntactic structure. The correct choice of representation is therefore
important in making the parser simple. The first part of the thesis
describes a completely new representation of syntactic structure 
(called common cover links) and a parsing method suitable for this
representation.

The common cover links make it possible for the parser to make use of
two important properties of natural language which I presuppose:
humans process language incrementally and the syntactic structures
of language are skewed (every subtree of a parse tree has a short branch).
By using common cover links, it is possible to define an incremental
parser which constructs the syntactic structure of a sentence
as the words of the sentence are being read one by one. The common
cover link representation ensures that only skewed syntactic trees are produced
by the parser. This considerably reduces the number of possibilities 
that the parser has to consider and makes the parser simple and fast. 

The second part of the thesis describes the statistics that the parser
collects and how these are used during parsing. An important property
of the algorithm is that statistics are collected from each sentence
only after this sentence has been parsed. In this way, the parser
is gradually improved: new statistics are added to previously collected
statistics and together are used to parse the next sentence and collect
additional statistics.

The statistics are collected for each word separately. For each word,
the statistics are represented by labels which are based on the frequency
of adjacent words and of the labels of these words. Based on the labels,
simple properties are induced which determine how words should be linked
to each other when parsing. As a result of the Zipfian distribution of words
in natural languages, the most frequent words dominate the properties
of all words. In this way, the most frequent labels take over the role
of the traditional parts of speech. The induction of the properties of
a word is carried out by summing over the properties of other words.
This makes the learning process very simple and fast.

The parser is tested on three corpora, in English, German and Chinese.
In each of these three languages the parser successfully discovers
part of the syntactic structure of the language. The parser is much
faster than other unsupervised parsers while achieving state of the art 
accuracy.