Aligning the Foundations of Hierarchical Statistical Machine Translation Gideon Maillette de Buy Wenniger Abstract: Globalization is one of the characteristics of our time. Documents and written texts are available from all over the world but they are not always accessible because of language barriers. The work of human translators is time consuming and costly. The introduction of machine translation has given new opportunities, but the results of machine translation are often still lacking in various aspects including word order. This dissertation contributes methods to improve hierarchical statistical machine translation (SMT) using the hierarchical translation equivalence relations induced from word alignments. The information obtained from these relations improves word order and coherence of the produced translations in particular, but lexical choice is affected as well. The core problem addressed in this dissertation is that hierarchical SMT uses little context when composing rules into translations, leading to independently made and poorly coordinated reordering decisions. In particular, Hiero (Chiang, 2005) grammars lack labels for nonterminals, causing the decoder to ignore the context of other rules when rules are applied. The foundational idea in this thesis is to explicitly represent the hierarchical translation equivalence structure induced by word alignments, using a newly proposed framework called hierarchical alignment trees (HATs) (Sima~an and Maillette de Buy Wenniger, 2013). This allows us to model how translation equivalence is composed in observed data, and generalize from this to learn context-aware rules that can be composed for the translation of unseen data. The main specific case of this general scheme addressed in this thesis, is the use of this representation to provide reordering context to hierarchical rules in the form of labels. The poor use of context by Hiero has been addressed before in the literature by adding syntactic labels to Hiero. The popular system syntax-augmented machine translation (SAMT) (Zollmann and Venugopal, 2006) is the standard example of this approach. But there are two main problems with using syntax. The first problem is that syntax and alignment structure are not necessarily compatible. The second problem is that reliable parsers are not available for all languages. These problems motivate our new approach, which does not rely on syntax, but instead on the rich information from the word alignments. These word alignments serve to construct bilingual reordering labels that allow rules to support better, context-aware reordering decisions. Reordering labels are applied in combination with a loosely constrained approach to label matching, which allow the grammar to learn soft preferences for particular label substitutions during tuning. These labels yield significant improvements over both Hiero and SAMT for three different language pairs, with the strongest improvements obtained for Chinese~English translation. Where do reordering labels come from? Reordering labels come from HATs. HATs are bilingual trees that represent the hierarchical translation equivalence structure induced from word alignments. HATs compactly represent all contiguous translation equivalence units (TEUs) that can be induced from word alignments. How do HATs differ from existing representations for TEUs and hierarchical reordering structure? HATs build further upon permutation trees (PETs) (Gildea et al., 2006) and normalized decomposition trees (NDTs) (Zhang et al., 2008a). Importantly, HATs preserve all information present in the original word alignments, distinguishing them from NDTs which present only the decomposition structure of the TEUs. Crucially HATs generalize both PETs and NDTs by representing arbitrary discontiguous word alignments while at the same time representing the recursive bilingual correspondence relations for all TEUs induced from word alignments. What new contributions are made based on HATs in this thesis? HATs like NDTs provide a basis for the extraction of bilingual rules, but unlike NDTs they also provide the information required to form reordering labels for those rules. Furthermore HATs have been applied to visualize hierarchical translation equivalence, accommodating a better qualitative understanding of empirical hierarchical translation equivalence (Maillette de Buy Wenniger and Sima~an, 2014b). Additionally, HATs have been used to study the complexity of empirical translation equivalence quantitatively, as discussed in more detail next. The last part of this thesis examines how to characterize the complexity of empirical translation equivalence as induced from word alignments. In particular, given a word alignment and a grammar, we try to give a formal answer to the question what it means for the grammar to cover the word alignment. Based on the intersection of the sets of TEUs induced from the word alignment and inferable from the grammar, we contribute a method to answer this question exactly. This contrasts with other work providing only upper bounds on alignment coverage. It is then shown how HATs can be applied to implement our method, while avoiding explicit intersection of sets of translation equivalents. This enables exact measurement of alignment coverage that is also efficient. A large empirical study of both manually and automatically produced word alignments shows that: 1) Empirical hierarchical translation equivalence is much more complex than commonly believed, 2) for all language pairs, a large fraction of the word alignments is neither binarizable, nor coverable by only permutations (bijective mappings), 3) embedding complex alignment configurations up to a limited maximum length in atomic units that are ignored for complexity only eliminates part of the complex alignment configurations and is by itself not sufficient to achieve full alignment coverage. This thesis shows that word order and coherence of hierarchical statistical machine translation can be significantly improved without syntax, by using just the information present in word alignments. The thesis contributes the framework of HATs and demonstrates its usefulness for a variety of applications including rule extraction, labeling as well as analysis of empirical hierarchical translation equivalence.