A Sampling-Based Exploration of Neural Text Generation Models
Bryan Eikema

Abstract:
Neural text generation models are at the basis of most modern-day natural language processing (NLP) systems. In recent years many important innovations to neural network architectures and training paradigms have appeared such as attention mechanisms, Transformers, and pre-training and data augmentation strategies that have accelerated the performance of NLP systems tremendously. At their core, however, these models have not changed their probabilistic formulation since the original neural text generation models were first described. 

The probabilistic nature of these models, however, is often quickly forgotten after the model has been trained. For many natural language processing tasks such as machine translation, for example, deterministic search algorithms are employed to extract a single “best" generation from the model. In such cases the probabilistic model is only used to score partial subsequences during generation as to find the highest scoring sequence, i.e. the sequence with highest probability under the distribution, also known as the mode of the distribution. This assumes that neural text generation models indeed put data-like sequences at its modes. A well-known observation across text generation tasks, however, seems to suggest that the highest probability generations of neural text generation models are not at all data-like. 

A better understanding of the sequence distributions that our neural networks predict allows us to make better-informed decisions about what kind of generation strategy is appropriate for our models. Sampling is a natural way to explore the properties of the sequence distributions predicted by neural networks. By studying the properties of such samples we indirectly also study the properties of the sequence distributions we are working with. Samples can also be used to inform generation algorithms and for some tasks samples even are the outputs of choice from the model.

In this dissertation we will explore the use of sampling to better understand neural text generation models and in order to inform decoding algorithms. We will view commonly known pathologies and biases of neural text generation models under the lens of such a probabilistic exploration and provide a new perspective on their potential causes. We use these insights to propose and iterate on a sampling-based decoding algorithm inspired by risk minimisation strategies, as well as develop new sampling strategies altogether to sample from arbitrary distributions where a per-token, i.e. autoregressive, factorisation does not exist.