Minimum Description Length Model Selection: Problems and Extensions Steven de Rooij Abstract: Model selection is a strange and wonderful topic in learning theory and statistics. At first glance the question seems very clear-cut: how should we decide which set of probability distributions matches the observations at hand best. This question comes up time and again in many different contexts, ranging from testing scientific hypotheses in general (which among these psychological models describes best how people behave?) to more concrete applications (what order polynomial should we use to fit the data in this regression problem? What lossy representation of this image best captures the structural properties of the original?). Thus, model selection is ubiquitous, and the one-size-fits-all criteria based on the Minimum Description Length (MDL) principle and the closely related Bayesian statistics are appreciated by many. Upon closer inspection, many applications of model selection are not as similar as they may first appear. They can be distinguished by technical properties (are the models nested? Parametric? Countable?), but also by a priori assumptions (is the process generating the data believed to be an element of any of the considered models?), as well as the motivation for performing model selection in the first place (do we want to identify which model contains the data generating process, or do we want to identify which model we may expect to predict future data best?). The best choice of methodology in any situation often depends on such particulars, and is further determined by practical considerations such as whether or not the relevant quantities can be evaluated analytically, and whether efficient algorithms exist for their calculation. MDL/Bayesian model selection has been shown to perform quite well in many different contexts and applications; in this thesis we treat some of the puzzling problems and limitations that have also become apparent over time. We also extend the idea by linking it to other topics in machine learning and statistical inference. To apply MDL, universal codes or distributions have to be associated with each of the considered models. The preferred code is the Normalised Maximum Likelihood (NML) or Shtarkov code. However, this code yields infinite code word lengths for many models. This first issue with MDL model selection is investigated in Chapter 2, in which we perform computer experiments to test the performance of some of the available alternatives. One result is that the model selection criterion based on the so-called prequential plug-in code displays inferior performance. This observation seems important because the prequential plug-in code is often thought of as a convenient alternative to other universal codes such as the NML code, as it is much easier to calculate. It was thought to result in code lengths similar to those obtained for other universal codes (such as NML, 2-part codes or Bayesian mixtures), but we discovered that this is only the case if the data generating process is in the model. We show in Chapter 3 that the redundancy of the prequential plug-in code is fundamentally different from the standard set by other universal codes if the data generating process is not an element of the model, so that caution should be exercised when it is applied to model selection. The third problem treated in this thesis is that MDL/Bayesian model selection normally does not take into account that, even in the ideal case where one of the considered models is ``true'' (contains the data generating process), and even if the data generating process is stationary ergodic, then still the index of the model whose associated universal code issues the best predictions of future data often changes with the sample size. Roughly put, at small sample sizes simple models often issue better predictions of future data than the more complex ``true'' model, i.e. the smallest model that contains the data generating distribution. When from a certain sample size onward the true model predicts best, the simpler model has already built up a lot of evidence in its favour, and a lot of additional data have to be gathered before the true model ``catches up'' and is finally identified by Bayesian/MDL model selection. This phenomenon is described in Chapter 5, in which we also introduce a novel model selection procedure that selects the true model almost as soon as enough data have been gathered for it to be able to issue the best predictions. The criterion is consistent: under mild conditions, the true model is selected with probability one for sufficiently large sample sizes. We also show that a prediction strategy based on this model selection criterion achieves an optimal rate of convergence: its cumulative KL-risk is as low as that of any other model selection criterion. The method is based on the ``switch distribution'', which can be evaluated using an efficient expert tracking algorithm. More properties of this switch distribution are treated in Chapter 4, which also contains a survey of this and other expert tracking algorithms and shows how such algorithms can be formulated in terms of Hidden Markov Models. Finally, in Chapter 6 we evaluate the new theory of algorithmic rate-distortion experimentally. This theory was recently proposed by Vitanyi and Vereshchagin as an alternative to classical rate-distortion theory. It allows analysis of the structural properties of individual objects and does not require the specification of a probability distribution on source objects. Instead it is defined in terms of Kolmogorov complexity, which is uncomputable. To be able to test this theory in practice we have approximated the Kolmogorov complexity by the compressed size of a general purpose data compression algorithm. This practical framework is in fact a generalisation of MDL model selection. The perspectives offered in this thesis on many aspects of MDL/Bayesian model selection, contribute to a better understanding of the relationships between model selection and such diverse topics as universal learning, prediction with expert advice, rate distortion theory and Kolmogorov complexity.