Statistical Inference Through Data Compression Rudi Cilibrasi Abstract: This thesis concerns a remarkable new scientific development that advances the state of the art in the field of data mining, or searching for previously unknown but meaningful patterns in fully or semi-automatic ways. A substantial amount of mathematical theory is presented as well as very many (though not yet enough) experiments. The results serve to test, verify, and demonstrate the power of this new technology. The core ideas of this thesis relate substantially to data compression programs. For more than 30 years, data compression software has been developed and significantly improved with better models for almost every type of file. Until recently, the main driving interests in such technology were to economize on disk storage or network data transmission costs. A new way of looking at data compressors and machine learning allows us to use compression programs for a wide variety of problems. In this thesis a few themes are important. The first is the use of data compressors in new ways. The second is a new tree visualization technique. And the third is an information-theoretic connection of a web search engine to the data mining system. Let us examine each of these in turn. 1. Data Compression and Statistics The first theme concerns the statistical significance of compressed file sizes. Most computer users realize that there are freely available programs that can compress text files to about one quarter their original size. The less well known aspect of data compression is that combining two or more files together to create a larger single conglomerate {\em archive file} prior to compression often yields better compression in aggregate. This has been used to great advantage in widely popular programs like {\tt tar} or {\tt pkzip}, combining archival and compression functionality. Only in recent years have scientists begun to appreciate the fact that compression ratios signify a great deal of important statistical information. All of the experiments in this thesis make use of a group of compressible objects. In each case, the individual compressed sizes of each object are calculated. Then, some or all possible pairs of objects are combined and compressed to yield pairwise compressed sizes. It is the tiny variations in the pairwise compressed sizes that yields the surprisingly powerful results of the following experiments. The key concept to realize is that if two files are very similar in their contents, then they will compress much better when combined together prior to compression, as compared to the sum of the size of each separately compressed file. If two files have little or nothing in common, then combining them together would not yield any benefit over compressing each file separately. Although the principle is intuitive to grasp, it has surprising breadth of applicability. By using even the simplest string-matching type compression made in the 1970's it is possible to construct evolutionary trees for animals fully automatically using files containing their mitochondrial gene sequence. We first construct a matrix of pairwise distances between objects (files) that indicate how similar they are. These distances are based on comparing compressed file sizes as described above. We can apply this to files of widely different types, such as music pieces or genetic codes as well as many other specialized domains. Although simple compressors work, it is also easy to use the most advanced modern compressors with the theory presented in this thesis; these results can often be more accurate than simpler compressors in a variety of particular circumstances or domains. In this thesis we describe the {\em Normalized Compression Distance} (NCD), which formalizes the ideas we have just described. We report on a plethora of experiments showing applications in a variety of interesting problems in data mining using gene sequences, music, text corpora, and other inputs. 2. Visualization Custom open source software has been written to provide powerful new visualization capabilities. The {\em CompLearn\/} software system implements our theory and with it experiments of two types may be carried out: classification or clustering. Classification refers to the application of discrete labels to a set of objects based on a set of examples from a human expert. Clustering refers to arrangement of objects into groups without prior training or influence by a human expert. In this thesis we deal primarily with hierarchical or nested clustering in which a group of objects is arranged into a sort of binary tree. This clustering method is called the {\em quartet method}.. In a nutshell, the quartet method is a way to determine a best matching tree given some data that is to be understood in a hierarchical cluster. It is called the quartet method because it is based on the smallest unrooted binary tree, which happens to be two pairs of two nodes for a total of four nodes comprising the quartet. It adds up many such small binary trees together to evaluate a big tree and then adjusts the tree according to the results of the evaluation. After a time, a best fitting tree is declared and the interpretation of the experimental results is possible. The compression-based algorithms output a matrix of pairwise distances between objects. Because such a matrix is hard to interpret, we try to extract some of its essential features using the quartet method. This results in a tree optimized so that similar objects with small distances are placed nearby each other. 3. Data Compression and the Web: the Google Distance It is possible to use coding theory to connect the compression approach to the web with the help of a search engine index database. By using a simple formula based on logarithms we can find ``compressed sizes'' of search terms. As before, a distance matrix is made, but this time with Google providing page count statistics that are converted to codelengths for use in the distance matrix calculations. Throughout this thesis the reader will find ample experiments demonstrating the machine learning technology. There are objective experiments based on pure statistics using true data compressors and subjective experiments using statistics from web pages as well. There are examples taken from genetics, linguistics, literature, radio astronomy, optical character recognition, music, and many more diverse areas.