Statistical Inference Through Data Compression
Rudi Cilibrasi
Abstract:
This thesis concerns a remarkable new scientific development that
advances the state of the art in the field of data mining, or
searching for previously unknown but meaningful patterns in fully or
semi-automatic ways. A substantial amount of mathematical theory is
presented as well as very many (though not yet enough) experiments.
The results serve to test, verify, and demonstrate the power of this
new technology. The core ideas of this thesis relate substantially to
data compression programs. For more than 30 years, data compression
software has been developed and significantly improved with better
models for almost every type of file. Until recently, the main
driving interests in such technology were to economize on disk storage
or network data transmission costs. A new way of looking at data
compressors and machine learning allows us to use compression programs
for a wide variety of problems.
In this thesis a few themes are important. The first is the use of
data compressors in new ways. The second is a new tree visualization
technique. And the third is an information-theoretic connection of a
web search engine to the data mining system. Let us examine each of these
in turn.
1. Data Compression and Statistics
The first theme concerns the statistical significance of compressed
file sizes. Most computer users realize that there are freely
available programs that can compress text files to about one quarter
their original size. The less well known aspect of data compression
is that combining two or more files together to create a larger single
conglomerate {\em archive file} prior to compression often yields
better compression in aggregate. This has been used to great
advantage in widely popular programs like {\tt tar} or {\tt pkzip},
combining archival and compression functionality. Only in recent
years have scientists begun to appreciate the fact that compression
ratios signify a great deal of important statistical information. All
of the experiments in this thesis make use of a group of compressible
objects. In each case, the individual compressed sizes of each object
are calculated. Then, some or all possible pairs of objects are
combined and compressed to yield pairwise compressed sizes. It is the
tiny variations in the pairwise compressed sizes that yields the
surprisingly powerful results of the following experiments. The key
concept to realize is that if two files are very similar in their
contents, then they will compress much better when combined together
prior to compression, as compared to the sum of the size of each
separately compressed file. If two files have little or nothing in
common, then combining them together would not yield any benefit over
compressing each file separately.
Although the principle is intuitive to grasp, it has surprising
breadth of applicability. By using even the simplest string-matching
type compression made in the 1970's it is possible to construct
evolutionary trees for animals fully automatically using files
containing their mitochondrial gene sequence. We first construct a
matrix of pairwise distances between objects (files) that indicate how
similar they are. These distances are based on comparing compressed
file sizes as described above. We can apply this to files of widely
different types, such as music pieces or genetic codes as well as many
other specialized domains.
Although simple compressors work, it is also easy to use the most
advanced modern compressors with the theory presented in this thesis;
these results can often be more accurate than simpler compressors in a
variety of particular circumstances or domains. In this thesis we
describe the {\em Normalized Compression Distance} (NCD), which
formalizes the ideas we have just described. We report on a plethora
of experiments showing applications in a variety of interesting
problems in data mining using gene sequences, music, text corpora, and
other inputs.
2. Visualization
Custom open source software has been written to provide powerful new
visualization capabilities. The {\em CompLearn\/} software system
implements our theory and with it experiments of two types may be
carried out: classification or clustering. Classification refers to
the application of discrete labels to a set of objects based on a set
of examples from a human expert. Clustering refers to arrangement of
objects into groups without prior training or influence by a human
expert. In this thesis we deal primarily with hierarchical or nested
clustering in which a group of objects is arranged into a sort of
binary tree. This clustering method is called the {\em quartet
method}.. In a nutshell, the
quartet method is a way to determine a best matching tree given some
data that is to be understood in a hierarchical cluster. It is called
the quartet method because it is based on the smallest unrooted binary
tree, which happens to be two pairs of two nodes for a total of four
nodes comprising the quartet. It adds up many such small binary trees
together to evaluate a big tree and then adjusts the tree according to
the results of the evaluation. After a time, a best fitting tree is
declared and the interpretation of the experimental results is
possible. The compression-based algorithms output a matrix of
pairwise distances between objects. Because such a matrix is hard to
interpret, we try to extract some of its essential features using the
quartet method. This results in a tree optimized so that similar
objects with small distances are placed nearby each other.
3. Data Compression and the Web: the Google Distance
It is possible to use coding theory to connect the compression
approach to the web with the help of a search engine index database.
By using a simple formula based on logarithms we can find ``compressed
sizes'' of search terms. As before, a distance matrix is made, but
this time with Google providing page count statistics that are
converted to codelengths for use in the distance matrix calculations.
Throughout this thesis the reader will find ample experiments
demonstrating the machine learning technology. There are objective
experiments based on pure statistics using true data compressors and
subjective experiments using statistics from web pages as well. There
are examples taken from genetics, linguistics, literature, radio
astronomy, optical character recognition, music, and many more diverse
areas.