Text Categorization and Prototypes
Alexander Bergo

Abstract:
There is a basic dichotomy in text categorization between having a
good categorizer (kNN) and low processing costs (Rocchio). We will try
to reconcile the two forces by developing centroid- or, as we prefer
to call them, prototype-based algorithms that look for similarities
and dissimilarities. The basic idea is to measure the distance between
two objects, not the closeness. In general, our algorithms will
consist of two phases:
  1. generating prototypes, and 
  2. comparing documents to be classified to prototypes.

In some settings, however, the two steps will be combined into a
single one. This is done by first measuring the distance between 
the document to be categorized with the previously categorized 
documents. The dissimilarity score for each category are then 
averaged and ranked. The category which has the least average
dissimilarity score is adopted to the document we want to categorize.

One of the core issues, then, is to come up with a good notion of
prototype. We will try to implement notions of prototype that are
inspired by fields like psychology, cognitive linguistics and
philosophy, mainly by measuring dissimilarities between objects. We
find the pairwise distance between two objects and then, for each
object sub-space, we find the mean distance from the object we want to
place in n-dimensional term space and thereby assign a category. The
aim is to choose the correct categories for the test documents, by
adopting the least dissimilar category. Our approach to distance
measuring is based on variations of the so-called Minkowski and 
Canberra metrics. Among others, we will use these notions of 
dissimilarity to implement Rocchio prototypes.

Our main aim is to utilize the new approach in such a manner that it
possibly can outperform the better approaches in the field today. We
will try to utilize methods that perform well, both according to
correctly categorizing documents, and to save computation time.

The rest of the thesis is organized as follows. In Chapter 2 we recall
further technical facts about kNN and the Rocchio classifiers; these
will serve as our starting points. Then, in Chapter 3 we present the
basic ideas about the dissimilarity measures that we will use, based
on the the Canberra and Minkowski metrics. The next step is to
evaluate our newly developed methods; in Chapter 4 we provide the
basis for a reliable test of the systems, by taking a brief look at
the Reuters collection, at ways of representing documents, and at an
example of how the relevant computations are done. In Chapter 5 we
present our experimental results, first for kNN, then the Rocchio
classifier, and then for a variety of dissimilarity systems. The final
chapter of the thesis is devoted to a discussion, conclusion and some
thoughts on how to develop the systems from here onwards.