Dunja Mladenic - joint work with Janez Brank and Marko Grobelnik, J.Stefan Institute, Slovenia and Natasa Milic-Frayling, Microsoft Research, Cambridge
Document categorization is the task of classifying natural language documents
into a set of predefined categories. Documents are typically represented
by sparse vectors under the vector space model, where each word in
the vocabulary is mapped to one coordinate axis and its occurrence
in the document gives rise to one nonzero component in the vector representing
that document. When training classifiers on large collections of documents,
both the time and memory requirements may be prohibitive. This calls
for the use of a feature selection method, not only to reduce
the number of features but also to increase the sparsity of document vectors.
We explore effects of various feature selection algorithms on document classification performance. We propose to use two, possibly distinct linear classifiers: one used exclusively for feature selection in order to obtain the feature space for training the second classifier, using possibly a different training set. We propose a feature selection method based on linear Support Vector Machines (SVMs). Linear SVM is used on a subset of the training set to train a linear classifier separating positive from negative instances. The linear classifier is characterized by the normal to the separating hyperplane, essentially a weighting of features where a greater feature weight (in terms of absolute value) implies a higher impact during data classification. We explore the effects of feature selection based on a simple cut-off on the weights of features in the normal. After feature selection on a subset of training data, the model is trained on the full training set represented by the reduced set of features.
We compare this feature selection approach to more traditional feature selection methods for documents, such as information gain and odds ratio in terms of the sparsity of vectors and classification performance. Experiments show that feature selection based on the linear SVM algorithm combines well with different types of classifiers and on Reuters-2000 dataset outperforms the two tested traditional feature selection methods.
Back to the Main Page
Charles Rosenberg Last modified: Mon Dec 9 10:46:44 EST 2002