Improving Text Classification by Shrinkage in a Hierarchy of Classes - Robotics Institute Carnegie Mellon University

Improving Text Classification by Shrinkage in a Hierarchy of Classes

A. McCallum, R. Rosenfeld, Tom Mitchell, and A. Ng
Conference Paper, Proceedings of (ICML) International Conference on Machine Learning, pp. 359 - 367, July, 1998

Abstract

When documents are organized in a large number of topic categories, the categories are often arranged in a hierarchy. The U.S. patent database and Yahoo are two examples. This paper shows that the accuracy of a naive Bayes text classifier can be significantly improved by taking advantage of a hierarchy of classes. We adopt an established statistical technique called shrinkage that smooths parameter estimates of a data-sparse child with its parent in order to obtain more robust parameter estimates. The approach is also employed in deleted interpolation, a technique for smoothing n-grams in language modeling for speech recognition. Our method scales well to large data sets, with numerous categories in large hierarchies. Experimental results on three real-world data sets from UseNet, Yahoo, and corporate web pages show improved performance, with a reduction in error up to 29% over the traditional at classifier.

BibTeX

@conference{McCallum-1998-14715,
author = {A. McCallum and R. Rosenfeld and Tom Mitchell and A. Ng},
title = {Improving Text Classification by Shrinkage in a Hierarchy of Classes},
booktitle = {Proceedings of (ICML) International Conference on Machine Learning},
year = {1998},
month = {July},
pages = {359 - 367},
}