Authors
Janez Brank,
D Mladenic,
Marko Grobelnik,
Publication date
2010
Publisher
Total citations
Description
We deal with the problem of classifying textual documents into a topical hierarchy of categories. Multi-class classification problems such as this one are often dealt with by converting them into several two-class classification problems; a binary classifier can be trained for each of these problems and their predictions are then combined to form the final classification of a document into the topic hierarchy. The conversion from the original multi-class problem into a group of two-class problems can be succinctly described by a" coding matrix". In traditional approaches, the coding matrix is either completely random or (more commonly) completely fixed in advance (eg 1-vs-1, 1-vsrest); in both cases, the training data does not affect the design of the coding matrix. Our approach constructs the coding matrix gradually, one column at a time, with each new column being defined in such a way that the new binary classifier attempts to rectify the most common mistakes of the ensemble of binary classifiers built up to that point. The goal is to achieve good performance with a smaller number of binary classifiers. We also present systematic experiments on a small dataset which demonstrate that good coding matrices with a small number of columns exist, but are rare.