Authors
Janez Brank,
Marko Grobelnik,
Natasa Milic-Frayling,
Publication date
2003
Publisher
Total citations
Cited by
Description
Text categorization involves a predefined set of categories and a set of documents that need to be classified using that categorization scheme. Each document can be assigned one or multiple categories (or perhaps none at all). We address the multi-class categorization problem as a set of binary problems where, for each category, the set of positive examples consists of documents belonging to the category while all other documents are considered negative examples. Labelled documents are used as input to various learning algorithms to train classifiers and automatically categorize new unlabelled documents. Traditionally, machine learning research has assumed that the class distribution in the training data is reasonably balanced. More recently it has been recognized that this is often not the case with realistic data sets where many more negative examples than positive ones are available. The question then arises how best to utilize the available labelled data. It has been observed that a disproportional abundance of negative examples decreases the performance of learning algorithms such as naive Bayes and decision trees [KM97]. Thus, research has been conducted into balancing the training set by duplicating positive examples (oversampling) or discarding negative ones (downsizing)[Jap00]. When discarding negative examples, the emphasis has sometimes been on those that are close to positive ones. In this way, one reduces the chance that the learning method might produce a classifier that would misclassify positive examples as negatives [KM97]. An alternative approach has been explored in [CS98]. It involves training several …