Authors
Janez Brank,
Marko Grobelnik,
Nataša Milic-Frayling,
Publication date
2003
Publisher
Technical Report MSR-TR-2003-34, Microsoft Corp
Total citations
Description
Text categorization involves a predefined set of categories and a set of documents that need to be classified using that categorization scheme. Each document can be assigned one or multiple categories (or perhaps none at all). We address the multi-class categorization problem as a set of binary problems where, for each category, the set of positive examples consists of documents belonging to the category while all other documents are considered negative examples. Labelled documents are used as input to various learning algorithms to train classifiers and automatically categorize new unlabelled documents. Traditionally, machine learning research has assumed that the class distribution in the training data is reasonably balanced. More recently it has been recognized that this is often not the case with realistic data sets where many more negative examples than positive ones are available. The question then …