Authors
Janez Brank,
Marko Grobelnik,
Natasa Milic-Frayling,
Publication date
2002
Publisher
Total citations
Description
Trends towards personalizing information services and client-based applications have increased the importance of effective and efficient document categorization techniques. New user scenarios, including the use of devices with various specifications, have placed higher demand on performance considerations and computing resource management when designing and testing classification techniques. It is that aspect of text classification that led us to explore methods for training classifiers that optimally use the computing memory and processing cycles for the available training data. In particular, we consider tradeoffs between the quality of document classification, as measured by commonly used performance measures, and reductions of a feature set used to represent the data. The timing of our study coincides with the availability of a larger data collection for text classification research provided by Reuters [Reu00]. While most of the past research has been constrained by a limited availability of data sets suitable for experimentation, the new Reuters collection provides opportunities for numerous experimental designs. It enables researchers and practitioners to aim at designing and testing systems that can meet the requirements of real life operations. Indeed, while the popular, smaller Reuters collection [Lew98] contained less than 30,000 documents in total, requiring less than 30 MB of disk storage, the new collection contains more than 800,000 documents and amounts to over 2 GB of text data.