Authors
Dunja Mladenic,
Marko Grobelnik,
Publication date
Publisher
Total citations
Cited by
Description
Most text domains are characterized by the large number of features and examples. This paper describes an approach to text categorization using a small subset of features. The data is collected from Yahoo, a large text hierarchy of Web documents. The high number of features is additionally reduced by using a short description of each document given in the hierarchy instead of using document itself. Documents are represented as feature-vectors that include word sequences instead of including only single words as commonly used when learning on text data. Based on the hierarchical structure the problem is divided into subproblems, each representing one on the categories included in the Yahoo hierarchy. The result of learning is a set of independent classiers, each used to predict probability that a new example is a member of the corresponding category. Experimental evaluation on real-world data shows that …