Authors
Dunja Mladenić,
Marko Grobelnik,
Publication date
2003
Publisher
North-Holland
Total citations
Description
The paper describes feature subset selection used in learning on text data (text learning) and gives a brief overview of feature subset selection commonly used in machine learning. Several known and some new feature scoring measures appropriate for feature subset selection on large text data are described and related to each other. Experimental comparison of the described measures is given on real-world data collected from the Web. Machine learning techniques are used on data collected from Yahoo, a large text hierarchy of Web documents. Our approach includes some original ideas for handling large number of features, categories and documents. The high number of features is reduced by feature subset selection and additionally by using ‘stop-list’, pruning low-frequency features and using a short description of each document given in the hierarchy instead of using the document itself. Documents are …