Authors
Blaž Novak,
Marko Grobelnik,
Dunja Mladenić,
Publication date
Publisher
Total citations
Cited by
Description
In many machine learning problem domains large amounts of data are available but the cost of correctly labelling it prohibits its use for model training. For us especially relevant are large quantities of raw information available on the internet that present an interesting challenge of how to successfully exploit information hidden within it without first having to invest much human work into manually tagging it. There exist several methods for using a small initial set of labelled data together with a large supplementary unlabelled data pool in order to learn a better hypothesis than just by using the labelled information. In document classification, it was reported that the overall performance of such system has improved on many data sets, when using unlabelled data or asking the user for the labels of selected examples–active learning. We present several approaches to using unlabelled data in document classification. This deliverable presents an overview of the state of the art in this field and provides working implementations of methods found to be useful on large textual domains. On average, less than half of usually required labelled samples are needed for the same classification accuracy.