Authors
Craig Saunders,
Hauke Tschach,
John Shawe-Taylor,
Publication date
2002
Publisher
Total citations
Description
Recently, the use of string kernels that compare documents as a string of letters has been shown to achieve good results on text classification problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents and as a result reduces computation time. Moreover syllables provide a more natural representation of text; rather than the traditional coarse representation given by the bag-of-words, or the too fine one resulting from considering individual letters only. We give some experimental results which show that syllables can be effectively used in text-categorisation problems. In this paper we also propose two extensions to the string kernel. The first introduces a new lambda-weighting scheme, where different symbols can be given differing decay weightings. This may be useful in text and other applications where the insertion of certain symbols may be known to be less significant. We also introduce the concept of 'soft matching', where symbols can match (possibly weighted by relevance) even if they are not identical. Again, this provides a method of incorporating prior knowledge where certain symbols can be regarded as a partial or exact match and contribute to the overall similarity measure for two data items.