Authors
Alexei Vinokourov,
John Shawe-Taylor,
Nello Cristianini,
Publication date
2002
Publisher
Total citations
Description
The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can be used either for crosslinguistic retrieval, or, more generally, as a part of a mono-linguistic categorisation or clustering system. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated hence forming a semantic representation of the data. Ëince we assume the two representations are completely independent apart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we report positive results in retrieval of documents, both in a cross language and in single language setting. our results consistently and significantly outperform those obtained by LËI on the same data.