Authors
Ian M Soboroff,
Peter Bailey,
Nick Craswell,
Emine Yilmaz,
Emine Yilmaz,
Publication date
2008
Publisher
Ian M. Soboroff, Peter Bailey, Nick Craswell, Alan Smeaton, Emine Yilmaz, Paul Thomas
Total citations
Cited by
Description
We investigate to what extent people making relevance judgments for a reusable IR test collection are exchangeable. We consider three classes of judge: gold standard judges, who are topic origi-nators and are experts in a particular information seeking task; sil-ver standard judges, who are task experts but did not create topics; and bronze standard judges, who are those who did not define topics and are not experts in the task. Analysis shows low levels of agreement in relevance judgments between these three groups. We report on experiments to determine if this is sufficient to invalidate the use of a test collection for mea-suring system performance when relevance assessments have been created by silver standard or bronze standard judges. We find that both system scores and system rankings are subject to consistent but small differences across the three assessment sets. It appears that test collections are …