Analysis of Inter-rater Agreement among Human Observers Who Judge Image Similarity

  • Krzysztof Michalak
  • Bartłomiej Dzieńkowski
  • Elżbieta Hudyma
  • Michał Stanek
Conference paper
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 95)


In this paper a problem of inter-rater agreement is discussed in the case of human observers who judge how similar pairs of images are. In such a case significant differences in judgment appear among the group of people. We have observed that for some pairs of images all values of similarity ratings are assigned by various people with approximately the same probability. To investigate this phenomenon in a more thorough mannerwe performed experiments in which inter-rater coefficients were used to measure the level of agreement for each given pair of images and for each pair of human judges. The results obtained in the experiments suggest that the variation of the level of agreement is considerable among pairs of images as well as among pairs of people. We suggest that this effect should be taken into account in design of computer systems using image similarity as a criterion.


Image Pair Image Similarity Weighted Kappa Rater Agreement Pure Chance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    All images used in the experiments at the AI research group website (2010),
  2. 2.
    Flickr - photo sharing (2010),
  3. 3.
    Berry, K., Johnston, J., Mielke Jr., P.: Weighted kappa for multiple raters. Percept Mot Skills 107(3), 837–848 (2008)CrossRefGoogle Scholar
  4. 4.
    Bland, J.M., Altman, D.: Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327(8476), 307–310 (1986)CrossRefGoogle Scholar
  5. 5.
    Cheng, S.C., Wu, T.L.: Fast indexing method for image retrieval using k nearest neighbors searches by principal axis analysis. Journal of Visual Communication and Image Representation 17(1), 42–56 (2006)CrossRefGoogle Scholar
  6. 6.
    Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1), 37–46 (1960)CrossRefGoogle Scholar
  7. 7.
    Cohen, J.: Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70(4), 213–220 (1968)CrossRefGoogle Scholar
  8. 8.
    Downie, J.S., Ehmann, A.F., Bay, M., Jones, M.C.: The music information retrieval evaluation exchange: Some observations and insights. In: Raś, Z.W., Wieczorkowska, A.A. (eds.) Advances in Music Information Retrieval. SCI, vol. 274, pp. 93–115. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  9. 9.
    Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)CrossRefGoogle Scholar
  10. 10.
    Fleiss, J.L., Levin, B., Paik, M.: Statistical Methods for Raters and Proportions, 3rd edn. Wiley and Sons, Chichester (2003)CrossRefGoogle Scholar
  11. 11.
    Geertzen, J., Bunt, H.: Measuring annotator agreement in a complex hierarchical dialogue act annotation scheme. In: Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, Association for Computational Linguistics, Sydney, Australia, pp. 126–133 (2006)Google Scholar
  12. 12.
    Jakobsen, K.D., Frederiksen, J.N., Hansen, T., Jansson, L.B., Parnas, J., Werge, T.: Reliability of clinical ICD-10 schizophrenia diagnoses. Nordic Journal of Psychiatry 59(3), 209–212 (2005)CrossRefGoogle Scholar
  13. 13.
    Kherfi, M.L., Ziou, D., Bernardi, A.: Combining positive and negative examples in relevance feedback for content-based image retrieval. Journal of Visual Communication and Image Representation 14(4), 428–457 (2003)CrossRefGoogle Scholar
  14. 14.
    Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)CrossRefzbMATHMathSciNetGoogle Scholar
  15. 15.
    Ptaszynski, M., Maciejewski, J., Dybala, P., Rzepka, R., Araki, K.: CAO: A fully automatic emoticon analysis system. In: Fox, M., Poole, D. (eds.) AAAI. AAAI Press, Menlo Park (2010)Google Scholar
  16. 16.
    Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the kappa statistic. Family Medicine 37(5), 360–363 (2005)Google Scholar
  17. 17.
    Yoon, T., Chavarra, R., Cole, J., Hasegawa-johnson, M.: Intertranscriber reliability of prosodic labeling on telephone conversation using ToBI. In: Proc. ICSLP (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Krzysztof Michalak
    • 1
  • Bartłomiej Dzieńkowski
    • 1
  • Elżbieta Hudyma
    • 1
  • Michał Stanek
    • 1
  1. 1.Institute of InformaticsWrocław University of TechnologyWrocławPoland

Personalised recommendations