Skip to main content

Analysis of Inter-rater Agreement among Human Observers Who Judge Image Similarity

  • Conference paper
Computer Recognition Systems 4

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 95))

  • 1301 Accesses


In this paper a problem of inter-rater agreement is discussed in the case of human observers who judge how similar pairs of images are. In such a case significant differences in judgment appear among the group of people. We have observed that for some pairs of images all values of similarity ratings are assigned by various people with approximately the same probability. To investigate this phenomenon in a more thorough mannerwe performed experiments in which inter-rater coefficients were used to measure the level of agreement for each given pair of images and for each pair of human judges. The results obtained in the experiments suggest that the variation of the level of agreement is considerable among pairs of images as well as among pairs of people. We suggest that this effect should be taken into account in design of computer systems using image similarity as a criterion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. All images used in the experiments at the AI research group website (2010),

  2. Flickr - photo sharing (2010),

  3. Berry, K., Johnston, J., Mielke Jr., P.: Weighted kappa for multiple raters. Percept Mot Skills 107(3), 837–848 (2008)

    Article  Google Scholar 

  4. Bland, J.M., Altman, D.: Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327(8476), 307–310 (1986)

    Article  Google Scholar 

  5. Cheng, S.C., Wu, T.L.: Fast indexing method for image retrieval using k nearest neighbors searches by principal axis analysis. Journal of Visual Communication and Image Representation 17(1), 42–56 (2006)

    Article  Google Scholar 

  6. Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1), 37–46 (1960)

    Article  Google Scholar 

  7. Cohen, J.: Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin 70(4), 213–220 (1968)

    Article  Google Scholar 

  8. Downie, J.S., Ehmann, A.F., Bay, M., Jones, M.C.: The music information retrieval evaluation exchange: Some observations and insights. In: Raś, Z.W., Wieczorkowska, A.A. (eds.) Advances in Music Information Retrieval. SCI, vol. 274, pp. 93–115. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  9. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)

    Article  Google Scholar 

  10. Fleiss, J.L., Levin, B., Paik, M.: Statistical Methods for Raters and Proportions, 3rd edn. Wiley and Sons, Chichester (2003)

    Book  Google Scholar 

  11. Geertzen, J., Bunt, H.: Measuring annotator agreement in a complex hierarchical dialogue act annotation scheme. In: Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, Association for Computational Linguistics, Sydney, Australia, pp. 126–133 (2006)

    Google Scholar 

  12. Jakobsen, K.D., Frederiksen, J.N., Hansen, T., Jansson, L.B., Parnas, J., Werge, T.: Reliability of clinical ICD-10 schizophrenia diagnoses. Nordic Journal of Psychiatry 59(3), 209–212 (2005)

    Article  Google Scholar 

  13. Kherfi, M.L., Ziou, D., Bernardi, A.: Combining positive and negative examples in relevance feedback for content-based image retrieval. Journal of Visual Communication and Image Representation 14(4), 428–457 (2003)

    Article  Google Scholar 

  14. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  15. Ptaszynski, M., Maciejewski, J., Dybala, P., Rzepka, R., Araki, K.: CAO: A fully automatic emoticon analysis system. In: Fox, M., Poole, D. (eds.) AAAI. AAAI Press, Menlo Park (2010)

    Google Scholar 

  16. Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the kappa statistic. Family Medicine 37(5), 360–363 (2005)

    Google Scholar 

  17. Yoon, T., Chavarra, R., Cole, J., Hasegawa-johnson, M.: Intertranscriber reliability of prosodic labeling on telephone conversation using ToBI. In: Proc. ICSLP (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Michalak, K., Dzieńkowski, B., Hudyma, E., Stanek, M. (2011). Analysis of Inter-rater Agreement among Human Observers Who Judge Image Similarity. In: Burduk, R., Kurzyński, M., Woźniak, M., Żołnierek, A. (eds) Computer Recognition Systems 4. Advances in Intelligent and Soft Computing, vol 95. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20319-0

  • Online ISBN: 978-3-642-20320-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics