Skip to main content

Gaze- and Speech-Enhanced Content-Based Image Retrieval in Image Tagging

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 6792)


We describe a setup and experiments where users are checking and correcting image tags given by an automatic tagging system. We study how much the application of a content-based image retrieval (CBIR) method speeds up the process of finding and correcting the erroneously-tagged images. We also analyze the use of implicit relevance feedback from the user’s gaze tracking patterns as a method for boosting up the CBIR performance. Finally, we use automatic speech recognition for giving the correct tags for those images that were wrongly tagged. The experiments show a large variance in the tagging task performance, which we believe is primarily caused by the users’ subjectivity in image contents as well as their varying familiarity with the gaze tracking and speech recognition setups. The results suggest potentials for gaze and/or speech enhanced CBIR method in image tagging, at least for some users.


  • Content-based image retrieval
  • automatic image tagging
  • gaze tracking
  • speech recognition

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-21738-8_48
  • Chapter length: 8 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-642-21738-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Ames, M., Naaman, M.: Why we tag: motivations for annotation in mobile and online media. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 971–980. ACM, New York (2007)

    CrossRef  Google Scholar 

  2. Auer, P., Hussain, Z., Kaski, S., Klami, A., Kujala, J., Laaksonen, J., Leung, A.P., Pasupa, K., Shawe-Taylor, J.: Pinview: Implicit feedback in content-based image retrieval. In: Diethe, T., Cristianini, N., Shawe-Taylor, J. (eds.) Proceedings of Workshop on Applications of Pattern Analysis. JMLR Workshop and Conference Proceedings, vol. 11, pp. 51–57 (2010)

    Google Scholar 

  3. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys 40(2), 1–60 (2008)

    CrossRef  Google Scholar 

  4. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results (2007),

  5. Klami, A., Kaski, S., Pasupa, K., Saunders, C., de Campos, T.: Prediction of relevance of an image from a scan pattern. PinView FP7-216529 Project Deliverable Report D2.1 (December 2008),

  6. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer Series in Information Sciences, vol. 30. Springer, Berlin (2001)

    CrossRef  Google Scholar 

  7. Laaksonen, J., Koskela, M., Oja, E.: PicSOM—Self-organizing image retrieval with MPEG-7 content descriptions. IEEE Transactions on Neural Networks, Special Issue on Intelligent Multimedia Processing 13(4), 841–853 (2002)

    CrossRef  Google Scholar 

  8. Lerman, K., Jones, L.: Social browsing on flickr. CoRR abs/cs/0612047 (2006)

    Google Scholar 

  9. Manning, C., Schütze, H.: MITCogNet: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  10. Pylkkönen, J., Kurimo, M.: Duration modeling techniques for continuous speech recognition. In: Eighth International Conference on Spoken Language Processing, ISCA (2004)

    Google Scholar 

  11. Viitaniemi, V., Laaksonen, J.: Evaluating the performance in automatic image annotation: example case by adaptive fusion of global image features. Signal Processing: Image Communications 22(6), 557–568 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, H., Ruokolainen, T., Laaksonen, J., Hochleitner, C., Traunmüller, R. (2011). Gaze- and Speech-Enhanced Content-Based Image Retrieval in Image Tagging. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2011. ICANN 2011. Lecture Notes in Computer Science, vol 6792. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21737-1

  • Online ISBN: 978-3-642-21738-8

  • eBook Packages: Computer ScienceComputer Science (R0)