Advertisement

Visual Search Target Inference Using Bag of Deep Visual Words

  • Sven StaudenEmail author
  • Michael BarzEmail author
  • Daniel SonntagEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11117)

Abstract

Visual Search target inference subsumes methods for predicting the target object through eye tracking. A person intents to find an object in a visual scene which we predict based on the fixation behavior. Knowing about the search target can improve intelligent user interaction. In this work, we implement a new feature encoding, the Bag of Deep Visual Words, for search target inference using a pre-trained convolutional neural network (CNN). Our work is based on a recent approach from the literature that uses Bag of Visual Words, common in computer vision applications. We evaluate our method using a gold standard dataset.

The results show that our new feature encoding outperforms the baseline from the literature, in particular, when excluding fixations on the target.

Keywords

Search target inference Eye tracking Visual attention Deep learning Intelligent user interfaces 

Notes

Acknowledgement

This work was funded by the Federal Ministry of Education and Research (BMBF) under grant number 16SV7768 in the Interakt project.

References

  1. 1.
    Akkil, D., Isokoski, P.: Gaze augmentation in egocentric video improves awareness of intention. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 1573–1584. ACM Press (2016). http://dl.acm.org/citation.cfm?doid=2858036.2858127
  2. 2.
    Bader, T., Beyerer, J.: Natural gaze behavior as input modality for human-computer interaction. In: Nakano, Y., Conati, C., Bader, T. (eds.) Eye Gaze in Intelligent User Interfaces, pp. 161–183. Springer, London (2013).  https://doi.org/10.1007/978-1-4471-4784-8_9CrossRefGoogle Scholar
  3. 3.
    Borji, A., Lennartz, A., Pomplun, M.: What do eyes reveal about the mind? Algorithmic inference of search targets from fixations. Neurocomputing 149(PB), 788–799 (2015).  https://doi.org/10.1016/j.neucom.2014.07.055CrossRefGoogle Scholar
  4. 4.
    DeAngelus, M., Pelz, J.B.: Top-down control of eye movements: Yarbus revisited. Vis. Cognit. 17(6–7), 790–811 (2009).  https://doi.org/10.1080/13506280902793843CrossRefGoogle Scholar
  5. 5.
    Donahue, J., et al.: DeCAF: A deep convolutional activation feature for generic visual recognition. In: Icml, vol. 32, pp. 647–655 (2014). http://arxiv.org/abs/1310.1531
  6. 6.
    Flanagan, J.R., Johansson, R.S.: Action plans used in action observation. Nature 424(6950), 769–771 (2003). http://www.nature.com/doifinder/10.1038/nature01861CrossRefGoogle Scholar
  7. 7.
    Goldberg, Y.: Neural network methods for natural language processing. Synth. Lect. Hum. Lang. Technol. 10(1), 1–309 (2017)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Gredeback, G., Falck-Ytter, T.: Eye movements during action observation. Perspect. Psychol. Sci. 10(5), 591–598 (2015). http://pps.sagepub.com/lookup/doi/10.1177/1745691615589103CrossRefGoogle Scholar
  9. 9.
    Huang, C.M., Mutlu, B.: Anticipatory robot control for efficient human-robot collaboration. In: 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 83–90. IEEE, March 2016.  https://doi.org/10.1109/HRI.2016.7451737, http://ieeexplore.ieee.org/document/7451737/
  10. 10.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS 2012, pp. 1097–1105. Curran Associates Inc., USA (2012). http://dl.acm.org/citation.cfm?id=2999134.2999257
  11. 11.
    Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157 (1999).  https://doi.org/10.1109/ICCV.1999.790410, http://ieeexplore.ieee.org/document/790410/
  12. 12.
    Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519 (2014).  https://doi.org/10.1109/CVPRW.2014.131, http://arxiv.org/abs/1403.6382
  13. 13.
    Rotman, G., Troje, N.F., Johansson, R.S., Flanagan, J.R.: Eye movements when observing predictable and unpredictable actions. J. Neurophysiol. 96(3), 1358–1369 (2006).  https://doi.org/10.1152/jn.00227.2006. http://www.ncbi.nlm.nih.gov/pubmed/16687620CrossRefGoogle Scholar
  14. 14.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015).  https://doi.org/10.1007/s11263-015-0816-yMathSciNetCrossRefGoogle Scholar
  15. 15.
    Sattar, H., Müller, S., Fritz, M., Bulling, A.: Prediction of search targets from fixations in open-world settings. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 981–990, June 2015.  https://doi.org/10.1109/CVPR.2015.7298700
  16. 16.
    Sattar, H., Bulling, A., Fritz, M.: Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling (2016). http://arxiv.org/abs/1611.10162
  17. 17.
    Sonntag, D.: Kognit: intelligent cognitive enhancement technology by cognitive models and mixed reality for dementia patients. In: AAAI Fall Symposium Series (2015). https://www.aaai.org/ocs/index.php/FSS/FSS15/paper/view/11702
  18. 18.
    Sonntag, D.: Intelligent user interfaces - A tutorial. CoRR abs/1702.05250 (2017). http://arxiv.org/abs/1702.05250
  19. 19.
    Toyama, T., Sonntag, D.: Towards episodic memory support for dementia patients by recognizing objects, faces and text in eye gaze. In: Hölldobler, S., Krötzsch, M., Peñaloza, R., Rudolph, S. (eds.) KI 2015. LNCS (LNAI), vol. 9324, pp. 316–323. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24489-1_29CrossRefGoogle Scholar
  20. 20.
    Wolfe, J.M.: Guided search 2.0 a revised model of visual search. Psychon. Bull. Rev. 1(2), 202–238 (1994).  https://doi.org/10.3758/BF03200774CrossRefGoogle Scholar
  21. 21.
    Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval, MIR 2007, pp. 197–206. ACM, New York (2007). http://doi.acm.org/10.1145/1290082.1290111
  22. 22.
    Yarbus, A.L.: Eye movements and vision. Neuropsychologia 6(4), 222 (1967).  https://doi.org/10.1016/0028-3932(68)90012-2CrossRefGoogle Scholar
  23. 23.
    Zelinsky, G.J., Peng, Y., Samaras, D.: Eye can read your mind: decoding gaze fixations to reveal categorical search targets. J. Vis. 13(14), 10 (2013).  https://doi.org/10.1167/13.14.10. http://www.ncbi.nlm.nih.gov/pubmed/24338446CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.German Research Center for Artificial Intelligence (DFKI)SaarbrückenGermany

Personalised recommendations