Give Ear to My Face: Modelling Multimodal Attention to Social Interactions

  • Giuseppe Boccignone
  • Vittorio CuculoEmail author
  • Alessandro D’Amelio
  • Giuliano Grossi
  • Raffaella Lanzarotti
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11130)


We address the deployment of perceptual attention to social interactions as displayed in conversational clips, when relying on multimodal information (audio and video). A probabilistic modelling framework is proposed that goes beyond the classic saliency paradigm while integrating multiple information cues. Attentional allocation is determined not just by stimulus-driven selection but, importantly, by social value as modulating the selection history of relevant multimodal items. Thus, the construction of attentional priority is the result of a sampling procedure conditioned on the potential value dynamics of socially relevant objects emerging moment to moment within the scene. Preliminary experiments on a publicly available dataset are presented.


Audio-visual attention Social interaction Multimodal perception 


  1. 1.
    Anderson, B.A.: A value-driven mechanism of attentional selection. J. Vis. 13(3), 7 (2013)CrossRefGoogle Scholar
  2. 2.
    Awh, E., Belopolsky, A.V., Theeuwes, J.: Top-down versus bottom-up attentional control: a failed theoretical dichotomy. Trends Cogn. Sci. 16(8), 437–443 (2012)CrossRefGoogle Scholar
  3. 3.
    Berridge, K.C., Robinson, T.E.: Parsing reward. Trends Neurosci. 26(9), 507–513 (2003)CrossRefGoogle Scholar
  4. 4.
    Boccignone, G., Ferraro, M.: Ecological sampling of gaze shifts. IEEE Trans. Cybern. 44(2), 266–279 (2014)CrossRefGoogle Scholar
  5. 5.
    Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 185–207 (2013)CrossRefGoogle Scholar
  6. 6.
    Bruce, N.D., Wloka, C., Frosst, N., Rahman, S., Tsotsos, J.K.: On computational modeling of visual saliency: examining what’s right, and what’s left. Vis. Res. 116, 95–112 (2015)CrossRefGoogle Scholar
  7. 7.
    Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE Trans. Pattern Anal. Mach. Intell. 1 (2018).
  8. 8.
    Cerf, M., Harel, J., Einhäuser, W., Koch, C.: Predicting human gaze using low-level saliency combined with face detection. In: Advances in Neural Information Processing Systems, vol. 20 (2008)Google Scholar
  9. 9.
    Chikkerur, S., Serre, T., Tan, C., Poggio, T.: What and where: a Bayesian inference theory of attention. Vis. Res. 50(22), 2233–2247 (2010)CrossRefGoogle Scholar
  10. 10.
    Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). Scholar
  11. 11.
    Chung, J.S., Zisserman, A.: Lip reading in profile. In: BMVC (2017)Google Scholar
  12. 12.
    Coutrot, A., Guyader, N.: An efficient audiovisual saliency model to predict eye positions when looking at conversations. In: 23rd European Signal Processing Conference, pp. 1531–1535, August 2015Google Scholar
  13. 13.
    Coutrot, A., Guyader, N.: How saliency, faces, and sound influence gaze in dynamic social scenes. J. Vis. 14(8), 5 (2014)CrossRefGoogle Scholar
  14. 14.
    Einhäuser, W., Spain, M., Perona, P.: Objects predict fixations better than early saliency. J. Vis. 8(14) (2008)., Scholar
  15. 15.
    Evangelopoulos, G., Rapantzikos, K., Maragos, P., Avrithis, Y., Potamianos, A.: Audiovisual attention modeling and salient event detection. In: Maragos, P., Potamianos, A., Gros, P. (eds.) Multimodal Processing and Interaction. MMSA, pp. 1–21. Springer, Boston (2008). Scholar
  16. 16.
    Foulsham, T., Cheng, J.T., Tracy, J.L., Henrich, J., Kingstone, A.: Gaze allocation in a dynamic situation: effects of social status and speaking. Cognition 117(3), 319–331 (2010)CrossRefGoogle Scholar
  17. 17.
    Hu, P., Ramanan, D.: Finding tiny faces. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1522–1530. IEEE (2017)Google Scholar
  18. 18.
    Kaya, E.M., Elhilali, M.: Modelling auditory attention. Phil. Trans. R. Soc. B 372(1714), 20160101 (2017)CrossRefGoogle Scholar
  19. 19.
    Kayser, C., Petkov, C.I., Lippert, M., Logothetis, N.K.: Mechanisms for allocating auditory attention: an auditory saliency map. Curr. Biol. 15(21), 1943–1947 (2005)CrossRefGoogle Scholar
  20. 20.
    Le Meur, O., Coutrot, A.: Introducing context-dependent and spatially-variant viewing biases in saccadic models. Vis. Res. 121, 72–84 (2016)CrossRefGoogle Scholar
  21. 21.
    Nakajima, J., Sugimoto, A., Kawamoto, K.: Incorporating audio signals into constructing a visual saliency map. In: Klette, R., Rivera, M., Satoh, S. (eds.) PSIVT 2013. LNCS, vol. 8333, pp. 468–480. Springer, Heidelberg (2014). Scholar
  22. 22.
    Napoletano, P., Boccignone, G., Tisato, F.: Attentive monitoring of multiple video streams driven by a bayesian foraging strategy. IEEE Trans. Image Process. 24(11), 3266–3281 (2015)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Onat, S., Libertus, K., König, P.: Integrating audiovisual information for the control of overt attention. J. Vis. 7(10), 11 (2007)CrossRefGoogle Scholar
  24. 24.
    Park, T., Casella, G.: The Bayesian lasso. J. Am. Stat. Assoc. 103(482), 681–686 (2008)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Rahman, I.M., Hollitt, C., Zhang, M.: Feature map quality score estimation through regression. IEEE Trans. Image Process. 27(4), 1793–1808 (2018)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Rodríguez-Hidalgo, A., Peláez-Moreno, C., Gallardo-Antolín, A.: Towards multimodal saliency detection: an enhancement of audio-visual correlation estimation. In: Proceedings of 16th International Conference on Cognitive Informatics and Cognitive Computing, pp. 438–443. IEEE (2017)Google Scholar
  27. 27.
    Schütz, A., Braun, D., Gegenfurtner, K.: Eye movements and perception: a selective review. J. Vis. 11(5), 9 (2011)CrossRefGoogle Scholar
  28. 28.
    Seo, H., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J. Vis. 9(12), 1–27 (2009)CrossRefGoogle Scholar
  29. 29.
    Shinn-Cunningham, B.G.: Object-based auditory and visual attention. Trends Cogn. Sci. 12(5), 182–186 (2008)CrossRefGoogle Scholar
  30. 30.
    Suda, Y., Kitazawa, S.: A model of face selection in viewing video stories. Sci. Rep. 5, 7666 (2015)CrossRefGoogle Scholar
  31. 31.
    Tatler, B., Hayhoe, M., Land, M., Ballard, D.: Eye guidance in natural vision: Reinterpreting salience. J. Vis. 11(5), 5 (2011)CrossRefGoogle Scholar
  32. 32.
    Tatler, B., Vincent, B.: The prominence of behavioural biases in eye guidance. Vis. Cogn. 17(6–7), 1029–1054 (2009)CrossRefGoogle Scholar
  33. 33.
    Torralba, A.: Contextual priming for object detection. Int. J. Comput. Vis. 53, 153–167 (2003)CrossRefGoogle Scholar
  34. 34.
    Wolfe, J.M.: When is it time to move to the next raspberry bush? Foraging rules in human visual search. J. Vis. 13(3), 10 (2013)CrossRefGoogle Scholar
  35. 35.
    Yang, S.C.H., Wolpert, D.M., Lengyel, M.: Theoretical perspectives on active sensing. Curr. Opin. Behav. Sci. 11, 100–108 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.PHuSe Lab, Department of Computer ScienceUniversità degli Studi di MilanoMilanoItaly

Personalised recommendations