Advertisement

RT-GENE: Real-Time Eye Gaze Estimation in Natural Environments

  • Tobias Fischer
  • Hyung Jin Chang
  • Yiannis Demiris
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11214)

Abstract

In this work, we consider the problem of robust gaze estimation in natural environments. Large camera-to-subject distances and high variations in head pose and eye gaze angles are common in such environments. This leads to two main shortfalls in state-of-the-art methods for gaze estimation: hindered ground truth gaze annotation and diminished gaze estimation accuracy as image resolution decreases with distance. We first record a novel dataset of varied gaze and head pose images in a natural environment, addressing the issue of ground truth annotation by measuring head pose using a motion capture system and eye gaze using mobile eyetracking glasses. We apply semantic image inpainting to the area covered by the glasses to bridge the gap between training and testing images by removing the obtrusiveness of the glasses. We also present a new real-time algorithm involving appearance-based deep convolutional neural networks with increased capacity to cope with the diverse images in the new dataset. Experiments with this network architecture are conducted on a number of diverse eye-gaze datasets including our own, and in cross dataset evaluations. We demonstrate state-of-the-art performance in terms of estimation accuracy in all experiments, and the architecture performs well even on lower resolution images.

Keywords

Gaze estimation Gaze dataset Convolutional neural network Semantic inpainting Eyetracking glasses 

Notes

Acknowledgment

This work was supported in part by the Samsung Global Research Outreach program, and in part by the EU Horizon 2020 Project PAL (643783-RIA). We would like to thank Caterina Buizza, Antoine Cully, Joshua Elsdon and Mark Zolotas for their help with this work, and all subjects who volunteered for the dataset collection.

References

  1. 1.
    Ballester, C., Bertalmio, M., Caselles, V., Sapiro, G., Verdera, J.: Filling-in by joint interpolation of vector fields and gray levels. IEEE Trans. Image Process. 10(8), 1200–1211 (2001).  https://doi.org/10.1109/83.935036MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Baltrusaitis, T., Robinson, P., Morency, L.P.: 3D constrained local model for rigid and non-rigid facial tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2610–2617 (2012).  https://doi.org/10.1109/CVPR.2012.6247980
  3. 3.
    Baltrusaitis, T., Robinson, P., Morency, L.P.: Constrained local neural fields for robust facial landmark detection in the wild. In: IEEE International Conference on Computer Vision Workshops, pp. 354–361 (2013).  https://doi.org/10.1109/ICCVW.2013.54
  4. 4.
    Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: Patchmatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24:1–24:11 (2009).  https://doi.org/10.1145/1531326.1531330CrossRefGoogle Scholar
  5. 5.
    Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, pp. 417–424 (2000).  https://doi.org/10.1145/344779.344972
  6. 6.
    Besl, P.J., McKay, N.D.: A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992).  https://doi.org/10.1109/34.121791CrossRefGoogle Scholar
  7. 7.
    Chan, T.F., Shen, J.: Mathematical models for local nontexture inpaintings. SIAM J. Appl. Math. 62, 1019–1043 (2002).  https://doi.org/10.1137/S0036139900368844MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Cristani, M., et al.: Social interaction discovery by statistical analysis of F-formations. In: British Machine Vision Conference, pp. 23.1–23.12 (2011).  https://doi.org/10.5244/C.25.23
  9. 9.
    Demiris, Y.: Prediction of intent in robotics and multi-agent systems. Cogn. Process. 8(3), 151–158 (2007).  https://doi.org/10.1007/s10339-007-0168-9CrossRefGoogle Scholar
  10. 10.
    Deng, H., Zhu, W.: Monocular free-head 3D gaze tracking with deep learning and geometry constraints. In: IEEE International Conference on Computer Vision, pp. 3143–3152 (2017).  https://doi.org/10.1109/ICCV.2017.341
  11. 11.
    Efros, A., Leung, T.: Texture synthesis by non-parametric sampling. In: International Conference on Computer Vision, pp. 1033–1038 (1999).  https://doi.org/10.1109/ICCV.1999.790383
  12. 12.
    Eid, M.A., Giakoumidis, N., El-Saddik, A.: A novel eye-gaze-controlled wheelchair system for navigating unknown environments: case study with a person with ALS. IEEE Access 4, 558–573 (2016).  https://doi.org/10.1109/ACCESS.2016.2520093CrossRefGoogle Scholar
  13. 13.
    Fanelli, G., Weise, T., Gall, J., Van Gool, L.: Real time head pose estimation from consumer depth cameras. In: Mester, R., Felsberg, M. (eds.) DAGM 2011. LNCS, vol. 6835, pp. 101–110. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-23123-0_11CrossRefGoogle Scholar
  14. 14.
    Fischer, T., Demiris, Y.: Markerless perspective taking for humanoid robots in unconstrained environments. In: IEEE International Conference on Robotics and Automation, pp. 3309–3316 (2016).  https://doi.org/10.1109/ICRA.2016.7487504
  15. 15.
    Funes Mora, K.A., Monay, F., Odobez, J.M.: EYEDIAP: a database for the development and evaluation of gaze estimation algorithms from RGB and RGB-D cameras. In: ACM Symposium on Eye Tracking Research and Applications, pp. 255–258 (2014).  https://doi.org/10.1145/2578153.2578190
  16. 16.
    Georgiou, T., Demiris, Y.: Adaptive user modelling in car racing games using behavioural and physiological data. User Model. User-Adapt. Interact. 27(2), 267–311 (2017).  https://doi.org/10.1007/s11257-017-9192-3CrossRefGoogle Scholar
  17. 17.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010), http://proceedings.mlr.press/v9/glorot10a.html
  18. 18.
    Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, S.: Multi-pie. Image Vis. Comput. 28(5), 807–813 (2010).  https://doi.org/10.1109/AFGR.2008.4813399CrossRefGoogle Scholar
  19. 19.
    Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Trans. Graph. 26(3), 4:1–4:7 (2007).  https://doi.org/10.1145/1276377.1276382CrossRefGoogle Scholar
  20. 20.
    Huang, Q., Veeraraghavan, A., Sabharwal, A.: TabletGaze: dataset and analysis for unconstrained appearance-based gaze estimation in mobile tablets. Mach. Vis. Appl. 28(5–6), 445–461 (2017).  https://doi.org/10.1007/s00138-017-0852-4CrossRefGoogle Scholar
  21. 21.
    Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion. ACM Trans. Graph. 36(4), 107:1–107:14 (2017).  https://doi.org/10.1145/3072959.3073659CrossRefGoogle Scholar
  22. 22.
    Jaques, N., Conati, C., Harley, J.M., Azevedo, R.: Predicting affect from gaze data during interaction with an intelligent tutoring system. In: Trausan-Matu, S., Boyer, K.E., Crosby, M., Panourgia, K. (eds.) ITS 2014. LNCS, vol. 8474, pp. 29–38. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-07221-0_4CrossRefGoogle Scholar
  23. 23.
    Jayagopi, D.B., et al.: The vernissage corpus: a conversational human-robot-interaction dataset. In: ACM/IEEE International Conference on Human-Robot Interaction, pp. 149–150 (2013).  https://doi.org/10.1109/HRI.2013.6483545
  24. 24.
    Kassner, M., Patera, W., Bulling, A.: Pupil: an open source platform for pervasive eye tracking and mobile gaze-based interaction. In: ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 1151–1160 (2014).  https://doi.org/10.1145/2638728.2641695
  25. 25.
    Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014).  https://doi.org/10.1109/CVPR.2014.241
  26. 26.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015). https://arxiv.org/abs/1412.6980
  27. 27.
    Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. In: Advances in Neural Information Processing Systems (2017). https://arxiv.org/abs/1706.02515
  28. 28.
    Krafka, K., et al.: Eye tracking for everyone. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2176–2184 (2016).  https://doi.org/10.1109/CVPR.2016.239
  29. 29.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012).  https://doi.org/10.1145/3065386CrossRefGoogle Scholar
  30. 30.
    Lemaignan, S., Garcia, F., Jacq, A., Dillenbourg, P.: From real-time attention assessment to with-me-ness in human-robot interaction. In: ACM/IEEE International Conference on Human Robot Interaction, pp. 157–164 (2016).  https://doi.org/10.1109/HRI.2016.7451747
  31. 31.
    Liu, Y., Wu, Q., Tang, L., Shi, H.: Gaze-assisted multi-stream deep neural network for action recognition. IEEE Access 5, 19432–19441 (2017).  https://doi.org/10.1109/ACCESS.2017.2753830CrossRefGoogle Scholar
  32. 32.
    Lu, F., Sugano, Y., Okabe, T., Sato, Y.: Gaze estimation from eye appearance: a head pose-free method via eye image synthesis. IEEE Trans. Image Process. 24(11), 3680–3693 (2015).  https://doi.org/10.1109/TIP.2015.2445295MathSciNetCrossRefGoogle Scholar
  33. 33.
    Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: International Conference on Machine Learning (2013). https://sites.google.com/site/deeplearningicml2013/relu_hybrid_icml2013_final.pdf
  34. 34.
    Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: IEEE International Conference on Computer Vision, pp. 2794–2802 (2017).  https://doi.org/10.1109/ICCV.2017.304
  35. 35.
    Massé, B., Ba, S., Horaud, R.: Tracking gaze and visual focus of attention of people involved in social interaction. IEEE Trans. Pattern Anal. Mach. Intell. (2017, to appear).  https://doi.org/10.1109/TPAMI.2017.2782819
  36. 36.
    Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1408–1424 (2015).  https://doi.org/10.1109/TPAMI.2014.2366154CrossRefGoogle Scholar
  37. 37.
    McMurrough, C.D., Metsis, V., Kosmopoulos, D., Maglogiannis, I., Makedon, F.: A dataset for point of gaze detection using head poses and eye images. J. Multimodal User Interfaces 7(3), 207–215 (2013).  https://doi.org/10.1007/s12193-013-0121-4CrossRefGoogle Scholar
  38. 38.
    Mukherjee, S.S., Robertson, N.M.: Deep head pose: gaze-direction estimation in multimodal video. IEEE Trans. Multimed. 17(11), 2094–2107 (2015).  https://doi.org/10.1109/TMM.2015.2482819CrossRefGoogle Scholar
  39. 39.
  40. 40.
    Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7(4), 308–313 (1965)MathSciNetCrossRefGoogle Scholar
  41. 41.
    Park, H.S., Jain, E., Sheikh, Y.: Predicting primary gaze behavior using social saliency fields. In: IEEE International Conference on Computer Vision, pp. 3503–3510 (2013).  https://doi.org/10.1109/ICCV.2013.435
  42. 42.
    Parks, D., Borji, A., Itti, L.: Augmented saliency model using automatic 3D head pose detection and learned gaze following in natural scenes. Vis. Res. 116, 113–126 (2015).  https://doi.org/10.1016/j.visres.2014.10.027CrossRefGoogle Scholar
  43. 43.
    Patacchiola, M., Cangelosi, A.: Head pose estimation in the wild using convolutional neural networks and adaptive gradient methods. Pattern Recognit 71, 132–143 (2017).  https://doi.org/10.1016/j.patcog.2017.06.009CrossRefGoogle Scholar
  44. 44.
    Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.: Context encoders: feature learning by inpainting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016).  https://doi.org/10.1109/CVPR.2016.278
  45. 45.
    Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans. Graph. 22(3), 313–318 (2003).  https://doi.org/10.1145/882262.882269CrossRefGoogle Scholar
  46. 46.
    Philips, G.R., Catellier, A.A., Barrett, S.F., Wright, C.: Electrooculogram wheelchair control. Biomed. Sci. Instrum. 43, 164–169 (2007). https://europepmc.org/abstract/med/17487075Google Scholar
  47. 47.
    Rasouli, A., Kotseruba, I., Tsotsos, J.K.: Agreeing to cross: how drivers and pedestrians communicate. In: IEEE Intelligent Vehicles Symposium, pp. 264–269 (2017).  https://doi.org/10.1109/IVS.2017.7995730
  48. 48.
    Rudoy, D., Goldman, D.B., Shechtman, E., Zelnik-Manor, L.: Learning video saliency from human gaze using candidate selection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1147–1154 (2013).  https://doi.org/10.1109/CVPR.2013.152
  49. 49.
    Shapovalova, N., Raptis, M., Sigal, L., Mori, G.: Action is in the eye of the beholder: eye-gaze driven model for spatio-temporal action localization. In: Advances in Neural Information Processing Systems, pp. 2409–2417 (2013). https://dl.acm.org/citation.cfm?id=2999881
  50. 50.
    Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2107–2116 (2017).  https://doi.org/10.1109/CVPR.2017.241
  51. 51.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015). https://arxiv.org/abs/1409.1556
  52. 52.
    Smith, B.A., Yin, Q., Feiner, S.K., Nayar, S.K.: Gaze locking: passive eye contact detection for human-object interaction. In: ACM Symposium on User Interface Software and Technology, pp. 271–280 (2013).  https://doi.org/10.1145/2501988.2501994
  53. 53.
    Sugano, Y., Matsushita, Y., Sato, Y.: Learning-by-synthesis for appearance-based 3D gaze estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1821–1828 (2014).  https://doi.org/10.1109/CVPR.2014.235
  54. 54.
    Wilczkowiak, M., Brostow, G.J., Tordoff, B., Cipolla, R.: Hole filling through photomontage. In: British Machine Vision Conference, pp. 492–501 (2005). http://www.bmva.org/bmvc/2005/papers/55/paper.pdf
  55. 55.
    Wood, E., Baltrušaitis, T., Morency, L.P., Robinson, P., Bulling, A.: Learning an appearance-based gaze estimator from one million synthesised images. In: ACM Symposium on Eye Tracking Research & Applications, pp. 131–138 (2016).  https://doi.org/10.1145/2857491.2857492
  56. 56.
    Wood, E., Baltrusaitis, T., Zhang, X., Sugano, Y., Robinson, P., Bulling, A.: Rendering of eyes for eye-shape registration and gaze estimation. In: IEEE International Conference on Computer Vision, pp. 3756–3764 (2015).  https://doi.org/10.1109/ICCV.2015.428
  57. 57.
    Wood, E., Baltrušaitis, T., Morency, L.-P., Robinson, P., Bulling, A.: A 3D morphable eye region model for gaze estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 297–313. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_18CrossRefGoogle Scholar
  58. 58.
    Yeh, R.A., Chen, C., Lim, T.Y., G., S.A., Hasegawa-Johnson, M., Do, M.N.: Semantic image inpainting with deep generative models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5485–5493 (2017).  https://doi.org/10.1109/CVPR.2017.728
  59. 59.
    Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016).  https://doi.org/10.1109/LSP.2016.2603342CrossRefGoogle Scholar
  60. 60.
    Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Appearance-based gaze estimation in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4511–4520 (2015).  https://doi.org/10.1109/CVPR.2015.7299081
  61. 61.
    Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: It’s written all over your face: full-face appearance-based gaze estimation. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 51–60 (2017).  https://doi.org/10.1109/CVPRW.2017.284
  62. 62.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: International Conference on Computer Vision, pp. 2223–2232 (2017).  https://doi.org/10.1109/ICCV.2017.244

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Personal Robotics Laboratory, Department of Electrical and Electronic EngineeringImperial College LondonLondonUK

Personalised recommendations