Skip to main content
Log in

A framework for generalizable neural networks for robust estimation of eyelids and pupils

  • Original Manuscript
  • Published:
Behavior Research Methods Aims and scope Submit manuscript

Abstract

Deep neural networks (DNNs) have enabled recent advances in the accuracy and robustness of video-oculography. However, to make robust predictions, most DNN models require extensive and diverse training data, which is costly to collect and label. In this work, we seek to improve the codevelop pylids, a pupil- and eyelid-estimation DNN model based on DeepLabCut. We show that performance of pylids-based pupil estimation can be related to the distance of test data from the distribution of training data. Based on this principle, we explore methods for efficient data selection for training our DNN. We show that guided sampling of new data points from the training data approaches state-of-the-art pupil and eyelid estimation with fewer training data points. We also demonstrate the benefit of using an efficient sampling method to select data augmentations for training DNNs. These sampling methods aim to minimize the time and effort required to label and train DNNs while promoting model generalization on new diverse datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Arpit, D., Jastrzȩbski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., & . . . Bengio, Y., et al. (2017). A closer look at memorization in deep networks. In International conference on machine learning (pp. 233–242).

  • Binaee, K., Sinnott, C., Capurro, K. J., MacNeilage, P., & Lescroart, M. D. (2021). Pupil tracking under direct sunlight. ACM symposium on eye tracking research and applications (pp. 1–4).

  • Biswas, A., Binaee, K., Capurro, K. J., & Lescroart, M. D. (2021). Characterizing the performance of deep neural networks for eye-tracking. ACM symposium on eye tracking research and applications (pp. 1–4).

  • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., & . . . Amodei, D. (2020). Language models are few-shot learners. CoRR. arXiv:2005.14165.

  • Chaudhary, A. K., Gyawali, P. K., Wang, L., & Pelz, J. B. (2021). Semi-supervised learning for eye image segmentation. ACM symposium on eye tracking research and applications (pp. 1–7).

  • Chaudhary, A. K., Kothari, R., Acharya, M., Dangi, S., Nair, N., Bailey, R., ... Pelz, J.B. (2019). RITnet: Real-time semantic segmentation of the eye for gaze tracking. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) (pp. 3698–3702). https://doi.org/10.1109/ICCVW.2019.00568

  • Chaudhary, A. K., Nair, N., Bailey, R. J., Pelz, J. B., Talathi, S. S., & Diaz, G. J. (2022). Temporal RIT-eyes: From real infrared eye-images to synthetic sequences of gaze behavior. IEEE Transactions on Visualization and Computer Graphics, 28(11), 3948–3958.

    Article  Google Scholar 

  • Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996). Active learning with statistical models. Journal of Artificial Intelligence Research, 4, 129–145.

    Article  Google Scholar 

  • Coleman, C., Yeh, C., Mussmann, S., Mirzasoleiman, B., Bailis, P., Liang, P., & . . . Zaharia, M. (2019). Selection via proxy: Efficient data selection for deep learning. arXiv:1906.11829.

  • Coleman, C., Yeh, C., Mussmann, S., Mirzasoleiman, B., Bailis, P., Liang, P., & . . . Zaharia, M. (2020). Selection via proxy: Efficient data selection for deep learning. International Conference on Learning Representations.

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. CVPR09. http://www.image-net.org/papers/imagenet_cvpr09.bib.

  • Eivazi, S., Santini, T., Keshavarzi, A., Kübler, T., & Mazzei, A. (2019). Improving real-time CNN-based pupil detection through domain-specific data augmentation. Proceedings of the 11th ACM symposium on eye tracking research & applications (40, pp. 6). Association for Computing Machinery. https://doi.org/10.1145/3314111.3319914

  • Fischer, T., Chang, H. J., & Demiris, Y. (2018). Rt-gene: Real-time eye gaze estimation in natural environments. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 334–352).

  • Fuhl, W., Kasneci, G., & Kasneci, E. (2021). Teyed: Over 20 million real-world eye images with pupil, eyelid, and iris 2d and 3d segmentations, 2d and 3d landmarks, 3d eyeball, gaze vector, and eye movement types. In 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 367–375).

  • Fuhl, W., Santini, T., & Kasneci, E. (2017). Fast and robust eyelid outline and aperture detection in real-world scenarios. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 1089–1097).

  • Fuhl, W., Santini, T., Kasneci, G., & Kasneci, E. (2017). PupilNet V2.0: Convolutional neural networks for robust pupil detection. CoRR. chair/team/enkelejda-kasneci,chair/team/wolfgang-fuhl, https://atreus.informatik.uni-tuebingen.de/seafile/d/8e2ab8c3fdd444e1a135/.

  • Gal, Y., Islam, R., & Ghahramani, Z. (2017). Deep Bayesian active learning with image data. In International conference on machine learning (pp. 1183–1192). PMLR

  • Gander, W., Golub, G. H., & Strebel, R. (1994). Least-squares fitting of circles and ellipses. BIT Numerical Mathematics,34(4), 558– 578.

  • Garbin, S. J., Shen, Y., Schuetz, I., Cavin, R., Hughes, G., & Talathi, S. S. (2019). Openeds: Open eye dataset. arXiv:1905.03702.

  • Guo, C., Zhao, B., & Bai, Y. (2022). DeepCore: A comprehensive library for coreset selection in deep learning. arXiv:2204.08499.

  • He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (770–778). https://doi.org/10.1109/CVPR.2016.90.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. CoRR. arXiv:1512.03385

  • Hennessey, C., Noureddin, B., & Lawrence, P. (2006). A single camera eye-gaze tracking system with free head motion. Proceedings of the 2006 symposium on Eye tracking research & applications (pp. 87–94).

  • Jung, A. B., Wada, K., Crall, J., Tanaka, S., Graving, J., Reinders, C., & . . . Laporte, M., et al. (2020). Imgaug. https://github.com/aleju/imgaug. Accessed 01 Feb 2020.

  • Kansal, P., & Devanathan, S. (2019). Eyenet: Attention based convolutional encoder-decoder network for eye region segmentation. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) (3688–3693).

  • Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (8110–8119).

  • Kassner, M., Patera, W., & Bulling, A. (2014). Pupil: An open source platform for pervasive eye tracking and mobile gaze-based interaction. arXiv:1405.0006, 10 (2638728.2641695).

  • Katsini, C., Abdrabou, Y., Raptis, G. E., Khamis, M., & Alt, F. (2020). The role of eye gaze in security and privacy applications: Survey and future hci research directions. In Proceedings of the 2020 CHI conference on human factors in computing systems (1–21).

  • Kingma, D. P., & Ba, J. (2017). Adam: A method for stochastic optimization.

  • Kothari, R. S., Bailey, R. J., Kanan, C., Pelz, J. B., & Diaz, G. J. (2022). EllSeg-Gen, towards domain generalization for head-mounted eyetracking. Proceedings of the ACM on human-computer interaction, 6(ETRA), 1–17.

  • Kothari, R. S., Chaudhary, A. K., Bailey, R. J., Pelz, J. B., & Diaz, G. J. (2020). EllSeg: An ellipse segmentation framework for robust gaze tracking. arXiv:2007.09600

  • Kothari, R., Yang, Z., Kanan, C., Bailey, R., Pelz, J. B., & Diaz, G. J. (2020). Gaze-in-Wild: A dataset for studying eye and head coordination in everyday activities. Scientific Reports, 10(1), 1–18.

    Article  Google Scholar 

  • Kouw, W. M., & Loog, M. (2018). An introduction to domain adaptation and transfer learning. arXiv:1812.11806

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.

    Google Scholar 

  • Kruskal, J. B., & Wish, M. (1978). Multidimensional scaling (no. 11). Sage.

  • Labs, P. (2013). Pupil Labs github repository. GitHub repository. https://github.com/pupil-labs/pupil

  • Lauer, J., Zhou, M., Ye, S., Menegas, W., Schneider, S., Nath, T., & Mathis, & A. (2022). Multi-animal pose estimation, identification and tracking with DeepLabCut. Nature Methods, 19(4), 496–504.

    Article  PubMed  PubMed Central  Google Scholar 

  • Malinen, M. I., & Fränti, P. (2014). Balanced k-means for clustering. Joint IAPR international workshops on statistical techniques in pattern recognition (spr) and Structural and Syntactic Pattern Recognition (SSPR) (pp. 32–41).

  • Mathis, A., Biasi, T., Schneider, S., Yuksekgonul, M., Rogers, B., Bethge, M., & Mathis, M. W. (2021). Pretraining boosts out-of-domain robustness for pose estimation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1859–1868).

  • Mathis, A., Mamidanna, P., Cury, K. M., Abe, T., Murthy, V. N., Mathis, M. W., & Bethge, M. (2018a). DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience. https://www.nature.com/articles/s41593-018-0209-y

  • Mathis, A., Mamidanna, P., Cury, K. M., Abe, T., Murthy, V. N., Mathis, M. W., & Bethge, M. (2018). DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience, 21(9), 1281–1289.

    Article  PubMed  Google Scholar 

  • Meyer, A. F., O’Keefe, J., & Poort, J. (2020). Two distinct types of eye-head coupling in freely moving mice. Current Biology, 30(11), 2116–2130.

    Article  PubMed  PubMed Central  Google Scholar 

  • Nair, N., Kothari, R., Chaudhary, A. K., Yang, Z., Diaz, G. J., Pelz, J. B., & Bailey, R. J. (2020). RIT-Eyes: Rendering of near-eye images for eye-tracking applications. ACM symposium on applied perception 2020 (pp. 1–9).

  • Nath, T., Mathis, A., Chen, A. C., Patel, A., Bethge, M., & Mathis, M. W. (2019). Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nature Protocols. https://doi.org/10.1038/s41596-019-0176-0

    Article  PubMed  Google Scholar 

  • Neyshabur, B., Tomioka, R., & Srebro, N. (2014). In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv:1412.6614.

  • Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., & Sohl-Dickstein, J. (2018). Sensitivity and generalization in neural networks: An empirical study. arXiv:1802.08760.

  • Park, S., Mello, S. D., Molchanov, P., Iqbal, U., Hilliges, O., & Kautz, J. (2019). Few-shot adaptive gaze estimation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9368–9377).

  • Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications, 36(2), 3336–3341.

    Article  Google Scholar 

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125

  • Rebecq, H., Ranftl, R., Koltun, V., & Scaramuzza, D. (2019). High speed and high dynamic range video with an event camera. IEEE Transactions on Pattern Analysis & Machine Intelligence, 01, 1–1. https://doi.org/10.1109/TPAMI.2019.2963386

    Article  Google Scholar 

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-resolution image synthesis with latent diffusion models.

  • Rot, P., Emeršič, Ž, Struc, V., & Peer, P. (2018). Deep multi-class eye segmentation for ocular biometrics. In 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI) (pp. 1–8).

  • Sener, O., & Savarese, S. (2017). Active learning for convolutional neural networks: A core-set approach. arXiv:1708.00489.

  • Settles, B. (2009). Active learning literature survey.

  • Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1–48.

    Article  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

  • Swirski, L., & Dodgson, N. (2013). A fully-automatic, temporal approach to single camera, glint-free 3D eye model fitting. Proc. PETMEI, 1–11.

  • Tonsen, M., Zhang, X., Sugano, Y., & Bulling, A. (2016). Labelled pupils in the wild: A dataset for studying pupil detection in unconstrained environments. Proceedings of the ninth biennial ACM symposium on eye tracking research & applications (pp. 139–142).

  • Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. CVPR 2011 (pp. 1521–1528). IEEE

  • van der Walt, S., Schönberger, J. L., Nunez-Iglesias, J., Boulogne, F., Warner, J. D., Yager, N., & . the scikit-image contributors. (2014). Scikit-image: Image processing in Python. PeerJ, 2, e453. https://doi.org/10.7717/peerj.453

  • Vera-Olmos, F. J., Pardo, E., Melero, H., & Malpica, N. (2019). DeepEye: Deep convolutional network for pupil detection in real environments. Integrated Computer-Aided Engineering, 26(1), 85–95.

    Article  Google Scholar 

  • Wang, T., Zhu, J.-Y., Torralba, A., & Efros, A. A. (2018). Dataset distillation. arXiv:1811.10959.

  • Yiu, Y.-H., Aboulatta, M., Raiser, T., Ophey, L., Flanagin, V. L., Zu Eulenburg, P., & Ahmadi, S.-A. (2019). Deepvog: Open-source pupil segmentation and gaze estimation in neuroscience using deep learning. Journal of Neuroscience Methods, 324, 108307.

    Article  PubMed  Google Scholar 

  • Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., & Savarese, S. (2018). Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zdarsky, N., Treue, S., & Esghaei, M. (2021). A deep learning-based approach to video-based eye tracking for human psychophysics. Frontiers in Human Neuroscience, 15.

  • Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115.

    Article  Google Scholar 

  • Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, 2223–2232.

Download references

Acknowledgements

We thank Kaylie Capurro for helping label the eyelid and pupil data. We would also like to thank Kamran Binaee, Matthew Shinkle, and Joseph (Yu) Zhao for helpful discussions. This work was supported by NSF EPSCoR # 1920896 to Michelle R. Greene, Mark D. Lescroart, Paul MacNeilage, and Benjamin Balas.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arnab Biswas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Data augmentation

To simulate outdoor eye images, we used image processing techniques to artificially add perturbations to eye images in our training dataset by adding exposure, reflection, defocus blur, eye rotation, JPEG compression, motion blur, Gaussian noise, and adding mock pupils and glints to eye images. For each perturbation, we generated four additional images with increasing perturbation intensities. We detail the process of generating perturbed eye images below.

  1. 1.

    Exposure: To simulate the effect of an increase in exposure and decrease in the contrast between the pupil/eyelashes and other regions in the eye video in bright sunlight, we added four steps of luminance increments (each 35 units) to all pixels in each frame. After each increment, pixel values were clipped to a maximum of 255.

  2. 2.

    Rotation To simulate the effect of different camera angles and facial anatomy across participants, we rotated the eye videos in four five-degree increments followed by scaling and cropping to ensure uniform frame size. This rotation resulted in the eye going partially out of the frame for the 15 and 20-degree rotation conditions.

  3. 3.

    Reflection: Corneal reflection and shadows on the eye present a challenge while recording eye videos outdoors. We used the method presented in (Eivazi et al., 2019) to add reflections and shadows to the eye images. We modified the blending factor for images superimposed on the eye video in four steps. For every frame, we randomly selected the reflected image from the Driving Events Camera Dataset (Rebecq et al., 2019) which contains videos from dashboard cameras of cars driving through highways and cityscapes.

  4. 4.

    JPEG artifacts: Compressed video formats are desirable when storing eye videos as they take up less space. Thus, we tested the robustness of our DNN to compression artifacts by altering the video frames with JPEG compression. We varied the JPEG quality parameter (which varies from 100 to 0, denoting best to worst quality) from 32 to 8 in four steps of 8.

  5. 5.

    Defocus blur: To mimic the defocus blur from a camera we used the imgaug image augmentation library (Jung et al., 2020) and iteratively increased the severity parameter from 1 to 4 to create an incremental loss of focus in the eye videos.

  6. 6.

    Motion blur: To simulate the motion blur due to saccadic eye movements or blinks we used the motion blur as implemented in (Jung et al., 2020). Motion blur was varied between intensities 20 and 80. The angle for the direction for motion blur was randomly sampled from within the range [-45, 45] degrees.

  7. 7.

    Gaussian noise: To simulate the effect of Gaussian Noise we used the add Gaussian noise function as implemented in (Jung et al., 2020). We added Gaussian noise to eye images sampled once per pixel from a normal distribution ranging from N(0, 0.1*255) to N(0, 0.3*255).

  8. 8.

    Mock pupils: Often pupil like structures appear on eye images due to shadows, reflections based on the environment to avoid spurious detection of pupils due to such structure we augmented our data with blacked out mock pupils inspired bey (Eivazi et al., 2019). We created blacked out ellipses to simulate mock pupils, the center of the ellipse was sampled from a 2D Gaussian distribution centered on the center of the frame and spanning half the height and width of the eye image.

  9. 9.

    Mock glints: Using the same procedure as mock pupils, we created white ellipses to simulate the appearance of glints due to reflections and environmental conditions.

Appendix B: Multi-dimensional scaling

Fig. 9
figure 9

Eyelid polynomial fits and ellipse fits for pupils demonstrating robustness of our baseline pylids model on samples from the GiW dataset. A Each row represents data from a different participant from the test data. Pupil positions as estimated by our pylids model are shown in red and compared to pupil estimates using the Pupil Labs package in blue. B and C compare pylids estimated pupil positions (red) to Pupil Labs package based pupil estimation (blue) during a blink (B) and a saccade (C). Partial pupil occlusion during a blink (B) leads to loss of data when using Pupil Labs pupil detection but not with pylids

Appendix C: Comparison with alternate model

Fig. 10
figure 10

Correlation of cosine distance between GiW training data and test samples from the Fuhl dataset with average error and average keypoint likelihood for eyelid estimation. With an increase in Cosine distance there is monotonic increase in the average error

Appendix D: Distance from training distribution is correlated with model performance

First using our baseline pylids model trained on 520 labels from the Gaze in Wild dataset we evaluated if model performance is dependent upon cosine distance from the training dataset in ResNet50 space. To test this, we computed the rank order correlation between the model error on the test distribution and the Cosine distance from the training distribution in ResNet50 space (see Fig. 10).

Based on our finding that distance from the training distribution is correlated with model performance we used our guided sampling algorithm to select uniformly distributed samples to fine tune our model. We show that models fine-tuned using guided sampling outperform models trained using other sampling methods (see Fig. 6). To evaluate that the relationship between distance from the training distribution and model performance holds for the fine-tuned models we correlated the model error with distance from the training distribution (which now included new labeled frames.) We found that there is significant correlation and the relationship still holds true (see Fig. 11)

Fig. 11
figure 11

Correlation between model error and cosine distance from training data for models fine-tuned using samples from the Fuhl dataset. The subplots are in the same order as 6. The correlation is significant across sampling methods

To further investigate if guided sampling did indeed result in reduction in distance from the training distribution we plotted the model error as a function of both distance to the initial baseline training samples and newly labeled samples. Compared to other sampling methods guided sampling reduced the distance from the training distribution for all test samples (see Fig. 12).

Fig. 12
figure 12

Model error as a function of both the cosine distance from the initial training dataset and the new selected frames for training based on the Fuhl dataset. These subplots correspond to the same subplots visualized in Figs. 6 and 11. The dashed diagonal line represents the identity line. More samples below the diagonal shows that samples are closer to the new labeled training frames than the initial baseline training data. This is especially true for guided sampling

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Biswas, A., Lescroart, M.D. A framework for generalizable neural networks for robust estimation of eyelids and pupils. Behav Res (2023). https://doi.org/10.3758/s13428-023-02266-3

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.3758/s13428-023-02266-3

Keywords

Navigation