ARBEE: Towards Automated Recognition of Bodily Expression of Emotion in the Wild

Abstract

Humans are arguably innately prepared to comprehend others’ emotional expressions from subtle body movements. If robots or computers can be empowered with this capability, a number of robotic applications become possible. Automatically recognizing human bodily expression in unconstrained situations, however, is daunting given the incomplete understanding of the relationship between emotional expressions and body movements. The current research, as a multidisciplinary effort among computer and information sciences, psychology, and statistics, proposes a scalable and reliable crowdsourcing approach for collecting in-the-wild perceived emotion data for computers to learn to recognize body languages of humans. To accomplish this task, a large and growing annotated dataset with 9876 video clips of body movements and 13,239 human characters, named Body Language Dataset (BoLD), has been created. Comprehensive statistical analysis of the dataset revealed many interesting insights. A system to model the emotional expressions based on bodily movements, named Automated Recognition of Bodily Expression of Emotion (ARBEE), has also been developed and evaluated. Our analysis shows the effectiveness of Laban Movement Analysis (LMA) features in characterizing arousal, and our experiments using LMA features further demonstrate computability of bodily expression. We report and compare results of several other baseline methods which were developed for action recognition based on two different modalities, body skeleton and raw image. The dataset and findings presented in this work will likely serve as a launchpad for future discoveries in body language understanding that will enable future robots to interact and collaborate more effectively with humans.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Notes

  1. 1.

    https://github.com/CMU-Perceptual-Computing-Lab/caffe_rtpose.

  2. 2.

    https://github.com/abewley/sort.

  3. 3.

    https://github.com/vadimkantorov/fastvideofeat.

  4. 4.

    http://pytorch.org/.

References

  1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.

  2. Aristidou, A., Charalambous, P., & Chrysanthou, Y. (2015). Emotion analysis and classification: understanding the performers’ emotions using the lma entities. Computer Graphics Forum, 34(6), 262–276.

    Article  Google Scholar 

  3. Aristidou, A., Zeng, Q., Stavrakis, E., Yin, K., Cohen-Or, D., Chrysanthou, Y., & Chen, B. (2017). Emotion control of unstructured dance movements. In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, article 9.

  4. Aviezer, H., Trope, Y., & Todorov, A. (2012). Body cues, not facial expressions, discriminate between intense positive and negative emotions. Science, 338(6111), 1225–1229.

    Article  Google Scholar 

  5. Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. Proceedings of the IEEE International Conference on Image Processing, pp. 3464–3468. https://doi.org/10.1109/ICIP.2016.7533003.

  6. Biel, J. I., & Gatica-Perez, D. (2013). The youtube lens: Crowdsourced personality impressions and audiovisual analysis of vlogs. IEEE Transactions on Multimedia, 15(1), 41–55.

    Article  Google Scholar 

  7. Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970.

  8. Cao, Z., Simon, T., Wei, S.E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299.

  9. Carmichael, L., Roberts, S., & Wessell, N. (1937). A study of the judgment of manual expression as presented in still and motion pictures. The Journal of Social Psychology, 8(1), 115–142.

    Article  Google Scholar 

  10. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 4724–4733.

  11. Noroozi, F., Kaminska, D., Corneanu, C., Sapinski, T., Escalera, S., & Anbarjafari, G. (2018). Survey on emotional body gesture recognition. Journal Of IEEE Transactions on Affective Computing. https://doi.org/10.1109/TAFFC.2018.2874986.

  12. Dael, N., Mortillaro, M., & Scherer, K. R. (2012). Emotion expression in body action and posture. Emotion, 12(5), 1085.

    Article  Google Scholar 

  13. Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision, Springer, pp. 428–441.

  14. Datta, R., Joshi, D., Li, J., & Wang, J.Z. (2006). Studying aesthetics in photographic images using a computational approach. In: European conference on computer vision, Springer, pp. 288–301.

  15. Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 28, 20–28.

    Article  Google Scholar 

  16. De Gelder, B. (2006). Towards the neurobiology of emotional body language. Nature Reviews Neuroscience, 7(3), 242–249.

    Article  Google Scholar 

  17. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255.

  18. Douglas-Cowie, E., Cowie, R., Sneddon, I., Cox, C., Lowry, L., McRorie, M., Martin, L.J.C., Devillers, J., Abrilian, A., & Batliner, S., et al. (2007). The humaine database: addressing the needs of the affective computing community. In: Proceedings of the International Conference on Affective Computing and Intelligent Interaction, pp. 488–500.

  19. Ekman, P. (1992). Are there basic emotions? Psychological Review, 99(3), 550–553.

    Article  Google Scholar 

  20. Ekman, P. (1993). Facial expression and emotion. American Psychologist, 48(4), 384.

    Article  Google Scholar 

  21. Ekman, P., & Friesen, W. V. (1977). Facial Action Coding System: A technique for the measurement of facial movement. Palo Alto: Consulting Psychologists Press, Stanford University.

    Google Scholar 

  22. Ekman, P., & Friesen, W. V. (1986). A new pan-cultural facial expression of emotion. Motivation and Emotion, 10(2), 159–168.

    Article  Google Scholar 

  23. Eleftheriadis, S., Rudovic, O., & Pantic, M. (2015). Discriminative shared gaussian processes for multiview and view-invariant facial expression recognition. IEEE Transactions on Image Processing, 24(1), 189–204.

    MathSciNet  Article  Google Scholar 

  24. Fabian Benitez-Quiroz, C., Srinivasan, R., & Martinez, A.M. (2016). Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5562–5570.

  25. Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056.

  26. Gunes, H., & Piccardi, M. (2005). Affect recognition from face and body: early fusion vs. late fusion. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 4, 3437–3443.

    Google Scholar 

  27. Gunes, H., & Piccardi, M. (2007). Bi-modal emotion recognition from expressive face and body gestures. Journal of Network and Computer Applications, 30(4), 1334–1345.

    Article  Google Scholar 

  28. Gwet, K.L. (2014). Handbook of Inter-rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Advanced Analytics, LLC.

  29. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.

  30. Iqbal, U., Milan, A., & Gall, J. (2017). Posetrack: Joint multi-person pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2011–2020.

  31. Kantorov, V., & Laptev, I. (2014). Efficient feature extraction, encoding and classification for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2593–2600.

  32. Karg, M., Samadani, A. A., Gorbet, R., Kühnlenz, K., Hoey, J., & Kulić, D. (2013). Body movements for affective expression: A survey of automatic recognition and generation. IEEE Transactions on Affective Computing, 4(4), 341–359.

    Article  Google Scholar 

  33. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., & Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

  34. Kipf, T.N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

  35. Kleinsmith, A., & Bianchi-Berthouze, N. (2013). Affective body expression perception and recognition: A survey. IEEE Transactions on Affective Computing, 4(1), 15–33.

    Article  Google Scholar 

  36. Kleinsmith, A., De Silva, P. R., & Bianchi-Berthouze, N. (2006). Cross-cultural differences in recognizing affect from body posture. Interacting with Computers, 18(6), 1371–1389.

    Article  Google Scholar 

  37. Kleinsmith, A., Bianchi-Berthouze, N., & Steed, A. (2011). Automatic recognition of non-acted affective postures. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(4), 1027–1038.

    Article  Google Scholar 

  38. Kosti, R., Alvarez, J.M., Recasens, A., & Lapedriza, A. (2017). Emotion recognition in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1667–1675.

  39. Krakovsky, M. (2018). Artificial (emotional) intelligence. Communications of the ACM, 61(4), 18–19.

    Article  Google Scholar 

  40. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105.

  41. Laban, R., & Ullmann, L. (1971). The Mastery of Movement. Bingley: ERIC.

    Google Scholar 

  42. Lu, X., Suryanarayan, P., Adams Jr, R.B., Li, J., Newman, M.G., & Wang, J.Z. (2012). On shape and the computability of emotions. In: Proceedings of the ACM International Conference on Multimedia, ACM, pp. 229–238.

  43. Lu, X., Adams Jr, R.B., Li, J., Newman, M.G., Wang, J.Z. (2017). An investigation into three visual characteristics of complex scenes that evoke human emotion. In: Proceedings of the International Conference on Affective Computing and Intelligent Interaction, pp. 440–447.

  44. Luvizon, D.C., Picard, D., & Tabia, H. (2018). 2d/3d pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5137–5146.

  45. Martinez, J., Hossain, R., Romero, J., & Little, J.J. (2017). A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649.

  46. Meeren, H. K., van Heijnsbergen, C. C., & de Gelder, B. (2005). Rapid perceptual integration of facial expression and emotional body language. Proceedings of the National Academy of Sciences of the United States of America, 102(45), 16518–16523.

    Article  Google Scholar 

  47. Mehrabian, A. (1980). Basic dimensions for a general psychological theory: Implications for personality, social, environmental, and developmental studies. Cambridge: The MIT Press.

    Google Scholar 

  48. Mehrabian, A. (1996). Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Current Psychology, 14(4), 261–292.

    MathSciNet  Article  Google Scholar 

  49. Nicolaou, M. A., Gunes, H., & Pantic, M. (2011). Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Transactions on Affective Computing, 2(2), 92–105.

    Article  Google Scholar 

  50. Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8.

  51. Potapov, D., Douze, M., Harchaoui, Z., & Schmid, C. (2014). Category-specific video summarization. In: European Conference on Computer Vision, Springer, pp. 540–555.

  52. Ruggero Ronchi, M., & Perona, P. (2017). Benchmarking and error diagnosis in multi-instance pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 369–378.

  53. Schindler, K., Van Gool, L., & de Gelder, B. (2008). Recognizing emotions expressed by body pose: A biologically inspired neural model. Neural Networks, 21(9), 1238–1246.

    Article  Google Scholar 

  54. Shiffrar, M., Kaiser, M.D., & Chouchourelou, A. (2011). Seeing human movement as inherently social. The Science of Social Vision, pp. 248–264.

  55. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576.

  56. Soomro, K., Zamir, A.R., Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402v1.

  57. Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., et al. (2016). Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2), 64–73.

    Article  Google Scholar 

  58. Towns, J., Cockerill, T., Dahan, M., Foster, I., Gaither, K., Grimshaw, A., et al. (2014). Xsede: Accelerating scientific discovery. Computing in Science & Engineering, 16(5), 62–74.

    Article  Google Scholar 

  59. Wakabayashi, A., Baron-Cohen, S., Wheelwright, S., Goldenfeld, N., Delaney, J., Fine, D., et al. (2006). Development of short forms of the empathy quotient (eq-short) and the systemizing quotient (sq-short). Personality and Individual Differences, 41(5), 929–940.

    Article  Google Scholar 

  60. Wallbott, H. G. (1998). Bodily expression of emotion. European Journal of Social Psychology, 28(6), 879–896.

    Article  Google Scholar 

  61. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558.

  62. Wang, H., Kläser, A., Schmid, C., & Liu, C.L. (2011). Action recognition by dense trajectories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3169–3176.

  63. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, Springer, pp. 20–36.

  64. Xu, F., Zhang, J., & Wang, J. Z. (2017). Microexpression identification and categorization using a facial dynamics map. IEEE Transactions on Affective Computing, 8(2), 254–267.

    Article  Google Scholar 

  65. Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence.

  66. Ye, J., Li, J., Newman, M. G., Adams, R. B., & Wang, J. Z. (2019). Probabilistic multigraph modeling for improving the quality of crowdsourced affective data. IEEE Transactions on Affective Computing, 10(1), 115–128.

    Article  Google Scholar 

  67. Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-l 1 optical flow. In: Proceedings of the Joint Pattern Recognition Symposium, Springer, pp. 214–223.

Download references

Acknowledgements

This material is based upon work supported in part by The Pennsylvania State University. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation Grant No. ACI-1548562 (Towns et al. 2014). The work was also supported through a GPU gift from the NVIDIA Corporation. The authors are grateful to the thousands of Amazon Mechanical Turk independent contractors for their time and dedication in providing invaluable emotion ground truth labels for the video collection. Hanjoo Kim contributed in some of the discussions. Jeremy Yuya Ong supported the data collection and visualization effort. We thank Amazon.com, Inc. for supporting the expansion of this line of research.

Author information

Affiliations

Authors

Corresponding authors

Correspondence to Yu Luo or James Z. Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Luo, Y., Ye, J., Adams, R.B. et al. ARBEE: Towards Automated Recognition of Bodily Expression of Emotion in the Wild. Int J Comput Vis 128, 1–25 (2020). https://doi.org/10.1007/s11263-019-01215-y

Download citation

Keywords

  • Body language
  • Emotional expression
  • Computer vision
  • Crowdsourcing
  • Video analysis
  • Perception
  • Statistical modeling