Advertisement

Autonomous Robots

, Volume 42, Issue 6, pp 1281–1298 | Cite as

Skeleton-based bio-inspired human activity prediction for real-time human–robot interaction

  • Brian Reily
  • Fei Han
  • Lynne E. Parker
  • Hao Zhang
Article
  • 508 Downloads

Abstract

Activity prediction is an essential task in practical human-centered robotics applications, such as security, assisted living, etc., which is targeted at inferring ongoing human activities based on incomplete observations. To address this challenging problem, we introduce a novel bio-inspired predictive orientation decomposition (BIPOD) approach to construct representations of people from 3D skeleton trajectories. BIPOD is invariant to scales and viewpoints, runs in real-time on basic computer systems, and is able to recognize and predict activities in an online fashion. Our approach is inspired by biological research in human anatomy. To capture spatio-temporal information of human motions, we spatially decompose 3D human skeleton trajectories and project them onto three anatomical planes (i.e., coronal, transverse and sagittal planes); then, we describe short-term time information of joint motions and encode high-order temporal dependencies. By using Extended Kalman Filters to estimate future skeleton trajectories, we endow our BIPOD representation with the critical capabilities to reduce noisy skeleton observation data and predict the ongoing activities. Experiments on benchmark datasets have shown that our BIPOD representation significantly outperforms previous methods for real-time human activity classification and prediction from 3D skeleton trajectories. Empirical studies using TurtleBot2 and Baxter humanoid robots have also validated that our BIPOD method obtains promising performance, in terms of both accuracy and efficiency, making BIPOD a fast, simple, yet powerful representation for low-latency online activity prediction in human–robot interaction applications.

Keywords

Human representation Activity classification Activity prediction Real-time human–robot interaction 

References

  1. Aggarwal, J., & Xia, L. (2014). Human activity recognition from 3D data: A review. Pattern Recognition Letters, 48, 70–80.CrossRefGoogle Scholar
  2. Akgun, B., Cakmak, M., Jiang, K., & Thomaz, A. (2012). Keyframe-based learning from demonstration. Internation Journal of Social Robotics, 4(4), 343–355.CrossRefGoogle Scholar
  3. Berndt, H., Emmert, J., & Dietmayer, K. (2008). Continuous driver intention recognition with hidden Markov models. In Intelligent Transportation Systems (pp. 1189–1194).Google Scholar
  4. Bi, L., Yang, X., & Wang, C. (2013). Inferring driver intentions using a driver model based on queuing network. In Intelligent Vehicles Symposium (pp. 1387–1391).Google Scholar
  5. Bosurgi, G., D’Andrea, A., & Pellegrino, O. (2014). Prediction of drivers’ visual strategy using an analytical model. Journal of Transportation Safety & Security, 7, 153–173.CrossRefGoogle Scholar
  6. Boubou, S., & Suzuki, E. (2015). Classifying actions based on histogram of oriented velocity vectors. Journal of Intelligent Information Systems, 44(1), 49–65.CrossRefGoogle Scholar
  7. Boussemart, Y., & Cummings, M. L. (2011). Predictive models of human supervisory control behavioral patterns using hidden semi-Markov models. Engineering Applications of Artifical Intelligence, 24, 1252–1262.CrossRefGoogle Scholar
  8. Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transaction on Intelligent Systems and Technology, 2, 27:1–27:27.Google Scholar
  9. Charles, J., Everingham, M. (2011). Learning shape models for monocular human pose estimation from the Microsoft Xbox Kinect. In IEEE international conference on computer vision.Google Scholar
  10. Chaudhry, R., Ofli, F., Kurillo, G., Bajcsy, R., & Vidal, R. (2013). Bio-inspired dynamic 3D discriminative skeletal features for human action recognition. In IEEE conference on computer vision and pattern recognition workshop.Google Scholar
  11. Chen, G., Giuliani, M., Clarke, D., Gaschler, A., & Knoll, A. (2014). Action recognition using ensemble weighted multi-instance learning. In IEEE international conference on robotics and automation.Google Scholar
  12. Dai, F., Zhang, J., & Lu, T. (2011). The study of driver’s starting intentions. In Mechanic Automation and Control Engineering (pp. 2758–2761).Google Scholar
  13. Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1110–1118).Google Scholar
  14. Einicke, G., & White, L. (1999). Robust extended Kalman filtering. IEEE Transactions on Signal Processing, 47(9), 2596–2599.CrossRefzbMATHGoogle Scholar
  15. Ellis, C., Masood, S. Z., Tappen, M. F., Laviola, J. J, Jr., & Sukthankar, R. (2013). Exploring the trade-off between accuracy and observational latency in action recognition. International Journal of Computer Vision, 101(3), 420–436.CrossRefGoogle Scholar
  16. Ganapathi, V., Plagemann, C., Koller, D., & Thrun, S. (2010). Real time motion capture using a single time-of-flight camera. In IEEE conference on computer vision and pattern recognition.Google Scholar
  17. Georgiou, T., & Demiris, Y. (2015). Predicting car states through learned models of vehicle dynamics and user behaviours. In Intelligent vehicles symposium (pp. 1240–1245).Google Scholar
  18. Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of general-activity human poses from depth images. In IEEE international conference on computer vision.Google Scholar
  19. Gowayyed, M. A., Torki, M., Hussein, M. E., & El-Saban, M. (2013). Histogram of oriented displacements (HOD): Describing trajectories of human joints for action recognition. In International joint conference on artificial intelligence.Google Scholar
  20. Gray, H. (1973). Anatomy of the human body. Philadelphia: Lea & Febiger.Google Scholar
  21. Han, F., Reily, B., Hoff, W., & Zhang, H. (2016). Space-time representation of people based on 3D skeletal data: A review. ArXiv e-prints 1601.01006.Google Scholar
  22. Han, F., Reily, B., Hoff, W., & Zhang, H. (2017). Space-time representation of people based on 3d skeletal data: A review. Computer Vision and Image Understanding, 158, 85–105.CrossRefGoogle Scholar
  23. Harandi, M., Sanderson, C., Hartley, R., & Lovell, B. (2012). Sparse coding and dictionary learning for symmetric positive definite matrices: A kernel approach. Computer Vision-ECCV, 2012, 216–229.Google Scholar
  24. He, L., Cf, Zong, & Wang, C. (2012). Driving intention recognition and behaviour prediction based on a double-layer hidden Markov model. Journal of Zhejiang University, 13, 208–217.CrossRefGoogle Scholar
  25. Hoai, M., & De la Torre, F. (2014). Max-margin early event detectors. International Journal of Computer Vision, 107(2), 191–202.MathSciNetCrossRefGoogle Scholar
  26. Hoare, J., & Parker, L. (2010). Using on-line conditional random fields to determine human intent for peer-to-peer human robot teaming. In IEEE/RSJ international conference on intelligent robots and systems.Google Scholar
  27. Hussein, M. E., Torki, M., Gowayyed, M. A., & El-Saban, M. (2013). Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In International joint conference on artificial intelligence.Google Scholar
  28. Jin, L., Hou, H., & Jiang, Y. (2011). Driver intention recognition based on continuous hidden Markov model. In Transportation, Mechanical, and Electrical Engineering (pp. 739–742).Google Scholar
  29. Jung, H. Y., Lee, S., Heo, Y. S., & Yun, I. D. (2015). Random tree walk toward instantaneous 3D human pose estimation. In IEEE conference on computer vision and pattern recognition.Google Scholar
  30. Kim, Y., Chen, J., Chang, M. C., Wang, X., Provost, E. M., & Lyu, S. (2015). Modeling transition patterns between events for temporal human action segmentation and classification. In IEEE international conference and workshops on automatic face and gesture recognition (FG), Ljubljana (pp. 1–8).Google Scholar
  31. Koppula, H. S., Rudhir, G., & Saxena, A. (2013). Learning human activities and object affordances from RGB-D videos. The International Journal of Robotics Research, 32, 951–970.CrossRefGoogle Scholar
  32. Li, K., & Fu, Y. (2014). Prediction of human activity by discovering temporal sequence patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 1644–1657.CrossRefGoogle Scholar
  33. Li, K., Hu, J., & Fu, Y. (2012). Modeling complex temporal composition of actionlets for activity prediction. In European conference on computer vision.Google Scholar
  34. Liu, Q., & Cao, X. (2012). Action recognition using subtensor constraint. In European conference on computer vision.Google Scholar
  35. López-Mendez, A., Gall, J., Casas, J. R., & Gool, L. J. V. (2012). Metric learning from poses for temporal clustering of human motion. In British machine vision conference.Google Scholar
  36. Luo, J., Wang, W., & Qi, H. (2013). Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In IEEE international conference on computer vision.Google Scholar
  37. Mandel, J. (1982). Use of the singular value decomposition in regression analysis. The American Statistician, 36(1), 15–24.Google Scholar
  38. McGinnis, M. (1999). Bioregionalism: The tug and pull of place. London: Routledge.Google Scholar
  39. Meiring, G. A. M., & Myburgh, H. C. (2015). A review of intelligent driving style analysis systems and related artificial intelligence algorithms. Sensors, 15, 30653–30682.CrossRefGoogle Scholar
  40. Mori, A., Uchida, S., Kurazume, R., Taniguchi, R. I., Hasegawa, T., & Sakoe, H. (2006). Early recognition and prediction of gestures. In International conference on pattern recognition.Google Scholar
  41. Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., & Weber, A. (2007). Documentation mocap database HDM05. Technical report, Universität Bonn.Google Scholar
  42. Niebles, J. C., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In IEEE conference on computer vision and pattern recognition.Google Scholar
  43. Nikolaidis, S., Hsu, D., & Srinivasa, S. (2017). Human-robot mutual adaptation in collaborative tasks: Models and experiments. The International Journal of Robotics Research, 36(5–7), 618–634.CrossRefGoogle Scholar
  44. Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2014). Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition. Journal of Visual Communication and Image Representation, 25(1), 24–38.CrossRefGoogle Scholar
  45. Pentland, A., & Liu, A. (1999). Modeling and prediction of human behavior. Neural Computation, 11(1), 229–242.CrossRefGoogle Scholar
  46. Perez-D’Arpino, C., & Shah, J. A. (2015). Fast target prediction of human reaching motion for cooperative human-robot manipulation tasks using time series classification. In 2015 IEEE international conference on robotics and automation (ICRA) (pp. 6175–6182). IEEE.Google Scholar
  47. Pieropan, A., Salvi, G., Pauwels, K., & Kjellstrom, H. (2014). Audio-visual classification and detection of human manipulation actions. In IEEE/RSJ international conference on intelligent robots and systems.Google Scholar
  48. Plagemann, C., Ganapathi, V., Koller, D., & Thrun, S. (2010). Real-time identification and localization of body parts from depth images. In IEEE international conference on robotics and automation.Google Scholar
  49. Rahmani, H., Mahmood, A., Mian, A., & Huynh, D. (2014). Real time action recognition using histograms of depth gradients and random decision forests. In IEEE winter conference on applications of computer vision.Google Scholar
  50. Ryoo, M. S. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In International conference on computer vision.Google Scholar
  51. Ryoo, M., Fuchs, T. J., Xia, L., Aggarwal, J. K., & Matthies, L. (2015). Robot-centric activity prediction from first-person videos: What will they do to me? In Proceedings of the tenth annual ACM/IEEE international conference on human–robot interaction (pp. 295–302). ACM.Google Scholar
  52. Ryoo, M. S., Grauman, K., & Aggarwal, J. K. (2010). A task-driven intelligent workspace system to provide guidance feedback. Computer Vision and Image Understanding, 114(5), 520–534.CrossRefGoogle Scholar
  53. Schwarz, L. A., Mkhitaryan, A., Mateus, D., & Navab, N. (2012). Human skeleton tracking from depth data using geodesic distances and optical flow. Image and Vision Computing, 30(3), 217–226.CrossRefGoogle Scholar
  54. Seidenari, L., Varano, V., Berretti, S., Del Bimbo, A., & Pala, P. (2013). Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In IEEE conference on computer vision and pattern recognition workshops.Google Scholar
  55. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In IEEE conference on computer vision and pattern recognition.Google Scholar
  56. Sung, J., Ponce, C., Selman, B., & Saxena, A. (2012). Unstructured human activity detection from RGBD images. In IEEE international conference on robotics and automation.Google Scholar
  57. Vantigodi, S., & Babu, R. V. (2013). Real-time human action recognition from motion capture data. In National conference on computer vision, pattern recognition, image processing and graphics.Google Scholar
  58. Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 480–492.CrossRefGoogle Scholar
  59. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In IEEE conference on computer vision and pattern recognition.Google Scholar
  60. Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In IEEE conference on computer vision and pattern recognition.Google Scholar
  61. Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2014a). Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 914–927.CrossRefGoogle Scholar
  62. Wang, W., Xi, J., & Chen, H. (2014b). Modeling and recognizing driver behavior based on driving data: A survey. Mathematical Problems in Engineering, 2014, 245641.  https://doi.org/10.1155/2014/245641.
  63. Wang, Z., Boularias, A., Mulling, K., Scholkopf, B., & Peters, J. (2014c). Anticipatory action selection for human–robot table tennis. Artificial Intelligence, 247, 399–414.MathSciNetCrossRefzbMATHGoogle Scholar
  64. Wu, D., & Shao, L. (2014). Leveraging hierarchical parametric networks for skeletal joints action segmentation and recognition. In IEEE conference on computer vision and pattern recognition.Google Scholar
  65. Xia, L., & Aggarwal, J. K. (2013). Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In IEEE conference on computer vision and pattern recognition.Google Scholar
  66. Yang, X., Tian, Y. (2012). EigenJoints-based action recognition using Naï–Bayes-Nearest-Neighbor. In IEEE conference on computer vision and pattern recognition workshop.Google Scholar
  67. Yang, X., & Tian, Y. (2014). Effective 3D action recognition using EigenJoints. Journal of Visual Communication and Image Representation, 25(1), 2–11.MathSciNetCrossRefGoogle Scholar
  68. Yokochi, C., & Rohen, J. W. (2006). Color atlas of anatomy: A photographic study of the human body. Philadelphia: Lippincott Williams & Wilkins.Google Scholar
  69. Yu, G., Yuan, J., & Liu, Z. (2012). Predicting human activities using spatio-temporal structure of interest points. In ACM international conference on multimedia.Google Scholar
  70. Yu, M., Liu, L., & Shao, L. (2016). Structure-preserving binary representations for RGB-D action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8), 1651–1664.CrossRefGoogle Scholar
  71. Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving pose: An efficient 3D kinematics descriptor for low-latency action recognition and detection. In IEEE international conference on computer vision.Google Scholar
  72. Zhang, H., & Parker, L. (2011). 4-dimensional local spatio-temporal features for human activity recognition. In IEEE/RSJ international conference on intelligent robots and systems.Google Scholar
  73. Zhang, H., Reardon, C. M., & Parker, L. E. (2013). Real-time multiple human perception with color-depth cameras on a mobile robot. IEEE Transactions on Cybernetics, 43(5), 1429–1441.CrossRefGoogle Scholar
  74. Zhao, X., Li, X., Pang, C., Zhu, X., & Sheng, Q. Z. (2013). Online human gesture recognition from motion data streams. In ACM international conference on multimedia.Google Scholar
  75. Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., & Xie, X. (2016). Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. arXiv preprint arXiv:160307772.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Human-Centered Robotics Laboratory, Department of Electrical Engineering and Computer ScienceColorado School of MinesGoldenUSA
  2. 2.Department of Electrical Engineering and Computer ScienceUniversity of TennesseeKnoxvilleUSA

Personalised recommendations