Abstract
One option for teaching a robot new skills is to use learning from demonstration techniques. While traditional techniques often involve expensive sensors/equipment, advancements in computer vision have made it possible to achieve similar outcomes at a lower cost. To the best of our knowledge, there is no previous research on a robot learning to produce 3D motions from 2D data and then using this knowledge to interact with people. To this end, we designed a study using a NAO robot to imitate human behavior by reproducing motions in 3D space after viewing a small number of 2D RGB videos for each motion. The goal is for the robot to learn certain social interactive skills by learning from video observation and then apply them during human-robot interaction. Five steps were taken to achieve this objective: 1) collecting a dataset, 2) human pose estimation, 3) transferring data from human space to the robot space, 4) robot control, and 5) human-robot interaction. These steps were separated into two phases, robot imitation learning and human-robot social interaction. The majority of the algorithms employed are deep learning-based, achieving ~96% accuracy in the action recognition on our dataset. The results were also promising when implemented on the robot. Overall, this preliminary exploratory study successfully showed the proof of concept for producing 3D motions from 2D data. This approach is noteworthy because of the amount of online training data, the robot can be trained quickly, and it does not require an expert.
Similar content being viewed by others
Availability of Data and Material (data transparency)
All data from this project (e.g., videos of the sessions) are available in the Social & Cognitive Robotics Laboratory archive.
Code Availability
All the codes are available in the Social & Cognitive Robotics Laboratory archive. In case the readers need the codes, they may contact the corresponding author.
References
Roveda, L., et al.: Model-Based Reinforcement Learning Variable Impedance Control for Human-Robot Collaboration. Journal of Intelligent & Robotic Systems. 100(2), 417–433 (2020). https://doi.org/10.1007/s10846-020-01183-3
Meghdari, A., Alemi, M., Zakipour, M., Kashanian, S.A.: Design and Realization of a Sign Language Educational Humanoid Robot. Journal of Intelligent & Robotic Systems. 95(1), 3–17 (2019). https://doi.org/10.1007/s10846-018-0860-2
Basiri, S., Taheri, A., Meghdari, A., Alemi, M.: Design and Implementation of a Robotic Architecture for Adaptive Teaching: a Case Study on Iranian Sign Language. Journal of Intelligent & Robotic Systems. 102(2), 48 (2021). https://doi.org/10.1007/s10846-021-01413-2
da Silva, I.J., Perico, D.H., Homem, T.P.D., da Costa Bianchi, R.A.: Deep Reinforcement Learning for a Humanoid Robot Soccer Player. Journal of Intelligent & Robotic Systems. 102(3), 69 (2021). https://doi.org/10.1007/s10846-021-01333-1
Hong, A., Igharoro, O., Liu, Y., Niroui, F., Nejat, G., Benhabib, B.: Investigating Human-Robot Teams for Learning-Based Semi-autonomous Control in Urban Search and Rescue Environments. Journal of Intelligent & Robotic Systems. 94(3), 669–686 (2019). https://doi.org/10.1007/s10846-018-0899-0
Ravichandar, H., Polydoros, A.S., Chernova, S., Billard, A.: Recent Advances in Robot Learning from Demonstration. Annual Review of Control, Robotics, and Autonomous Systems. 3, 297–330 (2020). https://doi.org/10.1146/annurev-control-100819-063206
Torabi, F., Warnell, G., Stone, P.: Recent Advances in Imitation Learning from Observation. pp. 6325–6331 (2019)
Calinon, S., Billard, A.: Incremental Learning of Gestures by Imitation in a Humanoid Robot. pp. 255–262 (2007)
Peng, X.B., Abbeel, P., Levine, S., van de Panne, M.: DeepMimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 37(4), 143 (2018). https://doi.org/10.1145/3197517.3201311
Nair, A., et al.: Combining self-supervised learning and imitation for vision-based rope manipulation. pp. 2146–2153 (2017)
Pavse, B.S., Torabi, F., Hanna, J., Warnell, G., Stone, P.: RIDM: Reinforced Inverse Dynamics Modeling for Learning from a Single Observed Demonstration. IEEE Robotics and Automation Letters. 5(4), 6262–6269 (2020). https://doi.org/10.1109/LRA.2020.3010750
Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation, presented at the Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden (2018)
Guo, X., Chang, S., Yu, M., Tesauro, G., Campbell, M.: Hybrid reinforcement learning with expert state sequences, presented at the Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, Hawaii, USA, [Online]. (2019) https://doi.org/10.1609/aaai.v33i01.33013739
Edwards, A.D., Sahni, H., Schroecker, Y., Isbell, Jr C.L.: Imitating Latent Policies from Observation. CoRR, vol. abs/1805.07914. [Online]. (2018) Available: http://arxiv.org/abs/1805.07914
Zheng, C., et al.: Deep learning-based human pose estimation: A survey. arXiv preprint arXiv:2012.13392 (2020)
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint arXiv:1812.08008 (2018)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 21–26 July 2017, pp. 1263–1272. (2017) https://doi.org/10.1109/CVPR.2017.139
Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic Graph Convolutional Networks for 3D Human Pose Regression. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15–20 June 2019, pp. 3420–3430. (2019) https://doi.org/10.1109/CVPR.2019.00354
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-End Recovery of Human Shape and Pose. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18–23 June 2018, pp. 7122–7131 (2018) https://doi.org/10.1109/CVPR.2018.00744
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: Video Inference for Human Body Pose and Shape Estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13–19 June 2020, pp. 5252–5262 (2020) https://doi.org/10.1109/CVPR42600.2020.00530
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 248 (2015). https://doi.org/10.1145/2816795.2818013
Pavlakos, G., et al.: Expressive Body Capture: 3D Hands, Face, and Body From a Single Image. pp. 10967–10977 (2019)
Kolotouros, N., Pavlakos, G., Black, M., Daniilidis, K.: Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop. pp. 2252–2261 (2019)
Benzine, A., Chabot, F., Luvison, B., Pham, Q., Achard C.: PandaNet: Anchor-Based Single-Shot Multi-Person 3D Pose Estimation. pp. 6855–6864 (2020)
Mehta, D., et al.: XNect: real-time multi-person 3D motion capture with a single RGB camera. ACM Trans. Graph. 39, 82:1–82:17 (2020). https://doi.org/10.1145/3386569.3392410
Zhang, Z., Niu, Y., Yan, Z., Lin, S.: Real-Time Whole-Body Imitation by Humanoid Robots and Task-Oriented Teleoperation Using an Analytical Mapping Method and Quantitative Evaluation. Appl. Sci. 8, 2005 (2018). https://doi.org/10.3390/app8102005
Koenemann, J., Burget, F., Bennewitz, M.: Real-time Imitation of Human Whole-Body Motions by Humanoids. (2014)
Zhang, L., Cheng, Z., Gan, Y., Zhu, G., Shen, P., Song, J.: Fast human whole body motion imitation algorithm for humanoid robots. pp. 1430–1435 (2016)
Shahverdi, P., Masouleh, M.T.: A simple and fast geometric kinematic solution for imitation of human arms by a NAO humanoid robot. In: 2016 4th International Conference on Robotics and Mechatronics (ICROM), 26–28 Oct. 2016 2016, pp. 572–577. https://doi.org/10.1109/ICRoM.2016.7886806
Ren, B., Liu, M., Ding, R., Liu, H.: A survey on 3d skeleton-based action recognition using learning method. arXiv preprint arXiv:2002.05907, (2020)
Wang, L., Huynh, D.Q., Koniusz, P.: A comparative review of recent kinect-based action recognition algorithms. IEEE Trans. Image Process. 29, 15–28 (2019)
Wang, H., Wang L. L., Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks. (2017)
Liu, J., Wang, G., Hu, P., Duan, L., Kot, A.C.: Global Context-Aware Attention LSTM Networks for 3D Action Recognition," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 21–26 July 2017, pp. 3671–3680, (2017) https://doi.org/10.1109/CVPR.2017.391
Caetano, C., Sena, J., Brémond, F., Dos Santos, J.A., Schwartz, W.R.: Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE, pp. 1–8 (2019)
Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019 (2016)
Caetano, C., Brémond, F., Schwartz, W.R.: Skeleton image representation for 3D action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), IEEE, pp. 16–23 (2019)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12026–12035 (2019)
Duan, H., Zhao, Y., Chen, K., Shao, D., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. arXiv preprint arXiv:2104.13586, (2021)
Djordjevic, V., Tao, H., Song, X., He, S., Gao, W., Stojanovic, V.: Data-driven control of hydraulic servo actuator: An event-triggered adaptive dynamic programming approach. Math. Biosci. Eng. 20(5), 8561–8582 (2023)
Nedic, N., Stojanovic, V., Djordjevic, V.: Optimal control of hydraulically driven parallel robot platform based on firefly algorithm. Nonlinear Dynamics. 82, 1457–1473 (2015)
Zhou, C., Tao, H., Chen, Y., Stojanovic, V., Paszke, W.: Robust point-to-point iterative learning control for constrained systems: A minimum energy approach. Int J Robust Nonlinear Control 32(18), 10139–10161 (2022)
Taheri, A., Meghdari, A., Mahoor, M.H.: A close look at the imitation performance of children with autism and typically developing children using a robotic system. Int. J. Soc. Robot. 13, 1125–1147 (2021)
Mahmood, N., Ghorbani, N., Troje, N., Pons-Moll, G., Black, M.: AMASS: Archive of Motion Capture As Surface Shapes. pp. 5441–5450 (2019)
Lugaresi, C., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, (2019)
Aldebaran. http://doc.aldebaran.com/. Accessed
W. S. Cleveland and S. J. Devlin, "Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting," J. Am. Stat. Assoc., vol. 83, no. 403, pp. 596–610, 1988/09/01 1988, https://doi.org/10.1080/01621459.1988.10478639
Müller, M.: Dynamic time warping. Information Retrieval for Music and Motion. 2, 69–84 (2007). https://doi.org/10.1007/978-3-540-74048-3_4
Yang, Z., Li, Y., Yang, J., Luo, J.: Action Recognition With Spatio–Temporal Visual Attention on Skeleton Image Sequences. IEEE Transactions on Circuits and Systems for Video Technology. 29(8), 2405–2415 (2019). https://doi.org/10.1109/TCSVT.2018.2864148
Xu, H., Bazavan, E., Zanfir, A., Freeman, W., Sukthankar, R., Sminchisescu, C.: GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models. pp. 6183–6192 (2020)
Pham, H.H., Salmane, H., Khoudour, L., Crouzil, A., Velastin, S.A., Zegers, P.: A unified deep framework for joint 3d pose estimation and action recognition from a single rgb camera. Sensors. 20(7), 1825 (2020)
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, 13–18 June 2010, pp. 9–14, (2010) https://doi.org/10.1109/CVPRW.2010.5543273
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 16–21 June 2012, pp. 28–35, (2012) https://doi.org/10.1109/CVPRW.2012.6239234
Mazhar, O., Ramdani, S., Navarro, B., Passama, R., Cherubini, A.: Towards Real-Time Physical Human-Robot Interaction Using Skeleton Information and Hand Gestures. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1–5 Oct. 2018, pp. 1–6, (2018) https://doi.org/10.1109/IROS.2018.8594385
Bandi, C., Thomas, U.: Skeleton-based Action Recognition for Human-Robot Interaction using Self-Attention Mechanism. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), 15–18 Dec. 2021, pp. 1–8, (2021) https://doi.org/10.1109/FG52635.2021.9666948
Song, Z., et al.: Attention-Oriented Action Recognition for Real- Time Human-Robot Interaction. In: 2020 25th International Conference on Pattern Recognition (ICPR), 10–15 Jan. 2021, pp. 7087–7094, (2021) https://doi.org/10.1109/ICPR48806.2021.9412346
Acknowledgments
This research was supported by the Sharif University of Technology. The complementary and continued support of the Social & Cognitive Robotics Laboratory by a Dr. Ali Akbar Siassi Memorial Grant is also greatly appreciated. We also thank Mrs. Shari Holderread for the English editing of the final manuscript.
Funding
This research was funded by the Sharif University of Technology (Grant No. G980517).
Author information
Authors and Affiliations
Contributions
All authors contributed to the study’s conception and design. Material preparation, data collection, and analysis were performed by Seyed Adel Alizadeh Kolagar. The first draft of the manuscript was written by Seyed Adel Alizadeh Kolagar. All authors read, revised, and approved the final manuscript. Alireza Taheri defined the project. Alireza Taheri and Ali F. Meghdari supervised the research and provided guidance/expertise in the area of AI and HRI.
Corresponding author
Ethics declarations
Ethics Approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.
Consent to Participate
Informed consent was obtained from all individual participants included in the study.
Consent for Publication
The authors affirm that the human research participants provided informed consent to publish the images used in all the figures.
Conflict of Interest
Author Alireza Taheri has received research grants from the Sharif University of Technology. The authors, Seyed Adel Alizadeh Kolagar and Ali F. Meghdari, declare no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Alizadeh Kolagar, S.A., Taheri, A. & Meghdari, A.F. NAO Robot Learns to Interact with Humans through Imitation Learning from Video Observation. J Intell Robot Syst 109, 4 (2023). https://doi.org/10.1007/s10846-023-01938-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10846-023-01938-8