Abstract
Vision-based control systems play an important role in modern robotics systems. An important task in implementing such a system is developing an effective algorithm for recognizing human actions and the working environment and the design of intuitive gesture commands. This paper proposes an action recognition algorithm for robotics and manufacturing automation. The key contributions are (1) fusion of multimodal information obtained by depth sensors and cameras of the visible range, (2) modified Gabor-based and 3-D binary-based descriptor using micro-block difference, (3) efficient skeleton-based descriptor, and (4) recognition algorithm using the combined descriptor. The proposed binary micro-block difference representation of 3-D patches from video with a complex background in several scales and orientations leads to an informative description of the scene action. The experimental results showed the effectiveness of the proposed algorithm on datasets.
Similar content being viewed by others
Availability of data and materials
Not applicable.
Code availability
Not applicable.
References
Ogenyi U, Liu J, Yang C, Ju Z, Liu H (2021) Physical human–robot collaboration: robotic systems, learning methods, collaborative strategies, sensors, and actuators. IEEE Transactions on Cybernetics 51(4):1888–1901
Heo Y, Kim D, Lee W, Kim H, Park J, Chung W (2019) Collision detection for industrial collaborative robots: a deep learning approach. IEEE Robotics and Automation Letters 4(2):740–746
Nascimento H, Mujica M, Benoussaad M (2021) Collision avoidance interaction between human and a hidden robot based on kinect and robot data fusion. IEEE Robotics and Automation Letters 6(1):88–94
Solmaz B, Assari SM, Shah M (2013) Classifying web videos using a global video descriptor. Mach Vis Appl 24(7):1473–1485
Ji XF, Wu QQ, Ju ZJ, Wang YY (2017) Study of human action recognition based on improved spatio-temporal features. Springer, Berlin Heidelberg, pp 233–250
Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb+ d: A large scale dataset for 3-D human activity analysis. Proc CVPR:1010–1019
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3-D action recognition. Proc CVPR:3288–3297
Kim TS, Reiter A (2017) Interpretable 3-D human action analysis with temporal convolutional networks. Proc. CVPR, 1623-1631
Brailean JC, Little D, Giger ML, Chen CT, Sullivan BJ et al (1992) Application of the EM algorithm to radiographic images. Med Phys 19(5):1175–1182
Nercessian S, Panetta K, Agaian S (2011) Multiresolution decomposition schemes using the parameterized logarithmic image processing model with application to image fusion, computer science, EURASIP J. Adv. Signal Process
Xie Z, Stockham TG (1989) Toward the unification of three visual laws and two visual models in brightness perception. IEEE Trans Syst Man Cyber 19:379–387
Panetta K, Wharton E, Agaian S (2007) Parameterization of logarithmic image processing models. IEEE Tran. Systems, Man, and Cybernetics, Part A: Systems and Humans
Zhdanova M, Voronin V, Semenishchev E, Ilyukhin Y, Zelensky A (2020) Human activity recognition for efficient human-robot collaboration. Proc. International Society for Optics and Photonics, 115430K
Serrano-Cuerda J, Fernández-Caballero A, López M (2014) Selection of a visible-light vs. thermal infrared sensor in dynamic environments based on confidence measures. Appl Sci 4(3):331–350
Voronin V, Zhdanova M, Semenishchev E, Zelensky A, Tokareva O (2020) Fusion of color and depth information for human actions recognition. Proc. International Society for Optics and Photonics, 114231C
Berkan Solmaz, Shayan Modiri Assari, Mubarak Shah (2012) Classifying web videos using a global video descriptor
Zelensky A, Zhdanova M, Voronin V, Alepko A, Gapon N, Egiazarian KO, Balabaeva O (2019) Control system of collaborative robotic based on the methods of contactless recognition of human actions. EPJ Web of Conferences 224:04006
Baumann F, Ehlers A, Rosenhahn B, Liao J (2016) Recognizing human actions using novel space-time volume binary patterns. Neurocomputing. 173:54–63
Ojala T, Pietikainen M, Harwood D (1994) Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. Pattern Recognition, vol. 1
Zhao G, Pietikäinen M (2006) Dynamic texture recognition using volume local binary patterns. Springer, 165-177
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517
Belagiannis V, Zisserman A (2017) Recurrent human pose estimation. 12th IEEE International Conference on Automatic Face & Gesture Recognition. 468-475
Tompson J, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 1799–1807
Johnson S, Everingham M (2010) Clustered pose and nonlinear appearance models for human pose estimation. In BMVC 2(4):5
Soomro K, Zamir AR and Shah M (2012) UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In CVRP
Liu Q, Che X, Bie M (2019) R-STAN: residual spatial-temporal attention network for action recognition. IEEE Access 7:82246–82255
Bilen H, Fernando B, GavvesE, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3034-3042
Zhaofan Q, Ting Y, Tao M (2017) Learning spatio-temporal representation with pseudo-3D residual networks. ICCV
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector CNNs. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2718-2726
Ng J, Choi J, Neumann J, Davis L (2018) ActionFlowNet: learning motion representation for action recognition. IEEE Winter Conference on Applications of Computer Vision (WACV), 1616-1624
Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. Proceedings of IEEE International Conference on Image Processing
Funding
The reported study was funded by Educational Organizations in 2020–2022 Project under Grant No FSFS-2020-0031 and in part by RFBR and NSFC according to the research project No 20-57- 53012.
Author information
Authors and Affiliations
Contributions
All authors contributed to the process of critical literature review. In addition, all authors contributed to writing and revising the manuscripts.
Corresponding author
Ethics declarations
Ethics approval
The manuscript in part or in full has not been submitted or published anywhere. The manuscript will not be submitted elsewhere until the editorial process is completed.
Consent to participate
Not applicable.
Consent for publication
The author transfers to Springer the non-exclusive publication rights.
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Voronin, V., Zhdanova, M., Semenishchev, E. et al. Action recognition for the robotics and manufacturing automation using 3-D binary micro-block difference. Int J Adv Manuf Technol 117, 2319–2330 (2021). https://doi.org/10.1007/s00170-021-07613-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00170-021-07613-2