Abstract
Isolated sign recognition from video streams is a challenging problem due to the multi-modal nature of the signs, where both local and global hand features and face gestures needs to be attended simultaneously. This problem has recently been studied widely using deep Convolutional Neural Network (CNN) based features and Long Short-Term Memory (LSTM) based deep sequence models. However, the current literature is lack of providing empirical analysis using Hidden Markov Models (HMMs) with deep features. In this study, we provide a framework that is composed of three modules to solve isolated sign recognition problem using different sequence models. The dimensions of deep features are usually too large to work with HMM models. To solve this problem, we propose two alternative CNN based architectures as the second module in our framework, to reduce deep feature dimensions effectively. After extensive experiments, we show that using pretrained Resnet50 features and one of our CNN based dimension reduction models, HMMs can classify isolated signs with 90.15% accuracy in Montalbano dataset using RGB and Skeletal data. This performance is comparable with the current LSTM based models. HMMs have fewer parameters and can be trained and run on commodity computers fast, without requiring GPUs. Therefore, our analysis with deep features show that HMMs could also be utilized as well as deep sequence models in challenging isolated sign recognition problem.
Similar content being viewed by others
References
Akram S, Beskow J, Kjellstrom H (2012) Visual recognition of isolated swedish sign language signs. arXiv:1211.3901[cs]
Cheok MJ, Omar Z, Jaward MH (2019) A review of hand gesture and sign language recognition techniques. International Journal of Machine Learning and Cybernetics 10 (1):131–153. https://doi.org/10.1007/s13042-017-0705-5
Combrink JH (2018) Discriminative training of hidden Markov models for gesture recognition. Master’s thesis, University of Cape Town. https://open.uct.ac.za/handle/11427/29267
Cooper H, Ong EJ, Pugeault N, Bowden R (2012) Sign language recognition using sub-units. J Mach Learn Res 13 (Jul):2205–2231. http://www.jmlr.org/papers/v13/cooper12a.html
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). https://doi.org/10.1109/CVPR.2005.177. ISSN: 1063-6919, vol 1, pp 886–893
Escalera S, Athitsos V, Guyon I (2017) Challenges in multi-modal gesture recognition. In: Escalera S, Guyon I, Athitsos V (eds) Gesture recognition, the springer series on challenges in machine learning. https://doi.org/10.1007/978-3-319-57021-1_1. Springer International Publishing, Cham, pp 1–60
Escalera S, Baró X, Gonzalez J, Bautista MA, Madadi M, Reyes M, Ponce-López V, Escalante HJ, Shotton J, Guyon I (2014) Chalearn looking at people challenge 2014: dataset and results. In: Workshop at the European conference on computer vision. Springer, pp 459–473
Escalera S, Gonzàlez J, Baró X, Reyes M, Lopes O, Guyon I, Athitsos V, Escalante H (2013) Multi-modal gesture recognition challenge 2013: dataset and results. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ICMI ’13. https://doi.org/10.1145/2522848.2532595. Association for Computing Machinery, Sydney, pp 445–452
Forney G (1973) The viterbi algorithm. Proceedings of the IEEE 61(3):268–278. https://doi.org/10.1109/PROC.1973.9030. Conference Name: Proceedings of the IEEE
Grobel K, Assan M (1997) Isolated sign language recognition using hidden Markov models. In: Computational cybernetics and simulation 1997 IEEE international conference on systems, man, and cybernetics. https://doi.org/10.1109/ICSMC.1997.625742, vol 1, pp 162–167
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang J, Zhou W, Li H, Li W (2015) Sign language recognition using 3D convolutional neural networks. In: 2015 IEEE international conference on multimedia and expo (ICME). https://doi.org/10.1109/ICME.2015.7177428, pp 1–6
Keogh E, Mueen A (2017) Curse of dimensionality. Springer US, Boston, pp 314–315. https://doi.org/10.1007/978-1-4899-7687-1_192
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Koller O, Zargaran S, Ney H, Bowden R (2018) Deep sign: enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs. Int J Comput Vis 126(12):1311–1325. https://doi.org/10.1007/s11263-018-1121-3
Li F, Neverova N, Wolf C, Taylor G (2017) Modout: learning multi-modal architectures by stochastic regularization. In: 2017 12th IEEE international conference on automatic face gesture recognition (FG 2017). https://doi.org/10.1109/FG.2017.59. ISSN: null, pp 422–429
Liu L, Shao L (2013) Learning discriminative representations from RGB-d video data. In: Proceedings of the twenty-third international joint conference on artificial intelligence, IJCAI ’13. AAAI Press, Beijing, pp 1493–1500
Mannor S, Peleg D, Rubinstein R (2005) The cross entropy method for classification. In: Proceedings of the 22nd international conference on Machine learning, ICML ’05. https://doi.org/10.1145/1102351.1102422. Association for Computing Machinery, Bonn, Germany, pp 561–568
Mercanoglu Sincan O, Tur AO, Yalim Keles H (2019) Isolated sign language recognition with multi-scale features using LSTM. In: 2019 27th signal processing and communications applications conference (SIU). https://doi.org/10.1109/SIU.2019.8806467. ISSN: 2165-0608, pp 1–4
Murakami K, Taguchi H (1991) Gesture recognition using recurrent neural networks. In: Proceedings of the SIGCHI conference on Human factors in computing systems Reaching through technology - CHI ’91. https://doi.org/10.1145/108844.108900. http://portal.acm.org/citation.cfm?doid=108844.108900. ACM Press, New Orleans, pp 237–242
Neverova N, Wolf C, Taylor G, Nebout F (2016) ModDrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(8):1692–1706. https://doi.org/10.1109/TPAMI.2015.2461544. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence
Núñez JC, Cabido R, Pantrigo JJ, Montemayor AS, Vélez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognition 76:80–94. https://doi.org/10.1016/j.patcog.2017.10.033. http://www.sciencedirect.com/science/article/pii/S0031320317304405
Nishida N, Nakayama H (2016) Multimodal gesture recognition using multi-stream recurrent neural network. In: Bräunl T, McCane B, Rivera M, Yu X (eds) Image and video technology, lecture notes in computer science. https://doi.org/10.1007/978-3-319-29451-3_54. Springer International Publishing, Cham, pp 682–694
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in PyTorch. In: NIPS-W
Pigou L, Dieleman S, Kindermans PJ, Schrauwen B (2014) Sign language recognition using convolutional neural networks. In: Workshop at the european conference on computer vision. Springer, pp 572–578
Pigou L, van den Oord A, Dieleman S, Van Herreweghe M, Dambre J (2018) Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. International Journal of Computer Vision 126 (2-4):430–439. https://doi.org/10.1007/s11263-016-0957-7. http://link.springer.com/10.1007/s11263-016-0957-7
Pisharady PK, Saerbeck M (2015) Recent methods and databases in vision-based hand gesture recognition: a review. Computer Vision and Image Understanding 141:152–165. https://doi.org/10.1016/j.cviu.2015.08.004. http://www.sciencedirect.com/science/article/pii/S1077314215001794
Rabiner L, Juang B (1986) An introduction to hidden markov models. IEEE ASSP Mag 3(1):4–16
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Santos CCD, Samatelo JLA, Vassallo RF (2020) Dynamic gesture recognition by using CNNs and star RGB: S temporal information condensation. Neurocomputing. https://doi.org/10.1016/j.neucom.2020.03.038. http://www.sciencedirect.com/science/article/pii/S092523122030391X
Schreiber J (2018) Pomegranate: fast and flexible probabilistic modeling in python. arXiv:1711.001371711.00137[cs, stat]
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Tsironi E, Barros P, Wermter S (2016) Gesture recognition with a convolutional long short-term memory recurrent neural network. Comput Intell: 6
Tur AO, Keles HY (2019) Isolated sign recognition with a siamese neural network of RGB and depth streams. In: IEEE EUROCON 2019 -18th international conference on smart technologies. https://doi.org/10.1109/EUROCON.2019.8861945, pp 1–6
Acknowledgements
The research presented is part of a project funded by The Scientific and Technological Research Council of Turkey (TÜBİTAK) under the grant number 217E022.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tur, A.O., Keles, H.Y. Evaluation of hidden Markov models using deep CNN features in isolated sign recognition. Multimed Tools Appl 80, 19137–19155 (2021). https://doi.org/10.1007/s11042-021-10593-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-10593-w