Learning Efficient Spatial-Temporal Gait Features with Deep Learning for Human Identification
The integration of the latest breakthroughs in bioinformatics technology from one side and artificial intelligence from another side, enables remarkable advances in the fields of intelligent security guard computational biology, healthcare, and so on. Among them, biometrics based automatic human identification is one of the most fundamental and significant research topic. Human gait, which is a biometric features with the unique capability, has gained significant attentions as the remarkable characteristics of remote accessed, robust and security in the biometrics based human identification. However, the existed methods cannot well handle the indistinctive inter-class differences and large intra-class variations of human gait in real-world situation. In this paper, we have developed an efficient spatial-temporal gait features with deep learning for human identification. First of all, we proposed a gait energy image (GEI) based Siamese neural network to automatically extract robust and discriminative spatial gait features for human identification. Furthermore, we exploit the deep 3-dimensional convolutional networks to learn the human gait convolutional 3D (C3D) as the temporal gait features. Finally, the GEI and C3D gait features are embedded into the null space by the Null Foley-Sammon Transform (NFST). In the new space, the spatial-temporal features are sufficiently combined with distance metric learning to drive the similarity metric to be small for pairs of gait from the same person, and large for pairs from different persons. Consequently, the experiments on the world’s largest gait database show our framework impressively outperforms state-of-the-art methods.
KeywordsGait recognition Siamese neural network Spatio-temporal features Metric learning Human identification
This work is partially supported by the Funds for International Cooperation and Exchange of the National Natural Science Foundation of China (No. 61720106007), the NSFC-Guangdong Joint Fund (No. U1501254), the National Natural Science Foundation of China (No. 61602049), and the Cosponsored Project of Beijing Committee of Education.
- Ariyanto, G., & Nixon, M.S. (2011). Model based 3d gait biometrics. In Proceedings of international joint conference on biometrics, pp. 1–7. IEEE.Google Scholar
- Bobick, A.F., & Johnson, A.Y. (2001). Gait recognition using static, activity-specific parameters. In Proceedings of IEEE conference on computer vision and pattern recognition, vol. 1, pp. I–I. IEEE.Google Scholar
- Boykov, Y., & Jolly, M. (2001). Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In Proceedings of international conference on computer vision, vol. 1, pp. 105–112. IEEE.Google Scholar
- Cao, C., Zhang, Y., Zhang, C., Lu, H. (2017). Body joint guided 3d deep convolutional descriptors for action recognition. CoRR arXiv:1704.07160.
- Castro, F.M., Marín-Jimėnez, M.J., Guil, N., de la Blanca, N.P. (2016). Automatic learning of gait signatures for people identification. CoRR arXiv:1603.01006.
- Chen, Z., Zhang, W., Deng, B., Xie, H., Gu, X. (2017). Name-face association with web facial image supervision. Multimedia Systems (4), 1–20.Google Scholar
- Chopra, S., Hadsell, R., LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In Proceedings of IEEE conference on computer vision and pattern recognition, vol. 1, pp. 539–546. IEEE.Google Scholar
- Feng, Y., Li, Y., Luo, J. (2016). Learning effective gait features using lstm. In 23rd international conference on pattern recognition, pp. 325–330. IEEE.Google Scholar
- Gan, C., Wang, N., Yang, Y., Yeung, D., Hauptmann, A.G. (2015). Devnet: A deep event network for multimedia event detection and evidence recounting. In IEEE conference on computer vision and pattern recognition, pp. 2568–2577.Google Scholar
- Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R. (2017). TURN TAP: temporal unit regression network for temporal action proposals. CoRR arXiv:1703.06189.
- He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.Google Scholar
- Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.CrossRefGoogle Scholar
- Hou, R., Chen, C., Shah, M. (2017). Tube convolutional neural network (T-CNN) for action detection in videos. CoRR arXiv:1703.10664.
- Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.F. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of IEEE conference on computer vision and pattern recognition, pp. 1725–1732. IEEE.Google Scholar
- Krizhevsky, A., Sutskever, I., Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105.Google Scholar
- Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J. (2015). Multi-task deep visual-semantic embedding for video thumbnail selection. In IEEE conference on computer vision and pattern recognition, pp. 3707–3715.Google Scholar
- Lombardi, S., Nishino, K., Makihara, Y., Yagi, Y. (2013). Two-point gait: decoupling gait from body shape. In IEEE international conference on computer vision, pp. 1041–1048.Google Scholar
- Ma, H., & Liu, W. (2017). Progressive search paradigm for internet of things. IEEE Multimedia. https://doi.org/10.1109/MMUL.2017.265091429.
- Ma, H., Zeng, C., Ling, C.X. (2012). A reliable people counting system via multiple cameras. ACM Transaction on Intelligent Systems and Technology, 3(2), 31.Google Scholar
- Makihara, Y., Rossa, B.S., Yagi, Y. (2012). Gait recognition using images of oriented smooth pseudo motion. In Proceedings of the IEEE international conference on systems, Man, and Cybernetics, SMC 2012, Seoul, Korea (South), October 14-17, 2012, pp. 1309–1314.Google Scholar
- Makihara, Y., Sagawa, R., Mukaigawa, Y., Echigo, T., Yagi, Y. (2006). Gait recognition using a view transformation model in the frequency domain. In Proceedings of european conference on computer vision, pp. 151–163.Google Scholar
- Martín-Félez, R., & Xiang, T. (2012). Gait recognition by ranking. In Proceedings of european conference on computer vision, pp. 328–341. Springer.Google Scholar
- Muja, M., & Lowe, D.G. (2012). Fast matching of binary features. In Proceedings of computer and robot vision, pp. 404–410.Google Scholar
- Nie, B.X., Xiong, C., Zhu, S. (2015). Joint action recognition and pose estimation from video. In Proceedings of IEEE conference on computer vision and pattern recognition, pp. 1293–1301.Google Scholar
- Shiraga, K., Makihara, Y., Muramatsu, D., Echigo, T., Yagi, Y. (2016). Geinet: View-invariant gait recognition using a convolutional neural network. In Proceedings of international conference on biometrics, pp. 1–8.Google Scholar
- Sivapalan, S., Chen, D., Denman, S., Sridharan, S., Fookes, C. (2013). Histogram of weighted local directions for gait recognition. In Proceedings of computer vision and pattern recognition workshop, pp. 125–130. IEEE.Google Scholar
- Sutskever, I., Vinyals, O., Le, Q.V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112.Google Scholar
- Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of international conference on computer vision, pp. 4489–4497.Google Scholar
- Urtasun, R., & Fua, P. (2004). 3d tracking for gait characterization and recognition. In Proceedings of 6th IEEE international conference on automatic face and gesture recognition, pp. 17–22.Google Scholar
- Varol, G., Laptev, I., Schmid, C. (2016). Long-term temporal convolutions for action recognition. CoRR arXiv:1604.04494.
- Wang, B., Tang, S., Zhao, R., Liu, W., Cen, Y. (2015). Pedestrian detection based on region proposal fusion. In Proceedings of international workshop on multimedia signal processing, pp. 1–6. IEEE.Google Scholar
- Wang, C., Wang, Y., Lin, Z., Yuille, A.L., Gao, W. (2014). Robust estimation of 3d human poses from a single image. In Proceedings of IEEE conference on computer vision and pattern recognition, pp. 2369–2376.Google Scholar
- Wang, C., Zhang, J., Pu, J., Yuan, X., Wang, L. (2010). Chrono-gait image: A novel temporal template for gait recognition. In Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I, pp. 257–270.Google Scholar
- Xu, H., Das, A., Saenko, K. (2017). R-C3D: region convolutional 3d network for temporal activity detection. CoRR arXiv:1703.07814.
- Yan, C.C., Xie, H., Liu, S., Yin, J., Zhang, Y., Dai, Q. (2017a). Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans. Intelligent Transportation Systems.Google Scholar
- Yan, C.C., Xie, H., Yang, D., Yin, J., Zhang, Y., Dai, Q. (2017b). Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans. Intelligent Transportation Systems.Google Scholar
- Yu, S., Tan, D., Tan, T. (2006). A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In Proceedings of international conference on pattern recognition, vol. 4, pp. 441–444. IEEE.Google Scholar
- Yuan, X., Lai, W., Mei, T., Hua, X., Wu, X., Li, S. (2006). Automatic video genre categorization using hierarchical svm. In Proceedings of international conference on image processing, pp. 2905–2908. IEEE.Google Scholar
- Zha, Z., Mei, T., Wang, Z., Hua, X. (2007). Building a comprehensive ontology to refine video concept detection. In Proceedings of the international workshop on multimedia information retrieval, pp. 227–236. ACM.Google Scholar
- Zhang, C., Liu, W., Ma, H., Fu, H. (2016). Siamese neural network based gait recognition for human identification. In IEEE international conference on acoustics, speech and signal processing, pp. 2832–2836.Google Scholar
- Zhang, D., & Shah, M. (2015). Human pose estimation in videos. In Proceedings of IEEE international conference on computer vision, pp. 2012–2020.Google Scholar
- Zhang, L., Xiang, T., Gong, S. (2016). Learning a discriminative null space for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1239–1248.Google Scholar
- Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T. (2017). Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. CoRR arXiv:1704.00616.