Advertisement

An efficient end-to-end deep learning architecture for activity classification

  • Amel Ben MahjoubEmail author
  • Mohamed Atri
Article
  • 177 Downloads

Abstract

Deep learning is widely considered to be the most important method in computer vision fields, which has a lot of applications such as image recognition, robot navigation systems and self-driving cars. Recent developments in neural networks have led to an efficient end-to-end architecture to human activity representation and classification. In the light of these recent events in deep learning, there is now much considerable concern about developing less expensive computation and memory-wise methods. This paper presents an optimized end-to-end approach to describe and classify human action videos. In the beginning, RGB activity videos are sampled to frame sequences. Then convolutional features are extracted from these frames based on the pre-trained Inception-v3 model. Finally, video actions classification is done by training a long short-term with feature vectors. Our proposed architecture aims to perform low computational cost and improved accuracy performances. Our efficient end-to-end approach outperforms previously published results by an accuracy rate of 98.4% and 98.5% on the UTD-MHAD HS and UTD-MHAD SS public dataset experiments, respectively.

Keywords

Pre-trained CNN LSTM End-to-end model Feature extraction Action recognition 

References

  1. 1.
    Asadi-Aghbolaghi, M., Clapes, A., Bellantonio, M., Escalante, H. J., Ponce-López, V., Baró, X., Guyon, I., Kasaei, S., & Escalera, S. (2017). A survey on deep learning based approaches for action and gesture recognition in image sequences. In 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017) (pp. 476–483). IEEE.Google Scholar
  2. 2.
    Koohzadi, M., & Charkari, N. M. (2017). Survey on deep learning methods in human action recognition. IET Computer Vision, 11(8), 623–632.CrossRefGoogle Scholar
  3. 3.
    Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21.CrossRefGoogle Scholar
  4. 4.
    Dhillon, J. K., & Kushwaha, A. K. S. (2017). A recent survey for human activity recoginition based on deep learning approach. In 2017 fourth international conference on image information processing (ICIIP) (pp. 1–6). IEEE.Google Scholar
  5. 5.
    Zhang, Q., Yang, L. T., Chen, Z., & Li, P. (2018). A survey on deep learning for big data. Information Fusion, 42, 146–157.CrossRefGoogle Scholar
  6. 6.
    Zhang, Q. S., & Zhu, S. C. (2018). Visual interpretability for deep learning: A survey. Frontiers of Information Technology & Electronic Engineering, 19(1), 27–39.CrossRefGoogle Scholar
  7. 7.
    Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  8. 8.
    Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).Google Scholar
  9. 9.
    Liu, J., Wang, G., Hu, P., Duan, L. Y., & Kot, A. C. (2017). Global context-aware attention LSTM networks for 3D action recognition. In CVPR.Google Scholar
  10. 10.
    Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In European conference on computer vision (pp. 816–833). Cham: Springer.Google Scholar
  11. 11.
    Lee, I., Kim, D., Kang, S., & Lee, S. (2017). Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In 2017 IEEE international conference on computer vision (ICCV) (pp. 1012–1020). IEEE.Google Scholar
  12. 12.
    Lan, Z., Zhu, Y., Hauptmann, A. G., & Newsam, S. (2017). Deep local video feature for action recognition. In Computer vision and pattern recognition workshops (CVPRW) (pp. 1219–1225). IEEE.Google Scholar
  13. 13.
    Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R. C., Li, B., et al. (2018). Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognition, 79, 32–43.CrossRefGoogle Scholar
  14. 14.
    Zang, J., Wang, L., Liu, Z., Zhang, Q., Niu, Z., Hua, G., & Zheng, N. (2018). Attention-based temporal weighted convolutional neural network for action recognition. arXiv preprint arXiv:1803.07179.
  15. 15.
    Du, W., Wang, Y., & Qiao, Y. (2018). Recurrent spatial–temporal attention network for action recognition in videos. IEEE Transactions on Image Processing, 27(3), 1347–1360.MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Sargano, A. B., Wang, X., Angelov, P., & Habib, Z. (2017). Human action recognition using transfer learning with deep representations. In 2017 international joint conference on 2017 neural networks (IJCNN) (pp. 463–469). IEEE.Google Scholar
  17. 17.
    Varol, G., Laptev, I., & Schmid, C. (2017). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 1510–1517.CrossRefGoogle Scholar
  18. 18.
    Chéron, G., Laptev, I., & Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 3218–3226).Google Scholar
  19. 19.
    Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015) Beyond short snippets: Deep networks for video classification. In Computer vision and pattern recognition (CVPR) (pp. 4694–4702). IEEE.Google Scholar
  20. 20.
    Gkioxari, G., Girshick, R., & Malik, J. (2015). Contextual action recognition with r* CNN. In Proceedings of the IEEE international conference on computer vision (pp. 1080–1088).Google Scholar
  21. 21.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko K. (2015). Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision (pp. 4534–4542).Google Scholar
  22. 22.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20–36). Cham: Springer.Google Scholar
  23. 23.
    Wang, P., Liu, L., Shen, C., & Shen, H. T. (2016). Order-aware convolutional pooling for video based action recognition. arXiv preprint arXiv:1602.00224.
  24. 24.
    Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In European conference on computer vision (pp. 744–759). Cham: Springer.Google Scholar
  25. 25.
    Zhu, Y., Lan, Z., Newsam, S., & Hauptmann, A. G. (2017). Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389.
  26. 26.
    Zheng, K., Yan, W. Q., & Nand, P. (2017). Video dynamics detection using deep neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2, 224–234.CrossRefGoogle Scholar
  27. 27.
    Yu, S., Cheng, Y., Xie, L., & Li, S. Z. (2017). Fully convolutional networks for action recognition. IET Computer Vision, 11(8), 744–749.CrossRefGoogle Scholar
  28. 28.
    Li, Z., Yang, Y., Liu, X., Wen, S., & Xu, W. (2017). Dynamic computational time for visual attention. arXiv preprint arXiv:1703.10332.
  29. 29.
    Shi, Y., Tian, Y., Wang, Y., Zeng, W., & Huang, T. (2017). Learning long-term dependencies for action recognition with a biologically-inspired deep network. In Proceedings of the IEEE conference on computer vision and pattern recognition 2017 (pp. 716-725).Google Scholar
  30. 30.
    Zhu, J., Zou, W., & Zhu, Z. (2017). End-to-end video-level representation learning for action recognition. arXiv preprint arXiv:1711.04161.
  31. 31.
    Yan, S., Smith, J. S., Lu, W., & Zhang, B. (2018). Hierarchical multi-scale attention networks for action recognition. Signal Processing: Image Communication, 61, 73–84.Google Scholar
  32. 32.
    Mahjoub, A. B., Khedher, M. I., Atri, M., & Yacoubi, M. A. E. (2017). Naive Bayesian fusion for action recognition from Kinect. In Computer science & information technology (CS & IT) (Vol. 7, pp. 53–69).Google Scholar
  33. 33.
    Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.
  34. 34.
    Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360.
  35. 35.
    Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  36. 36.
    He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).Google Scholar
  37. 37.
    Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5987–5995). IEEE.Google Scholar
  38. 38.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).Google Scholar
  39. 39.
    Chen, C., Jafari, R., & Kehtarnavaz, N. (2015). UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In 2015 IEEE international conference on image processing (ICIP) (pp. 168–172). IEEE.Google Scholar
  40. 40.
    Bernstein, J., Wang, Y. X., Azizzadenesheli, K., & Anandkumar, A. (2018). signSGD: Compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434.
  41. 41.
    Yang, X., Zhang, C., & Tian, Y. (2012). Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM international conference on multimedia (pp. 1057–1060). ACM.Google Scholar
  42. 42.
    Elmadany, N. E. D., He, Y., & Guan, L. (2015). Human action recognition using hybrid centroid canonical correlation analysis. In 2015 IEEE international symposium on multimedia (ISM) (pp. 205–210). IEEE.Google Scholar
  43. 43.
    Bulbul, M. F., Jiang, Y., & Ma, J. (2015). DMMs-based multiple features fusion for human action recognition. International Journal of Multimedia Data Engineering and Management (IJMDEM), 6(4), 23–39.CrossRefGoogle Scholar
  44. 44.
    Escobedo, E., & Camara, G. (2016). A new approach for dynamic gesture recognition using skeleton trajectory representation and histograms of cumulative magnitudes. In 2016 29th SIBGRAPI conference on graphics, patterns and images (SIBGRAPI) (pp. 209–216). IEEE.Google Scholar
  45. 45.
    Imran, J., & Kumar, P. (2016). Human action recognition using RGB-D sensor and deep convolutional neural networks. In 2016 international conference on advances in computing, communications and informatics (ICACCI) (pp. 144–148). IEEE.Google Scholar
  46. 46.
    Zhang, B., Yang, Y., Chen, C., Yang, L., Han, J., & Shao, L. (2017). Action recognition using 3D histograms of texture and a multi-class boosting classifier. IEEE Transactions on Image Processing, 26(10), 4648–4660.MathSciNetCrossRefzbMATHGoogle Scholar
  47. 47.
    Jin, K., Min, J., Kong, J., Huo, H., & Wang, X. (2017). Action recognition using vague division depth motion maps. The Journal of Engineering, 1(1), 77–84.CrossRefGoogle Scholar
  48. 48.
    Chen, C., Jafari, R., & Kehtarnavaz, N. (2016). A real-time human action recognition system using depth and inertial sensor fusion. IEEE Sensors Journal, 16(3), 773–781.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Laboratory of Electronics and Micro-electronics, Faculty of SciencesMonastir UniversityMonastirTunisia

Personalised recommendations