Skip to main content
Log in

An efficient end-to-end deep learning architecture for activity classification

  • Published:
Analog Integrated Circuits and Signal Processing Aims and scope Submit manuscript

Abstract

Deep learning is widely considered to be the most important method in computer vision fields, which has a lot of applications such as image recognition, robot navigation systems and self-driving cars. Recent developments in neural networks have led to an efficient end-to-end architecture to human activity representation and classification. In the light of these recent events in deep learning, there is now much considerable concern about developing less expensive computation and memory-wise methods. This paper presents an optimized end-to-end approach to describe and classify human action videos. In the beginning, RGB activity videos are sampled to frame sequences. Then convolutional features are extracted from these frames based on the pre-trained Inception-v3 model. Finally, video actions classification is done by training a long short-term with feature vectors. Our proposed architecture aims to perform low computational cost and improved accuracy performances. Our efficient end-to-end approach outperforms previously published results by an accuracy rate of 98.4% and 98.5% on the UTD-MHAD HS and UTD-MHAD SS public dataset experiments, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Asadi-Aghbolaghi, M., Clapes, A., Bellantonio, M., Escalante, H. J., Ponce-López, V., Baró, X., Guyon, I., Kasaei, S., & Escalera, S. (2017). A survey on deep learning based approaches for action and gesture recognition in image sequences. In 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017) (pp. 476–483). IEEE.

  2. Koohzadi, M., & Charkari, N. M. (2017). Survey on deep learning methods in human action recognition. IET Computer Vision, 11(8), 623–632.

    Article  Google Scholar 

  3. Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21.

    Article  Google Scholar 

  4. Dhillon, J. K., & Kushwaha, A. K. S. (2017). A recent survey for human activity recoginition based on deep learning approach. In 2017 fourth international conference on image information processing (ICIIP) (pp. 1–6). IEEE.

  5. Zhang, Q., Yang, L. T., Chen, Z., & Li, P. (2018). A survey on deep learning for big data. Information Fusion, 42, 146–157.

    Article  Google Scholar 

  6. Zhang, Q. S., & Zhu, S. C. (2018). Visual interpretability for deep learning: A survey. Frontiers of Information Technology & Electronic Engineering, 19(1), 27–39.

    Article  Google Scholar 

  7. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  8. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

  9. Liu, J., Wang, G., Hu, P., Duan, L. Y., & Kot, A. C. (2017). Global context-aware attention LSTM networks for 3D action recognition. In CVPR.

  10. Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In European conference on computer vision (pp. 816–833). Cham: Springer.

  11. Lee, I., Kim, D., Kang, S., & Lee, S. (2017). Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In 2017 IEEE international conference on computer vision (ICCV) (pp. 1012–1020). IEEE.

  12. Lan, Z., Zhu, Y., Hauptmann, A. G., & Newsam, S. (2017). Deep local video feature for action recognition. In Computer vision and pattern recognition workshops (CVPRW) (pp. 1219–1225). IEEE.

  13. Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R. C., Li, B., et al. (2018). Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognition, 79, 32–43.

    Article  Google Scholar 

  14. Zang, J., Wang, L., Liu, Z., Zhang, Q., Niu, Z., Hua, G., & Zheng, N. (2018). Attention-based temporal weighted convolutional neural network for action recognition. arXiv preprint arXiv:1803.07179.

  15. Du, W., Wang, Y., & Qiao, Y. (2018). Recurrent spatial–temporal attention network for action recognition in videos. IEEE Transactions on Image Processing, 27(3), 1347–1360.

    Article  MathSciNet  MATH  Google Scholar 

  16. Sargano, A. B., Wang, X., Angelov, P., & Habib, Z. (2017). Human action recognition using transfer learning with deep representations. In 2017 international joint conference on 2017 neural networks (IJCNN) (pp. 463–469). IEEE.

  17. Varol, G., Laptev, I., & Schmid, C. (2017). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 1510–1517.

    Article  Google Scholar 

  18. Chéron, G., Laptev, I., & Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 3218–3226).

  19. Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015) Beyond short snippets: Deep networks for video classification. In Computer vision and pattern recognition (CVPR) (pp. 4694–4702). IEEE.

  20. Gkioxari, G., Girshick, R., & Malik, J. (2015). Contextual action recognition with r* CNN. In Proceedings of the IEEE international conference on computer vision (pp. 1080–1088).

  21. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko K. (2015). Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision (pp. 4534–4542).

  22. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20–36). Cham: Springer.

  23. Wang, P., Liu, L., Shen, C., & Shen, H. T. (2016). Order-aware convolutional pooling for video based action recognition. arXiv preprint arXiv:1602.00224.

  24. Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In European conference on computer vision (pp. 744–759). Cham: Springer.

  25. Zhu, Y., Lan, Z., Newsam, S., & Hauptmann, A. G. (2017). Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389.

  26. Zheng, K., Yan, W. Q., & Nand, P. (2017). Video dynamics detection using deep neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2, 224–234.

    Article  Google Scholar 

  27. Yu, S., Cheng, Y., Xie, L., & Li, S. Z. (2017). Fully convolutional networks for action recognition. IET Computer Vision, 11(8), 744–749.

    Article  Google Scholar 

  28. Li, Z., Yang, Y., Liu, X., Wen, S., & Xu, W. (2017). Dynamic computational time for visual attention. arXiv preprint arXiv:1703.10332.

  29. Shi, Y., Tian, Y., Wang, Y., Zeng, W., & Huang, T. (2017). Learning long-term dependencies for action recognition with a biologically-inspired deep network. In Proceedings of the IEEE conference on computer vision and pattern recognition 2017 (pp. 716-725).

  30. Zhu, J., Zou, W., & Zhu, Z. (2017). End-to-end video-level representation learning for action recognition. arXiv preprint arXiv:1711.04161.

  31. Yan, S., Smith, J. S., Lu, W., & Zhang, B. (2018). Hierarchical multi-scale attention networks for action recognition. Signal Processing: Image Communication, 61, 73–84.

    Google Scholar 

  32. Mahjoub, A. B., Khedher, M. I., Atri, M., & Yacoubi, M. A. E. (2017). Naive Bayesian fusion for action recognition from Kinect. In Computer science & information technology (CS & IT) (Vol. 7, pp. 53–69).

  33. Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.

  34. Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360.

  35. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  36. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  37. Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5987–5995). IEEE.

  38. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).

  39. Chen, C., Jafari, R., & Kehtarnavaz, N. (2015). UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In 2015 IEEE international conference on image processing (ICIP) (pp. 168–172). IEEE.

  40. Bernstein, J., Wang, Y. X., Azizzadenesheli, K., & Anandkumar, A. (2018). signSGD: Compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434.

  41. Yang, X., Zhang, C., & Tian, Y. (2012). Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM international conference on multimedia (pp. 1057–1060). ACM.

  42. Elmadany, N. E. D., He, Y., & Guan, L. (2015). Human action recognition using hybrid centroid canonical correlation analysis. In 2015 IEEE international symposium on multimedia (ISM) (pp. 205–210). IEEE.

  43. Bulbul, M. F., Jiang, Y., & Ma, J. (2015). DMMs-based multiple features fusion for human action recognition. International Journal of Multimedia Data Engineering and Management (IJMDEM), 6(4), 23–39.

    Article  Google Scholar 

  44. Escobedo, E., & Camara, G. (2016). A new approach for dynamic gesture recognition using skeleton trajectory representation and histograms of cumulative magnitudes. In 2016 29th SIBGRAPI conference on graphics, patterns and images (SIBGRAPI) (pp. 209–216). IEEE.

  45. Imran, J., & Kumar, P. (2016). Human action recognition using RGB-D sensor and deep convolutional neural networks. In 2016 international conference on advances in computing, communications and informatics (ICACCI) (pp. 144–148). IEEE.

  46. Zhang, B., Yang, Y., Chen, C., Yang, L., Han, J., & Shao, L. (2017). Action recognition using 3D histograms of texture and a multi-class boosting classifier. IEEE Transactions on Image Processing, 26(10), 4648–4660.

    Article  MathSciNet  MATH  Google Scholar 

  47. Jin, K., Min, J., Kong, J., Huo, H., & Wang, X. (2017). Action recognition using vague division depth motion maps. The Journal of Engineering, 1(1), 77–84.

    Article  Google Scholar 

  48. Chen, C., Jafari, R., & Kehtarnavaz, N. (2016). A real-time human action recognition system using depth and inertial sensor fusion. IEEE Sensors Journal, 16(3), 773–781.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amel Ben Mahjoub.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ben Mahjoub, A., Atri, M. An efficient end-to-end deep learning architecture for activity classification. Analog Integr Circ Sig Process 99, 23–32 (2019). https://doi.org/10.1007/s10470-018-1306-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10470-018-1306-2

Keywords

Navigation