An efficient end-to-end deep learning architecture for activity classification

Ben Mahjoub, Amel; Atri, Mohamed

doi:10.1007/s10470-018-1306-2

An efficient end-to-end deep learning architecture for activity classification

Published: 22 August 2018

Volume 99, pages 23–32, (2019)
Cite this article

Analog Integrated Circuits and Signal Processing Aims and scope Submit manuscript

722 Accesses
6 Citations
Explore all metrics

Abstract

Deep learning is widely considered to be the most important method in computer vision fields, which has a lot of applications such as image recognition, robot navigation systems and self-driving cars. Recent developments in neural networks have led to an efficient end-to-end architecture to human activity representation and classification. In the light of these recent events in deep learning, there is now much considerable concern about developing less expensive computation and memory-wise methods. This paper presents an optimized end-to-end approach to describe and classify human action videos. In the beginning, RGB activity videos are sampled to frame sequences. Then convolutional features are extracted from these frames based on the pre-trained Inception-v3 model. Finally, video actions classification is done by training a long short-term with feature vectors. Our proposed architecture aims to perform low computational cost and improved accuracy performances. Our efficient end-to-end approach outperforms previously published results by an accuracy rate of 98.4% and 98.5% on the UTD-MHAD HS and UTD-MHAD SS public dataset experiments, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human Activity Recognition in Videos Using Deep Learning

SV-NET: A Deep Learning Approach to Video Based Human Activity Recognition

A Very Deep Sequences Learning Approach for Human Action Recognition

References

Asadi-Aghbolaghi, M., Clapes, A., Bellantonio, M., Escalante, H. J., Ponce-López, V., Baró, X., Guyon, I., Kasaei, S., & Escalera, S. (2017). A survey on deep learning based approaches for action and gesture recognition in image sequences. In 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017) (pp. 476–483). IEEE.
Koohzadi, M., & Charkari, N. M. (2017). Survey on deep learning methods in human action recognition. IET Computer Vision, 11(8), 623–632.
Article Google Scholar
Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21.
Article Google Scholar
Dhillon, J. K., & Kushwaha, A. K. S. (2017). A recent survey for human activity recoginition based on deep learning approach. In 2017 fourth international conference on image information processing (ICIIP) (pp. 1–6). IEEE.
Zhang, Q., Yang, L. T., Chen, Z., & Li, P. (2018). A survey on deep learning for big data. Information Fusion, 42, 146–157.
Article Google Scholar
Zhang, Q. S., & Zhu, S. C. (2018). Visual interpretability for deep learning: A survey. Frontiers of Information Technology & Electronic Engineering, 19(1), 27–39.
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).
Liu, J., Wang, G., Hu, P., Duan, L. Y., & Kot, A. C. (2017). Global context-aware attention LSTM networks for 3D action recognition. In CVPR.
Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In European conference on computer vision (pp. 816–833). Cham: Springer.
Lee, I., Kim, D., Kang, S., & Lee, S. (2017). Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In 2017 IEEE international conference on computer vision (ICCV) (pp. 1012–1020). IEEE.
Lan, Z., Zhu, Y., Hauptmann, A. G., & Newsam, S. (2017). Deep local video feature for action recognition. In Computer vision and pattern recognition workshops (CVPRW) (pp. 1219–1225). IEEE.
Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R. C., Li, B., et al. (2018). Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognition, 79, 32–43.
Article Google Scholar
Zang, J., Wang, L., Liu, Z., Zhang, Q., Niu, Z., Hua, G., & Zheng, N. (2018). Attention-based temporal weighted convolutional neural network for action recognition. arXiv preprint arXiv:1803.07179.
Du, W., Wang, Y., & Qiao, Y. (2018). Recurrent spatial–temporal attention network for action recognition in videos. IEEE Transactions on Image Processing, 27(3), 1347–1360.
Article MathSciNet MATH Google Scholar
Sargano, A. B., Wang, X., Angelov, P., & Habib, Z. (2017). Human action recognition using transfer learning with deep representations. In 2017 international joint conference on 2017 neural networks (IJCNN) (pp. 463–469). IEEE.
Varol, G., Laptev, I., & Schmid, C. (2017). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 1510–1517.
Article Google Scholar
Chéron, G., Laptev, I., & Schmid, C. (2015). P-CNN: Pose-based CNN features for action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 3218–3226).
Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015) Beyond short snippets: Deep networks for video classification. In Computer vision and pattern recognition (CVPR) (pp. 4694–4702). IEEE.
Gkioxari, G., Girshick, R., & Malik, J. (2015). Contextual action recognition with r* CNN. In Proceedings of the IEEE international conference on computer vision (pp. 1080–1088).
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko K. (2015). Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision (pp. 4534–4542).
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20–36). Cham: Springer.
Wang, P., Liu, L., Shen, C., & Shen, H. T. (2016). Order-aware convolutional pooling for video based action recognition. arXiv preprint arXiv:1602.00224.
Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In European conference on computer vision (pp. 744–759). Cham: Springer.
Zhu, Y., Lan, Z., Newsam, S., & Hauptmann, A. G. (2017). Hidden two-stream convolutional networks for action recognition. arXiv preprint arXiv:1704.00389.
Zheng, K., Yan, W. Q., & Nand, P. (2017). Video dynamics detection using deep neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2, 224–234.
Article Google Scholar
Yu, S., Cheng, Y., Xie, L., & Li, S. Z. (2017). Fully convolutional networks for action recognition. IET Computer Vision, 11(8), 744–749.
Article Google Scholar
Li, Z., Yang, Y., Liu, X., Wen, S., & Xu, W. (2017). Dynamic computational time for visual attention. arXiv preprint arXiv:1703.10332.
Shi, Y., Tian, Y., Wang, Y., Zeng, W., & Huang, T. (2017). Learning long-term dependencies for action recognition with a biologically-inspired deep network. In Proceedings of the IEEE conference on computer vision and pattern recognition 2017 (pp. 716-725).
Zhu, J., Zou, W., & Zhu, Z. (2017). End-to-end video-level representation learning for action recognition. arXiv preprint arXiv:1711.04161.
Yan, S., Smith, J. S., Lu, W., & Zhang, B. (2018). Hierarchical multi-scale attention networks for action recognition. Signal Processing: Image Communication, 61, 73–84.
Google Scholar
Mahjoub, A. B., Khedher, M. I., Atri, M., & Yacoubi, M. A. E. (2017). Naive Bayesian fusion for action recognition from Kinect. In Computer science & information technology (CS & IT) (Vol. 7, pp. 53–69).
Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5987–5995). IEEE.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).
Chen, C., Jafari, R., & Kehtarnavaz, N. (2015). UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In 2015 IEEE international conference on image processing (ICIP) (pp. 168–172). IEEE.
Bernstein, J., Wang, Y. X., Azizzadenesheli, K., & Anandkumar, A. (2018). signSGD: Compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434.
Yang, X., Zhang, C., & Tian, Y. (2012). Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM international conference on multimedia (pp. 1057–1060). ACM.
Elmadany, N. E. D., He, Y., & Guan, L. (2015). Human action recognition using hybrid centroid canonical correlation analysis. In 2015 IEEE international symposium on multimedia (ISM) (pp. 205–210). IEEE.
Bulbul, M. F., Jiang, Y., & Ma, J. (2015). DMMs-based multiple features fusion for human action recognition. International Journal of Multimedia Data Engineering and Management (IJMDEM), 6(4), 23–39.
Article Google Scholar
Escobedo, E., & Camara, G. (2016). A new approach for dynamic gesture recognition using skeleton trajectory representation and histograms of cumulative magnitudes. In 2016 29th SIBGRAPI conference on graphics, patterns and images (SIBGRAPI) (pp. 209–216). IEEE.
Imran, J., & Kumar, P. (2016). Human action recognition using RGB-D sensor and deep convolutional neural networks. In 2016 international conference on advances in computing, communications and informatics (ICACCI) (pp. 144–148). IEEE.
Zhang, B., Yang, Y., Chen, C., Yang, L., Han, J., & Shao, L. (2017). Action recognition using 3D histograms of texture and a multi-class boosting classifier. IEEE Transactions on Image Processing, 26(10), 4648–4660.
Article MathSciNet MATH Google Scholar
Jin, K., Min, J., Kong, J., Huo, H., & Wang, X. (2017). Action recognition using vague division depth motion maps. The Journal of Engineering, 1(1), 77–84.
Article Google Scholar
Chen, C., Jafari, R., & Kehtarnavaz, N. (2016). A real-time human action recognition system using depth and inertial sensor fusion. IEEE Sensors Journal, 16(3), 773–781.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Electronics and Micro-electronics, Faculty of Sciences, Monastir University, 5000, Monastir, Tunisia
Amel Ben Mahjoub & Mohamed Atri

Authors

Amel Ben Mahjoub
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Atri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amel Ben Mahjoub.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ben Mahjoub, A., Atri, M. An efficient end-to-end deep learning architecture for activity classification. Analog Integr Circ Sig Process 99, 23–32 (2019). https://doi.org/10.1007/s10470-018-1306-2

Download citation

Received: 14 June 2018
Accepted: 10 August 2018
Published: 22 August 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s10470-018-1306-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient end-to-end deep learning architecture for activity classification

Abstract

Access this article

Similar content being viewed by others

Human Activity Recognition in Videos Using Deep Learning

SV-NET: A Deep Learning Approach to Video Based Human Activity Recognition

A Very Deep Sequences Learning Approach for Human Action Recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An efficient end-to-end deep learning architecture for activity classification

Abstract

Access this article

Similar content being viewed by others

Human Activity Recognition in Videos Using Deep Learning

SV-NET: A Deep Learning Approach to Video Based Human Activity Recognition

A Very Deep Sequences Learning Approach for Human Action Recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation