Attention-based encoder-decoder networks for workflow recognition

Zhang, Min; Hu, Haiyang; Li, Zhongjin; Chen, Jie

doi:10.1007/s11042-021-10633-5

Attention-based encoder-decoder networks for workflow recognition

1166- Advances of machine learning in data analytics and visual information processing
Published: 06 March 2021

Volume 80, pages 34973–34995, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Min Zhang¹,
Haiyang Hu ORCID: orcid.org/0000-0002-6070-8524¹,
Zhongjin Li¹ &
…
Jie Chen¹

445 Accesses
5 Citations
Explore all metrics

Abstract

Behavior recognition is a fundamental yet challenging task in intelligent surveillance system, which plays an increasingly important role in the process of “Industry 4.0”. However, monitoring the workflow of both workers and machines in production procedure is quite difficult in complex industrial environments. In this paper, we propose a novel workflow recognition framework to recognize the behavior of working subjects based on the well-designed encoder-decoder structure. Namely, attention-based workflow recognition framework, termed as AWR. To improve the accuracy of workflow recognition, a temporal attention cell (AttCell) is introduced to draw dynamic attention distribution in the last stage of the framework. In addition, a Rough-to-Refine phase localization model is exploited to improve localization accuracy, which can effectively identify the boundaries of a specific phase instance in long untrimmed videos. Comprehensive experiments indicate a 1.4% mAP@IoU= 0.4 boost on THUMOS’14 dataset and a 3.4% mAP@IoU= 0.4 boost on hand-crafted workflow dataset detection challenge compared to the advanced GTAN pipeline respectively. More remarkably, the effectiveness of the workflow recognition system is validated in a real-world production scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 13

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

YOLO-based Object Detection Models: A Review and its Applications

Article 14 March 2024

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

References

Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Blum T, Feußner H, Navab N (2010) Modeling and segmentation of surgical workflow from laparoscopic video. In: International conference on medical image computing and computer-assisted intervention, pp 400–407
Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster r-cnn architecture for temporal action localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1130–1139
Chen Y, Sun Q L, Zhong K (2018) Semi-supervised spatio-temporal CNN for recognition of surgical workflow. EURASIP Journal on Image and Video Processing 2018(1):76
Article Google Scholar
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition, pp 248–255
Dogan E, Eren G, Wolf C, Baskurt A (2015) Activity recognition with volume motion templates and histograms of 3d gradients. In: 2015 IEEE International Conference on Image Processing (ICIP), pp 4421–4425
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Gorban A, Idrees H, Jiang Y G, Zamir A R, Laptev I, Shah M (2015) THUMOS challenge: Action recognition with a large number of classes
Hu H, Cheng K, Li Z, Chen J, Hu H (2018) Workflow recognition with structured two-stream convolutional networks. Pattern Recogn Lett 130:267–274
Article Google Scholar
Jiang B, Wang M, Gan W, Wu W, Yan J (2019) STM: SpatioTemporal and motion encoding for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2000–2009
Jin Y, Dou Q, Chen H, Yu L, Qin J, Fu C W, Heng P A (2017) SV-RCNEt: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans Medical Imag 37(5):1114–1126
Article Google Scholar
Kosmopoulos D I, Doulamis N D, Voulodimos A S (2012) Bayesian filter based behavior recognition in workflows allowing for user feedback. Comput Vis Image Underst 116(3):422–434
Article Google Scholar
Kulkarni A, Shivananda A (2019) Deep learning for NLP. In: Natural language processing recipes, pp 185–227
Lalys F, Riffaud L, Bouget D, Jannin P (2011) A framework for the recognition of high-level surgical tasks from video images for cataract surgeries. IEEE Trans Biomed Eng 59(4):966–976
Article Google Scholar
Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: 2011 International conference on computer vision, pp 2003–2010
Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek C G (2018) Videolstm convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50
Article Google Scholar
Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T (2019) Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 344–353
Lu J, Corso JJ (2015) Human action segmentation with hierarchical supervoxel consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3762–3771
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical co-attention for visual question answering. In: Neural Information Processing Systems (NIPS), pp 2
Ma Z, Chang X, Yang Y, Sebe N, Hauptmann A G (2017) The many shades of negativity. IEEE Trans Multimed 19(7):1558–1568
Article Google Scholar
Makantasis K, Doulamis A, Doulamis N, Psychas K (2016) Deep learning based human behavior recognition in industrial workflows. In: 2016 IEEE International conference on image processing (ICIP), pp 1609–1613
Padoy N (2019) Machine and deep learning for workflow recognition during surgery. Minimally Invasive Therapy & Allied Technologies 28(2):82–90
Article Google Scholar
Protopapadakis EE, Doulamis AD, Doulamis ND (2013) Tapped delay multiclass support vector machines for industrial workflow recognition. In: 2013 14th International workshop on image analysis for multimedia interactive services (WIAMIS), pp 1–4
Protopapadakis E, Doulamis A, Makantasis K, Voulodimos A (2012) A semi-supervised approach for industrial workflow recognition. In: Proceedings of the second international conference on advanced communications and computation, pp 21–26
Rensink R A (2000) The dynamic representation of scenes. Vis Cogn 7(1-3):17–42
Article Google Scholar
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1049–1058
Tao L, Zappella L, Hager GD, Vidal R (2013) Surgical gesture segmentation and recognition. In: International conference on medical image computing and computer-assisted intervention, pp 339–346
Thomay C, Gollan B, Haslgrübler M, Ferscha A, Heftberger J (2019) A multi-sensor algorithm for activity and workflow recognition in an industrial setting. In: Proceedings of the 12th ACM international conference on pervasive technologies related to assistive environments, pp 69–76
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517
Article Google Scholar
Voulodimos A, Kosmopoulos D, Vasileiou G, Sardis E, Anagnostopoulos V, Lalos C, Varvarigou T (2012) A threefold dataset for activity and workflow recognition in complex industrial environments. IEEE MultiMedia 19(3):42–52
Article Google Scholar
Voulodimos A, Kosmopoulos D, Veres G, Grabner H, Van Gool L, Varvarigou T (2011) Online classification of visual tasks for industrial workflow monitoring. Neural Netw 24(8):852–860
Article Google Scholar
Wang L, Qiao Y, Tang X (2014) Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge 1(2):2
Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 3551–3558
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Xu H, Das A, Saenko K (2017) R-c3d: Region convolutional 3d network for temporal activity detection. In: Proceedings of the IEEE international conference on computer vision, pp 5783–5792
Yang Y, Ma Z, Nie F, Chang X, Hauptmann A G (2015) Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vis 113(2):113–127
Article MathSciNet Google Scholar
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv:1409.2329
Zhang Q, Hua G (2015) Multi-view visual recognition of imperfect testing data. In: Proceedings of the 23rd ACM international conference on multimedia, pp 561–570
Zhang L, Wang QW (2018) XIOLIFT database, https://pan.baidu.com/s/lySILNURWDN40q5TpAvGKUA
Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1991–1999

Download references

Acknowledgments

This work is supported by National Science Foundation of China (Grant no. 61572251, 61572162, 61702144 and 61802095), the Natural Science Foundation of Zhejiang Province (LQ17F020003), the Key Science and Technology Project Foundation of Zhejiang Province (2018C01012).

Author information

Authors and Affiliations

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
Min Zhang, Haiyang Hu, Zhongjin Li & Jie Chen

Authors

Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haiyang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongjin Li
View author publications
You can also search for this author in PubMed Google Scholar
Jie Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haiyang Hu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, M., Hu, H., Li, Z. et al. Attention-based encoder-decoder networks for workflow recognition. Multimed Tools Appl 80, 34973–34995 (2021). https://doi.org/10.1007/s11042-021-10633-5

Download citation

Received: 28 December 2019
Revised: 21 December 2020
Accepted: 04 February 2021
Published: 06 March 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11042-021-10633-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention-based encoder-decoder networks for workflow recognition

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

YOLO-based Object Detection Models: A Review and its Applications

Attention mechanisms in computer vision: A survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Attention-based encoder-decoder networks for workflow recognition

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

YOLO-based Object Detection Models: A Review and its Applications

Attention mechanisms in computer vision: A survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation