Skip to main content
Log in

Attention-based encoder-decoder networks for workflow recognition

  • 1166- Advances of machine learning in data analytics and visual information processing
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Behavior recognition is a fundamental yet challenging task in intelligent surveillance system, which plays an increasingly important role in the process of “Industry 4.0”. However, monitoring the workflow of both workers and machines in production procedure is quite difficult in complex industrial environments. In this paper, we propose a novel workflow recognition framework to recognize the behavior of working subjects based on the well-designed encoder-decoder structure. Namely, attention-based workflow recognition framework, termed as AWR. To improve the accuracy of workflow recognition, a temporal attention cell (AttCell) is introduced to draw dynamic attention distribution in the last stage of the framework. In addition, a Rough-to-Refine phase localization model is exploited to improve localization accuracy, which can effectively identify the boundaries of a specific phase instance in long untrimmed videos. Comprehensive experiments indicate a 1.4% mAP@IoU= 0.4 boost on THUMOS’14 dataset and a 3.4% mAP@IoU= 0.4 boost on hand-crafted workflow dataset detection challenge compared to the advanced GTAN pipeline respectively. More remarkably, the effectiveness of the workflow recognition system is validated in a real-world production scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473

  2. Blum T, Feußner H, Navab N (2010) Modeling and segmentation of surgical workflow from laparoscopic video. In: International conference on medical image computing and computer-assisted intervention, pp 400–407

  3. Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster r-cnn architecture for temporal action localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1130–1139

  4. Chen Y, Sun Q L, Zhong K (2018) Semi-supervised spatio-temporal CNN for recognition of surgical workflow. EURASIP Journal on Image and Video Processing 2018(1):76

    Article  Google Scholar 

  5. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition, pp 248–255

  6. Dogan E, Eren G, Wolf C, Baskurt A (2015) Activity recognition with volume motion templates and histograms of 3d gradients. In: 2015 IEEE International Conference on Image Processing (ICIP), pp 4421–4425

  7. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941

  8. Gorban A, Idrees H, Jiang Y G, Zamir A R, Laptev I, Shah M (2015) THUMOS challenge: Action recognition with a large number of classes

  9. Hu H, Cheng K, Li Z, Chen J, Hu H (2018) Workflow recognition with structured two-stream convolutional networks. Pattern Recogn Lett 130:267–274

    Article  Google Scholar 

  10. Jiang B, Wang M, Gan W, Wu W, Yan J (2019) STM: SpatioTemporal and motion encoding for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2000–2009

  11. Jin Y, Dou Q, Chen H, Yu L, Qin J, Fu C W, Heng P A (2017) SV-RCNEt: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans Medical Imag 37(5):1114–1126

    Article  Google Scholar 

  12. Kosmopoulos D I, Doulamis N D, Voulodimos A S (2012) Bayesian filter based behavior recognition in workflows allowing for user feedback. Comput Vis Image Underst 116(3):422–434

    Article  Google Scholar 

  13. Kulkarni A, Shivananda A (2019) Deep learning for NLP. In: Natural language processing recipes, pp 185–227

  14. Lalys F, Riffaud L, Bouget D, Jannin P (2011) A framework for the recognition of high-level surgical tasks from video images for cataract surgeries. IEEE Trans Biomed Eng 59(4):966–976

    Article  Google Scholar 

  15. Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: 2011 International conference on computer vision, pp 2003–2010

  16. Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek C G (2018) Videolstm convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50

    Article  Google Scholar 

  17. Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T (2019) Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 344–353

  18. Lu J, Corso JJ (2015) Human action segmentation with hierarchical supervoxel consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3762–3771

  19. Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical co-attention for visual question answering. In: Neural Information Processing Systems (NIPS), pp 2

  20. Ma Z, Chang X, Yang Y, Sebe N, Hauptmann A G (2017) The many shades of negativity. IEEE Trans Multimed 19(7):1558–1568

    Article  Google Scholar 

  21. Makantasis K, Doulamis A, Doulamis N, Psychas K (2016) Deep learning based human behavior recognition in industrial workflows. In: 2016 IEEE International conference on image processing (ICIP), pp 1609–1613

  22. Padoy N (2019) Machine and deep learning for workflow recognition during surgery. Minimally Invasive Therapy & Allied Technologies 28(2):82–90

    Article  Google Scholar 

  23. Protopapadakis EE, Doulamis AD, Doulamis ND (2013) Tapped delay multiclass support vector machines for industrial workflow recognition. In: 2013 14th International workshop on image analysis for multimedia interactive services (WIAMIS), pp 1–4

  24. Protopapadakis E, Doulamis A, Makantasis K, Voulodimos A (2012) A semi-supervised approach for industrial workflow recognition. In: Proceedings of the second international conference on advanced communications and computation, pp 21–26

  25. Rensink R A (2000) The dynamic representation of scenes. Vis Cogn 7(1-3):17–42

    Article  Google Scholar 

  26. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119

  27. Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1049–1058

  28. Tao L, Zappella L, Hager GD, Vidal R (2013) Surgical gesture segmentation and recognition. In: International conference on medical image computing and computer-assisted intervention, pp 339–346

  29. Thomay C, Gollan B, Haslgrübler M, Ferscha A, Heftberger J (2019) A multi-sensor algorithm for activity and workflow recognition in an industrial setting. In: Proceedings of the 12th ACM international conference on pervasive technologies related to assistive environments, pp 69–76

  30. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  31. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459

  32. Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517

    Article  Google Scholar 

  33. Voulodimos A, Kosmopoulos D, Vasileiou G, Sardis E, Anagnostopoulos V, Lalos C, Varvarigou T (2012) A threefold dataset for activity and workflow recognition in complex industrial environments. IEEE MultiMedia 19(3):42–52

    Article  Google Scholar 

  34. Voulodimos A, Kosmopoulos D, Veres G, Grabner H, Van Gool L, Varvarigou T (2011) Online classification of visual tasks for industrial workflow monitoring. Neural Netw 24(8):852–860

    Article  Google Scholar 

  35. Wang L, Qiao Y, Tang X (2014) Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge 1(2):2

    Google Scholar 

  36. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 3551–3558

  37. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755

    Article  Google Scholar 

  38. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  39. Xu H, Das A, Saenko K (2017) R-c3d: Region convolutional 3d network for temporal activity detection. In: Proceedings of the IEEE international conference on computer vision, pp 5783–5792

  40. Yang Y, Ma Z, Nie F, Chang X, Hauptmann A G (2015) Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vis 113(2):113–127

    Article  MathSciNet  Google Scholar 

  41. Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv:1409.2329

  42. Zhang Q, Hua G (2015) Multi-view visual recognition of imperfect testing data. In: Proceedings of the 23rd ACM international conference on multimedia, pp 561–570

  43. Zhang L, Wang QW (2018) XIOLIFT database, https://pan.baidu.com/s/lySILNURWDN40q5TpAvGKUA

  44. Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1991–1999

Download references

Acknowledgments

This work is supported by National Science Foundation of China (Grant no. 61572251, 61572162, 61702144 and 61802095), the Natural Science Foundation of Zhejiang Province (LQ17F020003), the Key Science and Technology Project Foundation of Zhejiang Province (2018C01012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiyang Hu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, M., Hu, H., Li, Z. et al. Attention-based encoder-decoder networks for workflow recognition. Multimed Tools Appl 80, 34973–34995 (2021). https://doi.org/10.1007/s11042-021-10633-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-10633-5

Keywords

Navigation