Multimedia Tools and Applications

, Volume 78, Issue 1, pp 677–695 | Cite as

LSTM-based multi-label video event detection

  • An-An LiuEmail author
  • Zhuang Shao
  • Yongkang Wong
  • Junnan Li
  • Yu-Ting Su
  • Mohan Kankanhalli


Since large-scale surveillance videos always contain complex visual events, how to generate video descriptions effectively and efficiently without human supervision has become mandatory. To address this problem, we propose a novel architecture for jointly recognizing multiple events in a given surveillance video, motivated by the sequence to sequence network. The proposed architecture can predict what happens in a video directly without the preprocessing of object detection and tracking. We evaluate several variants of the proposed architecture with different visual features on a novel dataset perpared by our group. Moreover, we compute a wide range of quantitative metrics to evaluate this architecture. We further compare it to the popular Support Vector Machine-based visual event detection method. The comparison results suggest that the proposal method can outperform the traditional computer vision pipelines for visual event detection.


Concurrent event detections Recurrent neural network 



This work was supported in part by the National Natural Science Foundation of China (61772359, 61472275, 61572356), the Tianjin Research Program of Application Foundation and Advanced Technology (15JCYBJC16200), the National Research Foundation, Prime Minister Office, Singapore under its International Research Centre in Singapore Funding Initiative.


  1. 1.
    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR 1409:0473Google Scholar
  2. 2.
    Benfold B, Reid ID (2011) Stable multi-target tracking in real-time surveillance video. In: IEEE Conference on computer vision and pattern recognition, pp 3457–3464Google Scholar
  3. 3.
    Chang C, Lin C (2011) LIBSVM: A library for support vector machines. ACM TIST 2(3):27:1–27:27Google Scholar
  4. 4.
    Cheng Z, Shen J (2016) On very large scale test collection for landmark image search benchmarking. Signal Process 124:13–26CrossRefGoogle Scholar
  5. 5.
    Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. In: Proceedings of eighth workshop on syntax, semantics and structure in statistical translation, pp 103–111Google Scholar
  6. 6.
    Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1724–1734Google Scholar
  7. 7.
    Chu W, Song Y, Jaimes A (2015) Video co-summarization: Video summarization by visual co-occurrence. In: IEEE Conference on computer vision and pattern recognition, pp 3584–3592Google Scholar
  8. 8.
    Collins RT, Biernacki C, Celeux G, Lipton AJ, Govaert G, Kanade T (2000) Introduction to the special section on video surveillance. IEEE Trans Pattern Anal Mach Intell 22(8):745–746CrossRefGoogle Scholar
  9. 9.
    Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387Google Scholar
  10. 10.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on computer vision and pattern recognition, pp 886–893Google Scholar
  11. 11.
    Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. Lect Notes Comput Sci 3952:428–441CrossRefGoogle Scholar
  12. 12.
    Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691CrossRefGoogle Scholar
  13. 13.
    Fan C, Crandall DJ (2016) Deepdiary: Automatically captioning lifelogging image streams. In: European conference on computer vision workshops, pp 459–473Google Scholar
  14. 14.
    Fujiyoshi H, Lipton AJ, Kanade T (2004) Real-time human motion analysis by image skeletonization. IEICE, Transactions 87-D(1):113–120Google Scholar
  15. 15.
    Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112:83–97CrossRefGoogle Scholar
  16. 16.
    Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2013) Deep convolutional ranking for multilabel image annotation. arXiv:1312.4894
  17. 17.
    Guo J, Ren T, Bei J, Zhu Y (2015) Salient object detection in RGB-d image based on saliency fusion and propagation. In: International conference on internet multimedia computing and service, pp 59:1–59:5Google Scholar
  18. 18.
    Gutchess D, Trajkovic M, Cohen-Solal E, Lyons DM, Jain AK (2001) A background model initialization algorithm for video surveillance. In: ICCV, pp 733–740Google Scholar
  19. 19.
    He X, Gao M, Kan M, Wang D (2017) Birank: Towards ranking on bipartite graphs. IEEE Trans Knowl Data Eng 29(1):57–71CrossRefGoogle Scholar
  20. 20.
    Hochreiter S, Schmidhuber J (1996) LSTM can solve hard long time lag problems. In: Advances in neural information processing systems, pp 473–479Google Scholar
  21. 21.
    Ibrahim MS, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) A hierarchical deep temporal model for group activity recognition. In: IEEE Conference on computer vision and pattern recognition, pp 1971–1980Google Scholar
  22. 22.
    Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRefGoogle Scholar
  23. 23.
    Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: IEEE Conference on computer vision and pattern recognition, pp 4565–4574Google Scholar
  24. 24.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition, pp 1725–1732Google Scholar
  25. 25.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  26. 26.
    Lan T, Wang Y, Yang W, Robinovitch SN, Mori G (2012) Discriminative latent models for recognizing contextual group activities. IEEE Trans Pattern Anal Mach Intell 34(8):1549–1562CrossRefGoogle Scholar
  27. 27.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on computer vision and pattern recognition, pp 2169–2178Google Scholar
  28. 28.
    Lee D (2005) Effective gaussian mixture learning for video background subtraction. IEEE Trans Pattern Anal Mach Intell 27(5):827–832CrossRefGoogle Scholar
  29. 29.
    Li J, Wong Y, Kankanhalli MS (2016) Multi-stream deep learning framework for automated presentation assessment. In: IEEE International symposium on multimedia, pp 222–225Google Scholar
  30. 30.
    Liu A, Su Y, Jia P, Gao Z, Hao T, Yang Z (2015) Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybernetics 45(6):1194–1208CrossRefGoogle Scholar
  31. 31.
    Liu A, Su Y, Nie W, Kankanhalli MS (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114CrossRefGoogle Scholar
  32. 32.
    Liu A, Xu N, Nie W, Su Y, Wong Y, Kankanhalli M (2017) Benchmarking a multi-modal & multi-view & interactive dataset for human action recognition. IEEE Trans CybernGoogle Scholar
  33. 33.
    Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (2017) Do less and achieve more: Training CNNs for action recognition utilizing action images from the web. Pattern Recognition.
  34. 34.
    Money AG, Agius HW (2008) Video summarisation: A conceptual framework and survey of the state of the art. J Vis Commun Image Represent 19(2):121–143CrossRefGoogle Scholar
  35. 35.
    Nie L, Wang M, Zha Z, Chua T (2012) Oracle in image search: A content-based approach to performance prediction. ACM Trans Inf Syst 30(2):13:1–13:23CrossRefGoogle Scholar
  36. 36.
    Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Comput Vis Image Underst 150:109–125CrossRefGoogle Scholar
  37. 37.
    Pers J, Sulic V, Kristan M, Perse M, Polanec K, Kovacic S (2010) Histograms of optical flow for efficient representation of body motion. Pattern Recogn Lett 31(11):1369–1376CrossRefGoogle Scholar
  38. 38.
    Pritch Y, Ratovitch S, Hendel A, Peleg S (2009) Clustered synopsis of surveillance video. In: IEEE International conference on advanced video and signal based surveillance, pp 195–200Google Scholar
  39. 39.
    Qian Y, Bi M, Tan T, Yu K (2016) Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans Audio, Speech & Language Processing 24(12):2263–2276CrossRefGoogle Scholar
  40. 40.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576Google Scholar
  41. 41.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576Google Scholar
  42. 42.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognitionGoogle Scholar
  43. 43.
    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systemsGoogle Scholar
  44. 44.
    Truong BT, Venkatesh S (2007) Video abstraction: A systematic review and classification. TOMCCAP 3(1):3CrossRefGoogle Scholar
  45. 45.
    Tu K, Meng M, Lee MW, Choe TE, Zhu SC (2014) Joint video and text parsing for understanding events and answering queries. IEEE Multimedia 21(2):42–70CrossRefGoogle Scholar
  46. 46.
    Venugopalan S, Rohrbach M, Donahue J, Mooney RJ, Darrell T, Saenko K (2015) Sequence to sequence - video to text. In: IEEE International conference on computer vision, pp 4534–4542Google Scholar
  47. 47.
    Venugopalan S, Hendricks LA, Mooney RJ, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1961–1966Google Scholar
  48. 48.
    Wang L, Hu W, Tan T (2003) Recent developments in human motion analysis. Pattern Recogn 36(3):585–601CrossRefGoogle Scholar
  49. 49.
    Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Encode, review and decode: Reviewer module for caption generationGoogle Scholar
  50. 50.
    Yeung S, Fathi A, Fei-Fei L (2014) Videoset: video summary evaluation through text. In: CVPR Egocentric vision workshopGoogle Scholar
  51. 51.
    Zhang H, Shang X, Luan H, Wang M, Chua T (2016) Learning from collective intelligence: Feature learning using social images and tags. ACM Trans Multimed Comput Commun Appl 13(1):1:1–1:23CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  • An-An Liu
    • 1
    Email author
  • Zhuang Shao
    • 1
  • Yongkang Wong
    • 2
  • Junnan Li
    • 3
  • Yu-Ting Su
    • 1
  • Mohan Kankanhalli
    • 4
  1. 1.School of Electrical and Information EngineeringTianjin UniversityTianjinChina
  2. 2.Smart Systems InstituteNational University of SingaporeSingaporeSingapore
  3. 3.NUS Graduate School for Integrative Sciences and EngineeringNational University of SingaporeSingaporeSingapore
  4. 4.School of ComputingNational University of SingaporeSingaporeSingapore

Personalised recommendations