Skip to main content
Log in

LSTM and multiple CNNs based event image classification

  • 1162: Machine learning for big multimedia analytics
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Previous studies have demonstrated that complexity and variation of event images are the major challenges in event classification. We approach the problem through an integrated methodology by utilizing Long Short-Term Memory network (LSTM) to fuse multiple Convolutional Neural Networks (CNNs). To address the issue of complexity, we use three specific CNNs to extract the scene, object and human visual cues respectively. To reduce the semantic gap and utilize the complementarity of the features in different levels, we choose AlexNet and VGG-16 network as the basic structures, and concatenate their outputs of the first fully-connected layer and the second fully-connected layer. Considering the contextual correlations between visual cues, we arrange the concatenations of three CNNs in the sequence of scene, object and human as a whole and put into the LSTM network. Particularly for context, we crop the images into five blocks as input and an individual image is supplemented with contextual features due to the temporal characteristics of the LSTM. We evaluate our method on the Web Image Dataset for Event Recognition (WIDER), and the obtained results demonstrate the effectiveness of all the above points. Compared with the state-of-the-art methods, the proposed method gives a considerable way for improving the performance on event classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig.1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Agrawal P, Girshick R, Malik J (2014) Analyzing the performance of multilayer neural networks for object recognition. ECCV 2014:329–344

    Google Scholar 

  2. Bai S (2016) Growing random Forest on deep convolutional neural networks for scene categorization. Expert Syst Appl 71:279–287

    Article  Google Scholar 

  3. Bai S (2017) Scene categorization through using objects represented by deep features. Int J Pattern Recognit Artif Intell 31(9):1755013

    Article  Google Scholar 

  4. Bai S, Shan A (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304

    Article  Google Scholar 

  5. Bossard L, Guillaumin M, Van L (2013) Event recognition in photo collections with a stopwatch HMM. IEEE International Conference on Computer Vision 2013:1193–1200

    Google Scholar 

  6. Deng J, Dong W, Socher R et al (2009) ImageNet: a large-scale hierarchical image database. IEEE Computer Society Conference on Computer Vision and Pattern 2009:248–255

    Google Scholar 

  7. Dollar P, Rabaud V, Cottrell G et al (2005) Behavior recognition via sparse spatio-temporal features. IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance 2005:193–199

    Google Scholar 

  8. Duan L, Xu D, Tsang WH, Luo J (2012) Visual event recognition in videos by learning from web data. IEEE Trans Pattern Anal Mach Intell 34(9):1667–1680

    Article  Google Scholar 

  9. Gong Y, Wang L, Guo R et al (2014) Multi-scale orderless pooling of deep convolutional activation features. ECCV 8695:392–407

    Google Scholar 

  10. Hayat M, Khan SH, Bennamoun M, An S (2016) A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans Image Process 25(10):4829–4841

    Article  MathSciNet  MATH  Google Scholar 

  11. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  12. Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural Networks, pp. 464

  13. Izadinia H, Sadeghi F, Farhadi A (2014) Incorporating scene context and object layout into appearance modeling. IEEE Conference on Computer Vision and Pattern Recognition 2014:232–239

    Google Scholar 

  14. Jia Y, Shelhamer E, Donahue J, et al (2014) Caffe: convolutional architecture for fast feature embedding. In: proceedings of the 22nd ACM international conference on multimedia, pp 675-678

  15. Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 25(2):1097–1105

    Google Scholar 

  16. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2:2169–2178

    Google Scholar 

  17. Li LJ, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. IEEE 11th international conference on computer vision, vol 2007, pp 1–8

  18. Li LJ, Su H, Xing E P, Fei-fei L (2010) Object bank: a high-level image representation for scene classification and semantic feature sparsification. Adv Neural Inf Proces Syst 23.

  19. Lin M, Chen Q, Yan S (2013) Network in Network. arXiv:1312.4400.

  20. Lin D, Lu C, Liao R, Jia J (2014) Learning important spatial pooling regions for scene classification. IEEE Conference on Computer Vision and Pattern Recognition 2014:3726–3733

    Google Scholar 

  21. Liu J, Yu Q, Javed O, et al (2013) Video event recognition using concept attributes. In: proceedings of the 2013 IEEE workshop on applications of computer vision, pp 339–346

  22. Liu M, Liu X, Li Y et al (2015) Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. IEEE International Conference on Computer Vision Workshops 2015:274–279

    Google Scholar 

  23. Lowe DG (1999) Object recognition from local scale-invariant features. Proceedings of the Seventh IEEE International Conference on Computer Vision 1999:1150

    Article  MathSciNet  Google Scholar 

  24. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  25. Mattivi R, Uijlings J, Natale FGBD, Sebe N (2011) Exploitation of time constraints for (sub-)event recognition. In: proceedings of the 2011 joint ACM workshop on modeling and representing, pp 7–12

  26. Mousavian A, Kosecka J (2015) Deep convolutional features for image based retrieval and scene categorization. arXiv:1509.06033

  27. Oh SJ, Benenson R, Fritz M, Schiele B (2015) Person recognition in personal photo collections. IEEE International Conference on Computer Vision 42(1):203–220

    Google Scholar 

  28. Pandey M, Lazebnik S (2011) Scene recognition and weakly supervised object localization with deformable part-based models. International Conference on Computer Vision 2011:1307–1314

    Google Scholar 

  29. Parizi SN, Oberlin JG, Felzenszwalb PF (2012) Reconfigurable models for scene recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2012:2775–2782

    Google Scholar 

  30. Park S, Kwak N (2015) Cultural event recognition by subregion classification with convolutional neural network. IEEE Conference on Computer Vision and Pattern Recognition Workshops 2015:45–50

    Google Scholar 

  31. Quattoni A, Torralba A (2009) Recognizing indoor scenes. IEEE Conference on Computer Vision and Pattern Recognition 2009:413–420

    Google Scholar 

  32. Quelhas P, Odobez JM, Gaticaperez D et al (2007) A thousand words in a scene. IEEE Transactions on Pattern Analysis & Machine Intelligence 29(9):1575–1589

    Article  Google Scholar 

  33. Rachmadi RF, Uchimura K, Koutaki G (2016) Combined convolutional neural network for event recognition. The Korea-Japan joint workshop on Frontiers of Computer Vision, pp 85–90

  34. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  35. Salvador A, Manchon-Vizuete D, Calafell A et al (2015) Cultural event recognition with visual ConvNets and temporal models. IEEE Conference on Computer Vision and Pattern Recognition Workshops 2015:36–44

    Google Scholar 

  36. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large- scale image recognition. arXiv:1409.1556

  37. Sun X, Zhang L, Wang Z, Chang J, Yao Y, Li P, Zimmermann R (2018) Scene categorization using deeply learned gaze shifting kernel. IEEE Transactions on Cybernetics 49(6):2156–2167

    Article  Google Scholar 

  38. Szegedy C, Liu W, Jia Y et al (2014) Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition 2014:1–9

    Google Scholar 

  39. Tian Z, Huang W, He T et al (2016) Detecting text in natural image with connectionist text proposal network. European Conference on Computer Vision 9912:56–72

    Google Scholar 

  40. Wang J, Yang J, Yu K et al (2010) Locality-constrained linear coding for image classification. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2010:3360–3367

    Google Scholar 

  41. Wang L, Guo S, Huang W, Qiao Y (2015) Places205-VGGNet models for scene recognition. arXiv:1508.01667

  42. Wang L, Wang Z, Du W, Qiao Y (2015) Object-scene convolutional neural networks for event recognition in images. IEEE Conference on Computer Vision and Pattern Recognition Workshop 2015:30–35

    Google Scholar 

  43. Wang L, Wang Z, Guo S, Qiao Y (2015) Better exploiting os-cnns for better event recognition in images. IEEE International Conference on Computer Vision Workshop 2015:45–52

    Google Scholar 

  44. Wang L, Xiong Y, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv:1507.02159

  45. Wang Y, Lin Z, Shen X et al (2016) Event-specific image importance. IEEE Conference on Computer Vision and Pattern Recognition 2016:4810–4819

    Google Scholar 

  46. Wang J H, Liu T W, Luo X, Wang L (2018) An LSTM approach to short text sentiment classification with word embeddings. In: Proceedings of the 30th conference on computational linguistics and speech processing, Hanoi, Vietnam, pp. 214–223

  47. Wang L, Wang Z, Qiao Y, van Gool L (2018) Transferring deep object and scene representations for event recognition in still images. Int J Comput Vis 126(2–4):390–409

    Article  MathSciNet  Google Scholar 

  48. Wang M, Niu S, Gao Z (2019) A novel scene text recognition method based on deep learning. Computers, Materials & Continua 60(2):781–794

    Article  Google Scholar 

  49. Wu X, Luo C, Zhang Q, Zhou J, Yang H, Li Y (2019) Text detection and recognition for natural scene images using deep convolutional neural networks. Computers, Materials & Continua 61(1):289–300

    Article  Google Scholar 

  50. Xiong Y, Zhu K, Lin D et al (2015) Recognize complex events from static images by fusing deep channels. IEEE Conference on Computer Vision and Pattern Recognition 2015:1600–1609

    Google Scholar 

  51. Xu F, Zhang X, Xin Z, Yang A (2019) Investigation on the Chinese text sentiment analysis based on convolutional neural networks in deep learning. Computers, Materials and Continua 58(3):697–709

    Article  Google Scholar 

  52. Yang Y, Shah M (2012) Complex events detection using data-driven concepts. In Proceedings of the European conference on Computer vision, pp 722–735

  53. Yogatama D, Dyer C, Ling W, et al (2017) Generative and discriminative text classification with recurrent neural networks. arXiv:1703.01898

  54. Yosinski J, Clune J, Bengio Y et al (2014) How transferable are features in deep neural networks? Adv Neural Inf Proces Syst 27:3320–3328

    Google Scholar 

  55. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional neural networks. European Conference on Computer Vision 2014:818–833

    Google Scholar 

  56. Zhang N, Paluri M, Taigman Y et al (2015) Beyond frontal faces: improving person recognition using multiple cues. IEEE Conference on Computer Vision and Pattern Recognition 2015:4804–4813

    Google Scholar 

  57. Zhang C, Li R, Huang Q, Tian Q (2017) Hierarchical deep semantic representation for visual categorization. Neurocomputing 257:88–96

    Article  Google Scholar 

  58. Zhang T, Huang M, Zhao L (2018) Learning structured representation for text classification via reinforcement learning. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.

  59. Zhou B, Khosla A, Lapedriza A, et al (2014) Object detectors emerge in deep scene CNNs. arXiv:1412.6856

  60. Zhou B, Lapedriza A, Xiao J et al (2015) Learning deep features for scene recognition using places database. Adv Neural Inf Proces Syst 2015:487–495

    Google Scholar 

  61. Zhou P, Qi Z, Zheng S, Xu J, Bao H, Xu B (2016) Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In: proceedings of the 26th international conference on computational linguistics, pp 3485-3495

Download references

Acknowledgments

This work was supported in part by National Science Foundation Project of P. R. China under Grant No.52071349, No.61701554 and the cross-discipline research project of Minzu University of China (2020MDJC08), State Language Commission Key Project (ZDl135-39), Promotion plan for young teachers’ scientific research ability of Minzu University of China, MUC 111 Project, First class courses (Digital Image Processing KC2066). We gratefully acknowledge the assistance of Dr. Lizhi Zhao providing part of the revised manuscript and valuable discussion.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Song.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, P., Tang, H., Yu, J. et al. LSTM and multiple CNNs based event image classification. Multimed Tools Appl 80, 30743–30760 (2021). https://doi.org/10.1007/s11042-020-10165-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10165-4

Keywords

Navigation