Skip to main content
Log in

Multimodal movie genre classification using recurrent neural network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Genre is one of the features of a movie that defines its structure and type of audience. The number of streaming companies interested in automatically deriving movies’ genres is rapidly increasing. Genre categorization of trailers is a challenging problem because of the conceptual nature of the genre, which is not presented physically within a frame and can only be perceived by the whole trailer. Moreover, several genres may appear in the movie at the same time. The multi-label learning algorithms have not been improved as significantly as the single-label classification models, which causes the genre categorization problem to be highly complicated. In this paper, we propose a novel multi-modal deep recurrent model for movie genre classification. A new structure based on Gated Recurrent Unit (GRU) is designed to derive spatial-temporal features of movie frames. The video features are then concatenated with the audio features to predict the final genres of the movie. The proposed design outperforms the state-of-art models based on accuracy and computational cost and substantially improves the movie genre classifier system’s performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://github.com/Tinbeh97/MovieGenre

References

  1. Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video. In: Advances in neural information processing systems, pp 892–900

  2. Álvarez F, Sánchez F, Hernández-Peñaloza G, Jiménez D, Menéndez JM, Cisneros G (2019) On the influence of low-level visual features in film classification. PloS one 14(2):e0211406

    Article  Google Scholar 

  3. Badamdorj T, Rochan M, Wang Y, Cheng L (2021) Joint visual and audio learning for video highlight detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8127–8137

  4. Ben-Ahmed O, Huet B (2018) Deep multimodal features for movie genre and interestingness prediction. In: 2018 International Conference on Content-Based Multimedia Indexing (CBMI), IEEE, pp 1–6

  5. Bhoraniya DM, Ratanpara TV (2017) A survey on video genre classification techniques. In: 2017 International conference on intelligent computing and control (I2C2), IEEE, pp 1–5

  6. Choroś K (2019) Fast method of video genre categorization for temporally aggregated broadcast videos. Journal of intelligent & fuzzy systems, Preprint, pp 1-11

  7. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555

  8. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Article  MATH  Google Scholar 

  9. Fu S, Liu W, Tao D, Zhou Y, Nie L (2020) hesGCN: Hessian graph convolutional networks for semi-supervised classification. Inf Sci 514:484–498

    Article  MATH  Google Scholar 

  10. Fu S, Liu W, Zhang K, Zhou Y (2021) Example-feature graph convolutional networks for semi-supervised classification. Neurocomputing 461:63–76

    Article  Google Scholar 

  11. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778

  12. Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B et al (2017) CNN architectures for large-scale audio classification. In: 2017 ieee international conference on acoustics, speech and signal processing (icassp), IEEE, pp 131–135

  13. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  14. Huang X, Acero A, Hon H-W, Foreword By-Reddy R (2001) Spoken language processing: A guide to theory, algorithm, and system development Prentice hall PTR

  15. Huang Y-F, Wang S-H (2012) Movie genre classification using svm with audio and video features. In: International conference on active media technology, Springer, pp 1–10

  16. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167

  17. Jain SK, Jadon R (2009) Movies genres classifier using neural network. In: 2009 24th International Symposium on Computer and Information Sciences, IEEE, pp 575–580

  18. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980

  19. Li Y, Tarlow D, Brockschmidt M, Zemel R (2015) Gated graph sequence neural networks. arXiv:1511.05493

  20. Liu W, Ma X, Zhou Y, Tao D, Cheng J (2018) p-laplacian regularization for scene recognition. IEEE Trans Cybern 49(8):2927–2940

    Article  Google Scholar 

  21. Mangolin RB, Pereira RM, Britto AS, Silla CN, Feltrim VD, Bertolini D, Costa Y M (2020) A multimodal approach for multi-label movie genre classification. Multimedia Tools and Applications, pp 1–26

  22. Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175

    Article  MATH  Google Scholar 

  23. Pant P, Sabitha AS, Choudhury T, Dhingra P (2019) Multi-label classification trending challenges and approaches. In: Emerging trends in expert applications and security. Springer, pp 433–444

  24. Pillai I, Fumera G, Roli F (2013) Threshold optimisation for multi-label classifiers. Pattern Recogn 46(7):2055–2065

    Article  MATH  Google Scholar 

  25. Rasheed Z, Sheikh Y, Shah M (2005) On the use of computable features for film classification. IEEE Trans Circuits Syst Video Technol 15(1):52–64

    Article  Google Scholar 

  26. Rawat W, Wang Z (2017) Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput 29(9):2352–2449

    Article  MATH  Google Scholar 

  27. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  28. Schwarz D, O’Leary S (2015) Smooth granular sound texture synthesis by control of timbral similarity. In Sound and Music Computing (SMC), p. 6

  29. Serban IV, Sordoni A, Bengio Y, Courville A, Pineau J (2015) Hierarchical neural network generative models for movie dialogues. arXiv:1507.04808. 7(8), pp 434–441

  30. Simões GS, Wehrmann J, Barros RC, Ruiz DD (2016) Movie genre classification with convolutional neural networks. In: 2016 International joint conference on neural networks (IJCNN), IEEE, pp 259–266

  31. Sivaraman K, Somappa G (2016) Moviescope: Movie trailer classification using deep neural networks University of Virginia

  32. Srinivas S, Sarvadevabhatla RK, Mopuri KR, Prabhu N, Kruthiventi SS, Babu RV (2016) A taxonomy of deep convolutional neural nets for computer vision. Front Robot AI 2:36

    Article  Google Scholar 

  33. Thompson K, Smith J (2008) Film art: An introduction McGraw-Hill Higher Education

  34. Tian Y, Xu C (2021) Can audio-visual integration strengthen robustness under multimodal attacks?. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5601–5611

  35. Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:11

    MATH  Google Scholar 

  36. Varghese J, Nair KR (2019) A novel video genre classification algorithm by keyframe relevance. In: Information and communication technology for intelligent systems. Springer, pp 685–696

  37. Wang W, Yang Y, Wang X, Wang W, Li J (2019) Development of convolutional neural network and its application in image classification: a survey. Opt Eng 58(4):040901

    Article  Google Scholar 

  38. Wehrmann J, Barros RC (2017) Movie genre classification: A multi-label approach based on convolutions through time. Appl Soft Comput 61:973–982

    Article  Google Scholar 

  39. Wehrmann J, Barros RC, Simões GS, Paula TS, Ruiz DD (2016) (Deep) learning from frames. In: 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), IEEE, pp 1–6

  40. Wessel DL (1979) Timbre space as a musical control structure. Computer music journal, pp 45–52

  41. Wu J, Rehg JM (2008) Where am I: Place instance and category recognition using spatial PACT. In: 2008 IEEE Conference on computer vision and pattern recognition, IEEE, pp 1–8

  42. Yi Y, Li A, Zhou X (2020) Human action recognition based on action relevance weighted encoding. Signal Process Image Commun 80:115640

    Article  Google Scholar 

  43. Yu Y, Lu Z, Li Y, Liu D (2021) ASTS: Attention based spatio-temporal sequential framework for movie trailer genre classification. Multimed Tools Appl 80(7):9749–9764

    Article  Google Scholar 

  44. Zhou H, Hermans T, Karandikar AV, Rehg JM (2010) Movie genre classification via scene categorization. In: Proceedings of the 18th ACM international conference on multimedia, pp 747–750

  45. Zhou Y, Zhang L, Yi Z (2019) Predicting movie box-office revenues using deep neural networks. Neural Comput & Applic 31(6):1855–1865

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Ali Akhaee.

Ethics declarations

Competing interests

The authors declare they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A:: LSTM merge

LSTM network is a well-known system for classifying data that changes through time and can learn long-term dependencies inside a sequence [13]. The designed LSTM model is presented in Fig. 9, where L represents the number of classes. Tables 11 and 12 indicate that LSTM+SVM network has less score and more hamming loss than both GRU+SVM and 1D_Conv+SVM models.

Fig. 9
figure 9

LSTM model. LSTM.i i = 1,...,12 indicates ith parallel LSTM network. The outputs of these 12 networks are merged and fed into the final sigmoid layer

Table 11 LSTM network hamming loss and AUC micro, macro, and weighted scores
Table 12 LSTM network AUC score for 9 genres data

Appendix B:: GRU merge

Figure 10 indicates the network structure of the GRU merge model. Separating frames into 12 consecutive inputs of nine frames reduces the computational complexity Fig. 5 but increases the Hamming loss and decreases all AUC and F1 scores, Tables 13 and 14. For GRU merge, the training time required to train each epoch reduces to half compared to the proposed GRU model.

Fig. 10
figure 10

GRU merge model description. GRU.i i = 1,...,12 indicates ith parallel GRU network. The outputs of these 12 networks are merged and fed into the final sigmoid layer

Table 13 GRU merge network hamming loss and AUC micro, macro, and weighted scores
Table 14 GRU merge network AUC score for 9 genres data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Behrouzi, T., Toosi, R. & Akhaee, M.A. Multimodal movie genre classification using recurrent neural network. Multimed Tools Appl 82, 5763–5784 (2023). https://doi.org/10.1007/s11042-022-13418-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13418-6

Keywords

Navigation