Multimodal movie genre classification using recurrent neural network

Behrouzi, Tina; Toosi, Ramin; Akhaee, Mohammad Ali

doi:10.1007/s11042-022-13418-6

Multimodal movie genre classification using recurrent neural network

Published: 30 July 2022

Volume 82, pages 5763–5784, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

445 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Genre is one of the features of a movie that defines its structure and type of audience. The number of streaming companies interested in automatically deriving movies’ genres is rapidly increasing. Genre categorization of trailers is a challenging problem because of the conceptual nature of the genre, which is not presented physically within a frame and can only be perceived by the whole trailer. Moreover, several genres may appear in the movie at the same time. The multi-label learning algorithms have not been improved as significantly as the single-label classification models, which causes the genre categorization problem to be highly complicated. In this paper, we propose a novel multi-modal deep recurrent model for movie genre classification. A new structure based on Gated Recurrent Unit (GRU) is designed to derive spatial-temporal features of movie frames. The video features are then concatenated with the audio features to predict the final genres of the movie. The proposed design outperforms the state-of-art models based on accuracy and computational cost and substantially improves the movie genre classifier system’s performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Genre-ous: The Movie Genre Detector

A multimodal approach for multi-label movie genre classification

Article 07 November 2020

Recognizing online video genres using ensemble deep convolutional learning for digital media service management

Article Open access 14 May 2024

Notes

https://github.com/Tinbeh97/MovieGenre

References

Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video. In: Advances in neural information processing systems, pp 892–900
Álvarez F, Sánchez F, Hernández-Peñaloza G, Jiménez D, Menéndez JM, Cisneros G (2019) On the influence of low-level visual features in film classification. PloS one 14(2):e0211406
Article Google Scholar
Badamdorj T, Rochan M, Wang Y, Cheng L (2021) Joint visual and audio learning for video highlight detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8127–8137
Ben-Ahmed O, Huet B (2018) Deep multimodal features for movie genre and interestingness prediction. In: 2018 International Conference on Content-Based Multimedia Indexing (CBMI), IEEE, pp 1–6
Bhoraniya DM, Ratanpara TV (2017) A survey on video genre classification techniques. In: 2017 International conference on intelligent computing and control (I2C2), IEEE, pp 1–5
Choroś K (2019) Fast method of video genre categorization for temporally aggregated broadcast videos. Journal of intelligent & fuzzy systems, Preprint, pp 1-11
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Article MATH Google Scholar
Fu S, Liu W, Tao D, Zhou Y, Nie L (2020) hesGCN: Hessian graph convolutional networks for semi-supervised classification. Inf Sci 514:484–498
Article MATH Google Scholar
Fu S, Liu W, Zhang K, Zhou Y (2021) Example-feature graph convolutional networks for semi-supervised classification. Neurocomputing 461:63–76
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778
Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B et al (2017) CNN architectures for large-scale audio classification. In: 2017 ieee international conference on acoustics, speech and signal processing (icassp), IEEE, pp 131–135
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang X, Acero A, Hon H-W, Foreword By-Reddy R (2001) Spoken language processing: A guide to theory, algorithm, and system development Prentice hall PTR
Huang Y-F, Wang S-H (2012) Movie genre classification using svm with audio and video features. In: International conference on active media technology, Springer, pp 1–10
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Jain SK, Jadon R (2009) Movies genres classifier using neural network. In: 2009 24th International Symposium on Computer and Information Sciences, IEEE, pp 575–580
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980
Li Y, Tarlow D, Brockschmidt M, Zemel R (2015) Gated graph sequence neural networks. arXiv:1511.05493
Liu W, Ma X, Zhou Y, Tao D, Cheng J (2018) p-laplacian regularization for scene recognition. IEEE Trans Cybern 49(8):2927–2940
Article Google Scholar
Mangolin RB, Pereira RM, Britto AS, Silla CN, Feltrim VD, Bertolini D, Costa Y M (2020) A multimodal approach for multi-label movie genre classification. Multimedia Tools and Applications, pp 1–26
Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Article MATH Google Scholar
Pant P, Sabitha AS, Choudhury T, Dhingra P (2019) Multi-label classification trending challenges and approaches. In: Emerging trends in expert applications and security. Springer, pp 433–444
Pillai I, Fumera G, Roli F (2013) Threshold optimisation for multi-label classifiers. Pattern Recogn 46(7):2055–2065
Article MATH Google Scholar
Rasheed Z, Sheikh Y, Shah M (2005) On the use of computable features for film classification. IEEE Trans Circuits Syst Video Technol 15(1):52–64
Article Google Scholar
Rawat W, Wang Z (2017) Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput 29(9):2352–2449
Article MATH Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Schwarz D, O’Leary S (2015) Smooth granular sound texture synthesis by control of timbral similarity. In Sound and Music Computing (SMC), p. 6
Serban IV, Sordoni A, Bengio Y, Courville A, Pineau J (2015) Hierarchical neural network generative models for movie dialogues. arXiv:1507.04808. 7(8), pp 434–441
Simões GS, Wehrmann J, Barros RC, Ruiz DD (2016) Movie genre classification with convolutional neural networks. In: 2016 International joint conference on neural networks (IJCNN), IEEE, pp 259–266
Sivaraman K, Somappa G (2016) Moviescope: Movie trailer classification using deep neural networks University of Virginia
Srinivas S, Sarvadevabhatla RK, Mopuri KR, Prabhu N, Kruthiventi SS, Babu RV (2016) A taxonomy of deep convolutional neural nets for computer vision. Front Robot AI 2:36
Article Google Scholar
Thompson K, Smith J (2008) Film art: An introduction McGraw-Hill Higher Education
Tian Y, Xu C (2021) Can audio-visual integration strengthen robustness under multimodal attacks?. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5601–5611
Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:11
MATH Google Scholar
Varghese J, Nair KR (2019) A novel video genre classification algorithm by keyframe relevance. In: Information and communication technology for intelligent systems. Springer, pp 685–696
Wang W, Yang Y, Wang X, Wang W, Li J (2019) Development of convolutional neural network and its application in image classification: a survey. Opt Eng 58(4):040901
Article Google Scholar
Wehrmann J, Barros RC (2017) Movie genre classification: A multi-label approach based on convolutions through time. Appl Soft Comput 61:973–982
Article Google Scholar
Wehrmann J, Barros RC, Simões GS, Paula TS, Ruiz DD (2016) (Deep) learning from frames. In: 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), IEEE, pp 1–6
Wessel DL (1979) Timbre space as a musical control structure. Computer music journal, pp 45–52
Wu J, Rehg JM (2008) Where am I: Place instance and category recognition using spatial PACT. In: 2008 IEEE Conference on computer vision and pattern recognition, IEEE, pp 1–8
Yi Y, Li A, Zhou X (2020) Human action recognition based on action relevance weighted encoding. Signal Process Image Commun 80:115640
Article Google Scholar
Yu Y, Lu Z, Li Y, Liu D (2021) ASTS: Attention based spatio-temporal sequential framework for movie trailer genre classification. Multimed Tools Appl 80(7):9749–9764
Article Google Scholar
Zhou H, Hermans T, Karandikar AV, Rehg JM (2010) Movie genre classification via scene categorization. In: Proceedings of the 18th ACM international conference on multimedia, pp 747–750
Zhou Y, Zhang L, Yi Z (2019) Predicting movie box-office revenues using deep neural networks. Neural Comput & Applic 31(6):1855–1865
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
Tina Behrouzi
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
Ramin Toosi & Mohammad Ali Akhaee

Authors

Tina Behrouzi
View author publications
You can also search for this author in PubMed Google Scholar
Ramin Toosi
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Ali Akhaee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Ali Akhaee.

Ethics declarations

Competing interests

The authors declare they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A:: LSTM merge

LSTM network is a well-known system for classifying data that changes through time and can learn long-term dependencies inside a sequence [13]. The designed LSTM model is presented in Fig. 9, where L represents the number of classes. Tables 11 and 12 indicate that LSTM+SVM network has less score and more hamming loss than both GRU+SVM and 1D_Conv+SVM models.

Table 11 LSTM network hamming loss and AUC micro, macro, and weighted scores

Full size table

Table 12 LSTM network AUC score for 9 genres data

Full size table

Appendix B:: GRU merge

Figure 10 indicates the network structure of the GRU merge model. Separating frames into 12 consecutive inputs of nine frames reduces the computational complexity Fig. 5 but increases the Hamming loss and decreases all AUC and F1 scores, Tables 13 and 14. For GRU merge, the training time required to train each epoch reduces to half compared to the proposed GRU model.

Table 13 GRU merge network hamming loss and AUC micro, macro, and weighted scores

Full size table

Table 14 GRU merge network AUC score for 9 genres data

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Behrouzi, T., Toosi, R. & Akhaee, M.A. Multimodal movie genre classification using recurrent neural network. Multimed Tools Appl 82, 5763–5784 (2023). https://doi.org/10.1007/s11042-022-13418-6

Download citation

Received: 28 December 2020
Revised: 28 March 2022
Accepted: 02 July 2022
Published: 30 July 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s11042-022-13418-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal movie genre classification using recurrent neural network

Abstract

Access this article