Skip to main content
Log in

Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation

  • Research
  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Providing rich and accurate metadata for indexing media content is a crucial problem for all the companies offering streaming entertainment services. These metadata are commonly employed to enhance search engine results and feed recommendation algorithms to improve the matching with user interests. However, the problem of labeling multimedia content with informative tags is challenging as the labeling procedure, manually performed by domain experts, is time-consuming and prone to error. Recently, the adoption of AI-based methods has been demonstrated to be an effective approach for automating this complex process. However, developing an effective solution requires coping with different challenging issues, such as data noise and the scarcity of labeled examples during the training phase. In this work, we address these challenges by introducing a Transformer-based framework for multi-modal multi-label classification enriched with model prediction explanation capabilities. These explanations can help the domain expert to understand the system’s predictions. Experimentation conducted on two real test cases demonstrates its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability

Data are gathered from different open-source knowledge bases. However, they also integrate proprietary data from a third-party source therefore they cannot be released. The list of Movie-IDs can be provided upon request.

Notes

  1. https://www.trade.gov/media-entertainment

  2. Source of data: Scopus. Research query: "Movie Genre Classification"

  3. https://grouplens.org/datasets/movielens/25m/

References

  • Abavisani, M., Wu, L., Hu, S., et al. (2020). Multimodal categorization of crisis events in social media. In: 2020 IEEE/CVF conf. on computer vision and pattern recognition, CVPR 2020. Computer Vision Foundation/IEEE, pp 14,667–14,677, https://doi.org/10.1109/CVPR42600.2020.01469

  • Arevalo, J., Solorio, T., Montes-y Gómez, M., et al. (2017). Gated multimodal units for information fusion. arXiv:1702.01992

  • Audebert, N., Herold, C., Slimani, K., et al. (2020). Multimodal deep networks for text and image-based document classification. In I. Part (Ed.), Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings (pp. 427–443). Springer.

    Chapter  Google Scholar 

  • Choi, J. H., & Lee, J. S. (2019). Embracenet: A robust deep learning architecture for multimodal classification. Information Fusion, 51, 259–270. https://doi.org/10.1016/j.inffus.2019.02.010

    Article  Google Scholar 

  • Cui, Y., Jia, M., Lin, T.Y., et al. (2019). Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9268–9277, https://doi.org/10.1109/CVPR.2019.00949

  • Devlin, J., Chang, M.W., Lee, K., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT. Association for computational linguistics, pp 4171–4186, https://doi.org/10.18653/v1/N19-1423

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In: Intl conf on learning representations

  • Fish, E., Weinbren, J., Gilbert, A. (2020). Rethinking movie genre classification with fine-grained semantic clustering. arXiv:2012.02639

  • Gao, Y., Gu, S., Jiang, J., et al. (2022). Going beyond xai: A systematic survey for explanation-guided learning. https://doi.org/10.48550/ARXIV.2212.03954arXiv:2212.03954

  • Guarascio, M., Manco, G., & Ritacco, E. (2018). Deep learning. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, 1–3, 634–647. https://doi.org/10.1016/B978-0-12-809633-8.20352-X

    Article  Google Scholar 

  • Hermans, A., Beyer, L., Leibe, B. (2017). In defense of the triplet loss for person re-identification. arXiv:1703.07737

  • Hinton, G. E., Srivastava, N., Krizhevsky, A., et al. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. https://doi.org/10.5555/2627435.2670313

    Article  MathSciNet  Google Scholar 

  • Ioffe, S., Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proc. of the 32Nd Int. Conf. on Machine Learning - Volume 37, ICML’15, pp 448–456

  • Kar, S., Maharjan, S., López-Monroy, A. P., et al. (2018a). Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). In chair) NCC, K. Choukri, C. Cieri, et al. (Eds.), MPST: A corpus of movie plot synopses with tags. Paris, France: European Language Resources Association (ELRA).

  • Kar, S., Maharjan, S., Solorio, T. (2018b). Folksonomication: Predicting tags for movies from plot synopses using emotion flow encoded neural network. In: Proc of the 27th Intl Conf on computational linguistics, pp 2879–2891

  • Kaya, M., & Bilge, H. S. (2019). Deep metric learning: A survey. Symmetry, 11(9). https://doi.org/10.3390/sym11091066

  • Khan, U. A., Martínez-del-Amor, M. A., Altowaijri, S. M., et al. (2020). Movie tags prediction and segmentation using deep learning. IEEE Access, 8, 6071–6086. https://doi.org/10.1109/ACCESS.2019.2963535

    Article  Google Scholar 

  • Le Cun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539

    Article  Google Scholar 

  • Luggen, M., Audiffren, J., Difallah, D., et al. (2021). Wiki2prop: A multimodal approach for predicting wikidata properties from wikipedia. Proceedings of the Web Conference, 2021, 2357–2366. https://doi.org/10.1145/3442381.3450082

    Article  Google Scholar 

  • Luo, Z., Tang, G., Wang, C., et al. (2021). Generating high-quality movie tags from social reviews: A learning-driven approach. In: 2021 IEEE international conferences on internet of things (iThings) and IEEE green computing & communications (GreenCom) and IEEE cyber, physical & social computing (CPSCom) and IEEE smart data (SmartData) and IEEE congress on cybermatics (Cybermatics), pp 182–189,https://doi.org/10.1109/iThings-GreenCom-CPSCom-SmartData-Cybermatics53846.2021.00040

  • Minici, M., Pisani, F.S., Guarascio, M., et al. (2022). Learning and explanation of extreme multi-label deep classification models for media content. In: Foundations of intelligent systems. Springer International Publishing, Cham, pp 138–148, https://doi.org/10.1007/978-3-031-16564-1_14

  • Nair, V., Hinton, G.E. (2010). Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th int. conf. on machine learning, ICML’10, pp 807–814

  • Pandeya, Y. R., & Lee, J. (2021). Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimedia Tools and Applications, 80, 2887–2905. https://doi.org/10.1007/S11042-020-08836-3

    Article  Google Scholar 

  • Rahman, M.M., Malik, S., Islam, M.S., et al. (2022). An efficient approach to automatic tag prediction from movie plot synopses using transformer-based language model. In: 2022 25th International conference on computer and information technology (ICCIT), pp 501–505, https://doi.org/10.1109/ICCIT57492.2022.10055349

  • Ren, P., Xiao, Y., Chang, X., et al. (2021). A survey of deep active learning. ACM Comput Surv, 54(9). https://doi.org/10.1145/3472291

  • Ribeiro, M.T., Singh, S., Guestrin, C. (2016). "why should i trust you?" explaining the predictions of any classifier. In: Proc of the 22nd ACM SIGKDD intl conf on knowledge discovery and data mining, pp 1135–1144, https://doi.org/10.1145/2939672.2939778

  • Schroff, F., Kalenichenko, D., Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In: 2015 IEEE conf on computer vision and pattern recognition (CVPR), pp 815–823, 10.1109/CVPR.2015.7298682

  • Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is All You Need. In: Proc of the 31st intl conf on neural information processing systems, pp 6000–6010

  • Wang, W., Tran, D., Feiszli, M. (2020). What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,695–12,705

  • Wehrmann, J., & Barros, R. C. (2017). Movie genre classification: A multi-label approach based on convolutions through time. Applied Soft Computing, 61, 973–982. https://doi.org/10.1016/j.asoc.2017.08.029

    Article  Google Scholar 

  • Wu, C., Wang, C., Zhou, Y., et al. (2020). Exploiting user reviews for automatic movie tagging. Multimedia Tools and Applications, 79(17), 11399–11419. https://doi.org/10.1007/s11042-019-08513-0

    Article  Google Scholar 

  • Xu, P., Zhu, X., & Clifton, D. A. (2023). Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2023.3275156

    Article  Google Scholar 

  • Zhang, H., Patel, V.M., Chellappa, R. (2017). Hierarchical multimodal metric learning for multimodal classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3057–3065

  • Zhang, Z., Gu, Y., Plummer B.A., et al. (2024). Movie genre classification by language augmentation and shot sampling. In: IEEE Winter conference on applications of computer vision (WACV)

Download references

Acknowledgements

This work was partially supported by (i) PON I &C 2014-2020 FESR MISE, Catch 4.0 and by (ii) European Union - NextGenerationEU - National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) - Project: “SoBigData.it - Strengthening the Italian RI for Social Mining and Big Data Analytics” - Prot. IR0000013 - Avviso n. 3264 del 28/12/2021.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

Massimo Guarascio, Marco Minici, and Francesco Sergio Pisani equally contributed to the work and have to be considered all first authors. Massimo Guarascio: Investigation, Conceptualization, Writing - original draft, Writing - review & editing. Marco Minici: Investigation, Conceptualization, Software, Writing - original draft, Writing - review & editing. Francesco Sergio Pisani: Investigation, Conceptualization, Software, Writing - original draft, Writing - review & editing. Erika De Francesco: Investigation, Software. Pasquale Lambardi: Writing - review & editing, Funding acquisition.

Corresponding author

Correspondence to Marco Minici.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guarascio, M., Minici, M., Pisani, F.S. et al. Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation. J Intell Inf Syst (2024). https://doi.org/10.1007/s10844-023-00836-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10844-023-00836-7

Keywords

Navigation