Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation

Guarascio, Massimo; Minici, Marco; Pisani, Francesco Sergio; De Francesco, Erika; Lambardi, Pasquale

doi:10.1007/s10844-023-00836-7

Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation

Research
Published: 06 January 2024

(2024)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

283 Accesses
1 Citation
Explore all metrics

Abstract

Providing rich and accurate metadata for indexing media content is a crucial problem for all the companies offering streaming entertainment services. These metadata are commonly employed to enhance search engine results and feed recommendation algorithms to improve the matching with user interests. However, the problem of labeling multimedia content with informative tags is challenging as the labeling procedure, manually performed by domain experts, is time-consuming and prone to error. Recently, the adoption of AI-based methods has been demonstrated to be an effective approach for automating this complex process. However, developing an effective solution requires coping with different challenging issues, such as data noise and the scarcity of labeled examples during the training phase. In this work, we address these challenges by introducing a Transformer-based framework for multi-modal multi-label classification enriched with model prediction explanation capabilities. These explanations can help the domain expert to understand the system’s predictions. Experimentation conducted on two real test cases demonstrates its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence in recommender systems

Article Open access 01 November 2020

Learning with Noisy Correspondence

Article 13 April 2024

Recommendation system based on deep learning methods: a systematic review and new directions

Article 03 August 2019

Data Availability

Data are gathered from different open-source knowledge bases. However, they also integrate proprietary data from a third-party source therefore they cannot be released. The list of Movie-IDs can be provided upon request.

Notes

https://www.trade.gov/media-entertainment
Source of data: Scopus. Research query: "Movie Genre Classification"
https://grouplens.org/datasets/movielens/25m/

References

Abavisani, M., Wu, L., Hu, S., et al. (2020). Multimodal categorization of crisis events in social media. In: 2020 IEEE/CVF conf. on computer vision and pattern recognition, CVPR 2020. Computer Vision Foundation/IEEE, pp 14,667–14,677, https://doi.org/10.1109/CVPR42600.2020.01469
Arevalo, J., Solorio, T., Montes-y Gómez, M., et al. (2017). Gated multimodal units for information fusion. arXiv:1702.01992
Audebert, N., Herold, C., Slimani, K., et al. (2020). Multimodal deep networks for text and image-based document classification. In I. Part (Ed.), Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings (pp. 427–443). Springer.
Chapter Google Scholar
Choi, J. H., & Lee, J. S. (2019). Embracenet: A robust deep learning architecture for multimodal classification. Information Fusion, 51, 259–270. https://doi.org/10.1016/j.inffus.2019.02.010
Article Google Scholar
Cui, Y., Jia, M., Lin, T.Y., et al. (2019). Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9268–9277, https://doi.org/10.1109/CVPR.2019.00949
Devlin, J., Chang, M.W., Lee, K., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT. Association for computational linguistics, pp 4171–4186, https://doi.org/10.18653/v1/N19-1423
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In: Intl conf on learning representations
Fish, E., Weinbren, J., Gilbert, A. (2020). Rethinking movie genre classification with fine-grained semantic clustering. arXiv:2012.02639
Gao, Y., Gu, S., Jiang, J., et al. (2022). Going beyond xai: A systematic survey for explanation-guided learning. https://doi.org/10.48550/ARXIV.2212.03954 arXiv:2212.03954
Guarascio, M., Manco, G., & Ritacco, E. (2018). Deep learning. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, 1–3, 634–647. https://doi.org/10.1016/B978-0-12-809633-8.20352-X
Article Google Scholar
Hermans, A., Beyer, L., Leibe, B. (2017). In defense of the triplet loss for person re-identification. arXiv:1703.07737
Hinton, G. E., Srivastava, N., Krizhevsky, A., et al. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. https://doi.org/10.5555/2627435.2670313
Article MathSciNet Google Scholar
Ioffe, S., Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proc. of the 32Nd Int. Conf. on Machine Learning - Volume 37, ICML’15, pp 448–456
Kar, S., Maharjan, S., López-Monroy, A. P., et al. (2018a). Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). In chair) NCC, K. Choukri, C. Cieri, et al. (Eds.), MPST: A corpus of movie plot synopses with tags. Paris, France: European Language Resources Association (ELRA).
Kar, S., Maharjan, S., Solorio, T. (2018b). Folksonomication: Predicting tags for movies from plot synopses using emotion flow encoded neural network. In: Proc of the 27th Intl Conf on computational linguistics, pp 2879–2891
Kaya, M., & Bilge, H. S. (2019). Deep metric learning: A survey. Symmetry, 11(9). https://doi.org/10.3390/sym11091066
Khan, U. A., Martínez-del-Amor, M. A., Altowaijri, S. M., et al. (2020). Movie tags prediction and segmentation using deep learning. IEEE Access, 8, 6071–6086. https://doi.org/10.1109/ACCESS.2019.2963535
Article Google Scholar
Le Cun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
Article Google Scholar
Luggen, M., Audiffren, J., Difallah, D., et al. (2021). Wiki2prop: A multimodal approach for predicting wikidata properties from wikipedia. Proceedings of the Web Conference, 2021, 2357–2366. https://doi.org/10.1145/3442381.3450082
Article Google Scholar
Luo, Z., Tang, G., Wang, C., et al. (2021). Generating high-quality movie tags from social reviews: A learning-driven approach. In: 2021 IEEE international conferences on internet of things (iThings) and IEEE green computing & communications (GreenCom) and IEEE cyber, physical & social computing (CPSCom) and IEEE smart data (SmartData) and IEEE congress on cybermatics (Cybermatics), pp 182–189,https://doi.org/10.1109/iThings-GreenCom-CPSCom-SmartData-Cybermatics53846.2021.00040
Minici, M., Pisani, F.S., Guarascio, M., et al. (2022). Learning and explanation of extreme multi-label deep classification models for media content. In: Foundations of intelligent systems. Springer International Publishing, Cham, pp 138–148, https://doi.org/10.1007/978-3-031-16564-1_14
Nair, V., Hinton, G.E. (2010). Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th int. conf. on machine learning, ICML’10, pp 807–814
Pandeya, Y. R., & Lee, J. (2021). Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimedia Tools and Applications, 80, 2887–2905. https://doi.org/10.1007/S11042-020-08836-3
Article Google Scholar
Rahman, M.M., Malik, S., Islam, M.S., et al. (2022). An efficient approach to automatic tag prediction from movie plot synopses using transformer-based language model. In: 2022 25th International conference on computer and information technology (ICCIT), pp 501–505, https://doi.org/10.1109/ICCIT57492.2022.10055349
Ren, P., Xiao, Y., Chang, X., et al. (2021). A survey of deep active learning. ACM Comput Surv, 54(9). https://doi.org/10.1145/3472291
Ribeiro, M.T., Singh, S., Guestrin, C. (2016). "why should i trust you?" explaining the predictions of any classifier. In: Proc of the 22nd ACM SIGKDD intl conf on knowledge discovery and data mining, pp 1135–1144, https://doi.org/10.1145/2939672.2939778
Schroff, F., Kalenichenko, D., Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In: 2015 IEEE conf on computer vision and pattern recognition (CVPR), pp 815–823, 10.1109/CVPR.2015.7298682
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is All You Need. In: Proc of the 31st intl conf on neural information processing systems, pp 6000–6010
Wang, W., Tran, D., Feiszli, M. (2020). What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,695–12,705
Wehrmann, J., & Barros, R. C. (2017). Movie genre classification: A multi-label approach based on convolutions through time. Applied Soft Computing, 61, 973–982. https://doi.org/10.1016/j.asoc.2017.08.029
Article Google Scholar
Wu, C., Wang, C., Zhou, Y., et al. (2020). Exploiting user reviews for automatic movie tagging. Multimedia Tools and Applications, 79(17), 11399–11419. https://doi.org/10.1007/s11042-019-08513-0
Article Google Scholar
Xu, P., Zhu, X., & Clifton, D. A. (2023). Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2023.3275156
Article Google Scholar
Zhang, H., Patel, V.M., Chellappa, R. (2017). Hierarchical multimodal metric learning for multimodal classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3057–3065
Zhang, Z., Gu, Y., Plummer B.A., et al. (2024). Movie genre classification by language augmentation and shot sampling. In: IEEE Winter conference on applications of computer vision (WACV)

Download references

Acknowledgements

This work was partially supported by (i) PON I &C 2014-2020 FESR MISE, Catch 4.0 and by (ii) European Union - NextGenerationEU - National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) - Project: “SoBigData.it - Strengthening the Italian RI for Social Mining and Big Data Analytics” - Prot. IR0000013 - Avviso n. 3264 del 28/12/2021.

Funding

Not applicable.

Author information

Massimo Guarascio, Marco Minici and Francesco Sergio Pisani contributed equally to this work.

Authors and Affiliations

ICAR-CNR, Via P. Bucci, 8/9c, Rende, 87036, Italy
Massimo Guarascio, Marco Minici & Francesco Sergio Pisani
University of Pisa, Largo B. Pontecorvo, 3, Pisa, 56127, Italy
Marco Minici
Relatech S.p.a., Via Anguissola, 23, Milano, 20124, Italy
Erika De Francesco & Pasquale Lambardi

Authors

Massimo Guarascio
View author publications
You can also search for this author in PubMed Google Scholar
Marco Minici
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Sergio Pisani
View author publications
You can also search for this author in PubMed Google Scholar
Erika De Francesco
View author publications
You can also search for this author in PubMed Google Scholar
Pasquale Lambardi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Massimo Guarascio, Marco Minici, and Francesco Sergio Pisani equally contributed to the work and have to be considered all first authors. Massimo Guarascio: Investigation, Conceptualization, Writing - original draft, Writing - review & editing. Marco Minici: Investigation, Conceptualization, Software, Writing - original draft, Writing - review & editing. Francesco Sergio Pisani: Investigation, Conceptualization, Software, Writing - original draft, Writing - review & editing. Erika De Francesco: Investigation, Software. Pasquale Lambardi: Writing - review & editing, Funding acquisition.

Corresponding author

Correspondence to Marco Minici.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Guarascio, M., Minici, M., Pisani, F.S. et al. Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation. J Intell Inf Syst (2024). https://doi.org/10.1007/s10844-023-00836-7

Download citation

Received: 29 June 2023
Revised: 29 November 2023
Accepted: 07 December 2023
Published: 06 January 2024
DOI: https://doi.org/10.1007/s10844-023-00836-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in recommender systems

Learning with Noisy Correspondence

Recommendation system based on deep learning methods: a systematic review and new directions

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in recommender systems

Learning with Noisy Correspondence

Recommendation system based on deep learning methods: a systematic review and new directions

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation