Exploring Textual Features for Multi-label Classification of Portuguese Film Synopses

Portolese, Giuseppe; Domingues, Marcos Aurélio; Feltrim, Valéria Delisandra

doi:10.1007/978-3-030-30244-3_55

Giuseppe Portolese¹¹,
Marcos Aurélio Domingues¹¹ &
Valéria Delisandra Feltrim¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11805))

Included in the following conference series:

EPIA Conference on Artificial Intelligence

1674 Accesses
4 Citations

Abstract

The multi-label classification of film genres by using features extracted from their synopses has recently gained some attention from the scientific community, however, the number of studies is still limited. These studies are even scarcer for languages other than English. In this work we present the P-TMDb dataset, which contains 13, 394 Portuguese film synopses, and explore the film genre classification by experimenting with nine different groups of textual features and four multi-label algorithms. As our dataset is unbalanced, we also conducted experiments with an oversampled version of the dataset. The best result obtained for the original dataset was achieved by a TF-IDF based classifier, presenting an average F1 score of 0.478, while the best result for the oversampled dataset was achieved by a combination of several feature groups and presented an average F1 score of 0.611.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.themoviedb.org.
2.
P-TMDb and P-TMDb(+) datasets are available upon request.

References

Austin, A., Moore, E., Gupta, U., Chordia, P.: Characterization of movie genre based on music score. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 421–424. IEEE (2010)
Google Scholar
Balage Filho, P.P., Pardo, T.A.S., Aluísio, S.M.: An evaluation of the Brazilian Portuguese LIWC dictionary for sentiment analysis. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology (2013)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: A first approach to deal with imbalance in multi-label datasets. In: Pan, J.-S., Polycarpou, M.M., Woźniak, M., de Carvalho, A.C.P.L.F., Quintián, H., Corchado, E. (eds.) HAIS 2013. LNCS (LNAI), vol. 8073, pp. 150–160. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40846-5_16
Chapter Google Scholar
Fonseca, E.R., Rosa, J.L.G.: Mac-Morpho revisited: towards robust part-of-speech tagging. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology (2013)
Google Scholar
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., Aluisio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. arXiv preprint arXiv:1708.06025 (2017)
Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J.: Multilabel classification. Multilabel Classification, pp. 17–31. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41111-8_2
Chapter Google Scholar
Hinton, G., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet Google Scholar
Ho, K.W.: Movies’ genres classification by synopsis (2011)
Google Scholar
Hoang, Q.: Predicting movie genres based on plot summaries. arXiv preprint arXiv:1801.04813 (2018)
Huang, Y.-F., Wang, S.-H.: Movie genre classification using SVM with audio and video features. In: Huang, R., Ghorbani, A.A., Pasi, G., Yamaguchi, T., Yen, N.Y., Jin, B. (eds.) AMT 2012. LNCS, vol. 7669, pp. 1–10. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35236-2_1
Chapter Google Scholar
Ivasic-Kos, M., Pobar, M., Ipsic, I.: Automatic movie posters classification into genres. In: Bogdanova, A.M., Gjorgjevikj, D. (eds.) ICT Innovations 2014. AISC, vol. 311, pp. 319–328. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-09879-1_32
Chapter Google Scholar
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. J. 85(3), 333–359 (2011)
Article MathSciNet Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar
Monteiro, R.A., Santos, R.L.S., Pardo, T.A.S., de Almeida, T.A., Ruiz, E.E.S., Vale, O.A.: Contributions to the study of fake news in Portuguese: new corpus and automatic detection results. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 324–334. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_33
Chapter Google Scholar
Pearson, K.: X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 50(302), 157–175 (1900)
Article Google Scholar
Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: LIWC 2001. Lawrence Erlbaum Associates, Mahway (2001). 71(2001), 2001
Google Scholar
Rahman, R.I., Kadir, S., et al.: Genre classification of movies using their synopsis. Ph.D. thesis, BRAC University (2017)
Google Scholar
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Rasheed, Z., Sheikh, Y., Shah, M.: On the use of computable features for film classification. IEEE Trans. Circuits Syst. Video Technol. 15(1), 52–64 (2005)
Article Google Scholar
Read, J., Reutemann, P., Pfahringer, B., Holmes, G.: MEKA: a multi-label/multi-target extension to WEKA. J. Mach. Learn. Res. 17(21), 1–5 (2016). http://jmlr.org/papers/v17/12-164.html
MathSciNet MATH Google Scholar
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer (2010)
Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_34
Chapter Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multi-label classification. IEEE Trans. Knowl. Data Eng. 23, 1079–1089 (2010)
Article Google Scholar
Wehrmann, J., Barros, R.C.: Convolutions through time for multi-label movie genre classification. In: Proceedings of the Symposium on Applied Computing, pp. 114–119. ACM (2017)
Google Scholar
Zhou, H., Hermans, T., Karandikar, A.V., Rehg, J.M.: Movie genre classification via scene categorization. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 747–750. ACM (2010)
Google Scholar
Zhou, L., Burgoon, J.K., Twitchell, D.P., Qin, T., Nunamaker Jr., J.F.: A comparison of classification methods for predicting deception in computer-mediated communication. J. Manag. Inf. Syst. 20(4), 139–166 (2004)
Article Google Scholar

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

Author information

Authors and Affiliations

State University of Maringá, Av. Colombo 5790, Maringá, PR, 87020-900, Brazil
Giuseppe Portolese, Marcos Aurélio Domingues & Valéria Delisandra Feltrim

Authors

Giuseppe Portolese
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Aurélio Domingues
View author publications
You can also search for this author in PubMed Google Scholar
Valéria Delisandra Feltrim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giuseppe Portolese .

Editor information

Editors and Affiliations

INESC-TEC, University of Trás-os-Montes and Alto Douro, Vila Real, Portugal
Paulo Moura Oliveira
University of Minho, Braga, Portugal
Paulo Novais
LIACC/UP, University of Porto, Porto, Portugal
Luís Paulo Reis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Portolese, G., Domingues, M.A., Feltrim, V.D. (2019). Exploring Textual Features for Multi-label Classification of Portuguese Film Synopses. In: Moura Oliveira, P., Novais, P., Reis, L. (eds) Progress in Artificial Intelligence. EPIA 2019. Lecture Notes in Computer Science(), vol 11805. Springer, Cham. https://doi.org/10.1007/978-3-030-30244-3_55

Download citation

DOI: https://doi.org/10.1007/978-3-030-30244-3_55
Published: 30 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30243-6
Online ISBN: 978-3-030-30244-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics