Semantically Aware Text Categorisation for Metadata Annotation

  • Giulio CarducciEmail author
  • Marco Leontino
  • Daniele P. Radicioni
  • Guido Bonino
  • Enrico Pasini
  • Paolo Tripodi
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 988)


In this paper we illustrate a system aimed at solving a long-standing and challenging problem: acquiring a classifier to automatically annotate bibliographic records by starting from a huge set of unbalanced and unlabelled data. We illustrate the main features of the dataset, the learning algorithm adopted, and how it was used to discriminate philosophical documents from documents of other disciplines. One strength of our approach lies in the novel combination of a standard learning approach with a semantic one: the results of the acquired classifier are improved by accessing a semantic network containing conceptual information. We illustrate the experimentation by describing the construction rationale of training and test set, we report and discuss the obtained results and conclude by drawing future work.


Text categorization Lexical resources Semantics NLP Language models 



The authors wish to thank the EThOS staff for their prompt and kind support. Giulio Carducci and Marco Leontino have been supported by the REPOSUM project, BONG_CRT_17_01 funded by Fondazione CRT.


  1. 1.
    Akinyelu, A.A., Adewumi, A.O.: Classification of phishing email using random forest machine learning technique. J. Appl. Math. 2014 (2014). HindawiGoogle Scholar
  2. 2.
    Begum, N., Fattah, M., Ren, F.: Automatic text summarization using support vector machine. Int. J. Innov. Comput. Inf. Control 5, 1987–1996 (2009)Google Scholar
  3. 3.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)zbMATHCrossRefGoogle Scholar
  4. 4.
    Carducci, G., Leontino, M., Radicioni, D.P.: Semantically Aware Text Categorisation for Metadata Annotation: Implementation and Test Files (2018).
  5. 5.
    Chen, J., Huang, H., Tian, S., Qu, Y.: Feature selection for text classification with naïve bayes. Expert Syst. Appl. 36(3), 5432–5435 (2009)CrossRefGoogle Scholar
  6. 6.
    Colla, D., Mensa, E., Radicioni, D.P., Lieto, A.: Tell me why: computational explanation of conceptual similarity judgments. In: Medina, J., et al. (eds.) IPMU 2018. CCIS, vol. 853, pp. 74–85. Springer, Cham (2018). Scholar
  7. 7.
    Cutler, D.R., et al.: Random forests for classification in ecology. Ecology 88(11), 2783–92 (2007)CrossRefGoogle Scholar
  8. 8.
    Amasyalı, M.F., Diri, B.: Automatic turkish text categorization in terms of author, genre and gender. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 221–226. Springer, Heidelberg (2006). Scholar
  9. 9.
    Ferilli, S., Leuzzi, F., Rotella, F.: Cooperating techniques for extracting conceptual taxonomies from text. In: Appice, A., Ceci, M., Loglisci, C., Manco, G. (eds.) Proceedings of the Workshop on Mining Complex Patterns at AI*IA XIIth Conference (2011)Google Scholar
  10. 10.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)Google Scholar
  11. 11.
    Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st National Conference on Artificial Intelligence. AAAI 2006, vol. 2, pp. 1301–1306. AAAI Press (2006)Google Scholar
  12. 12.
    Ghignone, L., Lieto, A., Radicioni, D.P.: Typicality-based inference by plugging conceptual spaces into ontologies. In: Proceedings of the AIC. CEUR (2013)Google Scholar
  13. 13.
    Harabagiu, S., Moldovan, D.: Question answering. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics. Oxford University Press, New York (2003)Google Scholar
  14. 14.
    Ho, T.K.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1). ICDAR 1995, vol. 1, pp. 278–282. IEEE Computer Society, Washington, DC (1995)Google Scholar
  15. 15.
    Hovy, E.: Text summarization. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, 2nd edn. Oxford University Press, New York (2003)Google Scholar
  16. 16.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). Scholar
  17. 17.
    Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in Neural Information Processing Systems, pp. 919–927 (2015)Google Scholar
  18. 18.
    Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Short Papers, vol. 2, pp. 427–431. Association for Computational Linguistics (2017)Google Scholar
  19. 19.
    Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2267–2273. AAAI 2015. AAAI Press (2015)Google Scholar
  20. 20.
    Lieto, A., Mensa, E., Radicioni, D.P.: A resource-driven approach for anchoring linguistic resources to conceptual spaces. In: Adorni, G., Cagnoni, S., Gori, M., Maratea, M. (eds.) AI*IA 2016. LNCS (LNAI), vol. 10037, pp. 435–449. Springer, Cham (2016). Scholar
  21. 21.
    Lieto, A., Radicioni, D.P., Rho, V.: Dual peccs: a cognitive system for conceptual representation and categorization. J. Exp. Theoret. Artif. Intell. 29(2), 433–452 (2017). Scholar
  22. 22.
    Lison, P., Kennington, C.: Opendial: a toolkit for developing spoken dialogue systems with probabilistic rules. In: Proceedings of ACL-2016 System Demonstrations, pp. 67–72 (2016)Google Scholar
  23. 23.
    Liu, H., Singh, P.: Conceptnet-a practical commonsense reasoning tool-kit. BT Technol. J. 22(4), 211–226 (2004)CrossRefGoogle Scholar
  24. 24.
    Marujo, L., Ribeiro, R., de Matos, D.M., Neto, J.P., Gershman, A., Carbonell, J.: Key phrase extraction of lightly filtered broadcast news. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 290–297. Springer, Heidelberg (2012). Scholar
  25. 25.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: IN AAAI-98 Workshop on Learning for Text Categorization, pp. 41–48. AAAI Press (1998)Google Scholar
  26. 26.
    Mensa, E., Radicioni, D.P., Lieto, A.: COVER: a linguistic resource combining common sense and lexicographic information. Lang. Resour. Eval. 52(4), 921–948 (2018)CrossRefGoogle Scholar
  27. 27.
    Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  28. 28.
    Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th ACL, pp. 216–225. ACL (2010)Google Scholar
  29. 29.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)zbMATHCrossRefGoogle Scholar
  30. 30.
    Harris, Z.S.: Distributional structure. Word 10, 146–162 (1954)CrossRefGoogle Scholar
  31. 31.
    Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, vol. 13. pp. 63–70. Association for Computational Linguistics (2000)Google Scholar
  32. 32.
    Wang, J.: A knowledge network constructed by integrating classification, thesaurus, and metadata in digital library. Int. Inf. Libr. Rev. 35(2–4), 383–397 (2003)CrossRefGoogle Scholar
  33. 33.
    Weibel, S.: The dublin core: a simple content description model for electronic resources. Bull. Am. Soc. Inf. Sci. Technol. 24(1), 9–11 (1997)CrossRefGoogle Scholar
  34. 34.
    Xu, B., Guo, X., Ye, Y., Cheng, J.: An improved random forest classifier for text categorization. JCP 7(12), 2913–2920 (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Dipartimento di InformaticaUniversità degli Studi di TorinoTurinItaly
  2. 2.Dipartimento di FilosofiaUniversità degli Studi di TorinoTurinItaly

Personalised recommendations