The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles

  • Angelo A. SalatinoEmail author
  • Francesco Osborne
  • Thiviyan Thanapalasingam
  • Enrico Motta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11799)


Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.


Scholarly data Digital libraries Bibliographic data Ontology Text mining Topic detection Word embeddings Science of science 


  1. 1.
    Salatino, A.A., Osborne, F., Motta, E.: AUGUR: forecasting the emergence of new research topics. In: Joint Conference on Digital Libraries 2018, Fort Worth, Texas, pp. 1–10 (2018)Google Scholar
  2. 2.
    Osborne, F., Salatino, A., Birukou, A., Motta, E.: Automatic classification of springer nature proceedings with smart topic miner. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 383–399. Springer, Cham (2016). Scholar
  3. 3.
    Bolelli, L., Ertekin, Ş., Giles, C.L.: Topic and trend detection in text collections using latent dirichlet allocation. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 776–780. Springer, Heidelberg (2009)Google Scholar
  4. 4.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U. S. A. 101(1), 5228–5235 (2004)CrossRefGoogle Scholar
  5. 5.
    Osborne, F., Motta, E.: Mining semantic relations between research areas. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. Lecture Notes in Computer Science, vol. 7649, pp. 410–426. Springer, Heidelberg (2012). Scholar
  6. 6.
    Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F., Motta, E.: The computer science ontology: a large-scale taxonomy of research areas. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 187–205. Springer, Cham (2018). Scholar
  7. 7.
    Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: Human Language Technologies: Annual Conference of the North American Chapter of the ACL, pp. 1275–80 (2015)Google Scholar
  8. 8.
    Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and Word2vec for text classification with semantic features. In: 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 136–140. IEEE (2015)Google Scholar
  9. 9.
    Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F., Motta, E.: Classifying research papers with the computer science ontology. In: ISWC-P&D-Industry-BlueSky 2018 (2018)Google Scholar
  10. 10.
    Decker, S.L., Aleman-meza, B., Cameron, D., Arpinar, I.B.: Detection of Bursty and Emerging Trends towards Identification of Researchers at the Early Stage of Trends (2007)Google Scholar
  11. 11.
    Mai, F., Galke, L., Scherp, A.: Using deep learning for title-based semantic subject indexing to reach competitive performance to full-text. In: JCDL 2018 Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. pp. 169–178. ACM, New York (2018)Google Scholar
  12. 12.
    Chernyak, E.: An approach to the problem of annotation of research publications. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining - WSDM ’15, pp. 429–434. ACM Press, New York (2015)Google Scholar
  13. 13.
    Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic Detection and Tracking Pilot Study Final Report (1998)Google Scholar
  14. 14.
    Osborne, F., Scavo, G., Motta, E.: Identifying diachronic topic-based research communities by clustering shared research trajectories. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 114–129. Springer, Cham (2014). Scholar
  15. 15.
    Small, H., Boyack, K.W., Klavans, R.: Identifying emerging topics in science and technology. Res. Policy 43, 1450–1467 (2014)CrossRefGoogle Scholar
  16. 16.
    Caragea, C., Bulgarov, F., Mihalcea, R.: Co-Training for Topic Classification of Scholarly Data. Association for Computational Linguistics (2015)Google Scholar
  17. 17.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  18. 18.
    Bhatia, S., Lau, J.H., Baldwin, T.: Automatic labelling of topics with neural embeddings. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 953–963. The COLING 2016, December (2016) Google Scholar
  19. 19.
    Duvvuru, A., Radhakrishnan, S., More, D., Kamarthi, S.: Analyzing structural & temporal characteristics of keyword system in academic research articles. Procedia - Procedia Comput. Sci. 20, 439–445 (2013)CrossRefGoogle Scholar
  20. 20.
    Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H., Zhang, G.: Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. J. Informetr. 12, 1099–1117 (2018)CrossRefGoogle Scholar
  21. 21.
    Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 408–424. Springer, Cham (2015). Scholar
  22. 22.
    Osborne, F., Motta, E., Mulholland, P.: exploring scholarly data with rexplore. In: Alani, H., et al. (eds.) ISWC 2013. LNCS, vol. 8218, pp. 460–477. Springer, Heidelberg (2013). Scholar
  23. 23.
    Thanapalasingam, T., Osborne, F., Birukou, A., Motta, E.: Ontology-based recommendation of editorial products. In: Vrandečić, D., et al. (eds.) ISWC 2018. Lecture Notes in Computer Science, vol. 11137. Springer, Cham (2018). Scholar
  24. 24.
    Narayanan, A., Shmatikov, V.: De-anonymizing social networks. In: 30th IEEE Symposium on Security and Privacy, pp. 173–187. IEEE (2009)Google Scholar
  25. 25.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)Google Scholar
  26. 26.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  27. 27.
    Satopää, V., Albrecht, J., Irwin, D., Raghavan, B.: Finding a “Kneedle” in a haystack: detecting knee points in system behavior. In: ICDCSW 2011 Proceedings of the 2011 31st International Conference on Distributed Computing Systems, pp. 166–171. IEEE Computer Society Washington (2011)Google Scholar
  28. 28.
    Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Angelo A. Salatino
    • 1
    Email author
  • Francesco Osborne
    • 1
  • Thiviyan Thanapalasingam
    • 1
  • Enrico Motta
    • 1
  1. 1.Knowledge Media InstituteThe Open UniversityMilton KeynesUK

Personalised recommendations