Ontology Driven Extraction of Research Processes

  • Vayianos Pertsas
  • Panos Constantopoulos
  • Ion Androutsopoulos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11136)


We address the automatic extraction from publications of two key concepts for representing research processes: the concept of research activity and the sequence relation between successive activities. These representations are driven by the Scholarly Ontology, specifically conceived for documenting research processes. Unlike usual named entity recognition and relation extraction tasks, we are facing textual descriptions of activities of widely variable length, while pairs of successive activities often span multiple sentences. We developed and experimented with several sliding window classifiers using Logistic Regression, SVMs, and Random Forests, as well as a two-stage pipeline classifier. Our classifiers employ task-specific features, as well as word, part-of-speech and dependency embeddings, engineered to exploit distinctive traits of research publications written in English. The extracted activities and sequences are associated with other relevant information from publication metadata and stored as RDF triples in a knowledge base. Evaluation on datasets from three disciplines, Digital Humanities, Bioinformatics, and Medicine, shows very promising performance.


Ontology population Information extraction Machine learning methodologies Linked data 


  1. 1.
    Bornmann, L., Mutz, R.: Growth rates of modern science: a bibliometric analysis based on the number of publications. J. Assoc. Inf. Sci. Technol. Technol. 66, 2215–2222 (2015)CrossRefGoogle Scholar
  2. 2.
    Renear, A.H., Palmer, C.L.: Strategic reading, ontologies, and the future of scientific publishing. Science 325, 828–832 (2009)CrossRefGoogle Scholar
  3. 3.
    Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: SemEval 2017 Task 10: ScienceIE, pp. 546–555 (2017)Google Scholar
  4. 4.
    Pertsas, V., Constantopoulos, P.: Scholarly ontology: modelling scholarly practices. Int. J. Digit. Libraries. 18, 173–190 (2017)CrossRefGoogle Scholar
  5. 5.
    Levy, O., Goldberg, Y.: Linguistic regularities in sparse and explicit word representations. In: CoNLL, pp. 171–180 (2014)Google Scholar
  6. 6.
    Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. ACL 3, 211–225 (2015)Google Scholar
  7. 7.
    Chalkidis, I., Michos, A., Androutsopoulos, I.: Extracting contract elements. In: ICAIL, pp. 19–28, London (2017)Google Scholar
  8. 8.
    McCullagh, P., Nelder, J.A.: Generalized Linear Models, Chapman and Hall London – New York (1983). 261 SGoogle Scholar
  9. 9.
    Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000). ISBN 0-521-78019-5CrossRefGoogle Scholar
  10. 10.
    Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)CrossRefGoogle Scholar
  11. 11.
    Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML 2001. vol. 8, pp. 282–289 (2001)Google Scholar
  12. 12.
    Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57, 345–420 (2016)MathSciNetCrossRefGoogle Scholar
  13. 13.
    QasemiZadeh, B., Schumann, A.-K.: The ACL RD-TEC 2.0: a language resource for evaluating term extraction and entity recognition methods. In: LREC, pp. 1862–1868 (2016)Google Scholar
  14. 14.
    Lee, L.-H., Lee, K.-C., Tseng, Y.-H.: The NTNU System at SemEval-2017 Task 10: extracting keyphrases and relations from scientific publications using multiple CRFs. In: 11th International Workshop on SemEval-2017, pp. 950–954 (2017)Google Scholar
  15. 15.
    Luan, Y., Ostendorf, M., Hajishirzi, H.: Scientific Information Extraction with Semi-supervised Neural Tagging, pp. 2631–2641. arXiv:1708.06075 (2017)
  16. 16.
    Sateli, B., Witte, R.: What’s in this paper? Combining rhetorical entities with linked open data for semantic literature querying. In: ICWWW ACM, pp. 1023–1028 (2015).
  17. 17.
    Osborne, F., de Ribaupierre, H., Motta, E.: TechMiner: extracting technologies from academic publications. In: Blomqvist, E., Ciancarini, P., Poggi, F., Vitali, F. (eds.) EKAW 2016. LNCS (LNAI), vol. 10024, pp. 463–479. Springer, Cham (2016). Scholar
  18. 18.
    Sateli, B., Witte, R.: Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud. PeerJ Comput. Sci. 1, e37 (2015)CrossRefGoogle Scholar
  19. 19.
    Song, Y., Yi, E., Kim, E., Lee, G.G., Park, S.J.: POSBIOTM-NER: a machine learning approach for bio-named entity recognition (2004). Doi= Scholar
  20. 20.
    Plake, C., et al.: A support vector classifier for gene name recognition. In: BioCreAtIvE Workshop, Granada, Spain, pp. 1–5 (2004)Google Scholar
  21. 21.
    Gupta, S., Manning, C.: Analyzing the dynamics of research by extracting key aspects of scientific papers. In: IJCNLP, pp. 1–9 (2011)Google Scholar
  22. 22.
    Salatino, A.A., Osborne, F., Motta, E.: How are topics born? Understanding the research dynamics preceding the emergence of new areas. PeerJ Comput. Sci. 3, e119 (2017)CrossRefGoogle Scholar
  23. 23.
    Ruch, P., et al.: Using argumentation to extract key sentences from biomedical abstracts. Int. J. Med. Inform. 76, 195–200 (2007)CrossRefGoogle Scholar
  24. 24.
    Di Iorio, A., Nuzzolese, A.G., Peroni, S.: Towards the automatic identification of the nature of citations. In: CEUR Workshop Proceedings, pp. 63–74 (2013)Google Scholar
  25. 25.
    Athar, A., Teufel, S.: Context-enhanced citation sentiment detection. In: NAACL HLT 2012, pp. 597–601 (2012)Google Scholar
  26. 26.
    Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.-Y.: Extracting and matching authors and affiliations in scholarly documents. In: ACM/IEEE-CS - JCDL 2013, pp. 219–228 (2013)Google Scholar
  27. 27.
    Lindsay, A., Read, J., Ferreira, J.F., Hayton, T., Porteous, J., Gregory, P.: Framer: planning models from natural language action descriptions. In: ICAPS, pp. 434–442 (2017)Google Scholar
  28. 28.
    Feng, W., Zhuo, H.H., Kambhampati, S.: Extracting Action Sequences from Texts Based on Deep Reinforcement Learning. arXiv:1803.02632 (2018)
  29. 29.
    Mei, H., Bansal, M., Walter, M.R.: Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences. arXiv:1506.04089 (2015)
  30. 30.
    Pertsas, V., Christodoulou, T., Dallas, C., Constantopoulos, P., Papachristopoulos, L., Hughes, L.: Contextualized integration of digital humanities research: using the NeMO ontology of digital humanities methods. In: Digital Humanities 2016: Conference Abstracts, pp. 161–163. Jagiellonian University & Pedagogical University (2016)Google Scholar
  31. 31.
    Yeh, A.: More accurate tests for the statistical significance of result differences. In: COLING. vol. 2, pp. 947–953 (2000)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Vayianos Pertsas
    • 1
  • Panos Constantopoulos
    • 1
    • 2
  • Ion Androutsopoulos
    • 1
    • 2
  1. 1.Department of InformaticsAthens University of Economics and BusinessAthensGreece
  2. 2.Digital Curation UnitIMSI - Athena Research CentreAthensGreece

Personalised recommendations