Semantic Annotation of Data Processing Pipelines in Scientific Publications

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10249)


Data processing pipelines are a core object of interest for data scientist and practitioners operating in a variety of data-related application domains. To effectively capitalise on the experience gained in the creation and adoption of such pipelines, the need arises for mechanisms able to capture knowledge about datasets of interest, data processing methods designed to achieve a given goal, and the performance achieved when applying such methods to the considered datasets. However, due to its distributed and often unstructured nature, this knowledge is not easily accessible. In this paper, we use (scientific) publications as source of knowledge about Data Processing Pipelines. We describe a method designed to classify sentences according to the nature of the contained information (i.e. scientific objective, dataset, method, software, result), and to extract relevant named entities. The extracted information is then semantically annotated and published as linked data in open knowledge repositories according to the DMS ontology for data processing metadata. To demonstrate the effectiveness and performance of our approach, we present the results of a quantitative and qualitative analysis performed on four different conference series.


  1. 1.
    Alexandru, C., Peroni, S., Pettifer, S., Shotton, D., Vitali, F.: The document components ontology (DoCO). Semant. Web 7(2), 167–181 (2016)CrossRefGoogle Scholar
  2. 2.
    Möller, K., Heath, T., Handschuh, S., Domingue, J.: Recipes for semantic web dog food — The ESWC and ISWC metadata projects. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 802–815. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-76298-0_58CrossRefGoogle Scholar
  3. 3.
    Glaser, H., Millard, I.: Knowledge-enabled research support: In: Proceedings of Web Science, Athens, Greece (2009)Google Scholar
  4. 4.
    Ghavimi, B., Mayr, P., Vahdati, S., Lange, C.: Identifying and improving dataset references in social sciences full texts. arXiv preprint arXiv:1603.01774 (2016)
  5. 5.
    O’Seaghdha, D., Teufel, S.: Unsupervised learning of rhetorical structure with untopic models. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014) (2014)Google Scholar
  6. 6.
    Tuarob, S., et al.: AlgorithmSeer: a system for extracting and searching for algorithms in scholarly big data. IEEE Trans. Big Data 2(1), 3–17 (2016)CrossRefGoogle Scholar
  7. 7.
    Osborne, F., Ribaupierre, H., Motta, E.: TechMiner: extracting technologies from academic publications. In: Blomqvist, E., Ciancarini, P., Poggi, F., Vitali, F. (eds.) EKAW 2016. LNCS (LNAI), vol. 10024, pp. 463–479. Springer, Cham (2016). doi: 10.1007/978-3-319-49004-5_30CrossRefGoogle Scholar
  8. 8.
    Khodra, M.L., et al.: Information extraction from scientific paper using rhetorical classifier. In: International Conference on Electrical Engineering and Informatics (ICEEI) (2011)Google Scholar
  9. 9.
    Helen, A., Purwarianti, A., Widyantoro, D.H.: Rhetorical sentences classification based on section class and title of paper for experimental technical papers. J. ICT Res. Appl. 9(3), 288–310 (2015)CrossRefGoogle Scholar
  10. 10.
    Burns, G.A., Dasigi, P., de Waard, A., Hovy, E.H.: Automated detection of discourse segment and experimental types from the text of cancer pathway results sections. Database. J. Biol. Databases Curation (2016)Google Scholar
  11. 11.
    Sateli, B., Witte, R.: What’s in this paper? Combining rhetorical entities with linked open data for semantic literature querying. In: Proceedings of the 24th International Conference on World Wide Web. ACM (2015)Google Scholar
  12. 12.
    Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C.R.: Corpora for the conceptualisation and zoning of scientific papers. In: LREC (2010)Google Scholar
  13. 13.
    Gil, Y., Ratnakar, V., Garijo, D.: Ontosoft: capturing scientific software metadata. In: International Conference on Knowledge Capture, p. 32. ACM (2015)Google Scholar
  14. 14.
    Groza, T.: Using typed dependencies to study and recognise conceptualisation zones in biomedical literature. PloS One 8(11), e79570 (2013)CrossRefGoogle Scholar
  15. 15.
    Dorgeloh, H., Wanner, A.: Formulaic argumentation in scientific discourse. In: Corrigan, R., Moravcsik, E.A., Ouli, H., Wheatley, K.M. (eds.) Formulaic Language, vol. 2, pp. 523–544. John Benjamins, Amsterdam (2009)CrossRefGoogle Scholar
  16. 16.
  17. 17.
    Mesbah, S., Bozzon, A., Lofi, C., Houben, G.-J.: Describing data processing pipelines in scientific publications for big data injection. In: WSDM Workshop on Scholary Web Mining (SWM), Cambridge, UK (2017)Google Scholar
  18. 18.
    Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 473–474. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-04346-8_62CrossRefGoogle Scholar
  19. 19.
    Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In: JCDL, Indianapolis, USA (2013)Google Scholar
  20. 20.
    Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: International Joint Conference on Natural Language Processing of the AFNLP, Singapore (2009)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Delft University of TechnologyDelftThe Netherlands

Personalised recommendations