Semantic Web Evaluation Challenge

Semantic Web Evaluation Challenges pp 105-116

Machine Learning Techniques for Automatically Extracting Contextual Information from Scientific Publications

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 548)

Abstract

Scholarly publishing increasingly requires automated systems that semantically enrich documents in order to support management and quality assessment of scientific output. However, contextual information, such as the authors’ affiliations, references, and funding agencies, is typically hidden within PDF files. To access this information we have developed a processing pipeline that analyses the structure of a PDF document incorporating a diverse set of machine learning techniques. First, unsupervised learning is used to extract contiguous text blocks from the raw character stream as the basic logical units of the article. Next, supervised learning is employed to classify blocks into different meta-data categories, including authors and affiliations. Then, a set of heuristics are applied to detect the reference section at the end of the paper and segment it into individual reference strings. Sequence classification is then utilised to categorise the tokens of individual references to obtain information such as the journal and the year of the reference. Finally, we make use of named entity recognition techniques to extract references to research grants, funding agencies, and EU projects. Our system is modular in nature. Some parts rely on models learnt on training data, and the overall performance scales with the quality of these data sets.

Keywords

PDF extraction Machine learning Named entity recognition 

References

  1. 1.
    Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002)CrossRefMATHGoogle Scholar
  2. 2.
    Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996)Google Scholar
  3. 3.
    Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. 2008, pp. 661–667. Citeseer, European Language Resources Association (ELRA) (2008)Google Scholar
  4. 4.
    Kern, R., Jack, K., Hristakeva, M., Granitzer, M.: TeamBeam - meta-data extraction from scientific literature. D-Lib Mag. 18(7/8) (2012)Google Scholar
  5. 5.
    Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Mag. 19(9/10) (2013)Google Scholar
  6. 6.
    Klampfl, S., Granitzer, M., Jack, K., Kern, R.: Unsupervised document structure analysis of digital scientific articles. Int. J. Digit. Libr. 14(3–4), 83–99 (2014)CrossRefGoogle Scholar
  7. 7.
    Klampfl, S., Kern, R.: An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C.J. (eds.) TPDL 2013. LNCS, vol. 8092, pp. 144–155. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  8. 8.
    Kröll, M., Klampfl, S., Kern, R.: Towards a marketplace for the scientific community: accessing knowledge from the computer science domain. D-Lib Mag. 20(11/12) (2014)Google Scholar
  9. 9.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML-2001), pp. 282–289 (2001)Google Scholar
  10. 10.
    Ratnaparkhi, A.: Maximum entropy models for natural langual ambiguity resolution. Ph.D. thesis (1998)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Know-Center GmbHGrazAustria

Personalised recommendations