An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles

  • Stefan Klampfl
  • Roman Kern
Conference paper

DOI: 10.1007/978-3-642-40501-3_15

Part of the Lecture Notes in Computer Science book series (LNCS, volume 8092)
Cite this paper as:
Klampfl S., Kern R. (2013) An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles. In: Aalberg T., Papatheodorou C., Dobreva M., Tsakonas G., Farrugia C.J. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2013. Lecture Notes in Computer Science, vol 8092. Springer, Berlin, Heidelberg

Abstract

Scientific articles are predominantly stored in digital document formats, which are optimised for presentation, but lack structural information. This poses challenges to access the documents’ content, for example for information retrieval. We have developed a processing pipeline that makes use of unsupervised machine learning techniques and heuristics to detect the logical structure of a PDF document. Our system uses only information available from the current document and does not require any pre-trained model. Starting from a set of contiguous text blocks extracted from the PDF file, we first determine geometrical relations between these blocks. These relations, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this logical structure we finally extract the body text and the table of contents of a scientific article. We evaluate our pipeline on a number of datasets and compare it with state-of-the-art document structure analysis approaches.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Stefan Klampfl
    • 1
  • Roman Kern
    • 1
    • 2
  1. 1.Know-Center GmbHAustria
  2. 2.Knowledge Technologies InstituteGraz University of TechnologyGrazAustria

Personalised recommendations