Advertisement

SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)

  • Jöran Beel
  • Bela Gipp
  • Ammar Shaker
  • Nick Friedrich
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6273)

Abstract

Extracting titles from a PDF’s full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDF’s title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an ‘academic search engine’ scenario and better run times (8:19 minutes vs. 57:26 minutes).

Keywords

header extraction title extraction style information document analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, pp. 37–48 (2003)Google Scholar
  2. 2.
    Hu, Y., Li, H., Cao, Y., Teng, L., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. Information Processing and Management 42(5), 1276–1293 (2006)CrossRefGoogle Scholar
  3. 3.
    Peng, F., McCallum, A.: Accurate Information extraction from research papers using conditional random fields. Information Processing and Management 42(4), 963–979 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Jöran Beel
    • 1
    • 2
  • Bela Gipp
    • 1
    • 2
  • Ammar Shaker
    • 1
  • Nick Friedrich
    • 1
  1. 1.Computer Science/ITI/VLBA-LabOtto-von-Guericke UniversityMagdeburgGermany
  2. 2.UC BerkeleyBerkeleyUSA

Personalised recommendations