Wrapping PDF Documents Exploiting Uncertain Knowledge

  • S. Flesca
  • S. Garruzzo
  • E. Masciari
  • A. Tagarelli
Conference paper

DOI: 10.1007/11767138_13

Part of the Lecture Notes in Computer Science book series (LNCS, volume 4001)
Cite this paper as:
Flesca S., Garruzzo S., Masciari E., Tagarelli A. (2006) Wrapping PDF Documents Exploiting Uncertain Knowledge. In: Dubois E., Pohl K. (eds) Advanced Information Systems Engineering. CAiSE 2006. Lecture Notes in Computer Science, vol 4001. Springer, Berlin, Heidelberg

Abstract

The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according to the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • S. Flesca
    • 1
  • S. Garruzzo
    • 2
  • E. Masciari
    • 3
  • A. Tagarelli
    • 1
  1. 1.DEISUniversity of Calabria 
  2. 2.DIMETUniversity of Reggio Calabria 
  3. 3.ICAR-CNR – Institute of Italian National Research Council 

Personalised recommendations