Wrapping PDF Documents Exploiting Uncertain Knowledge

  • S. Flesca
  • S. Garruzzo
  • E. Masciari
  • A. Tagarelli
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4001)


The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according to the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document.


Balance Sheet Information Extraction Group Type Content Model Maximal Group 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ashish, N., Knoblock, C.A.: Wrapper Generation for Semistructured Internet Sources. ACM SIGMOD Record 26(4), 8–15 (1997)CrossRefGoogle Scholar
  2. 2.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. VLDB 2001 Conf., pp. 119–128 (2001)Google Scholar
  3. 3.
    Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large Web sites. In: Proc. VLDB 2001 Conf., pp. 109–118 (2001)Google Scholar
  4. 4.
    Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Machine Learning 39(2-3), 233–272 (2000)Google Scholar
  5. 5.
    Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)CrossRefGoogle Scholar
  6. 6.
    Soderland, S.: Learning Information Extraction Rules for Semistructured and Free Text. Machine Learning 34(1-3), 233–272 (1999)zbMATHCrossRefGoogle Scholar
  7. 7.
    Laender, A., Ribeiro-Neto, B., da Silva, A., Teixeira, J.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84–93 (2002)CrossRefGoogle Scholar
  8. 8.
    Adobe Systems Incorporated: PDF Reference, 5th edn., Adobe Portable Document Format version 1.6 (2004), Available at:
  9. 9.
    Zadeh, L.: Fuzzy Sets. Information and Control 8, 338–353 (1965)zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Wygralak, M.: Fuzzy Cardinals based on the Generalized Equality of Fuzzy Subsets. Fuzzy Sets & Systems 18, 143–158 (1986)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Bruggemann-Klein, A., Wood, D.: One-Unambiguous Regular Languages. Information and Computation 142(2), 182–206 (1998)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • S. Flesca
    • 1
  • S. Garruzzo
    • 2
  • E. Masciari
    • 3
  • A. Tagarelli
    • 1
  1. 1.DEISUniversity of Calabria 
  2. 2.DIMETUniversity of Reggio Calabria 
  3. 3.ICAR-CNR – Institute of Italian National Research Council 

Personalised recommendations