Abstract
The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according to the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Ashish, N., Knoblock, C.A.: Wrapper Generation for Semistructured Internet Sources. ACM SIGMOD Record 26(4), 8–15 (1997)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. VLDB 2001 Conf., pp. 119–128 (2001)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large Web sites. In: Proc. VLDB 2001 Conf., pp. 109–118 (2001)
Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Machine Learning 39(2-3), 233–272 (2000)
Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)
Soderland, S.: Learning Information Extraction Rules for Semistructured and Free Text. Machine Learning 34(1-3), 233–272 (1999)
Laender, A., Ribeiro-Neto, B., da Silva, A., Teixeira, J.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84–93 (2002)
Adobe Systems Incorporated: PDF Reference, 5th edn., Adobe Portable Document Format version 1.6 (2004), Available at: http://partners.adobe.com/public/developer/pdf
Zadeh, L.: Fuzzy Sets. Information and Control 8, 338–353 (1965)
Wygralak, M.: Fuzzy Cardinals based on the Generalized Equality of Fuzzy Subsets. Fuzzy Sets & Systems 18, 143–158 (1986)
Bruggemann-Klein, A., Wood, D.: One-Unambiguous Regular Languages. Information and Computation 142(2), 182–206 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Flesca, S., Garruzzo, S., Masciari, E., Tagarelli, A. (2006). Wrapping PDF Documents Exploiting Uncertain Knowledge. In: Dubois, E., Pohl, K. (eds) Advanced Information Systems Engineering. CAiSE 2006. Lecture Notes in Computer Science, vol 4001. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11767138_13
Download citation
DOI: https://doi.org/10.1007/11767138_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34652-4
Online ISBN: 978-3-540-34653-1
eBook Packages: Computer ScienceComputer Science (R0)