Wrapping PDF Documents Exploiting Uncertain Knowledge
- Cite this paper as:
- Flesca S., Garruzzo S., Masciari E., Tagarelli A. (2006) Wrapping PDF Documents Exploiting Uncertain Knowledge. In: Dubois E., Pohl K. (eds) Advanced Information Systems Engineering. CAiSE 2006. Lecture Notes in Computer Science, vol 4001. Springer, Berlin, Heidelberg
The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up wrapping approach to extract information tokens and integrate them into groups related according to the logical structure of a document. A PDF wrapper is defined by specifying a set of group type definitions which impose a target structure to token groups containing the required information. Due to the intrinsic uncertainty on the structure and presentation of PDF documents, we devise constraints on token groupings as fuzzy logic conditions. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document.
Unable to display preview. Download preview PDF.