Advertisement

XCDF: A Canonical and Structured Document Format

  • Jean-Luc Bloechle
  • Maurizio Rigamonti
  • Karim Hadjar
  • Denis Lalanne
  • Rolf Ingold
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3872)

Abstract

Accessing the structured content of PDF document is a difficult task, requiring pre-processing and reverse engineering techniques. In this paper, we first present different methods to accomplish this task, which are based either on document image analysis, or on electronic content extraction. Then, XCDF, a canonical format with well-defined properties is proposed as a suitable solution for representing structured electronic documents and as an entry point for further researches and works. The system and methods used for reverse engineering PDF document into this canonical format are also presented. We finally present current applications of this work into various domains, spacing from data mining to multimedia navigation, and consistently benefiting from our canonical format in order to access PDF document content and structures.

Keywords

Canonical Format Reverse Engineering Logical Structure Electronic Content Text Line 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
  2. 2.
  3. 3.
    Anjewierden, A.: AIDAS: Incremental logical structure discovery in PDF document. In: Sixth International Conference on Document Analysis and Recognition (ICDAR 2001), Seattle, USA, pp. 374–377 (2001)Google Scholar
  4. 4.
    Anjewierden, A., Kabel, S.: Automatic indexing of documents with ontologies. In: 13th Belgian/Dutch Conference on Artificial Intelligence (BNAIC 2001), Amsterdam, Holland, pp. 23–30 (2001)Google Scholar
  5. 5.
    Bagley, S.R., Brailsford, D.F., Hardy, M.R.B.: Creating reusable well-structured PDF as a sequence of component object graphic (COG) elements. In: ACM Symposium on Document Engineering (DocEng 2003), Grenoble, France, pp. 58–67 (2003)Google Scholar
  6. 6.
  7. 7.
    Chao, H., Fan, J.: Layout and Content Extraction for PDF Documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  8. 8.
    Chao, H., Xiaofan, L.: Capturing the Layout of electronic Documents for Reuse in Variable Data. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, Korea, pp. 940–944 (2005)Google Scholar
  9. 9.
    Futrelle, R.P., Shap, M., Cieslik, C., Grimes, A.E.: Extraction, layout analysis and classification of diagrams in PDF documents. In: Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), Edinburgh, Scotland, pp. 1007–1012 (2003)Google Scholar
  10. 10.
  11. 11.
    Hadjar, K., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: a new tool for eXtracting hidden structures from Electronic Documents. In: Document Image Analysis for Libraries (DIAL 2004), Palo Alto, USA, pp. 212–221 (2004)Google Scholar
  12. 12.
    Hadjar, K., Hitz, O., Robadey, L., Ingold, R.: Configuration REcognition Model for Complex Reverse Engineering Methods: 2(CREM). In: Lopresti, D.P., Hu, J., Kashi, R.S. (eds.) DAS 2002. LNCS, vol. 2423, pp. 469–479. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  13. 13.
    Hadjar, K., Ingold, R.: Arabic Newspaper Page Segmentation. In: Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), Edinburgh, Scotland, pp. 895–899 (2003)Google Scholar
  14. 14.
    Hardy, M.R.B., Brailsford, D., Thomas, P.L.: Creating Structured PDF Files Using XML Templates. In: ACM Symposium on Document Engineering (DocEng 2004), Milwaukee, USA, pp. 99–108 (2004)Google Scholar
  15. 15.
  16. 16.
  17. 17.
    Lawrence, S., Bollacker, K., Lee Giles, C.: Indexing and Retrieval of Scientific Literature. In: Eighth International Conference on Information and Knowledge Management (CIKM 1999), Kansas City, USA, pp. 139–146 (1999)Google Scholar
  18. 18.
    Lovegrove, W.S., Brailsford, D.F.: Document analysis of PDF files: methods, results and implications, pp. 207–220. Electronic publishing, Cologne University (1995)Google Scholar
  19. 19.
  20. 20.
    Mekhaldi, D., Lalanne, D., Ingold, R.: From Searching to Browsing through Multimodal Documents Linking. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, Korea, pp. 924–928 (2005)Google Scholar
  21. 21.
    Paknad, M.D., Ayers, R.M.: Method and apparatus for identifying words described in a portable electronic document. U.S. Patent 5,832,530 (1998)Google Scholar
  22. 22.
  23. 23.
    Rahman, F., Alam, H.: Conversion of PDF documents into HTML: a case study of document image analysis. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers 2003, USA, pp. 87–91 (2003)Google Scholar
  24. 24.
    Rigamonti, M., Hadjar, K., Lalanne, D., Ingold, R.: Xed: un outil pour l’extraction et l’analyse de documents PDF. In: Huitième Colloque International Francophone sur l’Ecrit et le Document (CIFED 2004), La Rochelle, France, pp. 85–90 (2004)Google Scholar
  25. 25.
    Rigamonti, M., Bloechle, J.-L., Hadjar, K., Lalanne, D., Ingold, R.: Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), Seoul, Korea, pp. 1050–1054 (2005)Google Scholar
  26. 26.
    Rigamonti, M., Lalanne, D., Evéquoz, F., Ingold, R.: Browsing multimedia archives through implicit and explicit cross-modal links. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 114–125. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  27. 27.
    Souafi-Bensafi, S., Parizeau, M., Lebourgeois, F., Emptoz, H.: Logical labeling usings Bayesian Networks. In: Sixth International Conference on Document Analysis and Recognition (ICDAR 2001), Seattle, USA, pp. 832–836 (2001)Google Scholar
  28. 28.
    Wellner, P., Flynn, M., Guillemot, S.: Browsing Recorded Meeting With Ferret. In: Bengio, S., Bourlard, H. (eds.) MLMI 2004. LNCS, vol. 3361, pp. 12–21. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  29. 29.
  30. 30.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jean-Luc Bloechle
    • 1
  • Maurizio Rigamonti
    • 1
  • Karim Hadjar
    • 1
  • Denis Lalanne
    • 1
  • Rolf Ingold
    • 1
  1. 1.DIVA Group, DIUFUniversity of FribourgFribourgSwitzerland

Personalised recommendations