A System for Converting PDF Documents into Structured XML Format

  • Hervé Déjean
  • Jean-Luc Meunier
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3872)

Abstract

We present in this paper a system for converting PDF legacy documents into structured XML format. This conversion system first extracts the different streams contained in PDF files (text, bitmap and vectorial images) and then applies different components in order to express in XML the logically structured documents. Some of these components are traditional in Document Analysis, other more specific to PDF. We also present a graphical user interface in order to check, correct and validate the analysis of the components. We eventually report on two real user cases where this system was applied on.

Keywords

Vectorial Image Text Block Portable Document Format Layout Analysis Reading Order 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
  2. 2.
    PDF Reference, 5th edn., Adobe® Portable Document Format Google Scholar
  3. 3.
  4. 4.
  5. 5.
  6. 6.
    Hadjar, K., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: a new Tool for eXtracting hidden structures from electronic Documents. In: DIAL (2004)Google Scholar
  7. 7.
    Omnipp. 14, Scansoft, http://www.scansoft.com
  8. 8.
    ABBYY FineReader, http://www.abbyy.com/
  9. 9.
    Adobe Acrobat CaptureGoogle Scholar
  10. 10.
  11. 11.
    Kuich, W., Salomaa, A.: Semirings, Automata, Languages. In: EATCS Monographs, on Theoretical Computer Science. Springer, Heidelberg (1986)Google Scholar
  12. 12.
    Lin, X.: Header and Footer Extraction by Page-Association, Hewlett-Packard Technical Report (2002), http://www.hpl.hp.com/techreports/2002/HPL-2002-129.pdf
  13. 13.
    Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.M.: Geometric Layout Analysis Techniques for Document Image Understanding: a Review, ITC-IRST Technical Report #9703-09Google Scholar
  14. 14.
    Meunier, J.-L.: Optimized XY-Cut for Determining a Page Reading Order. In: ICDAR 2005 (2005)Google Scholar
  15. 15.
    Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: International Conference on Pattern Recognition (1984)Google Scholar
  16. 16.
    Aiello, M., Smeulders, A.: Thick 2D Relations for Document Understanding. In: 7th Joint Conference on Information Sciences (2003)Google Scholar
  17. 17.
    Breuel, T.M.: High performance document layout analysis. In: Symposium on Document Image Understanding (2003)Google Scholar
  18. 18.
    Déjean, H., Meunier, J.-L.: Structuring documents according to their ToC, DocEng (2005)Google Scholar
  19. 19.
    Dori, D., Doermann, D., Shin, C., Haralick, R., Buchman, M., Ross, D., Phillips, I.: The representation of Document Structure. In: Hansbooks on optical Character Recognition and Document Analysis. World Scientific Publishing Company, Singapore (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Hervé Déjean
    • 1
  • Jean-Luc Meunier
    • 1
  1. 1.Xerox Research Centre EuropeMeylan

Personalised recommendations