Advertisement

On tables of contents and how to recognize them

  • Hervé DéjeanEmail author
  • Jean-Luc Meunier
Original Paper

Abstract

We present a method for structuring a document according to the information present in its different organizational tables: table of contents, tables of figures, etc. This method is based on a two-step approach that leverages functional and formal (layout-based) kinds of knowledge. The functional definition of organizational table, based on five properties, is used to provide a first solution, which is improved in a second step by automatically learning the form of the table of contents. We also report on the robustness and performance of the method and we illustrate its use in a real conversion case.

Keywords

Document structuring Table of contents recognition Functional approach Machine learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Belaïd, A., Pierron, L., Valverde, N.: Part-of-speech tagging for table of contents recognition. In: International Conference on Pattern Recognition (2000)Google Scholar
  2. 2.
    Déjean, H., Meunier, J.-L.: System for converting PDF documents into structured XML format. In: Proceedings of the Seventh IAPR Workshop on Document Analysis Systems, Nelson, New Zealand, 13–15 February 2006Google Scholar
  3. 3.
    Déjean, H., Meunier, J.-L.: Logical document conversion: combining functional and formal knowledge. In: Proceedings of the 2007 ACM Symposium on Document Engineering, Winnipeg, Manitoba, Canada (2007). doi: 10.1145/1284420.1284456
  4. 4.
    Déjean, H., Meunier, J.-L.: Combining multiple methods for book indexing. In: Proceedings of the Eighth IAPR International Workshop on Document Analysis Systems, Nara, Japan, 16–19 September 2008Google Scholar
  5. 5.
    Forney G.D.: The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)CrossRefMathSciNetGoogle Scholar
  6. 6.
    He, F., Ding, X., Peng, L.: Hierarchical logical structure extraction of book documents by analyzing tables of contents. Document Recognition and Retrieval XI, Proceedings of SPIE-IS&T Electronic Imaging, SPIE, vol. 5296 (2004)Google Scholar
  7. 7.
    Le Bourgeois, F., Emptoz, H., Souafi Bensafi, S.: Document understanding using probabilistic relaxation: application on tables of contents of periodicals. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition (2001)Google Scholar
  8. 8.
    Lin, C., Niwa, Y., Narita, S.: Logical structure analysis of book document images using contents information. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition (1997)Google Scholar
  9. 9.
    Lin X., Xiong Y.: Detection and analysis of table of contents based on content association. Int. J. Document Anal. Understand. 8, 2–3 (2006)Google Scholar
  10. 10.
    Lin, X.: Text-mining based journal splitting. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (2003)Google Scholar
  11. 11.
    Mandal, S., Chowbury, S.P., Das, A.K., Chanda, B.: Automated detection and segmentation of table of contents pages from document images. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (2003)Google Scholar
  12. 12.
    Meunier, J.-L.: Optimized XY-cut for determining a page reading order. In: Proceedings of the Eighth International Conference on Document Analysis and Recognition (2005)Google Scholar
  13. 13.
    Satoh, S., Takasu, A., Katsura, E.: An automated generation of electronic library based on document image understanding. In: Proceedings of the Third International Conference on Document Analysis and Recognition (1995)Google Scholar
  14. 14.
    Tsuruoka, S., Hirano, C., Yoshikawa, T., Shinogi, T.: Image-based structure analysis for a ToC and conversion to XML. In: DLIA Workshop 2001Google Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Xerox Research Centre EuropeMeylanFrance

Personalised recommendations