Advertisement

Detection and analysis of table of contents based on content association

  • Xiaofan LinEmail author
  • Yan Xiong
Original Paper

Abstract

As a special type of table understanding, the detection and analysis of tables of contents (TOCs) play an important role in the digitization of multi-page documents. Most previous TOC analysis methods only concentrate on the TOC itself without taking into account the other pages in the same document. Besides, they often require manual coding or at least machine learning of document-specific models. This paper introduces a new method to detect and analyze TOCs based on content association. It fully leverages the text information throughout the whole multi-page document and can be directly applied to a wide range of documents without the need to build or learn the models for individual documents. In addition, the associations of general text and page numbers are combined to make the TOC analysis more accurate. Natural language processing and layout analysis are integrated to improve the TOC functional tagging. The applications of the proposed method in a large-scale digital library project are also discussed.

Keywords

Table of contents Document structure analysis Table recognition Optical character recognition Algorithm combination 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Lin, C., Niwa, Y., Narita, S.: Logical structure analysis of book document images using contents information. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 1048–1054. Ulm, Germany (1997)Google Scholar
  2. He, F., Ding, X., Peng, L.: Hierarchical logical structure extraction of book documents by analyzing tables of contents. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval IX, pp. 6–13. San Jose, USA (2004)Google Scholar
  3. Mandal, S., Chowdhury, S.P.: Automated detection and segmentation of table of contents page from document images. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 398–402. Edinburgh, UK (2003)Google Scholar
  4. Tsuruoka, S., Hirano, C., Yoshikawa, T., Shinogi, T.: Image-based structure analysis for a table of contents and conversion to XML documents. In: Workshop on Document Layout Interpretation and Its Application (DLIA 2001), Seattle, USA (2001)Google Scholar
  5. Story, G.A., O'Gorman, L., Fox, D., Schaper, L.L., Jagadish, H.V.: RightPages image-based electronic library for browsing and alerting. IEEE Comput. 17–25 (1992)Google Scholar
  6. Belaïd, A.: Recognition of table of contents for electronic library consulting. Int. J. Document Anal. Recog. 4(1), 35–45 (2001)CrossRefGoogle Scholar
  7. Satoh, S., Takasu, A., Katsura, E.: An automated generation of electronic library based on document image understanding. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 163–166. Tokyo, Japan (1995)Google Scholar
  8. Le Bourgeois, F., Emptoz, H., Souafi Bensafi, S.: Document understanding using probabilistic relaxation: application on tables of contents of periodicals. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, pp. 508–512, Seattle, USA (2001)Google Scholar
  9. MIT Press, Classics Book Collection Release Announcement. http://mitpress.mit.edu/main/feature/classics/MITPClassics_rele-ase.pdf
  10. Simske, S., Lin, X.: Creating digital libraries: content generation and re-mastering. In: Proceedings of the International Workshop on Document Image Analysis for Libraries, pp. 33–45. Palo Alto (2004)Google Scholar
  11. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval VII, pp. 291–302. San Jose, USA (2000)Google Scholar
  12. Wang, Y., Phillips, I.T., Haralick, R.: Table detection via probability optimization. In: Proceedings of the 5th International Workshop DAS 2002, Document Image Analysis System V. pp. 272–282. Princeton, USA (2002)Google Scholar
  13. Luo, Q., Watanabe, T., Nakayama, T.: Identifying contents page of documents. In: Proceedings of the 13th International Conference on Pattern Recognition, vol. 3, pp. 696–700 (1996)Google Scholar
  14. Myers, G.: Whole-genome DNA sequencing. IEEE Comput. Eng. Sci. 33–43 (1999)Google Scholar
  15. Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)Google Scholar
  16. Lin, X., Simske, S.: Automatic document navigation for digital content re-mastering. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval XI, pp. 66–73. San Jose, USA (2004)Google Scholar
  17. Lin, X.: Text-mining based journal splitting. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 1075–1079. Edinburgh, UK (2003)Google Scholar
  18. CogNet website, http://cognet.mit.edu
  19. Lin, X.: Reliable OCR solution for digital content re-mastering. In: Proceedings of the SPIE Conference on Document Recognition and Retrieval IX, pp. 223–231. San Jose, USA (2002)Google Scholar
  20. Lin, X.: Impact of imperfect OCR on part-of-speech tagging. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, pp. 284–288. Edinburgh, UK (2003)Google Scholar
  21. Cogilex website, http://www.cogilex.com
  22. Allison, L., Dix, T.I., Yee, C.N.: Shortest path and closure algorithms for banded matrices. Inform. Process. Lett. 40(6), 317–322 (1991)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag 2005

Authors and Affiliations

  1. 1.Hewlett-Packard LaboratoriesPalo AltoUSA
  2. 2.KLA-Tencor CorporationMilpitasUSA

Personalised recommendations