Automatic Table-of-Contents Generation for Efficient Information Access

Bentabet, Najah-Imane; Juge, Rémi; El Maarouf, Ismaïl; Valsamou-Stanislawski, Dialekti; Ferradans, Sira

doi:10.1007/s42979-020-00302-z

Automatic Table-of-Contents Generation for Efficient Information Access

Original Research
Published: 27 August 2020

Volume 1, article number 283, (2020)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Najah-Imane Bentabet¹^na1,
Rémi Juge¹^na1,
Ismaïl El Maarouf ORCID: orcid.org/0000-0002-9164-7090¹,
Dialekti Valsamou-Stanislawski¹ &
…
Sira Ferradans¹

701 Accesses
1 Altmetric
Explore all metrics

Abstract

Purpose

This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdivisions of a document to assist with navigation and information localisation.

Methods

Unlike previous methods, we do not assume the presence of parsable TOC pages in the document but infer the TOC from a data-driven analysis of sections titles, their order and their depth.

Results

We offer an exhaustive analysis of the proposed model and evaluate it on French and English using documents from the financial domain, which we release to increase community’s interest. We compare this model to state-of-the-art approaches and show its superiority in multiple experiments.

Conclusions

The approach described in this paper can easily be adapted to other domains and documents and its application to the analysis of financial prospectuses will be strengthened by the release of datasets. The TOC generation algorithms used in this paper obtain state-of-the-art results and provide strong baselines for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on deep learning approaches for text-to-SQL

Article Open access 23 January 2023

Recent automatic text summarization techniques: a survey

Article 29 March 2016

A survey on neural topic models: methods, applications, and challenges

Article Open access 25 January 2024

Availability of data and material

Data will be made available upon request to any of the authors belonging to Fortia Financial Solutions.

Notes

see https://www.amf-france.org/en_US/Formulaires-et-declarations/OPCVM-et-fonds-d-investissement/OPCVM/Plan-type-du-prospectus0.
see for instance this prospectus: https://www.amffunds.com/html/F17-0998-AMF-Large-Cap-Prospectus.pdf.
see for instance Tesseract at https://github.com/tesseract-ocr/tesseract.
The last edition to date is available at http://icdar2019.org/.
such as MS Office at https://products.office.com/.
More on this in sections “Investment documents datasets” and “Title hierarchization”.
an exhaustive study reporting on the usage of prospectuses confirms this: https://morecarrot.com/wp-content/uploads/2019/10/MC_Prospectus_StudyReportFinal_23oct19.pdf with MS Word used in 92% of the cases.
for prospectuses, see https://www.amf-france.org/en_US/Formulaires-et-declarations/OPCVM-et-fonds-d-investissement/OPCVM/Plan-type-du-prospectus0.
http://www.poppler.freedesktop.org.
We refer here to the logical page number as opposed to the physical page number which is printed in the content of the document.
Please contact the authors of the paper to access this dataset.

References

Doucet A, Kazai G, Dresevic B, Uzelac A, Radakovic B, Todic N. Setting up a Competition Framework for the Evaluation of Structure Extraction from OCR-ed Books. Int J Doc Anal Recognit. 2011;14(1):45–52. https://hal.archives-ouvertes.fr/hal-01070398(special issue on Performance Evaluation of Document Analysis and Recognition Algorithms). Accessed 14 Aug 2020.
Adcock J, Cooper M, Denoue L, Pirsiavash H, Rowe LA. Talkminer: a lecture webcast search engine. In: ACM Multimedia ’10 2010;241–250.
Veit A, Matera T, Neumann L, Matas J, Belongie SJ. Coco-text: Dataset and benchmark for text detection and recognition in natural images. CoRR. vol. abs/1601.07140, 2016. http://arxiv.org/abs/1601.07140. Accessed 14 Aug 2020.
Christlein V, Nicolaou A, Seuret M, Stutzmann D, Maier A. ICDAR 2019 Competition on Image Retrieval for Historical Handwritten Documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), ser. 2019 International Conference on Document Analysis and Recognition (ICDAR). Sydney, Australia: IEEE, p. 1505–1509. https://hal.archives-ouvertes.fr/hal-02427214. Accessed 14 Aug 2020.
Evershed J, Fitch K. Correcting noisy ocr: Context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, ser. DATeCH ’14. New York, NY, USA: Association for Computing Machinery, 2014, p. 45–51. https://doi.org/10.1145/2595188.2595200. Accessed 14 Aug 2020.
Fang J, Tao X, Tang Z, Qiu R, Liu Y. Dataset, ground-truth and performance metrics for table detection evaluation. In: 2012 10th IAPR International Workshop on Document Analysis Systems 2012;445–449.
Futrelle RP, Shao M, Cieslik C, Grimes AE. Extraction, layout analysis and classification of diagrams in pdf documents. In: In 7th International Conference on Document Analysis and Recognition. IEEE Computer Society, 2003. pp. 1007–1014
Zhong X, Tang J, Yepes AJ. Publaynet: largest dataset ever for document layout analysis. arXiv preprint arXiv:1908.07836, 2019.
Bast H, Korzen C. A benchmark and evaluation for text extraction from pdf. In: Proceedings of Joint Conference On Digital Libraries JCDL’17, 2017.
Juge R, Bentabet I, Ferradans S. The FinTOC-2019 shared task: Financial document structure extraction. In: Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019). Turku, Finland: Linköping University Electronic Press, Sep 2019, p. 51–57. https://www.aclweb.org/anthology/W19-6407. Accessed 14 Aug 2020.
Power R, Scott D, Bouayad-Agha N. Document structure. Comput Linguist. 2003;29(2):211–260. http://dx.doi.org/10.1162/089120103322145315. Accessed 14 Aug 2020.
Paaß G, Konya I. Machine learning for document structure recognition. In: Modeling, Learning, and Processing of Text Technological Data Structures. Springer, 2011, p. 221–247.
Constantin A, Pettifer S, Voronkov A. Pdfx: fully-automated pdf-to-xml conversion of scientific literature. In: Proceedings of the 2013 ACM symposium on Document engineering. ACM, 2013, p. 177–180.
Sollaci LB, Pereira MG. The introduction, methods, results, and discussion (IMRAD) structure: a fifty-year survey. J Med Libr Assoc. 2004;92(3):364–7.
Google Scholar
Namboodiri AM, Jain AK. Document structure and layout analysis. In: Digital Document Processing. New York:Springer; 2007, p. 29–48.
Conway A. Page grammars and page parsing. a syntactic approach to document layout recognition. In: Document Analysis and Recognition, 1993., Proceedings of the Second International Conference on. IEEE, 1993, p. 761–764.
Fourli-Kartsouni F, Slavakis K, Kouroupetroglou G, Theodoridis S. A bayesian network approach to semantic labelling of text formatting in xml corpora of documents. In: International Conference on Universal Access in Human-Computer Interaction. Springer, 2007, p. 299–308.
Nakagawa K, Nomura A, Suzuki M. Extraction of logical structure from articles in mathematics. In: International Conference on Mathematical Knowledge Management. Springer, 2004, p. 276–289.
Tsujimoto S, Asada H. Understanding multi-articled documents. In: Pattern Recognition, 1990. Proceedings., 10th International Conference on, vol. 1. IEEE, 1990, p. 551–556.
Dresevic B, Uzelac A, Radakovic B, Todic N. Book layout analysis: Toc structure extraction engine. In: Geva S, Kamps J, Trotman A, editors. Advances in Focused Retrieval. Berlin Heidelberg: Springer; 2009. p. 164–71.
Chapter Google Scholar
Doucet A, Kazai G, Colutto S, Mühlberger G. Icdar 2013 competition on book structure extraction. In: Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013, p. 1438–1443.
Beckers T, Bellot P, Demartini G, Denoyer L, De Vries CM, Doucet A, Fachry KN, Fuhr N, Gallinari P, Geva S, Huang W-C, Iofciu T, Kamps J, Kazai G, Koolen M, Kutty S, Landoni M, Lehtonen M, Moriceau V, Nayak R, Nordlie R, Pharo N, Sanjuan E, Schenkel R, Tannier X, Theobald M, Thom JA, Trotman A, De Vries AP. Report on INEX 2009. In: Sigir Forum
Nguyen TTH, Doucet A, Coustaty M. Enhancing table of contents extraction by system aggregation. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2018.
El Haj M, Rayson P, Young S, Walker M. Detecting document structure in a very large corpus of UK financial reports. LREC’14 Ninth International Conference on Language Resources and Evaluation. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014) . European Language Resources Association (ELRA), Reykjavik, Iceland, p. 1335-1338, 2014.
El Haj M, Rayson P, Young S, Alves P, Herrero Zorita C. Multilingual Financial Narrative Processing: Analysing Annual Reports in English, Spanish and Portuguese. World Scientific Publishing, 2 2019.
Liu C, Chen J, Zhang X, Liu J, Huang Y. Toc structure extraction from ocr-ed books. In:International Workshop of the Initiative for the Evaluation of XML Retrieval. Springer, 2011, p. 98–108.
Gopinath AAM, Wilson S, Sadeh N. Supervised and unsupervised methods for robust separation of section titles and prose text in web documents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, p. 850–855.
Najah-Imane B, Rémi J, Sira F. Table-of-contents generation on contemporary documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), p. 100–107, 2019.
Rahman MM, Finin T. Understanding the logical and semantic structure of large documents. CoRR. vol. abs/1709.00770, 2017. http://arxiv.org/abs/1709.00770. Accessed 14 Aug 2020.
Déjean H, Meunier J-L. Reflections on the inex structure extraction competition. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, ser. DAS ’10. New York, NY, USA: ACM, 2010:301–308. http://doi.acm.org/10.1145/1815330.1815369. Accessed 14 Aug 2020.
Ramakrishnan C, Patnia A, Hovy E, Burns GA. Layout-aware text extraction from full-text pdf of scientific articles. Source Code for Biology and Medicine. 2012;7(1):7. https://doi.org/10.1186/1751-0473-7-7. Accessed 14 Aug 2020.
Tuarob S, Mitra P, Giles CL. A hybrid approach to discover semantic hierarchical sections in scholarly documents. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Aug 2015, p. 1081–1085.
Budhiraja S, Mago V. “A supervised learning approach for heading detection.” CoRR, vol. abs/1809.01477, 2018. http://arxiv.org/abs/1809.01477. Accessed 14 Aug 2020.
Zahour A, Taconet B, Likforman-Sulem L, Boussellaa W. Overlapping and multi-touching text-line segmentation by block covering analysis. In: Pattern Anal. Appl. 2009;12(4):335–351. https://doi.org/10.1007/s10044-008-0127-9. Accessed 14 Aug 2020.
Barlas P, Adam S, Chatelain C, Paquet T. A typed and handwritten text block segmentation system for heterogeneous and complex documents. In: Document Analysis Systems, France, Apr. 2014:6. https://hal.archives-ouvertes.fr/hal-00981245. Accessed 14 Aug 2020.
Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Vol 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 49–657. http://dl.acm.org/citation.cfm?id=2969239.2969312. Accessed 14 Aug 2020.
Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R. Improving neural networks by preventing co-adaptation of feature detectors. CoRR. vol. abs/1207.0580, 2012. http://arxiv.org/abs/1207.0580. Accessed 14 Aug 2020.
Kim Y. Convolutional neural networks for sentence classification. CoRR, vol. abs/1408.5882, 2014. http://arxiv.org/abs/1408.5882. Accessed 14 Aug 2020.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. Nov. 1997;9(8):1735–1780. http://dx.doi.org/10.1162/neco.1997.9.8.1735. Accessed 14 Aug 2020.
Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, p. 282–289. http://dl.acm.org/citation.cfm?id=645530.655813. Accessed 14 Aug 2020.
Kingma DP, Ba J. Adam: A method for stochastic optimization. CoRR. vol. abs/1412.6980, 2014. http://arxiv.org/abs/1412.6980. Accessed 14 Aug 2020.
Chollet F et al. Keras. 2015. https://keras.io. Accessed 14 Aug 2020.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. New York, NY, USA: ACM, 2016, p. 785–794. http://doi.acm.org/10.1145/2939672.2939785. Accessed 14 Aug 2020.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
MathSciNet MATH Google Scholar

Download references

Funding

This research has been fully supported by Fortia Financial Solutions.

Author information

Najah-Imane Bentabet and Rémi Juge have contributed equally to this work.

Authors and Affiliations

Fortia Financial Solutions, 17 Av George V, Paris, France
Najah-Imane Bentabet, Rémi Juge, Ismaïl El Maarouf, Dialekti Valsamou-Stanislawski & Sira Ferradans

Authors

Najah-Imane Bentabet
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Juge
View author publications
You can also search for this author in PubMed Google Scholar
Ismaïl El Maarouf
View author publications
You can also search for this author in PubMed Google Scholar
Dialekti Valsamou-Stanislawski
View author publications
You can also search for this author in PubMed Google Scholar
Sira Ferradans
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Najah-Imane Bentabet and Rémi Juge both contributed equally to this work. Ismaïl EL Maarouf contributed to rewriting this paper. Dialekti Valsamou-Stanislawski reviewed the final version of this paper, and Sira Ferradans contributed to the initial version of the paper while a member of Fortia Financial Solutions.

Corresponding author

Correspondence to Ismaïl El Maarouf.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Code availability

The authors consider they have provided enough details in the paper to ensure reproducibility and may be contacted in case of missing experimental details.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection ”Document Analysis and Recognition” guest edited by Michael Blumenstein, Seiichi Uchida and Cheng-Lin Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bentabet, NI., Juge, R., El Maarouf, I. et al. Automatic Table-of-Contents Generation for Efficient Information Access. SN COMPUT. SCI. 1, 283 (2020). https://doi.org/10.1007/s42979-020-00302-z

Download citation

Received: 01 February 2020
Accepted: 11 August 2020
Published: 27 August 2020
DOI: https://doi.org/10.1007/s42979-020-00302-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Table-of-Contents Generation for Efficient Information Access