Skip to main content

Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application

Abstract

In this paper, we study the problem of extracting variable-depth “logical document hierarchy” from long documents, namely organizing the recognized “physical document objects” into hierarchical structures. The discovery of logical document hierarchy is the vital step to support many downstream applications (e.g., passage-based retrieval and high-quality information extraction). However, long documents, containing hundreds or even thousands of pages and a variable-depth hierarchy, challenge the existing methods. To address these challenges, we develop a framework, namely Hierarchy Extraction from Long Document (HELD), where we “sequentially” insert each physical object at the proper position on the current tree. Determining whether each possible position is proper or not can be formulated as a binary classification problem. To further improve its effectiveness and efficiency, we study the design variants in HELD, including traversal orders of the insertion positions, heading extraction explicitly or implicitly, tolerance to insertion errors in predecessor steps, and so on. As for evaluations, we find that previous studies ignore the error that the depth of a node is correct while its path to the root is wrong. Since such mistakes may worsen the downstream applications seriously, a new measure is developed for a more careful evaluation. The empirical experiments based on thousands of long documents from Chinese financial market, English financial market and English scientific publication show that the HELD model with the “root-to-leaf” traversal order and explicit heading extraction is the best choice to achieve the tradeoff between effectiveness and efficiency with the accuracy of 0.972 6, 0.729 1 and 0.957 8 in the Chinese financial, English financial and arXiv datasets, respectively. Finally, we show that the logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task. In summary, we conduct a systematic study on this task in terms of methods, evaluations, and applications.

This is a preview of subscription content, access via your institution.

References

  1. Bloechle J L. Physical and logical structure recognition of pdf documents [PhD Thesis]. University of Fribourg, 2010.

  2. Mao S, Rosenfeld A, Kanungo T. Document structure analysis algorithms: A literature survey. In Proc. the 2003 Document Recognition and Retrieval X, Jan. 2003, pp.197-207. https://doi.org/10.1117/12.476326.

  3. Pembe F C, Gungor T. Heading-based sectional hierarchy identification for HTML documents. In Proc. the 22nd International Symposium on Computer and Information Sciences, Nov. 2007. https://doi.org/10.1109/ISCIS.2007.4456839.

  4. Geva M, Berant J. Learning to search in long documents using document structure. In Proc. the 27th International Conference on Computational Linguistics, Aug. 2018, pp.161-176.

  5. Howard T, Bruce C. Inference networks for document retrieval. ACM SIGIR Forum, 2017, 51(2): 124-147. https://doi.org/10.1145/3130348.3130361.

    MathSciNet  Article  Google Scholar 

  6. Summers K. Automatic discovery of logical document structure [PhD Thesis]. Cornell University, 1998.

  7. Luong M T, Nguyen T D, Kan M Y. Logical structure recovery in scholarly articles with rich document features. International Journal of Digital Library Systems, 2010, 1(4): 1-23. https://doi.org/10.4018/jdls.2010100101.

    Article  Google Scholar 

  8. Pembe F C, Güngör T. A tree-based learning approach for document structure analysis and its application to Web search. Natural Language Engineering, 2014, 21(4): 569-605. https://doi.org/10.1017/S1351324914000023.

    Article  Google Scholar 

  9. Ramakrishnan C, Patnia A, Hovy E, Burns G A. Layout-aware text extraction from full-text pdf of scientific articles. Source Code for Biology Medicine, 2012, 7(1): Article No. 7. https://doi.org/10.1186/1751-0473-7-7.

  10. Manabe T, Tajima K. Extracting logical hierarchical structure of HTML documents based on headings. Proceedings of the VLDB Endowment, 2015, 8(12): 1606-1617. https://doi.org/10.14778/2824032.2824058.

    Article  Google Scholar 

  11. Rahman M M, Finin T. Understanding the logical and semantic structure of large documents. arXiv:1709.00770, 2017. https://arxiv.org/abs/1709.00770, April 2021.

  12. Bentabet N I, Juge R, Ferradans S. Table-of-contents generation on contemporary documents. In Proc. the 2019 International Conference on Document Analysis and Recognition, Sept. 2019, pp. 100-107. https://doi.org/10.1109/ICDAR.2019.00025.

  13. Conway A. Page grammars and page parsing: A syntactic approach to document layout recognition. In Proc. the 2nd International Conference on Document Analysis and Recognition, Oct. 1993, pp.761-764. https://doi.org/10.1109/ICDAR.1993.395626.

  14. Tsujimoto S, Asada H. Understanding multi-articled documents. In Proc. the 10th International Conference on Pattern Recognition, June 1990, pp.124-133. https://doi.org/10.1109/ICPR.1990.118163.

  15. Constantin A, Pettifer S, Voronkov A. PDFX: Fully-automated PDF-to-XML conversion of scientific literature. In Proc. the 2013 ACM Symposium on Document Engineering, Sept. 2013, pp.177-180. https://doi.org/10.1145/2494266.2494271.

  16. Tkaczyk D, Szostek P, Fedoryszak M, Dendek P J, Bolikowski. CERMINE: Automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition, 2015, 18(4): 317-335. https://doi.org/10.1007/s10032-015-0249-8.

  17. Summers K. Toward a taxonomy of logical document structures. In Proc. the Dartmouth Institute for Advanced Graduate Studies: Electronic Publishing and the Information Superhighway, May 30-June 2, 1995, pp.124-133.

  18. Baird H S, Jones S E, Fortune S J. Image segmentation by shape-directed covers. In Proc. the 10th International Conference on Pattern Recognition, June 1990, pp.820-825. https://doi.org/10.1109/ICPR.1990.118223.

  19. Nagy G, Seth S, Viswanathan M. A prototype document image analysis system for technical journals. Computer, 1992, 25(7): 10-22. https://doi.org/10.1109/2.144436.

    Article  Google Scholar 

  20. Kopec G E, Chou P A. Document image decoding using Markov source models. In Proc. the 1993 IEEE International Conference on Acoustics Speech and Signal Processing, April 1993, pp.85-88. https://doi.org/10.1109/ICASSP.1993.319753.

  21. Xiao Y, Yumer E, Asente P, Kraley M, Kifer D, Giles C L. Learning to extract semantic structure from documents using multimodal fully convolutional neural network. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.4342-4351. https://doi.org/10.1109/CVPR.2017.462.

  22. Augusto Borges Oliveira D, Palhares Viana M. Fast CNN-based document layout analysis. In Proc. the 2017 IEEE International Conference on Computer Vision Workshops, Oct. 2017, pp.1173-1180. https://doi.org/10.1109/ICCVW.2017.142.

  23. Wong K Y, Casey R G, Wahl F M. Document analysis system. IBM Journal of Research and Development, 1982, 26(6): 647-656. https://doi.org/10.1147/rd.266.0647.

    Article  Google Scholar 

  24. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015.

  25. Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In Proc. the 2015 IEEE International Conference on Computer Vision, Dec. 2015, pp.1520-1528. https://doi.org/10.1109/ICCV.2015.178.

  26. He D, Cohen S, Price B, Kifer D, Giles C L. Multi-scale multi-task FCN for semantic page segmentation and table detection. In Proc. the 14th IAPR International Conference on Document Analysis and Recognition, Nov. 2017, pp.254-261. https://doi.org/10.1109/ICDAR.2017.50.

  27. Schuster M, Paliwal K K. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681. https://doi.org/10.1109/78.650093.

    Article  Google Scholar 

  28. Zhou G, Luo P, Cao R, Xiao Y, Lin F, Chen B, He Q. Tree-structured neural machine for linguistics-aware sentence generation. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.5722-5729.

  29. Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In Proc. the 27th International Conference on Neural Information Processing Systems, December 2014, pp.3104-3112.

  30. Tan Z, Wang M, Xie J, Chen Y, Shi X. Deep semantic role labeling with self-attention. In Proc. the 32nd AAAI Conference on Artificial Intelligence, Feb. 2018, pp.4929-4936.

  31. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing, December 2017, pp.5998-6008.

  32. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In Proc. the 2013 International Conference on Learning Representations, May 2013.

  33. Lin M, Chen Q, Yan S. Network in network. arXiv:1312.4400, 2013. https://arxiv.org/abs/1312.4400, Jan. 2021.

  34. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.448-456.

  35. Nair V, Hinton G E. Rectified linear units improve restricted Boltzmann machines. In Proc. the 27th International Conference on Machine Learning, Jun. 2010, pp.807-814.

  36. He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proc. the IEEE International Conference on Computer Vision, Dec. 2015, pp.1026-1034. https://doi.org/10.1109/ICCV.2015.123.

  37. Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.

  38. Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799, 2018. https://arxiv.org/abs/1802.05799, Jan. 2021.

  39. Friedman J H. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 2001, 29(5): 1189-1232. https://doi.org/10.1214/aos/1013203451.

    MathSciNet  Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rong-Yu Cao.

Supplementary Information

ESM 1

(PDF 74 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cao, RY., Cao, YX., Zhou, GB. et al. Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application. J. Comput. Sci. Technol. 37, 699–718 (2022). https://doi.org/10.1007/s11390-021-1076-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-021-1076-7

Keywords

  • logical document hierarchy
  • long document
  • passage retrieval