A hybrid approach to recognize generic sections in scholarly documents

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript


Discourse parsing of scholarly documents is the premise and basis for standardizing the writing of scholarly documents, understanding their content, and quickly locating and extracting specific information from them. With the continuous emergence of a large number of scholarly documents, how to automatically analyze scholarly documents quickly and effectively has become a research hotspot. In this paper, we propose a hybrid model, which considers both section headers and body texts, to recognize generic sections in scholarly documents automatically. We conduct a comprehensive analysis of the semantic difference between short phrases and long narrative text chunks on the SectLabel dataset. The experimental results show that our model achieves 91.67% \(F_{1}\)-value in the generic section recognization, which is better than the baseline.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

  1. Pdftotext command line tools


