Skip to main content
Log in

Enhancing identification of structure function of academic articles using contextual information

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

With the enrichment of literature resources, researchers are facing the growing problem of information explosion and knowledge overload. To help scholars retrieve literature and acquire knowledge successfully, clarifying the semantic structure of the content in academic literature has become the essential research question. In the research on identifying the structure function of chapters in academic articles, only a few studies used the deep learning model and explored the optimization for feature input. This limits the application, optimization potential of deep learning models for the research task. This paper took articles of the ACL conference as the corpus. We employ the traditional machine learning models and deep learning models to construct the classifiers based on various feature input. Experimental results show that (1) Compared with the chapter content, the chapter title is more conducive to identifying the structure function of academic articles. (2) Relative position is a valuable feature for building traditional models. (3) Inspired by (2), this paper further introduces contextual information into the deep learning models and achieved significant results. Meanwhile, our models show good migration ability in the open test containing 200 sampled non-training samples. We also annotated the ACL main conference papers in recent five years based on the best practice performing models and performed a time series analysis of the overall corpus. This work explores and summarizes the practical features and models for this task through multiple comparative experiments and provides a reference for related text classification tasks. Finally, we indicate the limitations and shortcomings of the current model and the direction of further optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://acl-arc.comp.nus.edu.sg/ Collection date: April, 2018.

  2. https://www.aclweb.org/anthology/ Collection date: April, 2018.

References

  • Ahmed, I., & Afzal, M. T. (2020). A systematic approach to map the research articles’ sections to IMRAD. IEEE Access: Practical Innovations, Open Solutions, 8, 129359–129371. https://doi.org/10.1109/ACCESS.2020.3009021

    Article  Google Scholar 

  • Asadi, N., Badie, K., & Mahmoudi, M. T. (2019). Automatic zone identification in scientific papers via fusion techniques. Scientometrics, 119(2), 845–862. https://doi.org/10.1007/s11192-019-03060-9

    Article  Google Scholar 

  • Badie, K., Asadi, N., & Mahmoudi, M. T. (2018). Zone identification based on features with high semantic richness and combining results of separate classifiers. Journal of Information & Telecommunication, 2(4), 411–427. https://doi.org/10.1080/24751839.2018.1460083

    Article  Google Scholar 

  • Bertin, M., Atanassova, I., Sugimoto, C. R., & Lariviere, V. (2016). The linguistic patterns and rhetorical structure of citation context: An approach using n-grams. Scientometrics, 109(3), 1417–1434. https://doi.org/10.1007/s11192-016-2134-8

    Article  Google Scholar 

  • Bird, S., Dale, R., Dorr, B. J., Gibson, B. R., Joseph, M., Kan, M.-Y., … Tan, Y. F. (2008). The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In Proceedings of the 6th International Conference on Language Resources and Evaluation Conference, 1755–1759.

  • Bollacker, K. D., Lawrence, S., & Giles, C. L. (2002). Discovering relevant scientific literature on the web. IEEE Intelligent Systems & Their Applications, 15(2), 42–47.

    Article  Google Scholar 

  • Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432–5435. https://doi.org/10.1016/j.eswa.2008.06.054

    Article  Google Scholar 

  • Clark, P., & Boswell, R. (1991). Rule induction with CN2: Some recent improvements. In Lecture Notes in Computer Science (pp. 151–163). Springer-Verlag. doi: https://doi.org/10.1007/bfb0017011

  • Cohen, & J. (1960). A coefficient of agreement for nominal scales. Educational & Psychological Measurement, 20(1), 37–46.

    Article  Google Scholar 

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1023/a:1022627411411

    Article  MATH  Google Scholar 

  • Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. https://doi.org/10.1109/tit.1967.1053964

    Article  MATH  Google Scholar 

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint https://arxiv.org/abs/1810.04805.

  • Ding, Y., Liu, X., Guo, C., & Cronin, B. (2013). The distribution of references across texts: Some implications for citation analysis. Journal of Informetrics, 7(3), 583–592. https://doi.org/10.1016/j.joi.2013.03.003

    Article  Google Scholar 

  • Echeverria, M., Stuart, D., & Blanke, T. (2015). Medical theses and derivative articles: Dissemination of contents and publication patterns. Scientometrics, 102(1), 559–586. https://doi.org/10.1007/s11192-014-1442-0

    Article  Google Scholar 

  • Guo, Y., Korhonen, A., Liakata, M., Silins, I., Sun, L., & Stenius, U. (2010). Identifying the information structure of scientific abstracts: An investigation of three different schemes. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, 99–107.

  • Habib, R., & Afzal, M. T. (2019). Sections-based bibliographic coupling for research paper recommendation. Scientometrics, 119(2), 643–656. https://doi.org/10.1007/s11192-019-03053-8

    Article  Google Scholar 

  • Harmsze, F. A. P. (2000). A modular structure for scientific articles in an electronic environment. University of Amsterdam.

  • Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(2), 1367–1382. Retrieved from https://doi.org/10.1007/s11192-018-2718-6

  • Hirohata, K., Okazaki, N., Ananiadou, S., & Ishizuka, M. (2008). Identifying sections in scientific abstracts using conditional random fields. In Proceedings of the Third International Joint Conference on Natural Language Processing, Vol I.

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory[J]. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hu, Z., Chen, C., & Liu, Z. (2013). Where are citations located in the body of scientific articles? A study of the distributions of citation locations. Journal of Informetrics, 7(4), 887–896. https://doi.org/10.1016/j.joi.2013.08.005

    Article  Google Scholar 

  • Ji, Y., Zhang, Q., Shen, S, Wang, D., Huang, S. (2019). Research on Functional Structure Identification of Academic Text Based on Deep Learning. In Proceedings of 17th International Conference of the International-Society-for-Scientometrics-and-Informetrics (ISSI), Vol II.

  • Kafkas, S., Pi, X., Marinos, N., & Talo’, F., Morrison, A., & Mcentyre, J. R. (2015). Section level search functionality in Europe PMC. Journal of Biomedical Semantics, 6(1), 7. https://doi.org/10.1186/s13326-015-0003-7

    Article  Google Scholar 

  • Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751. doi: https://doi.org/10.3115/v1/D14-1181

  • Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning, 1, 282-289

  • Lei, D., Zhang, H., Liu, H., Li, Z., & Wu, Y. (2019). Maximal uncorrelated multinomial logistic regression. IEEE Access, 7, 89924–89935. https://doi.org/10.1109/access.2019.2921820

    Article  Google Scholar 

  • Liakata, M., Saha, S., Dobnik, S., Batchelor, C., & Rebholz-Schuhmann, D. (2012). Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics (oxford, England), 28(7), 991–1000. https://doi.org/10.1093/bioinformatics/bts071

    Article  Google Scholar 

  • Liakata, M., Teufel, S., Siddharthan, A., & Batchelor, C. (2010). Corpora for the conceptualisation and zoning of scientific papers. Proceedings of LREC, 2010, 2054–2061.

    Google Scholar 

  • Lu, C., Ding, Y., & Zhang, C. (2017). Understanding the impact change of a highly cited article: A content-based citation analysis. Scientometrics, 112(2), 927–945.

    Article  Google Scholar 

  • Lu, W., Huang, Y., Bu, Y., & Cheng, Q. (2018). Functional structure identification of scientific documents in computer science. Scientometrics, 115(1), 463–486.

    Article  Google Scholar 

  • Ma, B., Wang, Y., & Zhang, C. (2020a). CSAA: An online annotating platform for classifying sections of academic articles. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in, 2020, 519–520. https://doi.org/10.1145/3383583.3398592

    Article  Google Scholar 

  • Ma, B., Zhang, C., & Wang, Y. (2020b). Exploring significant characteristics and models for classification of structure function of academic documents. Data and Information Management, 5(1), 65–74. https://doi.org/10.2478/dim-2020-0031

    Article  Google Scholar 

  • Nair, P. R. R., & Nair, V. D. (2014). Scientific writing and communication in agriculture and natural resources. Springer.

    Book  Google Scholar 

  • Nguyen, T. D., & Kan, M.-Y. (2007). Keyphrase extraction in scientific publications. In International conference on Asian digital libraries (pp. 317–326). Springer.

  • Shahid, A., & Afzal, M. T. (2017). Section-wise indexing and retrieval of research articles. Cluster Computing, 21(1), 1–12.

    Google Scholar 

  • Soldatova, L. N., & Liakata, M. (2007). An ontology methodology and CISP-the proposed Core Information about Scientific Papers. JISC Project Report.

  • Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A fifty-year survey. Journal of the Medical Library Association, 92(3), 364–367. PMID:15243643.

    Google Scholar 

  • Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019, October). How to fine-tune bert for text classification?. In China National Conference on Chinese Computational Linguistics (pp. 194–206). Springer, Cham.

  • Teufel, S., Siddharthan, A., & Batchelor, C. (2009). Towards domain-independent argumentative zoning: Evidence from chemistry and computational linguistics. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 1493–1502.

  • Teufel, S., Carletta, J., & Moens, M. (1999). An annotation scheme for discourse-level argumentation in research articles. Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics. https://doi.org/10.3115/977035.977051

    Article  Google Scholar 

  • Teufel, S., & Moens, M. (2002). Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4), 409–445. https://doi.org/10.1162/089120102762671936

    Article  Google Scholar 

  • Voos, H., & Dagaev, K. S. (1976). Are all citations equal? Or, Did We Op. Cit. Your Idem? Journal of Academic Librarianship, 1(6), 19–21.

    Google Scholar 

  • Xia, F., Wang, W., Bekele, T. M., & Liu, H. (2017). Big scholarly data: A survey. IEEE Transactions on Big Data, 3(1), 18–35. https://doi.org/10.1109/TBDATA.2016.2641460

    Article  Google Scholar 

  • Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, 412–420.

  • Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489.

  • Yao, Y., & Huang, Z. (2016). Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation. Processing of the Neural Information (pp. 345–353). Springer International Publishing. doi: https://doi.org/10.1007/978-3-319-46681-1_42

  • Zhang, Z., Krawczyk, B., Garcìa, S., Rosales-Pérez, A., & Herrera, F. (2016). Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data. Knowledge-Based Systems, 106, 251–263. https://doi.org/10.1016/j.knosys.2016.05.048

    Article  Google Scholar 

  • Zhou, S., & Li, X. (2020). Feature engineering vs deep learning for paper section identification: Toward applications in Chinese medical literature. Information Processing & Management, 57(3), 102206. https://doi.org/10.1016/j.ipm.2020.102206

    Article  MathSciNet  Google Scholar 

  • Zhu, X., Turney, P., Lemire, D., & Vellino, A. (2015). Measuring academic influence: Not all citations are equal. Journal of the Association for Information Science and Technology, 66(2), 408–427. https://doi.org/10.1002/asi.23179

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No.72074113) and Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (No. MJUKF-IPIC201903).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengzhi Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, B., Zhang, C., Wang, Y. et al. Enhancing identification of structure function of academic articles using contextual information. Scientometrics 127, 885–925 (2022). https://doi.org/10.1007/s11192-021-04225-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-021-04225-1

Keywords

Navigation