Abstract
With the enrichment of literature resources, researchers are facing the growing problem of information explosion and knowledge overload. To help scholars retrieve literature and acquire knowledge successfully, clarifying the semantic structure of the content in academic literature has become the essential research question. In the research on identifying the structure function of chapters in academic articles, only a few studies used the deep learning model and explored the optimization for feature input. This limits the application, optimization potential of deep learning models for the research task. This paper took articles of the ACL conference as the corpus. We employ the traditional machine learning models and deep learning models to construct the classifiers based on various feature input. Experimental results show that (1) Compared with the chapter content, the chapter title is more conducive to identifying the structure function of academic articles. (2) Relative position is a valuable feature for building traditional models. (3) Inspired by (2), this paper further introduces contextual information into the deep learning models and achieved significant results. Meanwhile, our models show good migration ability in the open test containing 200 sampled non-training samples. We also annotated the ACL main conference papers in recent five years based on the best practice performing models and performed a time series analysis of the overall corpus. This work explores and summarizes the practical features and models for this task through multiple comparative experiments and provides a reference for related text classification tasks. Finally, we indicate the limitations and shortcomings of the current model and the direction of further optimization.
Similar content being viewed by others
Notes
https://acl-arc.comp.nus.edu.sg/ Collection date: April, 2018.
https://www.aclweb.org/anthology/ Collection date: April, 2018.
References
Ahmed, I., & Afzal, M. T. (2020). A systematic approach to map the research articles’ sections to IMRAD. IEEE Access: Practical Innovations, Open Solutions, 8, 129359–129371. https://doi.org/10.1109/ACCESS.2020.3009021
Asadi, N., Badie, K., & Mahmoudi, M. T. (2019). Automatic zone identification in scientific papers via fusion techniques. Scientometrics, 119(2), 845–862. https://doi.org/10.1007/s11192-019-03060-9
Badie, K., Asadi, N., & Mahmoudi, M. T. (2018). Zone identification based on features with high semantic richness and combining results of separate classifiers. Journal of Information & Telecommunication, 2(4), 411–427. https://doi.org/10.1080/24751839.2018.1460083
Bertin, M., Atanassova, I., Sugimoto, C. R., & Lariviere, V. (2016). The linguistic patterns and rhetorical structure of citation context: An approach using n-grams. Scientometrics, 109(3), 1417–1434. https://doi.org/10.1007/s11192-016-2134-8
Bird, S., Dale, R., Dorr, B. J., Gibson, B. R., Joseph, M., Kan, M.-Y., … Tan, Y. F. (2008). The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In Proceedings of the 6th International Conference on Language Resources and Evaluation Conference, 1755–1759.
Bollacker, K. D., Lawrence, S., & Giles, C. L. (2002). Discovering relevant scientific literature on the web. IEEE Intelligent Systems & Their Applications, 15(2), 42–47.
Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432–5435. https://doi.org/10.1016/j.eswa.2008.06.054
Clark, P., & Boswell, R. (1991). Rule induction with CN2: Some recent improvements. In Lecture Notes in Computer Science (pp. 151–163). Springer-Verlag. doi: https://doi.org/10.1007/bfb0017011
Cohen, & J. (1960). A coefficient of agreement for nominal scales. Educational & Psychological Measurement, 20(1), 37–46.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1023/a:1022627411411
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. https://doi.org/10.1109/tit.1967.1053964
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint https://arxiv.org/abs/1810.04805.
Ding, Y., Liu, X., Guo, C., & Cronin, B. (2013). The distribution of references across texts: Some implications for citation analysis. Journal of Informetrics, 7(3), 583–592. https://doi.org/10.1016/j.joi.2013.03.003
Echeverria, M., Stuart, D., & Blanke, T. (2015). Medical theses and derivative articles: Dissemination of contents and publication patterns. Scientometrics, 102(1), 559–586. https://doi.org/10.1007/s11192-014-1442-0
Guo, Y., Korhonen, A., Liakata, M., Silins, I., Sun, L., & Stenius, U. (2010). Identifying the information structure of scientific abstracts: An investigation of three different schemes. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, 99–107.
Habib, R., & Afzal, M. T. (2019). Sections-based bibliographic coupling for research paper recommendation. Scientometrics, 119(2), 643–656. https://doi.org/10.1007/s11192-019-03053-8
Harmsze, F. A. P. (2000). A modular structure for scientific articles in an electronic environment. University of Amsterdam.
Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(2), 1367–1382. Retrieved from https://doi.org/10.1007/s11192-018-2718-6
Hirohata, K., Okazaki, N., Ananiadou, S., & Ishizuka, M. (2008). Identifying sections in scientific abstracts using conditional random fields. In Proceedings of the Third International Joint Conference on Natural Language Processing, Vol I.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory[J]. Neural Computation, 9(8), 1735–1780.
Hu, Z., Chen, C., & Liu, Z. (2013). Where are citations located in the body of scientific articles? A study of the distributions of citation locations. Journal of Informetrics, 7(4), 887–896. https://doi.org/10.1016/j.joi.2013.08.005
Ji, Y., Zhang, Q., Shen, S, Wang, D., Huang, S. (2019). Research on Functional Structure Identification of Academic Text Based on Deep Learning. In Proceedings of 17th International Conference of the International-Society-for-Scientometrics-and-Informetrics (ISSI), Vol II.
Kafkas, S., Pi, X., Marinos, N., & Talo’, F., Morrison, A., & Mcentyre, J. R. (2015). Section level search functionality in Europe PMC. Journal of Biomedical Semantics, 6(1), 7. https://doi.org/10.1186/s13326-015-0003-7
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751. doi: https://doi.org/10.3115/v1/D14-1181
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning, 1, 282-289
Lei, D., Zhang, H., Liu, H., Li, Z., & Wu, Y. (2019). Maximal uncorrelated multinomial logistic regression. IEEE Access, 7, 89924–89935. https://doi.org/10.1109/access.2019.2921820
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., & Rebholz-Schuhmann, D. (2012). Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics (oxford, England), 28(7), 991–1000. https://doi.org/10.1093/bioinformatics/bts071
Liakata, M., Teufel, S., Siddharthan, A., & Batchelor, C. (2010). Corpora for the conceptualisation and zoning of scientific papers. Proceedings of LREC, 2010, 2054–2061.
Lu, C., Ding, Y., & Zhang, C. (2017). Understanding the impact change of a highly cited article: A content-based citation analysis. Scientometrics, 112(2), 927–945.
Lu, W., Huang, Y., Bu, Y., & Cheng, Q. (2018). Functional structure identification of scientific documents in computer science. Scientometrics, 115(1), 463–486.
Ma, B., Wang, Y., & Zhang, C. (2020a). CSAA: An online annotating platform for classifying sections of academic articles. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in, 2020, 519–520. https://doi.org/10.1145/3383583.3398592
Ma, B., Zhang, C., & Wang, Y. (2020b). Exploring significant characteristics and models for classification of structure function of academic documents. Data and Information Management, 5(1), 65–74. https://doi.org/10.2478/dim-2020-0031
Nair, P. R. R., & Nair, V. D. (2014). Scientific writing and communication in agriculture and natural resources. Springer.
Nguyen, T. D., & Kan, M.-Y. (2007). Keyphrase extraction in scientific publications. In International conference on Asian digital libraries (pp. 317–326). Springer.
Shahid, A., & Afzal, M. T. (2017). Section-wise indexing and retrieval of research articles. Cluster Computing, 21(1), 1–12.
Soldatova, L. N., & Liakata, M. (2007). An ontology methodology and CISP-the proposed Core Information about Scientific Papers. JISC Project Report.
Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A fifty-year survey. Journal of the Medical Library Association, 92(3), 364–367. PMID:15243643.
Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019, October). How to fine-tune bert for text classification?. In China National Conference on Chinese Computational Linguistics (pp. 194–206). Springer, Cham.
Teufel, S., Siddharthan, A., & Batchelor, C. (2009). Towards domain-independent argumentative zoning: Evidence from chemistry and computational linguistics. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 1493–1502.
Teufel, S., Carletta, J., & Moens, M. (1999). An annotation scheme for discourse-level argumentation in research articles. Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics. https://doi.org/10.3115/977035.977051
Teufel, S., & Moens, M. (2002). Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4), 409–445. https://doi.org/10.1162/089120102762671936
Voos, H., & Dagaev, K. S. (1976). Are all citations equal? Or, Did We Op. Cit. Your Idem? Journal of Academic Librarianship, 1(6), 19–21.
Xia, F., Wang, W., Bekele, T. M., & Liu, H. (2017). Big scholarly data: A survey. IEEE Transactions on Big Data, 3(1), 18–35. https://doi.org/10.1109/TBDATA.2016.2641460
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, 412–420.
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489.
Yao, Y., & Huang, Z. (2016). Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation. Processing of the Neural Information (pp. 345–353). Springer International Publishing. doi: https://doi.org/10.1007/978-3-319-46681-1_42
Zhang, Z., Krawczyk, B., Garcìa, S., Rosales-Pérez, A., & Herrera, F. (2016). Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data. Knowledge-Based Systems, 106, 251–263. https://doi.org/10.1016/j.knosys.2016.05.048
Zhou, S., & Li, X. (2020). Feature engineering vs deep learning for paper section identification: Toward applications in Chinese medical literature. Information Processing & Management, 57(3), 102206. https://doi.org/10.1016/j.ipm.2020.102206
Zhu, X., Turney, P., Lemire, D., & Vellino, A. (2015). Measuring academic influence: Not all citations are equal. Journal of the Association for Information Science and Technology, 66(2), 408–427. https://doi.org/10.1002/asi.23179
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant No.72074113) and Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (No. MJUKF-IPIC201903).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ma, B., Zhang, C., Wang, Y. et al. Enhancing identification of structure function of academic articles using contextual information. Scientometrics 127, 885–925 (2022). https://doi.org/10.1007/s11192-021-04225-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-021-04225-1