Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles

Chen, Haihua; Nguyen, Huyen; Alghamdi, Asmaa

doi:10.1007/s11192-022-04380-z

Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles

Published: 28 April 2022

Volume 127, pages 7061–7075, (2022)
Cite this article

Scientometrics Aims and scope Submit manuscript

894 Accesses
5 Citations
Explore all metrics

Abstract

Research contributions, which indicate how a research paper contributes new knowledge or new understanding in contrast to prior research on the topic, are the most valuable type of information for researchers to understand the main content of a paper. However, there is little research using research contributions to identify and recommend valuable knowledge in academic literature for users. Instead, most existing studies mainly focus on the analysis of other elements in academic literature, such as keywords, citations, rhetorical structure, discourse, and others. This paper first introduces a fine-grained annotation scheme with six categories for research contributions in academic literature. To evaluate the reliability of our annotation scheme, we conduct annotation on 5024 sentences collected from Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL Anthology) and an academic journal Information Processing & Management (IP &M). We reach an inter-annotator agreement of Cohen’s kappa = 0.91 and Fleiss’ kappa = 0.91, demonstrating the high quality of the dataset. We then built two types of classifiers for automated research contribution identification based on the dataset: classic feature-based machine learning (ML) and transformer-based deep learning (DL). Our experimental results show that SCI-BERT, a pretrained language model for scientific text, achieves the best performance with an F1 score of 0.58, improving the best classic ML model (nouns + verbs + tf-idf + random forest) by 2%. This also indicates a comparable power of classic feature-based ML models to DL-based model like SCI-BERT on this dataset. The fine-grained annotation scheme can be applied for large-scale analysis for research contributions in academic literature. The automated research contribution classifiers built in this paper provide the basis for the automatic research contributions extraction and knowledge fragment recommendation. The high-quality research contribution dataset developed in this research is publicly available on Zenodo https://zenodo.org/record/6284137#.YhkZ7-iZO4Q. The code for the data analysis and experiments will be released at: https://github.com/HuyenNguyenHelen/Contribution-Sentence-Classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ContriSci: A BERT-Based Multitasking Deep Neural Architecture to Identify Contribution Statements from Research Papers

Leveraging MRC Framework for Research Contribution Patterns Identification in Citation Sentences

A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications

Article 23 January 2024

References

Angrosh, M., Cranefield, S., & Stanger, N. (2012). A citation centric annotation scheme for scientific articles. Proceedings of the Australasian Language Technology Association Workshop, 2012, 5–14.
Google Scholar
Auer, S., Kovtun, V., Prinz, M., Kasprzik, A., Stocker, M., & Vidal, M. E. (2018). Towards a knowledge graph for science. In Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, (pp 1–6).
Beltagy, I., Lo, K., & Cohan, A. (2019). Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.
Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), 249–254.
Google Scholar
Chakravarthi, B.R. (2021). Domain identification of scientific articles using transfer learning and ensembles. In Trends and Applications in Knowledge Discovery and Data Mining: PAKDD 2021 Workshops, WSPA, MLMEIN, SDPRA, DARAI, and AI4EPT, Delhi, India, May 11, 2021 Proceedings, (vol 12705, p. 88). Springer Nature.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321–357.
Article MATH Google Scholar
Chen, H., & Kanuboddu, B. N. (2021). A fine-grained annotation scheme for research contribution in academic literature. In Proceedings of the 18th International Conference on Scientometrics and Informetrics, (pp 241–248).
Chen, H., Chen, J., & Ding, J. (2021). Data evaluation and enhancement for quality improvement of machine learning. IEEE Transactions on Reliability, 70(2), 831–847.
Article Google Scholar
Chen, H., Wu, L., Chen, J., Lu, W., & Ding, J. (2022). A comparative study of automated legal text classification using random forests and deep learning. Information Processing & Management, 59(2), 102798.
Article Google Scholar
Day, R. A., et al. (1989). The origins of the scientific paper: the imrad format. Journal of the American Medical Directors Association, 4(2), 16–18.
MathSciNet Google Scholar
D’Souza, J., & Auer, S. (2020). Nlpcontributions: An annotation scheme for machine reading of scholarly contributions in natural language processing literature. In EEKE@JCDL’20 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents.
Falotico, R., & Quatto, P. (2015). Fleiss’ kappa statistic without paradoxes. Quality & Quantity, 49(2), 463–470.
Article Google Scholar
Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15(1), 3133–3181.
MathSciNet MATH Google Scholar
Fisas B, Ronzano F, Saggion H (2016) A multi-layered annotated corpus of scientific papers. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), (pp. 3081–3088).
Hao, W., Li, Z., Qian, Y., Wang, Y., & Zhang, C. (2020). The acl fws-rc: A dataset for recognition and classification of sentence about future works. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in, 2020, 261–269.
Google Scholar
Hofmann, A. H. (2016). Scientific writing and communication: papers, proposals, and presentations (3rd ed.). Oxford, United Kingdom: Oxford University Press.
Google Scholar
Hovy, E., & Lavid, J. (2010). Towards a ‘science’of corpus annotation: a new methodological challenge for corpus linguistics. International Journal of Translation, 22(1), 13–36.
Google Scholar
Jaradeh, M.Y., Oelen, A., Farfar, K.E., Prinz, M., D’Souza, J., Kismihók, G., Stocker, M., & Auer, S. (2019). Open research knowledge graph: Next generation infrastructure for semantic scholarly knowledge. In Proceedings of the 10th International Conference on Knowledge Capture, ACM, (pp. 243-246), https://dl.acm.org/doi/10.1145/3360901.3364435.
Kok, M. O., & Schuit, A. J. (2012). Contribution mapping: A method for mapping the contribution of research to enhance its impact. Health Research Policy and Systems, 10(1), 1–16.
Article Google Scholar
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 150.
Article Google Scholar
Le, X., Chu, J., Deng, S., Jiao, Q., Pei, J., Zhu, L., & Yao, J. (2019). Citeopinion: Evidence-based evaluation tool for academic contributions of research papers based on citing sentences. Journal of Data and Information Science, 4(4), 26–41.
Article Google Scholar
Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P. S., & He, L. (2020). A survey on text classification: From shallow to deep learning. arXiv preprint arXiv:2008.00364.
Lindsay, D. (1995). Scientific Writing. Longman Cheshire.
Mehta, P., Arora, G., & Majumder, P. (2018). Attention based sentence extraction from scientific articles using pseudo-labeled data. CoRR arXiv:1802.04675
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Morton, S. (2015). Progressing research impact assessment: A ‘contributions’ approach. Research Evaluation, 24(4), 405–419.
Article MathSciNet Google Scholar
nd (2022a) Annual meeting of the association for computational linguistics (acl). Retrieved February 18, 2022, from https://aclanthology.org/venues/acl/
nd (2022b) Information processing & management. Retrieved February 18, 2022, from https://www.journals.elsevier.com/information-processing-and-management
Oelen, A., Jaradeh, M. Y., Farfar, K. E., Stocker, M., & Auer, S. (2019). Comparing research contributions in a scholarly knowledge graph. In CEUR Workshop Proceedings 2526 (2019), (vol 2526, pp. 21–26). Aachen: RWTH Aachen.
Park, S., & Caragea, C. (2020). Scientific keyphrase identification and classification by pre-trained language models intermediate task transfer learning. In Proceedings of the 28th International Conference on Computational Linguistics, (pp. 5409–5419).
Peat, J., Elliott, E., Baur, L., & Keena, V. (2002). Scientific writing: Easy when you know how (1st ed.). London, United Kingdom: BMJ Books.
Book Google Scholar
QasemiZadeh, B., & Handschuh, S. (2014). The acl rd-tec: a dataset for benchmarking terminology extraction and classification in computational linguistics. In Proceedings of the 4th International Workshop on Computational Terminology (Computerm), (pp. 52–63).
Rehman, T., Sanyal, D. K., Chattopadhyay, S., Bhowmick, P. K., & Das, P. P. (2021). Automatic generation of research highlights from scientific abstracts. In EEKE@JCDL’21 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents.
Sateli, B., & Witte, R. (2015). What’s in this paper? combining rhetorical entities with linked open data for semantic literature querying. In Proceedings of the 24th International Conference on World Wide Web, (pp. 1023–1028).
Shen, Y., & Liu, J. (2021). Comparison of text sentiment analysis based on bert and word2vec. In 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), IEEE, (pp. 144–147).
Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (imrad) structure: A fifty-year survey. Journal of the Medical Library Association, 92(3), 364.
Google Scholar
Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge University Press.
Swales, J. M. (2011). Aspects of article introductions, michigan (classics). University of Michigan Press.
Teufel, S., Siddharthan, A., & Tidhar, D. (2006). An annotation scheme for citation function. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, (pp. 80–87).
Vogt, L., D’Souza, J., Stocker, M., & Auer, S. (2020). Toward representing research contributions in scholarly knowledge graphs using knowledge graph cells. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in, 2020, 107–116.
Google Scholar
Wang, W. M., See-To, E. W. K., Lin, H. T., & Li, Z. (2018). Comparison of automatic extraction of research highlights and abstracts of journal articles. In Proceedings of the 2nd International Conference on Computer Science and Application Engineering, (pp. 1–5).
Weng, W.H., Deaton, J., Natarajan, V., Elsayed, G. F., & Liu, Y. (2020). Addressing the real-world class imbalance problem in dermatology. In Machine Learning for Health, PMLR, (pp. 415–429).

Download references

Acknowledgements

The paper is a substantially extended version of the ISSI2021 conference paper “A Fine-Grained Annotation Scheme for Research Contribution in Academic Literature”. The authors would like to thank Roohia Shahzad, Rubab Shahzad, Aakansha Tallapally, Durga Bhavana Yerrabelli, Nikitha Malladi, and Riyaz Ahmad Shaik at the University of North Texas for participating in the annotation experiment. The authors would like to thank Bhavya Nandana Kanuboddu for contributing to the data analysis and the visualizations of the ISSI conference paper. The authors would like to thank Marie Bloechle at the University of North Texas for editing the language and writing of the paper. The authors are grateful to all the anonymous reviewers for their precious comments and suggestions.

Author information

Authors and Affiliations

Department of Information Science, University of North Texas, Denton, Texas, 76203, USA
Haihua Chen & Huyen Nguyen
Department of Computer Science and Engineering, University of North Texas, Denton, Texas, 76203, USA
Asmaa Alghamdi

Authors

Haihua Chen
View author publications
You can also search for this author in PubMed Google Scholar
Huyen Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Asmaa Alghamdi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

HC: Research design, project management, investigation, methodology, writing. HN: Methodology, experiments, data analysis, writing. AA: Data curation, data analysis, review, and editing.

Corresponding author

Correspondence to Haihua Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, H., Nguyen, H. & Alghamdi, A. Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles. Scientometrics 127, 7061–7075 (2022). https://doi.org/10.1007/s11192-022-04380-z

Download citation

Received: 31 October 2021
Accepted: 09 April 2022
Published: 28 April 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s11192-022-04380-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles

Abstract

Access this article

Similar content being viewed by others

ContriSci: A BERT-Based Multitasking Deep Neural Architecture to Identify Contribution Statements from Research Papers

Leveraging MRC Framework for Research Contribution Patterns Identification in Citation Sentences

A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles

Abstract

Access this article

Similar content being viewed by others

ContriSci: A BERT-Based Multitasking Deep Neural Architecture to Identify Contribution Statements from Research Papers

Leveraging MRC Framework for Research Contribution Patterns Identification in Citation Sentences

A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation