Abstract
Billions of scientific papers lead to the need to identify essential parts of the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences in scientific papers is labor-intensive, resulting in the creation of small-scale datasets that limit model learning. To tackle this challenge, data augmentation has been adopted due to its ability to generate synthetic data with minor variations, thereby expanding the scale of the original training dataset. Nowadays, there are various data augmentation methods, such as those based on random word replacement or back translation. Nevertheless, their suitability for sentence classification tasks in scientific papers remains unexplored. Thus, this paper constructs two manually annotation datasets and evaluates their performance. Furthermore, this paper delves into the mechanisms underlying their effects. Previous studies have suggested that data augmentation can diminish reliance on high-frequency patterns in models. Therefore, this paper employs attention values to represent the model's dependence on words and analyzes how data augmentation methods alter the attention values of individual words within sentences. The experimental results indicate that data augmentation methods can improve the macro F1 score in sentence classification tasks. Furthermore, data augmentation methods effectively reduce the attention values assigned to stop words, commonly used words in scientific papers, and commonly used words in method and problem sentences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bornmann, L., Mutz, R.: Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J. Am. Soc. Inf. Sci. 66(11), 2215–2222 (2015)
Dernoncourt, F., Lee, J.Y.: Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP, pp. 308–313. Asian Federation of Natural Language Processing, Taipei, Taiwan (2017)
Dernoncourt, F., Lee, J.Y., Szolovits, P.: Neural networks for joint sentence classification in medical paper abstracts. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL, pp. 694–700. Association for Computational Linguistics, Valencia, Spain (2016)
Ding, B., Qin, C., Liu, L., Bing, L., Joty, S., Li, B.: Is gpt-3 a good data annotator?. arXiv preprint arXiv:2212.10450 (2022)
Ferreira, T.M., Costa, A.H.R.: DeepBT and NLP Data Augmentation Techniques: A New Proposal and a Comprehensive Study. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 435–449. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_30
Fisas, B., Saggion, H., Ronzano, F.: On the Discoursive Structure of computer graphics research papers. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL, pp. 42–51. Association for Computational Linguistics: Colorado, USA (2015)
Graa, M., Kim, Y., Schamper, J., Khadivi, S., Ney, H.: Generalizing back-translation in neural machine translation. In: Proceedings of the Fourth Conference on Machine Translation, WMT, pp. 45–52. Association for Computational Linguistics, Florence, Italy (2019)
Iwatsuki, K., Aizawa, A.: Communicative-function-based sentence classification for construction of an academic formulaic expression database. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, EACL, pp. 3476–3497. Association for Computational Linguistics, Online (2021)
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 3219–3232. Association for Computational Linguistics, Brussels, Belgium (2018)
Luo, Z., Lu, W., He, J., Wang, Y.: Combination of research questions and methods: A new measurement of scientific novelty. J. Informet. 16(2), 101282 (2022)
Sakai, T., Hirokawa, S.: Feature words that classify problem sentence in scientific article. In: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services, IIWAS, pp. 360–367. Association for Computing Machinery, New York, USA (2012)
Shakeel, M.H., Karim, A., Khan, I.: A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts. Inf. Process. Manage. 57(3), 102204 (2020)
Shorten, C., Khoshgoftaar, T.M., Furht, B.: Text data augmentation for deep learning. Journal of Big Data 8(1), 101 (2021)
Wang, R., Zhang, C., Zhang, Y., Zhang, J.: Extracting Methodological Sentences from Unstructured Abstracts of Academic Articles. In: Sundqvist, A., Berget, G., Nolin, J., Skjerdingstad, K.I. (eds.) iConference 2020. LNCS, vol. 12051, pp. 790–798. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43687-2_66
Wang, W. Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 2557–2563. Association for Computational Linguistics, Lisbon, Portugal (2015)
Wei, J., Zou, K.: EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, pp. 6382–6388. Association for Computational Linguistics, Hong Kong, China (2019)
Wilson, E.B.: An Introduction to Scientific Research. Dover Publications (1991)
Wu, X., Lv, S., Zang, L., Han, J., Hu, S.: Conditional BERT contextual augmentation. In: Proceedings of the International Conference on Computational Science, ICCS, pp. 84–95. Springer, Faro, Portugal (2018)
Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for consistency training. In: Proceedings of the Advances in Neural Information Processing Systems, NIPS, pp. 6256–6268. Curran Associates Inc, Vancouver, Canada (2020)
Zeng, X., Li, Y., Zhai, Y., Zhang, Y.: Counterfactual generator: a weakly-supervised method for named entity recognition. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 7270–7280. Association for Computational Linguistics, Online: Association for Computational Linguistics (2020)
Zhang, H., Ren, F.: Bertatde at semeval-2020 task 6: extracting term-definition pairs in free text using pre-trained model. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval, pp. 690–696. International Committee for Computational Linguistics, Online (2020)
Acknowledgments
This work is supported by National Natural Science Foundation of China (Grant No. 72074113).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Y., Zhang, C. (2024). Data Augmentation on Problem and Method Sentence Classification Task in Scientific Paper: A Mechanism Analysis Study. In: Sserwanga, I., et al. Wisdom, Well-Being, Win-Win. iConference 2024. Lecture Notes in Computer Science, vol 14598. Springer, Cham. https://doi.org/10.1007/978-3-031-57867-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-57867-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57866-3
Online ISBN: 978-3-031-57867-0
eBook Packages: Computer ScienceComputer Science (R0)