Data Augmentation on Problem and Method Sentence Classification Task in Scientific Paper: A Mechanism Analysis Study

Zhang, Yingyi; Zhang, Chengzhi

doi:10.1007/978-3-031-57867-0_2

Yingyi Zhang¹⁴ &
Chengzhi Zhang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14598))

Included in the following conference series:

International Conference on Information

88 Accesses

Abstract

Billions of scientific papers lead to the need to identify essential parts of the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences in scientific papers is labor-intensive, resulting in the creation of small-scale datasets that limit model learning. To tackle this challenge, data augmentation has been adopted due to its ability to generate synthetic data with minor variations, thereby expanding the scale of the original training dataset. Nowadays, there are various data augmentation methods, such as those based on random word replacement or back translation. Nevertheless, their suitability for sentence classification tasks in scientific papers remains unexplored. Thus, this paper constructs two manually annotation datasets and evaluates their performance. Furthermore, this paper delves into the mechanisms underlying their effects. Previous studies have suggested that data augmentation can diminish reliance on high-frequency patterns in models. Therefore, this paper employs attention values to represent the model's dependence on words and analyzes how data augmentation methods alter the attention values of individual words within sentences. The experimental results indicate that data augmentation methods can improve the macro F₁ score in sentence classification tasks. Furthermore, data augmentation methods effectively reduce the attention values assigned to stop words, commonly used words in scientific papers, and commonly used words in method and problem sentences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bornmann, L., Mutz, R.: Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J. Am. Soc. Inf. Sci. 66(11), 2215–2222 (2015)
Google Scholar
Dernoncourt, F., Lee, J.Y.: Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP, pp. 308–313. Asian Federation of Natural Language Processing, Taipei, Taiwan (2017)
Google Scholar
Dernoncourt, F., Lee, J.Y., Szolovits, P.: Neural networks for joint sentence classification in medical paper abstracts. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL, pp. 694–700. Association for Computational Linguistics, Valencia, Spain (2016)
Google Scholar
Ding, B., Qin, C., Liu, L., Bing, L., Joty, S., Li, B.: Is gpt-3 a good data annotator?. arXiv preprint arXiv:2212.10450 (2022)
Ferreira, T.M., Costa, A.H.R.: DeepBT and NLP Data Augmentation Techniques: A New Proposal and a Comprehensive Study. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 435–449. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_30
Chapter Google Scholar
Fisas, B., Saggion, H., Ronzano, F.: On the Discoursive Structure of computer graphics research papers. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL, pp. 42–51. Association for Computational Linguistics: Colorado, USA (2015)
Google Scholar
Graa, M., Kim, Y., Schamper, J., Khadivi, S., Ney, H.: Generalizing back-translation in neural machine translation. In: Proceedings of the Fourth Conference on Machine Translation, WMT, pp. 45–52. Association for Computational Linguistics, Florence, Italy (2019)
Google Scholar
Iwatsuki, K., Aizawa, A.: Communicative-function-based sentence classification for construction of an academic formulaic expression database. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, EACL, pp. 3476–3497. Association for Computational Linguistics, Online (2021)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 3219–3232. Association for Computational Linguistics, Brussels, Belgium (2018)
Google Scholar
Luo, Z., Lu, W., He, J., Wang, Y.: Combination of research questions and methods: A new measurement of scientific novelty. J. Informet. 16(2), 101282 (2022)
Article Google Scholar
Sakai, T., Hirokawa, S.: Feature words that classify problem sentence in scientific article. In: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services, IIWAS, pp. 360–367. Association for Computing Machinery, New York, USA (2012)
Google Scholar
Shakeel, M.H., Karim, A., Khan, I.: A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts. Inf. Process. Manage. 57(3), 102204 (2020)
Article Google Scholar
Shorten, C., Khoshgoftaar, T.M., Furht, B.: Text data augmentation for deep learning. Journal of Big Data 8(1), 101 (2021)
Article Google Scholar
Wang, R., Zhang, C., Zhang, Y., Zhang, J.: Extracting Methodological Sentences from Unstructured Abstracts of Academic Articles. In: Sundqvist, A., Berget, G., Nolin, J., Skjerdingstad, K.I. (eds.) iConference 2020. LNCS, vol. 12051, pp. 790–798. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43687-2_66
Chapter Google Scholar
Wang, W. Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 2557–2563. Association for Computational Linguistics, Lisbon, Portugal (2015)
Google Scholar
Wei, J., Zou, K.: EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, pp. 6382–6388. Association for Computational Linguistics, Hong Kong, China (2019)
Google Scholar
Wilson, E.B.: An Introduction to Scientific Research. Dover Publications (1991)
Google Scholar
Wu, X., Lv, S., Zang, L., Han, J., Hu, S.: Conditional BERT contextual augmentation. In: Proceedings of the International Conference on Computational Science, ICCS, pp. 84–95. Springer, Faro, Portugal (2018)
Google Scholar
Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for consistency training. In: Proceedings of the Advances in Neural Information Processing Systems, NIPS, pp. 6256–6268. Curran Associates Inc, Vancouver, Canada (2020)
Google Scholar
Zeng, X., Li, Y., Zhai, Y., Zhang, Y.: Counterfactual generator: a weakly-supervised method for named entity recognition. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 7270–7280. Association for Computational Linguistics, Online: Association for Computational Linguistics (2020)
Google Scholar
Zhang, H., Ren, F.: Bertatde at semeval-2020 task 6: extracting term-definition pairs in free text using pre-trained model. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval, pp. 690–696. International Committee for Computational Linguistics, Online (2020)
Google Scholar

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (Grant No. 72074113).

Author information

Authors and Affiliations

Soochow University, Suzhou, 215127, China
Yingyi Zhang
Nanjing University of Science and Technology, Nanjing, 210094, China
Chengzhi Zhang

Authors

Yingyi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chengzhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chengzhi Zhang .

Editor information

Editors and Affiliations

iSchool organization, Berlin, Germany
Isaac Sserwanga
University of Tsukuba, Tsukuba, Japan
Hideo Joho
Jilin University, Changchun, China
Jie Ma
Stockholm University, Kista, Sweden
Preben Hansen
Wuhan University, Wuhan, China
Dan Wu
University of Tsukuba, Tsukuba, Japan
Masanori Koizumi
University of California, Los Angeles, CA, USA
Anne J. Gilliland

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Zhang, C. (2024). Data Augmentation on Problem and Method Sentence Classification Task in Scientific Paper: A Mechanism Analysis Study. In: Sserwanga, I., et al. Wisdom, Well-Being, Win-Win. iConference 2024. Lecture Notes in Computer Science, vol 14598. Springer, Cham. https://doi.org/10.1007/978-3-031-57867-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-57867-0_2
Published: 10 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57866-3
Online ISBN: 978-3-031-57867-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics