TextCut: A Multi-region Replacement Data Augmentation Approach for Text Imbalance Classification

Jiang, Wanrong; Chen, Ya; Fu, Hao; Liu, Guiquan

doi:10.1007/978-3-030-92273-3_35

Wanrong Jiang¹³,
Ya Chen¹³,
Hao Fu¹³ &
…
Guiquan Liu¹³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13111))

Included in the following conference series:

International Conference on Neural Information Processing

1978 Accesses
1 Citations

Abstract

In the practical applications of text classification, data imbalance problems occur frequently, which typically leads to prejudice of a classifier against the majority group. Therefore, how to handle imbalanced text datasets to alleviate the skew distribution is a crucial task. Existing mainstream methods tackle it by utilizing interpolation-based augmentation strategies to synthesize new texts according to minority class texts. However, it may mess up the syntactic and semantic information of the original texts, which makes it challenging to model the new texts. We propose a novel data augmentation method based on paired samples, called TextCut, to overcome the above problem. For a minority class text and its paired text, TextCut samples multiple small square regions of the minority text in the hidden space and replaces them with corresponding regions cutout from the paired text. We build TextCut upon the BERT model to better capture the features of minority class texts. We verify that TextCut can further improve the classification performance of the minority and entire categories, and effectively alleviate the imbalanced problem on three benchmark imbalanced text datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andreas, J.: Good-enough compositional data augmentation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7556–7566 (2020)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Chen, J., Wu, Y., Yang, D.: Semi-supervised models via data augmentation for classifying interactive affective responses. In: AffCon@ AAAI (2020)
Google Scholar
Chen, J., Yang, Z., Yang, D.: Mixtext: linguistically-informed interpolation of hidden space for semi-supervised text classification. In: ACL (2020)
Google Scholar
Croce, D., Castellucci, G., Basili, R.: GAN-BERT: generative adversarial learning for robust text classification with a bunch of labeled examples. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2114–2119 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)
Article MathSciNet Google Scholar
Fernando, C., et al.: Pathnet: evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734 (2017)
Goodfellow, I.J., et al.: Generative adversarial nets. In: NIPS (2014)
Google Scholar
Guo, H.: Nonlinear Mixup: out-of-manifold data augmentation for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 4044–4051 (2020)
Google Scholar
Guo, H., Mao, Y., Zhang, R.: Augmenting data with Mixup for sentence classification: an empirical study. arXiv preprint arXiv:1905.08941 (2019)
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Jang, J., Kim, Y., Choi, K., Suh, S.: Sequential targeting: a continual learning approach for data imbalance in text classification. Expert Syst. Appl. 179, 115067 (2021)
Article Google Scholar
Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language? In: ACL 2019–57th Annual Meeting of the Association for Computational Linguistics (2019)
Google Scholar
Khan, S.H., Hayat, M., Bennamoun, M., Sohel, F.A., Togneri, R.: Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3573–3587 (2017)
Google Scholar
Li, B., Liu, Y., Wang, X.: Gradient harmonized single-stage detector. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8577–8584 (2019)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Moreo, A., Esuli, A., Sebastiani, F.: Distributional random oversampling for imbalanced text classification. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 805–808 (2016)
Google Scholar
Padurariu, C., Breaban, M.E.: Dealing with data imbalance in text classification. Procedia Comput. Sci. 159, 736–745 (2019)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GLOVE: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Satriaji, W., Kusumaningrum, R.: Effect of synthetic minority oversampling technique (smote), feature representation, and classification algorithm on imbalanced sentiment analysis. In: 2018 2nd International Conference on Informatics and Computational Sciences (ICICoS), pp. 1–5. IEEE (2018)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: ACL (1) (2016)
Google Scholar
Suh, S., Lee, H., Jo, J., Lukowicz, P., Lee, Y.O.: Generative oversampling method for imbalanced data on bearing fault detection and diagnosis. Appl. Sci. 9(4), 746 (2019)
Article Google Scholar
Tian, J., Chen, S., Zhang, X., Feng, Z.: A graph-based measurement for text imbalance classification. In: ECAI 2020, pp. 2188–2195. IOS Press (2020)
Google Scholar
Waseem, Z., Hovy, D.: Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In: Proceedings of the NAACL Student Research Workshop, pp. 88–93 (2016)
Google Scholar
Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6383–6389 (2019)
Google Scholar
Yang, W., Li, J., Fukumoto, F., Ye, Y.: MSCNN: a monomeric-Siamese convolutional neural network for extremely imbalanced multi-label text classification. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6716–6722 (2020)
Google Scholar
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: CutMix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, R., Yu, Y., Zhang, C.: SeqMix: augmenting active sequence labeling via sequence Mixup. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8566–8579 (2020)
Google Scholar

Download references

Acknowledgements

This paper has been supported by the National Key Research and Development Program of China (No. 2018YFB1801105).

Author information

Authors and Affiliations

School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Wanrong Jiang, Ya Chen, Hao Fu & Guiquan Liu

Authors

Wanrong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ya Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hao Fu
View author publications
You can also search for this author in PubMed Google Scholar
Guiquan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guiquan Liu .

Editor information

Editors and Affiliations

Sampoerna University, Jakarta, Indonesia
Teddy Mantoro
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Sampoerna University, Jakarta, Indonesia
Media Anugerah Ayu
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Universitas Indonesia, Depok, Indonesia
Achmad Nizar Hidayanto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, W., Chen, Y., Fu, H., Liu, G. (2021). TextCut: A Multi-region Replacement Data Augmentation Approach for Text Imbalance Classification. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13111. Springer, Cham. https://doi.org/10.1007/978-3-030-92273-3_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-92273-3_35
Published: 05 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92272-6
Online ISBN: 978-3-030-92273-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics