Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

Xu, Guowei; Ding, Wenbiao; Fu, Weiping; Wu, Zhongqin; Liu, Zitao

doi:10.1007/978-3-030-86517-7_18

Guowei Xu¹²,
Wenbiao Ding¹²,
Weiping Fu¹²,
Zhongqin Wu¹² &
…
Zitao Liu¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12979))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1190 Accesses

Abstract

Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR engines may introduce errors and inputs to downstream NLP models become noisy. Despite that pre-trained models achieve state-of-the-art performance in many NLP benchmarks, we prove that they are not robust to noisy texts generated by real OCR engines. This greatly limits the application of NLP models in real-world scenarios. In order to improve model performance on noisy OCR transcripts, it is natural to train the NLP model on labelled noisy texts. However, in most cases there are only labelled clean texts. Since there is no handwritten pictures corresponding to the text, it is impossible to directly use the recognition model to obtain noisy labelled data. Human resources can be employed to copy texts and take pictures, but it is extremely expensive considering the size of data for model training. Consequently, we are interested in making NLP models intrinsically robust to OCR errors in a low resource manner. We propose a novel robust training framework which 1) employs simple but effective methods to directly simulate natural OCR noises from clean texts and 2) iteratively mines the hard examples from a large number of simulated samples for optimal performance. 3) To make our model learn noise-invariant representations, a stability loss is employed. Experiments on three real-world datasets show that the proposed framework boosts the robustness of pre-trained models by a large margin. We believe that this work can greatly promote the application of NLP models in actual scenarios, although the algorithm we use is simple and straightforward. We make our codes and three datasets publicly available (https://github.com/tal-ai/Robust-learning-MSSHEM).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://translate.google.cn.
2.
https://www.hw99.com/index.php.
3.
https://ai.100tal.com/product/ocr-hr.
4.
Parallel data do not have task specific labels, so they are not used as training data.
5.
https://github.com/ymcui/Chinese-BERT-wwm.

References

Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.J., Srivastava, M., Chang, K.W.: Generating natural language adversarial examples. In: Proceedings of EMNLP, pp. 2890–2896 (2018)
Google Scholar
Belinkov, Y., Bisk, Y.: Synthetic and natural noise both break neural machine translation. In: Proceedings of ICLR (2018)
Google Scholar
Chollampatt, S., Ng, H.T.: Neural quality estimation of grammatical error correction. In: Proceedings of EMNLP, pp. 2528–2539 (2018)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Ebrahimi, J., Rao, A., Lowd, D., Dou, D.: HotFlip: white-box adversarial examples for text classification. In: Proceedings of ACL, pp. 31–36 (2018)
Google Scholar
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: Proceedings of ICLR (2015)
Google Scholar
Hsieh, Y.L., Cheng, M., Juan, D.C., Wei, W., Hsu, W.L., Hsieh, C.J.: On the robustness of self-attentive models. In: Proceedings of ACL, pp. 1520–1529 (2019)
Google Scholar
Jin, D., Jin, Z., Zhou, J.T., Szolovits, P.: Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In: Proceedings of AAAI, vol. 34, 7–12 February 2020, New York, NY, USA, pp. 8018–8025 (2020)
Google Scholar
Karpukhin, V., Levy, O., Eisenstein, J., Ghazvininejad, M.: Training on synthetic noise improves robustness to natural noise in machine translation. In: Proceedings of W-NUT, pp. 42–47 (2019)
Google Scholar
Miyato, T., Dai, A.M., Goodfellow, I.J.: Adversarial training methods for semi-supervised text classification. In: Proceedings of ICLR (2017)
Google Scholar
Namysl, M., Behnke, S., Köhler, J.: NAT: Noise-aware training for robust neural sequence labeling. In: Proceedings of ACL, pp. 1501–1517 (2020)
Google Scholar
Ndiaye, M., Faltin, A.V.: A spell checker tailored to language learners. Comput. Assist. Lang. Learn. 2–3, 213–232 (2003)
Article Google Scholar
Rawlinson, G.: The significance of letter position in word recognition. IEEE Aerosp. Electron. Syst. Mag. 1, 26–27 (2007)
Article Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: Semantically equivalent adversarial rules for debugging NLP models. In: Proceedings of ACL, pp. 856–865 (2018)
Google Scholar
Sun, L., et al.: Adv-BERT: BERT is not robust on misspellings! generating nature adversarial samples on BERT. arXiv preprint arXiv:2003.04985 (2020)
Sun, Y., Jiang, H.: Contextual text denoising with masked language models. In: Proceedings of W-NUT, pp.286–290. arXiv preprint arXiv:1910.14080 (2019)
Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: COLING (2002)
Google Scholar
Valenti, S., Neri, F., Cucchiarelli, A.: An overview of current research on automated essay grading. J. Inf. Technol. Educ. Res. 1, 319–330 (2003)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of NIPS, 4–9 December 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)
Google Scholar
Yang, P., Chen, J., Hsieh, C.J., Wang, J.L., Jordan, M.I.: Greedy attack and Gumbel attack: generating adversarial examples for discrete data. J. Mach. Learn. Res. 43, 1–36 (2020)
MathSciNet MATH Google Scholar
Yasunaga, M., Kasai, J., Radev, D.: Robust multilingual part-of-speech tagging via adversarial training. In: Proceedings of NAACL-HLT, pp. 976–986 (2018)
Google Scholar
Zhai, C.: Statistical language models for information retrieval. In: Proceedings of NAACL-HLT, pp. 3–4 (2007)
Google Scholar
Zhao, W., Wang, L., Shen, K., Jia, R., Liu, J.: Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. In: Proceedings of NAACL-HLT, pp. 156–165 (2019)
Google Scholar
Zhao, Z., Dua, D., Singh, S.: Generating natural adversarial examples. In: Proceedings of ICLR (2018)
Google Scholar
Zheng, S., Song, Y., Leung, T., Goodfellow, I.J.: Improving the robustness of deep neural networks via stability training. In: Proceedings of CVPR, CVPR 2016, 27–30 June 2016, Las Vegas, NV, USA, pp. 4480–4488 (2016)
Google Scholar

Download references

Acknowledgment

This work was supported in part by National Key R&D Program of China, under Grant No. 2020AAA0104500 and in part by Beijing Nova Program (Z201100006820068) from Beijing Municipal Science & Technology Commission.

Author information

Authors and Affiliations

TAL Education Group, Beijing, China
Guowei Xu, Wenbiao Ding, Weiping Fu, Zhongqin Wu & Zitao Liu

Authors

Guowei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Wenbiao Ding
View author publications
You can also search for this author in PubMed Google Scholar
Weiping Fu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongqin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zitao Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenbiao Ding .

Editor information

Editors and Affiliations

Facebook AI, Seattle, WA, USA
Yuxiao Dong
Torre Telefonica, Barcelona, Spain
Nicolas Kourtellis
Bielefeld University, CITEC, Bielefeld, Germany
Barbara Hammer
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, G., Ding, W., Fu, W., Wu, Z., Liu, Z. (2021). Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining. In: Dong, Y., Kourtellis, N., Hammer, B., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12979. Springer, Cham. https://doi.org/10.1007/978-3-030-86517-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-86517-7_18
Published: 10 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86516-0
Online ISBN: 978-3-030-86517-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)