Skip to main content

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track (ECML PKDD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12979))

  • 1190 Accesses

Abstract

Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR engines may introduce errors and inputs to downstream NLP models become noisy. Despite that pre-trained models achieve state-of-the-art performance in many NLP benchmarks, we prove that they are not robust to noisy texts generated by real OCR engines. This greatly limits the application of NLP models in real-world scenarios. In order to improve model performance on noisy OCR transcripts, it is natural to train the NLP model on labelled noisy texts. However, in most cases there are only labelled clean texts. Since there is no handwritten pictures corresponding to the text, it is impossible to directly use the recognition model to obtain noisy labelled data. Human resources can be employed to copy texts and take pictures, but it is extremely expensive considering the size of data for model training. Consequently, we are interested in making NLP models intrinsically robust to OCR errors in a low resource manner. We propose a novel robust training framework which 1) employs simple but effective methods to directly simulate natural OCR noises from clean texts and 2) iteratively mines the hard examples from a large number of simulated samples for optimal performance. 3) To make our model learn noise-invariant representations, a stability loss is employed. Experiments on three real-world datasets show that the proposed framework boosts the robustness of pre-trained models by a large margin. We believe that this work can greatly promote the application of NLP models in actual scenarios, although the algorithm we use is simple and straightforward. We make our codes and three datasets publicly available (https://github.com/tal-ai/Robust-learning-MSSHEM).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://translate.google.cn.

  2. 2.

    https://www.hw99.com/index.php.

  3. 3.

    https://ai.100tal.com/product/ocr-hr.

  4. 4.

    Parallel data do not have task specific labels, so they are not used as training data.

  5. 5.

    https://github.com/ymcui/Chinese-BERT-wwm.

References

  1. Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.J., Srivastava, M., Chang, K.W.: Generating natural language adversarial examples. In: Proceedings of EMNLP, pp. 2890–2896 (2018)

    Google Scholar 

  2. Belinkov, Y., Bisk, Y.: Synthetic and natural noise both break neural machine translation. In: Proceedings of ICLR (2018)

    Google Scholar 

  3. Chollampatt, S., Ng, H.T.: Neural quality estimation of grammatical error correction. In: Proceedings of EMNLP, pp. 2528–2539 (2018)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  5. Ebrahimi, J., Rao, A., Lowd, D., Dou, D.: HotFlip: white-box adversarial examples for text classification. In: Proceedings of ACL, pp. 31–36 (2018)

    Google Scholar 

  6. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: Proceedings of ICLR (2015)

    Google Scholar 

  7. Hsieh, Y.L., Cheng, M., Juan, D.C., Wei, W., Hsu, W.L., Hsieh, C.J.: On the robustness of self-attentive models. In: Proceedings of ACL, pp. 1520–1529 (2019)

    Google Scholar 

  8. Jin, D., Jin, Z., Zhou, J.T., Szolovits, P.: Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In: Proceedings of AAAI, vol. 34, 7–12 February 2020, New York, NY, USA, pp. 8018–8025 (2020)

    Google Scholar 

  9. Karpukhin, V., Levy, O., Eisenstein, J., Ghazvininejad, M.: Training on synthetic noise improves robustness to natural noise in machine translation. In: Proceedings of W-NUT, pp. 42–47 (2019)

    Google Scholar 

  10. Miyato, T., Dai, A.M., Goodfellow, I.J.: Adversarial training methods for semi-supervised text classification. In: Proceedings of ICLR (2017)

    Google Scholar 

  11. Namysl, M., Behnke, S., Köhler, J.: NAT: Noise-aware training for robust neural sequence labeling. In: Proceedings of ACL, pp. 1501–1517 (2020)

    Google Scholar 

  12. Ndiaye, M., Faltin, A.V.: A spell checker tailored to language learners. Comput. Assist. Lang. Learn. 2–3, 213–232 (2003)

    Article  Google Scholar 

  13. Rawlinson, G.: The significance of letter position in word recognition. IEEE Aerosp. Electron. Syst. Mag. 1, 26–27 (2007)

    Article  Google Scholar 

  14. Ribeiro, M.T., Singh, S., Guestrin, C.: Semantically equivalent adversarial rules for debugging NLP models. In: Proceedings of ACL, pp. 856–865 (2018)

    Google Scholar 

  15. Sun, L., et al.: Adv-BERT: BERT is not robust on misspellings! generating nature adversarial samples on BERT. arXiv preprint arXiv:2003.04985 (2020)

  16. Sun, Y., Jiang, H.: Contextual text denoising with masked language models. In: Proceedings of W-NUT, pp.286–290. arXiv preprint arXiv:1910.14080 (2019)

  17. Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: COLING (2002)

    Google Scholar 

  18. Valenti, S., Neri, F., Cucchiarelli, A.: An overview of current research on automated essay grading. J. Inf. Technol. Educ. Res. 1, 319–330 (2003)

    Google Scholar 

  19. Vaswani, A., et al.: Attention is all you need. In: Proceedings of NIPS, 4–9 December 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)

    Google Scholar 

  20. Yang, P., Chen, J., Hsieh, C.J., Wang, J.L., Jordan, M.I.: Greedy attack and Gumbel attack: generating adversarial examples for discrete data. J. Mach. Learn. Res. 43, 1–36 (2020)

    MathSciNet  MATH  Google Scholar 

  21. Yasunaga, M., Kasai, J., Radev, D.: Robust multilingual part-of-speech tagging via adversarial training. In: Proceedings of NAACL-HLT, pp. 976–986 (2018)

    Google Scholar 

  22. Zhai, C.: Statistical language models for information retrieval. In: Proceedings of NAACL-HLT, pp. 3–4 (2007)

    Google Scholar 

  23. Zhao, W., Wang, L., Shen, K., Jia, R., Liu, J.: Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. In: Proceedings of NAACL-HLT, pp. 156–165 (2019)

    Google Scholar 

  24. Zhao, Z., Dua, D., Singh, S.: Generating natural adversarial examples. In: Proceedings of ICLR (2018)

    Google Scholar 

  25. Zheng, S., Song, Y., Leung, T., Goodfellow, I.J.: Improving the robustness of deep neural networks via stability training. In: Proceedings of CVPR, CVPR 2016, 27–30 June 2016, Las Vegas, NV, USA, pp. 4480–4488 (2016)

    Google Scholar 

Download references

Acknowledgment

This work was supported in part by National Key R&D Program of China, under Grant No. 2020AAA0104500 and in part by Beijing Nova Program (Z201100006820068) from Beijing Municipal Science & Technology Commission.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenbiao Ding .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, G., Ding, W., Fu, W., Wu, Z., Liu, Z. (2021). Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining. In: Dong, Y., Kourtellis, N., Hammer, B., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12979. Springer, Cham. https://doi.org/10.1007/978-3-030-86517-7_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86517-7_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86516-0

  • Online ISBN: 978-3-030-86517-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics