2RDA: Representation and Relation Distillation with Data Augmentation

Yang, Xurui; Ye, Jian

doi:10.1007/978-3-031-44198-1_1

Xurui Yang^11,12 &
Jian Ye^11,12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14261))

Included in the following conference series:

International Conference on Artificial Neural Networks

687 Accesses

Abstract

Large pre-trained language models have demonstrated superior performance in natural language processing tasks. However, the massive number of parameters and slow inference speed make it challenging to deploy them on resource-constrained devices. Existing knowledge distillation methods use the point-to-point approach to transfer the knowledge, which restrains the ability to learn the higher-level semantic knowledge from the teacher network. In this paper, we propose Representation and Relation Distillation with Data Augmentation(2RDA), a novel knowledge distillation framework. Unlike previous methods, 2RDA introduces an improved contrastive distillation loss function for data augmentation to solve the problem that data augmentation during the fine-tuning of downstream tasks may lead to the misclassification of positive and negative sample pairs for contrastive learning. Additionally, we guide the student model to obtain structural knowledge by distilling the relational knowledge between samples from a mini-batch through distance loss. 2RDA achieves excellent results and surpasses the state-of-the-art model compression methods on the GLUE benchmark, demonstrating the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/Tiny BERT.

References

Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems, NeurIPS, pp. 15509–15519 (2019)
Google Scholar
Chen, L., Wang, D., Gan, Z., Liu, J., Henao, R., Carin, L.: Wasserstein contrastive representation distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16296–16305 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Fang, H., Wang, S., Zhou, M., Ding, J., Xie, P.: CERT: contrastive self-supervised learning for language understanding. arXiv preprint arXiv:2005.12766 (2020)
Fu, H., et al.: LRC-BERT: latent-representation contrastive knowledge distillation for natural language understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12830–12838 (2021)
Google Scholar
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, AISTATS, pp. 297–304. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In: ICLR 2016 - Conference Track Proceedings (2016)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: 7th International Conference on Learning Representations, ICLR (2019)
Google Scholar
Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. In: Proceedings of the Conference on EMNLP, pp. 4163–4174 (2020)
Google Scholar
Liu, Y., et al.: Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: Proceedings of the EMNLP-IJCNLP, pp. 4322–4331 (2019)
Google Scholar
Sun, S., Gan, Z., Cheng, Y., Fang, Y., Wang, S., Liu, J.: Contrastive distillation on intermediate representations for language model compression. In: Proceedings of the Conference on EMNLP, pp. 498–508 (2020)
Google Scholar
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: Mobilebert: a compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th ACL, pp. 2158–2170 (2020)
Google Scholar
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: 8th International Conference on Learning Representations, ICLR (2020)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: 7th International Conference on ICLR (2019)
Google Scholar
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: MiniLM: deep self-attention distillation for task-agnostic compression of pre-trained transformers, pp. 5776–5788 (2020)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Google Scholar
Wu, Z., Wang, S., Gu, J., Khabsa, M., Sun, F., Ma, H.: CLEAR: contrastive learning for sentence representation. arXiv preprint arXiv:2012.15466 (2020)
Xu, C., Zhou, W., Ge, T., Wei, F., Zhou, M.: BERT-of-Theseus: compressing BERT by progressive module replacing. In: Proceedings of the Conference on EMNLP, pp. 7859–7869 (2020)
Google Scholar

Download references

Acknowledgement

The research work is supported by National Key R &D Program of China (No. 2022YFB3904700), Industrial Internet Innovation and Development Project in 2021 (TC210A02M, TC210804D), Opening Project of Beijing Key Laboratory of Mobile Computing and Pervasive Device.

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xurui Yang & Jian Ye
University of Chinese Academy of Sciences, Beijing, China
Xurui Yang & Jian Ye

Authors

Xurui Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Ye .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, X., Ye, J. (2023). 2RDA: Representation and Relation Distillation with Data Augmentation. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14261. Springer, Cham. https://doi.org/10.1007/978-3-031-44198-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-44198-1_1
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44197-4
Online ISBN: 978-3-031-44198-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

2RDA: Representation and Relation Distillation with Data Augmentation