Abstract
Large pre-trained language models have demonstrated superior performance in natural language processing tasks. However, the massive number of parameters and slow inference speed make it challenging to deploy them on resource-constrained devices. Existing knowledge distillation methods use the point-to-point approach to transfer the knowledge, which restrains the ability to learn the higher-level semantic knowledge from the teacher network. In this paper, we propose Representation and Relation Distillation with Data Augmentation(2RDA), a novel knowledge distillation framework. Unlike previous methods, 2RDA introduces an improved contrastive distillation loss function for data augmentation to solve the problem that data augmentation during the fine-tuning of downstream tasks may lead to the misclassification of positive and negative sample pairs for contrastive learning. Additionally, we guide the student model to obtain structural knowledge by distilling the relational knowledge between samples from a mini-batch through distance loss. 2RDA achieves excellent results and surpasses the state-of-the-art model compression methods on the GLUE benchmark, demonstrating the effectiveness of our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems, NeurIPS, pp. 15509–15519 (2019)
Chen, L., Wang, D., Gan, Z., Liu, J., Henao, R., Carin, L.: Wasserstein contrastive representation distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16296–16305 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Fang, H., Wang, S., Zhou, M., Ding, J., Xie, P.: CERT: contrastive self-supervised learning for language understanding. arXiv preprint arXiv:2005.12766 (2020)
Fu, H., et al.: LRC-BERT: latent-representation contrastive knowledge distillation for natural language understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12830–12838 (2021)
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, AISTATS, pp. 297–304. JMLR Workshop and Conference Proceedings (2010)
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In: ICLR 2016 - Conference Track Proceedings (2016)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: 7th International Conference on Learning Representations, ICLR (2019)
Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. In: Proceedings of the Conference on EMNLP, pp. 4163–4174 (2020)
Liu, Y., et al.: Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: Proceedings of the EMNLP-IJCNLP, pp. 4322–4331 (2019)
Sun, S., Gan, Z., Cheng, Y., Fang, Y., Wang, S., Liu, J.: Contrastive distillation on intermediate representations for language model compression. In: Proceedings of the Conference on EMNLP, pp. 498–508 (2020)
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: Mobilebert: a compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th ACL, pp. 2158–2170 (2020)
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: 8th International Conference on Learning Representations, ICLR (2020)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: 7th International Conference on ICLR (2019)
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: MiniLM: deep self-attention distillation for task-agnostic compression of pre-trained transformers, pp. 5776–5788 (2020)
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Wu, Z., Wang, S., Gu, J., Khabsa, M., Sun, F., Ma, H.: CLEAR: contrastive learning for sentence representation. arXiv preprint arXiv:2012.15466 (2020)
Xu, C., Zhou, W., Ge, T., Wei, F., Zhou, M.: BERT-of-Theseus: compressing BERT by progressive module replacing. In: Proceedings of the Conference on EMNLP, pp. 7859–7869 (2020)
Acknowledgement
The research work is supported by National Key R &D Program of China (No. 2022YFB3904700), Industrial Internet Innovation and Development Project in 2021 (TC210A02M, TC210804D), Opening Project of Beijing Key Laboratory of Mobile Computing and Pervasive Device.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, X., Ye, J. (2023). 2RDA: Representation and Relation Distillation with Data Augmentation. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14261. Springer, Cham. https://doi.org/10.1007/978-3-031-44198-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-44198-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44197-4
Online ISBN: 978-3-031-44198-1
eBook Packages: Computer ScienceComputer Science (R0)