Knowledge Transfer via Dense Cross-Layer Mutual-Distillation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12360)


Knowledge Distillation (KD) based methods adopt the one-way Knowledge Transfer (KT) scheme in which training a lower-capacity student network is guided by a pre-trained high-capacity teacher network. Recently, Deep Mutual Learning (DML) presented a two-way KT strategy, showing that the student network can be also helpful to improve the teacher network. In this paper, we propose Dense Cross-layer Mutual-distillation (DCM), an improved two-way KT method in which the teacher and student networks are trained collaboratively from scratch. To augment knowledge representation learning, well-designed auxiliary classifiers are added to certain hidden layers of both teacher and student networks. To boost KT performance, we introduce dense bidirectional KD operations between the layers appended with classifiers. After training, all auxiliary classifiers are discarded, and thus there are no extra parameters introduced to final models. We test our method on a variety of KT tasks, showing its superiorities over related methods. Code is available at


Knowledge Distillation Deep supervision Convolutional Neural Network Image classification 

Supplementary material

504470_1_En_18_MOESM1_ESM.pdf (269 kb)
Supplementary material 1 (pdf 269 KB)


  1. 1.
    Aguilar, G., Ling, Y., Zhang, Y., Yao, B., Fan, X., Guo, C.: Knowledge distillation from internal representations. In: AAAI (2020)Google Scholar
  2. 2.
    Ba, L.J., Caruana, R.: Do deep nets really need to be deep? In: NIPS (2014)Google Scholar
  3. 3.
    Batra, T., Parikh, D.: Cooperative learning with visual attributes. arXiv preprint arXiv:1705.05512 (2017)
  4. 4.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT (1998)Google Scholar
  5. 5.
    Bucilǎ, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: KDD (2006)Google Scholar
  6. 6.
    Chen, T., Goodfellow, I., Shlens, J.: Net2Net: accelerating learning via knowledge transfer. In: ICLR (2016)Google Scholar
  7. 7.
    Chen, Y., Wang, Z., Peng, Y., Zhang, Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: CVPR (2018)Google Scholar
  8. 8.
    Garcia, N.C., Morerio, P., Murino, V.: Modality distillation with multiple stream networks for action recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 106–121. Springer, Cham (2018). Scholar
  9. 9.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  10. 10.
    Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distilling cross-domain stereo networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 506–523. Springer, Cham (2018). Scholar
  11. 11.
    Hafner, F., Bhuiyan, A., Kooij, J.F.P., Granger, E.: A cross-modal distillation network for person re-identification in RGB-depth. arXiv preprint arXiv:1810.11641 (2018)
  12. 12.
    He, D., et al.: Dual learning for machine translation. In: NIPS (2017)Google Scholar
  13. 13.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  14. 14.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
  15. 15.
    Hou, S., Pan, X., Loy, C.C., Wang, Z., Lin, D.: Lifelong learning via progressive distillation and retrospection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 452–467. Springer, Cham (2018). Scholar
  16. 16.
    Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
  17. 17.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)Google Scholar
  18. 18.
    Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: ICLR (2018)Google Scholar
  19. 19.
    Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)Google Scholar
  20. 20.
    Jia, S., Bruce, N.D.B.: Richer and deeper supervision network for salient object detection. arXiv preprint arXiv:1901.02425 (2018)
  21. 21.
    Kim, J., Park, S., Kwak, N.: Paraphrasing complex network: network compression via factor transfer. In: NeurIPS (2018)Google Scholar
  22. 22.
    Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In: EMNLP (2016)Google Scholar
  23. 23.
    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. In: Tech Report (2009)Google Scholar
  24. 24.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  25. 25.
    Kundu, J.N., Lakkakula, N., Babu, R.V.: UM-Adapt: unsupervised multi-task adaptation using adversarial cross-task distillation. In: ICCV (2019)Google Scholar
  26. 26.
    Lan, X., Zhu, X., Gong, S.: Knowledge distillation by on-the-fly native ensemble. In: NeurIPS (2018)Google Scholar
  27. 27.
    Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: AISTATS (2015)Google Scholar
  28. 28.
    Lee, S.H., Kim, H.D., Song, B.C.: Self-supervised knowledge distillation using singular value decomposition. In: NeurIPS (2018)Google Scholar
  29. 29.
    Li, Y., Wang, N., Liu, J., Hou, X.: Demystifying neural style transfer. In: IJCAI (2016)Google Scholar
  30. 30.
    Li, Z., Hoiem, D.: Learning without forgetting. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 614–629. Springer, Cham (2016). Scholar
  31. 31.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  32. 32.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  33. 33.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). Scholar
  34. 34.
    Paszke, A., et al.: Automatic differentiation in pytorch. In: NIPS Workshops (2017)Google Scholar
  35. 35.
    Phuong, M., Lampert, C.H.: Distillation-based training for multi-exit architectures. In: ICCV (2019)Google Scholar
  36. 36.
    Qiao, S., Shen, W., Zhang, Z., Wang, B., Yuille, A.: Deep co-training for semi-supervised image recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 142–159. Springer, Cham (2018). Scholar
  37. 37.
    Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. In: ICLR (2015)Google Scholar
  38. 38.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: CVPR (2018)Google Scholar
  40. 40.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  41. 41.
    Song, G., Chai, W.: Collaborative learning for deep neural networks. In: NeurIPS (2018)Google Scholar
  42. 42.
    Sun, D., Yao, A., Zhou, A., Zhao, H.: Deeply-supervised knowledge synergy. In: CVPR (2019)Google Scholar
  43. 43.
    Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)CrossRefGoogle Scholar
  44. 44.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  45. 45.
    Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: ICLR (2020)Google Scholar
  46. 46.
    Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. ICLR (2020)Google Scholar
  47. 47.
    Wang, Z., Deng, Z., Wang, S.: Accelerating convolutional neural networks with dominant convolutional kernel and knowledge pre-regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 533–548. Springer, Cham (2016). Scholar
  48. 48.
    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)Google Scholar
  49. 49.
    Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV (2015)Google Scholar
  50. 50.
    Xu, D., Ouyang, W., Wang, X., Nicu, S.: PAD-Net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: CVPR (2018)Google Scholar
  51. 51.
    Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: CVPR (2017)Google Scholar
  52. 52.
    Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)Google Scholar
  53. 53.
    Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: ICLR (2017)Google Scholar
  54. 54.
    Zhai, M., Chen, L., Tung, F., He, J., Nawhal, M., Mori, G.: Lifelong GAN: continual learning for conditional image generation. In: ICCV (2019)Google Scholar
  55. 55.
    Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: ICLR (2017)Google Scholar
  56. 56.
    Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: ICLR (2018)Google Scholar
  57. 57.
    Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: CVPR (2018)Google Scholar
  58. 58.
    Zhang, Z., Zhang, X., Peng, C., Xue, X., Sun, J.: ExFuse: enhancing feature fusion for semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 273–288. Springer, Cham (2018). Scholar
  59. 59.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Intel LabsBeijingChina

Personalised recommendations