Matching Guided Distillation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12360)


Feature distillation is an effective way to improve the performance for a smaller student model, which has fewer parameters and lower computation cost compared to the larger teacher model. Unfortunately, there is a common obstacle—the gap in semantic feature structure between the intermediate features of teacher and student. The classic scheme prefers to transform intermediate features by adding the adaptation module, such as naive convolutional, attention-based or more complicated one. However, this introduces two problems: a) The adaptation module brings more parameters into training. b) The adaptation module with random initialization or special transformation isn’t friendly for distilling a pre-trained student. In this paper, we present Matching Guided Distillation (MGD) as an efficient and parameter-free manner to solve these problems. The key idea of MGD is to pose matching the teacher channels with students’ as an assignment problem. We compare three solutions of the assignment problem to reduce channels from teacher features with partial distillation loss. The overall training takes a coordinate-descent approach between two optimization objects—assignments update and parameters update. Since MGD only contains normalization or pooling operations with negligible computation cost, it is flexible to plug into network with other distillation methods. The project site is

Supplementary material

504470_1_En_19_MOESM1_ESM.pdf (15.8 mb)
Supplementary material 1 (pdf 16178 KB)


  1. 1.
    Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: ICCV (2011)Google Scholar
  2. 2.
    Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object detection models with knowledge distillation. In: NIPS (2017)Google Scholar
  3. 3.
    Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: NIPS (2013)Google Scholar
  4. 4.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  5. 5.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
  6. 6.
    Duchenne, O., Joulin, A., Ponce, J.: A graph-matching kernel for object categorization. In: ICCV (2011)Google Scholar
  7. 7.
    Frogner, C., Zhang, C., Mobahi, H., Araya, M., Poggio, T.A.: Learning with a Wasserstein loss. In: NIPS (2015)Google Scholar
  8. 8.
    He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: ICCV (2019)Google Scholar
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  10. 10.
    He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: ICCV (2017)Google Scholar
  11. 11.
    Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., Choi, J.Y.: A comprehensive overhaul of feature distillation. In: ICCV (2019)Google Scholar
  12. 12.
    Heo, B., Lee, M., Yun, S., Choi, J.Y.: Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In: AAAI (2019)Google Scholar
  13. 13.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)
  14. 14.
    Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
  15. 15.
    Huang, Z., Wang, N.: Like what you like: knowledge distill via neuron selectivity transfer. arXiv:1707.01219 (2017)
  16. 16.
    Huet, B., Cross, A.D., Hancock, E.R.: Graph matching for shape retrieval. In: NIPS (1999)Google Scholar
  17. 17.
    Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: CVPR (2016)Google Scholar
  18. 18.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  19. 19.
    Krizhevsky, A., et al.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)Google Scholar
  20. 20.
    Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2, 83–97 (1955)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Lee, S., Song, B.C.: Graph-based knowledge distillation by multi-head self-attention network. arXiv:1907.02226 (2019)
  22. 22.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)Google Scholar
  23. 23.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  24. 24.
    Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 122–138. Springer, Cham (2018). Scholar
  25. 25.
    Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR (2019)Google Scholar
  26. 26.
    Pudipeddi, B., Mesmakhosroshahi, M., Xi, J., Bharadwaj, S.: Training large neural networks with constant memory using a new execution algorithm. arXiv preprint arXiv:2002.05645 (2020)
  27. 27.
    Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. arXiv:1412.6550 (2014)
  28. 28.
    Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. IJCV 40, 99–121 (2000)CrossRefGoogle Scholar
  29. 29.
    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNet V2: inverted residuals and linear bottlenecks. In: CVPR (2018)Google Scholar
  30. 30.
    Srinivas, S., Fleuret, F.: Knowledge transfer with Jacobian matching. arXiv:1803.00443 (2018)
  31. 31.
    Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. arXiv:1905.11946 (2019)
  32. 32.
    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical report, CNS-TR-2011-001, California Institute of Technology (2011)Google Scholar
  33. 33.
    Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: HAQ: hardware-aware automated quantization with mixed precision. In: CVPR (2019)Google Scholar
  34. 34.
    Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015). Scholar
  35. 35.
    Wu, B., et al.: FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In: CVPR (2019)Google Scholar
  36. 36.
    Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. (2019)
  37. 37.
    Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. arXiv:1906.08237 (2019)
  38. 38.
    Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: CVPR (2017)Google Scholar
  39. 39.
    Ying, H., Huang, Z., Liu, S., Shao, T., Zhou, K.: EmbedMask: embedding coupling for one-stage instance segmentation (2019)Google Scholar
  40. 40.
    Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)Google Scholar
  41. 41.
    Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: ICLR (2017)Google Scholar
  42. 42.
    Zhou, F., De la Torre, F.: Factorized graph matching. In: CVPR (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Algorithm ResearchAibee Inc.BeijingChina

Personalised recommendations