Advertisement

Prime-Aware Adaptive Distillation

Conference paper
  • 651 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12364)

Abstract

Knowledge distillation (KD) aims to improve the performance of a student network by mimicing the knowledge from a powerful teacher network. Existing methods focus on studying what knowledge should be transferred and treat all samples equally during training. This paper introduces the adaptive sample weighting to KD. We discover that previous effective hard mining methods are not appropriate for distillation. Furthermore, we propose Prime-Aware Adaptive Distillation (PAD) by the incorporation of uncertainty learning. PAD perceives the prime samples in distillation and then emphasizes their effect adaptively. PAD is fundamentally different from and would refine existing methods with the innovative view of unequal training. For this reason, PAD is versatile and has been applied in various tasks including classification, metric learning, and object detection. With ten teacher-student combinations on six datasets, PAD promotes the performance of existing distillation methods and outperforms recent state-of-the-art methods.

Keywords

Knowledge distillation Adaptive weighting Uncertainty learning 

Notes

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700800. Thanks Xiruo Tang for her help on paper writing.

References

  1. 1.
    Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., Dai, Z.: Variational information distillation for knowledge transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9163–9171 (2019)Google Scholar
  2. 2.
    Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in neural information processing systems, pp. 2654–2662 (2014)Google Scholar
  3. 3.
    Cao, Y., Chen, K., Loy, C.C., Lin, D.: Prime sample attention in object detection. arXiv preprint arXiv:1904.04821 (2019)
  4. 4.
    Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object detection models with knowledge distillation. In: Advances in Neural Information Processing Systems, pp. 742–751 (2017)Google Scholar
  5. 5.
    Chen, Y., Wang, N., Zhang, Z.: DarkRank: accelerating deep metric learning via cross sample similarities transfer. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  6. 6.
    Cho, J.H., Hariharan, B.: On the efficacy of knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4794–4802 (2019)Google Scholar
  7. 7.
    Choi, J., Chun, D., Kim, H., Lee, H.J.: Gaussian YOLOV3: an accurate and fast object detector using localization uncertainty for autonomous driving. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 502–511 (2019)Google Scholar
  8. 8.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)Google Scholar
  9. 9.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)CrossRefGoogle Scholar
  10. 10.
    Gao, M., et al.: An embarrassingly simple approach for knowledge distillation. arXiv preprint arXiv:1812.01819 (2018)
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  12. 12.
    Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017)
  13. 13.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
  14. 14.
    Hu, W., Huang, Y., Zhang, F., Li, R.: Noise-tolerant paradigm for training face recognition CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11887–11896 (2019)Google Scholar
  15. 15.
    Isobe, S., Arai, S.: Deep convolutional encoder-decoder network with model uncertainty for semantic segmentation. In: 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pp. 365–370. IEEE (2017)Google Scholar
  16. 16.
    Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems, pp. 5574–5584 (2017)Google Scholar
  17. 17.
    Kraus, F., Dietmayer, K.: Uncertainty estimation in one-stage object detection. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 53–60. IEEE (2019)Google Scholar
  18. 18.
    Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)Google Scholar
  19. 19.
    Le, Y., Yang, X.: Tiny ImageNet visual recognition challenge. CS 231N (2015) Google Scholar
  20. 20.
    Li, B., Liu, Y., Wang, X.: Gradient harmonized single-stage detector. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8577–8584 (2019)Google Scholar
  21. 21.
    Li, Q., Jin, S., Yan, J.: Mimicking very efficient network for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6356–6364 (2017)Google Scholar
  22. 22.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)Google Scholar
  23. 23.
    Nix, D.A., Weigend, A.S.: Estimating the mean and variance of the target probability distribution. In: Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 1994), vol. 1, pp. 55–60. IEEE (1994)Google Scholar
  24. 24.
    Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)Google Scholar
  25. 25.
    Peng, B., et al.: Correlation congruence for knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5007–5016 (2019)Google Scholar
  26. 26.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  27. 27.
    Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-48881-3_2CrossRefGoogle Scholar
  28. 28.
    Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014)
  29. 29.
    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)Google Scholar
  30. 30.
    Shi, Y., Jain, A.K.: Probabilistic face embeddings. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6902–6911 (2019)Google Scholar
  31. 31.
    Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)Google Scholar
  32. 32.
    Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019)
  33. 33.
    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200–2011 dataset (2011)Google Scholar
  34. 34.
    Wang, T., Yuan, L., Zhang, X., Feng, J.: Distilling object detectors with fine-grained feature imitation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4933–4942 (2019)Google Scholar
  35. 35.
    Wang, X., Han, X., Huang, W., Dong, D., Scott, M.R.: Multi-similarity loss with general pair weighting for deep metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5022–5030 (2019)Google Scholar
  36. 36.
    Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019)Google Scholar
  37. 37.
    Yu, R., Dou, Z., Bai, S., Zhang, Z., Xu, Y., Bai, X.: Hard-aware point-to-set deep metric for person re-identification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 188–204 (2018)Google Scholar
  38. 38.
    Yu, T., Li, D., Yang, Y., Hospedales, T.M., Xiang, T.: Robust person re-identification by modelling feature uncertainty. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 552–561 (2019)Google Scholar
  39. 39.
    Yuan, L., Tay, F.E., Li, G., Wang, T., Feng, J.: Revisit knowledge distillation: a teacher-free framework. arXiv preprint arXiv:1909.11723 (2019)
  40. 40.
    Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)
  41. 41.
    Zhong, Y., et al.: Unequal-training for deep face recognition with long-tailed noisy data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7812–7821 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Megvii Inc.BeijingChina
  2. 2.University of Science and Technology of ChinaHefeiChina
  3. 3.Tongji UniversityShanghaiChina

Personalised recommendations