Advertisement

MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down Distillation

Conference paper
  • 567 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12359)

Abstract

Knowledge Distillation (KD) has been one of the most popular methods to learn a compact model. However, it still suffers from high demand in time and computational resources caused by sequential training pipeline. Furthermore, the soft targets from deeper models do not often serve as good cues for the shallower models due to the gap of compatibility. In this work, we consider these two problems at the same time. Specifically, we propose that better soft targets with higher compatibility can be generated by using a label generator to fuse the feature maps from deeper stages in a top-down manner, and we can employ the meta-learning technique to optimize this label generator. Utilizing the soft targets learned from the intermediate feature maps of the model, we can achieve better self-boosting of the network in comparison with the state-of-the-art. The experiments are conducted on two standard classification benchmarks, namely CIFAR-100 and ILSVRC2012. We test various network architectures to show the generalizability of our MetaDistiller. The experiments results on two datasets strongly demonstrate the effectiveness of our method .

Keywords

Knowledge distillation Meta learning 

Notes

Acknowledgement

Cho-Jui Hsieh would like to acknowledge the support by NSF IIS-1719097, Intel, and Facebook. This work was also supported by the National Key Research and Development Program of China under Grant 2017YFA-0700802, National Natural Science Foundation of China under Grant 61822603, Grant U1813218, Grant U1713214, and Grant 61672306, Beijing Natural Science Foundation under Grant L172051, Beijing Academy of Artificial Intelligence, a grant from the Institute for Guo Qiang, Tsinghua University, the Shenzhen Fundamental Research Fund (Subject Arrangement) under Grant JCYJ20170412170-602564, and Tsinghua University Initiative Scientific Research Program.

Supplementary material

504468_1_En_41_MOESM1_ESM.zip (13 kb)
Supplementary material 1 (zip 13 KB)

References

  1. 1.
    Bagherinezhad, H., Horton, M., Rastegari, M., Farhadi, A.: Label refinery: improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641 (2018)
  2. 2.
    Bengio, Y.: Gradient-based optimization of hyperparameters. Neural Comput. 12(8), 1889–1900 (2000)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Chen, P., Si, S., Li, Y., Chelba, C., Hsieh, C.J.: GroupReduce: block-wise low-rank approximation for neural language model shrinking. In: Advances in Neural Information Processing Systems, pp. 10988–10998 (2018)Google Scholar
  4. 4.
    Cho, J.H., Hariharan, B.: On the efficacy of knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4794–4802 (2019)Google Scholar
  5. 5.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)Google Scholar
  6. 6.
    Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1126–1135. JMLR. org (2017)Google Scholar
  7. 7.
    Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again neural networks. arXiv preprint arXiv:1805.04770 (2018)
  8. 8.
    Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015)
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  10. 10.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
  11. 11.
    Jang, Y., Lee, H., Hwang, S.J., Shin, J.: Learning what and where to transfer. arXiv preprint arXiv:1905.05901 (2019)
  12. 12.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  13. 13.
    Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)Google Scholar
  14. 14.
    Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: Artificial Intelligence and Statistics, pp. 562–570 (2015)Google Scholar
  15. 15.
    Lin, J., Rao, Y., Lu, J., Zhou, J.: Runtime neural pruning. In: Advances in Neural Information Processing Systems, pp. 2181–2191 (2017)Google Scholar
  16. 16.
    Ma, N., Zhang, X., Zheng, H.T., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 (2018)Google Scholar
  17. 17.
    Polino, A., Pascanu, R., Alistarh, D.: Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668 (2018)
  18. 18.
    Rao, Y., Lu, J., Lin, J., Zhou, J.: Runtime network routing for efficient image classification. IEEE Trans. Pattern Anal. Mach. Intell. 41(10), 2291–2304 (2018)CrossRefGoogle Scholar
  19. 19.
    Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014)
  20. 20.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015).  https://doi.org/10.1007/s11263-015-0816-yMathSciNetCrossRefGoogle Scholar
  21. 21.
    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)Google Scholar
  22. 22.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  23. 23.
    Wang, Y.X., Girshick, R., Hebert, M., Hariharan, B.: Low-shot learning from imaginary data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7278–7286 (2018)Google Scholar
  24. 24.
    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)Google Scholar
  25. 25.
    Yang, C., Xie, L., Qiao, S., Yuille, A.: Knowledge distillation in generations: more tolerant teachers educate better students. arXiv preprint arXiv:1805.05551 (2018)
  26. 26.
    Yang, C., Xie, L., Su, C., Yuille, A.L.: Snapshot distillation: teacher-student optimization in one generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2859–2868 (2019)Google Scholar
  27. 27.
    Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7370–7379 (2017)Google Scholar
  28. 28.
    Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)
  29. 29.
    Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)
  30. 30.
    Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., Ma, K.: Be your own teacher: improve the performance of convolutional neural networks via self distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3713–3722 (2019)Google Scholar
  31. 31.
    Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4320–4328 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of California, Los AngelesLos AngelesUSA
  2. 2.Tsinghua UniversityBeijingChina

Personalised recommendations