Skip to main content
Log in

How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel

  • Trends and Surveys
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

The neural tangent kernel (NTK) was created in the context of using the limit idea to study the theory of neural network. NTKs are defined from neural network models in the infinite-width limit trained by gradient descent. Such over-parameterized models achieved good test accuracy in experiments, and the success of the NTK emphasizes not only the importance of describing neural network models in the width limit of \(h \to \infty\), but also the further development of deep learning theory for gradient flow in the step limit of \(\eta \to 0\). And NTK can be widely used in various machine learning models. This review provides a comprehensive overview of the entire development of NTKs. Firstly, the bias–variance tradeoff in statistics, the popular over-parameterization and gradient descent in deep learning, and the widely used kernel method were introduced. Secondly, the development of research on the infinite-width limit in networks and the introduction of the concept of the NTK were introduced, and the development and latest progress of NTK theory were discussed. Finally, the researches on the migrations of NTKs to neural networks of other structures and the applications of NTKs to other fields of machine learning were presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4(1):1–58

    Article  Google Scholar 

  2. Fortmann-Roe S (2012) Understanding the bias-variance tradeoff. URL: http://scott.fortmann-roe. com/docs/BiasVariance. html (h¨amtad 2019-03-27)

  3. Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999

    Article  Google Scholar 

  4. Bartlett PL, Mendelson S (2001) Rademacher and gaussian complexities: risk bounds and structural results. In: International conference on computational learning theory, Springer, pp 224–240

  5. Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526

    MathSciNet  Google Scholar 

  6. Neal B (2019) On the bias-variance tradeoff: textbooks need an update. arXiv preprint arXiv:1912.08286

  7. Belkin M, Hsu D, Ma S, Mandal S (2019) Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc Natl Acad Sci 116(32):15849–15854

    Article  MathSciNet  Google Scholar 

  8. Bartlett PL, Long PM, Lugosi G, Tsigler A (2020) Benign overfitting in linear regression. Proc Natl Acad Sci 117(48):30063–30070

    Article  MathSciNet  Google Scholar 

  9. Hastie T, Montanari A, Rosset S, Tibshirani RJ (2022) Surprises in highdimensional ridgeless least squares interpolation. Ann Stat 50(2):949–986

    Article  Google Scholar 

  10. Ju P, Lin X, Liu J (2020) Overfitting can be harmless for basis pursuit, but only to a degree. Adv Neural Inf Process Syst 33:7956–7967

    Google Scholar 

  11. Muthukumar V, Vodrahalli K, Subramanian V, Sahai A (2020) Harmless interpolation of noisy data in regression. IEEE J Sel Areas Inf Theory 1(1):67–83

    Article  Google Scholar 

  12. Belkin M, Ma S, Mandal S (2018) To understand deep learning we need to understand kernel learning. In: International conference on machine learning (ICML), pp 541–549, PMLR

  13. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  14. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25

  15. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016)

  16. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  17. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9 (2015)

  18. Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Adv Neural Inf Process Syst 27

  19. Geirhos R, Jacobsen J-H, Michaelis C, Zemel R, Brendel W, Bethge M, Wichmann FA (2020) Shortcut learning in deep neural networks. Nat Mach Intell 2(11):665–673

    Article  Google Scholar 

  20. Li Y, Liang Y (2018) Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Neural information processing systems (NeurIPS)

  21. Zou D, Cao Y, Zhou D, Gu Q (2018) Stochastic gradient descent optimizes over-parameterized deep relu networks. arxiv e-prints, art. arXiv preprint arXiv:1811.08888

  22. Oymak S, Soltanolkotabi M (2020) Toward moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE J Sel Areas Inf Theory 1(1):84–105

    Article  Google Scholar 

  23. Arora S, Du S, Hu W, Li Z, Wang R (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In: International conference on machine learning (ICML), pp 322–332. PMLR

  24. Zou D, Gu Q (2019) An improved analysis of training over-parameterized deep neural networks. Adv Neural Inf Process Syst 32 (2019)

  25. Du SS, Zhai X, Poczos B, Singh A (2018) Gradient descent provably optimizes over-parameterized neural networks. In: International conference on learning representations (ICLR)

  26. Daniely A, Frostig R, Singer Y (2016) Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. Adv Neural Inf Process Syst 29

  27. Daniely A (2017) Sgd learns the conjugate kernel class of the network. Adv Neural Inf Process Syst 30

  28. Cho Y, Saul L (2009) Kernel methods for deep learning. Adv Neural Inf Process Syst 22

  29. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2016) Understanding deep learning requires rethinking generalization. In: International conference on learning representations (ICLR)

  30. Pinkus A (1999) Approximation theory of the mlp model in neural networks. Acta Numer 8:143–195

    Article  MathSciNet  Google Scholar 

  31. Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366

    Article  Google Scholar 

  32. Novak R, Xiao L, Hron J, Lee J, Alemi AA, Sohl-Dickstein J, Schoenholz SS (2019) Neural tangents: fast and easy infinite neural networks in python. In: International conference on learning representations (ICLR)

  33. Neal RM (1996) Bayesian learning for neural networks. Lecture Notes in Statistics

  34. Matthews AGDG, Hron J, Rowland M, Turner RE, Ghahramani Z (2018) Gaussian process behaviour in wide deep neural networks. In: International conference on learning representations (ICLR)

  35. Williams C (1996) Computing with infinite networks. Adv Neural Inf Process Syst 9 (1996)

  36. Lee J, Bahri Y, Novak R, Schoenholz SS, Pennington J, Sohl-Dickstein J (2018) Deep neural networks as gaussian processes. In: International conference on learning representations

  37. He B, Lakshminarayanan B, Teh YW (2020) Bayesian deep ensembles via the neural tangent kernel. Adv Neural Inf Process Syst 33:1010–1022

    Google Scholar 

  38. Lee J, Xiao L, Schoenholz S, Bahri Y, Novak R, Sohl-Dickstein J, Pennington J (2019) Wide neural networks of any depth evolve as linear models under gradient descent. In: Neural information processing systems (NeurIPS)

  39. Neyshabur B, Li Z, Bhojanapalli S, LeCun Y, Srebro N (2019) The role of overparametrization in generalization of neural networks. In: International conference on learning representations (ICLR)

  40. Novak R, Bahri Y, Abolafia DA, Pennington J, Sohl-Dickstein J (2018) Sensitivity and generalization in neural networks: an empirical study. In: International conference on learning representations (ICLR)

  41. Novak R, Xiao L, Bahri Y, Lee J, Yang G, Hron J, Abolafia DA, Pennington J, Sohl-dickstein J (2018) Bayesian deep convolutional networks with many channels are gaussian processes. In: International conference on learning representations (ICLR)

  42. Advani MS, Saxe AM, Sompolinsky H (2020) High-dimensional dynamics of generalization error in neural networks. Neural Netw 132:428–446

    Article  Google Scholar 

  43. Bansal Y, Advani M, Cox DD, Saxe AM (2018) Minnorm training: an algorithm for training overcomplete deep neural networks. arXiv preprint arXiv:1806.00730

  44. Neyshabur, B., Tomioka, R., Salakhutdinov, R., Srebro, N.: Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071 (2017)

  45. Spigler S, Geiger M, d’Ascoli S, Sagun L, Biroli G, Wyart M (2019) A jamming transition from under-to over-parametrization affects generalization in deep learning. J Phys A Math Theor 52(47):474001

    Article  MathSciNet  Google Scholar 

  46. Jacot A, Hongler C, Gabriel F (2018) Neural tangent kernel: convergence and generalization in neural networks. Adv Neural Inf Process Syst 31 (2018)

  47. Bai Y, Lee JD (2019) Beyond linearization: on quadratic and higher-order approximation of wide neural networks. In: International conference on learning representations (ICLR)

  48. Bietti A, Mairal J (2019) On the inductive bias of neural tangent kernels. Adv Neural Inf Process Syst 32:12873–12884

    Google Scholar 

  49. Park D, Sohl-Dickstein J, Le Q, Smith S (2019) The effect of network width on stochastic gradient descent and generalization: an empirical study. In: International conference on machine learning (ICML), pp 5042–5051. PMLR

  50. Arora S, Du SS, Hu W, Li Z, Salakhutdinov RR, Wang R (2019) On exact computation with an infinitely wide neural net. In: Neural information processing systems (NeurIPS)

  51. Cao Y, Gu Q (2019) Generalization bounds of stochastic gradient descent for wide and deep neural networks. Adv Neural Inf Process Syst 32 (2019)

  52. Ju P, Lin X, Shroff N (2021) On the generalization power of overfitted twolayer neural tangent kernel models. In: International conference on machine learning (ICML), pp 5137–5147. PMLR

  53. Chizat L, Oyallon E, Bach F (2019) On lazy training in differentiable programming. Adv Neural Inf Process Syst 32 (2019)

  54. Zhang C, Bengio S, Singer Y (2019) Are all layers created equal? arXiv preprint arXiv:1902.01996

  55. Mei S, Misiakiewicz T, Montanari A (2019) Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In: Conference on learning theory, pp 2388–2464. PMLR

  56. Song M, Montanari A, Nguyen P (2018) A mean field view of the landscape of twolayers neural networks. Proc Natl Acad Sci 115(33):7665–7671

    MathSciNet  Google Scholar 

  57. Allen-Zhu Z, Li Y, Song Z (2019) A convergence theory for deep learning via overparameterization. In: International Conference on Machine Learning (ICML), pp. 242–252. PMLR

  58. Geiger M, Spigler S, Jacot A, Wyart M (2020) Disentangling feature and lazy training in deep neural networks. J Stat Mech Theory Exp 2020(11):113301

    Article  MathSciNet  Google Scholar 

  59. Jacot A, Simsek B, Spadaro F, Hongler C, Gabriel F (2020) Implicit regularization of random feature models. In: International conference on machine learning (ICML), pp 4631–4640, PMLR

  60. Rahimi A, Recht B (2007) Random features for large-scale kernel machines. Adv Neural Inf Process Syst 20

  61. Rudi A, Rosasco L (2017) Generalization properties of learning with random features. Adv Neural Inf Process Syst 30 (2017)

  62. Han I, Avron H, Shoham N, Kim C, Shin J (2021) Random features for the neural tangent kernel. arXiv preprint arXiv:2104.01351

  63. Fiat J, Malach E, Shalev-Shwartz S (2019) Decoupling gating from linearity. arXiv preprint arXiv:1906.05032

  64. Chen Z, Cao Y, Gu Q, Zhang T (2020) A generalized neural tangent kernel analysis for two-layer neural networks. Adv Neural Inf Process Syst 33:13363–13373

    Google Scholar 

  65. Caron F, Ayed F, Jung P, Lee H, Lee J, Yang H (2023) Over-parameterised shallow neural networks with asymmetrical node scaling: global convergence guarantees and feature learning. arXiv preprint arXiv:2302.01002

  66. Fan Z, Wang Z (2020) Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Adv Neural Inf Process Syst 33:7710–7721

    Google Scholar 

  67. Chen L, Xu S (2023) Deep neural tangent kernel and laplace kernel have the same rkhs. In: International conference on learning representations (ICLR)

  68. Geifman A, Yadav A, Kasten Y, Galun M, Jacobs D, Ronen B (2020) On the similarity between the laplace and neural tangent kernels. Adv Neural Inf Process Syst 33:1451–1461

    Google Scholar 

  69. Arora, S., Du, S.S., Li, Z., Salakhutdinov, R., Wang, R., Yu, D.: Harnessing the power of infinitely wide deep nets on small-data tasks. In: International Conference on Learning Representations(ICLR) (2019)

  70. Shoham N, Avron H (2023) Experimental design for overparameterized learning with application to single shot deep active learning. arXiv preprint arXiv:2009.12820

  71. Zancato L, Achille A, Ravichandran A, Bhotika R, Soatto S (2020) Predicting training time without training. Adv Neural Inf Process Syst 33:6136–6146

    Google Scholar 

  72. Wei H, Simon D (2019) Ultra-wide deep nets and neural tangent kernel (ntk). URL: https://blog.ml.cmu.edu/2019/10/03/ultra-wide-deep-nets-and-theneural-tangent-kernel-ntk

  73. Schölkopf B, Burges C, Vapnik V (1996) Incorporating invariances in support vector learning machines. In: International conference on artificial neural networks (ICANN), pp 47–52. Springer

  74. Sietsma J, Dow RJ (1991) Creating artificial neural networks that generalize. Neural Netw 4(1):67–79

    Article  Google Scholar 

  75. Li Z, Wang R, Yu D, Du SS, Hu W, Salakhutdinov R, Arora S (2019) Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809

  76. Zagoruyko S, Komodakis N (2016) Wide residual networks. In: British machine vision conference 2016. British Machine Vision Association

  77. Belfer Y, Geifman A, Galun M, Basri R (2021) Spectral analysis of the neural tangent kernel for deep residual networks. arXiv preprint arXiv:2104.03093

  78. Allen-Zhu Z, Li Y (2023) Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816

  79. Du SS, Hou K, Salakhutdinov R, Póczos B, Wang R, Xu K (2019) Graph neural tangent kernel: fusing graph neural networks with graph kernels. In: Neural information processing systems (NeurIPS)

  80. Tang Y, Yan J (2022) Graphqntk: quantum neural tangent kernel for graph data. Adv Neural Inf Process Syst 35:6104–6118

    Google Scholar 

  81. Sohl-Dickstein J, Novak R, Schoenholz SS, Lee J (2020) On the infinite width limit of neural networks with a standard parameterization. arXiv preprint arXiv:2001.07301

  82. Yang G, Littwin E (2021) Tensor programs iib: architectural universality of neural tangent kernel training dynamics. In: International conference on machine learning (ICML), pp 11762–11772

  83. Wang Y, Li D, Sun R (2023) Ntk-sap: improving neural network pruning by aligning training dynamics. arXiv preprint arXiv:2304.02840

  84. Watanabe K, Sakamoto K, Karakida R, Sonoda S, Amari SI (2023) Deep learning in random neural fields: numerical experiments via neural tangent kernel. Neural Netw 160:148–163

    Article  Google Scholar 

  85. Kanoh R, Sugiyama M (2022) Analyzing tree architectures in ensembles via neural tangent kernel. In: The Eleventh international conference on learning representations

  86. Zhai Y, Liu H (2022) One class svm model based on neural tangent kernel for anomaly detection task on small-scale data. J Intell Fuzzy Syst 43:2731–2746

    Article  Google Scholar 

  87. Wang M, Xu C, Liu Y (2021) Multi-kernel learning method based on neural tangent kernel. J Comput Appl 41(12):3462

    Google Scholar 

  88. Huang, B., Li, X., Song, Z., Yang, X.: Fl-ntk: A neural tangent kernel-based framework for federated learning analysis. In: International Conference on Machine Learning(ICML), pp. 4423–4434 (2021). PMLR

  89. Yue K, Jin R, Pilgrim R, Wong CW, Baron D Dai H (2022) Neural tangent kernel empowered federated learning. In: International conference on machine learning (ICML), pp. 25783–25803. PMLR

  90. Yang Y, Adamczewski K, Sutherland DJ, Li X, Park M (2023) Differentially private neural tangent kernels for privacy-preserving data generation. arXiv preprint arXiv:2303.01687

  91. Wang M, Song X, Liu Y, Xu C (2022) Neural tangent kernel k-means clustering. J Comput Appl 42:3330

    Google Scholar 

  92. Nguyen TV, Wong RK, Hegde C (2021) Benefits of jointly training autoencoders: an improved neural tangent kernel analysis. IEEE Trans Inf Theory 67(7):4669–4692

    Article  MathSciNet  Google Scholar 

  93. Peng Y, Hu D, Xu Z-QJ (2023) A non-gradient method for solving elliptic partial differential equations with deep neural networks. J Comput Phys 472:111690

    Article  MathSciNet  Google Scholar 

  94. McClenny LD, Braga-Neto UM (2023) Self-adaptive physics-informed neural networks. J Comput Phys 474:111722

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Thanks to all participants’ contributions to the study.

Author information

Authors and Affiliations

Authors

Contributions

Tan wrote the main manuscript text. All authors reviewed the manuscript. Thanks to all participants' contributions to the study.

Corresponding author

Correspondence to Haizhong Liu.

Ethics declarations

Conflict of interest

I confirm the corresponding author has read the journal policies and submit this manuscript in accordance with those policies. I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Data availability

All of the material is owned by the authors, and/or no permissions are required.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tan, Y., Liu, H. How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel. Int J Multimed Info Retr 13, 8 (2024). https://doi.org/10.1007/s13735-023-00318-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-023-00318-0

Keywords

Navigation