Abstract
The neural tangent kernel (NTK) was created in the context of using the limit idea to study the theory of neural network. NTKs are defined from neural network models in the infinite-width limit trained by gradient descent. Such over-parameterized models achieved good test accuracy in experiments, and the success of the NTK emphasizes not only the importance of describing neural network models in the width limit of \(h \to \infty\), but also the further development of deep learning theory for gradient flow in the step limit of \(\eta \to 0\). And NTK can be widely used in various machine learning models. This review provides a comprehensive overview of the entire development of NTKs. Firstly, the bias–variance tradeoff in statistics, the popular over-parameterization and gradient descent in deep learning, and the widely used kernel method were introduced. Secondly, the development of research on the infinite-width limit in networks and the introduction of the concept of the NTK were introduced, and the development and latest progress of NTK theory were discussed. Finally, the researches on the migrations of NTKs to neural networks of other structures and the applications of NTKs to other fields of machine learning were presented.
Similar content being viewed by others
References
Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4(1):1–58
Fortmann-Roe S (2012) Understanding the bias-variance tradeoff. URL: http://scott.fortmann-roe. com/docs/BiasVariance. html (h¨amtad 2019-03-27)
Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999
Bartlett PL, Mendelson S (2001) Rademacher and gaussian complexities: risk bounds and structural results. In: International conference on computational learning theory, Springer, pp 224–240
Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526
Neal B (2019) On the bias-variance tradeoff: textbooks need an update. arXiv preprint arXiv:1912.08286
Belkin M, Hsu D, Ma S, Mandal S (2019) Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc Natl Acad Sci 116(32):15849–15854
Bartlett PL, Long PM, Lugosi G, Tsigler A (2020) Benign overfitting in linear regression. Proc Natl Acad Sci 117(48):30063–30070
Hastie T, Montanari A, Rosset S, Tibshirani RJ (2022) Surprises in highdimensional ridgeless least squares interpolation. Ann Stat 50(2):949–986
Ju P, Lin X, Liu J (2020) Overfitting can be harmless for basis pursuit, but only to a degree. Adv Neural Inf Process Syst 33:7956–7967
Muthukumar V, Vodrahalli K, Subramanian V, Sahai A (2020) Harmless interpolation of noisy data in regression. IEEE J Sel Areas Inf Theory 1(1):67–83
Belkin M, Ma S, Mandal S (2018) To understand deep learning we need to understand kernel learning. In: International conference on machine learning (ICML), pp 541–549, PMLR
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016)
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9 (2015)
Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Adv Neural Inf Process Syst 27
Geirhos R, Jacobsen J-H, Michaelis C, Zemel R, Brendel W, Bethge M, Wichmann FA (2020) Shortcut learning in deep neural networks. Nat Mach Intell 2(11):665–673
Li Y, Liang Y (2018) Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Neural information processing systems (NeurIPS)
Zou D, Cao Y, Zhou D, Gu Q (2018) Stochastic gradient descent optimizes over-parameterized deep relu networks. arxiv e-prints, art. arXiv preprint arXiv:1811.08888
Oymak S, Soltanolkotabi M (2020) Toward moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE J Sel Areas Inf Theory 1(1):84–105
Arora S, Du S, Hu W, Li Z, Wang R (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In: International conference on machine learning (ICML), pp 322–332. PMLR
Zou D, Gu Q (2019) An improved analysis of training over-parameterized deep neural networks. Adv Neural Inf Process Syst 32 (2019)
Du SS, Zhai X, Poczos B, Singh A (2018) Gradient descent provably optimizes over-parameterized neural networks. In: International conference on learning representations (ICLR)
Daniely A, Frostig R, Singer Y (2016) Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. Adv Neural Inf Process Syst 29
Daniely A (2017) Sgd learns the conjugate kernel class of the network. Adv Neural Inf Process Syst 30
Cho Y, Saul L (2009) Kernel methods for deep learning. Adv Neural Inf Process Syst 22
Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2016) Understanding deep learning requires rethinking generalization. In: International conference on learning representations (ICLR)
Pinkus A (1999) Approximation theory of the mlp model in neural networks. Acta Numer 8:143–195
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Novak R, Xiao L, Hron J, Lee J, Alemi AA, Sohl-Dickstein J, Schoenholz SS (2019) Neural tangents: fast and easy infinite neural networks in python. In: International conference on learning representations (ICLR)
Neal RM (1996) Bayesian learning for neural networks. Lecture Notes in Statistics
Matthews AGDG, Hron J, Rowland M, Turner RE, Ghahramani Z (2018) Gaussian process behaviour in wide deep neural networks. In: International conference on learning representations (ICLR)
Williams C (1996) Computing with infinite networks. Adv Neural Inf Process Syst 9 (1996)
Lee J, Bahri Y, Novak R, Schoenholz SS, Pennington J, Sohl-Dickstein J (2018) Deep neural networks as gaussian processes. In: International conference on learning representations
He B, Lakshminarayanan B, Teh YW (2020) Bayesian deep ensembles via the neural tangent kernel. Adv Neural Inf Process Syst 33:1010–1022
Lee J, Xiao L, Schoenholz S, Bahri Y, Novak R, Sohl-Dickstein J, Pennington J (2019) Wide neural networks of any depth evolve as linear models under gradient descent. In: Neural information processing systems (NeurIPS)
Neyshabur B, Li Z, Bhojanapalli S, LeCun Y, Srebro N (2019) The role of overparametrization in generalization of neural networks. In: International conference on learning representations (ICLR)
Novak R, Bahri Y, Abolafia DA, Pennington J, Sohl-Dickstein J (2018) Sensitivity and generalization in neural networks: an empirical study. In: International conference on learning representations (ICLR)
Novak R, Xiao L, Bahri Y, Lee J, Yang G, Hron J, Abolafia DA, Pennington J, Sohl-dickstein J (2018) Bayesian deep convolutional networks with many channels are gaussian processes. In: International conference on learning representations (ICLR)
Advani MS, Saxe AM, Sompolinsky H (2020) High-dimensional dynamics of generalization error in neural networks. Neural Netw 132:428–446
Bansal Y, Advani M, Cox DD, Saxe AM (2018) Minnorm training: an algorithm for training overcomplete deep neural networks. arXiv preprint arXiv:1806.00730
Neyshabur, B., Tomioka, R., Salakhutdinov, R., Srebro, N.: Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071 (2017)
Spigler S, Geiger M, d’Ascoli S, Sagun L, Biroli G, Wyart M (2019) A jamming transition from under-to over-parametrization affects generalization in deep learning. J Phys A Math Theor 52(47):474001
Jacot A, Hongler C, Gabriel F (2018) Neural tangent kernel: convergence and generalization in neural networks. Adv Neural Inf Process Syst 31 (2018)
Bai Y, Lee JD (2019) Beyond linearization: on quadratic and higher-order approximation of wide neural networks. In: International conference on learning representations (ICLR)
Bietti A, Mairal J (2019) On the inductive bias of neural tangent kernels. Adv Neural Inf Process Syst 32:12873–12884
Park D, Sohl-Dickstein J, Le Q, Smith S (2019) The effect of network width on stochastic gradient descent and generalization: an empirical study. In: International conference on machine learning (ICML), pp 5042–5051. PMLR
Arora S, Du SS, Hu W, Li Z, Salakhutdinov RR, Wang R (2019) On exact computation with an infinitely wide neural net. In: Neural information processing systems (NeurIPS)
Cao Y, Gu Q (2019) Generalization bounds of stochastic gradient descent for wide and deep neural networks. Adv Neural Inf Process Syst 32 (2019)
Ju P, Lin X, Shroff N (2021) On the generalization power of overfitted twolayer neural tangent kernel models. In: International conference on machine learning (ICML), pp 5137–5147. PMLR
Chizat L, Oyallon E, Bach F (2019) On lazy training in differentiable programming. Adv Neural Inf Process Syst 32 (2019)
Zhang C, Bengio S, Singer Y (2019) Are all layers created equal? arXiv preprint arXiv:1902.01996
Mei S, Misiakiewicz T, Montanari A (2019) Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In: Conference on learning theory, pp 2388–2464. PMLR
Song M, Montanari A, Nguyen P (2018) A mean field view of the landscape of twolayers neural networks. Proc Natl Acad Sci 115(33):7665–7671
Allen-Zhu Z, Li Y, Song Z (2019) A convergence theory for deep learning via overparameterization. In: International Conference on Machine Learning (ICML), pp. 242–252. PMLR
Geiger M, Spigler S, Jacot A, Wyart M (2020) Disentangling feature and lazy training in deep neural networks. J Stat Mech Theory Exp 2020(11):113301
Jacot A, Simsek B, Spadaro F, Hongler C, Gabriel F (2020) Implicit regularization of random feature models. In: International conference on machine learning (ICML), pp 4631–4640, PMLR
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. Adv Neural Inf Process Syst 20
Rudi A, Rosasco L (2017) Generalization properties of learning with random features. Adv Neural Inf Process Syst 30 (2017)
Han I, Avron H, Shoham N, Kim C, Shin J (2021) Random features for the neural tangent kernel. arXiv preprint arXiv:2104.01351
Fiat J, Malach E, Shalev-Shwartz S (2019) Decoupling gating from linearity. arXiv preprint arXiv:1906.05032
Chen Z, Cao Y, Gu Q, Zhang T (2020) A generalized neural tangent kernel analysis for two-layer neural networks. Adv Neural Inf Process Syst 33:13363–13373
Caron F, Ayed F, Jung P, Lee H, Lee J, Yang H (2023) Over-parameterised shallow neural networks with asymmetrical node scaling: global convergence guarantees and feature learning. arXiv preprint arXiv:2302.01002
Fan Z, Wang Z (2020) Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Adv Neural Inf Process Syst 33:7710–7721
Chen L, Xu S (2023) Deep neural tangent kernel and laplace kernel have the same rkhs. In: International conference on learning representations (ICLR)
Geifman A, Yadav A, Kasten Y, Galun M, Jacobs D, Ronen B (2020) On the similarity between the laplace and neural tangent kernels. Adv Neural Inf Process Syst 33:1451–1461
Arora, S., Du, S.S., Li, Z., Salakhutdinov, R., Wang, R., Yu, D.: Harnessing the power of infinitely wide deep nets on small-data tasks. In: International Conference on Learning Representations(ICLR) (2019)
Shoham N, Avron H (2023) Experimental design for overparameterized learning with application to single shot deep active learning. arXiv preprint arXiv:2009.12820
Zancato L, Achille A, Ravichandran A, Bhotika R, Soatto S (2020) Predicting training time without training. Adv Neural Inf Process Syst 33:6136–6146
Wei H, Simon D (2019) Ultra-wide deep nets and neural tangent kernel (ntk). URL: https://blog.ml.cmu.edu/2019/10/03/ultra-wide-deep-nets-and-theneural-tangent-kernel-ntk
Schölkopf B, Burges C, Vapnik V (1996) Incorporating invariances in support vector learning machines. In: International conference on artificial neural networks (ICANN), pp 47–52. Springer
Sietsma J, Dow RJ (1991) Creating artificial neural networks that generalize. Neural Netw 4(1):67–79
Li Z, Wang R, Yu D, Du SS, Hu W, Salakhutdinov R, Arora S (2019) Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809
Zagoruyko S, Komodakis N (2016) Wide residual networks. In: British machine vision conference 2016. British Machine Vision Association
Belfer Y, Geifman A, Galun M, Basri R (2021) Spectral analysis of the neural tangent kernel for deep residual networks. arXiv preprint arXiv:2104.03093
Allen-Zhu Z, Li Y (2023) Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816
Du SS, Hou K, Salakhutdinov R, Póczos B, Wang R, Xu K (2019) Graph neural tangent kernel: fusing graph neural networks with graph kernels. In: Neural information processing systems (NeurIPS)
Tang Y, Yan J (2022) Graphqntk: quantum neural tangent kernel for graph data. Adv Neural Inf Process Syst 35:6104–6118
Sohl-Dickstein J, Novak R, Schoenholz SS, Lee J (2020) On the infinite width limit of neural networks with a standard parameterization. arXiv preprint arXiv:2001.07301
Yang G, Littwin E (2021) Tensor programs iib: architectural universality of neural tangent kernel training dynamics. In: International conference on machine learning (ICML), pp 11762–11772
Wang Y, Li D, Sun R (2023) Ntk-sap: improving neural network pruning by aligning training dynamics. arXiv preprint arXiv:2304.02840
Watanabe K, Sakamoto K, Karakida R, Sonoda S, Amari SI (2023) Deep learning in random neural fields: numerical experiments via neural tangent kernel. Neural Netw 160:148–163
Kanoh R, Sugiyama M (2022) Analyzing tree architectures in ensembles via neural tangent kernel. In: The Eleventh international conference on learning representations
Zhai Y, Liu H (2022) One class svm model based on neural tangent kernel for anomaly detection task on small-scale data. J Intell Fuzzy Syst 43:2731–2746
Wang M, Xu C, Liu Y (2021) Multi-kernel learning method based on neural tangent kernel. J Comput Appl 41(12):3462
Huang, B., Li, X., Song, Z., Yang, X.: Fl-ntk: A neural tangent kernel-based framework for federated learning analysis. In: International Conference on Machine Learning(ICML), pp. 4423–4434 (2021). PMLR
Yue K, Jin R, Pilgrim R, Wong CW, Baron D Dai H (2022) Neural tangent kernel empowered federated learning. In: International conference on machine learning (ICML), pp. 25783–25803. PMLR
Yang Y, Adamczewski K, Sutherland DJ, Li X, Park M (2023) Differentially private neural tangent kernels for privacy-preserving data generation. arXiv preprint arXiv:2303.01687
Wang M, Song X, Liu Y, Xu C (2022) Neural tangent kernel k-means clustering. J Comput Appl 42:3330
Nguyen TV, Wong RK, Hegde C (2021) Benefits of jointly training autoencoders: an improved neural tangent kernel analysis. IEEE Trans Inf Theory 67(7):4669–4692
Peng Y, Hu D, Xu Z-QJ (2023) A non-gradient method for solving elliptic partial differential equations with deep neural networks. J Comput Phys 472:111690
McClenny LD, Braga-Neto UM (2023) Self-adaptive physics-informed neural networks. J Comput Phys 474:111722
Acknowledgements
Thanks to all participants’ contributions to the study.
Author information
Authors and Affiliations
Contributions
Tan wrote the main manuscript text. All authors reviewed the manuscript. Thanks to all participants' contributions to the study.
Corresponding author
Ethics declarations
Conflict of interest
I confirm the corresponding author has read the journal policies and submit this manuscript in accordance with those policies. I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Data availability
All of the material is owned by the authors, and/or no permissions are required.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tan, Y., Liu, H. How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel. Int J Multimed Info Retr 13, 8 (2024). https://doi.org/10.1007/s13735-023-00318-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13735-023-00318-0