How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel

Tan, Yiqiao; Liu, Haizhong

doi:10.1007/s13735-023-00318-0

How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel

Trends and Surveys
Published: 01 February 2024

Volume 13, article number 8, (2024)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Yiqiao Tan¹ &
Haizhong Liu¹

284 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The neural tangent kernel (NTK) was created in the context of using the limit idea to study the theory of neural network. NTKs are defined from neural network models in the infinite-width limit trained by gradient descent. Such over-parameterized models achieved good test accuracy in experiments, and the success of the NTK emphasizes not only the importance of describing neural network models in the width limit of \(h \to \infty\), but also the further development of deep learning theory for gradient flow in the step limit of \(\eta \to 0\). And NTK can be widely used in various machine learning models. This review provides a comprehensive overview of the entire development of NTKs. Firstly, the bias–variance tradeoff in statistics, the popular over-parameterization and gradient descent in deep learning, and the widely used kernel method were introduced. Secondly, the development of research on the infinite-width limit in networks and the introduction of the concept of the NTK were introduced, and the development and latest progress of NTK theory were discussed. Finally, the researches on the migrations of NTKs to neural networks of other structures and the applications of NTKs to other fields of machine learning were presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine learning and deep learning

Article Open access 08 April 2021

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

Development and Application of Artificial Neural Network

Article 30 December 2017

References

Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4(1):1–58
Article Google Scholar
Fortmann-Roe S (2012) Understanding the bias-variance tradeoff. URL: http://scott.fortmann-roe. com/docs/BiasVariance. html (h¨amtad 2019-03-27)
Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999
Article Google Scholar
Bartlett PL, Mendelson S (2001) Rademacher and gaussian complexities: risk bounds and structural results. In: International conference on computational learning theory, Springer, pp 224–240
Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526
MathSciNet Google Scholar
Neal B (2019) On the bias-variance tradeoff: textbooks need an update. arXiv preprint arXiv:1912.08286
Belkin M, Hsu D, Ma S, Mandal S (2019) Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc Natl Acad Sci 116(32):15849–15854
Article MathSciNet Google Scholar
Bartlett PL, Long PM, Lugosi G, Tsigler A (2020) Benign overfitting in linear regression. Proc Natl Acad Sci 117(48):30063–30070
Article MathSciNet Google Scholar
Hastie T, Montanari A, Rosset S, Tibshirani RJ (2022) Surprises in highdimensional ridgeless least squares interpolation. Ann Stat 50(2):949–986
Article Google Scholar
Ju P, Lin X, Liu J (2020) Overfitting can be harmless for basis pursuit, but only to a degree. Adv Neural Inf Process Syst 33:7956–7967
Google Scholar
Muthukumar V, Vodrahalli K, Subramanian V, Sahai A (2020) Harmless interpolation of noisy data in regression. IEEE J Sel Areas Inf Theory 1(1):67–83
Article Google Scholar
Belkin M, Ma S, Mandal S (2018) To understand deep learning we need to understand kernel learning. In: International conference on machine learning (ICML), pp 541–549, PMLR
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016)
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9 (2015)
Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Adv Neural Inf Process Syst 27
Geirhos R, Jacobsen J-H, Michaelis C, Zemel R, Brendel W, Bethge M, Wichmann FA (2020) Shortcut learning in deep neural networks. Nat Mach Intell 2(11):665–673
Article Google Scholar
Li Y, Liang Y (2018) Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Neural information processing systems (NeurIPS)
Zou D, Cao Y, Zhou D, Gu Q (2018) Stochastic gradient descent optimizes over-parameterized deep relu networks. arxiv e-prints, art. arXiv preprint arXiv:1811.08888
Oymak S, Soltanolkotabi M (2020) Toward moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE J Sel Areas Inf Theory 1(1):84–105
Article Google Scholar
Arora S, Du S, Hu W, Li Z, Wang R (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In: International conference on machine learning (ICML), pp 322–332. PMLR
Zou D, Gu Q (2019) An improved analysis of training over-parameterized deep neural networks. Adv Neural Inf Process Syst 32 (2019)
Du SS, Zhai X, Poczos B, Singh A (2018) Gradient descent provably optimizes over-parameterized neural networks. In: International conference on learning representations (ICLR)
Daniely A, Frostig R, Singer Y (2016) Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. Adv Neural Inf Process Syst 29
Daniely A (2017) Sgd learns the conjugate kernel class of the network. Adv Neural Inf Process Syst 30
Cho Y, Saul L (2009) Kernel methods for deep learning. Adv Neural Inf Process Syst 22
Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2016) Understanding deep learning requires rethinking generalization. In: International conference on learning representations (ICLR)
Pinkus A (1999) Approximation theory of the mlp model in neural networks. Acta Numer 8:143–195
Article MathSciNet Google Scholar
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Article Google Scholar
Novak R, Xiao L, Hron J, Lee J, Alemi AA, Sohl-Dickstein J, Schoenholz SS (2019) Neural tangents: fast and easy infinite neural networks in python. In: International conference on learning representations (ICLR)
Neal RM (1996) Bayesian learning for neural networks. Lecture Notes in Statistics
Matthews AGDG, Hron J, Rowland M, Turner RE, Ghahramani Z (2018) Gaussian process behaviour in wide deep neural networks. In: International conference on learning representations (ICLR)
Williams C (1996) Computing with infinite networks. Adv Neural Inf Process Syst 9 (1996)
Lee J, Bahri Y, Novak R, Schoenholz SS, Pennington J, Sohl-Dickstein J (2018) Deep neural networks as gaussian processes. In: International conference on learning representations
He B, Lakshminarayanan B, Teh YW (2020) Bayesian deep ensembles via the neural tangent kernel. Adv Neural Inf Process Syst 33:1010–1022
Google Scholar
Lee J, Xiao L, Schoenholz S, Bahri Y, Novak R, Sohl-Dickstein J, Pennington J (2019) Wide neural networks of any depth evolve as linear models under gradient descent. In: Neural information processing systems (NeurIPS)
Neyshabur B, Li Z, Bhojanapalli S, LeCun Y, Srebro N (2019) The role of overparametrization in generalization of neural networks. In: International conference on learning representations (ICLR)
Novak R, Bahri Y, Abolafia DA, Pennington J, Sohl-Dickstein J (2018) Sensitivity and generalization in neural networks: an empirical study. In: International conference on learning representations (ICLR)
Novak R, Xiao L, Bahri Y, Lee J, Yang G, Hron J, Abolafia DA, Pennington J, Sohl-dickstein J (2018) Bayesian deep convolutional networks with many channels are gaussian processes. In: International conference on learning representations (ICLR)
Advani MS, Saxe AM, Sompolinsky H (2020) High-dimensional dynamics of generalization error in neural networks. Neural Netw 132:428–446
Article Google Scholar
Bansal Y, Advani M, Cox DD, Saxe AM (2018) Minnorm training: an algorithm for training overcomplete deep neural networks. arXiv preprint arXiv:1806.00730
Neyshabur, B., Tomioka, R., Salakhutdinov, R., Srebro, N.: Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071 (2017)
Spigler S, Geiger M, d’Ascoli S, Sagun L, Biroli G, Wyart M (2019) A jamming transition from under-to over-parametrization affects generalization in deep learning. J Phys A Math Theor 52(47):474001
Article MathSciNet Google Scholar
Jacot A, Hongler C, Gabriel F (2018) Neural tangent kernel: convergence and generalization in neural networks. Adv Neural Inf Process Syst 31 (2018)
Bai Y, Lee JD (2019) Beyond linearization: on quadratic and higher-order approximation of wide neural networks. In: International conference on learning representations (ICLR)
Bietti A, Mairal J (2019) On the inductive bias of neural tangent kernels. Adv Neural Inf Process Syst 32:12873–12884
Google Scholar
Park D, Sohl-Dickstein J, Le Q, Smith S (2019) The effect of network width on stochastic gradient descent and generalization: an empirical study. In: International conference on machine learning (ICML), pp 5042–5051. PMLR
Arora S, Du SS, Hu W, Li Z, Salakhutdinov RR, Wang R (2019) On exact computation with an infinitely wide neural net. In: Neural information processing systems (NeurIPS)
Cao Y, Gu Q (2019) Generalization bounds of stochastic gradient descent for wide and deep neural networks. Adv Neural Inf Process Syst 32 (2019)
Ju P, Lin X, Shroff N (2021) On the generalization power of overfitted twolayer neural tangent kernel models. In: International conference on machine learning (ICML), pp 5137–5147. PMLR
Chizat L, Oyallon E, Bach F (2019) On lazy training in differentiable programming. Adv Neural Inf Process Syst 32 (2019)
Zhang C, Bengio S, Singer Y (2019) Are all layers created equal? arXiv preprint arXiv:1902.01996
Mei S, Misiakiewicz T, Montanari A (2019) Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In: Conference on learning theory, pp 2388–2464. PMLR
Song M, Montanari A, Nguyen P (2018) A mean field view of the landscape of twolayers neural networks. Proc Natl Acad Sci 115(33):7665–7671
MathSciNet Google Scholar
Allen-Zhu Z, Li Y, Song Z (2019) A convergence theory for deep learning via overparameterization. In: International Conference on Machine Learning (ICML), pp. 242–252. PMLR
Geiger M, Spigler S, Jacot A, Wyart M (2020) Disentangling feature and lazy training in deep neural networks. J Stat Mech Theory Exp 2020(11):113301
Article MathSciNet Google Scholar
Jacot A, Simsek B, Spadaro F, Hongler C, Gabriel F (2020) Implicit regularization of random feature models. In: International conference on machine learning (ICML), pp 4631–4640, PMLR
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. Adv Neural Inf Process Syst 20
Rudi A, Rosasco L (2017) Generalization properties of learning with random features. Adv Neural Inf Process Syst 30 (2017)
Han I, Avron H, Shoham N, Kim C, Shin J (2021) Random features for the neural tangent kernel. arXiv preprint arXiv:2104.01351
Fiat J, Malach E, Shalev-Shwartz S (2019) Decoupling gating from linearity. arXiv preprint arXiv:1906.05032
Chen Z, Cao Y, Gu Q, Zhang T (2020) A generalized neural tangent kernel analysis for two-layer neural networks. Adv Neural Inf Process Syst 33:13363–13373
Google Scholar
Caron F, Ayed F, Jung P, Lee H, Lee J, Yang H (2023) Over-parameterised shallow neural networks with asymmetrical node scaling: global convergence guarantees and feature learning. arXiv preprint arXiv:2302.01002
Fan Z, Wang Z (2020) Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Adv Neural Inf Process Syst 33:7710–7721
Google Scholar
Chen L, Xu S (2023) Deep neural tangent kernel and laplace kernel have the same rkhs. In: International conference on learning representations (ICLR)
Geifman A, Yadav A, Kasten Y, Galun M, Jacobs D, Ronen B (2020) On the similarity between the laplace and neural tangent kernels. Adv Neural Inf Process Syst 33:1451–1461
Google Scholar
Arora, S., Du, S.S., Li, Z., Salakhutdinov, R., Wang, R., Yu, D.: Harnessing the power of infinitely wide deep nets on small-data tasks. In: International Conference on Learning Representations(ICLR) (2019)
Shoham N, Avron H (2023) Experimental design for overparameterized learning with application to single shot deep active learning. arXiv preprint arXiv:2009.12820
Zancato L, Achille A, Ravichandran A, Bhotika R, Soatto S (2020) Predicting training time without training. Adv Neural Inf Process Syst 33:6136–6146
Google Scholar
Wei H, Simon D (2019) Ultra-wide deep nets and neural tangent kernel (ntk). URL: https://blog.ml.cmu.edu/2019/10/03/ultra-wide-deep-nets-and-theneural-tangent-kernel-ntk
Schölkopf B, Burges C, Vapnik V (1996) Incorporating invariances in support vector learning machines. In: International conference on artificial neural networks (ICANN), pp 47–52. Springer
Sietsma J, Dow RJ (1991) Creating artificial neural networks that generalize. Neural Netw 4(1):67–79
Article Google Scholar
Li Z, Wang R, Yu D, Du SS, Hu W, Salakhutdinov R, Arora S (2019) Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809
Zagoruyko S, Komodakis N (2016) Wide residual networks. In: British machine vision conference 2016. British Machine Vision Association
Belfer Y, Geifman A, Galun M, Basri R (2021) Spectral analysis of the neural tangent kernel for deep residual networks. arXiv preprint arXiv:2104.03093
Allen-Zhu Z, Li Y (2023) Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816
Du SS, Hou K, Salakhutdinov R, Póczos B, Wang R, Xu K (2019) Graph neural tangent kernel: fusing graph neural networks with graph kernels. In: Neural information processing systems (NeurIPS)
Tang Y, Yan J (2022) Graphqntk: quantum neural tangent kernel for graph data. Adv Neural Inf Process Syst 35:6104–6118
Google Scholar
Sohl-Dickstein J, Novak R, Schoenholz SS, Lee J (2020) On the infinite width limit of neural networks with a standard parameterization. arXiv preprint arXiv:2001.07301
Yang G, Littwin E (2021) Tensor programs iib: architectural universality of neural tangent kernel training dynamics. In: International conference on machine learning (ICML), pp 11762–11772
Wang Y, Li D, Sun R (2023) Ntk-sap: improving neural network pruning by aligning training dynamics. arXiv preprint arXiv:2304.02840
Watanabe K, Sakamoto K, Karakida R, Sonoda S, Amari SI (2023) Deep learning in random neural fields: numerical experiments via neural tangent kernel. Neural Netw 160:148–163
Article Google Scholar
Kanoh R, Sugiyama M (2022) Analyzing tree architectures in ensembles via neural tangent kernel. In: The Eleventh international conference on learning representations
Zhai Y, Liu H (2022) One class svm model based on neural tangent kernel for anomaly detection task on small-scale data. J Intell Fuzzy Syst 43:2731–2746
Article Google Scholar
Wang M, Xu C, Liu Y (2021) Multi-kernel learning method based on neural tangent kernel. J Comput Appl 41(12):3462
Google Scholar
Huang, B., Li, X., Song, Z., Yang, X.: Fl-ntk: A neural tangent kernel-based framework for federated learning analysis. In: International Conference on Machine Learning(ICML), pp. 4423–4434 (2021). PMLR
Yue K, Jin R, Pilgrim R, Wong CW, Baron D Dai H (2022) Neural tangent kernel empowered federated learning. In: International conference on machine learning (ICML), pp. 25783–25803. PMLR
Yang Y, Adamczewski K, Sutherland DJ, Li X, Park M (2023) Differentially private neural tangent kernels for privacy-preserving data generation. arXiv preprint arXiv:2303.01687
Wang M, Song X, Liu Y, Xu C (2022) Neural tangent kernel k-means clustering. J Comput Appl 42:3330
Google Scholar
Nguyen TV, Wong RK, Hegde C (2021) Benefits of jointly training autoencoders: an improved neural tangent kernel analysis. IEEE Trans Inf Theory 67(7):4669–4692
Article MathSciNet Google Scholar
Peng Y, Hu D, Xu Z-QJ (2023) A non-gradient method for solving elliptic partial differential equations with deep neural networks. J Comput Phys 472:111690
Article MathSciNet Google Scholar
McClenny LD, Braga-Neto UM (2023) Self-adaptive physics-informed neural networks. J Comput Phys 474:111722
Article MathSciNet Google Scholar

Download references

Acknowledgements

Thanks to all participants’ contributions to the study.

Author information

Authors and Affiliations

School of Mathematics and Physics, Lanzhou Jiaotong University, Lanzhou, 730070, China
Yiqiao Tan & Haizhong Liu

Authors

Yiqiao Tan
View author publications
You can also search for this author in PubMed Google Scholar
Haizhong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Tan wrote the main manuscript text. All authors reviewed the manuscript. Thanks to all participants' contributions to the study.

Corresponding author

Correspondence to Haizhong Liu.

Ethics declarations

Conflict of interest

I confirm the corresponding author has read the journal policies and submit this manuscript in accordance with those policies. I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Data availability

All of the material is owned by the authors, and/or no permissions are required.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tan, Y., Liu, H. How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel. Int J Multimed Info Retr 13, 8 (2024). https://doi.org/10.1007/s13735-023-00318-0

Download citation

Received: 03 October 2023
Revised: 09 December 2023
Accepted: 26 December 2023
Published: 01 February 2024
DOI: https://doi.org/10.1007/s13735-023-00318-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel

Abstract

Access this article

Similar content being viewed by others

Machine learning and deep learning

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Development and Application of Artificial Neural Network

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Data availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

How does a kernel based on gradients of infinite-width neural networks come to be widely used: a review of the neural tangent kernel

Abstract

Access this article

Similar content being viewed by others

Machine learning and deep learning

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Development and Application of Artificial Neural Network

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Data availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation