Skip to main content
Log in

Improved Architectures and Training Algorithms for Deep Operator Networks

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

Operator learning techniques have recently emerged as a powerful tool for learning maps between infinite-dimensional Banach spaces. Trained under appropriate constraints, they can also be effective in learning the solution operator of partial differential equations (PDEs) in an entirely self-supervised manner. In this work we analyze the training dynamics of deep operator networks (DeepONets) through the lens of Neural Tangent Kernel theory, and reveal a bias that favors the approximation of functions with larger magnitudes. To correct this bias we propose to adaptively re-weight the importance of each training example, and demonstrate how this procedure can effectively balance the magnitude of back-propagated gradients during training via gradient descent. We also propose a novel network architecture that is more resilient to vanishing gradient pathologies. Taken together, our developments provide new insights into the training of DeepONets and consistently improve their predictive accuracy by a factor of 10-50x, demonstrated in the challenging setting of learning PDE solution operators in the absence of paired input-output observations. All code and data accompanying this manuscript will be made publicly available at https://github.com/PredictiveIntelligenceLab/ImprovedDeepONets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data Availability

All methods needed to evaluate the conclusions in the paper are present in the paper and/or the Appendix. All code and data accompanying this manuscript will be made publicly available at https://github.com/PredictiveIntelligenceLab/ImprovedDeepONets.

References

  1. Lanthaler, S., Mishra, S., Karniadakis, G.E.: Error estimates for DeepONet: a deep learning framework in infinite dimensions. arXiv preprint arXiv:2102.09618 (2021)

  2. Kovachki, N., Lanthaler, S., Mishra, S.: On universal approximation and error bounds for fourier neural operators. arXiv preprint arXiv:2107.07562 (2021)

  3. Yu, A., Becquey, C., Halikias, D., Mallory, M.E., Townsend, A.: Arbitrary-depth universal approximation theorems for operator neural networks. arXiv preprint arXiv:2109.11354 (2021)

  4. Lu, L., Jin, P., Pang, G., Zhang, Z., Karniadakis, G.E.: Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nat. Mach. Intell. 3(3), 218–229 (2021)

    Article  Google Scholar 

  5. Kovachki, N., Li, Z., Liu, B., Azizzadenesheli, K., Bhattacharya, K., Stuart, A., Anandkumar, A.: Neural operator: learning maps between function spaces. arXiv preprint arXiv:2108.08481 (2021)

  6. Owhadi, H.: Do ideas have shape? Plato’s theory of forms as the continuous limit of artificial neural networks. arXiv preprint arXiv:2008.03920 (2020)

  7. Kadri, H., Duflos, E., Preux, P., Canu, S., Rakotomamonjy, A., Audiffren, J.: Operator-valued kernels for learning from functional response data. J. Mach. Learn. Res. 17(20), 1–54 (2016)

    MathSciNet  MATH  Google Scholar 

  8. Wang, S., Wang, H., Perdikaris, P.: Learning the solution operator of parametric partial differential equations with physics-informed DeepONets. Sci. Adv. 7(40), eabi8605 (2021)

    Article  Google Scholar 

  9. Wang, S., Perdikaris, P.: Long-time integration of parametric evolution equations with physics-informed DeepONets. arXiv preprint arXiv:2106.05384 (2021)

  10. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  12. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp. 448–456. PMLR (2015)

  13. Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization to accelerate training of deep neural networks. Adv. Neural Inf. Process. Syst. 29, 901–909 (2016)

    Google Scholar 

  14. LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Montavon, G., Orr, G.B., Müller, K.R. (eds.) Neural networks: tricks of the trade, pp. 9–48. Springer, Berlin (2012)

    Chapter  Google Scholar 

  15. Di Leoni, P.C., Lu, L., Meneveau, C., Karniadakis, G., Zaki, T.A.: DeepONet prediction of linear instability waves in high-speed boundary layers. arXiv preprint arXiv:2105.08697 (2021)

  16. Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., Anandkumar, A.: Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895 (2020)

  17. Wang, S., Teng, Y., Perdikaris, P.: Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM J. Sci. Comput. 43(5), A3055–A3081 (2021)

    Article  MathSciNet  Google Scholar 

  18. Wang, S., Yu, X., Perdikaris, P.: When and why PINNs fail to train: a neural tangent kernel perspective. arXiv preprint arXiv:2007.14527 (2020)

  19. McClenny, L., Braga-Neto, U.: Self-adaptive physics-informed neural networks using a soft attention mechanism. arXiv preprint arXiv:2009.04544 (2020)

  20. Wang, S., Perdikaris, P.: Deep learning of free boundary and Stefan problems. J. Comput. Phys. 428, 109914 (2021)

    Article  MathSciNet  Google Scholar 

  21. Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571–8580 (2018)

  22. Du, S., Lee, J., Li, H., Wang, L., Zhai, X.: Gradient descent finds global minima of deep neural networks. In: International Conference on Machine Learning, pp. 1675–1685. PMLR (2019)

  23. Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: International Conference on Machine Learning, pp. 242–252. PMLR (2019)

  24. Cao, Y., Fang, Z., Wu, Y., Zhou, D.-X., Gu, Q.: Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198 (2019)

  25. Xu, Z.-Q.J., Zhang, Y., Luo, T., Xiao, Y., Ma, Z.: Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523 (2019)

  26. Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., Courville, A.: On the spectral bias of neural networks. In: International Conference on Machine Learning, pp. 5301–5310 (2019)

  27. Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., Pennington, J.: Wide neural networks of any depth evolve as linear models under gradient descent. In: Advances in Neural Information Processing Systems, pp. 8572–8583 (2019)

  28. Wang, S., Wang, H., Perdikaris, P.: On the eigenvector bias of Fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks. arXiv preprint arXiv:2012.10047 (2020)

  29. Chen, T., Chen, H.: Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Trans. Neural Netw. 6(4), 911–917 (1995)

    Article  Google Scholar 

  30. Baydin, A.G., Pearlmutter, B.A., Radul, A.A., Siskind, J.M.: Automatic differentiation in machine learning: a survey. J. Mach. Learn. Res. 18, 1–43 (2018)

    MathSciNet  MATH  Google Scholar 

  31. Cai, S., Wang, Z., Lu, L., Zaki, T.A., Karniadakis, G.E.: Deepm &mnet: inferring the electroconvection multiphysics fields based on operator approximation by neural networks. arXiv preprint arXiv:2009.12935 (2020)

  32. Iserles, A. A First Course in the Numerical Analysis of Differential Equations. Number 44. Cambridge University Press (2009)

  33. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  34. Fort, S., Dziugaite, G.K., Paul, M., Kharaghani, S., Roy, D.M., Ganguli, S.: Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. arXiv preprint arXiv:2010.15110 (2020)

  35. Leclerc, G., Madry, A.: The two regimes of deep network training. arXiv preprint arXiv:2002.10376 (2020)

  36. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  37. Cai, T., Gao, R., Hou, J., Chen, S., Wang, D., He, D., Zhang, Z., Wang, L.: Gram–Gauss–Newton method: learning overparameterized neural networks for regression problems. arXiv preprint arXiv:1905.11675 (2019)

  38. Zhang, G., Martens, J., Grosse, R.B.: Fast convergence of natural gradient descent for over-parameterized neural networks. In: Advances in Neural Information Processing Systems, 32 (2019)

  39. van den Brand, J., Peng, B., Song, Z., Weinstein, O.: Training (overparametrized) neural networks in near-linear time. arXiv preprint arXiv:2006.11648 (2020)

  40. Schoenholz, S.S., Gilmer, J., Ganguli, S., Sohl-Dickstein, J.: Deep information propagation. arXiv preprint arXiv:1611.01232 (2016)

  41. Yang, Y., Perdikaris, P.: Physics-informed deep generative models. arXiv preprint arXiv:1812.03511 (2018)

  42. Driscoll, T.A., Hale, N., Trefethen, L.N.: Chebfun Guide (2014)

  43. Cox, S.M., Matthews, P.C.: Exponential time differencing for stiff systems. J. Comput. Phys. 176(2), 430–455 (2002)

    Article  MathSciNet  Google Scholar 

  44. Alnæs, M., Blechta, J., Hake, J., Johansson, A., Kehlet, B., Logg, A., Richardson, C., Ring, J., Rognes, M.E., Wells, G.N.: The fenics project version 1.5. Arch. Numer. Softw. 3(100), 9–23 (2015)

    Google Scholar 

  45. Shin, Y., Darbon, J., Karniadakis, G.E.: On the convergence of physics informed neural networks for linear second-order elliptic and parabolic type PDEs (2020)

  46. Mishra, S., Molinaro, R.: Estimates on the generalization error of physics informed neural networks (PINNs) for approximating PDEs. arXiv preprint arXiv:2006.16144 (2020)

  47. Mitusch, S.K., Funke, S.W., Dokken, J.S.: dolfin-adjoint 2018.1: automated adjoints for fenics and firedrake. J. Open Sour. Softw. 4(38), 1292 (2019)

    Article  Google Scholar 

  48. Bradbury, J., Frostig, R., Hawkins, P., Johnson, M.J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., Zhang, Q.: JAX: composable transformations of Python+NumPy programs (2018)

  49. Hunter, J.D.: Matplotlib: a 2D graphics environment. IEEE Ann. Hist. Comput. 9(03), 90–95 (2007)

    Google Scholar 

  50. Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020)

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank Dr. Mohamed Aziz Bhouri for the data preparation using Fenics [44, 47]. We would also like to thank the developers of the software that enabled our research, including JAX [48], Matplotlib [49], and NumPy [50].

Funding

This work received support from DOE grant DE-SC0019116, AFOSR grant FA9550-20-1-0060, and DOE-ARPA grant DE-AR0001201.

Author information

Authors and Affiliations

Authors

Contributions

SW and PP conceptualized the research and designed the numerical studies. SW implemented the methods and conducted the numerical experiments. HW assisted with the numerical studies. PP provided funding and supervised all aspects of this work. SW and PP wrote the manuscript.

Corresponding author

Correspondence to Paris Perdikaris.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Nomenclature

Table 8 summarizes the main symbols and notation used in this work.

Table 8 Nomenclature: Summary of the main symbols and notation used in this work

B Hyper-parameter settings

See Table 9.

Table 9 Default hyper-parameter settings for each benchmark employed in this work (unless otherwise stated)

C Computational cost

Training: Table C summarizes the computational cost (hours) of training physics-informed DeepONet models with different network architectures and weighting schemes. All networks are trained using a single NVIDIA RTX A6000 graphics card (Table 10).

Table 10 Computational cost (hours) for training physics-informed a DeepONet model across the different becnhmarks and architectures employed in this work

D Proof of Lemma 3.2

Proof

First we observe that \(\varvec{H}(\varvec{\theta })\) is a Gram matrix. Indeed, we define

$$\begin{aligned} v_i = T^{(i)}(u^{(i)}(\varvec{x}_i), G_{\varvec{\theta }}(\varvec{u}^{(i)})(\varvec{y}_i)), \quad i=1.2, \dots , N^*. \end{aligned}$$
(D.1)

Then by the definition of \(\varvec{H}(\varvec{\theta })\), we have

$$\begin{aligned} \varvec{H}_{ij}(\varvec{\theta }) = \langle v_i, v_j \rangle . \end{aligned}$$
(D.2)
  1. (a)

    By the definition of a Gram matrix.

  2. (b)

    Let \(\Vert \varvec{H}(\varvec{\theta })\Vert _\infty = \langle v_i, v_j \rangle \) for some ij and \(\langle v_k, v_k \rangle = \max _{1 \le k \le N^*} \varvec{H}_{kk}(\varvec{\theta }) \)/ for some k. Then we have

    $$\begin{aligned} \max _{1 \le k \le N^*} \varvec{H}_{kk}(\varvec{\theta }) = \langle v_k, v_k \rangle \le \Vert \varvec{H}(\varvec{\theta })\Vert _\infty = \langle v_i, v_j \rangle \le \Vert v_i\Vert \Vert v_j\Vert \le \Vert v_k\Vert ^2 = \langle v_k, v_k \rangle . \end{aligned}$$
    (D.3)

    Therefore, we have

    $$\begin{aligned} \Vert \varvec{H}(\varvec{\theta })\Vert _\infty = \max _{1\le k \le N^*} \varvec{H}_{kk}(\varvec{\theta }). \end{aligned}$$
    (D.4)

\(\square \)

E Proof of Lemma 3.4

Proof

Recall the definition of the loss function 3.6,

$$\begin{aligned} \mathcal {L}(\varvec{\theta }) = \frac{2}{N^*} \sum _{i=1}^{N^*} \left| T^{(i)}(u^{(i)}(\varvec{x}_i), G_{\varvec{\theta }}(\varvec{u}^{(i)})(\varvec{y}_i))\right| ^2. \end{aligned}$$
(E.1)

Now we consider the corresponding gradient flow

$$\begin{aligned} \frac{d \varvec{\theta }}{ dt}&= - \nabla \mathcal {L}(\varvec{\theta }), \end{aligned}$$
(E.2)
$$\begin{aligned}&= - \frac{4}{N^*} \sum _{k=1}^{N^*} \left( T^{(i)}(u^{(i)}(\varvec{x}_i), G_{\varvec{\theta }}(\varvec{u}^{(i)})(\varvec{y}_i)) \right) \frac{d T^{(i)}(u^{(i)}(\varvec{x}_i), G_{\varvec{\theta }}(\varvec{u}^{(i)})(\varvec{y}_i)) }{d \varvec{\theta }}. \end{aligned}$$
(E.3)

For \(1 \le j \le N^*\), note that

$$\begin{aligned}&\frac{d T^{(j)}(u^{(j)}(\varvec{x}_j), G_{\varvec{\theta }}(\varvec{u}^{(j)})(\varvec{y}_j))}{ dt } = \end{aligned}$$
(E.4)
$$\begin{aligned}&= \frac{d T^{(j)}(u^{(j)}(\varvec{x}_j), G_{\varvec{\theta }}(\varvec{u}^{(j)})(\varvec{y}_j))}{ d \varvec{\theta } } \cdot \frac{d \varvec{\theta }}{ dt } \end{aligned}$$
(E.5)
$$\begin{aligned}&= \frac{d T^{(j)}(u^{(j)}(\varvec{x}_j), G_{\varvec{\theta }}(\varvec{u}^{(j)})(\varvec{y}_j))}{ d \varvec{\theta } }\nonumber \\&\quad \cdot \left[ - \frac{4}{N^*} \sum _{i=1}^{N^*} \left( T^{(i)}(u^{(i)}(\varvec{x}_i), G_{\varvec{\theta }}(\varvec{u}^{(i)})(\varvec{y}_i)) \right) \frac{d T^{(i)}(u^{(i)}(\varvec{x}_i), G_{\varvec{\theta }}(\varvec{u}^{(i)})(\varvec{y}_i)) }{d \varvec{\theta }} \right] \end{aligned}$$
(E.6)
$$\begin{aligned}&= - \frac{4}{ N^*} \sum _{i=1}^{N^*} \left( T^{(i)}(u^{(i)}(\varvec{x}_i), G_{\varvec{\theta }}(\varvec{u}^{(i)})(\varvec{y}_i)) \right) \nonumber \\&\quad \left\langle \frac{d T^{(i)}(u^{(i)}(\varvec{x}_i), G_{\varvec{\theta }}(\varvec{u}^{(i)})(\varvec{y}_i)) }{d \varvec{\theta }}, \frac{d T^{(j)}(u^{(j)}(\varvec{x}_j), G_{\varvec{\theta }}(\varvec{u}^{(j)})(\varvec{y}_j))}{ d \varvec{\theta } } \right\rangle . \end{aligned}$$
(E.7)

By Definition 3.1, 3.3, we conclude that

$$\begin{aligned} \frac{d \varvec{T}\left( \varvec{U}(\varvec{X}), G_{\varvec{\theta }(t)} (\varvec{U}) (\varvec{Y}) )\right) }{ d t} =- \frac{4}{ N^*} \varvec{K}(\varvec{\theta }) \cdot \varvec{T}(\varvec{U}(\varvec{X}), G_{\varvec{\theta }(t)} (\varvec{U}) (\varvec{Y}) )), \end{aligned}$$
(E.8)

where \(\varvec{K}\) is a \(N^* \times N^*\) matrix whose entries are given by

$$\begin{aligned} \varvec{K}_{ij}(\varvec{\theta }) = \left\langle \frac{d T^{(i)} \left( u^{(i)} (\varvec{x}_i), G_{\varvec{\theta }} (\varvec{u}^{(i)})(\varvec{y}_i) \right) }{d \varvec{\theta }} , \frac{d T^{(j)} \left( u^{(j)} (\varvec{x}_j), G_{\varvec{\theta }} (\varvec{u}^{(j)})(\varvec{y}_j) \right) }{d \varvec{\theta }} \right\rangle . \end{aligned}$$
(E.9)

\(\square \)

F Antiderivative

See Fig. 15.

Fig. 15
figure 15

Anti-derivative operator: Training loss convergence of DeepONet models using different loss schemes for \(4 \times 10^4\) 40,000 iterations of gradient descent using the Adam optimizer. Here we remark that all losses are weighted, and, therefore, the magnitude of these losses is not informative

G Advection equation

See Figs. 16, 17 and 18.

Fig. 16
figure 16

Advection equation: Training loss convergence of a DeepONet model using different DeepONet architectures and weighting schemes for \(3 \times 10^5\) iterations of gradient descent using the Adam optimizer. Here we remark that all losses are unweighted

Fig. 17
figure 17

Advection equation: Relative \(L^2\) error of physics-informed DeepONets represented by different architectures and trained using different fixed weights \(\lambda _{\text {bc}} = \lambda _{\text {ic}} = \lambda \in [10^{-2}, 10^{2}]\), averaged over the test data-set

Fig. 18
figure 18

Advection equation: Predicted solution of the best trained physics-informed DeepONet for three different examples in the test data-set

H Burger’s equation

See Figs. 19, 20, 21, 22 and 23.

Fig. 19
figure 19

Burger’s equation: Training loss convergence of a DeepONet models using different DeepONet architectures and weighting schemes for \(2 \times 10^5\) iterations of gradient descent using the Adam optimizer. Here we remark that all losses are unweighted

Fig. 20
figure 20

Burger’s equation: Relative \(L^2\) error of physics-informed DeepONets represented by different architectures and trained using different fixed weights \(\lambda _{\text {bc}} = \lambda _{\text {ic}} = \lambda \in [10^{-2}, 10^{2}]\), averaged over the test data-set

Fig. 21
figure 21

Burger’s equation (\(\nu = 0.01\)): Predicted solution of the best trained physics-informed DeepONet for three different examples in the test data-set

Fig. 22
figure 22

Burger’s equation (\(\nu = 0.001\)): Predicted solution of the best trained physics-informed DeepONet for three different examples in the test data-set

Fig. 23
figure 23

Burger’s equation (\(\nu = 0.0001\)): Predicted solution of the best trained physics-informed DeepONet for three different examples in the test data-set

I Stokes flow

See Figs. 24, 25, 26 and 27.

Fig. 24
figure 24

Stokes equation: Training loss convergence of a DeepONet models using different DeepONet architectures and weighting schemes for \(2 \times 10^5\) iterations of gradient descent using the Adam optimizer. Here we remark that all losses are unweighted

Fig. 25
figure 25

Stokes equation: Predicted solution of the best trained physics-informed DeepONet for one example in the test data-set

Fig. 26
figure 26

Stokes equation: Predicted solution of the best trained physics-informed DeepONet for one example in the test data-set

Fig. 27
figure 27

Stokes equation: Predicted solution of the best trained physics-informed DeepONet for one example in the test data-set

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Wang, H. & Perdikaris, P. Improved Architectures and Training Algorithms for Deep Operator Networks. J Sci Comput 92, 35 (2022). https://doi.org/10.1007/s10915-022-01881-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-022-01881-0

Keywords

Navigation