1 Introduction

Rough functions i.e, functions which are at most Lipschitz continuous and could even be discontinuous, arise in a wide variety of problems in physics and engineering. Prominent examples include (weak) solutions of nonlinear partial differential equations. For instance, solutions of nonlinear hyperbolic systems of conservation laws such as the compressible Euler equations of gas dynamics, contain shock waves and are in general discontinuous [9]. Similarly, solutions to the incompressible Euler equations would well be only Hölder continuous in the turbulent regime, [12]. Moreover, solutions of fully non-linear PDEs such as Hamilton-Jacobi equations are in general Lipschitz continuous, [11]. Images constitute another class of rough or rather piecewise smooth functions as they are often assumed to be no more regular than functions of bounded variation on account of their sharp edges [5].

Given this context, the efficient and robust numerical approximation of rough functions is of great importance. However, classical approximation theory has severe drawbacks when it comes to the interpolation (or approximation) of such rough functions. In particular, it is well known that standard linear interpolation procedures degrade to at best first-order of accuracy (in terms of the interpolation mesh width) as soon as the derivative of the underlying function has a singularity [1] and references therein. This order of accuracy degrades further if the underlying function is itself discontinuous. Moreover approximating rough functions with polynomials can lead to spurious oscillations at points of singularity. Hence, the approximation of rough functions poses a formidable challenge.

Artificial neural networks, formed by concatenating affine transformations with pointwise application of nonlinearities, have been shown to possess universal approximation properties, [6, 8, 18] and references therein. This implies that for any continuous (even for merely measurable) function, there exists a neural network that approximates it accurately. However, the precise architecture of this network is not specified in these universality results. Recently in [26], Yarotsky was able to construct deep neural networks with ReLU activation functions and very explicit estimates on the size and parameters of the network, that can approximate Lipschitz functions to second-order accuracy. Even more surprisingly, in a very recent paper [27], the authors were able to construct deep neural networks with alternating ReLU and Sine activation functions that can approximate Lipschitz (or Hölder continuous) functions to exponential accuracy.

The afore-mentioned results of Yarotsky clearly illustrate the power of deep neural networks in approximating rough functions. However, there is a practical issue in the use of these deep neural networks as they are mappings from the space coordinate \(x \in D \subset {\mathbb {R}}^d\) to the output \(f^{*}(x) \in {\mathbb {R}}\), with the neural network \(f^{*}\) approximating the underlying function \(f:D \rightarrow {\mathbb {R}}\). Hence, for every given function f, the neural network \(f^{*}\) has to be trained i.e, its weights and biases determined by minimizing a suitable loss function with respect to some underlying samples of f [14]. Although it makes sense to train neural networks to approximate individual functions f in high dimensions, for instance in the context of uncertainty quantification of PDEs [20] and references therein, doing so for every low-dimensional function is unrealistic. Moreover, in a large number of contexts, the goal of approximating a function is to produce an interpolant \({\tilde{f}}\), given the vector \(\{f(x_i)\}\) at sampling points \(x_i \in D\) as input. Training a neural network as regression function for every individual f might become expensive quickly. Hence, for a fixed set of sampling points, one would like to construct neural networks that map the full input vector into an output interpolant (or its evaluation at certain sampling points). If this is possible, one would only need to train a neural network once and then apply it to every individual function f. However, it is unclear if the function approximation results for neural networks are informative in this particular context.

On the other hand, data-dependent interpolation procedures have been developed in the last decades to deal with the interpolation of rough functions. A notable example of these data dependent algorithms is provided by the essentially non-oscillatory (ENO) procedure. First developed in the context of reconstruction of non-oscillatory polynomials from cell averages in [16], ENO was also adapted for interpolating rough functions in [23] and references therein. Once augmented with a sub-cell resolution (SR) procedure of [15], it was proved in [1] that the ENO-SR interpolant also approximated (univariate) Lipschitz functions to second-order accuracy. Moreover, ENO was shown to satisfy a subtle non-linear stability property, the so-called sign property [13]. Given these desirable properties, it is not surprising that the ENO procedure has been very successfully employed in a variety of contexts, ranging from the numerical approximation of hyperbolic systems of conservation laws [16] and Hamilton-Jacobi equations [24] to data compression in image processing, [1, 2, 17] and references therein.

Given the ability of neural networks as well as ENO algorithms to approximate rough functions accurately, it is natural to investigate connections between them. This is the central premise of the current paper, where we aim to reinterpret ENO (and ENO-SR) algorithms in terms of deep neural networks. We prove the following results,

  • We prove that for any order, the ENO interpolation (and the ENO reconstruction) procedure can be cast as a suitable deep ReLU neural network.

  • We prove that a variant of the piecewise linear ENO-SR (sub-cell resolution) procedure of [15] can also be cast as a deep ReLU neural network. Thus, we prove that there exists a deep ReLU neural network that approximates piecewise smooth (say Lipschitz) functions to second-order accuracy.

  • The above theorems provide the requisite architecture for the resulting deep neural networks and we train them to obtain what we term as DeLENO (deep learning ENO) approximation procedures for rough functions. We test this procedure in the context of numerical methods for conservation laws and for data and image compression.

Thus, our results reinforce the enormous abilities of deep ReLU neural networks to approximate functions, in particular rough functions and add a different perspective to many existing results on approximation with ReLU networks.

2 Deep neural networks

In statistics, machine learning, numerical mathematics and many other scientific disciplines, the goal of a certain task can often be reduced to the following. We consider a (usually unknown) function \({\mathcal {L}}:D \subset {\mathbb {R}}^m \rightarrow {\mathbb {R}}^n\) and we assume access to a (finite) set of labelled data \({\mathbb {S}}\subset \{ (X,{\mathcal {L}}(X)):X \in D \}\), using which we wish to select an approximation \({\hat{{\mathcal {L}}}}\) from a parametrized function class \(\mathcal {L}^{\theta }\) that predicts the outputs of \({\mathcal {L}}\) on D with a high degree of accuracy.

One possible function class is that of deep neural networks (DNNs). In particular, we consider multilayer perceptrons (MLPs) in which the basic computing units (neurons) are stacked in multiple layers to form a feedforward network. The input is fed into the source layer and flows through a number of hidden layers to the output layer. An example of an MLP with two hidden layers is shown in Fig. 1.

In our terminology, an MLP of depth L consists of an input layer, \(L-1\) hidden layers and an output layer. We denote the vector fed into the input layer by \(X=Z^0\). The l-th layer (with \(n_l\) neurons) receives an input vector \(Z^{l-1} \in {\mathbb {R}}^{n_{l-1}}\) and transforms it into the vector \(Z^{l} \in {\mathbb {R}}^{n_{l}}\) by first applying an affine linear transformation, followed by a component-wise (non-linear) activation function \({\mathcal {A}}^l\),

$$\begin{aligned} Z^{l} = {\mathcal {A}}^l(W^l Z^{l-1} + b^l), \quad W^l \in {\mathbb {R}}^{n_{l} \times n_{l-1}}, \ b^l \in {\mathbb {R}}^{n_{l}}, \quad 1 \le l \le L, \end{aligned}$$
(2.1)

with \(Z^l\) serving as the input for the \((l+1)\)-th layer. For consistency, we set \(n_0 = m\) and \(n_L = n\). In (2.1), \(W^l\) and \(b^l\) are respectively known as the weights and biases associated with the l-th layer. The parameter space \(\Theta \) then consists of all possible weights and biases. A neural network is said to be deep if \(L\ge 3\) and such a deep neural network (DNN) is denoted as a ReLU DNN if the activation functions are defined by the very popular rectified linear (ReLU) function,

$$\begin{aligned} {\mathcal {A}}^l(Z) = (Z)_+ = \max (0,Z) \quad \text {for}\quad 1\le l\le L-1\quad \text {and}\quad {\mathcal {A}}^L(Z)=Z. \end{aligned}$$
(2.2)
Fig. 1
figure 1

An MLP with 2 hidden layers. The source layer transmits the signal X to the first hidden layer. The final output of the network is \({\hat{Y}}\)

Depending on the nature of the problem, the output of the ANN may have to pass through an output function \({\mathcal {S}}\) to convert the signal into a meaningful form. In classification problems, a suitable choice for such an output function would be the softmax function

$$\begin{aligned} {\mathcal {S}}(x):{\mathbb {R}}^n\rightarrow {\mathbb {R}}^n:x\mapsto \left( \frac{e^{x_1}}{\sum _{j=1}^n e^{x_j}},\ldots ,\frac{e^{x_n}}{\sum _{j=1}^n e^{x_j}}\right) . \end{aligned}$$
(2.3)

This choice ensures that the final output vector \({\hat{Y}}={\mathcal {S}}(Z^L)\) satisfies \(\sum _{j=1}^n {\hat{Y}}_j = 1\) and \(0 \le {\hat{Y}}_j \le 1\) for all \(1\le j \le n\), which allows \({\hat{Y}}_j\) to be viewed as the probability that the input \(Z^0\) belongs to the j-th class. Note that the class predicted by the network is \(\text {arg} \max _j {\hat{Y}}_j\). For regression problems, no additional output function is needed.

Remark 2.1

It is possible that multiple classes have the largest probability. In this case, the predicted class can be uniquely defined as \(\min \mathrm {arg} \max _j\{{\hat{Y}}_j\}\), following the usual coding conventions. Also note that the softmax function only contributes towards the interpretability of the network output and has no effect on the predicted class, that is,

$$\begin{aligned} \min \mathrm {arg} \max _j\{{\hat{Y}}_j\} = \min \mathrm {arg} \max _j\{Z^L_j\}. \end{aligned}$$

This observation will be used at a later stage.

The expressive power of ReLU neural networks, in particular their capability of approximating rough functions, has already been demonstrated in literature [22, 27]. In practice however, there is a major issue in this approach when used to approximate an unknown function \(f:D\subseteq {\mathbb {R}} \rightarrow {\mathbb {R}}\) based on a finite set \({\mathbb {S}}\subset \{(x,f(x)):x\in D\}\), as for each individual function f a new neural network has to be found. This network (or an approximation) is found by the process of training the network. The computational cost is significantly higher than that of other classical regression methods in low dimensions, which makes this approach rather impractical, at least for functions in low dimensions. This motivates us to investigate how one can obtain a neural network that takes input in D and produces an output interpolant, or rather its evaluation at certain sample points. Such a network primarily depends only on the training data \({\mathbb {S}}\) and can be reused for each individual function, thereby drastically reducing the computational cost. Instead of creating an entirely novel data-dependent interpolation procedure, we base ourselves in this paper on the essentially non-oscillatory (ENO) interpolation framework of [15, 16], which we introduce in the next section.

3 ENO framework for interpolating rough functions

In this section, we explore the essentially non-oscillatory (ENO) interpolation framework [15, 16], on which we will base our theoretical results of later sections. Although the ENO procedure has its origins in the context of the numerical approximation of solutions of hyperbolic conservation laws, this data-dependent scheme has also proven its use for the interpolation of rough functions [23].

3.1 ENO interpolation

We first focus on the original ENO procedure, as introduced in [16]. This procedure can attain any order of accuracy for smooth functions, but reduces to first-order accuracy for functions that are merely Lipschitz continuous. In particular, the ENO-p interpolant is p-th order accurate in smooth regions and suppresses the appearance of spurious oscillations in the vicinity of points of discontinuity. In the following, we describe the main idea behind this algorithm.

Let f be a function on \(\Omega = [c,d]\subset {\mathbb {R}}\) that is at least p times continuously differentiable. We define a sequence of nested uniform grids \(\{{\mathcal {T}}^k\}_{k=0}^K\) on \(\Omega \), where

$$\begin{aligned} {\mathcal {T}}^k = \{x_i^k\}_{i=0}^{N_k}, \, I_i^k = [x^k_{i-1}, x^k_i], \, x_i^k = c + i h_k, \, h_k = \frac{(d-c)}{N_k}, \, N_k = 2^k N_0, \end{aligned}$$
(3.1)

for \(0\le i \le N_k\), \(0\le k\le K\) and some positive integer \(N_0\). Furthermore we define \(f^k=\{f(x):x\in {\mathcal {T}}^k\}\), \(f^k_i=f(x_i^k)\) and we let \(f^k_{-p+2}, \ldots ,f^k_{-1}\) and \(f^k_{N_k+1},\ldots f^k_{N_k + p-2}\) be suitably prescribed ghost values. We are interested in finding an interpolation operator \({\mathcal {I}}^{h_k}\) such that

$$\begin{aligned} {\mathcal {I}}^{h_k}f(x)=f(x) \text { for } x\in {\mathcal {T}}^k \quad \text {and} \quad \left\Vert {\mathcal {I}}^{h_k}f-f\right\Vert _{\infty }=O(h_k^p) \text { for } k\rightarrow \infty . \end{aligned}$$

In standard approximation theory, this is achieved by defining \({\mathcal {I}}^{h_k}f\) on \(I_i^k\) as the unique polynomial \(p^k_i\) of degree \(p-1\) that agrees with f on a chosen set of p points, including \(x_{i-1}^k\) and \(x_i^k\). The linear interpolant (\(p=2\)) can be uniquely obtained using the stencil \(\{ x^k_{i-1},x^k_i \}\). However, there are several candidate stencils to choose from when \(p> 2\). The ENO interpolation procedure considers the stencil sets

$$\begin{aligned} {\mathcal {S}}^r_i = \{x^k_{i-1-r+j}\}_{j=0}^{p-1}, \quad 0\le r \le p-2, \end{aligned}$$

where r is called the (left) stencil shift. The smoothest stencil is then selected based on the local smoothness of f using Newton’s undivided differences. These are inductively defined in the following way. Let \(\Delta _j^0 = f^k_{i+j}\) for \(-p+1\le j \le p-2\) and \(0\le i \le N_k\). We can then define

$$\begin{aligned} \Delta ^s_{j} = {\left\{ \begin{array}{ll} \Delta ^{s-1}_{j}-\Delta ^{s-1}_{j-1} &{} \text {for }s \text { odd}\\ \Delta ^{s-1}_{j+1}-\Delta ^{s-1}_{j} &{} \text {for }s \text { even.} \end{array}\right. } \end{aligned}$$

Algorithm 1 describes how the stencil shift r can be obtained using these undivided differences. Note that r uniquely defines the polynomial \(p_i^k\). We can then write the final interpolant as

$$\begin{aligned} {\mathcal {I}}^{h_k}f(x)=\sum _{i=1}^{N_k}p^k_i(x)\mathbbm {1}_{[x_{i-1}^k,x_i^k)}(x). \end{aligned}$$

This interpolant can be proven to be total-variation bounded (TVB), which guarantees the disappearance of spurious oscillations (e.g. near discontinuities) when the grid is refined. This property motivates the use of the ENO framework over standard techniques for the interpolation of rough functions.

In many applications, one is only interested in predicting the values of \(f^{k+1}\) given \(f^k\). In this case, there is no need to calculate \({\mathcal {I}}^{h_{k}}f\) and evaluate it on \({\mathcal {T}}^{k+1}\). Instead, one can use Lagrangian interpolation theory to see that there exist fixed coefficients \(C_{r_,j}^p\) such that

$$\begin{aligned} {\mathcal {I}}^{h_{k}}f(x^{k+1}_{2i-1})= & {} \sum _{j=0}^{p-1} C_{r_i^k,j}^p f^k_{i-r_i^k+j} \quad \text { for }\, 1\le i \le N_k \quad \text {and}\quad \nonumber \\ {\mathcal {I}}^{h_{k}}f(x^{k+1}_{2i})= & {} f^{k+1}_{2i}=f^k_i \quad \text { for }\, 0\le i \le N_k, \end{aligned}$$
(3.2)

where \(r_i^k\) is the stencil shift corresponding to the smoothest stencil for interval \(I_i^k\). The coefficients \(C_{r,j}^p\) are listed in Table 5 in Appendix B.

Remark 3.1

ENO was initially introduced by [16] for high-order accurate piecewise polynomial reconstruction, given cell averages of a function. This allows the development of high-order accurate numerical methods for hyperbolic conservation laws, the so-called ENO schemes. ENO reconstruction can be loosely interpreted as ENO interpolation applied to the primitive function and is discussed in Appendix A.

Remark 3.2

The prediction of \(f^{k+1}\) from \(f^k\) can be framed in the context of multi-resolution representations of functions, which are useful for data compression [17]. As we will use ENO interpolation for data compression in Sect. 6, we refer to Appendix C for details on multi-resolution representations.

figure a

3.2 An adapted second-order ENO-SR algorithm

Even though ENO is able to interpolate rough functions without undesirable side effects (e.g. oscillations near discontinuities), there is still room for improvement. By itself, the ENO interpolation procedure degrades to first-order accuracy for piecewise smooth functions i.e, functions with a singularity in the second derivative. However, following [15], one can use sub-cell resolution (SR), together with ENO interpolation, to obtain a second-order accurate approximation of such functions. We propose a simplified variant of the ENO-SR procedure from [1] and prove that it is still second-order accurate. In the following, we assume f to be a continuous function that is two times differentiable except at a single point z where the first derivative has a jump of size \([f'] = f^{\prime }(z+) - f^{\prime }(z-) \). We use the notation introduced in Sect. 3.1.

The first step of the adapted second-order ENO-SR algorithm is to label intervals that might contain the singular point z as bad (B), other intervals get the label good (G). We use second-order differences

$$\begin{aligned} \Delta ^2_hf(x):=f(x-h)-2f(x)+f(x+h) \end{aligned}$$
(3.3)

as smoothness indicators. The rules of the ENO-SR detection mechanism are the following:

  1. (1)

    The intervals \(I_{i-1}^k\) and \(I_i^k\) are labelled B if

    $$\begin{aligned} |\Delta _{h_k}^2f(x^k_{i-1})| > \max _{n=1,2,3}|\Delta _{h_k}^2f(x^k_{i-1\pm n})|. \end{aligned}$$
  2. (2)

    Interval \(I_i^k\) is labelled B if

    $$\begin{aligned} |\Delta _{h_k}^2f(x^k_{i})|> \max _{n=1,2}|\Delta _{h_k}^2f(x^k_{i+n})| \quad \text {and} \quad |\Delta _{h_k}^2f(x^k_{i-1})| > \max _{n=1,2}|\Delta _{h_k}^2f(x^k_{i-1-n})|. \end{aligned}$$
  3. (3)

    All other intervals are labelled G.

Fig. 2
figure 2

Visualization of the second-order ENO-SR algorithm in the case where an interval is labelled as good (left) and bad (right). The superscript k was omitted for clarity

Note that neither detection rule implies the other and that an interval can be labelled B by both rules at the same time. In the following, we will denote by \({p_i^k:[c,d]\rightarrow {\mathbb {R}}}\) the linear interpolation of the endpoints of \(I_i^k\). The rules of the interpolation procedure are stated below, a visualization of the algorithm can be found in Fig. 2.

  1. (1)

    If \(I_i^k\) was labelled as G, then we take the linear interpolation on this interval as approximation for f,

    $$\begin{aligned} {\mathcal {I}}_i^{h_k}f(x) = p_i^k(x). \end{aligned}$$
  2. (2)

    If \(I_i^k\) was labelled as B, we use \(p^k_{i-2}\) and \(p^k_{i+2}\) to predict the location of the singularity. If both lines intersect at a single point y, then we define

    $$\begin{aligned} {\mathcal {I}}_i^{h_k}f(x) = p^k_{i-2}(x)\mathbbm {1}_{[c,\max \{y,c\})}(x)+p^k_{i+2}(x)\mathbbm {1}_{[\min \{y,d\},d]}(x). \end{aligned}$$

    The relation between this intersection point y and the singularity z is quantified by Lemma D.3. If the two lines do not intersect, we treat \(I_i^k\) as a good interval and let \({\mathcal {I}}_i^{h_k}f(x) = p_i^k(x)\).

The theorem below states that our adaptation of ENO-SR is indeed second-order accurate.

Theorem 3.3

Let f be a globally continuous function with a bounded second derivative on \({\mathbb {R}}\backslash \{z\}\) and a discontinuity in the first derivative at a point z. The adapted ENO-SR interpolant \({\mathcal {I}}^hf\) satisfies

$$\begin{aligned} \left\Vert f-{\mathcal {I}}^hf\right\Vert _{\infty } \le Ch^2 \sup _{{\mathbb {R}}\backslash \{z\}}|f''| \end{aligned}$$

for all \(h>0\), with \(C>0\) independent of f.

Proof

The proof is an adaptation of the proof of Theorem 1 in [1] and can be found in Appendix D. \(\square \)

4 ENO as a ReLU DNN

As mentioned in the introduction, we aim to recast the ENO interpolation algorithm from Sect. 3.1 as a ReLU DNN. Our first approach to this end begins by noticing that the crucial step of the ENO procedure is determining the correct stencil shift. Given the stencil shift, the retrieval of the ENO interpolant is straightforward. ENO-p can therefore be interpreted as a classification problem, with the goal of mapping an input vector (the evaluation of a certain function on a number of points) to one of the \(p-1\) classes (the stencil shifts). We now present one of the main results of this paper. The following theorem states that the stencil selection of p-th order ENO interpolation can be exactly obtained by a ReLU DNN for every order p. The stencil shift can be obtained from the network output by using the default output function for classification problems (cf. Remark 2.1).

Theorem 4.1

There exists a ReLU neural network consisting of \(p+ \left\lceil \log _2 \left( {\begin{array}{c}p-2\\ \lfloor \frac{p-2}{2}\rfloor \end{array}}\right) \right\rceil \) hidden layers, that takes input \({\varvec{\Delta }}^\mathbf{{0}}\) \(= \{f^k_{i+j}\}_{j=-p+1}^{p-2}\) and leads to exactly the same stencil shift as the one obtained by Algorithm 1.

Proof

We first sketch an intuitive argument why there exists a ReLU DNN that, after applying \(\min \mathrm{arg\,max}\), leads to the ENO-p stencil shift, using notation from Sect. 3.1. For a function \(f:[c,d]\rightarrow {\mathbb {R}}\), Algorithm 1 maps every input stencil \({\varvec{\Delta }}^\mathbf{{0}}\) \(\in [c,d]^{2p-2}\) to a certain stencil shift r. A more careful look at the algorithm reveals that the input space \([c,d]^{2p-2}\) can be partitioned into polytopes such that the interior of every polytope is mapped to one of the \(p-1\) possible stencil shifts. Given that every ReLU DNN is a continuous, piecewise affine function (e.g. [4]), one can construct for every \(i\in \{0, \ldots , p-2\}\) a ReLU DNN \(\phi _i: [c,d]^{2p-2}\rightarrow {\mathbb {R}}\) that is equal to 0 on the interior of every polytope corresponding to stencil shift i and that is strictly smaller than 0 on the interior of every polytope not corresponding to stencil shift i. It is then clear that

$$\begin{aligned} \min \mathrm{arg\,max}\{\phi _0, \ldots , \phi _{p-2}\}-1 \end{aligned}$$
(4.1)

corresponds to the ENO stencil shift on the interiors of all polytopes. Thanks to the minimum in (4.1), it also corresponds to the unique stencil shift from Algorithm 3.2 on the faces of the polytopes, where multiple \(\phi _i\) are equal to zero. The claim then follows from the fact that the mapping \({{\varvec{\Delta }}^\mathbf{{0}}}\) \(\mapsto (\phi _0({{\varvec{\Delta }}^\mathbf{{0}}}), \ldots , \phi _{p-2}({{\varvec{\Delta }}^\mathbf{{0}}}))\) can be written as a ReLU DNN.

In what follows, we present a more constructive proof that sheds light on the architecture that is needed to represent ENO and the sparsity of the corresponding weights. In addition, a technique to replace the Heaviside function (as in Algorithm 1) is used. Recall that we look for the ENO stencil shift \(r:=r_i^k\) corresponding to the interval \(I_i^k\). Let \(k\in {\mathbb {N}}\) and define \(\Delta _j^0 = f^k_{i+j}\) for \(-p+1\le j \le p-2\) and \(0\le i \le N_k\), where \(f^k_{-p+1},\ldots , f^k_{-1}\) and \(f^k_{N_k+1},\ldots , f^k_{N_k+p-2}\) are suitably defined ghost values. Note that \(\Delta ^0_j\) depends on i, but this dependence is omitted in the notation. Following Sect. 3.1, we define \(\Delta ^s_{j}=\Delta ^{s-1}_{j}-\Delta ^{s-1}_{j-1}\) for s odd and \(\Delta ^s_{j}=\Delta ^{s-1}_{j+1}-\Delta ^{s-1}_{j}\) for s even, and with \({\varvec{\Delta }}^\mathbf{{s}}\) we denote the vector consisting of all \(\Delta ^s_j\) for all applicable j. In what follows, we use \(Y^l\) and \(Z^l\) to denote the values of the l-th layer of the neural network before and after activation, respectively. We use the notation \(X^l\) for an auxiliary vector needed to calculate \(Y^l\).

Step 1. Take the input to the network to be

$$\begin{aligned} Z^0 = [\Delta _{-p+1}^0,\ldots ,\Delta _{p-2}^0]\in {\mathbb {R}}^{2(p-1)}. \end{aligned}$$

These are all the candidate function values considered in Algorithm 1.

Step 2. We want to obtain all quantities \(\Delta _{j}^s\) that are compared in Algorithm 1, as shown in Fig. 3. We therefore choose the first layer (before activation) to be

$$\begin{aligned} Y^1 = \begin{bmatrix} Y_{\Delta } \\ -Y_{\Delta } \end{bmatrix} \in {\mathbb {R}}^{2M} \quad \text {where} \quad Y_{\Delta } = \begin{bmatrix} \Delta _{0}^2 \\ \Delta _{-1}^2 \\ \vdots \end{bmatrix} \in {\mathbb {R}}^{M} \end{aligned}$$

is the vector of all the terms compared in Algorithm 1 and \(M=\frac{p(p-1)}{2}-1\). Note that every undivided difference is a linear combination of the network input. Therefore one can obtain \(Y^1\) from \(Z^0\) by taking a null bias vector and weight matrix \(W^1\in {\mathbb {R}}^{2M\times (2p-2)}\). After applying the ReLU activation function, we obtain

$$\begin{aligned} Z^1 = \begin{bmatrix} (Y_{\Delta })_+ \\ (-Y_{\Delta })_+ \end{bmatrix}. \end{aligned}$$

Step 3. We next construct a vector \(X^2\in {\mathbb {R}}^{L}\), where \(L=\frac{(p-2)(p-1)}{2}\), that contains all the quantities of the if-statement in Algorithm 1. This is ensured by setting,

$$\begin{aligned} X^2 = \begin{bmatrix} |\Delta _{-1}^2|-|\Delta _{0}^2|\\ |\Delta _{0}^3|-|\Delta _{1}^3| \\ |\Delta _{-1}^3|-|\Delta _{0}^3| \\ \vdots \end{bmatrix}. \end{aligned}$$

Keeping in mind that \(|a|=(a)_++(-a)_+\) for \(a\in {\mathbb {R}}\) we see that there is a matrix \({\widetilde{W}}^2\in {\mathbb {R}}^{L\times 2M}\) such that \(X^2={\widetilde{W}}^2 Z^1\). We wish to quantify for each component of \(X^2\) whether it is strictly negative or not (cf. the if-statement of Algorithm 1). For this reason, we define the functions \(H_1:{\mathbb {R}}\rightarrow {\mathbb {R}}\) and \(H_2:{\mathbb {R}}\rightarrow {\mathbb {R}}\) by

$$\begin{aligned} H_1(x) = {\left\{ \begin{array}{ll} 0 &{}x\le -1 \\ x+1 &{}-1< x< 0 \\ 1 &{}x \ge 0 \end{array}\right. } \quad \text {and} \quad H_2(x) = {\left\{ \begin{array}{ll} -1 &{}x\le 0 \\ x-1 &{}0< x < 1 \\ 0 &{}x \ge 1 \end{array}\right. }. \end{aligned}$$

The key property of these functions is that \(H_1\) and \(H_2\) agree with the Heaviside function on \(x>0\) and \(x<0\), respectively. When \(x=0\) the output is respectively \(+1\) or \(-1\). Now note that \(H_1(x) = (x+1)_{+} - (x)_{+}\) and \(H_2(x) = (x)_{+} - (x-1)_{+}-1\). This motivates us to define

$$\begin{aligned} Y^2 = \begin{bmatrix} X^2+1 \\ X^2 \\ X^2-1 \end{bmatrix}\in {\mathbb {R}}^{3L}, \end{aligned}$$

which can be obtained from \(Z^1\) by taking weight matrix \(W^2\in {\mathbb {R}}^{3L\times 2M}\) and bias vector \(b^2\in {\mathbb {R}}^{3L}\),

$$\begin{aligned} W^2=\left( \begin{bmatrix} 1\\ 1\\ 1 \end{bmatrix}\otimes {\mathbb {I}}_{L}\right) \cdot {\widetilde{W}}^2 \quad \text {and} \quad b^2_j={\left\{ \begin{array}{ll} 1 &{}1 \le j \le L \\ 0 &{}L+1 \le j \le 2L \\ -1 &{}2L+1 \le j \le 3L \end{array}\right. } \end{aligned}$$

where \({\mathbb {I}}_{L}\) denotes the \(L\times L\) unit matrix. After activation we obtain \(Z^2=(Y^2)_+=(W^2Z^1+b^2)_+\).

Step 4. We first define \(X^3 \in {\mathbb {R}}^{2L}\) by

$$\begin{aligned} X^3_j = {\left\{ \begin{array}{ll} H_1(X^2_j) = Z^2_j-Z^2_{L+j}&{} 1\le j \le L \\ H_2(X^2_{j-L}) = Z^2_j-Z^2_{L+j}-1 &{} L + 1 \le j \le 2L. \end{array}\right. } \end{aligned}$$

This is clearly for every j an affine transformation of the entries of \(Z^2\). For this reason there exist a matrix \({\widetilde{W}}^3\in {\mathbb {R}}^{2L\times 3L}\) and a bias vector \(\tilde{b}^3\in {\mathbb {R}}^{2L}\) such that \(X^3 = {\widetilde{W}}^3 Z^2+\tilde{b}^3\).

Fig. 3
figure 3

Only undivided differences in the shaded region are compared in Algorithm 1

Fig. 4
figure 4

Arrangement of \(N_1,\ldots ,N_L\) into directed acyclic graph

In order to visualize the next steps, we arrange the elements of \(X^3\) in a triangular directed acyclic graph, shown in Fig. 4, where every node \({\mathcal {N}}_j\) corresponds to the tuple \((X^3_j,X^3_{j+L}) = ( H_1(X^2_j), H_2(X^2_j))\). We note that this tuple is either of the form \(( +1, H_2(X^2_j))\) or \(( H_1(X^2_j), -1)\). Algorithm 1 is equivalent to finding a path from the top node to one of the bins on the bottom. Starting from \({\mathcal {N}}_1\), we move to the closest element to the right in the row below (i.e. \({\mathcal {N}}_2\)) if \({\mathcal {N}}_1\) is of the form \(( +1, H_2(X^2_j))\). If \({\mathcal {N}}_1\) is of the form \(( H_1(X^2_j), -1)\), we move to the closest element to the left in the row below (i.e. \({\mathcal {N}}_3\)). If \({\mathcal {N}}_1\) is of the form \((+1,-1)\), then it is not important in which direction we move. Both paths lead to a suitable ENO stencil shift. Repeating the same procedure at each row, one ends up in one of the \(p-1\) bins at the bottom representing the stencil shift r.

There are \(2^{p-2}\) paths from the top to one of the bins at the bottom. In order to represent the path using a \((p-2)\)-tuple of entries of \(X^3\), one needs to choose between \(H_1(X^2_j)\) and \(H_2(X^2_j)\) at every node of the path, leading to \(2^{p-2}\) variants of each path. At least one of these variants only takes the values \(+1\) and \(-1\) on the nodes and is identical to the path described above; this is the variant we wish to select. Counting all variants, the total number of paths is \(2^{2p-4}\).

Consider a path \({\mathcal {P}}=(X^3_{j_1}, \ldots , X^3_{j_{p-2}})\) that leads to bin r. We define for this path a weight vector \(W \in \{-1,0,1\}^{2L}\) whose elements are set as

$$\begin{aligned} W_j = {\left\{ \begin{array}{ll} +1 &{}\quad \text {if } X^3_j=+1 \text {and} j=j_s \text {for some} 1\le s \le p-2 \\ -1 &{}\quad \text {if } X^3_j=-1 \text {and} j=j_s \text {for some} 1\le s \le p-2 \\ 0 &{}\quad \text {otherwise.} \end{array}\right. } \end{aligned}$$

For this particular weight vector and for any possible \(X^3 \in {\mathbb {R}}^{2L}\) we have \({W \cdot X^3}\le p-2\), with equality achieved if and only if the entries of \(X^3\) appearing in \({\mathcal {P}}\) are assigned the precise values used to construct W. One can construct such a weight vector for each of the \(2^{2p-4}\) paths. We next construct the weight matrix \({\widehat{W}}^3\in {\mathbb {R}}^{2^{2p-4}\times 2L}\) in such a way that the first \(2^{p-2}\cdot \left( {\begin{array}{c}p-2\\ 0\end{array}}\right) \) rows correspond to the weight vectors for paths reaching \(r=0\), the next \(2^{p-2}\cdot \left( {\begin{array}{c}p-2\\ 1\end{array}}\right) \) for paths reaching \(r=1\) et cetera. We also construct the bias vector \({\hat{b}}^3\in {\mathbb {R}}^{2^{2p-4}}\) by setting each element to \(p-2\) and we define \({\hat{X}}^3 = {\widehat{W}}^3 X^3 + {\hat{b}}^3={\widehat{W}}^3 ({\widetilde{W}}^3 Z^2+\tilde{b}^3) + {\hat{b}}^3\). By construction, \({\hat{X}}^3_j = 2p-4\) if and only if path j corresponds to a suitable ENO stencil shift, otherwise \(0\le {\hat{X}}^3_j<2p-4\).

Step 5. Finally we define the final output vector by taking the maximum of all components of \({\hat{X}}^3\) that correspond to the same bin,

$$\begin{aligned} {\hat{Y}}_j = \max \left\{ {\hat{X}}^3\left( 1+2^{p-2}\cdot \sum _{k=0}^{j-2} \left( {\begin{array}{c}p-2\\ k\end{array}}\right) \right) , \ldots , {\hat{X}}^3\left( 2^{p-2}\cdot \sum _{k=0}^{j-1} \left( {\begin{array}{c}p-2\\ k\end{array}}\right) \right) \right\} , \end{aligned}$$

for \(j=1,\ldots , p-1\) and where \({\hat{X}}^3(j):={\hat{X}}^3_j\). Note that \({\hat{Y}}_j\) is the maximum of \(2^{p-2}\cdot \left( {\begin{array}{c}p-2\\ j-1\end{array}}\right) \) real positive numbers. Using the observation that \(\max \{a,b\}=(a)_++(b-a)_+\) for \(a,b\ge 0\), one finds that the calculation of \({\hat{Y}}\) requires \( p-2+ \left\lceil \log _2 \left( {\begin{array}{c}p-2\\ \lfloor \frac{p-2}{2}\rfloor \end{array}}\right) \right\rceil \) additional hidden layers. By construction, it is true that \({\hat{Y}}_j=2p-4\) if and only if the \((j-1)\)-th bin is reached. Furthermore, \({\hat{Y}}_j<2p-4\) if the \((j-1)\)-th bin is not reached. The set of all suitable stencil shifts R and the unique stencil shift r from Algorithm 1 are then respectively given by

$$\begin{aligned} R = \mathrm {argmax}_j {\hat{Y}}_j - 1 \quad \text {and} \quad r = \min R = \min \mathrm {argmax}_j {\hat{Y}}_j - 1, \end{aligned}$$
(4.2)

where for classification problems, \(\min \mathrm {argmax}\) is the default output function to obtain the class from the network output (see Remark 2.1). \(\square \)

Remark 4.2

The neural network constructed in the above theorem is local in the sense that for each cell, it provides a stencil shift. These local neural networks can be concatenated to form a single neural network that takes as its input, the vector \(f^k\) of sampled values and returns the vector of interpolated values that approximates \(f^{k+1}\). The global neural network combines the output stencil shift of each local neural network with a simple linear mapping (3.2).

Although the previous theorem provides a network architecture for every order p, the obtained networks are excessively large for small p. We therefore present alternative constructions for ENO interpolation of orders \(p=3,4\).

Algorithm 1 for \(p=3\) can be exactly represented by the following ReLU network with a single hidden layer, whose input is given by \(X = (\Delta _{-2}^0,\Delta ^0_{-1}, \Delta ^0_0, \Delta ^0_1)^\top \). The first hidden layer is identical to the one described in the original proof of Theorem 4.1 for \(p=4\), with a null bias vector and \(W^1 \in {\mathbb {R}}^{4 \times 4}\),

$$\begin{aligned} W^1 = \begin{pmatrix} 0 &{}\quad 1 &{}\quad -2 &{} \quad 1 \\ 1 &{}\quad -2&{}\quad 1 &{}\quad 0 \\ 0 &{}\quad -1 &{}\quad 2 &{}\quad -1 \\ -1 &{}\quad 2&{}\quad -1 &{}\quad 0 \\ \end{pmatrix}, \quad b^1 = \begin{pmatrix} 0\\ 0\\ 0\\ 0\end{pmatrix}. \end{aligned}$$
(4.3)

The weights and biases of the output layer are

$$\begin{aligned} W^2 = \begin{pmatrix} -1 &{}\quad 1 &{} \quad -1 &{}\quad 1\\ 1 &{}\quad -1 &{}\quad 1 &{} \quad -1\\ \end{pmatrix}, \quad b^2 = \begin{pmatrix} 0\\ 0\end{pmatrix}. \end{aligned}$$
(4.4)

The resulting network output is

$$\begin{aligned} {\hat{Y}} = \begin{pmatrix}|\Delta ^2_{-1}|-|\Delta ^2_0| \\ |\Delta ^2_{0}|-|\Delta ^2_{-1}|\end{pmatrix}, \end{aligned}$$

from which the ENO stencil shift can then be determined using (4.2).

For \(p=4\), Algorithm 1 can be represented by following ReLU network with 3 hidden layers, whose input is given by \(X = (\Delta _{-3}^0,\Delta _{-2}^0,\Delta ^0_{-1}, \Delta ^0_0, \Delta ^0_1,\Delta _{2}^0)^\top \). The first hidden layer is identical to the one described in the original proof of Theorem 4.1 for \(p=4\), with a null bias vector and \(W^1 \in {\mathbb {R}}^{10 \times 6}\),

$$\begin{aligned} W^1 = \begin{pmatrix} {\widetilde{W}}^1 \\ -{\widetilde{W}}^1 \end{pmatrix} \in {\mathbb {R}}^{10 \times 6} \quad \text {where}\quad {\widetilde{W}}^1 = \begin{pmatrix} 0 &{}\quad 0 &{}\quad 1 &{}\quad -2 &{}\quad 1 &{}\quad 0 \\ 0 &{}\quad 1 &{}\quad -2 &{}\quad 1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad -1 &{}\quad 3 &{}\quad -3 &{}\quad 1 \\ 0 &{}\quad -1 &{}\quad 3 &{}\quad -3 &{}\quad 1 &{}\quad 0 \\ -1 &{}\quad 3 &{}\quad -3 &{}\quad 1 &{}\quad 0 &{}\quad 0\end{pmatrix}. \end{aligned}$$
(4.5)

The second hidden layer has a null bias vector and the weight matrix

$$\begin{aligned} W^2 = \begin{pmatrix} {\widetilde{W}}^2 \\ -{\widetilde{W}}^2 \end{pmatrix} \in {\mathbb {R}}^{6 \times 10} \quad \text {where}\quad {\widetilde{W}}^2 = \begin{pmatrix} 1&1 \end{pmatrix}\otimes \begin{pmatrix} -1&{}\quad 1&{}\quad 0&{}\quad 0&{}\quad 0\\ 0&{}\quad 0&{}\quad -1&{}\quad 1&{}\quad 0\\ 0&{}\quad 0&{}\quad 0&{}\quad -1&{}\quad 1 \end{pmatrix}. \end{aligned}$$
(4.6)

Note that \({\widetilde{W}}^2 \in {\mathbb {R}}^{3 \times 10}\) is as in the original proof of Theorem 4.1 for \(p=4\). The third hidden layer and the output layer both have a null bias vector and their weights are respectively given by,

$$\begin{aligned} W^3 = \begin{pmatrix} 1 &{}\quad 1 &{} \quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 1\\ 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 1\\ -1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad -1\\ 0 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 &{}\quad 1 \end{pmatrix} \quad \text {and} \quad W^4 = \begin{pmatrix} 1 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 1 &{}\quad 1 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 \end{pmatrix}. \end{aligned}$$
(4.7)

After an elementary, yet tedious case study, one can show that the shift can again be determined using (4.2).

Remark 4.3

Similarly, one can show that the stencil selection algorithm for ENO reconstruction (Algorithm 2 in Appendix A) for \(p=2\) can be exactly represented by a ReLU DNN with one hidden layer of width 4. The input and output dimension are 3 and 2, respectively. For \(p=3\), Algorithm 2 can be shown to correspond to a ReLU DNN with three hidden layers of dimensions (10, 6, 4). Input and output dimension are 5 and 4, respectively.

Remark 4.4

After having successfully recast the ENO stencil selection as a ReLU neural network, it is natural to investigate whether there exists a pure ReLU neural network (i.e. without additional output function) input \(f^k\) and output \(({\mathcal {I}}^{h_k}f)^{k+1}\), as in the setting of (3.2) in Sect. 3.1. Since ENO is a discontinuous procedure and a pure ReLU neural network is a continuous function, a network with such an output does not exist. It remains however interesting to investigate to which extent we can approximate ENO using ReLU neural networks. This is the topic of Section 4.4 of the thesis [10], where it is shown that there exists a pure ReLU DNN that mimics ENO to some extent such that some of the desirable properties of ENO are preserved.

5 ENO-SR as a ReLU DNN

The goal of this section is to recast the second-order ENO-SR procedure from Sect. 3.2 as a ReLU DNN, similar to what we did for ENO in Sect. 4. Just like ENO, the crucial step of ENO-SR is the stencil selection, allowing us to interpret ENO-SR as a classification problem. In this context, we prove the equivalent of Theorem 4.1 for ENO-SR-2. Afterwards, we interpret ENO-SR as a regression problem (cfr. Remark 4.4) and investigate whether we can cast ENO-SR-2 as a pure ReLU DNN, i.e. without additional output function. In the following, we assume f to be a continuous function that is two times differentiable except at a single point z where the first derivative has a jump of size \([f'] = f^{\prime }(z+) - f^{\prime }(z-) \).

5.1 ENO-SR-2 stencil selection as ReLU DNN

We will now prove that a second-order accurate prediction of \(f^{k+1}\) can be obtained given \(f^k\) using a ReLU DNN, where we use notation as in Sect. 3.1. Equation (3.2) shows that we only need to calculate \({\mathcal {I}}^{h_k}_{i}f(x^{k+1}_{2i-1})\) for every \(1\le i\le N_k\). The proof we present can be directly generalized to interpolation at points other than the midpoints of the cells, e.g. retrieving cell boundary values for reconstruction purposes. From the ENO-SR interpolation procedure it is clear that for every i there exists \(r_i^k\in \{-2,0,2\}\) such that \({\mathcal {I}}^{h_k}_{i}f(x^{k+1}_{2i-1})=p^k_{i+r_i^k}(x^{k+1}_{2i-1})\). Analogously to what was described in Sect. 4, this gives rise to a classification problem. Instead of considering the stencil shifts as the output classes of the network, one can also treat the different cases that are implicitly described in the ENO-SR interpolation procedure in Sect. 3.2 as classes. This enables us to construct a ReLU neural network such that the stencil shift \(r^k_i\) can be obtained from the network output by using the default output function for classification problems (cf. Sect. 3.1 and Remark 2.1).

Theorem 5.1

There exists a ReLU neural network with input \(f^k\) that leads to output \((r_1^k, \ldots , r_{N_k}^k)\) as defined above.

Proof

Instead of explicitly constructing a ReLU DNN, we will prove that we can write the output vector as a composition of functions that can be written as pure ReLU DNNs with linear output functions. Such functions include the rectifier function, absolute value, maximum and the identity function. The network architecture of a possible realisation of the network of this proof can be found after the proof. Furthermore we will assume that the discontinuity is not located in the first four or last four intervals. This can be achieved by taking k large enough, or by introducing suitably prescribed ghost values. We also assume without loss of generality that \(x_i^k=i\) for \(0\le i\le N_k\).

The input of the DNN will be the vector \(X^0\in {\mathbb {R}}^{N_k+1}\) with \(X^0_{i+1}=f(x^k_i)\) for all \(0\le i \le N_k\). Using a simple affine transformation, we can obtain \(X^1\in {\mathbb {R}}^{N_k-1}\) such that \(X^1_{i}=\Delta ^2_{h_k} f(x^k_{i})\) for all \(1\le i \le N_k-1\). We now define the following quantities,

$$\begin{aligned} M_i = \max _{n=1,2,3}|\Delta _{h_k}^2f(x^k_{i\pm n})|=\max _{n=1,2,3}|X^1_{i\pm n}|,\quad N_i^{\pm } = \max _{n=1,2}|\Delta _{h_k}^2f(x^k_{i\pm n})|=\max _{n=1,2}|X^1_{i\pm n}|, \end{aligned}$$
(5.1)

where \(4\le i\le N_k-4\). Next, we construct a vector \(X^2\in {\mathbb {R}}^{N_k}\) such that every entry corresponds to an interval. For \(1\le i\le N_k\), we want \(X^2_i>0\) if and only if the interval \(I_{i}^k\) is labelled as B by the adapted ENO-SR detection mechanism. We can achieve this by defining

$$\begin{aligned} \begin{aligned} X^2_{i} = (\min \{|X^1_{i}|- N_i^+,|X^1_{i-1}| - N_{i-1}^-\})_+ +(|X^1_i|-M_i)_+ + (|X^1_{i-1}|-M_{i-1})_+ \end{aligned} \end{aligned}$$
(5.2)

for \(5\le i\le N_k-4\). Furthermore we set \(X^2_1 = X^2_2=X^2_3 =X^2_4 = X^2_{N_k-3}= X^2_{N_k-2}= X^2_{N_k-1}= X^2_{N_k}=0\). Note that the first term of the sum will be strictly positive if \(I_i^k\) is labelled bad by the second rule of the detection mechanism and one of the other terms will be strictly positive if \(I_i^k\) is labelled bad by the first rule. Good intervals \(I_i^k\) have \(X^2_i=0\).

Now define \(n_{i,l} = l + 4(i-1)\) for \(1\le i \le N_k\) and \(1\le l \le 4\). Using this notation, i refers to the interval \(I_i^k\). We denote by \({p_i^k:[c,d]\rightarrow {\mathbb {R}}:x\mapsto a_ix+b_i}\) the linear interpolation of the endpoints of \(I_i^k\), where we write \(a_i\) and \(b_i\) instead of \(a_i^k\) and \(b_i^k\) to simplify notation. Define \(X^3\in {\mathbb {R}}^{4N_k}\) in the following manner:

$$\begin{aligned} \begin{aligned} X^3_{n_{i,1}}&= X^2_i, \qquad \qquad \qquad \qquad \mathrm { }X^3_{n_{i,3}} = \left( |b_{i-2}-b_{i+2}|-x^{k+1}_{2i-1}|a_{i-2}-a_{i+2}|\right) _+,\\ X^3_{n_{i,2}}&= |a_{i-2} - a_{i+2}|, \,\qquad X^3_{n_{i,4}} = \left( -|b_{i-2}-b_{i+2}|+x^{k+1}_{2i-1}|a_{i-2}-a_{i+2}|\right) _+, \end{aligned} \end{aligned}$$
(5.3)

for \(5\le i \le N_k-4\). We set \(X^3_{n_{i,l}}=0\) for \(1 \le l \le 4\) and \(1\le i \le 4\) or \(N_k-3\le i \le N_k\). We can now define the output \({\hat{Y}}\in {\mathbb {R}}^{N_k}\) of the ReLU neural network by

$$\begin{aligned} {\hat{Y}}_i = \min \mathrm {argmin}_{1\le l \le 4} X^3_{n_{i,l}}. \end{aligned}$$
(5.4)

where we used the notation \({\hat{Y}}_i\) for the predicted class instead of the network output to simplify notation. It remains to prove that \(r_i^k\) can be obtained from \({\hat{Y}}_i\). Note that \({\hat{Y}}_i=1\) if and only if \(I_i^k\) was labelled G. Therefore \({\hat{Y}}_i=1\) corresponds to \(r_i^k=0\). If \({\hat{Y}}_i=2\), then \(I_i^k\) was labelled B and the interpolants \(p_{i-2}^k\) and \(p_{i+2}^k\) do not intersect, leading to \(r_i^k=0\) according to the interpolation procedure. Next, \({\hat{Y}}_i=3,4\) corresponds to the case where \(I_i^k\) was labelled B and the interpolants \(p_{i-2}^k\) and \(p_{i+2}^k\) do intersect. This intersection point is seen to be \(y=\frac{b_{i+2}-b_{i-2}}{a_{i-2}-a_{i+2}}\). If \({\hat{Y}}_i=3\), then \(x^{k+1}_{2i-1}\) is right of y and therefore \(r_i^k=2\). Analogously, \({\hat{Y}}_i=4\) corresponds to \(r^k_i=-2\), which concludes the proof. \(\square \)

Now that we have established that our adaptation of the second-order ENO-SR algorithm can be written as a ReLU DNN augmented with a discontinuous output function, we can present a possible architecture of a DNN that calculates the output \({\hat{Y}}_i\) from \(f^k\). The network we present has five hidden layers, of which the widths vary from 6 to 20, and an output layer of 4 neurons. The network is visualized in Fig. 5.

Fig. 5
figure 5

Flowchart of a ReLU DNN to calculate \({\hat{Y}}_i\) from \(f^k\)

We now give some more explanation about how each layer in Fig. 5 can be calculated from the previous layer, where we use the same notation as in the proof of Theorem 4.1. In addition, we define and note that

$$\begin{aligned} \gamma _{i-2,i+2}(z)&=|b_{i-2}-b_{i+2}|-z|a_{i-2}-a_{i+2}|, \end{aligned}$$
(5.5a)
$$\begin{aligned} \max \{x,y\}&= x + (y-x)_+. \end{aligned}$$
(5.5b)

A.B. It is easy to see that all quantities of the first layer are linear combinations of the input neurons. C. Application of \(|x|=(x)_++(-x)_+\) and definition (5.5a) on \(\pm (b_{i-2}-b_{i+2})\) and \(\pm (a_{i-2}-a_{i+2})\). D. Straightforward application of the identity \(|x|=(x)_++(-x)_+\) on \(\pm \Delta _q\), followed by taking linear combinations. E.G.I. Passing by values. F. The first six quantities were passed by from the previous layer. The other ones are applications of Eq. (5.5b), where the order of the arguments of the maximums is carefully chosen. H. Equation (5.5b) was used, where we use that \(\min \{x,y\}=-\max \{-x,-y\}\). J. Application of Eq. (5.1). K. The result follows from combining definitions (5.2) and (5.3). L. As can be seen in definition (5.4), \({\hat{Y}}_i\) is obtained by applying the output function \(\min \mathrm {argmin}\) on the output layer.

Remark 5.2

The second-order ENO-SR method as proposed in [1] can also be written as a ReLU DNN, but it leads to a neural network that is considerably larger than the one presented above.

5.2 ENO-SR-2 regression as ReLU DNN

After having successfully recast the ENO-SR stencil selection as a ReLU neural network, it is natural to investigate whether there exists a ReLU neural network with output \(({\mathcal {I}}^{h_k}f)^{k+1}\), as in the setting of (3.2) in Sect. 3.1. Since ENO-SR interpolation is a discontinuous procedure and a ReLU neural network is a continuous function, a network with such an output does not exist. It is however interesting to investigate to which extent we can approximate ENO-SR using ReLU neural networks. In what follows, we design an approximate ENO-SR method, based on the adapted ENO-SR-2 method of Sect. 3.2, and investigate its accuracy.

We first introduce for \(\epsilon \ge 0\) the function \(H_{\epsilon }:{\mathbb {R}}\rightarrow {\mathbb {R}}\), defined by

$$\begin{aligned} H_{\epsilon }(x) = {\left\{ \begin{array}{ll} 0 &{}\quad x\le 0\\ x/\epsilon &{}\quad 0<x\le \epsilon \\ 1 &{}\quad x > \epsilon .\end{array}\right. } \end{aligned}$$
(5.6)

Note that \(H_0\) is nothing more than the Heaviside function. Using this function and the notation of the proof of Theorem 5.1, we can write down a single formula for \({\mathcal {I}}^{h_k}_{i}f(x^{k+1}_{2i-1})\),

$$\begin{aligned} \begin{aligned} {\mathcal {I}}^{h_k}_{i}f(x^{k+1}_{2i-1})&= (1-\alpha )p^k_i(x^{k+1}_{2i-1}) + \alpha \left( (1-\beta )p^k_{i+2}(x^{k+1}_{2i-1})+\beta p^k_{i-2}(x^{k+1}_{2i-1})\right) , \\&\quad \text { where } \alpha =H_{0}(\min \{X^3_{n_{i,1}},X^3_{n_{i,2}}\}), \quad \beta = H_{0}(X^3_{n_{i,3}}), \end{aligned} \end{aligned}$$
(5.7)

for \(1\le i \le N_k\). Observe that this formula cannot be calculated using a pure ReLU DNN. Nevertheless, we will base ourselves on this formula to introduce an approximate ENO-SR algorithm that can be exactly written as a ReLU DNN.

The first step is to replace \(H_0\) by \(H_{\epsilon }\) with \(\epsilon > 0\), since

$$\begin{aligned} H_{\epsilon }(x) = \frac{1}{\epsilon }(x)_+-\frac{1}{\epsilon }(x-\epsilon )_+, \end{aligned}$$
(5.8)

which clearly can be calculated using a ReLU neural network. The now remaining issue is that the multiplication of two numbers cannot be exactly represented using a ReLU neural network. Moreover, as we aim for a network architecture that is independent of the accuracy of the final network that approximates ENO-SR-2, we cannot use the approximate multiplication networks in the sense of [26]. We therefore introduce an operation that resembles the multiplication of bounded numbers in another way. For \(\lambda >0\), we denote by \(\star \) the operation on \([0,1]\times [-\lambda ,\lambda ]\) defined by

$$\begin{aligned} x \star y := (y+\lambda x-\lambda )_+-(-y+\lambda x-\lambda )_+. \end{aligned}$$
(5.9)

Like \(H_{\epsilon }\), this operation can be cast as a simple ReLU DNN. We compare \(x\star y\) with \(x\cdot y\) for fixed \(x\in [0,1]\) and \(\lambda >0\) in Fig. 6. Next, we list some properties of \(\star \) that are of great importance for the construction of our approximation.

Fig. 6
figure 6

Plot of \(x\star y\) and \(x\cdot y\) for fixed \(x\in [0,1]\) and \(\lambda >0\)

Lemma 5.3

For \(\lambda >0\), the operation \(\star \) satisfies the following properties:

  1. (1)

    For all \(x\in \{0,1\}\) and \(y\in [-\lambda ,\lambda ]\) it holds true that \(x \star y = xy\).

  2. (2)

    For all \(x\in [0,1]\) and \(y\in [0,\lambda ]\) we have \(0\le x \star y \le xy\).

  3. (3)

    For all \(x\in [0,1]\) and \(y\in [-\lambda ,0]\) we have \(xy\le x \star y \le 0\).

  4. (4)

    There exist \(x\in [0,1]\) and \(y_1,y_2\in [-\lambda ,\lambda ]\) such that

    $$\begin{aligned} \min \{y_1,y_2\}\le (1-x)\star y_1 + x \star y_2\le \max \{y_1,y_2\} \end{aligned}$$

    does not hold.

  5. (5)

    For all \(x\in [0,1]\) and \(y_1,y_2\in [-\lambda ,\lambda ]\) it holds true that

    $$\begin{aligned}\min \{y_1,y_2\}\le y_1 + x \star (y_2-y_1)\le \max \{y_1,y_2\}. \end{aligned}$$

Proof

Properties 1,2 and 3 follow immediately from the definition and can also be verified on Fig. 6. For property 4, note that \((1/2)\star (\lambda /2)+(1/2)\star (\lambda /2)=0\). Property 5 is an application of properties 2 and 3. \(\square \)

For the moment, we assume that there exists \(\lambda >0\) such that all quantities that we will need to multiply, lie in the interval \([-\lambda ,\lambda ]\). This assumption will be verified in the proof of Theorem 5.5. In view of the third property in Lemma 5.3, directly replacing all multiplications in (5.7) by the operation \(\star \) will lead to a quantity that is no longer a convex combination of \(p^k_{i-2}(x^{k+1}_{2i-1}), p^k_{i}(x^{k+1}_{2i-1})\) and \(p^k_{i+2}(x^{k+1}_{2i-1})\). We therefore introduce the approximate ENO-SR prediction \({\hat{f}}_{i,\epsilon }^{k+1}\) of \(f(x^{k+1}_i)\) by setting \({\hat{f}}_{2i,\epsilon }^{k+1}=f_{2i}^{k+1}\) for \(0\le i \le N_k\) and

$$\begin{aligned} {\hat{f}}_{2i-1,\epsilon }^{k+1}= & {} p^k_i(x^{k+1}_{2i-1}) + \alpha \star \left( p^k_{i+2}(x^{k+1}_{2i-1})- p^k_i(x^{k+1}_{2i-1}) +\beta \star \left( p^k_{i-2}(x^{k+1}_{2i-1})-p^k_{i+2}(x^{k+1}_{2i-1})\right) \right) ,\nonumber \\&\quad \text { where } \alpha =H_{\epsilon }(\min \{X^3_{n_{i,1}},X^3_{n_{i,2}}\}), \quad \beta = H_{\epsilon }(X^3_{n_{i,3}}), \end{aligned}$$
(5.10)

for \(1\le i \le N_k\). The fourth property of Lemma 5.3 ensures that the two convex combinations in (5.7) are replaced by convex combinations (with possibly different weights). The theorem below quantifies the accuracy of the approximate ENO-SR predictions for \(\epsilon >0\).

Theorem 5.4

Let \(f:[c,d]\rightarrow [-1,1]\) be a globally continuous function with a bounded second derivative on \({\mathbb {R}}\backslash \{z\}\) and a discontinuity in the first derivative at a point z. For every k, the approximate ENO-SR predictions \({\hat{f}}^{k+1}_{i,\epsilon }\) satisfy for every \(0\le i\le N_{k+1}\) and \(\epsilon \ge 0\) that

$$\begin{aligned} |{\mathcal {I}}^{h_k}_{i}f(x^{k+1}_{2i-1})-{\hat{f}}^{k+1}_{i,\epsilon }| \le Ch_k^2 \sup _{[c,d]\backslash \{z\}}|f''| + \frac{3}{2}\epsilon , \end{aligned}$$
(5.11)

where \({\mathcal {I}}^{h_k}_{i}f(x^{k+1}_{2i-1})\) is the ENO-SR-2 prediction.

Proof

The proof can be found in Appendix E. \(\square \)

We see that our approximation is second-order accurate up to an additional constant error, which can be made arbitrarily small. Finally, the following theorem states that the constructed approximation can indeed be represented by a ReLU DNN, i.e. there exists a pure ReLU DNN that satisfies bound (5.11).

Theorem 5.5

Let \({\mathcal {F}}\) denote the class of functions \(f:[c,d]\rightarrow [-1,1]\) that are globally continuous with a bounded second derivative on \({\mathbb {R}}\backslash \{z\}\) and a discontinuity in the first derivative at a point z. For every \(\epsilon >0\), there exists a pure ReLU neural network that takes for every \(f\in {\mathcal {F}}\) as input the vector \(\{f^k_{i+q}\}_{q=-5}^4\) and returns the value \({\hat{f}}^{k+1}_{2i-1,\epsilon }\) for every \(1\le i \le N_k\), \(0\le k \le K\).

Proof

Most of the work was already done in Theorem 5.1 and the discussion preceding Theorem 5.4. Indeed, we have already established that \(p^k_{i-2}(x^{k+1}_{2i-1}), p^k_{i}(x^{k+1}_{2i-1})\), \(p^k_{i+2}(x^{k+1}_{2i-1}),X^3_1,X^3_2\) and \(X^3_3\), as well as the operation \(\star \) can be represented using pure ReLU networks. It only remains to find a bound for all second arguments of the operation \(\star \) in (5.10). Since the codomain of f is \([-1,1]\), one can calculate that \(p^k_{i-2}(x^{k+1}_{2i-1}), p^k_{i}(x^{k+1}_{2i-1})\) and \(p^k_{i+2}(x^{k+1}_{2i-1})\) lie in \([-4,4]\). Using Lemma 5.3, we then find that

$$\begin{aligned} p^k_{i+2}(x^{k+1}_{2i-1})- p^k_i(x^{k+1}_{2i-1}) +\beta \star \left( p^k_{i-2}(x^{k+1}_{2i-1})-p^k_{i+2}(x^{k+1}_{2i-1})\right) \in [-16,16] \end{aligned}$$

for all \(\beta \in [0,1]\). We can thus use the operation \(\star \) with \(\lambda = 16\) in (5.9). \(\square \)

We now present the network architecture of a ReLU neural network that computes the approximate ENO-SR prediction (5.10). The network we propose is visualized in Fig. 7 and consists of eight hidden layers with widths 23, 13, 14, 12, 8, 7, 6 and 3.

Fig. 7
figure 7

Flowchart of a ReLU DNN to calculate \({\hat{f}}^{k+1}_{2i-1,\epsilon }\) from \(f^k\)

In the figure, the following notation was used,

$$\begin{aligned} \begin{aligned} m_i&= \min \{X^3_{n_i,1},X^3_{n_i,2}\},\\ P^k_{m,n}&= p^k_{m}(x^{k+1}_{2i-1})- p^k_n(x^{k+1}_{2i-1}), \end{aligned} \end{aligned}$$
(5.12)

for \(1\le i,m,n \le N_k\). We now give some more explanation about how all the layers can be calculated from the previous layer in Fig. 7. A.B. All quantities of the first layer are linear combinations of the input neurons, where we also refer to Fig. 5. From the proof of Theorem 5.5, it follows that we can take \(\lambda =16\). C.D. Linear combinations. E. We refer to (5.10) and (5.12) for the definitions of \(\beta \) and \(m_i\), respectively. F. We refer to (5.9) and (5.10) for the definitions of \(\star \) and \(\alpha \), respectively. G. From (5.10) it follows that the value of the output layer is indeed equal to the approximate second-order ENO-SR prediction \({\hat{f}}^{k+1}_{2i-1,\epsilon }\).

6 Numerical results

From Sects. 4 and 5, we know that there exist deep ReLU neural networks, of a specific architecture, that will mimic the ENO-p and the second-order ENO-SR-2 algorithms for interpolating rough functions. In this section, we investigate whether we can train such networks in practice and we also investigate their performance on a variety of tasks for which the ENO procedure is heavily used. We will refer to these trained networks as DeLENO (Deep Learning ENO) and DeLENO-SR networks. More details on the training procedure can be found in Sect. 6.1, the performance is discussed in Sect. 6.2 and illustrated by various applications at the end of this section.

6.1 Training

The training of these networks involves finding a parameter vector \(\theta \) (the weights and biases of the network) that approximately minimizes a certain loss function \({\mathcal {J}}\) which measures the error in the network’s predictions. To achieve this, we have access to a finite data set \({\mathbb {S}}= \{ (X ^i,{\mathcal {L}}(X^i))\}_i \subset D \times {\mathcal {L}}(D)\), where \({\mathcal {L}}:D\subset {\mathbb {R}}^m\rightarrow {\mathbb {R}}^n\) is the unknown function we try to approximate using a neural network \({\mathcal {L}}^\theta \).

For classification problems, each \(Y^i={\mathcal {L}}(X^i)\) is an n-tuple that indicates to which of the n classes \(X^i\) belongs. The output of the network \({\hat{Y}}^i={\mathcal {L}}^\theta (X^i)\) is an approximation of \(Y^i\) in the sense that \({\hat{Y}}^i_j\) can be interpreted as the probability that \(X^i\) belongs to class j. A suitable loss function in this setting is the cross-entropy function with regularization term

$$\begin{aligned} {\mathcal {J}}(\theta ; {\mathbb {S}},\lambda ) = - \frac{1}{\# {\mathbb {S}}}\sum \limits _{(X^i,Y^i) \in {\mathbb {S}}} \ \sum \limits _{j=1}^{n} Y^i_j \log ({\hat{Y}}^i_j)+ \lambda {\mathcal {R}}(\theta ). \end{aligned}$$
(6.1)

The cross-entropy term measures the discrepancy between the probability distributions of the true outputs and the predictions. It is common to add a regularization term \(\lambda {\mathcal {R}}(\theta )\) to prevent overfitting of the data and thus improve the generalization capabilities of the network [14]. The network hyperparameter \(\lambda > 0\) controls the extent of regularization. Popular choices of \({\mathcal {R}}(\theta )\) include the sum of some norm of all the weights of the network. To monitor the generalization capability of the network, it is useful to split \({\mathbb {S}}\) into a training set \({\mathbb {T}}\) and a validation set \(\mathbb {V}\) and minimize \({\mathcal {J}}(\theta ; {\mathbb {T}},\lambda )\) instead of \({\mathcal {J}}(\theta ; {\mathbb {S}},\lambda )\). The validation set \(\mathbb {V}\) is used to evaluate the generalization error. The accuracy of network \(\mathcal {L}^{\theta }\) on \(\mathbb {T}\) is measured as

$$\begin{aligned} \mathbb {T}_{acc} = \#\left\{ (X,Y) \in \mathbb {T}\ | \ {\hat{Y}} = \mathcal {L}^{\theta }, \ \ \text {arg} \max \limits _{1\le j\le n} {\hat{Y}}_j= \text {arg} \max \limits _{1\le j\le n} Y_j \right\} / \# \mathbb {T}, \end{aligned}$$
(6.2)

with a similar expression for \(\mathbb {V}_{acc}\).

For regression problems, \({\hat{Y}}^i\) is a direct approximation of \(Y^i\), making the mean squared error with regularization term

$$\begin{aligned} {\mathcal {J}}(\theta ; {\mathbb {S}},\lambda ) = \frac{1}{\# {\mathbb {S}}}\sum \limits _{(X^i,Y^i) \in {\mathbb {S}}} \left\Vert Y^i-{\hat{Y}}^i\right\Vert ^2+ \lambda {\mathcal {R}}(\theta ), \end{aligned}$$
(6.3)

an appropriate loss function. As before, the data set \({\mathbb {S}}\) can be split into a training set \(\mathbb {T}\) and a validation set \(\mathbb {V}\), in order to minimize \( {\mathcal {J}}(\theta ; \mathbb {T},\lambda )\) and estimate the MSE of the trained network by \( {\mathcal {J}}(\theta ; \mathbb {V},\lambda )\).

The loss functions are minimized either with a mini-batch version of the stochastic gradient descent algorithm with an adaptive learning rate or with the popular ADAM optimizer [19]. We use batch sizes of 1024, unless otherwise specified.

6.1.1 Training DeLENO-p

We want to construct a suitable training data set \({\mathbb {S}}\) to train DeLENO-p for interpolation purposes. Thanks to the results of Sect. 4, we are guaranteed that for certain architectures it is theoretically possible to achieve an accuracy of 100%. For any order, this architecture is given by Theorem 4.1 and its proof. For small orders \(p=3,4\) we use the alternative network architectures described at the end of Sect. 4, as they are of smaller size. The network will take an input from \({\mathbb {R}}^m\), \(m=2p-2\), and predicts the stencil shift r. We generate a data set \({\mathbb {S}}\) of size 460,200-200m using Algorithm 1 with inputs given by,

  • A total of 400,000 samples \(X \in {\mathbb {R}}^m\), with each component \(X_j\) randomly drawn from the uniform distribution on the interval \([-1,1]\).

  • The set

    $$\begin{aligned} \{ (u_l, \ldots , u_{l+m})^\top | \ 0\le l \le N-m, \ \ 0\le q \le 39, \ \ N = \{100,200,300,400,500\} \} \end{aligned}$$

    where \(u_l\) is defined as

    $$\begin{aligned} u_l:=\sin \left( (q+1) \pi \frac{l}{N} \right) , \quad 0\le l \le N. \end{aligned}$$

The input data needs to be appropriately scaled before being fed into the network, to ensure faster convergence during training. We use the following scaling for each input X,

$$\begin{aligned} \text {Scale}(X) = {\left\{ \begin{array}{ll} \frac{2X - (b + a)}{b-a}&{} \quad \text {if } X \ne 0 \\ (1,\ldots ,1)^\top \in {\mathbb {R}}^m &{} \quad \text {otherwise} \end{array}\right. }, \quad a = \min _j (X_j) , \, b = \max _j (X_j), \end{aligned}$$
(6.4)

which scales the input to lie in the box \([-1,1]^m\).

Remark 6.1

When the input data is scaled using formula (6.4), then Newton’s undivided differences are scaled by a factor \(2(b-a)^{-1}\) as well. Therefore scaling does not alter the stencil shift obtained using Algorithm 1 or .

The loss function \({\mathcal {J}}\) is chosen as (6.1), with an \(L_2\) penalization of the network weights and \(\lambda = 7.8\cdot 10^{-6}\). The network is retrained using 5 times, with the weights and biases initialized using a random normal distribution for each of the retrainings. The last \(20 \%\) of \({\mathbb {S}}\) is set aside to be used as the validation set \(\mathbb {V}\). For each p, we denote by DeLENO-p the network with the highest accuracy \(\mathbb {V}_{acc}\) at the end of the training. The training of the DeLENO reconstruction networks was performed entirely analogously, with the only difference that now we set \(m=2p-1\).

6.1.2 Training DeLENO-SR

Next, we construct a suitable training data set \({\mathbb {S}}\) to train second-order DeLENO-SR for use as an interpolation algorithm. Recall that ENO-SR is designed to interpolate continuous functions f that are two times differentiable, except at isolated points, \(z\in {\mathbb {R}}\) where the first derivative has a jump of size \([f']\). Locally, these functions can be viewed as piecewise linear functions. Based on this observation, we create a data set using functions of the form

$$\begin{aligned} f(x)=a(x-z)_-+b(x-z)_+, \end{aligned}$$
(6.5)

where \(a,b,z\in {\mathbb {R}}\). For notational simplicity we assume that the x-values of the stencil that serves as input for the ENO-SR algorithm (Sect. 3.2) are \(0,1,\ldots ,9\). The interval of interest is then [4, 5] and the goal of ENO-SR is to find an approximation of f at \(x=4.5\). We generate 100,000 samples, where we choose abz in the following manner,

  • The parameters a and b are drawn from the uniform distribution on the interval \([-1,1]\). Note that any interval that is symmetric around 0 could have been used, since the data will be scaled afterwards.

  • For 25,000 samples, z is drawn from the uniform distribution on the interval [4, 5]. This simulates the case where the discontinuity is inside the interval of interest.

  • For 75,000 samples, z is drawn from the uniform distribution on the interval \([-9,9]\), which also includes the case in which f is smooth on the stencil.

The network architecture is described in Sect. 5. The network will take an input from \({\mathbb {R}}^{10}\) and predicts the stencil shift r. The training of DeLENO-SR was performed in a very similar fashion to the training of DeLENO-p (Sect. 6.1.1), only this time we retrained the DeLENO-SR network 5 times for 5000 epochs each. Furthermore we used 8-fold cross-validation on a data set of 20,000 samples to select the optimal regularization parameter, resulting in the choice \(\lambda =1\cdot 10^{-8}\).

Remark 6.2

Note that the detection mechanism of the ENO-SR interpolation method (Sect. 3.2) labels an interval as bad when \(\alpha -\beta >0\) for some numbers \(\alpha ,\beta \in {\mathbb {R}}\). This approach causes poor approximations in practice due to numerical errors. When for example \(\alpha =\beta \), rounding can have as a consequence that \(\mathrm {round}(\alpha -\beta )>0\), leading to an incorrect label. This deteriorates the accuracy of the method and is very problematic for the training. Therefore we used in our code the alternative detection criterion \(\alpha -\beta >\epsilon \), where for example \(\epsilon =10^{-10}\).

6.2 Performance

In the previous sections, we have proven the existence of ReLU neural networks that approximate ENO(-SR) well, or can even exactly reproduce its output. However, it might be challenging to obtain these networks by training on a finite set of samples. Fortunately, Table 1 demonstrates that this is not the case for the DeLENO(-SR) stencil selection networks. For both interpolation and reconstruction, the classification accuracy (6.2) is nearly 100\(\%\), where we used the network architecture from our theoretical results. A comparison between the trained weights and biases in Appendix F and their theoretical counterparts of (4.34.7) reveals that there are multiple DNNs that can represent ENO. Moreover, this indicates that the weights of two DNNs (i.e. the theoretical and trained DNNs) can be very different even though the output is approximately the same (Table 1). This is in agreement with the result from [21] that the function that maps a family of weights to the function computed by the associated network is not inverse stable.

Table 1 Shape of DeLENO-\(p\) and DeLENO-SR networks with their accuracies for the interpolation and reconstruction problem

Next, we investigate the order of accuracy of the DeLENO-SR regression network, for the functions

$$\begin{aligned} f_1(x)&= -2\left( x-\frac{\pi }{6}\right) \mathbbm {1}_{[0,\frac{\pi }{6})}(x)+\left( x-\frac{\pi }{6}\right) ^2\mathbbm {1}_{[\frac{\pi }{6},1]}(x), \end{aligned}$$
(6.6)
$$\begin{aligned} f_2(x)&= \sin (x). \end{aligned}$$
(6.7)

Note that the first derivative of \(f_1\) has a jump at \(\frac{\pi }{6}\). In Fig. 8, the order of accuracy of second-order ENO-SR-2 and DeLENO-SR-2 is compared with those of ENO-3 and DeLENO-3 for both the piecewise smooth function \(f_1\) and the smooth function \(f_2\).

Fig. 8
figure 8

Plots of the approximation error of the DeLENO-SR-2 regression network for the piecewise smooth function \(f_1\) and the smooth sine function \(f_2\). The approximation errors of ENO-SR-2 and (DeL)ENO-3 are shown for comparison

In both cases, ENO-3 and DeLENO-3 completely agree, which is not surprising given the high classification accuracies listed in Table 1. (DeL)ENO-3 is third-order accurate for the smooth function and only first-order accurate for a more rough function, in agreement with expectations. For both \(f_1\) and \(f_2\), DeLENO-SR-2 is second-order accurate on coarse grids, but a deterioration to first-order accuracy is seen on very fine grids. This deterioration is an unavoidable consequence of the error that the trained network makes and the linear rescaling of the input stencils (6.4). A more detailed discussion of this issue can be found in Section 6.2 of [10]. Furthermore, although the DeLENO-SR regression network is initially second-order accurate for the smooth function, the approximation error does not agree with that of ENO-SR. This is in line with Theorem 5.4, where we proved that there exists a network that is a second-order accurate approximation of ENO-SR-2 except for an error term that can be arbitrarily small, yet is fixed. A second factor that might contribute is the fact that DeLENO-SR-2 was trained on piecewise linear functions, which can be thought of as a second-order accurate approximation of a smooth function, therefore leading to a higher error.

6.3 Applications

Next, we apply the DeLENO algorithms in the following examples.

6.3.1 Function approximation

We first demonstrate the approximating ability of the DeLENO interpolation method using the function

$$\begin{aligned} q(x) = {\left\{ \begin{array}{ll} -x &{} \quad \text {if } x< 0.5,\\ 3\sin (10 \pi x) &{} \quad \text {if } 0.5< x< 1.5,\\ - 20(x-2)^2 &{} \quad \text {if } 1.5< x< 2.5,\\ 3 &{} \quad \text {if } 2.5<x,\\ \end{array}\right. } \end{aligned}$$
(6.8)

which consists of jump discontinuities and smooth high-frequency oscillations. We discretize the domain [0, 3] and generate a sequence of nested grids of the form (3.1) by setting \(N_0 = 16\) and \(K=4\). We use the data on the grid \({\mathcal {T}}^k\), and interpolate it onto the grid \({\mathcal {T}}^{k+1}\) for \(0\le k < K\). As shown in Fig. 9, the interpolation with ENO-4 and DeLENO-4 is identical on all grids, for this particular function.

Fig. 9
figure 9

Interpolating the function (6.8) using ENO and DeLENO

6.3.2 Data compression

We now apply the multi-resolution representation framework of Appendix C to use DeLENO to compress the function (6.8). We construct a nested sequence of meshes on [0, 3] by choosing \(N_0 = 9\) and \(K=5\) in (C.1). We use Algorithm 3 to obtain the multi-resolution representation of the form (C.5) and decode the solution using Algorithm 4 to obtain the approximation \({\widehat{q}}^K\). The compression thresholds needed for the encoding procedure are set using (C.6).

Figure 10 provides a comparison of the results obtained using different values for the threshold parameters \(\epsilon \), and shows the non-zero coefficients \({\widehat{d}}^k\) for each mesh level k. A higher value of \(\epsilon \) can truncate a larger number of \({\widehat{d}}^k\) components, as is evident for \(p=3\). However, there is no qualitative difference between \({\widehat{q}}^K\) obtained for the two \(\epsilon \) values considered. Thus, it is beneficial to use the larger \(\epsilon \), as it leads to a sparser multi-resolution representation without deteriorating the overall features. The solutions obtained with ENO and DeLENO are indistinguishable. We refer to Table 2 for the errors of the two methods.

Fig. 10
figure 10

Data compression of (6.8) using ENO and DeLENO with \(N_0\), \(L=5\) and \(t=0.5\). Comparison of thresholded decompressed data with the actual data on the finest level (left); Non-zero coefficients \({\widehat{d}}^k\) at each level (right)

Table 2 1D compression errors for (6.8)

The compression ideas used for one-dimensional problems can be easily extended to handle functions defined on two-dimensional tensorized grids. We consider a sequence of grids \({\mathcal {T}}^k\) with \((N^x_k+1) \times (N^y_k + 1)\) nodes, where \(N^x_k = 2^k N^x_0\) and \(N^y_k = 2^k N^x_0\), for \(0\le k \le K\). Let \(q^k\) be the data on grid \({\mathcal {T}}^k\) and denote by \({\widehat{q}}^{k+1}\) the compressed interpolation on grid \({\mathcal {T}}^{k+1}\). To obtain \({\widehat{q}}^{k+1}\), we first interpolate along the x-coordinate direction to obtain an intermediate approximation \({\widetilde{q}}^{k+1}\) of size \((N^x_{k+1}+1) \times (N^y_k + 1)\). Then we use \({\widetilde{q}}^{k+1}\) to interpolate along the y-coordinate direction to obtain the final approximation \({\widehat{q}}^{k+1}\).

Next, we use ENO and DeLENO to compress an image with \(705 \times 929\) pixels, shown in Fig. 11a. We set \(K=5\), \(\epsilon = 1\), \(t=0.2\) in Eq. (C.6). Once again, ENO and DeLENO give similar results, as can be seen from the decompressed images in Fig. 11 and the relative errors in Table 3. In this table we additionally listed the compression rate

$$\begin{aligned} c_r = 1 - \frac{\#\left\{ d^k_{i,j} | d^k_{i,j} > \epsilon ^k, \ 1\le k \le K \right\} }{(N^x_L + 1)(N^y_L + 1) - (N^x_0 + 1)(N^y_0 + 1)}, \end{aligned}$$
(6.9)

which represents the fraction of coefficients set to null.

Fig. 11
figure 11

Image compression

Table 3 Image compression errors

As an additional example of two-dimensional data compression, we consider the function

$$\begin{aligned} q(x,y) = {\left\{ \begin{array}{ll} -10 &{} \quad \text {if } (x - 0.5)^2 + (y-0.5)^2 < 0.0225\\ 30 &{} \quad \text {if } |x-0.5|>0.8 \text { or } |y-0.5|>0.8\\ 40 &{} \quad \text {otherwise }\\ \end{array}\right. }, \end{aligned}$$
(6.10)

where \((x,y) \in [0,1] \times [0,1]\), and generate a sequence of meshes by setting \(K=4\), \(N_0^x = 16\) and \(N_0^y = 16\). The threshold for data compression is chosen according to (C.6), with \(\epsilon =10\) and \(t=0.5\). The non-zero \({\widehat{d}}^k\) coefficients are plotted in Fig. 12, while the errors and compression rate (6.9) are listed in Table 4. Overall, ENO and DeLENO perform equally well, with DeLENO giving marginally smaller errors.

Table 4 2D compression errors for (6.10)
Fig. 12
figure 12

Non-zero coefficients \({\widehat{d}}^k\) for data compression of (6.10) using ENO and DeLENO for mesh level \(1\le k \le 4\)

6.3.3 Conservation laws

We compare the performance of ENO and DeLENO reconstruction, when used to approximate solutions of conservation laws. We work in the framework of high-order finite difference schemes with flux-splitting and we use a fourth-order Runge-Kutta scheme for the time integration.

As an example, we consider the system of conservation laws governing compressible flows given by

$$\begin{aligned} \partial _t \begin{pmatrix} \rho \\ v \\ p \end{pmatrix} + \partial _x \begin{pmatrix} \rho v \\ \rho v^2 + p \\ (E +p) v \end{pmatrix} = 0, \qquad E = \frac{1}{2} \rho v^2 + \frac{p}{\gamma -1}, \end{aligned}$$

where \(\rho , v\) and p denote the fluid density, velocity and pressure, respectively. The quantity E represents the total energy per unit volume where \(\gamma =c_p/c_v\) is the ratio of specific heats, chosen as \(\gamma =1.4\) for our simulations. We consider the shock-entropy problem [23], which describes the interaction of a right moving shock with smooth oscillatory waves. The initial conditions for this test case are prescribed as

$$\begin{aligned} (\rho , \ v, \ p ) = {\left\{ \begin{array}{ll} (3.857143,\ 2.629369,\ 10.33333) &{} \quad \text {if } x < -4 \\ (1+0.2 \sin (5x),\ 0,\ 1) &{} \quad \text {if } x > -4 \end{array}\right. }, \end{aligned}$$

on the domain \([-5,5]\). Due to the generation of high frequency physical waves, we solve the problem on a fine mesh with \(N=200\) cells up to \(T_f = 1.8\) with \(\mathrm {CFL = 0.5}\). A reference solution is obtained with ENO-4 on a mesh with \(N=2000\) cells. As can be seen in Fig. 13, ENO-\(p\) and DeLENO-\(p\) perform equally well depending on the order \(p\).

Fig. 13
figure 13

Solution for Euler shock-entropy problem with ENO-\(p\) and DeLENO-\(p\) on a mesh with \(N=200\) cells

Next we solve the Sod shock tube problem [25], whose initial conditions are given by

$$\begin{aligned} (\rho , \ v, \ p ) = {\left\{ \begin{array}{ll} (1,\ 0,\ 1) &{} \quad \text {if } x < 0 \\ (0.125,\ 0,\ 0.1) &{} \quad \text {if } x > 0 \end{array}\right. }, \end{aligned}$$

on the domain \([-5,5]\). The solution consists of a shock wave, a contact discontinuity and a rarefaction. The mesh is discretized with \(N=50\) cells and the problem is solved till \(T_f = 2\) with a \(\mathrm {CFL = 0.5}\). The solutions obtained with ENO-\(p\) and DeLENO-\(p\) are identical, as depicted in Fig. 14.

Fig. 14
figure 14

Solution for Euler Sod shock tube problem with ENO-\(p\) and DeLENO-\(p\) on a mesh with \(N=50\) cells

7 Discussion

In this paper, we considered efficient interpolation of rough or piecewise smooth functions. A priori, both deep neural networks (on account of universality) and the well-known data dependent ENO (and ENO-SR) interpolation procedure are able to interpolate rough functions accurately. We proved here that the ENO interpolation (and the ENO reconstruction) procedure as well as a variant of the second-order ENO-SR procedure can be cast as deep ReLU neural networks, at least for univariate functions. This equivalence provides a different perspective on the ability of neural networks in approximating functions and reveals their enormous expressive power as even a highly non-linear, data-dependent procedure such as ENO can be written as a ReLU neural network.

On the other hand, the impressive function approximation results, for instance of [26, 27], might have limited utility for functions in low dimensions, as the neural network needs to be trained for every function that has to be interpolated. By interpreting ENO (and ENO-SR) as a neural network, we provide a natural framework for recasting the problem of interpolation in terms of pre-trained neural networks such as DeLENO, where the input vector of sample values are transformed by the network into the output vector of interpolated values. Thus, these networks are trained once and do not need to retrained for every new underlying function. This interpretation of ENO as a neural network allows us to possibly extend ENO type interpolations into several space dimensions on unstructured grids.