1 Introduction

Gravitational-wave (GW) science has reached a level of maturity in which many tools from areas such as modern approximation theory, data science, machine learning, and artificial intelligence are being incorporated into the field. These attempt to address challenges such as dealing with complex modeling, analysis, and handling of big data. A common feature of these challenges is the computational cost involved, which in many cases can be prohibitive, to the point that it cannot be solely overcome with larger or faster (super)computers, specialized hardware such as GPUs, or software optimization. This is particularly the case for parametrized problems, where each query depends on multiple input parameters that might only be known at run time. This is exacerbated as the number of parameters grow, usually resulting in the curse of dimensionality. This refers to the complexity of the problem (here leaving the term complexity ambiguous on purpose) growing fast, sometimes exponentially, with the number of parameters. In the case of gravitational waves from binary systems, parameters can be intrinsic or extrinsic. The former relate to parameters such as the mass and spin of the binary components, the initial separation and eccentricity of the system, and equations of state if matter is present. Extrinsic parameters include distance of the source to Earth, sky position, orientation, time and phase of arrival.

One of the first challenges of a parametrized problem is sampling it. With standard methods (for example, equally spaced, using the metric approach in GWs, or stochastic sampling) the accuracy of such catalogs for GW detection increases in many cases at best linearly with the number of samples, and their sizes typically increase exponentially with the number of parameter dimensions dim as \({{{\mathcal {O}}}}(N^{\mathrm{dim}})\). In fact, for the metric approach, the number of templates grows as \((1-\mathrm {MM})^{-\mathrm{dim}/2}\), with \(\mathrm {MM}\) the minimal match, which is defined as the minimal overlap between elements of a bank of templates. Since we do not actually discuss the metric approach in this review we do not delve into its precise definition, we refer instead to Sathyaprakash and Dhurandhar (1991).

For the purposes of this review a catalog or a bank is a collection of templates, or waveforms. Producing such catalogs, as well as their storage and analysis, can become challenging and, again, even not tractable through raw computational power and storage.

One approach to this problem is decreased fidelity: the modeling of the problem is simplified by approximations to Einstein’s general relativity equations, which are cheaper to solve for and analyze. Decreased fidelity is a delicate approach, though, since the accuracy of the approximations might not be known without access to the high fidelity models for an arbitrary query, which if not available could become an issue when attempting to assign error bars and a given precision needs to be guaranteed for statistical purposes. Decreased fidelity models can also lead to missed signals or biases in parameter estimation of detected GWs.

Thus, another challenge is to simulate the GWs emitted by, for example, the coalescence of compact binary objects in real time, without any physical approximation. Here the gold standard is high accuracy numerical relativity (NR) solutions to the full Einstein field equations. It might appear unattainable to replace, for an arbitrary query, an online (that is, not evaluated in advance) supercomputer simulation which might take per query hundreds to tens of thousands of hours of computing time with a substitute or surrogate model of equal accuracy but which can be evaluated in real time on a commodity laptop. As we show throughout this review, this can be—and is being—done, though there are still outstanding challenges left and the problem is not completely solved.

Finally, another challenge of equal importance is to perform parameter estimation on the sources of any detected GW signal in quasi real time. Meaning, fast enough so as to process the large number of detections by modern GW laser interferometers. And, most important, in the case of sources with electromagnetic counterparts, fast enough to allow for rapid telescope followups. This requires both fast evaluations of the GW waveforms and rapid likelihood evaluations, both on demand.

In this review we discuss some state of the art approaches which attempt to (and in some, but not all, cases manage to) obtain accurate and fast-to-evaluate and analyze surrogate models of gravitational waves emitted by binary systems, accomplishing some of the aforementioned challenges. The common aspect underlying all these methods is reduced order modeling (ROM). The approaches here reviewed are principal component analysis (PCA), proper orthogonal and singular value decompositions (POD, SVD), reduced basis (RB), the empirical interpolation method (EIM), and derivatives, such as reduced order quadratures (ROQ). The goal is intendedly not to be a definitive survey, since it is a very active field. Instead, we attempt to provide an introduction to some of the approaches that are being used in practice and do deliver on some or several of the above challenges, along with some basic theory underlying each of them. The review is divided into three parts, each of which builds upon the previous one:

  1. 1.

    Representation: Sections 2 to 8

    One example is the generation of compact catalogs or banks of gravitational waves for searches. Another one is to analyze a system and look for redundancies.

  2. 2.

    Predictive models: Sections 9 to 12

    The most ambitious goal here is to build surrogate models that can be evaluated in real time and are indistinguishable from numerical relativity supercomputer simulations of the Einstein equations, without any physical approximation. The target is the evaluation of a waveform in the order of milliseconds per angular multipole mode; which involves a speed up of at least \(\sim 10^8\) with respect to NR without loss of accuracy.

  3. 3.

    Data analysis: Sections 13 and 14

    One of the main goals of ROM and other efforts in GW science is to achieve very fast parameter estimations, in particular so that real time alerts can be sent for searches of electromagnetic counterparts. From an astrophysical point of view, the target is from months using standard methods to around 10 min using focused reduced order quadratures (discussed in Sect. 14) and including millions to tens of millions of waveform evaluations and likelihood evaluations.

We encourage the reader to provide us with feedback, including topics to cover in future versions of this Living Review article. For briefness we have to skip many references, we apologize in advance for any and all omissions in such an active field.

2 Reduced order modeling and gravitational waves

This article reviews approaches which attempt to solve many of the aforementioned problems in GW science through reduced order modeling (ROM), also known as dimensional or complexity reduction, and the related field of surrogate modeling. ROM as a field has been around for a long time, but over the last decade and a half there have been major theoretical advances followed by a rapid raise in the number of applications and pace at which they both take place, in many areas of science, engineering and technology.

This is in part, again, due to powerful new approaches and results, from approximation theory to numerical analysis and scientific computing, but also due to the recognition of the power of dimensional reduction in many important problems, some of them long-standing. That is, problems which are only now being seen in the light of the latest developments of ROM. On top of that, fields such as data science (DS), machine learning (ML) and artificial intelligence (AI) are considerably benefiting from ROM to either eliminate redundancies in big data, making them more amenable to analysis, and/or identifying relevant features for further studies using techniques from these disciplines. In fact, depending on the definitions of these fields, some include ROM as a subdiscipline.

There is a big difference, though. There are many books on established DS, ML, and AI practices and theory, while the literature on modern approaches to ROM for parametrized problems, with some important exceptions that we mention in this review, is largely composed of technical papers, which could be difficult to grasp for practitioners as introductory material. Also, these notable exceptions usually focus on time-independent partial differential equations (PDEs), and even as introductory material they might be hard to absorb by non-mathematicians. In contrast, this review covers approaches which are purely data-driven and attempts to be amenable as an introduction for GW science practitioners, keeping theory to the minimum necessary to build intuition on why a given approach works as it does, what are the challenges left, and possible approaches to them.

One of the reasons why ROM for parametrized system has been largely devoted to time-independent PDEs is because in that case many rigorous results, such as a priori error bounds on the reduced model, can be proved; these are called certified approaches. Here we move beyond that particular arena to ROM for general parametrized systems. This includes problems which might not involve any differential equation, be time-dependent, and analysis and handling of large amounts of data. This broadening is at the expense, in many cases, of more heuristics and less rigor, such as a posteriori validation as opposed to rigorous a priori error bounds. The rationale for this is simply that many problems of interest in GW science are far too complex for existing detailed rigorous theorems. On the other hand, GW problems share many similarities with others for which approaches with proven properties have been developed and therefore certain algorithms can be adapted to cases of interest in GWs. In other instances it is quite the opposite: available rigorous results are quite generic, and can be abstracted from any previous application in which they were introduced.

Half-rigorous, half-heuristic approaches are not as pessimistic as it might seem. Many widely used techniques in DS, ML and AI are somewhat heuristic in nature and require a posteriori validation (what we here call validation is usually referred to as testing in ML, and validation has a different meaning). Similarly with half-rigorous, half-heuristic cutting-edge ROM approaches as those discussed here: they can (and should) always be a posteriori validated.

Our presentation keeps technicalities and analyses to the minimum necessary for building intuition of why a technique works as it does, under which conditions it does so, and what can be expected given the properties of the problem of interest. At the same time, it is not a survey of all the work done in the field.

There are two important books about reduced basis; though focused on partial differential equations there is a lot of valuable information (well beyond the scope of this short introduction): Hesthaven et al. (2015) and Quarteroni et al. (2016). A website for ROM for parametrized systems with several resources is MoRePaS (2010–2021), while ROM in GR workshop (2013) has the speakers slides from a workshop on ROM in general relativity; even though the latter is several years old and there has been much progress since then, the topics covered are still highly relevant. A very recent and lengthier, year-long program at ICERM (Brown University) focused on ROM in GR can be found on the website ICERM (2020), including many of the slides and talks of each workshop.

We assume the reader has (at least) some basic familiarity with gravitational waves. For introductory textbooks we recommend Maggiore (2007, 2018).

3 Mathematical preliminaries

This section serves as a brief recap of some basic mathematics and to introduce some notation used hereon. Since in this review we deal with model reduction for parametrized problems, we next consider an example amenable to it.

Example 1

Consider the model

$$\begin{aligned} h(t; \lambda ) = \exp (i \lambda t) . \end{aligned}$$
(1)

where \(\lambda \) is a complex parameter and \({\text {Im}}({\lambda }) > 0\). This represents oscillatory, exponentially damped functions; it can be seen as a toy model for the ringdown of a black hole (discussed in Sect. 12.7).

In the notation that we use throughout this review, h is a function of the physical variable t parametrized by \(\lambda \) and for compactness we often use the following type of shortcuts:

$$\begin{aligned} h = h_{\lambda } = h(\lambda ) = h(t; \lambda ) . \end{aligned}$$

Suppose now that we have a set of N samples of \(\lambda \), leading to a training set

$$\begin{aligned} {{{\mathcal {K}}}} = \{ h_i \} = \{ h_{\lambda _i} \} = \{ h(\lambda _i) \} = \{ h(t; \lambda _i) \}\, , \quad i=1\ldots N. \end{aligned}$$

Since the functions of interest (1) are known in closed-form, building such a training set is straightforward, though in general this is not the case (as when solving the Einstein equations is required).

One of the goals of ROM is to find a reduced basis; that is, a subset of \({{{\mathcal {K}}}}\) with number of elements \(n \le N\)—and, hopefully, \(n \ll N\)—, such that its span (the span of a set of vectors is its set of linear combinations) represents \({{{\mathcal {K}}}}\) with arbitrarily high accuracy. The compression factor is then

$$\begin{aligned} C_r := N/n . \end{aligned}$$

Furthermore, in practice one also needs to discretize time (or frequency). So let’s sample these training set elements at an arbitrary set of times \(\{ t_i \}_{i=1}^L \), where L stands for Length. Using, for example, the empirical interpolation method (EIM, discussed in Sect. 10), one can subsample the \(\{ t_i \}_{i=1}^L \) to a subset \(l \le L\). In fact, in the EIM, if the number of reduced basis is n, then by construction \(l=n\).

That is, the initial set being of size \(N\times L\) can now be reduced to \(n^2\), with a double compression factor (both in parameter and time domains)

$$\begin{aligned} C_r := \frac{N \times L}{n^2} . \end{aligned}$$

3.1 Inner products and norms for functions or vectors

Let \(\varOmega \) denote the physical domain. Throughout this review it represents a time or frequency interval, \([t_{\mathrm{min}},t_{\mathrm{max}}]\) or \([f_{\mathrm{min}},f_{\mathrm{max}}]\) respectively, though in general it could be space, space-time, or some more abstract arena. For the sake of discussion we consider time intervals. For any two complex-valued functions, h(t) and g(t), we consider inner/scalar/dot products and their corresponding norms of the form

$$\begin{aligned} \langle h_1, h_2 \rangle := \int _{\varOmega } {\bar{h}}_1(t) h_2(t) \omega (x)\, dx , \quad \Vert h \Vert ^2 := \langle h, h \rangle \, , \end{aligned}$$
(2)

where the bar over \(h_1\) denotes complex conjugation and \( \omega (x)\) is a generic non-negative weight function. These are referred to as weighted \(L_2\) scalar products. In data analysis/signal processing and usually in the frequency domain, \(S:= \omega ^{-1}\) characterizes the sensitivity of the detector and is referred to as the power spectral density (PSD).

We consider also discrete inner products and norms of the form

$$\begin{aligned} \langle h_1, h_2\rangle := \sum _{i=1}^L {\bar{h}}_1(t_i) h_2(t_i) \omega _i \, \quad \Vert h \Vert ^2 := \langle h , h \rangle . \end{aligned}$$
(3)

This discrete version of an \(L_2\) scalar product is usually denoted as \(\ell _2\), with a lowercase to distinguish it from the continuum case.

Whenever (3) is the discrete approximation of an integral (2) the values \(t_i \in \varOmega \) and \(\omega _i\) are respectively referred to as quadrature nodes and weights. Together, \(\{t_i,\omega _i\}_{i=1}^L\) is called a quadrature rule. When \(\omega _i =1\), (3) is referred to as the Euclidean scalar product. Throughout this review we also use the infinity, or max, norm

$$\begin{aligned} \Vert h \Vert _{\infty } := \max _{1 \le i \le L}|h (t_i)| . \end{aligned}$$

of a vector \(h \in {\mathbb {C}}^{L}\). This norm is not induced by an inner product. That is, it can be shown that there is no scalar product \(\langle \cdot , \cdot \rangle \) for which \(\Vert h \Vert _{\infty } =\langle h, h \rangle \).

Formally speaking, (2) is defined for functions while (3) is for vectors. We use the same notation \(\langle \cdot , \cdot \rangle \) for both cases; with the hope that the distinction is clear from the context. Some proofs are sometimes more convenient in a continuous setting while numerical computations are restricted to discrete values. Depending on the context, we sometimes switch between these two settings using the same notations for both of them.

Example 2

Polynomials.

Consider the space of degree n polynomials defined on \(\varOmega = [-1,1]\). Any element

$$\begin{aligned} h(t) = \sum _{i=0}^n c_i t^i \, \end{aligned}$$

can be written as a sum of \((n+1)\) terms and such space is a linear one of dimension \((n+1)\). The first \((n+1)\) normalized Legendre polynomials \(\{P_i(t)\}_{i=0}^{n}\) form an orthonormal basis of this space with respect to the scalar product (2) with \(\omega (t)\equiv 1\).

3.2 Matrices

Consider a complex \(L \times N\) matrix \(\mathbf{H } \in {\mathbb {C}}^{L \times N}\), where L is the number of rows (“length of the time series”) of \(\mathbf{H }\) and N is the number of columns (“number of parameter samples”),

$$\begin{aligned} {\mathbf {H}}:= \begin{pmatrix} h_{11} &\quad h_{12} &\quad \cdots &\quad h_{1N}\\ h_{21} &\quad h_{22} &\quad \cdots &\quad h_{2N} \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ h_{L1} &\quad h_{L2} &\quad \cdots &\quad h_{LN} \end{pmatrix} . \end{aligned}$$
(4)

As one might imagine, for gravitational waveforms \(h_{ij} = h_j(t_i)\). Each column is referred to as a snapshot. A matrix \({\mathbf {H}}\) of shape \(L\times N\) appears often throughout this review, where N is the number of samples in the training set (or n, the dimensionality of the reduced basis) and L is the length of each time series, so it is a very concrete example.

A square matrix \({\mathbf {H}}\) is said to be non-singular if its inverse—denoted by \({\mathbf {H}}^{-1}\)—exists, i.e. \({\mathbf {H}}{\mathbf {H}}^{-1}={\mathbf {H}}^{-1}{\mathbf {H}}= {\mathbf {I}}\), where

$$\begin{aligned} {\mathbf {I}}= \begin{pmatrix} 1 &\quad 0 &\quad \cdots &\quad 0 \\ 0 &\quad 1 &\quad \cdots &\quad 0 \\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ 0 &\quad 0 &\quad \cdots &\quad 1 \end{pmatrix} =: diag (1,1,\ldots ,1) . \end{aligned}$$

The (Hermitian) transpose of \({\mathbf {H}}\), denoted by \({\mathbf {H}}^{\dagger }\) is defined as the matrix with elements

$$\begin{aligned} \left( {\mathbf {H}}\right) ^{\dagger }_{ij} = {\bar{{\mathbf {H}}}}_{ji} \, , \end{aligned}$$

the bar again indicates complex conjugation, and we say that \({\mathbf {H}}\) is symmetric (hermitian or self-adjoint, in the current context), if \({\mathbf {H}}={\mathbf {H}}^{\dagger }\).

Eigenvectors and eigenvalues:

A vector \(x\ne 0\) is an eigenvector of \({\mathbf {H}}\) with eigenvalue \(\alpha \) if

$$\begin{aligned} {\mathbf {H}}x = \alpha x . \end{aligned}$$

Eigenvectors are defined up to a normalization constant; that is, if x is an eigenvector with eigenvalue \(\alpha \) so is ax, for any non-zero a.

3.2.1 Matrix norms

The Frobenius norm. Imagine “unpacking" \({\mathbf {H}}\) into a long vector of size \(L \times N\). Measuring this vector with the Euclidean norm defines the Frobenius norm

$$\begin{aligned} \Vert { {\mathbf {H}}} \Vert ^{2}_{F} := \sum _i^L \sum _j^N \, \left| h_{ij} \right| ^{2} \end{aligned}$$

of a matrix \({\mathbf {H}}\).

Later we will see that the Frobenius norm plays a key role when discussing approximation by proper orthogonal decomposition.

Induced norm.

The matrix can be viewed as a linear operator mapping vectors from \({\mathbb {C}}^N\) to \({\mathbb {C}}^L\). Given norms for both \({\mathbb {C}}^L\) and \({\mathbb {C}}^N\), which need not be the same, the induced norm of a matrix is defined as

$$\begin{aligned} \Vert { {\mathbf {H}}} \Vert = \max _{ \Vert x \Vert = 1} \Vert { {\mathbf {H}}} x \Vert . \end{aligned}$$

This norm characterizes the maximal possible “amplification” from the application of \({\mathbf {H}}\) to x. More precisely: the value of \(\Vert { {\mathbf {H}}} \Vert \) is the smallest positive number c such that

$$\begin{aligned} \Vert { {\mathbf {H}}} x \Vert \le c \Vert x \Vert . \end{aligned}$$

Notice that, following standard practice, we use the same notation—namely, \(|| \cdot ||\)—for both the norm of a vector (\(\Vert x \Vert \) and \(\Vert {\mathbf {H}}x \Vert \) in the above definition) and a matrix (\(\Vert {\mathbf {H}}\Vert \)). It should be clear from the context, though, which one we are referring to.

3.2.2 Rank and kernel

The range of a complex \(L\times N\) matrix \({\mathbf {H}}\) is

$$\begin{aligned} {\text{range}}\,({\mathbf {H}}) = \{ y \in {\mathbb {C}}^L |\, y = {\mathbf {H}}x \;\; {\text{for\,some}}\, x \in {\mathbb {C}}^N \} \, , \end{aligned}$$

its rank

$$\begin{aligned} {\text{rank}}\,({\mathbf {H}}) = \dim \left( {\text{range}}\, ({\mathbf {H}}) \right) , \end{aligned}$$

its kernel

$$\begin{aligned} {\text{ker}}\,({\mathbf {H}}) = \{ x \in {\mathbb {C}}^L |\, {\mathbf {H}}x =0 \} , \end{aligned}$$

and its nullity

$$\begin{aligned} {\text{nul}}\,({\mathbf {H}})={\text{dim}}\,({\text{ker}}\,({\mathbf {H}})). \end{aligned}$$

As a linear algebra exercise it can be shown that

$$\begin{aligned} {\text{rank}}\,({\mathbf {H}}) +{\text{nul}}\,({\mathbf {H}}) = N . \end{aligned}$$

For square \(N\times N\) matrices, the following properties are equivalent:

  1. 1.

    \({\mathbf {A}}\) is non-singular

  2. 2.

    \({\text{det}}\,({\mathbf {A}}) \ne 0\)

  3. 3.

    \({\text{ker}}({\mathbf {A}}) = \{ \mathbf{0}\in {\mathbb {C}}^N \}\)

  4. 4.

    \({\text{rank}}({\mathbf {A}})=N\)

  5. 5.

    \({\mathbf {A}}\) has linearly independent columns and vectors.

3.3 Approximation by projection

3.3.1 The least-squares problem

It is rather easy and common to approximate a function by functions of a special type, such as Fourier modes, wavelets, polynomials, or somewhat physically inspired bases. In this article we focus on representing gravitational waveforms by themselves. Intuitively, this should be (and is) more efficient. In this subsection we briefly introduce projection-based approximations, with more details given in Sect. 6.

To motivate the problem consider an n-dimensional vector space \(W_n\) which is itself a subspace of a Hilbert’s one \({{{\mathcal {H}}}}\) (for the purposes of this article, this means that a scalar product is assumed to exist). A common approximation criteria is a least squares (LS) one. That is, one seeks to approximate \(h \in {{{\mathcal {H}}}}\) by \(h^{(n)} \in W_n\) which is the solution to

$$\begin{aligned} h^{(n)} = \mathrm {argmin}_{{\tilde{h}}} \Vert h - {\tilde{h}} \Vert ^2 \, , \end{aligned}$$
(5)

where \({\tilde{h}} \in W_n\). Equation (5) involves finding the element of \(W_n\) that minimizes the squared norm of the representation error.

If \(\{e_i\}_{i=1}^n\) is an orthonormal basis of \(W_n\) and \(\langle . \, , \, . \rangle \) the scalar product associated with \({{{\mathcal {H}}}}\), the following is the (unique) solution to the LS problem,

$$\begin{aligned} h^{(n)} = \sum _{i=1}^n c_i e_i \, , \quad {\text{with}} \;c_i = \langle e_i, h\rangle . \end{aligned}$$
(6)

The solution \(h^{(n)}\) is the orthogonal projection of h onto \(W_n\), therefore denoted as

$$\begin{aligned} {{{\mathcal {P}}}}_n h := h^{(n)} . \end{aligned}$$

What this means is that if \(h \in W_n\) then

$$\begin{aligned} {{{\mathcal {P}}}}_n h = h \, , \end{aligned}$$

and the residual \(\delta h = {{{\mathcal {P}}}}_n h - h \) satisfies

$$\begin{aligned} \langle {{{\mathcal {P}}}}_n h , \delta h \rangle = 0 . \end{aligned}$$

The solution in Eq. 6 is basis-independent and \(h^{(n)}\) is uniquely defined. In other words, the orthogonal projection onto any linear space is a geometric construction, so it is independent of the basis used to represent it: If \(\{ e_i\}_{i=1}^n\) and \(\{ {\tilde{e}}_i\}_{i=1}^n\) are any two orthonormal bases, and the projection coefficients \(c_i\) and \({\tilde{c}}_i\) are computed according to (6) (replacing in the second case \(e_j\) by \({\tilde{e}}_j\)), then it is easy to show that

$$\begin{aligned} {{{\mathcal {P}}}}_n h= & {} \sum _{i=1}^n c_i e_i = \sum _{i=1}^n {\tilde{c}}_i {\tilde{e}}_i . \\ \sum _{i=1}^n \langle e_i, v\rangle e_i= & {} \sum _{i=1}^n \langle {\tilde{e}}_i, v\rangle {\tilde{e}}_i \, , \end{aligned}$$

leading to the same solution of the LS problem. One might ask why the emphasis on orthogonal or orthonormal bases: it is due to a conditioning issue which we discuss in Sect. 6. For the time being the summary is that one should always use orthogonal or orthonormal bases. An exception is when building predictive models, as discussed in Sect. 11.

3.3.2 Representing a set of elements: collective error

Suppose we seek to approximate not one function but a set of them:

$$\begin{aligned} {{{\mathcal {F}}}} = \{h_1, h_2, \dots , h_N | h_i \in {{{\mathcal {H}}}} \} . \end{aligned}$$

From the above discussion we know that \({{{\mathcal {P}}}}_n h_i\) is the the best representation of any particular \(h_i\). The question of how one measures the approximation error of the collective set naturally arises. Two reasonable and standard criteria are

$$\begin{aligned} \frac{1}{N} \sum _{i=1}^N \Vert h_i - {{{\mathcal {P}}}}_n h_i \Vert ^2 \, , \end{aligned}$$
(7)

and

$$\begin{aligned} \max _i \Vert h_i - {{{\mathcal {P}}}}_n h_i \Vert ^2 \, , \end{aligned}$$

which have the interpretations of the mean and maximum errors, respectively. Whether the notion of error \(\mathrm{(1)}\) or \(\mathrm{(2)}\) is chosen for an error minimization criteria will lead to the main two ROM approaches discussed in this review. Namely, proper orthogonal/singular value decompositions, or the reduced basis-greedy approach.

3.4 Further reading

The material discussed can be found in standard linear algebra/numerical analysis books. There are many introductory good ones, we suggest Stewart (1996) and Stewart (1998).

4 Principal component analysis

Principal component analysis (PCA) is perhaps one of the most used tools when seeking for redundancy or a hierarchy of importance of variables in statistical analysis. It has a close relationship with proper orthogonal decomposition (POD) as we discuss in Sect. 5.1. Within PCA one seeks to determine the most statistically relevant variables and to potentially dimensionally discard from the problem the least relevant ones. In order to do this, we recall the definition of the covariance between two stochastic variables XY as given by

$$\begin{aligned} {\mathrm{Cov}} (X,Y)&= \big \langle ( X - \langle X \rangle ) ( Y - \langle Y \rangle ) \big \rangle \\&= \langle X Y \rangle - \langle X \rangle \langle Y \rangle \, , \, \end{aligned}$$

where the brackets denote expectation values. The covariance between two variables provides a measure of the degree to which their fluctuations are correlated. A smaller (larger) covariance implies lower (higher) correlation. In particular, the covariance of a variable with itself is its variance (i.e., the standard deviation squared) and measures deviations from the mean value.

When there are multiple stochastic variables \(X_i\) (\(i=1, \ldots , n\)) one can construct their associated covariance matrix \({\mathbf {C}}\) with components \(C_{ij} = {\mathrm{Cov}} (X_i,X_j)\). This matrix is symmetric, non-negative definite and can therefore be diagonalized with an orthogonal transformation. Consider the ith normalized eigenvector \(\hat{{\mathbf {V}}}_i\). If we set \({{\mathbf {X}}}:=(X_1,\ldots ,X_n)\) then the principal components (PCs) are the associated eigenvectors,

$$\begin{aligned} {{{\mathcal {E}}}}_i = {{\mathbf {X}}}\cdot \hat{{\mathbf {V}}}_{i} . \end{aligned}$$

The PCs are new variables representing directions in which the \(X_i\) have different levels of variance. They are uncorrelated, a consequence of the orthogonality of the eigenvectors, and their associated eigenvalues \(\lambda \) are their variances,

$$\begin{aligned} {\mathrm{Cov}}({{{\mathcal {E}}}}_i, {{{\mathcal {E}}}}_j ) = \lambda _i \delta _{ij} . \end{aligned}$$

The fact that, by construction, principal components are uncorrelated with each other is important since they provide independent pieces of statistical information.

The smaller an eigenvalue \(\lambda _i\), the more likely that the corresponding linear combination \({{{\mathcal {E}}}}_i\) will not deviate from its average value for a randomly set of variables. Therefore, if there exist small eigenvalues then the associated PCs are largely conserved in a statistical sense. Conversely, the larger an eigenvalue is then the more relevant the associated PC is in describing the dynamics and variations in the problem.

There are two related but different senses in which for a parametrized time series a principal component with small variance can be semi-conserved. The first is being constant as a function of time for an arbitrary but fixed set of parameters. The second one is in the statistical sense that deviations of a principal component from the mean value are small for arbitrary but fixed initial and final times over a set of runs with different initial configurations. A small variance automatically implies approximate conservation in the second sense but not necessarily in the first one. The interest is not only in those PCs which have the smallest variances (and thus identify semi-conserved quantities in the second sense) but also in those with the largest variances, which encode the most information about the system dynamics. This is illustrated through the following example.

Example 3

PCA for spin dynamics.

One may wonder if given a uniform initial spin orientation distribution for each component in a binary system, after a while those orientations turn into some preferred orientation, such as being aligned into some preferred direction with respect to the orbital angular momentum. This was studied in Galley et al. (2010) through large scale numerical simulations. We only briefly review some of the results of that reference. The interest at the time was motivated by the unexpectedly large kicks found in numerical relativity simulations of binary black holes and whether they were of a generic nature in a statistical sense; for a review of recent results see Sect. 12.1. In order to perform a large enough analysis the Post-Newtonian equations were used, up to 3.5PN order in the angular frequency and 2PN with the covariant spin supplementary condition.

Next, in any PCA study one has to define which variables to analyze. In order for the quantities to be invariant under rotations of the system of reference, at least in a Newtonian sense, the first quantities chosen were scalar products between the normalized spins of each binary component and the orbital angular momentum of the system. Moreover, their differences between initial and final values:

$$\begin{aligned} \varDelta ({\hat{\mathbf{S}}}_1 \cdot {\hat{\mathbf{L}}})= & {} {\hat{\mathbf{S}}}_1 \cdot {\hat{\mathbf{L}}} |_f - {\hat{\mathbf{S}}}_1 \cdot {\hat{\mathbf{L}}} |_i =: {{\varDelta }_{1{\text{L}}}}\end{aligned}$$
(8)
$$\begin{aligned} \varDelta ({\hat{\mathbf{S}}}_2 \cdot {\hat{\mathbf{L}}})= & {} {\hat{\mathbf{S}}}_2 \cdot {\hat{\mathbf{L}}} |_f - {\hat{\mathbf{S}}}_2 \cdot {\hat{\mathbf{L}}} |_i =: {{\varDelta }_{2{\text{L}}}}\nonumber \\ \varDelta ({\hat{\mathbf{S}}}_1 \cdot {\hat{\mathbf{S}}}_2)= & {} {\hat{\mathbf{S}}}_1 \cdot {\hat{\mathbf{S}}}_2 |_f - {\hat{\mathbf{S}}}_1 \cdot {\hat{\mathbf{S}}}_2 |_i =: {{\varDelta }_{12}}\, , \end{aligned}$$
(9)

where hats stand for unitary vectors. One of the results of Galley et al. (2010) is that the findings described below do not depend on the choices of initial and final times, which gives insight into the fact that the binary problem in GR is highly redundant, as discussed next.

The orbital angular momentum and spin orientations naturally become correlated due to spin-orbit and spin-spin interactions as each of these binary black hole configurations evolve in time. However, at least within the PN approximation here considered, the orbital angular momentum and spin vectors remain perfectly uniformly distributed (Herrmann et al. 2010). For example, a Kolmogorov–Smirnov test for a representative configuration returns a p-value of \(\sim 10^{-5}\) when testing for lack of uniformness (Herrmann et al. 2010). Higher PN expansions might introduce small biases (Lousto et al. 2010) but if so they appear to be at a level in which approximating the mean of the above scalar products at any instant of time by zero is a very good approximation.

Galley et al. (2010) starts with a simple case and makes contact with previous conservation results. The authors start building towards the more general case by first doing a PCA using only the two spin-orbit (SO) variables in (8) and (9),

$$\begin{aligned} {{\varDelta }_{1{\text{L}}}}= \varDelta ({\hat{\mathbf{S}}}_1 \cdot {\hat{\mathbf{L}}})\, , \quad {{\varDelta }_{2{\text{L}}}}:= \varDelta ({\hat{\mathbf{S}}}_2 \cdot {\hat{\mathbf{L}}}) . \end{aligned}$$
(10)

However, spin-spin interactions in both the numerical simulations and in the analytical calculations are included in the PN equations of motion used when solving for the evolution of each configuration.

For mass and (dimensionless) spin magnitudes (\(m_j,\chi _j\)) of each black hole the covariance matrix for the variables (10) is

$$\begin{aligned} {\mathbf {C}} = \left( \begin{array}{cc} {\mathrm{Cov}} ( {{\varDelta }_{1{\text{L}}}}, {{\varDelta }_{1{\text{L}}}}) &{} {\mathrm{Cov}} ( {{\varDelta }_{1{\text{L}}}}, {{\varDelta }_{2{\text{L}}}}) \\ {\mathrm{Cov}} ( {{\varDelta }_{2{\text{L}}}}, {{\varDelta }_{1{\text{L}}}}) &{} {\mathrm{Cov}} ( {{\varDelta }_{2{\text{L}}}}, {{\varDelta }_{2{\text{L}}}}) \end{array} \right) \, , \end{aligned}$$

where the entries can come either from numerical simulations or from what the authors call the instantaneous approximation. The matrix \({\mathbf {C}}\) is then diagonalized to find the principal components.

From numerical simulations the authors find that, sampling across many random initial spin orientations, each of the principal components has zero mean over time (to numerical accuracy), \(\langle \varDelta {{{\mathcal {E}}}}_j^{\text {SO}} \rangle = 0\), a consequence of the spin orientation distributions remaining highly uniform during the inspiral.

Furthermore, they find \(\lambda _2\) to be in the range \(\sim 10^{-9}\)\(10^{-4}\) for the parameters sampled and that it grows with both spin magnitudes, which is expected from physical intuition, but also that it increases as the equal mass case is approached, which was unexpected.

As an example with \((m_1,m_2, \chi _1,\chi _2) = (0.4, 0.6, 1.0, 1.0)\), Fig. 1 shows a graphical representation of the principal components overlaid on a scatter plot of the \({{\varDelta }_{1{\text{L}}}}\) and \({{\varDelta }_{2{\text{L}}}}\) data from 1,000 out of 100,000 numerical simulations using random initial spin orientations. Notice that the first PC, which points along the direction of the eigenvector \({\hat{\mathbf{V}}}_1\) with the largest eigenvalue \(\lambda _1\), captures the largest variation in the data while the second PC, pointing along \({\hat{\mathbf{V}}}_2\), indicates that there is very little spread in the data in that direction, which is also implied by the smallness of \(\lambda _2\) relative to \(\lambda _1\). Therefore, for the time interval considered, the second PC is largely irrelevant. This figure is almost an ideal example of dimensionality reduction through PCA. For a detailed analysis and more powerful results see Galley et al. (2010).

Fig. 1
figure 1

A graphical representation of the principal components for the spin-orbit variables and the numerical data of \({{\varDelta }_{1{\text{L}}}}\) and \({{\varDelta }_{2{\text{L}}}}\) for a binary black hole system with, as illustration, \(m_1 = 0.4\) and maximal spin magnitudes. One can see that PC2 is largely irrelevant. For details see Galley et al. (2010)

4.1 Further reading

Principal component analysis is a very well known tool to look for redundancies or categorize the relevance of different variables. One of its weaknesses is that it depends on which variables to look for. The case study that we chose is due to the fact that if looking in more detail at the original reference reviewed, there are strong indications that there are at least three (out of eight) redundant quantities in the problem, and therefore the problem is amenable to a more powerful dimensional reduction approach, as presented throughout this review. Jolliffe (2002) is a very pedagogical book solely on PCA, with many case studies and modern developments.

5 Proper orthogonal and singular value decompositions

Given the snapshots \(h_1, \ldots , h_N \in {\mathbb {C}}^{L}\) (the training samples, or training waveforms), the snapshot matrix (4) defined in Sect. 3 can be written as

$$\begin{aligned} {\mathbf {H}} = [h_1, \ldots h_N] \in {\mathbb {C}}^{L \times N} \end{aligned}$$

where, in general, \(L\ge N\). The goal here is to find \(n \le N\) orthonormal vectors \(\{ \phi _i \}_{i=1}^n\) in \({\mathbb {C}}^{L}\) minimizing the average approximation error (7)

$$\begin{aligned} \frac{1}{N}\sum _{i=1}^N \Vert h_i - {{{\mathcal {P}}}}_n h_i \Vert ^2, \end{aligned}$$
(11)

where \({{{\mathcal {P}}}}_n\) is the orthogonal projector onto \(W_n:= {\text{span}}\{ \phi _i \}_{i=1}^n\) and \(\Vert \cdot \Vert \) is the Euclidean norm. The minimizer subspace of (11) is called a proper orthogonal decomposition (POD) of rank n, and its generating basis the POD one.

Notice that in general the basis elements \(\phi _i\) are not members of \(\{h_i\}_{i=1}^N\). In the gravitational wave case this implies that the basis is not itself a subset of waveforms. Having such an empirical basis of waveforms comes with advantages. In Sect. 8 we show how to build such a basis, and in Sects. 11 and 12 we see some of the main applications of it.

Another disadvantage is that the problem, involving a potentially very large linear algebra system, is not easily parallelizable in distributed memory architectures. Still, the method is straightforward to implement and can, for example, quickly provide insight into whether a problem is amenable to dimensionality reduction: after performing a POD decomposition of the snapshot matrix, one can look at the decay rate of singular values as a function of n in order to know if the problem can be codified in a space of lower dimensionality than the original one. This can be turned into an actual strategy: a POD for a small subset of the problem can provide a quick insight, and if the singular values decay fast enough with n, the larger problem of interest can be tackled with a greedy approach, as discussed in Sect. 8.3.

The optimal solution to the minimization problem defined by Eq. (11) can be accomplished by means of a singular value decomposition (SVD) procedure. In this framework the POD basis is given by the first n left singular vectors of the snapshot matrix \({\mathbf {H}}\) (this can be proved by considering the first-order optimality conditions for the minimization problem).

Consider then the following singular value decomposition of \({\mathbf {H}}\):

$$\begin{aligned} {\mathbf {H}}=\mathbf {U\Sigma V}^\dagger \, , \end{aligned}$$
(12)

where \({\mathbf {U}}=[u_1, \ldots , u_L]\in {\mathbb {C}}^{L\times L}\), \({\mathbf {V}}=[v_1, \ldots , v_N]\in {\mathbb {C}}^{N\times N}\) are orthogonal matrices and

$$\begin{aligned} {\varvec{\Sigma }}=\begin{bmatrix} {\mathbf {D}}\\ {\mathbf {0}} \end{bmatrix}\in {\mathbb {C}}^{L\times N}, {\mathbf {D}}=diag(\sigma _1, \ldots , \sigma _N)\in {\mathbb {C}}^{N\times N}, \sigma _1\ge \cdots \ge \sigma _N\ge 0. \end{aligned}$$
(13)

Then, the rank-n POD basis is given by \(\phi _i=u_i,\,i=1, \ldots , n\). This basis provides a low-rank approximation to \({\mathbf {H}}\) and represents the minimizer basis for the optimization problem stated in (11). To see this, consider from (12) the relation

$$\begin{aligned} u_i^\dagger {\mathbf {H}}=\sigma _i v_i^\dagger , \end{aligned}$$

valid for all column vectors of \({\mathbf {U}}\). Multiply by \(u_i\) on the left and sum over i

$$\begin{aligned} \sum _{i=1}^n u_i u_i^\dagger {\mathbf {H}}=\sum _{i=1}^n\sigma _i u_i v_i^\dagger . \end{aligned}$$
(14)

The left hand side of Eq. (14) is exactly the orthogonal projector \({{{\mathcal {P}}}}_n\) (in matrix representation) associated with the basis \(\{u_i\}_{i=1}^n\) acting on \({\mathbf {H}}\) and the right hand side represents the rank-n approximation \({\mathbf {H}}_n\) of \({\mathbf {H}}\). Therefore we can rewrite (14) as

$$\begin{aligned} {{{\mathcal {P}}}}_n{\mathbf {H}}={\mathbf {H}}_n. \end{aligned}$$

Remarks on the SVD decomposition

  • The numbers \(\sigma _i\) are known as the singular values of the matrix \({\mathbf {H}}\) and correspond to the positive square roots of the eigenvalues of the associated matrix \({\mathbf {K}}={\mathbf {H}}^{\dagger }{\mathbf {H}}\). In practice, they are usually chosen in descending order to facilitate the recognition of the most relevant components of the decomposition.

  • For the matrices \({\mathbf {U}}\) and \({\mathbf {V}}\), only the first \(\mathrm {rank}({\mathbf {H}})\) columns are unique, whereas the remaining ones are arbitrarily extended such that orthogonality is maintained. Since \(\sigma _i = 0\) whenever \(i > \mathrm {rank}({\mathbf {H}})\) the factorization (12) does not depend on the choice of this extension, since such extension is annihilated by null entries of \({\varvec{\Sigma }}\).

  • The first N columns of \({\mathbf {U}}\) and \({\mathbf {V}}\) are known as the left- and right-singular vectors of \({\mathbf {H}}\) respectively. The right-singular vectors, \(v_i\), are the normalized eigenvectors of \({\mathbf {H}}^{\dagger }{\mathbf {H}}\) and the left-singular vectors, \(u_i\), are the normalized eigenvectors of \({\mathbf {H}}{\mathbf {H}}^{\dagger }\).

  • The number K of non-zero elements in \({\varvec{\Sigma }}\) is exactly \({\text{rank}}({\mathbf {H}})\). Consequently, Eq. (12) is sometimes called a rank-revealing factorization.

Error of a POD approximation

The accuracy of the low rank-n approximation \({\mathbf {H}}_n=\mathcal{P}_n{\mathbf {H}}=[{{{\mathcal {P}}}}_n h_1,\ldots , {{{\mathcal {P}}}}_n h_N]\) is given by the following lemma.

Lemma 1

Define the approximation error by

$$\begin{aligned} \epsilon = \frac{1}{N} \sum _{i=1}^N \Vert h_i-{{{\mathcal {P}}}}_n h_i \Vert ^2. \end{aligned}$$

It can be shown that \(\epsilon \) satisfies

$$\begin{aligned} \epsilon = \frac{1}{N} \sum _{i=n+1}^N\sigma _i^2. \end{aligned}$$

In terms of the Frobenius norm,

$$\begin{aligned} \epsilon =\frac{1}{N}\Vert {\mathbf {H}}-{\mathbf {H}}_n\Vert ^2_F. \end{aligned}$$
(15)

Intuitively, the square of the Frobenius norm of the difference between a matrix and its low-rank approximation represents the total squared difference of the rows of \({\mathbf {H}}\) due to omitting the last \((N-n)\) singular values when forming \({\mathbf {H}}_n\). As discussed above, the projection \({\mathbf {H}}_n\) represents the best rank-n approximation of \({\mathbf {H}}\) in the Frobenius norm.

In summary, going back to Sect. 3, one can see that the minimization problem (1) related to the average error of a set of functions \({{{\mathcal {F}}}}\) is optimally solved by the POD/SVD decomposition of the matrix associated to those functions.

In the context of a gravitational-wave template bank the rows of the snapshot matrix are the waveforms evaluated at different time/frequency values, and the different rows correspond to different intrinsic parameters. The inner product of rows with themselves, in turn, correspond to the total power in the bank. Thus, the squared Frobenius norm of a template bank is the total power contained within all the templates in it. For normalized templates, this is simply the number of templates of the bank. It can be seen how Eq. (15) directly corresponds to the total power in a template bank lost from a low-rank approximation to it. More precisely, the error measure is directly related to the average fractional Signal-to-Noise ratio (SNR) loss which, up to a constant, is the squared Frobenius norm of the difference between the full template bank and its low-rank approximation. Fixing the average fractional SNR loss thus determines the total number of non-zero singular values that must be retained to guarantee that the rank-reduced bank remains effective. This is discussed in practical terms in the following case study.

Example 4

SVD for gravitational waves.

Cannon et al. (2010) studied the application of SVD to gravitational wave templates to find redundancies in the bank and build a compact basis able to represent the entire dataset. As a proof of concept, the authors applied a SVD approach to a set of CBC waveforms corresponding to a sliver of the BNS parameter space. Next, we summarize some of the results of this reference, closely following its notation, which might be different from the rest of this review.

In order to detect a GW signal, the common choice of the minimal match between an arbitrary point in parameter space and its nearest point of the template bank is \(97\%\). To compare and filter the data against the entire template bank, an approximation to the matched filtering \(\rho _\alpha =\langle {\bar{h}}_{\alpha }, s\rangle \) is sought for, where \(h_{\alpha }\) is a complex waveform vector and s is the data vector (containing the presumed signal, in addition to detector noise). In this way the number of evaluation of inner products can be reduced as well as its computational cost. Define the \(N\times L\) template matrix, where \(N=2M\) and M stands for the number of complex waveforms (there is no obvious reason why to perform an SVD on real and imaginary parts of a waveform, given that POD approach can handle complex snapshots):

$$\begin{aligned} {\mathbf {H}}=\{h_1^R,h_1^I,\ldots ,h_M^R,h_M^I\}, \end{aligned}$$

where \(h_i^{R,I}\) are the real and imaginary part of the i-template waveform, each one corresponding to the rows of \({\mathbf {H}}\). Applying an SVD decomposition to \({\mathbf {H}}\) and writing it in component form (here we follow the notation of Cannon et al. 2010)

$$\begin{aligned} {\mathbf {H}}_{\mu \nu }=\sum _{\kappa =1}^N v_{\mu \kappa }\sigma _\kappa u_{\kappa \nu }, \end{aligned}$$

one can define the truncated sum

$$\begin{aligned} {\mathbf {H}}'_{\mu \nu }=\sum _{\kappa =1}^{N'} v_{\mu \kappa }\sigma _\kappa u_{\kappa \nu }, \end{aligned}$$

and approximate \(\rho _\alpha \) by

$$\begin{aligned} \rho '_\alpha = \langle H'_{2\alpha -1}-iH'_{2\alpha }, s\rangle =\sum _{\nu =1}^{N'} (v_{{2\alpha -1}\,\nu } \sigma _\nu - i v_{{2\alpha }\,\nu }\sigma _\nu ) \langle u_\nu , s\rangle , \end{aligned}$$

where the \(H'_j\) are the vector rows of \({\mathbf {H}}'\).

Figure 2 shows a representation of the matrix of waveforms \({\mathbf {H}}\) and its associated SVD-basis. One disadvantage of this kind of dimensional reduction is that the remaining basis barely resembles the structure of the original template. This can be avoided with a RB-greedy approach, presented in Sect. 8.

Fig. 2
figure 2

Image reproduced with permission from Cannon et al. (2010), copyright by APS

Left: Representation of template waveforms corresponding to \({\mathbf {H}}\) against time. Right: Representation of the first four SVD-basis. Notice that since the SVD basis functions do not correspond to actual waveforms, they display non-physical behavior which might be difficult to fit for when building a surrogate model as discussed in Sect. 11.

Accuracy. An approximation to the data snapshot matrix is sought such that \(\Vert {\mathbf {H}}_\mu - {\mathbf {H}}'_\mu \Vert \sim 1\%\). Following this requirement, the fractional SNR

$$\begin{aligned} \Big \langle \frac{\delta \rho }{\rho }\Big \rangle :=\frac{1}{M}\sum _{\alpha =1}^M \frac{\delta \rho _\alpha }{\rho _\alpha } \end{aligned}$$
(16)

can be approximated by

$$\begin{aligned} \Big \langle \frac{\delta \rho }{\rho }\Big \rangle =\frac{1}{2N}\sum _{\mu =N'+1}^N \sigma _\mu ^2, \end{aligned}$$

resulting from a Taylor expansion valid in the range \(\langle \delta \rho /\rho \rangle <10\%\). Note that this approximation is proportional to the squared Frobenius norm of the truncation error of \({\mathbf {H}}\),

$$\begin{aligned} \Big \langle \frac{\delta \rho }{\rho }\Big \rangle =\frac{1}{2N}\Vert {\mathbf {H}}-{\mathbf {H}}' \Vert _F^2. \end{aligned}$$

As a case of application, Cannon et al. (2010) shows an SVD analysis to gravitational waves emitted by a CBC-BNS, with chirp masses \(1.125\,M_\odot \le M_c<1.240\,M_\odot \) (that is, a rather small sliver in parameter space) and component masses \(1M_\odot \le m_1,m_2<3\,M_\odot \). It was found that \(M=456\) templates (\(N=912\)) are needed to cover the parameter space for a minimal match of 96.8%. In Fig. 3 (plot of \(\langle \delta \rho /\rho \rangle \) vs. # of SVD-basis elements) it can be seen that, to obtain \(\langle \delta \rho /\rho \rangle =10^{-3}\), the number of basis elements needed to reconstruct the whole template bank to that accuracy can be reduced from \(N=912\) to \(N'=118\). Though POD/SVD is a good starting point for ROM, starting from Sect. 6 we will develop a more modern framework for modeling reduction.

Fig. 3
figure 3

Image reproduced with permission from Cannon et al. (2010), copyright by APS

SNR [Eq. (16)] for a CBC-BNS gravitational waves as a function of the number of SVD-basis elements—valid for SNR<10%.

5.1 PCA and POD

Principal component analysis and proper orthogonal decomposition are closely related from a mathematical point of view, though usually applied in different contexts. PCA is in general used in statistical analysis. It represents a solution to an optimization problem; namely, that one of finding the uncorrelated directions of maximum statistical variability in a stochastic data set. POD is usually applied in dimension reduction modeling and usually for deterministic, not stochastic, models. As in PCA, POD also represents a solution to an optimization problem: finding the best low-rank approximation to a data set matrix. This is achieved by minimizing the Frobenius norm of the error matrix [Eq. (15)].

Suppose we have a data set matrix of size \(L\times N\), as usual in this review N being the number of variables and L the number of samples corresponding to each variable. In PCA one is not interested in the “length" of each variable and their individual components, but rather on their correlations. Put differently, if there are N stochastic variables, the covariance matrix, regardless of the number of components in each stochastic variable, is in all cases of size \(N \times N\). Compression can be achieved by ignoring the directions in which the stochastic data have little variance.

5.1.1 Mathematical relationship between PCA and POD

Although conceptually different, both methods are mathematically related. To see this, we make a simple observation concerning the snapshot matrix and its covariance matrix. Let \({\mathbf {X}}:=[X_1,\ldots ,X_N]\in {\mathbb {C}}^{L\times N}\) be the snapshot matrix with each column i storing L independent and identically distributed observations of the random variable \(X_i\). We replace \({\mathbf {X}}\) by a “centered” matrix version of it by subtracting from each column \(X_i\) the corresponding mean value \(\langle X_i \rangle \),

$$\begin{aligned} X_{s,i}\rightarrow X_{s,i}-\langle X_i \rangle \, , \quad s=1\ldots L . \end{aligned}$$

Then, the associated covariance matrix \({\mathbf {C}}\) can be written as

$$\begin{aligned} {\mathbf {C}}=\frac{1}{L-1}{\mathbf {X}}^\dagger {\mathbf {X}}. \end{aligned}$$
(17)

Without loss of generality, we assume that \({\mathbf {X}}\) has full rank. Therefore, being an hermitian, non-negative definite matrix, \({\mathbf {C}}\) can be diagonalized as

$$\begin{aligned} {\mathbf {C}}={\mathbf {V}}{\varvec{\varLambda }}{\mathbf {V}}^\dagger , \end{aligned}$$

where \({\varvec{\varLambda }}={\text{diag}}(\lambda _1, \ldots, \lambda _N)\) is the matrix of non-zero descending-ordered real eigenvalues and \({\mathbf {V}}\) is the unitary matrix storing eigenvectors of \({\mathbf {C}}\). The columns of \({\mathbf {V}}\) are called principal axes or principal directions in the PCA framework. PCs correspond to the projection of the data onto these principal directions, the columns of the matrix product \({\mathbf {X}}{\mathbf {V}}\).

Now, perform a POD/SVD decomposition of the snapshot matrix \({\mathbf {X}}\),

$$\begin{aligned} {\mathbf {X}}={\mathbf {U}}{\varvec{\Sigma }}{\tilde{{{\mathbf {V}}}}}^{\dagger } \, , \end{aligned}$$

and rewrite Eq. (17) as

$$\begin{aligned} {\mathbf {C}}=\frac{1}{L-1}\, \,({\mathbf {U}}{\varvec{\Sigma }}{\tilde{{{\mathbf {V}}}}}^{\dagger })^\dagger {\mathbf {U}}{\varvec{\Sigma }}{\tilde{{{\mathbf {V}}}}}^{\dagger }={\tilde{{{\mathbf {V}}}}}\,\frac{{\varvec{\Sigma }}^\top {\varvec{\Sigma }}}{L-1}\,{\tilde{{{\mathbf {V}}}}}^\dagger . \end{aligned}$$

The similarity relation between \({\mathbf {C}}\) and \(\frac{{\varvec{\Sigma }}^\top {\varvec{\Sigma }}}{L-1}\) implies that both have the same eigenvalues:

$$\begin{aligned} \lambda _i = \frac{\sigma ^2_i}{L-1},\quad i=1,\ldots , N. \end{aligned}$$

Another benefit of this mathematical correspondence is that PCs are straigthforward to compute as \({\mathbf {X}}{\mathbf {V}}={\mathbf {U}}{\varvec{\Sigma }}\).

5.2 Further reading

We refer the reader to Jolliffe (2002), Trefethen and Bau (1997) for clear expositions about the classical matrix factorization methods presented in this section. QR decompositions form another class of matrix factorizations which can be used for low-rank approximations. A theoretical framework for rank revealing QR decompositions of a matrix was developed in the the late 1980s and 1990s (Chan 1987; Hong and Pan 1992), motivated partly by the SVD’s high computational cost. QR decompositions form the basis for many modern fast algorithms (Harbrecht et al. 2012; Guruswami and Sinop 2012; Deshpande and Rademacher 2010; Civril and Magdon-Ismail 2008). Actually, motivated by GW research an important result has been shown relating two apparently disconnected approaches: the RB-greedy algorithm described in Sect. 8.3 is completely equivalent to a certain type of QR decomposition Antil et al. (2018).

6 Spectral expansions

In this section we discuss fast converging classical linear approximations, since they naturally lead to reduced basis when considering parametrized problems.

Consider a set of complex functions,

$$\begin{aligned} {{{\mathcal {F}}}}:= \{ h: {\mathbb {C}} \rightarrow {\mathbb {C}} \} . \end{aligned}$$

A standard linear approximation consists of the following: a basis of n (not necessarily orthogonal) functions,

$$\begin{aligned} \{ e_i \}_{i=1}^n \, , \end{aligned}$$

is somehow chosen. Throughout this review we discuss several approaches for selecting bases.

Next, a function h in \({{{\mathcal {F}}}}\) is approximated by a linear combination of the chosen basis elements,

$$\begin{aligned} h(z) \approx h^{(n)}(z) := \sum _{i=1}^n c_i e_i(z) \, , \quad c_i \in {\mathbb {C}} . \end{aligned}$$
(18)

There are several criteria to choose the expansion coefficients \(\{ c_i \}\). We recap least-squares (LS), in Sect. 10 we discuss interpolation. In the LS approach the coefficients \(\{ c_i \}\) are chosen such that the representation error is minimized with respect to some chosen norm,

$$\begin{aligned} \left\| h - \sum _{i=1}^n c_i e_i \right\| ^2. \end{aligned}$$
(19)

As was discussed in Sect. 3, the solution to the LS problem (Eq. (5), Sect. 3) is the orthogonal projection onto the span of the basis,

$$\begin{aligned} h^{(n)} = {{{\mathcal {P}}}}_n h . \end{aligned}$$

That is, the approximation (18) minimizes the error (19) when the coefficients \(\{ c_i \}\) are chosen such that the residual

$$\begin{aligned} \delta ^{(n)} h = h - h^{(n)} \end{aligned}$$

satisfies \(\langle \delta h , h^{(n)} \rangle = 0\), which implies

$$\begin{aligned} \left\langle e_j (\cdot ) , h (\cdot ) - \sum _{i=1}^n c_i e_i (\cdot ) \right\rangle = 0 \quad {\text {for}}\quad j=1,\ldots , n . \end{aligned}$$
(20)

The solution to (20) is

$$\begin{aligned} c_i = \sum _{j=1}^n({\mathbf {G}}^{-1} )_{ij} \langle e_j , h \rangle \, , \end{aligned}$$
(21)

where \({\mathbf {G}}^{-1}\) is the inverse of the Gram matrix or Gramian \({\mathbf {G}}\), with entries

$$\begin{aligned} {\mathbf {G}}_{ij} := \langle e_i , e_j \rangle . \end{aligned}$$

If the basis is orthonormal, this matrix is the identity and one recovers the familiar expression

$$\begin{aligned} {{{\mathcal {P}}}}_n h = \sum _{i=1}^n \langle e_i , h \rangle e_i . \end{aligned}$$
(22)

The Gramian matrix can be very ill-conditioned in general (Gautschi 2012; Boyd and Gildersleeve 2011), meaning that the calculation of its inverse, needed in (21), can have large numerical errors (see, for example Taylor 1978). Therefore, from a numerical conditioning point of view, it is convenient to work with an orthonormal (or orthogonal) basis since the Gram matrix is the identity.

In iterative approximations one defines the convergence rate as the rate at which the representation error

$$\begin{aligned} \Vert \delta ^{(n)}h \Vert = \left\| h - {{{\mathcal {P}}}}_n h \right\| \end{aligned}$$

decreases as n increases for any given h. In the context of RB, when one is dealing with parametrized systems, this error also depends on the parameters of the system.

6.1 Spectral methods

In terms of accuracy and optimal convergence rates for approximation of a space \({{{\mathcal {F}}}}\), we have not yet discussed two related aspects:

  1. 1.

    The choice of a “good” basis or, more precisely, the approximation space \(W_n\).

  2. 2.

    The optimal choice of how many basis elements to use. One would think that the larger, the better. This is related to the regularity of the functions in \({{{\mathcal {F}}}}\). That is, how smooth or differentiable they are.

The first comment might be puzzling, after all we have emphasized that given an approximation space \(W_n\), the LS approximation is uniquely defined in a geometric way. The question really is what the approximation space \(W_n\) should be to minimize the representation error.

6.1.1 Fourier expansions

We start with the simplest and most familiar case: that one of periodic functions in \([0,2\pi ]\), unit weight, \(\omega (x)=1\),Footnote 1 and Fourier modes as (orthonormal) basis,

$$\begin{aligned} e_j (x) = \frac{1}{\sqrt{2\pi }} e^{ijx} \, , \quad j\in {\mathbb {Z}}. \end{aligned}$$

Assuming for simplicity that n is even, we have

$$\begin{aligned} {{{\mathcal {P}}}}_{n+1} h (x) = \sum _{-n/2}^{n/2} {\hat{h}}_j e_j(x)\, , \end{aligned}$$

where \({\hat{h}}_j\) are the Fourier coefficients of h.

One can show that if h has s derivatives, then there exists a constant \(C>0\) independent of n such that

$$\begin{aligned} \Vert h - {{{\mathcal {P}}}}_{n+1} h \Vert \le C(n+1)^{-s} \left\| \frac{d^s h }{dx^s} \right\| \end{aligned}$$

for all \(n\ge 0\). In particular, if h is smooth (\(h\in C^\infty \)), then the representation error decays faster than any power law with \(n\rightarrow \infty \), which is referred to as spectral convergence. Under further conditions, in particular the case in which h is analytic, the error decay is actually exponential. The Fourier case gives a very intuitive way of how this happens just by using integration by parts, for more details see Chapt. 9 of Sarbach and Tiglio (2012).

The summary here is that Fourier modes are not a good basis just because they are periodic functions, but they provide fast convergence as a representation space, as fast as the regularity of the function(s) being represented has. This is the core idea of spectral methods and, for parametrized systems, reduced basis. A brief summary of the non-periodic case follows.

6.1.2 Jacobi polynomials

In the case of non periodic functions a similar result to that one discussed for periodic functions holds, whether the domain is bounded or infinite. Again, the choice of weight is crucial. Any interval can be mapped into \([-1,1]\) or \((-1,1)\), that is why one usually refers to them. The open interval case is because in some cases (such as Chebyshev) the weights actually diverge at the \(\pm 1\) end points—there is a reason for this but we shall skip it.

Polynomials are a natural basis to use. Why? Just because after centuries we understand them well; going beyond that is one way of looking at reduced basis. Given a maximum degree, the span of polynomials is the same, so the question is only what kind of weights guarantee fast convergence upon regularity of the function to be represented. The following family of weights is a sufficient class, though not a necessary one,

$$\begin{aligned} \omega (x) = (1-x)^{\alpha }(1+x)^{\beta } \, \quad x\in (-1,1) \end{aligned}$$
(23)

with \(\alpha , \beta > -1\).

Under these assumptions, the following holds in a Jacobi polynomial approximation:

$$\begin{aligned} \Vert h - {{{\mathcal {P}}}}_{n+1} h\Vert \le C (n+1)^{-s} \left\| (1-x^2)^{s/2} \frac{d^s h}{dx^s} \right\| \, , \end{aligned}$$

for all \(n > (s-1)\) and C independent of n (Funaro 1992). If h is smooth there is asymptotic spectral convergence, by which it is meant convergence faster than any polynomial law. Under additional conditions on the smoothness (or, but not necessarily, analyticity), the convergence is in fact exponential. In this review we will loosely associate smoothness with asymptotic exponential convergence of application-specific spectral expansions, without discussing these additional assumptions.

Most of the spectral methods literature is based on solving numerical problems, prominently differential equations. In this review we shift this focus to that one of a more fundamental representation problem. This is not just a matter of taste, but will make the introduction of reduced basis and all its associated calculus and applications very natural and almost trivial.

6.2 Further reading

The weights (23) lead to Jacobi polynomials, which are solutions to a singular Sturm–Liouville problem (see, for example, Weidmann 1987), the properties of which guarantee the above mentioned spectral convergence. Standard examples are Legendre and Chebyshev polynomials. Their span is the same, it is with respect to which scalar product they guarantee fast convergence and their discrete version and relation to interpolation, discussed in Sect. 10. They are also orthonormalized with respect to their own scalar products, which helps avoiding the typical conditioning problem of inverting the associated Gramian matrix. For more details on and a quick glimpse at spectral methods, from a physicist or practitioner perspective, see Chapt. 9 of Sarbach and Tiglio (2012). Trefethen (2000) is a surprisingly compact and efficient book to get started with spectral methods (despite its title, MATLAB is not necessary at all to digest the content). A long, classic reference, especially for practitioners, is Boyd (2001). The book is legally available for free from the author’s webpage, though with some typos. A great book, in some sense targeted at ordinary differential equations is Funaro (1992). For a modern presentation and the latest results, from theory to current approaches to beat Gibb’s phenomenon, see Hesthaven et al. (2015) and references therein.

7 Parametrized problems and optimal approximations

We are interested in approximating some abstract space of parametrized functions \({{{\mathcal {F}}}}\), which in general is not linear (the sum of two waveforms does not need to be a waveform) but we assume that it can be embedded in a Hilbert one \({{{\mathcal {H}}}}\) (for example, the one of integrable functions in the \(L_2\) sense). We denote the underlying space of parameters as \(\varPhi \), which we assume to be compact. For example, \({{{\mathcal {F}}}}\) can be the space of \(\lambda \)-parametrized solutions \(h_{\lambda }(x)\) of a partial differential equation representing the dynamics of a physical system. In this review we place emphasis in gravitational waves, parametrized for example by the mass and spin of each black hole in a binary collision.

The question discussed next is how well one can theoretically approximate all of \({{{\mathcal {F}}}}\) by a set of n basis elements of \({{{\mathcal {H}}}}\) in a linear and most compact way. This leads to the Kolmogorov n-widthFootnote 2\(d_n\) of \({{{\mathcal {F}}}}\) with respect to \({{{\mathcal {H}}}}\),

$$\begin{aligned} d_n = d_n({{{\mathcal {F}}}}, {{{\mathcal {H}}}}):= \min _{\{ e_i \}_{i=1}^n \in {{{\mathcal {H}}}}} \max _{\lambda \in \varPhi } \min _{c_i \in {\mathbb {C}}} \bigg \Vert h(\cdot ;\lambda ) - \sum _{i=1}^n c_i(\lambda ) e_i(\cdot ) \bigg \Vert ^2 . \end{aligned}$$
(24)

Comment 1

We explain the meaning of (24), from right to left, with respect to the \(\min , \max , \min \) properties.

  • The first \(\min \) implies that the optimal representation with respect to the (so far arbitrary) basis \(\{ e_i \}_{i=1}^n\) is used. That is, given a basis, the best representation (in the induced norm \(\Vert .\Vert \)) is considered. We already discussed that this is the orthogonal projection \({{{\mathcal {P}}}}_n\) onto the span of the basis \(\{e_i \}_{i=1}^n\), so we can replace (24) by

    $$\begin{aligned} d_n := \min _{\{ e_i \}_{i=1}^n \in {{{\mathcal {H}}}}} \max _{\lambda \in \varPhi } \bigg \Vert h(\cdot ;\lambda ) - {{{\mathcal {P}}}}_n h(\cdot ; \lambda ) \bigg \Vert ^2 . \end{aligned}$$
    (25)
  • Next, the largest error in parameter space for such a best approximation is picked. That is the “worst best” approximation given a choice of basis.

  • Finally, a choice of basis (more precisely, the subspace \(W_n\)) which minimizes this worst best error is chosen and the associated error is by definition the n-width. The compactness of \(\varPhi \) is important to guarantee that these minima and maxima exist in the search through the parameter space.

In other words, the n-width is an upper bound of “what is the best that one can do” if one could optimally choose an optimal basis.

This is mostly a theoretical problem, since solving for such a basis is impractical (or, actually, in most cases, intractable) from a computational point of view because it carries combinatorial complexity (all elements of the basis have to be simultaneously chosen). It is more of a theoretical upper bound against which to benchmark the quality of any computable approximation which seeks for an approximate solution to the n-width problem.

7.1 Parameter regularity and fast convergence

Exact expressions for the n-width in general can only be achieved in a few cases (see, for example, Magaril-Il’yaev et al. 2001). In general, the best one can do is to set up upper bounds under specific assumptions and infer the dependence of the n-width with respect to n. Indeed, Kolmogorov calculated this distance for a special class of functions Magaril-Il’yaev et al. (2001) with its first \((r-1)\) derivatives with respect to parameter variation absolutely continuous and obtained a power law dependency for the n-width,

$$\begin{aligned} d_n \sim n^{-r} . \end{aligned}$$

If the set of functions considered were \(C^{\infty }\) with respect to parameter variation—the functions themselves can be discontinuous—the rate of convergence of its optimal basis representation would be better than any power law. This is called spectral convergence and will be relevant in next chapters at the moment of quantifying the fast convergence of a reduced basis approach. This is exactly the case in many scenarios of GWs, since they depend smoothly on the parameters of their sources and one therefore expects an optimal approach to have very fast (in fact, exponential) convergence, as opposed to random sampling, for example, for which the convergence is sublinear. This is the main reason underlying the extreme high accuracy of very compact surrogate models based on reduced bases. That is, one expects in most GW scenarios

$$\begin{aligned} d_n \sim e^{-an^b} \, , \end{aligned}$$

and an optimal basis should be extremely accurate and compact; in fact in some cases super-exponential convergence (\(b>1\)) has been found in the GW context (Field et al. 2011). The question then turns into how to find an approximate basis which is not only computable but is also nearly optimal with respect to the n-width.

It is “common knowledge” that parametrized problems with regularity with respect to parameter variation show spectral convergence of the n-width in practice. One might ask how this observation can be possible at all when we have mentioned that the n-width is not computable in practice: we will return to this point when we discuss the greedy algorithm. But in fact, up to our knowledge there is no rigorous proof of this expectation for general parametrized systems. But there is a very compelling argument. Asymptotic exponential convergence is also observed in practice in GWs and, being physicists, that is good enough for us. The argument is as follow (courtesy of Albert Cohen):

The convergence rate of the n-width depends on the smoothness/regularity of the functions with respect to the parameters of the problem. We now argue that if there is regularity with respect to parameter variation, as in many cases of interest, then one can expect fast (in fact, up to exponential or even super-exponential) decay of the approximation error. The argument is a spectral standard one (spectral methods were discussed in the previous section) applied to parameter variation and is as follows:

If the parametric map

$$\begin{aligned} \lambda \rightarrow h_{\lambda }(\cdot ) \end{aligned}$$

is smooth enough with respect to \(\lambda \), then it can be very well approximated in some appropriate basis in the \(\lambda \) parameter variable, for instance by Fourier or Jacobi polynomials. This means that there is an expansion of the form

$$\begin{aligned} c_1(\lambda )h_1 (\cdot ) + \cdots + c_n(\lambda ) h_n(\cdot )+ \ldots \, , \end{aligned}$$
(26)

with \(h_i \in {{{\mathcal {F}}}}\), that converges fastly towards \(h_{\lambda }(\cdot )\) in \({{{\mathcal {F}}}}\) uniformly in \(\lambda \).

Now, a partial sum of the form (26) is a member of the span of \(\{ h_1, \ldots ,h_n \}\), which means that this linear space approximates well all functions in \({{{\mathcal {F}}}}\).

What “well” means depends on the approximation result used. If the number of parameters is finite and the dependence is analytic, then exponential rates of the form \(\exp {-n^{1/\mathrm{dim}}}\) can easily be proved. If the dependence is only \(C^s\), then a rate \(n^{-s/\mathrm{dim}}\) can also easily be proved. This argument establishes, then, that there are bases in which expansions of the form (26) converge very fast, and as a consequence the n-width can only decay faster.

To summarize, for problems with smooth parametric dependence, fast convergence in terms of greedy reduced bases can be expected and it is not surprising that such global methods outperform local ones, as discussed in Sect. 8.3 below.

We mentioned that the functions themselves can be discontinuous, which might be confusing so it is worthwhile explaining it in more detail, even if qualitatively. Consider a problem of fluid dynamics, where the solutions might develop shocks at a finite time, and imagine that one is solving the partial differential equations for a parametrized family of initial data. This is completely fine in terms of the fast convergence of the n-width, since the location, shape, etc., of the shock depend smoothly on the initial data; the time series (in this case) itself does not need to have regularity with respect to the physical variable (time in this example) but with respect to parameter variation.

8 Reduced basis

In this section we present the RB-greedy framework for parametrized systems in order to address the resolution of the Kolmogorov problem in a quasi-optimal way. First mathematical conventions related to the RB-scheme are presented, followed by a discussion of its near-optimality with respect to different n-width behaviors.

8.1 Introduction

In previous sections we have discussed linear approximation of functions by orthogonal projection onto the span of a basis. The latter was taken to be a generic, problem-independent one, such as Fourier modes or polynomials. The reduced basis (RB) approach is a framework for efficiently solving parametrized problems, representing the solutions in a compact way, and predicting new ones based on an offline-online decomposition.

In parametrized problems one is interested in functions of the form

$$\begin{aligned} h_{\lambda } = h_{\lambda }( \cdot ) = h(\cdot ; \lambda ) . \end{aligned}$$

where is (in general a multidimensional) parameter \(\lambda \), the discretization of which will define the training space.

In fields related to scientific computing and data science one is usually interested in multiple evaluations and analyses of functions in real time. The approach of RB is especially tailored to that one in which some numerical problem has to be solved in order to obtain each function, and obtaining such numerical solutions is very expensive. The approach is also very powerful when large existing data sets are known and a sparse representation is needed and/or multiple, fast operations on them.

Then a minimal, nearly optimal set of representative such solutions (waveforms, for the purpose of this review) is sought for in order to construct a reduced basis (rb) for the whole solution set. These rb solutions constitute an application-specific basis. That is, a set

$$\begin{aligned} \{h_{\varLambda _i} \}_{i=1}^n \end{aligned}$$
(27)

of functions in the space of interest for carefully chosen parameter values \(\varLambda _i\) is used as a basis itself, as opposed to generic basis such as polynomials or Fourier modes.

The scalar product in this parametrized space is taken at fixed parameter values; that is, with respect to the physical dimension(s),

$$\begin{aligned} \langle h_i , h_j \rangle = \langle h_{\lambda _i} , h_{\lambda _j} \rangle = \langle h({\cdot ; \lambda _i}) , h({\cdot ; \lambda _j}) \rangle \; . \end{aligned}$$

The same results of Sect. 6 follow through for the parametrized and application-specific case. Namely, the optimal solution to the least-squares approximation is

$$\begin{aligned} h (\cdot ; \lambda ) \approx {{{\mathcal {P}}}}_n h (\cdot ; \lambda ) = \sum _{i=1}^n c_i (\lambda ) e_i (\cdot ) \, , \end{aligned}$$
(28)

with the coefficients \(c_i\) given by the equivalent of Eq. (21), in the physical dimension(s) and the reduced basis 27 relabeled as \(\{e_1, \ldots , e_n\}\). Namely,

$$\begin{aligned} c_i(\lambda ) = \sum _{j=1}^n({\mathbf {G}}^{-1} )_{ij} \langle e_j (\cdot ) , h (\cdot ; \lambda ) \rangle \, , \end{aligned}$$

where the entries of the Gramian matrix are

$$\begin{aligned} {\mathbf {G}}_{ij}:= \langle e_i , e_j \rangle . \end{aligned}$$

Clearly, the Gramian coefficients depend on the special parameters \(\varLambda _i\) associated to each basis element.

So far, the span of the reduced basis uniquely determines the rb representation. This reduced space depends primarily on the choice of the selected parameter points in order to define a starting basis. As of a convenient basis itself, for the same reasons discussed in Sect. 6, from a conditioning numerical perspective, it is convenient to work with an orthonormal basis built out of the rb solutions through a simple orthonormalization procedure. This does not change the span of the rb and is numerically convenient.

8.2 The training set

A commonly used approach to construct a basis, described below in Sect. 8.3, is through a greedy algorithm. In its simplest version, the algorithm identifies a set of n points in parameter space out of a representative enough set of functions of interest that are actually known. We call this set of functions the training set:

$$\begin{aligned} {{{{\mathcal {K}}}}} := \{ h_{\lambda _i} \}_{i=1}^N. \end{aligned}$$

The subset \(\{ h_{\varLambda _i} \}_{i=1}^n\) constitutes a nearly optimal basis for application-specific spectral expansions of any function in \({{{\mathcal {K}}}}\) in a precise sense discussed in Sect. 8.3.1. If there is partial redundancy/similarity in the latter, then \(n<N\) or even \(n \ll N\). Note here that we have used the symbol \({{{\mathcal {K}}}}\) to represent a discretization of the space of interest \({{{\mathcal {F}}}}\) in order to perform actual computer calculations.

The training set can be constructed by any means, including simple random or uniform sampling, more sophisticated stochastic methods (Messenger et al. 2009), the metric approach (Owen 1996), or those of Manca and Vallisneri (2010), for example. For large problems, resampling it while constructing the basis can be critical; for an application in the case of GWs see for example Blackman et al. (2014). Regardless of the method used to populate the training set, the RB-greedy formalism produces a compact and highly accurate representation of the training space catalog. The reduced basis is used to approximate other functions in the space of interest, whether they were in the training set catalog or not, through linear combinations that represent an orthogonal projection onto its span,

$$\begin{aligned} h_{\lambda } = {{{\mathcal {P}}}}_n h_{\lambda } + \delta h_{\lambda } \, , \end{aligned}$$

where the RB approximation is \({{{\mathcal {P}}}}_n h_{\lambda } \) and satisfies, by construction, \(\langle {{{\mathcal {P}}}}_n h_{\lambda }, \delta h_{\lambda } \rangle =0\).

8.3 Greedy algorithms

We mentioned that a commonly used way to generate a reduced basis is through a greedy algorithm. In its simplest form, such as when the waveforms are inexpensive to compute or the data are already somehow known, the greedy algorithm, outlined in Algorithm 1, has as input a discretization of the parameter and solution space,

$$\begin{aligned} {{{\mathcal {T}}}}:= \{\lambda _i \}_{i=1}^N \quad {{{{\mathcal {K}}}}} = \{ h_{\lambda _i} \}_{i=1}^N \end{aligned}$$

with the elements of \({{{\mathcal {T}}}}\) usually called training points. To put emphasis on structure instead of form, all functions in the training set in this review are normalized. How to recover the norm, particularly for parameter estimation of any detected signal, is discussed in Sect. 14.1.

The scheme needs an arbitrary seed \(\varLambda _1 \in {{{\mathcal {T}}}}\) to initialize it, and a threshold error \(\epsilon \) for a target representation accuracy (or greedy error). Part of the output of the algorithm is a sequential selection of n parameter greedy points

$$\begin{aligned} \{\varLambda _1,\varLambda _2,\ldots , \varLambda _n \} \subset {{{\mathcal {T}}}} \end{aligned}$$

and their associated waveforms

$$\begin{aligned} \{h_{\varLambda _1},h_{\varLambda _2},\ldots , h_{\varLambda _n} \} \subset \{ h_{\lambda _i} \}_{i=1}^N. \end{aligned}$$

The set of waveforms \(\{h_{\varLambda _i}\}_{i=1}^n\) constitutes the reduced basis. As was already discussed, for numerical conditioning purpose it is sometimes convenient (see Eq. (6)) to work with an orthonormalized set \(\{ e_i \}_{i=1}^n\) instead of directly the \(\{ h_{\varLambda _i} \}_{i=1}^n\).

The values \(\{ \lambda _i \}_{i=1}^N\) label the indices of the training set, \(\{ \varLambda _i \}_{i=1}^n\) is the subset chosen by the greedy algorithm, with \(n\le N\) and in general \(n\ll N\) (Fig. 4).

Fig. 4
figure 4

Graphical illustration of an orthogonal projection in the context of a reduced basis-greedy approach

Another output of the algorithm is precisely the set of projection coefficients for functions in the training space catalog, whereas coefficients for any other known function can be computed through projection onto the basis. In Sect. 10 we discuss how to approximate these coefficients through interpolation in the physical dimension (time in the case of gravitational waveforms), and in Sect. 11 a predictive approach for accurate and fast evaluation of new (unknown) functions in \({{{\mathcal {F}}}}\) through surrogate models. These are predictive models: as opposed to a known function being projected into a compact basis, they predict new solutions (waveforms).

figure a

Comment 2

 

  • Greedy-type algorithms are global optimization procedures used in contexts outside reduced basis or even dimensional reduction.

  • Being a global optimization algorithm, the choice of the seed \(\varLambda _1\) is largely irrelevant. This was explicitly discussed as a sidenote in Caudill et al. (2012), see Fig. 5 below. We also refer to Sect. 12.7 for a discussion of Caudill et al. (2012) in the context of RB for multi-mode black hole ringdown.

  • What the greedy algorithm does in Step 8 is to select the waveform for which its representation error onto the existing basis with \(i-1\) elements is worst, and in Step 11 adding it to the enrichment of the rb representation.

  • In steps 9 and 10 the rb waveforms are orthonormalized to avoid ill-conditioning of the computation of the projection (see the discussion in Sect. 6, before 6.1).

  • Given an arbitrary user-defined tolerance error \(\epsilon \), the algorithm stops when the approximation (28) meets the tolerance error—introduced as an input in the greedy algorithm –,

    $$\begin{aligned} \Vert h_{\lambda } - {{{\mathcal {P}}}}_n {h_{\lambda }} \Vert ^2 \le \epsilon \,\,\,\, \forall \,\, \lambda \in {{{\mathcal {T}}}}. \end{aligned}$$
  • The expected exponential convergence of the method for the problems of interest implies that \(\epsilon \) can be made arbitrarily small with a relatively small number n of basis element, with \(n < N\) and, in many cases, \(n \ll N\).

Fig. 5
figure 5

Image reproduced with permission from Caudill et al. (2012), copyright by IOP

This figure shows the representation error as a function of the number of reduced basis waveforms for a single QNM catalog. The authors iterate over all possible seed waveforms in the training set. The dark line represents the average, and the shaded area the maximum dispersion around it. This numerical experiment confirms that the seed choice becomes nearly irrelevant due to the global nature of the the greedy algorithm at each step, as intuitively expected being a global optimization.

In practice, one uses a better conditioned orthogonalization algorithm in step 9 than the standard (i.e., “classical”) Gram–Schmidt one. The naive implementation of the classical Gram–Schmidt procedure is actually ill-conditioned. This is related to the fact that the Gramian matrix, which would have to be inverted, can become nearly singular (Taylor 1978). To overcome this one can use an iterated Gram–Schmidt algorithm or a QR decomposition. A popular alternative to the classical Gram–Schmidt, largely used, and called the modified Gram–Schmidt, is also ill-conditioned, contrary to common perception since its ill conditioning only becomes apparent for large large data sets. See Ruhe (1983), Giraud et al. (2005), Stewart (2006) for discussions about the conditioning and numerical stability of different orthonormalization procedures.

Error definitions. We define the greedy error of the rb \(\{{e_i}\}_{i=1}^n\) as

$$\begin{aligned} \sigma _n := \max _{\lambda } \bigg \Vert h_{\lambda } - {{{\mathcal {P}}}}_n h_{\lambda } \bigg \Vert ^2 \le \epsilon \, , \end{aligned}$$
(29)

where \(\epsilon \) is the user-defined tolerance error, which depends on the number of basis elements.

The quantity \(\sigma _n\) represents the largest error in the parameter space of the best approximation by the greedy-reduced basis.

As discussed in Field et al. (2011), in the limit of sufficiently dense training spaces the greedy error is comparable to the minimal match (\(\mathrm {MM}\)) through

$$\begin{aligned} \sigma _n \sim 1 - \mathrm {MM}\,\,\mathrm{as } \,\,N \rightarrow \infty . \end{aligned}$$

Since the RB framework allows us to compress the information presented in the training set, it is useful to introduce the quantity

$$\begin{aligned} C_r:= N/n , \end{aligned}$$

called the compression factor (Salomon and Motta 2010), to measure it.

8.3.1 Convergence rates and near-optimality

The greedy algorithm chooses, in a precise sense, a nearly-optimal basis for the function spaces \({{{\mathcal {K}}}}\) or \({{{\mathcal {F}}}}\), depending on whether one is discussing the discrete or continuum cases. In order to quantify this near-optimality, recall the definition of the n-width from Sect. 7, Eq. (25), but write it in a more geometric way:

$$\begin{aligned} d_n = d_n({{{\mathcal {F}}}}, {{{\mathcal {H}}}}) = \min _{W_n \in {{{\mathcal {H}}}}} \max _{\lambda } \bigg \Vert h_\lambda - {{{\mathcal {P}}}}_n h_\lambda \bigg \Vert ^2 . \end{aligned}$$
(30)

The search space here is the whole Hilbert space and one looks for an optimal n-dimensional subspace \(W_n\) for approximating the function manifold \({{{\mathcal {F}}}}\). As was stated in Sect. 7, \(d_n\) is a theoretical upper bound to any practical algorithm to perform a linear approximation. In practice, solving such optimization problem becomes unfeasible due to its intrinsic combinatorial complexity. The nested nature of the greedy algorithm becomes crucial for reducing the complexity of the Kolmogorov problem. This means that each \(H_m=\mathrm{Span}\{e_i\}_{i=1}^m\) satisfies \(H_1\subset H_2\subset \ldots \subset H_n\) and this feature dramatically reduces the search space in (30). One expects that this reduction of the search space is at the expense of losing completely—if not a modicum—the n-width optimality but, as we discuss in the next paragraph, this expectation is the opposite from being true.

In which sense does the reduced basis-greedy procedure degrade the optimality in the Kolmogorov sense? To answer this, lets summarize two important results in relation with the greedy algorithm: if \(d_n\) decays exponentially with n then so does \(\sigma _n\) (DeVore et al. 2013),

$$\begin{aligned} d_n \le D \mathrm {e}^{-a n^\alpha } \implies \sigma _n \le \sqrt{2D}\gamma ^{-1}\mathrm {e}^{ -a'_{\alpha } n^{\alpha }} \, , \end{aligned}$$
(31)

where D, a, \(\alpha \) are positive constants.

Similarly, if the n-width has polynomial decay then so does the greedy error,

$$\begin{aligned} d_n \le D n^{-\alpha } \implies \sigma _n \le D'_{\alpha }n^{-\alpha } , \end{aligned}$$

where \(D,\alpha >0\). More generally, for any decay rate of the n-width,

$$\begin{aligned} \sigma _n \le 2\gamma ^{-1}d_{n/2} . \end{aligned}$$
(32)

The factor \(\gamma \) in Eqs. (31) and (32) is a constant in (0, 1] (DeVore et al. 2013). In recent years there have been efforts to improve these bounds. For details see Wojtaszczyk (2015) and Nguyen (2020).

In light of these results, we see that the reduced basis-greedy procedure inherits the optimality of the n-width. If the latter has exponential convergence, so it does the greedy error. In this precise sense the reduced basis-greedy approach is nearly-optimal: we cut down the complexity of the Kolmogorov problem at the very low expense of losing quality which, in most practical applications, becomes insignificant.

The convergence rate of the n-width (and, in consequence, of the greedy error) depends on the parametric smoothness/regularity of the functions/waveforms. Indeed, if the functions are analytic and can be extended to a complex region, exponential decay for the greedy error can be proven (see Quarteroni et al. (2016), Sect. 5.5, and citations therein). Therefore, the RB expansion is expected to have very fast convergence to the original waveform as the number of basis elements is increased. This is similar in spirit to the standard fast convergence of spectral methods, but here smoothness in the parameters of the problem is exploited, and the basis are elements themselves of the space of functions of interest. For this reason RB is sometimes referred to as an application-specific spectral expansion.

To summarize, for problems with smooth parametric dependence, fast convergence in terms of greedy bases can be expected and it is not surprising that such global methods outperform local ones.

Comment 3

Not only the choice of basis is important, but a also global expansion. To illustrate this consider again the example of the space of smooth periodic functions from Sect. 6.1.1. If one is interested in a family of them, and is able to choose n Fourier modes to represent them as best as possible, one could:

  • build a discretization of the space of periodic functions and choose n Fourier modes which when compared against the whole training set gives the best global approximation error; this approach is a global variation of the well known best m-term approximation problem, in which one is interested in approximating some function and wants the best set of m elements chosen from a dictionary of functions to do it;

  • follow the standard approach, in which the functions of interest are projected onto the span of the first n Fourier modes. If the functions of interest are smooth, this approach gives exponentially decaying representation errors with the number of Fourier modes. This is, of course, not a predictive model, since the function to be projected has to be known.

The first path would return a global and problem-dependent basis by construction. The second one, on the contrary, will return a standardized basis with no further feeling for the problem than the minimization of the maximum error due to the projection.

8.3.2 Complexity, scaling, computational aspects

Besides its near-optimality the greedy algorithm has several implementation advantages:

  1. 1.

    It is simple to implement (see Algorithm 1).

  2. 2.

    It follows a stricter approximation criteria (see Sect. 3) than POD/SVD, namely, the minimization of the worst error, as opposed to the average one, in parameter space due to the projection onto the RB basis. It provides a stricter error control.

  3. 3.

    It is embarrassingly parallel. Each greedy sweep, defined by Step 7 of Alg. 1, can be carried out simultaneously and independently for different values of \(\lambda \in {{{\mathcal {T}}}}\). That is, the computation of the projection error for each element in the training space is independent for different values of \(\lambda \) and can therefore be evaluated on different processors.This is a key advantage over SVD, which involves matrix calculations and is not so easy to parallelize since communication between processors usually involves each processor storing the whole matrix.

  4. 4.

    As we remarked in Sect. 8.3.1, the greedy points and associated bases are hierarchical (nested). This is particularly important if the algorithm is to be used to guide which numerical relativity simulations should be carried out. Since they are so expensive, it is desirable that if further accuracy is needed, then it should be possible to achieve it by adding more points (as opposed to starting the process from scratch).

  5. 5.

    Low complexity. That is, the cost of all steps 6–12 in Alg. 1 are independent of i, leading to a total cost linear in n, the total number of basis elements. This observation follows from the fact that

    $$\begin{aligned} {{{\mathcal {P}}}}_{i+1}h_{\lambda } = {{{\mathcal {P}}}}_{i}h_{\lambda } + \langle e_{i+1} , h_{\lambda }\rangle e_{i+1}. \end{aligned}$$

    So, if the projection coefficients are stored while building the basis, at each sweep only the dot product between the elements of the training set and the last basis element has to be computed.

8.4 Further reading

A comprehensive review of reduced basis is Quarteroni et al. (2011). For textbooks see Hesthaven et al. (2015) and Quarteroni et al. (2016).

9 Polynomial interpolation

There are several reasons why we find it useful to discuss classical polynomial interpolation:

  • It is used in ROM for gravitational wave physics and other fields; we discuss some advantages and disadvantages.

  • There is a close relationship between interpolation, quadrature rules, spectral methods and reduced basis. It is easier to introduce the standard theory first, followed by the application-specific version at the heart of ROM for parametrized systems.

The emphasis of this section is on approximation quality quantified through convergence rates and error bounds. Except for some side comments pointing out implementations known to be poorly conditioned, we shall not discuss numerical algorithms or their implementation which can be found in standard references.

9.1 Representation versus prediction

In standard approximation theory one can expand a function in some basis (see for example Sects. 5 and 6). Assuming that the approximation quality is measured by some weighted \(L_2\) norm, the optimal approximation is the orthogonal projection onto the representation space. This projection is independent of the basis, in the sense that it only depends on its span. Assuming these are orthonormal for simplicity and conditioning purposes, the orthogonal projection is given by

$$\begin{aligned} h(x) \approx {{{\mathcal {P}}}}_n h (x) = \sum _{i=1}^n \langle e_i, h \rangle e_i(x). \end{aligned}$$
(33)

Approximating h by its projection as in (33), while perhaps useful by itself, is not a predictive tool: we need to know the values of h(x) for all x in order to calculate the projection coefficients in Eq. (33). Here, by prediction it is meant a new evaluation within the range used to build the approximating model. This means, for example, within the same parameter and physical (time/frequency for example) ranges in the case of GWs. We are not dealing with a much more difficult problem: namely, that one of prediction of time series, for example, based on previous history. Then, to get a predictive power we need to turn first to an interpolation scheme.

Building an interpolant for h(x) requires only the knowledge of n evaluations of h(x) whereas the unknown values of the function are predicted from the interpolant. As previously mentioned, the applications are many, including numerical schemes to solve partial differential equations, the numerical computation of integrals and building predictive surrogate models. As we will see in Sect. 9.4 these seemingly radically different approaches for approximation, i.e. interpolation and projection, are closely related, and in some cases both are equivalent within certain frameworks.

9.2 Prediction through polynomial interpolation

Suppose we have a collection of \((n+1)\) different points \(\{ x_i \}_{i=0}^n\)Footnote 3, and a set of associated function values \(\{ h_i := h(x_i)\}_{i=0}^n\). We assume h to be a real valued, one-dimensional function. We discuss the more challenging multivariate case in Sect. 9.5.

The interpolation problem consists in using the partial function sampling \(\{ h_i \}_{i=0}^n\) to approximate h(x) by another function which agrees exactly with h at the interpolating nodes \(\{ x_i \}_{i=0}^n\). For definiteness we focus on linear interpolation: we require the interpolant of h to be expressible as a linear combination of the \((n+1)\) linearly independent basis functions \(\{ e_i(x) \}_{i=0}^n\),

$$\begin{aligned} {{{\mathcal {I}}}}_n[h](x):= \sum _{i=0}^nC_i e_i(x) . \end{aligned}$$
(34)

We have already discussed in Sect. 6 that the optimal approximation in the least-squares sense of the above form is provided by the orthogonal projection onto the basis \(\{ e_i(x) \}_{i=0}^n\). This is shown in Eq. (33). As we pointed out above, this optimality comes at the cost of knowing the function at all x in order to compute the projection coefficients. In interpolation, one relaxes this requirement and makes a compromise between knowledge and accuracy by requiring the interpolant to match the function at the nodes:

$$\begin{aligned} {{{\mathcal {I}}}}_n[h](x_j) = h(x_j) \, , \quad j=0,\ldots , n . \end{aligned}$$
(35)

When the basis elements are linearly independent polynomials of degree \(\le n\) this is standard polynomial interpolation. One can show by explicit construction that there is a unique polynomial which satisfies the interpolation problem (34,35). Before doing so we give an example.

Example 5

Suppose we have two points, \(\{ x_0,x_1\}\), sampling a function h(x). Let the sample values be \(h_0 = h(x_0)\) and \(h_1 = h(x_1)\). The functions \(\{1, x\}\) comprise a suitable basis for our \(n=1\) polynomial interpolant. Then the interpolating polynomial is of degree \(\le 1\), i.e. a straight line joining the two points:

$$\begin{aligned} {{{\mathcal {I}}}}_1[h](x) = ax + b \end{aligned}$$

We explicitly solve for (ab) by writing the interpolation problem (35)

$$\begin{aligned} {{{\mathcal {I}}}}_1[h](x_0)&= ax_0 + b = h_0 \\ {{{\mathcal {I}}}}_1[h](x_1)&= ax_1 + b = h_1. \end{aligned}$$

This is a linear system. The solution reads

$$\begin{aligned} a = \frac{h_0 - h_1}{x_0-x_1} \,\, , \,\, b= \frac{x_0h_1 - x_1h_0}{x_0-x_1} , \end{aligned}$$

with the interpolant being

$$\begin{aligned} {{{\mathcal {I}}}}_1[h](x) = \left( \frac{h_0 - h_1}{x_0-x_1} \right) x + \left( \frac{x_0h_1 - x_1h_0}{x_0-x_1} \right) . \end{aligned}$$
(36)

9.2.1 General solution: Lagrange polynomials

Existence. The existence of a polynomial solving the interpolation problem (35) can be shown by explicit construction. Suppose we have access to a basis \(e_i(x) = \ell _i^{(n)} (x)\) which satisfies

$$\begin{aligned} {{{\mathcal {I}}}}_n[h](x) = \sum _{j=0}^n h_j l_j^{(n)}(x) \end{aligned}$$
(37)

where, for each \(j=0\ldots n\), \(l_j^{(n)}(x)\) is a polynomial of degree \(\le n\) such that

$$\begin{aligned} l_j^{(n)}(x_i) = \delta _{ij} \,\,\, {\text{for}}\quad i=0\ldots n. \end{aligned}$$
(38)

The symbol \(\delta _{ij}\) is the Kronecker delta (\(\delta _{ij} = 1\) if \(i=j\) and zero otherwise). If such polynomials existed, the interpolant of Eq. (37) would solve the interpolation problem.

The Lagrange polynomials are precisely those with the property (38). They are given by

$$\begin{aligned} l_j^{(n)}(x) = \left( \prod _{k=0,k \ne j}^n (x-x_k) \right) / \left( \prod _{k=0,k \ne j}^n (x_j-x_k) \right) . \end{aligned}$$
(39)

Example 6

We look at the same case (\(n=1\)) of Example 5, but now we construct an interpolant through Lagrange polynomials. By Eq. (39) these are

$$\begin{aligned} l_0^{(1)}(x) = \frac{x-x_1}{x_0-x_1} \, , \quad l_1^{(1)}(x) = \frac{x-x_0}{x_1-x_0} \, , \end{aligned}$$

and using (37) the Lagrange form of the interpolant becomes

$$\begin{aligned} {{{\mathcal {I}}}}_1[h](x) =&\, h_0 l_0^{(1)}(x) + h_1 l_1^{(1)}(x) \nonumber \\ =&\left( \frac{h_0-h_1}{x_0-x_1} \right) x + \left( \frac{x_0h_1 - x_1h_0}{x_0-x_1} \right) \end{aligned}$$
(40)

The result (40) is the same as (36). This is not a coincidence, but a consequence of the uniqueness of the solution, which we prove next.

Uniqueness. Suppose there are two polynomials \(P_n\) and \(R_n\) of degree \(\le n\) satisfying the interpolation condition \(P_n(x_j)=R_n(x_j)=h_j,\, j=0, \ldots , n\). The polynomial \(Q(x):=P_n(x)-R_n(x)\) then satisfies \(Q(x_j)=0,\, j=0, \ldots , n\). It means that Q(x) has \((n+1)\) zeros, in contradiction with the hypothesis that it is at most of degree n. Ergo, \(Q(x)=0\) and \(P_n(x)=R_n(x)\).

9.2.2 The Vandermonde form

Yet another way of showing existence and uniqueness for the polynomial interpolant in a constructive way is the following. We write down the polynomial interpolant using monomial basis functions,

$$\begin{aligned} {{{\mathcal {I}}}}_n[h](x) = \sum _{j=0}^n a_j x^j. \end{aligned}$$

The interpolation conditions (35) become a set of linear equations for the coefficients \(a_j\)

$$\begin{aligned} {\mathbf {V}} a = h \, , \end{aligned}$$
(41)

where \({\mathbf {V}}_{ij} = x_i^{j-1}\), \(a=[a_0,\ldots , a_n]^\top \) and \(h=[h(x_0),\ldots , h(x_n)]^\top \). The matrix \({\mathbf {V}}\) is called the Vandermonde matrix. Its determinant, called the Vandermonde polynomial, takes the form

$$\begin{aligned} \det ({\mathbf {V}}) = \prod _{0\le i< j \le n}(x_j-x_i) . \end{aligned}$$

In order to have one and only one solution to Eq. (41) this determinant has to be non-zero. In one spatial dimension this condition is automatically satisfied whenever the interpolation points are unique. There are ways of inverting \({\mathbf {V}}\) in order to solve Eq. (41) using Lagrange polynomials, which we will not pursue here. The main reason for briefly mentioning the Vandermonde matrix is because a generalization of it will play a prominent role in the empirical interpolation method, described in Sect. 10.

9.3 Convergence rates, collocation points and Runge’s phenomenon

Here we discuss the behavior of the error of the polynomial interpolant \({{{\mathcal {I}}}}_n[h](x)\) of some function h(x) at \((n+1)\) nodes, to illustrate the difficulties in obtaining an accurate interpolation scheme on a unstructured set of nodes which, from the point of view of ROM, one wishes to be sparse.

Leaving aside both sparsity and unstructured meshes, one might imagine that for “nice enough” functions (for example, infinite differentiable), this error decreases as the number of nodes and, therefore, degree of the polynomial basis n increase. This is not the case, the classical counterexample is Runge’s one, here described, and the overall problem is referred to as Runge’s phenomenon.

The distinction is between local approximations of fixed degree and global ones. Because these two concepts play a key role in reduced order modeling we discuss them in this context with some detail. Suppose that one has \(\{ x_i \}_{i=0}^n\) nodes at which to interpolate a function.

9.3.1 Local interpolation

We consider local interpolants of degree one because we have already explicitly discussed them in Example 5. One would partition the interval of interest, where on each subinterval

$$\begin{aligned} I_j:= [x_j, x_{j+1}], \quad j=0, \ldots , n-1 \end{aligned}$$
(42)

one builds a local interpolant of degree one using the end points as interpolating nodes,

$$\begin{aligned} {{{\mathcal {I}}}}[h](x) = \left( \frac{h_j - h_{j+1}}{x_j-x_{j+1}} \right) x + \left( \frac{x_jh_{j+1} - x_{j+1}h_j}{x_j-x_{j+1}} \right) , \quad {\text{for}}\quad x\in I_j\, , \quad j=0, \ldots , n-1 . \end{aligned}$$
(43)

Such fixed order interpolants, where their degrees are independent of the number of nodes n, is—as explained later below—guaranteed to converge to the interpolated function as the number of nodes in a given fixed physical interval increases (the gridspacing decreases). Furthermore, such convergence is guaranteed regardless of the structure/location of the interpolating nodes, they can even be randomly located. Even more, to some degree, convergence is also guaranteed regardless of the smoothness of the data; this is why fixed order methods, especially very low order ones, are in general used for noisy data.

This class of methods is usually very robust, but at the expense of accuracy. The latter is exacerbated in the absence of a dense amount of data. On the contrary, in reduced order modeling one seeks very sparse representations of very high accuracy. This combination of, and even trade-off between, robustness (convergence) and accuracy is a delicate and challenging issue in ROM.

9.3.2 Global interpolation

Instead, one could use a global interpolant, where all \((n+1)\) nodes and function values are used to build a single interpolant,

$$\begin{aligned} {{{\mathcal {I}}}}_n[h](x) = \sum _{i=0}^n \ell _i ^{(n)} (x) h_i . \quad \forall x \in [x_0, x_n]. \end{aligned}$$

As the density of points and n increases, one would keep using all of them, thereby increasing the degree of the interpolant. This might seem desirable and more accurate than a local approach of fixed degree, since more information is used. The classical counterexample is due to Runge (1895), and it consists of interpolating the function

$$\begin{aligned} h(x) = \frac{1}{1+(5x)^2} \; \, , \quad x\in [-1,1] . \end{aligned}$$

This function is \(C^{\infty }\) (has infinite derivatives) yet the error of a global polynomial interpolant at equally spaced points diverges near the boundaries (Epperson 1987) at higher orders (see Fig. 6). For unstructured points, such as those usually appearing in reduced models, the lack of convergence of global approximations is in general worse.

Fig. 6
figure 6

Runge’s phenomenon in polynomial interpolation. In this example the approximation uses 19 equispaced nodes over the interval \([-1, 1]\) and the interpolant polynomial is of degree 18. The interpolation tends to diverge near the boundaries. This issue cannot be solved by increasing the sampling resolution, since, for equispaced nodes, divergences get worse at higher resolutions

As discussed below, this can be resolved and fast convergence achieved if the interpolation nodes can be appropriately chosen. However, standard choices for such special nodes are in general not suitable for reduced order modeling. In particular, they are neither necessarily sparse nor adapted to the problem of interest. They are generic and in general (and most important) not hierarchical, requiring recomputing from scratch the sought approximation if higher accuracy is required.

9.3.3 Error analysis and convergence I

In what follows we assume that h is smooth enough as needed from the context. Denoting the interpolation error by

$$\begin{aligned} E_n(x) := \left| h(x)- {{{\mathcal {I}}}}_n[h](x) \right| \end{aligned}$$

the following result holds:

$$\begin{aligned} E_n (x) = \left| \frac{1}{(n+1)!} h^{(n+1)}(\xi ) \omega _{n+1}(x) \right| \end{aligned}$$
(44)

where

$$\begin{aligned} \omega _{n+1}(x) := \prod _{i=0}^n (x-x_i) \end{aligned}$$

is called the nodal polynomial of order \((n+1)\), and \(\xi \) is in the smallest interval \(I_x\) containing \(x_0 \ldots x_n\) and x. Equation (44) can be proven as follows. Let’s define the function

$$\begin{aligned} H(t):=h(t)-{{{\mathcal {I}}}}_n[h](t)-\alpha (x) \omega _{n+1}(t) \end{aligned}$$

where

$$\begin{aligned} \alpha (x)=\frac{h(x)-{{{\mathcal {I}}}}_n[h](x)}{\omega _{n+1}(x)} \end{aligned}$$

This function is well defined since \(x\ne x_i\), and it is as smooth as h(t). Observe that \(H(x)=0\) and, in particular, \(H(x_i)=0\) (\(i=1,\ldots,n\)). It means that H(t) has \((n+2)\) distinct zeros in the interval in which h(t) is smooth. Now, the Mean Value Theorem tells us that, between any pair of zeros, there is one zero for the derivative \(H'(t)\). Then the function \(H'(t)\) has \((n+1)\) zeros. Repeating the previous argument, we conclude that \(H^{(n+1)}(t)\) has exactly one zero. Let’s call it \(\xi \). Thus \(H^{(n+1)}(\xi )=h^{n+1}(\xi )-\alpha (\xi )(n+1)!=0\). The statement in (44) follows directly from this relation.

Let’s make a comment here. The point x at which the interpolant is evaluated when computing the error (44) does not need to be in the smallest interval containing \(x_0\ldots x_n\). Assuming an ordering \(x_0<\ldots <x_n\), sometimes the process of approximating h(x) by \({{{\mathcal {I}}}}_n[h](x)\) is called interpolation only if \(x\in [x_0,x_n]\) and extrapolation otherwise.

We get back to the error formula (44) and analyze its behavior as n increases. In particular, we ask ourselves under what conditions does

$$\begin{aligned} E_n(x) \rightarrow 0 \;\;\; {\text{as}} \;\;\; n \rightarrow \infty . \end{aligned}$$

The error (44) can be decomposed in two terms, one related to the behavior of the derivatives of h,

$$\begin{aligned} \left| \frac{h^{(n+1)}(\xi )}{(n+1)!} \right| \end{aligned}$$

and another one related to the distribution of nodes through \(\omega _{n+1}(x)\).

We first discuss why fixed order interpolants are guaranteed to converge, regardless of the distribution of nodes (assuming the interpolated function is smooth enough), since the argument is simple.

Discussion 1

Local interpolants For definiteness consider the case of equally spaced points (the generalization to an arbitrary set is straightforward),

$$\begin{aligned} x_j = x_0 + j \varDelta x\, , \quad j=0, \ldots n. \end{aligned}$$

Then the error of the interpolant (43) on each subinterval (42) is bounded by

$$\begin{aligned} E \le \frac{1}{(m+1)!} \max _{x\in I_j} \left| h^{(m+1)} (x) \right| (\varDelta x)^m \, , \end{aligned}$$

where m is the (fixed) number of interpolation nodes and degree of the interpolants (clearly, \(m\le n\)). Since m is fixed, the convergence rate is determined by \(\varDelta x \rightarrow 0\), and is of order m.

Discussion 2

Global interpolants In this case \(m=n\). One can see from the interpolation error (44) that convergence is guaranteed if the derivatives of h decay with n sufficiently fast. In fact, the interpolant converges in the infinity norm if all derivatives \(h^{(s)}\) are bounded by the same constant in the interval of interest. However, it is not obvious from (44) to find out in a straightforward way the allowed growth rate of the derivatives of h such that convergence/divergence of the error takes place.

In fact, it is not easy to see how such error formula explains Runge’s phenomenon without resorting to complex calculus (Epperson 1987). However, we can ask ourselves the next related questions:

(Q1) What can be said about the maximum of \(|\omega _{n+1}(x)|\) in the interval of interest?

(Q2) If the physical problem allows such freedom, how can one choose the nodes so as to minimize such maximum?

For simplicity and without loss of generality we focus on the interval \(x\in [-1,1]\). We can answer these questions as:

Answer to (Q1): For all choices of nodes

$$\begin{aligned} \max _{[-1,1]} | \omega _{n+1} (x) | \ge 2^{-n} \end{aligned}$$

Answer to (Q2): If

$$\begin{aligned} \omega _{n+1}(x) = 2^{-n}T_{n+1}(x), \end{aligned}$$

where \(T_{j}\) denotes the Chebyshev polynomial of degree n, then

$$\begin{aligned} \max _{[-1,1]} | \omega _{n+1} (x) | = 2^{-n} . \end{aligned}$$

In other words, using the roots of the Chebyshev polynomial as interpolating nodes minimizes the error associated with the location of such nodes. It is in this sense that Chebyshev nodes are many times referred to as optimal points for interpolation. They have the disadvantage, though, that they are not hierarchical/nested, or adapted to the problem of interest. The former failure means, from a ROM perspective, that any model relying on them, including perhaps expensive numerical simulations, have to be completely built from scratch if higher accuracy is required. This is solved by a nearly-optimal set of nested nodes for application-specific interpolants designed with ROM problems in mind, discussed in Sect. 10.

9.3.4 Error analysis and convergence II

In principle, direct approximation through projection requires the knowledge of the continuum waveform, while interpolation only needs information at a (specified by the user) set of nodes. We next address what is the relationship between these two strategies. In terms of representation accuracy, we focus on the \(L_2\) norm, since it is directly related to the overlap error commonly used in data analysis. Since, as discussed, given a basis and linear approximation, projection is optimal in the least-squares sense, one knows that

$$\begin{aligned} \Vert h - {{{{\mathcal {P}}}}}_n h \Vert ^2\le \Vert h - {{{\mathcal {I}}}}_n [h] \Vert ^2 . \end{aligned}$$
(45)

The question is how “sub-optimal” in the above sense interpolation is. The answer obviously depends on the basis and interpolation nodes. A general framework to study this is through the definition of the Lebesgue constant.

For any arbitrary basis and interpolation scheme one can derive the more precise bound related to (45):

$$\begin{aligned} \Vert h - {{{\mathcal {I}}}}_n [h] \Vert ^2 \le \varLambda _n \Vert h - {{{{\mathcal {P}}}}}_n h \Vert ^2\, , \end{aligned}$$
(46)

where

$$\begin{aligned} \varLambda _n := \Vert {{{\mathcal {I}}}}_n \Vert ^2 . \end{aligned}$$

Comment 4

 

  1. 1.

    The \(\varLambda _n\)’s are referred as the Lebesgue constants and depend, given a vector norm, purely on the interpolation operator \({{{\mathcal {I}}}}_n\).

  2. 2.

    In terms of accuracy, the strategy is to build an interpolant (by choosing both the basis and nodes) such that the Lebesgue constant grows as slow as possible with n. It is a computable quantity, so one can actually quantify how much is “lost” with respect to projection through Eq. (46).

  3. 3.

    Strictly speaking, the Lebesgue constant is usually referred to in the context of the infinity norm,

    $$\begin{aligned} \Vert h \Vert _{\infty }:= \max _{1\le i \le L} | h(t_i) | \, , \end{aligned}$$

    but throughout this review, unless otherwise stated, we refer to it in the \(L_2\) norm.

As an example, and going back to Runge’s phenomenon at equally spaced versus Chebyshev nodes for global polynomial interpolation, the following illustrates the advantages of the latter and explains why it is such a popular choice—when it is a feasible approach at all within the problem of interest, which is not always the case in practice.

Example 7

Here the \(\Vert . \Vert _\infty \) norm is used. For large n the following holds for the Lebesgue constants (see, for example, Hesthaven et al. 2007).

  1. 1.

    The Lebesgue constant grows exponentially for equally spaced nodes,

    $$\begin{aligned} \varLambda _n \sim \frac{ 2^{n+1} }{n \log n }. \end{aligned}$$
  2. 2.

    For Chebyshev nodes, instead, it only grows logarithmically,

    $$\begin{aligned} \varLambda _n \sim \log {(n)} . \end{aligned}$$
    (47)

In the context of polynomial interpolation it can be shown that the Lebesgue constant grows at least logarithmically in the infinity norm for any selection of nodes. In particular, Chebyshev nodes belong to a family that induce a nearly-optimal behavior on the Lebesgue constant, since the growth (47) is only logarithmic. For further discussion around this topic see Hesthaven et al. (2007).

9.4 Discrete expansions and interpolation

In approximation through projection, one needs to compute the projection coefficients \(\langle e_i , h \rangle \) of Eq. (33) in order to approximate a function h. In practice, the integrals for computing \(\langle e_i , h \rangle \) are replaced by a quadrature rule, i.e. a numerical approximation (we discuss quadrature rules and specifically those built using ROM in Sect. 13). These are called then discrete projection approximations. Here it suffices to say that, if the quadrature nodes can be chosen, one can find an optimal family of quadrature rules that maximizes the degree for which the quadrature is exact for polynomials. These are called Gaussian quadratures and the nodes Gaussian nodes; examples of the latter are Legendre and Chebyshev nodes. The relevant result here relating approximation by discrete projection and interpolation is the following:

Result 1

Discrete projection and interpolation

If the projection coefficients are approximated by Gaussian quadratures and are used Gaussian nodes for interpolation, then the resulting approximation to \({{{\mathcal {P}}}}_n h\) exactly equals the polynomial interpolant \({{{\mathcal {I}}}}[h]\) (Hesthaven et al. 2007).

Through this result one can guarantee fast convergence of interpolation using Gauss nodes. The problem is that in many parametrized problems of interest, one cannot choose Gaussian nodes in parameter space as representative solutions. Either because data is not available (for example, if taken from experiments), or because it is not efficient to do so from a modeling perspective. In Sect. 10 we discuss a generalization for ROM and parametrized systems that attempts to generalize Result 1.

9.5 Multiple dimensions

9.5.1 Evaluation cost of multivariate polynomials

Even when counting with enough data for multi-dimensional fits, enough for accurate local approximations which avoid Runge’s phenomenon, a perhaps not so well known fact is that the evaluation cost of multivariate polynomials in general grows exponentially with the dimensionality of the problem.

Naive evaluation of a 1-dimensional polynomial of degree n, that is, evaluating all the monomials in the standard format

$$\begin{aligned} p_n(x) = a_0 + a_1 x + \cdots + a_n x^n \end{aligned}$$
(48)

carries an operation count of \({{{\mathcal {O}}}}(n^2)\). Through a simple factorization known as the Horner’s rule (see Stoer and Bulirsch 1980) the operation count can be reduced to 2n. Suppose we want to evaluate \(p_n(x)\). First note that \(p_n(x)\) can be decomposed in a nested manner

$$\begin{aligned} p_n(x)=a_0 + x(a_1 + x(a_2+\cdots x(a_{n-1}+a_n x)\cdots ). \end{aligned}$$

Horner’s rule consists in computing iteratively the recursion

$$\begin{aligned} \begin{aligned} b_0=&a_n\\ b_ k =&a_{n-k} + b_{k-1}x\quad k=1,\ldots , n \end{aligned} \end{aligned}$$

reaching \(p_n(x)=b_n\) in 2n operations (multiplication and addition, n times). It can be shown that this is the minimum number of operations to evaluate a one-dimensional problem unless some offline factorization is carried out for multiple online evaluations, in which case the computational cost can be slightly reduced.

Next, consider multivariate polynomial evaluation. Suppose one attempts to evaluate the generalization of (48) for the d-dimensional case. For definiteness, the form in the \(d=2\) case would take the form

$$\begin{aligned} p_n(x,y) = \sum _{i=0}^{n_x}\sum _{j=0}^{n_y} a_{ij} x^i y^j . \end{aligned}$$

The evaluation cost of such an approach in the d-dimensional case, assuming for simplicity the same degree n in each dimension, is of order \({{{\mathcal {O}}}}(n^{2d})\). Ideally, one would expect a generalization of Horner’s rule with evaluation cost of order \({{{\mathcal {O}}}}(d \times n)\). Unfortunately not such algorithm is known, neither a proof of the optimal number of operations needed to perform these evaluations. There have been several proposals to face this problem, ranging from rigorous attempts (see, e.g., Peña and Sauer 1998; Moroz 2013; Umans 2008; Leiserson et al. 2010; Ballico et al. 2013; Kedlaya and Umans 2008; van der Hoeven and Lecerf 2020) to heuristics such as greedy algorithms (Ceberio and Kreinovich 2003). All this aiming at reducing the operation count in the evaluation of the polynomial.

Multivariate polynomial evaluations scale exponentially with the dimensionality of the problem. This is a severe issue in most cases of interest which cannot be underestimated. In the context of ROM, the limitations of multivariate polynomial approximations appear in almost every possible context. This point is discussed in more detail and emphasized in Sect. 15.

9.6 Further reading

Gasca and Sauer (2000) reviews multivariate polynomial interpolation, and Narayan and Xiu (2012) presents a proposal for polynomial interpolation on unstructured grids in multiple dimensions.

There are schemes for fast multiple evaluations of multi-variate polynomials though. They are based on an offline factorization of the given polynomial (which can be expensive, but done only once), with a total cost of \({{{{\mathcal {O}}}}}(n^{d +1})\) for \({{{\mathcal {O}}}}(n^d)\) evaluations, leading to an average of \({{{\mathcal {O}}}}(n)\) for each evaluation (Lodha and Goldman 1997). To our knowledge this problem of multi-dimensionality has not been yet tackled in ROM for gravitational waves and is still largely an open problem in approximation theory. Instead, in GW science local interpolants in the form of splines are being used, which results in degradation of the accuracy of the surrogate models being built. In terms of higher (than one) dimensions, sparse grids are being used. One of their problems is that they are not necessarily hierarchical, leading to a problem when building training sets from numerical relativity to later apply the RB framework.

10 The empirical interpolation method

The empirical interpolation method (EIM) was proposed in 2004 (Barrault et al. 2004) as a way of identifying a good set of interpolation nodes on multi-dimensional unstructured meshes and has since found numerous applications (for a very short list, see Aanonsen 2009; Chaturantabut and Sorensen 2009, 2010; Eftang and Stamm 2012; Maday et al. 2009). When the basis is application-specific, so is the EIM.

Here, perhaps contrary to what might be suspected, interpolation is in the physical dimension(s), not on the parametric ones. For definiteness, in the case of gravitational waves this is either the time or frequency domain.

Before getting into technical details, we highlight some of the properties of the EIM:

  • It is hierarchical (nested). The “most relevant” points in the physical dimension(s) and associated interpolants are selected and built, respectively. This follows a process that is dual, in a precise sense, to the construction of a reduced basis using a greedy algorithm.

  • It is designed to overcome the difficulties of stability, accuracy and evaluation cost on sparse and scattered grids, especially in multiple dimensions.

  • It is highly accurate. By design, the algorithm attempts to control the behavior of the Lebesgue constant (see Sect. 10.3) at each loop.

  • It provides an affine parametrization of a model. This has multiple consequences. In particular, it allows for the design of reduced order quadratures, described in Sect. 13, which lead in particular to fast likelihood evaluations, as discussed in Sect. 14.

The material below relies on standard interpolation ideas from non-parametrized problems which we have summarized in Sect. 9. We will refer to the physical variable(s) as x, the nodes selected by the EIM as \(X_i\,(i=1\ldots n)\),Footnote 4 the (in general multidimensional) parameter as \(\lambda \) and its corresponding greedy points as \(\varLambda _i\,( i=1\ldots n)\). It is not a coincidence that there is the same number of EIM nodes and greedy points, in fact they go hand by hand.

10.1 EIM: algorithm and properties

The EIM approximation of a function \(h(x;\lambda )\) takes the form

$$\begin{aligned} h(x; \lambda ) \approx {{{\mathcal {I}}}}_n [h](x; \lambda )= \sum _{i=1}^n B_i(x) h(X_i; \lambda ) . \end{aligned}$$
(49)

This represents an empirical (that is, problem-dependent) global interpolant, which also provides an affine parametrization of h. The latter means that the r.h.s. of (49) is a sum of terms which depend only on x (the \(B_i\) coefficients) multiplied by other ones which depend only on \(\lambda \) (the \(h(X_i; \lambda )\)); one might also want to refer to this as “separability”.

Below in Sect. 10.1 we discuss how one arrives at (49), how to compute in the offline stage the \(B_i\) coefficients, and the quality of this approximation.

In addition, even though it is perhaps not usually viewed this way, or not emphasized enough compared to online evaluations to solutions of parametrized PDEs, the EIM also provides an application-specific down- and upsampling that, for practical purposes, beats Nyquist downsampling, with implications for signal or more generally data processing. See for example Sect. 11.2 and discussion around Figs. 13 and 14.

10.2 From projection to interpolation

In Sect. 3 we discussed that, given a basis (assuming for simplicity and conditioning that it is orthonormal) of cardinality n, the optimal linear representation in any weighted \(L_2\) norm is through an expression of the form

$$\begin{aligned} h(x ; \lambda ) \approx {{{\mathcal {P}}}}_n h(x; \lambda ) = \sum _{i=1}^n c_i (\lambda ) e_i (x) \, , \end{aligned}$$
(50)

where

$$\begin{aligned} c_i (\lambda ) = \langle e_i (\cdot ), h(\cdot ;\lambda ) \rangle . \end{aligned}$$
(51)

This is the standard approximation by projection, the most common one being using polynomials or Fourier modes as basis. Now, approximation through projection requires knowledge of the function to be represented at sufficient values of x so as to accurately compute the coefficients (51). In Sect. 9 we discussed how this is usually replaced by interpolation, and the strong relationship between projection and interpolation in standard spectral theory. Next we describe the EIM strategy, which mimics the spectral approach.

The approximation through projection onto a reduced basis, Eq. (50), is replaced by interpolation (in the physical dimension(s), x) as follows. First, some basis \(\{ e_i (x) \}\) is built, e.g. through a RB or POD approach. Then, the interpolant is sought to have the form

$$\begin{aligned} {{{\mathcal {I}}}}_n [h](x; \lambda ) := \sum _{i=1}^n C_i (\lambda ) e_i(x)\, , \end{aligned}$$
(52)

where on purpose the expansion, which are not projection, coefficients \(C_i\) are denoted by capital letters to distinguish them from the projection ones (51). Instead, they are defined to be solutions of an interpolation problem: given the EIM nodes—the construction of which we discuss below—, we want the interpolant to exactly match the function at those nodes,

$$\begin{aligned} {{{\mathcal {I}}}}_n [h](X_i; \lambda ) = h(X_i;\lambda ) , \qquad \forall \, i=1,\dots ,n . \end{aligned}$$
(53)

For the moment, we shall assume that the EIM nodes \(X_i\) are known and proceed to describe how to use them to find the EIM interpolant. This can be somewhat misleading, since it is not the way it works in practice, which is: the first EIM node is found, its associated interpolant built, the second EIM node found, the interpolant enriched to take it into account, and so on. They are not disjoint processes, unlike (again) standard spectral methods, where all the Gaussian nodes are found and afterwards the interpolant built. This is not only a procedural feature, but highlights a big difference of the EIM: namely that the approach (nodes and interpolant) is hierarchical. Hopefully this gradual presentation is intuitive and pedagogical, later in this section we will present the full, coupled algorithm together with its numerical intricacies.

Equation (53) is equivalent to solving the n-by-n system

$$\begin{aligned} \sum _{i=1}^n {\mathbf {V}}_{ji}C_i(\lambda ) = h (X_j; \lambda ), \quad j=1, \ldots , n \end{aligned}$$

for the coefficients \(\{ C_i \}_{i=1}^n\), where the interpolation matrix

$$\begin{aligned} {\mathbf {V}} := \left( \begin{array}{cccc} e_1(X_1) &{} e_2(X_1) &{} \cdots &{} e_{n}(X_1) \\ e_1(X_2) &{} e_2(X_2) &{} \cdots &{} e_{n}(X_2) \\ e_1(X_3) &{} e_2(X_3) &{} \cdots &{} e_{n}(X_3) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ e_1(X_{n}) &{} e_2(X_{n}) &{} \cdots &{} e_{n}(X_{n}) \\ \end{array} \right) \end{aligned}$$
(54)

is a generalization of the Vandermonde matrix [Eq. (41)] where the basis is composed by monomials. The polynomial Vandermonde matrix can easily be very ill-conditioned if the nodes are chosen, for example, equally spaced (Gautschi 2012), not to mention if they are scattered. So one can anticipate that the solution to the problem (52) is already non-trivial, one of the goals of the EIM is to make sure that it does not lead to an ill conditioned problem while providing a high accuracy interpolant.

The nodes given by the EIM and the basis’ linear independence ensure that \({\mathbf {V}}\) [Eq. (54)] is invertible so that

$$\begin{aligned} C_i = \sum _{j=1}^n \left( {\mathbf {V}}^{-1}\right) _{ij}h (X_j; \lambda ) \end{aligned}$$
(55)

is the unique solution to (53). It then follows upon substituting (55) into (52) that the empirical interpolant can be written as

$$\begin{aligned} {{{\mathcal {I}}}}_n [h](x;\lambda ) = \sum _{j=1}^n B_j (x) h(X_j; \lambda ) \end{aligned}$$
(56)

where

$$\begin{aligned} B_j(x) := \sum _{i=1}^n e_i (x) \left( {\mathbf {V}}^{-1} \right) _{ij} \end{aligned}$$
(57)

is independent of \(\lambda \).

Note that (56) is a linear combination of waveform evaluations at the empirical nodes. The coefficients \(\{ B_i \}_{i=1}^n\) are directly built from the basis and satisfy \(B_i(X_j)=\delta _{ij}\). This follows from the uniqueness of the solution to the interpolation problem and the constraint equations summarized in (53). These coefficients provide a clean offline/online separation. Because of this, the \(\{B_i\}_{i=1}^n\) functions can be pre-computed offline once the basis is generated, while the (fast) interpolation is evaluated during the online stage from (56) when the parameter \(\lambda \) is specified by the user. Evaluations of the waveform at the EIM nodes are still needed at the arbitrarily chosen parameter \(\lambda \) in order to construct the interpolant in (56). One can wonder how this can be of any use. In Sect. 11 we explain how to build surrogate predictive models for the evaluations of \( h(X_j; \lambda )\) for any \(\lambda \) (that is, not present in the original training space).

Next we discuss how to compute the EIM nodes in a rather qualitative way, before presenting the general algorithm later on.

Example 8

Finding the EIM nodes and building the interpolant.

Consider for definiteness a time series. The algorithm takes as input the basis set \(\{ e_i \}_{i=1}^n\) and an arbitrary choice of time samples \(\{ t_i\}_{i=1}^{L}\) from which the empirical nodes \(\{ T_i \}_{i=1}^n\) have to be selected. The EIM algorithm proceeds as follows.

  1. 1.

    The first time node is chosen to maximize the value of \(| e_1(t_i) |\); that is, \(\left| e_1(T_1) \right| \ge \left| e_1(t_i) \right| \) for all time samples \(t_i\).

  2. 2.

    Next, an interpolant for the second basis function is built using only the first basis function: from Eqs. (52,53) we have \({{{\mathcal {I}}}}_1 [e_2](t)= C_1 e_1(t)\) where \(C_1 = e_2(T_1) / e_1(T_1)\) has been found from Eq. (53) with \(i=1\).

  3. 3.

    The second empirical node is chosen to maximize the value of the pointwise interpolation error of \({{{\mathcal {I}}}}_1 [e_2](t) - e_2(t)\); that is, \(\left| {{{\mathcal {I}}}}_1 [e_2] (T_2) - e_2(T_2) \right| \ge \left| {{{\mathcal {I}}}}_1 [e_2](t_i) - e_2(t_i) \right| \) for all time samples.

  4. 4.

    Steps 2 and 3 are then repeated to select the remaining \((n-2)\) nodes. The full algorithm for generating the EIM nodes is shown in Algorithm 2.

Comment 5

As described, the EIM follows a greedy approach, albeit somewhat different from that one usually used to build a reduced basis. While a greedy algorithm to build a RB selects the most relevant points in parameter space, the EIM selects the most relevant points in the physical dimension(s). In addition, the former uses a (possibly weighted) \(L_2\) norm while the EIM uses the infinity one.

Comment 6

Notice also that even though constructing the EIM is also done offline, it has a much smaller computational cost than forming the basis: since it operates only on the this basis, the training set is not involved anymore.

The following example graphically shows the first three iterations of the EIM algorithm applied to Legendre polynomials.

Example 9

Consider the set \(\{P_i(x)\}_{i=0}^{n}\) of \((n+1)\) normalized Legendre polynomials defined on \([-1,1]\). These form an orthonormal basis for the space of degree n polynomials wrt the weight function \(\omega (x) \equiv 1\). Their ordering is important for approximations: expanding, by orthogonal projection, a smooth function using the first n Legendre polynomials typically results in exponential convergence (Hesthaven et al. 2007; Wang and Xiang 2011). Said another way, \(P_0\) is the most important basis function, \(P_1\) the next most important, and so on.

Given a convergence rate for the aforementioned Legendre projection-based approximation we might wonder how much accuracy is lost by trading it by the interpolation (52) and how to optimally choose the nodes \(\{X_i\}\). When the relevant error measurement is the maximum pointwise error, Chebyshev nodes are known to be well suited for interpolation, bringing an additional error which is bounded by \(\log (n)\) (Press et al. 1992; Quarteroni et al. 2010), as discussed in Sect. 9.3 (see Example 7). For this reason, Chebyshev nodes are specially tailored to benchmark the nodes selected by the EIM in this example.

We select the EIM nodes according to the same preferential ordering as the Legendre polynomials themselves; the first six normalized polynomials are shown in Fig. 7.

Fig. 7
figure 7

The first six normalized Legendre polynomials

For this example we chose as basis the first 24 Legendre polynomials. Figure 8 shows the process of the EIM algorithm to choose the interpolation nodes for the Legendre basis. The first EIM node is defined by the location of max\((|P_0|)\). Since \(P_0\) is a constant function there is no preference in the search, so we simply select the middle point \(x=0\). To identify the second node we

  1. 1.

    build the empirical interpolant \({{{\mathcal {I}}}}_0[P_1]\) of \(P_1\) using \(P_0\) as the basis and \(x=0\) as interpolation node (the first EIM node),

  2. 2.

    compute the pointwise error \(r_1 = P_1 - {{{\mathcal {I}}}}_0[P_1] \), and

  3. 3.

    select the second EIM node by the location of max\((\left| r_1\right| )\). In this case we find \(r_1 = P_1\) since \({{{\mathcal {I}}}}_0[P_1]\) is zero due to the fact that \(P_1(0) = 0\).

The process continues until the number of EIM nodes equals the number of basis elements.

Fig. 8
figure 8

Representation of the EIM algorithm acting on the first 24 normalized Legendre basis. We show the bases, residuals and EIM nodes involved in the iterations. The first three figures (upper-left, upper-right and bottom-left) illustrate the two steps of the algorithm that maximize residuals and select the EIM nodes. The last panel (bottom-right) shows the latest iteration of the algorithm, which corresponds to the 24th basis. The vertical line corresponds to the residual’s maximum

In Fig. 9 we show the 24 EIM nodes found by the algorithm and compare them to the first 24 Chebyshev nodes in the interval \([-1, 1]\). Both node distributions are qualitatively similar. Note that the EIM distribution emulates the Chebyshev nodes’ clustering near edges \(x= \pm 1\). Maday et al. (2009), which has inspired our numerical experiment with Legendre polynomials, also compares the Lebesgue constant(s) for the EIM nodes of this example and the Chebyshev nodes.

Fig. 9
figure 9

A direct comparison between the distribution of the 24 EIM nodes for the Example 9 and the first 24 Chebyshev nodes

Example 10

Taken from Canizares et al. (2013), this example illustrates the application of the EIM algorithm to basis functions for which a good set of interpolation nodes is unknown. This is a sine-Gaussian family of the form

$$\begin{aligned} h(t; \lambda ) := A e^{-(t-t_c)^2/(2\alpha ^2)}\sin (2\pi f_0 (t-t_c)) \, , \end{aligned}$$

which models gravitational waves from generic burst processes. Here A, \(f_0\) and \(\alpha \) are the amplitude, frequency and width of the waveform h respectively, \(t_c\) is the arrival time of the GW-burst signal, and \(t\in [-\infty ,\infty ]\). The Fourier transform (FT) of this waveform is given by

$$\begin{aligned} {{{\tilde{h}}}} (f,t_c; \lambda )= e^{i2\pi f t_c}{{{\tilde{h}}}} (f; \lambda ), \end{aligned}$$

where \({{{\tilde{h}}}} (f; \lambda )\) is the FT of the GW-burst at \(t_c = 0\):

$$\begin{aligned} {{{\tilde{h}}}} (f; \lambda ) = i2A\alpha \sqrt{2\pi } \sinh (4\pi ^2\alpha ^2f_0f) e^{-2\pi ^2\alpha ^2(f_0^2+f^2)} . \end{aligned}$$

This waveform family is, then, described by four free parameters \(\lambda = (\alpha ,f_0, t_c, A)\). We focus on the two parameters \((\alpha ,f_0)\), since the others are extrinsic and can be handled differently. The parameter space considered is

$$\begin{aligned} \alpha = [.02,2]\sec \, , \quad f_0 = [.01,1]\mathrm{Hz} \, , \end{aligned}$$

Figure 10 provides a graphical illustration of the EIM first iterations.

Fig. 10
figure 10

Iterations 1 (left) and 2 (right) of the EIM algorithm for Example 9. The first EIM node is defined by the location of max\((|e_1|)\). To identify the second node one: i) builds the empirical interpolant \({{{\mathcal {I}}}}_1[e_2]\) of \(e_2\) using \(e_1\) and the EIM node \(F_1\), (ii) computes the pointwise error \(\mathcal{I}_1[e_2] - e_2\); (iii) the second EIM node is then defined by the location of max\((\left| {{{\mathcal {I}}}}_1[e_2] - e_2\right| )\). The process continues until all n empirical interpolation nodes are found. Images reproduced with permission from Canizares et al. (2013), copyright by APS

10.3 EIM algorithm

We present the EIM algorithm as used/introduced in Field et al. (2014).

figure b

Comment 7

We make some remarks on the algorithm:

  • It is a greedy-type algorithm in the infinity norm.

  • The iteration is performed only on the basis, and there is no need for a training set as was the case in the construction of the basis itself.

  • At each greedy sweep it chooses the node that differs most from the interpolant built so far.

  • The selection of EIM nodes is basis-dependent.

10.4 Accuracy and conditioning of the EIM

In Sect. 9.3 we discussed how the quality of an interpolant in a \(L_2\) norm is related to the optimal projection representation. In general, with interpolation one loses accuracy but at the advantage of needing less information for computing the approximation. One can bound the interpolation error introducing a Lebesgue constant (see Eq. (46)):

$$\begin{aligned} \Vert h_\lambda - {{{\mathcal {I}}}}_n [h_\lambda ] \Vert ^2 \le \varLambda _n \Vert h_\lambda - {{{{\mathcal {P}}}}}_n h_\lambda \Vert ^2,\quad \varLambda _n = \Vert {{{\mathcal {I}}}}_n \Vert ^2. \end{aligned}$$

Taking the maximum in both sides of the inequality, and recalling the definition of the greedy projection error in Eq. (29), we obtain

$$\begin{aligned} \max _{\lambda } \Vert h{_\lambda } - {{{\mathcal {I}}}}_n [h_{\lambda }] \Vert ^2 \le \varLambda _n \sigma _n . \end{aligned}$$
(58)

Comment 8

The maxima in (58) involve some ambiguities, as they might refer to those with respect to the training set or those of the underlying space of interest. It is problem and user-dependent to clarify which is the case, but the above results apply to both of them.

Comment 9

As discussed below, the Lebesgue constant is an a posteriori computable number, so if one has an estimate of the projection error \(\sigma _n\), also has an estimation of the interpolation one. Assuming there is enough training data, an estimate for \(\sigma _n\) can be accomplished by standard validation tests. Otherwise, if the data is sparse, one may apply other standard methods from machine learning such as k-fold or leave-one-out cross-validation.

In the context of the EIM, some obvious questions arise from these observations: how is the EIM related to the Lebesgue constant? Does it optimize for accuracy in some way? These questions were addressed in Tiglio and Villanueva (2021b).

First notice that, at each step of the EIM, when building the interpolant, a Vandermonde matrix \({\mathbf {V}}_n\) [Eq. (54)] has to be inverted for computing the interpolation coefficients. The nature of the algorithm already ensures the invertibility of \({\mathbf {V}}_n\) at all steps. It turns out that the EIM in fact optimizes with respect to this invertibility.

Theorem. Define \(det ({\mathbf {V}}_j):=V_j(X_1,\ldots ,X_j)\). Then, the residual \(r_j(x)\) computed in the \((j-1)\)-iteration of the EIM-loop satisfies

$$\begin{aligned} r_j(x)=\frac{V_j(X_1,\ldots , X_{j-1},x)}{V_{j-1}(X_1,\ldots ,X_{j-1})}\quad j=2,3, \ldots n . \end{aligned}$$

In consequence, once \(X_j\) is chosen in Step 6, the residual at this node becomes

$$\begin{aligned} r_j(X_j)=\frac{V_j(X_1,\ldots , X_{j-1},X_j)}{V_{j-1}(X_1,\ldots ,X_{j-1})}=\frac{{\text{det}}({\mathbf {V}}_j)}{{\text{det}}({\mathbf {V}}_{j-1})}. \end{aligned}$$

The proof can be found in Tiglio and Villanueva (2021b). This result says that, at each j-step, the EIM algorithm selects a new node \(X_j\) in order to maximize the modulus of the determinant of \({\mathbf {V}}_j\), making the Vandermonde matrix as invertible as possible at each iteration.

The following is a useful identity to compute Lebesgue constants in practice (Antil et al. 2013): if the basis vectors are orthonormal in the 2–norm \(\Vert \cdot \Vert _2\), then

$$\begin{aligned} \varLambda _n=\Vert {\mathbf {V}}_n^{-1}\Vert ^2_2. \end{aligned}$$

We use this relation and the formulation of the inverse of a matrix in terms of its adjoint to rewrite the Lebesgue constant as

$$\begin{aligned} \varLambda _n=\frac{\Vert {\text{adj}}({\mathbf {V}}_n)\Vert ^2_2}{|{\text {det}}({\mathbf {V}}_n)|^2}. \end{aligned}$$
(59)

In doing this, we see that when maximizing the determinant of the Vandermonde matrix at each step the EIM attempts to partially control the growth of \(\varLambda _n\), in the sense of making the denominator of (59) as large as possible, but without controlling its numerator.

So, the algorithm does not solve for the optimization problem of finding global (since it is a nested approach) nodes that maximize the determinant of the Vandermonde matrix, but neither the partial problem of minimizing the whole Lebesgue constant at each step, as is usually thought. Analogous observations can be made for the conditioning of \({\mathbf {V}}_n\), since both the Lebesgue constant and the condition number of \({\mathbf {V}}_n\) are directly related by

$$\begin{aligned} \kappa _n = \Vert {\mathbf {V}}_n\Vert _2\varLambda _n. \end{aligned}$$

For further discussions around these topics in the context of GWs, see Tiglio and Villanueva (2021b).

Finally, we point out that, in the context of the EIM, a rather pessimistic bound for the Lebesgue constant can be derived (Chaturantabut and Sorensen 2010):

$$\begin{aligned} \varLambda _n\le (1+\sqrt{2L})^{n-1} \Vert e_1\Vert _{\infty }^{-1}, \end{aligned}$$

where L stands for the size of the sampling of the x variable. As an a priori estimate, this bound is not useful in practice since it does not reflect the observed growth of \(\varLambda _n\), which is, in most cases of interest, much slower. Sharper a priori estimates are difficult to prove. In practice, one computes the Lebesgue constant a posteriori.

10.5 Further reading

The original papers introducing the EIM are Barrault et al. (2004) and Maday et al. (2009). Chaturantabut and Sorensen (2010) presents a very clearly explained discrete version with emphasis on solutions to non-linear time-dependent parametrized equations. In the latter case one is interested in approximating non-linear terms, in similarity with Galerkin versus collocation methods when solving partial differential equations. An hp-refinement of the standard EIM approach is presented in Eftang and Stamm (2012); this in general should be of very practical importance for several scenarios, such as when there are discontinuities with respect to parameter variation, but it has so far not been used in GW science. The study of conditioning of Vandermonde matrices can be traced to Gautschi (1983), where orthogonal polynomials are used to improve the condition number. See also Gautschi (2011, 2012), and, for a short review, Higham (2014). For different studies around Vandermonde-like matrices see Kuian et al. (2019), Demmel and Koev (2006), Pan (2015), Reichel and Opfer (1991).

11 Surrogate models

For the purposes of this review a surrogate model is any predictive one. The first type of gravitational wave surrogates followed the lines of fitting for the projection coefficients of a reduced basis, be it obtained through a POD or greedy approach. That is, given a basis \(\{ e_i \}_{i=1}^n\) the RB approximation for a waveform h is

$$\begin{aligned} h(t ; \lambda ) := \sum _{i=1}^n c_i(\lambda ) e_i(t) \, , \end{aligned}$$
(60)

where the basis elements \(\{ e_i\}\) can be waveforms themselves as in the RB-greedy approach, or linear combinations of them as in the POD approach.

This representation requires knowledge of the projection coefficients \(\{ c_i\}\), which are only known if the waveform itself is known. That is, as it stands, Eq. (60) is not a predictive tool but a representation one. A natural step to predict new waveforms is to build a surrogate simply by fitting in some way for new values of the parameter \(\lambda \):

$$\begin{aligned} h_{\mathrm{s}}(t, \lambda ) := \sum _{i=1}^n c_{{\mathrm{s}}, i}(\lambda ) e_i(t) \, , \end{aligned}$$

where \(\{ c_{{\mathrm{s}}\, , i} \}\) are n approximations to the true projection coefficients \(\{ c_i \}\). There is a large variety of ways to fit for these coefficients, such as polynomial interpolation, splines, radial basis, or ML regression approaches. A number of them (without much success) were studied in Kaye (2012).

One of the problems found when the basis elements are not gravitational waves themselves, as in the POD case, or if using the auxiliary orthonormal basis in the RB-greedy approach, is that they have non-physical complex structure in their parameter dependence that is difficult to represent, this is also discussed in Appendix F of Field et al. (2014). One solution is to use as basis elements the waveforms themselves, something that is not possible within the POD approach, since it does not provide a set of the most representative waveforms to use. Another approach, if a dense training space is available, is to use local fits (such as splines) to find these unknown projection coefficients. The limitation of needing a dense training set is precisely being able to generate it.

The main source of errors in these approaches is the fitting procedure, which in general has slow convergence. Therefore, fitting only at the EIM nodes, as described below, serves at least two goals: (i) to minimize the source of fitting errors by doing so only at those nodes, (ii) to provide very fast to evaluate surrogate models. Impressive results have been achieved, which we discuss in Sect. 12, using this approach. Other approaches follow non-algorithmic schemes and the literature in this field is rather new and in constant evolution. To mention only a few, see for example, Hesthaven and Ubbiali (2018); Wang et al. (2019) in which neural networks are combined with reduced order modeling or Chua et al. (2019) for the same approach in the context of GWs. We leave for future updates of this review a thorough discussion of these topics.

11.1 Surrogate models for components

Even though we have emphasized data-driven approaches, some domain knowledge is still (very) useful. For example, the GWs emitted by two black holes have an apparent complexity, but can be decomposed into components with simple structure and easier to build models for. For example, in the non-precessing case the GWs are essentially oscillatory functions of time with an increasing amplitude, until around the time of merger, followed by the ringdown regime. Thus, it is advantageous to decompose them into phase and amplitude:

$$\begin{aligned} h(t ; \lambda ) = A(t, \lambda ) e^{i \varPhi (t, \lambda ) } . \end{aligned}$$
(61)

In the case of non-spinning binary black holes initially in quasi-circular orbit, the GWs can be parametrized using only the mass ratio \(q := m_1/m_2\). The structure of the waveforms themselves as well as of phase and amplitude are shown in Figs. 11 and 12. One can notice the simplification in structure due to this decomposition.

Fig. 11
figure 11

Waveform emitted by a non-spinning binary black hole coalescence initially in quasi-circular orbit with \(q:=m_1/m_2=1.7\). The (2, 2) \(h_+\) polarization is shown, using the surrogate model SpEC_q1_10_NoSpin (Blackman et al. 2015) generated with the GWSurrogate Python package (Blackman et al. 2021)

Fig. 12
figure 12

Representation of amplitude \(A(t, \lambda =q)\) and phase \(\phi (t, \lambda =q)\) surfaces for the (2, 2) mode. They were computed using a symbolic model (Tiglio and Villanueva 2021a) which was trained using the surrogate SpEC_q1_10_NoSpin

Similarly, in the precessing case, by using a co-precessing frame, one simplifies the structure of the waveforms and it is advantageous to include the coordinate transformations themselves as components to model for. A question arises of whether to build a reduced basis for the waveforms and later build surrogate models for the system components, or to start by building bases for the latter and then reconstruct the waveform from these surrogate components or pieces. In general, the latter approach should be more accurate, even if at the expense of higher offline cost.

11.2 Empirical interpolant based surrogates

The surrogate approach here described was introduced in Field et al. (2014), we closely follow the presentation of that reference. It represents, with some conceptually relatively minor variations, the state-of-the-art as of this writing. The approach has three offline stages, which are described below in decreasing order of computational cost, and one online stage corresponding to fast evaluations of the surrogate model.

11.2.1 Offline stages

  1. 1.

    Select the most relevant n points in parameter space (shown as red dots in Fig. 13) using, for example, a greedy algorithm as described in Sect. 8.3. The waveforms/functions associated with these selections (shown as red lines) provide a nearly optimal reduced basis for the space of interest \({{{\mathcal {F}}}}\).

  2. 2.

    Identify n time (or frequency) samples of the full time series using the EIM and build an interpolant that accurately reconstructs any fiducial waveform for all times if it is known at the EIM nodes. These nodes are shown as blue dots on the vertical axis in Fig. 13.

  3. 3.

    At each EIM node perform a fit with the method of choice in the parameter dimension for the waveform or its “components” (such as the amplitude and phase) using the greedy data or the full training parameters from Step 1. The fits are indicated by blue horizontal lines in Fig. 13.

Fig. 13
figure 13

Image reproduced with permission from Field et al. (2014)

A schematic of the method for building and evaluating the surrogate model. The red dots show the greedy selection of parameter points for building the reduced basis (Step 1, offline), the blue dots (Step 2, offline) show the associated empirical nodes in time from which a waveform can be reconstructed by interpolation with high accuracy, and the blue lines (Step 3, offline) indicate a fit for the waveform’s parametric dependence at each EIM node. The yellow dot shows a generic parameter, which is predicted at the yellow diamonds and filled in between for arbitrary times using the EIM, represented as a dotted black line (Step 4, online).

For further illustration, we show in Fig. 14 the distribution of EIM nodes for gravitational waves corresponding to an EOB model, taken from Field et al. (2014). Notice the sparsity of the distribution of EIM subsamples (nodes): only 19 empirical nodes are needed to reconstruct the whole time series consisting of roughly 65–70 waveform cycles for the entire parameter space. From a signal processing perspective, this may sound quite strange, since the EIM sampling does not seem to meet the minimal Nyquist sampling needed to reconstruct the time series. This apparent contradiction is resolved by realizing that the reduced basis already encodes the physical dependency of the model. So indeed, there is information in the background working for a faithful reconstruction of waveforms.

Fig. 14
figure 14

Image reproduced with permission from Field et al. (2014)

Example of a distribution of the EIM nodes corresponding to a fiducial model composed by EOB waveforms. Only 19 nodes are needed to represent 65–70 cycles within about machine accuracy.

11.2.2 Step by step offline construction

Put in a more specific way, the surrogate model is built as follows. Given a set of greedy parameters \(\{ \varLambda _i \}_{i=1}^n\) and functions (waveforms) the empirical interpolant for an arbitrary parameter value \(\lambda \) has the form (Sect. 10)

$$\begin{aligned} {{{\mathcal {I}}}}[h](t, \lambda ) = \sum _{i=1}^nB_i(t) h(T_i, \lambda ) \approx h(t, \lambda ) \, , \end{aligned}$$
(62)

where the \(\{ B_i \}\) functions are computed offline from the basis, as well as the EIM nodes \(\{ T_i \}\).

In the case of non-spinning binaries, it is convenient to decompose all waveforms h into their phase and amplitudes as in Eq. (61) and, in particular at the EIM nodes,

$$\begin{aligned} h(T_i, \lambda ) = A(T_i, \lambda ) e^{i \varPhi (T_i, \lambda ) } . \end{aligned}$$
(63)

For each \(i=1\ldots n\), perform fits for amplitude and phase,

$$\begin{aligned} A(T_i, \lambda ) \approx A^{\mathrm{fit}}(T_i, \lambda ) \quad , \quad \varPhi (T_i, \lambda ) \approx \varPhi ^{\mathrm{fit}}(T_i, \lambda )\quad \forall \lambda . \end{aligned}$$

Replacing these approximants, which are so far only valid for the EIM nodes \(\{T_i\}\), as opposed to arbitrary values of t, into (64) yields

$$\begin{aligned} h_{\mathrm{s}}(T_i, \lambda ) := A^{\mathrm{fit}} (T_i, \lambda ) e^{i \varPhi ^{\mathrm{fit}} (T_i, \lambda ) } . \end{aligned}$$
(64)

Finally replacing h by \(h_{\mathrm{s}}\) in Eq. (62) leads to the final surrogate model for all values of t and \(\lambda \),

$$\begin{aligned} h_{\mathrm{s}}(t, \lambda ):= \sum _{i=1}^n B_i(t) A^{\mathrm{fit}} (T_i, \lambda ) e^{i \varPhi ^{\mathrm{fit}}} (T_i, \lambda ) \quad \forall t,\lambda . \end{aligned}$$
(65)

Expressions similar to Eq. (65) are the final expressions of the surrogates used so far within numerical relativity, while decomposing the waveforms into more components in the case of spin and precession. Importantly, once the training set has been processed to build the bases, the surrogate model only relies on the knowledge of the bases, the training set is no longer involved. Its accuracy (error estimates) and evaluation costs are discussed next.

11.2.3 Online stage, evaluation cost

We now discuss the cost, in terms of operation count, to evaluate a surrogate model. For simplicity we count every arithmetic operation as a single one regardless of their actual complexity.

Also for simplicity and definiteness, still restricting the discussion to the non-spinning case, the complete surrogate model is given by Eq. (65), where the n coefficients \(B_i(t)\) and the 2n fitting functions \(\{ A_i (\lambda ) \}_{i=1}^n\) and \(\{ \phi _i (\lambda ) \}_{i=1}^n\) are assembled offline as described above.

In order to evaluate the surrogate model for some parameter \(\lambda _0\) we only need to evaluate each of those 2n fitting functions at \(\lambda _0\), recover the n complex values \(\{ A_i(\lambda _0) e^{-i \phi _i(\lambda _0)} \}_{i=1}^n\), and finally perform the summation in (65). Each \(B_i(t)\) is a complex-valued time series with L samples, where L is the desired number of time or frequency outputs.

Therefore, the overall operation count to evaluate (65) at each \(\lambda _0\) is \((2n-1)L\) plus the cost to evaluate the fitting functions. At least in one dimension, the evaluation cost of polynomials of order \(n_{\mathrm{fit}}\) using Horner’s method (which is known to be optimal in terms of operation counts – Pan 1966) is \(2n_{\mathrm{fit}}\). We have discussed the exponential cost of evaluating polynomials in higher (than one) dimensions in Sect. 9.5 and the final comments of that whole Section. We present some further discussions on this and related topics related to the fitting stage in Sect. 11.2. To summarize, the cost of evaluating the surrogate (65) is

$$\begin{aligned} {\mathrm{cost}}_{\mathrm{s}} = \left[ (2n-1) + c_{\mathrm{fit }}\right] L \end{aligned}$$

where at least in one dimension \(c_{\mathrm{fit}} \le 2n\) and therefore

$$\begin{aligned} {\mathrm{cost}}_{\mathrm{s}} \le \left( 4n-1 \right) L \sim {{{\mathcal {O}}}}(L) \end{aligned}$$
(66)

Comment 10

 

  • The evaluation of the surrogate model (65) is embarrassingly parallel in the number of outputs L (each output evaluation is completely independent of the other ones) and therefore the total computational time can be speed up as wished. In contrast, time dependent equations (ordinary or differential) are very difficult to parallelize in time.

  • The cost of most ordinary and partial differential equations solvers is at best linear in L. The speedup of the described surrogate approach for evaluation, if parallelized, is actually independent of the number of desired outputs L.

  • In Eq. (66), the linearity in L is simply due to the number of independent output sample evaluations. The optimal cost would be \(\mathrm{cost_s} = L\), corresponding to one arithmetic operation to evaluate the surrogate at any output x (time or frequency, in the context of this review) of interest. If n is small enough, as it happens in practice for GWs, the operation count (66) is in a precise sense, very efficient.

11.2.4 Error estimates

One of the errors of interest for the complete surrogate model is a discrete version of the \(L_2\) normed difference between a fiducial waveform and its surrogate. For definiteness we consider equally spaced L time samples,

$$\begin{aligned} \Vert h(\cdot , \lambda ) - h_{\mathrm{S}} (\cdot , \lambda ) \Vert ^2 := \varDelta t \sum _{i=1}^L\left| h(t_i; \lambda ) - h_{\mathrm{S}}(t_i; \lambda )\right| ^2 \, , \end{aligned}$$
(67)

where \(\varDelta t = (t_\mathrm {max} - t_\mathrm {min})/(L-1)\). Other errors of interest are the relative pointwise ones for the phase and amplitude,

$$\begin{aligned} \left| \frac{A(t; \lambda ) - A_{\mathrm{S}}(t; \lambda )}{A(t; \lambda )} \right| \, , \quad \left| \frac{\phi (t; \lambda ) - \phi _{\mathrm{S}}(t; \lambda )}{\phi (t; \lambda )}\right| . \end{aligned}$$

The following error bound for the discrete error (67) can be derived:

$$\begin{aligned} \Vert h(\cdot , \lambda ) - h_{\mathrm{S}} (\cdot , \lambda ) \Vert ^2\le \varLambda _n \sigma _n + \varLambda _n \varDelta t \sum _{i=1}^n \left[ h(T_i, \lambda ) - h_{\mathrm{S}}(T_i, \lambda ) \right] ^2 . \end{aligned}$$
(68)

For a proof, see Field et al. (2014). This bound identifies contributions from two sources. The first term in the r.h.s. of (68) describes how well the EIM interpolant (i.e., the basis and EIM nodes) represents \(h(t;\lambda )\). The expected exponential decay of the greedy error \(\sigma _n\) with increasing n together with a slowly growing Lebesgue constant \(\varLambda _n\) result in this term being very small. The second term in the r.h.s. of (68) is related to the quality of the fit. Currently, the fitting step is the dominant source of error in the surrogate models built so far, compared to the first two steps of generating the reduced basis and building the empirical interpolant. Improving on this source of error is still a remaining goal.

11.3 Further reading

Certified approaches refer to those which can a priori guarantee error bounds for the reduced model compared to the ground truth. Up to our knowledge these have been so far restricted to elliptic (coercive) partial differential equations. We again refer to Quarteroni et al. (2011), Hesthaven et al. (2015) and Quarteroni et al. (2016). For an implementation of the RB and EIM methods in a concise and user-friendly API, take a look at the Arby Python package (Villanueva 2021), a fully data-driven module for building surrogate models, reduced bases and empirical interpolants out from training data.

12 Surrogates of compact binaries

To date the most accurate numerical relativity (NR) binary black hole (BBH) surrogates are based on reduced basis and the empirical interpolation method as described in this review, perhaps with some variations, and the SpEC code for NR training simulations (SpEC 2021). However, other public catalogs of NR waveforms such as that one of the RIT group could be equally used (Healy et al. 2019, 2020). The SpEC-based surrogate catalog can be found in Field et al. (2019) and can be evaluated with the GWSurrogate Python package (Blackman et al. 2021, 2015). Another package for surrogate evaluations is surfinBH (Varma 2020), built on top of GWSurrogate. What is found in all these surrogates is that, when compared to NR, they are at least an order of magnitude more accurate than other existing models such as hybrid, phenomenological or effective ones.

In addition to NR surrogates, we also discuss other ones using as fiducial models post-Newtonian, effective one body, extreme mass ratio and ringdown approximations, some of which use a singular value decomposition approach or Gaussian regression.

12.1 Numerical relativity binary black holes

  1. 1.

    Non spinning (1-D):

    The first NR BBH surrogate is presented in Blackman et al. (2015), and referred to as SpEC_q1_10_NoSpin. The model corresponds to non-spinning black holes, in quasi-circular orbit and initial orbital eccentricity smaller than \(10^{-3}\), with time range \([-2750, 100]\)M. As is common practice, the waveforms are aligned so that \(t=0\) stands for the peak of the amplitude (which is around the merger of the two black holes), which corresponds to about 15 orbits before merger. The single parameter of the problem is the mass ratio \(q := m_1/m_2 \in [1,10]\).

    In this work, due to the prohibitive cost of computing a dense training set of NR waveforms, the greedy values are chosen using an EOB model. This results in 22 greedy parameters. Next, a reduced basis is built by performing 22 SpEC simulations using those parameter values. The final surrogate includes all spherical harmonic modes up to \(\ell =8\) and is proven to be indistinguishable from NR within the numerical errors of the latter using leave-one-out cross validation. It is also found to be more accurate than an EOB model when compared with NR.

    One could also ask for a symbolic model representing the numerical surrogate one. Tiglio and Villanueva (2021a) construct the first ab initio free-form symbolic model for GWs using symbolic regression through genetic programming. The fiducial model corresponds to the (2, 2)-mode of SpEC_q1_10_NoSpin, which is taken as ground truth solution of the Einstein field equations since it is indistinguishable from NR. The approach is ab initio, meaning that no approximations to the Einstein equations are made, as opposed to stitching PN waveforms with EOB, NR and ringdown ones. The search for closed expressions is completely free, meaning that no prior hypothesis related to the type of functions is made. This is at the heart of genetic programming: successive models evolve under evolutionary pressure until reaching a tolerance error (or another stopping condition) without incurring in any human bias. To generate the training set of waveforms for symbolic regression, instead of performing naive samplings such as equally spaced grids (which turn out to be prohibitively expensive in terms of convergence times), a key step of Tiglio and Villanueva (2021a) is to sample the time domain with the 22 EIM nodes. This results in fast convergence in the regression instance, ending with closed-form expressions with maximum overlap error, using all temporal points, of 1% with respect to the NR surrogate. One of the salient features of these closed expressions is that they are not divided into separate pieces for the inspiral regime, the merger, and the ringdown, but cover the whole inspiral-merger-ringdown at once.

  2. 2.

    Spinning, motivated by GW150914 (4/5-D):

    Following the work of Blackman et al. (2015), Blackman et al. (2017a) presents the first spinning BBH surrogate model for a parameter range motivated by the first GW detection, GW150914.

    Two surrogates are built: NRSur4d2s_TDROM_grid12 and NRSur4d2s_FDROM_grid12. The physical range is \(q \le 2\), and dimensionless spin magnitudes \(\chi _{1,2}\le 0.8\). The initial spin of the smaller black hole is assumed to be along the axis of the orbital angular momentum. The parameter region includes LIGO’s first detection, GW150914, though with less cycles: the model used (and is restricted to) training NR simulations with 25–35 cycles before merger, while GW150914 has around 55 cycles.

    Model NRSur4d2s_TDROM_grid12 is built using 276 NR simulations as training set and PN waveforms to find the greedy points for which the PN surrogate reaches a floor error of \(10^{-3}\). The parametric fits used in the surrogate assembly are fixed to a particular “order”—not necessarily polynomial order, since trigonometric functions are included as part of the dictionary of fitting functions. The authors attribute this floor to that fixing, the general problem of this issue is discussed in Sect. 15. Next, NR simulations are performed for those PN greedy points using a minimally rotating and co-precessing frame and, together with the coordinate transformations to an inertial frame, reduced bases for the different “pieces” or components of the waveform are built.

    The initial spin of the smaller black hole by construction lays along the axis of the orbital angular momentum, reducing the parameter dimensionality to 5. Next, the azimuthal component of the spin of the larger black hole at a reference time \(t_0 = t_{\text {peak}}-4500\)M is included through an analytical approximation, effectively reducing the parameter dimensionality to 4, while modeling 5 dimensions. Thus, the model does include a physical approximation and is not fully based on NR simulations, while being able to include precession in the modeling.

    Model NRSur4d2s_FDROM_grid12 is built not from the 276 NR training set simulations but from the time-domain surrogate NRSur4d2s_TDROM_grid12, which can be quickly evaluated at a much larger number of points to populate a new, more dense, training set.

  3. 3.

    Spinning and precessing, full dimensional case (7-D):

    The first surrogate model, named NRSur7dq2, for the full 7-D parameter space of GWs emitted by a non-eccentric BBH coalescence is presented in Blackman et al. (2017b). We recall that it is standard in NR to reduce the parameter space from 8-D to 7-D by using the scale invariance of GR with respect to the total mass of the system.

    It uses 744 NR simulations to construct the training set with parameter ranges \(q\le 2\), \(\chi _1, \chi _2 \le 0.8\), for about 20 orbits prior to merger, and \(\ell \le 4\). The authors “recycle” the 276 NR simulations used in the spinning 4/5-D case described above and complete a total of 744 NR simulations with a metric-based population criteria to select the remaining parameter points. Next, the number of simulations is extended to 886 by invoking symmetry arguments. As in Blackman et al. (2017a), but with some improvements, the waveforms are decomposed in data pieces before the construction of the surrogate.

    The in-sample errors computed for the 886 NR waveforms show that the largest surrogate errors are comparable to the largest NR ones (\(\sim 10^{-2}\)). To estimate the out-of-sample errors the authors perform cross-validation: The training set is randomly divided in 20 sets of 44–45 waveforms, leaving out one set at each step to build a trial surrogate with the remaining 19 sets. Later, the surrogate is compared against the one that is left out. This results in mismatches similar to those of the in-sample case. Also, the authors compare mismatches for a fully precessing EOB model (SEOBNRv3 – Pan et al. 2014) and a phenomenological waveform model (IMRPhenomPv2 – Hannam et al. 2014) which includes some effects of precession. For these models there are mismatches more than an order of magnitude larger than those of NRSur7dq2.

    In Varma et al. (2019a) the 7-D parameter space is also covered, but now for about 20 orbits before merger, mass ratios \(q\le 4\), arbitrary spin orientations with dimensionless magnitudes \(\chi _1, \chi _2 \le 0.8\), \(\ell \le 4\) multipole modes, and initial orbital eccentricity less than \(10^{-3}\), using 1,528 NR simulations for the training set. The resulting model, NRSur7dq4, extends the previous 7-D model and is to date the most exhaustive and general NR-based surrogate model for BBH coalescences. The extent of these simulations is sufficient to represent some but not all of the BBH signals measured by LIGO and Virgo in the first two observing runs. The reference proposes to hybridize the NR waveforms with PN approximations for higher values of mass ratio and spin magnitudes (see Sect. 12.2). In addition, a surrogate model, named NRSur7dq4Remnant, is built for the mass, spin, and recoil kick velocity of the remnant black hole. To test these models, the authors apply a 20-fold cross-validation. First they randomly divide the 1,528 training simulations into 20 groups of \(\sim \)76 elements each. For each group, they build a trial surrogate using the \(\sim \)1,452 remaining training elements and test this trial against the \(\sim \)76 elements that were left out. They show that the mismatches for NRSur7dq4 against NR, computed with the Advanced LIGO design sensitivity curve, are always \(\lesssim 8\times 10^{-3}\) at the 95 percentile level over the mass range (50–200)M. For NRSur7dq4Remnant the 95th percentile errors are \(\sim 5\times 10^{-4}\)M for mass, \(\sim 2\times 10^{-3}\) for spin magnitude, and \(\sim 4\times 10^{-4}c\) for kick speed. Compared to the spinning EOB waveform model SEOBNRv3 (Pan et al. 2014), the authors find for the two models an error improvement of at least one order of magnitude (Table 1).

Table 1 Characterization of the waveform surrogates described so far corresponding to NR binary black holes

12.1.1 Kicks

Before the release of accurate GW surrogates covering the full 7-D parameter space, black hole kicks were mostly modeled with fitting formulas based on PN approximations with subsequent calibrations based on NR simulations. Gerosa et al. (2018) presents the first effort to determine remnant properties from BBH mergers using a RB/EIM-based surrogate approach. More specifically, the authors use the NRSur7dq2 model (Blackman et al. 2017b) for the generation of a waveform catalog to analyze the linear momentum dissipation due to emission of GWs. Their procedure provides the velocity accumulation profile \({\mathbf {v}}(t)\) and the final kick speed \(v_k\) of the remnant black hole. The comparison for recoil speeds between NR simulations and NRSur7dq2 shows good agreement within an order of \(\sim 10^{-4}c\), with some outliers in the order of \(\sim 10^{-3}\)c. The authors suggest that, even in the case where the surrogate accurately models post-merger strains, small errors might propagate to the phase of the center-of-mass-oscillation causing a relatively large error on the final kick velocity.

More recently, in Varma et al. (2019c) the first surrogate models for remnant oscillations has been constructed using Gaussian process regression (a type of machine learning regression method, see, for example, Doctor et al. 2017) and NR simulations for training. These fits are able to provide remnant mass, spin vector and recoil kick vector with high accuracy for (1) precessing BHs with mass ratio \(q\le 2\) and spin magnitudes \(\chi _1 , \chi _2\le 0.8\); (2) non-precessing BHs with mass ratio \(q\le 8\) and anti-aligned spin magnitudes \(\chi _1 , \chi _2\le 0.8\).

12.2 Numerical relativity hybrid binary black holes

The previous surrogate models do not cover the entire LIGO band. To remedy this, in Varma et al. (2019b) a hybridized non-precessing model named NRHybSur3dq8 is presented. In that work NR waveforms are “stitched” at early times with PN and EOB ones, thus being able to cover the entire band of advanced LIGO with a starting frequency of 20 Hz and for systems with mass as low as \(2.25\,M_{\odot }\). This model is based on 104 NR simulations for the 3-D parameter region \(q\le 8\), \(|\chi _{z1}|, |\chi _{z2}|\le 0.8\) for modes \(\ell \le 4\) and (5, 5), excluding (4, 1) and (4, 0). To populate the training space, the authors perform 91 NR simulations and complete for a total of 104 with 13 waveforms added using BHs exchange symmetry (equal mass, unequal spin). The parameters for the 91 NR simulations are selected by a greedy procedure, iteratively constructing a PN surrogate model, testing it with a dense validation set and selecting the next greedy-parameter for the largest model error. At the hybridization stage, the early-inspiral waveforms are stitched with NR waveforms minimizing a cost function by varying the time and frame shifts between waveforms in an appropiate matching region. The matching region is settled to start at 1,000 M after the beginning of the NR waveforms and end after 3 orbits of the binary inspiral. The authors find that its hybridized surrogate model NRHybSur3dq8 performs well within NR truncation errors and outperforms in accuracy the SEOBNRv4HM spinning-EOB model (Cotesta et al. 2018) by about two orders of magnitude.

As an application case, NRHybSur3dq8 is used in Barkett et al. (2016) to generate GWs to study tidal effects by means of a PN tidal splicing method. The resulting model is named NRHybSur3dq8Tidal. It has been added to the last update of the SXS Collaboration catalog (Boyle et al. 2019) of numerical simulations for BBH coalescences.

12.3 Extreme mass ratio inspirals

Rifat et al. (2020) built a surrogate model, EMRISur1dq1e4, for extreme mass ratio inspirals using non-spinning point particle black hole perturbation theory (ppBHPT). The construction uses numerical solutions of the Teukolsky equation with a point particle as source. The trajectory of the particle is determined by an adiabatic inspiral at early times, a late-stage geodesic plunge, and a transition region. The mass ratio is within the range \([3 ,10^3]\).

After mass rescaling the surrogate model agrees remarkably well with NR waveforms (solving the full Einstein equations), which are available for mass ratios \(q\le 10\). The mass rescaling is a function of the mass ratio and is empirical in the sense that it is numerically chosen to minimize the difference with NR waveforms for the (2, 2) mode. Even so, the degree of agreement after rescaling is surprisingly good and unexpected since a priori there is no reason why such a good agreement should be present at all. As the authors point out, however, their result seems to be in line with growing evidence which suggests that perturbation theory with self-force corrections might be applicable to nearly equal mass systems.

12.4 Eccentric inspirals

Barta and Vasúth (2018) introduce a SVD-based ROM technique to model waveforms emitted by compact binary coalescences with any residual orbital eccentricity. They apply this framework to eccentric-PN waveforms generated with the CBwaves open-source software (Csizmadia et al. 2012) and build a reduced order model for a 3-D subset of waveforms from the full 8-D parameter space corresponding to total mass M, mass ratio q and eccentricity \(e_0\). The ranges covered by the template bank are \(2.15\,M_\odot \le {\text {M}}\le 215\,M_\odot \), \(0.01\le q\le 1\) and \(0\le e_0 \le 0.96\).

The speedup in evaluating the surrogate model is 2–3 orders of magnitude faster than generating the corresponding CBwaves waveforms, reaching a factor of several thousand around \((10-50)M_{\odot }\).

12.5 Effective one body

In Field et al. (2014) accurate surrogate models for EOB waveforms of non-spinning BBH coalescences are built using the RB framework, corresponding to modes (2, 1), (2, 2), (3, 3), (4, 4) and (5, 5) with mass ratios from 1 to 10. The authors benchmark the surrogate model against a fiducial one generated by the EOB solver of the LIGO Algorithm Library (LAL) software package. For a sampling rate of 2048 Hz they find a speedup of \(\sim 2,300\); that is, about three orders of magnitude faster than the LAL waveform model.

Lackey et al. (2017) presents a surrogate model of a non-spinning EOB waveform model with \(l=2,3,4\) tidal multipole moments that reproduces binary neutron star (BNS) numerical waveforms up to merger. The authors find, within the RB framework, that 12 amplitude and 7 phase basis elements are sufficient to reconstruct any BNS waveform with a starting frequency of 10 Hz. The surrogate has maximum errors of 3.8% in amplitude (0.04% excluding the last 100M before merger) and 0.043% radians in phase. Following a different approach, Lackey et al. (2019) implements Gaussian process regression to build a frequency-domain surrogate version for an aligned-spin BNS waveform model using the EOB formalism. The resulting surrogate has a maximum mismatch of \(4.5 \times 10^{-4}\) and a speedup of order \(O(10^{3})\) with respect to the original model.

As an alternative to greedy-based methods, singular value decomposition can be used to build reduced bases for GWs. This is done in Pürrer (2014) and Pürrer (2016). The author presents two frequency-domain reduced order models for EOB SEOBNRv1 (Taracchini et al. 2012) and SEOBNRv2 (Taracchini et al. 2014). These surrogates are built using SVD for reduced bases and tensor product splines as fitting method. The surrogate for SEOBNRv2 corresponds to a spin-aligned model for the GW dominant (2, 2) mode and extends the spin range of the surrogate for SEOBNRv1 to almost the entire Kerr range. It also covers the entire parameter space that the first surrogate is defined in: symmetric mass ratios \(0.01\le \eta \le 0.25\) and spin magnitudes \(-1\le \chi _i \le 0.99\). In general, mismatches are better than \(\sim 0.1 \%\) against SEOBNRv2 except in regions of parameter space in which the original model presents discontinuities, inducing mismatches \(\sim 1 \%\) in the surrogate.

12.6 Post-Newtonian (PN)

The first contact between reduced basis methods and GW modeling is presented in Field et al. (2011). The authors build a reduced basis for 2PN waveforms corresponding to the 2-D space of non-spinning BNS inspirals with mass components in the range \([1,3]\,M_\odot \). Remarkably, it is found that only 921 basis elements are needed to represent the full template bank used as fiducial model within machine precision error. The greedy approach is compared against a metric template placement method, finding exponential decay of the greedy error with the number of bases as opposed to the approximately linear convergence rate of the metric approach.

Later on, in Field et al. (2012) a reduced basis is built for the non-precessing case of BBH inspirals (4-D parameter space: 2 masses, 2 aligned or anti-aligned spins) using the restricted TaylorF2 PN approximation with component masses in the range \([3,30]\,M_\odot \) and dimensionless spin magnitudes in the full range \([-1,1]\). The authors find that, for a tolerance error of \(10^{-11}\), when increasing the dimensionality of the parameter space from 2-D to 4-D the number of basis elements needed to span the whole space of waveforms increases only in 6.6% with respect to the 1,725 bases needed for the 2-D case. Furthermore, going from the 3-D to the 4-D case implies adding only 15 more basis elements. This suggests the possibility that the curse of dimensionality could be beaten for the complete 8/7-D case, as discussed in Sect. 15.

In Blackman et al. (2014) the problem of building reduced bases for the full 7-D PN case, where there are no closed-form expressions and ordinary differential equations need to be solved, is tackled. The waveforms correspond to 3.5PN precessing inspirals (mass ratio \(q\in [1,10]\) and dimensionless spin magnitudes \(||\chi _i||\le 0.9\)). The approach uses a modified version of the standard reduced basis greedy algorithm: at each iteration the 7-D parameter space is resampled with a fixed number K of waveforms and the usual greedy loop is performed in this reduced training set. This turns out to be crucial to overcome the limitations imposed by the curse of dimensionality, since for high dimensional problems it is hopeless to compute a dense training set without incurring in unfeasible computational costs (and storage). The sampling number K is increased in each run until reaching \(K=36,000\), for which the number of RB waveforms becomes independent of K. Choosing to work in the binary’s precessing frame, the reference exploits the fact that in this frame the waveforms have a weaker parametric dependence than they do in the inertial frame. Another important ingredient is the choice of waveform parametrization in phase instead of time or frequency, taking full advantage of the waveforms’ smooth dependence on this variable. It is found that with all of these modifications, only 50 waveforms are needed to represent the entire 7-D parameter space with an error of \(10^{-7}\).

12.7 Ringdown

With the advent of GW astronomy since the detection of the GW150914 event in 2015 a new era for testing GR and alternative theories of gravity in strong regimes (Isi et al. 2019; Berti et al. 2018; Zhao et al. 2019; Radice et al. 2017) has opened. In the case of compact binary coalescences different techniques have been developed in recent years (Talukder et al. 2013; Da Silva Costa et al. 2018; O’Brien et al. 2019) to improve the extraction of physical information after merger. As it is well known, the account of post-merger properties is accomplished through the analysis of the quasinormal modes (QNMs) excited in the remnant Kerr black hole. Different models for ringdown waveforms have been constructed in recent years using several techniques. See, for example, London (2015, 2020), London et al. (2014), London and Fauchon-Jones (2019) for applications of greedy and regression methods in the construction of QNM models.

In practice, single-mode ringdown searches can limit the number of measured events in advanced ground-based detectors. Moreover, parameter estimation errors can become large when the actual waveform contains a second mode (which is expected to generically be the case). Besides this, there are further motivations for multi-mode searches, such as consistency tests of GR [e.g., the no-hair theorem, Berti et al. (2007), Dreyer et al. (2004), Berti et al. (2006)] and feature inference about the progenitors of the post-merger black hole. Looking forward to future multi-mode ringdown searches, the RB method has been implemented in Caudill et al. (2012) to build a compact representation of multi-mode QNM catalogs. A single QNM waveform has the form

$$\begin{aligned} h_{lmn}(t)={{{\mathcal {A}}}}_{lmn}(\varOmega ) \frac{M}{r} e^{- \pi \frac{ f_{lmn}}{Q_{lmn}}t} \cos {(2\pi f_{lmn}t)} \, , \end{aligned}$$

where \({{{\mathcal {A}}}}_{lmn}(\varOmega ):=\) orientation-dependent dimensionless amplitude; \(r:=\) distance to the source; \(M:=\) black hole mass; \(f_{lmn}:=\) central frequency; \(Q_{lmn}:=\) quality factor. Geometric units \(G=c=1\) are taken for granted. Starting from the one-mode case, a template bank of ringdown waveforms is constructed for the \((l,m,n)=(2,2,0)\) and (3, 3, 0) QNMs. Looking for several minimal matches MM between template and signal, the parameter space is sampled using a metric-based population criteria over the 2-D space (fQ). For example, a minimal match \(\mathrm {MM}=0.99\) corresponds to a catalog of 2,213 waveforms with spin magnitude of the remnant black hole being in the range [0, 0.9947] and mass M in the range \([2.9744,3025.7]\,M_{\odot }\). Table 2 shows the number of RB elements needed for each case.

Table 2 Number of reduced basis waveforms (\(N_{\mathrm{RB}}\)) and metric-based templates \(N_{\mathrm{metric}}\) needed to represent one-mode QNM training spaces with \((l, m, n) = (2, 2, 0)\) and (3, 3, 0) for different minimal matches MM

In Caudill et al. (2012) the RB representation is tested using Monte Carlo simulations. The parameter space is randomly sampled with \(10^7\) waveforms. An average waveform representation error \(||h_{\lambda }-{{{\mathcal {P}}}}_N h_{\lambda }||^2\approx 4.21\times 10^{-13}\) is found, one order of magnitude better than the maximum training space representation error \(\epsilon ^2=10^{-12}\). For the two-mode case, the previous single modes (2, 2, 0) and (3, 3, 0) are linearly composed and similar results to the one-mode case are found, since the approach takes advantage of the linearity of QNMs. Following this observation, the authors finally propose a simple method to represent unconstrained multi-mode waveforms by approximating them through the sum of individual one-mode projections.

13 Reduced order quadratures

We have so far presented a method to produce a reduced model which can be used as a representation for an underlying set of functions. This combines the RB and EIM frameworks to produce accurate representations. Finally, one builds a surrogate predictive model through interpolation in parameter space at EIM nodes. Next we discuss how this framework can be used to compute numerical approximations of integrals (quadratures).

Reduced order quadratures (ROQ) were introduced, at least in the context of GWs, in Antil et al. (2013). They use the EIM to build a set of nodes and weights to construct the integral, so the main difference with standard quadratures is that they are application-specific for parametrized problems. This results in fast online quadrature evaluations, which are at the core of data analysis when computing correlations, likelihoods, etc.

Comment 11

Reduced order quadratures choose a nearly optimal subset of nodes from the training physical points (time, frequency, space) at which any given basis is known. These training points can be arbitrarily located: they can be equally spaced, randomly distributed, etc. This is an important aspect for many practical purposes, such as experimental data, where one might not be able to dictate when signals are measured, and it is in sharp contrast with fast converging Gaussian quadratures, which do dictate the time at which data should be measured. The latter might not only be impractical but also unfeasible in many experimental scenarios.

Comment 12

Being based on the EIM, the method naturally applies to multiple dimensions, unstructured data and meshes of arbitrary shapes.

Comment 13

By design, also due to being based on the EIM, very fast convergence with the number of ROQ nodes is observed, typically exponentially in the cases of interest. This would not be possible using standard methods which rely on smoothness of the functions to be integrated. More precisely, in the case of GWs, in order for Gaussian quadratures to converge fast the signal would have to be smooth as a function of time, which is never the case for GW signals due to the presence of measurement noise. In contrast, ROQ lift this requirement and in practice achieve exponential convergence even for noisy data.

The idea of ROQ is remarkably simple for its impact:

  1. 1.

    Step 1. Build a reduced basis for the problem of interest. The basis can be built, for example, using a POD or greedy approach.

  2. 2.

    Step 2. Build an empirical interpolant, based on the previous basis, as discussed in Sect. 10. This approximates a parametrized function \(h(x, \lambda )\) in the space of interest by an empirical interpolant of the form

    $$\begin{aligned} {{{\mathcal {I}}}}_n [h](x, \lambda ) := \sum _{i=1}^n B_i(x) h(X_i, \lambda ) \approx h(x, \lambda ) \end{aligned}$$
    (69)

    Recall that in GW science h would typically be a waveform, and x time or frequency but the method is generic.

    The affine parametrization in \(\lambda \) and physical dimensions x achieved by (69) is one of the critical ingredients that allows for ROQ. By affine parametrization one means a decomposition in terms of products of functions which depend only on parameter and physical variables, as in (69).

  3. 3.

    Step 3. Reduced order quadratures follow what would otherwise be the standard procedure when building quadratures, but replacing standard polynomial interpolants with EIM based ones. Namely,

    $$\begin{aligned} I_n [ h ](\lambda ) := \int h(x, \lambda )\, dx \approx \int \mathcal{I}_n [h](x,\lambda )\, dx = \sum _{i=0}^n \omega _i h(X_i, \lambda ) , \end{aligned}$$

    where the ROQ weights

    $$\begin{aligned} \omega _i := \int B_i(x)\, dx \end{aligned}$$
    (70)

    are computed offline, with any quadrature method of choice or availability for computing (approximating) the integrals in Eq. (70).

Comment 14

If the number of available (say, time or frequency) samples is L, then one would typically use them to precompute the weights (70) in the offline stage. After this offline work, the online evaluation is decreased to n ROQ nodes, with no practical loss of accuracy and usually \(n \ll L\), as we discuss in Sect. 14, leading to dramatic speedups in likelihood computations, among other applications.

13.1 ROQ, other quadrature methods, dependence on dimensionality

Gaussian quadratures (such as Chebyshev and Legendre) are considered some of the most efficient methods for integrating smooth generic functions. This is not just a perception: there is a whole theory of why this is generically the case. Compared to ROQ, though, they do suffer from several disadvantages, mentioned in the Discussion 2 of Sect. 9.3:

  1. 1.

    The location of their nodes, at which the integrand has to be known, is dictated by the method, which is unrealistic for any experiment or application based on observations.

  2. 2.

    They are not hierarchical. That is, more nodes cannot simply be added for higher accuracy but each new quadrature has to be built from scratch.

  3. 3.

    Their fast convergence and near-optimality is subject to very specific conditions: smoothness of the integrand being the most important one after the imposition of nodes location. This is an unrealistic assumption for signals taken from experiments.

  4. 4.

    As with any polynomial-based approach, their efficiency in general (as when using grids based on tensor products of one-dimensional ones) decreases with the dimensionality of the problem, unlike ROQ for cases of interest in this review.

As we discuss next, ROQ can beat these best known generic methods while working under more relaxed conditions. This is at the expense of offline work. Obstacles (1) and (2) are well known, so there is not much need to comment on them. In Canizares et al. (2015, 2013) point (3) is discussed and how ROQ can lead to fast convergence even in the presence of noisy data.

So next we discuss point (4), by considering two simple examples, in one and two dimensions (both physically and parametrically). They are taken from Antil et al. (2013).

Example 11

The example function family of interest here is (there is nothing particular about this choice)

$$\begin{aligned} h(x, \lambda ) = \left[ \left( x - \lambda _1 \right) ^2 + 0.1^2 \right] ^{-1/2} \, , \end{aligned}$$
(71)

where both x and \(\lambda \) are real and we want to compute an approximation to its integral,

$$\begin{aligned} I (\lambda ) = \int _\varOmega h(x, \lambda )\, dx \, , \quad \end{aligned}$$
(72)

with, out of arbitrariness, \(\varOmega = [-1,1]\) and \(\lambda _1 \in [-0.1,0.1]\). We use two approaches to approximate (72):

  1. 1.

    Gaussian quadratures: Numerical integration is carried out using up to \(n = 150\) nodes and a Gauss-Legendre rule. Recall that for each value of \(\lambda \) a separate quadrature for each value of n has to be performed since the method is not hierarchical.

  2. 2.

    Reduced order quadratures: A reduced basis (in this case using a greedy approach) is first built for the family of functions (71), subsequently the empirical interpolant and nodes and, finally, the ROQ. Since ROQ are hierarchical, the additional cost for showing a convergence test from n to \((n+1)\) is independent of n, only an extra node needs to be added. This, again, is in contrast to regenerating each quadrature rule as in the Gaussian case.

The results are shown in Fig. 15, as black solid and dashed lines. Even for an error of at most \(10^{-4}\), ROQ provide a factor of \(\sim 4\) savings, with the savings dramatically increasing with higher accuracy. We recall once again that we are comparing ROQ with one of the best general purpose quadrature rules.

Fig. 15
figure 15

Error curves for the 1- and 2-dimensional cases of Examples 11 and 12 from Antil et al. (2013), respectively, using Gauss–Legendre and ROQ rules. Errors are computed by taking the maximum over the entire training set. The ROQ savings increase from \(\sim 4\) to \(\sim 12\) as the number of spatial dimensions is increased from one to two. Further savings are expected as the number of spatial dimensions increases. Note that the non-monotonicity of the ROQ curves is because the EIM does not optimize for accuracy at each step, this is discussed in Sect. 10.3

Approximating functions by products of functions (in particular of polynomials) in each physical dimension leads to evaluation costs of the approximants which scale exponentially with the dimentionality. As with higher accuracy, increasing the dimensionality of the problem leads to ROM providing larger savings when compared to generic methods. We next give an example.

Example 12

The family of functions of interest is now 2-dimensional, both in space and parameters,

$$\begin{aligned} h(x,\lambda ) = \left[ \left( x_1 - \lambda _1 \right) ^2 + \left( x_2 - \lambda _2 \right) ^2 + 0.1^2 \right] ^{-1/2} \, , \end{aligned}$$
(73)

where the physical domain is \(x = (x_1,x_2) \in \varOmega = [-1,1]\times [-1,1]\) and for the parametric one \(\lambda \in [-0.1, 0.1] \times [-0.1, 0.1] \), and we are interested in multiple evaluations of the parametrized integral

$$\begin{aligned} I (\lambda )= \int _{\varOmega } h(x, \lambda )\, dx. \end{aligned}$$

When considering Gaussian quadratures the curse of dimensionality already in two dimensions becomes apparent. In standard multidimensional quadratures the integrand is approximated by the product of one-dimensional polynomials. Therefore we consider up to \(150^2\) Gauss nodes to integrate functions from the family (73). And, again, because Gaussian quadratures are not hierarchical, this requires a set of \(150^2\) quadrature rules for each quadrature evaluation in a convergence test.

Next, we build a reduced basis, EIM, and ROQ. The convergence of both methods, Gaussian and ROQ, are displayed as solid and dashed blue lines in Fig. 15. Instead of a factor \(\sim 4\) savings as in the 1-D case for a modest error of \(10^{-4}\), for this same accuracy the savings of ROQ are \(\sim 12\). As we will show in the next section, for realistic problems in GW physics, the savings are much larger than those of these test problems.

14 Accelerating parameter estimation with ROQ

Once a GW detection is made one wants to infer the source’s astrophysical properties from the detection data. The goal here is to do a full Bayesian analysis to compute the posterior probability distribution function (PDF) of a set of astrophysical parameters \(\{\lambda \}\) which describe the source. Being able to do so in real time (or nearly so) is particularly important for rapid followups of electromagnetic counterparts enabling multi-messenger astronomy, among other motivations. For this reason we place emphasis on binary neutron stars or mixed pairs (black hole-neutron star).

Assuming that the detector data d contains the source’s signal \(h(\lambda _{\mathrm{true}})\) and stationary Gaussian noise n in an additive way,Footnote 5

$$\begin{aligned} d = h(\lambda _{\mathrm{true}}) + n\, , \end{aligned}$$

the likelihood of data d corresponding to a parameter \(\lambda \) (Jaranowski and Krolak 2009) is given by

$$\begin{aligned} {\mathcal {L}}( d | \lambda ) := A \exp [ -\frac{1}{2} \langle d - h (\lambda ), d- h (\lambda ) \rangle ] , \end{aligned}$$
(74)

where \(\langle d , h(\lambda )\rangle \) is defined for this section—and closely following the notation of Canizares et al. (2013)—as a weighted inner product in the frequency domain for discretely sampled data,

$$\begin{aligned} \langle d , h(\lambda )\rangle := 4{\mathbb {R}} \ \varDelta f \sum _{i=1}^{L} \frac{{\bar{d}}(f_i) h(f_i, \lambda )}{S_{n}(f_i)}. \end{aligned}$$
(75)

The values \(d(f_i)\) and \(h(f_i, \lambda )\) are the (discrete) Fourier transforms at frequencies \(\{ f_i \}_{i=1}^L\), a bar denotes complex conjugation, and the power spectral density or PSD. \(S_{n}(f_i)\) characterizes the detector’s noise.

The evaluations of likelihoods for parameter estimation (PE) is a high dimensional one. For binary systems, even excluding equations of state in the presence of matter, it is already a 15-dimensional problem, with 8 parameters being intrinsic (mass and spin vector of each binary component) and 7 extrinsic ones such as luminosity distance, coalescence time, inclination of the binary plane, and sky localization.

The bulk of the computational cost in (74) comes from two sources:

  1. 1.

    Computing the gravitational wave candidates \(h(\lambda )\).

  2. 2.

    Evaluating the inner products in (74) .

Surrogate models can decrease the computational cost of (1), and ROQ those of (2). In this section we briefly review (2); that is, the use of ROQ and ROM in general for accelerating PE. Therefore, this section is intendently limited in scope, and it is not aimed to be a general review of approaches for accelerating PE, which is a field by itself given its importance.

For a given observation time \(T=1/\varDelta f\) and detection frequency window \((f_{\mathrm{high}} - f_{{\mathrm{low}}})\) there are

$$\begin{aligned} L = {\mathrm{int}}\left( \left[ f_{\mathrm{high}} - f_{{\mathrm{low}}} \right] T \right) \end{aligned}$$
(76)

sampling points in the sum (75). Usually L is large, which determines the second bulk of the computation when evaluating the likelihood (74).

14.1 Constructing the reduced order quadrature

The ROQ scheme for parameter estimation requires three steps, as described in Sect. 13. Here we slightly reformulate them in order to bring up some practical issues. These steps are:

  1. 1.

    Construct a reduced basis, i.e., a set n elements whose span reproduces the GW model within a desired precision.

  2. 2.

    Using the EIM, construct the empirical interpolant and its nodes.

  3. 3.

    The ROQ weights (79) are computed, and are used to replace, without loss of accuracy, inner product evaluations (75) by ROQ compressed ones.

Comment 15

 

  • Step 1. The reduced basis only needs to be built over the space of intrinsic parameters for the waveform family. Furthermore, if the basis is generated using a PSD \(\equiv 1\) the representation of the waveform family can be used with any PSD whenever the weights are built as in Eq. (79). In practice, for the purposes of this section, basis generation proceeds in two stages. A reduced basis is first constructed, which can be evaluated for any value of. Next, this basis is evaluated at L equally spaced frequency samples appropriate for the detector.

  • Step 2. Given an accurate basis of dimension n, it is possible to uniquely and accurately reconstruct any waveform \(h(\lambda )\) from only n subsamples \(\{ h(F_i, \lambda )\}_{i=1}^n\) using the EIM. The frequency nodes \(\{F_i\}_{i=1}^n\), selected from the full set \(\{f_i\}_{i=1}^L\), are also obtained by the EIM.

    This step provides a near-optimal compression strategy in frequency which is complimentary to the parameter one of Step (1). The model’s empirical interpolant, valid for all parameters, can be written as (cf. Eq. (19) of Field et al. 2014)

    $$\begin{aligned} {{{\mathcal {I}}}} [h](f_i, \lambda ) = \exp (-2 \pi \mathrm {i} t_{c} f_i ) \sum _{j=1}^n B_j (f_i) h (t_c=0, F_j, \lambda ) \approx h(f_i, \lambda )\, , \end{aligned}$$
    (77)

    where, for the sake of the discussion below, we have temporarily isolated the coalescence time \(t_{c}\) from the other extrinsic parameters.

  • Step 3. Except for \(t_c\), no extrinsic parameter affects the construction of the ROQs. The coalescence time, however, requires special treatment. One can see this by substituting Eq. (77) into Eq. (75),

    $$\begin{aligned} \langle d , h(t_{c}, \lambda )\rangle = \sum _{k=1}^n \omega _k(t_{c}) h(t_c=0, F_k, \lambda ) \, , \end{aligned}$$
    (78)

    with the ROQ weights given by

    $$\begin{aligned} \omega _k(t_{c}) = 4{\mathbb {R}} \varDelta f \sum _{i=1}^L \frac{ {\bar{d}}(f_i) B_k (f_i)}{ S_n(f_i) } \exp (-2 \pi \mathrm {i} t_{c} f_i ) . \end{aligned}$$
    (79)

In practice, the dependence of (79) on \(t_c\) can be achieved through a simple domain decomposition, using that an estimate for the time window W centered around the coalescence time \(t_{\mathrm{trigger}}\) is given by the GW search pipeline. This suggests a prior interval \([ t_{\mathrm{trigger}} - W, t_{\mathrm{trigger}} + W] \) to be used for \(t_c\). This prior interval is then split into \(n_c\) equal subintervals of size \(\varDelta t_c\). The number of subintervals is chosen so that the discretization error is below the measurement uncertainty on the coalescence time. Finally, on each subinterval a unique set of ROQ weights is constructed.

Step (3) is currently implemented in the LALInference pipeline as summarized in Algorithm (3). The offline Steps (1) and (2) are carried out independently. By construction, the approach guarantees that these offline steps need be to carried out only once for each waveform family model.

figure c

14.1.1 Compressed norm evaluations

There are two more terms to consider for fast computations of inner products, which can be seen from Eq. (74). One of them is \(\langle h(\lambda ) , h(\lambda ) \rangle \). Unless the GW model is a closed-form expression, one needs to build a fast online evaluation for this norm for values of \(\lambda \) that are only known at run time. This can be achieved by constructing a reduced basis for \(\langle h(\lambda ) , h(\lambda ) \rangle \), then its empirical interpolant, and finally its ROQ. The other term, \(\langle d , d\rangle \), only needs to be computed once per parameter estimation or search analysis, so it does not require any special treatment to speed up its calculation.

14.2 Total speedup

Notice that even though the ROQ weights (Alg. 3) are computed in the online stage, they only depend on the detection-triggered data d and not on any \(h(\lambda )\). Therefore, they can be computed in what can be referred to as the startup stage, which requires n full (of size L) inner product (75) evaluations for each \(t_c\) interval. As discussed below, in practice this cost is negligible, while each likelihood is subsequently calculated millions of times, leading to significant speedups in parameter estimation studies, resulting in observed speedups in the whole PE study equal to L/n, which is the reduction in the number of terms needed to compute (78) instead of (75). This assumes that \(n<L\) but, as we see below, this is indeed the case in problems of interest (furthermore, usually \(n \ll L\)).

This speedup comes from operation counts, but it has also been observed in practice in actual implementations in the LAL pipeline, as discussed below. For BNS, for the early advanced detectors’ configuration Abbott et al. (2020) ROQ showed to provide a factor of \(\sim 30\times \) speedup in PE for low-frequency sensitivity of 40 Hz, and \(\sim 70\times \) and \(\sim 150\times \) as the sensitivity band is lowered to 20 Hz and 10 Hz, respectively. This was true for all cases, without practical loss of accuracy or systematic biases.

Example 13

The material for this example is taken from Canizares et al. (2015).

The majority of a binary neutron star’s GW signal can be expected to be in the inspiral regime (Sathyaprakash and Schutz 2009), which can be described by the closed-form TaylorF2 approximation (Buonanno et al. 2009). While TaylorF2 does not incorporate spins or the merger-ringdown phases of the binary’s evolution, these might not be important for BNS parameter estimation and can therefore be neglected, at least in a first approximation (Singer et al. 2014). For a thorough study on this point using the SpinTaylorF2 approximation, see Miller et al. (2015). Even for this simple to evaluate waveform family, inference on a single data set used to require significant computational wall-time with standard parameter estimation methods (Aasi et al. 2013).

Canizares et al. (2015) first computed the observation time T required to contain a typical BNS signal. Next, a reduced basis of dimensionality n needed to represent this model for any pair of BNS masses was constructed. The upper frequency \(f_{\mathrm{high}}\) was fixed to 1024 Hz while \(f_{{\mathrm{low}}}\) varied between 10 Hz and 40 Hz.

The time taken for a BNS system with an initial GW frequency of \(f_{{\mathrm{low}}}\) to inspiral to 1024 Hz,

$$\begin{aligned} T_{\mathrm{BNS}} = \left[ 6.32 + 2.07\times \frac{10^6}{ \left( f_{\mathrm{low}}/{\text{Hz}} \right) ^3 + 5.86\left( f_{{\mathrm{low}}}/{\text{Hz}} \right) ^2 } \right] {\text{s}}, \end{aligned}$$
(80)

was empirically found by generating a \(\left( 1+1\right) \,M_{\odot }\) waveform (directly given in the frequency domain) and Fourier transforming it to the time domain where the duration up to when the waveform’s evolution terminates is measured. Equation (80) and subsequent fits were found using a genetic algorithm-based symbolic regression software (Eureqa Software 2021; Schmidt and Lipson 2009, 2010). The length L, as implied by Eq. (76), is plotted in the top panel of Fig. 16.

Fig. 16
figure 16

Image reproduced with permission from Canizares et al. (2015), copyright by APS, where what here is denoted as n (the number of basis) in that reference is N (here used for the size of the initial training set)

Top: Length L (red dots) of a typical binary neutron star inspiral waveform, with the solid black curve connecting this data implied by the fit (80). Middle: Number of reduced basis waveforms (red crosses), with the solid black curve given by the fit (81). Bottom: Speedup implied by operation counts, as given by Eq. (82).

As discussed, each basis only needs to be constructed over the space of intrinsic parameters—in this case the 2-dimensional space of component masses, chosen to be in the range \([1,4]\,M_{\odot }\). This range is wider than expected for neutron stars, but ensures that the resulting PDFs do not have sharp cut-offs (Mandel et al. 2014). The number of reduced bases required to represent the TaylorF2 model within this range with a representation error around double precision (\(\sim 10^{-14}\)) can be fitted by

$$\begin{aligned} n_{\mathrm{BNS}} = 3.12\times 10^5 \left( f_{{\mathrm{low}}}/\mathrm {Hz} \right) ^{-1.543} \, , \end{aligned}$$
(81)

and is shown in the middle panel of Fig. 16. It was found that increasing the high-frequency cutoff to 4096 Hz only adds a handful of basis elements, while L changes by a factor of 4, indicating that the speedup for an inspiral-merger-ringdown model might be higher, especially given that not many EIM nodes are needed for the merger and ringdown regimes (Field et al. 2014). This was indeed shown in Smith et al. (2016), as highlighted below.

Recalling Eq. (76), the expected speedup from standard- to compressed-ROQ likelihood evaluations is given by

$$\begin{aligned} \frac{L}{n} \approx \left( 1024 {\text{Hz}} - f_{{\mathrm{low}}} \right) \frac{T_{\mathrm{BNS}}}{n_{\mathrm{BNS}}} \, , \end{aligned}$$
(82)

with \(T_{\mathrm{BNS}}\) and \(n_{\mathrm{BNS}}\) given by Eqs. (80) and (81), respectively. As reported in Smith et al. (2016), this speedup is indeed observed using LALInference, and is shown in Fig. 16 (bottom), with a reduction in computational cost and time of \(\sim 30\) for the initial detectors (with a cutoff of \(f_{{\mathrm{low}}} = 40\,\)Hz) and \(\sim 150\) once the advanced detectors reach \(f_{{\mathrm{low}}} \sim 10\,\)Hz.

14.3 Implementation in LALInference

Reduced order quadratures for compressed likelihood evaluations and Algorithm (3) were originally implemented in the LAL parameter estimation pipeline, known as LALInference (Aasi et al. 2013), by Vivien Raymond and Rory Smith. The resulting variation is called LALInference\(\_\)ROQ. This section provides unpublished details which were not included in Canizares et al. (2015), courtesy also of VR and RS. Below is a comparison between MCMC parameter estimation results using the standard version of LALInference and ROQ accelerated studies using LALInference\(\_\)ROQ for the previous example, where TaylorF2 is the waveform model. Synthetic signals embedded in simulated Gaussian noise were injected into the LAL pipeline, for settings anticipating at the time the initial configuration of aLIGO, using the zero detuned high power PSD (LIGO Scientific Collaboration 2010) and \(f_{\mathrm{low}} = 40\,\mathrm{Hz}\).

The time window was taken to be \(W= 0.1\) s about the coalescence time \(t_c\) of a binary neutron star signal (Smith et al. 2013; Aasi et al. 2013). Following the procedure discussed in Sect. 14.1 above, LALInference\(\_\)ROQ discretizes this prior into \(n_c=2,000\) sub-intervals, each of size \(\varDelta t_c=10^{-5}\) s, for which it constructs a unique set of ROQ weights on each sub-interval. A width of \(10^{-5}\) s ensures that this discretization error is below the measurement uncertainty on the coalescence time, which is typically \(\sim 10^{-3}\) s (Aasi et al. 2013).

As expected, the ROQ and standard likelihood approaches produce, for all practical purposes, statistically indistinguishable results for posterior probability density functions over the full 9-D parameter space (2 intrinsic dimensions and the full 7 extrinsic ones). Figure 17 and Table 3 describe results for the nine parameters obtained in one particular MCMC simulation; other simulations were qualitatively similar.

Table 3 Chirp mass \({\mathcal {M}}_c\), symmetric mass ratio \(\eta \), source distance (from Earth) D, coalescence time \(t_c\), right ascension \(\alpha \), declination \(\delta \), coalescence phase \(\phi _c\), polarization phase \(\psi \), inclination \(\iota \) and Signal-to-Noise Ratio (SNR) of the analysis from Fig. 17

It is also useful to quantify the fractional difference in the 9-D likelihood function computed using ROQ and the standard approach. This fractional error has found to be

$$\begin{aligned} \varDelta \log \,{\mathcal {L}} = 1 - \left( \frac{ \log \,{\mathcal {L}}}{\log \,{\mathcal {L}}_{{\mathrm{ROQ}}} } \right) \lesssim 10^{-6} \end{aligned}$$

in all cases.

In addition to providing indistinguishable results, ROQ accelerated inference is significantly faster: The ROQ-based MCMC study with the discussed settings takes \(\sim 1\) h, compared to \(\sim 30\) h using the standard likelihood approach, in remarkable agreement with the expected savings based on operation counts. The wall-time of the analysis is proportional to the total number of posterior samples of the MCMC simulation, which in this case was \(\sim 10^{7}\). The startup stage required to build the ROQ weights has negligible cost and is completed in near real-time (\(\sim 30\) s), which is equivalent to \(\sim 0.028\%\) of the total cost of a standard likelihood parameter estimation study.

Fig. 17
figure 17

Probability density function for the chirp mass \({\mathcal {M}}_c\), symmetric mass ratio \(\eta \), inclination \(\iota \), coalescence phase \(\phi _c\), right ascension \(\alpha \), declination \(\delta \), polarization phase \(\psi \), source distance (from Earth) D and coalescence time \(t_c\) of a simulated event in LIGO/Virgo data. In green as obtained in \(\sim 30\) h by the standard likelihood, and in blue as obtained in 1 h with the ROQ. The injection values are in red, and are listed in Table 3. The overlap region of the sets of PDFs is the hatched region. Credit: Vivien Raymond

For a lower cutoff frequency of \(f_{{\mathrm{low}}} \sim 20\,\mathrm{Hz}\), the speedup reduction is from a couple of weeks to hours. For advanced detectors with \(f_{{\mathrm{low}}} \sim 10\)Hz, the longest BNS signals last around 2048 s in duration. Assuming a fiducial high frequency cut-off of 1024 Hz, which is approaching the upper limit of the sensitivity of aLIGO/AdV, datasets can be as large as \(L\sim 1024\,\mathrm {Hz}^{-1}\times 2048\,\mathrm {s} \sim 10^6\). Assuming that the advanced detectors require at least \(\sim 10^7\) posterior samples, this implies runtimes upwards of \(\sim 100\) days and one Petabyte worth of model evaluations using the standard approach. On the other hand, ROQ reduces this to hours. Remarkably, this approach when applied to the advanced detectors operating at design sensitivity is faster than even the standard likelihood one used for the initial detectors. Additionally, with parallelization of the sum in each likelihood evaluation essentially real-time full Bayesian analysis might be achieved.

14.4 Further reading

Even though ROQs decrease the computational cost of likelihood evaluations for binary neutron stars for advanced detectors by about two orders of magnitude, this is still in the order of hours, which does not meet the need for real-time followup of GW detections through electromagnetic counterparts. Therefore, other ideas, beyond or on top of ROQ are needed. One approach, named focused reduced order quadratures (FROQ) (Morisaki and Raymond 2020), is to restrict the parameter estimation search based on trigger values from the detection pipeline, which further reduces the cost to around 10 minutes, which is a good target from an astrophysical point of view for searches of electromagnetic counterparts. Even though the study of Morisaki and Raymond (2020) uses Post-Newtonian closed-form approximations, waveform evaluation cost should not be an issue in using NR-based surrogates to extend FROQ.

Smith et al. (2016) built the first ROQ for precessing inspiral-merger-ringdown compact binaries, using the IMRPhenomPv2 model, which includes all 7 intrinsic parameters. For low cutoff frequencies of \(f_{{\mathrm{low}}} =20\,\mathrm { Hz}\) the authors found speedups of up to between \(\sim 4\times \) and \(\sim 300 \times \) (for short, BBH, and long, BNS systems, respectively), leading to an estimate of 6–12 h for a full PE study, as opposed to \(\sim 1\)day instead of \(\sim \)6 months. The resulting code is also available from LALInference. This study was extended in Meidam et al. (2018) adding a parametrization for deviations from general relativity using IMRPhenomPv2.

We mentioned before that one of the computational burdens of PE is the computation of the gravitational waveforms and that this can be solved using surrogate models. This speedup was analyzed in detail using SVD interpolated waveforms in Smith et al. (2013).

Even though in this section we have focused on the use of ROQ for parameter estimation, the generic procedure for GW searches is similar.

As a proof of the generality of ROQ, it was used in Brown et al. (2016) in the context of studies of laser light scattering with a speedup of \(\sim 2750\) with essentially no loss in precision.

As emphasized in Sect. 15, an important topic that we have not touched in this review is that one of statistical machine learning approaches (as opposed to algorithmic ones as in this review) for GWs prediction and analysis. An exception in order here is that one of Chua et al. (2019), since it bridges ROM with artificial neural networks (ANN) and addresses an important problem. Namely, the challenge of interpolating the projection coefficients in a reduced basis approach. Instead of algorithmically interpolating them, the authors use ANN to map the relationship from parameters to projection coefficients. Though the focus of the reference is on parameter estimation of GWs, the fundamental idea is applicable to the predictive step itself; in fact it was proposed around the same time in a different context in Wang et al. (2019).

15 Challenges, open issues and comments

To discuss one of the main challenges left in reduced order and surrogate models for gravitational waves, we review the case of inspiral spinning non-precessing binary PN black hole waveforms from Field et al. (2012). The 4-D sector of parameter space studied corresponds to \(m_i\in [3-30]\,M_{\odot }\), with \(\chi _i\in [-1,1]\) (\(i=1,2\)). The authors build reduced bases for parameter dimensions \(\mathrm{dim}=2,3,4\), corresponding to non-spinning, spinning-aligned, and spinning non-precessing cases, respectively. The histogram of Fig. 18 corresponds to the 4-D case and shows the greedy parameter count for mass ratio \(m_1/m_2\in [0.1, 10]\) for a representation error of \(\sim 10^{-14}\) of any waveform in the parameter range. In Fig. 19 it is represented the distribution of greedy points in the parameter space for the 3-D case, with spins being aligned or anti-aligned, with \(\chi _1=\chi _2=\chi \). The lower cutoff frequency being considered at the time was 40 Hz.

Fig. 18
figure 18

Image reproduced with permission from Field et al. (2012), copyright by APS

The greedy parameter choices for mass ratio \(m_1/m_2\in [0.1, 10]\). Notice that the greedy points cluster in the near-equal mass sector.

Fig. 19
figure 19

Image reproduced with permission from Field et al. (2012), copyright by APS

Greedy parameter points for the 3-D case. Notice the sparsity of the distribution of points.

Fig. 20
figure 20

Image reproduced with permission from Field et al. (2012), copyright by APS

Representation errors as function of the number of basis elements n for \(\mathrm{dim}=2,3,4\). The value n barely grows with the dimensionality of the problem \(\mathrm dim\), suggesting that at least in the context of GWs, the curse of dimensionality might be beaten by the RB approach.

Comment 16

 

  1. 1.

    From Fig. 18 it can be seen that the largest number of selected greedy points is near the equal mass case. The reason for this is that since a fixed frequency range is considered, the longer waveforms correspond to nearly equal mass components, have a larger number of cycles and therefore more structure and representative information. Notice that the choice of seed in the greedy algorithm in general breaks the symmetry \(m_1 \leftrightarrow m_2\). If \(m_1=m_2\) is chosen as a seed, there is degeneracy in the choice of the next greedy point, with 2-tuples (corresponding to \(m_1 \leftrightarrow m_2\)) giving the same representation error. An (arbitrary) choice has to be made to remove this degeneracy, which again breaks the aforementioned symmetry.

  2. 2.

    In the non-spinning case, around 1700 basis waveforms are needed for machine precision representation, see Fig. 20. This is a rather large number of data points to fit each waveform phase and amplitude at the EIM nodes for the considered mass range when building a surrogate as proposed in Sect. 11.2. Therefore, a low order local polynomial or spline interpolation would probably be accurate enough for practical purposes when the fiducial model is fast enough to evaluate—as it is in this example—while avoiding Runge’s phenomenon (discussed in Sect. 9).

  3. 3.

    The reason why as increasing the dimensionality of the problem the reduced basis approach seems to beat the curse of dimensionality can be seen from Fig. 19. Namely, the data becomes extremely sparse, and the number of basis elements remains approximately constant while the volume of the parameter space grows exponentially with the number of dimensions (Fig. 20). This sparseness is a very much desired condition indeed for any ROM approach, but it introduces difficult challenges, as discussed below.

In the case of projection and when the waveforms are known in closed form, or can be computed in a relatively inexpensive way, there are no major issues: in order to evaluate the representation of the waveform one simply computes the projection coefficients onto the basis. Similarly for predictive models, since in those cases there are an “arbitrary” number of local waveforms to use for fitting in Step 3 of an EIM-based interpolant (Sect. 11.2): a local fit of low order (be it in the form of interpolation, least-squares, splines, etc.) can be used with high accuracy and while avoiding Runge’s phenomenon.

A very different situation is that one in which the training set is computationally expensive to build, as when it involves solving the full Einstein equations through numerical relativity and only a sparse training set can be constructed and there are not enough points nearby in parameter space to perform local fits with enough accuracy.

One solution is to keep running NR simulations to enrich the training set. With enough computational resources, this approach might be affordable for binary black holes. For binary neutron stars or mixed pairs, on the other hand, where the parameter dimensionality grows considerably, this might be impractical if not just impossible in reality. This challenge follows directly from the agnostic, data-driven, purposely designed, approach of the surrogate model discussed in this review, in which no differential equations are invoked. The motivation for this design is that it is a major effort to build a production Einstein solver for realistic problems of interest, and an intrusive approach would require major algorithmic and software changes and, as a consequence, the NR community might be reluctant to adopt it. Yet, that might be the only feasible approach in the long run. An intermediate solution might be doing further research in state of the art of global approaches to approximations in multiple dimensions which are fast to evaluate and do not suffer from Runge’s phenomenon, be it in the form of deterministic or statistical machine learning/artificial intelligence-type algorithms.

In this review, we have covered mostly algorithms of the first kind and tangentially mentioned some results of the second one. But there is a lot of ongoing activity and interesting results on the ML/AI side, though it appears too premature to cover them here, so we will defer their discussion to a future, updated version of this Living Review. For a recent comparative study, though preliminary, on ML/AI surrogates, see Setyawati et al. (2020). In terms of parameter estimation and within reduced order modeling, the use of focused reduced order quadratures, discussed at the end of Sect. 14, might be a good enough solution for near real-time or fast-enough followups of electromagnetic counterparts.

A related active field—though not in GW science—is that one of uncertainty quantification (UQ) (Smith 2013) and, in particular, generalized polynomial chaos (Xiu 2010), which we have not discussed in this review but are closely related to reduced basis and can remove the usual assumption (and its consequences) of the noise being Gaussian and stationary, which is not the case in GW science. The application of these areas of research in GW science are unknown to us, so we also defer them to a further version of this review. However, it has to be highlighted that other techniques to deal with non-Gaussian noise have found new detections from public LIGO data (Zackay et al. 2021), so the importance of the topic cannot be overemphasized.