Sparse polynomial prediction

In numerical analysis, sparse grids are point configurations used in stochastic finite element approximation, numerical integration and interpolation. This paper is concerned with the construction of polynomial interpolator models in sparse grids. Our proposal stems from the fact that a sparse grid is an echelon design with a hierarchical structure that identifies a single model. We then formulate the model and show that it can be written using inclusion–exclusion formulæ. At this point, we deploy efficient methodologies from the algebraic literature that can simplify considerably the computations. The methodology uses Betti numbers to reduce the number of terms in the inclusion–exclusion while achieving the same result as with exhaustive formulæ.


Introduction
The application of computational algebraic methods to experimental design has been extensively covered by the authors, co-workers and others following the seminal paper (Pistone and Wynn 1996), monograph (Pistone et al. 2001) and review by the authors (Maruri-Aguilar and Wynn 2015), with other developments such the study of duality between fractions of factorial design and models (Maruri-Aguilar et al. 2012). In this work a design is seen as a set of points and defines a natural polynomial model which can be used to interpolate data at design points. We concentrate on designs that are the union of product tensor structures, known as echelon designs. For such designs, the exponents of the interpolator have the same form as the design, giving a duality between designs and models, with both exhibiting hierarchical structure.
Another feature of this hierarchical design-model framework is the central role played by inclusion-exclusion (IE) formulae. Inclusion-exclusion has been used extensively in reliability theory, where formulae are derived from monomial ideals, see (Sáenz-de Cabezón Wynn 2009). We use the same principles to describe the echelon design points as IE of tensor (product) designs. The benefit of this method is the reduction in complexity achieved by using the so-called minimal free resolution. In short, the coefficients in the IE are Betti numbers of the resolution of the monomial ideal.
In this paper we apply the same machinery to study sparse grids. These grids are used widely in quadrature, particularly in the Finite Element solution of differential equations, and the difference of terminology should be noted. Sparse grids are sometimes referred to as Smolyak grids following the seminal paper (Smolyak 1963). Recent work has used sparse grids to the solution of differential equations (Nobile et al. 2008), for interpolation (Barthelmann et al. 2000) and for integration (Novak and Ritter 1996).
In our development, a set of quadrature points becomes an experimental design (design for short). This approach is in line with the recent history of the design of computer experiments, also termed as simulation experiments or simply simulation. We show that a Smolyak grid satisfies the same IE despite not having the same hierarchical form as discussed above.
The main result of this paper is that the IE formulae which hold for interpolators over echelon designs extend to more general sparse grids. Polynomial interpolators based on the "echelon" designs inherit the IE decomposition that uses the natural product model interpolators. Remarkably, the same IE forms are inherited by sparse grid interpolators. Moreover the reduced forms use the multi-graded Betti numbers from the algebraic theory.
Betti numbers are at the foundation of algebraic topology, and describe features of the geometry of manifolds via topologically equivalent, and one might say more abstract entities, namely simplicial complexes. In statistics they have been used intensively within the relatively new area topological data analysis (TDA) (Edelsbrunner Harer 2010). However, here Betti numbers are used in a different way, namely as a way of reducing the complexity of the inclusion-exclusion lemma. It is quickly observed that in handling large inclusion-exclusion formulae, there are redundancies; that is to say a large amount of cancellation of terms and many sets may be empty. This happens partly because the underlying dimension may be much smaller that the number of sets and partly because of the complexity of the inter-relation between the sets. Both in TDA and IE there are "minimal free resolutions" (a term from the underlying algebraic topology) which gives, in a well-defined sense the, least complex formulae in which the plus and minus signs of the formulae are replaced by (multigraded) Betti numbers (Naiman and Wynn 1997;Sáenz-de Cabezón Wynn 2009).
The "sets" in this paper are the tensor grids of exponents α in monomials x α . Thus, in summary, the Betti numbers support less complex IE formulae, in the sense of number of terms, in which the complexity is, in a sense, buried in the Betti numbers. The main purpose of this paper is to point out this structure, that is to say complexity reduction using the Betti numbers, is inherited by the natural interpolators based supported on the tensor grids and their extension to sparse grids.
The rest of the paper is ordered as follows. In Sect. 2 we introduce echelon designs and the duality between designs and models. We discuss the tensor product notation and the extreme corner points that define echelon designs. Section 3 contains the main results for interpolation. We define indicators which are the building blocks of a Lagrange interpolator over a single tensor design and give the result of inclusionexclusion which we prove in Appendix 1. This result allows interpolation over general echelon designs which is achieved by IE formulae. In Sect. 4 we apply our results to sparse grids and show this with an example. At the core of this paper are two approaches to response surface modelling, the numerical analysis and the statistical viewpoints. We briefly discuss these two approaches and suggest future work in Sect. 5.

Designs and models
We start with a general definition of a design that covers designs for response surface modelling and designs for computer experiments. The objective is to study the effect of k continuous factors x 1 , . . . , x k ∈ R on a real valued output y, with information collected in an experimental design.
Definition 1 A design D is a finite set of n points in R k . For every factor, the values of the coordinates of points in D are the levels of the design.
In this paper we are interested in a special type of designs called echelon designs. The following two definitions form the core of this paper: a hierarchical grid in the integers and the echelon design built from it. Both objects below are designs as of Definition 1.
Definition 2 A reference grid G is a finite subset of Z k ≥0 with the order ideal property, that is for every point α = (α 1 , . . . , α k ) ∈ G, all α ∈ Z k ≥0 such that 0 ≤ α ≤ α (coordinatewise) are also in G.
Definition 3 An echelon design is a design obtained from a reference grid G by changing the levels of each factor to any finite real values using a one to one mapping.
Definition 3 starts from a reference grid G to build an echelon design D. The converse procedure is important, namely that any echelon design D maps to a unique reference grid G(D), hence the use of the word reference. Using the identity map, a reference grid is trivially an echelon design.
Example 1 Full factorial experiments 2 k are echelon designs. The echelon property is not limited to designs with two levels, and if the design has all combinations of levels of factors, then the full factorial design n 1 n 2 · · · n k is an echelon design, where n i is the number of levels of the i-th factor.
In general, regular fractions of factorial designs are not echelon. A simple case of this is the fraction 2 2−1 with point coordinates (−1, −1), (1, 1) that cannot be converted  The coordinates of D were obtained with the function smolyak.quad from the R library gss (Gu 2014). Given the symmetry of G, this map is valid for both coordinates x 1 and x 2 to a reference grid by relabeling levels. However, certain fractions of factorial designs are echelon designs because they can be converted to a reference grid, such as the designs used in the method of elementary effects (EE) for screening active factors in computer experiments, see (Morris 1991).

Example 2
The central panel of Fig. 1 shows a reference grid G with n = 17 points, which is transformed into a Smolyak design by the map shown in Table 1. For instance, all the 7 points of G with horizontal (vertical) coordinate equal to zero are relabeled to a x 1 (x 2 ) coordinate with value 0.5. The map is not unique, as points of G with coordinate value 1 (or 2) could be relabelled to be one of 0.113 or 0.887. The left panel in Fig. 1 shows the Smolyak design, while the right panel shows another echelon design with the same size and reference grid G. In both Smolyak and echelon, the numbers at the bottom (left) are numbers of points with the same x 1 (x 2 ) coordinate. The numbers shown in G give the actual coordinates of points.
A reference grid G is defined by a special set of corner points A(G). This set of corners A(G) is comprised with those points α ∈ G such that there are no other points α ∈ G that satisfy α ≤ α coordinatewise. Trivially, the reference grid defined by A = A(G) equals G, i.e. G(A) = G. In this last statement, we slightly abused notation using G for a reference grid and G(A) as the reference grid built from the set of extreme corners A.

The simplest reference grid is created with a single corner point
is a product design, written equivalently as and we refer to it as a tensor design. A tensor grid G(α) is already a full factorial design and any echelon created from it will also be a full factorial. We now discuss the polynomial models for our designs. The building block for the model is a monomial whose exponents are entries of α, that is x α = x α 1 1 · · · x α k k . We write the model over the tensor grid G(α) as and in general for a reference grid G with corners A(G) we have the model: This model has observations collected in a reference grid design G and uses points α in G as model exponents. This duality between design and model was referred to in the introduction, see also (Pistone et al. 2001). When we consider the elements of G as exponents of a model, then A(G) are the exponents of the set of directing monomials, see (Bates et al. 2003).
Although the model y A(G) (x) uses all the points and exponents in G, we refer to it by its most extreme points A(G). This notation is necessary for later as our methods depend on inclusion exclusion. We also use the notation G(A) when we wish to recapture the reference grid from its defining corner set A.
Example 3 Consider the reference grid G shown in Fig. 1. If we start from G, we recover its extreme points as A = A(G) = {(0, 6), (2, 2), (6, 0)}. Taking a slightly different approach, we recover the full grid of n = 17 points from A as G = G(A).
The following lemma links models built with the reference grid and an echelon design. This result is taken from Pistone et al. (2001).

Lemma 1 If D is an echelon design, then it identifies a unique model with exponent vectors given by the elements of the reference grid with extreme corner A(G(D)).
In the following section we develop the models for the reference grids using indicator functions. a real data value y α is available. We will construct a polynomial interpolator using the full set of data {y α : α ∈ G(α)}.
To build the interpolator, we start by defining the indicator function for the point α in the grid G(α). The indicator is the product of k one dimensional indicators obtained by Lagrange interpolation: α (x) that makes explicit that this is the indicator function for point α in the grid G(α) for a given α ∈ Z k ≥0 . The notation I For a point x in G(α), the function I α (x) has the indicator property The interpolator function over the full tensor grid G(α) is built as a linear combination of the indicator functions, using the values y α as coefficients: (2) Example 4 Consider the grid {(α 1 , α 2 ) : α 1 = 0, 1; α 2 = 0, 1, 2, 3} of n = 8 points in Z 2 ≥0 . This is the reference grid G(α) for α = (1, 3), equivalently written as a tensor grid

Interpolation on echelon design
We consider interpolation over a reference grid G(A), for a set of corner points A. Suppose that this grid is the union of two tensor grids Here A = {α, γ }, where α, γ are distinct vectors with non-negative integer entries.
To avoid trivial cancellations, in what follows the vectors α, γ do not dominate each other, that is neither α ≤ γ nor α ≥ γ . By this requirement we avoid the inclusions The following is a simple inclusion-exclusion for the designs, using indicator functions in an obvious way: where the indicator I (G) takes the unit value on grid points in G and zero on integer vectors (grid points) outside G.
The greatest common divisor between monomials x α and x γ is the monomial A main result of this paper is that the set inclusion-exclusion (IE) for the design points extends to the interpolators. The proof is given in Appendix 1.
In general, reference grids we will have more than two corners and the inclusionexclusion will require several layers. Figure 2 shows the grids involved in the IE with three generators α, γ , δ. The top left panel is the reference grid G = G ({α, γ , δ}); each of the other figures corresponds to a specific labeled grid; black dots are points in the grid while grey dots are points in G but outside the labeled grid.
The main result considers a reference grid defined by a set of corner points A. We give the theorem below, whose proof by induction we omit.

Theorem 1 Let G(A) be a reference grid for a set of corner points A. Then the following holds:
(i) There is an IE for G(A) in terms of indicators of tensor grids: (ii) The unique interpolator y A (x) for points in G(A) satisfies the following

Betti numbers
In algebraic theory, the inclusion-exclusion of Theorem 1 is known as the Taylor resolution, which is the most complex case of IE, namely using all the singleton generators, then all possible pairs, triples and so on. The Taylor resolution is often highly redundant and it is possible to reduce the complexity of the IE by using other "resolutions" Sáenz-de Cabezón Wynn (2009). Here is a simple example to show reduction in IE complexity by cancelation.  (G((2, 0)) ∩ G ((1, 1)) ∩ G((0, 2))).

For this grid, or any echelon design built from it, the IE interpolator is
An efficient way of construction the sought inclusion-exclusion uses results from monomial ideals (Sáenz-de Cabezón Wynn 2009). Recall that a monomial ideal is a set of polynomials, and in our case, we build the ideal by flipping the grid G around a point. After flipping, the points become the exponents of monomials that generate an ideal. We use the shorthand that the monomial ideal obtained by this flipping the corners in A around a point c is the derived ideal. The objective of creating the derived ideal is to determine efficiently the IE decomposition of the grid G. Figure 3 gives a graphical explanation of the process. We flip the corners in A around a point c as follows, every corner α ∈ A is flipped to becomeα = c − α and similarly for γ, δ in the same figure. When the whole grid is flipped, the point c becomes the origin as shown in the right panel. The choice of flip point c is not critical as long as it is at least equal to the maximum value of G for the coordinate in turn. In the figure, c was even beyond the maximum value per coordinate which has no influence in the result.

Example 8
We continue with the grid G of Example 7. To be specific, if the corner points A are flipped around the maximum point per coordinate c = (2, 2), we retrieve the same A which when turned to exponents yields x 2 1 , x 1 x 2 , x 2 2 which are used to generate the derived ideal. For the generator corners of G in Example 2, we flip around c = (6, 6) and convert to exponents so that we have ideal generators x 6 1 , x 4 1 x 4 2 , x 6 2 . In a more complex case, the grid of Example 6 uses flip point c = (2, 2, 3) which gives monomials x 3 , x 2 1 , x 2 2 .
The result of homological computations with ideals is a collection of Betti numbers β α, j . These numbers are non-negative integers that provide the coefficients in a maximally reduced IE and thus are what we are looking for. Each number is indexed by a vector of non-negative integers α and an integer j that controls the role of the number in the inclusion-exclusion. We only use positive β α, j , as those zeroes do not contribute to the result and for simplicity, we collect the Betti numbers for the same j in a set that we label B j . The computations are carried out using, for example, the function HilbertSeriesMultiDeg of the CoCoA computer package (Abbott et al.). The results of computations over the derived ideal have to be reversed around c so that we can use them for the grid.
The following theorem is a simplification of the exhaustive series of Theorem 1 and it is given without a proof.
Theorem 2 Let G(A) be an echelon for a set of corner points A. Let β α, j be the multigraded Betti numbers associated with the minimal free resolution of the derived monomial ideal, and let the non-zero Betti numbers with the same value of j be collected in B j . Then (i) There is an IE for I (G(A)): (ii) The unique interpolator over G(A) can be decomposed into tensor grid interpolator of the same form: Theorems 1 and 2 yield exactly the same result concerning either the composition of I (G(A)), or the interpolation formulae for y A (x) over G(A). However the former is much more complex, involving 2 |A| − 1 terms, while the latter has many of these redundancies removed. As shown by example earlier, the reduction in complexity achieved with Theorem 2 depends on the generators of the derived ideal. The following example continues an earlier grid with maximal redundancy.

Extension to sparse grids
Sparse grids are designs used in stochastic finite element approximation, numerical integration and interpolation Nobile et al. (2008) andPlumlee (2014). They have a reduced number of runs while preserving the accuracy of approximations.
To construct a sparse design, we use the construction of Definition 3. That is, we start from a reference grid G(A). Recall that although the reference grid lies in Z k ≥0 , it is already an echelon design, and each factor x i has n i levels 0, 1, . . . , n i − 1. Next, define new levels for the design D, in which the levels of variable x i are replaced by new levels The replacement of levels is arbitrary and only requires a one to one mapping, hence the design D may not resemble at all its reference grid G. The operations described above can be reversed and design D in the class defined by the operation has a unique reference grid G(D).
The reversibility of operations implies that the design D still identifies a single polynomial model with exponents given by points in the reference grid G. The main result of the paper is that the inclusion-exclusion results for the tensor designs and interpolators that hold for the echelon designs extend to the sparse designs obtained by a set of transformation of level of each factor. We shall now call such a design a Smolyak grid.
It is clear that any component tensor grid of the reference grid is transformed into a tensor grid in the design D. Thus a Smolyak grid defined from a reference grid G(D) has a set of defining corner points A which then indexes an IE, for D. We use the simple notation yα to indicate the interpolator over the block defined by the "corner" α, andÃ for the full Smolyak grid. Over the sparse gridÃ, the (Betti) reduced IE polynomial interpolator is Example 10 Consider a Smolyak design D with k = 2 factors and n = 161 points and with 15 levels per factor. The reference grid G has generators {(14, 6), (6, 14)} so that the inclusion-exclusion involves three tensor grids and corresponding design points as shown in Fig. 4.

Conclusions and future research
The sparse grids described are based on a reference grid and despite the apparently predominant one-factor-at-a-time structure, the relabelling of coordinates allows an efficient exploration of the design space while keeping the number of runs low. These Sparse grids have been mainly used in two disciplines: numerical analysis (NA) and experimental design (DE). A motivation for the current paper was to rediscover and clarify intricate formulae used in the numerical analysis literature by using the Betti number decomposition of the last section.
For a fixed reference grid G(A), the construction of a design for which G(D) = G(A) still requires the specification of the numerical values of levels of each factor z i,1 , . . . , z i,n i . Thus if the design is to optimised, it should be optimised with respect both the reference grid G(D) and the spacings.
By analogy with factorial design and response surface design, one may outline a design first without specifying the spacing. A "star-composite design" is an example, being made up of a factorial design and a star without specifying the spacing of either. The spacing may be decided later based on the shape of the whole design region, preset candidate points, optimum design or some structural property e.g., rotatability. We could say, informally, that we shown one method of producing a generalised star design, but have left till later issues of spacing. This is partly because the numerical analysis definitions of optimality and the statistical ones are different so that more research is needed. We also note that although we worked on saturated models, actions like adding replicate points to improve statistical capabilities do not change the model identified.
In NA one may consider an approximation of a function by yÃ(x) and select the spare gridÃ making use of the inequality where y * is some best approximator to f , in some class, is the so-called Lebesgue constant, and · a chosen metric. But sparse grids are beginning to attract interest in the statistical literature; an example is Plumlee (2014). There is an opportunity to draw the extensive literature on optimal design for application to sparse grids, in the choice of reference grid and spacing, whether for physical response surface experiments or computer experiments.
Whether in NA or DE, the use of algebra reduces complexity by tensor decomposition and we believe, this will help us to understand in future research the important trade off between sparsity and optimality. and y α∧γ (x) = 0≤α ≤α∧γ where we used the notation I (γ ) α (x), I (α∧γ ) α (x) emphasizing that these are indicators over different reference grids, namely G(α) and G(α ∧ γ ). The ranges of summation above are G(α) and G(α) ∩ G(γ ) respectively and neither of these include the current case α ∈ G(α) \ (G(α) ∩ G(γ )) hence y γ (x) and y α∧γ (x) do not interpolate for α . However, the core part of this case is that when evaluated at x = α , these two functions take the same value y γ (α ) = y α∧γ (α ), which we show next. By the construction of y γ (x), the only non zero terms I (γ ) α (α ) of y γ (α ) are for those points in G(γ ) that share the largest number of coordinates with the given α , as no single point shares all coordinates with α because α / ∈ G(γ ). Without lack of generality, assume that a point α ∈ G(γ ) shares the first k − 1 coordinates with α ∈ G(α) \ (G(α) ∩ G(γ )), and that the coordinate that is not equal between this generic α ∈ G(γ ) and α k is the k-th coordinate, that is α = (α 1 , α 2 , . . . , α k−1 , α k ). The inner product in the i-th coordinate in Equation (1) is and this product is equal to one because it is evaluated at x i = α i . This result holds for coordinates i = 1, . . . , k − 1.
To complete our development for I (γ ) α (α ), we compute the product of Equation (A1) evaluated at x k = α k that becomes the indicator, that is This non-zero quantity is the coefficient of y α in the evaluation of y γ (α ). For the indicator I α∧γ,α (α ) the results above hold, namely that the indicator terms are zero except for when α shares all but one coordinate with α and that the product of terms equals one. This is true despite the fact that the product of Equation (A1) in I α∧γ,α (x) runs for values of m in the range 0, 1, α i − 1, α i + 1, . . . , min(α i , γ i ) with min(α i , γ i ) ≤ γ i . In other words, indicators for different reference grids involve a different number of products.