Optimal subsampling design for polynomial regression in one covariate

Improvements in technology lead to increasing availability of large data sets which makes the need for data reduction and informative subsamples ever more important. In this paper we construct D-optimal subsampling designs for polynomial regression in one covariate for invariant distributions of the covariate. We study quadratic regression more closely for specific distributions. In particular we make statements on the shape of the resulting optimal subsampling designs and the effect of the subsample size on the design. To illustrate the advantage of the optimal subsampling designs we examine the efficiency of uniform random subsampling.


Introduction
Data Reduction is a major challenge as technological advances have led to a massive increase in data collection to a point where traditional statistical methods fail or computing power can not keep up.In this case we speak of big data.We typically differentiate between the case where the number of is much larger than the number of observations and the case where the massive amount of observations is the problem.The first case is well studied, most notably by Tibshirani (1996) introducing LASSO, which utilizes 1 penalization to find sparse parameter vectors, thus fusing subset selection and ridge regression.The second case, often referred to as massive data, can be tackled in two ways.Firstly in a probabilistic fashion, creating random subsamples in a non-uniform manner.Prominent studies include Drineas et al. (2006), Mahoney (2011) and Ma et al. (2014).They present subsampling methods for linear regression models called algorithmic leveraging that sample according to probabilities based on the normalized statistical leverage scores of the covariate matrix.More recently Dereziński and Warmuth (2018) study volume sampling, where subdata is chosen proportional to the squared volume of the parallelepiped spanned by its observations.Conversely to these probabilistic methods one can select subdata by applying deterministic rules.Shi and Tang (2021) present such a method, that maximizes the minimal distance between two observations in the subdata.Wang et al. (2021) propose orthogonal subsampling inspired by orthogonal arrays.Most prominently, Wang et al. (2019) introduce the information-based optimal subdata selection (IBOSS) to tackle big data linear regression in a deterministic fashion based on D-optimality.
In this paper we study D-optimal subsampling designs for polynomial regression in one covariate, where the goal is to select a percentage α of the full data that maximizes the determinant of the information matrix.For the conventional study of approximate designs in this setting we refer to Gaffke and Heiligers (1996).Heiligers and Schneider (1992) consider specifically cubic regression on a ball.We consider D-optimal designs with measure α that are bounded from above by the distribution of the known covariate.Such directly bounded designs were first studied by Wynn (1977) and Fedorov (1989).Pronzato (2004) considers this setting using a form of the subsampling design standardized to one and bounded by α −1 times the distribution of the covariates.More recently, Pronzato and Wang (2021) studies the same in the context of sequential subsampling.For the characterization of the optimal subsampling designs we make use of an equivalence theorem by Sahm and Schwabe (2001).This equivalence theorem enables us to construct such subsampling designs for various settings of the distributional assumptions on the covariate.Here we will only look at distributions of the covariate that are invariant to a sign change, i.e. symmetric about the vertical axis.We discuss the shape of D-optimal subsampling subsampling designs for polynomial regression of degree q first.We then study quadratic regression under several distributional assumptions more closely, after showing two examples for simple linear regression.In particular we take a look at the percentage of mass of the optimal subsampling design on the outer intervals compared to the inner one, which changes drastically given the distribution of the covariate, particularly for heavy-tailed distributions.In addition we examine the efficiency of uniform random subsampling to illustrate the advantage of the optimal subsampling designs.All numerical results are obtained by the Newton method implemented in the R package nleqslv by Hasselman (2018).All relevant R scripts are available on a GitHub repository https: //github.com/TorstenReuter/polynomial_regression_in_one_covariate.
The rest of this paper is organized as follows.In Section 2 we specify the polynomial model.In Section 3 we introduce the concept of continuous subsampling designs and give characterizations for optimization.In Sections 4 and 5 we present optimal subsampling designs in the case of linear and quadratic regression, respectively, for various classes of distributions of the covariate.Section 6 contains some efficiency considerations showing the strength of improvement of the performance of the optimal subsampling design compared to random subsampling.The paper concludes with a discussion in Section 7. Proofs are deferred to an Appendix.

Model Specification
We consider the situation of pairs (x i , y i ) of data, where y i is the value of the response variable Y i and x i is the value of a single covariate X i for unit i = 1, . . ., n, for very large numbers of units n.We assume that the dependence of the response on the covariate is given by a polynomial regression model . The largest exponent q ≥ 1 denotes the degree of the polynomial regression, and p = q + 1 is the number of regression parameters β 0 , . . ., β q to be estimated, where, for each k = 1, . . ., q, the parameter β k is the coefficient for the kth monomial x k , and β 0 denotes the intercept.For example, for q = 1, we have ordinary linear regression, Y i = β 0 + β 1 X i + ε i , with p = 2 parameters β 0 (intercept) and β 1 (slope) and, for q = 2, we have quadratic regression, i + ε i , with p = 3 and an additional curvature parameter β 2 .Further, we assume that the units of the covariate X i are identically distributed and that all X i and random errors ε i are independent.
For notational convenience, we write the polynomial regression as a general linear model ) is the p-dimensional vector of regression functions and β = (β 0 , β 1 , . . ., β q ) is the p-dimensional vector of regression parameters.

Subsampling Design
We are faced with the problem that the responses Y i are expensive or difficult to observe while the values x i of all units X i of the covariate are available.To overcome this problem, we consider the situation that the responses Y i will be observed only for a certain percentage α of the units (0 < α < 1) and that these units will be selected on the basis of the knowledge of the values x i of the covariate for all units.As an alternative motivation, we can consider a situation where all pairs (x i , y i ) are available but parameter estimation is computationally feasible only on a percentage α of the data.In either case we want to find the subsample of pairs (x i , y i ) that yields the most precise estimation of the parameter vector β.
To obtain analytical results, the covariate X i is supposed to have a continuous distribution with density f X (x), and we assume that the distribution of the covariate is known.The aim is to find a subsample of this distribution that covers a percentage α of the distribution and that contains the most information.For this, we will consider continuous designs ξ as measures of mass α on R with density f ξ (x) bounded by the density f X (x) of the covariate X i such that f ξ (x) dx = α and f ξ (x) ≤ f X (x) for all x ∈ R. A subsample can then be generated according to such a continuous design by accepting units i with probability For a continuous design ξ, the information matrix M(ξ) is defined as ,...,q , where m k = x k f ξ (x) dx is the kth moment associated with the design ξ.Thus, it has to be required that the distribution of X i has a finite moment E(X 2q i ) of order 2q in order to guarantee that all entries in the information matrix M(ξ) exist for all continuous designs ξ for which the density f ξ (x) is bounded by f X (x).
The information matrix M(ξ) measures the performance of the design ξ in the sense that the covariance matrix of the least squares estimator β based on a subsample according to the design ξ is proportional to the inverse M(ξ) −1 of the information matrix M(ξ) or, more precisely, √ αn( β − β) is normally distributed with mean zero and covariance matrix σ 2 ε M(ξ) −1 , at least asymptotically.Note that for continuous designs ξ the information matrix M(ξ) is always of full rank and, hence, the inverse M(ξ) −1 exists.Based on the relation to the covariance matrix, it is desirable to maximize the information matrix M(ξ).However, as well-known in design optimization, maximization of the information matrix cannot be achieved uniformly with respect to the Loewner ordering of positive-definiteness.Thus, commonly, a design criterion which is a real valued functional of the information matrix M(ξ) will be maximized, instead.We will focus here on the most popular design criterion in applications, the D-criterion, in its common form log(det(M(ξ))) to be maximized.Maximization of the D-criterion can be interpreted in terms of the covariance matrix to be the same as minimizing the volume of the confidence ellipsoid for the whole parameter vector β based on the least squares estimator or, equivalently, minimizing the volume of the acceptance region for a Wald test on the whole model.The subsampling design ξ * that maximizes the D-criterion log(det(M(ξ))) will be called D-optimal, and its density is denoted by f ξ * (x).
To obtain D-optimal subsampling designs, we will make use of standard techniques coming from constrained convex optimization and symmetrization.For convex optimization we employ the directional derivative of the D-criterion at a design ξ with non-singular information matrix M(ξ) in the direction of a design η, where we allow here η to be a general design of mass α that has not necessarily a density bounded by f X (x).In particular, η = ξ x may be a one-point design which assigns all mass α to a single setting x in R. Evaluating of the directional derivative yields F D (ξ, η) = p − trace(M(ξ) −1 M(η)) (compare Silvey, 1980, Example 3.8) which reduces to F D (ξ, ξ x ) = p − αf (x) M(ξ) −1 f (x) for a one-point design η = ξ x .Equivalently, for one-point designs η = ξ x , we may consider the sensitivity function ψ(x, ξ) = αf (x) M(ξ) −1 f (x) which incorporates the essential part of the directional derivative (ψ(x, ξ) = p − F D (ξ, ξ x )).For the characterization of the D-optimal continuous subsampling design, the constrained equivalence theorem under Kuhn-Tucker conditions (see Sahm and Schwabe, 2001, Corollary 1 (c)) can be reformulated in terms of the sensitivity function and applied to our case of polynomial regression.
Theorem 3.1.In polynomial regression of degree q with density f X (x) of the covariate X i , the subsampling design ξ * with support X * is D-optimal if and only if there exist a threshold s * and settings a 1 > • • • > a 2r for some r (1 ≤ r ≤ q) such that (i) the D-optimal subsampling design ξ * is given by The density of the D-optimal subsampling design ξ * is concentrated on, at most, q + 1 intervals I k , where 1 A (x) denotes the indicator function on the set A, i. e. 1 A (x) = 1 for x ∈ A, and 1 A (x) = 0 otherwise.The density f ξ * (x) has a 0-1-property such that it is either equal to the density f X (x) of the covariate (on X * ) or equal to 0 (on the complement of X * ).Thus, the generation of a subsample according to the optimal continuous subsampling design ξ * can be implemented easily by accepting all units i for which the value x i of the covariate is in X * and rejecting all other units with x i ∈ X * .The threshold s * can be interpreted as the (1 − α)-quantile of the distribution of the sensitivity function ψ(X i , ξ * ) as a function of the random variable X i (see Pronzato and Wang, 2021).
A further general concept to be used is equivariance.This can be employed to transform the D-optimal subsampling design simultaneously with a transformation of the distribution of the covariate.More precisely, the location-scale transformation Z i = σX i + µ of the covariate and its distribution is conformable with the regression function f (x) in polynomial regression, and the D-criterion is equivariant with respect to such transformations.Theorem 3.2.Let f ξ * (x) be the density for a D-optimal subsampling design ξ * for covariate X i with density f X (x).Then is the density for a D-optimal subsampling design ζ * for covariate ).As a consequence, also the optimal subsampling design ζ * is concentrated on, at most, p = q + 1 intervals, and its density f ζ * (z) is either equal to the density f Z (z) of the covariate Z i (on Z * = σX * + µ) or it is equal to 0 (elsewhere) such that, also here, the optimal subsampling can be implemented quite easily.
A further reduction of the optimization problem can be achieved by utilizing symmetry properties.Therefore, we consider the transformation of sign change, g(x) = −x, and assume that the distribution of the covariate is symmetric, f X (−x) = f X (x) for all x.For a continuous design ξ, the design ξ g transformed by sign change has density f ξ g (x) = f ξ (−x) and, thus, satisfies the boundedness condition f ξ g (x) ≤ f X (x), when the distribution of X i is symmetric, and has the same value for the D-criterion as ξ, log(det(M(ξ g ))) = log(det(M(ξ))).By the concavity of the D-criterion, standard invariance arguments can be used as in Pukelsheim (1993, Chapter 13) and Heiligers and Schneider (1992).In particular, any continuous design ξ is dominated by its symmetrization ξ = (ξ + ξ g )/2 with density fξ 1993, Chapter 13.4).Hence, we can restrict the search for a D-optimal subsampling design to symmetric designs ξ with density fξ(−x) = fξ(x) which are invariant with respect to sign change ( ξg = ξ).For these symmetric subsampling designs ξ, the moments m k ( ξ) are zero for odd k and positive when k is even.Hence, the information matrix M( ξ) is an even checkerboard matrix (see Jones and Willms, 2018) with positive entries m j+j ( ξ) for even index sums and entries equal to zero when the index sum is odd.The inverse M( ξ) −1 of the information matrix M( ξ) shares the structure of an even checkerboard matrix.Thus, the sensitivity function ψ(x, ξ) is a polynomial with only terms of even order and is, hence, a symmetric function of x.This leads to a simplification of the representation of the optimal subsampling design in Theorem 3.1 because the support X * of the optimal subsampling design ξ * will be symmetric, too.
Corollary 3.3.In polynomial regression of degree q with a symmetrically distributed covariate X i with density f X (x), the D-optimal subsampling design ξ * with density This characterization of the optimal subsampling design ξ * will be illustrated in the next two sections for ordinary linear regression (q = 1) and for quadratic regression (q = 2).

Optimal Subsampling for Linear Regression
In the case of ordinary linear regression for the information matrix of any subsampling design ξ.The inverse M(ξ) −1 of the information matrix is given by and the sensitivity function is a polynomial of degree two in x.The D-optimal continuous subsampling design ξ * has density The corresponding subsampling design then accepts those units i for which x i ≤ a 2 or x i ≥ a 1 , and rejects all units i for which a 2 < x i < a 1 .
To obtain the D-optimal continuous subsampling design ξ * by Theorem 3.1, the boundary points a 1 and a 2 have to be determined to solve the two non-linear equations P(X i ≤ a 2 ) + P(X i ≥ a 1 ) = α (2) and ψ(a 2 , ξ * ) = ψ(a 1 , ξ * ) .By equation ( 1), the latter condition can be written as (3) When the distribution of X i is symmetric, Corollary 3.3 provides symmetry a 2 = −a 1 of the boundary points.This is in agreement with condition (3) because m 1 (ξ * ) = 0 in the case of symmetry.Further, by the symmetry of the distribution, P(X i ≤ a 2 ) = P(X i ≥ a 1 ) = α/2, and a 1 has to be chosen as the (1 − α/2)-quantile of the distribution of X i to obtain the D-optimal continuous subsampling design.
Example 4.1 (normal distribution).If the covariate X i comes from a standard normal distribution, then the optimal boundaries are the (α/2)-and the (1 − α/2)quantile ±z 1−α/2 , and unit i is accepted when For X i having a general normal distribution with mean µ and variance σ 2 , the optimal boundaries remain to be the (α/2)-and (1 − α/2)-quantile a 2 = µ − σz 1−α/2 and a 1 = µ + σz 1−α/2 , respectively, by Theorem 3.2.This approach applies accordingly to all distributions which are obtained by a location or scale transformation of a symmetric distribution: units will be accepted if their values of the covariate lie in the lower or upper (α/2)-tail of the distribution.This procedure can be interpreted as a theoretical counterpart in one dimension of the IBOSS method proposed by Wang et al. (2019).
However, for an asymmetric distribution of the covariate X i , the optimal proportions for sampling from the upper and lower tail may differ.By condition ( 7), there will be a proportion α 1 , 0 ≤ α 1 ≤ α, for the upper tail and α 2 = α − α 1 for the lower tail such that a 1 is the (1 − α 1 )-quantile and a 2 is the α 2 -quantile of the distribution of the covariate X i , respectively.In view of condition (3), neither α 1 nor α 2 can be zero.Hence, the optimal subsampling design will have positive, but not necessarily equal mass at both tails.This will be illustrated in the next example.
Example 4.2 (exponential distribution).If the covariate X i comes from a standard exponential distribution with density f X (x) = e −x , x ≥ 0, we conclude from Theorem 3.1 that with a = a 1 and b = a 2 when a 2 ≥ 0. Otherwise, when a 2 < 0, the density f X (x) of the covariate X i vanishes on the left interval I 1 = (−∞, a 2 ] because the support of the distribution of X i does not cover the whole range of R. In that case, we may formally let b = 0.Then, we can calculate the entries of M(ξ * ) as functions of a and b as To obtain the optimal solutions for a and b in the case a 2 ≥ 0, the two non-linear equations ( 2) and (3) have to be satisfied which become here e −b − e −a = 1 − α and α(a If a 2 < 0 would hold, the first condition reveals a = − log(α) and, hence, m 1 (ξ * ) = α(a + 1).There, similar to the proof of Theorem 5.2 below, the second condition has to be relaxed to ψ(a, ξ * ) ≥ ψ(0, ξ * ) which can be reformulated to αa ≥ 2m 1 (ξ * ) = 2α(a + 1) and yields a contradiction.Thus, this case can be excluded, and a 2 has to be larger than 0 for all α.
For selected values of α, numerical results are presented in Table 1.Additionally to the optimal values for a and b, also the proportions P(X i ≤ b) and P(X i ≥ a) are presented in Table 1 together with the percentage of mass allocated to the left interval [0, b].In Figure 1, the density f ξ * of the optimal subsampling design ξ * and the corresponding sensitivity function ψ(x, ξ * ) are exhibited for α = 0.5 and α = 0.3.Vertical lines indicate the positions of the boundary points a and b, and the dotted horizontal line displays the threshold s * .As could have been expected, less mass is assigned to the right tail of the right-skewed distribution because observations from the right tail are more influential and, thus, more observations seem to be required on the lighter left tail for compensation.
For X i having an exponential distribution with general intensity λ > 0 (scale 1/λ), the optimal boundary points remain to be the same quantiles as in the standard exponential case, a 1 = a/λ and a 2 = b/λ associated with the proportion α, by Theorem 3.2.

Optimal Subsampling for Quadratic Regression
In the case of quadratic regression for the information matrix of a symmetric subsampling design ξ.The inverse M( ξ) −1 of the information matrix is given by and the sensitivity function is a polynomial of degree four and is symmetric in x.
According to Corollary 3.3, the density f ξ * (x) of the D-optimal continuous subsampling design ξ * has, at most, three intervals that are symmetrically placed around zero, where the density is equal to the bounding density f X (x), and f ξ * (x) is equal to zero elsewhere.Thus the density f ξ * (x) of the D-optimal subsampling design has the shape where a > b ≥ 0. We formally allow b = 0 which means that ψ(0, ξ * ) ≤ s * = ψ(a, ξ * ) and that the density f ξ * (x) is concentrated on only two intervals, . Although the information matrix will be non-singular even in the case of two intervals (b = 0), the optimal subsampling design will include a non-degenerate interior interval [−b, b] in many cases, b > 0, as illustrated below in Examples 5.1 and 5.3.However, for a heavy-tailed distribution of the covariate X i , the interior interval may vanish in the optimal subsampling design as shown in Example 5.5.
To obtain the D-optimal continuous subsampling design ξ * by Corollary 3.3, the boundary points a = a 1 and b = a 2 ≥ 0 have to be determined to solve the two non-linear equations (8) By equation ( 5), the latter condition can be written as which can be reformulated as For finding the optimal solution, we use the Newton method implemented in the R package nleqslv by Hasselman (2018) to calculate numeric values for a and b based on equations ( 7) and (8) for various symmetric distributions.
Example 5.1 (normal distribution).For the case that the covariate X i comes from a standard normal distribution, results are given in Table 2 for selected values of α.Additionally to the optimal values for a and b, also the proportions .Density of the optimal subsampling design (solid line) and the standard normal distribution (dashed line, upper panels), and sensitivity functions (lower panels) for subsampling proportions α = 0.5 (left) and α = 0.1 (right) a, respectively.In the subplots of the sensitivity function, the dotted horizontal line displays the threshold s * .For other values of α, the plots are looking similar.
The numerical results in Table 2 suggest that the interior interval [−b, b] does not vanish for any α (0 < α < 1).This will be established in the following theorem.
Theorem 5.2.In quadratic regression with standard normal covariate X i , for any subsampling proportion α ∈ (0, 1), the D-optimal subsampling design ξ * has density For X i having a general normal distribution with mean µ and variance σ 2 , the optimal boundary points remain to be the same quantiles as in the standard normal case, a 1 , a 4 = µ ± σa and a 2 , a 3 = µ ± σb, by Theorem 3.2.

Example 5.3 (uniform distribution). If the covariate X
, we can obtain analytical results for the dependence of the subsampling design on the proportion α to be selected.
The distribution of X i is symmetric.By Corollary 3.3, the density of the Doptimal continuous subsampling design ξ * has the shape where we formally allow a = 1 or b = 0 resulting in only one or two intervals of support.The relevant entries in the information matrix ). If, in Corollary 3.3, the boundary points a 1 and a 2 satisfy a 1 ≤ 1 and a 2 ≥ 0, then a = a 1 and b = a 2 are the solution of the two equations a − b = 1 − α and αm 2 (ξ * )(a 2 + b 2 ) = αm 4 (ξ * ) − 3m 2 (ξ * ) 2 arising from conditions (7) and ( 9).On the other hand, if there exist solutions a and b of these equations such that 0 < b < a < 1, then these are the boundary points in the representation (10), and the density of the optimal subsampling design is supported by three proper intervals.Solving the two equations results in  3. Values for the boundary points a and b for selected values of the subsampling proportion α in the case of uniform points −a, −b, b, and a, and the dotted horizontal line displays the threshold s * .Moreover, the percentage of mass at the different intervals is displayed in Figure 5.The results in Table 3 and Figure 5 suggest that the percentage of mass on all three intervals [−1, −a], [−b, b], and [a, 1] tend to 1/3 as α tends to 0. We establish this in the following theorem.Theorem 5.4.In quadratic regression with covariate X i uniformly distributed on [−1, 1], let ξ * α be the optimal subsampling design for subsampling proportion α, 0 < α < 1, defined in equations ( 11) and (12).Then lim α→0 ξ * α ([−b, b])/α = 1/3.It is worth-while mentioning that the percentages of mass displayed in Figure 5 are not monotonic over the whole range of α ∈ (0, 1), as, for example the percentage of mass at the interior interval [−b, b] is increasing from 0.419666 at b = 0.50 to 0.448549 at b = 0.92 and then slightly decreasing back again to 0.447553 at b = 0.99.
Finally, it can be checked that, for all α, the solutions satisfy 0 < b < a < 1 such that the optimal subsampling designs are supported on three proper intervals.
In the two preceding examples it could be noticed that the mass of observations is of comparable size for the three supporting intervals in the case of a normal and of a uniform distribution with light tails.This may be different in the case of a heavy-tailed distribution for the covariate X i as the t-distribution.
Example 5.5 (t-distribution).For the case that the covariate X i comes from a t-distribution with ν degrees of freedom, we observe a behavior which differs substantially from the normal case of Example 5.1.The interior interval typically has less mass than the outer intervals and may vanish for some values of α.We show this in the case of the least possible number ν = 5 of degrees of freedom to maintain an existing fourth moment, which appears in the information matrix of the D-optimal continuous subsampling design ξ * while maximizing the dispersion.
Theorem 5.6.In quadratic regression with t-distributed covariate X i ∼ t 5 with five degrees of freedom, there is a critical value α * ≈ 0.082065 of the subsampling proportion α such that the D-optimal subsampling design ξ * has (i) For illustration, numerical results are given in Table 4.The percentage of mass on the interior interval [−b, b] is equal to zero for all larger values of α as stated in Theorem 5.6.The percentage of mass on [−b, b] decreases with increasing subsampling proportion α before vanishing entirely.
Table 4. Values for the boundary points a and b for selected values of the subsampling proportion α in the case of t 5 -distributed Further calculations provide that the critical value α * , where a the D-optimal subsampling design switches from a three-interval support to a two-interval support, increases with the number of degrees ν of freedom of the t-distribution and converges to one when ν tends to infinity.This is in accordance with the results for the normal distribution in Example 5.1 as the t-distribution converges in distribution to a standard normal distribution for ν → ∞.We have given numeric values for the crossover points for selected degrees of freedom in Table 5, where ν = ∞ relates to the normal distribution.The corresponding value α * = 1 indicates that the D-optimal subsampling design is supported by three intervals for all α in this case.To exhibit the gain in using a D-optimal subsampling design compared to random subsampling, we consider the performance of the uniform random subsampling design ξ α of size α, which has density f ξα (x) = αf X (x), compared to the D-optimal subsampling design ξ * α with mass α.More precisely, the D-efficiency of any subsampling design ξ with mass α is defined as , where p is the dimension of the parameter vector β.For this definition the homogeneous version (det(M(ξ))) 1/p of the D-criterion is used which satisfies the homogeneity condition (det(λM(ξ))) 1/p = λ(det(M(ξ))) 1/p for all λ > 0 (see Pukelsheim, 1993, Chapter 6.2).
For uniform random subsampling, the information matrix is given by M(ξ α ) = αM(ξ 1 ), where M(ξ 1 ) is the information matrix for the full sample with raw moments m k (ξ 1 ) = E(X k i ) as entries in the (j, j )th position, j + j − 2 = k.Thus, the D-efficiency eff D,α (ξ α ) of uniform random subsampling can be nicely interpreted: the sample size (mass) required to obtain the same precision (in terms of the Dcriterion), as when the D-optimal subsampling design ξ * α of mass α is used, is equal to the inverse of the efficiency eff D,α (ξ α ) −1 times α.For example, if the efficiency eff D,α (ξ α ) is equal to 0.5, then twice as many observations would be needed under uniform random sampling than for a D-optimal subsampling design of size α.Of course, the full sample has higher information than any proper subsample such that, obviously, for uniform random subsampling, eff D,α (ξ α ) ≥ α holds for all α.
For the examples of Sections 4 and 5, the efficiency of uniform random subsampling is given in Table 6 for selected values of α and exhibited in Figure 6 for the full range 0.70390 0.56087 0.36344 0.17097 of α between 0 and 1 (solid lines).Here the determinant of the information matrix is determined as in the examples of Sections 4 and 5 for the optimal subsampling designs ξ * α either numerically or by explicit formulas where available.Both Table 6 and Figure 6 indicate that the efficiency of uniform random subsampling is decreasing in all cases when the proportion α of subsampling gets smaller.In the case of quadratic regression with uniformly distributed covariate, the decrease is more or less linear with a minimum value of approximately 0.58 when α is small.In the other cases, where the distribution of the covariate is unbounded, the efficiency apparently decreases faster, when the proportion α is smaller than 10%, and tends to 0 for α → 0. The latter property can be easily seen for linear regression and symmetric distributions: there, the efficiency eff D,α (ξ α ) of uniform random sampling is bounded from above by c/q 1−α/2 , where c = E(X 2 i ) 1/2 is a constant and q 1−α/2 is the (1−α/2)quantile of the distribution of the covariate.When the distribution is unbounded like the normal distribution, then these quantiles tend to infinity for α → 0 and, hence, the efficiency tends to 0. Similar results hold for quadratic regression and asymmetric distributions.
In any case, as can be seen from Table 6, the efficiency of uniform random subsampling is quite low for reasonable proportions α ≤ 0.1 and, hence, the gain in using the D-optimal subsampling design is substantial.
By equivariance arguments as indicated above in the examples of Sections 4 and 5, the present efficiency considerations carry over directly to a covariate having a general normal, exponential, or uniform distribution, respectively.
In the IBOSS approach by Wang et al. (2019), of the proportion α is taken from both tails of the data.The corresponding continuous subsampling design ξ α would be to have two intervals (−∞, b] and [a, ∞) and to choose the boundary points a and b to be the (1 − α/2)-and (α/2)-quantile of the distribution of the covariate, respectively.For linear regression, it can been seen from Corollary 3.3 that the subsampling design ξ α is D-optimal when the distribution of the covariate is symmetric.As the IBOSS procedure does not use prior knowledge of the distribution, it would be tempting to investigate the efficiency of the corresponding continuous subsampling design ξ α under asymmetric distributions.For the exponential distribution, this efficiency eff D,α (ξ α ) is added to the upper left panel in Figure 6 by a dashed line.There the subsampling design ξ α shows a remarkably high efficiency over the whole range of α with a minimum value 0.976 at α = 0.332.
As an extension of IBOSS for quadratic regression, we may propose a procedure which takes proportions α/3 from both tails of the data as well as from the center of the data.This procedure can be performed without any prior knowledge of the distribution of the covariate.The choice of the proportions α/3 is motivated by the standard case D-optimal design on an interval where one third of the weight is allocated to each of the endpoints and to the midpoint of the interval, respectively.For a symmetric distribution, the corresponding continuous subsampling design ξ α can be defined by the boundary points a and b to be the (1 − α/3)-and (1/2 + α/6)quantile of the distribution of the covariate, respectively.In the case of the uniform distribution, the subsampling design ξ α is the limiting D-optimal subsampling design for α → 0 by Theorem 5.4.In Figure 6, the efficiency eff D,α (ξ α ) is shown by dashed lines for the whole range of α for the uniform distribution as well as for the normal and for the t-distribution in the case of quadratic regression.In all three cases, the subsampling design ξ α is highly efficient over the whole range of α with minimum values 0.994 at α = 0.079 for the normal distribution, 0.989 at α = 0.565 for the uniform distribution, and 0.978 at α = 0.245 for the t 5 -distribution, respectively.This is of particular interest for the t 5 -distribution, where the interior interval of the D-optimal subsampling design ξ * α is considerably smaller than of the IBOSS-like subsampling design ξ α and even vanishes entirely for α > α * ≈ 0.08.However, we only tested this extension of IBOSS for quadratic regression for symmetric distributions of the covariate.Further investigations for non-symmetric distributions is necessary.

Concluding Remarks
In this paper we have considered a theoretical approach evaluate subsampling designs under distributional assumptions on the covariate in the case of polynomial regression on a single explanatory variable.We first reformulated the constrained equivalence theorem under Kuhn-Tucker conditions in Sahm and Schwabe (2001) to characterize the D-optimal continuous subsampling design for general distributions of the covariate.For symmetric distributions of the covariate we concluded the following.The D-optimal subsampling design is equal to the bounding distribution in its support and the support of the optimal subsampling design will be the union of at most q + 1 intervals that are symmetrically placed around zero.Further we have found that in the case of quadratic regression the D-optimal subsampling design has three support intervals with positive mass for all α ∈ (0, 1), whereas the interior interval vanishes for some α for a t-distributed covariate.In contrast to that, for linear regression, always two intervals are required at the tails of the distribution.
The main emphasis in this work was on D-optimal subsampling designs.But many of the results may be extended to other optimality criteria like A-and Eoptimality from the Kiefer's Φ q -class of optimality criteria, IM SE-optimality for predicting the mean response, or optimality criteria based on subsets or linear functionals of parameters.
The D-optimal subsampling designs show a high performance compared to uniform random subsampling.In particular, for small proportions, the efficiency of uniform random subsampling tends to zero when the distribution of the covariate is unbounded.This property is in accordance with the observation that estimation based on subsampling according to IBOSS is "consistent" in the sense that the mean squared error goes to zero with increasing population size even when the size of the subsample is fixed.
We propose a generalization of the IBOSS method to quadratic regression which does not require prior knowledge of the distribution of the covariate and which performs remarkably well compared to the optimal subsampling design.However, an extension to higher order polynomials does not seem to be obvious.

Appendix A. Proofs
Before proving Theorem 3.1, we establish two preparatory lemmas on properties of the sensitivity function ψ(x, ξ) for a continuous subsampling design ξ with density f ξ (x) and reformulate an equivalence theorem on constraint design optimality by Sahm and Schwabe (2001) for the present setting.The first lemma deals with the shape of the sensitivity function.
Lemma A.1.The sensitivity function ψ(x, ξ) is a polynomial of degree 2q with positive leading term.
The second lemma reveals a distributional property of the sensitivity function considered as a function in the covariate X i .
Lemma A.2.The random variable ψ(X i , ξ) has a continuous cumulative distribution function.
Proof of Lemma A.2.As the sensitivity function ψ(x, ξ) is a non-constant polynomial by Lemma A.1, the equation ψ(x, ξ) = s has only finitely many roots x 1 , . . ., x , ≤ 2q, say, by the fundamental theorem of algebra.Hence, P(ψ(X i , ξ) = s) = k=1 P(X i = x k ) = 0 by the continuity of the distribution of X i which proves the continuity of the cumulative distribution function of ψ(X i , ξ).
With the continuity of the distribution of ψ(X i , ξ * ) the following equivalence theorem can be obtained from Corollary 1(c) in Sahm and Schwabe (2001) for the present setting by transition from the directional derivative to the sensitivity function and considering R as the design region.
Theorem A.3 (Equivalence Theorem).The subsampling design ξ * is D-optimal if and only if there exist a threshold s * and a subset X * of R such that (i) the D-optimal subsampling design ξ * is given by Proof of Theorem 3.1.By Lemma A.1 the sensitivity function ψ(x, ξ) is a polynomial in x of degree 2q with positive leading term.Using the same argument as in the proof of Lemma A.2 we obtain that there are at most 2q roots of the equation ψ(x, ξ * ) = s * and, hence, there are at most 2q sign changes in ψ(x, ξ * ) − s * .As ψ(x, ξ * ) is a polynomial of even degree, also the number of (proper) sign changes has to be even, and they occur at a 1 > • • • > a 2r , say, r ≤ q.Moreover, for 0 < α < 1, X * is a proper subset of R and, thus, there must be at least one sign change, r ≥ 1.Finally, as the leading coefficient of ψ(x, ξ * ) is positive, ψ(x, ξ * ) gets larger than s * for x → ±∞ and, hence, the outmost intervals [a 1 , ∞) and (−∞, a 2r ] are included in the support X * of ξ * .By the interlacing property of intervals with positive and negative sign for ψ(x, ξ * ) − s * , the result follows from the conditions on the D-optimal subsampling design ξ * in Theorem A.3.
Proof of Theorem 3.2.First note that for any µ and σ > 0, the location-scale transformation z = σx + µ is conformable with the regression function f (x), i. e. there exists a non-singular matrix Q such that f (σx + µ) = Qf (x) for all x.Then, for any design ξ bounded by f X (x), the design ζ has density f . Hence, by the transformation theorem for measure integrals, it holds that Therefore det(M(ζ)) = det(Q) 2 det(M(ξ)).Thus ξ * maximizes the D-criterion over the set of subsampling designs bounded by f X (x) if and only if ζ * maximizes the D-criterion over the set of subsampling designs bounded by f Z (z).
Proof of Theorem 5.2.In view of the shape (6) of the density and by Corollary 3.3, the tails are included in the optimal subsampling design such that a < ∞.
c(α) is continuous in α and does not have any roots in (0, 1).Further, it can be checked that c(0.1) > 0, say.Thus c(α) > 0 which means that ψ(0, ξ ) > ψ(a, ξ ) for all α.Hence, by Theorem A.3, the subsampling design ξ cannot be optimal and, as a consequence, the optimal subsampling design ξ * has support on three proper intervals with b > 0 for all α.
Proof of Theorem 5.  Proof of Theorem 5.6.The proof will follow the idea of the proof of Theorem 5.2.For α ∈ (0, 1), we consider the symmetric design ξ which is supported only on the tails and which will be the optimal subsampling design when b = 0.This design has density f ξ (x) = 1 (−∞,−a]∪[a,∞)(x) f X (x) with a = t 5,1−α/2 .The relevant entries of the information matrix M(ξ ) are Thus for α < α * ≈ 0.082065 we have ψ(0, ξ ) > ψ(a, ξ ) and the design ξ cannot be optimal by Theorem A.3.In this situation, an inner interval has to be included in the optimal subsampling design ξ * with b > 0.

Table 2 .
Figure2.Density of the optimal subsampling design (solid line) and the standard normal distribution (dashed line, upper panels), and sensitivity functions (lower panels) for subsampling proportions α = 0.5 (left) and α = 0.1 (right)

Figure 3 .
Figure3.Boundary points a (dashed) and b (solid) of the Doptimal subsampling design in the case of uniform X i on [−1, 1] as functions of α 1/ √ 5 as α tends to 1. Similar to the case of the normal distribution, the resulting values and illustrations are given in Table3 and Figure4.Note that the mass of the interior interval P(−b ≤ X i ≤ b) is equal to b itself as X i is uniformly distributed on [−1, 1].Also here, in Figure 4, vertical lines indicate the positions of the boundary

Figure 6 .
Figure 6.Efficiency of uniform random subsampling (solid line) and of an IBOSS-type subsampling design (dashed line) w.r.t.Doptimality

Figure 7 .
Figure 7. Difference c(α) = ψ(0, ξ ) − ψ(a, ξ ) (solid) for the case of a t-distributed covariate with 5 degrees of freedom vertical dotted line indicates the position of the critical value α * ≈ 0.082065, where the curve of the function c(α) intersects the horizontal dotted line indicating c = 0.Thus for α < α * ≈ 0.082065 we have ψ(0, ξ ) > ψ(a, ξ ) and the design ξ cannot be optimal by Theorem A.3.In this situation, an inner interval has to be included in the optimal subsampling design ξ * with b > 0.Conversely, for α ≥ α * ≈ 0.082065 we have that ψ(0, ξ ) ≤ ψ(a, ξ ).Hence, the design ξ is optimal by Theorem A.3, and no inner interval has to be added to the optimal subsampling design ξ * = ξ (b = 0).

Table 1 .
Numeric values for the boundary points a and b for selected values of the subsampling proportion α in the case of standard exponential X i

Table 5 .
Values of the critical value α * for selected degrees of freedom ν of the t-distribution

Table 6 .
Efficiency of uniform subsampling w.r.t.D-optimality for selected values of the subsampling proportion α