Robust optimal designs using a model misspecification term

Much of classical optimal design theory relies on specifying a model with only a small number of parameters. In many applications, such models will give reasonable approximations. However, they will often be found not to be entirely correct when enough data are at hand. A property of classical optimal design methodology is that the amount of data does not influence the design when a fixed model is used. However, it is reasonable that a low dimensional model is satisfactory only if limited data is available. With more data available, more aspects of the underlying relationship can be assessed. We consider a simple model that is not thought to be fully correct. The model misspecification, that is, the difference between the true mean and the simple model, is explicitly modeled with a stochastic process. This gives a unified approach to handle situations with both limited and rich data. Our objective is to estimate the combined model, which is the sum of the simple model and the assumed misspecification process. In our situation, the low-dimensional model can be viewed as a fixed effect and the misspecification term as a random effect in a mixed-effects model. Our aim is to predict within this model. We describe how we minimize the prediction error using an optimal design. We compute optimal designs for the full model in different cases. The results confirm that the optimal design depends strongly on the sample size. In low-information situations, traditional optimal designs for models with a small number of parameters are sufficient, while the inclusion of the misspecification term lead to very different designs in data-rich cases.


Introduction
The emergence of Data Science partly inspires the current work. There is a great overlap between the fields of Statistics and Data Science and there is no consensus view on how exactly to define them or the difference between them. However, one important difference is that Data Science often applies a high-dimensional parameter space to analyze very rich data. At the same time, (traditional) Statistics has often used pre-defined low-parameter models, for situations where information is limited. It is not the size of the sample per se that is important but rather the amount of information in relation to the magnitude of the signal in the data. In many applications, one faces both situations with limited and rich information. Therefore, unified approaches could be beneficial.
Statistics has been used extensively both for designed experiments and observational studies, while Data Science typically refers to the latter type of study. In general, while Optimal Design theory has a strong legacy in Statistics, the design of experiments is not a large part of Data Science methodology. Still, there are many situations where experiments can provide large data sets and strong signals. One example is web advertisements, see e.g. Kohavi et al. (2009) and Example 1.6 of Montgomery (2017). Such ads could be displayed to the customer for a different time duration, repeated with different frequency patterns, use different appearances, price deals, etc. The designer of an advertisement might be interested in the conversion rate, the percentage of visits to the website that includes a purchase (Kohavi et al. 2009), or in another performance metric. While the amount of data on customer behavior may be large, in terms of the number of customers and the number of ad showings, the results may have enough uncertainty to call for the optimal design of the experiment. Even small changes in a performance metric like conversion rate can have a huge impact on revenues (Kohavi et al. 2013). Fernandez-Tapia and Guéant (2017) deal with bidding models for ad-inventory auctions using a Poisson process and demonstrate that the Hamilton-Jacobi-Bellman equation describes the optimal bids. Mardanlou et al. (2017) propose an optimal cost design model that presents the connection between a campaign-level control signal and the total cost to the advertiser for the influence they had at each control signal value.
A sub-field of artificial intelligence, active machine learning, has been used over years in data science and is related to optimal experimental design. If a learning algorithm can choose the data it wants to learn from, it can perform better with less data for training. The first applied statistical analyzes of active learning for regression in robot arm kinematics was presented by Cohn (1996) and Cohn et al. (1996). Important areas of application include speech recognition, image or video classification, or information extraction from texts, see Settles (2010) for a review. Nie et al. (2018) focus on active learning for regression models including a model misspecification term. López-Fidalgo and Wiens (2022) recommend methods for binary data for esti-mation and classification in active learning which are robust in case of both model misspecification and response mislabelling.
When analyzing data, we may either use a multi-purpose model or a scientifically based model. Commonly used multi-purpose models include polynomial models, such as linear and quadratic, and generalized linear models. When possible, it is often useful to base a model on subject matter science. As an example, we may study the dosedependent effect of drugs. In this context, the Michaelis-Menten model can be derived theoretically for certain simple concentration-response experiments. The Michaelis-Menten is often generalized to an Emax model when studying dose-response based on results from patients. While the Michaelis-Menten model has two parameters for location and maximal response, the standard 4-parameter Emax model includes additional free parameters for the placebo efficacy and the degree of sigmoidicity of the response curve. The Emax model has been very used and useful for clinical trials. However, for several theoretical reasons, it can not be exact. Assume, for example, that individual patient's concentration-response curve follows an Emax model. Late-stage clinical trials often aim at modeling dose-response, not measuring plasma concentration. Variations in concentration will therefore distort the population model. Patient heterogeneity in Emax parameters will also lead to distortion. In addition, the Michaelis-Menten model may hold for receptor binding. However, the relation between receptor binding and the clinical endpoint, e.g. blood pressure, glucose, or FEV1, is likely a complex non-linear function. Some of these distortions are possible to include in a more sophisticated scientific model. However, other kinds of distortions, like the relation between receptor binding and response, are likely far too deep to capture in a low-parameter model. Similarly, the relation between price or display time and a customer response like conversion rate will likely not exactly follow a simple model.
We will therefore combine a low-dimensional model with a term that models the misspecification. A simple example can be a linear (straight-line) regression combined with a Brownian bridge as a (modeled) misspecification term. Our objective is to estimate the combined model, including the misspecification term, based on the collected data. The experimental conditions used for data collection should be chosen to optimize the precision of the estimated model.
Optimal designs for models with a stochastic process as error term have been considered e.g. by Sacks and Ylvisaker (1966); Harman and Štulajter (2011); Dette et al. (2016Dette et al. ( , 2017. However, in contrast to these situations, our stochastic process (the misspecification term) contributes to the regression function of interest and is not an error term.
In our situation, the low-dimensional regression model is a fixed effect, and the misspecification term is a random effect in a mixed-effects model. Since we are here interested in estimating the combination of fixed and random effects, our aim is to predict within the mixed model. Optimal designs for prediction in mixed models have been considered by Prus andSchwabe (2016), Liu et al. (2019), Prus (2020). Fedorov and Jones (2005) describe how a mixed-effects model can be used to determine designs for multicenter clinical trials. If we use in our model a Gaussian process as stochastic process, the model considered here is related to non-parametric Gaussian processes which is used in machine learning (Williams and Rasmussen 2006). A Bayesian description of these models dates back to the 1970's (Blight and Ott 1975) and also some design optimization has been considered from a Bayesian perspective (O'Hagan 1978). This framework has been used in multiple applications, see e.g. Siivola et al. (2021). The optimal design methodology which we consider is, however, different from the one used by O'Hagan (1978).
The general model set-up and the methods used for prediction and design optimization will be specified in Sect. 2. In Sect. 2.1, we derive how to estimate the response over the whole design space, and the structure of the covariance estimate is presented in Sect. 2.2. We will present the optimized specific model in Sect. 3. In Sect. 4, we will present the optimality criterion used and the results extracted by applying Fedorov's algorithm. It is shown that even if the parameters of the low-dimensional model are known (local optimal design), the optimal design will depend on the sample size, which comes in contrast to standard optimal design theory. In practice, our optimal design will typically be very similar to the traditional optimal design (without a misspecification term) when the sample size is small. However, with increasing sample size, the new optimal design will zoom in on the most important area. Depending on the optimal design criterion, this may e.g. be where the expected response has a certain target value.

The general model
We assume that we observe N observations Y i in an experiment which follow where e i are uncorrelated random variables with E(e i ) = 0 and V ar(e i ) = σ 2 > 0 and x i are elements of a compact one-or multidimensional design space X . We consider a simple, low-dimensional approximate model ν(x) for the mean response. As this model is likely not fully correct, we consider the misspecification function μ(x) − ν(x). We will explicitly model the misspecification function with a zero-mean stochastic process C(X ), i.e. we assume here that the true mean is μ(x) = ν(x) + C(x). The discrepancy C(x) can be seen as a random effect. We can view the model including C as a vehicle to construct robust designs and shed light on the question about when and how much to rely on the approximate simple model ν.
We assume further that ν(x) = f(x) β. Here, f(x) β is a linear regression model with a d-dimensional unknown parameter vector β ∈ R d and a d-dimensional known regression function f : X → R d . So, μ is a mixture of a fixed and random part: (1) In the random part, C(x) is assumed to be a stochastic process with zero mean, E{C(x)} = 0, x ∈ X , and existing second moments (E|C(x)C(y)| < ∞, x, y ∈ X ). It is assumed to be independent of e = (e 1 , . . . , e N ) .
The interpretation of (1) is that our model is assumed to be a linear regression model f(x) β plus an additional misspecification term C. We know the functional shape of the model only up to this misspecification term. In this paper, we are interested in predicting the whole μ(x) on X and not only the linear regression term, since the misspecification correction part belongs to our model of interest.
Often, one will observe several times at some x. Let m + 1 be the number of distinct design points and we denote them byx 0 ,x 1 , . . . ,x m ∈ X . If the design space is onedimensional, we assume that the x i andx j are sorted ascendingly and that the number of observations made atx j is n j , j = 0, . . . , m, i.e. m j=0 n j = N . Our model can be written as mixed-effects model.

Inference results of BLUE and BLUP
In mixed-effects models, the interest is to estimate β and to predict γ . A Best Linear Unbiased Estimate (BLUE) for β and a Best Linear Unbiased Predictor (BLUP) for γ are well known, see Christensen (2002). In the following lemma, we show how they can be computed specifically for our model. We provide formulae both based on Y andȲ. Let M = (ZDZ + R) −1 which is a symmetric N × N -matrix, and let I d be the identity matrix of dimension d × d.

The Best Linear Unbiased Predictor for μ iŝ
4. Let x * ∈ [0, 1] be an arbitrary point where one wants to predict μ. Define the following functions of x: The proof is available in the Appendix.
In the case of large variance σ 2 or alternatively in the case of small sample sizes n i , the random effect part in the mixed model becomes unimportant. Therefore,β converges to the ordinary least squares estimator (X X) −1 X Y andγ converges to the zero vector 0 m+1 . This can formally be seen based on Lemma 2.1 since For design optimization, we need expressions for covariance matrices. In Lemma A.1 in the appendix, we show them for the BLUE and the BLUP. Since we aim to optimize the design by minimizing the uncertainty in prediction, the following theorem is important for our elaborations in Sect. 4.

Theorem 2.2
The covariance of the prediction errorμ − μ is: . The proof can be found in the Appendix. Note that the diagonal entry belonging to x i in the first expression, Cov(μ − μ), is equal to the value of the second expression,

The specific model
To illustrate our model, we use the following specific model. We consider straight line regression on X = [0, 1], centered at 1/2: Ross et al. (1996) and Chow (2009) for definition and properties of the Brownian bridge.
The matrix W in the model (3) for the vector of meansȲ is the (m + 1) × 2dimensional matrix

Numerical interpretation of the prediction
In order to better understand the impact of adding more information to the estimators and predictors, we present in this subsection a numerical illustration of them for the case of m = 2 and m = 4 for equidistantly spaced observationsx i = i/m, i = 0, . . . , m.
In Table 1 and 2, we apply the same sample size for m = 2 in all design points, so n 0 = n 1 = n 2 , and we want to investigate the effect of increasing N. Based on the  Table 2 For m = 2 the BLU-predictorγ results ofβ 1 , the intercept of the linear regression, we see that while the weight of all Y i are equal when N is almost 0, the weight ofȲ 1 decreases while N increases until the weight is close to 0 when N tends to infinity. Thus, the information is spread equally between the three weights when N is small and we end up with all the information shared by the extreme values when N is large. Regarding the weights forB 1 , we detect that the weight of the middle design pointȲ 1 increases while N increases and we have the corresponding weight to be close to 1 when N tends to infinity. Since the sum of weights ofB 1 is equal to 0, the increase of the middle point weight decreases the weight of the extreme points. Thus, by starting with all the weights to be close to 0 when N is almost zero, we get the weight ofȲ 1 to be 1 while the weights ofȲ 0 andȲ 2 are -0.5 when N is large. Another way to explain the two tables is by seeing the impact of the amount of data in the use of the misspecification term. When there is less data (N is small), we obtain all the information almost exclusively from the simple linear regression. On the other hand, when we have a lot of data (N tends to infinity), the regression includes the Brownian bridge handling the big amount of information.
The results in Table 3 and 4 give us the estimators and the predictors when m = 4, so our n-vector consists of 5 elements. We obtain symmetry between the four extremē Y i in the case of the interceptβ 1 . So the weight ofȲ 0 equals the weight ofȲ 4 and the one ofȲ 1 is the same as the weight ofȲ 3 . On the other hand, for the slopeβ 2 we have negative symmetry betweenȲ 0 andȲ 4 and betweenȲ 1 andȲ 3 . The weights of the intercept sum to 1, while the one of the slope sum to 0.
Since our misspecification term is a Brownian bridge, we getB 0 andB 4 equal to zero. There is a symmetry betweenB 1 andB 3 . The weight that corresponds toȲ 0 in the case ofB 1 is the same as the weight ofȲ 4 forB 3 . In the same way the weight ofȲ 1 forB 1 match with the one ofȲ 3 forB 3 . And vice versa, soȲ 0 andȲ 1 forB 3 equals to Table 3 For m = 4 the BLU-estimatorβ 30, 10, 20, 10, 30 Table 4 For m = 4 the BLU-predictorγ 100 30, 10, 20, 10, 30 Y 4 andȲ 3 respectively forB 1 . The weight ofȲ 2 remains the same for both predictors. The middle predictorB 2 follows the same weight symmetry. As a result, we get the same weight for the pairȲ 0 andȲ 4 , andȲ 1 ,Ȳ 3 . As in the case of m = 2, the sum of the weights of the predictions is equal to 0. In Table 3 and 4, we see also the effect of a 10-fold increase of N on estimators and predictors. The weight of the meansȲ 0 andȲ 4 at the extreme points increases when N increases and that has, as a result, the weight of the meansȲ 1 ,Ȳ 2 andȲ 3 at the middle points to decrease. Therefore, for N = 100, we can focus on the boundary points where we have no misspecification due to Brownian bridge. On the other hand, for a smaller sample size, the observations in the edges do not give much information. In contrast, the middle observations gather most of the information, even with the risk of getting bias.
In a low information case with N close to 0, the prediction is following the maximum likelihood estimate for a straight line model (without misspecification term). In a high information case with N large, the prediction atx i becomesȲ i . Between the observational pointsx i , the prediction interpolates linearly; this can be seen from formula (4) since W * is linear in x and D * is linear on each interval between thex i for our specific example. When we have N = 10 or N = 100 (like in the previous  Table 3 and 4), the prediction shrinks the high information curve towards the straight line regression.

Optimal designs
Based on the results in Sect. 2, we can predict μ given data Y. We can base this prediction on the BLUE and BLUP. When we have given data Y, the formulae discussed in Sect. 2 depend on the design for the data collection which is specified by the number of observations n j ≥ 0 on each design pointx j = j/m, j = 0, . . . , m. Now we consider the planning stage of an experiment where we have the possibility to choose the design, i.e. the numbers n j . We want to choose a design for the experiment which allows the best possible prediction of the mean vector μ.

Experimental designs
Following standard notation (Atkinson et al. 2007, for example), a design is written as ξ = x 0x1 · · ·x m n 0 n 1 · · · n m with n j ≥ 0, j = 0, . . . , m, and m j=0 n j = N for a given total number of observations N . When n i are only allowed to attain integer values, designs of the form (5) are called exact designs. Dropping this integer requirement and allowing nonnegative values for n j , designs are called approximate designs following the approach of Kiefer (1974). We will consider later mainly approximate designs but also some exact designs in some examples. Since the quality of experimental designs for fixed effect models usually do not depend on the total number of observations N , designs are often described by weights w j = n j /N , only, and these weights form a probability measure on the design space. We will however see that the quality of the design depends on N for our model. Therefore, we need the n j and not only the weights to describe the design even if we drop the integer requirement.

Optimal designs in fixed and mixed models
In optimal experimental design, we determine a design (5), which minimizes some error of the estimates. In fixed-effects models with a parameter vector β, one is therefore often interested to minimize the covariance matrix Cov(β). We are here interested not only in the low-dimensional model with the parameter vector β, but consider μ as the underlying model which we want to estimate. This means that our interest is to reduce the error ofμ. Since μ is our underlying model, we should minimize V ar(μ(x) − μ(x)) and not Cov(μ). Note that μ is a random parameter in contrast to β. So while we have Cov(β − β) = Cov(β) in fixed effect models, we cannot remove μ from the covariance in mixed-effect models. V ar(μ(x) − μ(x)) was specified in Theorem 2.2.

Optimality criteria
While the Loewner ordering could be used to order covariance matrices, defining A ≥ B if A − B is positive semi-definite, this is only a partial order, and minimization of covariance matrices is usually not well defined with this ordering. Instead, following a usual optimal design approach, we have to choose which "aspect" of the covariance matrix to optimize using a criterion function, which maps the space of non-negative definite matrices to real numbers.
When minimizing Cov(β) in fixed effects models, many optimality criteria have been discussed in literature, see e.g. Atkinson et al. (2007). Popular possibilities are to minimize the determinant of the covariance matrix, det (Cov(β)), to minimize the average variance, trace(Cov(β)), or the variance of a linear combination V ar(c β ) = c Cov(β)c. If prediction is desired in random effects models, these criteria can be applied, too, by applying them on the covariance of the prediction error, Cov(μ − μ), see Prus and Schwabe (2016); Prus (2020). See also Hooks et al. (2009) for a discussion of optimality criteria for mixed-effects models.
In our case, it is reasonable to focus on the predicted mean function μ(x), x ∈ [0, 1]. One possibility is to require that the variance ofμ(x) − μ(x), x ∈ [0, 1] is small and we can consider the integrated variance We will focus in this article on designs minimizing expression (6), where V ar(μ(x) − μ(x)) can be found in Theorem 2.2. Prus and Schwabe (2016) called this criterion integrated mean-squared error (IMSE) criterion in the context of prediction and used it as their favorite criterion. Other alternatives than IMSE-optimality would be possible, of course. First, one could weigh different regions of the design space differently and define a weighted IMSE criterion. One could be concerned with the worst variance in the design space and minimize max x∈ [0,1] {μ(x) − μ(x)}. This criterion called G-optimality in fixedeffect models is actually the oldest optimality criterion dating back to the work of Smith (1918). Another alternative is to minimize the prediction error Cov(μ − μ).
In some contexts, it might be desirable to estimate the value of x which gives a certain response, or the value of x which minimizes some cost-function or maximizes a utility-function, e.g. when the optimal price is of interest. These desires lead to coptimality criteria. See e.g. Tsirpitzi and Miller (2021) for a maximization of a utility in a fixed-effects model. Fedorov (1972) introduced an exchange method in which the sum of design weights remains the same. The procedure starts by setting the initial design with the initial weights n 0 , n 1 , · · · n m . The main idea of the Fedorov exchange algorithm is to look for the optimal design by exchanging a unit α from n i to n j . Thus, an important stage in this algorithm is to identify all possible exchange couples (n i , n j ). In the classical Fedorov algorithm the α unit equals 1, so the algorithm exchanges one point from n i to n j . If we consider not exact but continuous designs, α can be any value in the interval (0, 1] and can differentiate in each iteration. So we can have larger changes in the first iterations and smaller ones in the last when the algorithm approaches the optimal design. In order to find the couple (n i , n j ) that will trade a unit α, Fedorov considered the interaction between the variance functions of the two weights. He defined the so-called -function, which describes what happens in the optimality criterion when making a small change. Since we consider an IMSE-optimal design, we used the following -function:

Fedorov algorithm for the IMSE-optimal design
where crit is the integral of the prediction errorμ − μ of the initial n-vector. Thus, the Fedorov algorithm computes the -value for all the possible pairs and chooses the one pair with the smallest -value. If there is more than one couple with the same -value, one will be picked randomly. Since we are looking for the minimum , we already have an improved design compared to the previous one as soon as we have a negative value. This procedure is repeated until there is no other exchange between a couple (n i , n j ) that will decrease the -value and will improve the optimal design.

Algorithm 1 Fedorov algorithm
Set the initial design with the initial values to n-vector Compute the integral of the prediction errorμ − μ while a negative delta for a couple of weights is found do for i from 1 to m + 1 do for j from 1 to m + 1 do By exchanging α point from n i to n j , compute the delta function for this couple end for end for Find the couple of weights that has the smallest delta among all the combinations if there is more than one couple with minimum delta then randomly select one couple end if exchange α point from n i to n j update n-vector reset minimum delta end while For further information on the exchange algorithm, see e.g. Triefenbach (2008).

Constrained optimization for approximate designs
If we want to compute optimal approximate designs and not optimal exact designs, we can also use one of many standard optimization algorithms. We can optimize both the number of observations n i as well as the locationx i of the design points. Optimal design problems are however constraint optimization problems. One equality constraint is that we want to determine the optimal design for a given total sample size, N , i.e. the sum of all n i needs to be N . We handle this here by setting n m = N − m−1 i=0 n i and dropping n m from the parameters to be optimized. We have further linear inequality constraints which we handle by applying the barrier method (see e.g. Lange 2013, Chapter 16). The constraints are here 0 ≤x 0 ,x i−1 ≤x i , i = 1, . . . , m,x m ≤ 1, n i ≥ 0, i = 0, . . . , m − 1, m−1 i=0 n i ≤ N . After incorporating a barrier, we can apply standard optimization algorithms. For the numerical calculations in Sect. 4.6.2, we have used the Nelder-Mead algorithm (see e.g. Givens and Hoeting 2013, Chapter 2.2.4) as implemented in the R-function constrOptim.

Results of optimal design
This section considers our specific example where the fixed effect model is straight line regression on [0, 1] and the misspecification term is the Brownian bridge. We start in Sect. 4.6.1 with keeping the observational points fixed asx i = i/m, i = 0, . . . , m, and optimize the weights at these observational points, i.e. we use the discrete design space X m = {0, 1/m, . . . , 1}. In Sect. 4.6.2, we will optimize both thex i and the weights and have design space X = [0, 1] again.

Optimal weights for fixed observational points for IMSE optimality
In Figures 2, 3, 4, 5, the weights n i /N of the optimal design are shown in dependence of N /σ 2 = n i /σ 2 for m = 4 and m = 12, where σ 2 is set equal to 1 and x-axis is in log scale. Note again that n j /N = n m− j /N , since the structure of the regression function is symmetric around 1/2, the Brownian bridge is tied down and symmetric and the optimality criterion is symmetric, too. Numerically for m = 4, the values of n i /N , i = 0, 1, 2 are shown in Figure 2 depending on the values of N /σ 2 = n i /σ 2 . When we set σ 2 = 1, the values of N runs from 5 to 10,000 and the exchange α unit in Fedorov's algorithm is set to 0.01. Thus, when N is small the weight goes to the extreme values. While when N is larger than 30, the middle n i /N are the one with higher weight than the weight of the extreme ones. For any N higher than 145, we see that the weights get stable, and the weight of the extreme points n 0 = n 4 is almost 0.17, while the weights of the middle points n 1 = n 3 and n 2 is almost 0.22.
In order to better understand how small the values of N should be to get all the weight in the two extreme values n 0 , n 4 , we created Figure 3. While the log scaled x-axis is in the interval 1 to 10 and α = 0.001, the weight of the extreme values moves from 0.5 and they reach 0.25. Accordingly, the two extreme values share the weight as long as N /σ 2 is below to 2. For values bigger or equal to 2, n 0 and n 4 lose the monopoly and they start sharing the weight first with the two less extreme values n 1 , n 3 and later with n 2 .
In Figure 4 we illustrate two cases in order to show the impact of the exchange unit α in the Fedorov algorithm. The left panel corresponds to the discrete case, where α is 1, while the right is the continuous and α = 0.001. While both plots follow the Optimal design for n i /N for m = 4 and for N /σ 2 = n i /σ 2 between 1 and 10 by using the Fedorov algorithm with α = 0.01 exchange same pattern, the interesting part in these two cases is the fluctuation of the n i lines due to discreteness compared to the smoothness that occurs in the continuous. Due to discreteness, the optimal design needs not to be symmetrical and we have therefore not in all cases n i = n m−i ; these values can differ by 1. In the continuous case, the computed optimal designs are always symmetrical with n i /N = n m−i /N .
In Figure 5 for m = 12, we present the values of n i /N , i = 0, 1, . . . , 6 on y-axis and the values of N /σ 2 = n i /σ 2 in log scale on x-axis, while σ 2 is set equal to 1 and α is 0.01. In contrast with the small values of N /σ 2 , where the weight is gathered mainly in the extreme values, all middle n i /N tend to the same value which is almost 0.08 when N /σ 2 gets higher than 1,000.
So in both cases m = 4 and m = 12, we see the impact of the amount of data in the use of the misspecification part, as we have already mentioned. When we have more and more data (N /σ 2 large), the optimal design estimates the misspecification part, since it is possible to obtain meaningful information on this part when a lot of data can be collected. On the other hand, the misspecification part is ignored by the optimal design if only a little data can be collected (N /σ 2 small) and the optimal design coincides with the one for simple linear regression putting half of the observations in the two endpoints of the design space.

Optimal observational points and weights
For the case m = 4, we also allow now the observational pointsx i , i = 0, . . . , 4, to be chosen to optimize the IMSE criterion. Figure 6 shows the optimalx i in the left panel and the corresponding optimal weights n i /N in the right panel.
For N /σ 2 ≤ 1, the two-point design having 50% weight in 0 and 1 is optimal. For N /σ 2 somewhat larger than 1, two further design points,x 1 ,x 3 close to 0 and 1, respectively, are included with low weights and then a design pointx 2 is added in the middle at 0.5 as well. With increasing N /σ 2 ,x 1 andx 3 tend then towards the middle and increase their weights. For N /σ 2 < 10,x 0 = 0,x 4 = 1 is the optimal choice; while it is better to choosex 0 > 0 andx 4 < 1, when N /σ 2 ≥ 10.
For large N /σ 2 , the optimal design has equal weights on all five observational points. The observational points are equidistantly spaced andx 0 ≈ 0.0668. We will consider the case N /σ 2 → ∞ now formally in Sect. 4.7, where we determine the IMSE-optimal design in the limiting case analytically.

Asymptotic case for large information
When N → ∞, alternatively σ 2 → 0, we get DȲ → D. This limiting model can be seen as being free from the error term e, i.e.
It can be seen from our formulas thatμ =Ȳ. The model without error term e has been considered for computer experiments, see e.g. Sacks et al. (1989); Williams and Rasmussen (2006).
For the model without error term, it makes no sense for prediction of μ to repeatedly observe at the samex i and since we have only single observations at each observational point, we can call them here x i . To optimize the design, we consider therefore no weights. We optimize the choice of the x i ∈ [0, 1], i = 0, . . . , m. We derive now the optimal design for our specific example, Brownian bridge with straight line regression.  (1, x) is the straight line regression. Let the number of design points x 0 , . . . , x m be fixed, m + 1. For N → ∞ or for σ 2 → 0, the IMSE converges to 1 6 The IMSE-optimal design for this limiting case is an equidistant design with design points where The proof of Theorem 4.1 can be found in the appendix. For m = 4, x 0 ≈ 0.0668. Thus, we have analytically confirmed the numeric results for large N /σ 2 obtained in Sect. 4.6.2. For m = 12, x 0 ≈ 0.0256.
We state now the limiting distribution for m → ∞ in the following corollary.
Corollary 4.2 Consider the IMSE-optimal design for the limiting case N → ∞ or σ 2 → 0 from Theorem 4.1. For m → ∞, the IMSE-optimal design converges in probability to the uniform distribution on [0, 1].
Proof Since x 0 is guaranteed to be below 1/m and since the x i are equidistantly spread, it is obvious that the maximal difference between the empirical distribution function F m (x) of the optimal design and F(x) = x, x ∈ [0, 1], converges to 0, i.e.
We have therefore shown that it is best to spread the design points uniformly on the design space when we expect that we can collect data with large information.
In some other cases, IMSE-optimal designs have been computed for the error-free model. As one example model, Mukherjee (2003) considers the Brownian bridge as C(x) without regression function, f(x) β = 0; her solution for the IMSE-optimal design is x i = i/(m + 1), i = 1, . . . , m. Abt (1992) has also considered the Brownian bridge C(x) and a constant regression, f(x) = 1. An IMSE-optimal design in his case has distance 1/(m + 1) between the design points and the first design point can be chosen freely such that it is at most 1/(m + 1). The case of linear or quadratic regression with the Brownian motion as process has been considered by Harman and Štulajter (2011). They show that equidistant designs are optimal for a large class of optimality criteria (not including IMSE-type criteria).
In addition to the mentioned cases where equidistant designs were optimal, the uniform design has also optimality properties for other models and situations. In the different situation of optimizing a lack of fit test when the alternative hypothesis is a rich class of functions, Wiens (1991) has shown the optimality of the uniform design, see also Biedermann and Dette (2001), Bischoff and Miller (2006), Wiens (2019).

Discussion
Model selection is an old area within statistics and is, in specific forms, used in machine learning. The universe of potential models can in itself be viewed as a pre-defined model. However, model selection will traditionally not respond to small or moderate misspecifications of a low-parameter model. A third-order polynomial can be attempted as an alternative to a quadratic. Data mining can cut the covariate space into fragments and offer different models in different intervals. However, an almost linear model with irregular deviations, e.g., may not be caught. However, the idea in this paper of adding a misspecification term could potentially be expanded to such settings, where competing models could be combined with a misspecification term.
The Brownian bridge distortion used in the main example should only be viewed as one possible misspecification function. We do not think that the difference between the true model and the low-parameter model, e.g. a linear model, would be exactly a Brownian bridge. The idea is rather that the misspecification cannot be easily understood or modeled. The Brownian bridge is then one reasonable choice that will result in a more pragmatic design, humbly reflecting that the simple low-parameter model is an approximation. While the Brownian bridge may be a useful misspecification model, an important factor limiting the practical usefulness of the methodology presented in this paper is that the volatility of the Brownian bridge is assumed to be pre-known. The choice of volatility parameter will impact the choice of design. It is, in principle, possible to estimate the volatility from the experimental data. However, that would partly contradict the idea of having a misspecification term to represent unknown model inaccuracies. One solution, in some situations, is to use previous experiments for similar situations to estimate the volatility. Another solution is to guesstimate the volatility before the experiment, and then to check the robustness over a range of plausible sizes for the misspecification. Alternative misspecification models exist that could be used instead of a Brownian bridge but the issue of choosing a volatility parameter, or a similar measure of the size of the misspecification, will remain. An interesting idea is to use spline functions for the misspecification. A drawback, in settings where monotonicity should theoretically be expected, is that the resulting combined model will not be monotone with standard splines. Consequently, the function will not be invertible. The same issue is partly relevant for the Brownian bridge. In a concrete situation, one should consider whether this is a real problem or a useful approximation. One way of distorting e.g. an Emax model while preserving the monotonicity would be to distort (compress/expand) the dose scale (x-axis) rather than the response (y-axis).
The choice of optimality criterion is important when applying optimal design theory. In the example, we have chosen to focus on IMSE-optimality, integrating the variance over the entire design space. Specific situations may call for quite different optimality criteria. For example, optimizing the profit in a simple demand model could imply that the objective is to find argmax x ((x − k) · f (x)), where x is the price, k the variable cost per unit, and f (x) is the demand function. Further research could explore optimal designs for this and a multitude of other relevant optimality criteria.
The methodology is relatively computer-intensive. Our example, using only one covariate and two parameters, does not require much computing time. However, it can be expected that computation issues will arise in more complicated models. It would be of interest to build similar models with multiple covariates and bring the methodology closer to machine learning, where many covariates are typically explored. It is straightforward to decrease computing times by using more efficient optimization algorithms. Still, how to optimize these algorithms may be an area for future research.
For σ 2 = 0, the matrix DȲ = D is not invertible if x 0 = 0 or x m = 1, but the limits of the integrated v 1 and v 2 exist when σ 2 → 0. I.e. we can do following considerations for the case x 0 > 0, x m < 1, but the resulting formula are correct if x 0 = 0 or x m = 1 as well. Using the well-known tri-diagonal form of D −1 (see e.g. Abt 1992;Bischoff et al. 2003), it can be shown that To show the optimality of design (8), we compute the derivatives of the IMSE (7) with respect to x i , i = 0, . . . , x m and set them to 0. From the derivatives for x i , i = 1, . . . , m − 1, we obtain the unique solution x i = (x i+1 + x i−1 )/2 and therefore, the optimal design is equidistant. Setting the derivatives with respect to x 0 and x m to 0, we obtain that x m = 1 − x 0 and that x 0 solves the third degree equation h m (x) = 0 with h m (x) = (8m + 8)x 3 − (9m + 12)x 2 + (3m + 6)x − 1.
Lemma A.2 shows that h m (x) = 0 has exactly one solution and that this solution is in (0, 1/m). Solving the equation yields the value noted in the theorem.
Lemma A.2 For each m ≥ 1 and for h m defined by (9), the third degree equation h m (x) = 0 has exactly one solution x 0 . This solution is in the interval (0, 1/m).