A note on the Gao et al. (2019) uniformmixture model in the case of regression

We extend the uniform mixture model of Gao et al. (Ann Oper Res, 2019. https://doi.org/10. 1007/s10479-019-03236-9) to the case of linear regression. Gao et al. (Ann Oper Res, 2019. https://doi.org/10.1007/s10479-019-03236-9) proposed that to characterize the probability distributions of multimodal and irregular data observed in engineering, a uniform mixture model can be used. This model is a weighted combination of multiple uniform distribution components. This case is of empirical interest since, in many instances, the distribution of the error term in a linear regression model cannot be assumed unimodal. Bayesian methods of inference organized around Markov chain Monte Carlo are proposed. In a Monte Carlo experiment, significant efficiency gains are found in comparison to least squares justifying the use of the uniform mixture model.

The uniform distribution in the interval (a, b) has probability density: The UMM is defined by discretizing the support to points {a 1 , a 2 , . . . , a N +1 }, where N is given, and using the following mixture density: where I(·) is the indicator function and the weights w j satisfy w j ≥ 0, j = 1, . . . , N ,

The case of linear regression
Consider now a regression model of the form where y i is the dependent variable, x i ∈ k is a vector of explanatory variables, β ∈ k is a vector of coefficients to be estimated, and n > k is the number of observations. Suppose the first element of x i is unity so that an intercept is always present in the model. Assuming the distribution of the error term, u i , is unknown but can be approximated by a UMM, we must have E(u i |{x t } n t=1 ) = 0, i = 1, . . . , n, which implies the following constraint: assuming a j+1 − a j = ∀ j. From (2) we have that: Since a j = a 1 + ( j − 1) , we can write this equation as: which implies that

Markov chain Monte Carlo (MCMC) in general
Very often, complicated posterior distributions arise in statistics, operations research, and related field. Given a parameter α ∈ A ⊆ R d , and data D, suppose that the likelihood function is L (α; D). Suppose also we have a prior on the parameters, say p(α). By Bayes theorem we know that the posterior is: In general, we are interested in the posterior means of certain functions of interest, say f (α). The posterior mean of this function of interest is: where E α|D [ f (α)] denotes posterior expectation, and the denominator is the normalizing constant of the posterior. Part of the problem could be to find marginal posterior densities. If we partition α = α 1 , α 2 then the marginal posterior density of α 1 would be These integrals are typically, not available in closed form unless the problem is very simple. The Gibbs sampler, a particular MCMC technique relies on the idea that we may be able to produce a sequence of parameter draws α (s) , s = 1, . . . , S , not necessarily iid, which converges (as S → ∞) to the posterior whose unormalized density is given by (9). If such a sample were available, the posterior expectation in (10) could be accurately approximated as follows: Therefore, a sampling approach would facilitate the tasks of Bayesian inference to a great degree. The Gibbs sampler relies on the idea that the sequence α (s) , s = 1, . . . , S can be produced recursively by using the conditional posterior distribution of each element of α. Suppose for example α = [α 1 , α 2 ] where α 1 , α 2 are two scalar parameters for simplicity (although clearly they can be vectors). The Gibbs sampler is as follows: and so on, if there are additional parameters. We repeat for s = 1, . . . , S and we assume α (0) 2 is available. Quite often, the conditional posterior distributions are univariate and amenable to random number generation by commonly available means.

MCMC in the UMM linear regression model
whose interpretation is that u i is drawn from a uniform distribution in a J i , a J i +1 with probability w j . In turn, the posterior (augmented) distribution of the model is: Here, θ is the parameter vector which includes β and some other elements as we explain below, and D denotes the entire data set {y i , x i } n i=1 . Therefore, we have: where I(·) is the indicator function, So, n j represents the number of observations in the jth sub-interval. It turns out that given and N the endpoint a 1 can be estimated from the data. Define the parameter vector as Given the J i s we must have: Therefore, the conditional posterior of regression parameters, β, is: From (5) along with the posterior in (15) where the first inequality comes from the restriction: a N +1 = a 1 + N > max t=1,...,n (y t − x t β). Moreover, we have: Therefore, the right endpoint can be expressed in terms of N , a 1 and a N +1 . If we wish to impose the constraint a 1 = −a N +1 then we have a N +1 = N 2 . In this case, a 1 = − N 2 , and a 1 has to be treated as given. We follow this practice, throughout to simplify the analysis as treating a 1 adds a layer of technicalities, although it is straightforward to treat it as an unknown parameter. In practice, the support of the error can be accurately estimated using the standard error of LS residuals.
Given {w j }, N and , these equations determine the values of the endpoints. Suppose our prior is In turn, the conditional posterior of weights is: subject to (3), which is a Dirichlet distribution.
From (17) we have that β has to be drawn from the prior p(β) subject to the restrictions that > x i β > ψ, i = 1, . . . , n, as in (17). A particular convenient prior is the flat prior, viz. p(β) ∝ const. All the above techniques can be implemented using straightforward Markov Chain Monte Carlo (MCMC) techniques organized around the Gibbs sampler (Gelfand and Smith 1990) by drawing successively random numbers from the conditional posterior distributions in (17) and (21). In particular, for β we proceed as follows. The restrictions that > x i β > ψ, i = 1, . . . , n, as in (17), can be written, in matrix notation as: where 1 n is an n × 1 vector of ones, and X is the n × k matrix of regressors. In turn, the posterior conditional distribution of β is p(β) ∝ const. subject to these restrictions. Suppose X = [x 1 , . . . , x k ] where x j is the jthe column of X , an n × 1 vector. We can write (22) as follows: Suppose we want to draw β 1 |β 2 , . . . , β k , D. Then the conditional posterior distribution of β 1 is uniform in subject to the restrictions: * We can draw β 1 (conditional on all other βs) from a uniform distribution subject to the restrictions in (24) which are enforced via rejection sampling. Repeating for each j = 1, . . . , k we obtain draws from the posterior conditional distribution of β j |β (− j) , D, j = 1, . . . , k. Finally, to obtain draws from the conditional distribution of {J i } n i=1 we have: (25) In turn, we normalize π j = , and we set J i = j with probability π j , j = 1, . . . , N . The Gibbs sampler yields a sample β (s) , w (s) , J (s) S s=1 which converges to the posterior distribution whose non-normalized density is given in (15), as S increases.

Monte Carlo evidence
We consider four cases for the distribution of the error term as in Fig. 1.
Our interest focuses on comparing with least squares (LS) regression and the potential improvement in efficiency, which is defined as Eff = var(b j,L S )/var(b j ), where j = 1, 2, b j,L S is the Bayes posterior mean estimate of β j from the UMM model, b j,L S is the LS estimator of β j , and "var" denotes sampling variance. We use 10,000 Monte Carlo simulations to examine the efficiency of LS versus UMM-regression-based techniques. MCMC is implemented using 15,000 passes the first 5000 of which are discarded during the "burn-in" phase. Initial conditions were obtained from LS and, in all cases, we have N = 100 points in the support of the error term.
From the results in Table 1, regression-UMM-based techniques are considerable more efficient compared to LS particularly for "small" samples (i.e. n ≤ 1000) although even at n =10,000 the improvement in efficiency is quite evident. With n =10,000 the efficiency is (d) Fig. 1 a A mixture of five normals, with means − 10, − 5, 0, 5, 10, standard deviations 0.25, 1, 0.5, 1, 0.25, and probabilities 0.2. b A mixture of five Student-t densities with one degree of freedom and the same configurations as in a. c A mixture of five lognormal densities with one degree of freedom and the same configurations as in a. d A mixture of ten Student-t densities with randomly selected means using N (0, 10 2 ), randomly selected standard deviations using |N (0, 1)| and ten randomly selected probabilities in the interval (0, 1) normalized so that they sum up to unity The results are based on 10,000 of Monte Carlo replications. The results refer to b 1,L S and b 1 . The efficiency of b 1,L S and b 1 was quite similar to the results reported above. We use 10,000 Monte Carlo simulations to examine the efficiency of LS versus UMM-regression-based techniques. MCMC is implemented using 15,000 passes the first 5000 of which are discarded during the "burn-in" phase. Initial conditions were obtained from LS and, in all cases, we have N = 100 points in the support of the error term s.e. standard error close to unity but still the efficiency of UMM is larger (notice that LS is best linear unbiased, but the UMM-regression estimator is not linear so efficiency gains are possible even in quite large samples). Moreover, the regression-UMM-based estimator is, practically, unbiased as it mean squared error and variance are very similar (results available on request). Finally, efficiency gains are largest in cases (b) and (c) where the mixing components are far from normality (viz. Student-t with one degree of freedom and lognormal components). Another interesting case is to consider u i ∼ N (0, σ 2 ), i = 1, . . . , n, where σ 2 is estimated using the LS estimator , and b L S = (X X ) −1 X y. In turn, we know that the support of the error terms is, approximately, (−3s, 3s) (perhaps too "generously"). Even a plot of LS residuals can inform us, at least in large samples, about the support as well as the form of the distribution of errors.
Using the same data generating process as in cases (a), (b), and (c), we examine the bias and efficiency of LS estimator of β 1 and UMM-regression with n = 100 but different number of points (N ) in the support of UMM-regression. in Table 2.
For example the mean square error (MSE) of LS is 0.011 2 + 0.014 2 = 0.000317 while the MSE of UMM-regression estimator with N = 50 is 0.007 2 + 0.011 2 = 0.00017 so the ratio of MSEs is almost 1.86. The MSE is lower compared to LS even if we use only N = 10 points in the support of the error.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.