# A note on the Gao et al. (2019) uniform mixture model in the case of regression

• Mike G. Tsionas
• Athanasios Andrikopoulos
Open Access
Short Note

## Abstract

We extend the uniform mixture model of Gao et al. (Ann Oper Res, 2019. ) to the case of linear regression. Gao et al. (Ann Oper Res, 2019. ) proposed that to characterize the probability distributions of multimodal and irregular data observed in engineering, a uniform mixture model can be used. This model is a weighted combination of multiple uniform distribution components. This case is of empirical interest since, in many instances, the distribution of the error term in a linear regression model cannot be assumed unimodal. Bayesian methods of inference organized around Markov chain Monte Carlo are proposed. In a Monte Carlo experiment, significant efficiency gains are found in comparison to least squares justifying the use of the uniform mixture model.

## Keywords

Multimodal data Uniform mixture model Regression models Statistical inference Bayesian analysis

## 1 Introduction

Gao et al. (2019) proposed that to characterize the probability distributions of multimodal and irregular data observed from practical engineering, a uniform mixture model (UMM) can be used, which is a weighted combination of multiple uniform distribution components. As these authors notice, because of noise in many data sets, “probability distributions of observed data can not be accurately characterized by typical unimodal distributions (such as normal, lognormal, and Weibull distributions), and the adequacy of typical unimodal distributions may be questioned”.

The uniform distribution in the interval (ab) has probability density:
\begin{aligned} f(u)={\left\{ \begin{array}{ll} \frac{1}{b-a}, &{} \mathrm {if}\;a<x<b,\\ 0, &{} \mathrm {otherwise}. \end{array}\right. } \end{aligned}
(1)
The UMM is defined by discretizing the support to points $$\left\{ a_{1},a_{2},\ldots ,a_{N+1}\right\}$$, where N is given, and using the following mixture density:
\begin{aligned} f_{UMM}(u)=\sum _{j=1}^{N}w_{j}\frac{1}{a_{j+1}-a_{j}}\mathbb {I}(a_{j}<u<a_{j+1}), \end{aligned}
(2)
where $$\mathbb {I}(\cdot )$$ is the indicator function and the weights $$w_{j}$$ satisfy
\begin{aligned} w_{j}\ge 0,\,j=1,\ldots ,N,\,\sum _{j=1}^{N}w_{j}=1. \end{aligned}
(3)

## 2 The case of linear regression

Consider now a regression model of the form
\begin{aligned} y_{i}=x'_{i}\beta +u_{i},i=1,\ldots ,n, \end{aligned}
(4)
where $$y_{i}$$ is the dependent variable, $$x_{i}\in \mathfrak {R}^{k}$$ is a vector of explanatory variables, $$\beta \in \mathfrak {R}^{k}$$ is a vector of coefficients to be estimated, and $$n>k$$ is the number of observations. Suppose the first element of $$x_{i}$$ is unity so that an intercept is always present in the model. Assuming the distribution of the error term, $$u_{i}$$, is unknown but can be approximated by a UMM, we must have $$\mathbb {E}(u_{i}|\{x_{t}\}_{t=1}^{n})=0,i=1,\ldots ,n$$, which implies the following constraint:
\begin{aligned} \mathbb {E}(u_{i}|x_{i})=\sum _{j=1}^{N}w_{j}\frac{a_{j}+a_{j+1}}{2}=\Delta \sum _{j=1}^{N}jw_{j}+a_{1}-\tfrac{\Delta }{2}=0, \end{aligned}
(5)
assuming $$a_{j+1}-a_{j}=\Delta \,\forall j$$. From (2) we have that:
\begin{aligned} f_{UMM}(u_{i})=\sum _{j=1}^{N}w_{j}\frac{1}{\Delta }\mathbb {I}(a_{j}<y_{i}-x'_{i}\beta <a_{j+1}),i=1,\ldots ,n. \end{aligned}
(6)
Since $$a_{j}=a_{1}+(j-1)\Delta$$, we can write this equation as:
\begin{aligned} f_{UMM}(u_{i})=\sum _{j=1}^{N}w_{j}\frac{1}{\Delta }\mathbb {I}(a_{1}+(j-1)\Delta<y_{i}-x'_{i}\beta <a_{1}+j\Delta ),i=1,\ldots ,n, \end{aligned}
(7)
which implies that
\begin{aligned} f_{UMM}(u_{i})=\sum _{j=1}^{N}w_{j}\frac{1}{\Delta }\mathbb {I}(-\Delta<y_{i}-x'_{i}\beta -a_{1}-j\Delta <0),i=1,\ldots ,n. \end{aligned}
(8)

## 3 Statistical inference

### 3.1 Markov chain Monte Carlo (MCMC) in general

Very often, complicated posterior distributions arise in statistics, operations research, and related field. Given a parameter $$\alpha \in \mathcal {A}\subseteq \mathfrak {R}^{d}$$, and data $$\mathcal {D}$$, suppose that the likelihood function is $$\mathscr {L}(\alpha ;D)$$. Suppose also we have a prior on the parameters, say $$p(\alpha )$$. By Bayes theorem we know that the posterior is:
\begin{aligned} p(\alpha |\mathcal {D})\propto \mathscr {L}(\alpha ;\mathcal {D})p(\alpha ). \end{aligned}
(9)
In general, we are interested in the posterior means of certain functions of interest, say $$f(\alpha )$$. The posterior mean of this function of interest is:
\begin{aligned} \mathbb {E}_{\alpha |\mathcal {D}}\left[ f(\alpha )\right] =\frac{\intop _{\mathcal {A}}f(\alpha )p(\alpha |\mathcal {D})\,d\alpha }{\intop _{\mathcal {A}}p(\alpha |\mathcal {D})\,d\alpha }, \end{aligned}
(10)
where $$\mathbb {E}_{\alpha |\mathcal {D}}\left[ f(\alpha )\right]$$ denotes posterior expectation, and the denominator is the normalizing constant of the posterior. Part of the problem could be to find marginal posterior densities. If we partition $$\alpha =\left[ \alpha '_{1},\alpha '_{2}\right]$$ then the marginal posterior density of $$\alpha _{1}$$ would be
\begin{aligned} p(\alpha _{1}|\mathcal {D})=\frac{\intop _{\mathcal {A}}p(\alpha _{1},\alpha _{2}|\mathcal {D})\,d\alpha _{2}}{\intop _{\mathcal {A}}p(\alpha |\mathcal {D})\,d\alpha }. \end{aligned}
(11)
These integrals are typically, not available in closed form unless the problem is very simple. The Gibbs sampler, a particular MCMC technique relies on the idea that we may be able to produce a sequence of parameter draws $$\left\{ \alpha ^{(s)},\,s=1,\ldots ,S\right\}$$, not necessarily iid, which converges (as $$S\rightarrow \infty$$) to the posterior whose unormalized density is given by (9). If such a sample were available, the posterior expectation in (10) could be accurately approximated as follows:
\begin{aligned} \mathbb {E}_{\alpha |\mathcal {D}}\left[ f(\alpha )\right] \simeq S^{-1}\sum _{s=1}^{S}f(\alpha ^{(s)}). \end{aligned}
(12)
Therefore, a sampling approach would facilitate the tasks of Bayesian inference to a great degree. The Gibbs sampler relies on the idea that the sequence $$\left\{ \alpha ^{(s)},\,s=1,\ldots ,S\right\}$$ can be produced recursively by using the conditional posterior distribution of each element of $$\alpha$$. Suppose for example $$\alpha =[\alpha _{1},\alpha _{2}]'$$ where $$\alpha _{1},\alpha _{2}$$ are two scalar parameters for simplicity (although clearly they can be vectors). The Gibbs sampler is as follows:
• Draw $$\alpha _{1}^{(s)}$$ from its conditional distribution $$\alpha _{1}|\alpha _{2}^{(s-1)},\mathcal {D},$$

• Draw $$\alpha _{2}^{(s)}$$ from its conditional distribution $$\alpha _{2}|\alpha _{1}^{(s)},\mathcal {D},$$

and so on, if there are additional parameters. We repeat for $$s=1,\ldots ,S$$ and we assume $$\alpha _{2}^{(0)}$$ is available. Quite often, the conditional posterior distributions are univariate and amenable to random number generation by commonly available means.

### 3.2 MCMC in the UMM linear regression model

Suppose now there is an index $$J_{i}\in \{1,\ldots ,N\}$$ so that
\begin{aligned} -\Delta<y_{i}-x'_{i}\beta -a_{1}-J_{i}\Delta <0,\,i=1,\ldots ,n, \end{aligned}
(13)
whose interpretation is that $$u_{i}$$ is drawn from a uniform distribution in $$\left( a_{J_{i}},a_{J_{i}+1}\right)$$ with probability $$w_{j}$$.
In turn, the posterior (augmented) distribution of the model is:
\begin{aligned} p(\theta ,\{J_{i}\}_{i=1}^{n}|D)\propto \prod _{i=1}^{n}w_{J_{i}}\mathbb {I}\left( -\Delta<y_{i}-x'_{i}\beta -a_{1}-J_{i}\Delta <0\right) p(\theta ). \end{aligned}
(14)
Here, $$\theta$$ is the parameter vector which includes $$\beta$$ and some other elements as we explain below, and D denotes the entire data set $$\{y_{i},x_{i}\}_{i=1}^{n}$$. Therefore, we have:
\begin{aligned} p(\theta ,\{J_{i}\}_{i=1}^{n}|D)\propto w_{1}^{n_{1}}\ldots w_{N}^{n_{N}}\prod _{i=1}^{n}\mathbb {I}\left( -\Delta<y_{i}-x'_{i}\beta -a_{1}-J_{i}\Delta <0\right) p(\theta ), \end{aligned}
(15)
where $$\mathbb {I}(\cdot )$$ is the indicator function, $$n_{j}=\sum _{i=1}^{n}\mathbb {I}\left( -\Delta<y_{i}-x'_{i}\beta -a_{1}-j\Delta <0\right)$$, and $$\sum _{j=1}^{N}n_{j}=n$$. So, $$n_{j}$$ represents the number of observations in the jth sub-interval.
It turns out that given $$\Delta$$ and N the endpoint $$a_{1}$$  can be estimated from the data. Define the parameter vector as $$\theta =[\beta ',a_{1},\{J_{i}\}_{i=1}^{n},w']'$$. Given the $$J_{i}$$s we must have:
\begin{aligned} a_{1}+(J_{i}-1)\Delta<y_{i}-x'_{i}\beta <a_{1}+J_{i}\Delta ,\,i=1,\ldots ,n. \end{aligned}
(16)
Therefore, the conditional posterior of regression parameters, $$\beta$$, is:
\begin{aligned} \begin{array}{c} p(\beta |\{J_{i}\},w,a_{1})\propto \mathrm {const}.,\\ \mathrm {s.t}\;\Psi \equiv \left( \min _{t=1,\ldots ,n}y_{t}-a_{J_{t}}\right)>x'_{i}\beta >\left( \max _{t=1,\ldots ,n}y_{t}-a_{J_{t}}\right) \equiv \psi ,\,i=1,\ldots ,n. \end{array} \end{aligned}
(17)
From (5) along with the posterior in (15) we have
\begin{aligned} \max _{t=1,\ldots ,n}(y_{t}-x'_{t}\beta )-N\Delta<a_{1}<\min _{t=1,\ldots ,n}(y_{t}-x'_{t}\beta ), \end{aligned}
(18)
where the first inequality comes from the restriction: $$a_{N+1}=a_{1}+N\Delta >\max _{t=1,\ldots ,n}(y_{t}-x'_{t}\beta )$$. Moreover, we have:
\begin{aligned} a_{N+1}=a_{1}+N\Delta . \end{aligned}
(19)
Therefore, the right endpoint can be expressed in terms of $$N,a_{1}$$ and $$a_{N+1}$$. If we wish to impose the constraint $$a_{1}=-a_{N+1}$$ then we have $$a_{N+1}=\tfrac{N\Delta }{2}$$. In this case, $$a_{1}=-\tfrac{N\Delta }{2}$$, and $$a_{1}$$ has to be treated as given. We follow this practice, throughout to simplify the analysis as treating $$a_{1}$$ adds a layer of technicalities, although it is straightforward to treat it as an unknown parameter. In practice, the support of the error can be accurately estimated using the standard error of LS residuals.
Given $$\{w_{j}\}$$, N and $$\Delta$$, these equations determine the values of the endpoints. Suppose our prior is
\begin{aligned} p(\beta ,w)\propto w_{1}^{-1}\ldots w_{N}^{-1}p(\beta ), \end{aligned}
(20)
In turn, the conditional posterior of weights is:
\begin{aligned} p(w|\beta ,\{J_{i}\}_{i=1}^{n},D)\propto w_{1}^{n_{1}-1}\ldots w_{N}^{n_{N}-1}, \end{aligned}
(21)
subject to (3), which is a Dirichlet distribution.
From (17) we have that $$\beta$$ has to be drawn from the prior $$p(\beta )$$ subject to the restrictions that $$\Psi>x'_{i}\beta >\psi ,i=1,\ldots ,n$$, as in (17). A particular convenient prior is the flat prior, viz. $$p(\beta )\propto \mathrm {const}.$$ All the above techniques can be implemented using straightforward Markov Chain Monte Carlo (MCMC) techniques organized around the Gibbs sampler (Gelfand and Smith 1990) by drawing successively random numbers from the conditional posterior distributions in (17) and (21). In particular, for $$\beta$$ we proceed as follows. The restrictions that $$\Psi>x'_{i}\beta >\psi ,i=1,\ldots ,n$$, as in (17), can be written, in matrix notation as:
\begin{aligned} \Psi 1_{n}>X\beta >\psi 1_{n}, \end{aligned}
(22)
where $$1_{n}$$ is an $$n\times 1$$ vector of ones, and X is the $$n\times k$$ matrix of regressors. In turn, the posterior conditional distribution of $$\beta$$ is $$p(\beta )\propto \mathrm {const}.$$ subject to these restrictions. Suppose $$X=[\mathbf {x}_{1},\ldots ,\mathbf {x}_{k}]$$ where $$\mathbf {x}_{j}$$ is the jthe column of X, an $$n\times 1$$ vector. We can write (22) as follows:
\begin{aligned} \Psi 1_{n}>\beta _{1}\mathbf {x}_{1}+\cdots +\beta _{k}\mathbf {x}_{k}>\psi 1_{n}. \end{aligned}
(23)
Suppose we want to draw $$\beta _{1}|\beta _{2},\ldots ,\beta _{k},D$$. Then the conditional posterior distribution of $$\beta _{1}$$ is uniform in $$\mathfrak {R}$$ subject to the restrictions:
\begin{aligned} \varvec{\Psi }_{1}^{*}\equiv \Psi 1_{n}-\sum _{j\ne 1}\beta _{j}\mathbf {x}_{j}>\beta _{1}\mathbf {x}_{1}>\psi 1_{n}-\sum _{j\ne 1}\beta _{j}\mathbf {x}_{j}\equiv \varvec{\psi }_{1}^{*}. \end{aligned}
(24)
We can draw $$\beta _{1}$$ (conditional on all other $$\beta$$s) from a uniform distribution subject to the restrictions in (24) which are enforced via rejection sampling. Repeating for each $$j=1,\ldots ,k$$ we obtain draws from the posterior conditional distribution of $$\beta _{j}|\beta _{(-j)},D,\,j=1,\ldots ,k$$. Finally, to obtain draws from the conditional distribution of $$\left\{ J_{i}\right\} _{i=1}^{n}$$ we have:
\begin{aligned} p(J_{i}=j|\beta ,w,D)\propto \sum _{t=1}^{n}\mathbb {I}\left( a_{1}+(j-1)\Delta<y_{t}-x'_{t}\beta <a_{1}+j\Delta \right) ,\,j=1,\ldots ,N. \end{aligned}
(25)
In turn, we normalize $$\pi _{j}=\frac{}{}$$$$\frac{p(J_{i}=j|\beta ,w,D)}{\sum _{j'=1}^{N}p(J_{i}=j'|\beta ,w,D)}$$, and we set $$J_{i}=j$$ with probability $$\pi _{j},\,j=1,\ldots ,N$$. The Gibbs sampler yields a sample $$\left\{ \beta ^{(s)},w^{(s)},J^{(s)}\right\} _{s=1}^{S}$$ which converges to the posterior distribution whose non-normalized density is given in (15), as S increases.

## 4 Monte Carlo evidence

We consider four cases for the distribution of the error term as in Fig. 1.

For each case we assume that the sample size is $$n=25$$, 50, 100, 500, 1000 and 10,000. We have two correlated regressors: the first one, $$x_{i1}\sim N(0,1)$$ and the second is $$x_{i2}=x_{i1}+0.1\varepsilon _{i}$$, where $$\varepsilon _{i}\sim N(0,1),\,i=1,\ldots ,n$$. The regression model is: $$y_{i}=\beta _{0}+\beta _{1}x_{i1}+\beta _{2}x_{i2}+u_{i},$$ where $$u_{i}$$ is generated according to cases (a) through (d). The true parameter values are: $$\beta _{0}=10,\,\beta _{1}=1,\,\beta _{2}=-1$$.

Our interest focuses on comparing with least squares (LS) regression and the potential improvement in efficiency, which is defined as $$\text {Eff}=\sqrt{\mathrm {var}(b_{j,LS})/\mathrm {var}(b_{j})}$$, where $$j=1,2$$, $$b_{j,LS}$$ is the Bayes posterior mean estimate of $$\beta _{j}$$ from the UMM model, $$b_{j,LS}$$ is the LS estimator of $$\beta _{j}$$, and “var” denotes sampling variance. We use 10,000 Monte Carlo simulations to examine the efficiency of LS versus UMM-regression-based techniques. MCMC is implemented using 15,000 passes the first 5000 of which are discarded during the “burn-in” phase. Initial conditions were obtained from LS and, in all cases, we have $$N=100$$ points in the support of the error term.
Table 1

Efficiency of regression-UMM versus LS

Case (a)

Case (b)

Case (c)

Case (d)

$$n=25$$

1.712

1.912

1.981

2.231

$$n=50$$

1.515

1.832

1.872

1.945

$$n=500$$

1.350

1.644

1.750

1.717

$$n=1{,}000$$

1.210

1.355

1.515

1.422

$$n=10{,}000$$

1.07

1.101

1.113

1.130

The results are based on 10,000 of Monte Carlo replications. The results refer to $$b_{1,LS}$$ and $$b_{1}.$$ The efficiency of $$b_{1,LS}$$ and $$b_{1}$$ was quite similar to the results reported above. We use 10,000 Monte Carlo simulations to examine the efficiency of LS versus UMM-regression-based techniques. MCMC is implemented using 15,000 passes the first 5000 of which are discarded during the “burn-in” phase. Initial conditions were obtained from LS and, in all cases, we have $$N=100$$ points in the support of the error term

From the results in Table 1, regression-UMM-based techniques are considerable more efficient compared to LS particularly for “small” samples (i.e. $$n\le 1000$$) although even at $$n=$$10,000 the improvement in efficiency is quite evident. With $$n=$$10,000 the efficiency is close to unity but still the efficiency of UMM is larger (notice that LS is best linear unbiased, but the UMM-regression estimator is not linear so efficiency gains are possible even in quite large samples). Moreover, the regression-UMM-based estimator is, practically, unbiased as it mean squared error and variance are very similar (results available on request). Finally, efficiency gains are largest in cases (b) and (c) where the mixing components are far from normality (viz. Student-t with one degree of freedom and lognormal components).

Another interesting case is to consider $$u_{i}\sim N(0,\sigma ^{2}),\,i=1,\ldots ,n$$, where $$\sigma ^{2}$$ is estimated using the LS estimator $$s^{2}=\frac{\sum _{i=1}^{n}(y_{i}-x'_{i}b_{LS})^{2}}{n-k}$$, and $$b_{LS}=(X'X)^{-1}X'y$$. In turn, we know that the support of the error terms is, approximately, $$\left( -3s,3s\right)$$ (perhaps too “generously”). Even a plot of LS residuals can inform us, at least in large samples, about the support as well as the form of the distribution of errors.

Using the same data generating process as in cases (a), (b), and (c), we examine the bias and efficiency of LS estimator of $$\beta _{1}$$ and UMM-regression with $$n=100$$ but different number of points (N) in the support of UMM-regression. in Table 2.
Table 2

Bias and efficiency of LS estimator of $$\beta _{1}$$ and UMM-regression

$$N=10$$

$$N=50$$

$$N=100$$

Bias LS

0.014

Bias UMM

0.012

0.011

0.011

s.e. LS

0.011

s.e. UMM

0.009

0.007

0.007

s.e. standard error

For example the mean square error (MSE) of LS is $$0.011^{2}+0.014^{2}=0.000317$$ while the MSE of UMM-regression estimator with $$N=50$$ is $$0.007^{2}+0.011^{2}=0.00017$$ so the ratio of MSEs is almost 1.86. The MSE is lower compared to LS even if we use only $$N=10$$ points in the support of the error.

## References

1. Gao, J., An, Z., & Bai, X. (2019). A new representation method for probability distributions of multimodal and irregular data based on uniform mixture model. Annals of Operations Research. .
2. Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410), 398–409.