Cluster’s Number Free Bayes Prediction of General Framework on Mixture of Regression Models

Prediction based on a single linear regression model is one of the most common way in various field of studies. It enables us to understand the structure of data, but might not be suitable to express the data whose structure is complex. To express the structure of data more accurately, we make assumption that the data can be divided in clusters, and has a linear regression model in each cluster. In this case, we can assume that each explanatory variable has their own role; explaining the assignment to the clusters, explaining the regression to the target variable, or being both of them. Introducing probabilistic structure to the data generating process, we derive the optimal prediction under Bayes criterion and the algorithm which calculates it sub-optimally with variational inference method. One of the advantages of our algorithm is that it automatically weights the probabilities of being each number of clusters in the process of the algorithm, therefore it solves the concern about selection of the number of clusters. Some experiments are performed on both synthetic and real data to demonstrate the above advantages and to discover some behaviors and tendencies of the algorithm.


Introduction
In the field of machine learning, prediction problem is a common problem, which uses pairs of explanatory variable and target variable {(x i , y i )} n i=1 and new explanatory variable x n+1 to predict corresponding target variable y n+1 . The data analysis using linear regression model has been one of the most common and fundamental way to detect the structure of the data. However, it might not be suitable to adopt a linear regression model when the structure of the data is too complex.
To attain high prediction performance with interpretability, previous works have proposed some extensions of linear regression model (e.g., [1][2][3][4][5][6][7][8][9][10][11][12][13][14]). In stratified regression [1], the data are stratified based on design variables, and at each level, a linear regression model is defined using explanatory variables and response variables. Another extension of the linear regression model is Mixture of Experts Model (e.g., [2,3]) or Hierarchical Mixture of Experts Model (e.g., [4][5][6][7][8]), which consists of some experts and a gating function. Each expert is a linear regression model that outputs the response variable. The gating function, which receives the same input as experts, weights each output of experts, so that we can finally get a single output. Piecewise Linear Regression Model (e.g., [9,10]) is also an extension of linear regression model. It divides the input space, and has different linear regression model in each subspace. Generalized Linear Mixed Models [11] considers "random effect" in its model, which leads the data to be divided into some groups and have different linear regression model in each group. Although each work has different algorithms, the structure they assume to the data are the same: the data can be grouped into clusters, and have linear regression model in each cluster.
Here, we provide a point of view to organize these previous works. By focusing on the probabilistic model which expresses the data generation, we can treat these previous works in unified manner. Each of the data is represented as (z, x 1 , … , x d , y) , which is the realized value of random variables Z, X 1 , … , X d and Y. Here, Z ∈ {1, … , k} is assign variable, which decides the assignment of the data point to k clusters. X 1 , ⋯ , X d are the explanatory variables, and Y represents the target variable. Here, let U and V be defined as From this point of view, we can consider broad patterns of data generation models by adjusting these factors: Many of the previous works can be organized by this point of view. For example, the stratified regression is the one with U = � , continuous Y, and observable Z. Mixture of Linear Experts and Piecewise Linear Regression model are the ones with U ∩ V = {X 1 , … , X d } , continuous Y, and unobservable Z.
In this paper, we assume the data generation model with continuous target variable Y, and unobservable assignment variable Z. Here, we do not restrict the number of elements in U ∩ V , therefore we can treat broad patterns of data generation model by our approach. Assuming this structure to the data, we deliver the optimal prediction under Bayes criterion [15]. Moreover, we derive an approximative algorithm that calculates the prediction by adopting variational inference method.
The advantages of using our algorithm are shown below: 1. It is not necessary to input the number of clusters beforehand, since our proposed algorithm automatically weights the probabilities of being each number of clusters in the process of prediction. 2. Since the prediction is based on the explicit assumption of a probabilistic data generation structure, it is easier to distinguish the data structure. 3. We can easily consider broader extensions of our algorithm. For example, by using logistic regression model instead of linear regression model, we can solve classification problem where Y is discrete.
Regarding (1), recent works (e.g., [5][6][7]) often fix the number of regression models, so the number of clusters. On the other hand, our algorithm do not choose the number of clusters to be a single value, but consider several numbers of clusters and weight their prediction results. In Sect. 2, we define probabilistic structure which expresses the data generation. Then, in Sects. 3 and 4, we derive the optimal prediction under Bayes criterion and the approximative algorithm respectively. In Sect. 5, we mention some experiments conducted to confirm the behavior of the proposed algorithm using both synthetic and real data. In the first experiment, we synthesized data and parameters from the priors, so that it is possible to check the average behavior of the algorithm. In the second experiment, we generated the data with specific tendency, so that we can confirm more detailed behavior of the algorithm. In the last experiment, we used real data to confirm that our algorithm works well in practical use, too.

Probabilistic Model of Data Generation
We assume that the given data have the following characteristics: • The data are the pairs of explanatory variables and response variables. • All the variables are continuous. • It is not appropriate to adopt a single linear regression model to the data. • If we divide the data into some clusters, it will be appropriate to adopt a linear regression model in each cluster.
Let each of the given data be represented as (z, x, y) , which is the realized value of three random variables (Z, X, Y) . Each random variable has following features.
From now on, we express the p.d.f. of Gaussian distribution, Dirichlet distribution, and Wishart distribution as N(⋅), Dir(⋅) and W(⋅) , respectively.

Assumption 2.1
Let K be an unobservable random variable which represents the number of clusters, and k be a realized value of K. Also, let k max be a fixed value which represents the maximum number of clusters we assume.
, 1} k be a latent random variable called assignment variable, given the number of clusters k. Here, the lth component The distribution of Z k is given by where M k ∶= { 1 , … , k } is a set of (p + q) dimensional mean vectors, and L k ∶= { 1 , … , k } is a set of (p + q) × (p + q) dimensional precision matrices.
Equation (5) suggests that among explanatory variables X 1 , … , X d , the first (p + q) variables follow k mixtures of Gaussian distributions, and the rest r variables follows arbitrary distributions. To give an example in real world data, think of precipitation prediction in some observation areas. The first p + q explanatory variables can be elevation, latitude, and distance from the sea which might explain the cluster structure of observation areas. (1) Hereafter, we use calligraphic fonts to express the set where each random variable belongs to.

3 3 Optimal Prediction Under Bayes Criterion
Assuming the given data to be represented by the probabilistic model introduced in Sect. 2, we propose the optimal prediction in the sense of the Bayes criterion when considering squared-error loss.
Since the goal is to predict the new response variable y n+1 from the given data D , the decision rule can be defined as follows.
Using the squared-error loss between the true value of y n+1 and the decision (D) , we define the loss function L as below: Given the loss function, the risk function of decision rule (⋅) is defined as follows.
Moreover, the Bayes risk of decision rule (D) , with respect to the prior distributions defined in Assumptions 2.2 and 2.5 is defined as Here, we can obtain the following result. The proof of Proposition 3.1 is in Appendix A.

Proposition 3.1 The decision rule * (D) which minimizes the Bayes risk (which we call "optimal prediction under Bayes criterion") is
where Note here, given the same data set (x n , y n ) , different new data points like x n+1 and x � n+1 will cause different posterior distributions of the parameters and the latent variables such as k and z n+1 . The effect of the new data x n+1 would be smaller when the size of data set n is large.

Approximation Using Variational Bayes Algorithm
Although we derived optimal prediction under Bayes criterion in Proposition 3.1, it is hard to obtain a closed-form analytical solution of posterior distribution p(z n+1 k , k , k | D) which appears in (14). Therefore, we adopt variational Bayes algorithm to approximate the posterior distribution, and obtain the approximation of optimal prediction under Bayes criterion.

Priors of Parameters and Cluster Number
First, we introduce priors over each parameter given the number of clusters k; k , M k , L k , W k , and the cluster number K. Here, we mostly choose conjugate distribution for each prior of the parameters, therefore the analysis would be considerably simplified.
where 0 k ∶= ( 0 k , … , 0 k ) ∈ ℝ k is real-valued constant vector, and C( 0 k ) is the normalization constant on Dirichlet distribution. Here, all the hyper-parameters

Analysis on Approximated Posterior Distribution
Now, we consider an approximated posterior distribution. Approximated posterior distribution 1 is a distribution of the latent variables Z n+1 k , the parameters k , and the cluster number K, which belongs to the function family defined as below: The purpose of the variational Bayes algorithm is to obtain the variational distribution q * ∈ which minimizes its Kullback-Leibler (KL) divergence between poste- Here, we can obtain the following two propositions. 2

Proposition 4.1
Then, where E �q * (⋆) [⋅] denotes an expectation with respect to the distribution q * over all variables except ⋆.

Proposition 4.2 Regarding variational distribution of K, it holds that
Equations (27) and (28) give the distribution which minimizes the KL divergence between the posterior distribution subject to the constraint of . However, they still do not give the explicit solution, since the expectations in (22) to (26) depend on the other factors of q * . Therefore, we first initialize all the factors and repeat updating each factor in turn using (22)-(26) so that we can approximately calculate (27).
In the next subsections, we will introduce updating formula of each factors q * (⋅ k ) . Hereafter, q (t) (⋆ k ) denotes the variational distribution of latent variables or parameter ⋆ given cluster number k at time t, and E ⋆ k [⋅] denotes the expectation with respect to q (t) (⋆ k ) . Also, for simplicity, we will define U ∈ ℝ p+q and V ∈ ℝ q+r as following: as the realized values of U and V respectively, and define ṽ i as 1. Updating q(z n k ) : From (22), the updating formula of q(z n k ) is given by U ∶= X 1 , … , X p+q (cluster explanatory variables), V ∶= X p+1 , … , X d (regression explanatory variables).

3
Journal of Statistical Theory and Applications (2021) 20:425-449 From (24), the updating formula of q( k ) is given by 6. Testing convergence: We use variational lower bound to test the convergence and decide when to stop the iteration, since the variational Bayes algorithm guarantees the variational lower bound to increase in each iteration [16].
Using the variational distribution of each parameter or latent variable after t iterations, state q (t) (z n+1 k , k ) as Then, the variational lower bound at iteration t is defined as Note here that all the expectation in (43) is taken by distribution q (t) (z n+1 k , k ) . Each term in (43) can be calculated as in Appendix B. By calculating the variational lower bound at each iteration, we judge the convergence of the algorithm.

Adopting Approximated Posterior Distribution
Let T be the number of iteration when the variational lower bound satisfied the convergence condition. Then, we can derive the variational distribution of the cluster number k (k = 1, … , k max ) as below: By substituting approximated posterior distribution q (T) (z n+1 k , k , k) ∶= q (T) (k)q (T) (z n+1 k , k ) to the posterior distribution p(z n+1 k , k , k | D) , we can obtain (14) in the approximated form (the detail is in Appendix C) : Using this approximated predictive distribution, we can finally derive the next proposition.

Proposition 4.3 Adopting the variational distribution derived by variational Bayes algorithm, the optimal prediction under Bayes criterion (13) can be approximated as
The proof of Proposition 4.3 is in Appendix D.
Note here that the term in brackets of (46) represents the approximation of the optimal prediction under Bayes criterion when fixing k as the cluster number. The prediction (46) is the weighted sum of the prediction at each cluster number, where the weight is q (T) (k) for each k (k = 1, … , k max ).

Experiments
To demonstrate the behavior of our algorithm, we used various types of data. First, we generated synthetic data from the model we assumed in Sect. 2, and adopted our algorithms. In contrast, we did not only generated parameters using the priors we defined in Sect. 4.1, but also fixed the parameters to generate the data with specific tendency. In addition, we used real data to confirm that our algorithm works well in practical use, too.
For all the experiment, we set q = 0 which means that there is no intersection between clustering explanatory variable and regression explanatory variable. Hence, it would be easier to understand the behavior of each role of explanatory variables.
All the codes used in the experiments are opened in Github [17].

Setting
In Experiment 1, we generated both the parameters and the data from the model described in previous sections, and checked the average of squared-error 3 occurred by the prediction (46). The procedure of Experiment 1 is shown in the following: First, we generated the cluster number k from the prior (15), and generated the hyper-parameters as follows. 0 k : generated from uniform distribution on [0, 1), 0 k : generated from uniform distribution on [0.001, 1.001), A 0 k = I (Identity matrix), 0 k : generated from uniform distribution on [2,3), w0 k = 0 , w0 k = I , and 2 l k = 0.5 for all l k = 1, … , k . Next, we 1 3 generated the parameters from the priors (16)- (18). Consequently, we generated N sets of learning data and M sets of test data from the model. Changing the number of training data n from 10, 20, … , N , we calculated the prediction and the prediction error. We generated the initial values of hyper-parameters at each cluster number k in the same way as we did for genuine hyper-parameters. We repeated data generation for P times per parameter generation, parameter generation for Q times per cluster number generation, and cluster number generation for R times. Here, each constant was stated as following: p = 3 , q = 0 , r = 5 , k max = 5 , N = 100 , M = 100 , P = 100 , Q = 10 , and R = 10 . The iteration was stopped either when L (t+1) k − L (t) k < 0.0001 or when the iteration number reached to 30 times.

Result and Discussion
The result of Experiment 1 is shown in Figs. 2 and 3. Figure 2 shows the prediction error occurred by both our algorithm and random forest algorithm [18]. Since our algorithm is the approximation of Bayes optimal prediction, it is expected to have smaller prediction error compared to any other prediction algorithms. As shown in Fig. 2, the proposed algorithm obtained expected result, where the prediction error is lower than that of random forest regardless of the number of data n. Therefore, in spite of the approximation by variational Bayes algorithm, the prediction by proposed algorithm still indicates the characteristic of the optimal prediction under Bayes criterion.
In Fig. 3, lines k = 1, 2, 3, 4, 5 indicates the prediction error at each cluster number, which corresponds to the term in brackets of (46). These lines can be regarded as the prediction error when the cluster number is fixed at k. The graph shows that the prediction error of our algorithm is smaller than any other prediction errors. Therefore, we should not assume the concrete cluster number so that we can obtain a better prediction.

Setting
In Experiment 2, the cluster number and the parameters are fixed, so that we can confirm the behaviors of our algorithm when treating the data with specific tendency. We set k = 3 , p = 3, q = 0, r = 5 , and compared the following patterns. For other parameters, we fixed l k = 10I, 2 l k = 0.5 (l k = 1, 2, 3), . Constants and initial values for variational Bayes algorithm were set as we did in Experiment 1.
In Experiment 2, we examined the behavior of the following values: (i) The prediction error occurred by the prediction (46). (ii) The approximated posterior distribution of the cluster number K.
In the procedure of Experiment 2, we conducted the same process done in Experiment 1, except that we fixed the cluster number k in the case of Experiment 2.

Result and Discussion
The results of Experiment 2 are shown in Figs. 4 and 5. Since condition (a) can be regarded as the benchmark, we give discussion for conditions (b-d).
For condition (b), the prediction error is larger than that of condition (a) as shown in Fig. 4. In contrast, as shown in Fig. 5, the approximated posterior distribution of the cluster number K is similar to that of condition (a). Therefore, the cause of larger prediction error can be regarded as the complexity of clustering which yields the prediction error at each cluster number k.
For condition (c), as shown in Fig. 4, the prediction error is smaller than that in condition (a) when the number of data n is small. In addition, from Fig. 5, there exists a certain value of the approximated posterior distribution of the cluster number k = 2 when n is small. Hence, when there is a small number of data, our algorithm indicates that it is appropriate to weight the prediction regarding the cluster number as k = 2 to make the optimal prediction under Bayes criterion. Conversely, when there is enough number of data n to distinguish the cluster 1 and 2, our algorithm indicates that it is suitable to weight the prediction regarding the actual cluster number k = 3 . Therefore, we could confirm the feature of our algorithm that it automatically optimizes the cluster number k.
For condition (d), the behaviors indicated in condition (c) become more obvious. Figure 5 shows that the approximated posterior distribution of k = 2 is the highest at any number of data n. Since there are essentially two clusters in condition (d), the algorithm adopts k = 2 even when there is a large number of data.
To conclude, from Experiment 2, we confirmed the following features of our algorithm. Firstly, the complexity of clustering is one of the causes which makes the prediction error larger. Secondly, as long as we provide the appropriate k max , the proposed algorithm automatically tunes the weight of cluster number k for the optimal prediction under the Bayes criterion. The approximated posterior distribution of the cluster number K. The x-axis represents the number of samples n, and the y-axis represents the approximated posterior probability of k = 1, 2, … , 5 1 3

Setting
In Experiment 3, we used real data taken from Scikit-learn datasets library [19] which include explanatory variables such as age, body mass index (BMI), average blood pressure, and six blood serum measurements obtained for each of patients, as well as the target variable, a quantitative measure of disease progression one year after baseline.
In this experiment, we set the explanatory variable in several patterns, and compare the prediction error made on each generative model. Setting the quantitative measure of disease progression as target variable y ∈ ℝ , we fixed the six blood serum measurements as regression explanatory variable. Using blood pressure for cluster explanatory variable is one case, and we used BMI for cluster explanatory variable for another case.
The procedure of Experiment 3 is shown as follows: First, we randomly sampled (N + M) data from the whole dataset, and divided them to N training data, and M test data. Changing the number of training data n from 10, 20, ..., N, we calculated the prediction and the prediction error. We repeated the random sampling of the data for P times. Here, we set N = 50, M = 50, P = 100 . Constants and initial values for variational Bayes algorithm were set as we did in Experiment 1.

Result and Discussion
The result of Experiment 3 is shown in Fig. 6.
As can be seen from Fig. 6, using BMI as cluster explanatory variable lessen the prediction error compared to using blood pressure for cluster explanatory variable. Therefore, we can infer that the data structure which use BMI as cluster explanatory Fig. 6 The prediction error of our algorithm. The blue and orange line shows the error when using average blood pressure and BMI respectively as cluster explanatory variable. The error bar represents the standard error of means