# Cluster’s Number Free Bayes Prediction of General Framework on Mixture of Regression Models

## Abstract

Prediction based on a single linear regression model is one of the most common way in various field of studies. It enables us to understand the structure of data, but might not be suitable to express the data whose structure is complex. To express the structure of data more accurately, we make assumption that the data can be divided in clusters, and has a linear regression model in each cluster. In this case, we can assume that each explanatory variable has their own role; explaining the assignment to the clusters, explaining the regression to the target variable, or being both of them. Introducing probabilistic structure to the data generating process, we derive the optimal prediction under Bayes criterion and the algorithm which calculates it sub-optimally with variational inference method. One of the advantages of our algorithm is that it automatically weights the probabilities of being each number of clusters in the process of the algorithm, therefore it solves the concern about selection of the number of clusters. Some experiments are performed on both synthetic and real data to demonstrate the above advantages and to discover some behaviors and tendencies of the algorithm.

## Introduction

In the field of machine learning, prediction problem is a common problem, which uses pairs of explanatory variable and target variable $$\{(x_i, y_i)\}_{i=1}^n$$ and new explanatory variable $$x_{n+1}$$ to predict corresponding target variable $$y_{n+1}$$. The data analysis using linear regression model has been one of the most common and fundamental way to detect the structure of the data. However, it might not be suitable to adopt a linear regression model when the structure of the data is too complex.

To attain high prediction performance with interpretability, previous works have proposed some extensions of linear regression model (e.g., [1,2,3,4,5,6,7,8,9,10,11,12,13,14]). In stratified regression [1], the data are stratified based on design variables, and at each level, a linear regression model is defined using explanatory variables and response variables. Another extension of the linear regression model is Mixture of Experts Model (e.g., [2, 3]) or Hierarchical Mixture of Experts Model (e.g., [4,5,6,7,8]), which consists of some experts and a gating function. Each expert is a linear regression model that outputs the response variable. The gating function, which receives the same input as experts, weights each output of experts, so that we can finally get a single output. Piecewise Linear Regression Model (e.g., [9, 10]) is also an extension of linear regression model. It divides the input space, and has different linear regression model in each subspace. Generalized Linear Mixed Models [11] considers “random effect” in its model, which leads the data to be divided into some groups and have different linear regression model in each group. Although each work has different algorithms, the structure they assume to the data are the same: the data can be grouped into clusters, and have linear regression model in each cluster.

Here, we provide a point of view to organize these previous works. By focusing on the probabilistic model which expresses the data generation, we can treat these previous works in unified manner. Each of the data is represented as $$(z, x_1,\ldots ,x_d,y)$$, which is the realized value of random variables $$Z, X_1,\ldots , X_d$$ and Y. Here, $$Z \in \{1,\ldots ,k\}$$ is assign variable, which decides the assignment of the data point to k clusters. $$X_1,\cdots ,X_d$$ are the explanatory variables, and Y represents the target variable. Here, let $${\varvec{U}}$$ and $${\varvec{V}}$$ be defined as

\begin{aligned} {\varvec{U}}:= & {} {\{} X_i \in \{X_1,\ldots ,X_d\}\mid X_i \ \mathrm{explains\ the \ cluster \ structure} {\}}, \\ {\varvec{V}}:= & {} \left\{ X_j \right. \in \{X_1,\ldots , X_d \} \mid \left. X_j \ \mathrm{explains \ the \ regression\ to\ }Y \right\} . \end{aligned}

From this point of view, we can consider broad patterns of data generation models by adjusting these factors:

• Number of variables in $${\varvec{U}} \cap {\varvec{V}}$$.

• Whether Y is continuous or discrete.

• Whether Z is observable or not.

Many of the previous works can be organized by this point of view. For example, the stratified regression is the one with $${\varvec{U}}=\emptyset$$, continuous Y, and observable Z. Mixture of Linear Experts and Piecewise Linear Regression model are the ones with $${\varvec{U}} \cap {\varvec{V}}=\{X_1,\ldots ,X_d\}$$, continuous Y, and unobservable Z.

In this paper, we assume the data generation model with continuous target variable Y, and unobservable assignment variable Z. Here, we do not restrict the number of elements in $${\varvec{U}} \cap {\varvec{V}}$$, therefore we can treat broad patterns of data generation model by our approach. Assuming this structure to the data, we deliver the optimal prediction under Bayes criterion [15]. Moreover, we derive an approximative algorithm that calculates the prediction by adopting variational inference method.

The advantages of using our algorithm are shown below:

1. 1.

It is not necessary to input the number of clusters beforehand, since our proposed algorithm automatically weights the probabilities of being each number of clusters in the process of prediction.

2. 2.

Since the prediction is based on the explicit assumption of a probabilistic data generation structure, it is easier to distinguish the data structure.

3. 3.

We can easily consider broader extensions of our algorithm. For example, by using logistic regression model instead of linear regression model, we can solve classification problem where Y is discrete.

Regarding (1), recent works (e.g., [5,6,7]) often fix the number of regression models, so the number of clusters. On the other hand, our algorithm do not choose the number of clusters to be a single value, but consider several numbers of clusters and weight their prediction results.

In Sect. 2, we define probabilistic structure which expresses the data generation. Then, in Sects. 3 and 4, we derive the optimal prediction under Bayes criterion and the approximative algorithm respectively. In Sect. 5, we mention some experiments conducted to confirm the behavior of the proposed algorithm using both synthetic and real data. In the first experiment, we synthesized data and parameters from the priors, so that it is possible to check the average behavior of the algorithm. In the second experiment, we generated the data with specific tendency, so that we can confirm more detailed behavior of the algorithm. In the last experiment, we used real data to confirm that our algorithm works well in practical use, too.

## Probabilistic Model of Data Generation

We assume that the given data have the following characteristics:

• The data are the pairs of explanatory variables and response variables.

• All the variables are continuous.

• It is not appropriate to adopt a single linear regression model to the data.

• If we divide the data into some clusters, it will be appropriate to adopt a linear regression model in each cluster.

Let each of the given data be represented as $$({\varvec{z}},{\varvec{x}}, y)$$, which is the realized value of three random variables $$({\varvec{Z}},{\varvec{X}}, Y)$$. Each random variable has following features. From now on, we express the p.d.f. of Gaussian distribution, Dirichlet distribution, and Wishart distribution as $$\mathcal {N}(\cdot ), \mathrm {Dir}(\cdot )$$ and $$\mathcal {W}(\cdot )$$, respectively.

### Assumption 2.1

Let K be an unobservable random variable which represents the number of clusters, and k be a realized value of K. Also, let $$k_\mathrm{max}$$ be a fixed value which represents the maximum number of clusters we assume.

### Assumption 2.2

Let $${\varvec{Z}}_k=(Z_1, \ldots ,Z_k)^{\mathsf T}\in \{0,1\}^{k}$$ be a latent random variable called assignment variable, given the number of clusters k. Here, the lth component $$Z_{l_k}\ (l_k=1, \ldots , k)$$ is defined as

\begin{aligned} Z_{l_k}={\left\{ \begin{array}{ll} 1 &{} \mathrm{\Big (when\ in\ cluster}\ l_k\ \Big )\\ 0 &{} \mathrm{otherwise} \end{array}\right. }, \sum _{l_k=1}^k Z_{l_k}=1. \end{aligned}
(1)

The distribution of $${\varvec{Z}}_k$$ is given by

\begin{aligned} p\left( {\varvec{z}}_k \mid {\varvec{\pi }}_k \right) =\prod _{l_k=1}^k \pi _{l_k}^{z_{l_k}} , \end{aligned}
(2)

where $${\varvec{\pi }}_k:=(\pi _1,\ldots ,\pi _k)^{\mathsf {T}} \in [0,1]^k$$ is a mixing parameter which fulfills

\begin{aligned} \sum _{l_k=1}^k \pi _{l_k}=1. \end{aligned}
(3)

### Assumption 2.3

Let $${\varvec{X}}=(X_1,\ldots ,X_d)^\mathsf {T}\in \mathbb {R}^d$$ be an observable random variable, where $$d:=p+q+r$$. Given the number of clusters k, the first $$(p+q)$$ variables of $${\varvec{X}}$$ follows the Gaussian mixture distribution as below:

\begin{aligned} p\left( {\varvec{x}}\mid {\varvec{z}}_k,{\varvec{M}}_k,{\varvec{L}}_k \right)&=p\left( x_1,\ldots ,x_{p+q} \right) p \left( x_{p+q+1},\ldots ,x_d \right) \end{aligned}
(4)
\begin{aligned}&=\prod _{l_k=1}^k \mathcal {N}\left( x_1,\ldots , x_{p+q}\mid {\varvec{\mu }}_{l_k}, {\varvec{\Lambda }}_{l_k}^{-1} \right) ^{z_{l_k}}p\left( x_{p+q+1},\ldots ,x_d \right) , \end{aligned}
(5)

where $${\varvec{M}}_k:=\{{\varvec{\mu }}_1,\ldots ,{\varvec{\mu }}_k\}$$ is a set of $$(p+q)$$ dimensional mean vectors, and $${\varvec{L}}_k:=\{{\varvec{\Lambda }}_1,\ldots ,{\varvec{\Lambda }}_k\}$$ is a set of $$(p+q) \times (p+q)$$ dimensional precision matrices.

Equation (5) suggests that among explanatory variables $$X_1, \ldots , X_d$$, the first $$(p+q)$$ variables follow k mixtures of Gaussian distributions, and the rest r variables follows arbitrary distributions. To give an example in real world data, think of precipitation prediction in some observation areas. The first $$p+q$$ explanatory variables can be elevation, latitude, and distance from the sea which might explain the cluster structure of observation areas.

### Assumption 2.4

Let $$Y \in \mathbb {R}$$ be an observable random variable called target variable. Given the number of clusters k, Y follows the regression model of the last $$(q+r)$$ explanatory variables :

\begin{aligned} p\left( y \!\mid \! {\varvec{x}},{\varvec{W}}_k,{\varvec{z}}_k \right) \!=\! \prod _{l_k=1}^k \mathcal {N}\left( y \!\mid \!w_{l_k0}+w_{l_k1}x_{p\!+\!1}+\cdots +w_{l_k(q\!+\!r)}x_{d}, \sigma _{l_k}^2 \right) ^{z_{l_k}}. \end{aligned}
(6)

Here, $${\varvec{W}}_k:=({\varvec{w}}_1,\ldots ,{\varvec{w}}_k) \in \mathbb {R}^{(q+r+1)\times k}$$ is a coefficient parameter, where

\begin{aligned} {\varvec{w}}_{l_k}=\left( w_{l_k0},\ldots ,w_{l_k(q+r)}\right) ^{\mathsf {T}} \in \mathbb {R}^{q+r+1}\ \ \left( l_k=1,\ldots ,k \right) , \end{aligned}
(7)

and $$\sigma _{l_k}^2 \in \mathbb {R}_+\ (l_k=1,\ldots ,k)$$ is a variance which is a fixed value.

Equation (6) suggests that among explanatory variables $$X_1,\ldots ,X_d$$, the last $$(q+r)$$ variables explain the regression to the target variable Y.

As can be understood from Assumptions 2.3 and 2.4, the first $$(p+q)$$ explanatory variables explain the cluster structure, and the last $$(q+r)$$ explanatory variables explain the regression to Y.Therefore, we name each variable as below.

### Definition 2.1

Regarding Assumptions 2.3 and 2.4, the first $$(p+q)$$ explanatory variables are called cluster explanatory variable, and the last $$(q+r)$$ explanatory variables are called regression explanatory variable, i.e.,

### Assumption 2.5

Let $${\varvec{\pi }}_k, {\varvec{M}}_k, {\varvec{L}}_k, {\varvec{W}}_k$$, and K have respective priors $$p({\varvec{\pi }}_k)$$, $$p({\varvec{M}}_k)$$, $$p({\varvec{L}}_k)$$, $$p({\varvec{W}}_k)$$, and p(K).

The graphical model of data and parameters given the number of clusters k is shown in Fig. 1.

### Definition 2.2

Given the number of clusters k, let parameters $${\varvec{\theta }}_k$$ be defined as

\begin{aligned} {\varvec{\theta }}_k:=\left\{ {\varvec{\pi }}_k, {\varvec{M}}_k, {\varvec{L}}_k, {\varvec{W}}_k \right\} . \end{aligned}
(8)

Also, let $${\varvec{x}}^n$$ express n independent, identically distributed data drawn from $$p({\varvec{x}})$$, i.e., $${\varvec{x}}^n:=\{{\varvec{x}}_1,\ldots , {\varvec{x}}_n\}$$, and define $$y^n, {\varvec{z}}^n$$ in the same way. Then, define given data $${\varvec{D}}$$ as $${\varvec{D}}\!:=\!\{{\varvec{x}}^n, y^n, {\varvec{x}}_{n+1}\}$$. Here, new data $$({\varvec{x}}_{n+1},y_{n+1})$$ is drawn independently from the identical distribution as $$({\varvec{x}}_i, y_i)$$ where $$i=1,\ldots ,n$$.

Hereafter, we use calligraphic fonts to express the set where each random variable belongs to.

## Optimal Prediction Under Bayes Criterion

Assuming the given data to be represented by the probabilistic model introduced in Sect. 2, we propose the optimal prediction in the sense of the Bayes criterion when considering squared-error loss.

Since the goal is to predict the new response variable $$y_{n+1}$$ from the given data $${\varvec{D}}$$, the decision rule $$\delta$$ can be defined as follows.

\begin{aligned} \delta : \mathcal {X}^n \times \mathcal {Y}^n \times \mathcal {X} \rightarrow \mathcal {Y} ; {\varvec{D}} \mapsto \delta ({\varvec{D}}) . \end{aligned}
(9)

Using the squared-error loss between the true value of $$y_{n+1}$$ and the decision $$\delta ({\varvec{D}})$$, we define the loss function L as below:

\begin{aligned}&L\left( \delta ({\varvec{D}}), {\varvec{z}}_k^{n+1}, {\varvec{\theta }}_k, k \right) =\int _{\mathcal {Y}}\left( y_{n+1}-\delta ({\varvec{D}})\right) ^2 p\left( y_{n+1}\mid {\varvec{x}}_{n+1}, {\varvec{z}}_{n+1\,k},{\varvec{\theta }}_k,k \right) dy_{n+1}. \end{aligned}
(10)

Given the loss function, the risk function of decision rule $$\delta (\cdot )$$ is defined as follows.

\begin{aligned}&R\left( \delta (\cdot ), {\varvec{z}}_k^{n\!+\!1}, {\varvec{\theta }}_k, k \right) := \!\int _{\mathcal {X}^{n\!+\!1}}\!\int _{\mathcal {Y}^{n}}\!L\left( \delta ({\varvec{D}}), {\varvec{D}} ,{\varvec{z}}_k^{n\!+\!1}\!,{\varvec{\theta }}_k, k \right) p\left( {\varvec{D}} \mid \!{\varvec{z}}_k^{n\!+\!1}\!,\! {\varvec{\theta }}_k\!,\!k\right) dy^n\!d{\varvec{x}}^{n\!+\!1}. \end{aligned}
(11)

Moreover, the Bayes risk of decision rule $$\delta ({\varvec{D}})$$, with respect to the prior distributions defined in Assumptions 2.2 and 2.5 is defined as

\begin{aligned}&\mathrm{BR}(\delta (\cdot )) :=\sum _{k=1}^{k_\mathrm{max}}\sum _{\mathcal {Z}_k^{n+1}}\int _{\Theta _k} R\left( \delta ({\varvec{D}}), {\varvec{z}}_k^{n+1}, {\varvec{\theta }}_k,k \right) p\left( {\varvec{z}}_k^{n+1}, {\varvec{\theta }}_k, k \right) d{\varvec{\theta }}_k. \end{aligned}
(12)

Here, we can obtain the following result. The proof of Proposition 3.1 is in Appendix A.

### Proposition 3.1

The decision rule $$\delta ^*({\varvec{D}})$$ which minimizes the Bayes risk (which we calloptimal prediction under Bayes criterion”) is

\begin{aligned} \delta ^*({\varvec{D}})=\int _{\mathcal {Y}}y_{n+1}{p}^*\left( y_{n+1}\mid {\varvec{D}}\right) dy_{n+1}, \end{aligned}
(13)

where

\begin{aligned}&{p}^*\left( y_{n+1}\mid {\varvec{D}}\right) := \sum _{k=1}^{k_\mathrm{max}}\sum _{\mathcal {Z}_k^{n\!+\!1}}\int _{\Theta _k}p\left( y_{n\!+\!1}\mid {\varvec{D}}, {\varvec{z}}_k^{n\!+\!1}, {\varvec{\theta }}_k, k\right) p\left( {\varvec{z}}_k^{n\!+\!1}, {\varvec{\theta }}_k, k \mid {\varvec{D}}\right) d{\varvec{\theta }}_k. \end{aligned}
(14)

Here, $${p}^*(y_{n+1}\mid {\varvec{D}})$$ is called predictive distribution.

Note here, given the same data set $$({\varvec{x}}^n, y^n)$$, different new data points like $${\varvec{x}}_{n+1}$$ and $${\varvec{x}}_{n+1}'$$ will cause different posterior distributions of the parameters and the latent variables such as k and $${\varvec{z}}^{n+1}$$. The effect of the new data $${\varvec{x}}_{n+1}$$ would be smaller when the size of data set n is large.

## Approximation Using Variational Bayes Algorithm

Although we derived optimal prediction under Bayes criterion in Proposition 3.1, it is hard to obtain a closed-form analytical solution of posterior distribution $$p({\varvec{z}}_k^{n+1},{\varvec{\theta }}_k,k \mid {\varvec{D}})$$ which appears in (14). Therefore, we adopt variational Bayes algorithm to approximate the posterior distribution, and obtain the approximation of optimal prediction under Bayes criterion.

### Priors of Parameters and Cluster Number

First, we introduce priors over each parameter given the number of clusters k; $${\varvec{\pi }}_k, {\varvec{M}}_k, {\varvec{L}}_k, {\varvec{W}}_k$$, and the cluster number K. Here, we mostly choose conjugate distribution for each prior of the parameters, therefore the analysis would be considerably simplified.

### Assumption 4.1

Let uniform distribution govern the cluster number K, i.e.,

\begin{aligned} p(k)=\dfrac{1}{k_\mathrm{max}}. \end{aligned}
(15)

### Assumption 4.2

Let mixing parameter $${\varvec{\pi }}_k \in [0,1]^k$$ be drawn over Dirichlet distribution

\begin{aligned} p({\varvec{\pi }}_k)=\mathrm{Dir}\left( {\varvec{\pi }}_k \mid {\varvec{\alpha }}_{0_k}\right) =C\left( {\varvec{\alpha }}_{0_k}\right) \prod _{l_k=1}^k \pi _{l_k}^{\alpha _{0_k}-1}, \end{aligned}
(16)

where $${\varvec{\alpha }}_{0_k}:=(\alpha _{0_k},\ldots ,\alpha _{0_k})^{\mathsf {T}}\in \mathbb {R}^k$$ is real-valued constant vector, and $$C({\varvec{\alpha }}_{0_k})$$ is the normalization constant on Dirichlet distribution.

### Assumption 4.3

Let Gaussian-Wishart distribution govern the mean vector and precision matrix of each Gaussian component, i.e.,

\begin{aligned}&p\left( {\varvec{M}}_k,{\varvec{L}}_k\right) =p\left( {\varvec{M}}_k \mid {\varvec{L}}_k \right) p\left( {\varvec{L}}_k\right) =\prod _{l_k=1}^k \mathcal {N}\left( {\varvec{\mu }}_{l_k} \mid {\varvec{m}}_{0_k}, (\beta _{0_k} {\varvec{\Lambda }}_{l_k})^{-1}\right) \mathcal {W}\left( {\varvec{\Lambda }}_{l_k} \mid {\varvec{A}}_{0_k}, \nu _{0_k}\right) . \end{aligned}
(17)

Here, all the hyper-parameters $${\varvec{m}}_{0_k} \in \mathbb {R}^{p+q}$$, $$\beta _{0_k} \in \mathbb {R}_+$$, $${\varvec{A}}_{0_k} \in \mathbb {R}^{(p+q) \times (p+q)}$$, and $$\nu _{0_k} \in \mathbb {R}$$ are constant.

### Assumption 4.4

Let coefficient parameter be drawn over Gaussian distribution in each cluster, i.e.,

\begin{aligned} p({\varvec{W}}_k)=\prod _{l_k=1}^k \mathcal {N}\left( {\varvec{w}}_{l_k} \mid {\varvec{\mu }}_{w{0_k}}, {\varvec{\Lambda }}_{w{0_k}}^{-1} \right) . \end{aligned}
(18)

Here, mean vector $${\varvec{\mu }}_{w0_k} \in \mathbb {R}^{q+r+1}$$, and precision matrix $${\varvec{\Lambda }}_{w0_k} \in \mathbb {R}^{(q+r+1)\times (q+r+1)}$$ are both constant.

### Analysis on Approximated Posterior Distribution

Now, we consider an approximated posterior distribution. Approximated posterior distributionFootnote 1 is a distribution of the latent variables $${\varvec{Z}}_k^{n+1}$$, the parameters $${\varvec{\theta }}_k$$, and the cluster number K, which belongs to the function family $$\mathfrak {Q}$$ defined as below:

\begin{aligned}&\mathfrak {Q}:=\left\{ q\left( {\varvec{z}}_k^{n+1}, {\varvec{\theta }}_k, k \right) \mid q\left( {\varvec{z}}_k^{n\!+\!1} \!,{\varvec{\theta }}_k,k \right) =\!q\left( {\varvec{z}}_k^n \right) q\left( {\varvec{z}}_{n\!+\!1\, k} \right) q\left( {\varvec{\pi }}_k \right) q\left( {\varvec{M}}_k,\!{\varvec{L}}_k \right) q\left( {\varvec{W}}_k \right) q(k)\right\} . \end{aligned}
(19)

The purpose of the variational Bayes algorithm is to obtain the variational distribution $$q^* \in \mathfrak {Q}$$ which minimizes its Kullback–Leibler (KL) divergence between posterior distribution $$p({\varvec{z}}_k^{n+1},{\varvec{\theta }}_k,k \mid {\varvec{D}})$$, i.e.,

\begin{aligned} q^*\left( {\varvec{z}}_k^{n+1},{\varvec{\theta }}_k,k \right)&=q^*\left( {\varvec{z}}_k^{n+1},{\varvec{\theta }}_k\mid k \right) q^*(k) \end{aligned}
(20)
\begin{aligned}&=\mathop {\mathrm{arg~min}}\limits _{q \in \mathfrak {Q}} \sum _{k\!=\!1}^{k_\mathrm{max}}\sum _{\mathcal {Z}_k^{n\!+\!1}}\int _{\Theta _k} q\left( {\varvec{z}}_k^{n\!+\!1}\!, {\varvec{\theta }}_k,k \right) \ln \frac{q\left( {\varvec{z}}_k^{n\!+\!1}\!, {\varvec{\theta }}_k,k \right) }{p\left( {\varvec{z}}_k^{n\!+\!1}\!,{\varvec{\theta }}_k,k \!\mid \! {\varvec{D}} \right) }d{\varvec{\theta }}_k. \end{aligned}
(21)

Here, we can obtain the following two propositions.Footnote 2

### Proposition 4.1

\begin{aligned}&\ln q^*\left( {\varvec{z}}_k^{n}\right) := E_{\backslash q^*({\varvec{z}}_k^n)}\left[ \ln p\left( {\varvec{x}}^{n\!+\!1},y^n, {\varvec{z}}_k^{n\!+\!1},{\varvec{\theta }}_k\mid k\right) \right] +\mathrm{const.}, \end{aligned}
(22)
\begin{aligned}&\ln \!q^*\left( {\varvec{z}}_{n\!+\!1\,k}\right) \!:=\!E_{\backslash q^*({\varvec{z}}_{n\!+\!1\,k})}\left[ \ln p\left( {\varvec{x}}^{n\!+\!1}\!,y^n\!, {\varvec{z}}_k^{n\!+\!1}\!,{\varvec{\theta }}_k\!\mid \!k\right) \right] +\mathrm{const.}, \end{aligned}
(23)
\begin{aligned}&\ln q^*({\varvec{\pi }}_k):= E_{\backslash q^*({\varvec{\pi }}_k)}\left[ \ln p\left( {\varvec{x}}^{n+1},y^n, {\varvec{z}}_k^{n+1},{\varvec{\theta }}_k\mid k\right) \right] +\mathrm{const.}, \end{aligned}
(24)
\begin{aligned}&\ln \!q^*({\varvec{M}}_k,{\varvec{L}}_k)\!:=\! E_{\backslash q^*({\varvec{M}}_k,{\varvec{L}}_k)}\left[ \ln p\left( {\varvec{x}}^{n\!+\!1}\!,y^n,\! {\varvec{z}}_k^{n\!+\!1}\!,{\varvec{\theta }}_k\!\mid \!k\right) \right] +\mathrm{const.}, \end{aligned}
(25)
\begin{aligned}&\ln q^*({\varvec{W}}_k):= E_{\backslash q^*({\varvec{W}}_k)}\left[ \ln p\left( {\varvec{x}}^{n\!+\!1},y^n, {\varvec{z}}_k^{n\!+\!1},{\varvec{\theta }}\!\mid \!k\right) \right] +\mathrm{const.} \end{aligned}
(26)

Then,

\begin{aligned} q^*\left( {\varvec{z}}_k^{n\!+\!1},{\varvec{\theta }}_k\!\mid \! k \right) =q^*\left( {\varvec{z}}_k^{n}\right) q^*\left( {\varvec{z}}_{n\!+\!1\,k}\right) q^*\left( {\varvec{\pi }}_k\right) q^*\left( {\varvec{M}}_k,{\varvec{L}}_k\right) q^*\left( {\varvec{W}}_k\right) , \end{aligned}
(27)

where $$E_{\backslash q^*(\star )}[\cdot ]$$ denotes an expectation with respect to the distribution $$q^*$$ over all variables except $$\star$$.

### Proposition 4.2

Regarding variational distribution of K, it holds that

\begin{aligned}&\ln q^*(k)=\ln p(k) + \ln k! +\sum _{\mathcal {Z}_k^{n\!+\!1}}\!\int _{\Theta _k}q^*({\varvec{z}}_k^{n\!+\!1},{\varvec{\theta }}_k\!\mid \!k)\!\ln \frac{p\left( {\varvec{x}}^{n\!+\!1},y^n, {\varvec{z}}_k^{n\!+\!1},{\varvec{\theta }}_k\mid k\right) }{q^*\left( {\varvec{z}}_k^{n\!+\!1},{\varvec{\theta }}_k\!\mid \!k\right) }d{\varvec{\theta }}_k +\mathrm{const}. \end{aligned}
(28)

Equations (27) and (28) give the distribution which minimizes the KL divergence between the posterior distribution subject to the constraint of $$\mathfrak {Q}$$. However, they still do not give the explicit solution, since the expectations in (22) to (26) depend on the other factors of $$q^*$$. Therefore, we first initialize all the factors and repeat updating each factor in turn using (22)–(26) so that we can approximately calculate (27).

In the next subsections, we will introduce updating formula of each factors $$q^*(\cdot _k)$$. Hereafter, $$q^{(t)}(\star _k)$$ denotes the variational distribution of latent variables or parameter $$\star$$ given cluster number k at time t, and $$E_{\star _k}[\cdot ]$$ denotes the expectation with respect to $$q^{(t)}(\star _k)$$. Also, for simplicity, we will define $${\varvec{U}}\in \mathbb {R}^{p+q}$$ and $${\varvec{V}} \in \mathbb {R}^{q+r}$$ as following:

\begin{aligned} {\varvec{U}}&:=\left( X_1,\ldots ,X_{p+q}\right) \ (\mathrm{cluster \ explanatory \ variables}), \\ {\varvec{V}}&:=\left( X_{p+1},\ldots ,X_{d}\right) \ (\mathrm{regression \ explanatory \ variables}). \end{aligned}

We use $${\varvec{u}}_i =(u_{i1},\ldots ,u_{i(p+q)})^\mathsf {T} \in \mathbb {R}^{p+q}, {\varvec{v}}_i =(v_{i1},\ldots ,v_{i(q+r)})^\mathsf {T} \in \mathbb {R}^{q+r}\ (i=1,\ldots ,n)$$ as the realized values of $${\varvec{U}}$$ and $${\varvec{V}}$$ respectively, and define $$\tilde{{\varvec{v}}}_i$$ as $$\tilde{{\varvec{v}}}_i=(1,v_{i1},\ldots ,v_{i(q+r)}) \in \mathbb {R}^{q+r+1}\ (i=1,\ldots ,n)$$.

1. 1.

Updating $$q({\varvec{z}}_k^n)$$: From (22), the updating formula of $$q({\varvec{z}}_k^n)$$ is given by

\begin{aligned} q^{(t+1)}({\varvec{z}}_k^n)=\prod _{i=1}^n \prod _{l_k=1}^k r_{il_k}^{(t) \ z_{il_k}}. \end{aligned}
(29)

Here, $$r_{il_k}^{(t)}$$ is given by

\begin{aligned} r_{il_k}^{(t)}=\dfrac{\rho _{il_k}^{(t)}}{\sum _{l'_k=1}^k \rho _{il'_k}^{(t)}}, \end{aligned}
(30)

where

\begin{aligned} {\rho _{il_k}^{(t)}}&=\exp \left[ \frac{1}{2}\left\{ \sum _{j=1}^{p+q} \psi \left( \frac{\nu _{l_k}^{(t)}+1-j}{2}\right) +(p+q)\ln 2+ \ln |{\varvec{A}}_{l_k}^{(t)}|\right\} -\!\frac{1}{2} \left\{ (p+q){\beta _{l_k}^{(t)}}^{-1}+\nu _{l_k}^{(t)}({\varvec{u}}_i-{\varvec{m}}_{l_k}^{(t)})^{\mathsf {T}}{\varvec{A}}_{l_k}^{(t)}({\varvec{u}}_i-{\varvec{m}}_{l_k}^{(t)})\right\} \right. \nonumber \\&\left. -\frac{1}{2\sigma _{l_k}^2}\left\{ y_i^2-2y_i{{\varvec{\mu }}_{wl_k}^{(t)}}^\mathsf {T}\tilde{{\varvec{v}}}_i+\tilde{{\varvec{v}}}_i^\mathsf {T}\left( {{\varvec{\Lambda }}_{wl_k}^{(t)}}^{-1}+{\varvec{\mu }}_{wl_k}^{(t)}{{\varvec{\mu }}_{wl_k}^{(t)}}^\mathsf {T}\right) \tilde{{\varvec{v}}}_i\right\} +\psi (\alpha _{l_k}^{(t)})-\psi \left( \sum _{l_k=1}^k \alpha _{l_k}^{(t)}\right) -\frac{p+q}{2}\ln (2\pi )\!-\!\frac{1}{2}\ln (2\pi \sigma _{l_k}^2)\right] . \end{aligned}
(31)

The function $$\psi (\cdot )$$ expresses the digamma function.

2. 2.

Updating $$q({\varvec{z}}_{n+1\,k})$$: From (23), the updating formula of $$q({\varvec{z}}_{n\!+\!1\,k})$$ is given by

\begin{aligned} q^{(t+1)}({\varvec{z}}_{n+1\,k})= \prod _{l_k=1}^k \varphi _{l_k}^{(t)\ z_{n+1\ l_k}}. \end{aligned}
(32)

Here, $$\varphi _{l_k}^{(t)}$$ is given by

\begin{aligned} \varphi _{l_k}^{(t)}=\dfrac{\eta _{l_k}^{(t)}}{\sum _{l'_k=1}^k \eta _{l'_k}^{(t)}}, \end{aligned}
(33)

where

\begin{aligned}\eta _{l_k}^{(t)}&=\exp \left[ \frac{1}{2}\left\{ \sum _{j=1}^{p+q} \psi \left( \frac{\nu _{l_k}^{(t)}+1-j}{2}\right) +(p+q)\ln 2+ \ln |{\varvec{A}}_{l_k}^{(t)}|\right\} +\psi (\alpha _{l_k}^{(t)})-\psi \left( \sum _{l_k=1}^k \alpha _{l_k}^{(t)}\right) \right. \nonumber \\& \quad -\left. \!\frac{1}{2} \left\{ (p+q){\beta _{l_k}^{(t)}}^{-1}+\nu _{l_k}^{(t)}({\varvec{u}}_{n+1}-{\varvec{m}}_{l_k}^{(t)})^{\mathsf {T}}{\varvec{A}}_{l_k}^{(t)}({\varvec{u}}_{n+1}-{\varvec{m}}_{l_k}^{(t)})\right\} -\frac{p+q}{2}\ln (2\pi )\right] . \end{aligned}
(34)
3. 3.

Updating $$q({\varvec{\pi }}_k)$$: We define next term with respect to each $$l_k=1,\ldots ,k$$ :

\begin{aligned} N_{l_k}^{(t)}=\sum _{i=1}^n r_{il_k}^{(t)}+\varphi _{l_k}^{(t)}. \end{aligned}
(35)

From (24), the updating formula of $$q({\varvec{\pi }}_k)$$ is given by

\begin{aligned} q({\varvec{\pi }}_k)^{(t+1)}=\mathrm{Dir}\left( {\varvec{\pi }}_k \mid {\varvec{\alpha }}_k^{(t+1)}\right) , \end{aligned}
(36)

where $${\varvec{\alpha }}_k^{(t\!+\!1)}\!:=\!\left( \alpha _1^{(t\!+\!1)},\cdots ,\alpha _k^{(t\!+\!1)}\right) ^{\mathsf {T}},\ \alpha _{l_k}^{(t\!+\!1)}\!:=\!N_{l_k}^{(t\!+\!1)}+\alpha _{0_k}$$ for $$l_k=1,\ldots ,k$$.

4. 4.

Updating $$q({\varvec{M}}_k,{\varvec{L}}_k)$$: We define next terms with respect to each $$l_k\!=\!1\!,\!\ldots \!,\!k$$ :

\begin{aligned}&\bar{{\varvec{u}}}^{(t)}_{l_k}= \frac{1}{N^{(t)}_{l_k}}\left\{ \sum _{i=1}^n r^{(t)}_{il_k}{\varvec{u}}_i+\varphi ^{(t)}_{l_k}{\varvec{u}}_{n+1}\right\} , \end{aligned}
(37)
\begin{aligned}&{\varvec{S}}_{l_k}^{(t)}=\frac{1}{N_{l_k}^{(t)}}\left\{ \sum _{i=1}^{n}r_{il_k}^{(t)}\left( {\varvec{u}}_i-\bar{{\varvec{u}}}_{l_k}^{(t)}\right) \left( {\varvec{u}}_i-\bar{{\varvec{u}}}_{l_k}^{(t)}\right) ^{\mathsf {T}}+\varphi _{l_k}^{(t)}\left( {\varvec{u}}_{n+1}-\bar{{\varvec{u}}}_{l_k}^{(t)}\right) \left( {\varvec{u}}_{n+1}-\bar{{\varvec{u}}}_{l_k}^{(t)}\right) ^{\mathsf {T}}\right\} . \end{aligned}
(38)

From (25), the updating formula of $$q({\varvec{M}}_k,{\varvec{L}}_k)$$ is given by

\begin{aligned} q^{(t+1)}({\varvec{M}}_k,{\varvec{L}}_k) =\prod _{l_k=1}^k \!\mathcal {N}\!\left( {\varvec{\mu }}_{l_k} \!\mid \!{\varvec{m}}_{l_k}^{(t\!+\!1)},(\beta _{l_k}^{(t\!+\!1)} {\varvec{\Lambda }}_{l_k})^{-1}\right) \mathcal {W}\left( {\varvec{\Lambda }}_{l_k} \!\mid \! {\varvec{A}}_{l_k}^{(t\!+\!1)},\!\nu _{l_k}^{(t\!+\!1)}\right) . \end{aligned}
(39)

Here, $$\beta _{l_k}^{(t+1)},\nu _{l_k}^{(t+1)},{\varvec{m}}_{l_k}^{(t+1)}$$, and $${\varvec{A}}_{l_k}^{(t+1)}$$ are defined as

\begin{aligned} \beta _{l_k}^{(t+1)}&=\beta _{0_k}+N_{l_k}^{(t+1)},\ \nu _{l_k}^{(t+1)}=\nu _{0_k}+N_{l_k}^{(t+1)}, \\ {\varvec{m}}_{l_k}^{(t+1)}&= \frac{1}{\beta _{l_k}^{(t+1)}}\left( \beta _{l_k}^{(0)} {\varvec{m}}_{0_k}+N_{l_k}^{(t+1)}\bar{{\varvec{u}}}_{l_k}^{(t+1)}\right) , \\ \left( {\varvec{A}}_{l_k}^{(t+1)}\right) ^{-1}&=\left( {\varvec{A}}_{0_k}\right) ^{-1}+N_{l_k}^{(t+1)}{\varvec{S}}_{l_k}^{(t+1)}+\frac{\beta _{0_k} N_{l_k}^{(t+1)}}{\beta _{0_k}+N_{l_k}^{(t+1)}}(\bar{{\varvec{u}}}_{l_k}^{(t+1)}-{\varvec{m}}_{0_k})(\bar{{\varvec{u}}}_{l_k}^{(t+1)}\!-\!{\varvec{m}}_{0_k})^{\mathsf {T}}\!. \end{aligned}
5. 5.

Updating $$q({\varvec{W}}_k)$$: From (26), the updating formula of $$q({\varvec{W}}_k)$$ is given by

\begin{aligned} q^{(t+1)}({\varvec{W}}_k)=\prod _{l_k=1}^k \mathcal {N}\left( {\varvec{w}}_{l_k}^{(t+1)} \mid {\varvec{\mu }}_{wl_k}^{(t+1)},\left( {\varvec{\Lambda }}_{wl_k}^{(t+1)}\right) ^{-1}\right) . \end{aligned}
(40)

Here, $${\varvec{\Lambda }}_{wl_k}^{(t+1)}$$ and $${\varvec{\mu }}_{wl_k}^{(t+1)}$$ are given as

\begin{aligned}&{\varvec{\Lambda }}_{wl_k}^{(t+1)}=\frac{1}{\sigma _{l_k}^2} \sum _{i=1}^n r_{il_k}^{(t+1)}\tilde{{\varvec{v}}}_i\tilde{{\varvec{v}}}_i^{\mathsf {T}}+{\varvec{\Lambda }}_{w0_k} , \\&{\varvec{\mu }}_{wl_k}^{(t+1)}=\left( {\varvec{\Lambda }}_{wl_k}^{(t+1)}\right) ^{-1} \left( \frac{1}{\sigma _{l_k}^2} \sum _{i=1}^n r_{il_k}^{(t+1)}y_i \tilde{{\varvec{v}}}_i+{\varvec{\Lambda }}_{w0_k}{\varvec{\mu }}_{w0_k}\right) . \end{aligned}
6. 6.

Testing convergence: We use variational lower bound to test the convergence and decide when to stop the iteration, since the variational Bayes algorithm guarantees the variational lower bound to increase in each iteration [16].

Using the variational distribution of each parameter or latent variable after t iterations, state $$q^{(t)}({\varvec{z}}^{n+1}_k,{\varvec{\theta }}_k)$$ as

\begin{aligned} q^{(t)}\left( {\varvec{z}}_k^{n\!+\!1}\!,{\varvec{\theta }}_k\right) \!:=\!q^{(t)}\left( {\varvec{z}}_k^n\right) q^{(t)}\left( {\varvec{z}}_{n\!+\!1\,k}\right) q^{(t)}\left( {\varvec{\pi }}_k\right) q^{(t)}\left( {\varvec{M}}_k,{\varvec{L}}_k\right) q^{(t)}\left( {\varvec{W}}_k\right) . \end{aligned}

Then, the variational lower bound at iteration t is defined as

\begin{aligned} \mathcal {L}_k^{(t)}&:=\sum _{\mathcal {Z}_k^{n+1}}\int _{\Theta _k}q^{(t)}({\varvec{z}}_k^{n+1},{\varvec{\theta }}_k)\ln \frac{p\left( {\varvec{D}}, {\varvec{z}}_k^{n+1},{\varvec{\theta }}_k\right) }{q^{(t)}\left( {\varvec{z}}_k^{n+1},{\varvec{\theta }}_k\right) }d{\varvec{\theta }}_k+\ln k! \end{aligned}
(41)
\begin{aligned}&=E\left[ \ln p\left( {\varvec{D}}, {\varvec{z}}_k^{n+1},{\varvec{\theta }}_k\right) \right] -E\left[ \ln q^{(t)}\left( {\varvec{z}}_k^{n+1},{\varvec{\theta }}_k\right) \right] \end{aligned}
(42)
\begin{aligned}&=E\left[ \ln p\left( {\varvec{u}}^{n+1} \mid {\varvec{z}}_k^{n+1},{\varvec{M}}_k,{\varvec{L}}_k\right) \right] +E\left[ \ln p\left( y^n \mid {\varvec{v}}^{n+1},{\varvec{W}}_k,{\varvec{z}}_k^{n+1}\right) \right] \nonumber \\&\quad +E\left[ \ln p\left( {\varvec{z}}_k^{n+1} \mid {\varvec{\pi }}_k\right) \right] +E\left[ \ln p\left( {\varvec{M}}_k,{\varvec{L}}_k\right) \right] \nonumber \\&+E\left[ \ln p\left( {\varvec{\pi }}_k\right) \right] +E\left[ \ln p\left( {\varvec{W}}_k\right) \right] -\!E\left[ \ln q^{(t)}\left( {\varvec{z}}_k^n\right) \right] \!-\!E\left[ \ln q^{(t)}\left( {\varvec{z}}_{n+1\,k}\right) \right] \nonumber \\&\quad - E\left[ \ln q^{(t)}\left( {\varvec{M}}_k,{\varvec{L}}_k\right) \right] \! -E\left[ \ln q^{(t)}\left( {\varvec{\pi }}_k\right) \right] \!-\!E\left[ \ln q^{(t)}\left( {\varvec{W}}_k\right) \right] . \end{aligned}
(43)

Note here that all the expectation in (43) is taken by distribution $$q^{(t)}({\varvec{z}}_k^{n+1}, {\varvec{\theta }}_k)$$. Each term in (43) can be calculated as in Appendix B. By calculating the variational lower bound at each iteration, we judge the convergence of the algorithm.

Let T be the number of iteration when the variational lower bound satisfied the convergence condition. Then, we can derive the variational distribution of the cluster number $$k\ (k=1,\ldots ,k_\mathrm{max})$$ as below:

\begin{aligned} q^{(T)}(k):=\frac{p(k)\exp \left\{ {\mathcal {L}}_k^{(T)}\right\}}{\sum _{k'=1}^{k_\mathrm{max}}p(k')\exp \left\{ \mathcal {L}_{k'}^{(T)}\right\}}. \end{aligned}
(44)

By substituting approximated posterior distribution $$q^{(T)}({\varvec{z}}_k^{n+1},{\varvec{\theta }}_k,k):=$$

$$q^{(T)}(k)q^{(T)}({\varvec{z}}_k^{n+1}, {\varvec{\theta }}_k)$$ to the posterior distribution $$p({\varvec{z}}_k^{n+1},{\varvec{\theta }}_k,k \mid {\varvec{D}})$$, we can obtain (14) in the approximated form (the detail is in Appendix C) :

\begin{aligned} p^*(y_{n+1}\mid {\varvec{D}})\approx \!\sum _{k=1}^{k_\mathrm{max}}\!q^{(T)}(k)\! \sum _{l_k=1}^k \!\varphi _{l_k}^{(T)} \times \mathcal {N}\!\left( y_{n\!+\!1} \left| \ {{\varvec{\mu }}_{wl_k}^{(T)}}^{\mathsf {T}}\tilde{{\varvec{v}}}_{n\!+\!1}, \frac{1}{\sigma _{l_k}^2}\!+\!\tilde{{\varvec{v}}}_{n\!+\!1}^{\mathsf {T}}\left( {\varvec{\Lambda }}_{wl_k}^{(T)}\right) ^{-1}\!\tilde{{\varvec{v}}}_{n\!+\!1}\right. \right) \!. \end{aligned}
(45)

Using this approximated predictive distribution, we can finally derive the next proposition.

### Proposition 4.3

Adopting the variational distribution derived by variational Bayes algorithm, the optimal prediction under Bayes criterion (13) can be approximated as

\begin{aligned} \delta ^*({\varvec{D}})\approx \sum _{k=1}^{k_\mathrm{max}}q^{(T)}(k)\left( \sum _{l_k=1}^k \varphi _{l_k}^{(T)} {{\varvec{\mu }}_{wl_k}^{(T)}}^{\mathsf {T}} \tilde{{\varvec{v}}}_{n+1}\right) . \end{aligned}
(46)

The proof of Proposition 4.3 is in Appendix D.

Note here that the term in brackets of (46) represents the approximation of the optimal prediction under Bayes criterion when fixing k as the cluster number. The prediction (46) is the weighted sum of the prediction at each cluster number, where the weight is $$q^{(T)}(k)$$ for each $$k\ (k=1,\ldots ,k_\mathrm{max})$$.

## Experiments

To demonstrate the behavior of our algorithm, we used various types of data. First, we generated synthetic data from the model we assumed in Sect. 2, and adopted our algorithms. In contrast, we did not only generated parameters using the priors we defined in Sect. 4.1, but also fixed the parameters to generate the data with specific tendency. In addition, we used real data to confirm that our algorithm works well in practical use, too.

For all the experiment, we set $$q=0$$ which means that there is no intersection between clustering explanatory variable and regression explanatory variable. Hence, it would be easier to understand the behavior of each role of explanatory variables.

All the codes used in the experiments are opened in Github [17].

### Experiment 1

#### Setting

In Experiment 1, we generated both the parameters and the data from the model described in previous sections, and checked the average of squared-errorFootnote 3 occurred by the prediction (46).

The procedure of Experiment 1 is shown in the following: First, we generated the cluster number k from the prior (15), and generated the hyper-parameters as follows. $${\varvec{\alpha }}_{0_k}$$ : generated from uniform distribution on [0, 1), $$\beta _{0_k}$$ : generated from uniform distribution on [0.001, 1.001), $${\varvec{A}}_{0_k}={\varvec{I}}$$ (Identity matrix), $$\nu _{0_k}$$: generated from uniform distribution on [2, 3), $${\varvec{\mu }}_{w{0_k}}={\varvec{0}}$$, $${\varvec{\Lambda }}_{w0_k}={\varvec{I}}$$, and $$\sigma _{l_k}^2=0.5$$ for all $$l_k=1,\ldots ,k$$. Next, we generated the parameters from the priors (16)–(18). Consequently, we generated N sets of learning data and M sets of test data from the model. Changing the number of training data n from $$10, 20, \ldots , N$$, we calculated the prediction and the prediction error. We generated the initial values of hyper-parameters at each cluster number k in the same way as we did for genuine hyper-parameters. We repeated data generation for P times per parameter generation, parameter generation for Q times per cluster number generation, and cluster number generation for R times.

Here, each constant was stated as following: $$p=3$$, $$q=0$$, $$r=5$$, $$k_\mathrm{max}=5$$, $$N=100$$, $$M=100$$, $$P=100$$, $$Q=10$$, and $$R=10$$. The iteration was stopped either when $$\mathbb {\mathcal {L}}_k^{(t+1)}-\mathbb {\mathcal {L}}_k^{(t)}<0.0001$$ or when the iteration number reached to 30 times.

#### Result and Discussion

The result of Experiment 1 is shown in Figs. 2 and 3.

Figure 2 shows the prediction error occurred by both our algorithm and random forest algorithm [18]. Since our algorithm is the approximation of Bayes optimal prediction, it is expected to have smaller prediction error compared to any other prediction algorithms. As shown in Fig. 2, the proposed algorithm obtained expected result, where the prediction error is lower than that of random forest regardless of the number of data n. Therefore, in spite of the approximation by variational Bayes algorithm, the prediction by proposed algorithm still indicates the characteristic of the optimal prediction under Bayes criterion.

In Fig. 3, lines $$k=1,2,3,4,5$$ indicates the prediction error at each cluster number, which corresponds to the term in brackets of (46). These lines can be regarded as the prediction error when the cluster number is fixed at k. The graph shows that the prediction error of our algorithm is smaller than any other prediction errors. Therefore, we should not assume the concrete cluster number so that we can obtain a better prediction.

### Experiment 2

#### Setting

In Experiment 2, the cluster number and the parameters are fixed, so that we can confirm the behaviors of our algorithm when treating the data with specific tendency. We set $$k=3$$, $$p=3, q=0, r=5$$, and compared the following patterns.

1. (a)

Neither any clusters nor any regression lines are overlapping.

$${\varvec{\mu }}_1=(2,0,0,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{\mu }}_2=(0,2,0,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{\mu }}_3=(0,0,2,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{w}}_1\!=\!(0,0,0,1,2,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{w}}_2\!=\!(0,0,0,0,0,1,2,0,0)^{\mathsf {T}},$$

$${\varvec{w}}_3\!=\!(0,0,0,0,0,0,0,1,2)^{\mathsf {T}}.$$

2. (b)

Cluster 1 and 2 are overlapping.

$${\varvec{\mu }}_1\!=\!(4/5,\!0,\!0,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{\mu }}_2\!=\!(0,\!4/5,\!0,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{\mu }}_3\!=\!(0,\!0,\!2,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{w}}_1\!=\!(0,0,0,1,\!2,\!0,\!0,\!0,\!0)^{\mathsf {T}},$$

$${\varvec{w}}_2\!=\!(0,0,0,0,\!0,\!1,\!2,\!0,\!0)^{\mathsf {T}},$$

$${\varvec{w}}_3\!=\!(0,0,0,0,\!0,\!0,\!0,\!1,\!2)^{\mathsf {T}}$$.

3. (c)

Regression line 1 and 2 are overlapping.

$${\varvec{\mu }}_1=(2,0,0,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{\mu }}_2=(0,2,0,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{\mu }}_3=(0,0,2,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{w}}_1\!=\!(0,0,0,\!1,\!2,\!0,\!0,\!0,\!0)^{\mathsf {T}},$$

$${\varvec{w}}_2\!=\!(0,0,0,1,\!2,\!0,\!0,\!0,\!0)^{\mathsf {T}},$$

$${\varvec{w}}_3\!=\!(0,0,0,0,\!0,\!0,\!0,\!1,\!2)^{\mathsf {T}}$$.

4. (d)

Cluster 1 and 2 are overlapping, and regression line 1 and 2 are overlapping.

$${\varvec{\mu }}_1\!=\!(4/5,\!0,\!0,0,0,0,0,0)^{\mathsf {T}}$$

$${\varvec{\mu }}_2=(0,\!4/5,\!0,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{\mu }}_3\!=\!(0,\!0,\!2,0,0,0,0,0)^{\mathsf {T}},$$

$${\varvec{w}}_1\!=\!(0,0,0,1,\!2,\!0,\!0,\!0,\!0)^{\mathsf {T}},$$

$${\varvec{w}}_2\!=\!(0,0,0,1,\!2,\!0,\!0,\!0,\!0)^{\mathsf {T}},$$

$${\varvec{w}}_3\!=\!(0,0,0,0,\!0,\!0,\!0,\!1,\!2)^{\mathsf {T}}$$.

For other parameters, we fixed $${\varvec{\Lambda }}_{l_k}=10 {\varvec{I}},\sigma _{l_k}^2=0.5\ (l_k=1,2,3),$$. Constants and initial values for variational Bayes algorithm were set as we did in Experiment 1.

In Experiment 2, we examined the behavior of the following values:

1. (i)

The prediction error occurred by the prediction (46).

2. (ii)

The approximated posterior distribution of the cluster number K.

In the procedure of Experiment 2, we conducted the same process done in Experiment 1, except that we fixed the cluster number k in the case of Experiment 2.

#### Result and Discussion

The results of Experiment 2 are shown in Figs. 4 and 5.

Since condition (a) can be regarded as the benchmark, we give discussion for conditions (bd).

For condition (b), the prediction error is larger than that of condition (a) as shown in Fig. 4. In contrast, as shown in Fig. 5, the approximated posterior distribution of the cluster number K is similar to that of condition (a). Therefore, the cause of larger prediction error can be regarded as the complexity of clustering which yields the prediction error at each cluster number k.

For condition (c), as shown in Fig. 4, the prediction error is smaller than that in condition (a) when the number of data n is small. In addition, from Fig. 5, there exists a certain value of the approximated posterior distribution of the cluster number $$k=2$$ when n is small. Hence, when there is a small number of data, our algorithm indicates that it is appropriate to weight the prediction regarding the cluster number as $$k=2$$ to make the optimal prediction under Bayes criterion. Conversely, when there is enough number of data n to distinguish the cluster 1 and 2, our algorithm indicates that it is suitable to weight the prediction regarding the actual cluster number $$k=3$$. Therefore, we could confirm the feature of our algorithm that it automatically optimizes the cluster number k.

For condition (d), the behaviors indicated in condition (c) become more obvious. Figure 5 shows that the approximated posterior distribution of $$k=2$$ is the highest at any number of data n. Since there are essentially two clusters in condition (d), the algorithm adopts $$k=2$$ even when there is a large number of data.

To conclude, from Experiment 2, we confirmed the following features of our algorithm. Firstly, the complexity of clustering is one of the causes which makes the prediction error larger. Secondly, as long as we provide the appropriate $$k_\mathrm{max}$$, the proposed algorithm automatically tunes the weight of cluster number k for the optimal prediction under the Bayes criterion.

### Experiment 3

#### Setting

In Experiment 3, we used real data taken from Scikit-learn datasets library [19] which include explanatory variables such as age, body mass index (BMI), average blood pressure, and six blood serum measurements obtained for each of patients, as well as the target variable, a quantitative measure of disease progression one year after baseline.

In this experiment, we set the explanatory variable in several patterns, and compare the prediction error made on each generative model. Setting the quantitative measure of disease progression as target variable $$y \in \mathbb {R}$$, we fixed the six blood serum measurements as regression explanatory variable. Using blood pressure for cluster explanatory variable is one case, and we used BMI for cluster explanatory variable for another case.

The procedure of Experiment 3 is shown as follows: First, we randomly sampled $$(N+M)$$ data from the whole dataset, and divided them to N training data, and M test data. Changing the number of training data n from 10, 20, ..., N, we calculated the prediction and the prediction error. We repeated the random sampling of the data for P times. Here, we set $$N=50, M=50, P=100$$. Constants and initial values for variational Bayes algorithm were set as we did in Experiment 1.

#### Result and Discussion

The result of Experiment 3 is shown in Fig. 6.

As can be seen from Fig. 6, using BMI as cluster explanatory variable lessen the prediction error compared to using blood pressure for cluster explanatory variable. Therefore, we can infer that the data structure which use BMI as cluster explanatory variable gives better explanation to the given data. Hence, using our algorithm enables data analysts to understand well about the structure of the data.

## Conclusion

In this paper, we considered the data generation model composed of several linear regression models and proposed the optimal prediction under Bayes criterion. In addition, we created the algorithm which approximately calculates the optimal prediction. Some experiments indicated the following behaviors of our proposed algorithm:

1. 1.

By taking the weighted average of prediction by the approximated posterior distribution of cluster number K, the average of prediction error becomes smaller.

2. 2.

The complexity of clustering makes the prediction error larger.

3. 3.

The algorithm automatically optimizes the weight of cluster number k, so that it realizes the approximative Bayes optimal prediction.

4. 4.

Adopting our algorithm enables us to estimate the structure of data generation.

To extend our research in the future, we will give focus on automatic selection of cluster explanatory variables and regression explanatory variables. This extension will enable use to adopt the algorithm to any kind of real value data without any tunings beforehand.

## Availability of Data and Material

See [17] for the code used for experiments. The data sets used in Experiment 3 is available at [19].

## Notes

1. 1.

It is called “variational distribution” in [16].

2. 2.

Refer Section 10.1.1. and 10.1.4 of [16] to see the proofs of Propositions 4.1 and Section 10.1.4 of 4.2, respectively.

3. 3.

Hereafter, we use the words “the prediction error” in short instead of using “the average of squared-error”.

## Abbreviations

Optimal prediction under Bayes criterion:

The decision rule $$\delta ^*({\varvec{D}})$$ which minimizes the Bayes risk

BMI:

Body mass index

## References

1. 1.

Nathan, G., Holt, D.: The effect of survey design on regression analysis. J. R. Stat. Soc. Ser. B (Methodological) 42(3), 377–386 (1980)

2. 2.

Jacobs, R.A., Jordan, M.: Adaptive mixture of local experts. Neural Comput. 3(1), 78–88 (1991)

3. 3.

Jiang, Y., Conglian, Y., Qinghua, J.: Model selection for the localized mixture of experts models. J. Appl. Stat. 45(11), 1994–2006 (2018)

4. 4.

Bishop, C.M., Svensen, M.: Bayesian hierarchical mixtures of experts. In: Uncertainty in Artificial Intelligence: Proceedings of the Nineteenth Conference, (2003)

5. 5.

Baldacchino, T., Worden, K., Rowson, J.: Robust nonlinear system identification: Bayesian mixture of experts using the t-distribution. Mech. Syst. Signal Process. 85, 977–992 (2017)

6. 6.

Mossavat, I., Amft, O.: Sparse Bayesian hierarchical mixture of experts. In: IEEE Statistical Signal Processing Workshop (SSP), (2011)

7. 7.

Iikubo, Y., Horii, S., Matsushima, T.: Sparse Bayesian hierarchical mixture of experts and variational inference. In: ISITA, (2018)

8. 8.

Iikubo, Y., Horii, S., Matsushima, T.: Model selection of Bayesian hierarchical mixture of experts based on variational inference. In: IEEE International Conference on Systems, Man and Cybermetrics (SMC), (2019)

9. 9.

Nusser, S., Otte, C., Hauptmann, W.: An EM-based piecewise linear regression algorithm. In: Corchado, E., Abraham, A., Pedrycz, W. (eds.) Hybrid Artificial Intelligence Systems, pp. 466–474. Springer Berlin Heidelberg, Berlin (2008)

10. 10.

Eto, R., Fujimaki, R., Morinaga, S., Tamano, H.: Fully-automatic Bayesian piecewise sparse linear models. J. Mach. Learn. Res. 33, 238–246 (2004)

11. 11.

Burgard, J.P., Dorr, P.: Survey-weighted generalized linear mixed models. In: Research Papers in Economics, No. 1/18, (2018)

12. 12.

Damas, B., Victor, J.S.: Online learning of single- and multivalued functions with an infinite mixture of linear experts. Neural Comput. 25(11), 3044–3091 (2013)

13. 13.

Tan, S.L., Nott, D.J.: Variational approximation of mixtures of linear mixed model. J. Comput. Graph. Stat. 23(2), 564–585 (2014)

14. 14.

Liang, J., Chen, K., Lin, M., Zhang, C., Wang, F.: Robust finite mixture regression for heterogeneous targets. Data Min. Knowl. Disc. 32, 1509–1560 (2018)

15. 15.

Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer, Berlin (1985)

16. 16.

Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)

17. 17.

Codes for experiments. [Online]. https://github.com/syttw23/experiments.git. Accessed 15 Sept 2021

18. 18.

Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

19. 19.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

## Acknowledgements

The authors are very grateful to the reviewers for their helpful comments.

## Funding

This work was supported by JSPS KAKENHI Grant Numbers JP17K06446, JP18K11585, JP19K04914, JP19K14989.

## Author information

Authors

### Contributions

Conceptualization: HM, SS, YN, YI and TM; Methodology: HM, SS, YN, YI and TM; Software: HM, YN, and YI; Validation: HM, SS, YN, YI and TM; Formal analysis: HM, SS, YN, YI and TM; Investigation: HM, SS, YN, YI and TM; Resources: HM; Data curation: HM; Writing–original draft preparation: HM, SS, YN; Writing–review & editing: HM, SS, YN, YI and TM; Visualization: HM; Supervision: TM; Project administration: TM; Funding acquisition: SS and TM. All authors have read and agreed to the published version of the manuscript.

### Corresponding author

Correspondence to Haruka Murayama.

## Ethics declarations

### Conflict of Interests

The authors declare they have no competing interests.

Not applicable.

Not applicable.

## Appendices

### Proof of Proposition 3.1

By applying Bayes’ theorem, the Bayes risk of decision rule $$\delta ({\varvec{D}})$$ can be transformed as

\begin{aligned} \mathrm{BR}(\delta (\cdot ))&=\int _{\mathcal {D}} \biggl \{ \int _{\mathcal {Y}}\left( y_{n+1}-\delta ({\varvec{D}})\right) ^2 \sum _{k=1}^{k_\mathrm{max}} \sum _{\mathcal {Z}_k^{n+1}} \nonumber \\&\int _{\Theta _k} p\left( y_{n\!+\!1}\mid {\varvec{x}}_{n\!+\!1},{\varvec{z}}_k^{n\!+\!1},{\varvec{\theta }}_k,k\right) p\left( {\varvec{z}}_k^{n\!+\!1},{\varvec{\theta }}_k,k \mid {\varvec{D}}\right) d{\varvec{\theta }}_k dy_{n\!+\!1}\biggr \} \times p({\varvec{D}})d{\varvec{D}}. \end{aligned}
(47)

Therefore, minimizing the Bayes risk is equivalent to minimizing the terms in the brackets of (47). Let a represent $$\delta ({\varvec{D}})$$, and F(a) represent the terms in the brackets of (47), i.e.,

\begin{aligned}&F(a)\! \equiv \!\int _{\mathcal {Y}}\left( y_{n+1}-a\right) ^2 \sum _{k=1}^{k_\mathrm{max}} \!\sum _{\mathcal {Z}_k^{n+1}}\!\int _{\Theta _k}\! p\left( y_{n+1}\mid {\varvec{x}}_{n+1},{\varvec{z}}_k^{n+1},{\varvec{\theta }}_k,k\right) p\left( {\varvec{z}}_k^{n+1},{\varvec{\theta }}_k,k \mid {\varvec{D}}\right) d{\varvec{\theta }}_k dy_{n+1}. \end{aligned}

By taking partial derivative of F(a) with respect to a and setting it to 0, we obtain the minimum value of F(a) when a is represented as

\begin{aligned} a=\int _{\mathcal {Y}}y_{n+1}{p}^*\left( y_{n+1}\mid {\varvec{D}}\right) dy_{n+1}, \end{aligned}
(48)

where

\begin{aligned}&{p}^*\left( y_{n+1}\mid {\varvec{D}}\right) :=\sum _{k=1}^{k_\mathrm{max}}\sum _{\mathcal {Z}_k^{n\!+\!1}}\int _{\Theta _k}p\left( y_{n\!+\!1}\mid {\varvec{D}}, {\varvec{z}}_k^{n\!+\!1}, {\varvec{\theta }}_k, k\right) p\left( {\varvec{z}}_k^{n\!+\!1}, {\varvec{\theta }}_k, k \mid {\varvec{D}}\right) d{\varvec{\theta }}_k. \end{aligned}
(49)

### Calculation of Variational Lower Bound

Each term in variational lower bound (43) can be written as following.

\begin{aligned}&E[\ln p({\varvec{u}}^{n+1} \mid {\varvec{z}}_k^{n+1},{\varvec{M}}_k,{\varvec{L}}_k)] \end{aligned}
(50)
\begin{aligned}&=\frac{1}{2} \sum _{l_k=1}^k N_{l_k}^{(t)} \left\{ \sum _{j=1}^{p+q} \psi \left( \frac{\nu _{l_k}^{(t)}+1-j}{2}\right) +(p+q)\ln 2+\ln |{\varvec{A}}_{l_k}^{(t)}| -(p+q){\beta _{l_k}^{(t)}}^{-1}-\nu _{l_k}^{(t)} \mathrm{tr}\left( {\varvec{S}}_{l_k}^{(t)} {\varvec{A}}_{l_k}^{(t)}\right) \right. \nonumber \\&\left. \ \ \ -\nu _{l_k}\left( \bar{{\varvec{u}}}_{l_k}^{(t)}-{\varvec{m}}_{l_k}^{(t)}\right) ^{\mathsf {T}}{\varvec{A}}_{l_k}^{(t)}\left( \bar{{\varvec{u}}}_{l_k}^{(t)}-{\varvec{m}}_{l_k}^{(t)}\right) -(p+q)\ln (2\pi )\right\} , \nonumber \\&E[\ln p(y^n \mid {\varvec{v}}_k^{n+1},{\varvec{W}}_k,{\varvec{z}}_k^{n+1})] \end{aligned}
(51)
\begin{aligned}&=-\frac{1}{2} \sum _{l_k=1}^k \left[ N_{l_k}^{(t)} \ln \left( 2 \pi \sigma _{l_k}^2\right) +\frac{1}{\sigma _{l_k}^2}\sum _{i=1}^n r_{il_k}^{(t)}\left\{ y_i^2-2y_i{{\varvec{\mu }}_{wl_k}^{(t)}}^{\mathsf {T}}\tilde{{\varvec{v}}}_i+\tilde{{\varvec{v}}}_i^{\mathsf {T}}\left( {\varvec{\Lambda }}_{wl_k}^{-1}+{\varvec{\mu }}_{wl_k}^{(t)}{{\varvec{\mu }}_{wl_k}^{(t)}}^{\mathsf {T}}\right) \tilde{{\varvec{v}}}_i\right\} \right] , \nonumber \\&E[\ln p({\varvec{z}}_k^{n+1} \mid {\varvec{\pi }}_k)] =\sum _{l_k=1}^k N_{l_k}^{(t)} \left\{ \psi \left( \alpha _{l_k}^{(t)}\right) -\psi \left( \sum _{l_k=1}^k \alpha _{l_k}^{(t)}\right) \right\} , \end{aligned}
(52)
\begin{aligned}&E[\ln p({\varvec{M}}_k,{\varvec{L}}_k)]=\frac{1}{2}\sum _{l_k=1}^k \left[ (p+q)\ln (\beta _{0_k} / 2\pi )+\sum _{j=1}^{p+q} \psi \left( \frac{\nu _{l_k}^{(t)}+1-j}{2}\right) +(p+q) \ln 2+\ln |{\varvec{A}}_{l_k}^{(t)}| \right. \end{aligned}
(53)
\begin{aligned}&\ \ \ -\frac{(p+q)\beta _{0_k}}{\beta _{l_k}^{(t)}} -\beta _{0_k}\nu _{l_k}^{(t)}({\varvec{m}}_{l_k}^{(t)}-{\varvec{m}}_{0_k})^{\mathsf {T}}{\varvec{A}}_{l_k}^{(t)}({\varvec{m}}_{l_k}^{(t)}-{\varvec{m}}_{0_k})-\nu _{0_k}\ln \mid {\varvec{A}}_{0_k}\mid -\nu _{0_k} (p+q)\ln 2 \nonumber \\&\ \ \ -\frac{(p+q)(p+q-1)}{2}\ln \pi -2\sum _{j=1}^{p+q} \ln \Gamma \left( \frac{\nu _{0_k}+1-j}{2} \right) \nonumber \\&\ \ \ \left. +(\nu _{0_k}-p-q-1)\left\{ \sum _{j=1}^{p+q} \psi \left( \frac{\nu _{l_k}^{(t)}+1-j}{2}\right) +(p+q)\ln 2+\ln |{\varvec{A}}_{l_k}^{(t)}|\right\} -\nu _{l_k}^{(t)} \mathrm{tr}({\varvec{A}}_{0_k}^{-1}{\varvec{A}}_{l_k}^{(t)})\right] ,\nonumber \\&E[\ln p({\varvec{\pi }}_k)]=\ln \frac{\Gamma (\sum _{l_k=1}^k \alpha _{0_k})}{\prod _{l_k=1}^k \Gamma (\alpha _{0_k})}+(\alpha _{0_k}-1)\sum _{l_k=1}^k \left\{ \psi \left( \alpha _{l_k}^{(t)}\right) -\psi \left( \sum _{l_k=1}^k \alpha _{l_k}^{(t)}\right) \right\} , \end{aligned}
(54)
\begin{aligned}E[\ln p({\varvec{W}}_k)]&=\frac{1}{2}\sum _{l_k=1}^k \left\{ -(q+r+1)\log (2\pi )+\ln \mid {\varvec{\Lambda }}_{w0_k}\mid -\mathrm{tr}\left( {\varvec{\Lambda }}_{w0_k}({{\varvec{\Lambda }}_{wl_k}^{(t)}}^{-1}+{\varvec{\mu }}_{wl_k}^{(t)}{{\varvec{\mu }}_{wl_k}^{(t)}}^{\mathsf {T}})\right) \right. \end{aligned}
(55)
\begin{aligned}& \left. +2{{\varvec{\mu }}_{wl_k}^{(t)}}^{\mathsf {T}}{\varvec{\Lambda }}_{w0_k}{\varvec{\mu }}_{w0_k}-{\varvec{\mu }}_{w0_k}^{\mathsf {T}}{\varvec{\Lambda }}_{w0_k}{\varvec{\mu }}_{w0_k}\right\} , \nonumber \\&E[\ln q^{(t)}({\varvec{z}}_k^n)]=\sum _{i=1}^{n}\sum _{l_k=1}^k r_{il_k}^{(t)}\ln r_{il_k}^{(t)}, \end{aligned}
(56)
\begin{aligned}&E[\ln q^{(t)}({\varvec{z}}_{n+1\,k})]=\sum _{l_k=1}^k \varphi _{l_k}^{(t)}\ln \varphi _{l_k}^{(t)}, \end{aligned}
(57)
\begin{aligned}E[\ln q^{(t)}({\varvec{M}}_k,{\varvec{L}}_k)]&=\sum _{l_k=1}^k \left[ \frac{1}{2}\left\{ \sum _{j=1}^{p+q}\psi \left( \frac{\nu _{l_k}^{(t)}+1-j}{2}\right) +(p+q)\ln 2+\ln |{\varvec{A}}_{l_k}^{(t)}| \right\} \right. \end{aligned}
(58)
\begin{aligned}& \left. +\frac{p+q}{2} \ln \left( \frac{\beta _{l_k}^{(t)}}{2\pi }\right) -\frac{p+q}{2}-H[q^{(t)}({\varvec{\Lambda }}_{l_k}^{(t)})]\right] ,\nonumber \\&E[\ln q^{(t)}({\varvec{\pi }}_k)]=\ln \frac{\Gamma (\sum _{l_k=1}^k \alpha _{l_k}^{(t)})}{\prod _{l_k=1}^k \Gamma (\alpha _{l_k}^{(t)})}+\sum _{l_k=1}^k (\alpha _{l_k}^{(t)}-1)\left\{ \psi \left( \alpha _{l_k}^{(t)}\right) -\psi \left( \sum _{l_k=1}^k \alpha _{l_k}^{(t)}\right) \right\} , \end{aligned}
(59)
\begin{aligned}&E[\ln q^{(t)}({\varvec{W}}_k)]=\frac{1}{2}\sum _{l_k=1}^k \left[ -(q+r+1)\ln (2\pi )+\ln \mid {\varvec{\Lambda }}_{wl_k}\mid -\mathrm{tr}\left( {\varvec{\Lambda }}_{wl_k^{(t)}}\left( {{\varvec{\Lambda }}_{wl_k}^{(t)}}^{-1}+{\varvec{\mu }}_{wl_k}^{(t)}{{\varvec{\mu }}_{wl_k}^{(t)}}^{\mathsf {T}}\right) \right) +{{\varvec{\mu }}_{wl_k}^{(t)}}^{\mathsf {T}}{\varvec{\Lambda }}_{wl_k}^{(t)}{\varvec{\mu }}_{wl_k}^{(t)}\right] . \end{aligned}
(60)

Here, $$\psi (\cdot )$$ represents digamma function, and $$H[\cdot ]$$ represents entropy of the Wishart distribution.

### Proof of Equation (45)

By substitution of approximated posterior distribution $$q^{(T)}({\varvec{z}}_k^{n+1},{\varvec{\theta }}_k, k)$$ to the posterior distribution $$p({\varvec{z}}_k^{n\!+\!1}, {\varvec{\theta }}_k, k \mid {\varvec{D}})$$ in (14), we can calculate as

\begin{aligned}&p^{*}(y_{n+1}\mid {\varvec{D}}) \nonumber \\&=\sum _{k=1}^{k_\mathrm{max}}\sum _{\mathcal {Z}_k^{n\!+\!1}}\int _{\Theta _k}p\left( y_{n\!+\!1}\mid {\varvec{D}}, {\varvec{z}}_k^{n\!+\!1}, {\varvec{\theta }}_k, k\right) p\left( {\varvec{z}}_k^{n\!+\!1}, {\varvec{\theta }}_k, k \mid {\varvec{D}}\right) d{\varvec{\theta }}_k \end{aligned}
(61)
\begin{aligned}&\approx \sum _{k=1}^{k_\mathrm{max}}\sum _{\mathcal {Z}_k^{n\!+\!1}}\int _{\Theta _k}p\left( y_{n\!+\!1}\mid {\varvec{D}}, {\varvec{z}}_k^{n\!+\!1}, {\varvec{\theta }}_k, k\right) q^{(T)}\left( {\varvec{z}}_k^{n+1},{\varvec{\theta }}_k, k\right) d{\varvec{\theta }}_k \end{aligned}
(62)
\begin{aligned}&=\sum _{k=1}^{k_\mathrm{max}}\sum _{\mathcal {Z}_k}\int _{\mathcal {W}_k}p\left( y_{n\!+\!1}\mid {\varvec{v}}_{n+1},{\varvec{z}}_{n+1\,k},{\varvec{W}}_k,k\right) q^{(T)}({\varvec{z}}_{n+1\,k})q^{(T)}({\varvec{W}}_k)q^{(T)}(k)d{\varvec{W}}_k \end{aligned}
(63)
\begin{aligned}&=\sum _{k=1}^{k_\mathrm{max}}q^{(T)}(k)\sum _{\mathcal {Z}_k}\int _{\mathcal {W}_k} \prod _{l_k=1}^k \mathcal {N}\left( y_{n+1}\mid {{\varvec{w}}_{l_k}^{(T)}}^{\mathsf {T}}\tilde{{\varvec{v}}}_{n+1}, \sigma _{l_k}^2\right) ^{z_{n+1\ l_k}}\cdot \prod _{l_k=1}^k {\varphi _{l_k}^{(T)}}^{z_{n+1\ l_k}}\cdot \prod _{l_k=1}^k \mathcal {N}\left( {\varvec{w}}_{l_k} \mid {\varvec{\mu }}_{wl_k}^{(T)}, {{\varvec{\Lambda }}_{wl_k}^{(T)}}^{-1}\right) d{\varvec{W}}_k \end{aligned}
(64)
\begin{aligned}&=\sum _{k=1}^{k_\mathrm{max}}q^{(T)}(k)\sum _{\mathcal {Z}_k}\int _{\mathcal {W}_k} \prod _{l_k=1}^k \left[ \left\{ \varphi _{l_k}^{(T)} \mathcal {N}\left( y_{n\!+\!1}\mid {{\varvec{w}}_{l_k}^{(T)}}^{\mathsf {T}}\tilde{{\varvec{v}}}_{n\!+\!1},\sigma _{l_k}^2\right) \right\} ^{z_{n\!+\!1\ l_k}} \cdot \mathcal {N}\left( {\varvec{w}}_{l_k} \mid {\varvec{\mu }}_{wl_k}^{(T)},{{\varvec{\Lambda }}_{wl_k}^{(T)}}^{-1}\right) \right] d{\varvec{W}}_k. \end{aligned}
(65)

Here, define $$l_k^*$$ as $$l_k$$ where $$z_{n+1\ l_k}=1$$, so that we can rewrite (65) as

\begin{aligned}&\sum _{k=1}^{k_\mathrm{max}}q^{(T)}(k)\sum _{l_k^*=1}^k\int _{\mathcal {W}}\!\left\{ \! \varphi _{l_k^*}^{(T)} \mathcal {N}\left( y_{n+1}\!\mid \!{{\varvec{w}}_{l_k^*}^{(T)}}^{\mathsf {T}}\!\tilde{{\varvec{v}}}_{n+1},\sigma _{l_k^*}^2\right) \cdot \mathcal {N}({\varvec{w}}_{l_k^*} \!\mid \!{\varvec{\mu }}_{wl_k^*}^{(T)},{{\varvec{\Lambda }}_{wl_k^*}^{(T)}}^{\!-\!1}\!) \cdot \!\prod _{l_k \ne l_k^*}\mathcal {N}\left( {\varvec{w}}_{l_k} \mid {\varvec{\mu }}_{wl_k}^{(T)},{{\varvec{\Lambda }}_{wl_k}^{(T)}}^{-1} \right) \right\} d{\varvec{W}}_k \end{aligned}
(66)
\begin{aligned}&=\sum _{k=1}^{k_\mathrm{max}}q^{(T)}(k) \sum _{l_k^*=1}^k\varphi _{l_k^*}^{(T)}\mathcal {N}\left( y_{n+1} \left| \ {{\varvec{\mu }}_{wl_k^*}^{(T)}}^{\mathsf {T}}\tilde{{\varvec{v}}}_{n+1}, \frac{1}{\sigma _{l_k^*}^2}+\tilde{{\varvec{v}}}_{n+1}^\mathsf {T}\left( {{\varvec{\Lambda }}_{wl_k^*}}^{(T)}\right) ^{-1}\tilde{{\varvec{v}}}_{n+1}\right. \right) \prod _{l_k \ne l_k^*}\underbrace{\int _{\mathcal {W}_{l_k}}\mathcal {N}({\varvec{w}}_{l_k} \mid {\varvec{\mu }}_{wl_k}^{(T)},{{\varvec{\Lambda }}_{wl_k}^{(T)}}^{-1} )d{\varvec{w}}_{l_k}}_{=1} \end{aligned}
(67)
\begin{aligned}&=\sum _{k=1}^{k_\mathrm{max}}\!q^{(T)}(k)\! \sum _{l_k=1}^k \!\varphi _{l_k}^{(T)} \mathcal {N}\!\left( y_{n\!+\!1} \left| \ {{\varvec{\mu }}_{wl_k}^{(T)}}^{\mathsf {T}}\tilde{{\varvec{v}}}_{n\!+\!1}, \! \frac{1}{\sigma _{l_k}^2}\!+\!\tilde{{\varvec{v}}}_{n\!+\!1}^{\mathsf {T}}\left( {\varvec{\Lambda }}_{wl_k}^{(T)}\right) ^{-1}\!\tilde{{\varvec{v}}}_{n\!+\!1}\right. \right) \!. \end{aligned}
(68)

### Proof of Proposition 4.3

As proven in Proposition 3.1, optimal prediction under Bayes criterion is mean of the predictive distribution $$p^*(y_{n+1} \mid {\varvec{D}})$$. Here, we use approximated predictive distribution which we derived in (45), and take the mean of it. Then, we can obtain the prediction $$\delta ^*({\varvec{D}})$$ in approximated form as

\begin{aligned} \delta ^*({\varvec{D}})\approx \sum _{k=1}^{k_\mathrm{max}}q^{(T)}(k)\left( \sum _{l_k=1}^k \varphi _{l_k}^{(T)} {{\varvec{\mu }}_{wl_k}^{(T)}}^{\mathsf {T}} \tilde{{\varvec{v}}}_{n+1}\right) . \end{aligned}
(69)

## Rights and permissions

Reprints and Permissions

Murayama, H., Saito, S., Iikubo, Y. et al. Cluster’s Number Free Bayes Prediction of General Framework on Mixture of Regression Models. J Stat Theory Appl 20, 425–449 (2021). https://doi.org/10.1007/s44199-021-00001-5