1 Introduction

1.1 Motivation

This paper is concerned with the issue of detecting changes of a model that lies behind a data stream. The model refers to the discrete structural information such as the number of free parameters in the mechanism for generating the data. We consider the situation where a model changes over time. Under this environment, it is important to detect the model changes as accurately as possible. This is because the model changes may correspond to important events. For example, it is reported in [14] that when customers’ behaviors are modeled using a Gaussian mixture model, the change of the number of mixture components corresponds to the emergence or disappearance of a cluster of customers’ behaviors. In this case a model change implies a change of the market trend. For another example, it is reported in [35] that when the syslog behaviors are modeled using a mixture of hidden Markov models, the change of the number of mixture components may correspond to a system failure.

The issue of model change detection has extensively been explored. This paper is rather concerned with the issue of detecting signs or early warning signals of model changes. Why is it important to detect such signs? One reason is that if they were detected earlier than the changes themselves, we could predict the changes before they were actualized. The other reason is that if they were detected after the change themselves, we could analyze the cause of the changes in a retrospective way.

A model, say, the number of parameters, is an integer-valued index, in general. Therefore, it appears that the model change abruptly occurs. However, it is reasonable to suppose that some intrinsic change, which we call latent change, gradually occurs at the back of the model change. Then we may define a sign of the model change as the starting point of the latent change. Therefore, if we properly define a real-valued index to quantify the model dimensionality in the transition period, we can understand how rapidly the latent change is going on and we can detect signs of model changes by tracking the rise-up/descent of the index (Fig. 1).

Fig. 1
figure 1

Transition period of dimensionality change. We consider the situation where the clustering structure changes over time so that the number k of clusters changes from \(k=2\) to \(k=3\). Here k can be thought of as an integer-valued model dimensionality, called the parametric dimensionality. If we define a real-valued intrinsic model dimensionality, called the descriptive dimensionality, then we can quantify the dimensionality in the transition period. For example, k becomes 2.5 at some time point. By tracking the rise-up/descent of such a real-valued model dimensionality, we are able to detect signs of increase of the number of clusters

The key idea of this paper is to employ the notion of descriptive dimensionality (Ddim) for the quantification of a model in the transition period. Ddim is a real-valued index, which quantifies the model dimensionality for the case where a number of models are mixed. We thereby establish a methodology of continuous model selection. It is to determine the optimal real-valued model dimensionality from data on the basis of Ddim. In the transition period of model changes, the mixing structure of models may change over time. Hence, by tracking the rise-up/descent of Ddim, we will be able to track the latent changes behind model changes.

The purpose of this paper is twofold: One is to establish a novel methodology for detecting signs (or early warning signals) of model changes from a data stream. We realize this by using Ddim for the quantification of model dimensionality in its transition period. The theory of Ddim is developed on the basis of the minimum description length (MDL) principle [24] in combination with the theory of box counting dimension. The other is to empirically validate the effectiveness of the methodology using synthetic and real data sets. We evaluate how early and how reliably it is able to make alarms of signs of model changes.

1.2 Related work

Model change detection has been studied in the scenario of dynamic model selection (DMS) developed in [34, 35]. Model change detection is different from the classical continuous parameter change detection. Taking an example of finite mixture models, the former is to detect changes in the number of components, while the latter is to detect those in the real-valued parameters of individual components or mixing parameters. In [34, 35], they proposed the DMS algorithm, which outputs a model sequence of the shortest description length, on the basis of the MDL principle [24]. They demonstrated its effectiveness from the empirical and information-theoretic aspects. The MDL based model change detection has been further theoretically justified in [33]. The problems similar to model change detection have been discussed in the scenarios of switching distributions [8], tracking best experts [13], on-line clustering [27], cluster evolution [21], Bayesian change detection [29], and structure break detection for autoregression model [3]. In all of these previous studies, however, a model change was considered to be an abrupt change of a discrete structure. The transition period of changes has never been analyzed there. In the conventional state-space model, change detection of continuous states is addressed (see e.g.[5]). Then the state itself has not the same meaning as a model which we define in this paper. The number of states is a model which we mean.

Table 1 Comparison of related works on model change detection
Table 2 Comparison of related works on dimensionality

Changes that do not occur abruptly but incrementally occur were discussed in the context of detecting incremental changes in concept drift [10], gradual changes [36], volatility shift [17], etc. However, it has never been quantitatively analyzed how rapidly a model changes in the transition period.

Recently, the indices of structural entropy [16] and graph-based entropy [22] have been developed for measuring the uncertainty associated with model changes. Although they can be thought of as early warning signals of model changes, they cannot quantify the intrinsic model dimensionality nor explain how rapidly a model changes in the transition period. Change sign detection method usng differential MDL change statistics has been proposed in [37]. However, it is applied to change sign detection for parameters only. We summarize the related work in Table 1 from the viewpoints of abrupt model change detection, model change sign detection, and quantification of model dimensionality.

This paper proposes a methodology for analyzing model transition in terms of real-valued dimensionality. A number of notions of dimensionality have been proposed in the areas of physics and statistics. The metric dimension was proposed by Kolmogorov and Tihomirov [18] to measure the complexity of a given set of points in terms of the notion of covering numbers. This was evolved into the notion of the box counting dimension, equivalently, the fractal dimension [20]. It is a real-valued index for quantifying the complexity of a given set. It is also related to the capacity [7]. Vapnik Chervonenkis dimension was proposed to measure the power of representation for a given class of functions [28]. It was also related to the rate of uniform convergence of estimating functions. See [12] for relations between dimensionality and learning. The dimensionality as a power of representation is conventionally integer-valued, but when it changes over time, there is no effective non-integer valued quantification of its transition. The previous notions of dimensionality are summarized in Table 2 from the viewpoints of integer/real-valued, characterizatio of learning rate, and quantification of model change.

Preliminary versions of this paper appeared in Arxiv [31, 32].

1.3 Significance of this paper

The significance of this paper is summarized as follows: (1) Proposal of a novel methodology for detecting signs of model changes with continuous model selection. This paper proposes a novel methodology for detecting signs of model changes. The key idea is to track model transitions with continuous model selection using the notion of descriptive dimensionality  (Ddim). It measures the model dimensionality in the case where a number of models with different dimensionalities are mixed.

For example, we employ the Gaussian mixture model (GMM) to consider the situation where the number of mixture components changes over time. We suppose that in the transition period of model change, a number of probabilistic models with various mixture sizes are fused. We give a method for calculating Ddim for this case. The transition period of model change can be visualized by drawing a Ddim graph versus time. Once a Ddim graph is obtained, we can understand how rapidly the model changes over time. We eventually detect signs of model changes by tracking the rise-up/descent of Ddim. This methodology is significantly important in data mining since it helps us predict model changes in earlier stages.

(2)Empirical demonstration of effectiveness of model change sign detection via Ddim. We empirically validate how early we are able to detect signs of model changes with continuous model selection, for GMMs and auto-regression (AR) models. With synthetic data sets and real data sets, we illustrate that our method is able to effectively visualize the transition period of model change using Ddim. We further empirically demonstrate that our methodology is able to detect signs of model changes significantly earlier than any existing dynamic model selection algorithms and is comparable to structural entropy in [16]. Through our empirical analysis, we demonstrate that Ddim is an effective index for measuring the model dimensionality in the model transition period.

(3)Giving theoretical foundations for Ddim. In this paper, Ddim plays a central role in continuous model selection. We introduce this notion from an information-theoretic view based on the MDL principle [24] (see also [11]). We show that Ddim coincides with the number of free parameters in the case where the model consists of a single parametric class. We also derive Ddim for the case where a number of models with different dimensionalities are mixed. We characterize Ddim by demonstrating that it governs the rate of convergence of the MDL-based learning algorithm. This corresponds to the fact that the metric dimensionality governs the rate of convergence of the empirical risk minimization algorithm in statistical learning theory [12].

The rest of this paper is organized as follows: Section 2 introduces the notion of Ddim. Section 3 gives a methodology for model change sign detection via Ddim. Section 4 shows experimental results. Section 5 characterizes Ddim by relating it to the rate of convergence of the MDL learning algorithm. Section 6 gives conclusion. Source codes and data sets are available at a Github repository [38].

2 Descriptive dimensionality

2.1 NML and parametric complexity

This section introduces the theory of Ddim. This theory is based on the MDL principle (see [24] for the original paper and [25] for the recent advances) from the viewpoint of information theory. We start by introducing a number of fundamental notions of the MDL principle.

Let \({\mathcal X}\) be the data domain where \({\mathcal X}\) is either discrete or continuous. Without loss of generality, we assume that \({\mathcal X}\) is discrete. Let \({\textbf{x}}=x_{1},\dots ,x_{n}\in {\mathcal X}^{n}\) be a data sequence of length n. We assume that each \(x_{i}\) is independently generated. \({\mathcal P}=\{p({\textbf{x}}) \}\) be a class of probabilistic models where \(p({\textbf{x}})\) is a probability mass function or a probability density function. Hereafter, we asssume that for any \({\textbf{x}}\), the maximum of \(p({\textbf{x}})\) with respect to p exists.

Under the MDL principle, the information of a datum \({\textbf{x}}\) is measured in terms of description length, i.e, the codelength required for encoding the datum with a prefix coding method. We may encode \({\textbf{x}}\) with help of a class \({\mathcal P}\) of probability distributions. One of the most important methods for calculating the codelength \({\textbf{x}}\) using \({\mathcal P}\) is the normalized maximum likelihood (NML) coding [25]. This is defined as the codelength associated with the NML distribution as follows:

Definition 1

We define the normalized maximum likelihood (NML) distribution over \({\mathcal X}^{n}\) with respect to \({\mathcal P}\) by

$$\begin{aligned} p_{_\textrm{NML}}({\textbf{x}};{\mathcal P})\buildrel \text {def} \over =\frac{\max _{p\in {\mathcal P}}p({\textbf{x}})}{\sum _{{\textbf{y}}} \max _{p\in {\mathcal P}}p({\textbf{y}})}. \end{aligned}$$
(1)

The normalized maximum likelihood (NML) codelength of \({\textbf{x}}\) relative to \({\mathcal P}\), which we denote as \(L_{_\textrm{NML}}({\textbf{x}}; {\mathcal P})\), is given as follows:

$$\begin{aligned} \begin{aligned} L_{_\textrm{NML}}({\textbf{x}};{\mathcal P})\buildrel \text {def} \over =&-\log p_{_\textrm{NML}}({\textbf{x}}; {\mathcal P}) \\ =&-\log \max _{p\in {\mathcal P}}p({\textbf{x}})+\log {\mathcal C}_{n}({\mathcal P}),\\ \end{aligned} \end{aligned}$$
(2)

where

$$\begin{aligned} \log {\mathcal C}_{n}({\mathcal P})\buildrel \text {def} \over =\log \sum _{{\textbf{y}}} \max _{p\in {\mathcal P}}p({\textbf{y}}). \end{aligned}$$
(3)

The first term in (2) is the negative logarithm of maximum likelihood while the second term (3) is the logarithm of the normalization term. The latter is called the parametric complexity of \({\mathcal P}\) [25]. This means the information-theoretic complexity for the model class \({\mathcal P}\) relative to the length n of data sequence. The NML codelength can be thought of as an extension of Shannon information \(-\log p({\textbf{x}})\) into the case where the true model p is unknown but only \({\mathcal P}\) is known.

In order to understand the meaning of the NML codelength and the parametric complexity, we define the minimax regret as follows:

$$\begin{aligned} R_{n}({\mathcal P}) \buildrel \text {def} \over =\min _{q}\max _{{\textbf{x}}}\left\{ -\log q({\textbf{x}})-\min _{p\in {\mathcal P}}(-\log p({\textbf{x}}))\right\} , \end{aligned}$$

where the minimum is taken over the set of all probability distributions. The minimax regret means the descriptive complexity of the model class, indicating how largely any codelength is deviated from the smallest negative log-likelihood over the model class. Shtarkov [26] proved that the NML distribution (1) is optimal in the sense that it attains the minimum of the minimax regret. In this sense the NML codelength is the optimal codelength for encoding \({\textbf{x}}\) for given \({\mathcal P}\). Then we can immediately see that the minimax regret coincides with the parametric complexity. That is,

$$\begin{aligned} R_{n}({\mathcal P})=C_{n}({\mathcal P}). \end{aligned}$$
(4)

We next consider how to calculate the parametric complexity. According to [25] (pp:43-44), the parametric complexity can be rewritten using a variable transformation technique as follows:

$$\begin{aligned} C_{n}({\mathcal P})= \sum _{{\textbf{y}}} \max _{p\in {\mathcal P}}p({\textbf{y}}) =\int g(\hat{p}, \hat{p})d\hat{p}, \end{aligned}$$
(5)

where \(g(\hat{p},p)\) is defined as

$$\begin{aligned} g(\hat{p}, p) \buildrel \text {def} \over = \sum _{{\textbf{y}}:\max _{\bar{p}\in {\mathcal P}}\bar{p}({\textbf{y}})=\hat{p}({\textbf{y}})} p({\textbf{y}}). \end{aligned}$$
(6)

2.2 Definition of descriptive dimension

Below we give the definition of Ddim from a view of approximation of the parametric complexity, equivalently, the minimax regret (by (4)). The scenario of defining Ddim is as follows: We first count how many points are required to approximate the parametric complexity (5) with quantization. We consider that count as information-theoretic richness of representation for a model class. We then employ that count to define Ddim in a similar manner with the box counting dimension.

We consider to approximate (5) with a finite sum of partial integrals of \(g(\hat{p},\hat{p})\).Let \(\overline{{\mathcal P}}=\{p _{1}, p _{2},\dots \}\subset {\mathcal P}\) be a finite subset of \({\mathcal P}\). Let \(\epsilon \) be the parameter for defining the diameter of the neighborhood of a given probability distribution. For \(\epsilon >0, \) for \(p_{i}\in \overline{{\mathcal P}}\), let \(D_{\epsilon }^{n}(i)\buildrel \text {def} \over =\{p\in {\mathcal P} :\ d_{n}(p_{i},p)\le \epsilon ^{2}\}\) where \(d_{n}(p_{i}, p)\) is the Kullback-Leibler (KL) divergence between p and \(p_{i}\):

$$\begin{aligned} d_{n}(p, p_{i})=\frac{1}{n}\sum _{{\textbf{x}}} p_{i}({\textbf{x}})\log \frac{p_{i}({\textbf{x}})}{p({\textbf{x}})}. \end{aligned}$$

Then we approximate \(\bar{C}_{n}(\bar{\mathcal {P}})\) by

$$\begin{aligned} \overline{{C}_{n}}(\overline{{\mathcal P}}){\buildrel \text {def} \over = }\sum _{i} Q_{\epsilon }(i), \end{aligned}$$
(7)

where

$$\begin{aligned} Q_{\epsilon }(i){\buildrel \text {def} \over =}\int _{\hat{p}\in D_{\epsilon }^{n}(i)}g(\hat{p}, \hat{p})d\hat{p}. \end{aligned}$$
(8)

That is, (7) gives an approximation to \(C_{n}({\mathcal P})\) with a finite sum of integrals of \(g(\hat{p}, \hat{p})\) over the \(\epsilon ^{2}-\)neighborhood of a point \(p_{i}\). We define \(m_{n}(\epsilon :{\mathcal P})\) as the smallest number of points \(\mid \overline{\mathcal P}\mid \) with respect to \(\overline{\mathcal P}\) such that \(C_{n}({\mathcal P}) \le \overline{C}_{n}(\overline{{\mathcal P}})\). More precisely,

$$\begin{aligned} m_{n}(\epsilon :{\mathcal P}){\buildrel \text {def} \over =}\min _{\overline{{\mathcal P}}} \mid \overline{{\mathcal P}}\mid \ \ \text {subject to}\ C_{n}({\mathcal P})\le \overline{C_{n}}(\overline{{\mathcal P}}). \end{aligned}$$
(9)

We are now led to the definition of descriptive dimension.

Definition 2

[31] Let \({\mathcal P}\) be a class of probability distributions. We let \(m(\epsilon :{\mathcal P})\) be the one obtained by choosing \(\epsilon ^{2}n=O(1)\) in \(m_{n}(\epsilon :{\mathcal P} )\) as in (9). We define the descriptive dimension (Ddim) of \({\mathcal P}\) by

$$\begin{aligned} \text {Ddim}({\mathcal P}){\buildrel \text {def} \over =}\lim _{\epsilon \rightarrow 0}\frac{\log m(\epsilon : {\mathcal P})}{\log (1/\epsilon )}, \end{aligned}$$
(10)

when the limit exists.

The definition of Ddim is similar with that of the box counting dimension [7, 9, 20] .The main difference between them is how to count the number of points. Ddim is calculated on the basis of the number of points required for approximating the parametric complexity, while the box counting dimension is calculated on the basis of the number of points required for covering a given object with their \(\epsilon \)-neighborhoods.

Consider the case where \({\mathcal P}_{k}\) is a k-dimensional parametric class, i.e., \({\mathcal P}_{k}=\{p({\textbf{x}};\theta ):\ \theta \in \Theta _{k}\subset {\mathbb R}^{k}\}\), where \(\Theta _{k}\) is a k-dimensional real-valued parameter space. Let \(p({\textbf{x}};\theta )=f({\textbf{x}}\mid \hat{\theta }({\textbf{x}}))g(\hat{\theta }({\textbf{x}});\theta )\) for the conditional probabilistic mass function \(f({\textbf{x}}\mid \hat{\theta }({\textbf{x}}))\). We then write g according to (6) as follows

$$\begin{aligned} g(\hat{\theta },\theta ){\buildrel \over = } \sum \limits _{{\textbf{x}}:\text {argmax}_{\theta }p({\textbf{x}};\theta )=\hat{\theta }}p({\textbf{x}};\theta ). \end{aligned}$$
(11)

Assume that the central limit theorem holds for the maximum likelihood estimator of a parameter vector \(\theta \). Then according to [25], we can take a Gaussian density function as (11) asymptotically. That is, for sufficiently large n, (11) can be approximated as:

$$\begin{aligned} g(\hat{\theta }, \theta )\simeq \left( \frac{n}{2\pi }\right) ^{\frac{k}{2}}\mid I_{n}(\theta )\mid ^{\frac{1}{2}}e^{-n(\hat{\theta }-\theta )^{\top }I_{n}(\theta )(\hat{\theta }-\theta )/2}, \end{aligned}$$
(12)

where \(I_{n}(\theta ){\buildrel \text {def} \over =} (1/n)E_{\theta }[-\partial ^{2}\log p({\textbf{x}};\theta )/\partial \theta \partial \theta ^{\top }]\) is the Fisher information matrix.

The following theorem shows the basic property of \(m_{n}(\epsilon :{\mathcal P}_{k})\) for the parametric case.

Theorem 1

Suppose that \(p({\textbf{x}};\theta )\in {\mathcal P}_{k}\) is continuously three-times differentiable with respect to \(\theta \). Under the assumption of the central limit theorem so that (12) holds, for sufficiently large n, we have

$$\begin{aligned} \log C_{n}({\mathcal P}_{k}) = \log m_{n}(1/\sqrt{n} :{\mathcal P}_{k})+O(1). \end{aligned}$$
(13)

The proof is given in Appendix.

It is known [25] (p.53) that under some regularity condition that the central limit theorem holds for the maximum likelihood estimator for \(\theta \), the parametric complexity for \({\mathcal P}_{k}\) is asymptotically expanded as

$$\begin{aligned} \log C_{n}({\mathcal P}_{k})=\frac{k}{2}\log \frac{n}{2\pi }+\log \int \sqrt{\mid I(\theta )\mid }d\theta +o(1), \end{aligned}$$
(14)

where \(I(\theta )\) is the Fisher information matrix: \(I(\theta )\buildrel \text {def} \over =\text {lim}_{n\rightarrow \infty }(1/n)\times \) \({\text {E}}_{\theta }[-\partial ^{2}\log p({\textbf{x}};\theta )/\partial \theta \partial \theta ^{\top }]\). Plugging (13) with (14) for \(\epsilon ^{2}n=O(1)\) into (10) yields the following theorem.

Theorem 2

For a k-dimensional parametric class \({\mathcal P}_{k}\), under the regularity condition for \({\mathcal P}_{k}\) as in Theorem 1, we have

$$\begin{aligned} \textrm{Ddim}({\mathcal P}_{k})=k. \end{aligned}$$
(15)

Theorem 2 shows that when the model class is a single parametric one, Ddim coincides with the conventional notion of dimensionality (the number of free parameters), which we call the parametric dimensionality in the rest of this paper.

Ddim can also be defined even for the case where the model class is not a single parametric class. Hence Theorem 2 implies that Ddim is a natural extension of the parametric dimensionality.

Let us consider model fusion where a number of model classes are probabilistically mixed. Let \({\mathcal F}=\{ {\mathcal P}_{1},\dots , {\mathcal P}_{s}\}\) be a family of model classes and assume a model class is probabilistically distributed according to \(p({\mathcal P})\) over \({\mathcal F}\). We denote the model fusion over \({\mathcal F}\) as \({\mathcal F}^{\odot }={\mathcal P}_{1}\odot \cdots \odot {\mathcal P}_{s}\). We may interpret the resulting distribution over \({\mathcal X}\) as a finite mixture model [19] of a number of model classes with different dimensionalities. Then Ddim of \({\mathcal F}^{\odot }\) is calculated as

$$\begin{aligned} \lim _{\epsilon \rightarrow 0}\frac{\log E_{{\mathcal P}}[m(\epsilon : {\mathcal P})]}{\log (1/\epsilon )}\ge & {} \lim _{\epsilon \rightarrow 0}\sum ^{s}_{i=1}p({\mathcal P}_{i})\frac{\log m(\epsilon : {\mathcal P}_{i})}{\log (1/\epsilon )}\nonumber \\= & {} \sum _{i=1}^{s}p({\mathcal P}_{i})\text {Ddim}({\mathcal P}_{i}), \end{aligned}$$
(16)

where we have used Jensen’s inequality to derive the first inequality. We call the lower bound (16) the pseudo Ddim for model fusion \({\mathcal F}^{\odot }\). In the rest of this paper, we adopt it as Ddim value for model fusion. We write it as \(\underline{\text {Ddim}}({\mathcal F}^{\odot })\). Model fusion is a reasonable setting when we consider the transition period of model changes. Then Ddim is no longer integer-valued.

Fig. 2
figure 2

Continuous model selection. We consider the situation where the number k of clusters in a GMM changes from two to three as time goes by. In the transition period, one cluster gradually collapses into two clusters. Ddim is a continuous variant of the number of clusters, hence Ddim may change continuously in the transition period, taking the value in (2, 3). It represents the gradual change of clustering structure well

3 Model change sign detection

3.1 Continuous model selection for GMMs

This section proposes a methodology for detecting signs of model changes with continuous model selection. We first focus on the case where the model is a Gaussian mixture model (GMM). The problem setting is as follows: At each time we obtain a number of unlabeled multi-dimensional examples. By observing such examples sequentially, we obtain a data stream of the examples. At each time, we may conduct clustering of the examples using GMMs. Assuming that the number of components in GMM may change over time, we aim at detecting their changes and the signs of them.

The key idea is to conduct continuous model selection, which is to determine the real-valued model dimensionality on the basis of Ddim. Below we give a scenario of continuous model selection with applications to model change sign detection. Let \({\mathcal P}_{k}\) be a class of GMMs with k components. We consider the situation where the structure of GMM gradually changes over time (Fig. 2), while the model k may abruptly change. The key observation is that during the model transition period, model fusion occurs where a number of GMMs with different ks are probabilistically mixed according to the posterior probability distribution. Then model dimensionality in the transition period can be calculated as Ddim of model fusion. Thus we can detect signs of model changes by tracking the rise-up/descent of Ddim.

Below let us formalize the above scenario. Let \({\mathcal X}\) be an m-dimensional real-valued domain and let \(x\in {\mathcal X}\) be an observed datum. Let \(z\in \{1,\dots , k\}\) be a latent variable indicating which component x comes from. Let \(\mu _{i}\in {\mathbb R}^{m}\), \(\Sigma _{i}\in {\mathbb R}^{m\times m}\) be the mean vector and variance-covariance matrix for the ith component, respectively. Let \(\mu =(\mu _{1},\dots ,\mu _{k})\) and \(\Sigma =(\Sigma _{1},\dots , \Sigma _{k})\). Let \(\sum _{i}\pi _{i}=1,\ \pi _{i}\ge 0\ (i=1,\dots ,k)\). Let \(\theta =(\mu _{i}, \Sigma _{i}, \pi _{i})\mid _{i=1,\dots ,k}\). Then a complete variable model of GMM with k-components is given by

$$\begin{aligned} p( x,z; \theta ,k )= & {} p(x\mid z; \mu , \Sigma )p(z; \pi ), \end{aligned}$$

where

$$\begin{aligned}&p( x\mid z=i; \mu , \Sigma ) =\frac{1}{(2\pi )^{\frac{m}{2}}\cdot \mid \Sigma _i \mid ^{\frac{1}{2}}} \exp \left\{ -\frac{1}{2} (x-\mu _i)^{\top } \Sigma _i^{-1} (x-\mu _i) \right\} , \nonumber \\&p(z=i;\pi )=\pi _{i}\ \ (i=1, \dots , k). \end{aligned}$$
(17)

Let \({\textbf{x}}=x_{1},\dots , x_{n}\) be a sequence of observed variables of length n. Let \(z_{j}\) denote a latent variable which corresponds to \(x_{j}\) and \({\textbf{z}}=z_{1},\dots , z_{n}\). Let \({\textbf{y}}=({\textbf{x}},{\textbf{z}})\) be a complete variable. Let \(\hat{\mu }_{i}, \hat{\Sigma } _{i}\) be the maximum likelihood estimators of \(\mu _{i}, \Sigma _{i}\) \((i=1,\dots ,k)\) for given \({\textbf{y}}\). Let \(\hat{\pi }_{i}=n_{i}/n\) where \(n_{i}\) is the number of occurrences in \(\varvec{z}\) such that \(z=i\) \((i=1,\dots ,k)\) and \(\sum ^{k}_{i=1}n_{i}=n\). \({\textbf{z}}\) may be estimated by sampling from the posterior probability obtained by the EM algorithm. Let \(\hat{\theta }({\textbf{y}})=(\hat{\pi }_{i},\hat{\mu } _{i},\hat{\Sigma } _{i})\mid _{i=1,\dots ,k}\).

The NML codelength of \({\textbf{y}}\) for a complete variable model of a GMM is given by

$$\begin{aligned} L_{_\textrm{NML}}({\textbf{y}};k)= & {} -\log p_{_\textrm{NML}}({\textbf{y}};k) \nonumber \\= & {} -\log p({\textbf{y}}; \hat{\theta }({\textbf{y}}) ,k) + \log \mathcal {C}_{n}(k), \end{aligned}$$
(18)

where \(\mathcal {C}_{n}(k)\) is a parametric complexity for a GMM. According to [15], an upper bound on \(\mathcal {C}_{n}(k)\) is given as follows:

$$\begin{aligned} \mathcal {C}_{n}(k)&\le \sum \limits _{n_1,\cdots , n_k} \frac{n!}{n_1! \cdots n_k !} \times \prod \limits _{i=1}^{k} \left( \frac{n_i}{n} \right) ^{n_i} B(m,R,\epsilon ) \nonumber \\&\times \left( \frac{n_{i}}{2e} \right) ^{\frac{mn_i}{2}} \left( \Gamma _ m\left( \frac{n_i-1}{2}\right) \right) ^{-1}, \end{aligned}$$
(19)

where

$$\begin{aligned} B(m,R,\epsilon ) \overset{\text {def}}{=}\frac{2^{m+1}R^{\frac{m}{2}}\epsilon ^{-\frac{m^{2}}{2}}}{m^{m+1}\cdot \Gamma \left( \frac{m}{2} \right) }, \end{aligned}$$

where R is a positive constant such that for all i, \(\parallel \hat{\mu }_{i} \parallel ^{2}\le R\), and \(\epsilon \) is a positive constant such that \(\epsilon \) is the lower bound on the smallest eigenvalue of \(\Sigma _{i}\) for any i. \(\Gamma _{m}\) is the multivariate Gamma function defined as \(\Gamma _{m}(x)=\pi ^{\frac{m(m-1)}{4}}\prod ^{m}_{j=1}\Gamma (x+\frac{1-j}{2})\) and \(\Gamma \) is the Gamma function. We use the bound (19) as the value of \({\mathcal C}_{n}(k)\). It is known [14] that \(C_{n}(k)\) is computable in time \(O(n^{2}k)\).

At each time t, we observe a data sequence: \({\textbf{x}}_{t} =x_{1},\dots , x_{n}\in {\mathcal X}^{n}\) of length n. We sequentially observe such a datum as show in Fig. 2. Let \({\textbf{x}}^{T}={\textbf{x}}_{1},\dots ,{\textbf{x}}_{T}\ ({\textbf{x}}_{t}\in {\mathcal X}^{n},\ t=1,\dots ,T)\) be an observed data sequence. The length n may vary over time. We denote the joint sequence of observed variables and latent variables at time t as \({\textbf{y}}_{t}=({\textbf{x}}_{t},{\textbf{z}}_{t})\).

We suppose that a number of GMMs with different ks are fused according to the probability distribution \(p(k\mid {\textbf{y}}_{t})\) at each time. We define \(p(k\mid {\textbf{y}}_{t})\) as the annealed posterior probability of k for \({\textbf{y}}_{t}\):

$$\begin{aligned}&p(k\mid {\textbf{y}}_{t}) \buildrel \text {def} \over =\frac{(p_{_\textrm{NML}}({\textbf{y}}_{t};k)p(k\mid k_{t-1}))^{\beta }}{\sum _{k'}(p_{_\textrm{NML}}({\textbf{y}}_{t};k')p(k'\mid k_{t-1}))^{\beta }} \\ \nonumber&= \frac{\exp (-\beta L_{_\textrm{NML}}({\textbf{x}}_{t},{\textbf{z}}_{t};k)+\beta \log p(k\mid k_{t-1}))}{\sum _{k'}\exp (-\beta L_{_\textrm{NML}}({\textbf{x}}_{t},{\textbf{z}}_{t};k')+\beta \log p(k'\mid k_{t-1}))}, \end{aligned}$$
(20)

where \(k_{t-1}\) is the dimensionality estimated at time \(t-1\), and

$$\begin{aligned} p (k \mid k_{t-1}) \buildrel \text {def} \over =\left\{ \begin{array}{ll} 1-\gamma &{} \text{ if }~ k=k_{t-1}~\text {and}~k_{t-1} \ne 1,k_\text {max},\\ 1-\gamma /2~&{} \text{ if }~ k=k_{t-1}~\text {and}~k_{t-1} = 1,k_\text {max},\\ \gamma /2~ &{}\text{ if }~ k=k_{t-1} \pm 1. \end{array}\right. \end{aligned}$$
(21)

\(\gamma (0<\gamma <1)\) is a parameter, and \(k_{\max }\) is the maximum value of k. We estimate \(\gamma \) using the MAP estimator with the beta distribution \(\textrm{Beta}(a,b)\) being the prior where the density function of \(\text {Beta}(a,b)\) is proportional to \(\gamma ^{a-1}(1-\gamma )^{b-1}\) for hyper-parameters a and b. The MAP estimator of \(\gamma \) is given as follows:

$$\begin{aligned} \hat{\gamma } =\frac{N_t+a-1}{t+a+b-2}, \end{aligned}$$

where \(N_t\) shows how many times the number of clusters has changed until time \(t-1\). In the experiments to follow, we set \((a, b)=(2,10)\). \(\beta (>0)\) is the temperature parameter. In our experiment, we set as \(\beta \) as

$$\begin{aligned} \beta =1/\sqrt{n}. \end{aligned}$$
(22)

This is due to the PAC-Bayesian argument [1] for Gibbs posteriors.

Note that (20) is calculated on the basis of the NML distribution. This is because the probability distribution with unknown parameters should be estimated as the NML distribution since it is the optimal distribution in terms of the minimax regret (see Section 2.1).

By (16), we can calculate Ddim of model fusion of GMMs with various ks at time t as

$$\begin{aligned} \underline{\text {Ddim}}({\mathcal F}^{\odot })_{t}= \sum _{k}p_{t}({\mathcal P}_k)\text {Ddim}({\mathcal P}_{k}), \end{aligned}$$
(23)

where \(p_{t}({\mathcal P}_{k})\) is the probability of \({\mathcal P}_{k}\) at time t, and \(p_{t}({\mathcal P}_{k})=p(k\mid {\textbf{y}}_{t})\) in this case. Note that Ddim for GMM with k components is \(k(m^{2}/2+(5m/2))-1\approx kf(m)\) where \(f(m)=m^{2}/2+(5m/2)\). However, in order to focus on the mixture size, we divide the true Ddim by f(m) to consider an alternative Ddim of the form of (24).

$$\begin{aligned} \underline{\text {Ddim}}({\mathcal F}^{\odot })_{t}&\buildrel \text {def} \over =&\sum _{k}p(k\mid {\textbf{y}}_{t})\text {Ddim}({\mathcal P}_{k})/f(m) \nonumber \\\approx & {} \sum _{k}p(k\mid {\textbf{y}}_{t})k. \end{aligned}$$
(24)

The calculation of real-valued k according to (24) is really continuous model selection.

Suppose that there exists a true parametric dimensionality \(k^{*}\). Then because of the consistency of MDL model estimation ([25], pp:63-69),

$$\begin{aligned} p(\hat{k}=k^{*}\mid {\textbf{y}}_{t}) \rightarrow 1 \end{aligned}$$

for \(\hat{k}\) minimizing the NML codelength as n increases, Hence (24) will coincide with \(k^{*}\) with probability 1 as n goes to infinity.This implies that (24) is a natural extension of parametric dimensionality.

3.2 Model change sign detection algorithms

Consider the situation where we sequentially observe a complete variable sequence: \({\textbf{y}}_{1}, {\textbf{y}}_{2},\dots ,{\textbf{y}}_{T}\). We then obtain a Ddim graph:

$$\begin{aligned} \{(t, \underline{\text {Ddim}}({\mathcal F}^{\odot })_{t}): t=1,2,\dots , T\}, \end{aligned}$$

as in Fig. 1. We can visualize the transition period by drawing the Ddim graph versus time. In this paper we have defined a sign of a model change as the starting point of the latent gradual change associated with it. Thus we can detect signs of model changes by looking at the rise-up/descent of Ddim.

More precisely, we propose the following two methods for raising alarms of model change signs.

1) Thresholding method (TH): We raise an alarm if the absolute difference between Ddim and the baseline exceeds a given threshold \(\delta _{1}\). The baseline is the parametric dimensionality estimated by the sequential dynamic model selection algorithm (SDMS) [14], which is a sequential variant of DMS in [34, 35]. It outputs a model \(k=\hat{k}\) with the shortest codelength, i.e., for \(\lambda >0\),

$$\begin{aligned} \hat{k}=\underset{k}{\text {argmin}}\,\{L_{_{\text {NML}}}({\textbf{y}}_{t};k)-\lambda \log p(k\mid k_{t-1})\}, \end{aligned}$$
(25)

where \(L_{_{\text {NML}}}({\textbf{y}}_{t};k)\) is calculated as in (18). Letting \(\hat{k}\) be the output of SDMS and \(\underline{\text {Ddim}}_{t}\) be Ddim of model fusion at time t, we raise an alarm if

$$\begin{aligned} \text {TH}{-}\text {Score}\buildrel \text {def} \over =\mid \underline{\text {Ddim} }_{t}-\hat{k}\mid > \delta _{1}. \end{aligned}$$
(26)

2) Differential method (Diff): We raise an alarm if the time difference of Ddim exceeds a given threshold \(\delta _{2}\). That is, we raise an alarm if

$$\begin{aligned} \text {Diff}{-}\text {Score}\buildrel \text {def} \over =\mid \underline{\text {Ddim}}_{t}-\underline{\text {Ddim}}_{t-1}\mid > \delta _{2}. \end{aligned}$$
(27)

The computational complexity of TH and Diff at each time t is governed by that for computing the NML codelength (18). The first term in (18) is computable in time O(nk). The second term in (18) is computable in time \(O(n^{2}k)\) [15], but it does not depend on data, hence can be calculated for various n and k beforehand. It can be referred when necessary. Hence the computational complexity of TH and Diff at each time is O(nK) where K is an upper bound on k.

3.3 Continuous model selection for AR model

The above methodology can be applied to general classes of finite mixture models other than GMMs. It can also be applied to general classes of parametric probabilistic model classes. We illustrate the case of auto-regression (AR) model as an example. This model does not include latent variables.

Let a data sequence \(\{x_{t}\},\ x_t\in {\mathbb R}\ (t=1,2,\dots ,T)\) be given. For the modeling of the data sequence, we consider AR(k) (k-th order auto-regression) model of the form:

$$\begin{aligned} x_{t}=a_{1}x_{t-1}+\cdots + a_{k}x_{t-k}+\epsilon , \end{aligned}$$

where \(a_{i}\in {\mathbb R}\ (i=1,\dots ,k)\) are unknown parameters, and \(\epsilon \) is a random variable following the Gaussian distribution with mean 0 and unknown variance \(\sigma ^{2}\). We set \(\theta =(a_{1},\dots ,a_{k},\sigma ^{2})\).

Let \({\textbf{x}}_{t}=x_{t},\dots , x_{t-w+1}\) be the t-th session for a window size w. We calculate the NML codelength \(L_{\text {NML}}({\textbf{x}}_{t}; k)\) for \({\textbf{x}}_{t}\) associated with AR(k) in the following sequential manner: Letting \(\hat{\theta }\) be the maximum likelihood estimator,

$$\begin{aligned} L_{_\textrm{NML}}({\textbf{x}}_{t}; k)=\sum ^{t}_{j=t-w+1} \left( -\log \frac{p(x_{j}; \hat{\theta }(x_{j},x^{j-1}))}{\int p(y_{j}; \hat{\theta }(y,x^{j-1}))dy} \right) . \end{aligned}$$

Letting \({\mathcal P}_{k}=AR(k)\), similarly with (20) and (23), we can calculate Ddim at time t and draw a Ddim graph.We thereby detect model change signs by applying TH and Diff to the graph.

Fig. 3
figure 3

Graph of proportion of \(\mu _{3}\) in the mean to \(\mu _{2}\) versus time for various \(\alpha \)

4 Experimental results

4.1 Synthetic data: GMM

4.1.1 Data set

We employ synthetic data sets to evaluate how well we are able to detect signs of model changes using Ddim. We let \(n=1000\) at each time. We generated DataSet 1 according to GMMs so that the number of components changed from \(k=2\) to \(k=3\) as follows:

$$\begin{aligned} \left\{ \begin{array}{l} k=2,\ \mu =(\mu _{1},\mu _{2}) \qquad \qquad \quad \text {if} \ 0\le t \le \tau _1, \nonumber \\ k=3, \ \mu =(\mu _1, \mu _2, f_{\alpha }(t)) ~~\quad \text {if} \tau _1+1 \le t \le \tau _2, \nonumber \\ k=3, \ \mu =(\mu _1, \mu _2, \mu _3) \qquad \quad \text {if} \tau _2 +1 \le t \le T, \end{array} \right. \end{aligned}$$
(28)

where letting \(\alpha \) be a smoothness parameter,

$$\begin{aligned} f_{\alpha }(t)\buildrel \text {def} \over = \frac{(\tau _2-t)^{\alpha }\mu _2 + (t-\tau _1)^{\alpha }\mu _3}{(\tau _2-t )^{\alpha }+(t-\tau _1)^{\alpha }} \ \ (\alpha >0). \end{aligned}$$
(29)

In it, one component collapsed gradually in the transition period from \(t=\tau _1+1\) to \(t=\tau _2\). \(f_{\alpha }(t)\) is the mean value which switches from \(\mu _{2}\) to \(\mu _{3}\) where the speed of change is specified by a parameter \(\alpha \). Figure 3 shows the graph of the proportion of \(\mu _{3}\) in the mean to \(\mu _{2}\) versus time for various \(\alpha \). The change becomes rapid as \(\alpha \) approaches to zero. The variance covariance matrix of each component is given by

$$\begin{aligned} \Sigma =(rAA^{\top }+(I-r)I)\times \text {var}, \end{aligned}$$
(30)

where \(r=0.2\), \(\text {var}=3\), and A is a randomly generated \(m\times m\) matrix. We set \( m=3, \tau _1=9,\tau _2=29,T=39\). It appears that the number of components of a GMM abruptly changed at \(t=20\) since it takes a discrete value. However, in the early stage of \(k=3\), the model is very close to \(k=2\) because the mean values of Gaussian components are very close each other. It may be more natural to recognize the model dimensionality at this stage as a value between \(k=2\) and \(k=3\).

We evaluate how well Ddim tracked the transition period of model change. The temperature parameter \(\beta \) was chosen so that \(\beta =0.0316\) according to (22). Figure 4 shows how Ddim gradually grows as time goes by for various \(\alpha \) values. The gray zone shows the transition period when the model changes from \(k=2\) to \(k=3\). The blue line shows the number of components of the GMM estimated by the SDMS algorithm as in (25). The green curve shows the Ddim graph. The red and purple curves show TH-Score and Diff-Score as in (26) and (27), respectively. We show the time points of their alarms of TH and Diff using the same colors. Ddim successfully visualized how rapidly the GMM structure changed in the transition period from \(t=10\) to 29. The true change occurs rapidly for \(\alpha =0.2\), while it occurs slowly for \(\alpha =1.0\). Ddim was able to successfully track their transition process depending on \(\alpha \). Ddim detected signs earlier than SDMS made an alarm of model change.

Fig. 4
figure 4

Ddim graph (transition period: \([\tau _1=9,\tau _2=29], T=39\)). We see how Ddim gradually grows as time goes by for various \(\alpha \) values. The gray zone shows the transition period when the model changes from \(k=2\) to \(k=3\). The blue line shows the number of components of the GMM estimated by the SDMS algorithm as in (25). The green curve shows the Ddim graph. The red and purple curves show TH-Score and Diff-Score as in (26) and (27), respectively. We show the time points of their alarms of TH and Diff using the same colors. We see that Ddim successfully visualized how rapidly the GMM structure changes in the transition period from \(t=10\) to 29 for each \(\alpha \), where the true change occurs rapidly for \(\alpha = 0.2\), while it occurs slowly for \(\alpha =1.0\). Furthermore, Ddim detected signs of model change earlier than SDMS made an alarm of model change

We next consider the case where there are multiple change points. We generated DataSet 2 according to GMMs so that the number of components changed from \(k=2\) to \(k=3\), then from \(k=3\) to \(k=4\). as follows:

$$\begin{aligned} \left\{ \begin{array}{l} k=2,\ \mu =(\mu _{1},\mu _{2}) \qquad \qquad \qquad \qquad \qquad \quad \quad \quad ~\ \text {if} \ 1\le t \le \tau _1, \\ k=3, \ \mu =(\mu _1, \mu _2, \frac{(\tau _{2}-t)^{\alpha }\mu _{2}+(t-\tau _{1})^{\alpha }\mu _{3}}{\tau _{2}-\tau _{1}} ) \quad \qquad \text {if} \tau _1+1 \le t \le \tau _2,\\ k=3, \ \mu =(\mu _1, \mu _2, \mu _3) \qquad \qquad \qquad \qquad \qquad \quad ~\ \text {if} \ \tau _2 \le t \le \tau _{3}, \\ k=4, \ \mu =(\mu _1, \mu _2, \mu _{3}, \frac{(\tau _{4}-t)^{\alpha }\mu _{3}+(t-\tau _{3})^{\alpha }\mu _{4}}{\tau _{4}-\tau _{3}} ) \quad \text {if} \tau _3+1 \le t \le \tau _4,\\ k=4, \ \mu =(\mu _1, \mu _2, \mu _3, \mu _4) \qquad \qquad \qquad \qquad \quad ~\ \text {if} \tau _4+1 \le t \le T. \end{array}\right. \end{aligned}$$

One component collapsed gradually over time from \(t=10\) to \(t=29\) and the other one collapsed from \(t=50\) to \(t=69\). We set \(\tau _{1}=9, \tau _{2}=29, \tau _{3}=49, \tau _{4}=69\), and \(T=79\). In the transition periods the parameters varied as with the single change point case.

Figure 5 shows the Ddim graph for \(\alpha =0.5\). The gray zone shows the transition periods when the model changes from \(k=2\) to \(k=3\) and \(k=3\) to \(k=4\). The green curve shows the Ddim graph. The blue line shows the number of components of the GMM estimated by SDMS. The red and purple lines show the times when alarms for signs of model changes are raised using TH and Diff, respectively. The Ddim graph helps us understand well how rapidly the GMM structure gradually changes in the transition periods from \(t=10\) to \(t=29\) and from \(t=50\) to \(t=69\). The Ddim graph helps us understand well how rapidly the model changes in the transition period. TH detected the signs of model changes earlier than SDMS.

4.1.2 Evaluation metrics

Next we quantitatively evaluate how early we were able to detect signs of model changes with Ddim. We measure the performance of any algorithm in terms of benefit. Let \(\hat{t}\) be the first time when an alarm is made and \(t^{*}\) be the true sign, which we define as the starting point of model change. Then benefit is defined as

$$\begin{aligned} \text {benefit}={\left\{ \begin{array}{ll} 1-(\hat{t}-t^{*})/{U} &{} (t^{*}\le \hat{t}<t^{*}+U),\\ 0 &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(31)

where U is a given parameter. Benefit takes the maximum value 1 when the alarm coincides with the true sign. It decreases linearly as t goes by and becomes zero as \(\hat{t}\) exceeds \(t^{*}+U\).

False alarm rate (FAR) is defined as the ratio of the number of alarms outside the transition period over the total number of alarms.

Fig. 5
figure 5

Ddim graph (transition periods: \([\tau _1=9,\tau _2=29],[\tau _3=49,\tau _{4}=59], T=79\)) Ddim successfully visualizes how rapidly the GMM structure changes in the transition periods from \(t = 10\) to 29 and from \(t=50\) to \(t=69\). We see also that Ddim detected signs of model change earlier than SDMS made an alarm of model change

We evaluate any method for model change sign detection algorithm in terms of Area Under Curve (AUC) of Benefit-FAR curve that is obtained by varying the threshold parameter \(\delta \) such as in (26) and (27). We set \(U=10\) in (30).

4.1.3 Methods for comparison

We consider the following methods for comparison.

1) The sequential DMS algorithm (SDMS) [14]: The SDMS algorithm with \(\lambda =1\) outputs the estimated parametric dimensionality as in (25). We raise an alarm when the output of SDMS changes.

2) Fixed share algorithm (FS) [13]: We think of each model k as an expert, and perform Herbster and Warmuth’s fixed share algorithm, abbreviated as FS. It was originally designed to make prediction by taking a weighted average over a number of experts, where the weight is calculated as a linear combination of the exponential update weight and the sum of other experts’ ones. In it the expert with the largest weight is the best expert, which may change over time. We can think of FS as a model change detection algorithm by tracking the time-varying best expert.

Here is a summary of FS. Let k be the index of the expert. \(L_{k}({\textbf{z}}_{t-1})\) is the loss function for the kth expert for data \({\textbf{z}}_{t-1}\), which is the NML codelength in our setting. \(w_{t,k}^{u}\) and \(w_{t,k}^{s}\) are tentative and final weights for the kth expert at time t. FS conducts the following weight update rule: Letting \(\alpha >0\) be a sharing parameter and \(\beta \) be a learning ratio,

$$\begin{aligned} w_{t-1,k}^u= & {} w_{t-1,k}^s \cdot \exp \{-\beta L_{k}({\textbf{z}}_{t-1})\} , \\ w_{t,k}^s= & {} (1-\alpha )w_{t-1,k}^u + \sum _{\ell \ne k} \frac{\alpha }{n-1}w_{t-1,\ell }^u , \end{aligned}$$

where n is the total number of experts.

Let \(\hat{k}\) be the best expert in which \(w_{t,k}^{s}\) is maximum. FS raises an alarm when the best expert changes. The learning rate was set to be the same as our method.

3) Fixed share weighted algorithm (FSW-TH, FSW-Diff): We consider variants of TH and Diff where \(p(k\mid {\textbf{y}}_{t})\) as in (20) is replaced with the normalized weight for k calculated in the process of FS. FSW-TH and FSW-Diff calculate scores by plugging \(w_{t,k}^{s}\) to \(p(k\mid {\textbf{y}}_{t})\) in (24) and make alarms according to (26) and (27), respectively. The learning rate was set to be the same as our method.

4) Structural entropy (SE): It is a measure of uncertainty for model selection, developed in [16]. It is calculated as the entropy with respect to the model posterior probability distribution (20). SE makes alarms when it exceeds a threshold.

The method 1) is the only existing work that performs on-line dynamic model selection. The methods 2) and 3) are the ones that are adapted to our problem setting. The methods 1) and 2) are model change detection algorithms while the method 3) is an algorithm for quantifying latent gradual changes in a similar way with TH or Diff. The method 4) is to detect change signs from the view of model uncertainty, but not to intend continuous model selection.

Table 3 AUC comparison results for GMMs. The bold values show highest AUC records over all the methods

4.1.4 Results

We generated random data 10 times and took an average value of benefit over 10 trials for each method. Table 3 shows results on comparison of all the methods in terms of AUC both for single and multiple change cases. AUC was calculated for the benefit-FAR curve obtained by varying a threshold. The parameter \(\alpha \) specifies the speed of change. As for the multiple change cases, AUC was calculated as an average taken over all change points. Both for the single and multiple change cases, TH and Diff had much higher benefit than the FS-based methods for all the cases. It was statistically significant via t-test with p-values less than \(5\%\). This implies that TH and Diff were able to detect signs of model changes significantly earlier than the FS-based methods. TH worked almost as well as Diff.

It is worthwhile noting that TH and Diff performed better than FSW-TH and FSW-Diff. It implies that the posterior based on the NML distribution is more suitable for tracking gradual model changes than that based on the FS-based heuristics. As \(\alpha \) becomes small, the superiority of TH and Diff over the others becomes more remarkable. This implies that Ddim is able to catch up the growth of a cluster much more quickly than the others.

Both for the single and multiple change cases, TH worked slightly better than Diff, but they were almost comparable.

TH and Diff were comparable to SE in terms of change sign detection alone. However, note that the Ddim based ones realize both model change sign detection and continuous model selection simultaneously. The latter function is specifically important for understanding how fast model changes. Meanwhile, SE can do only sign detection by measuring the uncertainty in model selection. Therefore, Ddim has an advantange over SE in the sense that it can continuously quantify the model complexity in the transition period as well as detects signs of model changes.

4.2 Synthetic data: Auto-regression model

4.2.1 Data sets

We next examined continuous model selection for auto-regression (AR) models as in Section 3.3. We let \(n=1000\) at each time. We generated DataSet 3 according to AR model in the setting as in Section 3.3 where the number k of coefficients in AR model changed over time as follows:

$$\begin{aligned} \text {k}={\left\{ \begin{array}{ll} 1 &{} \text {if}\ ~1\le t< \tau _{1},\\ 1 ~\text {with prob.} ~1-\frac{t-\tau _{1}}{\tau _{2}-\tau _{1}}&{} \text {if} ~\tau _{1}\le t< \tau _{2},\\ 3 ~\text {with prob.} ~\frac{t-\tau _{1}}{\tau _{2}-\tau _{1}}&{} \text {if} ~\tau _{1}\le t< \tau _{2},\\ 3 &{} \text {if} ~\tau _{2}\le t\le T, \end{array}\right. } \end{aligned}$$

where \(\tau _{1}=100, \tau _{2}=200\) and \(T=300\).

4.2.2 Results

We generated random data 10 times and took an average value of benefit over 10 trials for each method. Table 4 shows results on comparison of all the methods in terms of AUC for Benefit-FAR curves.

Table 4 AUC comparison results for AR models. The bold value shows the highest AUC over all the methods

Table 4 shows that TH and Diff were comparable to SE and obtained much larger values of AUC than other methods except SE. This implies that TH and Diff could catch up signs of model changes significantly earlier than the others except SE. Note again that TH and Diff conduct not only change sign detection but also continuous model selection meanwhile SE only perform change sign detection.

4.3 Real data: Market data

4.3.1 Data sets

We apply our method to real market data provided by HAKUHODO,INC. (https://www.hakuhodo-global.com/) and M-CUBE,INC. (https://www.m-cube.com/). This data set consists of 912 customers’ beer purchase transactions from Nov. 1st 2010 to Jan. 31st 2011. See [38]. Each customer’s record is specified by a four-dimensional feature vector, each component of which shows a consumption volume for a certain beer category. Categories are: {Beer(A), Low-malt beer(B), Other brewed-alcohol(C), Liquor (D)}.

We constructed a sequence of customers’ feature vectors as follows: A time unit is a day. At each time \(t(=\tau , . . . ,T)\), we denote the feature vector of the ith customer as \(x_{it} = (x_{it,A}, . . . , x_{it,D}) \in {\mathbb R}^{4}\). Each \(x_{it,j}\) is the ith customer’s consumption of the jth category from time \(t-\tau +1\) to t. We denote data at time t as \({\textbf{x}}_{t} = (x_{1t}, . . . , x_{nt})\), where \(n=912\), the number of customers. The total number of transactions is 13993. We set \(\tau =14\) and \(T=53\). Since TH and DIFF have turned out to outperform the other methods in the previous section and we like to conduct continuous model selection simultaneously, we focus on evaluating how well they work for the real data sets.

Fig. 6
figure 6

Change sign detection for market data. Ddim continuously increases from \(t=24\) to \(=26\). TH and Diff raised an alarm at \(t=25\) as a sign of that market structure change

4.3.2 Results

Figure 6 shows Ddim (green), estimated number of clusters in GMM (blue) using SDMS, and time points of alarms raised by TH and Diff (red and purple) with \(\delta _{1}=\delta _{2}=0.1\). Table 5 shows the clustering structures \(t=24,25,26\). Each number in the (ij)th cell shows the purchase volume of category \(i\ (=A,B,C,D)\) for the customers in the jth cluster \(cj\ (j=1,2,3,4)\). The last row shows the number of customers.

Table 5 Market structure change

The purchase volume of category C in cluster c4 gradually increased from \(t=24\) to \(t=25\), eventually c4 started to collapse at \(t=25\) and was split into c4 and c5 at \(t=26\). We confirm from Table 5 that c4 consisted of heavy users in category C, at \(t=26\), some of them became dormant users that did not purchase anything to form a new cluster. The SDMS algorithm detected this market structure change at \(t=26\). As shown in Fig. 6, TH and Diff successfully raised an alarm at \(t=25\) as a sign of that market structure change. The reason why we could detect the early warning signal is that there were gradual changes among clusters as well as within individual clusters before the clustering change occurred. Our result shows that our method was effective in detecting signs of model changes for such a case.

4.4 Real Data: Electric power consumption data

4.4.1 Data sets

Next we apply our method to the household electric power consumption dataset provided by [6]. This dataset contains three categories of electric power consumption corresponding to electricity consumed 1) in kitchen and laundry rooms, 2) by electric water heaters and 3) by air-conditioners. The data were obtained every other minute from Dec. 17, 2006 to Dec. 10, 2010. We set \({\textbf{x}}_{t} = (x_{1}, \cdots , x_{n})\) and \(x_{i} = (x_{i1}, x_{i2}, x_{i3})\) where each \(x_{i}\) denotes the value of consumption per an hour for the three categories, respectively, and \({\textbf{x}}_{t}\) is the value of consumption for two weeks (\(n=336\)).

4.4.2 Results

Figure 7 shows how Ddim (the green curve) and the number of clusters (the blue line) changed over time. Here each cluster shows a consumption pattern. The red dotted line shows the alarm positions for TH and Diff with \(\delta _{1}=\delta _{2}=0.1\). Let us focus on the duration from \(t=18\) to \(t=22\). At \(t=18,19\), there were three clusters, one of which collapsed to two clusters at \(t=21\), eventually, produced the fourth cluster. The Ddim graph in Fig.7 shows that Ddim gradually increased from \(k=3\) to \(k=4\) during the period. The alarm was made by TH and Diff at \(t=20\) while there were still three clusters. This alarm can be thought of as a sign of the emergence of a new cluster having a unique consumption pattern.

Fig. 7
figure 7

Change sign detection for power consumption data. Ddim continuously grows from \(t=19\) and \(t=21\). TH and Diff made an alarm at \(t=20\), which can be thought of as a sign of the emergence of a new cluster with a unique consumption pattern

Table 6 Electric power consumption structure change

Table 6 shows the contents of clusters on the weeks starting from May 14th, 21st, and 28th in 2007. c means clusters, and m1, m2, m3 mean the mean amounts of meter 1,2,3, respectively. The last column shows the total amount of users in a respective cluster.A sign of model change was detected on May 21st. The model change was detected on May 28th. We see from Table 6 that cluster 2 collapsed into clusters 2 and 3. Cluster 2 shows a pattern of homogeneous consumption with a relatively high weight on category 3. Cluster 3 shows a pattern of homogeneous consumption with a relatively high weight on category 1. The sign of this collapse was successfully detected on May 21st by monitoring the Ddim value. The reason why we could detect the early warning signal is that there was a gradual change in the collapse of cluster 2 before the clustering change occurred. Our result shows that our method was effective in detecting signs of model changes for such a case.

5 Relation of Ddim to MDL Learning

This section gives a theoretical foundation of Ddim by relating it to the rate of convergence of the MDL learning algorithm [2, 30]. It selects a model with the shortest total codelength required for encoding the data as well as the model itself. We give an NML-based version of the MDL algorithm as follows.

Let \({\mathcal F}=\{{\mathcal P}_{1},\dots , {\mathcal P}_{s}\}\) where \(\mid {\mathcal F}\mid =s< \infty \) and each \({\mathcal P}_{i}\) is a class of probability distributions. For a given training data sequence \({\textbf{x}}=x_{1},\dots ,x_{n}\) where each \(x_{i}\) is independently drawn, the MDL learning algorithm selects \(\hat{{\mathcal P}}\) such that

$$\begin{aligned} \hat{{\mathcal P}}= & {} \underset{{\mathcal P}\in {\mathcal F}}{\text {argmin}}(-\log p_{_\text {NML}}({\textbf{x}}; {\mathcal P}))\\= & {} \underset{{\mathcal P}\in {\mathcal F}}{\text {argmin}}\left\{ -\log \max _{p\in {\mathcal P}}p({\textbf{x}})+\log {\mathcal C}_{n}({\mathcal P})\right\} , \nonumber \end{aligned}$$
(32)

where \({\mathcal C}_{n}({\mathcal P})\) is the parametric complexity of \({\mathcal P}\) as in (5). The MDL learning algorithm outputs the NML distribution associated with \(\hat{{\mathcal P}}\) as in (31): For a sequence \({\textbf{y}}=y_{1},\dots , y_{n}\),

$$\begin{aligned} \hat{p}({\textbf{y}})=\frac{\max _{p\in \hat{{\mathcal P}}}p({\textbf{y}})}{C_{n}(\hat{{\mathcal P}})}. \end{aligned}$$
(33)

Note that \({\textbf{y}}\) is independent of the training sequence \({\textbf{x}}\) used to obtain \(\hat{{\mathcal P}}\). In previous work [2, 30], the MDL learning algorithm has been designed so that it outputs the two-stage shortest codelength distribution with quantized parameter values, belonging to the model classes. Our algorithm differs from them in that it outputs the NML distribution (32), which is not included in the model classes. The NML distribution and the MDL principle are the central notions in deriving Ddim throughout this paper. Thus it is significant to investigate the relation of Ddim to the NML distribution estimated with the MDL learning algorithm.

We have the following theorem relating Ddim to the rate of convergence of the MDL learning algorithm.

Theorem 3

Suppose that each \({\textbf{x}}\) is generated according to \( p^{*}\in {\mathcal P}^{*}\in {\mathcal F}=\{{\mathcal P}_{1},\dots , {\mathcal P}_{s}\}\). Let \(\hat{p}\) be the output of the MDL learning algorithm as in (32). Let \(d_{B}^{(n)}(\hat{p},p^{*})\) be the Bhattacharyya distance between \(\hat{p}\) and \(p^{*}\):

$$\begin{aligned} d_{B}^{(n)}(\hat{p},p^{*})\buildrel \text {def} \over =-\frac{1}{n}\log \sum _{{\textbf{y}}} (p^{*}({\textbf{y}})\hat{p}({\textbf{y}}))^{\frac{1}{2}}. \end{aligned}$$
(34)

Then for any \(\epsilon >0\), we have the following upper bound on the probability that under the condition for \({\mathcal P}^{*}\) as in Theorem 1,the Bhattacharyya distance between the output of the MDL learning algorithm and the true distribution exceeds \(\epsilon \):

$$\begin{aligned} Prob[d_{B}^{(n)}(\hat{p},p^{*})>\epsilon ]= & {} O\left( n^\mathrm{{Ddim}({\mathcal {P}}^{*})/4}e^{-n\epsilon }\right) . \end{aligned}$$
(35)

Suppose that \({\mathcal P}\) is chosen randomly according to the probability distribution \(\pi ({\mathcal P})\) over \({\mathcal F}=\{{\mathcal P}_1,\dots , {\mathcal P}_{s}\}\) and that the unknown true distribution \(p^{*}\) is chosen from \({\mathcal P}^{*}\). Then we have the following upper bound on the expected Bhattacharyya distance between the output of the MDL learning algorithm and the true distribution:

$$\begin{aligned} E_{{\mathcal P}^{*}}E_{{\textbf{x}}\sim p^{*}\in {\mathcal P}^{*}}[d_{B}^{(n)}(\hat{p},p^{*}) ] =O\left( \frac{\mathrm{{Ddim}}({\mathcal F}^{\odot })\log n}{n}\right) , \end{aligned}$$
(36)

where \(\mathrm{{Ddim}}({\mathcal F}^{\odot })\) is Ddim for model fusion as in (16).

The proof is given in Appendix. This result may be generalized into the agnostic case where the model class misspecifies the true distribution (see also [4] for this case). We omit this result from this manuscript since our main concern is how the expected generalization performance is related to Ddim.

Theorem 3 implies that the NML distribution with model of the shortest NML codelength converges exponentially to the true distribution in probability as n increases and the rate is governed by Ddim for the true model. In conventional studies on PAC (probably approximately correct) learning [12], the performance of the empirical risk minimization algorithm has been analyzed using the technique of uniform convergence, where the rate of convergence is governed by the metric dimension. Meanwhile, the performance of the MDL learning algorithm is analyzed using the non-uniform convergence technique, since the non-uniform model complexity is considered. In this case the rate of convergence of the MDL algorithm is governed by Ddim. Then the expected Bhattacharyya distance between the true distribution and the output of the MDL learning algorithm is characterized by Ddim for model fusion over \({\mathcal F}\).

6 Conclusion

This paper has proposed a novel methodology for detecting signs of model changes from a data stream. The key idea is to conduct continuous model selection using the notion of descriptive dimensionality (Ddim). Ddim quantifies the real-valued model dimensionality in the model transition period. We are able not only to visualize the model complexity in the transition period of model changes, but also to detect their signs by tracking the rise-up/descent of Ddim. Focusing on the model changes in Gaussian mixture models, we have shown that gradual structure changes of GMMs can be effectively visualized by drawing a Ddim graph. Furthermore, we have empirically demonstrated that our methodology was able to detect signs of changes of the number of mixtures in GMM and those of the order of AR model earlier than they were actualized. Experimental results have shown that it was able to detect them significantly earlier than any other existing dynamic model selection methods.

This paper has offered the use of continuous model change selection in the scenario of model change sign detection only. Exploring other scenarios of continuous model selection has remained for future studies.