Detecting signs of model change with continuous model selection based on descriptive dimensionality

Yamanishi, Kenji; Hirai, So

doi:10.1007/s10489-023-04780-5

Detecting signs of model change with continuous model selection based on descriptive dimensionality

Open access
Published: 24 August 2023

Volume 53, pages 26454–26471, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Detecting signs of model change with continuous model selection based on descriptive dimensionality

Download PDF

1713 Accesses
2 Citations
Explore all metrics

Abstract

We address the issue of detecting changes of models that lie behind a data stream. The model refers to an integer-valued structural information such as the number of free parameters in a parametric model. Specifically we are concerned with the problem of how we can detect signs of model changes earlier than they are actualized. To this end, we employ continuous model selection on the basis of the notion of descriptive dimensionality (Ddim). It is a real-valued model dimensionality, which is designed for quantifying the model dimensionality in the model transition period. Continuous model selection is to determine the real-valued model dimensionality in terms of Ddim from a given data. We propose a novel methodology for detecting signs of model changes by tracking the rise-up/descent of Ddim in a data stream. We apply this methodology to detecting signs of changes of the number of clusters in a Gaussian mixture model and those of the order in an auto regression model. With synthetic and real data sets, we empirically demonstrate its effectiveness by showing that it is able to visualize well how rapidly model dimensionality moves in the transition period and to raise early warning signals of model changes earlier than they are detected with existing methods.

High-dimensional regression with potential prior information on variable importance

Article Open access 14 June 2022

A systematic review on model selection in high-dimensional regression

Article 12 November 2018

A sequential multiple change-point detection procedure via VIF regression

Article 03 June 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 Motivation

This paper is concerned with the issue of detecting changes of a model that lies behind a data stream. The model refers to the discrete structural information such as the number of free parameters in the mechanism for generating the data. We consider the situation where a model changes over time. Under this environment, it is important to detect the model changes as accurately as possible. This is because the model changes may correspond to important events. For example, it is reported in [14] that when customers’ behaviors are modeled using a Gaussian mixture model, the change of the number of mixture components corresponds to the emergence or disappearance of a cluster of customers’ behaviors. In this case a model change implies a change of the market trend. For another example, it is reported in [35] that when the syslog behaviors are modeled using a mixture of hidden Markov models, the change of the number of mixture components may correspond to a system failure.

The issue of model change detection has extensively been explored. This paper is rather concerned with the issue of detecting signs or early warning signals of model changes. Why is it important to detect such signs? One reason is that if they were detected earlier than the changes themselves, we could predict the changes before they were actualized. The other reason is that if they were detected after the change themselves, we could analyze the cause of the changes in a retrospective way.

A model, say, the number of parameters, is an integer-valued index, in general. Therefore, it appears that the model change abruptly occurs. However, it is reasonable to suppose that some intrinsic change, which we call latent change, gradually occurs at the back of the model change. Then we may define a sign of the model change as the starting point of the latent change. Therefore, if we properly define a real-valued index to quantify the model dimensionality in the transition period, we can understand how rapidly the latent change is going on and we can detect signs of model changes by tracking the rise-up/descent of the index (Fig. 1).

The key idea of this paper is to employ the notion of descriptive dimensionality (Ddim) for the quantification of a model in the transition period. Ddim is a real-valued index, which quantifies the model dimensionality for the case where a number of models are mixed. We thereby establish a methodology of continuous model selection. It is to determine the optimal real-valued model dimensionality from data on the basis of Ddim. In the transition period of model changes, the mixing structure of models may change over time. Hence, by tracking the rise-up/descent of Ddim, we will be able to track the latent changes behind model changes.

The purpose of this paper is twofold: One is to establish a novel methodology for detecting signs (or early warning signals) of model changes from a data stream. We realize this by using Ddim for the quantification of model dimensionality in its transition period. The theory of Ddim is developed on the basis of the minimum description length (MDL) principle [24] in combination with the theory of box counting dimension. The other is to empirically validate the effectiveness of the methodology using synthetic and real data sets. We evaluate how early and how reliably it is able to make alarms of signs of model changes.

1.2 Related work

Model change detection has been studied in the scenario of dynamic model selection (DMS) developed in [34, 35]. Model change detection is different from the classical continuous parameter change detection. Taking an example of finite mixture models, the former is to detect changes in the number of components, while the latter is to detect those in the real-valued parameters of individual components or mixing parameters. In [34, 35], they proposed the DMS algorithm, which outputs a model sequence of the shortest description length, on the basis of the MDL principle [24]. They demonstrated its effectiveness from the empirical and information-theoretic aspects. The MDL based model change detection has been further theoretically justified in [33]. The problems similar to model change detection have been discussed in the scenarios of switching distributions [8], tracking best experts [13], on-line clustering [27], cluster evolution [21], Bayesian change detection [29], and structure break detection for autoregression model [3]. In all of these previous studies, however, a model change was considered to be an abrupt change of a discrete structure. The transition period of changes has never been analyzed there. In the conventional state-space model, change detection of continuous states is addressed (see e.g.[5]). Then the state itself has not the same meaning as a model which we define in this paper. The number of states is a model which we mean.

Table 1 Comparison of related works on model change detection

Full size table

Table 2 Comparison of related works on dimensionality

Full size table

Changes that do not occur abruptly but incrementally occur were discussed in the context of detecting incremental changes in concept drift [10], gradual changes [36], volatility shift [17], etc. However, it has never been quantitatively analyzed how rapidly a model changes in the transition period.

Recently, the indices of structural entropy [16] and graph-based entropy [22] have been developed for measuring the uncertainty associated with model changes. Although they can be thought of as early warning signals of model changes, they cannot quantify the intrinsic model dimensionality nor explain how rapidly a model changes in the transition period. Change sign detection method usng differential MDL change statistics has been proposed in [37]. However, it is applied to change sign detection for parameters only. We summarize the related work in Table 1 from the viewpoints of abrupt model change detection, model change sign detection, and quantification of model dimensionality.

This paper proposes a methodology for analyzing model transition in terms of real-valued dimensionality. A number of notions of dimensionality have been proposed in the areas of physics and statistics. The metric dimension was proposed by Kolmogorov and Tihomirov [18] to measure the complexity of a given set of points in terms of the notion of covering numbers. This was evolved into the notion of the box counting dimension, equivalently, the fractal dimension [20]. It is a real-valued index for quantifying the complexity of a given set. It is also related to the capacity [7]. Vapnik Chervonenkis dimension was proposed to measure the power of representation for a given class of functions [28]. It was also related to the rate of uniform convergence of estimating functions. See [12] for relations between dimensionality and learning. The dimensionality as a power of representation is conventionally integer-valued, but when it changes over time, there is no effective non-integer valued quantification of its transition. The previous notions of dimensionality are summarized in Table 2 from the viewpoints of integer/real-valued, characterizatio of learning rate, and quantification of model change.

Preliminary versions of this paper appeared in Arxiv [31, 32].

1.3 Significance of this paper

The significance of this paper is summarized as follows: (1) Proposal of a novel methodology for detecting signs of model changes with continuous model selection. This paper proposes a novel methodology for detecting signs of model changes. The key idea is to track model transitions with continuous model selection using the notion of descriptive dimensionality (Ddim). It measures the model dimensionality in the case where a number of models with different dimensionalities are mixed.

For example, we employ the Gaussian mixture model (GMM) to consider the situation where the number of mixture components changes over time. We suppose that in the transition period of model change, a number of probabilistic models with various mixture sizes are fused. We give a method for calculating Ddim for this case. The transition period of model change can be visualized by drawing a Ddim graph versus time. Once a Ddim graph is obtained, we can understand how rapidly the model changes over time. We eventually detect signs of model changes by tracking the rise-up/descent of Ddim. This methodology is significantly important in data mining since it helps us predict model changes in earlier stages.

(2)Empirical demonstration of effectiveness of model change sign detection via Ddim. We empirically validate how early we are able to detect signs of model changes with continuous model selection, for GMMs and auto-regression (AR) models. With synthetic data sets and real data sets, we illustrate that our method is able to effectively visualize the transition period of model change using Ddim. We further empirically demonstrate that our methodology is able to detect signs of model changes significantly earlier than any existing dynamic model selection algorithms and is comparable to structural entropy in [16]. Through our empirical analysis, we demonstrate that Ddim is an effective index for measuring the model dimensionality in the model transition period.

(3)Giving theoretical foundations for Ddim. In this paper, Ddim plays a central role in continuous model selection. We introduce this notion from an information-theoretic view based on the MDL principle [24] (see also [11]). We show that Ddim coincides with the number of free parameters in the case where the model consists of a single parametric class. We also derive Ddim for the case where a number of models with different dimensionalities are mixed. We characterize Ddim by demonstrating that it governs the rate of convergence of the MDL-based learning algorithm. This corresponds to the fact that the metric dimensionality governs the rate of convergence of the empirical risk minimization algorithm in statistical learning theory [12].

The rest of this paper is organized as follows: Section 2 introduces the notion of Ddim. Section 3 gives a methodology for model change sign detection via Ddim. Section 4 shows experimental results. Section 5 characterizes Ddim by relating it to the rate of convergence of the MDL learning algorithm. Section 6 gives conclusion. Source codes and data sets are available at a Github repository [38].

2 Descriptive dimensionality

2.1 NML and parametric complexity

This section introduces the theory of Ddim. This theory is based on the MDL principle (see [24] for the original paper and [25] for the recent advances) from the viewpoint of information theory. We start by introducing a number of fundamental notions of the MDL principle.

Let ${\mathcal X}$ be the data domain where ${\mathcal X}$ is either discrete or continuous. Without loss of generality, we assume that ${\mathcal X}$ is discrete. Let ${\textbf{x}}=x_{1},\dots ,x_{n}\in {\mathcal X}^{n}$ be a data sequence of length n. We assume that each $x_{i}$ is independently generated. ${\mathcal P}=\{p({\textbf{x}}) \}$ be a class of probabilistic models where $p({\textbf{x}})$ is a probability mass function or a probability density function. Hereafter, we asssume that for any ${\textbf{x}}$, the maximum of $p({\textbf{x}})$ with respect to p exists.

Under the MDL principle, the information of a datum ${\textbf{x}}$ is measured in terms of description length, i.e, the codelength required for encoding the datum with a prefix coding method. We may encode ${\textbf{x}}$ with help of a class ${\mathcal P}$ of probability distributions. One of the most important methods for calculating the codelength ${\textbf{x}}$ using ${\mathcal P}$ is the normalized maximum likelihood (NML) coding [25]. This is defined as the codelength associated with the NML distribution as follows:

Definition 1

We define the normalized maximum likelihood (NML) distribution over ${\mathcal X}^{n}$ with respect to ${\mathcal P}$ by

$$\begin{aligned} p_{_\textrm{NML}}({\textbf{x}};{\mathcal P})\buildrel \text {def} \over =\frac{\max _{p\in {\mathcal P}}p({\textbf{x}})}{\sum _{{\textbf{y}}} \max _{p\in {\mathcal P}}p({\textbf{y}})}. \end{aligned}$$

(1)

The normalized maximum likelihood (NML) codelength of ${\textbf{x}}$ relative to ${\mathcal P}$, which we denote as $L_{_\textrm{NML}}({\textbf{x}}; {\mathcal P})$, is given as follows:

$$\begin{aligned} \begin{aligned} L_{_\textrm{NML}}({\textbf{x}};{\mathcal P})\buildrel \text {def} \over =&-\log p_{_\textrm{NML}}({\textbf{x}}; {\mathcal P}) \\ =&-\log \max _{p\in {\mathcal P}}p({\textbf{x}})+\log {\mathcal C}_{n}({\mathcal P}),\\ \end{aligned} \end{aligned}$$

(2)

where

$$\begin{aligned} \log {\mathcal C}_{n}({\mathcal P})\buildrel \text {def} \over =\log \sum _{{\textbf{y}}} \max _{p\in {\mathcal P}}p({\textbf{y}}). \end{aligned}$$

(3)

The first term in (2) is the negative logarithm of maximum likelihood while the second term (3) is the logarithm of the normalization term. The latter is called the parametric complexity of ${\mathcal P}$ [25]. This means the information-theoretic complexity for the model class ${\mathcal P}$ relative to the length n of data sequence. The NML codelength can be thought of as an extension of Shannon information $-\log p({\textbf{x}})$ into the case where the true model p is unknown but only ${\mathcal P}$ is known.

In order to understand the meaning of the NML codelength and the parametric complexity, we define the minimax regret as follows:

$$\begin{aligned} R_{n}({\mathcal P}) \buildrel \text {def} \over =\min _{q}\max _{{\textbf{x}}}\left\{ -\log q({\textbf{x}})-\min _{p\in {\mathcal P}}(-\log p({\textbf{x}}))\right\} , \end{aligned}$$

where the minimum is taken over the set of all probability distributions. The minimax regret means the descriptive complexity of the model class, indicating how largely any codelength is deviated from the smallest negative log-likelihood over the model class. Shtarkov [26] proved that the NML distribution (1) is optimal in the sense that it attains the minimum of the minimax regret. In this sense the NML codelength is the optimal codelength for encoding ${\textbf{x}}$ for given ${\mathcal P}$. Then we can immediately see that the minimax regret coincides with the parametric complexity. That is,

$$\begin{aligned} R_{n}({\mathcal P})=C_{n}({\mathcal P}). \end{aligned}$$

(4)

We next consider how to calculate the parametric complexity. According to [25] (pp:43-44), the parametric complexity can be rewritten using a variable transformation technique as follows:

$$\begin{aligned} C_{n}({\mathcal P})= \sum _{{\textbf{y}}} \max _{p\in {\mathcal P}}p({\textbf{y}}) =\int g(\hat{p}, \hat{p})d\hat{p}, \end{aligned}$$

(5)

where $g(\hat{p},p)$ is defined as

$$\begin{aligned} g(\hat{p}, p) \buildrel \text {def} \over = \sum _{{\textbf{y}}:\max _{\bar{p}\in {\mathcal P}}\bar{p}({\textbf{y}})=\hat{p}({\textbf{y}})} p({\textbf{y}}). \end{aligned}$$

(6)

2.2 Definition of descriptive dimension

Below we give the definition of Ddim from a view of approximation of the parametric complexity, equivalently, the minimax regret (by (4)). The scenario of defining Ddim is as follows: We first count how many points are required to approximate the parametric complexity (5) with quantization. We consider that count as information-theoretic richness of representation for a model class. We then employ that count to define Ddim in a similar manner with the box counting dimension.

We consider to approximate (5) with a finite sum of partial integrals of $g(\hat{p},\hat{p})$.Let $\overline{{\mathcal P}}=\{p _{1}, p _{2},\dots \}\subset {\mathcal P}$ be a finite subset of ${\mathcal P}$. Let $\epsilon $ be the parameter for defining the diameter of the neighborhood of a given probability distribution. For $\epsilon >0, $ for $p_{i}\in \overline{{\mathcal P}}$, let $D_{\epsilon }^{n}(i)\buildrel \text {def} \over =\{p\in {\mathcal P} :\ d_{n}(p_{i},p)\le \epsilon ^{2}\}$ where $d_{n}(p_{i}, p)$ is the Kullback-Leibler (KL) divergence between p and $p_{i}$:

$$\begin{aligned} d_{n}(p, p_{i})=\frac{1}{n}\sum _{{\textbf{x}}} p_{i}({\textbf{x}})\log \frac{p_{i}({\textbf{x}})}{p({\textbf{x}})}. \end{aligned}$$

Then we approximate $\bar{C}_{n}(\bar{\mathcal {P}})$ by

$$\begin{aligned} \overline{{C}_{n}}(\overline{{\mathcal P}}){\buildrel \text {def} \over = }\sum _{i} Q_{\epsilon }(i), \end{aligned}$$

(7)

where

$$\begin{aligned} Q_{\epsilon }(i){\buildrel \text {def} \over =}\int _{\hat{p}\in D_{\epsilon }^{n}(i)}g(\hat{p}, \hat{p})d\hat{p}. \end{aligned}$$

(8)

That is, (7) gives an approximation to $C_{n}({\mathcal P})$ with a finite sum of integrals of $g(\hat{p}, \hat{p})$ over the $\epsilon ^{2}-$neighborhood of a point $p_{i}$. We define $m_{n}(\epsilon :{\mathcal P})$ as the smallest number of points $\mid \overline{\mathcal P}\mid $ with respect to $\overline{\mathcal P}$ such that $C_{n}({\mathcal P}) \le \overline{C}_{n}(\overline{{\mathcal P}})$. More precisely,

$$\begin{aligned} m_{n}(\epsilon :{\mathcal P}){\buildrel \text {def} \over =}\min _{\overline{{\mathcal P}}} \mid \overline{{\mathcal P}}\mid \ \ \text {subject to}\ C_{n}({\mathcal P})\le \overline{C_{n}}(\overline{{\mathcal P}}). \end{aligned}$$

(9)

We are now led to the definition of descriptive dimension.

Definition 2

[31] Let ${\mathcal P}$ be a class of probability distributions. We let $m(\epsilon :{\mathcal P})$ be the one obtained by choosing $\epsilon ^{2}n=O(1)$ in $m_{n}(\epsilon :{\mathcal P} )$ as in (9). We define the descriptive dimension (Ddim) of ${\mathcal P}$ by

$$\begin{aligned} \text {Ddim}({\mathcal P}){\buildrel \text {def} \over =}\lim _{\epsilon \rightarrow 0}\frac{\log m(\epsilon : {\mathcal P})}{\log (1/\epsilon )}, \end{aligned}$$

(10)

when the limit exists.

The definition of Ddim is similar with that of the box counting dimension [7, 9, 20] .The main difference between them is how to count the number of points. Ddim is calculated on the basis of the number of points required for approximating the parametric complexity, while the box counting dimension is calculated on the basis of the number of points required for covering a given object with their $\epsilon $-neighborhoods.

Consider the case where ${\mathcal P}_{k}$ is a k-dimensional parametric class, i.e., ${\mathcal P}_{k}=\{p({\textbf{x}};\theta ):\ \theta \in \Theta _{k}\subset {\mathbb R}^{k}\}$, where $\Theta _{k}$ is a k-dimensional real-valued parameter space. Let $p({\textbf{x}};\theta )=f({\textbf{x}}\mid \hat{\theta }({\textbf{x}}))g(\hat{\theta }({\textbf{x}});\theta )$ for the conditional probabilistic mass function $f({\textbf{x}}\mid \hat{\theta }({\textbf{x}}))$. We then write g according to (6) as follows

$$\begin{aligned} g(\hat{\theta },\theta ){\buildrel \over = } \sum \limits _{{\textbf{x}}:\text {argmax}_{\theta }p({\textbf{x}};\theta )=\hat{\theta }}p({\textbf{x}};\theta ). \end{aligned}$$

(11)

Assume that the central limit theorem holds for the maximum likelihood estimator of a parameter vector $\theta $. Then according to [25], we can take a Gaussian density function as (11) asymptotically. That is, for sufficiently large n, (11) can be approximated as:

$$\begin{aligned} g(\hat{\theta }, \theta )\simeq \left( \frac{n}{2\pi }\right) ^{\frac{k}{2}}\mid I_{n}(\theta )\mid ^{\frac{1}{2}}e^{-n(\hat{\theta }-\theta )^{\top }I_{n}(\theta )(\hat{\theta }-\theta )/2}, \end{aligned}$$

(12)

where $I_{n}(\theta ){\buildrel \text {def} \over =} (1/n)E_{\theta }[-\partial ^{2}\log p({\textbf{x}};\theta )/\partial \theta \partial \theta ^{\top }]$ is the Fisher information matrix.

The following theorem shows the basic property of $m_{n}(\epsilon :{\mathcal P}_{k})$ for the parametric case.

Theorem 1

Suppose that $p({\textbf{x}};\theta )\in {\mathcal P}_{k}$ is continuously three-times differentiable with respect to $\theta $. Under the assumption of the central limit theorem so that (12) holds, for sufficiently large n, we have

$$\begin{aligned} \log C_{n}({\mathcal P}_{k}) = \log m_{n}(1/\sqrt{n} :{\mathcal P}_{k})+O(1). \end{aligned}$$

(13)

The proof is given in Appendix.

It is known [25] (p.53) that under some regularity condition that the central limit theorem holds for the maximum likelihood estimator for $\theta $, the parametric complexity for ${\mathcal P}_{k}$ is asymptotically expanded as

$$\begin{aligned} \log C_{n}({\mathcal P}_{k})=\frac{k}{2}\log \frac{n}{2\pi }+\log \int \sqrt{\mid I(\theta )\mid }d\theta +o(1), \end{aligned}$$

(14)

where $I(\theta )$ is the Fisher information matrix: $I(\theta )\buildrel \text {def} \over =\text {lim}_{n\rightarrow \infty }(1/n)\times $ ${\text {E}}_{\theta }[-\partial ^{2}\log p({\textbf{x}};\theta )/\partial \theta \partial \theta ^{\top }]$. Plugging (13) with (14) for $\epsilon ^{2}n=O(1)$ into (10) yields the following theorem.

Theorem 2

For a k-dimensional parametric class ${\mathcal P}_{k}$, under the regularity condition for ${\mathcal P}_{k}$ as in Theorem 1, we have

$$\begin{aligned} \textrm{Ddim}({\mathcal P}_{k})=k. \end{aligned}$$

(15)

Theorem 2 shows that when the model class is a single parametric one, Ddim coincides with the conventional notion of dimensionality (the number of free parameters), which we call the parametric dimensionality in the rest of this paper.

Ddim can also be defined even for the case where the model class is not a single parametric class. Hence Theorem 2 implies that Ddim is a natural extension of the parametric dimensionality.

Let us consider model fusion where a number of model classes are probabilistically mixed. Let ${\mathcal F}=\{ {\mathcal P}_{1},\dots , {\mathcal P}_{s}\}$ be a family of model classes and assume a model class is probabilistically distributed according to $p({\mathcal P})$ over ${\mathcal F}$. We denote the model fusion over ${\mathcal F}$ as ${\mathcal F}^{\odot }={\mathcal P}_{1}\odot \cdots \odot {\mathcal P}_{s}$. We may interpret the resulting distribution over ${\mathcal X}$ as a finite mixture model [19] of a number of model classes with different dimensionalities. Then Ddim of ${\mathcal F}^{\odot }$ is calculated as

$$\begin{aligned} \lim _{\epsilon \rightarrow 0}\frac{\log E_{{\mathcal P}}[m(\epsilon : {\mathcal P})]}{\log (1/\epsilon )}\ge & {} \lim _{\epsilon \rightarrow 0}\sum ^{s}_{i=1}p({\mathcal P}_{i})\frac{\log m(\epsilon : {\mathcal P}_{i})}{\log (1/\epsilon )}\nonumber \\= & {} \sum _{i=1}^{s}p({\mathcal P}_{i})\text {Ddim}({\mathcal P}_{i}), \end{aligned}$$

(16)

where we have used Jensen’s inequality to derive the first inequality. We call the lower bound (16) the pseudo Ddim for model fusion ${\mathcal F}^{\odot }$. In the rest of this paper, we adopt it as Ddim value for model fusion. We write it as $\underline{\text {Ddim}}({\mathcal F}^{\odot })$. Model fusion is a reasonable setting when we consider the transition period of model changes. Then Ddim is no longer integer-valued.

3 Model change sign detection

3.1 Continuous model selection for GMMs

This section proposes a methodology for detecting signs of model changes with continuous model selection. We first focus on the case where the model is a Gaussian mixture model (GMM). The problem setting is as follows: At each time we obtain a number of unlabeled multi-dimensional examples. By observing such examples sequentially, we obtain a data stream of the examples. At each time, we may conduct clustering of the examples using GMMs. Assuming that the number of components in GMM may change over time, we aim at detecting their changes and the signs of them.

The key idea is to conduct continuous model selection, which is to determine the real-valued model dimensionality on the basis of Ddim. Below we give a scenario of continuous model selection with applications to model change sign detection. Let ${\mathcal P}_{k}$ be a class of GMMs with k components. We consider the situation where the structure of GMM gradually changes over time (Fig. 2), while the model k may abruptly change. The key observation is that during the model transition period, model fusion occurs where a number of GMMs with different ks are probabilistically mixed according to the posterior probability distribution. Then model dimensionality in the transition period can be calculated as Ddim of model fusion. Thus we can detect signs of model changes by tracking the rise-up/descent of Ddim.

Below let us formalize the above scenario. Let ${\mathcal X}$ be an m-dimensional real-valued domain and let $x\in {\mathcal X}$ be an observed datum. Let $z\in \{1,\dots , k\}$ be a latent variable indicating which component x comes from. Let $\mu _{i}\in {\mathbb R}^{m}$, $\Sigma _{i}\in {\mathbb R}^{m\times m}$ be the mean vector and variance-covariance matrix for the ith component, respectively. Let $\mu =(\mu _{1},\dots ,\mu _{k})$ and $\Sigma =(\Sigma _{1},\dots , \Sigma _{k})$. Let $\sum _{i}\pi _{i}=1,\ \pi _{i}\ge 0\ (i=1,\dots ,k)$. Let $\theta =(\mu _{i}, \Sigma _{i}, \pi _{i})\mid _{i=1,\dots ,k}$. Then a complete variable model of GMM with k-components is given by

$$\begin{aligned} p( x,z; \theta ,k )= & {} p(x\mid z; \mu , \Sigma )p(z; \pi ), \end{aligned}$$

where

$$\begin{aligned}&p( x\mid z=i; \mu , \Sigma ) =\frac{1}{(2\pi )^{\frac{m}{2}}\cdot \mid \Sigma _i \mid ^{\frac{1}{2}}} \exp \left\{ -\frac{1}{2} (x-\mu _i)^{\top } \Sigma _i^{-1} (x-\mu _i) \right\} , \nonumber \\&p(z=i;\pi )=\pi _{i}\ \ (i=1, \dots , k). \end{aligned}$$

(17)

Let ${\textbf{x}}=x_{1},\dots , x_{n}$ be a sequence of observed variables of length n. Let $z_{j}$ denote a latent variable which corresponds to $x_{j}$ and ${\textbf{z}}=z_{1},\dots , z_{n}$. Let ${\textbf{y}}=({\textbf{x}},{\textbf{z}})$ be a complete variable. Let $\hat{\mu }_{i}, \hat{\Sigma } _{i}$ be the maximum likelihood estimators of $\mu _{i}, \Sigma _{i}$ $(i=1,\dots ,k)$ for given ${\textbf{y}}$. Let $\hat{\pi }_{i}=n_{i}/n$ where $n_{i}$ is the number of occurrences in $\varvec{z}$ such that $z=i$ $(i=1,\dots ,k)$ and $\sum ^{k}_{i=1}n_{i}=n$. ${\textbf{z}}$ may be estimated by sampling from the posterior probability obtained by the EM algorithm. Let $\hat{\theta }({\textbf{y}})=(\hat{\pi }_{i},\hat{\mu } _{i},\hat{\Sigma } _{i})\mid _{i=1,\dots ,k}$.

The NML codelength of ${\textbf{y}}$ for a complete variable model of a GMM is given by

$$\begin{aligned} L_{_\textrm{NML}}({\textbf{y}};k)= & {} -\log p_{_\textrm{NML}}({\textbf{y}};k) \nonumber \\= & {} -\log p({\textbf{y}}; \hat{\theta }({\textbf{y}}) ,k) + \log \mathcal {C}_{n}(k), \end{aligned}$$

(18)

where $\mathcal {C}_{n}(k)$ is a parametric complexity for a GMM. According to [15], an upper bound on $\mathcal {C}_{n}(k)$ is given as follows:

$$\begin{aligned} \mathcal {C}_{n}(k)&\le \sum \limits _{n_1,\cdots , n_k} \frac{n!}{n_1! \cdots n_k !} \times \prod \limits _{i=1}^{k} \left( \frac{n_i}{n} \right) ^{n_i} B(m,R,\epsilon ) \nonumber \\&\times \left( \frac{n_{i}}{2e} \right) ^{\frac{mn_i}{2}} \left( \Gamma _ m\left( \frac{n_i-1}{2}\right) \right) ^{-1}, \end{aligned}$$

(19)

where

$$\begin{aligned} B(m,R,\epsilon ) \overset{\text {def}}{=}\frac{2^{m+1}R^{\frac{m}{2}}\epsilon ^{-\frac{m^{2}}{2}}}{m^{m+1}\cdot \Gamma \left( \frac{m}{2} \right) }, \end{aligned}$$

where R is a positive constant such that for all i, $\parallel \hat{\mu }_{i} \parallel ^{2}\le R$, and $\epsilon $ is a positive constant such that $\epsilon $ is the lower bound on the smallest eigenvalue of $\Sigma _{i}$ for any i. $\Gamma _{m}$ is the multivariate Gamma function defined as $\Gamma _{m}(x)=\pi ^{\frac{m(m-1)}{4}}\prod ^{m}_{j=1}\Gamma (x+\frac{1-j}{2})$ and $\Gamma $ is the Gamma function. We use the bound (19) as the value of ${\mathcal C}_{n}(k)$. It is known [14] that $C_{n}(k)$ is computable in time $O(n^{2}k)$.

At each time t, we observe a data sequence: ${\textbf{x}}_{t} =x_{1},\dots , x_{n}\in {\mathcal X}^{n}$ of length n. We sequentially observe such a datum as show in Fig. 2. Let ${\textbf{x}}^{T}={\textbf{x}}_{1},\dots ,{\textbf{x}}_{T}\ ({\textbf{x}}_{t}\in {\mathcal X}^{n},\ t=1,\dots ,T)$ be an observed data sequence. The length n may vary over time. We denote the joint sequence of observed variables and latent variables at time t as ${\textbf{y}}_{t}=({\textbf{x}}_{t},{\textbf{z}}_{t})$.

We suppose that a number of GMMs with different ks are fused according to the probability distribution $p(k\mid {\textbf{y}}_{t})$ at each time. We define $p(k\mid {\textbf{y}}_{t})$ as the annealed posterior probability of k for ${\textbf{y}}_{t}$:

$$\begin{aligned}&p(k\mid {\textbf{y}}_{t}) \buildrel \text {def} \over =\frac{(p_{_\textrm{NML}}({\textbf{y}}_{t};k)p(k\mid k_{t-1}))^{\beta }}{\sum _{k'}(p_{_\textrm{NML}}({\textbf{y}}_{t};k')p(k'\mid k_{t-1}))^{\beta }} \\ \nonumber&= \frac{\exp (-\beta L_{_\textrm{NML}}({\textbf{x}}_{t},{\textbf{z}}_{t};k)+\beta \log p(k\mid k_{t-1}))}{\sum _{k'}\exp (-\beta L_{_\textrm{NML}}({\textbf{x}}_{t},{\textbf{z}}_{t};k')+\beta \log p(k'\mid k_{t-1}))}, \end{aligned}$$

(20)

where $k_{t-1}$ is the dimensionality estimated at time $t-1$, and

$$\begin{aligned} p (k \mid k_{t-1}) \buildrel \text {def} \over =\left\{ \begin{array}{ll} 1-\gamma &{} \text{ if }~ k=k_{t-1}~\text {and}~k_{t-1} \ne 1,k_\text {max},\\ 1-\gamma /2~&{} \text{ if }~ k=k_{t-1}~\text {and}~k_{t-1} = 1,k_\text {max},\\ \gamma /2~ &{}\text{ if }~ k=k_{t-1} \pm 1. \end{array}\right. \end{aligned}$$

(21)

$\gamma (0<\gamma <1)$ is a parameter, and $k_{\max }$ is the maximum value of k. We estimate $\gamma $ using the MAP estimator with the beta distribution $\textrm{Beta}(a,b)$ being the prior where the density function of $\text {Beta}(a,b)$ is proportional to $\gamma ^{a-1}(1-\gamma )^{b-1}$ for hyper-parameters a and b. The MAP estimator of $\gamma $ is given as follows:

$$\begin{aligned} \hat{\gamma } =\frac{N_t+a-1}{t+a+b-2}, \end{aligned}$$

where $N_t$ shows how many times the number of clusters has changed until time $t-1$. In the experiments to follow, we set $(a, b)=(2,10)$. $\beta (>0)$ is the temperature parameter. In our experiment, we set as $\beta $ as

$$\begin{aligned} \beta =1/\sqrt{n}. \end{aligned}$$

(22)

This is due to the PAC-Bayesian argument [1] for Gibbs posteriors.

Note that (20) is calculated on the basis of the NML distribution. This is because the probability distribution with unknown parameters should be estimated as the NML distribution since it is the optimal distribution in terms of the minimax regret (see Section 2.1).

By (16), we can calculate Ddim of model fusion of GMMs with various ks at time t as

$$\begin{aligned} \underline{\text {Ddim}}({\mathcal F}^{\odot })_{t}= \sum _{k}p_{t}({\mathcal P}_k)\text {Ddim}({\mathcal P}_{k}), \end{aligned}$$

(23)

where $p_{t}({\mathcal P}_{k})$ is the probability of ${\mathcal P}_{k}$ at time t, and $p_{t}({\mathcal P}_{k})=p(k\mid {\textbf{y}}_{t})$ in this case. Note that Ddim for GMM with k components is $k(m^{2}/2+(5m/2))-1\approx kf(m)$ where $f(m)=m^{2}/2+(5m/2)$. However, in order to focus on the mixture size, we divide the true Ddim by f(m) to consider an alternative Ddim of the form of (24).

$$\begin{aligned} \underline{\text {Ddim}}({\mathcal F}^{\odot })_{t}&\buildrel \text {def} \over =&\sum _{k}p(k\mid {\textbf{y}}_{t})\text {Ddim}({\mathcal P}_{k})/f(m) \nonumber \\\approx & {} \sum _{k}p(k\mid {\textbf{y}}_{t})k. \end{aligned}$$

(24)

The calculation of real-valued k according to (24) is really continuous model selection.

Suppose that there exists a true parametric dimensionality $k^{*}$. Then because of the consistency of MDL model estimation ([25], pp:63-69),

$$\begin{aligned} p(\hat{k}=k^{*}\mid {\textbf{y}}_{t}) \rightarrow 1 \end{aligned}$$

for $\hat{k}$ minimizing the NML codelength as n increases, Hence (24) will coincide with $k^{*}$ with probability 1 as n goes to infinity.This implies that (24) is a natural extension of parametric dimensionality.

3.2 Model change sign detection algorithms

Consider the situation where we sequentially observe a complete variable sequence: ${\textbf{y}}_{1}, {\textbf{y}}_{2},\dots ,{\textbf{y}}_{T}$. We then obtain a Ddim graph:

$$\begin{aligned} \{(t, \underline{\text {Ddim}}({\mathcal F}^{\odot })_{t}): t=1,2,\dots , T\}, \end{aligned}$$

as in Fig. 1. We can visualize the transition period by drawing the Ddim graph versus time. In this paper we have defined a sign of a model change as the starting point of the latent gradual change associated with it. Thus we can detect signs of model changes by looking at the rise-up/descent of Ddim.

More precisely, we propose the following two methods for raising alarms of model change signs.

1) Thresholding method (TH): We raise an alarm if the absolute difference between Ddim and the baseline exceeds a given threshold $\delta _{1}$. The baseline is the parametric dimensionality estimated by the sequential dynamic model selection algorithm (SDMS) [14], which is a sequential variant of DMS in [34, 35]. It outputs a model $k=\hat{k}$ with the shortest codelength, i.e., for $\lambda >0$,

$$\begin{aligned} \hat{k}=\underset{k}{\text {argmin}}\,\{L_{_{\text {NML}}}({\textbf{y}}_{t};k)-\lambda \log p(k\mid k_{t-1})\}, \end{aligned}$$

(25)

where $L_{_{\text {NML}}}({\textbf{y}}_{t};k)$ is calculated as in (18). Letting $\hat{k}$ be the output of SDMS and $\underline{\text {Ddim}}_{t}$ be Ddim of model fusion at time t, we raise an alarm if

$$\begin{aligned} \text {TH}{-}\text {Score}\buildrel \text {def} \over =\mid \underline{\text {Ddim} }_{t}-\hat{k}\mid > \delta _{1}. \end{aligned}$$

(26)

2) Differential method (Diff): We raise an alarm if the time difference of Ddim exceeds a given threshold $\delta _{2}$. That is, we raise an alarm if

$$\begin{aligned} \text {Diff}{-}\text {Score}\buildrel \text {def} \over =\mid \underline{\text {Ddim}}_{t}-\underline{\text {Ddim}}_{t-1}\mid > \delta _{2}. \end{aligned}$$

(27)

The computational complexity of TH and Diff at each time t is governed by that for computing the NML codelength (18). The first term in (18) is computable in time O(nk). The second term in (18) is computable in time $O(n^{2}k)$ [15], but it does not depend on data, hence can be calculated for various n and k beforehand. It can be referred when necessary. Hence the computational complexity of TH and Diff at each time is O(nK) where K is an upper bound on k.

3.3 Continuous model selection for AR model

The above methodology can be applied to general classes of finite mixture models other than GMMs. It can also be applied to general classes of parametric probabilistic model classes. We illustrate the case of auto-regression (AR) model as an example. This model does not include latent variables.

Let a data sequence $\{x_{t}\},\ x_t\in {\mathbb R}\ (t=1,2,\dots ,T)$ be given. For the modeling of the data sequence, we consider AR(k) (k-th order auto-regression) model of the form:

$$\begin{aligned} x_{t}=a_{1}x_{t-1}+\cdots + a_{k}x_{t-k}+\epsilon , \end{aligned}$$

where $a_{i}\in {\mathbb R}\ (i=1,\dots ,k)$ are unknown parameters, and $\epsilon $ is a random variable following the Gaussian distribution with mean 0 and unknown variance $\sigma ^{2}$. We set $\theta =(a_{1},\dots ,a_{k},\sigma ^{2})$.

Let ${\textbf{x}}_{t}=x_{t},\dots , x_{t-w+1}$ be the t-th session for a window size w. We calculate the NML codelength $L_{\text {NML}}({\textbf{x}}_{t}; k)$ for ${\textbf{x}}_{t}$ associated with AR(k) in the following sequential manner: Letting $\hat{\theta }$ be the maximum likelihood estimator,

$$\begin{aligned} L_{_\textrm{NML}}({\textbf{x}}_{t}; k)=\sum ^{t}_{j=t-w+1} \left( -\log \frac{p(x_{j}; \hat{\theta }(x_{j},x^{j-1}))}{\int p(y_{j}; \hat{\theta }(y,x^{j-1}))dy} \right) . \end{aligned}$$

Letting ${\mathcal P}_{k}=AR(k)$, similarly with (20) and (23), we can calculate Ddim at time t and draw a Ddim graph.We thereby detect model change signs by applying TH and Diff to the graph.

4 Experimental results

4.1 Synthetic data: GMM

4.1.1 Data set

We employ synthetic data sets to evaluate how well we are able to detect signs of model changes using Ddim. We let $n=1000$ at each time. We generated DataSet 1 according to GMMs so that the number of components changed from $k=2$ to $k=3$ as follows:

$$\begin{aligned} \left\{ \begin{array}{l} k=2,\ \mu =(\mu _{1},\mu _{2}) \qquad \qquad \quad \text {if} \ 0\le t \le \tau _1, \nonumber \\ k=3, \ \mu =(\mu _1, \mu _2, f_{\alpha }(t)) ~~\quad \text {if} \tau _1+1 \le t \le \tau _2, \nonumber \\ k=3, \ \mu =(\mu _1, \mu _2, \mu _3) \qquad \quad \text {if} \tau _2 +1 \le t \le T, \end{array} \right. \end{aligned}$$

(28)

where letting $\alpha $ be a smoothness parameter,

$$\begin{aligned} f_{\alpha }(t)\buildrel \text {def} \over = \frac{(\tau _2-t)^{\alpha }\mu _2 + (t-\tau _1)^{\alpha }\mu _3}{(\tau _2-t )^{\alpha }+(t-\tau _1)^{\alpha }} \ \ (\alpha >0). \end{aligned}$$

(29)

In it, one component collapsed gradually in the transition period from $t=\tau _1+1$ to $t=\tau _2$. $f_{\alpha }(t)$ is the mean value which switches from $\mu _{2}$ to $\mu _{3}$ where the speed of change is specified by a parameter $\alpha $. Figure 3 shows the graph of the proportion of $\mu _{3}$ in the mean to $\mu _{2}$ versus time for various $\alpha $. The change becomes rapid as $\alpha $ approaches to zero. The variance covariance matrix of each component is given by

$$\begin{aligned} \Sigma =(rAA^{\top }+(I-r)I)\times \text {var}, \end{aligned}$$

(30)

where $r=0.2$, $\text {var}=3$, and A is a randomly generated $m\times m$ matrix. We set $ m=3, \tau _1=9,\tau _2=29,T=39$. It appears that the number of components of a GMM abruptly changed at $t=20$ since it takes a discrete value. However, in the early stage of $k=3$, the model is very close to $k=2$ because the mean values of Gaussian components are very close each other. It may be more natural to recognize the model dimensionality at this stage as a value between $k=2$ and $k=3$.

We evaluate how well Ddim tracked the transition period of model change. The temperature parameter $\beta $ was chosen so that $\beta =0.0316$ according to (22). Figure 4 shows how Ddim gradually grows as time goes by for various $\alpha $ values. The gray zone shows the transition period when the model changes from $k=2$ to $k=3$. The blue line shows the number of components of the GMM estimated by the SDMS algorithm as in (25). The green curve shows the Ddim graph. The red and purple curves show TH-Score and Diff-Score as in (26) and (27), respectively. We show the time points of their alarms of TH and Diff using the same colors. Ddim successfully visualized how rapidly the GMM structure changed in the transition period from $t=10$ to 29. The true change occurs rapidly for $\alpha =0.2$, while it occurs slowly for $\alpha =1.0$. Ddim was able to successfully track their transition process depending on $\alpha $. Ddim detected signs earlier than SDMS made an alarm of model change.

We next consider the case where there are multiple change points. We generated DataSet 2 according to GMMs so that the number of components changed from $k=2$ to $k=3$, then from $k=3$ to $k=4$. as follows:

$$\begin{aligned} \left\{ \begin{array}{l} k=2,\ \mu =(\mu _{1},\mu _{2}) \qquad \qquad \qquad \qquad \qquad \quad \quad \quad ~\ \text {if} \ 1\le t \le \tau _1, \\ k=3, \ \mu =(\mu _1, \mu _2, \frac{(\tau _{2}-t)^{\alpha }\mu _{2}+(t-\tau _{1})^{\alpha }\mu _{3}}{\tau _{2}-\tau _{1}} ) \quad \qquad \text {if} \tau _1+1 \le t \le \tau _2,\\ k=3, \ \mu =(\mu _1, \mu _2, \mu _3) \qquad \qquad \qquad \qquad \qquad \quad ~\ \text {if} \ \tau _2 \le t \le \tau _{3}, \\ k=4, \ \mu =(\mu _1, \mu _2, \mu _{3}, \frac{(\tau _{4}-t)^{\alpha }\mu _{3}+(t-\tau _{3})^{\alpha }\mu _{4}}{\tau _{4}-\tau _{3}} ) \quad \text {if} \tau _3+1 \le t \le \tau _4,\\ k=4, \ \mu =(\mu _1, \mu _2, \mu _3, \mu _4) \qquad \qquad \qquad \qquad \quad ~\ \text {if} \tau _4+1 \le t \le T. \end{array}\right. \end{aligned}$$

One component collapsed gradually over time from $t=10$ to $t=29$ and the other one collapsed from $t=50$ to $t=69$. We set $\tau _{1}=9, \tau _{2}=29, \tau _{3}=49, \tau _{4}=69$, and $T=79$. In the transition periods the parameters varied as with the single change point case.

Figure 5 shows the Ddim graph for $\alpha =0.5$. The gray zone shows the transition periods when the model changes from $k=2$ to $k=3$ and $k=3$ to $k=4$. The green curve shows the Ddim graph. The blue line shows the number of components of the GMM estimated by SDMS. The red and purple lines show the times when alarms for signs of model changes are raised using TH and Diff, respectively. The Ddim graph helps us understand well how rapidly the GMM structure gradually changes in the transition periods from $t=10$ to $t=29$ and from $t=50$ to $t=69$. The Ddim graph helps us understand well how rapidly the model changes in the transition period. TH detected the signs of model changes earlier than SDMS.

4.1.2 Evaluation metrics

Next we quantitatively evaluate how early we were able to detect signs of model changes with Ddim. We measure the performance of any algorithm in terms of benefit. Let $\hat{t}$ be the first time when an alarm is made and $t^{*}$ be the true sign, which we define as the starting point of model change. Then benefit is defined as

$$\begin{aligned} \text {benefit}={\left\{ \begin{array}{ll} 1-(\hat{t}-t^{*})/{U} &{} (t^{*}\le \hat{t}<t^{*}+U),\\ 0 &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

(31)

where U is a given parameter. Benefit takes the maximum value 1 when the alarm coincides with the true sign. It decreases linearly as t goes by and becomes zero as $\hat{t}$ exceeds $t^{*}+U$.

False alarm rate (FAR) is defined as the ratio of the number of alarms outside the transition period over the total number of alarms.

We evaluate any method for model change sign detection algorithm in terms of Area Under Curve (AUC) of Benefit-FAR curve that is obtained by varying the threshold parameter $\delta $ such as in (26) and (27). We set $U=10$ in (30).

4.1.3 Methods for comparison

We consider the following methods for comparison.

1) The sequential DMS algorithm (SDMS) [14]: The SDMS algorithm with $\lambda =1$ outputs the estimated parametric dimensionality as in (25). We raise an alarm when the output of SDMS changes.

2) Fixed share algorithm (FS) [13]: We think of each model k as an expert, and perform Herbster and Warmuth’s fixed share algorithm, abbreviated as FS. It was originally designed to make prediction by taking a weighted average over a number of experts, where the weight is calculated as a linear combination of the exponential update weight and the sum of other experts’ ones. In it the expert with the largest weight is the best expert, which may change over time. We can think of FS as a model change detection algorithm by tracking the time-varying best expert.

Here is a summary of FS. Let k be the index of the expert. $L_{k}({\textbf{z}}_{t-1})$ is the loss function for the kth expert for data ${\textbf{z}}_{t-1}$, which is the NML codelength in our setting. $w_{t,k}^{u}$ and $w_{t,k}^{s}$ are tentative and final weights for the kth expert at time t. FS conducts the following weight update rule: Letting $\alpha >0$ be a sharing parameter and $\beta $ be a learning ratio,

$$\begin{aligned} w_{t-1,k}^u= & {} w_{t-1,k}^s \cdot \exp \{-\beta L_{k}({\textbf{z}}_{t-1})\} , \\ w_{t,k}^s= & {} (1-\alpha )w_{t-1,k}^u + \sum _{\ell \ne k} \frac{\alpha }{n-1}w_{t-1,\ell }^u , \end{aligned}$$

where n is the total number of experts.

Let $\hat{k}$ be the best expert in which $w_{t,k}^{s}$ is maximum. FS raises an alarm when the best expert changes. The learning rate was set to be the same as our method.

3) Fixed share weighted algorithm (FSW-TH, FSW-Diff): We consider variants of TH and Diff where $p(k\mid {\textbf{y}}_{t})$ as in (20) is replaced with the normalized weight for k calculated in the process of FS. FSW-TH and FSW-Diff calculate scores by plugging $w_{t,k}^{s}$ to $p(k\mid {\textbf{y}}_{t})$ in (24) and make alarms according to (26) and (27), respectively. The learning rate was set to be the same as our method.

4) Structural entropy (SE): It is a measure of uncertainty for model selection, developed in [16]. It is calculated as the entropy with respect to the model posterior probability distribution (20). SE makes alarms when it exceeds a threshold.

The method 1) is the only existing work that performs on-line dynamic model selection. The methods 2) and 3) are the ones that are adapted to our problem setting. The methods 1) and 2) are model change detection algorithms while the method 3) is an algorithm for quantifying latent gradual changes in a similar way with TH or Diff. The method 4) is to detect change signs from the view of model uncertainty, but not to intend continuous model selection.

Table 3 AUC comparison results for GMMs. The bold values show highest AUC records over all the methods

Full size table

4.1.4 Results

We generated random data 10 times and took an average value of benefit over 10 trials for each method. Table 3 shows results on comparison of all the methods in terms of AUC both for single and multiple change cases. AUC was calculated for the benefit-FAR curve obtained by varying a threshold. The parameter $\alpha $ specifies the speed of change. As for the multiple change cases, AUC was calculated as an average taken over all change points. Both for the single and multiple change cases, TH and Diff had much higher benefit than the FS-based methods for all the cases. It was statistically significant via t-test with p-values less than $5\%$. This implies that TH and Diff were able to detect signs of model changes significantly earlier than the FS-based methods. TH worked almost as well as Diff.

It is worthwhile noting that TH and Diff performed better than FSW-TH and FSW-Diff. It implies that the posterior based on the NML distribution is more suitable for tracking gradual model changes than that based on the FS-based heuristics. As $\alpha $ becomes small, the superiority of TH and Diff over the others becomes more remarkable. This implies that Ddim is able to catch up the growth of a cluster much more quickly than the others.

Both for the single and multiple change cases, TH worked slightly better than Diff, but they were almost comparable.

TH and Diff were comparable to SE in terms of change sign detection alone. However, note that the Ddim based ones realize both model change sign detection and continuous model selection simultaneously. The latter function is specifically important for understanding how fast model changes. Meanwhile, SE can do only sign detection by measuring the uncertainty in model selection. Therefore, Ddim has an advantange over SE in the sense that it can continuously quantify the model complexity in the transition period as well as detects signs of model changes.

4.2 Synthetic data: Auto-regression model

4.2.1 Data sets

We next examined continuous model selection for auto-regression (AR) models as in Section 3.3. We let $n=1000$ at each time. We generated DataSet 3 according to AR model in the setting as in Section 3.3 where the number k of coefficients in AR model changed over time as follows:

$$\begin{aligned} \text {k}={\left\{ \begin{array}{ll} 1 &{} \text {if}\ ~1\le t< \tau _{1},\\ 1 ~\text {with prob.} ~1-\frac{t-\tau _{1}}{\tau _{2}-\tau _{1}}&{} \text {if} ~\tau _{1}\le t< \tau _{2},\\ 3 ~\text {with prob.} ~\frac{t-\tau _{1}}{\tau _{2}-\tau _{1}}&{} \text {if} ~\tau _{1}\le t< \tau _{2},\\ 3 &{} \text {if} ~\tau _{2}\le t\le T, \end{array}\right. } \end{aligned}$$

where $\tau _{1}=100, \tau _{2}=200$ and $T=300$.

4.2.2 Results

We generated random data 10 times and took an average value of benefit over 10 trials for each method. Table 4 shows results on comparison of all the methods in terms of AUC for Benefit-FAR curves.

Table 4 AUC comparison results for AR models. The bold value shows the highest AUC over all the methods

Full size table

Table 4 shows that TH and Diff were comparable to SE and obtained much larger values of AUC than other methods except SE. This implies that TH and Diff could catch up signs of model changes significantly earlier than the others except SE. Note again that TH and Diff conduct not only change sign detection but also continuous model selection meanwhile SE only perform change sign detection.

4.3 Real data: Market data

4.3.1 Data sets

We apply our method to real market data provided by HAKUHODO,INC. (https://www.hakuhodo-global.com/) and M-CUBE,INC. (https://www.m-cube.com/). This data set consists of 912 customers’ beer purchase transactions from Nov. 1st 2010 to Jan. 31st 2011. See [38]. Each customer’s record is specified by a four-dimensional feature vector, each component of which shows a consumption volume for a certain beer category. Categories are: {Beer(A), Low-malt beer(B), Other brewed-alcohol(C), Liquor (D)}.

We constructed a sequence of customers’ feature vectors as follows: A time unit is a day. At each time $t(=\tau , . . . ,T)$, we denote the feature vector of the ith customer as $x_{it} = (x_{it,A}, . . . , x_{it,D}) \in {\mathbb R}^{4}$. Each $x_{it,j}$ is the ith customer’s consumption of the jth category from time $t-\tau +1$ to t. We denote data at time t as ${\textbf{x}}_{t} = (x_{1t}, . . . , x_{nt})$, where $n=912$, the number of customers. The total number of transactions is 13993. We set $\tau =14$ and $T=53$. Since TH and DIFF have turned out to outperform the other methods in the previous section and we like to conduct continuous model selection simultaneously, we focus on evaluating how well they work for the real data sets.

4.3.2 Results

Figure 6 shows Ddim (green), estimated number of clusters in GMM (blue) using SDMS, and time points of alarms raised by TH and Diff (red and purple) with $\delta _{1}=\delta _{2}=0.1$. Table 5 shows the clustering structures $t=24,25,26$. Each number in the (i, j)th cell shows the purchase volume of category $i\ (=A,B,C,D)$ for the customers in the jth cluster $cj\ (j=1,2,3,4)$. The last row shows the number of customers.

Table 5 Market structure change

Full size table

The purchase volume of category C in cluster c4 gradually increased from $t=24$ to $t=25$, eventually c4 started to collapse at $t=25$ and was split into c4 and c5 at $t=26$. We confirm from Table 5 that c4 consisted of heavy users in category C, at $t=26$, some of them became dormant users that did not purchase anything to form a new cluster. The SDMS algorithm detected this market structure change at $t=26$. As shown in Fig. 6, TH and Diff successfully raised an alarm at $t=25$ as a sign of that market structure change. The reason why we could detect the early warning signal is that there were gradual changes among clusters as well as within individual clusters before the clustering change occurred. Our result shows that our method was effective in detecting signs of model changes for such a case.

4.4 Real Data: Electric power consumption data

4.4.1 Data sets

Next we apply our method to the household electric power consumption dataset provided by [6]. This dataset contains three categories of electric power consumption corresponding to electricity consumed 1) in kitchen and laundry rooms, 2) by electric water heaters and 3) by air-conditioners. The data were obtained every other minute from Dec. 17, 2006 to Dec. 10, 2010. We set ${\textbf{x}}_{t} = (x_{1}, \cdots , x_{n})$ and $x_{i} = (x_{i1}, x_{i2}, x_{i3})$ where each $x_{i}$ denotes the value of consumption per an hour for the three categories, respectively, and ${\textbf{x}}_{t}$ is the value of consumption for two weeks ($n=336$).

4.4.2 Results

Figure 7 shows how Ddim (the green curve) and the number of clusters (the blue line) changed over time. Here each cluster shows a consumption pattern. The red dotted line shows the alarm positions for TH and Diff with $\delta _{1}=\delta _{2}=0.1$. Let us focus on the duration from $t=18$ to $t=22$. At $t=18,19$, there were three clusters, one of which collapsed to two clusters at $t=21$, eventually, produced the fourth cluster. The Ddim graph in Fig.7 shows that Ddim gradually increased from $k=3$ to $k=4$ during the period. The alarm was made by TH and Diff at $t=20$ while there were still three clusters. This alarm can be thought of as a sign of the emergence of a new cluster having a unique consumption pattern.

Table 6 Electric power consumption structure change

Full size table

Table 6 shows the contents of clusters on the weeks starting from May 14th, 21st, and 28th in 2007. c means clusters, and m1, m2, m3 mean the mean amounts of meter 1,2,3, respectively. The last column shows the total amount of users in a respective cluster.A sign of model change was detected on May 21st. The model change was detected on May 28th. We see from Table 6 that cluster 2 collapsed into clusters 2 and 3. Cluster 2 shows a pattern of homogeneous consumption with a relatively high weight on category 3. Cluster 3 shows a pattern of homogeneous consumption with a relatively high weight on category 1. The sign of this collapse was successfully detected on May 21st by monitoring the Ddim value. The reason why we could detect the early warning signal is that there was a gradual change in the collapse of cluster 2 before the clustering change occurred. Our result shows that our method was effective in detecting signs of model changes for such a case.

5 Relation of Ddim to MDL Learning

This section gives a theoretical foundation of Ddim by relating it to the rate of convergence of the MDL learning algorithm [2, 30]. It selects a model with the shortest total codelength required for encoding the data as well as the model itself. We give an NML-based version of the MDL algorithm as follows.

Let ${\mathcal F}=\{{\mathcal P}_{1},\dots , {\mathcal P}_{s}\}$ where $\mid {\mathcal F}\mid =s< \infty $ and each ${\mathcal P}_{i}$ is a class of probability distributions. For a given training data sequence ${\textbf{x}}=x_{1},\dots ,x_{n}$ where each $x_{i}$ is independently drawn, the MDL learning algorithm selects $\hat{{\mathcal P}}$ such that

$$\begin{aligned} \hat{{\mathcal P}}= & {} \underset{{\mathcal P}\in {\mathcal F}}{\text {argmin}}(-\log p_{_\text {NML}}({\textbf{x}}; {\mathcal P}))\\= & {} \underset{{\mathcal P}\in {\mathcal F}}{\text {argmin}}\left\{ -\log \max _{p\in {\mathcal P}}p({\textbf{x}})+\log {\mathcal C}_{n}({\mathcal P})\right\} , \nonumber \end{aligned}$$

(32)

where ${\mathcal C}_{n}({\mathcal P})$ is the parametric complexity of ${\mathcal P}$ as in (5). The MDL learning algorithm outputs the NML distribution associated with $\hat{{\mathcal P}}$ as in (31): For a sequence ${\textbf{y}}=y_{1},\dots , y_{n}$,

$$\begin{aligned} \hat{p}({\textbf{y}})=\frac{\max _{p\in \hat{{\mathcal P}}}p({\textbf{y}})}{C_{n}(\hat{{\mathcal P}})}. \end{aligned}$$

(33)

Note that ${\textbf{y}}$ is independent of the training sequence ${\textbf{x}}$ used to obtain $\hat{{\mathcal P}}$. In previous work [2, 30], the MDL learning algorithm has been designed so that it outputs the two-stage shortest codelength distribution with quantized parameter values, belonging to the model classes. Our algorithm differs from them in that it outputs the NML distribution (32), which is not included in the model classes. The NML distribution and the MDL principle are the central notions in deriving Ddim throughout this paper. Thus it is significant to investigate the relation of Ddim to the NML distribution estimated with the MDL learning algorithm.

We have the following theorem relating Ddim to the rate of convergence of the MDL learning algorithm.

Theorem 3

Suppose that each ${\textbf{x}}$ is generated according to $ p^{*}\in {\mathcal P}^{*}\in {\mathcal F}=\{{\mathcal P}_{1},\dots , {\mathcal P}_{s}\}$. Let $\hat{p}$ be the output of the MDL learning algorithm as in (32). Let $d_{B}^{(n)}(\hat{p},p^{*})$ be the Bhattacharyya distance between $\hat{p}$ and $p^{*}$:

$$\begin{aligned} d_{B}^{(n)}(\hat{p},p^{*})\buildrel \text {def} \over =-\frac{1}{n}\log \sum _{{\textbf{y}}} (p^{*}({\textbf{y}})\hat{p}({\textbf{y}}))^{\frac{1}{2}}. \end{aligned}$$

(34)

Then for any $\epsilon >0$, we have the following upper bound on the probability that under the condition for ${\mathcal P}^{*}$ as in Theorem 1,the Bhattacharyya distance between the output of the MDL learning algorithm and the true distribution exceeds $\epsilon $:

$$\begin{aligned} Prob[d_{B}^{(n)}(\hat{p},p^{*})>\epsilon ]= & {} O\left( n^\mathrm{{Ddim}({\mathcal {P}}^{*})/4}e^{-n\epsilon }\right) . \end{aligned}$$

(35)

Suppose that ${\mathcal P}$ is chosen randomly according to the probability distribution $\pi ({\mathcal P})$ over ${\mathcal F}=\{{\mathcal P}_1,\dots , {\mathcal P}_{s}\}$ and that the unknown true distribution $p^{*}$ is chosen from ${\mathcal P}^{*}$. Then we have the following upper bound on the expected Bhattacharyya distance between the output of the MDL learning algorithm and the true distribution:

$$\begin{aligned} E_{{\mathcal P}^{*}}E_{{\textbf{x}}\sim p^{*}\in {\mathcal P}^{*}}[d_{B}^{(n)}(\hat{p},p^{*}) ] =O\left( \frac{\mathrm{{Ddim}}({\mathcal F}^{\odot })\log n}{n}\right) , \end{aligned}$$

(36)

where $\mathrm{{Ddim}}({\mathcal F}^{\odot })$ is Ddim for model fusion as in (16).

The proof is given in Appendix. This result may be generalized into the agnostic case where the model class misspecifies the true distribution (see also [4] for this case). We omit this result from this manuscript since our main concern is how the expected generalization performance is related to Ddim.

Theorem 3 implies that the NML distribution with model of the shortest NML codelength converges exponentially to the true distribution in probability as n increases and the rate is governed by Ddim for the true model. In conventional studies on PAC (probably approximately correct) learning [12], the performance of the empirical risk minimization algorithm has been analyzed using the technique of uniform convergence, where the rate of convergence is governed by the metric dimension. Meanwhile, the performance of the MDL learning algorithm is analyzed using the non-uniform convergence technique, since the non-uniform model complexity is considered. In this case the rate of convergence of the MDL algorithm is governed by Ddim. Then the expected Bhattacharyya distance between the true distribution and the output of the MDL learning algorithm is characterized by Ddim for model fusion over ${\mathcal F}$.

6 Conclusion

This paper has proposed a novel methodology for detecting signs of model changes from a data stream. The key idea is to conduct continuous model selection using the notion of descriptive dimensionality (Ddim). Ddim quantifies the real-valued model dimensionality in the model transition period. We are able not only to visualize the model complexity in the transition period of model changes, but also to detect their signs by tracking the rise-up/descent of Ddim. Focusing on the model changes in Gaussian mixture models, we have shown that gradual structure changes of GMMs can be effectively visualized by drawing a Ddim graph. Furthermore, we have empirically demonstrated that our methodology was able to detect signs of changes of the number of mixtures in GMM and those of the order of AR model earlier than they were actualized. Experimental results have shown that it was able to detect them significantly earlier than any other existing dynamic model selection methods.

This paper has offered the use of continuous model change selection in the scenario of model change sign detection only. Exploring other scenarios of continuous model selection has remained for future studies.

References

Alquier P, Ridgway J,Chopin N (2016) “On the properties of variational approximations of Gibbs posteriors.”J Mach Learning Research 17,1–41
Barron A, Cover T (1991) Minimum complexity density estimation. IEEE Trans Information Theory IT 37:1034–1054
Davis RA, Lee T, Rodriguez-Yam G (2006) “Structural break estimation for nonstationary time series models.” Jr.of the Amer. Stat. Assoc. 101(473):22–239
Ding J, Tarokh V,Yang Y (2018) Model selection techniques: an overview. In: IEEE Signal Processing Magazine Vol. 35, Issue: 6, pp 16–34 Nov 2018
Ding J, Zhou J,Tarokh V (2019) Asymptotically optimal prediction for time-varying data generating processes. In: IEEE Transactions on Information Theory Vol. 65, Issue: 5, pp 3034–3067 May 2019
Dheeru D, Taniskidou EK (2017) UCI machine learning repository
Dudley RM (1987) Universal Donsker classes and metric entropy. Annals of Probability 15(4):1306–1326
Article MathSciNet MATH Google Scholar
Erven T, Grunwald P,Rooij S (2012) “Catching up by switching sooner: a predictive approach to adaptive estimation with an application to the AIC-BIC dilemma." Jr.Royal Stat Soc Ser B Vol. 74, No.Issue 3 pp 361–417
Farmer JD (1982) Information dimension and the probabilistic structure of chaos. Z. Naturforsch A 31:1304–1325
Article MathSciNet Google Scholar
Gama J, Zlibait I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Computing Survey 46(4):1–37
Article MATH Google Scholar
Grunwald PD (2007) The minimum description length principle, MIT press
Haussler D (1992) Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation 100:78–150
Article MathSciNet MATH Google Scholar
Herbster M, Warmuth M (1998) Tracking the best experts. Machine Learning 32:151–178
Article MATH Google Scholar
Hirai S, Yamanishi K (2012) “Detecting changes of clustering structures using normalized maximum likelihood coding." In: Proceedings of ACM International Conference on Knowedge Discovery and Data Mining (KDD’12) pp 343–351
Hirai S, Yamanishi K (2019) Correction to efficient computation of normalized maximum likelihood codes for Gaussian mixture models with its applications to clustering. IEEE Transactions on Information Theory 65(10):6827–6828
Article MathSciNet MATH Google Scholar
Hirai S, Yamanishi K (2018) “Detecting structural uncertainty with structural entropy.” In: Proceedings of 2018 IEEE International Conference on BigData(BigData’18) pp 26–35
Huang DTJ, Koh YS, Dobbie G,Pears R (2014) “Detecting volatility shift in data streams.” In: Proceedings of 2014 IEEE International Conference on Data Mining (ICDM’14), pp 863–868
Kolmogorov AN, Tihomirov VM (1961) $\epsilon $-entropy and $\epsilon $-capacity of sets in functional spaces. Amer. Math. Sot. Trunsl. Ser. 2(17):277–364
MathSciNet Google Scholar
MacLachlan G, Peel D (2000) Finite Mixture Models. John Wiley and Sons
Book Google Scholar
Mandelbrot BB (1982) The Fractal Geometry of Nature. Freeman, W.H
MATH Google Scholar
Ntoutsi I, Spiliopoulou MY (2011) “Theodoridis:FINGERPRINT: Summarizing cluster evolution in dynamic environments’.Prceedings of Computational Science and Its Applications-ICCSA 2011. Part II 11:562–577
Google Scholar
Ohsawa Y (2018) Graph-based entropy for detecting explanatory signs of changes in market. In: The Review of Social Network Strategies Vol. 12, 2, pp 183–203
Pollard D (1984) Convergence of Stochastic Processes. Springer-Verlag, Berlin/NewYork
Book MATH Google Scholar
Rissanen J (1978) Modeling by shortest description length. Automatica 14(5):465–471
Rissanen J (2012) Optimal Estimation of Parameters. Cambridge
Shtarkov YM (1987) Universal sequential coding of single messages. Problem Peredachi Informatsii 23(3):3–17
Song M, Wang H (2005) “Highly efficient incremental estimation of Gaussian mixture models for online data stream clustering.” Intelligent Computing: Theory and Applications III 5803:174–183
Vapnik VN (1982) Estimation of Dependences Based on Empirical Data. Springer Verlag, New York
MATH Google Scholar
Xuan X, Murphy K (2007) “Modeling changing dependency structure in multivariate time series.” Proceedings of the 24th International Conference on Machine learning pp 1055–1062 ACM
Yamanishi K (1992) “A learning criterion for stochastic rules.” Machine Learning Volume 9, Issue 2-3, pp 165-203 July 1992
Yamanishi K (2019) “Descriptive dimensionality with its characterization of MDL-based learning and change detection.” arXiv:1910.11540
Yamanishi K, Hirai S (2023)“Detecting signs of model change with continuous model selection based on descriptive dimensionality.” arXiv:2302.12127
Yamanishi K, Fukushima S (2018) Model change detection with the MDL principle. IEEE Transactions on Information Theory 64(9):6115–6126
Yamanishi K, Maruyama Y (2007) Dynamic model selection with its applications to novelty detection. IEEE Transactions on Information Theory 53(6):2180–2189
Article MathSciNet MATH Google Scholar
Yamanishi K, Maruyama Y (2005) “Dynamic syslog mining for network failure monitoring." In: Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’05) pp 499–508
Yamanishi K, Miyaguchi K (2016) “Detecting gradual changes from data stream using MDL change statistics." In: Proceedings of 2016 IEEE International Conference on BigData (BigData’16) pp 156–163
Yamanishi K, Xu L, Yuki R, Fukushima S, Lin C (2021) Change sign detection with differential MDL change statistics and its applications to COVID-19 pandemic analys. Scientific Reports 11:19795
https://github.com/conf-post-01/Ddim.git

Download references

Acknowledgements

This work was partially supported by JST KAKENHI 191400000190.

Funding

Open access funding provided by The University of Tokyo.

Author information

Authors and Affiliations

School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyoku, Tokyo, 113-8656, Japan
Kenji Yamanishi & So Hirai

Authors

Kenji Yamanishi
View author publications
You can also search for this author in PubMed Google Scholar
So Hirai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kenji Yamanishi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Theorem 1

Let ${\mathcal P}$ be a k-dimensional parametric class, which we denote as ${\mathcal P}_k=\{ p({\textbf{x}};\theta ):\ \theta \in \Theta _{k}\}$ where $\Theta _k$ is a k-dimensional parametric space. In this case, we denote $g(\hat{\theta }, \theta )$ instead of $g(\hat{p}, p)$. Let the finite set of k-dimensional real-valued parameters space be $\overline{\Theta } _{k}=\{\theta _{1},\theta _{2},\dots \}$ and let $\overline{{\mathcal P}}_{k}=\{ p({\textbf{x}};\theta ):\ \theta \in \overline{\Theta } _{k}\subset {\mathbb R}^{k}\}.$ Let $I_{n}(\theta )$ be the Fisher information matrix at $\theta $: $I_{n}(\theta )\buildrel \text {def} \over =(1/n)E_{\theta }[\partial ^{2}(-\log p({\textbf{x}}; \theta ))/\partial \theta \partial \theta ^{\top }]$ and suppose that $\lim _{n\rightarrow \infty }I_{n}(\theta )=I(\theta )$ for each $\theta $. Below we denote $p({\textbf{x}};\theta )$ as $p_{\theta }$. Consider $\epsilon ^{2}$-neighborhood of $\theta _{i}$ with respect to the KL divergence $d_{n}$:

$$\begin{aligned} D_{\epsilon }(i)\buildrel \text {def} \over =\{\theta : d_{n}(p_{\theta _{i}}, p_{\theta })\le \epsilon ^{2}\}. \end{aligned}$$

Note that $d_{n}(p_{\theta _{i}},p_{\theta })$ is written using Taylor’s expansion up to the second order as follows: Under the condition that $\log p$ is three-times differentiable, $\max _{a,b,c}\mid \partial ^{3}\log p({\textbf{x}};\theta )/\partial \theta _{a}\partial \theta _{b}\partial \theta _{c}\mid <\infty $,

$$\begin{aligned} d_{n}(p_{\theta _{i}},p_{\theta })= & {} -\frac{1}{n}E_{\theta _{i}}\left[ \frac{\partial \log p({\textbf{x}};\theta )}{\partial \theta }\mid _{\theta _{i}}\right] (\theta -\theta _{i}) \\{} & {} +\frac{1}{2n}(\theta -\theta _{i}) ^{\top }E_{\theta _{i}}\left[ -\frac{\partial ^{2}\log p({\textbf{x}};\theta )}{\partial \theta \partial \theta ^{\top }}\mid _{\theta _{i}}\right] (\theta -\theta _{i})\\{} & {} +O(\parallel \theta -\theta _{i}\parallel ^{3})\\= & {} \frac{1}{2}(\theta -\theta _{i})^{\top }I_{n}(\theta _{i})(\theta -\theta _{i})+O(\parallel \theta -\theta _{i}\parallel ^{3}), \end{aligned}$$

where we have used the fact:

$$\begin{aligned} E_{\theta _{i}}\left[ \frac{\partial \log p({\textbf{x}};\theta )}{\partial \theta }\mid _{\theta _{i}}\right]= & {} \sum _{\textbf{x}}p({\textbf{x}};\theta _{i})\frac{\partial \log p({\textbf{x}};\theta )}{\partial \theta }\mid _{\theta _{i}}\\= & {} \frac{\partial \sum _{\textbf{x}}p({\textbf{x}};\theta )}{\partial \theta } \mid _{\theta _{i}}\\ {}= & {} 0. \end{aligned}$$

Therefore, we may consider $\tilde{D}_{\epsilon }(i)$ in place of $D_{\epsilon }(i)$.

$$\begin{aligned} \tilde{D}_{\epsilon }(i)=\{\theta : (\theta -\theta _{i})^{\top }I_{n}(\theta _{i})(\theta -\theta _{i})\le C\epsilon ^{2}\}, \end{aligned}$$

where C does not depend on n nor $\epsilon $. Let $B_{\epsilon }(i)$ be the largest hyper-rectangle within $\tilde{D}_{\epsilon }(i)$ centered at $\theta _{i}$. For some $1\le C'< \infty $, for any i, we have

$$\begin{aligned} \mid B_{\epsilon }(i)\mid \le \mid \tilde{D}_{\epsilon }(i)\mid \le C'\mid B_{\epsilon }(i)\mid . \end{aligned}$$

(37)

Along with [25] (p.74), geometric analysis of $B_{\epsilon }(i)$ yields the Lebesgue volume of $B_{\epsilon }(i)$ as follows:

$$\begin{aligned} \mid B_{\epsilon }(i)\mid= & {} \left( \frac{4C\epsilon ^{2}}{k}\right) ^{\frac{k}{2}}\mid I_{n}(\theta _{i})\mid ^{-\frac{1}{2}}=2^{k}\prod ^{k}_{j=1}\sqrt{\frac{C\epsilon ^{2}}{k\lambda _ {j}}}, \end{aligned}$$

where $\lambda _{j}$ is the j-th largest eigenvalue of $I_{n}(\theta _{i})$.

We choose $\overline{\Theta }_{k}$ so that the central limit theorem holds. Then for sufficiently large n, as $\theta \rightarrow \theta _{i}$,

$$\begin{aligned} g(\theta _{i}, \theta )\simeq \left( \frac{n}{2\pi }\right) ^{\frac{k}{2}} \mid I(\theta _{i})\mid ^{\frac{1}{2}}e^{-n(\theta -\theta _{i})^{\top }I_{n}(\theta _{i})(\theta -\theta _{i})/2}. \end{aligned}$$

Thus for $\hat{\theta } \in D_{i}(\epsilon )$, for sufficiently large n, as $\theta \rightarrow \theta _{i}$, we obtain

$$\begin{aligned} g(\hat{\theta } ,\hat{\theta })\mid B_{\epsilon }(i)\mid \simeq \left( \frac{2C\epsilon ^{2}n}{k\pi }\right) ^{\frac{k}{2}}. \end{aligned}$$

(38)

Next define $Q_{\epsilon }(i)$ as

$$\begin{aligned} Q_{\epsilon }(i)\buildrel \text {def} \over =\int _{\hat{\theta }\in \tilde{D}_{\epsilon }(i)}g(\hat{\theta }, \hat{\theta })d\hat{\theta }. \end{aligned}$$

and let $m_{n}(\epsilon :{\mathcal P})$ be the smallest number of elements in $\overline{{\Theta }}_{k}$:

$$\begin{aligned} \log C_{n}(k)= \log \sum ^{m_{n}(\epsilon :{\mathcal P})}_{i=1}Q_{\epsilon }(i)+O(1). \end{aligned}$$

(39)

Combining (37) and (36) with (38) yields

$$\begin{aligned} \log C_{n}({\mathcal P}_k)= & {} \log m_{n}(\epsilon :{\mathcal P})+\sup _{\overline{\Theta }_{k}}\left\{ \frac{k}{2}\log \left( \frac{2C\epsilon ^{2}n}{k\pi }\right) \right\} +O(1) \\= & {} \log m_{n}(\epsilon :{\mathcal P})+\frac{k}{2}\log ({\epsilon ^{2}n}) +O(1). \end{aligned}$$

(40)

where C in (39) depends on $\overline{\Theta }_{k}$, and the O(1) term in (40) may depend on k, but both of them do not depend on n nor $\epsilon $. The supremum in (39) is taken with respect to $\bar{\Theta }_{k}$ so that (37) holds. Setting $\epsilon ^{2}n=O(1)$ yields

$$\begin{aligned} \log C_{n}({\mathcal P}_{k})= & {} \log m_{n}(\epsilon :{\mathcal P}_{k})+\frac{k}{2}\log (\epsilon ^{2}n)+O(1)\\= & {} \log m_{n}(1/\sqrt{n} :{\mathcal P}_{k})+O(1). \end{aligned}$$

This completes the proof of Theorem 1. $\square $

B Proof of Theorem 3

Let $p^{*}$ be the true distribution associated with the true model ${\mathcal P}^{*}$. Let $\hat{{\mathcal P}}$ be the model selected by the MDL learning algorithm and let $p_{_\text {NML}}({\textbf{x}};\hat{{\mathcal P})}$ be the NML distribution associated with $\hat{{\mathcal P}}$. We write it as $\hat{p}$. We employ the proof technique similar to that for two-part code estimators in [2, 30].

By the definition of the MDL learning algorithm, we have

$$\begin{aligned} \min _{{\mathcal P}}( -\log p_{_\text {NML}}({\textbf{x}}; {\mathcal P}))\le & {} -\log p_{_\text {NML}}({\textbf{x}}; {\mathcal P}^{*})\nonumber \\= & {} -\log \max _{p\in {\mathcal P}^{*}}p({\textbf{x}})+\log C_{n}({\mathcal P}^{*})\nonumber \\\le & {} -\log p^{*}({\textbf{x}})+\log C_{n}({\mathcal P}^{*}). \end{aligned}$$

(41)

Let $p_{_{\textrm{NML}, {\mathcal P}}}$ be the NML distribution $p_{_\text {NML}}({\textbf{x}}:{\mathcal P})$ associated with ${\mathcal P}.$ For $\epsilon >0$, the following inequalities hold:

$$\begin{aligned}{} & {} Prob\left[ d_{B}^{(n)}(\hat{p},p^{*})>\epsilon \right] \nonumber \\\le & {} Prob \left[ {\textbf{x}}: \ (41)\ \text { holds\ under }\ d_{B}^{(n)}(\hat{p},p^{*})>\epsilon \right] \nonumber \\= & {} Prob\left[ \textbf{x}: \underset{\mathcal {P}:d_{B}^{n}(p_{_{\textrm{NML}, {\mathcal P}}},p^{*})>\epsilon }{\min }(-\log p_{_\text {NML}}(\textbf{x};\mathcal {P})) \right. \nonumber \\{} & {} \quad \quad \quad \quad \quad \left. \le -\log p^{*}({\textbf{x}})+\log C_{n}({\mathcal P}^{*})\right] \nonumber \\= & {} Prob\left[ {\textbf{x}}: \underset{\mathcal {P}:d_{B}^{n}(p_{_{\textrm{NML}, {\mathcal P}}},p^{*})>\epsilon }{\max }p_{_\text {NML}}(\textbf{x}:\mathcal {P})\ge \frac{p^{*}({\textbf{x}})}{C_{n}({\mathcal P}^{*})}\right] \nonumber \\\le & {} \sum \limits _{\mathcal {P}\in {\mathcal F}, d_{B}^{(n)}(p_{_{\textrm{NML}, {\mathcal P}}}, p^{*})>\epsilon }Prob\left[ {\textbf{x}}: p_{_\text {NML}}({\textbf{x}}:{\mathcal P})\ge \frac{ p^{*}({\textbf{x}})}{C_{n}({\mathcal P}^{*})} \right] . \end{aligned}$$

(42)

Let $E_{n}({\mathcal P})$ be the event: $p_{_\text {NML}}({\textbf{x}}:{\mathcal P})\ge p^{*}({\textbf{x}})/C_{n}({\mathcal P}^{*}).$ Under $E_{n}({\mathcal P})$,

$$\begin{aligned} 1\le \left( \frac{p_{_\text {NML}}({\textbf{x}};{\mathcal P})}{p^{*}({\textbf{x}})}\right) ^{\frac{1}{2}} ( C_{n}({\mathcal P}^{*}))^{\frac{1}{2}}. \end{aligned}$$

(43)

Then under the condition that $d_{B}^{(n)}(p_{_{\textrm{NML}, {\mathcal P}}}, p^{*})>\epsilon $, we have

$$\begin{aligned} Prob[E_{n}({\mathcal P})]= & {} \sum _{{\textbf{x}}\cdots E_{n}({\mathcal P})}p^{*}({\textbf{x}})\nonumber \\\le & {} \sum _{{\textbf{x}}\cdots E_{n}({\mathcal P})}p^{*}({\textbf{x}})\left( \frac{p_{_\text {NML}}({\textbf{x}}; {\mathcal P})}{p^{*}({\textbf{x}})}\right) ^{\frac{1}{2}} ( C_{n}({\mathcal P}^{*}))^{\frac{1}{2}} \nonumber \\\le & {} \left\{ \sum _{{\textbf{y}}} (p_{_\text {NML}}({\textbf{y}};{\mathcal P})p^{*}({\textbf{y}}))^{\frac{1}{2}} \right\} (C_{n}({\mathcal P}^{*}))^{\frac{1}{2}} \nonumber \\< & {} \exp (-n\epsilon +(\log C_{n}({\mathcal P}^{*}))/2) , \end{aligned}$$

(44)

where we have used the fact that under $d_{B}^{(n)}(p_{_\text {NML}},{\mathcal P},p^{*})>\epsilon $, it holds

$$\begin{aligned} \sum _{{\textbf{y}}} (p_{_\text {NML}}({\textbf{y}};{\mathcal P})p^{*}({\textbf{y}}))^{\frac{1}{2}}< e^{-n\epsilon }. \end{aligned}$$

Plugging (44) into (42) yields

$$\begin{aligned} Prob[d_{B}^{(n)}(\hat{p},p^{*})>\epsilon ]\le & {} \sum _{{\mathcal P}\in {\mathcal F}, d_{B}^{(n)}(p_{_{\textrm{NML}, {\mathcal P}}}, p^{*})>\epsilon }Prob[E_{n}({\mathcal P})]\nonumber \\< & {} \sum _{{\mathcal P}\in {\mathcal F}, d_{B}^{(n)}(p_{_{\textrm{NML}, {\mathcal P}}}, p^{*})>\epsilon }\exp \left( -n\epsilon +(1/2)\log C_{n}({\mathcal P}^{*})\right) \nonumber \\\le & {} \sum _{{\mathcal P}\in {\mathcal F}} \exp \left( -n\epsilon +(1/2)\log C_{n}({\mathcal P}^{*})\right) \nonumber \\= & {} \exp \left( -n\epsilon +(1/2)\log C_{n}({\mathcal P}^{*})+\log \mid {\mathcal F}\mid \right) . \end{aligned}$$

(45)

Under the condition for ${\mathcal P}^{*}$ as in Theorem 3, we have

$$\begin{aligned} \frac{1}{2}\log C_{n}({\mathcal P}^{*})+\log \mid {\mathcal F}\mid =\frac{1}{4}{\text {Ddim}}({\mathcal P}^{*})\log n+o(\log n). \end{aligned}$$

(46)

Plugging (46) into (45) yields (34).

Let $r_{n}({\mathcal P}^{*})\buildrel \text {def} \over =\{(1/2)\log {\mathcal C}_{n}({\mathcal P}^{*})+\log \mid {\mathcal F}\mid \}/n$. For fixed ${\mathcal P}^{*}$, we have the following upper bound on the expected Bhattacharyya distance:

$$\begin{aligned} E_{{\textbf{x}}\sim p^{*}\in {\mathcal P}^{*}}[d_{B}^{(n)}(\hat{p},p^{*})-r_{n}({\mathcal P}^{*})]= & {} \int ^{\infty }_{0}Prob[d_{B}^{(n)}(\hat{p},p^{*})-r_{n}({\mathcal P}^{*})>\epsilon ]d\epsilon \nonumber \\\le & {} \int ^{\infty }_{0}e^{-n\epsilon }d\epsilon =\frac{1}{n}, \end{aligned}$$

(47)

where we have used (34) to derive (47).Therefore, we have

$$\begin{aligned} E_{{\textbf{x}}\sim p^{*}\in {\mathcal P}^{*}}[d_{B}^{(n)}(\hat{p},p^{*})] \le r_{n}({\mathcal P}^{*})+\frac{1}{n}. \end{aligned}$$

Taking the expectation with respect to ${\mathcal P}^{*}$ yields

$$\begin{aligned} E_{{\mathcal P}^{*}}[d_{B}^{(n)}(\hat{p},p^{*})]\le & {} E_{{\mathcal P}^{*}}E_{{\textbf{x}}\sim p^{*}\in {\mathcal P}^{*}}[r_{n}({\mathcal P}^{*})]+\frac{1}{n} \nonumber \\= & {} O\left( \frac{ E_{{\mathcal P}^{*}}[\log m_{n}(1/\sqrt{n}:{\mathcal P}^{*})]}{n}\right) \\= & {} O\left( \frac{ E_{{\mathcal P}^{*}}[{\text {Ddim}} ({\mathcal P}^{*})]\log n }{n}\right) \nonumber \\= & {} O\left( \frac{\text {Ddim} ({\mathcal F}^{\odot })\log n}{n}\right) .\nonumber \end{aligned}$$

(48)

To derive (48), we have used the fact $\log C_{n}({\mathcal P}^{*})=\log m_{n}(1/\sqrt{n}: {\mathcal P}^{*})+O(1)$ (by Theorem 1). This completes the proof. $\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yamanishi, K., Hirai, S. Detecting signs of model change with continuous model selection based on descriptive dimensionality. Appl Intell 53, 26454–26471 (2023). https://doi.org/10.1007/s10489-023-04780-5

Download citation

Accepted: 11 June 2023
Published: 24 August 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10489-023-04780-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Detecting signs of model change with continuous model selection based on descriptive dimensionality

Abstract

Similar content being viewed by others

High-dimensional regression with potential prior information on variable importance

A systematic review on model selection in high-dimensional regression

A sequential multiple change-point detection procedure via VIF regression

Explore related subjects

1 Introduction

1.1 Motivation

1.2 Related work

1.3 Significance of this paper

2 Descriptive dimensionality

2.1 NML and parametric complexity

Definition 1

2.2 Definition of descriptive dimension

Definition 2

Theorem 1

Theorem 2

3 Model change sign detection

3.1 Continuous model selection for GMMs

3.2 Model change sign detection algorithms

3.3 Continuous model selection for AR model

4 Experimental results

4.1 Synthetic data: GMM

4.1.1 Data set

4.1.2 Evaluation metrics

4.1.3 Methods for comparison

4.1.4 Results

4.2 Synthetic data: Auto-regression model

4.2.1 Data sets

4.2.2 Results

4.3 Real data: Market data

4.3.1 Data sets

4.3.2 Results

4.4 Real Data: Electric power consumption data

4.4.1 Data sets

4.4.2 Results

5 Relation of Ddim to MDL Learning

Theorem 3

6 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Proof of Theorem 1

B Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation