1 Introduction

The mean shift (MS) algorithm is a non-parametric mode seeking technique that was introduced by Fukunaga and Hostetler (Jan. 1975) and later developed by Cheng (1995) and Comanicio and Meer (2002). The algorithm starts from one of the data points and iteratively shifts each data point to the weighted average of the data set in order to find the stationary points of an estimated probability density function (pdf). Modes of an estimated pdf have been used in a wide range of applications, including image segmentation (Comanicio and Meer 2002; Wang et al. 2004), object tracking (Comaniciu et al. 2000, 2003), noisy source vector quantization (Aliyari Ghassabeh et al. 2012b), and nonlinear dimensionality reduction (Aliyari Ghassabeh et al. 2012a). The main advantage of the MS algorithm is that it does not require any prior knowledge of the number of clusters and there is no assumption for the shape of the clusters. The MS algorithm generates a sequence, called the mode estimate sequence, in order to estimate modes of an estimated pdf. In the original paper, the authors claimed that the mode estimate sequence is a convergent sequence (Comanicio and Meer 2002), but the given proof was not correct. Later in another work, Carreira-Perpinán (2007) showed that the MS algorithm with a Gaussian kernel is an expectation maximization (EM) algorithm and therefore the generated sequence converges to a mode of the estimated pdf. However, there are situations when the EM algorithm may not converge Boyles (1983), as a result of which the convergence of the MS algorithm does not follow. The author in Carreira-Perpinán (2007) also assumed the iteration index to be a continuous variable; in addition, for a special case when all the terms in the Gaussian mixture model have the same diagonal bandwidth matrix, this author introduced a strict Lyapunov function in order to show that an equilibrium point of the system is an asymptotically stable point

In two recent works, the convergence of the MS algorithm in the one-dimensional space (\(d=1\)) is investigated (Aliyari Ghassabeh et al. 2013; Aliyari Ghassabeh 2013). The authors in Aliyari Ghassabeh et al. (2013) showed that the MS algorithm with an analytic kernel (e.g., Gaussian kernel) generates a convergent sequence in the one-dimensional space. The author in Aliyari Ghassabeh (2013) proved that for the MS algorithm in the one-dimensional space with certain class of kernels, the mode estimate sequence is a monotone and convergent sequence. However, the authors in Aliyari Ghassabeh et al. (2013) and Aliyari Ghassabeh (2013) could not generalize the convergence result to a high dimensional space (\(d>1\)).

In this paper, we first generalize the results given in Carreira-Perpinán (2007) for the iteration index as a continuous variable. In particular, we assume that each term in the pdf estimate using the Gaussian kernel has a unique covariance matrix instead of assuming a constant diagonal bandwidth matrix for all the terms. Then, we introduce a strict Lyapunov function and show that it satisfies the required condition for an equilibrium point to be asymptotically stable. We also investigate the discrete case with isolated stationary points and show that the proposed Lyapunov function for the continuous case can be used for the discrete case as well. The availability of a Lyapunov function guarantees the asymptotic stability of the system (i.e., the mode estimate sequence remains close to an equilibrium point and finally converges to it). In Sect. 2, I give a short introduction to the MS algorithm. I also provide a brief review of the Lyapunov stability theory in Sect. 3. The main theoretical results are given in Sect. 4. The concluding remarks are given in Sect. 5.

2 Mean shift algorithm

Let \(\mathbf {x}_{i} \in \mathbb {R}^d,\; i=1, \ldots , n\) be a set of \(n\) independent and identically distributed (iid) random variables. The multivariate kernel density estimation using kernel \(K\) and bandwidth matrix \(\mathbf {H}\) is given by Silverman (1986)

$$\begin{aligned} \hat{f}_{K,\mathbf {H}}(\mathbf {x})=\frac{1}{n|\mathbf {H}|^{1/2}} \sum \limits _{i=1}^nK\left( \mathbf {H}^{-1/2}(\mathbf {x}-\mathbf {x}_{i})\right) , \end{aligned}$$

where the kernel \(K\) is a non-negative, real-valued, and integrable function with a compact support satisfying the following conditions (Wand 1995)

$$\begin{aligned} \int \limits _{\mathbb {R}^d}K(\mathbf {x})d\mathbf {x}=1, \quad&\lim _{\Vert \mathbf {x}\Vert \rightarrow \infty } \Vert \mathbf {x}\Vert ^dK(\mathbf {x})=0, \quad \int \limits _{\mathbb {R}^d}\mathbf {x}K(\mathbf {x})d\mathbf {x}=0. \end{aligned}$$

For simplicity, we assume a specific class of kernel functions called radially symmetric kernels that are defined in terms of a profile \(k\).

Definition 1

A profile \(k:[0,\infty )\rightarrow [0,\infty )\) is a non-negative, non-increasing, and piecewise continuous function that satisfies \(\int \nolimits _{0}^{\infty }k(x)dx<\infty \) and \(K({\varvec{x}})=c_{k,d}k(\Vert {\varvec{x}}\Vert ^2)\), where \(c_{k,d}\) is a normalization factor that causes \(K({\varvec{x}})\) to integrate to one.

Furthermore, the shadow of a profile \(k\) is defined by Cheng (1995)

Definition 2

A profile \(h\) is called the shadow of a profile \(k\) if and only if

$$\begin{aligned} h(x)=a+b\int \limits _{x}^{\infty }k(t)dt, \end{aligned}$$

where \(b>0\) and \(a \in \mathbb {R}\) is a constant.

To reduce the computational cost, in practice the bandwidth matrix \(\mathbf {H}\) is chosen to be proportional to the identity matrix, i.e., \(\mathbf {H}=h\mathbf {I}\). The estimated pdf using the profile function and with only one bandwidth parameter simplifies to the following form

$$\begin{aligned} \hat{f}_{h,k}(\mathbf {x})=\frac{c_{k,d}}{nh^d}\sum \limits _{i=1}^n k\left( \left\| \frac{\mathbf {x}-\mathbf {x}_{i}}{h}\right\| ^2\right) . \end{aligned}$$
(1)

The modes of an estimated pdf are zeros of the gradient function. Taking the gradient of (1) and equating it to zero reveals that modes of the estimated pdf are zeros of the following function (fixed points of \(\mathbf {m}_{h,g}(\mathbf {x})+\mathbf {x}\))

$$\begin{aligned} \mathbf {m}_{h,g}(\mathbf {x})=\frac{\sum \nolimits _{i=1}^n\mathbf {x}_{i}g\left( \left\| \frac{\mathbf {x}-\mathbf {x}_{i}}{h}\right\| ^2\right) }{\sum \nolimits _{i=1}^ng\left( \left\| \frac{\mathbf {x}-\mathbf {x}_{i}}{h}\right\| ^2\right) }-\mathbf {x}, \end{aligned}$$
(2)

where \(g(x)=-k'(x)\). The vector \(\mathbf {m}_{h,g}\) is called the MS vector (Comanicio and Meer 2002).

Note that the fixed point of a function \(f:\mathbb {R}^d\rightarrow \mathbb {R}^d\) is any value \({\varvec{x}}\in \mathbb {R}^d\) such that \(f({\varvec{x}})={\varvec{x}}\), whereas a stationary point of \(f\) is any value \({\varvec{y}}\in \mathbb {R}^d\) such that \(\nabla f({\varvec{y}})=\mathbf {0}\).

The MS vector can alternatively be represented by Comanicio and Meer (2002)

$$\begin{aligned} \mathbf {m}_{h,g}(\mathbf {x})=c\frac{\nabla \hat{f}_{k}(\mathbf {x})}{ \hat{f}_{g}(\mathbf {x})}, \end{aligned}$$
(3)

where \(c\) is a scalar depending on the bandwidth \(h\), and \(\hat{f}_{k}\) represents the pdf estimate using the profile \(k\). The above expression shows that at an arbitrary point \(\mathbf {x}\), a MS vector is proportional to the normalized density gradient estimate at \(\mathbf {x}\). The MS algorithm starts from one of the data points and updates the mode estimate iteratively. The mode estimate in the \(k\)th iteration is updated by

$$\begin{aligned} \mathbf {y}_{k+1}&=\mathbf {m}_{h,g}(\mathbf {y}_{k})+\mathbf {y}_{k}\nonumber \\&=\frac{\sum \nolimits _{i=1}^n\mathbf {x}_{i}g\left( \left\| \frac{\mathbf {y}_{k}-\mathbf {x}_{i}}{h}\right\| ^2\right) }{\sum \nolimits _{i=1}^ng\left( \left\| \frac{\mathbf {y}_{k}-\mathbf {x}_{i}}{h}\right\| ^2\right) }. \end{aligned}$$
(4)

It can be shown that the norm of the difference between two consecutive mode estimates converges to zero (Aliyari Ghassabeh et al. 2012a), i.e., \(\lim _{k\rightarrow \infty }\Vert \mathbf {y}_{k+1}-\mathbf {y}_{k}\Vert =0\). Therefore the MS algorithm terminates the iterations until the norm of the difference between two consecutive mode estimates becomes less than some predefined threshold. The convergence of the algorithm for the special one dimensional case (\(d=1\)) is proved (Aliyari Ghassabeh 2013), but unfortunately the convergence result has not been generalized for higher dimensions, i.e. \(d>1\).

3 Lyapunov stability theory

Consider a general nonlinear dynamical system ( Luenberger 1979)

$$\begin{aligned} \dot{\mathbf {x}}&=f(\mathbf {x}(t),t), \quad \text {Continuous case},\\ \mathbf {x}(k+1)&=f(\mathbf {x}(k),k),\quad \mathbf {x}(0)=\mathbf {x}_{0}, \quad \quad \text {Discrete case}, \end{aligned}$$

where \(\mathbf {x}\in \mathbb {U}\subset \mathbb {R}^d\), \(\mathbb {U}\) is a neighborhood of the origin and \(f:\mathbb {R}^d\rightarrow \mathbb {R}^d\) is a continuous and differentiable function. An equilibrium point \(\mathbf {x}^*\) for continuous and discrete cases is defined as follows (Antsaklis 2006).

Definition 3

A vector \(\mathbf {x}^*\) is called an equilibrium point from time \(t_{0}\) for the continuous case if \(f(\mathbf {x}^*,t)=0, \forall t\ge t_{0}\) and is called an equilibrium (or fixed) point from time \(k_{0}\) for the discrete case if \(f(\mathbf {x}^*,k)=\mathbf {x}^*, \forall k>k_{0}\).

An equilibrium point \(\mathbf {x}^*\) is called Lyapunov stable if solutions starting close enough to the equilibrium point remain close enough forever. Formally speaking, we have (Antsaklis 2006):

Definition 4

An equilibrium point \(\mathbf {x}^*\) is called Lyapunov stable if for every \(\epsilon >0\) there exists a \(\delta (\epsilon )>0\) such that if \(\Vert \mathbf {x}(0)-\mathbf {x}^*\Vert <\delta (\epsilon )\) then \(\Vert \mathbf {x}(t)-\mathbf {x}^*\Vert <\epsilon \) for all \(t\ge 0\) (the Lyapunov stability is defined similarly for the discrete case).

The equilibrium point \(\mathbf {x}^*\) is said to be asymptotically stable if it is Lyapunov stable and if there exists \(\delta >0\) such that if \(\Vert \mathbf {x}(0)-\mathbf {x}^*\Vert <\delta \) then \(\lim _{t\rightarrow \infty } \Vert \mathbf {x}(t)-\mathbf {x}^*\Vert =0\) (Antsaklis 2006).

Let \(\mathbf {x}^*\) denote an equilibrium point for a continuous dynamic system. Lyapunov’s second method states that if there exists a continuous, differentiable function \(V(\mathbf {x}): E\rightarrow \mathbb {R}\), where \(E\subset \mathbb {R}^d\) is a neighborhood of \(\mathbf {x}^*\), such that \(V(\mathbf {x}^*)=0\) and \(V(\mathbf {x})>0\) if \(\mathbf {x}\ne \mathbf {x}^*\), then \(\mathbf {x}^*\) is asymptotically stable if \(\dot{V}(\mathbf {x})<0\) for all \(\mathbf {x}\in E\backslash \{\mathbf {x}^*\}\) (Tripathi 2008). For a discrete time system, the theorem is slightly different: if there exist a continuous, differentiable function \(V(\mathbf {x}): E\rightarrow \mathbb {R}\), where \(E\) is defined as before, such that \(V(\mathbf {x}^*)=0\) and \(V(\mathbf {x})>0\) if \(\mathbf {x}\ne \mathbf {x}^*\), then \(\mathbf {x}^*\) is asymptotically stable if \(\Delta {V}(\mathbf {x})=V(\mathbf {x}_{k+1})-V(\mathbf {x}_{k})<0\) for all \(\mathbf {x}\in E\backslash \{\mathbf {x}^*\}\) (Haddad and Chellaboina 2008).

4 Theoretical results

In this section, we first consider the iteration index for the MS algorithm to be continuous and generalize the results in Carreira-Perpinán (2007). Then we show that the proposed function can also be used as a Lyapunov function for the discrete case with isolated stationary points, which shows that the fixed points of the MS algorithm are asymptotically stable.

4.1 Continuous case

Carreira-Perpinán (2007) investigated the MS algorithm with a Gaussian kernel and considered the iteration index to be a continuous variable. The Gaussian MS algorithm with a continuous iteration index can be written as follows (Carreira-Perpinán 2007)

$$\begin{aligned} \dot{\mathbf {x}}=\frac{\sum \nolimits _{i=1}^n \mathbf {\Sigma }_{i}^{-1}(\mathbf {x}_{i}-\mathbf {x}) \exp \left( -(\mathbf {x}_{i}-\mathbf {x})^t\mathbf {\Sigma }_{i}^{-1}( \mathbf {x}_{i}-\mathbf {x})/2\right) }{\sum \nolimits _{i=1}^n \exp \left( -(\mathbf {x}_{i}-\mathbf {x})^t\mathbf {\Sigma }_{i}^{-1}(\mathbf {x}_{i}- \mathbf {x})/2 \right) }= \frac{\nabla \hat{f}(\mathbf {x})}{\hat{f}(\mathbf {x})}, \end{aligned}$$
(5)

where \(\mathbf {\Sigma }_{i}\) is the covariance matrix for \(i\)th component in the Gaussian mixture model. For simplicity the author in Carreira-Perpinán (2007) assumed that \(\mathbf {\Sigma }_{i}=h^2\mathbf {I}\). Then the above continuous dynamical system reduces to

$$\begin{aligned} \dot{\mathbf {x}}=\nabla \left( h^2\log (\hat{f}(\mathbf {x}))\right) , \end{aligned}$$
(6)

where \(f(\mathbf {x})\) is defined in (1) and has a Gaussian kernel with the bandwidth matrix \(h^2\mathbf {I}\). The Lyapunov function in neighborhood \(E\) of any equilibrium point \(\mathbf {x}^*\) is defined by Carreira-Perpinán (2007)

$$\begin{aligned} V(\mathbf {x})=h^2\log \frac{\hat{f}(\mathbf {x}^*)}{\hat{f}(\mathbf {x})}. \end{aligned}$$
(7)

It is not difficult to show that \(V(\mathbf {x}^*)=0\) and \(V(\mathbf {x})>0\) for all \(\mathbf {x}\in E\backslash \{\mathbf {x}^*\}\), i.e. \(V\) is positive definite in \(E\backslash \{\mathbf {x}^*\}\). The author in Carreira-Perpinán (2007) also showed that \(\dot{V}(\mathbf {x})<0\) for all \(\mathbf {x}\in E\backslash \{\mathbf {x}^*\}\). Therefore, the equilibrium point \(\mathbf {x}^*\) is asymptotically stable point for the dynamical system. The author in Carreira-Perpinán (2007) mentioned that finding a Lyapunov function for the general case (5) is more difficult.

In recent work, the authors provided a sufficient condition for the MS algorithm with the Gaussian kernel to have a unique mode in the convex hull of the data set (Theorem \(2\) in Liu et al. 2013). They showed that if the MS algorithm has a unique mode in the convex hull of the data set, then the mode is globally stable and the mode estimated sequence is an exponentially convergent sequence (Theorem \(3\) in Liu et al. 2013). The provided sufficient condition in Liu et al. (2013) depends on the data set and the covariance matrix of each Gaussian term in the pdf estimate. In general, it may be a difficult task to choose the covariance matrices to satisfy the provided sufficient condition. Furthermore, the MS algorithm with a unique mode has limited use in practice. The MS algorithm has been widely used in applications such as image segmentation and clustering, which require the algorithm to have multiple modes.

We propose a Lyapunov function for the general case (5) in order to guarantee the asymptotic stability of the algorithm for a general Gaussian mixture model. Let \(\mathbf {x}_{i}\in \mathbb {R}^{d}, i=1,\ldots ,n\) denote our samples. The density estimate using the Gaussian kernel is given by

$$\begin{aligned} \hat{f}(\mathbf {x})=c\sum _{i=1}^n\exp \left( -(\mathbf {x}-\mathbf {x}_{i})^t \mathbf {\Sigma }_{i}^{-1}(\mathbf {x}-\mathbf {x}_{i})/2\right) , \end{aligned}$$
(8)

where \(c\) is the normalization factor and \(\mathbf {\Sigma }_{i}, i=1,\ldots ,n\) is the covariance matrix for \(i\)th sample. Let \(N(\mathbf {x}_{i},\mathbf {\Sigma }_{i})=\exp (-(\mathbf {x}-\mathbf {x}_{i})^t \mathbf {\Sigma }_{i}^{-1}(\mathbf {x}-\mathbf {x}_{i})/2)\) denote the Gaussian function at \(\mathbf {x}\), then the gradient estimate at \(\mathbf {x}\) using \(\hat{f}(\mathbf {x})\) is computed by

$$\begin{aligned} \nabla \hat{f}(\mathbf {x})&=c\sum \limits _{i=1}^n\mathbf {\Sigma }_{i}^{-1}(\mathbf {x}_{i}- \mathbf {x})N(\mathbf {x}_{i},\mathbf {\Sigma }_{i})\nonumber \\&=c\sum \limits _{i=1}^n\mathbf {\Sigma }_{i}^{-1}\mathbf {x}_{i}N(\mathbf {x}_{i}, \mathbf {\Sigma }_{i})- c\sum _{i=1}^n\mathbf {\Sigma }_{i}^{-1}N(\mathbf {x}_{i},\mathbf {\Sigma }_{i})\mathbf {x}. \end{aligned}$$
(9)

Multiplying both sides of (9) by \([\sum _{i=1}^n\mathbf {\Sigma }_{i}^{-1}N(\mathbf {x}_{i},\mathbf {\Sigma }_{i})]^{-1}/c\), we obtain

$$\begin{aligned} \left[ \sum \limits _{i=1}^n\mathbf {\Sigma }_{i}^{-1}N(\mathbf {x}_{i},\mathbf { \Sigma }_{i})\right] ^{-1}\frac{\nabla \hat{f}(\mathbf {x})}{c}&=\left[ \sum \limits _{i=1}^n\mathbf {\Sigma }_{i}^{-1}N(\mathbf {x}_{i},\mathbf {\Sigma }_{i})\right] ^{-1}\sum \limits _{i=1}^n\mathbf {\Sigma }_{i}^{-1}\mathbf {x}_{i}N(\mathbf {x}_{i},\mathbf { \Sigma }_{i})-\mathbf {x}\nonumber \\&=\mathbf {m}(\mathbf {x}). \end{aligned}$$
(10)

Consider a nonlinear continuous system \(\dot{\mathbf {x}}(t)=\mathbf {m}(\mathbf {x}(t))\), where \(\mathbf {m}:\mathbb {R}^d\rightarrow \mathbb {R}^d\) is a continuous, differentiable function defined in (10). Let \(\mathbf {x}^*\) be an equilibrium point of the system, i.e., \(m(\mathbf {x}^*)=0\). Let \(V(\mathbf {x})=\hat{f}(\mathbf {x}^*)-\hat{f}(\mathbf {x})\), where \(V:E\rightarrow \mathbb {R}\) is a continuous, differentiable function and \(E\subset \mathbb {R}^d\) is an open neighborhood around \(\mathbf {x}^*\) such that \(\hat{f}(\mathbf {x}^*)>\hat{f}(\mathbf {x})\) for all \(\mathbf {x}\in E\).Footnote 1 Since \(\mathbf {x}^*\) is a mode of the estimated pdf in local neighborhood \(E\), then \(V(\mathbf {x})=f(\mathbf {x}^*)-f(\mathbf {x})>0\) for all \(\mathbf {x}\in E\backslash \{\mathbf {x}^*\}\) and \(V(\mathbf {x}^*)=0\), i.e., \(V(\mathbf {x})\) is an strictly positive definite in local neighborhood \(E\). Now it is time to show that the function \(\dot{V}(\mathbf {x})\) is negative definite, i.e., \(\dot{V}(\mathbf {x})<0\) for all \(\mathbf {x}\in E\backslash \{\mathbf {x}^*\}\) and \(\dot{V}({\mathbf {x}}^*)=0\). By taking the derivative of \(V\) using the chain rule, we have

$$\begin{aligned} \dot{V}(\mathbf {x})&=\dot{\mathbf {x}}^{t}\nabla V\nonumber \\&=-\dot{\mathbf {x}}^{t}\nabla \hat{f}(\mathbf {x})\nonumber =-\mathbf {m}( \mathbf {x}(t))^t\nabla \hat{f}(\mathbf {x})\nonumber \\&=-\left( \left[ \sum \limits _{i=1}^n\Sigma _{i}^{-1}N(\mathbf {x}_{i},\Sigma _{i})\right] ^{-1}\frac{ \nabla \hat{f}(\mathbf {x})}{c}\right) ^t\nabla \hat{f}(\mathbf {x})\nonumber \\&=\frac{-1}{c}\nabla \hat{f}(\mathbf {x})^t\left[ \sum \limits _{i=1}^n\Sigma _{i}^{-1}N( \mathbf {x}_{i},\Sigma _{i})\right] ^{-1} \nabla \hat{f}(\mathbf {x})\nonumber < 0. \end{aligned}$$

The last inequality is true since the weighted sum of the inverse of the covariance matrices is a positive definite matrix. It is also obvious that \(\dot{V}(\mathbf {x}^*)=0\). Therefore, \(V(\mathbf {x})=\hat{f}(\mathbf {x}^*)-\hat{f}(\mathbf {x})\) is a strict Lyapunov function for the continuous dynamical system in (5) and \(\mathbf {x}^*\) is locally asymptotically stable, i.e., if we start from any point \(\mathbf {x}_{0}\in E\) then the mode estimate sequence remains close to \(\mathbf {x}^*\) and finally will converge to \(\mathbf {x}^*\).

4.2 Discrete case

For the discrete case, Fashing and Tomasi proved the following theorem (Theorem \(2\) in Fashing and Tomasi 2005).

Theorem 1

The MS procedure with a piecewise constant profile \(k\) is equivalent to Newton’s method applied to a density estimate using the shadow of \(k\).

Theorem 1 implies that for a very special class of profile functions, piecewise constant profiles, the MS algorithm tends to be equivalent to Newton’s method. A piecewise constant profile (e.g., uniform profile) defines a piecewise constant kernel. Piecewise constant kernels (e.g., uniform kernels) have limited use in kernel density estimation, since the pdf estimate using a piecewise constant kernel is a non-smooth function that is not desirable. Theorem 1 is not correct for widely used kernels (e.g., Gaussian kernel) and therefore the MS algorithm in general is not equivalent to Newton’s method. Furthermore, even for a piecewise constant profile \(k\), Theorem 1 does not necessarily imply the convergence of the sequence. There are situations where Newton’s method diverges. For example, consider function \(f(x)=x^{1/3}\): starting at point \(x_{1}=a (a \in \mathbb {R})\), Newton’s method generates the following sequence

$$\begin{aligned} x_{n+1}=x_{n}-\frac{f(x_{n})}{f'(x_{n})}=-2x_{n}. \end{aligned}$$

It is clear that for \(a \ne 0\) the sequence \(\{x_{n}\}_{n=1,2,\ldots }\) grows instead of converging, hence Newton’s method fails to find the root \(x=0\). The authors in Fashing and Tomasi (2005) showed that the MS procedure at step \(k, k\ge 1\) maximizes a quadratic function \(\rho _{k}({\varvec{x}})\) (Theorem \(3\) in Fashing and Tomasi 2005). Furthermore, they proved that \(\rho _{k}({\varvec{x}})\) can be considered as a (lower) bounding function for the density estimate \(\hat{f}({\varvec{x}})\), where the bounding function \(\rho _{k}({\varvec{x}})\) for \(\hat{f}({\varvec{x}})\) is defined by Salakhutdinov et al. (2003)

Definition 5

Let \(\hat{f}({\varvec{x}}): \mathcal {X} \rightarrow \mathbb {R}\) denote our objective function, where \(\mathcal {X}\subset \mathbb {\mathbb {R}}^D, D\ge 1\). The bounding function \(\rho _{k}({\varvec{x}})\) for \(\hat{f}({\varvec{x}})\) is a function such that \(\rho _{k}({\varvec{x}}^*)=\hat{f}({\varvec{x}}^*)\) at some point \({\varvec{x}}^* \in \mathcal {X}\) and \(\rho _{k}({\varvec{x}})\le \hat{f}({\varvec{x}})\) for every other \({\varvec{x}}\in \mathcal {X}\).

The authors in Fashing and Tomasi (2005) showed that the MS algorithm with profile \(k\) is a quadratic bound maximization over a density estimate \(\hat{f}\) using the shadow of \(k\) (Theorem \(4\) in Fashing and Tomasi 2005). This result implies that the pdf estimate along the sequence generated by the MS algorithm is an increasing sequence, i.e., \(\hat{f}({\varvec{y}}_{k+1})\ge \rho _{k}({\varvec{y}}_{k+1})>\rho _{k}({\varvec{y}}_{k})=\hat{f}({\varvec{y}}_{k})\) (Fashing and Tomasi 2005).

Assume we are interested in maximizing a scalar valued function \(L(\theta )\) of a free parameter vector \(\Theta \). The bound maximizer algorithms (e.g., EM algorithm for maximum likelihood learning in latent variable models) never worsen the objective function. In other words, the bound maximizer algorithms generate a sequence \(\{\Theta _{k}\}_{k=1,2,\ldots }\) such that \(L(\theta _{k+1})>L(\theta _{k}), k\ge 1\) (Salakhutdinov et al. 2003). However, a bound maximizer algorithm (e.g., EM algorithm) without additional conditions may not converge (Wu 1983). For example, Boyles presented a counterexample that satisfies all the hypotheses of Theorem \(2\) in Dempster et al. (1977) but converges to a unit circle instead of converging to a single point (Boyles 1983). Thus, showing that the MS algorithm is a bound optimization is not enough to prove the convergence of mode estimate sequence.

From (4) and (10), the discrete dynamical system for the MS algorithm is

$$\begin{aligned} \mathbf {y}(k+1)=\mathbf {m}(\mathbf {y}(k))+\mathbf {y}(k), \end{aligned}$$
(11)

where \(\mathbf {y}(k)\) is the mode estimate at \(k\)th iteration. Let \(\mathbf {y}^*\) denote the equilibrium point of (11), then \(\mathbf {y}^*\) is a fixed point of (11), which implies \(\mathbf {m}(\mathbf {y}^*)=0\). Consider the proposed Lyapunov function \(V(\mathbf {y})=\hat{f}(\mathbf {y}^*)-\hat{f}(\mathbf {y})\). For any isolated mode \(\mathbf {y}^*\) of the estimated pdf there is an open neighborhood \(E\) around \(\mathbf {y}^*\) such that the estimated pdf attains its maximum at \(\mathbf {y}^*\), i.e., \(V(\mathbf {y})=\hat{f}(\mathbf {y}^*)-\hat{f}(\mathbf {y})>0\) for all points \(\mathbf {y}\in E\backslash \{\mathbf {y}^*\}\). It is clear that \(V(\mathbf {y}^*)=0\), therefore \(V(\mathbf {x})\) is a strict Lyapunov function in \(E\). To show that \( \Delta V(\mathbf {y})<0\) , we need the following lemma.Footnote 2

Lemma 1

If the profile \(k\) is a convex and strictly decreasing function, then the density estimate values \(\hat{f}\) are increasing along the mode estimate sequence.

Proof

Let \(\mathbf {y}_{j}\ne \mathbf {y}_{j+1}\), we show that \(\hat{f}(\mathbf {y}_{j+1})>\hat{f}(\mathbf {y}_{j})\). From Equation (\(1\)), we have

$$\begin{aligned} \hat{f}(\mathbf {y}_{j+1})-\hat{f}(\mathbf {y}_{j})&=\frac{c_{k,d}}{nh^d} \left[ \sum \limits _{i=1}^n k\left( \left\| \frac{\mathbf {y}_{j+1}-\mathbf {x}_{i}}{h}\right\| ^2\right) - \sum \limits _{i=1}^n k\left( \left\| \frac{\mathbf {y}_{j}-\mathbf {x}_{i}}{h}\right\| ^2\right) \right] \\&=\frac{c_{k,d}}{nh^d}\sum \limits _{i=1}^n\left[ k\left( \left\| \frac{\mathbf {y}_{j+1} -\mathbf {x}_{i}}{h}\right\| ^2\right) -\left( \left\| \frac{\mathbf {y}_{j}- \mathbf {x}_{i}}{h}\right\| ^2\right) \right] \\&\quad =>\frac{c_{k,d}}{nh^d}\sum \limits _{i=1}^nk'\left( \left\| \frac{\mathbf {y}_{j}- \mathbf {x}_{i}}{h}\right\| ^2\right) \left( \left\| \frac{\mathbf {y}_{j+1}-\mathbf {x}_{i}}{h}\right\| ^2-\left\| \frac{\mathbf {y}_{j} -\mathbf {x}_{i}}{h}\right\| ^2\right) , \end{aligned}$$

where the last inequality is true since the convexity of the profile function \(k\) implies that \(k(x_{2})-k(x_{1})=>k'(x_{1})(x_{2}-x_{x1})\). By expanding the terms in the right side of the above inequality and using Eq. (4), we have

$$\begin{aligned} \hat{f}(\mathbf {y}_{j+1})-\hat{f}(\mathbf {y}_{j})&=> \frac{c_{k,d}}{nh^d}\sum \limits _{i=1}^nk'\left( \left\| \frac{ \mathbf {y}_{j}-\mathbf {x}_{i}}{h}\right\| ^2\right) \left( \left\| \frac{\mathbf {y}_{j+1}-\mathbf {x}_{i}}{h}\right\| ^2-\left\| \frac{\mathbf {y}_{j}-\mathbf {x}_{i}}{h}\right\| ^2\right) \\&=\frac{c_{k,d}}{nh^{d+2}}\sum \limits _{i=1}^n k'\left( \left\| \frac{ \mathbf {y}_{j}-\mathbf {x}_{i}}{h}\right\| ^2\right) \left( \left\| \mathbf {y}_{j+1}\right\| ^2+ \left\| \mathbf {x}_{i}\right\| ^2\!-\!2\mathbf {y}_{j+1}\cdot \mathbf {x}_{i}-\left\| \mathbf {y}_{j}\right\| ^2\right. \\&\quad -\left. \left\| \mathbf {x}_{i}\right\| ^2+2\mathbf {y}_{j}\cdot \mathbf {x}_{i}\right) \\&=\frac{c_{k,d}}{nh^{d+2}}\sum \limits _{i=1}^n k'\left( \left\| \frac{\mathbf {y}_{j}- \mathbf {x}_{i}}{h}\right\| ^2\right) \left( \left\| \mathbf {y}_{j+1}\right\| ^2\!-\! \left\| \mathbf {y}_{j}\right\| ^2-2\left( \mathbf {y}_{j+1}-\mathbf {y}_{j}\right) \cdot \mathbf {x}_{i}\right) \\&=\frac{c_{k,d}}{nh^{d+2}}\sum \limits _{i=1}^n k'\left( \left\| \frac{ \mathbf {y}_{j}\!-\!\mathbf {x}_{i}}{h}\right\| ^2\right) \left( \left\| \mathbf {y}_{j+1}\right\| ^2\!-\! \left\| \mathbf {y}_{j}\right\| ^2\!-\!2\left( \mathbf {y}_{j+1}\!-\!\mathbf {y}_{j}\right) \cdot \mathbf {y}_{j+1}\right) \\&=\frac{c_{k,d}}{nh^{d+2}}\sum \limits _{i=1}^n k'\left( \left\| \frac{\mathbf {y}_{j}- \mathbf {x}_{i}}{h}\right\| ^2\right) \left( -\left\| \mathbf {y}_{j+1}\right\| ^2- \left\| \mathbf {y}_{j}\right\| ^2+2\mathbf {y}_{j}\cdot \mathbf {y}_{j+1}\right) \\&=-\frac{c_{k,d}}{nh^{d+2}}\sum \limits _{i=1}^n k'\left( \left\| \frac{\mathbf {y}_{j}- \mathbf {x}_{i}}{h}\right\| ^2\right) \left( \left\| \mathbf {y}_{j+1}-\mathbf {y}_{j}\right\| \right) ^2>0, \end{aligned}$$

where \(\cdot \) denotes the inner product. The last inequality comes from the fact that the profile function \(k\) is strictly decreasing, therefore its derivative is strictly less than zero, i.e., \(k'(x)<0\). Therefore, the sequence \(\{\hat{f}(\mathbf {y}_{j})\}_{j=1,2,\ldots }\) is strictly increasing and for an arbitrary \(j\), we have \(\hat{f}(\mathbf {y}_{j+1})-\hat{f}(\mathbf {y}_{j})>0\). \(\square \)

Using Lemma (1), we have

$$\begin{aligned} \Delta V(\mathbf {y})&=V(\mathbf {y}({k+1}))-V(\mathbf {y}({k}))\nonumber \\&= \hat{f}(\mathbf {y}^*)-\hat{f}(\mathbf {y}(k+1))-\hat{f}(\mathbf {y}^*) +\hat{f}(\mathbf {y}(k))\nonumber \\&=\hat{f}(\mathbf {y}(k))-\hat{f}(\mathbf {y}(k+1)) <0. \end{aligned}$$
(12)

The last inequality holds since from Lemma \(1\) the sequence \(\{\hat{f}(\mathbf {y}(k))\}_{k=1,2,\ldots }\) is an increasing sequence. Therefore, for the discrete dynamical system in (11), function \(V\) is a Lyapunov function and equilibrium point \(\mathbf {y}^*\) is asymptotically stable.

Remarks

  1. 1.

    For the Lyapunov function in Carreira-Perpinán (2007), it is required that all the covariance matrices be the same and proportional to the identity matrix, i.e., \(\mathbf {\Sigma }_{i}=h^2\mathbf {I}, i=1,2,\ldots , n\). But for the proposed Lyapunov function, there is no constraint on the covariance matrices except being positive definite.

  2. 2.

    The systems in (5) or (11) can have many equilibrium points and, as long as the equilibrium points are isolated, the above argument works. For each equilibrium point \(\mathbf {x}_{i}^*, i=1,2, \ldots \) (\(\mathbf {y}^*\) for the discrete case), there is an open neighborhood \(E_{i}\) such that the estimated pdf \(\hat{f}(\mathbf {x})\) attains its maximum at \(\mathbf {x}_{i}^*\) on \(E_{i}\).

  3. 3.

    By proving the asymptotic stability of the isolated equilibrium points of the MS algorithm, we showed that if we start from a point close to an specific equilibrium point, then the MS algorithm remains close to the equilibrium point and finally converges to it.

  4. 4.

    In real world applications, where digital computers store numbers in floating point representation, the MS algorithm may not converge exactly to a fixed point due to the rounding error. As mentioned before, the MS algorithm stops when the distance between two mode estimates becomes less that some predefined threshold, i.e., \(\Vert \mathbf {y}_{k+1}- \mathbf {y}_{k}\Vert <\epsilon \). By choosing a small threshold we can guarantee that the stopping point is close enough to the fixed point.

5 Conclusion

The MS algorithm is a widely used technique for estimating modes of an estimated pdf. Although the algorithm has been used in many applications, it seems that the study of the theoretical properties of the algorithm has been missing in the literature. In this paper, we generalized the asymptotic stability results in Carreira-Perpinán (2007) by introducing a Lyapunov function for the MS algorithm with a continuous iteration index. The author in Carreira-Perpinán (2007) proposed a Lyapunov function for the MS algorithm with the Gaussian kernel when all terms in the pdf estimate have equal covariance matrices that are proportional to the identity matrix. In our case, there is no constraint on the covariance matrices and they just need to be positive definite matrices. We also showed that the proposed function satisfies the required condition for an equilibrium (fixed) point of the discrete MS algorithm with isolated stationary points to be asymptotically stable. In other words, we proved that for the MS algorithm with isolated stationary points, if we start the iterations close enough to an equilibrium point, then the mode estimate sequence remains close to that point and finally converges to it.