Skip to main content
Log in

On the breakdown behavior of the TCLUST clustering procedure

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

Clustering procedures allowing for general covariance structures of the obtained clusters need some constraints on the solutions. With this in mind, several proposals have been introduced in the literature. The TCLUST procedure works with a restriction on the “eigenvalues-ratio” of the clusters scatter matrices. In order to try to achieve robustness with respect to outliers, the procedure allows to trim off a proportion α of the most outlying observations. The resistance to infinitesimal contamination of the TCLUST has already been studied. This paper aims to look at its resistance to a higher amount of contamination by means of the study of its breakdown behavior. The rather new concept of restricted breakdown point will demonstrate that the TCLUST procedure resists to a proportion α of contamination as soon as the data set is sufficiently “well clustered”.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. A set is relatively compact if its closure is a compact set.

  2. The pigeonhole principle states that if n items are put into m pigeonholes with n>m, then at least one pigeonhole must contain more than one item.

References

  • Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25:553–576

    Article  MATH  Google Scholar 

  • Dennis JE Jr. (1982) Algorithms for nonlinear fitting. In: Nonlinear optimization, Cambridge, 1981. Academic Press, London, pp 67–78

    Google Scholar 

  • Donoho D, Huber PJ (1983) The notion of breakdown point. In: A festschrift for Erich L. Lehmann. Wadsworth, Belmont, pp 157–184

    Google Scholar 

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631

    Article  MathSciNet  MATH  Google Scholar 

  • Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33:347–380

    Article  MathSciNet  MATH  Google Scholar 

  • Gallegos MT, Ritter G (2009a) Trimmed ML estimation of contaminated mixtures. Sankhyā 71:164–220

    MathSciNet  MATH  Google Scholar 

  • Gallegos MT, Ritter G (2009b) Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3:135–167

    Article  MathSciNet  Google Scholar 

  • García-Escudero LA, Gordaliza A (1999) Robustness properties of k means and trimmed k means. J Am Stat Assoc 94:956–969

    MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36:1324–1345

    Article  MATH  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4:89–109

    Article  MathSciNet  Google Scholar 

  • García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2011) Exploring the number of groups in robust model-based clustering. Stat Comput 21:585–599

    Article  MathSciNet  MATH  Google Scholar 

  • Genton MG, Lucas A (2003) Comprehensive definitions of breakdown points for independent and dependent observations. J R Stat Soc, Ser B, Stat Methodol 65:81–94

    Article  MathSciNet  MATH  Google Scholar 

  • Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13:795–800

    Article  MathSciNet  MATH  Google Scholar 

  • Hennig C (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. J Multivar Anal 99:1154–1176

    Article  MathSciNet  MATH  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley–Interscience, New York

    Book  Google Scholar 

  • McLachlan G, Peel D (2000) Finite mixture models. Wiley–Interscience, New York

    Book  MATH  Google Scholar 

  • Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52:299–308

    Article  MathSciNet  MATH  Google Scholar 

  • Ruwet C, García-Escudero LA, Gordaliza A, Mayo-Iscar A (2012) The influence function of the TCLUST robust clustering procedure. Adv Data Anal Classif 6:107–130

    Article  MathSciNet  MATH  Google Scholar 

  • Zhong S, Ghosh J (2004) A unified framework for model-based clustering. J Mach Learn Res 4:1001–1037

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This research has been partially supported by the Spanish Ministerio de Ciencia y Tecnología and the FEDER grant MTM2011-28657-C02-01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. Ruwet.

Appendix

Appendix

Proof of Lemma 1

(a) From (1), we have

$$2 L(X_{\mathcal{R}_\theta}|\mathbf{p},\mathbf{m},\mathbf{V})=2 \sum _{j=1}^{g}\sum_{x_i\in R_j}\log \bigl(p_j\varphi (x_i;m_j,V_j ) \bigr)\leq 2 \sum_{j=1}^{g}\sum _{x_i\in R_j}\log\varphi (x_i;m_j,V_j ) $$

since 0≤p j ≤1 for all j. This inequality becomes a strict inequality if there is at least one j=1,…,g such that 0<p j <1. Then, using the expression of the normal pdf, we obtain

by the (ER) restriction. Finally, we get

since \(m_{j}=\bar{x}_{R_{j}}\) (Proposition 4 of García-Escudero et al. 2008).

(b) Since |RX n |≥gd+1, there is a subset, w.l.o.g. R l , that contains at least d+1 elements of X n (pigeonhole principle).Footnote 2 By general position of X n , \(W_{R_{l}\cap X_{n}}\) is regular and

$$W_{\mathcal{R}_\theta}=\sum_{j=1}^{g}W_{R_j}\succeq W_{R_l} \succeq W_{R_l\cap X_n}\succeq K I_d$$

with some constant K>0 that depends only on X n . The result follows directly from this relation and part (a). □

Proof of Proposition 1

(a) Let M be the modified data set where nr+g−1 observations have been replaced arbitrarily and denote by \(\mathcal{R}^{*}\) and θ =(p ,m ,V ) the optimal partition and the optimal parameters for M, i.e. \(\mathcal{R}_{\theta^{*}}=\mathcal{R}^{*}\) and \(\theta_{\mathcal{R}^{*}}=\theta^{*}\). The maximum trimmed likelihood remains bounded from below by a strictly positive constant that depends only on X n . To see this, take R={x 1,…,x rg+1}∪{y 1,…,y g−1} with rg+1 original observations and g−1 replacements, p j =1/g, V j =I d for all j=1,…,g and m 1=0 and m j =y j−1 for all j=2,…,g. Then, by optimality of \(\mathcal{R}^{*}\) and (p ,m ,V ), we have

$$f\bigl(X_{\mathcal{R}^*}|\mathbf{p}^*,\mathbf{m}^*,\mathbf{V}^*\bigr)\geq f \bigl(X_{\mathcal{R}_\theta}|\theta=(\mathbf{p},\mathbf{m},\mathbf{V})\bigr) $$

which is also larger than the likelihood computed on R with (p,m,V) and the following assignments: x i R 1 for every i=1,…,rg+1 and y i R i+1 for every i=1,…,g−1. This leads to

By assumption, any subset of M of size r contains at least r−(nr+g−1)≥gd+1 original data points. Then, Lemma 1(b) can be applied and combined to the previous step to say

$$-\infty<2\log C \leq 2 L \bigl(X_{\mathcal{R}^*}|\mathbf{p}^*,\mathbf{m}^*,\mathbf{V}^*\bigr) \leq -dr\log(2\pi m_n)-d M_n^{-1}K $$

which involves that m n and M n are uniformly bounded and bounded away from zero.

(b) Let M be the modified data set where nr+g observations have been replaced arbitrarily and denote by \(\mathcal{R}^{*}\) and (p ,m ,V ) the optimal set of untrimmed observations and the optimal parameters for M. By assumption, \(\mathcal{R}^{*}\) contains at least g replacements. Then, either one cluster contains at least 2 replacements or each cluster contains at least one replacement. In the first case, let \(R_{l}^{*}\) be a cluster with two replacements x and y. If each cluster contains at least one replacement, let \(R_{l}^{*}\) be a cluster that contains at least d+1≥2 elements (such a cluster exists due to the general assumption rgd+1 and the pigeonhole principle). This cluster contains at least one replacement y and some other element x (replacement or not). In both cases, it is easy to see that \(\operatorname{tr}(S^{*}_{l}) \geq \frac{1}{4}\|x-y\|^{2}\). Moreover, by (3.4) in García-Escudero et al. (2008), we know that the eigenvalues of matrices \({V^{*}_{j}}^{-1}\), j=1,…,g, denoted by λ i,j for i=1,…,d and j=1,…,g, satisfy

$$(\lambda_{1,1},\ldots,\lambda_{d,1},\lambda_{1,2}\ldots,\lambda_{d,g})= \mathop{\mbox{argmin}}\limits_{\tilde{\lambda}_{i,j}}\sum_{j=1}^{g}p^{*}_{j}\sum_{i=1}^{d} \bigl(-\log \tilde{\lambda}_{i,j}+\lambda_i\bigl(S^{*}_{j}\bigr) \tilde{\lambda}_{i,j} \bigr) $$

(the absence of the weights in their expression is a typo). If \(\mathbf{V}^{*}\in\mathcal{V}_{c}\), then \(2\mathbf{V}^{*}\in\mathcal{V}_{c}\) and the previous equation leads to

since the eigenvalues are positive. Then, using \(\lambda_{i}\bigl({V^{*}_{l}}^{-1}\bigr)\geq \lambda_{\mathrm{min}}\bigl({V^{*}_{l}}^{-1}\bigr)\) and \(p^{*}_{l}\geq 2/r\), we have

$$0\leq d \log 2 - \frac{1}{2}\frac{2}{r}\lambda_{\mathrm{min}}\bigl({V^*_l}^{-1} \bigr)\sum_{i=1}^{d}\lambda_i \bigl(S^{*}_{l}\bigr) \leq d \log 2 - \frac{1}{r} \lambda_{\mathrm{min}}\bigl({V^*_l}^{-1}\bigr)\operatorname{tr} \bigl(S_{l}^{*}\bigr) $$

which implies

$$\lambda_{\mathrm{min}} \bigl({V^*_l}^{-1}\bigr)\|x-y\|^2\leq 4 \lambda_{\mathrm{min}} \bigl({V^*_l}^{-1}\bigr)\operatorname{tr} \bigl(S^{*}_{l}\bigr)\leq 4dr\log 2. $$

This shows that the smallest eigenvalues of \({V^{*}_{j}}^{-1}\), i.e. the inverse of the biggest eigenvalue of \({V^{*}_{j}}\), could tend toward zero if any replacement was chosen far away from the original observations and far away from the other replacements.

(c) Direct from (a) and (b). □

Proof of Lemma 2

Assume y 1R j . Proposition 4 of García-Escudero et al. (2008) implies

$$m_j=\frac{1}{|R_j|} \biggl(\sum_{x_i\in R_j}x_i+\sum_{y_i\in R_j}y_i \biggr)=\frac{1}{|R_j|} \biggl(\sum_{x_i\in R_j}x_i+\sum_{y_i\in R_j}y_1+\sum_{y_i\in R_j}(y_i-y_1) \biggr)$$

which is enough since (y i y 1),i=2,…,q, are bounded. □

Proof of Proposition 2

(a) The proof follows that of Theorem 2(a) in Gallegos and Ritter (2009b) with some adaptations. While they use the fact that \(\det(W_{\mathcal{R}^{*}})\rightarrow\infty\) when ∥y∥→∞, we use the fact that \(\lambda_{\mathrm{max}}(W_{\mathcal{R}^{*}})\rightarrow\infty\). Then, we use Lemma 1(a) and the previous claim to show that the maximum of the log-likelihood function (1) tends to −∞ if the optimal solution does not trim the replacement. On the other hand, if it is trimmed off, the log-likelihood value obtained with p j =1/g, m j =0 and V j =I d for all j=1,…,g could be used as a finite lower bound for the trimmed log-likelihood function which would lead to a contradiction.

(b) The proof follows that of Theorem 3.5(b) in Gallegos and Ritter (2009a). Parts (α), construction of the data set, and (β), bounded behavior of the maximum likelihood function, are the same. In part (γ), we can use the same reasoning as in the proof of Proposition 1(b) to show that \(\operatorname{tr}(S_{l}^{*})\rightarrow\infty\) if K 1,K 2→∞ (in replacement of (3.1)) and the pigeonhole principle with the general assumption rgd+1 to show that \(W_{\mathcal{R}^{*}}\succeq W_{R_{j}}\succeq c_{F} I_{d}\) (in replacement of (3.2)). Then, Lemmas 1(a) and 2, the two first steps and the two previous claims allow to conclude as in Gallegos and Ritter (2009a).

(c) Direct from (a) and (b). □

Proof of Proposition 3

This proof follows that of Proposition 2 in Gallegos and Ritter (2009b). Let M be any admissible data set obtained from X n by modifying at most rq elements and let \(\mathcal{R}^{*}\) and θ =(p ,m ,V ) be the optimal partition and optimal parameters for M., i.e. \(\mathcal{R}_{\theta^{*}}=\mathcal{R}^{*}\) and \(\theta_{\mathcal{R}^{*}}=\theta^{*}\).

(α):

\(M_{n}^{*}\) and \(m_{n}^{*}\) are bounded and bounded away from zero by constants that depend only on X n .

Since R contains at most rq replacements, it contains at least qgd+1 original observations. The proof finishes as that of Proposition 1(a).

(β):

If \(R_{j}^{*}\) contains some original observations, then \(m_{j}^{*}\) is bounded by a constant that depends only on X n .

Let \(x\in R_{j}^{*}\cap X_{n}\). We have \(\operatorname{tr}(S_{j}^{*}) \geq \frac{1}{|R_{j}^{*}|}\|x-m_{j}^{*}\|^{2}\). Following the argument in the proof of Proposition 1(b), we show that \(\|x-m_{j}^{*}\|^{2}\leq d r^{2} M_{n}^{*} \log 2 \) and the claim follows from the previous step.

(δ):

If it exists j∈{1,…,g} such that \(0<p_{j}^{*}<1\), then

$$ L\bigl(X_{\mathcal{R}^*}|\mathbf{p}^*,\mathbf{m}^*,\mathbf{V}^* \bigr)< c_{d,r}+\frac{d r}{2}\log c -\frac{r}{2}\log \biggl( \frac{\operatorname{tr}(S_{\mathcal{R}^*})}{d} \biggr)^d. $$
(6)

By Lemma 1(a), and since \(M^{*}_{n}\leq c m_{n}^{*}\) due to (ER), we have

(ϵ):

R contains no modification with a sufficiently large norm.

The reasoning before (12) of Gallegos and Ritter (2009b) still holds in our case. This, with (3), implies that

$$\biggl(\frac{\operatorname{tr}(W_{\mathcal{R}^*})}{d} \biggr)^d\geq \biggl(\frac{\operatorname{tr}(W_\mathcal{T})}{d} \biggr)^d\geq g^2\max_{T\subseteq R\in \bigl(\substack{X_n\\r} \bigr)}\det(c W_{\mathcal{P}\cap R})$$

and then,

$$ \max_{T\subseteq R\in \bigl(\substack{X_n\\r} \bigr)} \log\det(S_{\mathcal{P}\cap R})\leq -2\log g -d\log c +\log \biggl(\frac{\operatorname{tr}(S_{\mathcal{R}^*})}{d} \biggr)^d. $$
(7)

With their notation \(\bar{x}_{\mathcal{P}\cap R}\) and for all RX n of cardinality r, we have

Then,

We can use (7) and (6) to obtain

which contradicts the optimality of \(\mathcal{R}^{*}\) and (p ,m ,V ). Then R contains no replacement with large norm and step (β) shows that the means remain bounded.

 □

Proof of Lemma 3

The first half of the proof follows from that of Lemma 5 of Gallegos and Ritter (2009b). Then, we can use the linearity of the trace operator to obtain

as we can also use Lemma A1 of Gallegos and Ritter (2009b). Then

$$\operatorname{tr}(W_{\mathcal{T}}) \geq \operatorname{tr}(W_{\mathcal{T}\sqcap\mathcal{P}}) \biggl(1+ \frac{\kappa_\rho}{\operatorname{tr}(S_{\mathcal{T}\sqcap\mathcal{P}})}\min_{\substack{k,l\neq j\\P_h\cap T_k\neq\emptyset, h=j,l}} \|\bar{x}_{T_k\cap P_j}- \bar{x}_{T_k\cap P_l}\|^2 \biggr) $$

and the separation condition (4) leads to the conclusion. □

Proof of Proposition 4

(a) Direct from Proposition 3 and Lemma 3.

(b) Follows proof of Theorem 3(b) of Gallegos and Ritter (2009b) with Lemma 2.

(c) Follows proof of Theorem 3(c) of Gallegos and Ritter (2009b). □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ruwet, C., García-Escudero, L.A., Gordaliza, A. et al. On the breakdown behavior of the TCLUST clustering procedure. TEST 22, 466–487 (2013). https://doi.org/10.1007/s11749-012-0312-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-012-0312-4

Keywords

Mathematics Subject Classification

Navigation