Abstract
Clustering procedures allowing for general covariance structures of the obtained clusters need some constraints on the solutions. With this in mind, several proposals have been introduced in the literature. The TCLUST procedure works with a restriction on the “eigenvalues-ratio” of the clusters scatter matrices. In order to try to achieve robustness with respect to outliers, the procedure allows to trim off a proportion α of the most outlying observations. The resistance to infinitesimal contamination of the TCLUST has already been studied. This paper aims to look at its resistance to a higher amount of contamination by means of the study of its breakdown behavior. The rather new concept of restricted breakdown point will demonstrate that the TCLUST procedure resists to a proportion α of contamination as soon as the data set is sufficiently “well clustered”.
Similar content being viewed by others
Notes
A set is relatively compact if its closure is a compact set.
The pigeonhole principle states that if n items are put into m pigeonholes with n>m, then at least one pigeonhole must contain more than one item.
References
Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25:553–576
Dennis JE Jr. (1982) Algorithms for nonlinear fitting. In: Nonlinear optimization, Cambridge, 1981. Academic Press, London, pp 67–78
Donoho D, Huber PJ (1983) The notion of breakdown point. In: A festschrift for Erich L. Lehmann. Wadsworth, Belmont, pp 157–184
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33:347–380
Gallegos MT, Ritter G (2009a) Trimmed ML estimation of contaminated mixtures. Sankhyā 71:164–220
Gallegos MT, Ritter G (2009b) Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3:135–167
García-Escudero LA, Gordaliza A (1999) Robustness properties of k means and trimmed k means. J Am Stat Assoc 94:956–969
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36:1324–1345
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4:89–109
García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2011) Exploring the number of groups in robust model-based clustering. Stat Comput 21:585–599
Genton MG, Lucas A (2003) Comprehensive definitions of breakdown points for independent and dependent observations. J R Stat Soc, Ser B, Stat Methodol 65:81–94
Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13:795–800
Hennig C (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. J Multivar Anal 99:1154–1176
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley–Interscience, New York
McLachlan G, Peel D (2000) Finite mixture models. Wiley–Interscience, New York
Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52:299–308
Ruwet C, García-Escudero LA, Gordaliza A, Mayo-Iscar A (2012) The influence function of the TCLUST robust clustering procedure. Adv Data Anal Classif 6:107–130
Zhong S, Ghosh J (2004) A unified framework for model-based clustering. J Mach Learn Res 4:1001–1037
Acknowledgements
This research has been partially supported by the Spanish Ministerio de Ciencia y Tecnología and the FEDER grant MTM2011-28657-C02-01.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Proof of Lemma 1
(a) From (1), we have
since 0≤p j ≤1 for all j. This inequality becomes a strict inequality if there is at least one j=1,…,g such that 0<p j <1. Then, using the expression of the normal pdf, we obtain
by the (ER) restriction. Finally, we get
since \(m_{j}=\bar{x}_{R_{j}}\) (Proposition 4 of García-Escudero et al. 2008).
(b) Since |R∩X n |≥gd+1, there is a subset, w.l.o.g. R l , that contains at least d+1 elements of X n (pigeonhole principle).Footnote 2 By general position of X n , \(W_{R_{l}\cap X_{n}}\) is regular and
with some constant K>0 that depends only on X n . The result follows directly from this relation and part (a). □
Proof of Proposition 1
(a) Let M be the modified data set where n−r+g−1 observations have been replaced arbitrarily and denote by \(\mathcal{R}^{*}\) and θ ∗=(p ∗,m ∗,V ∗) the optimal partition and the optimal parameters for M, i.e. \(\mathcal{R}_{\theta^{*}}=\mathcal{R}^{*}\) and \(\theta_{\mathcal{R}^{*}}=\theta^{*}\). The maximum trimmed likelihood remains bounded from below by a strictly positive constant that depends only on X n . To see this, take R={x 1,…,x r−g+1}∪{y 1,…,y g−1} with r−g+1 original observations and g−1 replacements, p j =1/g, V j =I d for all j=1,…,g and m 1=0 and m j =y j−1 for all j=2,…,g. Then, by optimality of \(\mathcal{R}^{*}\) and (p ∗,m ∗,V ∗), we have
which is also larger than the likelihood computed on R with (p,m,V) and the following assignments: x i ∈R 1 for every i=1,…,r−g+1 and y i ∈R i+1 for every i=1,…,g−1. This leads to
By assumption, any subset of M of size r contains at least r−(n−r+g−1)≥gd+1 original data points. Then, Lemma 1(b) can be applied and combined to the previous step to say
which involves that m n and M n are uniformly bounded and bounded away from zero.
(b) Let M be the modified data set where n−r+g observations have been replaced arbitrarily and denote by \(\mathcal{R}^{*}\) and (p ∗,m ∗,V ∗) the optimal set of untrimmed observations and the optimal parameters for M. By assumption, \(\mathcal{R}^{*}\) contains at least g replacements. Then, either one cluster contains at least 2 replacements or each cluster contains at least one replacement. In the first case, let \(R_{l}^{*}\) be a cluster with two replacements x and y. If each cluster contains at least one replacement, let \(R_{l}^{*}\) be a cluster that contains at least d+1≥2 elements (such a cluster exists due to the general assumption r≥gd+1 and the pigeonhole principle). This cluster contains at least one replacement y and some other element x (replacement or not). In both cases, it is easy to see that \(\operatorname{tr}(S^{*}_{l}) \geq \frac{1}{4}\|x-y\|^{2}\). Moreover, by (3.4) in García-Escudero et al. (2008), we know that the eigenvalues of matrices \({V^{*}_{j}}^{-1}\), j=1,…,g, denoted by λ i,j for i=1,…,d and j=1,…,g, satisfy
(the absence of the weights in their expression is a typo). If \(\mathbf{V}^{*}\in\mathcal{V}_{c}\), then \(2\mathbf{V}^{*}\in\mathcal{V}_{c}\) and the previous equation leads to
since the eigenvalues are positive. Then, using \(\lambda_{i}\bigl({V^{*}_{l}}^{-1}\bigr)\geq \lambda_{\mathrm{min}}\bigl({V^{*}_{l}}^{-1}\bigr)\) and \(p^{*}_{l}\geq 2/r\), we have
which implies
This shows that the smallest eigenvalues of \({V^{*}_{j}}^{-1}\), i.e. the inverse of the biggest eigenvalue of \({V^{*}_{j}}\), could tend toward zero if any replacement was chosen far away from the original observations and far away from the other replacements.
(c) Direct from (a) and (b). □
Proof of Lemma 2
Assume y 1∈R j . Proposition 4 of García-Escudero et al. (2008) implies
which is enough since (y i −y 1),i=2,…,q, are bounded. □
Proof of Proposition 2
(a) The proof follows that of Theorem 2(a) in Gallegos and Ritter (2009b) with some adaptations. While they use the fact that \(\det(W_{\mathcal{R}^{*}})\rightarrow\infty\) when ∥y∥→∞, we use the fact that \(\lambda_{\mathrm{max}}(W_{\mathcal{R}^{*}})\rightarrow\infty\). Then, we use Lemma 1(a) and the previous claim to show that the maximum of the log-likelihood function (1) tends to −∞ if the optimal solution does not trim the replacement. On the other hand, if it is trimmed off, the log-likelihood value obtained with p j =1/g, m j =0 and V j =I d for all j=1,…,g could be used as a finite lower bound for the trimmed log-likelihood function which would lead to a contradiction.
(b) The proof follows that of Theorem 3.5(b) in Gallegos and Ritter (2009a). Parts (α), construction of the data set, and (β), bounded behavior of the maximum likelihood function, are the same. In part (γ), we can use the same reasoning as in the proof of Proposition 1(b) to show that \(\operatorname{tr}(S_{l}^{*})\rightarrow\infty\) if K 1,K 2→∞ (in replacement of (3.1)) and the pigeonhole principle with the general assumption r≥gd+1 to show that \(W_{\mathcal{R}^{*}}\succeq W_{R_{j}}\succeq c_{F} I_{d}\) (in replacement of (3.2)). Then, Lemmas 1(a) and 2, the two first steps and the two previous claims allow to conclude as in Gallegos and Ritter (2009a).
(c) Direct from (a) and (b). □
Proof of Proposition 3
This proof follows that of Proposition 2 in Gallegos and Ritter (2009b). Let M be any admissible data set obtained from X n by modifying at most r−q elements and let \(\mathcal{R}^{*}\) and θ ∗=(p ∗,m ∗,V ∗) be the optimal partition and optimal parameters for M., i.e. \(\mathcal{R}_{\theta^{*}}=\mathcal{R}^{*}\) and \(\theta_{\mathcal{R}^{*}}=\theta^{*}\).
- (α):
-
\(M_{n}^{*}\) and \(m_{n}^{*}\) are bounded and bounded away from zero by constants that depend only on X n .
Since R ∗ contains at most r−q replacements, it contains at least q≥gd+1 original observations. The proof finishes as that of Proposition 1(a).
- (β):
-
If \(R_{j}^{*}\) contains some original observations, then \(m_{j}^{*}\) is bounded by a constant that depends only on X n .
Let \(x\in R_{j}^{*}\cap X_{n}\). We have \(\operatorname{tr}(S_{j}^{*}) \geq \frac{1}{|R_{j}^{*}|}\|x-m_{j}^{*}\|^{2}\). Following the argument in the proof of Proposition 1(b), we show that \(\|x-m_{j}^{*}\|^{2}\leq d r^{2} M_{n}^{*} \log 2 \) and the claim follows from the previous step.
- (δ):
-
If it exists j∈{1,…,g} such that \(0<p_{j}^{*}<1\), then
$$ L\bigl(X_{\mathcal{R}^*}|\mathbf{p}^*,\mathbf{m}^*,\mathbf{V}^* \bigr)< c_{d,r}+\frac{d r}{2}\log c -\frac{r}{2}\log \biggl( \frac{\operatorname{tr}(S_{\mathcal{R}^*})}{d} \biggr)^d. $$(6)By Lemma 1(a), and since \(M^{*}_{n}\leq c m_{n}^{*}\) due to (ER), we have
- (ϵ):
-
R ∗ contains no modification with a sufficiently large norm.
The reasoning before (12) of Gallegos and Ritter (2009b) still holds in our case. This, with (3), implies that
$$\biggl(\frac{\operatorname{tr}(W_{\mathcal{R}^*})}{d} \biggr)^d\geq \biggl(\frac{\operatorname{tr}(W_\mathcal{T})}{d} \biggr)^d\geq g^2\max_{T\subseteq R\in \bigl(\substack{X_n\\r} \bigr)}\det(c W_{\mathcal{P}\cap R})$$and then,
$$ \max_{T\subseteq R\in \bigl(\substack{X_n\\r} \bigr)} \log\det(S_{\mathcal{P}\cap R})\leq -2\log g -d\log c +\log \biggl(\frac{\operatorname{tr}(S_{\mathcal{R}^*})}{d} \biggr)^d. $$(7)With their notation \(\bar{x}_{\mathcal{P}\cap R}\) and for all R⊆X n of cardinality r, we have
Then,
We can use (7) and (6) to obtain
which contradicts the optimality of \(\mathcal{R}^{*}\) and (p ∗,m ∗,V ∗). Then R ∗ contains no replacement with large norm and step (β) shows that the means remain bounded.
□
Proof of Lemma 3
The first half of the proof follows from that of Lemma 5 of Gallegos and Ritter (2009b). Then, we can use the linearity of the trace operator to obtain
as we can also use Lemma A1 of Gallegos and Ritter (2009b). Then
and the separation condition (4) leads to the conclusion. □
Proof of Proposition 4
(a) Direct from Proposition 3 and Lemma 3.
(b) Follows proof of Theorem 3(b) of Gallegos and Ritter (2009b) with Lemma 2.
(c) Follows proof of Theorem 3(c) of Gallegos and Ritter (2009b). □
Rights and permissions
About this article
Cite this article
Ruwet, C., García-Escudero, L.A., Gordaliza, A. et al. On the breakdown behavior of the TCLUST clustering procedure. TEST 22, 466–487 (2013). https://doi.org/10.1007/s11749-012-0312-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-012-0312-4