Approximating symmetrized estimators of scatter via balanced incomplete U-statistics

Dümbgen, Lutz; Nordhausen, Klaus

doi:10.1007/s10463-023-00879-1

Approximating symmetrized estimators of scatter via balanced incomplete U-statistics

Published: 08 August 2023

Volume 76, pages 185–207, (2024)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Lutz Dümbgen¹ &
Klaus Nordhausen²

209 Accesses
1 Altmetric
Explore all metrics

Abstract

We derive limiting distributions of symmetrized estimators of scatter. Instead of considering all $n(n-1)/2$ pairs of the n observations, we only use nd suitably chosen pairs, where $d \ge 1$ is substantially smaller than n. It turns out that the resulting estimators are asymptotically equivalent to the original one whenever $d = d(n) \rightarrow \infty$ at arbitrarily slow speed. We also investigate the asymptotic properties for arbitrary fixed d. These considerations and numerical examples indicate that for practical purposes, moderate fixed values of d between 10 and 20 yield already estimators which are computationally feasible and rather close to the original ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Check your outliers! An introduction to identifying statistical outliers in R with easystats

Article 25 March 2024

References

Barbour, A. D., Chen, L. H. Y. (eds.) (2005). An introduction to Stein’s method, Lecture Notes Series, Vol. 4, Institute for Mathematical Sciences, National University of Singapore, Singapore University Press.
Book Google Scholar
Blom, G. (1976). Some properties of incomplete $U$-statistics. Biometrika, 63, 573–580.
Article MathSciNet Google Scholar
Brown, B. M., Kildea, D. G. (1978). Reduced $U$-statistics and the Hodges–Lehmann estimator. The Annals of Statistics, 6, 828–835.
Article MathSciNet Google Scholar
Dudley, R. M., Sidenko, S., Wang, Z. (2009). Differentiability of $t$-functionals of location and scatter. The Annals of Statistics, 37, 939–960.
Article MathSciNet Google Scholar
Dümbgen, L. (1998). On Tyler’s $M$-functional of scatter in high dimension. Annals of the Institute of Statistical Mathematics, 50, 471–491.
Article MathSciNet Google Scholar
Dümbgen, L., Nordhausen, K., Schuhmacher, H. (2014). fastM: Fast computation of multivariate M-estimators. R package, https://cran.r-project.org/web/packages/fastM
Dümbgen, L., Pauly, M., Schweizer, T. (2015). M-functionals of multivariate scatter. Statistics Surveys, 9, 32–105.
Article MathSciNet Google Scholar
Dümbgen, L., Nordhausen, K., Schuhmacher, H. (2016). New algorithms for M-estimation of multivariate scatter and location. Journal of Multivariate Analysis, 144, 200–217.
Article MathSciNet Google Scholar
Feller, W. (1945). The fundamental limit theorems in probability. Bulletin of the American Mathematical Society, 51, 800–832.
Article MathSciNet Google Scholar
Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19, 293–325.
Article MathSciNet Google Scholar
Hoeffding, W. (1951). A combinatorial Central Limit Theorem. The Annals of Mathematical Statistics, 22, 558–566.
Article MathSciNet Google Scholar
Kent, J. T., Tyler, D. E. (1991). Redescending $M$-estimates of multivariate location and scatter. The Annals of Mathematical Statistics, 19, 2102–2119.
MathSciNet Google Scholar
Lee, A. J. (1990). U-statistics—theory and practice (Vol. 110). New York: Marcel Dekker, Inc.
Google Scholar
Miettinen, J., Nordhausen, K., Taskinen, S., Tyler, D. E. (2016). On the computation of symmetrized $M$-estimators of scatter. In C. Agostinelli, A. Basu, P. Filzmoser, D. Mukherjee (Eds.), Recent Advances in Robust Statistics: Theory and Applications (pp. 151–167). India: Springer.
Chapter Google Scholar
Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B. (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10, 1–141.
Article Google Scholar
Nordhausen, K., Tyler, D. E. (2015). A cautionary note on robust covariance plug-in methods. Biometrika, 102, 573–588.
Article MathSciNet Google Scholar
Nordhausen, K., Oja, H., Ollila, E. (2008). Robust independent component analysis based on two scatter matrices. Austrian Journal of Statistics, 37, 91–100.
Google Scholar
Paindaveine, D. (2008). A canonical definition of shape. Statistics and Probability Letters, 78, 2240–2247.
Article MathSciNet Google Scholar
Serfling, R. J. (1980). Approximation theorems of mathematical statistics. Wiley series in probability and mathematical statistics, New York: Wiley.
Google Scholar
Sirkia, S., Taskinen, S., Oja, H. (2007). Symmetrised M-estimators of multivariate scatter. Journal of Multivariate Analysis, 98, 1611–1629.
Article MathSciNet Google Scholar
Stein, C. (1986). Approximate computation of expectations, Institute of mathematical statistics lecture notes—monograph series, Vol. 7, Hayward, CA: Institute of Mathematical Statistics.
Book Google Scholar
Tyler, D. E. (1987). A distribution-free $M$-estimator of multivariate scatter. The Annals of Statistics, 15, 234–251.
Article MathSciNet Google Scholar
Tyler, D. E., Critchley, F., Dümbgen, L., Oja, H. (2009). Invariant coordinate selection (with discussion). Journal of the Royal Statistical Society, Series B: Statistical Methodology, 71, 549–592.
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank Sara Taskinen for stimulating discussions. Constructive comments of three referees are gratefully acknowledged.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of Bern, Alpeneggstrasse 22, CH-3012, Bern, Switzerland
Lutz Dümbgen
Department of Mathematics and Statistics, University of Jyväskylä, PO Box 35, FI-40014, Jyväskylä, Finland
Klaus Nordhausen

Authors

Lutz Dümbgen
View author publications
You can also search for this author in PubMed Google Scholar
Klaus Nordhausen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lutz Dümbgen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

L. Dümbgen, work supported by the Swiss National Science Foundation.

A auxiliary results

1.1 A.1 A particular coupling of random permutations

Preparations. For an integer $n \ge 1$, let ${\mathcal {S}}_n$ be the set of all permutations of $\langle n\rangle := \{1,2,\ldots ,n\}$. A cycle in ${\mathcal {S}}_n$ is a permutation $\sigma \in {\mathcal {S}}_n$ such that for $m \ge 1$ pairwise different points $a_1,\ldots ,a_m \in \langle n\rangle$,

$$\begin{aligned} a_1 \ \mapsto \ a_2 \mapsto \ \cdots \ \mapsto \ a_m \ \mapsto \ a_1, \end{aligned}$$

while $\sigma (i) = i$ for $i \in \langle n\rangle {\setminus } \{a_1,\ldots ,a_m\}$. (In the case of $m = 1$, $\sigma (i) = i$ for all $i \in \langle n\rangle$.) We write

$$\begin{aligned} \sigma \ = \ (a_1,\ldots ,a_m)_{\textrm{c}} \end{aligned}$$

for this mapping and note that it has m equivalent representations

$$\begin{aligned} \sigma \ = \ (a_1,\ldots ,a_m)_{\textrm{c}} \ = \ (a_2,\ldots ,a_m,a_1)_{\textrm{c}} \ = \ \cdots \ = \ (a_m,a_1,\ldots ,a_{m-1})_{\textrm{c}}. \end{aligned}$$

Any permutation $\sigma \in {\mathcal {S}}_n$ can be written as

$$\begin{aligned} \sigma \ = \ (a_{11},\ldots ,a_{1m(1)})_{\textrm{c}} \circ \cdots \circ (a_{k1},\ldots ,a_{km(k)})_{\textrm{c}}, \end{aligned}$$

where the sets $\{a_{j1},\ldots ,a_{jm(j)}\}$, $1 \le j \le k$, form a partition of $\langle n\rangle$. Note that the cycles $(a_{j1},\ldots ,a_{jm(j)})_{\textrm{c}}$, $1\le j\le m$, commute. This representation of $\sigma$ as a combination of cycles is unique if we require, for instance, that

$$\begin{aligned} a_{jm(j)} \ = \ \min \{a_{j1},\ldots ,a_{jm(j)}\} \quad \text {for} \ 1 \le j \le k \end{aligned}$$

and

$$\begin{aligned} a_{1m(1)}< \cdots < a_{km(k)}. \end{aligned}$$

In what follows, let ${\mathcal {S}}_n^*$ be the set of all permutations $\sigma \in {\mathcal {S}}_n$ consisting of just one cycle, i.e.,

$$\begin{aligned} \sigma \ = \ (a_1,a_2,\ldots ,a_n)_{\textrm{c}} \end{aligned}$$

with pairwise different numbers $a_1, a_2, \ldots , a_n \in \langle n\rangle$.

The coupling. The standardized cycle representation of $\sigma \in {\mathcal {S}}_n$ gives rise to a particular mapping ${\mathcal {S}}_n \ni \pi \mapsto (\sigma ,\sigma ^*) \in {\mathcal {S}}_n \times {\mathcal {S}}_n^*$ such that $\pi \mapsto \sigma$ is bijective. For fixed $\pi \in {\mathcal {S}}_n$ and any index $i \in \langle n\rangle$, let

$$\begin{aligned} M_i \,= \ \langle n\rangle \setminus \{\pi (s): 1 \le s < i\}, \end{aligned}$$

i.e., $\langle n\rangle = M_1 \supset M_2 \supset \cdots \supset M_n = \{\pi (n)\}$, and $\# M(i) = n+1-i$. Let $1 \le t_1< t_2< \cdots < t_k = n$ be those indices i such that $\pi (i) = \min (M_i)$. Then,

$$\begin{aligned} \sigma \,= \ \left( \pi (1), \ldots , \pi (t_1) \right) _{\textrm{c}} \circ \left( \pi (t_1+1),\ldots ,\pi (t_2) \right) _{\textrm{c}} \circ \cdots \circ \left( \pi (t_{k-1}+1), \ldots , \pi (t_k) \right) _{\textrm{c}} \end{aligned}$$

defines a permutation of $\langle n\rangle$ with standardized cycle representation. This is essentially the construction used by Feller (1945) to investigate the number of cycles of a random permuation. Moreover,

$$\begin{aligned} \sigma ^* \,= \ \left( \pi (1), \pi (2), \ldots , \pi (n) \right) _{\textrm{c}} \end{aligned}$$

defines a permutation in ${\mathcal {S}}_n^*$ such that

$$\begin{aligned} \left\{ i \in \langle n\rangle : \sigma (i) \ne \sigma ^*(i) \right\} \ = \ {\left\{ \begin{array}{ll} \emptyset &{} \text {if} \ k = 1, \\ \{t_1,\ldots ,t_k\} &{} \text {if} \ k \ge 2. \end{array}\right. } \end{aligned}$$

Suppose that $\pi$ is a random permutation with uniform distribution on ${\mathcal {S}}_n$. Then, $\sigma$ is a random permutation with uniform distribution on ${\mathcal {S}}_n$ too, because $\pi \mapsto \sigma$ is a bijection. Since the conditional distribution of $\pi (i)$, given $(\pi (s))_{1 \le s < i}$, is the uniform distribution on $M_i$, the random variables

$$\begin{aligned} Y_i \,= \ 1_{[\pi (i) = \min (M_i)]}, \quad i \in \langle n\rangle , \end{aligned}$$

are stochastically independent Bernoulli random variables with $\textrm{I}\!\textrm{P}(Y_i = 1) = (n+1-i)^{-1} = 1 - \textrm{I}\!\textrm{P}(Y_i = 0)$. Consequently,

$$\begin{aligned} \textrm{I}\!\textrm{E}\left( \# \left\{ i \in \langle n\rangle : \sigma (i) \ne \sigma ^*(i) \right\} \right) \ \le \ \sum _{i=1}^n (n+1-i)^{-1} \ = \ 1 + \sum _{j=2}^n j^{-1} \ \le \ 1 + \log (n), \end{aligned}$$

because $j^{-1} \le \int _{j-1}^j x^{-1} \, dx = \log (j) - \log (j-1)$ for $2 \le j \le n$.

1.2 A.2 Some inequalities related to Lindeberg-type conditions

In connection with Gaussian approximations and Stein’s method, see Stein (1986) or Barbour and Chen (2005), the quantity

$$\begin{aligned} L(X) \,= \ \textrm{I}\!\textrm{E}\left( X^2 \min (|X|,1) \right) \end{aligned}$$

for a square-integrable random variable X plays an important role. Elementary considerations show that

$$\begin{aligned} h(x) \ \le \ x^2 \min (|x|,1) \ \le \ \sqrt{2} \, h(x) \quad \text {with}\quad h(x) \,= \ \frac{|x|^3}{\sqrt{1 + x^2}} \end{aligned}$$

for arbitrary $x \in {\mathbb {R}}$. Moreover, $h: {\mathbb {R}}\rightarrow [0,\infty )$ is an even, convex function such that $h(2x) \le 8 h(x)$. Consequently, for arbitrary $x,y \in {\mathbb {R}}$, Jensen’s inequality implies that

$$\begin{aligned} (x + y)^2 \min (|x + y|, 1)&\le \ \sqrt{2} \, \textrm{I}\!\textrm{E}h(x + y) \\&\le \ 2^{-1/2} \left( h(2x) + h(2y) \right) \\&\le \ \sqrt{32} \, \textrm{I}\!\textrm{E}h(x) + \sqrt{32} \, \textrm{I}\!\textrm{E}h(y) \\&\le \ 6 x^2 \min (|x|, 1) + 6 y^2 \min (|y|,1) \ \le \ 6 x^2 \min (|x|,1) + 6 y^2 . \end{aligned}$$

For a symmetric matrix $A \in {\mathbb {R}}^{n\times n}$, we define its row means $\bar{A}_i:= n^{-1} \sum _{j=1}^n A_{ij}$ and its overall mean $\bar{A}:= n^{-2} \sum _{i,j=1}^n A_{ij}$. Let $\tilde{A}:= (A_{ij} - \bar{A}_i - \bar{A}_j + \bar{A})_{i,j=1}^n$. Then, elementary calculations and the previous inequalities reveal that

$$\begin{aligned} 0 \ \le \ n^{-1} \sum _{i,j=1}^n A_{ij}^2 - n^{-1} \sum _{i,j=1}^n \tilde{A}_{ij}^2 \ \le \ 2 \sum _{i=1}^n \bar{A}_i^2 \end{aligned}$$

and

$$\begin{aligned} n^{-1} \sum _{i,j=1}^n \tilde{A}_{ij}^2 \min (|\tilde{A}_{ij}|,1) \ \le \ 6 n^{-1} \sum _{i,j=1}^n A_{ij}^2 \min (|A_{ij}|,1) + 12 \sum _{i=1}^n \bar{A}_i^2. \end{aligned}$$

About this article

Cite this article

Dümbgen, L., Nordhausen, K. Approximating symmetrized estimators of scatter via balanced incomplete U-statistics. Ann Inst Stat Math 76, 185–207 (2024). https://doi.org/10.1007/s10463-023-00879-1

Download citation

Received: 06 February 2023
Revised: 08 June 2023
Accepted: 03 July 2023
Published: 08 August 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10463-023-00879-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximating symmetrized estimators of scatter via balanced incomplete U-statistics

Abstract

Access this article

Similar content being viewed by others

Violating the normality assumption may be the lesser of two evils

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Check your outliers! An introduction to identifying statistical outliers in R with easystats

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A auxiliary results

1.1 A.1 A particular coupling of random permutations

1.2 A.2 Some inequalities related to Lindeberg-type conditions

About this article

Cite this article

Keywords

Navigation

Approximating symmetrized estimators of scatter via balanced incomplete U-statistics

Abstract

Access this article

Similar content being viewed by others

Violating the normality assumption may be the lesser of two evils

Partial Least Squares Methods: Partial Least Squares Correlation and Partial Least Square Regression

Check your outliers﻿! An introduction to identifying statistical outliers in R with easystats

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A auxiliary results

A auxiliary results

1.1 A.1 A particular coupling of random permutations

1.2 A.2 Some inequalities related to Lindeberg-type conditions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Check your outliers! An introduction to identifying statistical outliers in R with easystats