1 Introduction

Covariance matrix estimation is a fundamental topic in multivariate statistical analyses. Traditionally, the sample covariance matrix is a convenient and efficient estimator when sample size n is much larger than dimension p. However, in recent years, more and more high-dimensional datasets with small n and large p have appeared in various applications. For instance, investors track thousands of assets in the financial market, but there are only hundreds of daily trading observations per year (Bodnar et al., 2018). For cancer diagnosis with genetic data, thousands of gene expressions can be measured using microarray techniques simultaneously, but patient cases are often rare and limited (Best et al., 2015). It is well-known that the sample covariance matrix is singular when \(p>n\), but a valid covariance matrix must be positive-definite. This fatal flaw hampers the application of sample covariance matrix in high-dimensional multivariate statistical analyses, including discriminant analysis and regression models. Furthermore, Johnstone (2001) showed that the sample covariance matrix distorts the eigen-structure of the population covariance matrix and is ill-conditioned when p is large. Generally, the sample covariance matrix is an awful estimator in high-dimensional cases.

Although its performance is poor as a whole (Fan et al., 2016), each entry in the sample covariance matrix is still an efficient estimator of pairwise covariance among variables. This motivates the design of a modified version that retains efficient estimation of pairwise covariance, while avoiding the drawbacks. Ledoit and Wolf (2004) proposed a shrinkage method by taking a weighted linear combination of the sample covariance matrix and the identity matrix. The resulting matrix is positive-definite, invertible, and preserves the eigenvector structure. There is existing literature on how to choose the optimal weighted parameter to obtain better asymptotic properties (Ledoit and Wolf, 2004; Mestre and Lagunas, 2005; Mestre, 2008). However, the shrinkage operation leads to a biased estimator in finite samples. If the covariance matrix is sparse, thresholding methods may be the most intuitive idea in high-dimensional analyses. Bickel and Levina (2008) applied the hard-thresholding method to the sample covariance matrix and showed its asymptotic consistency. After that, other generalized thresholding rules were proposed and tried, such as banding (Bickel and Levina, 2008; Wu and Pourahmadi, 2009), soft-thresholding (Rothman et al., 2009), and adaptive thresholding (Cai and Liu, 2011). For further theoretical results, Cai et al. (2010) derived the optimal rate of convergence for estimating the true covariance matrix, and Cai and Zhou (2012) explored the operator norm, Frobenius norm and \(L_1\) norm of the estimator and its inverse. The thresholding idea is an efficient method to obtain a sparse estimator, but it is hard to ensure positive-definiteness for finite samples. In fact, Guillot and Rajaratnam (2012) showed that a thresholded matrix may lose positive-definiteness quite easily. Fan et al. (2016) also demonstrated that the thresholding method sacrifices a great deal of entries and information in the sample covariance matrix to attain positive-definiteness.

From the perspective of random matrix theory, Marzetta et al. (2011) constructed a positive-definite estimator by random dimension reduction. Tucci and Wang (2019) considered a random unitary matrix with Harr measure as an alternative random operator. In this paper, inspired by the work of random matrix theory and some practical considerations, we modify the sample correlation matrix using the Bagging technique. Bagging (Bootstrap Aggregating), proposed by Breiman (1996), is an ensemble algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical inference. Surprisingly, we find that the Bagging technique can help achieve a positive-definite estimate when \(p>n\). Through a resampling procedure, the Bagging technique can “create” more linearly independent data to transform the problem into traditional cases where n/p is large. This paper contributes to the field in the following aspects: (a) we propose a new high-dimensional correlation matrix estimator for general continuous data; (b) we prove theoretically the Bagging estimator ensures positive-definiteness with probability one in finite samples, while the estimator is consistent when p is fixed; (c) we demonstrate that the Bagging estimator is competitive with existing approaches through a large number of simulation studies in various scenarios and a real application.

This paper is organized as follow: Sect. 2 proposes the Bagging estimator. Section 3 proves some relevant theoretical results. Section 4 compares our method with existing approaches through simulation studies in various scenarios and Sect. 5 provides a real application. Section 6 concludes the paper.

2 Bagging estimator

For a given training set D of size n, the Bagging technique first generates m new training sets \(d_1,\cdots ,d_m\), each of size n, by sampling from D uniformly with replacement. This step is called bootstrap sampling. These m bootstrap resampling sets are then fitted separately to produce estimates \(h_1,\cdots ,h_m\). The individual estimates \(h_1,\cdots ,h_m\) are then combined by averaging or voting to generate the final estimate \(h^{\text {Bag}}\). The procedure of the Bagging algorithm is illustrated in Fig. 1.

Fig. 1
figure 1

The procedure of the Bagging algorithm

Generally, Bagging can improve the stability and accuracy of almost every regression and classification algorithm (Breiman, 1996). In this paper, we use the Bagging technique to modify the sample correlation matrix.

Let \(\mathbf{X} =(X_{ij})_{n\times p}\) be the observed dataset. \(X_{ij}\) denotes the i-th observation for the j-th variable where \(i=1,\cdots ,n\) and \(j=1,\cdots ,p\). Assume row vectors \(\varvec{X}_i=(X_{i1},\cdots ,X_{ip})\) are i.i.d. for \(i=1,\cdots ,n\), and follow a continuous and irreducible p-dimensional distribution with mean \(\varvec{\mu }\) and positive-definite covariance matrix \(\varvec{\varSigma }\), e.g., \(\varvec{X}_i\sim N_p(\varvec{\mu },\varvec{\varSigma })\). Here an irreducible p-dimensional distribution denotes a p-dimensional distribution where the p components are irreducible (see Definition 5 for details). We are interested in estimating the \(p\times p\) covariance matrix \(\varvec{\varSigma }=(\sigma _{ij})_{p\times p}\) for fixed p and finite sample size n when \(p>n\). The sample covariance matrix is defined as

$$\begin{aligned} \mathbf{S} =\frac{1}{n-1}(\mathbf{X} -\bar{\mathbf{X }})'(\mathbf{X} -\bar{\mathbf{X }}), \end{aligned}$$

where \(\bar{\mathbf{X }}=\mathbf{1} _{n\times 1}\cdot (\frac{1}{n}\sum _{i=1}^n\varvec{X}_i)\) is the matrix of sample mean vectors.

According to the variance-correlation decomposition, \(\varvec{\varSigma }=\mathbf{D} \varvec{\varLambda } \mathbf{D} \), where \(\mathbf{D} \) is the diagonal matrix of standard deviations and \(\varvec{\varLambda }\) is the correlation matrix with diagonal elements equal to 1. Thus, we may estimate \(\mathbf{D} \) and \(\varvec{\varLambda }\) separately (Barnard et al., 2000). If \(\mathbf{D} \) is estimated by the sample variance, i.e., \(\hat{\mathbf{D }}=\text {diag}(\mathbf{S} )^{1/2}\), then the problem becomes to estimate the correlation matrix \(\varvec{\varLambda }\). The corresponding sample version is defined as follows:

Definition 1

(Sample Correlation Matrix) Let \( \mathbf{Y} =(Y_{ij})_{n\times p}\) be the matrix normalized from the original dataset \( \mathbf{X} \) by columns, i.e., \(Y_{ij}=(X_{ij}-{\hat{\mu }}_{j})/{{\hat{\sigma }}}_{j}\) where \({{\hat{\mu }}}_j=\frac{1}{n}\sum _{i=1}^nX_{ij}\) and \({\hat{\sigma }}_j^2=\frac{1}{n-1}\sum _{i=1}^n(X_{ij}-{\hat{\mu }}_j)^2\). Then, the sample correlation matrix \( \mathbf{R} \) is defined as

$$\begin{aligned} \mathbf{R =\frac{1}{n-1}{} \mathbf{Y} '{} \mathbf{Y} }. \end{aligned}$$

Note that \(\text {rank}(\mathbf{R} )=n-1\), thus \(\mathbf{R} \) is still singular when \(p>n\) and hence not a valid estimator of \(\varvec{\varLambda }\). Therefore, a modification on R is a must.

Definition 2

(Bagging Estimator) For a given dataset \(\mathcal {L}=\{\varvec{X}_1,\cdots , \varvec{X}_n\}\), consider a simple resampling set of n observations with replacement, e.g., \(\mathcal {L}^{(t)}=\{\varvec{X}^{(t)}_1,\cdots ,\varvec{X}^{(t)}_n\}\). Using these resampled data construct the matrix \( \mathbf{X} ^{(t)}\), which is used to form a sample correlation matrix \( \mathbf{R} ^{(t)}\). Repeat this process independently for T times. Then, the Bagging estimator is defined as \( \mathbf{R ^{\text {Bag}}=\frac{1}{T}\sum _{t=1}^T\mathbf{R} ^{(t)}}\).

The Bagging algorithm is summarized in Algorithm 1 in detail. The complete algorithm is simple, easy to implement, and requires few assumptions. Common assumptions, such as the data being Gaussian and the covariance matrix being sparse, are unnecessary in our algorithm. Compared with approaches that rely on these assumptions, our Bagging estimator is more flexible for general continuous data.

figure a

3 Theoretical properties

3.1 Positive-definiteness

A valid correlation matrix estimator must be positive-definite. As we shall show, our new estimator \(\mathbf{R} ^{\text {Bag}}\) is positive-definiteness with probability one for finite samples, although each \(\mathbf{R} ^{(t)}\) is still singular. It should be noted that this “magic” operation works only for the sample correlation matrix \(\mathbf{R} \), rather than the sample covariance matrix \(\mathbf{S} \). This may partially explain why this simple procedure has not been explored up till now.

For \(\mathbf{R} ^{\text {Bag}}\), we have the following decomposition,

$$\begin{aligned} \begin{aligned} \mathbf{R} ^{\text {Bag}}&=\frac{1}{T}\sum _{t=1}^T\mathbf{R} ^{(t)} =\frac{1}{(n-1)T}\sum _{t=1}^T\mathbf{Y ^{(t)}}'{} \mathbf{Y} ^{(t)}=\frac{1}{(n-1)T}{} \mathbf{Z} '{} \mathbf{Z} , \end{aligned} \end{aligned}$$
(1)

where \(\mathbf{Y} ^{(t)}=(Y_{ij}^{(t)})_{n\times p}\) is the matrix normalized from the resampled dataset \(\mathbf{X} ^{(t)}\) by columns, i.e., \(Y_{ij}^{(t)}=(X_{ij}^{(t)}-{\hat{\mu }}_j^{(t)})/{\hat{\sigma }}_j^{(t)}\), where \({\hat{\mu }}_j^{(t)}=\frac{1}{n}\sum _{i=1}^nX_{ij}^{(t)}\) and \(({\hat{\sigma }}_j^{(t)})^2=\frac{1}{n-1}\sum _{i=1}^n(X_{ij}^{(t)}-{\hat{\mu }}_j^{(t)})^2\). Here

$$\begin{aligned} \begin{aligned} \mathbf{Z} =\left( \begin{array}{c} \mathbf{Y} ^{(1)} \\ \mathbf{Y} ^{(2)} \\ \vdots \\ \mathbf{Y} ^{(T)} \\ \end{array} \right) _{nT\times p} \end{aligned} \end{aligned}$$

is a random matrix, which contains all resampled observations.

According to Equation (1), it is sufficient to show that \(\mathrm {Pr}(\text {rank}(\mathbf{Z} )=p)=1\) for large T. First, we clarify several definitions regarding random variables for convenience.

Definition 3

(Continuous) A random variable X is said to be continuous if \(\mathrm {Pr}(X\in B)=0\) for any finite or countable set B of points of the real line.

Definition 4

(Irreducible) Let W be a continuous random variable. Given random variables \(U_1,\cdots ,U_n\), if \(W|U_1,\cdots ,U_n\) is still a continuous random variable, W is said to be irreducible given \(U_1,\cdots ,U_n\).

Definition 5

For continuous random variables \(U_1,\cdots ,U_n\), if every \(U_i\) is irreducible given the remaining random variables, we say \(U_1,\cdots ,U_n\) are irreducible.

Corollary 1

Let W be a continuous random variable. If W is independent of random variables \(U_1,\cdots ,U_n\), then W is irreducible given \(U_1,\cdots ,U_n\).

Proof

If W is independent of \(U_1,\cdots ,U_n\), then \(W|U_1,\cdots ,U_n\) is identically distributed with W and is a continuous random variable. \(\square \)

Definition 6

(Linearly Irreducible) Let W be a continuous random variable. Given random variables \(U_1,\cdots ,U_n\), if

$$\begin{aligned} \begin{aligned} \mathrm {Pr}(W=a_1U_1+\cdots +a_nU_{n}|U_1,\cdots ,U_n)=0, \end{aligned} \end{aligned}$$

for any \(a_1,\cdots ,a_n\in \mathbb {R}\), W is said to be linearly irreducible given \(U_1,\cdots ,U_n\).

Definition 7

For continuous random variables \(U_1,\cdots ,U_n\), if every \(U_i\) is linearly irreducible given the remaining random variables, we say \(U_1,\cdots ,U_n\) are linearly irreducible.

Corollary 2

Let W be a continuous random variable. If W is irreducible given \(U_1,\cdots ,U_n\), then W is linearly irreducible given \(U_1,\cdots ,U_n\).

Proof

By Definition 4, \(W|U_1,\cdots ,U_n\) is a continuous random variable. So \(\mathrm {Pr}(W=a|U_1,\cdots ,U_n)=0\) for any \(a\in \mathbb {R}\). In particular, \(\mathrm {Pr}(W=a_1U_1+\cdots +a_nU_{n}|U_1,\cdots ,U_n)=0\) for any \(a_1,\cdots ,a_n\in \mathbb {R}\). \(\square \)

The following lemma provides a criterion for being linearly irreducible (See Appendix A for detailed proofs of Lemmas and Theorems).

Lemma 1

Let \(U_1,\cdots ,U_n\) be continuous random variables. If

$$\begin{aligned} \begin{aligned} \mathrm {Pr}(a_1U_1+\cdots +a_nU_{n}=0)=0 \end{aligned} \end{aligned}$$

for any \(a_1,\cdots ,a_n\in \mathbb {R}\) which are not all zero, then \(U_1,\cdots ,U_n\) are linearly irreducible.

Inspired by the rank of the Gaussian ensemble in random matrix theory (Tao and Vu 2010), we show a general result for the rank of a random matrix.

Theorem 1

For random matrix \( \mathbf{M} =(M_{ij})_{q\times p}\), where \(M_{ij}\) are continuous random variables, if \( \mathbf{M} \) satisfies the following conditions: (1) By rows, \(M_{i1},\cdots ,M_{ip}\) are linearly irreducible for all i; (2) By columns, \(M_{1j},\cdots ,M_{qj}\) are linearly irreducible for all j, then we have

$$\begin{aligned} \mathrm {Pr}( \text {rank}( \mathbf{M} )=\min (q,p))=1. \end{aligned}$$

Specifically, consider the rank of random matrix \(\mathbf{Z} \),

$$\begin{aligned} \begin{aligned} \mathbf{Z} =\left( \begin{array}{c} \mathbf{Y} ^{(1)} \\ \vdots \\ \mathbf{Y} ^{(T)} \\ \end{array} \right) = \left( \begin{array}{cccc} \frac{X_{11}^{(1)}-{\hat{\mu }}_1^{(1)}}{{\hat{\sigma }}_1^{(1)}} &{} \frac{X_{12}^{(1)}-{\hat{\mu }}_2^{(1)}}{{\hat{\sigma }}_2^{(1)}} &{} \cdots &{} \frac{X_{1p}^{(1)}-{\hat{\mu }}_p^{(1)}}{{\hat{\sigma }}_p^{(1)}} \\ \vdots &{} \vdots &{} &{} \vdots \\ \frac{X_{n1}^{(1)}-{\hat{\mu }}_1^{(1)}}{{\hat{\sigma }}_1^{(1)}} &{} \frac{X_{n2}^{(1)}-{\hat{\mu }}_2^{(1)}}{{\hat{\sigma }}_2^{(1)}} &{} \cdots &{} \frac{X_{np}^{(1)}-{\hat{\mu }}_p^{(1)}}{{\hat{\sigma }}_p^{(1)}} \\ \vdots &{} \vdots &{} &{} \vdots \\ \frac{X_{11}^{(T)}-{\hat{\mu }}_1^{(T)}}{{\hat{\sigma }}_1^{(T)}} &{} \frac{X_{12}^{(T)}-{\hat{\mu }}_2^{(T)}}{{\hat{\sigma }}_2^{(T)}} &{} \cdots &{} \frac{X_{1p}^{(T)}-{\hat{\mu }}_p^{(T)}}{{\hat{\sigma }}_p^{(T)}} \\ \vdots &{} \vdots &{} &{} \vdots \\ \frac{X_{n1}^{(T)}-{\hat{\mu }}_1^{(T)}}{{\hat{\sigma }}_1^{(T)}} &{} \frac{X_{n2}^{(T)}-{\hat{\mu }}_2^{(T)}}{{\hat{\sigma }}_2^{(T)}} &{} \cdots &{} \frac{X_{np}^{(T)}-{\hat{\mu }}_p^{(T)}}{{\hat{\sigma }}_p^{(T)}} \\ \end{array} \right) _{Tn\times p}. \end{aligned} \end{aligned}$$

For simplicity, delete the redundant rows in \(\mathbf{Z} \), which does not change the rank of the matrix. The redundancy may come from identical resampling sets, i.e., \(\mathbf{Y} ^{(t_1)}\equiv \mathbf{Y} ^{(t_2)}\), or may come from repetitive observations in the same resampling sets, i.e., \(\varvec{X}^{(t)}_{i_1}\equiv \varvec{X}^{(t)}_{i_2} \equiv \varvec{X}_i\in \mathcal {L}^{(t)}\). After eliminating these redundant rows, let \({\tilde{T}}\) be the number of distinct resampling sets in total T resampling sets, and let \(q_t\) be the number of non-repetitive observations in \(\mathcal {L}^{(t)}\).

Note that in each resampling set, there exists a perfect linear relationship among non-repetitive rows due to the sample mean \({\hat{\mu }}_j^{(t)}\), which decreases the degrees of freedom of observations by one. Thus, there are only \(q_t-1\) free observations in each resampling set. Without loss of generality, assume the first \(q_t-1\) rows in each resampling set are non-repetitive. We have submatrix \(\mathbf{G} \) of \(\mathbf{Z} \),

$$\begin{aligned} \begin{aligned} \mathbf{G} = \left( \begin{array}{cccc} \frac{X_{11}^{(1)}-{\hat{\mu }}_1^{(1)}}{{\hat{\sigma }}_1^{(1)}} &{} \frac{X_{12}^{(1)}-{\hat{\mu }}_2^{(1)}}{{\hat{\sigma }}_2^{(1)}} &{} \cdots &{} \frac{X_{1p}^{(1)}-{\hat{\mu }}_p^{(1)}}{{\hat{\sigma }}_p^{(1)}} \\ \vdots &{} \vdots &{} &{} \vdots \\ \frac{X_{q_1-1,1}^{(1)}-{\hat{\mu }}_1^{(1)}}{{\hat{\sigma }}_1^{(1)}} &{} \frac{X_{q_1-1,2}^{(1)}-{\hat{\mu }}_2^{(1)}}{{\hat{\sigma }}_2^{(1)}} &{} \cdots &{} \frac{X_{q_1-1,p}^{(1)}-{\hat{\mu }}_p^{(1)}}{{\hat{\sigma }}_p^{(1)}} \\ \vdots &{} \vdots &{} &{} \vdots \\ \frac{X_{11}^{({\tilde{T}})}-{\hat{\mu }}_1^{({\tilde{T}})}}{{\hat{\sigma }}_1^{({\tilde{T}})}} &{} \frac{X_{12}^{({\tilde{T}})}-{\hat{\mu }}_2^{({\tilde{T}})}}{{\hat{\sigma }}_2^{({\tilde{T}})}} &{} \cdots &{} \frac{X_{1p}^{({\tilde{T}})}-{\hat{\mu }}_p^{({\tilde{T}})}}{{\hat{\sigma }}_p^{({\tilde{T}})}} \\ \vdots &{} \vdots &{} &{} \vdots \\ \frac{X_{q_{{\tilde{T}}}-1,1}^{({\tilde{T}})}-{\hat{\mu }}_1^{({\tilde{T}})}}{{\hat{\sigma }}_1^{({\tilde{T}})}} &{} \frac{X_{q_{{\tilde{T}}}-1,2}^{({\tilde{T}})}-{\hat{\mu }}_2^{({\tilde{T}})}}{{\hat{\sigma }}_2^{({\tilde{T}})}} &{} \cdots &{} \frac{X_{q_{{\tilde{T}}}-1,p}^{({\tilde{T}})}-{\hat{\mu }}_p^{({\tilde{T}})}}{{\hat{\sigma }}_p^{({\tilde{T}})}} \\ \end{array} \right) _{\sum _{t=1}^{{\tilde{T}}}(q_t-1)\times p}. \end{aligned} \end{aligned}$$

Here submatrix \(\mathbf{G} =(G_{ij}^{(t)})\) has the same rank as \(\mathbf{Z} \), where \(G_{ij}^{(t)}=\frac{X_{ij}^{(t)}-{\hat{\mu }}_j^{(t)}}{{\hat{\sigma }}_j^{(t)}}\), \(i=1,\cdots ,q_{t}-1\), \(j=1,\cdots ,p\), \(t=1,\cdots ,{\tilde{T}}\).

Lemma 2

\(G_{ij}^{(t)}\) is a continuous random variable.

According to Theorem 1 and Lemma 2, we show \(\mathrm {Pr}(\text {rank}(\mathbf{G} )=\min (\sum _{t=1}^{{\tilde{T}}}(q_t-1),p))=1\).

Theorem 2

For random matrix \(\mathbf{G} \), we have

$$\begin{aligned} \mathrm {Pr}( \text {rank}(\mathbf{G} )=\min (\sum _{t=1}^{{\tilde{T}}}(q_t-1),p))=1. \end{aligned}$$

The total number of distinct sets is \(\left( {\begin{array}{c}n+k-1\\ k\end{array}}\right) \) if we draw k samples from n different elements with replacement (Pishro-Nik 2016). Here we have \(k=n\) in our Bagging algorithm. Thus, the number of distinct resampling sets \({\tilde{T}}\) goes to \(\left( {\begin{array}{c}2n-1\\ n\end{array}}\right) \) with probability 1 as \(T\rightarrow \infty \).

Since there are \(q_t-1\) free observations in each resampling set and \(q_t-1\ge 1\) holds except for the n sets in which the elements are all the same, we have \(\sum _{t=1}^{{\tilde{T}}}(q_t-1)\ge {\tilde{T}}-n\). Thus, \(\sum _{t=1}^{{\tilde{T}}}(q_t-1)\ge \left( {\begin{array}{c}2n-1\\ n\end{array}}\right) -n\) as \(T\rightarrow \infty \). Even if n is small, \(\left( {\begin{array}{c}2n-1\\ n\end{array}}\right) \) can be quite large. For example, when \(n=30\), \(\left( {\begin{array}{c}2n-1\\ n\end{array}}\right) \approx 5.9\times 10^{16}\). Thus, even in the cases where \(p\gg n\), we still have \(\mathrm {Pr}(\text {rank}(\mathbf{Z} )=p)=1\) as long as \(\left( {\begin{array}{c}2n-1\\ n\end{array}}\right) -n>p\).

In practice, it does not need too many resampling times T to ensure the full rank. Let \(\tau =\left( {\begin{array}{c}2n-1\\ n\end{array}}\right) \) and consider resampling p times i.e., \(T=p\). Note that the number of resampling sets with rank at least 1 is \(\tau -n\). The probability of obtaining p distinct resampling sets with rank at least 1 is

$$\begin{aligned} \begin{aligned}&\frac{\tau -n}{\tau }\cdot \frac{\tau -n-1}{\tau }\cdots \frac{\tau -n-p+1}{\tau }=\prod _{i=0}^{p-1}(1-\frac{n+i}{\tau })\\ \ge ~&(1-\frac{n+p-1}{\tau })^{p-1}=1-\frac{(n+p-1)(p-1)}{\tau }+o(\frac{n+p-1}{\tau }), \end{aligned} \end{aligned}$$

where \(o(\frac{n+p-1}{\tau })\) denotes a higher order term of \(\frac{n+p-1}{\tau }\). Since \(\tau \gg n\) and \(\tau \gg p\) (e.g., for \(n=30\), \(\tau \approx 5.9\times 10^{16}\)), then \(\frac{n+p-1}{\tau }\) is close to 0. Thus, the probability is quite close to 1. It illustrates that we could obtain a full rank matrix with only p resampling times with high probability. Since \(\text {rank}(\mathbf{Z} )=\text {rank}(\mathbf{R} ^{\text {Bag}})\), we have \(\mathrm {Pr}(\text {rank}(\mathbf{R} )=p)=1\) and thus our \(\mathbf{R} ^{\text {Bag}}\) is not singular.

It is worth mentioning that if we estimate the covariance matrix directly rather than the correlation matrix, i.e., without the standardization step, the Bagging estimator is not positive-definite. Similarly to the decomposition in Equation (1), we have

$$\begin{aligned} \begin{aligned} \mathbf{S} ^{\text {Bag}}&=\frac{1}{T}\sum _{t=1}^T\mathbf{S} ^{(t)} =\frac{1}{(n-1)T}\sum _{t=1}^T{(\mathbf{X} ^{(t)}-\bar{\mathbf{X }}^{(t)})}'(\mathbf{X} ^{(t)}-\bar{\mathbf{X }}^{(t)}).\\ \end{aligned} \end{aligned}$$

The corresponding random matrix \(\tilde{\mathbf{Z }}\) is

$$\begin{aligned} \begin{aligned} \tilde{\mathbf{Z }}= \left( \begin{array}{cccc} X_{11}^{(1)}-{\hat{\mu }}_1^{(1)} &{} X_{12}^{(1)}-{\hat{\mu }}_2^{(1)} &{} \cdots &{} X_{1p}^{(1)}-{\hat{\mu }}_p^{(1)} \\ \vdots &{} \vdots &{} &{} \vdots \\ X_{n1}^{(1)}-{\hat{\mu }}_1^{(1)} &{} X_{n2}^{(1)}-{\hat{\mu }}_2^{(1)} &{} \cdots &{} X_{np}^{(1)}-{\hat{\mu }}_p^{(1)} \\ \vdots &{} \vdots &{} &{} \vdots \\ X_{11}^{(T)}-{\hat{\mu }}_1^{(T)} &{} X_{12}^{(T)}-{\hat{\mu }}_2^{(T)} &{} \cdots &{} X_{1p}^{(T)}-{\hat{\mu }}_p^{(T)} \\ \vdots &{} \vdots &{} &{} \vdots \\ X_{n1}^{(T)}-{\hat{\mu }}_1^{(T)} &{} X_{n2}^{(T)}-{\hat{\mu }}_2^{(T)} &{} \cdots &{} X_{np}^{(T)}-{\hat{\mu }}_p^{(T)} \\ \end{array} \right) _{Tn\times p}=\mathbf{AX} , \end{aligned} \end{aligned}$$

where \(\mathbf{A} \) is a \(Tn\times n\) constant matrix. This means \(\tilde{\mathbf{Z }}\) is only a linear transformation of \(\mathbf{X} \). We have

$$\begin{aligned} \begin{aligned} \text {Rank}(\tilde{\mathbf{Z }})\le \text {Rank}(\mathbf{X} )=n. \end{aligned} \end{aligned}$$

Thus, the Bagging sample covariance matrix is still singular.

3.2 Mean squared error

In addition to the guarantee of positive-definiteness, our Bagging estimator \(\mathbf{R} ^{Bag}\) performs well in terms of mean squared error (MSE). The MSE of a matrix estimator is defined by the Frobenius norm, i.e.,

$$\begin{aligned} \begin{aligned} {\text {MSE}(\varvec{{\hat{\varLambda }}})=||\varvec{{\hat{\varLambda }}}-\varvec{\varLambda }||_F^2=\sum _{i,j} ({\hat{\lambda }}_{ij}-\lambda _{ij})^2}, \end{aligned} \end{aligned}$$

where \(||\cdot ||_F\) is the Frobenius norm of a matrix, \(\varvec{{\hat{\varLambda }}}=({\hat{\lambda }}_{ij})_{p\times p}\) and \(\varvec{\varLambda }=(\lambda _{ij})_{p\times p}\) are the estimated and true correlation matrix respectively.

For the sample correlation matrix \(\mathbf{R} =(r_{ij})_{p\times p}\), the MSE of \(\mathbf{R} \) is

$$\begin{aligned} \begin{aligned} {\text {MSE}(\mathbf{R} )=E||\mathbf{R} -\varvec{\varLambda }||^2_F=E\sum _{i,j} (r_{ij}-\lambda _{ij})^2=\sum _{i,j}\text {MSE}(r_{ij})}. \end{aligned} \end{aligned}$$

Although the performance of the sample correlation matrix is poor as a whole when \(p>n\) due to being singular, each entry of it is still an efficient estimator of pairwise covariance among variables. We next show that our Bagging estimator is consistent when p is fixed.

Theorem 3

The mean squared error of \(r_{ij}^{Bag}\) is no more than the average of mean-squared error of \(r_{ij}^{(t)}\), i.e.,

$$\begin{aligned} \begin{aligned} {\text {MSE}(r^{\text {Bag}}_{ij})\le \frac{1}{T}\sum _{t=1}^T\text {MSE}(r_{ij}^{(t)})}, \end{aligned} \end{aligned}$$

where \(r_{ij}^{(t)}\) denotes the i-th row and j-th column entry of \( \mathbf{R ^{(t)}}\).

Since each resampling set \(\mathcal {L}^{(t)}\) has the identical distribution, Theorem 3 leads to \(\text {MSE}(r^{Bag}_{ij})\le \text {MSE}(r_{ij}^{(t)})\) directly. Thus, it is sufficient to show that \(r_{ij}^{(t)}\) is a consistent estimator, which further leads to \(\text {MSE}(r_{ij}^{(t)})\rightarrow 0\) as n goes into infinity.

For a general bivariate distribution (XY) with finite forth moments, Lehmann (1999) showed that the limit distribution of \(\sqrt{n}(r_{XY}-\rho )\) is asymptotically normal with mean 0 and constant variance, where \(r_{XY}\) is the sample correlation coefficient and \(\rho \) is the true value of correlation coefficient. It also implies that \(r_{XY}\) is a consistent estimator of \(\rho \). Here we proposed its bootstrap version to show that \(r_{XY}^{(t)}\) is asymptotically consistent.

Theorem 4

Let \((X_1,Y_1),\cdots ,(X_n, Y_n)\) be i.i.d. according to some bivariate distribution (XY), which has finite forth moments, with means \(E(X)=\xi \), \(E(Y)=\eta \), variances \(\text {Var}(X)=\sigma ^2\), \(\text {Var}(Y)=\tau ^2\), and correlation coefficient \(\rho \). Let \((X_1^{(t)},Y_1^{(t)}),\cdots ,(X_n^{(t)}, Y_n^{(t)})\) be the t-th bootstrap resampling set. The bootstrap sample correlation is defined as

$$\begin{aligned} r_{XY}^{(t)}=\frac{\frac{1}{n-1}\sum _{i=1}^n(X_i^{(t)}-{\bar{X}}^{(t)})(Y_i^{(t)}-{\bar{Y}}^{(t)})}{S_X^{(t)}S_Y^{(t)}}, \end{aligned}$$

where

$$\begin{aligned} \begin{aligned}&{\bar{X}}^{(t)}=\frac{1}{n}\sum _{i=1}^nX_i^{(t)}, ~~~{\bar{Y}}^{(t)}=\frac{1}{n}\sum _{i=1}^nY_i^{(t)},\\&(S_X^{(t)})^2=\frac{1}{n-1}\sum _{i=1}^n(X_i^{(t)}-{\bar{X}}^{(t)})^2,~~~(S_Y^{(t)})^2=\frac{1}{n-1}\sum _{i=1}^n(Y_i^{(t)}-{\bar{Y}}^{(t)})^2. \\ \end{aligned} \end{aligned}$$

Then, as n goes to infinity, the bootstrap sample correlation \(r_{XY}^{(t)}\) is a consistent estimator of \(\rho \).

By Theorems 3 & 4, we have the following corollary.

Corollary 3

Under the mild condition that the p-dimensional distribution has finite forth moments, MSE of the Bagging estimator converges to zero, i.e.,

$$\begin{aligned} {\text {MSE}(\mathbf{R} ^{\text {Bag}})\le \text {MSE}(\mathbf{R} ^{(t)})\rightarrow 0}, \end{aligned}$$

as \(n\rightarrow \infty \) for fixed p. It implies that the Bagging estimator \( \mathbf{R ^{\text {Bag}}}\) is consistent.

4 Simulations

In this section, simulation studies are presented to compare the performance of the Bagging estimator with other classic approaches, including graphical lasso (glasso, Friedman et al., 2008), the hard-threshold method (H-threshold, Bickel and Levina, 2008), the shrinkage estimator (Ledoit and Wolf, 2004 and the traditional sample correlation matrix. Two criteria are used to evaluate the performance of estimators: comparable log-likelihood \(\ell \) and root-mean-square error (RMSE). Log-likelihood measures the fitness of observed data, which depends on the assumed distribution. Here comparable log-likelihood \(\ell \) is the core of the log-likelihood function with common constant terms omitted. RMSE measures the difference between the true values and estimators. The RMSE of an estimator is defined as follows:

$$\begin{aligned} \begin{aligned} {\text {RMSE}(\varvec{{\hat{\varLambda }}})=\frac{1}{p}||\varvec{{\hat{\varLambda }}}-\varvec{\varLambda }||_F=\frac{1}{p}\sqrt{\sum _{i,j} ({\hat{\lambda }}_{ij}-\lambda _{ij})^2}}, \end{aligned} \end{aligned}$$

where \(||\cdot ||_F\) is the Frobenius norm of a matrix, \(\varvec{{\hat{\varLambda }}}=({\hat{\lambda }}_{ij})_{p\times p}\) and \(\varvec{\varLambda }=(\lambda _{ij})_{p\times p}\) are the estimated and true correlation matrix respectively.

In the following simulation studies, we synthesize data from assumed distributions with known correlation matrix. The true correlation matrix is generated as follows:

$$\begin{aligned} \varvec{\varSigma }=A'A~~~\text {and}~~~\varvec{\varLambda }=\text {diag}(\varvec{\varSigma })^{-1/2}\varvec{\varSigma }\text {diag}(\varvec{\varSigma })^{-1/2} \end{aligned}$$
(2)

where \(A=(a_{ij})_{p\times p}\), \(a_{ij}\sim \text {Unif}(-1,1)\) are i.i.d for \(i,j=1,\cdots ,p\). The randomly generated correlation matrices are positive-definite and symmetric. They are general correlation matrices without any special structures.

Then, we obtain the estimated covariance matrix using generated data sets. Considering the uncertainty of Monte Carlo simulations, we repeat the experiments, including generation of random covariance matrices and data synthesis, 100 times independently in each setting. The means and standard errors of \(\ell \) and RMSE are reported for comparison. See the supplementary materials for the detailed R codes.

4.1 Case 1: multivariate Gaussian data

In this case, the data sets are generated from a multivariate Gaussian distribution with mean zero and a general correlation matrix. Here the true correlation matrix is generated randomly according to Equation (2). Table 1 presents the means and standard errors of \(\ell _N\) and RMSE in the case of \(p=50,n=20\) and \(p=200,n=100\) respectively.

The only required parameter in the Bagging estimator is the resampling times T. In practice, increasing the resampling times may improve the accuracy of estimation. Figure 2, which is from one of following simulation studies, demonstrates the relationship between T and RMSE. At the beginning, the RMSE of the estimator decays with the increase of T and then converges to a stable level. In the following simulation studies, T is set as 100 to balance accuracy of estimation and computation cost.

Fig. 2
figure 2

The RMSE of Bagging estimator decays with the increase of T at the beginning and then seemingly converges to a stable level

From Table 1, we find that the hard-threshold method sacrifices much information of the covariance matrix to attain positive-definiteness. The comparable log-likelihood of the thresholded estimator is quite low, though its RMSE performs well. Our Bagging estimator has significant advantages over compared approaches on comparable log-likelihood \(\ell _N\). This demonstrates that the Bagging estimator fits the observed data better. Note that \(\ell _N\) of the sample correlation estimator would be infinite when \(p>n\) due to the estimator being singular, making the estimator invalid. For RMSE, the performance of Bagging and glasso are close, and better than the shrinkage estimator and the sample correlation estimator; but not as good as the H-threshold estimator.

Table 1 The means and standard errors of two criteria across 100 independent experiments for multivariate Gaussian data

The results of more scenarios under different settings are shown in Fig. 3. Here the sample size n is set as \(n=p/2\) varying with the number of variables p. In summary, the Bagging estimator strikes a better balance between RMSE and likelihood.

Fig. 3
figure 3

a For comparable log-likelihood \(\ell _N\), our Bagging estimator beats others significantly across all values of p. b For RMSE, the Bagging estimator is second only to the hard-threshold method, which has the worst performance from the perspective of \(\ell _N\)

4.2 Case 2: multivariate t-distribution data

Besides traditional multivariate Gaussian data, the Bagging estimator also works on general continuous distributions, such as multivariate t-distributions. In the following simulation studies, data are generated from the multivariate t-distribution with mean zero and a general correlation matrix, which is still randomly generated from Equation (2). The multivariate t-distribution is a generalization to random vectors of Student’s t-distribution (Genz and Bretz, 2009). The density function is defined as

$$\begin{aligned} f(\mathbf{x} ;\varvec{\mu },\varvec{\varLambda },\upsilon )=\frac{\varGamma [(\upsilon +p)/2]}{\varGamma (\upsilon /2)\upsilon ^{p/2} \pi ^{p/2}|\varvec{\varLambda }|^{1/2}}\Big [1+\frac{1}{\upsilon }(\mathbf{x} -\varvec{\mu })^T\varvec{\varLambda }(\mathbf{x} -\varvec{\mu })\Big ]^{-(\upsilon +p)/2}. \end{aligned}$$

where \(\varvec{\mu }\) and \(\varvec{\varLambda }\) are the mean vector parameter and correlation matrix parameter respectively. Here \(\upsilon \) denotes the degrees of freedom of the distribution. As \(\upsilon \rightarrow \infty \), the multivariate t-distribution converges to the multivariate Gaussian distribution asymptotically. So the degrees of freedom \(\upsilon \) is set to 3 to distinguish from the Gaussian cases. The resampling times T is still set as 100, the same as in Sect. 4.1. Table 2 presents the means and standard errors of \(\ell _t\) and RMSE in the case of \(p=50,n=20\) and \(p=100,n=50\).

Table 2 The means and standard errors of two criteria across 100 independent experiments for Multivariate t-distribution data (\(\upsilon =3\))

More scenarios under different settings are explored in Fig. 4. Also, the sample size n is set as \(n=p/2\).

Fig. 4
figure 4

a For comparable log-likelihood \(\ell _t\), our Bagging estimator beats others significantly across all values of p. b For RMSE, the Bagging estimator is second only to the hard-threshold method, which has the worst performance from the perspective of \(\ell _t\)

Table 2 and Fig. 4 draw similar conclusions to those in Table 1 and Fig. 3. They demonstrate that our Bagging estimator is not only suitable for Gaussian data, but also can be applied to non-Gaussian data.

5 Application

This section presents a real application to demonstrate the performance of our estimator. The original dataset, contributed by Bhattacharjee et al. (2001), is a famous gene expression dataset on lung cancer patients. It contains 203 specimens, including 139 adenocarcinomas resected from the lung (“AD” samples) and 64 other samples, and 12,600 transcript sequences. Here we focus on the 139 “AD” samples (\(n=139\)) and assume they are independent, identically distributed, and follow a Gaussian distribution. For simplicity, we use a standard deviation threshold of 500 expression units to select the 186 most variable transcript sequences (\(p=186\)). Then, a subset of 70 “AD” samples are sampled randomly without replacement to form a covariance matrix estimator. We repeat the experiments and the sampling procedure for 100 times independently. The comparable log-likelihood and RMSE for different covariance matrix estimations are summarized in Table 3, where RMSE is calculated using the sample covariance matrix of the full 139 samples instead of the unknown “true” covariance matrix. It shows our Bagging estimator has significant advantages over other estimators in terms of likelihood, and is competitive in terms of RMSE.

Table 3 Means and standard errors of two criteria across 100 independent experiments

Figure 5 presents the sample correlation matrix of the full 139 samples and the Bagging estimator with a subset of 70 samples in one of experiments. It demonstrates that our Bagging estimator is quite close to the “true” value.

Fig. 5
figure 5

a Heat map of the sample correlation matrix of the full 139 samples, which is viewed as “true” correlation matrix for comparison. b Heat map of the Bagging estimator on a subset of 70 samples

6 Summary

In this paper, we propose a novel approach to estimate high-dimensional correlation matrices when \(p>n\) with finite samples. Through the procedure of Bootstrap resampling, we show that the Bagging estimator ensures positive-definiteness with probability one in finite samples. Furthermore, our estimator is flexible for general continuous data under some mild conditions. The common assumptions in analogous problems, such as sparse structure and having a Gaussian distribution, are unnecessary in our framework. Through simulation studies and a real application, our method is demonstrated to strike a better balance between RMSE and likelihood. The selected four approaches for comparison represent different but classical ideas to solve the high-dimensional covariance matrix problem; so the results are representative.

It should be noted that our Bagging estimator is devoted to solving problems with little prior knowledge. If one has the prior information on the structure of the covariance matrix, e.g., block or banding, specific approaches are certainly better than our general method. The choice of estimation method still depends on specific scenarios and applications. Some theoretical aspects can be explored further in future research, e.g., the convergence rate of the Bagging estimator when both p and n go to infinity.