High-dimensional correlation matrix estimation for general continuous data with Bagging technique

High-dimensional covariance matrix estimation plays a central role in multivariate statistical analysis. It is well-known that the sample covariance matrix is singular when the sample size is smaller than the dimension of the variable, but the covariance estimate must be positive-definite. This motivates some modifications of the sample covariance matrix to preserve its efficient estimation of pairwise covariance. In this paper, we modify the sample correlation matrix using the Bagging technique. The proposed Bagging estimator is flexible for general continuous data. Under some mild conditions, we show theoretically that the Bagging estimator can ensure positive-definiteness with probability one in finite samples. We also prove the consistency of the bootstrap estimator of Pearson correlation and the consistency of our Bagging estimator when the dimension p is fixed. Simulation results and a real application are provided to demonstrate that our method strikes a better balance between RMSE and likelihood, and is more robust, than other existing estimators.


Introduction
Covariance matrix estimation is a fundamental topic in multivariate statistical analyses. Traditionally, the sample covariance matrix is a convenient and efficient estimator when sample size n is much larger than dimension p. However, in recent years, more and more high-dimensional datasets with small n and large p have appeared in various applications. For instance, investors track thousands of assets in the financial market, but there are only hundreds of daily trading observations per year (Bodnar et al., 2018). For cancer diagnosis with genetic data, thousands of gene expressions can be measured using microarray techniques simultaneously, but patient cases are often rare and limited (Best et al., 2015). It is well-known that the sample covariance matrix is singular when p > n , but a valid covariance matrix must be positive-definite. This fatal flaw hampers the application of sample covariance matrix in high-dimensional multivariate statistical analyses, including discriminant analysis and regression models. Furthermore, Johnstone (2001) showed that the sample covariance matrix distorts the eigen-structure of the population covariance matrix and is ill-conditioned when p is large. Generally, the sample covariance matrix is an awful estimator in high-dimensional cases.
Although its performance is poor as a whole (Fan et al., 2016), each entry in the sample covariance matrix is still an efficient estimator of pairwise covariance among variables. This motivates the design of a modified version that retains efficient estimation of pairwise covariance, while avoiding the drawbacks. Ledoit and Wolf (2004) proposed a shrinkage method by taking a weighted linear combination of the sample covariance matrix and the identity matrix. The resulting matrix is positive-definite, invertible, and preserves the eigenvector structure. There is existing literature on how to choose the optimal weighted parameter to obtain better asymptotic properties (Ledoit and Wolf, 2004;Mestre and Lagunas, 2005;Mestre, 2008). However, the shrinkage operation leads to a biased estimator in finite samples. If the covariance matrix is sparse, thresholding methods may be the most intuitive idea in high-dimensional analyses.  applied the hardthresholding method to the sample covariance matrix and showed its asymptotic consistency. After that, other generalized thresholding rules were proposed and tried, such as banding Wu andPourahmadi, 2009), soft-thresholding (Rothman et al., 2009), and adaptive thresholding (Cai and Liu, 2011). For further theoretical results, Cai et al. (2010) derived the optimal rate of convergence for estimating the true covariance matrix, and Cai and Zhou (2012) explored the operator norm, Frobenius norm and L 1 norm of the estimator and its inverse. The thresholding idea is an efficient method to obtain a sparse estimator, but it is hard to ensure positive-definiteness for finite samples. In fact, Guillot and Rajaratnam (2012) showed that a thresholded matrix may lose positivedefiniteness quite easily. Fan et al. (2016) also demonstrated that the thresholding method sacrifices a great deal of entries and information in the sample covariance matrix to attain positive-definiteness.
From the perspective of random matrix theory, Marzetta et al. (2011) constructed a positive-definite estimator by random dimension reduction. Tucci and Wang (2019) considered a random unitary matrix with Harr measure as an alternative random operator. In this paper, inspired by the work of random matrix theory and some practical considerations, we modify the sample correlation matrix using the Bagging technique. Bagging (Bootstrap Aggregating), proposed by Breiman (1996), is an ensemble algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical inference. Surprisingly, we find that the Bagging technique can help achieve a positive-definite estimate when p > n . Through a resampling procedure, the Bagging technique can "create" more linearly independent data to transform the problem into traditional cases where n/p is large. This paper contributes to the field in the following aspects: (a) we propose a new highdimensional correlation matrix estimator for general continuous data; (b) we prove theoretically the Bagging estimator ensures positive-definiteness with probability one in finite samples, while the estimator is consistent when p is fixed; (c) we demonstrate that the Bagging estimator is competitive with existing approaches through a large number of simulation studies in various scenarios and a real application. This paper is organized as follow: Sect. 2 proposes the Bagging estimator. Section 3 proves some relevant theoretical results. Section 4 compares our method with existing approaches through simulation studies in various scenarios and Sect. 5 provides a real application. Section 6 concludes the paper.

Bagging estimator
For a given training set D of size n, the Bagging technique first generates m new training sets d 1 , ⋯ , d m , each of size n, by sampling from D uniformly with replacement. This step is called bootstrap sampling. These m bootstrap resampling sets are then fitted separately to produce estimates h 1 , ⋯ , h m . The individual estimates h 1 , ⋯ , h m are then combined by averaging or voting to generate the final estimate h Bag . The procedure of the Bagging algorithm is illustrated in Fig. 1.
Generally, Bagging can improve the stability and accuracy of almost every regression and classification algorithm (Breiman, 1996). In this paper, we use the Bagging technique to modify the sample correlation matrix. Let = (X ij ) n×p be the observed dataset. X ij denotes the i-th observation for the j-th variable where i = 1, ⋯ , n and j = 1, ⋯ , p . Assume row vectors X i = (X i1 , ⋯ , X ip ) are i.i.d. for i = 1, ⋯ , n , and follow a continuous and irreducible p-dimensional distribution with mean and positive-definite covariance matrix , e.g., X i ∼ N p ( , ) . Here an irreducible Fig. 1 The procedure of the Bagging algorithm 1 3 p-dimensional distribution denotes a p-dimensional distribution where the p components are irreducible (see Definition 5 for details). We are interested in estimating the p × p covariance matrix = ( ij ) p×p for fixed p and finite sample size n when p > n . The sample covariance matrix is defined as is the matrix of sample mean vectors. According to the variance-correlation decomposition, = , where is the diagonal matrix of standard deviations and is the correlation matrix with diagonal elements equal to 1. Thus, we may estimate and separately (Barnard et al., 2000). If is estimated by the sample variance, i.e., ̂ = diag( ) 1∕2 , then the problem becomes to estimate the correlation matrix . The corresponding sample version is defined as follows: Then, the sample correlation matrix is defined as Note that rank( ) = n − 1 , thus is still singular when p > n and hence not a valid estimator of . Therefore, a modification on R is a must.
Definition 2 (Bagging Estimator) For a given dataset L = {X 1 , ⋯ , X n } , consider a simple resampling set of n observations with replacement, e.g., L (t) = {X (t) 1 , ⋯ , X (t) n } . Using these resampled data construct the matrix (t) , which is used to form a sample correlation matrix (t) . Repeat this process independently for T times. Then, the Bagging estimator is defined The Bagging algorithm is summarized in Algorithm 1 in detail. The complete algorithm is simple, easy to implement, and requires few assumptions. Common assumptions, such as the data being Gaussian and the covariance matrix being sparse, are unnecessary in our algorithm. Compared with approaches that rely on these assumptions, our Bagging estimator is more flexible for general continuous data.
Algorithm 1 Bagging Algorithm for Correlation Matrix Estimation 1: Given dataset L = {X1, · · · , Xn}. 2: for t-th iteration do 3: Resample n samples in L with replacement to construct X (t) . 4: Normalize the matrix X (t) by columns to obtain Y (t) . 5: Calculate the sample correlation matrix R (t) . 6: end for 7: Average the outputs in iterations as Bagging estimator R Bag .

Positive-definiteness
A valid correlation matrix estimator must be positive-definite. As we shall show, our new estimator Bag is positive-definiteness with probability one for finite samples, although each (t) is still singular. It should be noted that this "magic" operation works only for the sample correlation matrix , rather than the sample covariance matrix . This may partially explain why this simple procedure has not been explored up till now. For Bag , we have the following decomposition, Here is a random matrix, which contains all resampled observations. According to Equation (1), it is sufficient to show that Pr(rank( ) = p) = 1 for large T. First, we clarify several definitions regarding random variables for convenience.
Definition 3 (Continuous) A random variable X is said to be continuous if Pr(X ∈ B) = 0 for any finite or countable set B of points of the real line.
Definition 4 (Irreducible) Let W be a continuous random variable. Given random variables U 1 , ⋯ , U n , if W|U 1 , ⋯ , U n is still a continuous random variable, W is said to be irreducible given U 1 , ⋯ , U n .
Definition 5 For continuous random variables U 1 , ⋯ , U n , if every U i is irreducible given the remaining random variables, we say U 1 , ⋯ , U n are irreducible.
Corollary 1 Let W be a continuous random variable. If W is independent of random variables U 1 , ⋯ , U n , then W is irreducible given U 1 , ⋯ , U n .
Proof If W is independent of U 1 , ⋯ , U n , then W|U 1 , ⋯ , U n is identically distributed with W and is a continuous random variable. ◻ Definition 6 (Linearly Irreducible) Let W be a continuous random variable. Given random for any a 1 , ⋯ , a n ∈ ℝ , W is said to be linearly irreducible given U 1 , ⋯ , U n . (1) Pr(W = a 1 U 1 + ⋯ + a n U n |U 1 , ⋯ , U n ) = 0, Definition 7 For continuous random variables U 1 , ⋯ , U n , if every U i is linearly irreducible given the remaining random variables, we say U 1 , ⋯ , U n are linearly irreducible.

Corollary 2 Let W be a continuous random variable. If W is irreducible given
In particular, Pr(W = a 1 U 1 + ⋯ + a n U n |U 1 , ⋯ , U n ) = 0 for any a 1 , ⋯ , a n ∈ ℝ . ◻ The following lemma provides a criterion for being linearly irreducible (See Appendix A for detailed proofs of Lemmas and Theorems).
Lemma 1 Let U 1 , ⋯ , U n be continuous random variables. If for any a 1 , ⋯ , a n ∈ ℝ which are not all zero, then U 1 , ⋯ , U n are linearly irreducible.
Inspired by the rank of the Gaussian ensemble in random matrix theory (Tao and Vu 2010), we show a general result for the rank of a random matrix. Specifically, consider the rank of random matrix , For simplicity, delete the redundant rows in , which does not change the rank of the matrix. The redundancy may come from identical resampling sets, i.e., (t 1 ) ≡ (t 2 ) , or may come from repetitive observations in the same resampling sets, i.e., X (t) After eliminating these redundant rows, let T be the number of distinct resampling sets in total T resampling sets, and let q t be the number of non-repetitive observations in L (t) .
Note that in each resampling set, there exists a perfect linear relationship among nonrepetitive rows due to the sample mean ̂( t) j , which decreases the degrees of freedom of observations by one. Thus, there are only q t − 1 free observations in each resampling set. Without loss of generality, assume the first q t − 1 rows in each resampling set are non-repetitive. We have submatrix of , ij is a continuous random variable.
According to Theorem 1 and Lemma 2,

Theorem 2 For random matrix , we have
The total number of distinct sets is n + k − 1 k if we draw k samples from n different elements with replacement (Pishro-Nik 2016). Here we have k = n in our Bagging algorithm. Thus, the number of distinct resampling sets T goes to 2n − 1 n with probability 1 as T → ∞.
Since there are q t − 1 free observations in each resampling set and q t − 1 ≥ 1 holds except for the n sets in which the elements are all the same, we have ∑T can be quite large. For example, when n = 30 , 2n − 1 n ≈ 5.9 × 10 16 . Thus, even in the cases where p ≫ n , we still have Pr(rank( ) = p) = 1 as long as 2n − 1 n − n > p.
In practice, it does not need too many resampling times T to ensure the full rank. Let = 2n − 1 n and consider resampling p times i.e., T = p . Note that the number of resampling sets with rank at least 1 is − n . The probability of obtaining p distinct resampling sets with rank at least 1 is where o( n+p−1 ) denotes a higher order term of n+p−1 . Since ≫ n and ≫ p (e.g., for n = 30 , ≈ 5.9 × 10 16 ), then n+p−1 is close to 0. Thus, the probability is quite close to 1. It illustrates that we could obtain a full rank matrix with only p resampling times with high probability. Since rank( ) = rank( Bag ) , we have Pr(rank( ) = p) = 1 and thus our Bag is not singular.
It is worth mentioning that if we estimate the covariance matrix directly rather than the correlation matrix, i.e., without the standardization step, the Bagging estimator is not positive-definite. Similarly to the decomposition in Equation (1), we have The corresponding random matrix ̃ is where is a Tn × n constant matrix. This means ̃ is only a linear transformation of . We have Thus, the Bagging sample covariance matrix is still singular.

Mean squared error
In addition to the guarantee of positive-definiteness, our Bagging estimator Bag performs well in terms of mean squared error (MSE). The MSE of a matrix estimator is defined by the Frobenius norm, i.e., where || ⋅ || F is the Frobenius norm of a matrix, ̂= (̂i j ) p×p and = ( ij ) p×p are the estimated and true correlation matrix respectively.
For the sample correlation matrix = (r ij ) p×p , the MSE of is Although the performance of the sample correlation matrix is poor as a whole when p > n due to being singular, each entry of it is still an efficient estimator of pairwise covariance among variables. We next show that our Bagging estimator is consistent when p is fixed.

Theorem 3
The mean squared error of r Bag ij is no more than the average of mean-squared error of where r (t) ij denotes the i-th row and j-th column entry of ( ) .
Since each resampling set L (t) has the identical distribution, Theorem 3 leads to ij is a consistent estimator, which further leads to MSE(r (t) ij ) → 0 as n goes into infinity. For a general bivariate distribution (X, Y) with finite forth moments, Lehmann (1999) showed that the limit distribution of √ n(r XY − ) is asymptotically normal with mean 0 and constant variance, where r XY is the sample correlation coefficient and is the true value of correlation coefficient. It also implies that r XY is a consistent estimator of . Here we proposed its bootstrap version to show that r (t) XY is asymptotically consistent.
1 3 as n → ∞ for fixed p. It implies that the Bagging estimator Bag is consistent.

Simulations
In this section, simulation studies are presented to compare the performance of the Bagging estimator with other classic approaches, including graphical lasso (glasso, Friedman et al., 2008), the hard-threshold method (H-threshold, , the shrinkage estimator (Ledoit and Wolf, 2004 and the traditional sample correlation matrix. Two criteria are used to evaluate the performance of estimators: comparable log-likelihood and root-mean-square error (RMSE). Log-likelihood measures the fitness of observed data, which depends on the assumed distribution. Here comparable log-likelihood is the core of the log-likelihood function with common constant terms omitted. RMSE measures the difference between the true values and estimators. The RMSE of an estimator is defined as follows: where || ⋅ || F is the Frobenius norm of a matrix, ̂= (̂i j ) p×p and = ( ij ) p×p are the estimated and true correlation matrix respectively.
In the following simulation studies, we synthesize data from assumed distributions with known correlation matrix. The true correlation matrix is generated as follows: where A = (a ij ) p×p , a ij ∼ Unif(−1, 1) are i.i.d for i, j = 1, ⋯ , p . The randomly generated correlation matrices are positive-definite and symmetric. They are general correlation matrices without any special structures.
Then, we obtain the estimated covariance matrix using generated data sets. Considering the uncertainty of Monte Carlo simulations, we repeat the experiments, including generation of random covariance matrices and data synthesis, 100 times independently in each setting. The means and standard errors of and RMSE are reported for comparison. See the supplementary materials for the detailed R codes.

Case 1: multivariate Gaussian data
In this case, the data sets are generated from a multivariate Gaussian distribution with mean zero and a general correlation matrix. Here the true correlation matrix is generated randomly according to Equation (2). Table 1 presents the means and standard errors of N and RMSE in the case of p = 50, n = 20 and p = 200, n = 100 respectively.
The only required parameter in the Bagging estimator is the resampling times T. In practice, increasing the resampling times may improve the accuracy of estimation. Figure 2, which is from one of following simulation studies, demonstrates the relationship between T and RMSE. At the beginning, the RMSE of the estimator decays with the increase of T and then converges to a stable level. In the following simulation studies, T is set as 100 to balance accuracy of estimation and computation cost.
From Table 1, we find that the hard-threshold method sacrifices much information of the covariance matrix to attain positive-definiteness. The comparable log-likelihood of the thresholded estimator is quite low, though its RMSE performs well. Our Bagging estimator has significant advantages over compared approaches on comparable log-likelihood N . This demonstrates that the Bagging estimator fits the observed data better. Note that N of the sample correlation estimator would be infinite when p > n due to the estimator being singular, making the estimator invalid. For RMSE, the performance of Bagging and glasso are close, and better than the shrinkage estimator and the sample correlation estimator; but not as good as the H-threshold estimator.
The results of more scenarios under different settings are shown in Fig. 3. Here the sample size n is set as n = p∕2 varying with the number of variables p. In summary, the Bagging estimator strikes a better balance between RMSE and likelihood.

Case 2: multivariate t-distribution data
Besides traditional multivariate Gaussian data, the Bagging estimator also works on general continuous distributions, such as multivariate t-distributions. In the following Fig. 2 The RMSE of Bagging estimator decays with the increase of T at the beginning and then seemingly converges to a stable level simulation studies, data are generated from the multivariate t-distribution with mean zero and a general correlation matrix, which is still randomly generated from Equation (2). The multivariate t-distribution is a generalization to random vectors of Student's t-distribution (Genz and Bretz, 2009). The density function is defined as where and are the mean vector parameter and correlation matrix parameter respectively. Here denotes the degrees of freedom of the distribution. As → ∞ , the multivariate t-distribution converges to the multivariate Gaussian distribution asymptotically. So the degrees of freedom is set to 3 to distinguish from the Gaussian cases. The resampling times T is still set as 100, the same as in Sect. 4.1. Table 2 presents the means and standard errors of t and RMSE in the case of p = 50, n = 20 and p = 100, n = 50.
More scenarios under different settings are explored in Fig. 4. Also, the sample size n is set as n = p∕2. Table 2 and Fig. 4 draw similar conclusions to those in Table 1 and Fig. 3. They demonstrate that our Bagging estimator is not only suitable for Gaussian data, but also can be applied to non-Gaussian data.
. Fig. 3 a For comparable log-likelihood N , our Bagging estimator beats others significantly across all values of p. b For RMSE, the Bagging estimator is second only to the hard-threshold method, which has the worst performance from the perspective of N

Application
This section presents a real application to demonstrate the performance of our estimator. The original dataset, contributed by Bhattacharjee et al. (2001), is a famous gene expression dataset on lung cancer patients. It contains 203 specimens, including 139 adenocarcinomas resected from the lung ("AD" samples) and 64 other samples, and 12,600 transcript sequences. Here we focus on the 139 "AD" samples ( n = 139 ) and assume they are independent, identically distributed, and follow a Gaussian distribution. For simplicity, we use a standard deviation threshold of 500 expression units to select the 186 most variable transcript sequences ( p = 186 ). Then, a subset of 70 "AD" samples are sampled randomly without replacement to form a covariance matrix estimator. We repeat the experiments and the sampling procedure for 100 times independently. The comparable log-likelihood and RMSE for different covariance matrix estimations are summarized in Table 3, where RMSE is calculated using the sample covariance matrix of the full 139 samples instead of the unknown "true" covariance matrix. It shows our Bagging estimator has significant advantages over other estimators in terms of likelihood, and is competitive in terms of RMSE. Figure 5 presents the sample correlation matrix of the full 139 samples and the Bagging estimator with a subset of 70 samples in one of experiments. It demonstrates that our Bagging estimator is quite close to the "true" value. Fig. 4 a For comparable log-likelihood t , our Bagging estimator beats others significantly across all values of p. b For RMSE, the Bagging estimator is second only to the hard-threshold method, which has the worst performance from the perspective of t

3 6 Summary
In this paper, we propose a novel approach to estimate high-dimensional correlation matrices when p > n with finite samples. Through the procedure of Bootstrap resampling, we show that the Bagging estimator ensures positive-definiteness with probability one in finite samples. Furthermore, our estimator is flexible for general continuous data under some mild conditions. The common assumptions in analogous problems, such as sparse structure and having a Gaussian distribution, are unnecessary in our framework. Through simulation studies and a real application, our method is demonstrated to strike a better balance between RMSE and likelihood. The selected four approaches for comparison represent different but classical ideas to solve the high-dimensional covariance matrix problem; so the results are representative.
It should be noted that our Bagging estimator is devoted to solving problems with little prior knowledge. If one has the prior information on the structure of the covariance matrix, e.g., block or banding, specific approaches are certainly better than our general method. The choice of estimation method still depends on specific scenarios and applications. Some theoretical aspects can be explored further in future research, e.g., the convergence rate of the Bagging estimator when both p and n go to infinity.

Appendix A: proofs of Lemmas & Theorems
Proof of Lemma 1 Without loss of generality, assume a 1 ≠ 0 . So Since U 1 is a continuous random variable, by Definition 4, U 1 is linearly irreducible given remaining random variables. Similarly, we have every U i is linearly irreducible given remaining random variables. Thus, U 1 , ⋯ , U n are linearly irreducible. ◻

Proof of Theorem 1
If q ≤ p , we need to show that Pr(rank( ) = q) = 1 . Construct a square submatrix q×q using the first q columns of . Since is a square matrix, Pr(rank( ) = q) = 1 means that is singular with probability 0, i.e., Pr(det( ) = 0) = 0 .
is singular if and only if G i lies in the span of G 1 , ⋯ , G i−1 for some i, where G i may be a row or column vector of . Since this theorem is symmetric by rows or columns, we assume G i are row vectors of without loss of generality. Thus, According to condition (2), G 1j , ⋯ , G qj are linearly irreducible for all j. So, for any 1 < i ≤ q , G 1j , ⋯ , G ij are linearly irreducible for all j. This means G ij is linearly irreducible given G 1j , ⋯ , G i−1,j for all j. By Definition 4, we have holds for all j and for any a 1 , ⋯ , a i−1 . Thus, Pr(G i ∈ V i |G 1 , ⋯ , G i−1 ) = 0 holds for any 1 < i ≤ q.
If q > p , we take the first p rows of to construct a square submatrix p×p . Similarly we have Pr(rank( ) = p) = 1 . And since rank( ) ≤ rank( ) ≤ p , then Pr(rank( ) = p) = 1.

Proof of Lemma 2 Note that
is a function of X (t) 1j , ⋯ , X (t) q t ,j , where X (t) 1j , ⋯ , X (t) q t ,j are independent continuous random variables.

3
where CX 2 + DX + E > 0 for all X, A ≠ 0 , and A, B, C, D, E are constants. Consider that there are at most two zero points in solution space . Since X = X (t) 1j is independent of X (t) 2j , ⋯ , X (t) q t ,j , and X is a continuous random variable, by Corollary 1, X|X (t) 2j , ⋯ , X (t) q t ,j is a continuous random variable. For any finite set , Pr(X ∈ ) = 0 . So, we have Then for any finite or countable set B of points of the real line, we have Pr(G (t) ij ∈ B) = 0 . ◻
(1): Note that if X (t) i = (X (t) i1 , ⋯ , X (t) ip ) follows a continuous and irreducible p-dimensional distribution, then {X (t) i1 , ⋯ , X (t) ip } are irreducible. Since X (t) i , i = 1, ⋯ , q t , are independent random vectors, according to Corollary 1, we have are irreducible. For any a 1 , ⋯ , a p ∈ ℝ , not all zero, we explore the probability of the equation Without loss of generality, assume a 1 ≠ 0 . Let X (t) 11 = X . Given X (t) �X (t) 11 , we have where CX 2 + DX + E > 0 for all X, A ≠ 0 and A, B, C, D, E, F are constants. Since X is irreducible given X (t) �X (t) 11 , similarly with the proof of Lemma 2, we have By integrating X (t) �X (t) 11 out, we have Pr(a 1 G (t) i1 + ⋯ + a p G (t) ip = 0) = 0 . According to Lemma 2, G (t) ij are continuous random variable. Then, by Lemma 1, we have ip are linearly irreducible for all i, t. The random matrix satisfies condition (1).
(2) Within the t-th resampling set, there are q t different independent samples X (t) 1 , ⋯ , X (t) q t . For any t and column j, ij and So, without loss of generality, we can show that the first q t − 1 elements that G (t) 1j , ⋯ , G (t) q t −1,j are linearly irreducible. For any s are not all zero. Note that the coefficients are all zero if and only if a (t) According to Lemma 1, we have that G (t) 1j , ⋯ , G (t) q t −1,j are linearly irreducible for any t and j. Between different resampling sets, e.g., the t 1 -th set and t 2 -th set, let X = X ⋅j . For any a (t 1 ) q t 2 −1 ∈ ℝ , which are not all zeros, there are two cases. If a (t 1 ) q t 1 −1 are not all zero and a (t 2 ) 1 , ⋯ , a (t 2 ) q t 2 −1 are not all zero, given X where A 1 , B 1 are not both zero and A 2 , B 2 are not both zero, since G (t 1 ) 1j , ⋯ , G (t 1 ) q t 1 −1,j are linearly irreducible and G (t 2 ) 1j , ⋯ , G (t 2 ) q t 2 −1,j are linearly irreducible. Here C 1 X 2 + D 1 X + E 1 > 0 and C 2 X 2 + D 2 X + E 2 > 0 for all X, and A 1 , B 1 , C 1 , D 1 , E 1 , A 2 , B 2 , C 2 , D 2 , E 2 are constants. Consider that there are at most 4 zero points in solution space . Since X = X ⋅j �X , and X is a continuous random variable, by Corollary 1, X|X (t) 2j , ⋯ , X (t) q t ,j is a continuous random variable. For any finite set , the probability Pr(X ∈ ) = 0 . So, we have By integrating X (t 1 ) ⋅j ∪ X (t 2 ) ⋅j �X (t 1 ) ij out, we have According to Lemma 2, G (t) ij is a continuous random variable. Then, by Lemma 1, G (t 1 ) 1j , ⋯ , G (t 1 ) q t 1 −1,j , G (t 2 ) 1j , ⋯ , G (t 2 ) q t 2 −1 ,j are linearly irreducible for all j. Similarly, we can generalize the results that G (1) 1j , ⋯ , G (1) q 1 −1,j , ⋯ , G (T) 1j , ⋯ , G (T) qT −1,j are linearly irreducible for all j. The random matrix satisfies condition (2).

Proof of Theorem 3 Note that
Applying the Jensen's inequality to the first term, ⋅j �X)