Estimation of autocovariance matrices for high dimensional linear processes

In this paper under some mild restrictions upper bounds on the rate of convergence for estimators of p×p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p\times p$$\end{document} autocovariance and precision matrices for high dimensional linear processes are given. We show that these estimators are consistent in the operator norm in the sub-Gaussian case when p=Onγ/2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p={\mathcal {O}}\left( n^{\gamma /2}\right) $$\end{document} for some γ>1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma >1$$\end{document}, and in the general case when p2/β(n-1logp)1/2→0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ p^{2/\beta }(n^{-1} \log p)^{1/2}\rightarrow 0$$\end{document} for some β>2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta >2$$\end{document} as p=p(n)→∞\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ p=p(n)\rightarrow \infty $$\end{document} and the sample size n→∞\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n\rightarrow \infty $$\end{document}. In particular our results hold for multivariate AR processes. We compare our results with those previously obtained in the literature for independent and dependent data. We also present non-asymptotic bounds for the error probability of these estimators.


Introduction
Estimation of covariance matrices in a high dimensional setting has been one of the fundamental statistical issues in the last decade. Some statistical applications of estimation of covariance matrices have been presented for ridge regression in Hoerl and Kennard (1970), in regularized discriminant analysis- (Friedman 1989) and in principal component analysis- (Johnstone and Lu 2009). For an overview of this topic and its applications see (Bickel and Levina 2008b;Birnbaum and Nadler 2012;Chen et al. 2013;Fan et al. 2006;Rothman et al. 2009). The problem of estimating covariance matrices for dependent data has been recently investigated by (Chen et al. 2013;Bhattacharjee and Bose 2014a, b;Guo et al. 2016;Jentsch and Politis 2015;McMurry and Politis 2010), and Wu and Pourahmadi (2009). The estimation of the inverse covariance matrix is used in the recovery of the true unknown structure of undirected B Konrad Furmańczyk konrad_furmanczyk@sggw.edu.pl 1 Institute of Information Technology, Warsaw University of Life Sciences (SGGW), Nowoursynowska 159, 02-776 Warsaw, Poland graphical models, especially in Gaussian graphical models, where a zero entry of the inverse covariance matrix is associated with a missing edge between two vertices in the graph. The recovery of undirected graphs on the basis of the estimation of the precision matrices for a general class of nonstationary time series is considered in Xu et al. (2020).
For example, condition (SGauss) is satisfied for bounded sequences ε i,l i.e. sup i,l ε i,l ≤ M for some M > 0. We can observe that (SGauss) is implied by sub-Gaussian condition for the vectorization vec ε i ⊗ ε j of the Kronecker product for some σ 2 > 0. Condition (NGa β ) is a moment condition for the innovation process without assumptions on the dependency of the coordinates of the innovation process (ε i ).
Let k be the kth order autocovariance matrix, The matrix k will be estimated from the sample X 1 , ..., X n because in practice we do not know the matrices j and . Define the sample autocovariance matrix of order k for 0 ≤ k ≤ n − 1, and the banded version ofˆ k (as in Bhattacharjee and Bose (2014b)) is given by for some sequence of thresholds l n → ∞ as n → ∞, where 1(·) is the indicator function. We will assume that p = p(n) → ∞ as n → ∞.
The main contribution of our paper is that in Theorem 1 we obtain the rate in the operator norm of B l n (ˆ k ) − k for high dimensional linear process for some c n → 0 and some α > 0. Under the sub-Gaussian condition ((Gauss) and (SGauss)) we obtain the same rate O P ((n −1 log p) α/2(α+1) ) as Bhattacharjee and Bose (2014b) but under weaker assumptions for matrices of coefficients j and . In particular, under causality, our results include vector autoregressive AR(r ) processes. Similar results (Corollary 1) we obtain for the precision matrix An interesting problem is to obtain lower bounds and the optimal rate of convergence. Cai et al. (2010) obtained the minimax bound for i.i.d. observations for tapering estimators of 0 . For dependent data this problem is still open. Below we briefly present the state of research related to the estimation of the covariance matrix for independent and dependent observations. The sample covariance matrixˆ 0 = (γ 0 i j ) performs poorly in a high dimensional setting. In the Gaussian case when 0 = I is the identity matrix and p/n → c ∈ (0, 1), the empirical distribution of the eigenvalues of the sample covariance matrix 0 follows the Marchenko and Pastur (1967) law, which is supported on the interval Bickel and Levina (2008a) proposed thresholding of the sample covariance matrix and obtained rates of convergence for the thresholding estimators for a proper choice of the threshold λ n , where the estimator is given bŷ Rothman et al. (2009) considered a class of universal thresholding rules with more general thresholding functions than hard thresholding. An interesting generalization of this method can be found in Cai and Liu (2011) for sparse covariance matrices, where an adaptive thresholding estimator is given bŷ where S λ i j (·) is a general thresholding function with data-driven thresholds λ i j . For other interesting results in this area, see (Birnbaum and Nadler 2012;Cai et al. 2010;Fan et al. 2006;Furrer and Bengtsson 2007;Huang et al. 2006). There are few results for high-dimensional dependent data: (Bhattacharjee and Bose 2014a, b;Chen et al. 2013;Jentsch and Politis 2015) and Guo et al. (2016). Bhattacharjee and Bose (2014a) considered the estimation of the high dimensional variance-covariance matrix under a general Gaussian model with weak dependence in both rows and columns of the data matrix. They showed that the bounded and tapered sample variance-covariance matrices are consistent under a suitable column dependence model. But their conditions do not allow control of the first few autocovariances, only they control higher order autocovariances. Bhattacharjee and Bose (2014b) showed that under suitable assumptions for the linear process, the banded sample autocovariance matrices are consistent in the high dimensional setting. Chen et al. (2013) obtained the rate of convergence for a banded autocovariance estimator in operator norm for a general dependent model. A similar result under more restricted assumptions was obtained by (Jentsch and Politis 2015;Guo et al. 2016) established similar results for multivariate sparse autoregressive AR processes.
The rest of the paper is organized as follows. In "The rate of convergence of autocovariance estimation" section we deal with the problem of estimating the kth order autocovariance matrix k by (3) in high dimensions (Theorem 1). The rate of convergence for the estimator of the precision matrix −1 k is given in Corollary 1. In Proposition 1 we obtain bounds on the error probability of the estimation of the k th order autocovariance. In "Comparison of our results with previous studies" section we compare our results with the results obtained by (Bickel and Levina 2008a, b) for independent normal and nonnormal data, with the minimax upper bound for tapering estimator in Cai et al. (2010) and with the results for dependent data obtained by (Chen et al. 2013;Bhattacharjee and Bose 2014b;Guo et al. 2016) and Jentsch and Politis (2015). In the special case of a multi-dimensional linear process we obtain a sharper bound (Theorem 1) than Chen et al. (2013) and Jentsch and Politis (2015). We also obtain a better rate for the estimation error for multivariate sparse AR processes than Guo et al. (2016).
Finally, the conclusions are presented in Sect. 2.5. All the proofs and auxiliary lemmas are given in the "Appendix".

The rate of convergence of autocovariance estimation
For any p × p matrix M = m i j we define the following matrix norms which are convenient for comparison with other results of autocovariance estimation: We consider the following conditions on the matrices j (see (1)), k and for all t > 0 and j = 0, 1, ... : These conditions are restrictions on the parameter space. It is obvious that (A3) implies (A4) but the converse is not true. If the covariance matrix =(γ i j ) is such that γ i j ≤ C |i − j| −α−1 for all i, j for some α > 0, then T ( , t) ≤ C 2 t −α and (A2) holds. Conditions (A1)-(A2) are tapering conditions and specify the rate of decay for the matrices j and away from the diagonal.
Of course, if (A3) holds then (1,1) < ∞ implies that the series in (1) converges almost surely. Next, in "Remarks on the condition (A1) for AR(1) processes" and "Remarks on the condition (A1) for AR(r ) processes" sections in a few remarks we will give a discussion about condition (A1) is fulfilled for vector AR(1) and AR(r ) processes. In "Comparison of our results with previous studies" section we compare our main result Theorem 1 with the previous studies. In "Conclusions" section we summarize results related to Theorem 1 available in the literature.

The main results
The main results in this section concern the rate of convergence in operator norm for the kth order autocovariance matrix k and for the precision matrix −1 k .
The proof of Theorem 1 is similar to the proofs of Theorems 1-2 in Bhattacharjee and Bose (2014b), but instead of their Lemmas 3, 5 we use a maximal inequality for the (SGauss) case and Pisier's inequality (see (Pisier 1983)) for the (NGa β ) case.
Corollary 1 Under the assumptions of Theorem 1 and when −1 k exists and −1 where l n and c n are defined in Theorem 1.
Corollary 1 is a simple consequence of Theorem 1. We also obtain the rate of convergence ofˆ k in supremum norm, which will be used in the proof of Theorem 1.
Next, we obtain non-asymptotic bounds on the error probability of the estimation of the kth order autocovariance.

Remarks on the condition (A1) for AR(r) processes
Now, we consider a multivariate AR(r ) process of order r given by t ≥ 1, where the p× p matrices A i , i = 1, . . . , r , are called parameter matrices. Under some regularity condition (for more details see ((Bhattacharjee and Bose 2014b) (4), p. 264) and Brockwell and Davis (2002)) this process has the representation ( 1), where 0 = I and We consider the class A of r -sequences of matrices, defined by Therefore, putting t = 1/2, we get Similarly, we may obtain A i (1,1) ≤ (2 α C 2 + C)δ i .

Proposition 2 Under the condition of class
and for j = 1, 2, . . .

Remark B
Immediately from Proposition 2, we have that under the condition of class A condition (A1) holds for d j = jδ j for multivariate AR(r ) processes.
A sharper rate than in (16) for a nonnormal case was obtained by Chen et al. (2013), where data come from a general weak dependence model. In particular, when data come from a linear process (1) with (NGa β ) for β = 2q for some q > 2, and the coefficient matrices for some γ > 1/2 − 1/q, 0 ∈ G r (M) and M < p, then one can obtain (16) for q−2r (for more details see discussion preceding Corollary 2.7 in Chen et al. (2013)).
Under condition (Gauss) or (SGauss), we obtain the same rate of convergence as in Bickel and Levina (2008a) for the covariance matrix 0 for normal data and as in Bhattacharjee and Bose (2014b) for k . In particular, if 0 ∈ G r (M) then (1,1) ≤ M, and from Theorem 1 we have If M ∼ p η for some η ∈ [0, 1), then the r.h.s. of (16) for u = u * equals max p which is greater than the r.h.s. of (17), p η n −1 log p α 2(α+1) . Thus, our rate of convergence is better than the rate (16) for u = u * in Chen et al. (2013) for a linear process. Guo et al. (2016) worked with a bounded covariance estimator for l n = C log(n/ log p) for some C > 0 for a sparse multivariate AR model. The rate of convergence for operator norm was O P (log(n/ log p) n −1 log p). In a special case, from our Theorem 1 we obtain the better rate O P (log α (n −1 log p)). Jentsch and Politis (2015) dealt with so called flat-top tapered covariance matrix estimation for a multivariate strictly stationary time series process. In the special case when the observations come from a p dimensional linear process whereˆ k,l is a flat-top tapered covariance matrix estimator of k , where l = o √ n is the banding parameter. If ψ i, j (h) ≤ Ch −(1+α) for some C > 0 and α > 0, where h = ψ i, j (h) , then (A1) holds and under (A2) and (SGauss) we deduce from Theorem 1 that the r.h.s. of (4) is of order O P ((n −1 log p) α/2(α+1) ). Therefore γ i, j (h) = O h −α−1 and for l = n 1/2−κ for some κ ∈ (0, 1/2) we have , which is worse rate than in Theorem 1. Similarly when (NGa β ) and p = O n γ /2 for some γ > 1, then from Theorem 1 the r.h.s. of (4) is of order O P (( p 2/β n −1 log p) α/(α+1) ) and this rate is sharper than the r.h.s. of (19).

Conclusions
Our main result (Theorem 1) was compared with related results available in the literature. In the special case of a multidimensional linear process we obtained a better rate of convergence of our covariance estimator in operator norm than (Chen et al. 2013;Jentsch and Politis 2015) and Guo et al. (2016). Our result is similar to that of Bhattacharjee and Bose (2014b), but under milder assumptions for the noise process (ε t ) and for the admissible class matrices k .
Comparing the results of estimating the covariance matrix is difficult because those results are for different assumptions on the class of covariance matrices and for independent or dependent data. For independent data Cai et al. (2010) obtained the optimal rate for tapering estimators of 0 . In contrast such results do not exist for dependent data and the problem of finding the optimal rate of convergence in Theorem 1 is still open.
Proof By a simple calculation, From (SGauss) and (A3) we conclude that Hence, from the maximal inequality for sub-Gaussian r.v.'s we obtain Hence, we have For fixed i, j, l, s the sequence (ζ l,s m,i, j ) m is i.i.d. and using the moment bound ( Petrov 1995) and (NGa β ), we obtain Hence, we get (23), as desired.

Remark C From Lemma A.3 (Bickel and Levina 2008b
) under the assumption that (ε t ) is Gaussian and (A5), we may deduce that for some δ > 0 and any |η| ≤ δ, for some constants C * 1 , C * 2 > 0. Reasoning as in the proof of Lemma 1, we may obtain (6) for c n = n −1 log p.
Proof of Theorem 1 From the inequality in (Bhattacharjee and Bose (2014b), p. 280), we find that From Lemma 3, we have for any l n → ∞ as n → ∞. From Lemma 1 (see also Remark C when (Gauss) holds), we get for c n as in Lemma 1. Consequently, due to (25) Proof of Corollary 1 Reasoning as in (Bhattacharjee and Bose (2014b), Section 6.2), we Put A = k and B = B l n (ˆ k ). Since −1 k 2 = O 1 and from Theorem 1, B l n (ˆ k ) − k 2 = O P a n as a n → 0, it follows that for large n we have A −1 A − B ≤ 1. Therefore for some C > 0 and large n, we find that Hence, directly from (4), we have (5). (25), we have for any η > 0, P B l n (ˆ k ) − k 2 > η ≤ P (2l n + 1) ˆ k − k ∞ + T k , l n > η .
By a similar reasoning, we obtain T ( j , t) ≤ C 2 jδ j t −α . Hence, by induction the proof of (12) is complete.

Proof
Clearly for j = 1 inequality (28) holds. Now, we assume that (28) is true for some j. From the induction assumption and Lemma 2 (ii), By induction, we see that (28) holds for all positive integers j.