Precision matrix estimation using penalized Generalized Sylvester matrix equation

Estimating a precision matrix is an important problem in several research fields when dealing with large-scale data. Under high-dimensional settings, one of the most popular approaches is optimizing a Lasso or ℓ1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _1$$\end{document} norm penalized objective loss function. This penalization endorses sparsity in the estimated matrix and improves the accuracy under a proper calibration of the penalty parameter. In this paper, we demonstrate that the problem of minimizing Lasso penalized D-trace loss can be seen as solving a penalized Sylvester matrix equation. Motivated by this method, we propose estimating the precision matrix using penalized generalized Sylvester matrix equations. In our method, we develop a particular estimating equation and a new convex loss function constructed through this equation, which we call the generalized D-trace loss. We assess the performance of the proposed method using detailed numerical analysis, including simulated and real data. Extensive results show the advantage of the proposed method compared to other estimation approaches in the literature.


Introduction
Precision or inverse covariance matrix has an important role in statistical learning and data analysis. Its applications span different research fields including genetics, brain studies, finance, psychology, etc. Moreover, under high-dimensional settings, the B Vahe Avagyan vahe.avagyan@wur.nl accurate estimation of a precision matrix is crucial for several statistical methodologies, including classification, forecasting, among others.
Precision matrix shows the partial correlations among normally distributed variables. Under the assumption of multivariate normality, the entry ω i j = 0 of the precision matrix = [ω i j ] 1≤i, j≤ p ∈ R p× p indicates the conditional independence between the variables X i and X j , given all the other variables, i.e., X i ⊥ ⊥ X j |X −(i, j) (Dempster 1972). In this way, the precision matrix is closely related to the Gaussian graphical models (GGM) which is a useful technique to visualize the conditional independence of the variables (e.g., gene interaction networks). The GGM is an undirected graph G = (N , E), where the set of nodes N = {1, . . . , p} represents the variables. The set of edges E ⊆ N × N consists of the pair indexes (i, j) corresponding to the "active" entries ω i j = 0 for 1 ≤ i, j ≤ p (Lauritzen 1996).
The estimation of large-scale precision matrices and corresponding GGMs has received a substantial attention in the extant literature. Under high-dimensional settings, one of the most commonly employed approaches in the literature is the Lasso (least absolute shrinkage and selection operator) or 1 norm penalization (Tibshirani 1996) of a certain loss function, which induces sparsity in the estimated precision matrix. In particular, Banerjee et al. (2006) introduce the 1 norm penalized log-likelihood maximization approach, also known as graphical Lasso or GLASSO estimator (see also Yuan and Lin 2007;Friedman et al. 2008;Scheinberg et al. 2010;Rothman et al. 2008;Ravikumar et al. 2011;Hsieh et al. 2014, for theoretical analyses and different solving algorithms for this estimator). Other methods that employ 1 norm penalization approach include the neighborhood selection for selecting the GGMs (Meinshausen and Bühlmann 2006), sparse partial correlation estimation or SPACE (Peng et al. 2009), constrained 1 norm minimization for inverse matrix estimation or CLIME (Cai et al. 2011), 1 norm penalized D-trace loss minimization (Zhang and Zou 2014), sparse column-wise inverse operator or SCIO (Liu and Luo 2015), among several others. Despite the popularity of 1 norm penalty, literature considers also other penalties such as Adaptive 1 norm (Fan et al. 2009;Avagyan et al. 2018), SCAD (Fan et al. 2009), Elastic-Net (Ryali et al. 2012), Ridge (i.e., squared Frobenius or 2 norm) (van Wieringen and Peeters 2016;  and Generalized Ridge penalties (van Wieringen 2019). Note that the methods based on penalization framework require a proper selection (i.e., calibration) of the tuning parameter that controls the strength (i.e., intensity) of the employed penalty. This can be done empirically through several techniques such as penalized goodness-of-fit criteria (e.g., Bayesian Information Criterion) and cross-validation methods. In this paper, we employ both techniques for selecting the penalty parameters of the considered methods. For the further review on methods to estimate precision matrices, we refer to Fan et al. (2016) and .
In line with the literature mentioned above, we propose a precision matrix estimation method using 1 norm penalized convex minimization. Our introduced method is motivated by the D-trace framework of Zhang and Zou (2014). First, we show that the D-trace loss function is closely related to the Sylvester matrix equation. In other words, minimizing a penalized D-trace loss function is equivalent to solving a penalized Sylvester equation. We note that the Sylvester equation is a particular case of matrix equation family called generalized Sylvester equations. Next, we discuss the estimation of the precision matrices through particular penalized generalized Sylvester equation. Furthermore, we construct a loss function based on the selected penalized equation. We call this function a generalized D-trace loss. Using extensive numerical analysis, we show that the estimated precision matrix obtained through solving the introduced penalized generalized Sylvester equation (i.e., minimizing 1 norm penalized generalized D-trace loss) provides more favorable performance in finite samples than the existing methods. We evaluate the performances of the considered estimators in terms of several statistical measures for different models (i.e., patterns) of the true precision matrix .
The manuscript is organized as follows. In Sect. 2, we describe the proposed methodology. In Sect. 3, we evaluate the statistical performance of the proposed methodology and compare it with that of other approaches. In Sect. 4, we provide a real data application: classification of breast cancer patients using linear discriminant analysis. We provide our conclusions in Sect. 5. Finally, we provide technical details in "Appendix A" and the required solving algorithm in "Appendix B."

Proposed methodology
We use the following notations throughout the paper. For any p-dimensional vector a = (a 1 , . . . , a p ) T ∈ R p , we define the 2 norm by ||a|| 2 = p j=1 a 2 j . For any symmetric matrix A = [a i j ] 1≤i, j≤ p ∈ R p× p , we denote the Frobenius norm by i j , the ∞ norm by ||A|| ∞ = max 1≤i, j≤ p |a i j |, the matrix 1 norm by ||A|| 1 = max 1≤ j≤ p p i=1 |a i j |, the componentwise 1 norm by ||A|| 1 = p i=1 p j=1 |a i j | and off-diagonal 1 norm by ||A|| 1,off = p i=1 p j=1, j =i |a i j |, the spectral norm by ||A|| sp = sup ||x|| 2 ≤1 ||Ax|| 2 . Furthermore, we assume that X n× p is mean-centered observed sample data matrix, where each row X i = X i1 , . . . , X i p is a p-dimensional normal random vector, i.i.d. for i = 1, . . . , n and has a covariance matrix = −1 .
Many studies focus on the estimation of a precision matrix under high-dimensional settings. Among the proposed approaches, graphical Lasso (or GLASSO) is one of the most popular and well-studied estimators. This estimator is obtained by minimizing the 1 norm penalized negative log-likelihood function of a multivariate normal distribution (Banerjee et al. 2006;Friedman et al. 2008): where S = 1 n n i=1 X T i X i is the sample covariance matrix and ν > 0 is the associated tuning (or penalty) parameter that controls the accuracy and the sparsity of the precision matrix estimator. As an alternative to the log-likelihood function in (1), Zhang and Zou (2014) introduce a new loss function called D-trace: which is a quadratic function of the precision matrix. Furthermore, the authors propose a precision matrix estimation method based on minimizing the 1 norm penalized Dtrace loss function (hereafter, DT): where τ > 0 is the associated penalty parameter. The constraint I guarantees that the solution of (3) is positive definite, where > 0 is a small positive value (we set = 10 −8 in the numerical analyses). This problem can be easily solved through an algorithm developed by the same authors, which is based on the alternating direction method (see also Wang and Jiang 2020, for a similar solving algorithm).
According to the first-order condition, the solution of (3) satisfies the following penalized matrix equation: Here, Pen τ ( ) is the penalty term of the estimating equation and is defined as Note that if there is no penalty, i.e., τ = 0, (4) is known as the Sylvester equation. This is a matrix equation with the following definition: where R, L and C are known matrices and is the unknown. In this way, DT estimation can be seen as solving a Sylvester equation (using R = L = 1 2 S and C = −I ) with an imposed Pen τ ( ) penalty.
The Sylvester equation (5) is closely related to the matrix equation family known as generalized Sylvester equations, which are defined as where L i , R i (i = 1, . . . , k) and C are known matrices (see, for instance, Li et al. 2010;De Terán and Iannazzo 2016). Here, we assume that these matrices guarantee that the resulting estimator of is symmetric. Note that the classical Sylvester equation 5 is a spacial case of 6. Motivated by the DT estimator (4), we focus on estimating the precision matrices through penalized generalized Sylvester equations. In other words, we induce an additional penalty term in (6), under the assumption of positive definiteness of the estimated precision matrix: Note that different choices of those matrices will lead to different equations. In this paper, we restrict ourselves with one special case, by setting S 1/2 and C = −I . We demonstrate the estimation of the precision matrix using the following penalized matrix equation: where λ > 0 is the penalty parameter and is required to be positive definite. The term Z ( ) is defined earlier. In contrast to (4), the equation (8) has an extra weighted component S 1/2 S 1/2 , which imposes additional balanced constraints on the estimand. We hypothesize that this may provide additional benefits on the estimated precision matrix. On the other hand, similar to (4), when λ = 0, the solution of (8) is Notice that the solution of the penalized matrix equation (8) is the minimizer of the following optimization problem: Again, the constraint I is added to guarantee the positive definiteness of . We call the proposed precision matrix estimator generalized D-trace estimator GDT . Correspondingly, we define a new loss function as which we call generalized D-trace loss function. Notice that because of the additional term trace( 1/2 1/2 ), the f GDT function is no longer quadratic of (but rather "quasi-quadratic"). Nevertheless, f GDT ( , ) is a convex function of and has a unique minimizer at −1 (see "Appendix A" for more details on the proposed loss function). Finally, in order to solve the optimization problem (9), we use an algorithm based on the alternating direction method (see "Appendix B" for detailed description of the algorithm) similar to DT method.
The following remarks are in order. First, this paper does not aim at discussing the dominance of one loss function over the other one or which loss function should be used under different circumstances. Moreover, in this article, we discuss the advantages of employing penalized generalized Sylvester equation for estimating precision matrices by focusing on one particular case given in (8). As mentioned earlier, different generalized Sylvester equations can be introduced based on different choices of L i , R i and C in (6). Notice that these matrices should be selected properly in order to guarantee the symmetry of the esimtating function (i.e., left side of Eq. (7)) and the corresponding precision matrix estimate. For example, in (8), the symmetry of the estimating function is ensured. We leave the selection of those matrices for the optimal estimating equation (7) for the future research. Second, in this article, we focus only on 1 norm penalized generalized Sylvester equation. We note that for instance, the adaptive (i.e., weighted) 1 norm penalization can also be employed in (8) (see, for instance Avagyan et al. 2018, for Adaptive 1 norm penalized D-trace loss minimization approach).

Simulation analysis
We demonstrate the numerical performance of the proposed approach based on simulated data generated using different models for the true precision matrix in terms of several statistical measures. In our study, we consider the most popular techniques for selecting the penalty parameters for the considered methods. We compareˆ GDT with D-Trace estimatorˆ DT (Zhang and Zou 2014), graphical Lasso estimatorˆ GLasso (Banerjee et al. 2006) and CLIME estimatorˆ CLIME (Cai et al. 2011).

Performance evaluation
We generate multivariate normal random samples with zero mean and covariance matrix = −1 for each model over 100 replications. We evaluate the quality of a precision matrix estimator based on popular statistical losses and sparsity pattern prediction measures, previously considered in the literature. In particular, we consider the Kullback-Leibler loss (KLL) or the entropy loss (see, for instance Yuan 2010; Yin and Li 2013, etc.), the reverse Kullback-Leibler loss (RKLL) (see Avagyan 2021), the relative trace error (RTE) and matrix losses (the Frobenius 2 norm, the spectral norm, and the matrix 1 norm) (see, for instance Cai et al. 2011;Zhang and Zou 2014;van Wieringen and Peeters 2016, etc.) defined as Here, TP is the number of true positives (i.e., correctly selected nonzero entries), TN is the number of true negatives (i.e., correctly selected zero entries), FP is the number of false positives (i.e., incorrectly selected nonzero entries), and FN is the number of false negatives (i.e., incorrectly selected zero entries). The MCC is popular approach in statistics for measuring the binary classifications: the closer the MCC to one is, the better the overall classification is (see Chicco and Jurman 2020, for more details).
In this paper, we focus on both group of measures, i.e., statistical prediction and sparsity pattern selection. Usually, improving one measure may deteriorate the other one. Therefore, we aim at achieving a desirable performance in terms of both criteria, instead of focusing on either one. Moreover, this performance should remain consistent over different settings.

Simulation study 1
Our first simulation study is based on the following models.  ). • Model 3 A block-diagonal matrix, with four equally sized blocks along the diagonal. Each block is defined as ω i j = 0.6 |i− j| (prevously considered in Cai et al. 2011;Fan et al. 2009, etc). • Model 4 A random positive definite matrix, with approximately 50% of nonzero entries, generated using MATLAB command sprandsym. The matrix is further standardized to have unit diagonal.
We set n = 200, p = 200, 400. For this study, we select the penalty parameters for the considered methods using Bayesian Information Criterion (BIC) (Yuan and Lin 2007).

Simulation study 2
Our second study is based on the following models previously considered in Zhang and Zou (2014).
In line with Zhang and Zou (2014), we set n = 400, p = 500 for models 5 and 6 and p = 484 for model 7. For this study, we select the penalty parameters using fivefold cross-validation technique, consistent with the study of Zhang and Zou (2014).

Discussion of results
Tables 1, 2, 3, 4, 5, 6, and 7 report the averages (and standard deviations) of the measures over 100 replications. Bold letters indicate the best performance. First, we observe that GDT outperforms DT in terms of KLL, RKLL, RTE, 2 norm, sp norm for all models, in terms of 1 norm for models 1, 2, 3, 4, in terms of Specificity for models 1, 2 (when p = 400), 3 (when p = 200), 4, 5, 6, 7, in terms of Sensitivity for models 3, 4, 5 and in terms of MCC for models 1, 2 (when p = 400), 3, 4, 5, 6, 7. Both GDT and DT provide similar Sensitivity for models 1 and 2. On the other hand, DT method performs better than GDT in terms of the matrix 1 norm for models 5, 6, 7, in terms of Specificity for models 2 (when p = 200), 3 (when p = 400), in terms of Sensitivity for models 6, 7 and in terms of MCC for model 2 (when p = 200). Comparing our proposed method with GLASSO and CLIME methods, we see that in general, GDT provides better results, especially in terms of the statistical losses. However, GLASSO provides the best overall results in terms of the KLL for models    6, 7, in terms of Specificity for models 3 (when p = 200), 4, in terms of sensitivity for models 4, 5, 6, 7 and in terms of MCC for model 4. In addition, we observe that CLIME estimator provides the best overall results in terms of specificity for models 2 (when p = 400), 4, 5, 6, 7, in terms of sensitivity for model 3 and in terms of MCC for models 2 (when p = 400), 5, 6, 7. Note that these results are not surprising because GLASSO provides more dense solutions (thus, higher sensitivity but lower specificity), whereas CLIME provides more sparse solutions (thus, higher specificity but lower sensitivity), in general. This can be observed on the recorded numbers of nonzero entries of these  estimators applied on a real dataset (see Fig. 1). Moreover, the results indicate that good performance of GLASSO and CLIME in terms of GGM selection usually leads to deteriorated performances of matrix prediction (i.e., statistical losses). In sum, the proposed GDT estimator in general provides better performance than DT, GLASSO, CLIME methods for most of the models in terms of most statistical losses and GGM prediction measures. Moreover, GDT shows a better trade-off between the matrix prediction and sparsity pattern identification, i.e., the outperformance in terms of one criterion does not diminish the other one.
In addition, we conduct a study with smaller off-diagonal nonzero entries in models 1-3 (not provided). The results support our discussion above (i.e., the comparison of the considered methods remains roughly the same) and shows the robustness of our simulation study.
In "Appendix A," we briefly discuss the required theoretical assumption for the model selection, called the irrepresentability condition. Although we do not discuss theoretical properties of the considered method, we provide a simple numerical example (previously used in the literature) where the irrepresentability condition holds for the GDT estimator, but it fails for the DT estimator.

Real data analysis
In this section, we demonstrate the performance of the considered methods on an empirical analysis. We focus on classifying breast cancer patients with pathological complete response (PCR) and patients with residual disease (RD). This is an important problem because the PCR condition after the treatment (e.g., chemotherapy) may potentially lead to a cancer-free life (Kuerer et al. 1999). We use a dataset (available at http://bioinformatics.mdanderson.org/pubdata.html) which contains 22283 gene expression levels of 133 patients (subjects) with breast cancer. Among these, 34 patients have PCR and 99 patients have RD.
Following Cai et al. (2011), we randomly divide the data into a training set and a testing set with sizes 112 and 21, respectively. The testing set consists of five subjects with PCR and 16 subjects with RD. The training set contains the remaining subjects. Next, we apply two sample t test between the two groups using the training set, and we select the most significant 150 genes with the smallest p-values. Using only the selected genes, we estimated the precision matrix using the training set. For the sake of computational time, the penalty parameters are selected using the BIC technique. Finally, the estimated precision matrix is used in the linear discriminant μ t , t = 1 for PCR and t = 2 for RD.
Here, μ t = 1 n t i∈class t x i is the within group average calculated using the training set. We use δ t (Y ) to classify the subject Y from the testing set. The classification rule is t = arg max t δ t (Y ). We repeat this process 100 times. In order to measure the overall classification accuracy, we use MCC defined earlier in Sect. 3.1. We consider TP and TN as the number of correctly predicted PCR and RD, respectively, and FP and FN as the number of wrongly predicted PCR and RD, respectively. Our calculation shows that the average MCC over 100 replications obtained using GDT is 0.445, whereas for DT, GLASSO, and CLIME, the average MCC values over 100 replications are 0.422, 0.338 and 0.406, respectively. This indicates that GDT provides better overall classification of the subjects with PCR condition compared to the other considered methods. In addition, we record the number of non-zero entries of the estimated precision matrix over 100 replications. Figure 1 shows that CLIME produces the sparsest precision matrix. On the other hand, our proposed GDT approach produces less nonzero entries on average than DT and GLASSO methods.

Conclusions
The current research presents a new method for estimating high-dimensional precision matrices. The proposed method is based on the 1 norm penalization of a new generalized D-trace loss function. Our introduced loss function is motivated by the D-trace loss. We show that D-trace loss function is based on the Sylvester equation, whereas our proposed loss function is constructed through the generalized Sylvester equations. In this article, we restrict ourselves with only one particular penalized equation, although different other possible versions can be proposed. Selecting the optimal equation and the corresponding loss function is an open question. We consider this as one of the future research direction. We provide an extensive numerical analysis using simulated data. Our study is based on several statistical measures and settings, including those previously used in the literature. The proposed method performs favorably compared to other estimators in the literature. Moreover, we demonstrate a better performance of our proposed method in an empirical application of breast cancer patient classification using linear discriminant analysis.

Appendix A: Technical details
In this appendix, we confirm the properties of the GDT loss function. First, we show that f GDT ( , ) is a convex function of . We have the following: which is nonnegative, given that for any symmetric matrix A and a positive definite matrix B 0, the product AB A is positive semidefinite. This confirms the convexity of f GDT . Next, we show that f GDT has a unique minimizer at −1 . We check the sign of the Hessian matrix of f GDT : where ⊗ is the Kronecker product. For 0, the Hessian matrix is always positive definite. Finally, we have By setting the first-order derivative to 0, we can see that the minimum of f GDT occurs at −1 . Note that the Hessian matrix corresponding to the DT loss function (3) is defined as In general, the Hessian matrix has an important role for the model selection consistency. More specifically, it is assumed that max e∈S c || e,S S,S −1 || 1 < 1, which is known as the irrepresentability condition. Here, we denote as the value of the Hessian matrix at the true (unknown) covariance matrix (e.g., in case of DT estimator, = DT ( )). Next, S = {(i, j)| ω i j = 0, } is the set of nonzero entries (i.e., support) and S c is its complement. We define S 1 S 2 as a sub-matrix of ∈ R p 2 × p 2 with rows and columns indexing the subsets S 1 , S 2 ∈ {1, . . . , p} × {1, . . . , p}.
In this article, we do not provide theoretical properties for our proposed estimator. However, we suppose that in order to establish the model selection consistency of GDT estimator a similar irrepresentability condition would be required based on GDT ( ). In general, irrepresentability conditions are difficult to compare theoretically, i.e., show that a condition for one method is always weaker or stronger than the that for another method for a certain class of precision matrices. This remains an open question, and we plan to study this problem in separate paper. Nevertheless, we compare the irrepresentability conditions of DT and GDT estimators on a simple example, previously used by Ravikumar et al. (2011) and Zhang and Zou (2014). Consider = [ω i j ] 1≤i, j≤4 ∈ R 4×4 , with ω 11 = 1, ω 14 = ω 41 = 2c 2 , ω 23 = ω 32 = 0 and ω i j = c otherwise, where we assume that the values of c guarantee the positive definiteness of the matrix. It can be checked that for DT estimator, the irrepresentability condition holds for |c| ≤ 0.315. On the other hand, the irrepresentability condition based on = GDT ( ) holds for |c| ≤ 0.335, which shows that for c ∈ (0.315, 0.335] the irrepresentability condition holds for GDT estimator, but it fails for DT estimator. where A = S + 4I , B = S 1/2 and C = I + t 0 + t 1 − t 0 − t 1 . Let A = U V U T and B = U WU T are the eigendecomposition of matrices (it is easy to see that A and B have the same eigenvectors). Assume that eigenvalues are ordered: V 1 ≥ · · · ≥ V p and W 1 ≥ · · · ≥ W p . It can be checked that the solution of (B.4) is given asˆ = U {(U T BU ) • C}U T , where • denotes the Hadamard product and The solution of (B.6) is given as t+1 1 = t+1 + t 1 + , where for any symmetric matrix A with an eigendecomosition A = U diag(α 1 , . . . , α p )U T the operator [A] + is defined as || t+1 − t || 2 max(1, || t || 2 , || t+1 || 2 ) < 10 −7 .
It is important to note that we can significantly reduce the computational time of the algorithm by discarding the constraint I in the initial optimization problem. This enables us to drop 1 and omit the problem (B.6). If the reduced algorithm provides a positive definite outcome˜ (such that˜ I ), then we stop the calculation and set =˜ . Otherwise, we repeat the complete algorithm provided earlier with an initial startˆ .