Sparse Reduced-Rank Regression for Simultaneous Rank and Variable Selection via Manifold Optimization

We consider the problem of constructing a reduced-rank regression model whose coefficient parameter is represented as a singular value decomposition with sparse singular vectors. The traditional estimation procedure for the coefficient parameter often fails when the true rank of the parameter is high. To overcome this issue, we develop an estimation algorithm with rank and variable selection via sparse regularization and manifold optimization, which enables us to obtain an accurate estimation of the coefficient parameter even if the true rank of the coefficient parameter is high. Using sparse regularization, we can also select an optimal value of the rank. We conduct Monte Carlo experiments and real data analysis to illustrate the effectiveness of our proposed method.

In recent years, the number of response and predictor variables has been increasing.
This causes difficulty in the estimating of parameters when the sample size is smaller than the number of the parameters included in the model. One approach for overcoming this problem is to apply a regularization method. During previous decades, sparse regularization methods, such as lasso (Tibshirani, 1996), has been the focus of attention, because  2015)). Co-sparse factor regression (SFAR; Mishra et al. (2017)) was proposed in one such study. SFAR is based on both RRR and a factor analysis model by assuming that the coefficient parameter can be decomposed by singular value decomposition with both a low-rank constraint and sparsity for the singular vectors. For the estimation of parameters, Mishra et al. (2017) proposed the sequential factor extraction via co-sparse unit-rank estimation (SeCURE) algorithm. The SeCURE algorithm sequentially estimates the parameters with orthogonality and sparsity for each factor. However, the SeCURE algorithm fails to estimate the parameters when the number of latent factors is large, because the algorithm is a greedy estimation method based on the classical Gram-Schmidt orthogonalization algorithm and it is well known that the classical method does not guarantee that the optimal solution will be obtained (Björck, 1967).
To overcome this problem, we propose a factor extraction algorithm with rank and variable selection via sparse regularization and manifold optimization (RVSManOpt). Manifold optimization has demonstrated excellent performance over decades of study (Bakır al., 2004;Mishra et al., 2013;Tan et al., 2019). The minimization problem of the SFAR model can be reformulated in terms of manifold optimization. Manifold optimization enables us to solve the minimization problem by taking the geometric structure of the SFAR model into consideration. By estimating the parameters on the manifold, we simultaneously obtain all latent factors. In addition, in order to select the optimal value of the rank, we introduce a regularizer which induces a hard-thresholding operator.
The remainder of the paper is organized as follows. In Section 2, we introduce RRR and derive the SFAR model from the factor regression model. In Section 3, we reformulate the minimization problem of the SFAR model based on manifold optimization. In Section 4, we provide the estimation algorithm based on manifold optimization and discuss the selection of tuning parameters. In Sections 5, Monte Carlo experiments and real data analysis support the efficacy of RVSManOpt. Concluding remarks which summarize our study are presented in Section 6. Supplementary materials and source codes of our proposed method are available at https://github.com/yoshikawa-kohei/RVSManOpt.

Preliminaries
Suppose that we obtain n independent observations {(y i , x i ); i = 1, . . . , n}, where y i = [y i1 , . . . , y iq ] T ∈ R q is a q-dimensional vector of response variables and x i = [x i1 , . . . , x ip ] T ∈ R p is a p-dimensional vector of predictor variables. When we set Y = [y 1 , . . . , y n ] T ∈ R n×q and X = [x 1 , . . . , x n ] T ∈ R n×p , RRR (Anderson, 1951;Izenman, 1975;Reinsel and Velu, 1998) is formulated as where C ∈ R p×q is the coefficient matrix, which has rank at most r = min (rank (X) , q), and E = [e 1 , . . . , e n ] T ∈ R n×q is the error matrix, which consists of independent random error vectors e i with mean E [e i ] = 0 and covariance matrix Cov [e i ] = Σ (i = 1, . . . , n).
The estimator of the coefficient matrix C can be obtained by solving the minimization where · F denotes the Frobenius norm.
Mishra et al. (2017) proposed SFAR by extending RRR in terms of factor analysis.
Before introducing SFAR, we describe the relationship between RRR and factor analysis.
First, we consider the RRR model with a coefficient matrix C that is decomposed as where U ∈ R p×r andṼ ∈ R q×r . Then we obtain the RRR model reformulated by The equation (2.4) is related to a factor analysis model: XU can be regarded as a common factor matrix andṼ can be regarded as a loading matrix. Furthermore, if we assume in turn gives the following SFAR model.
Here, the coefficient matrix is C = UDV T .
The estimator of SFAR is obtained by solving the minimization problem where u ij , v ij are elements of U and V, respectively, w ij are adaptive weights with positive values proposed by Zou (2006), and λ 1 , λ 2 > 0 are regularization parameters. The second and third terms are penalty functions inducing elementwise sparsity (Tibshirani, 1996). By solving this minimization problem, we obtain the estimator of the coefficient The minimization problem is solved under orthogonality and sparsity of the parameters. However, it is difficult to estimate the parameters directly. For this reason, Mishra et al. (2017) proposed the SeCURE algorithm. The SeCURE algorithm sequentially solves the minimization problem for the k-th latent factor given by where k = 1, . . . , r, u k and v k are the k-th column vector of U and V, respectively, and in which d j is the j-th diagonal element of D and Y 1 = Y. By sequentially solving the minimization problem (2.7), we obtain the solutionsd k ,û k , andv k which satisfy orthogonality and sparsity. Whenû k = 0 orv k = 0, the SeCURE algorithm updates d k = 0. This means that the updates are terminated. In addition, the index k that terminates the updates is regarded as the optimal value of the rank of the coefficient matrix C. It should be noted that the estimation method for the minimization problem 3 Minimization problem of co-sparse factor regres-

sion via manifold optimization
The SeCURE algorithm fails to estimate the parameters for the k-th latent factor when k is large, because the algorithm is based on the classical Gram-Schmidt orthogonalization algorithm. Note that the classical Gram-Schmidt orthogonalization algorithm does not produce an optimal solution, owing to rounding errors (Björck, 1967). To overcome this problem, we reconsider this minimization problem in terms of manifold optimization.

Reformulation of the minimization problem as manifold optimization
To consider the minimization problem (2.6) in terms of manifold optimization, we use the fundamental geometric structure given by where q ≥ r. Here, St (r, q) is called the Stiefel manifold, which is the set of orthogonal matrices of size q × r. Furthermore, we also use the generalized Stiefel manifold given by where p ≥ r and G ∈ R p×p is a symmetric positive definite matrix. In this paper, we use The minimization problem (3.3) is an unconstrained optimization problem, and solving it allows us to estimate all the parameters for all the latent factors at once.

Rank selection with sparse regularization
The reformulation of the minimization problem (2.6) gives us the unconstrained optimization problem (3.3). However, we cannot select the optimal value of the rank of the coefficient matrix C because of not using a sequential estimating procedure, such as SeCURE. To overcome this drawback, we propose the following minimization problem: where 1(·) is an indicator function that returns 1 if the condition is true and returns 0 if the condition is false, w  Letting U * ∈ R p×r and V * and V * * ∈ R q×r denote variables for splitting nondifferentiable penalty terms from the minimization problem (3.4), we consider a minimization problem with equality constraints as follows: where u * ij , v * ij are the (i, j)-th elements of U * and V * , respectively, and v * * i is an i-th column vector of V * * . When we let Ω ∈ R p×r and Φ and Ψ ∈ R q×r denote the dual variables, we obtain a scaled augmented Lagrangian (Boyd et al., 2011) as follows: where ρ 1 , ρ 2 , ρ 3 > 0 are penalty parameters. For this study, we fixed ρ 1 = ρ 2 = ρ 3 = 1. M-ADMM alternately updates each parameter to minimize the augmented Lagrangian. The estimators of elements in U * and V * indicate whether each element of the parameter is zero. The estimators of column vectors in V * * indicate whether each vector of the parameter is a zero vector. In the M-ADMM procedure, we initialize the parameters by using and the k-th column vector ofṼ is the k-th eigenvalue of (1/n)Y T X(X T X) − X T Y.
We set the adaptive weights w The parameters U and V are estimated by a gradient descent algorithm based on manifold optimization. For example, the procedure for estimating U can be represented by the following.
1. At a given iteration s, calculate the Euclidean gradient ∇L U (s) .
2. Project ∇L U (s) onto the tangent space T U (s) GSt (p, r) using orthogonal projection P U (s) (·) to obtain the gradient gradL U (s) on the manifold.
3. Update the parameter U (s) by retraction R U (s) (−t gradL U (s) ) to obtain the parameter U (s+1) , where t ∈ R is an Armijo step size described in Absil et al. (2008).
The necessary notation is shown in Table 1. In the same way, we estimate the parameter V on the manifold. The detailed calculation of the updates is described in the Appendix.
This algorithm is called the factor extraction algorithm with rank and variable selection via sparse regularization and manifold optimization (RVSManOpt). RVSManOpt is summarized as Algorithm 1.

Selection of tuning parameters
We have six tuning parameters: λ 1 , λ 2 , α, γ u , γ v , and γ d . To avoid a high computational cost, α, γ u , γ v , and γ d are fixed in advance. We set the values of these tuning parameters according to the situation. The tuning parameter α is set to a large value when a sparse regularization is more important than a regularization for selecting the rank of the coefficient matrix C. Larger values of tuning parameters γ u , γ v , and γ d correspond to a higher data dependence. To select the remaining two tuning parameters, λ 1 and λ 2 , we use the Bayesian information criterion (BIC) given by where SSE λ 1 ,λ 2 is the sum of squared errors of prediction defined by and df λ 1 ,λ 2 is the degree of freedom which evaluates the sparsity of the estimatesÛ and V defined by 5 Numerical study

Monte Carlo simulations
We conducted Monte Carlo simulations to illustrate the efficacy of RVSManOpt. In our simulation study, we generated 50 datasets from the model: where Y ∈ R n×q is a response matrix, X ∈ R n×p is a predictor matrix, C ∈ R p×q is a coefficient matrix, and E = [e 1 , . . . e n ] T ∈ R n×q is an error matrix. Each row of X followed a multivariate normal distribution N (0, Γ), where Γ = [γ ij ] is a p × p covariance matrix with γ ij = 0.5 |i−j| for i, j = 1, . . . , p. We generated each row of E by e i i.i.d.
where ∆ = [δ ij ] is a q × q matrix with δ ij = ρ |i−j| and σ is determined according to the signal-to-noise ratio defined by SNR = d r Xu r v T 2 / E 2 = 0.5. We considered the ranks of the coefficient matrix as follows: r ∈ {3, 5, 7, 10, 12}. We generated the coefficient Specifically, we set where rep(a, b) represents the vector of length b with all elements having the value a. We considered four cases. In Cases 1 and 2, we set n = 400, p = 80, and q = 50 in common, and we set the correlation as ρ = 0.3 (Case 1) or ρ = 0.5 (Case 2). In Cases 3 and 4, we set n = 400, p = 120, and q = 60 in common, and we set the correlation as ρ = 0.3 (Case 3) or ρ = 0.5 (Case 4).
To demonstrate the efficacy of RVSManOpt, we compared RVSManOpt with the Se-CURE with an adaptive lasso (SeCURE(AL)), and the SeCURE with an adaptive elastic net (SeCURE(AE)). For 50 datasets, we measured the estimation accuracy Er(XC) and the selected rank absolute error Er(r). These are defined as where C (k) is the true coefficient matrix, r (k) is the true rank of the coefficient matrix C (k) , C (k) is an estimated coefficient matrix, andr (k) is the selected rank of coefficient matrix C (k) for the k-th dataset. In order to evaluate the sparsity, we computed the F-measure defined by where Recall (k) and Precision (k) are defined by , ij are respectively elements of the estimated U and V for the k-th dataset and |{·}| is the count of the elements of set {·}. All implementations were done in R (ver. 3.6) (R Core Team, 2018).

Application to yeast cell cycle dataset
We applied RVSManOpt to yeast cell cycle data (Spellman et al., 1998).  Table 6 gives the results of a real data analysis. In RVSManOpt, the proportion of experimentally confirmed TFs is larger than both SeCURE(AL) and SeCURE(AE).
This result means that RVSManOpt may capture the latent structure of the yeast cell cycle data more precisely by identifying 5 latent factors.  Fig. 2 indicates that the estimated transcription levels followed two cycles. It was experimentally confirmed that the transcription levels in the cell cycle did cover a two cycle time period. Thus, RVSManOpt was demonstrated to accurately estimate the cycles of data.

Concluding Remarks
We proposed a minimization problem of SFAR on a Stiefel manifold and developed the factor extraction algorithm with rank and variable selection via sparse regularization and manifold optimization (RVSManOpt). RVSManOpt surpassed the traditional estimation procedure, which fails when the rank of the coefficient matrix is high. Numerical comparisons including Monte Carlo simulations and a real data analysis supported the usefulness of RVSManOpt.
In general, it is challenging to estimate parameters while preserving both orthogonality and sparsity. Mishra et al. (2017) indicates that enforcing orthogonality collapses sparsity and does not work from the viewpoint of prediction. Therefore, it may be unnecessary to construct a model with perfect orthogonality if we focus on prediction. Also, the recent paper by Absil and Hosseini (2019) discusses a theory of manifold optimization for nonsmooth functions. It would be interesting to develop RVSManOpt based on this theory.
We leave these as future topics.
Appendix: Detailed description of update procedures for the parameters Formulas for updating U and V The Euclidean gradient ∇L U can be calculated as follows: The formula for updating U is given bŷ where R U is the retraction mapping on a generalized Stiefel manifold, t u is the Armijo step size, and gradL U is the gradient on the generalized Stiefel manifold. gradL U can be obtained by projecting the Euclidean gradient ∇L U into the tangent space T U GSt (p, r) by using projection operator P U (·).
In a similar way, the Euclidean gradient ∇L V can be calculated as follows: The formula for updating V is given bŷ where R V is the retraction mapping on a Stiefel manifold, t v is the Armijo step size, and gradL V is the gradient on the Stiefel manifold. gradL V can be obtained by projecting the Euclidean gradient ∇L V into the tangent space T V St (q, r) by using projection operator

Formula for updating D
The Euclidean gradient ∇L D is given by When ∇L D = 0, the optimal solution ofD is given by Therefore, the formula for updating D is given bŷ Formulas for updating U * and V * The augmented Lagrangian with respect to U * is given by The partial derivative of L(U * ) is calculated as follows: where ∂| · | is the subderivative operator defined as When this partial derivative is equal to 0, the element of U * is represented as Thus, the formula for updating U * can be obtained as follows: This formula can be simplified using the soft-thresholding operator S(·, ·) as follows: where ω ij is the (i, j)-th element of Ω and S(·, ·) is the soft-thresholding operator S (x, λ) = sign(x)(|x| − λ) + , (x) + = max {x, 0} , In a similar way to the updating of U * , the formula for updating V * can be obtained as follows: where φ ij is the (i, j)-th element of Φ.

Formula for updating V * *
The augmented Lagrangian with respect to V * * is given by Here, we consider the augmented Lagrangian for every column vector v * * i , i = 1, . . . , r, as follows: This equation can be divided into v * * i = 0 and v * * i = 0 cases as follows: When v * * i = 0, the optimal solution v * * i can be obtained as follows: v * * i = v i + ψ i .
When we substitute v i + ψ i in for v * * i in L(v * * i ), the value of L(v * * i ) is √ q(1 − α)λ 2 w (d) i . It is necessary to satisfy the following condition The formula for updating v * i can be obtained as follows: . This formula is simplified by using the hard-thresholding operator H(·, ·) as follows: where ψ ij is the (i, j)-th element of Ψ and H(·, ·) is the hard-thresholding operator