1 Introduction

Reduced-rank regression (RRR), a useful tool for statistics, is based on a multivariate linear regression model with a low-rank constraint for the coefficient parameter. RRR reduces the number of parameters included in the model and enables us to easily interpret the relationship between response and predictor variables. Therefore, RRR is used in various fields of research, including genomics, signal processing, and econometrics. To date, various extensions for RRR have been proposed: high-dimensional RRR with a rank selection criterion (Bunea et al. 2011), RRR with a nuclear norm penalization (Yuan et al. 2007; Negahban and Wainwright 2011), reduced-rank ridge regression and its kernel extensions (Mukherjee and Zhu 2011), and reduced-rank stochastic regression with sparse singular value decomposition (Chen et al. 2013).

In recent years, the number of response and predictor variables has been increasing in the fields of research such as genomics. This causes difficulty in the estimating of parameters when the sample size is smaller than the number of the parameters included in the model. One approach for overcoming this problem is to apply a regularization method. During previous decades, sparse regularization methods, such as lasso (Tibshirani 1996), have been the focus of attention, because they can estimate parameters and exclude irrelevant variables simultaneously. Various studies have considered a multivariate linear regression model with some sparse regularization (see, e.g., Rothman et al. 2010; Peng et al. 2010; Li et al. 2015). Co-sparse factor regression (SFAR; Mishra et al. 2017) was proposed in one such study. SFAR is based on both RRR and a factor analysis model by assuming that the coefficient parameter can be decomposed by singular value decomposition with both a low-rank constraint and sparsity for the singular vectors. For the estimation of parameters, Mishra et al. (2017) proposed the sequential factor extraction via co-sparse unit-rank estimation (SeCURE) algorithm. The SeCURE algorithm sequentially estimates the parameters with orthogonality and sparsity for each factor. However, the SeCURE algorithm fails to estimate the parameters when the number of latent factors is large, because the algorithm is a greedy estimation method based on the classical Gram–Schmidt orthogonalization algorithm and it is well known that the classical method does not guarantee that the optimal solution will be obtained (Björck 1967).

To overcome this problem, we propose a factor extraction algorithm with rank and variable selection via sparse regularization and manifold optimization (RVSManOpt). Manifold optimization has demonstrated excellent performance over decades of study (Bakır et al. 2004; Mishra et al. 2013; Tan et al. 2019). The minimization problem of the SFAR model can be reformulated in terms of manifold optimization. Manifold optimization enables us to solve the minimization problem by taking the geometric structure of the SFAR model into consideration. By estimating the parameters on the manifold, we simultaneously obtain all latent factors. In addition, in order to select the optimal value of the rank, we introduce a regularizer which induces a hard-thresholding operator.

The remainder of the paper is organized as follows. In Sect. 2, we introduce RRR and derive the SFAR model from the factor regression model. In Sect. 3, we reformulate the minimization problem of the SFAR model based on manifold optimization. In Sect. 4, we provide the estimation algorithm based on manifold optimization and discuss the selection of tuning parameters. In Sect. 5, Monte Carlo experiments and a real data analysis support the efficacy of RVSManOpt. Concluding remarks which summarize our study are presented in Sect. 6. Supplementary materials and source codes of our proposed method are available at https://github.com/yoshikawa-kohei/RVSManOpt.

2 Preliminaries

Suppose that we obtain n independent observations \(\left\{ (\mathbf {y}_i, \mathbf {x}_i); i=1,\dots ,n \right\} \), where \(\mathbf {y}_i = {\left[ y_{i1}, \ldots , y_{iq} \right] }^\mathsf {T} \in \mathbb {R}^{q}\) is a q-dimensional vector of response variables and \(\mathbf {x}_i = {\left[ x_{i1}, \ldots , x_{ip} \right] }^\mathsf {T} \in \mathbb {R}^p\) is a p-dimensional vector of predictor variables. When we set \(\mathbf {Y} = {\left[ \mathbf {y}_1, \ldots , \mathbf {y}_n \right] }^\mathsf {T} \in \mathbb {R}^{n \times q}\) and \(\mathbf {X} = {\left[ \mathbf {x}_1, \ldots , \mathbf {x}_n \right] }^\mathsf {T} \in \mathbb {R}^{n \times p}\), RRR (Anderson 1951; Izenman 1975; Reinsel and Velu 1998) is formulated as

$$\begin{aligned} \mathbf {Y} = \mathbf {X}\mathbf {C} + \mathbf {E},\quad \quad \mathrm {s.t.}\ \mathrm {rank}\left( \mathbf {C}\right) \le r, \end{aligned}$$
(1)

where \(\mathbf {C} \in \mathbb {R}^{p \times q}\) is the coefficient matrix, which has rank at most \(r = \min \left( \mathrm {rank}\left( \mathbf {X}\right) , q \right) \), and \(\mathbf {E} = {\left[ \mathbf {e}_1, \ldots , \mathbf {e}_n \right] }^\mathsf {T} \in \mathbb {R}^{n \times q}\) is the error matrix, which consists of independent random error vectors \(\mathbf {e}_i\) with mean \(\mathrm {E}\left[ \mathbf {e}_i\right] = \mathbf {0}\) and covariance matrix \(\mathrm {Cov}\left[ \mathbf {e}_i \right] = \varvec{\Sigma } \ (i = 1,\dots ,n)\). The estimator of the coefficient matrix \(\mathbf {C}\) can be obtained by solving the minimization problem

$$\begin{aligned} \min _{\mathbf {C}}\ \left\Vert \mathbf {Y} - \mathbf {X} \mathbf {C}\right\Vert _F^2, \quad \mathrm {s.t.}\ \mathrm {rank}\left( \mathbf {C} \right) \le r, \end{aligned}$$
(2)

where \(\left\Vert \cdot \right\Vert _F\) denotes the Frobenius norm.

Mishra et al. (2017) proposed SFAR by extending RRR in terms of factor analysis. Before introducing SFAR, we describe the relationship between RRR and factor analysis. First, we consider the RRR model with a coefficient matrix \(\mathbf {C}\) that is decomposed as

$$\begin{aligned} \mathbf {C} = \mathbf {U} {{\tilde{\mathbf {V}}}}^\mathsf {T}, \end{aligned}$$
(3)

where \(\mathbf {U} \in \mathbb {R}^{p \times r}\) and \(\tilde{\mathbf {V}} \in \mathbb {R}^{q \times r}\). Then we obtain the RRR model reformulated by

$$\begin{aligned} \mathbf {Y} = \mathbf {X} \mathbf {U} {{\tilde{\mathbf {V}}}}^\mathsf {T} + \mathbf {E}. \end{aligned}$$
(4)

The Eq. (4) is related to a factor analysis model: \(\mathbf {XU}\) can be regarded as a common factor matrix and \(\tilde{\mathbf {V}}\) can be regarded as a loading matrix. Furthermore, if we assume \(\mathrm {E}[\mathbf {x}_i] = \mathbf {0}\) and \(\mathrm {cov}[\mathbf {x}_i] = \varvec{\Gamma } \ (i=1,\dots ,n)\), then \(\mathrm {cov}[{{\mathbf {U}}}^\mathsf {T} \mathbf {x}] = {{\mathbf {U}}}^\mathsf {T} \varvec{\Gamma } \mathbf {U} = \mathbf {I}_r\) is derived. When we decompose \(\tilde{\mathbf {V}} = \mathbf {V} \mathbf {D}\) such that \({{\mathbf {V}}}^\mathsf {T} \mathbf {V} = \mathbf {I}_{r}\) and \(\mathbf {D} = \mathrm {diag}( d_1, \dots , d_r ), (d_1> \dots> d_r > 0)\), we can obtain the following SFAR model:

$$\begin{aligned} \mathbf {Y} = \mathbf {X} \mathbf {U} \mathbf {D} {{\mathbf {V}}}^\mathsf {T} + \mathbf {E}, \quad \mathrm {s.t.}\ {{\mathbf {U}}}^\mathsf {T} \varvec{\Gamma } \mathbf {U} = \mathbf {I}_{r}, {{\mathbf {V}}}^\mathsf {T} \mathbf {V} = \mathbf {I}_{r}. \end{aligned}$$
(5)

Here, the coefficient matrix is \(\mathbf {C} = \mathbf {U} \mathbf {D} {{\mathbf {V}}}^\mathsf {T}\).

The estimator of SFAR is obtained by solving the minimization problem

$$\begin{aligned} \min _{\mathbf {U}, \mathbf {D},\mathbf {V}}\ \frac{1}{2} \left\Vert \mathbf {Y} - \mathbf {X} \mathbf {U} \mathbf {D} {{\mathbf {V}}}^\mathsf {T} \right\Vert _F^2 + \lambda _1 h_1 (\mathbf {U}) + \lambda _2 h_2 (\mathbf {V}),\nonumber \\ \mathrm {s.t.}\ {{\mathbf {U}}}^\mathsf {T} \left( \frac{{\mathbf {X}}^\mathsf {T} \mathbf {X}}{n} \right) \mathbf {U}= \mathbf {I}_{r}, {{\mathbf {V}}}^\mathsf {T} \mathbf {V} = \mathbf {I}_{r}, \end{aligned}$$
(6)

where \(\lambda _1, \lambda _2 > 0\) are regularization parameters and \(h_1 (\cdot ), h_2 (\cdot )\) are the regularization functions which induce the sparsity for the parameters \(\mathbf {U}\) and \(\mathbf {V}\), respectively (Tibshirani 1996). The parameter \(\mathbf {U}\) has the role of constructing latent variables by linear combination of predictor variables. By sparsely estimating the parameter \(\mathbf {U}\), the latent variables can be constructed from predictor variables that contribute to the prediction. In addition, the parameter \(\mathbf {V}\) stands for the coefficient for the latent variable. By estimating \(\mathbf {V}\) sparsely, we can select the latent variables that contribute to the prediction. By solving the minimization problem (6), we obtain the estimator of the coefficient matrix \(\hat{\mathbf {C}} = \hat{\mathbf {U}} \hat{\mathbf {D}} {\hat{{\mathbf {V}}}}^\mathsf {T}\).

The minimization problem is solved under orthogonality and sparsity of the parameters. However, it is difficult to estimate the parameters directly. For this reason, Mishra et al. (2017) proposed the SeCURE algorithm. The SeCURE algorithm sequentially solves the minimization problem for the kth latent factor given by

$$\begin{aligned} \min _{d_k, \mathbf {u}_k, \mathbf {v}_k} \frac{1}{2} \left\Vert \mathbf {Y}_k - d_k \mathbf {X} \mathbf {u}_k {\mathbf {v}_k}^\mathsf {T} \right\Vert _F^2 + \lambda \sum _{i = 1}^p \sum _{j = 1}^q w_{ij} |d_k u_{ki} v_{kj}|,\nonumber \\ \mathrm {s.t.}\ d_k \ge 0, \mathbf {u}_k^\mathsf {T} {\mathbf {X}}^\mathsf {T} \mathbf {X} \mathbf {u}_k = n, \mathbf {v}_k^\mathsf {T}\mathbf {v}_k = 1, \end{aligned}$$
(7)

where \(u_{ij}\) and \(v_{ij}\) are elements of \(\mathbf {U}\) and \(\mathbf {V}\), respectively, \(\mathbf {u}_k\) and \(\mathbf {v}_k\) for \(k=1,\dots ,r\) are the kth column vector of \(\mathbf {U}\) and \(\mathbf {V}\), respectively, and \(w_{ij}\) is the adaptive weight with positive value proposed by Zou (2006). The role of the adaptive weights is to control the strength of the regularization for each parameters. Here, \(\mathbf {Y}_k\) is defined by

$$\begin{aligned} \mathbf {Y}_k = \mathbf {Y} - \sum _{j=1}^{k-1} d_j \mathbf {X} \mathbf {u}_j \mathbf {v}_j^\mathsf {T}, \end{aligned}$$
(8)

in which \(d_j\) is the jth diagonal element of \(\mathbf {D}\) and \(\mathbf {Y}_1 = \mathbf {Y}\). By sequentially solving the minimization problem (7) with respect to each \(d_k, \mathbf {u}_k\) and \(\mathbf {v}_k\), we obtain the solutions \(\hat{d}_k, \hat{\mathbf {u}}_k\), and \(\hat{\mathbf {v}}_k\) which satisfy orthogonality and sparsity: \(\mathbf {u}_k^\mathsf {T} {\mathbf {X}}^\mathsf {T} \mathbf {X} \mathbf {u}_h = 0\) and \(\mathbf {v}_k^\mathsf {T}\mathbf {v}_h = 0\) when \(k \ne h\). When \(\hat{\mathbf {u}}_k = \mathbf {0}\) or \(\hat{\mathbf {v}}_k = \mathbf {0}\), the SeCURE algorithm updates \(d_k = 0\). This means that the updates are terminated. In addition, the index k that terminates the updates is regarded as the optimal value of the rank of the coefficient matrix \(\mathbf {C}\). It should be noted that the estimation method for the minimization problem (7) is the block coordinate descent algorithm proposed by Chen et al. (2012).

3 Minimization problem of co-sparse factor regression via manifold optimization

The SeCURE algorithm fails to estimate the parameters for the kth latent factor when k is large, because the algorithm is based on the classical Gram–Schmidt orthogonalization algorithm. Note that the classical Gram–Schmidt orthogonalization algorithm does not produce an optimal solution, owing to rounding errors (Björck 1967). To overcome this problem, we reconsider this minimization problem in terms of manifold optimization.

3.1 Reformulation of the minimization problem as manifold optimization

To consider the minimization problem (6) in terms of manifold optimization, we use the fundamental geometric structure given by

$$\begin{aligned} \mathrm {St}(r, q)&:= \left\{ \mathbf {V} \in \mathbb {R}^{q \times r} \mid {\mathbf {V}}^\mathsf {T} \mathbf {V} = \mathbf {I}_r \right\} , \end{aligned}$$
(9)

where \(q \ge r\). Here, \(\mathrm {St}(r, q)\) is called the Stiefel manifold, which is the set of orthogonal matrices of size \(q \times r\). Furthermore, we also use the generalized Stiefel manifold given by

$$\begin{aligned} \mathrm {GSt}(r, p)&:= \left\{ \mathbf {U} \in \mathbb {R}^{p \times r} \mid {\mathbf {U}}^\mathsf {T} \mathbf {G} \mathbf {U} = \mathbf {I}_r \right\} , \end{aligned}$$
(10)

where \(p \ge r\) and \(\mathbf {G} \in \mathbb {R}^{p \times p}\) is a symmetric positive definite matrix. In this paper, we use \(\mathbf {G} = {\mathbf {X}}^\mathsf {T} \mathbf {X} /n\).

By utilizing the geometric structures (9) and (10), the minimization problem (6) can be reformulated as

$$\begin{aligned} \min _{\begin{array}{c} \mathbf {U} \in \mathrm {GSt}(r, p),\\ \mathbf {D},\\ \mathbf {V}\in \mathrm {St}(r, q) \end{array}} \frac{1}{2} \left\Vert \mathbf {Y} - \mathbf {X} \mathbf {U} \mathbf {D} {\mathbf {V}}^\mathsf {T} \right\Vert _F^2 + n\lambda _1 \sum _{i=1}^{p} \sum _{j=1}^{r} w^{(u)}_{ij}|{u}_{ij}| + n\lambda _2 \sum _{i=1}^{q} \sum _{j=1}^{r} w^{(v)}_{ij}|{v}_{ij}|. \end{aligned}$$
(11)

We introduce the following two penalties \(h_1(\mathbf {U})\) and \(h_2(\mathbf {V})\) for the minimization problem (6):

$$\begin{aligned} h_1(\mathbf {U}) = n\sum _{i=1}^p \sum _{j=1}^r w_{ij}^{(u)} |u_{ij}|,\quad h_2(\mathbf {V}) = n\sum _{i=1}^q \sum _{j=1}^r w_{ij}^{(v)} |v_{ij}|. \end{aligned}$$
(12)

The minimization problem (11) is an unconstrained optimization problem, and solving it allows us to estimate all the parameters for all the latent factors at once.

3.2 Rank selection with sparse regularization

The reformulation of the minimization problem (6) gives us the unconstrained optimization problem (11). However, we cannot select the optimal value of the rank of the coefficient matrix \(\mathbf {C}\) because of not using a sequential estimating procedure, such as SeCURE. To overcome this drawback, we propose the following minimization problem:

$$\begin{aligned}&\min _{\begin{array}{c} {\mathbf{U}} \in {{\mathrm GSt}}(r, p),\\ {\mathbf{D}},\\ {\mathbf{V}}\in {{\mathrm St}}(r, q) \end{array}} \frac{1}{2} \left\Vert {\mathbf{Y}} - {\mathbf{X}} {\mathbf{U}} {\mathbf{D}} {\mathbf{V}}^{\textsf {T}} \right\Vert _F^2 + n\lambda _1 \sum _{i=1}^{p} \sum _{j=1}^{r} w^{(u)}_{ij}|{u}_{ij}| \nonumber \\&\qquad + n\alpha \lambda _2 \sum _{i=1}^{q} \sum _{j=1}^{r} w^{(v)}_{ij}|{v}_{ij}| + n\sqrt{q}(1-\alpha )\lambda _2 \sum _{i=1}^r w^{(d)}_{i} \mathbbm {1} ({\mathbf{v}}_i \ne {\mathbf{0}}), \end{aligned}$$
(13)

where \(\mathbbm {1} ( \cdot )\) is an indicator function that returns 1 if the condition is true and returns 0 if the condition is false, \(w^{(d)}_{i}\) is an adaptive weight with a positive value proposed by Zou (2006), and \(\alpha \) is a tuning parameter having a value between zero and one. The group selection in the fourth term plays the role of the rank selection of the coefficient matrix \(\mathbf {C}\). The tuning parameter \(\alpha \) adjusts the trade-off between the third term and the fourth term. Introducing \(\alpha \) facilitates the interpretation of the regularizations imposed on the model without losing the generality of the optimization problem. The two terms can be regarded as Sparse Group Lasso (Wu and Lange 2008; Puig et al. 2009; Simon et al. 2013). The fourth term is a regularizer which induces a hard-thresholding operator. By imposing this regularization, we can estimate some column vectors of \(\mathbf {V}\) as zero vectors. As a consequence, the model is constructed with a small number of latent factors. In that sense, the indicator function plays the role of selecting the rank of the coefficient matrix \(\mathbf {C}\).

The reason why we do not apply Group Lasso, which induces a soft-thresholding operator (Yuan and Lin 2006), is to avoid a double shrinking effect for the parameter \(\mathbf {V}\). If we assume that the fourth term corresponds to the Group Lasso, then such a double shrinking effect appears to occur. The double shrinking effect reduces the variance of the model, but it excessively increases the bias. To prevent the double shrinking effect for the parameter \(\mathbf {V}\), we use a regularizer which induces a hard-thresholding operator, since it does not shrink the value of the parameter.

4 Implementation

4.1 Computational algorithm

To estimate the parameters, we employ a manifold optimization method (Edelman et al. 1998; Absil et al. 2008). Manifold optimization can be performed for differentiable functions. However, the minimization problem (13) includes nondifferentiable penalty terms. For this reason, we handle the nondifferentiability by applying the manifold alternating direction method of multipliers (M-ADMM) proposed by Kovnatsky et al. (2016) to the minimization problem (13).

Letting \(\mathbf {U}^* \in \mathbb {R}^{p \times r}\) and \(\mathbf {V}^*\) and \(\mathbf {V}^{**} \in \mathbb {R}^{q \times r}\) denote variables for splitting nondifferentiable penalty terms from the minimization problem (13), we consider a minimization problem with equality constraints as follows:

(14)

where \(u_{ij}^*, v_{ij}^*\) are the (ij)th elements of \(\mathbf {U}^*\) and \(\mathbf {V}^*\), respectively, and \(\mathbf {v}_i^{**}\) is an ith column vector of \(\mathbf {V}^{**}\). When we let \(\varvec{\Omega } \in \mathbb {R}^{p \times r}\) and \(\varvec{\Phi }\) and \(\varvec{\Psi } \in \mathbb {R}^{q \times r}\) denote the dual variables, we obtain a scaled augmented Lagrangian (Boyd et al. 2011) as follows:

(15)

where \(\xi _1, \xi _2, \xi _3 > 0 \) are penalty parameters. For this study, we fixed \(\xi _1 = \xi _2 = \xi _3 = 1\). M-ADMM alternately updates each parameter to minimize the augmented Lagrangian. The estimators of elements in \(\mathbf {U}^*\) and \(\mathbf {V}^*\) indicate whether each element of the parameter \(\mathbf {V}\) is zero. The estimators of column vectors in \(\mathbf {V}^{**}\) indicate whether each vector of the parameter \(\mathbf {V}\) is a zero vector. In the M-ADMM procedure, we initialize the parameters by using \(\tilde{\mathbf {U}} \in \mathbb {R}^{p \times r}, \tilde{\mathbf {D}} = \mathrm {diag} (\tilde{d}_1, \dots , \tilde{d}_r), \tilde{\mathbf {V}} \in \mathbb {R}^{q \times r}\). Here, \(\tilde{\mathbf {U}}\) is calculated by \(({\mathbf {X}}^\mathsf {T} \mathbf {X})^{-} {\mathbf {X}}^\mathsf {T} \mathbf {Y} \tilde{\mathbf {V}} \tilde{\mathbf {D}}^{-1}\), where the kth diagonal element of \(\tilde{\mathbf {D}}^2\) is the kth eigenvalue of \((1/n) {\mathbf {Y}}^\mathsf {T} \mathbf {X} ({\mathbf {X}}^\mathsf {T} \mathbf {X})^{-} {\mathbf {X}}^\mathsf {T} \mathbf {Y}\), and the kth column vector of \(\tilde{\mathbf {V}}\) is the kth eigenvector of \((1/n) {\mathbf {Y}}^\mathsf {T} \mathbf {X} ({\mathbf {X}}^\mathsf {T} \mathbf {X})^{-} {\mathbf {X}}^\mathsf {T} \mathbf {Y}\).

We set the adaptive weights \(w_{ij}^{(u)}, w_{ij}^{(v)}, w_{i}^{(d)}\) as

$$\begin{aligned} w^{(u)}_{ij}&= \frac{1}{\left| \tilde{u}_{ij} \right| ^{{\kappa ^u}}}, \quad i=1,\dots ,p, j=1,\dots ,r, \end{aligned}$$
(16)
$$\begin{aligned} w^{(v)}_{ij}&= \frac{1}{\left| \tilde{v}_{ij} \right| ^{{\kappa ^v}}}, \quad i=1,\dots ,q, j=1,\dots ,r, \end{aligned}$$
(17)
$$\begin{aligned} w^{(d)}_{i}&= \frac{1}{|\tilde{d}_{i} |^{{\kappa ^d}}}, \quad i=1,\dots ,r, \end{aligned}$$
(18)

where \(\kappa ^u\), \(\kappa ^v\), \(\kappa ^d > 0\) are tuning parameters.

The parameters \(\mathbf {U}\) and \(\mathbf {V}\) are estimated by a gradient descent algorithm based on manifold optimization. For example, the procedure for estimating \(\mathbf {U}\) can be represented by the following.

  1. 1.

    At a given iteration s, calculate the Euclidean gradient \(\nabla \mathcal {L}_{\mathbf {U}^{(s)}}\).

  2. 2.

    Project \(\nabla \mathcal {L}_{\mathbf {U}^{(s)}}\) onto the tangent space \(\mathcal {T}_{\mathbf {U}^{(s)}} \mathrm {GSt}(p, r)\) using orthogonal projection \(\mathcal {P}_{\mathbf {U}^{(s)}}(\cdot )\) to obtain the gradient \(\mathrm {grad} \mathcal {L}_{\mathbf {U}^{(s)}}\) on the manifold.

  3. 3.

    Update the parameter \(\mathbf {U}^{(s)}\) by retraction \(\mathcal {R}_{\mathbf {U}^{(s)}}(- t\ \mathrm {grad} \mathcal {L}_{\mathbf {U}^{(s)}})\) to obtain the parameter \(\mathbf {U}^{(s+1)}\), where \(t \in \mathbb {R}\) is an Armijo step size described in Absil et al. (2008).

The necessary notation is shown in Table 1. In the same way, we estimate the parameter \(\mathbf {V}\) on the manifold. The detailed calculation of the updates is described in the Appendix. This algorithm is called the factor extraction algorithm with rank and variable selection via sparse regularization and manifold optimization (RVSManOpt). RVSManOpt is summarized as Algorithm 1.

We have analyzed the computational complexity of RVSManOpt. The complexity of this algorithm is \(O(sp^3)\). The high complexity is due to the requirement of calculating the inverse matrix when performing retraction.

Table 1 Notation for the manifold optimization algorithm

4.2 Selection of tuning parameters

We have six tuning parameters: \(\lambda _1, \lambda _2, \alpha , \kappa ^u, \kappa ^v\), and \(\kappa ^d\). To avoid a high computational cost, \(\alpha , \kappa ^u, \kappa ^v\), and \(\kappa ^d\) are fixed in advance. We set the values of these tuning parameters according to the situation. The tuning parameter \(\alpha \) is set to a large value when a sparse regularization is more important than a regularization for selecting the rank of the coefficient matrix \(\mathbf {C}\). Larger values of tuning parameters \(\kappa ^u, \kappa ^v\), and \(\kappa ^d\) correspond to a higher data dependence. To select the remaining two tuning parameters, \(\lambda _1\) and \(\lambda _2\), we use the Bayesian information criterion (BIC) given by

$$\begin{aligned} \mathrm {BIC} = \log \left\{ \mathrm {SSE}_{\lambda _1, \lambda _2} / nq \right\} + \left\{ \log (qn)/(nq) \right\} df_{\lambda _1, \lambda _2}, \end{aligned}$$
(19)

where \(\mathrm {SSE}_{\lambda _1, \lambda _2}\) is the sum of squared errors of prediction defined by

$$\begin{aligned} \mathrm {SSE}_{\lambda _1, \lambda _2} = \left\Vert \mathbf {Y} - \mathbf {X} \hat{\mathbf {U}} \hat{\mathbf {D}} {\hat{\mathbf {V}}}^\mathsf {T} \right\Vert _F^2, \end{aligned}$$
(20)
figure a

and \(df_{\lambda _1, \lambda _2}\) is the degree of freedom which evaluates the sparsity of the estimates \(\hat{\mathbf {U}}\) and \(\hat{\mathbf {V}}\) defined by

$$\begin{aligned} df_{\lambda _1, \lambda _2} = \sum _{i=1}^{p} \sum _{j=1}^r \mathbbm {1}(\hat{u}_{ij} \ne 0) + \sum _{i=1}^{q} \sum _{j=1}^r \mathbbm {1}(\hat{v}_{ij} \ne 0) - 1. \end{aligned}$$
(21)

We select the tuning parameters \(\lambda _1\) and \(\lambda _2\) which minimize the BIC. The candidates values of \(\lambda _1, \lambda _2\) are taken from equally spaced values in the interval \([\lambda _{\min }, \lambda _{\max }]\). We set \(\lambda _{\max }=1\), \(\lambda _{\min }=10^{-15}\) and divide the interval into 50 equal parts in our numerical studies.

5 Numerical study

5.1 Monte Carlo simulations

We conducted Monte Carlo simulations to illustrate the efficacy of RVSManOpt. In our simulation study, we generated 50 datasets from the model:

$$\begin{aligned} \mathbf {Y} = \mathbf {XC} + \mathbf {E}, \end{aligned}$$
(22)

where \(\mathbf {Y} \in \mathbb {R}^{n \times q}\) is a response matrix, \(\mathbf {X} \in \mathbb {R}^{n \times p}\) is a predictor matrix, \(\mathbf {C} \in \mathbb {R}^{p \times q}\) is a coefficient matrix, and \(\mathbf {E} = {[\mathbf {e}_1, \dots \mathbf {e}_n]}^\mathsf {T} \in \mathbb {R}^{n \times q}\) is an error matrix. Each row of \(\mathbf {X}\) followed a multivariate normal distribution \(\mathcal {N}(\mathbf {0}, \varvec{\Gamma })\), where \(\varvec{\Gamma } = [\gamma _{ij}]\) is a \(p \times p\) covariance matrix with \(\gamma _{ij} = 0.5^{|i-j|}\) for \(i,j = 1,\dots ,p\). We generated each row of \(\mathbf {E}\) by \(\mathbf {e}_i \overset{\mathrm{i.i.d.}}{\sim } \mathcal {N}(\mathbf {0}, \sigma ^2 \varvec{\Delta })\), where \(\varvec{\Delta } = [\delta _{ij}]\) is a \(q \times q\) matrix with \(\delta _{ij} = \rho ^{|i-j|}\) and \(\sigma \) is determined according to the signal-to-noise ratio defined by \(\mathrm {SNR} = \left\Vert d_r \mathbf {X} \mathbf {u}_r {\mathbf {v}}^\mathsf {T} \right\Vert _2 / \left\Vert \mathbf {E} \right\Vert _2 = 0.5\). We considered the ranks of the coefficient matrix as follows: \(r \in \{3, 5, 7, 10, 12 \}\). We generated the coefficient matrix \(\mathbf {C} = \mathbf {UD} {\mathbf {V}}^\mathsf {T}\), where \(\mathbf {U} = [\mathbf {u}_1, \dots , \mathbf {u}_r]\), \(\mathbf {D} = \mathrm {diag}(d_1, \dots , d_r)\), \(\mathbf {V} = [\mathbf {v}_1, \dots , \mathbf {v}_r]\). Specifically, we set

$$\begin{aligned} d_k&= 5 + 0.1(k-1), \quad k = 1, \dots ,r,\\ \mathbf {u}_k&= \bar{\mathbf {u}}_k/\left\Vert \bar{\mathbf {u}}_k \right\Vert _2,\\ \bar{\mathbf {u}}_1&= {[\check{\mathbf {u}}, \mathrm {rep}(0, p-8)]}^\mathsf {T}, \bar{\mathbf {u}}_k = {[\mathrm {rep}(0,5(k-1)),\check{\mathbf {u}}, \mathrm {rep}(0, p-(5k+3))]}^\mathsf {T},\\ \check{\mathbf {u}}&= [1, -1, 1, -1, 0.5, -0.5, 0.5, -0.5],\\ \mathbf {v}_k&= \bar{\mathbf {v}}_k/\left\Vert \bar{\mathbf {v}}_k \right\Vert _2,\\ \bar{\mathbf {v}}_1&= {[\check{\mathbf {v}}, \mathrm {rep}(0, q-4)]}^\mathsf {T}, \bar{\mathbf {v}}_k = {[\mathrm {rep}(0,4(k-1)), \check{\mathbf {v}}, \mathrm {rep}(0, q-4k)]}^\mathsf {T},\\ \check{\mathbf {v}}&= [1, -1, 0.5, -0.5], \end{aligned}$$

where \(\mathrm {rep}(a, b)\) represents the vector of length b with all elements having the value a. We considered four cases. In Cases 1 and 2, we set \(n=400, p=120\), and \(q=60\) in common, and we set the correlation as \(\rho =0.3\) (Case 1) or \(\rho =0.5\) (Case 2). In Cases 3 and 4, we set \(n=400, p=80\), and \(q=50\) in common, and we set the correlation as \(\rho =0.3\) (Case 3) or \(\rho =0.5\) (Case 4).

To demonstrate the efficacy of RVSManOpt, we compared RVSManOpt with RRR (Mukherjee et al. 2015), the SeCURE with an adaptive lasso (SeCURE(AL)), and the SeCURE with an adaptive elastic net (SeCURE(AE)). For 50 datasets, we measured the estimation accuracy of coefficient matrix \(\mathrm {Er}(\mathbf {C})\), the prediction accuracy \(\mathrm {Er}(\mathbf {XC})\) and the selected rank absolute error \(\mathrm {Er}(r)\). These are defined as

$$\begin{aligned} \mathrm {Er}(\mathbf {C})&= \frac{1}{50} \sum _{k=1}^{50}\frac{\left\Vert \hat{\mathbf {C}}^{(k)} - \mathbf {C}^{(k)} \right\Vert _F^2}{pq}, \end{aligned}$$
(23)
$$\begin{aligned} \mathrm {Er}(\mathbf {XC})&= \frac{1}{50} \sum _{k=1}^{50}\frac{\left\Vert \varvec{\Gamma }^{\frac{1}{2}}(\hat{\mathbf {C}}^{(k)} - \mathbf {C}^{(k)}) \right\Vert _F^2}{nq}, \end{aligned}$$
(24)
$$\begin{aligned} \mathrm {Er}(r)&= \frac{1}{50} \sum _{k=1}^{50}|\hat{r}^{(k)} - r^{(k)}|, \end{aligned}$$
(25)

where \(\mathbf {C}^{(k)}\) is the true coefficient matrix, \(r^{(k)}\) is the true rank of the coefficient matrix \(\mathbf {C}^{(k)}\), \(\hat{\mathbf {C}}^{(k)}\) is an estimated coefficient matrix, and \(\hat{r}^{(k)}\) is the selected rank of coefficient matrix \(\hat{\mathbf {C}}^{(k)}\) for the kth dataset. In order to evaluate the sparsity, we computed Precision, Recall, and F-measure defined by

$$\begin{aligned} \text {Precision}&= \frac{1}{50} \sum _{k=1}^{50} \mathrm {Precision}^{(k)},\\ \text {Recall}&= \frac{1}{50} \sum _{k=1}^{50} \mathrm {Recall}^{(k)},\\ \text {F-measure}&= \frac{1}{50} \sum _{k=1}^{50} 2 \cdot \frac{\mathrm {Recall}^{(k)} \cdot \mathrm {Precision}^{(k)}}{\mathrm {Recall}^{(k)} + \mathrm {Precision}^{(k)}}, \end{aligned}$$

where \(\mathrm {Precision}^{(k)}\) and \(\mathrm {Recall}^{(k)}\) are defined by

$$\begin{aligned} \text {Precision}^{(k)}&= \frac{ \sum _{ij} \left| \left\{ u_{ij} \ne 0 \wedge \hat{u}_{ij}^{(k)} \ne 0 \right\} \right| }{ \sum _{ij} \left| \left\{ \hat{u}_{ij}^{(k)} \ne 0\right\} \right| } + \frac{ \sum _{ij} \left| \left\{ v_{ij} \ne 0 \wedge \hat{v}_{ij}^{(k)} \ne 0 \right\} \right| }{ \sum _{ij} \left| \left\{ \hat{v}_{ij}^{(k)} \ne 0\right\} \right| },\\ \text {Recall}^{(k)}&= \frac{ \sum _{ij} \left| \left\{ u_{ij} \ne 0 \wedge \hat{u}_{ij}^{(k)} \ne 0 \right\} \right| }{ \sum _{ij} \left| \left\{ u_{ij} \ne 0\right\} \right| } + \frac{ \sum _{ij} \left| \left\{ v_{ij} \ne 0 \wedge \hat{v}_{ij}^{(k)} \ne 0 \right\} \right| }{ \sum _{ij} \left| \left\{ v_{ij} \ne 0\right\} \right| }, \end{aligned}$$

for which \(\hat{u}_{ij}^{(k)}\) and \(\hat{v}_{ij}^{(k)}\) are respectively elements of the estimated \(\mathbf {U}\) and \(\mathbf {V}\) for the kth dataset and \(|\{\cdot \}|\) is the count of the elements of set \(\{\cdot \}\). All implementations were done in R (ver. 3.6) (R Core Team 2018).

Tables 2, 3, 4, and 5 show summaries of the results for, respectively, Cases 1 to 4 of the Monte Carlo simulations. In Cases 1 and 2, when the rank of the coefficient matrix \(\mathbf {C}\) is high, RVSManOpt outperforms other methods in terms of \(\mathrm {Er}(\mathbf {C})\), \(\mathrm {Er}(\mathbf {XC})\) and \(\mathrm {Er}(r)\). In contrast, when the rank of the coefficient matrix \(\mathbf {C}\) is low, the performances of all algorithms are approximately the same. Moreover, the F-measure gives almost the same value for RVSManOpt, SeCURE(AL) and SeCURE(AE), except for RRR. Since the RRR does not assume the sparsity for the estimation of parameter, the F-measure of RRR is small. In Cases 3 and 4, RRR outperforms other methods in terms of \(\mathrm {Er}(\mathbf {C})\), \(\mathrm {Er}(\mathbf {XC})\) and \(\mathrm {Er}(r)\). However, it has been reported that RRR is inferior to other methods in Cases 1 and 2. This shows that the estimation performance of RRR largely depends on the true structure of a model. Therefore, RVSManOpt achieves performance superior to those of other methods under the various situations.

Figure 1 shows box-plots of \(\mathrm {Er}(\mathbf {XC})\) for Case 1. Since the box-plots for the other cases are essentially same, except for RRR, these are available as the supplementary materials. When the rank of the coefficient matrix \(\mathbf {C}\) is high, we observe many outliers in the box-plots of RRR, SeCURE(AL) and SeCURE(AE). These outliers indicate that RRR, SeCURE(AL) and SeCURE(AE) fail to estimate parameters many times. On the other hand, the number of the outliers produced by RVSManOpt is small, and hence RVSManOpt performs the other methods in terms of stable estimation.

Table 2 Results for Monte Carlo simulations in Case 1. For simplicity, \(\mathrm {Er}(\mathbf {C})\) and \(\mathrm {Er}(\mathbf {XC})\) are multiplied by \(10^4\). The symbol (sd) is the standard deviation of the left column
Fig. 1
figure 1

Box-plots of scaled \(\mathrm {Er}(\mathbf {XC})\) for each rank r in Case 1

Table 3 Results for Monte Carlo simulations in Case 2. For simplicity, \(\mathrm {Er}(\mathbf {C})\) and \(\mathrm {Er}(\mathbf {XC})\) are multiplied by \(10^4\). The symbol (sd) is the standard deviation of the left column
Table 4 Results for Monte Carlo simulations in Case 3. For simplicity, \(\mathrm {Er}(\mathbf {C})\) and \(\mathrm {Er}(\mathbf {XC})\) are multiplied by \(10^4\). The symbol (sd) is the standard deviation of the left column
Table 5 Results for Monte Carlo simulations in Case 4. For simplicity, \(\mathrm {Er}(\mathbf {C})\) and \(\mathrm {Er}(\mathbf {XC})\) are multiplied by \(10^4\). The symbol (sd) is the standard deviation of the left column

5.2 Application to yeast cell cycle dataset

We applied RVSManOpt to yeast cell cycle data (Spellman et al. 1998). The dataset was available in the secure package (Mishra et al. 2017) in the software R. The analysis of the yeast cell cycle enables us to identify transcription factors (TFs) which regulate ribonucleic acid (RNA) levels within the eukaryotic cell cycle. The dataset contains two components: the chromatin immunoprecipitation (ChIP) data and eukaryotic cell cycle data. The binding information of a subset of 1790 genes and 113 TFs was included in the ChIP data (Lee et al. 2002). The cell cycle data were obtained by measuring the RNA levels every 7 minutes for 119 minutes, thus a total of 18 time points, to cover two cycles. Since the dataset contained missing values, we complemented them by using the imputeMissings package in R. By complementing the dataset, we can use all \(n = 1790\) genes and analyze the relationship between the RNA levels in the \(q = 18\) time points and \(p = 113\) TFs. We compared RVSManOpt with SeCURE(AL) and SeCURE(AE) by computing the number of selected experimentally confirmed TFs among the total number of the selected TFs and the proportion of experimentally confirmed TFs. It is known that there are 21 TFs which have been experimentally confirmed to be involved in the cell cycle regulation (Wang et al. 2007).

Table 6 gives the results of a real data analysis. In RVSManOpt, the proportion of experimentally confirmed TFs is larger than both SeCURE(AL) and SeCURE(AE). RVSManOpt estimated \(\hat{r} = 5\), while SeCURE(AL) and SeCURE(AE) estimated \(\hat{r} = 4\). This result means that RVSManOpt may capture the latent structure of the yeast cell cycle data more precisely by identifying 5 latent factors.

Figure 2 shows estimated transcription levels of three of the experimentally confirmed TFs selected by RVSManOpt. The rest of the 12 experimentally confirmed TFs are available as the supplementary materials. Figure 2 indicates that the estimated transcription levels followed two cycles. It was experimentally confirmed that the transcription levels in the cell cycle did cover a two cycle time period. Thus, RVSManOpt was demonstrated to accurately estimate the cycles of data.

Table 6 Results of analysis of yeast cell cycle dataset
Fig. 2
figure 2

Plots of estimated transcription levels of 3 experimentally confirmed TFs selected by RVSManOpt. The vertical line shows the estimated transcription levels and the horizontal line shows the mesurement time

6 Concluding remarks

We proposed a minimization problem of SFAR on a Stiefel manifold and developed the factor extraction algorithm with rank and variable selection via sparse regularization and manifold optimization (RVSManOpt). RVSManOpt surpassed the traditional estimation procedure, which fails when the rank of the coefficient matrix is high. Numerical comparisons including Monte Carlo simulations and a real data analysis supported the usefulness of RVSManOpt.

In general, it is challenging to estimate parameters while preserving both orthogonality and sparsity. Mishra et al. (2017) indicates that enforcing orthogonality collapses sparsity and does not work from the viewpoint of prediction. Therefore, it may be unnecessary to construct a model with perfect orthogonality if we focus on prediction. Also, the recent paper by Absil and Hosseini (2019) discusses a theory of manifold optimization for non-smooth functions. It would be interesting to develop RVSManOpt based on this theory. In addition, convergence analysis needs to be established for a more detailed analysis of RVSManOpt. We leave these as future topics.