CenetBiplot: a new proposal of sparse and orthogonal biplots methods by means of elastic net CSVD

In this work, a new mathematical algorithm for sparse and orthogonal constrained biplots, called CenetBiplots, is proposed. Biplots provide a joint representation of observations and variables of a multidimensional matrix in the same reference system. In this subspace the relationships between them can be interpreted in terms of geometric elements. CenetBiplots projects a matrix onto a low-dimensional space generated simultaneously by sparse and orthogonal principal components. Sparsity is desired to select variables automatically, and orthogonality is necessary to keep the geometrical properties that ensure the biplots graphical interpretation. To this purpose, the present study focuses on two different objectives: 1) the extension of constrained singular value decomposition to incorporate an elastic net sparse constraint (CenetSVD), and 2) the implementation of CenetBiplots using CenetSVD. The usefulness of the proposed methodologies for analysing high-dimensional and low-dimensional matrices is shown. Our method is implemented in R software and available for download from https://github.com/ananieto/SparseCenetMA.


Introduction
Principal component analysis (PCA) is the most widely used multivariate statistical technique for projecting a data set onto a lower dimensional space, preserving as much variability as possible (Jolliffe et al. 2016). The basis vectors of this new subspace known as principal components (PCs) are obtained as a linear combination of the original variables. The coefficients of that combination are called loadings. Each PC is calculated in terms of all variables and the results can be difficult to interpret. Therefore, several alternatives have been proposed to produce modified PCs with some zero loadings (named sparse loadings). These alternatives are known as sparse PCA (Jolliffe et al. 2003;Zou et al. 2006;Shen and Huang 2008;Journée et al. 2010;Li et al. 2016). This purpose is achieved by adding sparse-promoting constraints in the optimization problem. Different constraint techniques are proposed in the literature but some of the most used are Ridge (Hoerl and Kennard 1988) and Lasso (Tibshirani 1996). Ridge shrinks the coefficients towards zero and encourages highly correlated variables to have similar coefficients. On the other hand, Lasso makes some of them zero, but tends to choose a single variable from a set of highly correlated variables, discarding the others. To overcome this, Zou and Hastie (2005) proposed the use of elastic net (enet), which combines Lasso and Ridge to preserve both favourable properties. In addition, enet is particularly useful when the number of variables is higher than the number of observations (Zou and Hastie, 2005).
It is important to emphasise that in some of these Sparse PCA techniques, the loading matrix orthogonality is lost at the expense of sparsity. Thus, some authors, such as Trendafilov (2014) and Genicot et al. (2015), provide sparse and orthogonal components simultaneously.
The coordinates of the observations and the variables in the first components are used to graphically represent them in the score and loading plots, respectively. To visualize them on the same reference system simultaneously, Gabriel (1971) and Galindo (1986) proposed the use of biplot methods. These techniques have been applied in several fields (Xavier et al. 2018;Amor-Esteban et al. 2019;Carrasco et al. 2019;Bernal et al. 2020). Biplots define a common reference system where the rows and the columns of a matrix can be jointly displayed. So, the relationships between them can be interpreted by means of geometric elements in a Euclidean space (distances, angles, projections, …) (Gabriel 1971;Galindo 1986).
In the case of sparse biplots, there are only two techniques mentioned in the literature related to sparse loadings: CDBiplot (Nieto-Librero et al. 2017), based on the CDPCA of Vichi andSaporta (2009), andElastic-net HJ-Biplot (Cubilla-Montilla et al. 2021), based on the SPCA of Zou et al. (2006). On the one hand, CDBiplot extracts disjoint PCs, in which each original variable only contributes to the construction of one dimension. On the other hand, Elastic-net HJ-Biplot does not provide orthogonal sparse PCs, even though orthogonality is necessary to keep the geometrical properties that allow biplots interpretation. Also, in this work biplot coordinates are estimated once the sparse loading matrix is obtained from the SPCA (Zou et al. 2006). Nevertheless, as some authors have pointed out it is important to obtain the results in the same optimization process and not in a tandem analysis (Vichi and Saporta 2009;Nieto-Librero et al. 2017).
All things considered, our main objective is to propose a new mathematical technique, called C enet Biplots, that simultaneously incorporates the orthogonality of PCs and the selection of variables by means of the enet sparse constraint. Since the biplot solution is obtained from the singular value decomposition (SVD) (Gabriel 1971;Galindo 1986), our research is focused on the sparse and orthogonal SVD via Lasso proposed by Guillemot et al. (2019) but imposing the enet constraint to overcome the disadvantages mentioned above.
Therefore, this work is structured as follows. Section 2 includes the notation paragraph and Sect. 3 defines the extension of constrained singular value decomposition as the solution of a convex-optimization problem with enet and orthogonality restrictions. Section 3 also shows the algorithm used to solve C enet SVD, extending the projection onto convex sets (POCS) algorithm in the sense of a divide and conquer algorithm. Section 4 presents the implementation of the sparse and orthogonal biplot methods, known as the C enet Biplots. The selection of the sparsity parameters is proposed in Sect. 5. Section 6 shows the usefulness of these methodologies analysing high-dimensional real genomic data and low-dimensional psychometric data. Finally, Sect. 7 includes a discussion and the main conclusions of the study.

Notation
We present below the notation and terminology used in this manuscript. X I J denotes a matrix with the information of I observations in the rows and J variables in the columns. The elements of a matrix X are denoted as x i j . The transpose of a matrix X is denoted by X T , and its inverse, as X −1 . The 2 norm of a matrix X is defined by X 2 F . The 2 norm of a vector x = {x j } J j=1 is computed by j x j 2 , and the 1 norm is calculated by j x j . A vector is normalized when it is divided by its 2 norm. Constraint balls are defined as the regions B 2

Singular value decomposition
Given a matrix X I J of rank R ≤ min(I , J ), the SVD of X is defined as the product: where U = [u 1 , . . . , u I ] and V = [v 1 , . . . , v J ] are orthonormal matrices,U T U = I and V T V = I. U contains the left-singular vectors of the SVD in columns, V contains the right-singular vectors and D is a diagonal matrix containing the d r singular values of X (r = 1, . . . , R), conveniently expressed so that d 1 ≥ d 2 ≥ · · · ≥ d R ≥ 0. For optimal Q ≤ R, SVD provides the best low Q-rank approximation X Q of X in the sense of least squares by minimizing the 2 norm of the difference between the initial and the reconstructed matrices (Eckart and Young 1936;Shen and Huang 2008). X Q , is defined as: . Frequently, the orthogonal singular vectors of SVD are computed using the power iteration algorithm together with a deflation approach. Instead, Guillemot et al. (2019) suggest obtaining the singular vectors by the projection onto the convex sets (POCS) algorithm (Bauschke and Combettes 2017).

Extension of constrained singular value decomposition to elastic net (C enet SVD)
C enet SVD provides a factorization of X I J by means of sparse and orthogonal singular vectors (called pseudo-singular vectors) and pseudo-singular values. The key point of C enet SVD is the calculation of sparse and orthogonal vectors simultaneously. The C enet SVD formulation is based on the constrained optimization problem of CSVD proposed by Guillemot et al. (2019), replacing lasso by enet restriction: where τ 1,q , τ 2,q > 0 are the shrinkage parameters that control the sparsity degree included in the constrained model. The higher τ 1 or τ 2 is, the fewer sparse coefficients there are. It is important to remark that only some values for τ 1,q and τ 2,q lead to possible solutions (see Sect. 5.2 and Online Resource 2A). The parameter α ∈ [0, 1) defines the amount of the Lasso or the Ridge constraint included in the enet restriction.
To find the solution of (3), an iterative process is defined (Guillemot et al. 2019). First, it is necessary to establish an equivalent form of the previous minimization problem. Equation (3) is equivalent to: With v q and u q previously calculated, 0 ≤ q < q and q ≥ 1. Equation (4) is solved by block relaxation, an iterative process that resolves two alternating parts: (1) Find the vector solution u optimizing the function with v fixed: where U ⊥ is the orthogonal complement of the space spanned by the columns of a matrix U. These constraints involved projections of a vector onto a convex space.
In addition, it should be noted that the intersection of two convex spaces is also convex. Thus, the problem can be solved by using the POCS and the PL1L2 algorithms (2) Maximize (4) to find the vector v solution with u fixed: where V ⊥ is the orthogonal complement of the space spanned by the columns of a matrixV . The projection v t+1 = proj B 1 + 2 (τ )∩B 2 (1)∩V ⊥ X T u t+1 is carried out in the same manner as for (1).
The global optimization problem of C enet SVD is handled using the POCS algorithm (Table 1). Lines 6 and 7 are modified from (Guillemot et al. 2019) to address the problem of projection onto the B 1 + 2 ∩ B 2 space.

Extension of C enet SVD to the biplot methods
The results of traditional PCA and Biplot methods are often calculated from the SVD of a matrix. Consequently, sparse and orthogonal constrained PCA (C enet PCA) and sparse and orthogonal constrained biplots (C enet Biplots) are obtained from the results of the C enet SVD. Set the C enet SVD of X as the low Q-rank approximation of the original matrix X ≈ U enet DV T enet , where U enet and V enet are sparse and orthonormal matrices.

Sparse and orthogonal PCA
The main objective of C enet PCA is to project the original matrix onto a subspace determined by a new set of Q < J sparse and orthogonal PCs. Given X I J , the sparse Table 1 Algorithm for the implementation of C enet SVD based on the POCS algorithm q-C enet PC is defined as a linear combination of the initial variables as: being Y I Q the score matrix, which contains the coordinates of the observations in the new subspace, and V enet the sparse loading matrix. A flowchart of C enet PCA is shown in Fig. 1a.

Sparse and orthogonal Biplots
Biplot methods are optimal tools to visualize multivariate data in a low-dimensional space. The relationship between the observations and the variables can be interpreted in a scatterplot because of the inner product properties. Gabriel (1971) proposed the JK and GH biplots, which assign an optimal quality of representation to rows (GH-Biplot) and columns (JK-Biplot) in the same Euclidean space. In order to provide the highest-quality representation for both rows and columns in the same reference system, Galindo (1986) proposed the HJ-Biplot. To this end, the original matrix is factorized as X ≈ AB T , so that the inner product a T i b j approximates the element x i j as closely as possible. A I Q and B J Q are the row and the column markers matrices, respectively. The pseudo-singular vectors and values obtained from the C enet SVD of X are used to implement sparse and orthogonal biplots. C enet JK-Biplot sets the row and column markers as A = U enet D and B = V enet , C enet GH-Biplot sets A = U enet and B = V enet D, and finally C enet HJ-Biplot sets A = U enet D and B = V enet D, respectively (Fig. 1b).
To interpret the C enet Biplots graphical representation, it is important to note that i) observations are represented by dots and variables by vectors in the graph; ii) distances between points show the dissimilarities between observations; iii) lengths of the vectors refer to the variability of the variables; iv) relationships between variables are interpreted from the cosine of the angles between the corresponding vectors (obtuse: inverse relationship; acute: direct relationship; right angle: linear independence); and v) projections of the points in the direction of a vector approximate the values of the variables for those observations.

Selection of˛and (1 −˛)
The parameter α ∈ [0, 1) defines the amount of Lasso or Ridge constraint included in the enet restriction. Usually, α is set to 0.5 to aggregate the same Lasso and Ridge constraint to the model. In practice, the parameter α could be selected manually. But also, we suggest to select the α that minimizes the cross-validation error (James et al. 2014) through an iterative process. This process starts defining an increasing sequence of possible values for α and segmenting the matrix into n = 10 folds. For each of the n-folds, the training matrix is conformed by the observations of the submatrices formed in the (n − 1)-folds. The remaining observations constitute the test matrix. Then, the iterative process for each of the α = (α 1 , ..., α t ) values is initiated. C enet SVD of the training matrix is carried out, and the matrix V enet is used to calculate the mean reconstruction error MSE α 1 ,1 of the test matrix: Reconstruction errors MSE α 1 ,1 , ..., MSE α 1 ,n are obtained for each of the folds; hence, the final MSE α 1 is computed as the mean of the errors MSE α 1 ,1:n .
This step is repeated for the whole sequence of α t values. The optimum α is the one that provides the minimum MSE = min MSE α 1 , . . . , MSE α t .

Selection of
The shrinkage parameter τ inversely controls the degree of sparsity; that is, the larger its value is, the fewer zero loadings. In this work, we propose to select τ using the Bayesian information criterion (BIC) as in (Guo and James 2010;Croux et al. 2013): where X I J is the original matrix, V enet is the right-pseudo-singular vector matrix obtained from C enet SVD, V is the right-singular vector matrix obtained from unconstrained SVD and df(τ ) is the number of non-zero elements in V enet . The parameter τ that minimizes BIC(τ ) is selected from a sequence of possible τ ∈ [1, (1 − α) √ J + α]. As stated by Guillemot et al. (2019) and Witten et al. (2009), only some values of the constraints lead to solutions (see Online Resource 2).

Analysing data with C enet methods
In this section, we illustrate the performance of C enet PCA and C enet HJ-Biplot to analyse two matrices in different contexts: 1) the Object-Spatial Imagery Questionnaire (OSIQ) data (J < I ) and 2) the MILE gene expression data (J >> I ). All data analyses were performed using the R software (R Core Team 2020). All figures were illustrated using the ggplot2 library (Wickham 2016).
To show the usefulness of C enet HJ-Biplot in a toy example in which the number of variables is less than the number of observations, a random selection of 100 simulated participants who responded to the items of the OSIQ scale (Blajenkova et al. 2006) was carried out. This scale consists of 30 items rated on a 5-category Likert scale, which are structured around two latent dimensions: the spatial and object imagery scales. The correlation matrix between items is shown in Online Resource 1.
Two-dimensional solutions of the traditional HJ-Biplot and C enet HJ-Biplot are shown in Fig. 2. To obtain a low sparsity degree, C enet HJ-Biplot was employed with a shrinkage parameter τ 2 = (0.5 √ J + 0.5) · (3/4) (low level of sparsity). There is no sparsity restriction imposed on individuals. In both plots, participants are represented by dots, and questionnaire items, by arrows. The cosine of the angles between the vectors reflects the relationships between items. Thus, items of the same theoretical dimension are strongly and directly correlated (acute angles). As can be seen, if the C enet HJ-Biplot is applied (Fig. 2b), the relationships of the items in the two constructed dimensions, even with low constraints, reflect the theoretical psychometric structure more clearly than in the case of the HJ-Biplot (Fig. 2a). Additionally, the item o15 correlates directly and weakly with the items of its own scale and correlates inversely and strongly with the items of the other scale in the results of HJ-Biplot (Fig. 1a). The C enet HJ-Biplot tends to maintain these relationships (Fig. 2b). The same can be observed with items s06, s11 and o10, which do not correlate with their dimension items but whose relationships to the items of another subscale are preserved. Fig. 2 2D plots of components 1-2 of the HJ-Biplot (a) and sparse and orthogonal C enet HJ-Biplot (b). Labels starting with "o" refer to object imagery scale items, and labels starting with "s" indicate spatial imagery scale items

Analysing leukaemia gene expression data: a genomic example
The use of statistical methods to analyse microarray data involving variable selection has become increasingly important in tumours classification context (Algamal and Lee 2015). In the case of gene expression data, it is common that the number of genes exceeds the number of samples. Frequently only a few genes have relevant information and therefore the use of sparse constraints for the automatic selection of genes is necessary.
To show the usefulness of C enet PCA and C enet HJ-Biplot, in this section we analyse 216 leukaemia patients randomly selected from the GSE13204 series available at the Gene Expression Omnibus (GEO) repository. Among our samples, 32.9% had been classified as acute lymphoblastic leukaemia (ALL, n = 71), and 34.3%, as chronic lymphocytic leukaemia (CLL, n = 74); 32.9% of the samples were control patients (n = 71). RNA gene expression data were extracted from the Affymetrix Microarray Platform Human Genome U133 Plus 2.0 Array (HGU133Plus2). Data preprocessing was carried out using RMA normalization. For illustrative purposes, 2,000 genes showing the greatest variability from the CUR decomposition leverage scores were selected (more detailed information can be found in (Mahoney and Drineas 2009)).
Traditional PCA was performed on the centred matrix of 216 samples and 2,000 probe sets. The 45.5% of the data variability is explained by the two components retained. Figure 3a shows the score plot of components 1-2. The scores revealed three subgroups of samples clearly differentiated corresponding to control (rectangle), ALL (circle) and CLL (triangle) samples. The first PC discriminates between control and CLL samples, with the CLL samples located on the positive side of axis 1. The ALL samples were differentiated from the CLL and the control samples by their 2-axis coordinates. Figure 3a (middle, bottom) shows that each PC is a linear combination of a large number of genes, making them hardly interpretable despite their good discriminatory capacity.
To assess the effects of sparsification, C enet PCA is performed using α = 0.5 (i.e., providing equal weight to the Lasso and Ridge constraints) and τ = 5.86. The last one was selected according to the BIC criterion and the previously separation of the groups considered. The 25.8% of the data variability is explained by the two sparse components retained. Figure 3b reflects the score plot of the samples in the first two sparse PCs. The same classification is achieved, as shown in Fig. 3a (top). The loading plots of C enet PCA show the genes automatically selected with this technique (Fig. 3b, middle-bottom). Those genes with higher loadings in the PCs are those with higher loadings in the sparse C enet PCs. Additionally, the variables with lower loadings in the PCs have a zero-value loading in the sparse C enet PCs (or close to 0). This fact simplifies the interpretation of the constrained components obtained. From the 2,000 gene probes considered, 1,752 present zero loadings in both C enet PCs, and from the remaining 248 non-zero loadings, 43 gene probes present loadings lower than 0.01 along both constrained components.
Finally, to characterise the influence of the selected gene probes in the separation of the above-mentioned groups, the gene expression centred matrix of the Each shape refers to one of the three groups of samples (rectangle: control; triangle: CLL; circle: ALL). Panel a contains the score plot of PCs 1-2 (top plot) and the loadings plots of PC1 (middle plot) and PC2 (bottom). Each PC is computed as a linear combination of the 2,000 gene probes measured. b contains the score plot (top) and the loadings plot of each of the constrained PCs (middle, bottom). Sparse and orthogonal PCs distinguish the three groups considered, although in this case many gene probes present zero loadings 216 observations and 205 resultant gene probes of C enet PCA were analysed via C enet HJ-Biplot. The sparsity constraint τ 2 was fixed at a medium level of sparsity (Guillemot et al. 2019). The 27.6% of the data variability is explained by the two sparse components retained. The C enet HJ-Biplot results are shown in Fig. 4. The first axis shows the separation of CLL samples from control and ALL samples. Axis 2 is a gradient direction differentiating the ALL samples from the rest. The genes that do not appear on the plane have null coordinates. In addition, the C enet HJ-Biplot representation makes possible the recognition of a genetic characterization of each of the subgroups. In this sense, control samples are characterized by a high expression of S100A12 and S100A9. These genes are responsible for discriminating between control and tumour samples. CLL samples are differentiated from

Conclusions
In this work, the extension of CSVD (Guillemot et al. 2019) to the enet ball is proposed, integrating sparse and orthogonal vectors simultaneously. This method is based on the projection of a vector onto the convex intersection of enet and 2 balls. Here, the enet constraint is shown as a suitable constraint approach, restricting coefficients to zero while ensuring that correlated variables have similar coefficients, a desirable property in disciplines such as genomics or psychometry. Our approach using C enet SVD is useful for analysing large-scale problems with J I and datasets with I > J . Additionally, C enet SVD is extended to sparse and orthogonal constrained C enet PCA and sparse and orthogonal constrained C enet Biplots. These techniques provide the possibility of recognizing groups with similar patterns and the causative variables associated with them. In addition, they are variable selection techniques that improve the interpretation of results due to the sparse components established. Furthermore, this work provides a sparsity parameter selection procedure based on the cross-validation and the BIC, as well as the possibility to manually establish distinct levels of sparsity.
Future lines of research may contemplate the possibility of applying other types of constraints within the CSVD framework or even the proposal of other algorithms for projecting a vector onto non-convex sets based on the correspondent mathematical theory. Additionally, statistical techniques of two-way and three-way data analysis could be developed through C enet SVD. We conclude that our proposed methods are promising tools for conducting multivariate analysis and are applicable to a wide range of research areas.