Least squares and maximum likelihood estimation of sufficient reductions in regressions with matrix-valued predictors

We propose methods to estimate sufficient reductions in matrix-valued predictors for regression or classification. We assume that the first moment of the predictor matrix given the response can be decomposed into a row and column component via a Kronecker product structure. We obtain least squares and maximum likelihood estimates of the sufficient reductions in the matrix predictors, derive statistical properties of the resulting estimates and present fast computational algorithms with assured convergence. The performance of the proposed approaches in regression and classification is compared in simulations.We illustrate the methods on two examples, using longitudinally measured serum biomarker and neuroimaging data.


Introduction
In many applications, predictors are matrix-valued. For example, in cohort studies conducted to study diseases, multiple correlated biomarkers are measured repeatedly during follow-up. It is of interest to assess their associations with disease outcomes to aid understanding of biological underpinnings of disease and to use them individually or in combinations in diagnostic or prognostic models. Neuroimaging studies use data from electroencephalography (EEG) that records electrical activity of the brain over time, to predict cognitive outcomes and to identify brain regions associated with a clinical response. In these examples, the E. B. and D. K. gratefully acknowledge the support of the Austrian Science Fund (FWF P 30690-N35). Multivariate statistical methods can be used to analyze matrix-valued predictors by mapping them into vectors. Frequently this is not feasible for data sets of realistic size. For instance, treating EEG data measured at 60 channels each for 256 time points as a vector in a regression model would require estimating 15360 regression parameters, necessitating practically impossibly large samples. Moreover, vectorizing a matrix destroys the inherent structure of the predictors that may contain important modeling information.
Only few statistical approaches accommodate a matrix structure of the predictors. Dimension folding [29] extends moment-based sufficient dimension reduction (SDR) methods for matrix-valued predictors by reducing the predictors' row and column dimensions simultaneously without loss of information on the response. [34] proposed and studied first-moment-based SDR methods for combining several longitudinally measured predictors into a composite score for prediction or regression modeling. They assumed that the means and the second moments of the predictors can be separated into a predictor-specific and a time-specific component via a Kronecker product structure and proposed an estimation approach, longitudinal sliced inverse regression (LSIR), based on empirical moments of the predictors given the outcome. The Kronecker product structure substantially reduces the complexity of the first-moment-based dimension reduc-tion subspace. The resulting score yielded better predictive accuracy than standard first-moment-based SDR methods, such as sliced inverse regression (SIR) [31], applied to the vectorized predictors.
[17] developed model-based methods, dimension folding principal component analysis (PCA) and dimension folding principal fitted components (PFC), that extend conventional PCA and PFC [12] to matrix-valued data. They require the predictors be normally distributed with Kronecker product covariance structure. In the context of classification, [32] proposed a discriminant analysis model to predict a categorical response for mixed categorical and tensor-valued predictors. The method reduces the dimension of the tensor predictor within each group defined by the categorical covariates.
In the machine learning literature, methods proposed for matrix-valued predictors that do not use information on outcome; i.e., unsupervised dimension reduction methods include 2DPCA [41], generalized 2D principal component analysis (G2DPCA) [27], (2D) 2 PCA [45], GLRAM [42], unified PCA [37] and probabilistic higher-order PCA [44]. Regression approaches include reduced-rank generalized linear models using a mixture of array-valued and vector-valued predictors [47] and a tensor partial least squares algorithm for the regression of a continuous response on tensor-valued predictors [46]. Both focus on the forward regression which they assume is linear in the vector-, matrix-or tensor-valued predictors. These methods frequently suffer from lack of convergence and do not yield closed form solutions.
In this paper, we propose least squares and maximum likelihood-based approaches to estimate the sufficient reductions in matrix-valued predictors under a Kronecker product structure for the predictor means given the response without requiring a specific structure for the covariance in contrast to previous methods [17,34]. By casting the estimation problem in a linear model framework, we obtain least squares-based estimates that are asymptotically optimal and competitive with maximum likelihood estimates (MLEs) for practically relevant sample sizes.

Background on sufficient dimension reduction
Let X = (X 1 , . . . , X p ) T ∈ R p be a vector of p predictors and Y ∈ R denote the outcome variable. sufficient dimension reduction, SDR [9], aims to find a function or "reduction" of X, R : R p → R d with d ≤ p, which contains the same information as X about the response Y . That is, F(Y | X) = F(Y | R(X)), where F is the conditional distribution function of Y given X. This version of dimension reduction is called sufficient because the lower-dimensional R(X) (d < p) replaces the predictor vector X without any loss of information on Y . The dimension d of the sufficient reduction R(X) is the dimension of the regression of Y on X.
Model-based SDR is based on the important result that if R(X) is a sufficient reduction for the forward regression Y | X, then it is also a sufficient statistic for the inverse regression X | Y [11]. Exploiting this, both linear and nonlinear sufficient reductions for the regression of Y on X have been derived by requiring the distribution of X | Y be in the elliptically contoured or exponential family [4,5,[11][12][13]].

First-moment-based SDR subspace
In this paper, we focus on inference on the first-momentbased SDR subspace (FMSDR), which is the span of the centered mean of the inverse regression of X on Y , E(X | Y ) − E(X), scaled by the inverse of the marginal covariance of X, Σ x . That is, we let where μ Y = E(X | Y ) and μ = E(X). If the predictors X satisfy the linearity condition [9, p.188] that requires E(X | η T X) be linear in η T X for η such that F(Y | X) = F(Y | η T X), then S FMSDR ⊆ S(η). The linearity condition refers exclusively to the marginal distribution of X. It holds when X has an elliptical distribution, such as multivariate normal or multivariate t, and also holds approximately when p is very large [24,38]. Under the linearity condition, any core matrix whose column space spans the same space as S FMSDR can be used to either exhaustively or partially estimate S(η). SDR methods based on the first conditional moment of the inverse predictors X | Y , such as SIR [31], use = Σ −1 x Var(E(X | Y )). [3] proposed parametric inverse regression (PIR) to obtain a least squares estimate of S(η) from fitting the multivariate linear inverse regression model where f Y is an r × 1 vector of functions of Y with E(f Y ) = 0, the p × r unknown parameter matrix B is unconstrained, Regressing X on F yields the ordinary least squares (OLS) estimate for B, in model (2). Letting P F = F(F T F) −1 F T denote the projection matrix onto the space spanned by the columns of F, an estimate of the matrix Δ Y is where Q F = I n − P F . Equations (1) and (2) imply that dim(S FMSDR ) = rank(B) ≤ p.
The first model-based SDR method for the estimation of S FMSDR in (1), principal fitted components (PFC [12]), requires X follow model (2) and also is conditionally normally distributed given Y , with where Γ ∈ R p×d is an orthogonal basis of the linear space with S Y the sample space of Y , and γ ∈ R d×r an unrestricted rank d parameter matrix, with d ≤ r . Thus, PFC is a constrained version of PIR [3] in that it also requires X | Y be normal with constant variance Δ, and the rank d of B in (2) be known so that B = Γ γ . Under (5), [12] showed S FMSDR = span(Γ ) and derived the maximum likelihood estimate (MLE) of S FMSDR to be where In (7), Δ res is obtained by multiplying (4) by (n−rank(F))/n, and Δ fit = X T X/n − Δ res = X T X/n − X T Q F X/n = X T P F X/n. The eigenvectors of Δ When d = r , (7) reduces to Δ MLE = Δ res . The MLE of the sufficient reduction is

Matrix-valued predictors
For ease of exposition, we present the model in the longitudinal setting, where the p × 1 predictor vector X is measured at T different time points. Specifically, for sample i with response variable Y i ∈ R, i = 1, . . . , n, the predictors can be represented as the p × T -matrix which corresponds to the pT ×1 vec(X i ) = (X T i1 , . . . , X T i T ) T , comprised of the columns of X i in (9) stacked one after another. We assume that all samples have measurements for all predictors at the same time points.
To accommodate the longitudinal structure of X | Y , we assume that the centered first moment of X is decomposed into a time and a predictor component as in [34], and write the linear inverse regression model (2) as bilinear in the rows and columns of X, where f Y is a k × r matrix of functions in Y with E(f Y ) = 0, α ∈ R T ×r , and β ∈ R p×k . In vector form, model (10) is written as The T × r parameter matrix α captures the mean structure over time, and the p × k matrix β captures the mean structure of the predictors regardless of time. The error ε satisfies (11) is analogous to model (2) with the difference that vec(f y ) in (11) is a kr ×1-vector and the parameter matrix B is replaced by the Kronecker product of α and β, which induces sparsity in the sense of reducing the number of parameters to estimate. 2 [34] showed that, letting Σ x denote the pT × pT covariance matrix of vec(X), and Δ = E(Δ Y ), with dimension dim(S FMSDR ) = rank(α) rank(β). 2 From pT kr to pk + T r.
For the PFC version of model (11), we use the corresponding parameterization of the two parameter matrices α ∈ R T ×r and β ∈ R p×k , which are both unconstrained. Assuming rank(α) = d 1 and rank(β) = d 2 , we let α = Γ 1 γ 1 , where Γ 1 is a T × d 1 semi-orthogonal matrix whose columns form a basis for the d 1 -dimensional span(α), and γ 1 is an unconstrained d 1 × r matrix of rank d 1 . Similarly, there exists a p × d 2 semi-orthogonal matrix Γ 2 whose columns form a basis for the d 2 -dimensional subspace span(β), and a d 2 × k rank d 2 unconstrained matrix γ 2 , so that β = Γ 2 γ 2 . Using this parameterization, model (11) becomes As a consequence, (12) yields When Σ x is separable; i.e., Σ x = Σ 1 ⊗ Σ 2 , or, slightly less restrictive, when Δ y = Var(vec(X) | Y = y) = Δ 1y ⊗ Δ 2y , then In this case, the number of parameters that are needed to estimate S FMSDR in (14) is further reduced.

Estimating S FMSDR using matrix-valued predictors
We propose several approaches to estimate S FMSDR in (14) by estimating the component matrices α and β and Γ 1 and Γ 2 in models (11) and (13). We assume that the dimension d is known and comment on inference on d for all approaches in Sect. 8.

Least squares Kronecker parametric inverse regression, (K-PIR (ls))
To obtain least squares (ls)-based estimates of S FMSDR under model (11), we assume that the predictors are centered around their overall mean μ. Using the sample level notation defined in Sect. 2.1 and letting X i = X i − X, i = 1, . . . , n, the model becomes where X y : n × pT with ith row vec( The following theorem, proved in "Appendix," summarizes the approach and properties of the resulting estimates. Theorem 1 Assume the data X follow model (15). Let B = (F T y F y ) −1 F T y X denote the ordinary least squares estimate in the unconstrained model X = F y B + ε. The matrices α and β defined as and estimated using algorithm 2 in [40], converge in probability to α and β in the constrained model (11). 3 That is, When the distribution of X | Y belongs to the exponential family, then α and β are asymptotically normal.
We refer to any matrices that are obtained as solutions to (16) as VLP (Van Loan and Pitsianis [40]) approximations. The algorithm is described in "Appendix." Given α and β, the least squares-based estimate of Δ =

ML Kronecker parametric inverse regression (K-PIR (mle))
We derive the MLEs for α and β in model (11) under the additional assumption that where f y i is defined (Eq. (15)). The corresponding loglikelihood is The MLE of μ when the other parameters are fixed is the sample mean X. We substitute X for μ and use the centered observations X i = X i − X in what follows. For fixed α and β, solving the corresponding score equation for Δ yields The score equations for α and β, however, do not yield closed form solutions, and we employ the following iterative algorithm for estimation. K-PIR MLE Algorithm: 1. Initialize Δ at the value Δ 0 from least squares in (18). 2. Compute α 1 , β 1 by optimizing the log-likelihood in (20) numerically with starting values where ( α ls , β ls ) is the approximate ls solution computed from (16). 3. Compute Δ 1 from (21) with α = α 1 and β = β 1 .

Repeat steps 2 and 3 until
We estimate S FMSDR in (12), assuming that d 1 and d 2 are known, with where Γ 1 and Γ 2 are the first d 1 and d 2 singular vectors of α and β, respectively.

Kronecker principal fitted components (K-PFC)
The log-likelihood under model (13) with ε ∼ N pT (0, Δ) has a different mean structure from (20), which is and γ is a d × kr matrix of rank d, but otherwise unconstrained. [12] computed the MLEs of μ, Γ , and γ in model (5) where and We show in "Appendix" that the Kronecker product structure constraint on the parameter matrix B = α T ⊗ β T does not alter the formulae for the MLEs until the last step. That is, The expression for Δ MLE is given in equation (7). In the fullrank setting, i.e., when d 1 = r and d 2 = k, (7) simplifies to Δ MLE = Δ res , since K is then a matrix of zeros.

Remark 1
In the standard MLE approach of Sect. 4.2, the number of unknown parameters in α and β is T r + pk, whereas in the PFC parameterization is T d 1 + pd 2 , which can be significantly smaller in the non-full-rank setting where d 1 < r and d 2 < k.
Compute Γ 1 and Γ 2 from Γ using VLP Flowchart of K-PIR and K-PFC Algorithms but requires X | Y be normal as in (19). The three K-PFC methods use model (13) under the assumption of normality of X | Y . K-PIR (ls) and K-PFC1, K-PFC2, K-PFC3 estimate (14) using the Van Loan and Pitsianis (VLP) [40] least squares approximation algorithm applied to different parameter matrices.

Variable selection: sparse K-PIR and K-PFC
In addition to reducing the dimension of the predictors, it is desirable to identify those associated with the outcome and remove irrelevant and redundant ones when computing sufficient reductions. We adapt results of [7], a version of group lasso [6], to the Kronecker product setting.
One can easily show that the coordinate-independent sparse sufficient dimension reduction estimator (CISE) of where λ ≥ 0 is a regularization parameter, Δ fit is given in (27), and . 2 denotes the L 2 norm.
The minimization of (30) is a Grassmann manifold optimization problem. Since · 2 is non-differentiable at zero, traditional Grassmann manifold optimization techniques [see [19]] cannot be applied directly. [7] proposed a computational algorithm based on local quadratic approximation [20], and [48] proved that CISE with the BIC-based tuning parameter selection identified the true model consistently, i.e., has the oracle property.
We use the fast penalized orthogonal iteration (fast POI) optimization algorithm in [26] to implement CISE. Fast POI is a new algorithm for sparse estimation of eigenvectors in generalized eigenvalue problems, which is much faster and easier to implement than the algorithm in [7]. Fast POI-C, the coordinate-wise version of the algorithm, is guaranteed to converge to the optimal solution [26,39].
To simultaneously carry out variable selection and dimension reduction in the least squares-based approaches, we first solve (30) to obtain Γ and then minimize via the VLP approximation to find Γ 1 and Γ 2 . The sparse estimate of the sufficient reduction is Coordinate-wise SDR selects whole rows (corresponding to particular markers) and whole columns (corresponding to particular time points) separately which are then removed from the model. It does not remove a particular marker only for select time points.

Simulations
We assessed the performance of K-PIR (ls) in Sect. 4.1, K-PIR (mle) in Sect. 4.2, and the K-PFC least squares algorithms in Sect. 4.3, for estimating the sufficient reduction subspace S FMSDR using simulations, for both continuous and binary outcomes Y . As mentioned in Introduction, there are very few regression or classification approaches that apply to matrix-valued predictors. The only directly comparable published methods are folded SIR [29] and longitudinal sliced inverse regression (LSIR) [34]. We excluded folded SIR from the simulations due to the instability of its estimation algorithm [see the analysis of the EEG data in Sect. 7]. LSIR [34] assumes both the first and second conditional moments of X | Y have Kronecker product structure; i.e., E(X | Y ) − E(X) = α ⊗ β, and Var(X) = Σ α ⊗ Σ β , where α captures the time and β the biomarker structure of the predictors. The estimation of the sufficient reduction is based on discretizing the response variable Y , if it is not categorical, and using the group sample means to estimate S FMSDR in (1). LSIR is the Kronecker product version of linear discriminant analysis for matrixvalued data.
For continuous outcomes Y , we additionally compared our methods to (2D) 2 principal component regression that we denote as (2D) 2 PCR, our adaptation of (2D) 2 PCA [45] and GLRAM [42] to regression with matrix-valued predictors in analogy to principal regression analysis (PCR) [25]. PCR computes linear combinations, principal components (PCs), of vector-valued predictors, using as coefficients the elements of the eigenvectors of the predictor sample covariance matrix arranged with respect to its eigenvalues in decreasing order. We let U α = (U 1,α , . . . , U T ,α ) and U β = (U 1,β , . . . , U p,β ) denote the column and row eigenvectors of the p×T predictor X, respectively. The columns of U α are the eigenvectors of the T × T sample column covariance matrix Σ α = n j=1 (X j − X) T (X j −X)/n :, and those of U β are the eigenvectors of the p × p sample row covariance matrix Σ β = n j=1 (X j − X)(X j −X) T /n. We define the (2D) 2 PCs of X to be X i = U T β X i U α , for i = 1, . . . , n, and call the regression of the response Y on X i "(2D) 2 PCR." The (2D) 2 PCA estimate of α ⊗ β in (11) is U α,d 1 ⊗ U β,d 2 , where U α,d 1 = (U 1,α , . . . , U d 1 ,α ), and U β,d 2 = (U 1,β , . . . , U d 2 ,β ).

Data generation for continuous outcome Y
To generate data from the model in equation (10), we first generated y i ∼ N (0, 1) for i = 1, . . . , n, and then computed the ith row f y i = g y i −ḡ of the n × rk matrix F y , where g y i is a vector of Fourier basis functions, vec(g y i ) = (cos(2π y i ), sin(2π y i ), . . . , cos(2π sy i ), sin(2π sy i )) T , with 2s = rk. The n × pT matrix of error terms was generated from the multivariate normal N npT (0, Δ ⊗ I n ), where Δ was a positive definite matrix with ones on the diagonal to ensure that all variables have the same scale. We then let α = Γ 1 γ 1 and β = Γ 2 γ 2 , where Γ 1 ∈ R T ×d 1 and Γ 2 ∈ R p×d 2 , and computed X = F y (α ⊗ β) T + ε, using the parameterization in (15).
We let p = 10, T = 8 with r = k = 6 for d 1 = d 2 = 2, d 1 = d 2 = 4, and d 1 = d 2 = 6, to assess the impact of the dimension on the estimation procedures. For each setting, we generated 500 data sets of sample sizes n = 500 and n = 5000 and report means over the 500 repetitions in Tables 1 and 2 .

Data generation for binary outcome Y
We generated X from two multivariate normal distributions with equal covariance matrices, (X k | Y = i) ∼ N (α i ⊗ β, Δ), k = 1, . . . , n i , i = 0, 1, for n 0 = n 1 = n/2, for n = 500, 1000 and n = 2000 with p = 10 and T = 5. Each α i , i = 0, 1, was a vector of length T and β was a vector of length p; that is, the dimension is d = d 1 d 2 = 1. We let β = p −1/2 (1, . . . , 1) and the entries of α 0 be equal to 0, and the entries of α 1 were α 1 [k] = (T −k+1) −1 . When T denotes time from study baseline, this choice of the α 1 coefficients leads to later time points; i.e., measurements more proximal in time to Y , contributing more to discrimination of the two groups. The variance matrix of the predictors was separable, Σ x = Var(X) = Σ 1 ⊗ Σ 2 . We imposed an AR(1) structure on both components of Σ x ; that is, cor(X i j , X ik ) = ρ |k− j| T for Σ 1 , and cor(X i j , X k j ) = ρ |k− j| p for Σ 2 , for various choices of ρ T and ρ p . The covariance matrix Δ was computed using

Performance evaluation for estimation of the subspace
To evaluate bias, we computed the differences between the estimated and the true matrix values as E 1 = α ⊗ β − α ⊗ β / α ⊗ β and E 2 = Δ − Δ / Δ , along with their standard deviations. As a measure of variability, we calculated V 1 , the trace of the empirical covariance matrix of vec( α i ⊗ β i ), a pT rk × 1 vector, for i = 1, . . . , N = 500 repetitions for each simulation setting. Similarly, we computed the trace of the empirical covariance matrix of vec( Δ), V 2 , as a measure of variability of the estimates of the covariance matrix Δ. Table 1 Continuous outcomes Y with p = 10, T = 8, r = k = 6 and rank(α) = rank(β) = 6

Method
Mean  Table 2 Continuous outcomes Y with p = 10, T = 8, r = k = 6 and rank(α) = rank(β) < 6 Method The accuracy of the estimation is assessed by the Frobenius norm of the difference of the projections to the relative spans of the true and the estimated dimension reduction matrices. 4 We report averages over 500 replicates of the following: = P Γ − P Γ , and φ i = P Γ i − P Γ i , i = 1, 2, where P A = A(A T A) −1 A T is the orthogonal projection onto the span of a full-rank matrix A.

Data generation
To assess the performance of the variable selection method in Sect. 4.4, we generated continuous outcome data by first generating y i ∼ N (0, 1) for i = 1, . . . , n, and then computed the ith row f y i = g y i −ḡ of the n × rk matrix F y , where g y = (1, y, y 2 ). The n × pT matrix of error terms, E, was generated from the multivariate normal N npT (0, Δ⊗I n ), where Δ was a positive definite matrix with ones on the diagonal. We then computed X = F y (α ⊗ β) T + ε, where the 2 × T matrix α had entries α 11 = α 22 = 1 and all other entries α i j , i = 1, 2, j = 1, . . . , T were zero, and β was a vector of length p with β 1 = 1 and β i = 0, i = 2, . . . , p for p = 10 and T = 5.
We evaluated the influence of the sample size, n, and magnitude of noise, by multiplying the error term ε in the linear model by a constant factor, called "Scale" in Table 4.

Performance criteria for variable selection
We computed how often markers (rows) and time points (columns) of X were correctly selected on average.
The following quantities are reported. False positives (FPs): An FP occurs when α i j = 0, but its estimate α i j = 0. The FP rate for α is the percentage of times an FP occurs for α i j , and the overall FP rate (FPR) is the average of the FPRs across all zero coefficients of α. False negatives (FNs): An FN occurs when α i j = 0, but its estimate α i j = 0. The FN rate for α is the percentage of times an FN occurs for α i j , and the overall FN rate (FNR) is the average of the FN rates across all nonzero coefficients of α. The total error rate is computed as the sum of the times a nonzero coefficient of α was estimated to be zero and the times a zero coefficient was estimated to be nonzero, divided by the total number of elements in α.
The corresponding FPR, FNR and total error rate for β are reported separately.

Results for continuous outcome Y
We present results for p = 10 and T = 8 in Tables 1 and 2. Results for other values of p and T were qualitatively similar. Table 1 shows summary performance statistics when r = k = 6 and both α and β are of full rank 6 for n = 500, 5000. In this setting, the K-PIR (mle) estimates of α ⊗ β had lower bias (E 1 ) and distance between subspaces ( , φ 1 and φ 2 ) than those for all other algorithms for n = 500. K-PFC1 and K-PIR (ls) estimates of α ⊗ β were similar with respect to all measures, but K-PIR (ls) estimates of Δ had a larger bias and more variability than those of K-PFC1. K-PFC2 and K-PFC3 resulted in significantly larger bias and lower estimation accuracy measures for both sample sizes. (2D) 2 PCA-based estimates of α ⊗ β had much larger bias and variability than all other methods, but had the resulting estimates had smaller distance to the true subspace than K-PFC2, K-PFC3. Table 2 shows results for r = k = 6 for the non-fullrank case. While the general patterns were similar to the full-rank setting, all methods had poorer performance. For rank(α) = rank(β) = 4 and n = 500 K-PIR (mle) yielded the least biased estimates of α ⊗ β and the smallest distances , φ 1 and φ 2 . K-PFC1 was slightly better than K-PIR (ls) in terms of bias of α ⊗ β. For n = 5000, K-PIR (ls), K-PIR (mle) and K-PFC1 estimates all had the same performance.
When rank(α) = rank(β) = 2, however, K-PFC1-based estimates of α ⊗ β had much lower bias, variability and distance to the true subspace and also better estimated Δ than all other methods.
For all parameter settings and sample sizes, K-PFC2-and K-PFC3-based estimates were very similar and resulted in poorer estimation than the other three methods. (2D) 2 PCA does not yield estimates for α and β in the non-full-rank case. With respect to other measures, it behaved similarly to the full-rank case.

Results for binary outcome Y
We present results for p = 10 and T = 5 in Table 3. Findings were qualitatively similar for other choices of p and T . The sample size n refers to the number of samples in each of the Y = 0 and the Y = 1 groups. Interestingly, in contrast to the results for continuous outcome, for all sample sizes estimates of α ⊗ β and Δ from K-PFC2 and K-PFC3 had the lowest bias and the smallest variance of all methods. The K-PFC2-and K-PFC3-based estimates also had slightly better performance in estimating the subspaces for smaller sample sizes, but for larger n all methods resulted in similar performance of the estimates. LSIR-based estimates [34] had larger bias and variance estimates compared to those from K-PFC2 and K-PFC3, but smaller compared to estimates from K-PIR (ls), K-PIR (mle) and K-PFC1 for all sample sizes. How-ever, LSIR had worse performance than all other methods in estimating subspaces for all sample sizes.

Results for variable selection
In Table 4, we present results on the accuracy of our variable selection approach. For both n = 100, 500 with p = 10 and T = 5, the false negative rate (FNR) was 0 for α and β for low noise-to-signal ratio. For n = 100 and at the highest signalto-noise ratio we report, the FNR jumped to 29.5% for α and 18.8% for β, with lower false positive rates (FPR=14.1% for α and FPR=13.4% for β). The total error rates was 20.2% and 13.4% for α and β, respectively. For the more realistic setting of noise with 3 times the magnitude of the mean parameters, all error rates were less than 7% for both matrices.
When the sample size was increased to n = 500, even when the noise standard deviation was 5 times larger than the magnitude of the mean parameters, all error rates for both matrices were below 5%, indicating excellent performance in variable selection.

Serially measured pre-diagnostic levels of serum biomarkers and risk of brain cancer
To illustrate our methods, we used data from 128 individuals diagnosed with glioma, a type of brain cancer (cases, Y = 1) and 111 healthy individuals (controls, Y = 0) from a study that assessed the associations of fourteen serially measured biomarkers with glioma risk in individuals sampled from active component military personnel [2]. The markers Table 4 Sparse case: FNR = false negative rate, FPR = false positive rate. The nonzero entries of α and β had values equal to one. "Scale" corresponds to a term that multiplied the standard deviation of the noise term in the data and reflects the noise-to-signal ratio were measured in serum obtained at three time points prior to diagnosis for cases, or selection for controls. The serum was typically what remains after routine, periodic HIV testing or required pre-and post-deployment samples. On average, samples were available every two years for a given person. We analyzed the log-transformed values of 13 markers, including several interleukins (ILs), IL-12p40, IL-15, IL-16, IL-7, IL-10, monocyte chemoattractant protein (MCP1), thymus and activation regulated chemokine (TARC), placental growth factor (PLGF), vascular endothelial growth factor (VEGF), tumor necrosis factor alpha (TNFa), hepatocyte growth factor (HGF), interferon gamma (IFNγ ) and transforming growth factor beta (TGFb1). One marker (IL8) that a had highly non-normal distribution, even after log transformation, was excluded from the original panel in order to allow comparison with K-MLE, resulting in ( p, T ) = (13, 3). We also compared all proposed methods with LSIR [34].
The discriminatory ability of the linear combinations from the various approaches to distinguish the two groups Y = 0 and Y = 1 was assessed by the area under the receiver operator characteristics curve, AUC [33, p. 67]. We used leave-one-out cross-validation to obtain an unbiased AUC estimate. That is, we removed person i from the data set, estimated the parameters of the respective model from the remaining samples and computed the projections of X i onto the respective SDR subspace for person i. We repeated these steps by letting i range from 1 to the total sample size, to obtain unbiased predictions. For binary Y , all methods estimate at most a single direction in the central subspace; i.e., S FMSDR is a vector. We thus used the projections onto the space spanned by the core matrices of the methods directly as a scalar diagnostic score in computing the AUC and its variance with the R package pROC [36]. Table 5 reports AUC values and their standard deviations. All of our proposed methods had the same discriminatory ability, with an AUC values of 0.66 for K-PIR, K-PFC1, K-PFC2 and K-PFC3 and for K-PIR (mle). LSIR, which assumes the Kronecker product structure for the first and the second moments of X, had the highest AUC, AUC=0.69 highlighting the impact of further reducing complexity of estimating the central subspace, especially in settings of limited sample size.

EEG Data
For the second example, we analyzed EEG data from a small study of 77 alcoholic and 45 control subjects (http://kdd.ics.uci. edu/databases/eeg/eeg.data.html). The data for each study subject consisted of a 64×256 matrix, with each column representing a time point and each row a channel. The measurements were obtained by exposing each individual to visual stimuli and measuring voltage values from 64 electrodes placed on the To facilitate comparison of our results with other published analyses, we used only a single stimulus condition (S1), and for each subject, we took the average of all the trials under that condition. That is, we used (X i , Y i ), i = 1, . . . , 122, where X i is a 64 × 256 matrix, i.e., p = 64, T = 256, with each entry representing the mean voltage value of subject i at a combination of a time point and a channel, averaged over all trials under the S1 stimulus condition, and Y was a binary outcome variable with Y = 1 for an alcoholic and Y = 0 for a control subject. The pT × pT = 16384 × 16384 sample variance-covariance matrix of the predictors ( Σ x ) is singular, since the sample size is 122.
We carried out two separate analyses. First, to bypass the issue of large p small n, we applied the same pre-screening procedure as in [29], which is a version of (2D) 2 PCA [45], to reduce the order to ( p , T ) = (30,20), (15,15) and (4,3). The pre-screened data were computed by replacing the matrix predictors with their (2D) 2 PCs, setting X i = U T β X i U α : p × T , i = 1, . . . , n, as described in Sect. 5.
(2D) 2 PCR linear combinations had a highly variable performance, ranging from AUC of 0.83 for ( p , T ) = (4, 3) to 0.50 for ( p , T ) = (30,20). AUC values did not decrease monotonically (unreported results), indicating lack of stabil-ity of the method. LSIR [34] linear combinations resulted in the best discriminatory performance and higher AUC values than all other methods for our choices of ( p , T ). [29] analyzed these data with their method, folded SIR, also using ( p , T ) = (15,15). In contrast to our algorithms, folded SIR uses starting values for α and β that are random draws from two multivariate normal distributions, which results in different estimates every time the method is applied. We repeated the analysis using folded SIR several times and obtained consistently lower AUC values than with our methods, ranging from 0.61 to 0.70, which also reflects the numerical instability of the folded SIR estimation algorithm.
We also applied coordinate-wise sparse SDR without preprocessing the data, as described in Sect. 4.4, to simultaneously identify important variables and sufficient reductions. We report the average AUC values and corresponding standard deviations from tenfold cross-validation (due to the computational burden) fast POI-C [26] in the last row of Table 6. The average AUC value was 0.63, much lower than the AUCs from all other estimation methods.  The right y-axis shows the percent times the component was dropped. No components of β were consistently dropped indicating that no specific sensor was found to be insignificant. In contrast, approximately 40% of the later time points were consistently dropped. That is, sparse SDR identifies the earlier time measurements to be more predictive of alcoholism status.

Discussion
In this paper, we propose methods for regression and classification with matrix-valued predictors that yield consistent estimators, which are also asymptotically optimal when the predictors given the outcome have exponential family distributions. The least squares estimation algorithms are fast with guaranteed convergence. Our methods can incorporate simultaneous variable selection in estimating the sufficient dimension reduction, which further reduces complexity.
The dimensions d 1 and d 2 of span(α) and span(β), respectively, are assumed to be known in our computations. Their estimation can be carried out, for example, via AIC and BIC [17]. Our methodology can be extended to regressions with multidimensional array-valued predictors.
The R code that implements the methods in this paper can be downloaded from https://git.art-ist.cc/daniel/tensor_ predictors/releases. [40] proposed a singular value decompositionbased algorithm to efficiently find the optimal factor matrices B and C that minimize the Frobenius norm A − B ⊗ C , where A : p × q, B : p1 × q1, C : p 2 × q 2 with p = p 1 p 2 and q = q 1 q 2 . They write A as a p 1 p 2 × q 1 q 2 block matrix,

Van Loan and Pitsianis Matrix Approximation: Van Loan and Pitsianis
where A i j : p 2 × q 2 , and show that the Kronecker product approximation for two factor matrices is equivalent to finding a nearest rank 1 matrix to R(A), for j = 1, . . . , q 1 . This problem can be solved by singular value decomposition [22], as follows. If the SVD of R is U T RV = Λ = diag(λ 1 , . . . , λ min( p,q) ), the optimal B equals √ λ 1 U 1 and the optimal C, √ λ 1 V 1 , where U 1 , V 1 are the first columns of U and V, respectively.
Proof of Theorem 1 Suppose the true parameter matrix has the form B T = α ⊗ β, where α ∈ R T ×r , and β ∈ R p×k . Thus, where B i j = α i j β : T × k. In this proof, we assume the estimates α and β are computed using the algorithm in Section 4 of [40], which is an alternating least squares algorithm for the calculation of the largest singular value of R(B T ), as required in the Van Loan and Pitsianis Kronecker product matrix approximation [40]. That is, for fixed β, where B i j is the lse of the corresponding true B i j in (33). The approximation algorithm for α and β is an alternating least squares algorithm and enjoys both global and local convergence [15]. Since the unconstrained least squares estimate B is consistent for B, we obtain that B i j is consistent for B i j , for all i, j, and α i j → tr(B T i j β) tr(β T β) = tr(α i j β T β) tr(β T β) = α i j tr(β T β) tr(β T β) = α i j , and similarlyβ i j p → β i j . Therefore, α ⊗ β p → α ⊗ β. The unconstrained least squares estimate B is also asymptotically normal [3]. Therefore, each of its elements and any of its block matrices are asymptotically normal. Alternating least squares is a special case of Iteratively Reweighted Least Squares (IRLS), which yields MLEs for the normal distribution, as well as for all members of the exponential family because they are equivalent to Fisher's scoring method [16,23]. Thus, α and β are also asymptotically normal.