Matrix quadratic risk of orthogonally invariant estimators for a normal mean matrix

In estimation of a normal mean matrix under the matrix quadratic loss, we develop a general formula for the matrix quadratic risk of orthogonally invariant estimators. The derivation is based on several formulas for matrix derivatives of orthogonally invariant functions of matrices. As an application, we calculate the matrix quadratic risk of a singular value shrinkage estimator motivated by Stein's proposal for improving on the Efron--Morris estimator 50 years ago.


Introduction
Suppose that we have a matrix observation X ∈ R n×p whose entries are independent normal random variables X ij ∼ N(M ij , 1), where M ∈ R n×p is an unknown mean matrix. In this setting, we consider estimation of M under the matrix quadratic loss [1,6]: for every M and c ∈ R p . In particular, each column ofM 1 dominates that ofM 2 as an estimator of the corresponding column of M under quadratic loss. Recently, [6] investigated shrinkage estimation in this setting by introducing a concept called matrix superharmonicity, which can be viewed as a generalization of the theory by Stein [7] for a normal mean vector. Note that shrinkage estimation of a normal mean matrix under the Frobenius loss, which is the trace of the matrix quadratic loss, has been well studied, e.g. [5,8,9,10,11,12]. Many common estimators of a normal mean matrix are orthogonally invariant. Namely, they satisfyM (P XQ) = PM (X)Q for any orthogonal matrices P ∈ O(n) and Q ∈ O(p). It can be viewed as a generalization of the rotationally invariance of estimators for a normal mean vector, which is satisfied by many minimax shrinkage estimators [3]. We focus on orthogonally invariant estimators given byM = X + ∇h(X), (2) where h(X) is an orthogonally invariant function satisfying h(P XQ) = h(X) for any orthogonal matrices P ∈ O(n) and Q ∈ O(p) and ∇ is the matrix gradient operator defined by (3). For example, the maximum likelihood estimatorM = X corresponds to h(X) = 0. The Efron-Morris estimator [2] . This estimator can be viewed as a matrix generalization of the James-Stein estimator and it is minimax under the Frobenius loss [2] as well as matrix quadratic loss [6]. We will provide another example of an orthogonally invariant estimator of the form (2) in Section 4. Note that an estimator of the form (2) is called a pseudo-Bayes estimator [3], because it coincides with the (generalized) Bayes estimator when h is given by the logarithm of the marginal distirbution of X with respect to some prior on M (Tweedie's formula).
In this study, to further the theory of shrinkage estimation under the matrix quadratic loss, we develop a general formula for the matrix quadratic risk of orthogonally invariant estimators of the form (2). First, we prepare several matrix derivative formulas in Section 2. Then, we derive the formula for the matrix quadratic risk in Section 3. Finally, we present an example in Section 4, which is motivated by Stein's proposal for improving on the Efron-Morris estimator 50 years ago [7].

Matrix derivative formulas
Here, we develop matrix derivative formulas based on [7]. Note that, whereas [7] considered a setting where X is a p × n matrix, here we take X to be a n × p matrix. In the following, the subscripts a, b, . . . run from 1 to n and the subscripts i, j, . . . run from 1 to p. We denote the Kronecker delta by δ ij .
We employ the following notations for matrix derivatives introduced in [6].
Definition 2.1. For a function f : R n×p → R, its matrix gradient ∇f : R n×p → R n×p is defined as Definition 2.2. For a C 2 function f : R n×p → R, its matrix Laplacian ∆f : R n×p → R p×p is defined as Let be a spectral decomposition of X ⊤ X, where V = (v 1 , . . . , v p ) is an orthogonal matrix and Λ = diag(λ 1 , . . . , λ p ) is a diagonal matrix. Then, the derivatives of λ and V are obtained as follows. Thus, where v i is the i-th column vector of V .
Proof. By differentiating V ⊤ V = I p and using (dV ) ⊤ V = (V ⊤ dV ) ⊤ , we obtain which means the antisymmetricity of V ⊤ dV .
Taking the differential of (5), we have Then, multiplying (9) on the left by V ⊤ and on the right by V , we obtain Since Λ and dΛ are diagonal and (dV ) ⊤ V = (V ⊤ dV ) ⊤ , the (i, j)-th entry of (10) yields On the other hand, from d(X ⊤ X) = (dX) ⊤ X + X ⊤ dX, By taking i = j in (11), Then, by using (12), Thus, we obtain (6) and it leads to (7).
Then, by using (12), where we switched k and l in the last step. Thus, we obtain (13).
A function h is said to be orthogonally invariant if it satisfies h(P XQ) = h(X) for any orthogonal matrices P ∈ O(n) and Q ∈ O(p). Such a function can be written as h(X) = H(λ), where λ is the eigenvalues of X ⊤ X as given by (5), and its derivatives are calculated as follows. Thus, where D is the p × p diagonal matrix given by Proof. From (7), which yields (14). Then, by using which yields (15).
where D is the p × p diagonal matrix given by Proof. From (14), Also, from (13), 5 By substituting (18) into (17) and taking the sum, Then, Thus, by rewriting m to l, we obtain (16).
By taking the trace of the matrix Laplacian (16), we have ∆h = tr( ∆h) = tr(D) where we used This coincides with the Laplacian formula in [7].

Risk formula
Now, we derive a general formula for the matrix quadratic risk of orthogonally invariant estimators of the form (2).
Theorem 3.1. Let h(X) = H(λ) be an orthogonally invariant function. Then, the matrix quadratic risk of an estimatorM = X + ∇h(X) is given by where D is the p × p diagonal matrix given by Proof. From [6], the matrix quadratic risk of an estimatorM = X + g(X) with a weakly differentiable function g is where the matrix divergence div g : R n×p → R p×p of a function g : R n×p → R n×p is defined as ( div g(X)) ij = n a=1 ∂ ∂X ai g aj (X).
By taking the trace of (20) and using (19), we obtain the following formula for the Frobenius risk of orthogonally invariant estimators, which coincides with the one given by [7]. Corollary 3.2. Let h(X) = H(λ) be an orthogonally invariant function. Then, the Frobenius risk of an estimatorM = X + ∇h(X) is given by We derived the risk formula for orthogonally invariant estimators of the form (2), which are called pseudo-Bayes estimators [3]. The class of pseudo-Bayes estimators includes all Bayes and generalized Bayes estimators. It is an interesting future work to extend the current result to general orthogonally invariant estimators. Also, extension to unknown covariance case is an important future problem. Note that Section 6.6.2 of [9] derived a risk formula for a class of estimators in the unknown covariance setting.

Example
We provide an example of the application of Theorem 3.1. Let X = U ΣV ⊤ with U ∈ R n×p , Σ = diag(σ 1 , . . . , σ p ) and V ∈ R p×p be a singular value decomposition of X, where U ⊤ U = V ⊤ V = I p and σ 1 ≥ · · · ≥ σ p ≥ 0 are the singular values of X. We consider an orthogonally invariant estimator given bŷ where c 1 , . . . , c p ≥ 0.
where D is the p × p diagonal matrix given by Proof. To apply Theorem 3.1, let We have Thus, Therefore, we obtain (23) from Theorem 3.1.
The Efron-Morris estimator [2] corresponds to (22) with c k ≡ n − p − 1. In this case, Thus, its matrix quadratic risk (23) is This coincides with the result in [6]. Motivated by Stein's proposal [7] for improving on the Efron-Morris estimator, we consider the estimator (22) with c k = n + p − 2k − 1. In the following, we call it "Stein's estimator" for convenience. Stein [7] stated that the positive-part of Stein's estimator dominates the positivepart of the Efron-Morris estimator under the Frobenius loss 1 , where "positive-part" means the modification of (22) given bŷ where (a) + = max(0, a). It is known that the estimator (22) is dominated by its positive-part (25) under the Frobenius loss [8].
where D is the p × p diagonal matrix given by Thus, Stein's estimator dominates the maximum likelihood estimator under the matrix quadratic loss when n ≥ 3p − 1.
Proof. By substituting c k = n + p − 2k − 1 into Theorem 4.2, The second term is nonpositive since λ 1 ≥ λ 2 ≥ · · · ≥ λ p . When n ≥ 3p − 1, the first term is also nonpositive and thus Numerical results indicate that the bound of n in Proposition 4.3 may be relaxed to n ≥ p + 2, which is the same bound with the Efron-Morris estimator. See Appendix.
Finally, we present simulation results to compare Stein's estimator and the Efron-Morris estimator. Figure 1 compares the Frobenius risk of Stein's estimator and the Efron-Morris estimator when n = 10 and p = 3. It implies that Stein's estimator dominates the Efron-Morris estimator under the Frobenius loss. Both estimators attain constant risk reduction when some singular values of M are small, regardless of the magnitude of the other singular values. Thus, both estimators work well for low rank matrices. See [6] for related discussions. Figure 2 plots the three eigenvalues λ 1 ≥ λ 2 ≥ λ 3 of the matrix quadratic risk of Stein's estimator and the Efron-Morris estimator in the same setting with Figure 1. Since all eigenvalues are less than n = 10, the matrix quadratic risk R(M,M ) satisfies R(M,M ) nI p for every M . Thus, both estimators dominate the maximum likelihood estimator under the matrix quadratic loss, which is compatible with (24) and Proposition 4.3. Also, each eigenvalue for Stein's estimator is smaller than the corresponding one for the Efron-Morris estimator, which suggests that Stein's estimator dominates the Efron-Morris estimator even under the matrix quadratic loss. It is an interesting future work to develop its rigorous theory.