The Global Convergence of the Nonlinear Power Method for Mixed-Subordinate Matrix Norms

We analyze the global convergence of the power iterates for the computation of a general mixed-subordinate matrix norm. We prove a new global convergence theorem for a class of entrywise nonnegative matrices that generalizes and improves a well-known results for mixed-subordinate ℓp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^p$$\end{document} matrix norms. In particular, exploiting the Birkoff–Hopf contraction ratio of nonnegative matrices, we obtain novel and explicit global convergence guarantees for a range of matrix norms whose computation has been recently proven to be NP-hard in the general case, including the case of mixed-subordinate norms induced by the vector norms made by the sum of different ℓp\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^p$$\end{document}-norms of subsets of entries.


Introduction
Let A be an m × n matrix and consider the matrix norm where · α and · β are vector norms.
Computing A β→α is a classical problem in computational mathematics, as norms of this kind arise naturally in many situations, such as approximation theory, estimation of matrix condition numbers and approximation of relative residuals [26]. However, attention around the problem of computing A β→α has been growing in recent years. In fact, for example, matrix norms of this type can be used in combinatorial optimization and sparse data recovery, to approximate generalized Grothendieck and restricted isometry constants [1,6,16,30], in scientific computing, to estimate the largest entries of large matrices [27], in data mining and learning theory, to minimize empirical risks or obtain robust nonnegative graph embeddings [9,41], or in quantum information theory and the study of Khot's unique game conjecture where the computational complexity of evaluating A β→α plays an important role [2]. Moreover, it was observed by Lim [33] that the notion of tensor norm and tensor spectrum relates to A β→α in a very natural way and thus relevant advances on the problem of computing A β→α when A is entrywise nonnegative and · α , · β are p norms have been recently obtained as a consequence of a number of new nonlinear Perron-Frobenius-type theorems for higher-order maps [15,19,21,22].
Closed form solutions and efficient algorithms are known for some special p norms, as for instance the case where · α = · β and they coincide with either the 1 , the 2 , or the ∞ norm, or the case where p ≤ 1 ≤ q and · α and · β are p and q (semi) norms, respectively (c.f. [10,32,36]). However, the computation of A β→α is generally NP-hard [23,38].
The best known method for the computation of A β→α is the (nonlinear) power method, essentially introduced by Boyd [4] and then further analyzed and extended for instance in [3,15,25,39]. When the considered vector norms are p norms, the power method can count on a very fundamental global convergence result which ensures convergence to the matrix norm A β→α for a class of entry-wise nonnegative matrices A and for a range of p norms. We discuss in detail the method and its convergence in Sect. 2.
The convergence of the method is a consequence of an elegant fixed point argument that involves a nonlinear operator S A and its Lipschitz contraction constant. However, the convergence analysis of this method has two main uncovered points: on the one hand, all the work done so far addresses only the case of p norms whereas almost nothing is known about the global convergence behavior of the power iterates for more general norms. On the other hand, even for the case of p norms, known upper-bounds on the contraction constant of S A are not sharp, especially for positive matrices. In this work we provide novel results that address and improve both these directions.
Consider for example the case where · α is defined as where k is a positive integer not larger than the dimension of x and · p i are p norms. Of course one can extend this idea by looking at any family of subsets of entries of x and any set of p norms, in order to generate arbitrarily new norms. Norms of this form are natural modifications of p norms and are used for instance to define the generalized Grothendieck constants as in [30] or in graph matching problems to build continuous relaxation of the set of matrix permutations [11,34]. However, even for this case, extending the result of Boyd is not straightforward. In this work we consider general pairs of monotonic and differentiable vector norms and provide a thorough convergence analysis of the power method for the computation of the corresponding induced matrix norm A β→α . Our result is based on a novel nonlinear Perron-Frobenius theorem for this kind of norms and ensures global convergence of the power method provided that the Birkhoff contraction ratio of the power iterator is smaller than one.
When applied to the case A q→ p of p norms, our result does not only imply the current convergence result, but actually significantly improves the range of values of p and q for which global convergence can be ensured. This is particularly interesting from a complexity viewpoint. In fact, for example, although the computation of A q→ p is well known to be NP-hard for p > q, we show that for a non-trivial class of nonnegative matrices the power method converges to A q→ p in polynomial time even for p sensibly larger than q. To our knowledge this is the first global optimality result for this problem that does not require the condition p ≤ q.
In the general case A β→α , a main computational drawback of the power method is related with the computation of the dual norm · β * . In fact, if · β is not an p norm, the corresponding dual norm may be challenging to compute [14]. In practice, evaluating · α * from · α can be done via convex optimization and Corollary 7 of [14] proves that · α * can be evaluated in polynomial time (resp. is NP-hard) if and only if · α can be evaluated in polynomial time (resp. is NP-hard). There are norms for which an explicit expression in terms of arithmetic operations for · α is given by construction (resp. modelisation), but such an expression is not available for the dual · α * . As we discuss in Sect. 5.1, examples of this type include for instance x α = ( x 2 p + x 2 q ) 1/2 . A further main result of this work addresses this issue for the particular case of norms of the type (1). For this family of norms we provide an explicit convergence bound and an explicit formula for the power iterator for the computation of the corresponding matrix norm A β→α . To illustrate possible applications of the result, we list in Corollaries 3-8 relatively sophisticated and non-standard matrix norms together with an explicit condition for their computability.
We organize the discussion as follows: In Sect. 2 we review the nonlinear power method and its main convergence properties. In Sect. 3 we review relevant preliminary conetheoretic results and notation. Then, in Sect. 4, we propose a novel and detailed global convergence analysis of the method based on a Perron-Frobenius type result for the map x → Ax α / x β , in the case of entry-wise nonnegative matrices and monotonic norms · α , · β . We derive new conditions for the global convergence to A β→α that, in particular, help shedding new light on the NP-hardness of the problem, and we propose a new explicit bound on the linear convergence rate of the power iterates. In Sect. 5 we focus on the particular case of norms of the same form as (1). We show how to practically implement the power method for this type of norms, we prove a specific convergence criterion that gives a-priori global convergence guarantees and we discuss the complexity of the method. Finally, in Sect. 6 we illustrate the behaviour of the nonlinear power method on some example matrix norm.

Boyd's Nonlinear Power Method
Let · p , · q be the usual p and q vector norms and consider the induced matrix norm A q→ p = max x =0 Ax p / x q . A well known explicit formula holds for the 1 and ∞ matrix norms A 1→1 , A ∞→∞ . However, while the mixed norm A 1→∞ equals max i j |a i j |, the computation of A ∞→1 is NP-hard [36]. More generally, when p is any rational number p = 1, 2, computing the norm A p→ p is NP-hard for a general matrix A [23], and the same holds for any norm A q→ p , for 1 ≤ p < q ≤ ∞ [38]. The best known technique to compute A q→ p is a form of nonlinear power method that we review in what follows.
Consider the nonnegative function f A (x) = Ax p / x q . The norm A q→ p is the global maximum of f A by analyzing the optimality conditions of f A , for differentiable pnorms · p and · q , we note that The associated fixed point iteration defines what we call (nonlinear) power method for A q→ p . Although, in practice, the method applied to A p→ p for p = 1, ∞ often seems to converge to the global maximum (see e.g. [24]), no guarantees exist for the general case. For differentiable p norms and nonnegative matrices, instead, conditions can be established in order to guarantee that the power iterates always converge to a global maximizer of f A . The idea is that when the power method is started in the positive orthant then, provided A has an appropriate non-zero pattern, each iterate of the method will stay in this orthant until convergence. Then, a nonlinear Perron-Frobenius type result is proved to guarantee that there exists only one critical point of f A in this region and this point is a global maximizer of f A . While this idea was already known by Perron himself in the Euclidean 2 case, to our knowledge, the first version of this result for norms different than the Euclidean norm, has been proved by Boyd [4]. However, Boyd did not prove the uniqueness of positive critical points but only that they are global maximizer of f A under the assumption that A T A is irreducible and 1 < p ≤ q < ∞. This work is then revisited by Bhaskara and Vijayaraghavan [3] who proved uniqueness for positive matrices A and 1 < p ≤ q < ∞. Independently Friedland, Gaubert and Han proved in [15] similar results for 1 < p ≤ 2 ≤ q < ∞ and any nonnegative A such that the matrix 0 A A T 0 is irreducible. Their result was then extended to 1 < p ≤ q < ∞ in [18] under the assumption that A T A is irreducible. Finally, all these results have been improved in [22], leading to the following  (2) converges to x + for every positive starting point.
In this work we consider the case of a matrix norm defined in terms of arbitrary vector norms · α and · β and we prove Theorem 4 below, which is a new version of Theorem 1, holding for general vector norms, provided that suitable and mild differentiability and monotonicity conditions are satisfied. We stress that Theorems 1 and 4 are not corollaries of each other in the sense that there are cases where exactly one, both or none apply. However, when both apply, then Theorem 4 is more informative. We discuss in detail these discrepancies in Sect. 4.1 and give there examples to illustrate them. In particular, a noticeable difference is that, for positive matrices A, the newly proposed Theorem 4 ensures uniqueness and maximality for choices of 1 < p, q < ∞ that include the range p > q. This is, to our knowledge, the first global optimality result for this problem that includes such range of values.
The key of our approach is the use of cone geometry techniques and the Birkhoff-Hopf theorem, which we recall below.

Cone-theoretic Background
We start by recalling concepts from conic geometry. Let R n + be the nonnegative orthant in R n , that is x ∈ R n + if x i ≥ 0 for every i = 1, . . . , n. The cone R n + induces a partial ordering on R n as follows: for every x, y ∈ R n we write x ≤ y if y − x ∈ R n + , i.e. x i ≤ y i for every i. Furthermore, x, y ∈ R n + are comparable, and we write x ∼ y, if there exist c, C > 0 such that cy ≤ x ≤ Cy. Clearly, ∼ is an equivalence relation and the equivalence classes in R n + are called the parts of R n + . For example, if n = 2 and x = (1, 0), then the equivalence class of x in R 2 + is given by {(y 1 , 0) : y 1 > 0}. For simplicity, from now on we will say that a vector is nonnegative (resp. positive) if its entries are nonnegative (resp. positive). The same nomenclature will be used for matrices.
We recall that a norm · on R n is monotonic if for every x, y ∈ R n such that |x| ≤ |y|, where the absolute value is taken componentwise, it holds x ≤ y and it is strongly monotonic if for every x, y ∈ R n with |x| = |y| and |x| ≤ |y| it holds x < y .
One of the key tools for our main result is the Hilbert's projective metric d H : R n + ×R n + → [0, ∞], defined as follows: We collect in the following lemma some useful properties of d H . Most of these results are known and can be found in [31]. Moreover, similarly to what is observed in Theorem 3 of [20], we prove a direct relation between the infinity norm and the Hilbert metric, which is useful for deriving explicitly computable convergence rates for the power method.
where r = inf{t > 0 : x i ≤ t ∀x ∈ M, i = 1, . . . , n}.  [31]. We prove (3). If P = {0}, the result is trivial so we assume P = {0} and let i 1 , . . . , i m be such that for any z ∈ R n + , z ∈ P if and only if z i 1 , . . . , z i m > 0. Let x, y ∈ M, then x ≤ M(x/y)y and, by monotonicity of . . , ln(x i m ) and y = ln(y i 1 ), . . . , ln(y i m ) . By definition of r > 0, we have ln(x i j ), ln(y i j ) ∈ (−∞, ln(r )] for every j = 1, . . . , m. Furthermore, by the mean value theorem, we have Finally, with x = (x i 1 , . . . , x i m ) and y = (y i 1 , . . . , y i m ), we obtain Observe that if r is defined as in Lemma 1 and · is strongly monotonic, then Indeed, if y ∈ M is such that there exists j ∈ {1, . . . , n} with y j > r , then 1 = y > re j = r e j , which is not possible. The proof of our main theorem is based on the Banach contraction principle. Thus, for a map F : R n + → R m + we consider the Birkhoff contraction ratio κ H (F) ∈ [0, ∞] of F, defined as the smallest Lipschitz constant of F with respect to d H : Clearly, if there exist x, y ∈ R n + such that x ∼ y and F(x) F(y), then κ H (F) = ∞. However, such a situation never happens when F is a linear map in which case κ H (F) ≤ 1 always holds. Indeed, if A ∈ R m×n is a nonnegative matrix, x, y ∈ R n + and x ∼ y, then x ≤ M(x/y)y implies Ax ≤ M(x/y)Ay. Similarly, we have Ay ≤ M(y/x)Ax and thus Ax ∼ Ay. These inequalities also imply that κ H (A) ≤ 1. This upper bound is not tight in many cases. However, thanks to the Birkhoff-Hopf theorem, a better estimate of κ H (A) can be obtained by computing the projective diameter (A) ∈ [0, ∞] of A, defined as This is formalized in the following theorem whose proof can be found in Theorems 3.5 and 3.6 of [12].
The above theorem is particularly useful when combined with the following Theorem 6.2 in [12] and Theorem 3.12 in [37]: Theorem 3 Let A ∈ R m×n be a matrix with nonnegative entries and e 1 , . . . , e n the canonical basis of R n . If there exists I ⊂ {1, . . . , n} such that Ae i ∼ Ae j for all i, j ∈ I and Ae i = 0 for all i / ∈ I, then In Unfortunately, such simple formulas for the Birkhoff contraction ratio are, to our knowledge, not known for general nonlinear mappings. We refer however to Corollary 2.1 in [35] and Corollary 3.9 in [17] for general characterizations of this ratio.

Nonlinear Perron-Frobenius Theorem for
where · α and · β are arbitrary vector norms on C m and C n , respectively. Then, as for the case of p norms, consider the function For an arbitrary possibly non-differentiable vector norm · it holds ([13, e.g.]) where ∂ denotes the subdifferential and · * is the dual norm of · , defined as y * = max x =0 x, y / x . Again, for notational convenience, given the vector norm x α , we introduce the set-valued operator J α such that The definition of dual norm implies the generalized Hölder inequality x, y ≤ x y * . Thus, for a vector x and a norm · α , the set of vectors J α (x) coincides with the set of vectors in the unit sphere of the dual norm of · α , for which equality holds in the Hölder inequality. In fact, the subdifferential of a norm J α is strictly related with the duality mapping J α induced by that norm. Precisely, by Asplund's theorem (see e.g. [7]), we have that It is well known that the subgradient of a convex function f is single valued if and only if f is Fréchet differentiable. Therefore J α is single valued if and only if · α is a Fréchet differentiable norm. The assumption that the duality maps involved are single valued will be crucial for our main result. For this reason, throughout we make the following assumptions on the norms we are considering: The norms · α and · β we consider are such that 3. Both · α and · β * are strongly monotonic.

Remark 1
Recall that every monotonic norm · is also absolute (see e.g. [28,Thm. 1]), that is |x| = x for every x, where |x| denotes the entrywise absolute value. This implies, in particular, that a monotonic norm is Fréchet differentiable at every x ∈ R n \{0} if and only if it is Fréchet differentiable at every x ∈ R n + \{0}.
Points (1) and (1) of Assumption 1 ensure that the following nonlinear mapping is single valued. Point (1) ensures that for nonnegative matrices the maximum of f A is attained on a nonnegative vector and that if A T A is irreducible, then this maximizer has positive entries. Overall, they allow us to prove the following fundamental preliminary Lemmas 2-6. First, we discuss the critical points of f A . If · α and · β satisfy Assumption 1, then f A may not be differentiable. Indeed, the differentiability of · β * does not imply that of · β (see for instance [7, Chapter II]). Hence, in the following, we use Clarke's generalized gradient [8] to discuss the critical points of f A . In particular, let us recall that, by [8,Prop. 2.2.7], the generalized gradient of a convex function coincides with its subgradient. Moreover, it can be verified that f A is locally Lipschitz near every x ∈ R n \{0} so that its generalized

If x is a critical point of f A , then it is a fixed point of S A . Conversely, if x is a fixed point of S A and · β is differentiable, then x is a critical point of f A .
Proof First, assume that 0 ∈ ∂ f A (x). As · α and · β are Lipschitz functions and x β = 1, Proposition 2.3.14 of [8] implies that If · β is differentiable, then f A is differentiable at x and the sets in (10) are equal (and singletons). It follows that 0 ∈ ∂ f A (x). Lemma 3 Let A ∈ R m×n be a matrix with nonnegative entries and P, a part of R n + such that A T Ax ∈ P for every x ∈ P. Furthermore, let · α and · β satisfy Assumption 1. If κ H (S A ) ≤ τ < 1, then S A has a unique fixed point z in P and for every positive integer k and every x ∈ P, it holds is complete by Lemma 1, it follows from the Banach fixed point theorem (see for instance Theorem 3.1 in [29]) that S A has a unique fixed point z in M and for every y ∈ M it holds .
for every λ > 0, the convergence rate is a direct consequence of the above inequality and Lemma 1.
We remark that this result does not guarantee that the unique fixed point z of S A in P is a global maximizer of f A and in fact this is not always true. Indeed, if A is a 2 × 2 diagonal matrix which is not a multiple of the identity and · α = · 2 , · β = · 3 , then κ H (S A ) ≤ 1/2 and S A leaves all the parts of R 2 + invariant but some of them do not contain a global maximizer of f A . Moreover, as R n + has 2 n parts, testing each part of the cone is computationally too expensive for large n. Therefore, in the remaining part of the section, we derive conditions in order to ensure that the power iterates converge to a global maximizer of f A .

Lemma 4 Let A ∈ R m×n be a matrix with nonnegative entries and let
Proof Let x = 0, since A has nonnegative entries, it holds |Ax| ≤ A|x|. Thus, as monotonic norms are also absolute, we have In the forthcoming Lemma 6, we use the strong monotonicity required in Point (1) of Assumption 1 to prove that if A T A is irreducible, then the nonnegative maximizer of Lemma 4 has positive entries. To this end, however, we need one additional preliminary result that characterizes strongly monotonic norms in terms of the zero pattern of J and which we prove in the following: Proof Suppose that · γ is strongly monotonic. Let x ∈ R n + . If x = 0, J γ (0) = 0 by construction. Suppose that x = 0. We use the strong monotonicity to prove the existence of For the reverse implication, suppose that J γ (x) ∼ x for all x ∈ R n + . Let x, y ∈ R n + be such that x ≤ y and x = y. If x = 0, then x γ = 0 < y γ . Suppose that x = 0. As x ≤ y and x = 0, there exists i and t 0 > 0 such that x + te i ≤ y for all t ∈ (0, t 0 ). For t ∈ (0, t 0 ), we have where the second inequality follows from the convexity of · γ . By assumption, we have J γ x + t 0 2 e i ∼ x + t 0 2 e i and thus J γ x + t 0 2 e i i > 0. It follows that y γ > x γ , i.e. · γ is strongly monotonic.

Lemma 6
Let · α and · β satisfy Assumption 1. Let A be a matrix with nonnegative entries and suppose that A T A is irreducible. Then, S A (x) is positive for every positive x and every nonnegative critical point of f A is positive.
It follows that S A maps positive vectors to positive vectors since the irreducibility of A T A implies that A T A is positive for all positive x. Finally, note that A T A is symmetric positive semi-definite and therefore all its eigenvalues are nonnegative. It follows that A T A is primitive (see e.g. Theorem 1 in [40]). By the same theorem, there exists a positive integer k such that (A T A) k is a matrix with positive entries.
is strictly positive for every nonzero, nonnegative x. Finally, suppose that y ∈ R n + is a critical point of f A , then y is a fixed point of S A by Lemma 2 and thus y = S k A (y) is strictly positive.
We are now ready to state our main theorem of this section. This theorem provides conditions on A, · α and · β that ensure the existence of a unique positive maximizer x + such that Ax + β / x + α = A β→α and that govern the convergence of the power sequence to such x + . As announced, this result is essentially a fixed point theorem for S A and thus the Birkhoff contraction ratio κ H (S A ) and any τ that well-approximate κ H (S A ) from above play a central role.

Theorem 4 Let A ∈ R m×n be a matrix with nonnegative entries and suppose that A T A is
irreducible. Let · α and · β satisfy Assumption 1.
If κ H (S A ) ≤ τ < 1, then: where e 1 , . . . , e n is the canonical basis of R n . Furthermore, it holds x α x ∞ . In particular, x k → x + as k → ∞.
Proof Lemma 4 implies that f A has a maximizer x + ∈ R n + . Lemma 6 implies that x + is positive and that the interior of R n + is left invariant by S A . Hence, all statements except the bounds on Ax k α follow by a direct application of Lemma 3 and Eq. (4). We conclude with a proof of the estimates for Ax k α . Clearly, Ax k α ≤ A β→α always hold. For the lower bound, let γ = max x =0 x β which concludes the proof.
Note that the condition that requires A T A to be irreducible is in general weaker than requiring the initial matrix A to be irreducible itself, as A T A may be irreducible even if A is reducible. This is also observed in the numerical examples in Sect. 6.
Theorem 4 holds for any upperbound τ of κ H (S A ) and a somewhat natural choice for such a τ is the following This coefficient is particularly useful in practice as, thanks to the Birkhoff-Hopf theorem, in many circumstances one can provide explicit bounds for τ (S A

Examples and Comparison with Previous Work
When · α and · β are p norms, Theorem 4 implies the following:

Corollary 1 Let A ∈ R m×n be a matrix with nonnegative entries and suppose that A T A is
irreducible. Let 1 < p, q < ∞ and consider If τ < 1, then A q→ p can be approximated to an arbitrary precision with the fixed point iteration (2).
In the case of p norms, both Theorem 1 and Corollary 1 apply. In order to compare them let us compute the Birkhoff contraction ratio for some simple but explanatory cases. Let ε ≥ 0 and A ∈ R 3×2 , B ∈ R 2×2 , C ∈ R 3×3 be defined as Due to Theorem 3, it is easy to see that If p ≤ q and ε > 0, then Theorem 1 implies that f X has a unique positive maximizer x + , which is global, and the power sequence (11) will converge to x + . However, if ε = 0 then Theorem 1 ensures that every positive critical point of f B is a global maximizer but uniqueness and convergence are only guaranteed under the assumption p < q. Now, we look at the implications of Theorem 4. By noting that Hence, for instance, uniqueness and global maximality of a positive maximizer of f A is guaranteed by Theorem 4 under the assumption 9( p − 1) < 400(q − 1) which includes the known global convergence range of values p < q, but is of course a much weaker assumption. Now, note that for ε ≥ 1 we have τ (S B ) < 1 if and only if (ε−1) 2 ( p−1) < (ε+1) 2 (q−1). This assumption is less restrictive than p ≤ q for every ε ≥ 1 as p ≤ q correspond to the asymptotic case ε → ∞. If ε = 1, Theorem 4 applies for every 1 < p, q < ∞. The analysis for 0 < ε < 1 is similar. However, we note that if ε = 0, then Theorem 4 does not provide any information about f B for the case p = q in contrast with Theorem 1. When ε = 0 and p < q, both theorems imply the same result. Finally, note that τ (S C ) < 1 if and only if p < q and so Theorem 1 is more useful as it also covers the case p = q.
More in general, when the considered matrix A has finite projective diameter (A), then Theorem 2 implies that κ H (A) < 1 and thus Theorem 4 ensures that for any p > 1, the matrix norm A q→ p can be approximated in polynomial time to an arbitrary precision for any choice of q > κ H (A) 2 ( p − 1) + 1, without the requirement q > p. Figure 1 shows that the value of κ H (A) for matrices with positive entries is often substantially smaller than one, enhancing the relevance of Theorem 4.

On the Sharpness of the New Convergence Condition
As we observed earlier, the key property behind the global convergence of the power iterates relies on the fact that, when κ H (S A ) < 1, the mapping S A has a unique positive fixed point x + . Due to Lemma 2, this is equivalent to observing that, in this case, x + is the unique positive critical point of f A , up to scalar multiples. In what follows we show that this is not anymore the case if κ H (S A ) > 1. In particular, we limit our attention to the case of p norms and we exhibit a one-parameter family of 2×2 positive and symmetric matrices A ε for which a unique positive critical point of f A ε exists if and only if κ H (S A ε ) ≤ 1. Moreover, we show that for such a family of matrices it holds τ (S A ) = κ H (S A ) where τ (S A ) is the estimate of κ H (S A ) discussed in Eq. (12). As f A is scale invariant, here and in the rest of this section, uniqueness of the critical point is meant up to scalar multiples.
For ε > 0 and p, q ∈ (1, ∞), let A ε ∈ R 2×2 and f A ε : R 2 → R + be defined as The main result of this section is the following theorem, whose proof is postponed to the end of the section This result shows that, unlike the previous Theorem 1, Theorem 4 is tight in the sense that when κ H (S A ) > 1 there might be multiple distinct fixed points of S A in R 2 + , and thus convergence of the power sequence to a prescribed fixed point cannot be ensured globally without restrictions on the starting point x 0 ∈ R 2 + . We subdivide the proof of Theorem 5 above into a number of preliminary results. Before proceeding, we recall that for p ∈ (1, ∞), p : R n → R n is entrywise defined as p (x) i = |x i | p−2 x i for all i. We compute τ (S A ε ) and κ H (S A ).

Lemma 7 For every ε > 0, we have κ H (S
d H x, y(t) .

Then, we have h(t) ≤ κ H (S A ε ) for every t > 0. To conclude the proof, we show that lim t→1 + h(t) = τ (S A ε ). A direct computation shows that d H x, y(t) = ln(t) and
With the above computations, imply where the last equality follows by continuity. As ln(1) = ln(g(1)) = 0, L'Hopital's rule implies that 4 , after rearrangement, we finally obtain and thus concludes the proof. Now, we prove that the nonnegative critical points of f A ε are positive and we then characterize them in terms of a real parameter t. As critical points are defined up to multiples, we restrict our attention to the line {x ∈ R 2 : x 1 + x 2 = 1}.

. Then x is a critical point of f A ε if and only if there exists t ∈ (0, 1) such that x = (t, 1 − t) T and ψ(t)
Proof As we already observed, f A ε attains a global maximum in R 2 + . Furthermore, the critical points of f A ε satisfy As A ε is positive, (15) implies that every nonnegative critical point of f A ε is positive. It follows that, for positive vectors x, (15) is equivalent to Thus, x 1 + x 2 = 1 and x 1 , x 2 > 0 imply the existence of t ∈ (0, 1) such that x 1 = t and (16) we finally obtain the claimed result.
A direct consequence of Lemma 8 is that (1, 1) T /2 is a critical point of f A ε . Moreover, by symmetry, we see that Proof Note that if τ (S A ε ) > 1, then 1+ε where ψ is defined as in (14). The critical points of f A ε correspond to zeros of h in (0, 1/2]. Indeed, by Lemma 8, we know that these points are in bijection with the zeros of h on (0, 1) and h(t) = −h(1 − t) for every t ∈ (0, 1). We have already observed that h(t 0 ) = 0 with t 0 = 1/2. We now show that there exists t 1 ∈ (0, t 0 ) such that h(t 1 ) = 0. The existence of such t 1 implies that (t 1 , To construct t 1 , we first prove that our assumption τ (S A ε ) > 1 is equivalent to the condition h (t 0 ) > 0. We have As As lim t→0 h(t) = ε p−1 + ε > 0, the intermediate value theorem implies the existence of t 1 ∈ (0, s) such that h(t 1 ) = 0. As observed above, this concludes the proof.

Lemma 10
If τ (S A ε ) = 1, then f A ε has a unique nonnegative critical point.
, where q * = q/(q − 1) denotes the Hölder conjugate of q. Then, for 1 = (1, 1) T and u = 1/2, we have F(u) = λu for some λ > 0. Hence, u is a fixed point of S A ε and, · q is differentiable, by Lemma 2, it follows that u is a critical point of f A ε . Moreover, it is a fixed point of G : Note that the fixed points of G coincide, up to scaling, with those of S A ε . To conclude, we prove that u is the unique fixed point of G.
As H (x, y) and so G is non-expansive with respect to d H . Now, Theorem 6.4.1 in [31] implies that u is the unique where G (u) denotes the Jacobian matrix of G evaluated at u. Moreover, as F(u) = λu, Lemma 6.4.2 in [31] implies that F (u)u = λu and , 1 u).
Suppose by contradiction that there exists a z ∈ R 2 \{0} with z 1 + z 2 = 0, such that z − G (u)z = 0. A direct computation shows that z, F (u) T u = 0. Then, It follows that F (u)z = λz and, as F (u) is entry-wise positive, the classical Perron-Frobenius theorem implies that z = ±u. However, u 1 + u 2 > 0 which contradicts the assumption z 1 + z 2 = 0. So 0 = z − G (u)z for every z = 0 such that z 1 + z 2 = 0. Hence, u is the unique fixed point of G, which concludes the proof.
Combining the last two lemmas allows us to conclude: Proof of Theorem 5 Due to Lemmas 9 and 10 we only need to address the case τ (S A ε ) < 1. However this is a direct consequence of Lemma 3. In fact, as A ε is entry-wise positive, the nonnegative fixed points of S A ε are positive and, if τ (S A ε ) < 1, then S A ε is a strict contraction with respect to d H and so it has a unique fixed point which also is the unique positive maximizer of f A ε on R 2 + .

Matrix Norms Induced by Sum of Weighted p Norms
The Birkhoff contraction ratios κ H (J α ) and κ H (J α * ) are easy to compute when · α is a weighted p norm. More precisely, we have the following Proposition 1 Let x α = Dx p for some p ∈ (1, ∞) and some diagonal matrix D with positive diagonal entries, then Proof The equality x α * = D −1 x p * follows from Theorem 6 below. To conclude, note that J α (x) = Dx While the above Proposition 1 makes the computation of the Birkhoff constant of weighted p -norms particularly easy, computing κ H (J α ) or κ H (J α * ) for a general strongly monotonic norm · α can be a difficult task. There are norms for which an explicit expression in terms of arithmetic operations for · α is given by construction (resp. modelisation), but such an expression is not available for the dual · α * . Examples include x α = ( x 3 p + x 3 q ) 1/3 as shown by Theorem 6 below. On the other hand, as discussed in the introduction, monotonic norms different than the standard p norms arise quite naturally in several applications.
Motivated by the above observations, we devote the rest of the section to the study of a particular class of monotonic norms of the form x α = x α 1 , . . . , x α d γ where all the norms are monotonic and where we also allow x α i to measure only a subset of the coordinates of x.

Composition of Monotonic Norms and its Dual
Let d be a positive integer. We consider norms of the following form where · γ is a monotonic norm on R d , · α i is a norm on R n i and P i ∈ R n i ×n is a "weight matrix" for all i = 1, . . . , d. For · α to be a norm, we assume that M = [P T 1 , . . . , P T d ] T ∈ R (n 1 +...+n d )×n has rank n. Note that the monotonicity of · γ implies that · α satisfies the triangle inequality.
Let us first discuss particular cases of (17). First, note that for two norms · α 1 , · α 2 on R n , the norm (17) with d = 2, · γ = · p , and P 1 = P 2 = I , with I ∈ R n×n being the identity matrix. It is also possible to model norms acting on different coordinates of the vectors. For example, if (x, y) ∈ R 2n , then can be obtained from (17) with d = 2, · γ = · p , P 1 = diag(1, . . . , 1, 0, . . . , 0) ∈ R 2n×2n and P 2 = diag(0, . . . , 0, 1, . . . , 1) ∈ R 2n×2n . The dual of · α × is discussed in Lemma 11 below and has a particularly elegant description. More complicated weight matrices P i can also be used. For example if n is an integer not smaller than n and P ∈ R n×n has rank n, then the norm can be obtained with d = 1, · γ = | · |, · α 1 = · p and P 1 = P. Note that if n = n, then P is square and invertible and this property can be used to simplify the evaluation of the dual norm of · α P . Consequences of such additional structure are discussed in Corollary 2.
In the next Theorem 6 we provide a characterization of the dual norm of · α in its general form as defined in (17). We first need the following lemma that addresses the particular case where P 1 , . . . , P d are projections.

Lemma 11
Let n 1 , . . . , n d be positive integers and for i = 1, . . . , d let · i be a norm on R n i . Furthermore, let · γ be a monotonic norm on R d . Let V = R n 1 × . . . × R n d and for all (u 1 , . . . , u d ) ∈ V define Then · V is a norm on V and the induced dual norm · V * satisfies Proof The fact that · V is a norm follows from a direct verification. Let (u 1 , . . . , u d ) ∈ V . Then, for every (y 1 , . . . , y d ) ∈ V , we have which shows that For the reverse inequality, let v = ( u 1 α * 1 , . . . , u d α * d ). As · γ is monotonic, by Proposition 5.2 in [7, Chapter 1], there exists w ∈ R d + such that w γ ≤ 1 and v, w = v γ * . Let us denote by w 1 , . . . , w d ∈ R + and v 1 , . . . , v d ∈ R respectively the components of w and v in the canonical basis of R d . Now, let y 1 ∈ R n 1 , . . . , y d ∈ R n d be such that y i α i ≤ 1 and y i , u i = u i α * i for all i = 1, . . . , d. Then, as · γ is monotonic with respect to R d + and y i α i ≤ 1 for all i, we have Note that It follows that (18), concludes the proof. . For i = 1, . . . , d, let P i ∈ R n i ×n and let · α i be a norm on R n i . Suppose that M = [P T 1 , . . . , P T d ] T ∈ R (n 1 +...+n d )×n has rank n. Furthermore, let · γ be a monotonic norm on R d . For every x ∈ R n , define

Theorem 6 Let d be a positive integer
Then, · α is a norm on R n and the induced dual norm is given by where · α * i is the dual norm induced by · α i and · γ * is the dual norm induced by · γ .
Proof Let u 1 ∈ R n 1 , . . . , u d ∈ R n d be such that P T 1 u 1 + · · · + P T d u d = x. Such vectors always exists as M has full rank. Then, for every y ∈ R n , it holds It follows that Now, we prove the reverse inequality. To this end, consider the vector space V = R n 1 × . . . × R n d endowed with the norm · V defined as As V is a finite product of finite dimensional vector spaces, we can identify V * with V and by Lemma 11, we know that the dual norm of · V * induced by · V satisfies Consider now the vector subspace W = {(P 1 y, . . . , P d y) | y ∈ R n } ⊂ V . Note that, we can identify W with the image of M, i.e. W = {My | y ∈ R n }. Let M † ∈ R n×(n 1 +...+n d ) be the Moore-Penrose inverse of M. Then, as M is full rank, we have M † My = y for all y ∈ R n . Let φ : W → R be defined as For every (u 1 , . . . , u d ) ∈ W , there exists y ∈ R n such that (u 1 , . . . , u d ) = My, i.e. u i = P i y for all i = 1, . . . , d, and thus By the Hahn-Banach theorem (see e.g. Corollary 1.2 of [5]), there exists (u 1 , . . . , and Next, let y ∈ R n , then My = (P 1 y, . . . , P d y) ∈ W and with (19), we have As the above is true for all y ∈ R n , it follows that P T 1 u 1 + . . . + P T d u d = x. Hence, we have inf which concludes the proof of the formula for · α * .
As a consequence of the above Theorem 6, we have that the dual of the norms · α + , · α × , · α P considered at the beginning of this section are respectively given by with p * = p/( p − 1). Note that the · α * × does not involve an infimum. The infimum can also be removed in x α * P , if P is square and invertible and in that case it holds x α * P = P −T x p * . We discuss more general examples in the next result.
Corollary 2 Under the same assumptions as Theorem 6, we have: 1. If P 1 , . . . , P d are all square invertible matrices and If every x ∈ R n can be uniquely written as x = x P 1 + . . . + x P d with x P i ∈ Im(P T i ) for all i = 1, . . . , d (i.e. R n is the direct sum of the range of P 1 , . . . , P d ), then If, additionally, n i = dim(Im(P T i )) for all i = 1, . . . , d, then where (P T i ) † is the Moore-Penrose inverse of P T i .

The Power Method for Compositions of p -Norms
We discuss here consequences of Theorems 4 and 6 when applied to a special family of norms defined in terms of subsets of entries of the initial vector, i.e. the case where P i is a nonnegative diagonal matrix. For some nonnegative weight vector ω ∈ R m and coefficient p ∈ (1, ∞), let · ω, p be the ω-weighted p -(semi)norm on R m , defined as To express the dual of x ω, p and their compositions, let If ω is positive, then x ω, p is a norm and it holds ( x ω, p ) * = x ω * , p * by Proposition 1.

Lemma 12
Let · α be as in (22), then · α is differentiable if either s > 1 or s = 1 and ω i has at least two positive entries for every i = 1, . . . , d.
Proof As p k > 1, · ω k , p k is differentiable if ω k has at least two positive entries. If it has only one positive entry then · ω k , p k is just a weighted absolute value. Hence, if s > 1, then the differentiability of · α follows from that of the s -norm. While if s = 1 and ω i has at least two positive entries for every i = 1, . . . , d, then · α is just a sum of differentiable norms.
If · α is differentiable, we have and the following lemma provides an upper bound for κ H (J α ).
As a consequence, we have Theorem 7 Let A ∈ R m×n be a nonnegative matrix such that A T A is irreducible. Let · α and · β be as in (22) and (25), respectively. Let If τ < 1 and · α , · β are differentiable, then A β * →α can be approximated to ε precision in O C(S A ) ln(1/ε) arithmetic operations with the power sequence (11).
Proof Besides the complexity bound, the result is a direct consequence of Theorem 4 and the upper bounds for κ H (J α ) and κ H (J β ) obtained in Lemma 13. Let us provide and estimates for the total number of operations required by the fixed point sequence (11). Let C be as in Theorem 4. We have Cτ k < ε if and only if k > (ln(ε) − ln( C))/ ln(τ ). As (ln(ε) − ln( C))/ ln(τ ) ∈ O(− ln(ε)) for ε → 0, we deduce that A q→ p − ε ≤ Ax k p after O(ln(ε −1 )) iterations of S A , leading to a total complexity of O(C(S A ) ln(ε −1 )).
We conclude the section by proving a number of corollaries of Theorem 7 that illustrate the richness of the class of problem that can be addressed via that theorem. For simplicity, in the statements we assume that the involved matrices are square and positive. However, more general statements involving irreducible and rectangular matrices can be easily derived by reproducing the proof of the corresponding corollary.

Corollary 3
Let A ∈ R n×n be a positive matrix. Let ω, ∈ R n be positive weights and It τ < 1, then A β→α can be computed to ε precision in O nnz(A) ln(1/ε) operations.

Numerical Experiments
In this section we illustrate the numerical behaviour of the power sequence (11) on some example matrices and some choices of the norms · α and · β . In particular, we consider both a dense matrix and sparse matrix example. Fig. 2 Comparison between the true error x k − x + ∞ and the upper bound (τ p,q,ε ) k C in (26) against number of iterations for the power method (11) applied to A ε q→ p with ε = 1/3, q = 2 and for five values of p ranging within the interval [1/15, 1/κ H (A ε )] · (q − 1) + 1, i.e. chosen so that τ p,q,ε = κ H (S A ε ) < 1

Dense Matrix Example: Tightness of the Convergence Bound
We verify the convergence bound of Theorem 4 for the family of matrix norms analyzed in Sect. 4.2, that is we verify the bound of Theorem 4 on the convergence of the power sequence (11) for the computation of A ε q→ p for various 1 < p, q < ∞ and A ε defined as in (13).
hence the power sequence converges to x + when τ p,q,ε < 1. By Theorem 4, if τ p,q,ε < 1, then We use δ = 10 −12 and x 0 = (δ, 1 − δ) T in our experiments. The choice of x 0 is motivated by the fact that it is far from the limit point x + in the Hilbert metric, so to model a worstcase-scenario setting. In Fig. 2, we plot the true error x k − x + ∞ against the number of iterations k and compare it with the upper bound (τ p,q,ε ) k C, for the choice ε = 1/3, q = 2 and five increasing values of p chosen so that τ p,q,ε = κ H (S A ε ) < 1. We can observe that the method converges linearly as expected and that the upper bound well captures the decay slope. Moreover, even though larger values of p yield larger values of the contraction constant τ p,q,ε , the upper bound still behaves as the true error when p grows up to a multiplicative constant.

Sparse Matrix Example
For this experiment, we consider two families of matrices with growing size which are not irreducible but satisfy the requirement of Theorem 4, that is A T A is irreducible. More precisely, let A 1 ∈ R 3×3 and B 1 ∈ R 4×2 be given by  τ (S A s ) = τ (S B s ) = 3/4. Note that A 10 ∈ R 3 10 ×3 10 with 3 10 ≈ 6 · 10 6 and B 10 ∈ R 4 10 ×2 10 with 4 10 ≈ 10 6 Clearly, neither A 1 nor B 1 is irreducible, however A T 1 A 1 and B T 1 B 1 have strictly positive entries and are therefore irreducible. Then, for s ≥ 2, we consider the matrices A s ∈ R 3 s ×3 s , B s ∈ R 4 s ×2 s obtained by taking s times the Kronecker product of A 1 , resp. B 1 , with itself, i.e. A s = A 1 ⊗ A s−1 and B s = B 1 ⊗ B s−1 . Note that for all s ≥ 1, A s has at least one full row of only zero entries and therefore cannot be irreducible. On the other hand, it holds A T s A s = (A T 1 A 1 ) ⊗ . . . ⊗ (A T 1 A 1 ) and thus A T s A s has positive entries since it is the Kronecker product of s positive matrices. The same observation holds for the sequence B s . Furthermore, Theorem 3 implies that κ H (A s ) = κ H (B s ) = 1 for all s ≥ 1. Hence, we have τ (S A s ) = τ (S B s ) = p−1 q−1 for all s. In our experiments we analyze the number of iterations until convergence of the power sequences associated to the computation of A s q→ p and B s q→ p , where p, q are fixed so that τ = τ (S A s ) = τ (S B s ) = 3/4. For each fixed p, q and s, we try 5000 different starting points drawn uniformly from (0, 1) n with n = 3 s or n = 2 s . The boxplots in Fig. 3 show the number of iterations required until the stopping criterion is met, for both A s (the two panels in the top row) and B s (the two panels in the row at the bottom), and for δ = 10 −10 . Note that, due to Theorem 4, if (27) holds for k then we are guaranteed to be δ-close to the true solution x + , i.e. the computed approximation x k is such that x k − x + ∞ < δ. Moreover, since x p ≤ n 1/ p x ∞ for all x = 0, we have  Fig. 3 shows the steps required to guarantee approximation to the true solution, we emphasize that in practice the required number of steps to reach floating point precision on two consecutive iterates is typically much smaller.

Conclusions
On top of being a classical problem in numerical analysis, computing the norm of a matrix A β→α is a problem that appears in many recent applications in data mining and optimiza-tion. However, except for a few choices of · α and · β , computing such a matrix norm to an arbitrary precision is generally unfeasible for large matrices as this is known to be an NP-hard problem. The situation is different when the matrix has nonnegative entries, in which case A q→ p is known to be computable for p norms such that q ≤ p. In this paper we have both (a) refined this result, by showing that the condition p < q is not necessarily required and (b) extended this result to much more general vector norms · α and · β than p norms. In particular, we have shown how to compute matrix norms induced by monotonic norms of the form x α = x α 1 , . . . , x α d γ , where we also allow x α i to measure only a subset of the coordinates of x. Using these kinds of norms we can globally solve in polynomial time quite sophisticated nonconvex optimization problems, as we discuss in the examples corollaries at the end of Sect. 5.
Funding Open access funding provided by Gran Sasso Science Institute -GSSI within the CRUI-CARE Agreement.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.