1 Introduction

Suppose that \({{{\mathcal {M}}}}\) is a statistical manifold [2, 3], i.e., \({{{\mathcal {M}}}}\) is given a dualistic structure \((g,\nabla , \nabla ^*)\) where \(g,\nabla \) and \(\nabla ^*\) are, respectively, a Riemannian metric and a pair of torsion-free affine connections satisfying

$$\begin{aligned} Xg(Y,Z)=g(\nabla _X Y, Z)+g(Y, \nabla ^*_X Z), \quad \forall X, Y, Z \in {\mathcal {X}}({{{\mathcal {M}}}}). \end{aligned}$$

Here, \({\mathcal {X}}({{{\mathcal {M}}}})\) denotes the set of all vector fields on \({{{\mathcal {M}}}}\). Recall that for a statistical manifold, one can introduce a one-parameter family of affine connections called the \(\alpha \)-connection:

$$\begin{aligned} \nabla ^{(\alpha )}=\frac{1+\alpha }{2}\nabla + \frac{1-\alpha }{2} \nabla ^*, \quad \alpha \in \textbf{R} \end{aligned}$$

and \((g,\nabla ^{(\alpha )}, \nabla ^{(-\alpha )})\) is also a dualistic structure on \({{{\mathcal {M}}}}\).

In a statistical manifold, we can naturally introduce a submanifold \({{{\mathcal {N}}}}\) that is simultaneously autoparallel with respect to (w.r.t., in short) both \(\nabla \) and \(\nabla ^*\) [31, 43].

Definition 1

Let \(({{{\mathcal {M}}}}, g, \nabla , \nabla ^*)\) be a statistical manifold and \({{{\mathcal {N}}}}\) be its submanifold. We call \({{{\mathcal {N}}}}\) doubly autoparallel (DA, in short) in \({{{\mathcal {M}}}}\) if it holds that

$$\begin{aligned} \nabla _X Y \in {\mathcal {X}}({{{\mathcal {N}}}}), \; \nabla ^*_X Y \in {\mathcal {X}}({{{\mathcal {N}}}}), \quad \forall X, Y \in {\mathcal {X}}({{{\mathcal {N}}}}). \end{aligned}$$

In other words the second fundamental forms (Euler–Schouten imbedding curvatures) w.r.t. \(\nabla \) and \(\nabla ^*\), respectively denoted by \(H_{{{\mathcal {N}}}}(X,Y)\) and \(H_{{{\mathcal {N}}}}^*(X,Y)\), simultaneously vanish for all \(X, Y \in {\mathcal {X}}({{{\mathcal {N}}}})\).

We immediately see that DA submanifolds \({{{\mathcal {N}}}}\) commonly possess the following properties. Note that the statement 5) holds when \(({{{\mathcal {M}}}}, g, \nabla , \nabla ^*)\) is dually flat.

Proposition 1

The following statements are equivalent:

  1. 1)

    a submanifold \({{{\mathcal {N}}}}\) is DA,

  2. 2)

    a submanifold \({{{\mathcal {N}}}}\) is autoparallel w.r.t. \(\nabla ^{(\alpha )}\) for two distinct \(\alpha \),

  3. 3)

    a submanifold \({{{\mathcal {N}}}}\) is autoparallel w.r.t. \(\nabla ^{(\alpha )}\) for all \(\alpha \),

  4. 4)

    each \(\alpha \)-geodesic curve passing through an arbitrary point p in \({{{\mathcal {N}}}}\) that is tangent to \({{{\mathcal {N}}}}\) at p lies in \({{{\mathcal {N}}}}\), (i.e., \({{{\mathcal {N}}}}\) is \(\nabla ^{(\alpha )}\)-totally geodesic for all \(\alpha \)),

  5. 5)

    a submanifold \({{{\mathcal {N}}}}\) is affinely constrained in both \(\nabla \)- and \(\nabla ^*\)-affine coordinates of \({{{\mathcal {M}}}}\).

In particular, let \({{{\mathcal {M}}}}\) be a parametric statistical model that is dually flat, w.r.t. the Fisher metric g, the exponential connection \(\nabla =\nabla ^{({\textrm{e}})}\) and the mixture connection \(\nabla ^*=\nabla ^{({\textrm{m}})}\) [2, 3]. If \({{{\mathcal {N}}}}\) is DA, the \(\alpha \)-projection to \({{{\mathcal {N}}}}\) is unique for each \(\alpha \). Further, \({{{\mathcal {N}}}}\) is simultaneously an exponential and mixture family.

Since a DA submanifold is equipped with the above attractive characteristics, it might be expected to give useful insights into wide area of applications in not only information geometry but also information or statistical science.

For example, suppose that an ambient manifold \({{{\mathcal {M}}}}\) is the set of positive definite real symmetric matrices, where the standard \(\nabla \)- and \(\nabla ^*\)-affine coordinates are, respectively, entries of a matrix and those of the inverse matrix [34]. Hence, by the property 5) of Proposition 1, a DA submanifold \({{{\mathcal {N}}}}\) in \({{{\mathcal {M}}}}\) is the set of matrices which, and the inverses of which, are simultaneously constrained affinely (See examples later). In studies of structured (patterned) covariance estimation [4, 5, 9, 37], significance of such structure has been already recognized, independently of research on doubly autoparallelism. The reason of the significance there is that doubly autoparallelism of \({{{\mathcal {N}}}}\) implies the unique existence of the maximum likelihood estimate (MLE) solution on \({{{\mathcal {N}}}}\). For a different application, in [31], convex optimization on \({{{\mathcal {M}}}}\), called semidefinite program (SDP) [30], is proved to possess a closed form solution if \({{{\mathcal {N}}}}\), the relative interior of a feasible region, is DA. As another aspect, when \({{{\mathcal {M}}}}\) is generally a symmetric cone, a common and interesting nature for various means on a DA submanifold \({{{\mathcal {N}}}}\) is proved [32].

In a case where \({{{\mathcal {M}}}}\) is (the relative interior of) the probability simplex, Nagaoka [29] has pointed out importance of this notion in study of statistical models. He has characterized models statistically equivalent to \({{{\mathcal {M}}}}\) by doubly autoparallelism. On the other hand, in [33] a classification in addition to a different characterization of DA submanifolds in \({{{\mathcal {M}}}}\) are derived in terms of Hadamard algebra. Further, it should be also mentioned that Matúš and Ay have studied the corresponding statistical models based on research motivations from learning theory [6, 24]. As for the other aspect it would be interesting to note that a DA statistical model, i.e., simultaneously e-autoparallel and m-autoparallel, is a mixture model on which the unique MLE exists.

In the existing literature mentioned above, an ambient statistical manifold \({{{\mathcal {M}}}}\) is a symmetric cone [12] or a closely associated manifold. A symmetric cone is regarded as a generalization of the set of positive definite matrices, and the probability simplex is an intersection of a hyperplane and the positive orthant \(\textbf{R}_+^n\), which is also a symmetric cone. In this case the structure of DA submanifolds can be studied algebraically in terms of Jordan algebras [16, 23, 31, 43] (for the relation between a symmetric cone and a Jordan algebra, refer to [12, 19]). To the best knowledge of the authors, purely geometrical arguments of doubly autoparallelism for the case of general statistical manifold has not been found except the recent literature [11, 36].

The first purpose of the paper is to show an algebraic characterization of DA submanifolds in a symmetric cone, but it is stated as a corollary for more general connected submanifolds in semi-simple Jordan algebras. The characterization is based on the property 5) of Proposition 1. The result is not only tractable compared to the existing work [43] in the sense that we only need to check the obtained condition at an arbitrary single point in the submanifold, but also complete in the sense that it is not restricted to subcones [16, 23, 43] in an ambient symmetric cone. We also demonstrate two applications of MLE on structured covariances and SDP. It is seen that doubly autoparallelism contributes explicit representations for solutions to these problems.

On the other hand, as the second purpose, we investigate the following integrals of a quantity related to \(H_{{{\mathcal {N}}}}\) or \(H^*_{{{\mathcal {N}}}}\) along a smooth curve \(C=\{x(t) | t_1 \le t \le t_2 \}\) on a submanifold \({{{\mathcal {N}}}} \subset {{{\mathcal {M}}}}\)

$$\begin{aligned} \int _{t_1}^{t_2} \Vert H_{{{\mathcal {N}}}}(X, X) \Vert ^{1/2}_{x(t)} dt, \quad \int _{t_1}^{t_2} \Vert H^*_{{{\mathcal {N}}}}(X, X) \Vert ^{1/2}_{x(t)} dt, \quad X(t):=dx(t)/dt, \end{aligned}$$

where \(\Vert Y \Vert _x:=\sqrt{g_x(Y,Y)}\) for \(Y \in {{{\mathcal {X}}}}({{{\mathcal {M}}}})\). These integrals tell global information on how the tangent direction of the curve C is deviated from \({{{\mathcal {N}}}}\) in the sense of the parallelisms of \(\nabla \) or \(\nabla ^*\). We call them curvature integrals. Introduction of such integrals is motivated by analysis of computational effort in an interior-point algorithm of SDP for the case where the concerned submanifolds lose doubly autoparallelism. We prove that the integrals surprisingly measure the iteration-complexity of a path-following method in convex optimization. In the extreme case where both \(H_{{{\mathcal {N}}}}\) and \(H^*_{{{\mathcal {N}}}}\) vanish everywhere (i.e., DA) the existence of the closed form solution is recovered (cf. Corollary 3). For these second part, omitted proofs can be found in our unpublished technical report [35].

The paper is organized as follows: Sect. 2 describes one of the main results that characterizes connected DA submanifolds in semi-simple Jordan algebras, which includes the case of symmetric cones. In particular, the result for associative algebras is also given. Such a typical and important example would be \(\textbf{R}^n_+\), which involves the associative Hadamard algebra. In Sect. 3, we show examples and two applications for DA submanifolds in positive definite matrices. We see that DA structure gives closed form solutions to these applications. Section 4 explains preliminaries of path-following interior-point method from a viewpoint of dually flat structure. In Sect. 5 we propose a new idea for a path-following algorithm using several concepts of dually flat structure. Section 6 gives the second main result which relates the curvature integral to iteration-complexity of the path-following algorithm. In Sect. 7 we apply the results obtained in Sect. 6 to analyze the complexity of currently standard algorithm and show that the curvature integrals satisfy an interesting relation (cf. Theorem 5). The concluded remarks are given in Sect. 8.

2 Doubly autoparallel structure and Jordan algebra

2.1 Algebraic results

A commutative algebra \({{{\mathcal {A}}}}\) over \(\textbf{R}\) or \(\textbf{C}\) with a bilinear product \(*\) satisfying

$$\begin{aligned} x^2*(x*y)=x*(x^2*y), \quad \text{ where } x^{k+1}=x*x^{k}, \; (k=1,2,\dots ) \end{aligned}$$

for all \(x, y \in V\) is called a Jordan algebra [12, 19]. Let \(({{{\mathcal {A}}}},*)\) be a finite dimensional real Jordan algebra with a unit element e. For \(x \in {{{\mathcal {A}}}}\), if \(x*y = e\) with \(y = a_0 e + a_1 x + \dots + a_n x^n\,\,\,(a_0, a_1, \dots , a_n \in \textbf{R})\), for a positive integer n, then y is called the inverse element of x and write \(y = x^{-1}\). In this case we say that x is invertible. Note that only \(x *y = e\) does not imply \(y = x^{-1}\) (see [12, p. 30]). Let \({{{\mathcal {I}}}}\) be the set of invertible elements of \(({{{\mathcal {A}}}}, *)\). A linear subspace \(V \subset {{{\mathcal {A}}}}\) is said to be a subalgebra of the Jordan algebra \(({{{\mathcal {A}}}}, *)\) if \(x * y \in {{{\mathcal {A}}}}\) for any \(x, y \in {{{\mathcal {A}}}}\), where we do not assume that \(e \in V\). For a set \({{{\mathcal {M}}}}\) we write \({{{\mathcal {M}}}}^{-1}:=\{x^{-1}\;|\; x \in {{{\mathcal {M}}}}\}\).

Lemma 1

Let \(({{{\mathcal {A}}}}_0, *)\) be a real Jordan algebra with a unit element e, and V be a subalgebra of \(({{{\mathcal {A}}}}_0, *)\). If \(x \in V\) and \(e + x \in {{{\mathcal {I}}}}\), then \((e+x)^{-1} \in e+V\).

Proof

Since \((e+x)^{-1}\) is expressed as a polynomial function of \(e+x\) by definition, it is also a polynomial of x, that is, \((e+x)^{-1} = a_0 e + a_1 x + \dots + a_n x^n\) with some \(a_0, a_1, \dots , a_n \in \textbf{R}\). Then \((e+x)^{-1} * x = a_0 x + a_1 x^2 + \dots + a_n x^{n+1} \in V\). On the other hand, we observe that

$$\begin{aligned} e - (e+x)^{-1} * x = e - (e+x)^{-1} * \{ (e+x) - e\} = e - e + (e+x)^{-1} = (e+x)^{-1}. \end{aligned}$$

Hence, it holds that \( (e+x)^{-1} = e - (e+x)^{-1} * x \in e + V. \) \(\square \)

For \(a \in {{{\mathcal {A}}}}\), we denote by L(a) the multiplication operator \({{{\mathcal {A}}}} \ni x \mapsto a *x \in {{{\mathcal {A}}}}\), and by P(a) the linear operator \(2\,L(a)^2 - L(a^2) \in \textrm{End}({{{\mathcal {A}}}})\), which is called the quadratic representation. If a is invertible, we have \(P(a^{-1}) = P(a)^{-1}\) and \(P(a)a^{-1} = a\). For \(a, b \in {{{\mathcal {A}}}}\), define \(P(a,b):= L(a) L(b) +L(b) L(a) - L(a*b)\), so that \(P(a)= P(a, a)\) and \(2 P(a,b) = P(a+b) - P(a) - P(b)\). Let us recall the following basic formulas (axioms of a quadratic Jordan algebra, [12, p. 40]):

$$\begin{aligned}&P(e) = {I}, \quad P(a, e) b= P(a, b)e, \end{aligned}$$
(1)
$$\begin{aligned}&P(P(a) b ) = P(a) P(b) P(a), \end{aligned}$$
(2)
$$\begin{aligned}&P(a) P(b,c)a = P(P(a)b, a)c \end{aligned}$$
(3)

for all \(a, b, c \in {{{\mathcal {A}}}}\). We note that (2) yields

$$\begin{aligned} P(P(a)b, P(a)c) = P(a) P(b, c) P(a). \end{aligned}$$
(4)

For each \(a \in {{{\mathcal {A}}}}\), we introduce a bilinear product \(\perp _a\) on \({{{\mathcal {A}}}}\) given by

$$\begin{aligned} x \perp _a y:= P(x,y) a \quad (x, y \in {{{\mathcal {A}}}}). \end{aligned}$$
(5)

It is known that \(({{{\mathcal {A}}}},\, \perp _a)\) forms a Jordan algebra, which is called the mutation of \(({{{\mathcal {A}}}}, *)\) [19]. If a is invertible, then \(a^{-1}\) is a unit element of \(({{{\mathcal {A}}}}, \perp _a)\). In this case, we denote by \({\textrm{Inv}}_{\perp _a}(x)\) the inverse element of \(x \in {{{\mathcal {A}}}}\) in the Jordan algebra \(({{{\mathcal {A}}}},\, \perp _a)\).

Lemma 2

Let \(a \in {{{\mathcal {I}}}}\). If \(x \in {{{\mathcal {A}}}}\) is invertible in \(({{{\mathcal {A}}}},\, *)\), then x is invertible in \(({{{\mathcal {A}}}},\, \perp _a)\), and one has

$$\begin{aligned} {\textrm{Inv}}_{\perp _a}(x) = P(a^{-1}) x^{-1}. \end{aligned}$$
(6)

Proof

Let \(({{{\mathcal {A}}}}_\textbf{C},\, *)\) be the complexfication of \(({{{\mathcal {A}}}}, *)\). Noting that x is invertible if and only if the unit element e belongs to the linear space spanned by \(x^k\,\,\,(k=1,2, \ldots )\), we see that the invertibility in \(({{{\mathcal {A}}}}, *)\) and \(({{{\mathcal {A}}}}_\textbf{C}, *)\) coincide. Since a is invertible, there exists an invertible \(b \in {{{\mathcal {A}}}}_\textbf{C}\) for which \(a = b^2\) ( [12, Proposition VIII.3.4]). Then, for \(x, y \in {{{\mathcal {A}}}}_\textbf{C}\), we observe that

$$\begin{aligned} P(b)(x \perp _a y)&= P(b)P(x,y)a = P(b)P(x,y) P(b) e = P(P(b)x, P(b)y)e\\&= (P(b)x) * (P(b)y), \end{aligned}$$

where we have used (4) at the third equality. Thus P(b) gives a Jordan algebra isomorphism from \(({{{\mathcal {A}}}}_\textbf{C},\, \perp _a)\) onto \(({{{\mathcal {A}}}}_\textbf{C},\, *)\). In particular, an element \(y \in {{{\mathcal {A}}}}_\textbf{C}\) is the inverse element of x in \(({{{\mathcal {A}}}}_\textbf{C},\, \perp _a)\) if and only if P(b)y is the inverse element of P(b)x in \(({{{\mathcal {A}}}}_\textbf{C},\, *)\). Since \((P(b)x)^{-1}= P(b)^{-1} x^{-1}\) by [12, Proposition 3.3 (ii)], we obtain \(P(b)y = P(b)^{-1} x^{-1}\), so that

$$\begin{aligned} y = P(b)^{-1} P(b)^{-1} x^{-1} = P(a)^{-1} x^{-1} = P(a^{-1}) x^{-1}, \end{aligned}$$

which belongs to \({{{\mathcal {A}}}}\) eventually. \(\square \)

For \(u \in {{{\mathcal {A}}}}\), we write \(D_u\) for the directional derivative along u. Namely, for a vector-valued function f defined on a neighborhood of \(x \in {{{\mathcal {A}}}}\), we have \(D_u f(x):= (\frac{d}{dt})_{t=0} f(x + t u)\). For instance, if \(x \in {{{\mathcal {A}}}}\) is invertible, the derivative of an \({{{\mathcal {A}}}}\)-valued function \(x \mapsto x^{-1}\) is given by

$$\begin{aligned} D_u x^{-1} = -P(x)^{-1}u \quad (u \in {{{\mathcal {A}}}}). \end{aligned}$$
(7)

Lemma 3

If \(x \in {{{\mathcal {I}}}}\), one has

$$\begin{aligned} D_u D_v x^{-1} = 2 P(x^{-1}) (u \perp _{x^{-1}} y) \qquad (u,v \in {{{\mathcal {A}}}}). \end{aligned}$$

Proof

Taking a derivative of (7), we have

$$\begin{aligned} D_v D_u x^{-1}&= -\left( \frac{d}{dt}\right) _{t=0} P(x+ tv)^{-1} u\\&= \lim _{{t \rightarrow 0}} \frac{1}{t} P(x+tv)^{-1}\{ P(x+ tv) - P(x) \} P(x)^{-1} u\\&= P(x)^{-1} \{ D_v P(x) \} P(x)^{-1} u\\&= 2P(x^{-1}) P(v, x) P(x^{-1}) u. \end{aligned}$$

By (4), the last term equals \(2P(P(x^{-1})v,\, P(x^{-1}) x) u = 2P(P(x^{-1})v,\, x^{-1}) u\), which is equal to \(2 P(x^{-1}) P(v, u) x^{-1}\) by (3). Therefore, by definition of the mutation, we obtain \( D_u D_v x^{-1} = 2 P(x^{-1}) P(u, v) x^{-1} = 2 P(x^{-1}) (u \perp _{x^{-1}} y). \) \(\square \)

Now we state our main result in this subsection.

Theorem 1

Let \(a \in {{{\mathcal {I}}}}\), and W and \(W'\) be linear subspaces of \({{{\mathcal {A}}}}\) of the same dimension. Let \(U \subset W\) be a neighborhood of \(0 \in W\) such that \(a + U \subset {{{\mathcal {I}}}}\). Then

$$\begin{aligned} (a + U)^{-1} \subset a^{-1} + W' \end{aligned}$$
(8)

if and only if the following two conditions hold:

$$\begin{aligned} {\mathrm{(i)}} \; W'=P(a)^{-1}W, \quad {\mathrm{(ii)}} \; W \text{ is } \text{ a } \text{ subalgebra } \text{ of } ({{{\mathcal {A}}}},\, \perp _{a^{-1}}). \end{aligned}$$

Proof

First we show the ‘only if’ part. We assume that (8) holds. Take any \(u \in W\). Then for \(t \in \textbf{R}\) with sufficiently small |t|, we have \(a + tu \in a + U\), so that \((a+ tu)^{-1} \in a^{-1} + W'\). Furthermore, if \(t \ne 0\), then \((1/t) \{ (a+ tu)^{-1} - a^{-1} \} \in W'\). Thus, by (7), we have \(D_u a^{-1} = - P(a^{-1}) u \in W'\), which means that \(P(a^{-1}) W \subset W'\). Since \(\dim W' =\dim W\), we have the condition (i). Similarly, we have \(D_v D_u a^{-1} \in W'\) for \(v \in W\), and Lemma 3 leads us to \(2 P(a^{-1}) (u \perp _{a^{-1}} v) \in W'\). Therefore \(u \perp _{a^{-1}} v \in W\) by (i), which implies the condition (ii).

Next we show the ‘if’ part. Assume that W is a subalgebra of \(({{{\mathcal {A}}}},\, \perp _{a^{-1}})\). Let \(x \in U\). Note that \(({{{\mathcal {A}}}},\, \perp _{a^{-1}})\) is a real Jordan algebra with unit element a, and that \(a+x \in {\mathcal {I}}\) is invertible in this Jordan algebra by Lemma 2. Thus, applying Lemma 1 to \(({{{\mathcal {A}}}},\, \perp _{a^{-1}})\), we see that \({\textrm{Inv}}_{\perp _{a^{-1}}}(a+x)\) belongs to \(a + W\). Then Lemma 2 tells us that

$$\begin{aligned} (a+x)^{-1} = P(a)^{-1} {\textrm{Inv}}_{\perp _{a^{-1}}}(a+x) \in P(a)^{-1} (a+W) = a^{-1} + W'. \end{aligned}$$

Therefore we have \((a+U)^{-1} \subset a^{-1} + W'\). \(\square \)

Corollary 1

Let \(a \in {{{\mathcal {I}}}}\) and \(W, W'\) be linear subspaces of \({{{\mathcal {A}}}}\). Then \(\{(a + W) \cap {{{\mathcal {I}}}} \}^{-1} = (a^{-1} + W') \cap {{{\mathcal {I}}}}\) if and only if the conditions (i) and (ii) in Theorem 1 hold.

Let \({{{\mathcal {R}}}}\) be a real associative algebra with a unit element e. If we define \(x * y:= (x \cdot y + y \cdot x)/2\) for \(x, y \in {{{\mathcal {R}}}}\), we have a Jordan algebra \(({{{\mathcal {R}}}}, *)\) with e being a unit element. We see that the inverse element in the associative algebra \(({{{\mathcal {R}}}},\, \cdot )\) coincides with the one in the Jordan algebra \(({{{\mathcal {R}}}}, *)\). Therefore Theorem 1 yields the following.

Theorem 2

Let \(a \in {{{\mathcal {R}}}}\) be an invertible element, and W and \(W'\) be linear subspaces of \({{{\mathcal {R}}}}\) of the same dimension. Let \(U \subset W\) be a neighborhood of \(0 \in W\) such that \(a + x\) is invertible for all \(x \in U\). Then \((a + U)^{-1} \subset a^{-1} + W'\) if and only if the following two conditions hold:

$$\begin{aligned} {\mathrm{(i)}} \; W'=a^{-1} \cdot W \cdot a^{-1}, \quad {\mathrm{(ii)}} \; (x \cdot a^{-1} \cdot y + y\cdot a^{-1}\cdot x)/2 \in W \text{ for } \text{ all } x,y \in W. \end{aligned}$$

2.2 Dually flat structure on semi-simple Jordan algebras

It is known [12, Prop. II.2.1] that there exist polynomial functions \(a_i(x), \; i=1,\dots ,r\) and a positive integer r depending on \(({{{\mathcal {A}}}},*)\) such that

$$\begin{aligned} f(\lambda )=\lambda ^r-a_1(x)\lambda ^{r-1}+a_2(x) \lambda ^{r-2}+ \cdots +(-1)^r a_r(x) \end{aligned}$$

is a minimal polynomial of every element x of a certain open and dense subset in \({{{\mathcal {A}}}}\). We denote \({{{\textrm{tr}}}}(x):=a_1(x)\) and \(\det (x):=a_r(x)\).

In what follows, we assume that the Jordan algebra \(({{{\mathcal {A}}}}, *)\) is semi-simple [12, 19], that is, the bilinear form \(\langle x,\, y \rangle := \textrm{tr}\,(x*y)\,\,\,(x,y \in {{{\mathcal {A}}}})\) is non-degenerate on \({{{\mathcal {A}}}}\). It is known that \({{{\mathcal {A}}}}\) is semi-simple if and only if \({{{\mathcal {A}}}}\) is isomorphic to a direct sum of simple Jordan algebras.

Recalling that \({{{\mathcal {I}}}} =\{x \in {{{\mathcal {A}}}}| \det (x) \not = 0 \}\), we introduce a smooth function \(\psi : {{{\mathcal {I}}}} \rightarrow \textbf{R}\) by \(\psi (x):= - \log |\det (x)|\,\,\,(x \in {{{\mathcal {I}}}})\).

Lemma 4

For \(x \in {{{\mathcal {I}}}}\) and \(u,\, v,\, w \in {{{\mathcal {A}}}}\), one has

  1. (i)

    \(D_w \psi (x) = -\langle x^{-1},\, w\rangle \),

  2. (ii)

    \(D_u D_w \psi (x) = \langle P(x^{-1})u,\, w \rangle \),

  3. (iii)

    \(D_u D_v D_w \psi (x) = -2 \langle P(x^{-1})(u \perp _{x^{-1}} v),\, w \rangle \).

Proof

The assertion (ii) follows from (i) and (7). The assertion (iii) is deduced from (ii) and Lemma 3. \(\square \)

In view of Lemma  4 (ii), we define a pseudo-Riemannian metric h on \({{{\mathcal {I}}}}\) by \(h_x(u,v):= D_u D_v \psi (x)\) \((x \in {{{\mathcal {I}}}},\, u, v \in {{{\mathcal {A}}}})\). The signature of \(h_x\) is constant on each component of \({{{\mathcal {I}}}}\). By general theory of Hessian geometry [38], the Levi-Civita connection \(\nabla ^{(0)}\) on \(({{{\mathcal {I}}}},\, h)\) is described as \(h_x(\nabla ^{(0)}_{{\textbf{u}}} \textbf{v},\, \textbf{w}) = (1/2) D_u D_v D_w \psi (x)\) for \(x \in {{{\mathcal {I}}}}\), where \({\textbf{u}}, {\textbf{v}},\, {\textbf{w}}\) are constant vector fields on \({{{\mathcal {I}}}}\) corresponding to \(u, v, w \in {{{\mathcal {A}}}}\). Then we see from Lemma 4 (ii) and (iii) that

$$\begin{aligned} (\nabla ^{(0)}_{{\textbf{u}}} {\textbf{v}})_x = - (u \perp _{x^{-1}} v). \end{aligned}$$

More generally, for a real number \(\alpha \), we define the \(\alpha \)-connection \(\nabla ^{(\alpha )}\) on \({{{\mathcal {I}}}}\) in such a way that \(h_x(\nabla ^{(\alpha )}_{{\textbf{u}}} \textbf{v},\, \textbf{w}) = \displaystyle \frac{1-\alpha }{2} D_u D_v D_w \psi (x)\) for \(x \in {{{\mathcal {I}}}}\), so that

$$\begin{aligned} (\nabla ^{(\alpha )}_{{\textbf{u}}} {\textbf{v}})_x = (\alpha - 1) (u \perp _{x^{-1}} v). \end{aligned}$$
(9)

Then \(\nabla ^{(\alpha )}\) and \(\nabla ^{(-\alpha )}\) are mutually dual w.r.t. the pseudo-metric h. In particular, \(\nabla ^{(1)}=D\), which is the canonical flat affine connection on \({{{\mathcal {A}}}}\), and the dual connection \(\nabla ^{{(-1)}}\) is flat, too. Indeed, \(\nabla ^{(-1)}=D'\), which is the pullback \(D':=\iota ^*D\) of D via the gradient map \(\iota : {{{\mathcal {I}}}} \ni x \mapsto -{\textrm{grad}}\, \psi (x) = x^{-1} \in {{{\mathcal {I}}}}\) in view of Lemma 4 (i). In this way, we have a dually flat structure \((h,\, D,\, D')\) on \({{{\mathcal {I}}}}\). Note that x and \(x^{-1}\) are, respectively, D- and \(D'\)-affine coordinates.

2.3 Doubly autoparallelism on semi-simple Jordan algebras and symmetric cones

Let us consider doubly autoparallel submanifold \({{{\mathcal {M}}}}\) of \({{{\mathcal {I}}}}\) and a point \(a \in {{{\mathcal {M}}}}\). Since D- and \(D'\)-affine coordinates are, respectively, (entries of) x and \(x^{-1}=\iota (x)\) in \({{{\mathcal {I}}}}\), We see that a connected submanifold \({{{\mathcal {M}}}} \subset {{{\mathcal {I}}}}\) is D-autoparallel if and only if there exist a linear subspace \(W \subset {{{\mathcal {A}}}}\) and a neighbourhood \(U \subset W\) of \(0 \in W\) such that \({{{\mathcal {M}}}} = a+ U\). Hence, by the statement 5) in Proposition 1, we derive the condition \({{{\mathcal {M}}}}^{-1}\) is affinely constrained by the \(D'\)-affine coordinates.

Theorem 3

A connected submanifold \({{{\mathcal {M}}}} = a+U \subset {{{\mathcal {I}}}}\) with \(0 \in U \subset W\) is doubly autoparallel if and only if

$$\begin{aligned} u \perp _{a^{-1}} v \in W \qquad (u,\,v \in W), \end{aligned}$$
(10)

i.e., W is a subalgebra of \(({{{\mathcal {A}}}},\perp _{a^{-1}})\). When \({{{\mathcal {M}}}}\) is doubly autoparallel, it holds that \({{{\mathcal {M}}}} \subset (a^{-1}+W' )^{-1}\) with \(W'=P(a)^{-1}W\).

Proof

The ‘only if’ part is clear by Definition 1 of doubly autoparallelism and (9), i.e., \((D'_\textbf{u} \textbf{v})_a = -2 (u \perp _{a^{-1}} v) \in W\). To show the ‘if’ part, we assume that W is a subalgebra of \(({{{\mathcal {A}}}},\perp _{a^{-1}})\). Then Theorem 1 tells us that the image \(\iota ({{{\mathcal {M}}}})\) of the gradient map \(\iota (x) = x^{-1}\) is an open subset of \(a^{-1} + W'\), which is D-autoparallel. Therefore, \({{{\mathcal {M}}}}\) is autoparallel w.r.t. the pull-back \(\iota ^* D = D'\). \(\square \)

Remark 1

Theorem 3 claims that, if a connected manifold \({{{\mathcal {M}}}} \subset {{{\mathcal {I}}}}\) is D-autoparallel and (10) holds at some point \(a \in {{{\mathcal {M}}}}\), then (10) holds at all points in \({{{\mathcal {M}}}}\). It also holds for Corollary 2 below. This tractability improves the characterization obtained in [43, Lem 4.2], which is hard to check practically.

In particular, we assume that the Jordan algebra \(({{{\mathcal {A}}}}, *)\) is Euclidean, that is, the bilinear form \(\langle x,\, y \rangle = {\textrm{tr}}\,(x*y),\,\,\,(x,y \in {{{\mathcal {A}}}})\) is positive definite on \({{{\mathcal {A}}}}\). Then the connected component \(\varOmega \) of \({{{\mathcal {I}}}}\) containing e is a symmetric cone [12]. Indeed, the pseudo-metric \(h_x\) on \({{{\mathcal {I}}}}\) gives a Riemannian metric on \(\varOmega \), which makes \(\varOmega \) a Riemannian symmetric space. On the other hand, it is known that for \(a \in \varOmega \), there exists a unique \(b \in \varOmega \) for which \(a = b^2\). As is already seen in the proof of Lemma 2, the linear map P(b) gives a Jordan algebra isomorphism from \(({{{\mathcal {A}}}}, *)\) to \(({{{\mathcal {A}}}}, \perp _{a^{-1}})\). Thus, the following characterization for the case of a symmetric cone \(\varOmega \) is a straightforward consequence from Theorem 1 and Theorem 3.

Corollary 2

Let W be a linear subspace in a Euclidean Jordan algebra \(({{{\mathcal {A}}}},*)\) and \(a \in \varOmega \). A submanifold \({{{\mathcal {M}}}}=(a+W) \cap \varOmega \) is doubly autoparallel if and only if \(P(b)^{-1}W\) is a subalgebra of \(({{{\mathcal {A}}}},*)\) for \(b^2=a\). When \({{{\mathcal {M}}}}\) is doubly autoparallel it holds that \({{{\mathcal {M}}}}=\{(a^{-1}+W')\cap \varOmega \}^{-1}\) with \(W'=P(a)^{-1}W\).

Therefore, in view of Corollary 2, we conclude that a classification of doubly autoparallel submanifolds of \(\varOmega \) is reduced to a classification of subalgebras of a Euclidean Jordan algebra \(({{{\mathcal {A}}}}, *)\).

3 Applications of doubly autoparallel structure on real symmetric positive definite matrices

3.1 Doubly autoparallel structure on positive definite matrices

Consider the set of n by n real symmetric positive definite matrices denoted by PD(n), which is a typical and familiar example of symmetric cones. The ambient Euclidean Jordan algebra \({{{\mathcal {A}}}}\) of PD(n) is Sym(n), the set of real symmetric matrices, which is equipped with the product \(*\) defined by

$$\begin{aligned} X*Y=\frac{XY+YX}{2}, \quad X,Y \in {\textrm{Sym}}(n). \end{aligned}$$

The unit and inverse element of X in \(({\textrm{Sym}}(n),*)\) are, respectively, the identity matrix I and the the usual matrix inverse \(X^{-1}\). An inner-product is \(\langle X, Y \rangle ={{{\textrm{tr}}}}(XY)\).

The standard dually flat structure \((g,D,D')\) on PD(n) [34] is derived from the potential function \(\psi (P):=-\log \det (P)\) for \(P \in {\textrm{PD}}(n)\). Take a basis \(\{E_i\}_{i=1}^N\) of Sym(n) with \(N:=n(n+1)/2\). When \({\textrm{Sym}}(n) \ni X=\sum _{i=1}^N x^iE_i\), we can regard \((x^i)\) as a D-affine coordinate system for the canonical flat connection D on Sym(n), i.e., \(D_{\partial /\partial x^i} \partial /\partial x^j=0\). Defining an operator \({\textrm{grad}}\) by \(\langle {\textrm{grad}}f(X), Y \rangle =D_Y f(X)\) for a smooth function f, we let \(\iota \) denote the gradient map \(\iota (P):=-{\textrm{grad}}\psi (P)=P^{-1}\). As is seen in Sect. 2, we let \(D'\) be the pullback \(\iota ^* D\). Then \((g=Dd\psi ,D,D')\) is dually flat structure on PD(n). Since \(\iota (P)=P^{-1}\) for \(P \in {\textrm{PD}}(n)\), by taking another basis \(\{F^i\}_{i=1}^N\) of Sym(n) and expressing \(P^{-1}=\sum _{i=1}^N s_iF^i\), we see that \((s_i)\) is \(D'\)-affine coordinate system.

For the dually flat structure \((g,D,D')\) on PD(n), the property 5) of Proposition 1 and Corollary 2 imply that the following three statements are equivalent:

  • A submanifold \({{{\mathcal {M}}}} \subset {\textrm{PD}}(n)\) is doubly autoparallel

  • There exist matrices \(E_0\) and \(F^0\), and two sets of linearly independent matrices \(\{E_i\}_{i=1}^m\) and \(\{F^i\}_{i=1}^m\) in Sym(n) for \(m <N\) such that \({{{\mathcal {M}}}}\) is simultaneously represented by

    $$\begin{aligned} {{{\mathcal {M}}}}=&{} \left\{ P \,\left| \, P=E_0 + \sum _{i=1}^m x^i E_i, \right. \; \exists x=(x^i) \in {\textbf {R}}^m \right\} \cap {\text {PD}}(n) \end{aligned}$$
    (11)
    $$\begin{aligned} {{{\mathcal {M}}}}=&{} \left\{ P \,\left| \, P^{-1} = F^0 + \sum _{i=1}^m s_i F^i \right. , \; \exists s=(s_i) \in {\textbf {R}}^m \right\} \cap {\text {PD}}(n). \end{aligned}$$
    (12)
  • If \(A:=E_0 \in {\textrm{PD}}(n)\) for \({{{\mathcal {M}}}}\) defined in (11), \({{{\mathcal {P}}}}(B)^{-1}{{{\mathcal {W}}}}\) is a subalgebra of Sym(n), where B is an arbitrary matrix in Sym(n) with \(A=B^2\) and \({{{\mathcal {W}}}}:={\textrm{span}}\{E_i \}_{i=1}^m\). The linear map \({{{\mathcal {P}}}}(B)={{{\mathcal {P}}}}(B,B)\) on Sym(n) is the quadratic representation of B defined in Sect. 2.1, which maps \(X \in {\textrm{Sym}}(n)\) to \({{{\mathcal {P}}}}(B)^{-1}X=B^{-1}XB^{-1}\).

[Examples of doubly autoparallel submanifolds in PD(n)]

(I) Assume that the identity matrix I (the unit element of \(({\textrm{Sym}}(n),*)\)) is in \({{{\mathcal {W}}}}={\textrm{span}}\{E_i \}_{i=1}^m\). If \(E_0 \in {{{\mathcal {W}}}}\), then \({{{\mathcal {M}}}}={{{\mathcal {W}}}} \cap {\textrm{PD}}(n)=(I+{{{\mathcal {W}}}}) \cap {\textrm{PD}}(n)\). Since the quadratic representation \({{{\mathcal {P}}}}(I)\) is the identity map, Corollary 2 means that \({{{\mathcal {M}}}}\) is doubly autoparallel if and only if such a \({{{\mathcal {W}}}}\) is a Jordan subalgebra of Sym(n). Hence, for a subalgebra \({{{\mathcal {W}}}}\) containing I, a resultant intersection \({{{\mathcal {M}}}}={{{\mathcal {W}}}} \cap {\textrm{PD}}(n)\) is doubly autoparallel. Typical examples of such Jordan subalgebras in Sym(n) would be the followings:

  1. i)

    The set of doubly symmetric matrices defined by

    $$\begin{aligned} {{{\mathcal {W}}}}=\{X | X = (x_{ij}), x_{ij}=x_{ji}, \; x_{ij}=x_{n+1-j \; n+1-i} \}, \end{aligned}$$
    (13)

    i.e., the set of matrices that are symmetric w.r.t. both main and anti-main diagonal entries,

  2. ii)

    The set of symmetric matrices that have the prescribed n eigenvectors,

and so on. For further information on Jordan subalgebras in Sym(n), including their classification and examples, consult with [13, 15, 22, 23, 27].

This type of doubly autoparallel submanifolds coincides with traditionally studied one [16, 23] in structured covariance estimation. Note that the corresponding submanifolds are the intersections of PD(n) and linear but not affine subspaces, i.e., they are subcones in PD(n) and their boundaries contain the origin.

(II) Hence, we give examples of doubly autoparallel submanifolds that are not subcones. Let \({\mathcal {V}}\) be a Jordan subalgebra of \(\textrm{Sym}(k)\). Let \(A \in \textrm{Sym}(n)\) and \(B \in {\textbf{R}}^{n\times k}\) be constant matrices such that \(\det A \ne 0\) and \(B^T A^{-1} B \in {\mathcal {V}}\), where \(\cdot ^T\) denotes the transpose of a matrix. Then a submanifold

$$\begin{aligned} {\mathcal {M}} = \{ A - B X B^T \,|\, X \in {\mathcal {V}} \} \cap \textrm{PD}(n) \end{aligned}$$
(14)

is doubly autoparallel. In order to prove it, let us check that the space \({\mathcal {W}} =\{ B X B^T \, | \, X \in {\mathcal {V}} \}\) is a subalgebra of the mutation \((\textrm{Sym}(n),\, \perp _{A^{-1}} )\), where the product \(\perp _{Z}\) is given via (5) by \( {{{\mathcal {P}}}}(U,W)Z=U \perp _Z W= (U Z W+ WZU)/2 \) for \(U, W, Z \in \textrm{Sym}(n)\). Let \(B XB^T\) and \(B Y B^T\) be two elements of \({\mathcal {W}}\) with \(X, Y \in {\mathcal {V}}\). We have

$$\begin{aligned}&B X B^T \perp _{A^{-1}} B YB^T \\&\quad = \frac{1}{2}\{ (B X B^T) A^{-1} (BYB^T) + (B Y B^T) A^{-1} (B X B^T) \} \\&\quad = \frac{1}{2} B ( X(B^T A^{-1} B) Y + Y (B^T A^{-1} B) X ) B^T \\&\quad = B (X \perp _{B^T A^{-1}B} Y) B^T, \end{aligned}$$

which belongs to \({\mathcal {W}}\) because we see that \(X \perp _{B A^{-1} B^T} Y \in {\mathcal {V}}\) by the assumptions \(X, \,Y, \,B A^{-1}B^T \in {\mathcal {V}}\) and the definition of \({{{\mathcal {P}}}}(X,Y)\). Therefore Theorem 3 implies that \({\mathcal {M}}\) is doubly autoparallel. Note that \({\mathcal {M}}\) is not a cone if \(A \not \in {\mathcal {W}}\).

An alternative proof can be given as follows under an assumption that the k-th order identity matrix \(I_k\) is in \({\mathcal {V}}\). The widely-known Sherman–Morrison–Woodbury formula implies that

$$\begin{aligned} (A-BXB^T)^{-1}=A^{-1}+A^{-1}B(X^{-1}-B^TA^{-1}B)^{-1}B^TA^{-1} \end{aligned}$$

for invertible X and \(X^{-1}-B^TA^{-1}B\). By the assumption for \({{{\mathcal {V}}}}\), if \(X \in {{{\mathcal {V}}}}\), then \(X^{-1} \in {{{\mathcal {V}}}}\) [23, Lem 2.4.2 (i)]. Hence, \((X^{-1}-B^TA^{-1}B)^{-1} \in {{{\mathcal {V}}}}\) similarly. Since the set of invertible matrices is open and dense in such a \({{{\mathcal {V}}}}\), the formula holds on \({{{\mathcal {V}}}}\) by continuity. Thus, \((A-BXB^T)^{-1}\) is also affinely constrained.

As a simple extension, the following submanifold

$$\begin{aligned} {\mathcal {M}}=\left\{ \left. A - \sum _{i=1}^q B_i X_i B_i^T \, \right| \, X_i \in {\mathcal {V}}_i\,\,(i=1, \dots , q) \right\} \cap \text {PD}(n) \end{aligned}$$

is doubly autoparallel, where \({\mathcal {V}}_i \subset \textrm{Sym}(k_i)\,\,\,(i=1, \dots , q)\) is a Jordan subalgebra, \(A \in \textrm{Sym}(n)\) and \(B_i \in {\textbf{R}}^{n \times k_i}\,\,\, (i=1, \dots , q)\) such that \(\det A \ne 0,\,\,B_i^T A^{-1} B_i \in {\mathcal {V}}_i\,\,(i=1, \dots q)\) and \(B_i^T A^{-1} B_j = 0 \,\,(i \ne j)\).

One can find further examples of subalgebras of \(\textrm{Sym}(n)\) in [13].

3.2 Maximum likelihood estimation of structured covariance

Suppose that the finite observed data \(\{z_{1},z_{2},\ldots ,z_{q}\}\) obey to the n-dimensional multivariate normal distribution with zero-mean and covariance matrix \(\varSigma \)

$$\begin{aligned} p(z)=(2\pi )^{-n/2}(\det \varSigma )^{-1/2}\exp \left\{ -\frac{1}{2} z^{T}{\varSigma }^{-1}z\right\} \end{aligned}$$

and consider the problem to estimate \(\varSigma \) from the observed data. The structured covariance estimation problem assumes that true \(\varSigma \) should lie in a certain submanifold \({{{\mathcal {M}}}}\) in PD(n) depending on the mechanisms of data generations.

For example, it is often the case where several entries of \(\varSigma \) may be known beforehand. In statistical signal processing the data observed from the stationary time series or images involve (block) Toeplitz or Hankel structure for covariance matrices [9, 27]. These are examples of linear constraints of \(\varSigma \).

On the other hand, in the field of statistical estimation called graphical Gaussian modelling or covariance selection [21, 42], the structure in which several entries of \(\varSigma ^{-1}\) are specified to be zero, plays an important role. In this case the constraints are inversely-linear, i.e., linear w.r.t. the entries of \(\varSigma ^{-1}\). Further, in the area of factor analysis, more complicated structures may be possibly imposed on \(\varSigma \). Thus, attention has been paid to structured covariance estimation as an important problem in practice for a long time [5, 9].

Now consider the maximum likelihood estimation (MLE) of a covariance matrix \(\varSigma \) on only affinely constrained model \({{{\mathcal {M}}}}\) in (11), which would be basic and important in practice, e.g, factor analysis, Toeplitz structure in signal processing. From the likelihood function \(L(\varSigma )=\prod _{i=1}^q p(z_i)\) we have

$$\begin{aligned} \log L(\varSigma )= & {} -\frac{qn}{2}\log {2\pi }+\frac{q}{2}\left\{ -\log \det \varSigma -\frac{1}{q}\sum _{i=1}^{q}z_{i}^{T}\varSigma ^{-1}z_{i}\right\} \\= & {} -\frac{qn}{2}\log {2\pi }+\frac{q}{2}\{-\log \det \varSigma -{{{\textrm{tr}}}}(\varSigma ^{-1}S)\}, \end{aligned}$$

where S is the sample covariance matrix calculated by \( S=(1/q)\sum _{i=1}^{q} z_{i}z_{i}^{T} \) and S is not generally in \({{{\mathcal {M}}}}\).

Hence, solving the maximizer of \(\log L(\varSigma )\) on \({{{\mathcal {M}}}}\), i.e, the MLE for covariance structure \({{{\mathcal {M}}}}\) is equivalent with the optimization:

$$\begin{aligned} {\textrm{minimize }}\, f(\varSigma ), \;\text{ s.t. } \varSigma \in {{{\mathcal {M}}}}, \;\;\text{ where } f(\varSigma ):=-\log \det \varSigma ^{-1} +{{{\textrm{tr}}}}(\varSigma ^{-1}S).\nonumber \\ \end{aligned}$$
(15)

Unfortunately, the function \(f(\varSigma )\) is not convex w.r.t. \(\varSigma \in {{{\mathcal {M}}}}\). Hence, in general we can only expect to solve local minimizers.

However, suppose \({{{\mathcal {M}}}}\) is doubly autoparallel, i.e., \({{{\mathcal {M}}}}\) is simultaneously expressed by (11) and (12), then the MLE problem (15) become a convex optimization using the inversely-affine parametrization \(\varSigma (s):=\left( F^0 + \sum _{i=1}^m s_i F^i \right) ^{-1}\), \(s \in \textbf{R}^m\). The problem is reduced to as follows:

$$\begin{aligned} {\textrm{minimize }}\, {{\tilde{f}}}(s), \quad {{\tilde{f}}}(s):= -\log \det [\left( \varSigma (s)\right) ^{-1}] +{{{\textrm{tr}}}}[\left( \varSigma (s) \right) ^{-1}S]. \end{aligned}$$
(16)

Note that the optimization (16) is unconstrained because \({{\tilde{f}}}\) is a barrier function of \({{{\mathcal {M}}}}\), i.e., \({{\tilde{f}}}(s) \rightarrow +\infty \) when \(\varSigma (s)\) approaches to the boundary \(\partial {{{\mathcal {M}}}}\). Further, since \({{\tilde{f}}}\) is self-concordant ( [30], cf. Sect. 4.2), \({{\tilde{f}}}\) is strictly convex and the minimizer is unique if it exists. Thus, the minimizer s is a solution to optimality equation \(\partial {{\tilde{f}}}/\partial s_i=0\), \((i=1,\dots ,m)\), which can be transformed, using the original affine parametrization \(\varSigma (x)=E_0 + \sum _{i=1}^m x^i E_i\), to linear equations of x:

$$\begin{aligned} -{{{\textrm{tr}}}}\left( \varSigma (s) F^i\right) +{{{\textrm{tr}}}}(F^iS)= & {} -{{{\textrm{tr}}}}\left( \varSigma (x) F^i\right) +{{{\textrm{tr}}}}(F^iS) \\= & {} -\sum _{j=1}^m {{{\textrm{tr}}}}(E_jF^i)x^j + {{{\textrm{tr}}}}\left( F^i(S-E_0)\right) =0, \quad i=1,\dots ,m. \end{aligned}$$

By linearly independence assumptions for \(\{E_i\}_{i=1}^m\) and \(\{F^i\}_{i=1}^m\), the Gramian matrix \(({{{\textrm{tr}}}}(E_jF^i))\) is nonsingular. Hence, the solution x is a unique minimizer if \(\varSigma (x) \in {\textrm{PD}}(n)\).

3.3 Doubly autoparallel feasible region of semi-definite program

In this subsection we demonstrate that the convex optimization problem called semi-definite program (SDP) [46] is solved explicitly if the interior of its feasible region is doubly autoparallel. The results are similarly extended to general class of problems called conic program [30] (cf. Sect. 4), which includes symmetric cone program.

We shortly introduce the primal path-following method for general convex optimization from a viewpoint of information geometry. Since every convex optimization can be equivalently reformulated as one with a linear objective function, we consider the following optimization without loss of generality:

$$\begin{aligned} {\textrm{minimize}}\; c^T x, \quad \text{ s.t. } \; x \in \overline{{{\mathcal {M}}}}, \end{aligned}$$
(17)

where \(c \in \textbf{R}^n\) and \({{{\mathcal {M}}}} \subset \textbf{R}^n\) are, respectively, a constant vector and an open convex set. The closure of \({{{\mathcal {M}}}}\), denoted by \(\overline{{{\mathcal {M}}}}\), is said the feasible region of (17).

Let \(\varPsi (x)\) be a smooth convex barrier of \({{{\mathcal {M}}}}\) i.e., \(\varPsi (x) \rightarrow +\infty \), (\(x \rightarrow \partial {{{\mathcal {M}}}}\)). We assume that the Hessian matrix h of \(\varPsi \) is positive definite. To solve (17), the idea of the commonly used affine-scaling method is numerically following the solution x(t) to the gradient system of \(c^Tx\) on the Riemannian manifold \(({{{\mathcal {M}}}},h)\):

$$\begin{aligned} \dot{x}= \frac{dx}{dt}=-h(x)^{-1}c, \quad x(t_0) \in {{{\mathcal {M}}}}, \qquad t \ge t_0. \end{aligned}$$
(18)

It is known [1, 30, 41] that x(t) stays in \({{{\mathcal {M}}}}\) for all \(t \ge t_0\). Further, if the set of the minimizers of (17) is nonempty and bounded, and \(\varPsi \) is a self-concordant barrier [30] (cf. Sect. 4.2), then x(t) converges to the set when \(t \rightarrow +\infty \). The curve \(\gamma _{{\textrm{AS}}}=\{x(t) | t \ge t_0 \}\) is called the affine-scaling trajectory. In particular, when the initial point \(x(t_0)\) is the minimizer of \(t_0 c^Tx+\varPsi \) with \(t_0>0\), the trajectory is said the central trajectory (cf. Sect. 4.3).

Using \(\varPsi (x)\) as a potential, we can define dually flat structure \((h,D,D')\) (i.e., Hessian structure [38]) on \({{{\mathcal {M}}}}\). As is same previously, we take the canonical flat affine connection D on \(\textbf{R}^n\) and \(D':=\iota ^* D\), the pull-back of D of the gradient map \(\iota :=- {\textrm{grad}}\, \varPsi \). From a viewpoint of dually flat structure on \({{{\mathcal {M}}}}\), it is of interest to see that (18) is equivalent with

$$\begin{aligned} {\dot{s}}=-c, \quad s_i= \frac{\partial \varPsi }{\partial x^i}, \; i=1, \dots , n, \end{aligned}$$

in the dual coordinate \(s=(s_i)\), i.e., we have the following nice property:

Proposition 2

[7, 8, 40] The affine-scaling trajectory \(\gamma _{{\textrm{AS}}}=\{s(t)|s(t)=-c(t-t_0)+s(t_0), t \ge t_0 \}\) coincides with the \(D'\)-geodesic curve up to parameters.

Hence, \(\gamma _{{\textrm{AS}}}\) is linearized and we can obtain the minimizer \({{\widehat{x}}}\) of (17) in a theoretically simple way

$$\begin{aligned} {{\widehat{x}}}={\textrm{grad}}\, \varPsi ^* ({{\widehat{s}}}), \quad {{\widehat{s}}}:=-\lim _{t \rightarrow +\infty } c(t-t_0)+s(t_0), \end{aligned}$$
(19)

where \(\varPsi ^*\) is the Legendre conjugate of \(\varPsi \).

In spite of the nearly direct expression of the minimizer \({{\widehat{x}}}\) in (19), one of general difficulties, in practice, is obtaining an explicit form of the conjugate function \(\varPsi ^*\) to calculate (19). Alternative way to solve a system of nonlinear equations \({\textrm{grad}}\, \varPsi ({{\widehat{x}}})={{\widehat{s}}}\) is indirect and might be hard when n is large. Hence, the formula (19) is not normally invoked. Instead the affine-scaling method is usually performed on convex feasible set \(\overline{{{{\mathcal {M}}}}}\) realized as

$$\begin{aligned} {{{\mathcal {M}}}}=(a+W) \cap \varOmega , \quad a \in \textbf{R}^N, \; n <N \end{aligned}$$
(20)

in a suitable subspace \(W \subset \textbf{R}^N\) and an ambient open convex set \(\varOmega \subset \textbf{R}^N\) equipped with its convex barrier \(\psi \). The barrier \(\psi \) should have good properties like self-concordance for numerically tracing the trajectory because the restriction \(\varPsi =\psi |_{{{\mathcal {M}}}}\) is usually the barrier for \({{{\mathcal {M}}}}\) [41].

However, by applying properties of dually flat structure, we see that the formula (19) can be effectively activated again in the case where \({{{\mathcal {M}}}}=(a+W) \cap \varOmega \) is doubly autoparallel in a symmetric cone \(\varOmega \) with the self-concordant barrier \(\psi =- \log \det \). Note that in this case the Legendre transform \(-\iota ^{-1}={\textrm{grad}}\psi ^*\) is nothing but (the affine transform of) the inversion of an element \(s^{-1}\) (Recall the definition of the differential operator grad in Sect. 3.1). We demonstrate it taking SDP as an example, where \(\varOmega ={\textrm{PD}}(n)\).

We keep using the notation in (11) for the set of linearly independent matrices \(\{E_i\}_{i=1}^m\) in Sym(n) with \({{{\mathcal {W}}}}:= \text{ span }\{E_i\}_{i=1}^m\), and formulate SDP in the following variant of the dual canonical form without loss of generality:

$$\begin{aligned} {\mathop {{\textrm{minimize}}}\limits _x}\; \langle C, P \rangle , \; \text{ s.t. } \; P=E_0 + \sum _{i=1}^m x^i E_i \in \overline{{{{\mathcal {M}}}}} =\overline{(E_0+{{{\mathcal {W}}}}) \cap {\textrm{PD}}(n)} \end{aligned}$$
(21)

for a constant \(C \in {\textrm{Sym}}(n)\). The dually flat structure \((g, D,D')\) on \({\textrm{PD}}(n)\) is the same one defined in Sect. 3.1 using \(\psi (P)=-\log \det (P)\), hence, \(\iota (P)=P^{-1}\). Note that \({{{\mathcal {M}}}}\) is readily D-autoparallel.

Assume that \({{{\mathcal {M}}}}\) be doubly autoparallel in \({\textrm{PD}}(n)\). Recalling Proposition 2, we see that the affine scaling trajectory \(\gamma _{{\textrm{AS}}}\) stays on \({{{\mathcal {M}}}}\) for \(t \ge t_0\) by the totally geodesic property 4) in Proposition 1. Hence, we can naturally consider the following procedure to explicitly solve the minimizer \({{\widehat{P}}}\) using the formula (19).

1::

Since \({{{\mathcal {M}}}}\) is doubly autoparallel, there exist common \(F^0\) and \(\{F^i\}_{i=1}^m\), for which \(P \in {{{\mathcal {M}}}}\) is also represented by (12)

$$\begin{aligned} P^{-1} = F^0 + \sum _{i=1}^m s_i F^i. \end{aligned}$$

For an arbitrarily fixed \({{\tilde{P}}} \in {{{\mathcal {M}}}}\), they are given via the differential \((\iota _*)_{{{\tilde{P}}}}: T_{{{\tilde{P}}}} {{{\mathcal {M}}}} \rightarrow T_{\iota ({{\tilde{P}}})} \iota ({{{\mathcal {M}}}})\), with an identification of \(T_{{{\tilde{P}}}} {{{\mathcal {M}}}}\) with \({{{\mathcal {W}}}}\), by

$$\begin{aligned} F^0={{\tilde{P}}}^{-1}, \quad F^i=(\iota _*)_{{{\tilde{P}}}}(E_i)=-{{\tilde{P}}}^{-1}E_i {{\tilde{P}}}^{-1}, \quad i=1,\dots ,m. \end{aligned}$$
2::

Solve \({{\widetilde{C}}} \in {\textrm{span}}\{F^i\}_{i=1}^m\) that satisfies

$$\begin{aligned} \forall P \in {{{\mathcal {M}}}}, \quad \langle C, P \rangle = \langle {{\widetilde{C}}}, P \rangle + {\mathrm{const.}} \end{aligned}$$

This is given by \({{\widetilde{C}}}=\sum _{i=1}^m {{\widetilde{c}}}_i F^i\) using the solution \({{\widetilde{c}}}=({{\widetilde{c}}}_i) \in \textbf{R}^m\) to the following linear equations:

$$\begin{aligned} \sum _{j=1}^m \langle F^j, E_i \rangle {{\widetilde{c}}}_j = \langle C, E_i \rangle , \quad i=1, \dots , m. \end{aligned}$$

The Gramian matrix \((\langle F^j, E_i \rangle )\) is nonsingular by the assumption.

3::

Compute the spectral decomposition of \({{\widetilde{C}}}\):

$$\begin{aligned} {{\widetilde{C}}} = \left( \begin{array}{cc} V_1&V_2 \end{array} \right) \left( \begin{array}{cc} \varSigma _1 &{} O \\ O &{} O \end{array} \right) \left( \begin{array}{c} V_1^T \\ V_2^T \end{array} \right) = V_1 \varSigma _1 V_1^T, \end{aligned}$$

where \(\varSigma _1\) is the diagonal matrix with nonzero eigenvalues of \({{\widetilde{C}}}\).

4::

Take an initial \(P_0\) in \({{{\mathcal {M}}}}\) arbitrarily. Instead of the entrywise usual Legendre transforms \((s_i)=(\partial \psi /\partial x^i)\) and \((x^i)=(\partial \psi ^*/\partial s_i)\), we use their affine transforms \(\iota \) and \(\iota ^{-1}\), i.e, the matrix inversions: \(P \mapsto S=P^{-1}\) and \(S \mapsto P=S^{-1}\). Note that the affine-scaling trajectory \(\gamma _{{\textrm{AS}}}\) is again a half straight line even if modified by affine transforms. Applying the formula (19) to \(S_0=P_0^{-1}\), we have the trajectory \(\gamma _{{\textrm{AS}}}=\{S(t)\,| \,S(t)=-{{\widetilde{C}}}(t-t_0)+S_0,\, t\ge t_0 \}\) expressed in the dual coordinate system. Hence, the minimizer \({{\widehat{P}}}\) is

$$\begin{aligned} {{\widehat{P}}}=\lim _{t \rightarrow \infty } S(t)^{-1} =\lim _{t \rightarrow \infty } (-{{\widetilde{C}}}t+S_0)^{-1} =P_0-P_0V_1(V_1^T P_0 V_1)^{-1}V_1^TP_0\nonumber \\ \end{aligned}$$
(22)

The last equality follows again from the Sherman-Morrison-Woodbury formula. The parameters \(x=(x^i)\) in (21) of \({{\widehat{P}}}\) are extracted by \({{\widehat{x}}}^i=\langle E^i, {{\widehat{P}}} \rangle -\langle E^i, E_0 \rangle \) for \(\{E^i\}_{i=1}^m\) with \(\langle E^j, E_i \rangle =\delta _i^j\).

The consequence of the above procedure is summarized as follows:

Proposition 3

Assume that minimizers for SDP in (21) exist and the set of them is bounded. If \({{{\mathcal {M}}}}\) is doubly autoparallel in \(({\textrm{PD}}(n),g,D,D')\), one of the minimizers of SDP has an explicit formula given in (22) for any initial point \(P_0 \in {{{\mathcal {M}}}}\) and objective function \(\langle C,P \rangle \) satisfying the assumptions

Thus, by doubly autoparallelism of \({{{\mathcal {M}}}}\) in SDP we are able to solve the minimizer \({{\widehat{P}}}\) (or \({{\widehat{x}}}\)) without any iterative computations, but just the spectral decomposition and matrix inversions (cf. Corollary 3). This is similar to the case of the MLE on structured covariances in Sect. 3.2.

Remark 2

Note that Proposition 3 claims that the linear objective function and the initial point are arbitrary. While SDP is a combination of the objective function and the feasible region \(\overline{{{\mathcal {M}}}}\), the doubly autoparallelism imposed on \({{{\mathcal {M}}}}\) would be severe to use the formula (22). It is at least sufficient that there exists a \(D'\)-autoparallel submanifold \({{{\mathcal {N}}}}\) in \({{{\mathcal {M}}}}\) that contains a trajectory \(\gamma _{{\textrm{AS}}}\), which depends on C and \(P_0\), even if \({{{\mathcal {M}}}}\) itself is not doubly autoparallel. We temporally call such a submanifold \({{{\mathcal {M}}}}\) directionally doubly autoparallel (DDA).

There exists several studies for classes of explicitly solvable SDP [28, 44, 45]. The feasible region \(\overline{{{\mathcal {M}}}}\) of the class discussed in [44] coincides with an example given in (14). The case of [45] is closely connected to DDA. On the other hand, the relation between the class in [28] and doubly autoparallelism is not clear yet.

4 Curvature integrals on trajectories

As is observed in Remark 2 at the last of Sect. 3.3, in order to follow an affine-scaling trajectory \(\gamma _{{\textrm{AS}}}\) without iterations, doubly autoparallelism on whole \({{{\mathcal {M}}}}\) is not actually necessary, but the existence of a \(D'\)-autoparallel submanifold \({{{\mathcal {N}}}}\) in \({{{\mathcal {M}}}}\) containing the trajectory (i.e. DDA) is. However, the criterion for the existence of such a convenient trajectory and \({{{\mathcal {N}}}}\) would be generally difficult.

Alternatively, in the remaining part of the paper, we study an integral of the second fundamental form (curvature integral) along the trajectory, which is proved to asymptotically measure the necessary iteration-complexity in the path-following algorithm. Among affine-scaling trajectories, we select the special one called the central trajectory, which is known to be advantageous in many aspects and plays a significant role in the interior-point method [46]. In addition, the purpose of the succeeding sections is to give information geometric understandings to interior-point algorithms for convex programming developed in [35]. Detailed proofs or derivations omitted can be found therein.

4.1 Geometric framework of conic linear programs

We formulate a general convex optimization (17) as the following primal problem (23) (or the dual problem (24)) in a conic linear program. Let E be a real vector space of dimension n and \(\varOmega \) be a proper (pointed) open convex cone in E. We denote by \(\langle s,x \rangle \) the pairing of \(x \in E\) and \(s \in E^*\), the dual space of E. For the open dual cone \(\varOmega ^*:=\{s \in E^* | \langle s,x \rangle >0, \forall x \in {\overline{\varOmega }}\backslash \{0\}\}\), we consider the primal and dual pair of conic linear programming problems:

$$\begin{aligned} \text{ minimize }{} & {} \langle c,x \rangle , \; \text{ s.t. } x \in \overline{{{\mathcal {P}}}}, \\{} & {} \text{ where } {{{\mathcal {P}}}}:= (d+T) \cap \varOmega , \nonumber \end{aligned}$$
(23)
$$\begin{aligned} \text{ minimize }{} & {} \langle s,d \rangle , \; \text{ s.t. } s \in \overline{{{\mathcal {D}}}}, \\{} & {} \text{ where } {{{\mathcal {D}}}}:= (c+T^*) \cap \varOmega ^*. \nonumber \end{aligned}$$
(24)

Here, \(c \in E^*\), \(d \in E\) and \(T \subset E\) are, respectively, given elements and a subspace of dimension \(n-m\). By \(T^*\) we denote the subspace consisting of every \(s \in E^*\) that satisfies \(\langle s, x \rangle =0\) for all \(x \in T\). We assume that \({{{\mathcal {P}}}}\) and \({{{\mathcal {D}}}}\) are nonempty, which imply that the both sets of optimal solutions to (23) and (24) are nonempty and bounded [30].

To see the relevance to the previous section one may read a feasible region \(\overline{{{\mathcal {M}}}}\) in (20) as a primal feasible region \(\overline{{{\mathcal {P}}}}\) in the followings.

4.2 Dually flat structure on \(\varOmega \)

Denote by D the canonical flat affine connection on E and let \(\{x^1, \ldots , x^n\}\) be one of its affine coordinate system, i.e., \(D_{\partial /\partial x^i}\partial /\partial x^j=0\). For a smooth function \(\psi \) on \(\varOmega \) the Hessian is

$$\begin{aligned} Dd\psi =\sum _{i,j} \frac{\partial ^2 \psi }{\partial x^i \partial x^j}dx^idx^j, \end{aligned}$$

Similarly we simply write \(D^2d \psi (X,Y,Z)\) instead of \((D_X Dd\psi )(Y,Z)\) for vector fields XY and Z on \(\varOmega \) since \(D^2d \psi \) is symmetric, i.e., \((D_X Dd\psi )(Y,Z)=(D_YDd\psi )(X,Z)\).

Let \(\psi \) be smooth, convex, and meet the following two conditions at each \(x \in \varOmega \):

$$\begin{aligned} \psi (tx)= & {} \psi (x)-\vartheta \log t, \end{aligned}$$
(25)
$$\begin{aligned} |(D^2 d \psi )_x(X,X,X)|\le & {} 2 ((Dd \psi )_x(X,X))^{3/2} \end{aligned}$$
(26)

for a real parameter \(\vartheta \ge 1\), \(\forall t>0\) and \(\forall X \in T_x \varOmega \cong E\). In addition when \(\psi (x_i) \rightarrow +\infty \) for each sequence \(\{x_i | x_i \in \varOmega \}\) that converges to \(\partial \varOmega \), the function \(\psi \) is called a \(\vartheta \)-normal barrier on \(\varOmega \), which always exists and \(Dd\psi \) is positive definite [30, Prop.2.3.5 and Chap.5]. For example, the logarithmic characteristic function of \(\varOmega \) defined by

$$\begin{aligned} \psi (x):=\log \int _{\varOmega ^*} e^{-\langle s,x \rangle } ds, \end{aligned}$$

is a \(\vartheta \)-normal barrier on \(\varOmega \) for some \(\vartheta \) [14]. Note that a convex function satisfying only (26) is called a 1-self-concordant function [30]. We introduce a Riemannian metric g on \(\varOmega \) as the Hessian of a \(\vartheta \)-normal barrier \(\psi \), in other word, \(g=Dd\psi \).

Let \(\{s_1,\ldots ,s_n \}\) be the dual affine coordinate system on \(E^*\) w.r.t. \(\{x^1, \ldots , x^n\}\), i.e., for \(x \in E\) and \(s \in E^*\) it holds that \(\langle s,x \rangle = \sum _i s_i(s) x^i(x)\). Using the dual affine coordinates we shall identify \(\varOmega ^*\) with \(\varOmega \) via the diffeomorphism \(\iota :\varOmega \rightarrow \varOmega ^*\) [38] defined by

$$\begin{aligned} s_i \circ \iota =-\frac{\partial \psi }{\partial x^i}. \end{aligned}$$
(27)

Note that the differential \(\iota _*\) satisfies the relation:

$$\begin{aligned} \langle \iota _*(X),Y \rangle =-g_x(X,Y), \end{aligned}$$
(28)

for all \(x \in \varOmega \) and \(X,Y \in T_x \varOmega \cong E\).

Consider another torsion-free affine connection \(D'\) on \(\varOmega \) defined by

$$\begin{aligned} X g(Y,Z)= g(D_X Y,Z)+g(Y, D'_X Z) \end{aligned}$$
(29)

for arbitrary vector fields XY and Z on \(\varOmega \). It is known that the symmetry of \(D^2d\psi \) and (29) assures \(D'\) to be flat [2, 3, 38]. The affine connection \(D^*\) on \(\varOmega ^*\) induced by \(\iota \) from \(D'\), i.e., \(D^*_{\iota _*(X)} \iota _*(Y)=\iota _* (D'_X Y)\) is flat and \(\{s_1,\ldots ,s_n \}\) is its affine coordinate system. The triple \((g,D,D')\) is called dually flat structure [2, 3] or Hessian structure [38] on \(\varOmega \).

Let \(\psi ^*\) be the Legendre conjugate of \(\psi \) defined on \(\varOmega ^*\) by

$$\begin{aligned} \psi ^*(s)= \max _{x \in \varOmega } \{-\langle s,x \rangle -\psi (x) \} \end{aligned}$$

following to (27), which is also a \(\vartheta \)-normal barrier on \(\varOmega ^*\) with the same \(\vartheta \) [30, Thm 2.4.4]. Hence, the Hessian \(g^*:=D^*d \psi ^*\) can be regarded as a Riemannian metric on \(\varOmega ^*\), the pull-back of which by \(\iota \) is g, i.e., \(g=\iota ^* g^*\). We denote a local length of \(X \in T_x \varOmega \) by

$$\begin{aligned} \Vert X \Vert _x:=\Vert Y \Vert _s:=\sqrt{g_x(X,X)} =\sqrt{g^*_s(Y, Y)}, \end{aligned}$$
(30)

where \(s=\iota (x)\) and \(Y=\iota _*(X)\).

4.3 Central trajectory and autoparallel submanifolds

Let \(\gamma _{{{\mathcal {P}}}}(t)\) for each \(t > 0\) be a point on \({{{\mathcal {P}}}}\), where \(x(t):=\gamma _{{{\mathcal {P}}}}(t)\) is defined as the unique minimizer of the following convex optimization problem:

$$\begin{aligned} {\textrm{minimize}} \; t \langle c,x \rangle + \psi (x), \text{ s.t. } x \in \overline{{{\mathcal {P}}}}. \end{aligned}$$
(31)

We call the curve \(\gamma _{{{\mathcal {P}}}}:=\{x(t) \in {{{\mathcal {P}}}} | t > 0 \}\) the central trajectory (central path) for the primal problem (23). The uniqueness of x(t) for each t follows from the assumptions for \({{{\mathcal {P}}}}\) and \({{{\mathcal {D}}}}\) in the end of Sect. 4.1 and Lemma 5, and x(t) converges to the set of optimal solutions to (23) as \(t \rightarrow +\infty \) [30]. Tracing \(\gamma _{{{\mathcal {P}}}}\) numerically by generating points in the neighborhood of the trajectory with an increase of t is one of standard interior-point algorithm to solve (23).

Now represent the subspace T as the null space of a linear surjective operator \(A:E \rightarrow \textbf{R}^m\), then we have

$$\begin{aligned} {{{\mathcal {P}}}}= & {} \{x \in \varOmega | Ax=b \}, \end{aligned}$$
(32)
$$\begin{aligned} {{{\mathcal {D}}}}= & {} \{s \in \varOmega ^* | s=c-A^*y, \; \exists y \in \textbf{R}^m \}, \end{aligned}$$
(33)

where \(A^*: \textbf{R}^m \rightarrow E^*\) is the operator satisfying \(y^T(Ax)=\langle A^*y, x \rangle \), and \(b:=Ad \in \textbf{R}^m\). Hence, \(T^*={\textrm{im}}A^*\). Note that \({{{\mathcal {P}}}}\) and \({{{\mathcal {D}}}}\) are, respectively, D- and \(D^*\)-autoparallel. The homogenization (or conic hull) of \({{{\mathcal {D}}}}\) defined by

$$\begin{aligned} {\textrm{Hom}}({{{\mathcal {D}}}})&:=\bigcup _{t>0} t{{{\mathcal {D}}}}, \\ t{{{\mathcal {D}}}}&:=\{s \in \varOmega ^* | s=t {{\tilde{s}}}, \; {{\tilde{s}}} \in {{{\mathcal {D}}}} \} =\{s \in \varOmega ^* | s=tc -A^*y, \; \exists y \in \textbf{R}^m \} \end{aligned}$$

is a \(D^*\)-autoparallel submanifold of dimension \(m+1\).

Generally the submanifold \({{{\mathcal {P}}}}\) is not \(D'\)-autoparallel in \((\varOmega ,D')\). We denote by \(H_{{{\mathcal {P}}}}^*\) the second fundamental form (Euler-Schouten embedding curvature) of \({{{\mathcal {P}}}}\) w.r.t. \(D'\). For V and \(W \in T_x {{{\mathcal {P}}}}\), it is expressed by

$$\begin{aligned} (H_{{{\mathcal {P}}}}^*(V,W))_x =\varPi _x^\perp (D'_V W) \in T_x \varOmega , \end{aligned}$$

where \(\varPi _x^\perp \) is the orthogonal projection from \(T_x \varOmega \simeq E\) to \(T^\perp :=({\textrm{ker}}A)^\perp \subset T_x \varOmega \) w.r.t. g, i.e.,

$$\begin{aligned} \varPi _x^\perp :=(\iota _*)^{-1} \circ A^* \circ (A \circ (\iota _*)^{-1} \circ A^*)^{-1} \circ A. \end{aligned}$$

Recalling (28) and defining a linear operator \(G_s:T_s \varOmega ^* \rightarrow T_x \varOmega \) with \(s=\iota (x)\) by

$$\begin{aligned} G_s:=-(\iota _*)^{-1}, \end{aligned}$$

we simply write, by omitting the symbol of the composition \(\circ \),

$$\begin{aligned} \varPi _x^\perp =G_s A^* (A G_s A^*)^{-1} A. \end{aligned}$$

Similarly, the orthogonal projection from \(T_s \varOmega ^* \simeq E^*\) to \(T^*={\textrm{im}}A^*=\iota _* T^\perp \subset T_s \varOmega ^*\) w.r.t. \(g^*\), denoted by \(\varPi _s^\perp \), is

$$\begin{aligned} \varPi _s^\perp :=\iota _* \varPi _x^\perp (\iota _*)^{-1} = G_s^{-1} \varPi _x^\perp G_s. \end{aligned}$$
(34)

5 Geometric predictor-corrector algorithm

5.1 Characterization of central trajectory in dual cone

We propose a new path-following algorithm to trace \(\gamma _{{{\mathcal {P}}}} \subset {{{\mathcal {P}}}}\). Novelty might be that the algorithm follows the curve \(\iota (\gamma _{{{\mathcal {P}}}})\) instead of \(\gamma _{{{\mathcal {P}}}}\), generating points sequentially on \({\textrm{Hom}}({{{\mathcal {D}}}}) \subset \varOmega ^*\) by iterating of pairs of predictor and corrector steps.

The basic idea is as follows: In a predictor step \(s(t):=\iota (x(t))\) for \(x(t) \in \gamma _{{{\mathcal {P}}}}\) moves to \(s_L(t+\varDelta t)\) towards the tangent direction of \(\iota (\gamma _{{{\mathcal {P}}}})\) for some \(\varDelta t\), and an ideal corrector step brings \(s_L(t+\varDelta t)\) back to \(s(t+\varDelta t)\) on \(\iota (\gamma _{{{\mathcal {P}}}})\) (cf. Fig. 1). Through this setup the number of iterations of such pairs in the algorithm is evaluated in terms of the embedding curvature \(H^*_{{{\mathcal {P}}}}\) on \({{{\mathcal {P}}}}\). Since taking such an exact corrector step onto \(\iota (\gamma _{{{\mathcal {P}}}})\) is actually impossible, we instead introduce a tubular neighborhood of \(\iota (\gamma _{{{\mathcal {P}}}})\) and the resultant corrector step is confined in that neighborhood. Then we consider the situation where the radius of the tube is reduced to zero.

Fig. 1
figure 1

Predictor and corrector steps in \({\textrm{Hom}}({{{\mathcal {D}}}})\) following \(\iota (\gamma _{{{\mathcal {P}}}})\)

Let L(xy) and \(y \in \textbf{R}^m\) be the Lagrange function and its multiplier associated with the problem (31) defined by

$$\begin{aligned} L(x,y):= t \langle c,x \rangle + \psi (x)+ y^T(b-Ax). \end{aligned}$$

Using (27), we have the optimality condition:

$$\begin{aligned} \frac{\partial L}{\partial x}= tc-s-A^*y=0, \quad \text{ where } \; s=\iota (x) \; \text{ for } x \in {{{\mathcal {P}}}}, \end{aligned}$$
(35)

which is equivalent to that \(s \in t{{{\mathcal {D}}}}\), and

$$\begin{aligned} \min _x L(x,y)=y^Tb+\min _x\{\psi (x)+ \langle s,x \rangle \} =y^Tb-\psi ^*(s). \end{aligned}$$

Thus, the dual convex problem of (31) is

$$\begin{aligned} {\mathop {{\textrm{maximize}}}\limits _{y}} \; b^T y-\phi (y), \quad \text{ where }\; \phi (y):=\psi ^*(tc-A^*y), \end{aligned}$$
(36)

or equivalently rewritten as

$$\begin{aligned} {\mathop {{\textrm{minimize}}}\limits _s} \; F(s):= \langle s,d \rangle + \psi ^*(s), \text{ s.t. } s \in t {{\mathcal {D}}}. \end{aligned}$$
(37)

Hence, we have the following characterization of \(\gamma _{{{\mathcal {P}}}}\):

Lemma 5

The central trajectory \(\gamma _{{{\mathcal {P}}}}\) and a point \(s(t)=\iota (x(t))\) where \(x(t) \in \gamma _{{{\mathcal {P}}}}\) are respectively expressed by

$$\begin{aligned} \iota (\gamma _{{{\mathcal {P}}}})=\iota ({{{\mathcal {P}}}}) \cap {\textrm{Hom}}({{{\mathcal {D}}}}), \quad s(t)=\iota ({{{\mathcal {P}}}}) \cap t{{{\mathcal {D}}}}. \end{aligned}$$

Note that by the definitions \(\iota ({{{\mathcal {P}}}})\) and \(t {{{\mathcal {D}}}}\) are orthogonal at s(t) w.r.t. \(g^*\) for each t.

5.2 Predictor step

We first derive the differential equation that governs \(\iota (\gamma _{{{\mathcal {P}}}})\). Differentiate the optimality condition (35) by t, then it holds that

$$\begin{aligned} c-{\dot{s}} =A^* {\dot{y}}. \end{aligned}$$
(38)

Multiplication by \(A(\iota _*)^{-1}\) from the left yields

$$\begin{aligned} A(\iota _*)^{-1}c-A{\dot{x}}=A(\iota _*)^{-1}A^*{\dot{y}}. \end{aligned}$$

Since \(A{\dot{x}}=0\) because of the constraint \(\gamma _{{{\mathcal {P}}}} \subset {{{\mathcal {P}}}}\), we find

$$\begin{aligned} {\dot{y}}=(A(\iota _*)^{-1}A^*)^{-1}A(\iota _*)^{-1}c =(AG_s A^*)^{-1}AG_s c. \end{aligned}$$

Substitute this relation into (38) and recall (34) and (35), then the tangent vector \({\dot{s}}(t)\) of \(\iota (\gamma _{{{\mathcal {P}}}})\) is represented by

$$\begin{aligned} \dot{s}=({\textrm{id}}-\varPi _s^\perp )c=\frac{1}{t}({\textrm{id}}-\varPi _s^\perp )s. \end{aligned}$$
(39)

Note that the vector field \(({\textrm{id}}-\varPi _{{{\bar{s}}}}^\perp )c\) is tangent to \({\textrm{Hom}}({{{\mathcal {D}}}})\) at any \({{\bar{s}}} \in {\textrm{Hom}}({{{\mathcal {D}}}})\) because

$$\begin{aligned} ({\textrm{id}}-\varPi _{{{\bar{s}}}}^\perp )c=c-A^*z \; \; \text{ with } z:=(AG_{{{\bar{s}}}}A^*)^{-1}AG_{{{\bar{s}}}}c. \end{aligned}$$

Consider a point \({{\bar{s}}} \in t{{{\mathcal {D}}}}\) that is sufficiently close to \(s(t) \in \iota (\gamma _{{{\mathcal {P}}}})\) in the sense of the Newton decrement (cf. (42) below). A predictor forwards \({{\bar{s}}}\) to \({{\bar{s}}}_L(t+\varDelta t)\) along the tangent direction of \({\dot{s}}(t)\) given in (39), i.e.,

$$\begin{aligned} {{\bar{s}}}_L(t+\varDelta t):={{\bar{s}}}+ \varDelta t ({\textrm{id}}-\varPi _{{{\bar{s}}}}^\perp )c \in (t+\varDelta t){{{\mathcal {D}}}} \end{aligned}$$
(40)

for a suitable increment \(\varDelta t\) (cf. Dotted arrow in Fig. 1).

5.3 Corrector step

With Lemma 5, the point \(s(t) \in \iota (\gamma _{{{\mathcal {P}}}})\), which is the ideal goal of each corrector for a given t, is characterized by the minimizer of the dual problem (37). The Newton direction \(N \in {{{\mathcal {X}}}}(t{{{\mathcal {D}}}})\) for the problem (37) is the vector field on \(t{{{\mathcal {D}}}}\) defined by

$$\begin{aligned} D^*dF(X,N)=-dF(X),\; \forall X \in {{{\mathcal {X}}}}(t{{{\mathcal {D}}}}). \end{aligned}$$

By the relation \(D^*dF=D^*d \psi ^*\) we find that it coincides with the Riemannian metric \(g^*\), i.e., \(g^*=D^*dF\). Since t is fixed in the problem (37), N is given by

$$\begin{aligned} N:=-A^* {{\hat{N}}}, \end{aligned}$$
(41)

where \({{\hat{N}}} =({{\hat{N}}}^i)\) is the Newton direction vector field on \(\textbf{R}^m\) for the equivalent unconstrained problem (36) defined by

$$\begin{aligned} {{\hat{N}}}^i=- \sum _{j=1}^m \left( \frac{\partial ^2 \phi }{\partial y^i \partial y^j} \right) ^{-1} \left( -b_j+\frac{\partial \phi }{\partial y^j}\right) , \quad i=1,\ldots ,m. \end{aligned}$$

A significant deviation measure from the minimizer in convex optimization via Newton method is the Newton decrement [30]. For the problem (37) the Newton decrement \(\delta (s)\) at \(s \in t{{{\mathcal {D}}}}\) from the minimizer \(s(t) \in \iota (\gamma _{{{\mathcal {P}}}})\) of the function F(s) is obtained by

$$\begin{aligned} \delta (s):=\Vert N \Vert _{s}. \end{aligned}$$
(42)

In particular, for convex functions with the self-concordance (26) (hence, so are F(s) and \(\phi (y)\)), it is generally known [30, ii) of Thm 2.2.2] that, by a single Newton step \(s^+:= s+N_s\), the decrement \(\delta (s)\) quadratically reduces as follows:

$$\begin{aligned} \delta (s) \le \beta \le 1/4 \; \Rightarrow \; \delta (s^+) \le 16 \beta ^2/9. \end{aligned}$$
(43)

In order to approximately follow the trajectory \(\iota (\gamma _{{{\mathcal {P}}}})\) in \({\textrm{Hom}}({{{\mathcal {D}}}})\) we set the tubular neighborhood \({{{\mathcal {N}}}}(\beta ) \subset {\textrm{Hom}}({{{\mathcal {D}}}})\) of \(\iota (\gamma _{{{\mathcal {P}}}})\) defined by

$$\begin{aligned} {{{\mathcal {N}}}}(\beta ):=\bigcup _{t \in (0,\infty )}{{{\mathcal {N}}}}_t(\beta ), \end{aligned}$$

where \({{{\mathcal {N}}}}_t(\beta ):=\{s \in t {{{\mathcal {D}}}}| \delta (s) \le \beta \}\). The fact (43) implies that if a predictor \({{\bar{s}}}_L(t+\varDelta t)\) is in \({{{\mathcal {N}}}}_{t+\varDelta t}(\beta )\) for \(\beta \le 1/4\), then the single Newton step can correct it to a point

$$\begin{aligned} {{\bar{s}}}^+_L(t+\varDelta t):= {{\bar{s}}}_L(t+\varDelta t)+N_{{{\bar{s}}}_L(t+\varDelta t)} \end{aligned}$$
(44)

in a smaller vicinity \({{{\mathcal {N}}}}_{t+\varDelta t}(16\beta ^2/9)\) of \(s(t+\varDelta t) \in \iota (\gamma _{{{\mathcal {P}}}})\). Hence, as an actual corrector, we will use \({{\bar{s}}}^+_L(t+\varDelta t)\) defined by (44) with a single Newton step.

To sum up, the algorithm generates predictor and corrector in \({{{\mathcal {N}}}}(\beta )\) alternatively as follows:

1.:

Let \(0< \beta \le 1/4\) and \(0 < \eta \le 1/2\), and initialize \(t > 0\) and \({{\bar{s}}} \in t {{{\mathcal {D}}}}\) such that \({{\bar{s}}} \in {{{\mathcal {N}}}}_{t}(16\beta ^2/9)\).

2.:

(Predictor step) Compute the predictor \({{\bar{s}}}_L(t+\varDelta t)\) via (40) by choosing \(\varDelta t\) such that satisfies

$$\begin{aligned} (1-\eta )\beta \le \delta ({{\bar{s}}}_L(t+\varDelta t)) \le \beta . \end{aligned}$$
(45)

The increment \(\varDelta t\) is determined by line search to meet (45).

3.:

(Corrector step) Compute the corrector \({{\bar{s}}}^+_L(t+\varDelta t)\) from \({{\bar{s}}}_L(t+\varDelta t)\) using the single Newton step (44).

4.:

Set \({{\bar{s}}}:={{\bar{s}}}^+_L(t+\varDelta t)\), \(t:= t+ \varDelta t\) and return to step 2.

6 Curvature integrals and iteration-complexity

6.1 Three preliminary results

The properties of self-concordant functions [30] claim the following compatibility of \(\Vert s(t)-{{\bar{s}}} \Vert _{{{\bar{s}}}}\) and \(\delta ({{\bar{s}}})\) for \({{\bar{s}}} \in t{{{\mathcal {D}}}}\) [35, Prop.3.4]:

$$\begin{aligned}{} & {} \displaystyle \text{ If } \delta ({{\bar{s}}}) < \frac{1}{16}, \text{ then } \text{ it } \text{ holds } \text{ that } \\{} & {} \quad \left\{ \begin{array}{l} \delta ({{\bar{s}}})(1-8\delta ({{\bar{s}}})) \le \Vert s(t)-{{\bar{s}}} \Vert _{{{\bar{s}}}} \le \delta ({{\bar{s}}})(1+8\delta ({{\bar{s}}})), \\ \Vert s(t)-{{\bar{s}}} \Vert _{{{\bar{s}}}}(1-22\Vert s(t)-{{\bar{s}}} \Vert _{{{\bar{s}}}}) \le \delta ({{\bar{s}}}) \le \Vert s(t)-{{\bar{s}}} \Vert _{{{\bar{s}}}}(1+22\Vert s(t)-{{\bar{s}}} \Vert _{{{\bar{s}}}}). \end{array} \right. \end{aligned}$$

Using these relations, we can express deviation of the predictor \({{\bar{s}}}_L(t+\varDelta t)\) from \(s(t+\varDelta t) \in \iota (\gamma _{{{\mathcal {P}}}})\) (cf. Fig. 1) in terms of \(\delta ({{\bar{s}}})\) and \(\varDelta t\) [35, Lem. 5.2], i.e.,

$$\begin{aligned} \delta ({{\bar{s}}}_L (t+\varDelta t))= & {} \Vert s(t+\varDelta t)-{{\bar{s}}}_L (t+\varDelta t) \Vert _{{{\bar{s}}}_L(t+\varDelta t)} +r_4 \nonumber \\= & {} \frac{(\varDelta t)^2}{2} \left\| \ddot{s}(t) \right\| _{s(t)} +\delta ({{\bar{s}}})+r_1+r_2+r_3, \end{aligned}$$
(46)

where the remainder terms \(r_i\)’s are evaluated by \(|r_1| \le M_1 (\varDelta t)^3\), \(|r_2| \le M_2 \delta ({{\bar{s}}})\), \(|r_3| \le M_3 [(\varDelta t)^2+\delta ({{\bar{s}}})]^2\) and \(|r_4| \le M_4 [(\varDelta t)^2+\delta ({{\bar{s}}})]^2\) for constants \(M_i, \, (i=1,\ldots ,4)\). According to (46) consider w defined by

$$\begin{aligned} w:=\delta ({{\bar{s}}}_L (t+\varDelta t))-r_2-\delta ({{\bar{s}}}) = \frac{(\varDelta t)^2}{2} \left\| \ddot{s}(t) \right\| _{s(t)} +r_1+r_3. \end{aligned}$$

By (45) and \(\delta ({{\bar{s}}})=O(\beta ^2)\) we have the first preliminary results [35, pp. 33 \(\sim \) 34]:

$$\begin{aligned}{} & {} \displaystyle \sqrt{w}-\sqrt{M_3} \delta ({{\bar{s}}}) \le \frac{\varDelta t}{\sqrt{2}} \left\| \ddot{s}(t) \right\| ^{1/2}_{s(t)} +\sqrt{|r_1|} +\sqrt{M_3}(\varDelta t)^2 \end{aligned}$$
(47)
$$\begin{aligned}{} & {} \displaystyle \frac{\varDelta t}{\sqrt{2}} \left\| \ddot{s}(t) \right\| ^{1/2}_{s(t)}-\sqrt{|r_1|}-\sqrt{M_3}(\varDelta t)^2 \le \sqrt{w} +\sqrt{M_3} \delta ({{\bar{s}}}) \end{aligned}$$
(48)
$$\begin{aligned}{} & {} \sqrt{(1-\eta ) \beta }(1-O(\sqrt{\beta })) \le \sqrt{w} \pm \sqrt{M_3} \delta ({{\bar{s}}}) \le \sqrt{\beta }(1+O(\sqrt{\beta })) \end{aligned}$$
(49)

for sufficiently small \(\varDelta t\) and \(\beta \).

Secondly, differentiating \(\varPi ^\perp _s\) by t and using (39), we have

$$\begin{aligned} \ddot{s} =-{{\dot{\varPi }}}^\perp _s c = -\varPi _s^\perp G_s^{-1} \dot{G}_s({\textrm{id}}-\varPi _s^\perp )c =-\varPi _s^\perp G_s^{-1} \dot{G}_s {\dot{s}}, \end{aligned}$$

which implies that \(\ddot{s}(t) \in (T_s \iota ({{{\mathcal {P}}}}))^\perp =T^*\). Hence, it holds that

$$\begin{aligned} \ddot{s} = D^*_{{\dot{s}}} {\dot{s}}= \iota _*(H^*_{{{\mathcal {P}}}}(\dot{x},\dot{x})), \quad \text{ where } {\dot{s}}=\iota _* \dot{x}, \; x(t) \in \gamma _{{{\mathcal {P}}}}. \end{aligned}$$
(50)

Note that the second equality shows the fact that the central trajectory \(\iota (\gamma _{{{\mathcal {P}}}})=\{s(t) | \; t >0 \}\) mapped to \(\iota ({{{\mathcal {P}}}})\) is a \(D^*\)-autoparallel curve on the submanifold \({\textrm{Hom}}({{{\mathcal {D}}}})\) w.r.t. the affine connection induced from \((\varOmega ^*, D^*)\).

Finally, it is known [30, Appendix 1] that (26) implies

$$\begin{aligned} |(D^2 d\psi )_x(X,Y,Z)| \le 2 \Vert X \Vert _x\Vert Y \Vert _x\Vert Z \Vert _x. \end{aligned}$$
(51)

Differentiating (25) by t at \(t=1\), we obtain

$$\begin{aligned} \langle s,x \rangle =\vartheta , \end{aligned}$$
(52)

and by operating \(Z \in T_s \varOmega ^*\) to (52) we have

$$\begin{aligned} \langle Z,x \rangle - g^*_s(s,Z)=0, \end{aligned}$$
(53)

where \(E^*\) is identified with \(T_s \varOmega ^*\).

6.2 Asymptotic iteration-complexity analysis by curvature integral

The number of predictor steps necessary to reach an approximate point of the optimum within a specified accuracy is an important performance measure for general interior-point algorithms. The next theorem states asymptotic results on the number of predictor steps to follow the central trajectory. The theorem elucidates the relation between the iteration-complexity and the second fundamental form \(H^*_{{{\mathcal {P}}}}\) along the central trajectory \(\gamma _{{{\mathcal {P}}}}\).

We assume that \(\beta \rightarrow 0\) implies that \(\varDelta t \rightarrow 0\) in the algorithm, which holds for almost all cases. (The case where this assumption does not hold is characterized by doubly autoparallelism [43] of \(\gamma _{{{\mathcal {P}}}}\)). Hereafter, we simply write a point on \(\gamma _{{{\mathcal {P}}}}\) by \(\gamma _{{{\mathcal {P}}}}(t)\) instead of \(x(t) \in \gamma _{{{\mathcal {P}}}}\).

Theorem 4

Let \(0< t_1 < t_2\), \(0 <\beta \le 1/4\) and \(0 < \eta \le 1/2\). Suppose \(s_1 \in {{{\mathcal {N}}}}(\beta ) \cap t_1{{{\mathcal {D}}}}\). Let \(\#(s_1,t_2,\beta )\) be the number of predictor steps in the proposed algorithm that starts from \(s_1\) to find a point \(s_2 \in {{{\mathcal {N}}}}(\beta ) \cap t_2{{{\mathcal {D}}}}\), generating points in \({{{\mathcal {N}}}}(\beta )\). Then we have the following relation:

$$\begin{aligned} \lim _{\beta \rightarrow 0} \frac{\sqrt{\beta }\times \#(s_1,t_2,\beta )}{I_{{{\mathcal {P}}}}(t_1,t_2)}=1, \end{aligned}$$

where

$$\begin{aligned} I_{{{\mathcal {P}}}}(t_1,t_2):= \frac{1}{\sqrt{2}} \int _{t_1}^{t_2} \Vert H_{{{\mathcal {P}}}}^* ({{\dot{\gamma }}}_{{{\mathcal {P}}}}(t), {{\dot{\gamma }}}_{{{\mathcal {P}}}}(t)) \Vert ^{1/2} _{\gamma _{{{\mathcal {P}}}}(t)} dt. \end{aligned}$$

Proof

Let \(({{\bar{s}}}^{(k)}, t^{(k)}), \, (k=1, \ldots , \#(s_1,t_2,\beta ))\) be a sequence of the pair of correctors and parameters generated by the algorithm and define \(\varDelta t_{\max }:=\max _k \{t^{(k+1)}-t^{(k)}\}\). Respectively taking summations of (47) and (48) evaluated at each \(({{\bar{s}}}^{(k)}, t^{(k)})\) for sufficiently small \(\varDelta t_{\max }\) and bounding them with (49), we have

$$\begin{aligned} \sqrt{(1-\eta ) \beta } \sum _{k=1}^{\#(s_1,t_2,\beta )} (1-O(\sqrt{\beta }))\le & {} \frac{1}{\sqrt{2}} \int _{t_1}^{t_2} \left\| \ddot{s}(t) \right\| ^{1/2}_{s(t)}dt +M' \sqrt{\varDelta t_{\max }}, \\ \frac{1}{\sqrt{2}} \int _{t_1}^{t_2} \left\| \ddot{s}(t) \right\| ^{1/2}_{s(t)}dt -M' \sqrt{\varDelta t_{\max }}\le & {} \sqrt{\beta } \sum _{k=1}^{\#(s_1,t_2,\beta )} (1+O(\sqrt{\beta })) \end{aligned}$$

for a positive constant \(M'\) independent of \(\beta \), and sufficiently small \(\beta \) that controls \(\varDelta t_{\max }\). From the assumption it holds that \(M' \sqrt{\varDelta t_{\max }} \rightarrow 0\) when \(\beta \rightarrow 0\). Recalling (50) and the relation (30), we see that the statement follows. \(\square \)

Remark 3

Theorem 4 implies that if we take a tubular neighborhood \({{{\mathcal {N}}}}(\beta )\) with a sufficiently small radius \(\beta \) in the proposed algorithm, the number of iterations necessary to follow the central trajectory \(\gamma _{{{\mathcal {P}}}}\) from \(\gamma _{{{\mathcal {P}}}}(t_1)\) to \(\gamma _{{{\mathcal {P}}}}(t_2)\) through \({{{\mathcal {N}}}}(\beta )\) is approximately \(I_{{{\mathcal {P}}}}(t_1,t_2)/\sqrt{\beta }\).

The proposed algorithm can be regarded as a solver of \(\gamma _{{{\mathcal {P}}}}\) on a manifold \(\iota ({{{\mathcal {P}}}})\) governed by the ODE (39). Since \(I_{{{\mathcal {P}}}}(t_1,t_2)\) is nothing but an integral of the norm of \(H_{{{\mathcal {P}}}}^*\) along \(\gamma _{{{\mathcal {P}}}}\), we expect that \(\#(t_1,t_2,\beta )\) is zero for any \(\beta \) and the objective \(\langle c,x \rangle \) when \(H_{{{\mathcal {P}}}}^*\) on whole \({{{\mathcal {P}}}}\) vanishes, i.e., the feasible region \({{{\mathcal {P}}}}\) is \(D'\)-autoparallel. This is equivalent to that \({{{\mathcal {P}}}}\) is doubly autoparallel because \({{{\mathcal {P}}}}\) is readily D-autoparallel.

Corollary 3

If \({{{\mathcal {P}}}}\) is doubly autoparallel in \((\varOmega ,g,D,D')\), then it is possible to take \(\varDelta t_{\max } \rightarrow +\infty \) regardless of \(\beta \) and \(\langle c, x \rangle \) in the algorithm, and the minimizer of (23) is solved with no iteration.

The following proposition gives the upper bound:

$$\begin{aligned} I_{{{\mathcal {P}}}}(t_1,t_2) \le \sqrt{\vartheta } \log (t_2/ t_1), \end{aligned}$$

which reflects the iteration complexity of the proposed algorithm.

Proposition 4

It holds that

$$\begin{aligned} \Vert H_{{{\mathcal {P}}}}^* ({{\dot{\gamma }}}_{{{\mathcal {P}}}}, {{\dot{\gamma }}}_{{{\mathcal {P}}}}) \Vert _{\gamma _{{{\mathcal {P}}}}(t)}^{1/2} \le \frac{\sqrt{2\vartheta }}{t}. \end{aligned}$$

Proof

We use, only in this proof, the abbreviated notation \(H^*_{{{\mathcal {P}}}}:=H^*_{{{\mathcal {P}}}}({{\dot{\gamma }}}_{{{\mathcal {P}}}}, {{\dot{\gamma }}}_{{{\mathcal {P}}}})\). Set \(x=\gamma _{{{\mathcal {P}}}}(t)\) and \(s=\iota (x)\) for an arbitrarily fixed t. Recalling that \((H^*_{{{\mathcal {P}}}}({{\dot{\gamma }}}_{{{\mathcal {P}}}}, {{\dot{\gamma }}}_{{{\mathcal {P}}}}))_x = \varPi _x^\perp (D'_{{{\dot{\gamma }}}_{{{\mathcal {P}}}}} {{\dot{\gamma }}}_{{{\mathcal {P}}}})_x\) and \(\varPi _x^\perp \) is a projection w.r.t. \(g_x\), we have

$$\begin{aligned} \Vert H^*_{{{\mathcal {P}}}} \Vert _x^2= & {} g_x(\varPi _x^\perp (D'_{{{\dot{\gamma }}}_{{{\mathcal {P}}}}} {{\dot{\gamma }}}_{{{\mathcal {P}}}})_x, \varPi _x^\perp (D'_{{{\dot{\gamma }}}_{{{\mathcal {P}}}}} {{\dot{\gamma }}}_{{{\mathcal {P}}}})_x) =g_x(H^*_{{{\mathcal {P}}}}, D'_{{{\dot{\gamma }}}_{{{\mathcal {P}}}}} {{\dot{\gamma }}}_{{{\mathcal {P}}}}) \\= & {} -g_x(D_{{{\dot{\gamma }}}_{{{\mathcal {P}}}}} H^*_{{{\mathcal {P}}}}, {{\dot{\gamma }}}_{{{\mathcal {P}}}}). \end{aligned}$$

The last equality follows from \(g(H^*_{{{\mathcal {P}}}}, {{\dot{\gamma }}}_{{{\mathcal {P}}}})=0\) and (29). On the other hand, since \({{{\mathcal {P}}}}\) is D-autoparallel and \( H^*_{{{\mathcal {P}}}}\) is orthogonal to \({{{\mathcal {P}}}}\) at each x, it holds that

$$\begin{aligned} (D^2d\psi )(H^*_{{{\mathcal {P}}}},{{\dot{\gamma }}}_{{{\mathcal {P}}}}, {{\dot{\gamma }}}_{{{\mathcal {P}}}})= & {} (D_{{{\dot{\gamma }}}_{{{\mathcal {P}}}}} g) (H^*_{{{\mathcal {P}}}}, {{\dot{\gamma }}}_{{{\mathcal {P}}}}) \\= & {} {{\dot{\gamma }}}_{{{\mathcal {P}}}} g(H^*_{{{\mathcal {P}}}}, {{\dot{\gamma }}}_{{{\mathcal {P}}}}) -g(D_{{{\dot{\gamma }}}_{{{\mathcal {P}}}}} H^*_{{{\mathcal {P}}}}, {{\dot{\gamma }}}_{{{\mathcal {P}}}}) -g(H^*_{{{\mathcal {P}}}}, D_{{{\dot{\gamma }}}_{{{\mathcal {P}}}}} {{\dot{\gamma }}}_{{{\mathcal {P}}}}) \\= & {} -g(D_{{{\dot{\gamma }}}_{{{\mathcal {P}}}}} H^*_{{{\mathcal {P}}}}, {{\dot{\gamma }}}_{{{\mathcal {P}}}}) \end{aligned}$$

via the formula of covariant derivative for (0,2) tensors. Thus, it follows from (51) that \( \Vert H^*_{{{\mathcal {P}}}}\Vert _x \le 2 \Vert {{\dot{\gamma }}}_{{{\mathcal {P}}}} \Vert _x^2 \). Finally, (39), (52) and (53) lead to

$$\begin{aligned} \Vert {{\dot{\gamma }}}_{{{\mathcal {P}}}} \Vert _x^2=\Vert {\dot{s}} \Vert _s^2 = \frac{1}{t^2} \Vert ({\textrm{id}}-\varPi _s^\perp )s \Vert _s^2 \le \frac{1}{t^2} \Vert s \Vert _s^2 = \frac{1}{t^2} g_s^*(s, s) = \frac{1}{t^2} \langle s, x \rangle = \frac{1}{t^2} \vartheta . \end{aligned}$$

This completes the proof. \(\square \)

Remark 4

Analogous results to Theorem 4 and Proposition 4 hold for the dual central trajectory \(\gamma _{{{\mathcal {D}}}}(t)=s(t)=\iota ^{-1}({{{\mathcal {D}}}}) \cap t{{{\mathcal {P}}}}\) defined by the minimizer of

$$\begin{aligned} {\textrm{minimize}} \; t \langle s,d \rangle + \psi ^*(s), \text{ s.t. } s \in \overline{{{\mathcal {D}}}}, \end{aligned}$$
(54)

which plays a similar role to solve the dual problem (24). In the statements of them, \(I_{{{\mathcal {P}}}}(t_1,t_2)\) is replaced by

$$\begin{aligned} I_{{{\mathcal {D}}}}(t_1,t_2):= \frac{1}{\sqrt{2}} \int _{t_1}^{t_2} \Vert H_{{{\mathcal {D}}}} ({{\dot{\gamma }}}_{{{\mathcal {D}}}}(t), {{\dot{\gamma }}}_{{{\mathcal {D}}}}(t)) \Vert ^{1/2} _{\gamma _{{{\mathcal {D}}}}(t)} dt, \end{aligned}$$

where \(H_{{{\mathcal {D}}}}\) is the second fundamental form of \(\iota ^{-1}({{{\mathcal {D}}}})\) w.r.t. the canonical flat affine connection D on E.

7 Application to iteration-complexity analysis of the primal-dual methods for linear programming

In practice the main stream of interior-point methods is to simultaneously solve the conic programs (23) and (24) by following so-called the primal-dual central trajectory (see below). Such methods are called the primal-dual interior-point methods [10, 20], while the algorithm proposed in the previous section is classified to the primal ones because they follow only the primal central trajectory \(\gamma _{{{\mathcal {P}}}}\). One of the advantages for the primal-dual methods is that they need not compute the Newton decrements several times in a line search of predictors to measure the deviation from the central trajectory, which is relatively expensive. However, the geometrically obtained basic result in Theorem 4 is still useful to investigate iteration-complexity of primal-dual methods.

In this section by applying the result to the primal-dual interior-point methods for linear programs (LP’s), we examine in detail to derive several interesting properties similar to the Pythagorean relation (cf. Theorem 5). Note that the relation for LP’s demonstrated here can be generalized [17, 18] to the class of symmetric cone programs, including semidefinite programs (SDP’s), in terms of Jordan algebras.

Taking the positive orthant \(\textbf{R}^n_+\) for \(\varOmega \) and \(\varOmega ^*\), i.e., \(\varOmega =\varOmega ^*=\textbf{R}^n_{+}\), with an n-normal barrier \(\psi (x)=-\sum _{i=1}^n \log x^i\) in (23), (24), (32) and (33), consider LP’s:

$$\begin{aligned} \mathop {{\textrm{minimize }}}\limits _x{} & {} c^T x, \; \text{ s.t. } Ax=b, \; x \in \overline{\textbf{R}_{+}^n}, \end{aligned}$$
(55)
$$\begin{aligned} \mathop {{\textrm{maximize }}}\limits _{(s,y)}{} & {} b^T y, \; \text{ s.t. } c- A^T y=s, \; s \in \overline{\textbf{R}_{+}^n}, \; y \in \textbf{R}^m, \end{aligned}$$
(56)

where \(A \in \textbf{R}^{m \times n}, c \in \textbf{R}^n\) and \(b \in \textbf{R}^m\). We assume that the rows of A are linearly independent.

Let \((x_{PD}(\nu ), s_{PD}(\nu ), y_{PD}(\nu )) \in {{{\mathcal {P}}}} \times {{{\mathcal {D}}}} \times \textbf{R}^m\) be the primal-dual central trajectory defined by the unique solution to the following system of equations for a given parameter \(\nu >0\):

$$\begin{aligned} x \circ s = \nu e, \; Ax=b, \; c-A^Ty=s, \; x \in \textbf{R}_{+}^n, \; s \in \textbf{R}_{+}^n, \; y \in \textbf{R}^m. \end{aligned}$$
(57)

Here, \(\circ \) denotes the entrywise product of two vectors, called the Hadamard product, and e is its unit element. The parameter \(\nu \) is called the normalized duality gap, which represents precision of optimality. When \(\nu \rightarrow 0\), the solution approaches optimal one. The primal-dual algorithms trace \((x_{PD}(\nu ), s_{PD}(\nu ))\) in \({{{\mathcal {P}}}} \times {{{\mathcal {D}}}}\) [20]. By the optimality conditions of \(\gamma _{{{\mathcal {P}}}}\) and \(\gamma _{{{\mathcal {D}}}}\) respectively defined for (31) and (54), we see that, for \(\nu =1/t\),

$$\begin{aligned} x_{PD}(\nu )=\gamma _{{{\mathcal {P}}}}(t), \qquad s_{PD}(\nu )=\gamma _{{{\mathcal {D}}}}(t). \end{aligned}$$

In the analysis of predictor-corrector primal-dual algorithm [26, 39, 47] it has been known that the following quantity for \(\nu _f \le \nu _i\) also called curvature integral

$$\begin{aligned} I_{PD}(\nu _f,\nu _i):= \int _{\nu _f}^{\nu _i} \frac{1}{\sqrt{\nu }} \left\| \frac{d x_{PD}}{d \nu } \circ \frac{d s_{PD}}{d \nu } \right\| _2^{1/2} d\nu , \end{aligned}$$

where \(\Vert \cdot \Vert _2\) is the Euclidean norm, plays an important role. In particular, the following interesting results have been proved in [26]:

Proposition 5

  1. (i)

    Similarly to Theorem 4, \(I_{PD}(\nu _f,\nu _i)/\sqrt{\beta }\) approaches the number of iterations to follow the primal-dual central trajectory from \(\nu _i\) to \(\nu _f\) when \(\beta \rightarrow 0\).

  2. (ii)

    \(I_{PD}(0,\infty )\) is bounded and is a quantity of \(O(n^{3.5} {\textrm{size}}(A))\). Here \({\textrm{size}}(A)\) is the input bit size of the matrix A and \({\textrm{size}}(A) \le mn\) if \(A \in \{0,1 \}^{m \times n}\).

The fact (ii) is related to the existence of a strongly polynomial-time algorithm to solve an LP whose coefficient matrix A is binary.

Actually the quantity \(I_{PD}(\nu _f,\nu _i)/\sqrt{\beta }\) well approximates the number of iterations similarly to Theorem 4. An example is shown in Fig. 2, where a famous real-world problem DFL001 from NETLIB, a standard benchmark library for linear programming, is solved with the primal-dual predictor-corrector method with various \(\beta \), starting from a unique initial point. The sizes of this instance are \(n=12230, m=6072\). The left figure is the number of iterations versus \(\log _{10}(1/t)\). It is seen that the algorithm takes more iterations for smaller \(\beta \) but all the curves look similar. In the right figure where the number of iterations is multiplied with \(\sqrt{\beta }\), all the iterative process seem to be on a unique curve. This is because the number of iterations multiplied with \(\sqrt{\beta }\) is the integral itself which does not depend on \(\beta \). This fact is observed for various other instances from NETLIB [18].

Fig. 2
figure 2

The number of iterations versus \(\log _{10} \hbox {(Normalized Duality Gap)}\) (left) and (The number of iterations) \(\times \sqrt{\beta }\) versus \(\log _{10} \hbox {(Normalized Duality Gap)}\) (right). Normalized Duality Gap\(=\nu =1/t\) in the figures

Define \(t:=1/\nu \), then using (57) we can rewrite \(I_{PD}\) [35, Prop. 6.2] as

$$\begin{aligned} I_{PD}(\nu _f,\nu _i)= \int _{t_i}^{t_f} h_{PD}(t)^{1/2} dt, \end{aligned}$$

where \(h_{PD}(t)\) is given by

$$\begin{aligned} h_{PD}(t):=\frac{1}{t^2} ((I_n-Q(t))e) \circ (Q(t)e). \end{aligned}$$

Here, \(I_n\) denotes the n by n identity matrix, and

$$\begin{aligned} Q(t):=G_s^{1/2}A^T(AG_sA^T)^{-1}AG_s^{1/2} =G_s^{-1/2}\varPi _x^{\perp } G_s^{1/2}, \end{aligned}$$

for the Hessian matrix \(G_s\) of the n-normal barrier \(\psi ^*(s)=-\sum _{i=1}^n \log s_i\).

Under the above setup, we see that \(I_{PD}\) is rigorously qualified for the terminology “curvature integral” from a differential geometric viewpoint by the following result [35, Thm 6.4].

Theorem 5

It holds that

$$\begin{aligned} h_{PD}(t)^2=\left( \frac{1}{2} \Vert H_{{{\mathcal {P}}}}^* ({{\dot{\gamma }}}_{{{\mathcal {P}}}}(t), {{\dot{\gamma }}}_{{{\mathcal {P}}}}(t)) \Vert _{\gamma _{{{\mathcal {P}}}}(t)} \right) ^2 + \left( \frac{1}{2} \Vert H_{{{\mathcal {D}}}} ({{\dot{\gamma }}}_{{{\mathcal {D}}}}(t), {{\dot{\gamma }}}_{{{\mathcal {D}}}}(t)) \Vert _{\gamma _{{{\mathcal {D}}}}(t)} \right) ^2. \end{aligned}$$

In next theorem the statement (i) is a direct consequence from Theorem 5 and the statement (ii) is proved [35, Thm 6.3] using (ii) in Proposition 5.

Theorem 6

The followings hold:

$$\begin{aligned}{} & {} {\mathrm{(i)}}\ \max \{I_{{{\mathcal {P}}}}(t_1,t_2),I_{{{\mathcal {D}}}}(t_1,t_2) \} \le I_{PD}(t_1,t_2) \le I_{{{\mathcal {P}}}}(t_1,t_2)+I_{{{\mathcal {D}}}}(t_1,t_2), \\{} & {} {\mathrm{(ii)}}\ \max [I_{{{\mathcal {P}}}}(0,\infty ), I_{{{\mathcal {D}}}}(0,\infty )] = O(n^{3.5} {\textrm{size}}(A)). \end{aligned}$$

Remark 5

Note that (ii) would be stronger than what Proposition 4 suggests.

One of the important consequences of Theorems 5 and 6 is that the sole primal, or the sole dual path-following algorithm, performs better than the primal-dual path-following algorithm when we focus on only the integral, i.e., iteration-complexity. Another conclusion of Theorems 4 and 5 is that both iteration-complexities of primal and dual path following methods are in trade-off relation.

8 Conclusions

We give an algebraic characterization of doubly autoparallel submanifolds in semi-simple Jordan algebras and associative algebras equipped with dually flat structure. It is proved that structure of such submanifolds is closely related that of subalgebras of the ambient algebras. As a special case, we characterize doubly autoparallel structure on symmetric cones, in which there would be many applications to statistical science and mathematical programing.

We have observed that two applications of convex optimizations respectively own the closed form solutions if doubly autoparallel structure is given. Inspired by this fact we introduce curvature integrals and prove that they asymptotically coincide with the number of iterations to solve conic convex optimizations in the case where feasible regions are not doubly autoparallel. In addition, we show that \(H^*_{{{\mathcal {P}}}}\) and \(H_{{{\mathcal {D}}}}\) relate to iteration-complexity of the primal-dual path-following algorithms in an elegant form. It would be of interest to find that the concept of iteration-complexity in the field of optimization algorithms, is unexpectedly connected to such global geometric properties of central trajectories \(\gamma _{{{\mathcal {P}}}}(t)\) and \(\gamma _{{{\mathcal {D}}}}(t)\) or submanifolds \({{{\mathcal {P}}}}\) and \({{{\mathcal {D}}}}\).

Research problems on doubly autoparallelism for general statistical manifolds, such as characterizations or attractive applications, widely remain unsolved, and these authors believe that attacking them would contribute to theory of information geometry and statistical or information science.