1 Introduction

1.1 Overview, Motivation

Learning the parameters of large neural networks from training data constitutes a basic problem in imaging science, machine learning and other fields. The prevailing approach utilizes gradient descent or approximations thereof based on automatic differentiation [5] and corresponding software tools, like PyTorch [19] and TensorFlow [1]. This kind of software support has been spurring research in imaging science and machine learning dramatically. However, merely relying on numerical schemes and their automatic differentiation tends to thwart attempts to shed light on the often-criticized black-box behavior of deep networks and to better understand the internal representation and function of parameters and their adaptive dynamics.

In this paper, we explore a different route. Adopting the linearized assignment flow approach introduced by [27], we focus on a corresponding large system of linear ODEs of the form

$$\begin{aligned} \dot{V} = A(\Omega ) V + B, \end{aligned}$$
(1.1)

and study a geometric approach to learning the regularization parameters \(\Omega \) by Riemannian gradient descent of a loss function

$$\begin{aligned} \Omega \mapsto \mathcal {L}(V(T;\Omega )) \end{aligned}$$
(1.2)

constrained by the dynamical system (1.1). Here, we exploit the crucial property that the solution to (1.1) can be specified in closed form (2.24) and can be computed efficiently using exponential integration ([27] and Sect. 2.4). Matrix \(V \in {\mathbb {R}}^{|I|\times c}\) represents a tangent vector of the so-called assignment manifold, |I| is the number of nodes \(i\in I\) of the underlying graph, and c is the number of labels (classes) that have to be assigned to data observed at nodes \(i\in I\). Specifically,

  • we derive a formula—see Theorem 3.8—for the Euclidean parameter gradient \(\partial _{\Omega }\mathcal {L}(V(T;\Omega ))\) in closed form;

  • we show that a low-rank representation of this gradient can be used to efficiently and accurately approximate this closed-form gradient; neither backpropagation, nor automatic differentiation or solving adjoint equations are required;

  • we highlight that the resulting parameter estimation algorithm, in terms of a Riemannian gradient descent iteration (3.7) on the parameter manifold, can be implemented without any specialized software support with modest computational resources;

The significance of our work reported in this paper arises in a broader context. The linearized assignment flow approach also comprises the equation

$$\begin{aligned} W(T) = {{\,\textrm{Exp}\,}}_{{\mathbb {1}}_{\mathcal {W}}}(V(T)) \end{aligned}$$
(1.3)

that yields the labeling in terms of almost integral assignment vectors \(W_{i}\in {\mathbb {R}}_{+}^{c},\; i\in I\) that form the rows of the matrix W, depending on the solution V(t) of (1.1) for a sufficiently large time \(t=T\). Both Eqs. (1.3) and (1.1) together constitute a linearization of the full nonlinear assignment flow [3]

$$\begin{aligned} \dot{W} = R_{W}S(W) \end{aligned}$$
(1.4)

at the barycenter \({\mathbb {1}}_{\mathcal {W}}\) of the assignment manifold. Choosing an arbitrary sequence of time intervals (step sizes) \(h_{1}, h_{2}, \dotsc \) and setting

$$\begin{aligned} W^{(0)}={\mathbb {1}}_{\mathcal {W}},\qquad W^{(k)}=W(h_{k}),\qquad k\in {\mathbb {N}}, \end{aligned}$$
(1.5)

a sequence of linearized assignment flows

$$\begin{aligned} W^{(k+1)}&= {{\,\textrm{Exp}\,}}_{{\mathbb {1}}_{\mathcal {W}}}(V^{(k)}), \end{aligned}$$
(1.6a)
$$\begin{aligned} V^{(k+1)}&= V^{(k)}+V\big (h_{k};\Omega ^{(k)},W^{(k)}\big ),\quad k=0,1,2,\dots \end{aligned}$$
(1.6b)

can be computed in order to approximate (1.4) more closely, where \(V\big (h_{k};\Omega ,W^{(k)}\big )\) solves the corresponding updated ODE (1.1) of the form

$$\begin{aligned} \dot{V}&= A(\Omega ^{(k)}; W^{(k)}) V + \Pi _{0}S(W^{(k)}). \end{aligned}$$
(1.6c)

The time-discrete equations (1.6) reveal two basic ingredients of deep networks (or neural ODEs) which the full assignment flow (1.4) embodies in a continuous-time manner: coupling a pointwise nonlinearity (1.6a) and diffusion (1.6b), (1.6c) enhances the expressivity of network models for data analysis.

The key point motivating the work reported in this paper is that our results apply to learning the parameters \(\Omega ^{k}\) in each step of the iterative scheme (1.6). We expect that the gradient, and its low-dimensional subspace representations, will help the further study of how each ingredient of (1.6) impacts the predictive power of assignment flows. Furthermore, ‘deep’ extensions of (1.4) and (1.6) are equally feasible within the same mathematical framework (cf. Sect. 5.2).

1.2 Related Work

Assignment flows were introduced by [3]. For a survey of prior and recent related work, we refer to [23]. Linearized assignment flows were introduced by [27] as part of a comprehensive study of numerical schemes for the geometric integration of the assignment flow equation (1.4).

While the bulk of these schemes are based on a Lie group action (cf. [14]) on the assignment manifold, which enables to apply established theory and algorithms for the numerical integration of ODEs that evolve in an Euclidean space [11], the linearity of the ODE (1.1) specifically allows to represent its solution in closed form by the Duhamel (or variation-of-constants) formula [24]. Corresponding extensions to nonlinear ODEs rely on exponential integration [12, 13]. Iteration (1.6) combines a corresponding iterative scheme and the tangent-space based parametrization (1.3) of the linearized assignment flow.

A key computational step of the latter class of methods requires to evaluate an analytical matrix-valued function, like the matrix exponential and similar functions [8, Section 10]. While basic methods [17] only work for problem of small and medium size, dedicated methods using Krylov subspaces [2, 10] and established numerical linear algebra [20, 21] can be applied to larger problems. The algorithm that results from our approach employs such methods.

Machine learning requires to compute gradients of loss functions that take solutions of ODEs as argument. This defines an enormous computational task and explains why automatic differentiation and corresponding software tools are almost exclusively applied. Alternative dedicated recent methods like [16] focus on a special problem structure, viz. the action of the differential of the matrix exponential on a rank-one matrix. Our closed-form formula for the parameter gradient also involves the differential of a matrix exponential. Yet, we wish to evaluate the gradient itself rather than its action on another matrix. The special problem structure that we can exploit is the Kronecker sum of matrices. Accordingly, our approach is based on the recent corresponding work [6] and an additional subsequent low-rank approximation.

1.3 Contribution, Organization

We derive a closed-form expression of the gradient of any \(C^{1}\) loss function of the form (1.2) that depends on the solution V(t) of the linear system of ODEs (1.1) at some arbitrary but fixed time \(t=T\). In addition, we develop a numerical method that enables to evaluate the gradient efficiently for the common large sizes of image labeling problems. We apply the method to optimal parameter estimation by Riemannian gradient descent and validate our approach by a series of proof-of-concept experiments. This includes a comparison with automatic differentiation applied to two numerical schemes for integrating the linearized assignment flow: geometric explicit Euler and exponential integration. It turns out that our method is as accurate and efficient as the highly optimized automatic differentiation software, like PyTorch [19] and TensorFlow [1]. We point out that to our knowledge, automatic differentiation has not been applied to exponential integration, so far.

This paper extends the conference paper [26] in that all parameter dependencies of the loss function, constrained by the linearized assignment flow, are taken into account (cf. diagram (3.15)). In addition, a complete proof of the corresponding main result (Theorem 3.8) is provided. The space complexity of various gradient approximations are specified in a series of Remarks. The approach is validated numerically and more comprehensively by comparing to automatic differentiation and by examining the influence of all parameters.

The plan for this paper is as follows. Section 2 summarizes the assignment flow approach, the linearized assignment flow and exponential integration for integrating the latter flow. Section 3 details the derivation of the exact gradient of any loss function of the flow with respect to the weight parameters that regularize the flow. Furthermore, a low-rank approximation of the gradient is developed for evaluating the gradient efficiently. We also sketch how automatic derivation is applied to two numerical schemes in order to solve the parameter estimation problem in alternative ways. Numerical experiments are reported in Sect. 4 for comparing the methods and for inspecting quantitatively the gradient approximation and properties of the estimated weight patches that parametrize the linearized assignment flow. We conclude in Sect. 5 and point out further directions of research.

2 Preliminaries

2.1 Basic Notation

We set \([n]=\{1,2,\dotsc ,n\}\) for \(n\in {\mathbb {N}}\). The cardinality of a finite set S is denoted by |S|, e.g., \(|[n]|=n\). \({\mathbb {R}}^{n}_{+}\) denotes the positive orthant and \({\mathbb {R}}_{>}^{n}\) its interior. \({\mathbb {1}}=(1,1,\dotsc ,1)^{\top }\) has dimension depending on the context that we specify sometimes by a subscript, e.g., \({\mathbb {1}}_{n}\in {\mathbb {R}}^{n}\). Similarly, we set \(0_{n}=(0,0,\dotsc ,0)^{\top }\in {\mathbb {R}}^{n}\). \(\{e_{i}:i\in [n]\}\) is the canonical basis of \({\mathbb {R}}^{n}\) and \(I_{n}=(e_{1},\dotsc ,e_{n})\in {\mathbb {R}}^{n\times n}\) the identity matrix.

The support of a vector \(x\in {\mathbb {R}}^{n}\) is denoted by \({{\,\textrm{supp}\,}}(x) = \{i\in [n]:x_{i}\ne 0\}\). \(\Delta _{n}=\{p\in {\mathbb {R}}_{+}^{n}:\langle {\mathbb {1}}_{n},p\rangle =1\}\) is the probability simplex whose points represent discrete distributions on [n]. Distributions with full support [n] form the relative interior \(\mathring{\Delta }_{n}=\Delta _{n}\cap {\mathbb {R}}_{>}^{n}\). \(\langle \cdot ,\cdot \rangle \) is the Euclidean inner product of vectors and matrices. In the latter case, this reads \(\langle A, B\rangle = \textrm{tr}(A^{\top } B)\) with the trace \(\textrm{tr}(A)=\sum _{i}A_{ii}\). The induced Frobenius norm is denoted by \(\Vert A\Vert =\sqrt{\langle A,A\rangle }\), and other matrix norms like the spectral norm \(\Vert A\Vert _{2}\) are indicated by subscripts. The mapping \({{\,\textrm{Diag}\,}}:{\mathbb {R}}^{n}\rightarrow {\mathbb {R}}^{n\times n}\) sends a vector x to the diagonal matrix \({{\,\textrm{Diag}\,}}(x)\) with entries x. \(A\otimes B\) denotes the Kronecker product of matrices A and B [7, 25] and \(\oplus \) the Kronecker sum

$$\begin{aligned}&A \oplus B = A \otimes I_{n} + I_{m} \otimes B \in {\mathbb {R}}^{m n\times m n},\nonumber \\&\qquad A \in {\mathbb {R}}^{m\times m},\quad B\in {\mathbb {R}}^{n\times n}. \end{aligned}$$
(2.1)

We have

$$\begin{aligned} (A\otimes B)(C\otimes D) = (A C)\otimes (B D) \end{aligned}$$
(2.2)

for matrices of compatible dimensions. The operator \({{\,\textrm{vec}\,}}_{r}\) turns a matrix into the vector by stacking the row vectors. It satisfies

$$\begin{aligned} {{\,\textrm{vec}\,}}_{r}(A B C) = (A \otimes C^{\top }){{\,\textrm{vec}\,}}_{r}(B). \end{aligned}$$
(2.3)

The Kronecker product \(v \otimes w \in {\mathbb {R}}^{mn}\) of two vectors \(v \in {\mathbb {R}}^m\) and \(w \in {\mathbb {R}}^n\) is defined by viewing the vectors as matrices with only one column and applying the definition of Kronecker products for matrices. We have

$$\begin{aligned} v \otimes w = {{\,\textrm{vec}\,}}_{r}(v w^\top ). \end{aligned}$$
(2.4)

The matrix exponential of a square matrix A is given by [8, Ch. 10]

$$\begin{aligned} {{\,\textrm{expm}\,}}(A) = \sum _{k\ge 0} \frac{A^{k}}{k!}. \end{aligned}$$
(2.5)

\(L(\mathcal {E}_{1},\mathcal {E}_{2})\) denotes the space of all linear bounded mappings from \(\mathcal {E}_{1}\) to \(\mathcal {E}_{2}\).

2.2 Assignment Flow

Let \(G=(I,E)\) be a given undirected graph with vertices \(i \in I\) indexing data

$$\begin{aligned} \mathcal {F}_{I} = \{f_{i} :i \in I\} \subset \mathcal {F} \end{aligned}$$
(2.6)

given in a metric space \((\mathcal {F},d)\). In this paper, we focus primarily on the application of image labeling in which the graph G is a grid graph equipped with a \(3 \times 3\) or larger neighborhood \(\mathcal {N}_{i} = \{k \in I :ik=ki \in E\} \cup \{i\}\) at each pixel \( i\in I\). The linearized assignment flow and the learning approach in this paper can, however, also be applied to the case of data labeling on arbitrary graphs.

Along with \(\mathcal {F}_{I}\), prototypical data (labels) \(\mathcal {L}_{J} = \{l_{j} \in \mathcal {F} :j \in J\}\) are given that represent classes \(j = 1,\dotsc ,|J|\). Supervised image labeling denotes the task to assign precisely one prototype \(l_{j}\) to each datum \(f_{i}\) at every vertex i in a coherent way, depending on the label assignments in the neighborhoods \(\mathcal {N}_{i}\). These assignments at i are represented by probability vectors

$$\begin{aligned} W_{i} \in \mathring{\Delta }_{|J|},\quad i \in I. \end{aligned}$$
(2.7)

The set \(\mathring{\Delta }_{|J|}\) becomes a Riemannian manifold denoted by \(\mathcal {S}:= (\mathring{\Delta }_{|J|},g_{{\scriptscriptstyle FR}})\) when endowed with the Fisher–Rao metric \(g_{{\scriptscriptstyle FR}}\). Collecting all assignment vectors as rows defines the strictly positive row-stochastic assignment matrix

$$\begin{aligned} W= & {} {(W_{1},\dotsc ,W_{|I|})}^{\top } \in \mathcal {W} = \mathcal {S} \times \dots \times \mathcal {S} \subset {\mathbb {R}}^{|I| \times |J|},\nonumber \\ \end{aligned}$$
(2.8)

that we regard as point on the product assignment manifold \(\mathcal {W}\). Image labeling is accomplished by geometrically integrating the assignment flow W(t) solving

$$\begin{aligned} \dot{W}= & {} R_{W}\big (S(W)\big ),\nonumber \\ \qquad W(0)= & {} {\mathbb {1}}_{\mathcal {W}} := \frac{1}{|J|} {\mathbb {1}}_{|I|} {\mathbb {1}}_{|J|}^{\top }\qquad (\text {barycenter}), \end{aligned}$$
(2.9)

where \(R_{W}\) and S(W) are defined in (2.11b) resp. (2.17). The assignment flow provably converges toward a binary matrix [28], i.e., \(\lim _{t\rightarrow \infty }W_{i}(t)=e_{j(i)}\), for every \(i\in I\) and some \(j(i)\in J\), which yields the label assignment \(f_{i} \mapsto l_{j(i)}\). In practice, geometric integration is terminated when W(t) is \(\varepsilon \)-close to an integral point using the entropy criterion from [3], followed by trivial rounding, due to the existence of basins of attraction around each integral point [28].

We specify the right-hand side of the differential equation in (2.9)—see (2.14) and (2.17) below—and refer to [3, 23] for more details and the background. With the tangent space

$$\begin{aligned} T_{0}=T_{p}\mathcal {S} = \{v\in {\mathbb {R}}^{|J|}:\langle {\mathbb {1}},v\rangle =0\},\qquad \forall p\in \mathcal {S}, \end{aligned}$$
(2.10)

that does not depend on the base point \(p \in \mathcal {S}\), we define

$$\begin{aligned}&\Pi _{0} :{\mathbb {R}}^{|J|} \rightarrow T_{0}, \quad z \mapsto I_{|J|}-\frac{1}{|J|}{\mathbb {1}}_{|J|}{\mathbb {1}}_{|J|}^{\top }, \end{aligned}$$
(2.11a)
$$\begin{aligned}&R_{p} :{\mathbb {R}}^{|J|} \rightarrow T_{0}, \quad z \mapsto R_{p}(z)=\big ({{\,\textrm{Diag}\,}}(p)-p p^{\top }\big ) z, \end{aligned}$$
(2.11b)
$$\begin{aligned}&{{\,\textrm{Exp}\,}}:\mathcal {S} \times T_{0} \rightarrow \mathcal {S}, \quad (p,v) \mapsto {{\,\textrm{Exp}\,}}_{p}(v) = \frac{e^{\frac{v}{p}}}{\langle p, e^{\frac{v}{p}} \rangle } p, \end{aligned}$$
(2.11c)
$$\begin{aligned}&{{\,\textrm{Exp}\,}}^{-1} :\mathcal {S} \times \mathcal {S} \rightarrow T_{0}, \quad (p,q) \mapsto {{\,\textrm{Exp}\,}}_{p}^{-1}(q) = R_{p}\log \frac{q}{p}, \end{aligned}$$
(2.11d)
$$\begin{aligned}&\exp :\mathcal {S} \times {\mathbb {R}}^{|J|} \rightarrow \mathcal {S},(p,z) \mapsto \exp _{p}(z) = {{\,\textrm{Exp}\,}}_{p}\circ R_{p}(z) = \frac{p e^{z}}{\langle p,e^{z}\rangle }, \end{aligned}$$
(2.11e)

where multiplication, division, exponentiation \(e^{(\cdot )}\) and \(\log (\cdot )\) apply component-wise to vectors. Corresponding maps

$$\begin{aligned} R_{W}, \qquad {{\,\textrm{Exp}\,}}_{W}, \qquad \exp _{W} \end{aligned}$$
(2.12)

in connection with the product manifold (2.8) are defined analogously, and likewise the tangent space

$$\begin{aligned} \mathcal {T}_{0}=T_{0} \times \dots \times T_{0} = T_{W}\mathcal {W},\qquad \forall W\in \mathcal {W} \end{aligned}$$
(2.13)

and the extension of the orthogonal projection (2.11a) onto \(\mathcal {T}_{0}\), again denoted by \(\Pi _{0}\). For example, regarding (2.9), with \(W \in \mathcal {W}\) and \(S(W)\in \mathcal {W}\) (or more generally \(S \in {\mathbb {R}}^{|I|\times |J|}\)), we have

$$\begin{aligned} R_{W}S(W)&= \big (R_{W_{1}} S_{1}(W),\dotsc , R_{W_{|I|}} S_{|I|}(W)\big )^{\top } \nonumber \\&= {{\,\textrm{vec}\,}}_{r}^{-1}\big ({{\,\textrm{Diag}\,}}(R_{W}){{\,\textrm{vec}\,}}_{r}\big (S(W)\big )\big ) \end{aligned}$$
(2.14a)

with

$$\begin{aligned} {{\,\textrm{Diag}\,}}(R_{W})&:= \begin{pmatrix} R_{W_{1}} &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad R_{W_{2}} &{}\quad &{}\quad \vdots \\ \vdots &{}\quad &{}\quad \ddots &{}\quad 0 \\ 0 &{}\quad \cdots &{}\quad &{}\quad R_{W_{|I|}} \end{pmatrix}. \end{aligned}$$
(2.14b)

Given data \(\mathcal {F}_{I}\) are taken into account as distance vectors

$$\begin{aligned} D_{i}=\big (d(f_{i},l_{1}),\dotsc ,d(f_{i},l_{|J|})\big )^{\top },\quad i\in I \end{aligned}$$
(2.15)

and mapped to \(\mathcal {W}\) by

$$\begin{aligned} L(W)= & {} \exp _{W}(-\tfrac{1}{\rho }D) \in \mathcal {W}, \nonumber \\ L_{i}(W_{i})= & {} \exp _{W_{i}}(-\tfrac{1}{\rho }D_{i})\nonumber \\= & {} \frac{W_{i}e^{-\frac{1}{\rho } D_{i}}}{\langle W_{i}, e^{-\frac{1}{\rho } D_{i}}\rangle }, \end{aligned}$$
(2.16)

where \(\rho > 0\) is a user parameter for normalizing the scale of the data. These likelihood vectors represent data terms in conventional variational approaches: Each individual flow \(\dot{W}_{i} = R_{W_{i}} L_{i}(W_{i})\), \(W_{i}(0)={\mathbb {1}}_{\mathcal {S}}\) converges to \(e_{j(i)}\) with \(j(i)=\arg \min _{j\in J} D_{ij}\) and in this sense maximizes the local data likelihood.

The vector field defining the assignment flow (2.9) arises through coupling flows for individual pixels through geometric averaging within the neighborhoods \(\mathcal {N}_{i},\,i\in I\), conforming to the underlying Fisher–Rao geometry

$$\begin{aligned} S(W)&= \begin{pmatrix} \vdots \\ {S_{i}(W)}^{\top } \\ \vdots \end{pmatrix} = \mathcal {G}^{\Omega }\big (L(W)\big ) \in \mathcal {W},\qquad \end{aligned}$$
(2.17a)
$$\begin{aligned} S_{i}(W)&= \mathcal {G}^{\Omega }_{i}\big (L(W)\big )\nonumber \\ {}&= {{\,\textrm{Exp}\,}}_{W_{i}} \left( \sum _{k \in \mathcal {N}_{i}} \omega _{ik} {{\,\textrm{Exp}\,}}_{W_{i}}^{-1}\big (L_{k}(W_{k})\big )\right) ,\quad i \in I. \end{aligned}$$
(2.17b)

The similarity vectors \(S_{i}(W)\) are parametrized by strictly positive weight patches \((\omega _{ik})_{k\in \mathcal {N}_{i}}\), centered at \(i\in I\) and indexed by local neighborhoods \(\mathcal {N}_{i}\subset I\), that in turn define the weight parameter matrix

$$\begin{aligned} \Omega= & {} {(\Omega _{i})}_{i\in I} \in {\mathbb {R}}_{+}^{|I|\times |I|},\nonumber \\ \qquad \Omega _{i}|_{\mathcal {N}_{i}}= & {} {(\omega _{ik})}_{k\in \mathcal {N}_{i}} \in \mathring{\Delta }_{|\mathcal {N}_{i}|},\nonumber \\ \qquad \sum _{k\in \mathcal {N}_{i}}\omega _{ik}= & {} 1,\;\forall i\in I. \end{aligned}$$
(2.18)

The matrix \(\Omega \) comprises all regularization parameters satisfying the latter linear constraints. Flattening these weight patches defines row vectors \(\Omega _{i}|_{\mathcal {N}_{i}},\,i\in I\) and, by complementing with 0, entries of the sparse row vectors \(\Omega _{i}\) of the matrix \(\Omega \). Note that the positivity assumption \(\omega _{ik}>0\) is reflected by the membership \(\Omega _{i}|_{\mathcal {N}_{i}} \in \mathring{\Delta }_{|\mathcal {N}_{i}|}\). Throughout this paper, we assume that all pixels have neighborhoods of equal size

$$\begin{aligned} |\mathcal {N}|:= |\mathcal {N}_{i}|,\quad \forall i\in I \end{aligned}$$
(2.19)

and therefore simply write \(\Omega _{i}|_{\mathcal {N}} = \Omega _{i}|_{\mathcal {N}_{i}}\). These parameters are used in the linearized assignment flow, to be introduced next. We explain a corresponding parameter estimation approach in Sect. 3 and a parameter predictor in Sect. 4.4.

2.3 Linearized Assignment Flow

The linearized assignment flow, introduced by [27], approximates (2.9) by

$$\begin{aligned} \dot{W} = R_{W}\left( S(W_{0}) + dS_{W_{0}}R_{W_{0}} \log \frac{W}{W_{0}}\right) , W(0)=W_{0} \in \mathcal {W}\nonumber \\ \end{aligned}$$
(2.20)

around any point \(W_{0}\). In what follows, we only consider the barycenter

$$\begin{aligned} W_{0}={\mathbb {1}}_{\mathcal {W}} \end{aligned}$$
(2.21)

which is the initial point of (2.9). The differential equation (2.20) is still nonlinear but can be parametrized by a linear ODE on the tangent space

$$\begin{aligned} W(t)&= {{\,\textrm{Exp}\,}}_{W_{0}}\big (V(t)\big ), \end{aligned}$$
(2.22a)
$$\begin{aligned} \dot{V}&= R_{W_{0}}\big (S(W_{0}) + dS_{W_{0}} V\big ) =: B_{W_0} + A(\Omega )V,\quad \nonumber \\ V(0)&=0, \end{aligned}$$
(2.22b)

where matrix \(A(\Omega )\) linearly depends on the parameters \(\Omega \) of (2.17). The action of \(A(\Omega )\) on V is explicitly given by [27, Prop. 4.4]

$$\begin{aligned} A(\Omega ) V&=-R_{W_{0}}dS_{W_{0}}V = R_{S(W_{0})}\Omega V \overset{}{\nonumber }\\&=-{{\,\textrm{vec}\,}}_{r}^{-1}\big ({{\,\textrm{Diag}\,}}(R_{S(W_{0})}){{\,\textrm{vec}\,}}_{r}(\Omega V)\big ) \end{aligned}$$
(2.23a)
$$\begin{aligned}&=\left( R_{S_{1}(W_{0})}\sum _{k\in \mathcal {N}_{1}}\omega _{1k}V_{k},\dotsc ,R_{S_{|I|}(W_{0})}\sum _{k\in \mathcal {N}_{|I|}}\omega _{|I|k}V_{k}\right) ^{\top }, \end{aligned}$$
(2.23b)

where \({{\,\textrm{Diag}\,}}(R_{S(W_{0})})\) is defined by (2.14b) and we took into account (2.21). The linear ODE (2.22b) admits a closed-form solution which in turn enables a different numerical approach (Sect. 2.4) and a novel approach to parameter learning (Sect. 3).

2.4 Exponential Integration

The solution to (2.22b) is given by a high-dimensional integral (Duhamel’s formula) whose value in closed form is given by

$$\begin{aligned} V(t;\Omega ) = t \varphi \big (tA(\Omega )\big ) B_{W_0}, \varphi (x) = \frac{e^{x}-1}{x}=\sum _{k=0}^{\infty } \frac{x^{k}}{(k+1)!},\nonumber \\ \end{aligned}$$
(2.24)

where the entire function \(\varphi \) is extended to matrix arguments as the limit of an absolutely convergent power series in the matrix space [9, Theorem 6.2.8]. As the matrix A is already very large even for medium-sized images, however, it is not feasible in practice to compute \(\varphi (tA)\) in this way. Exponential integration [10, 18], therefore, was used by [27] for approximately evaluating (2.24), as sketched next.

Applying the row-stacking operator (2.3) to both sides of (2.22b) and (2.24), respectively, yields with

$$\begin{aligned} v = {{\,\textrm{vec}\,}}_{r}(V) \end{aligned}$$
(2.25)

the ODE (2.22b) in the form

$$\begin{aligned} \dot{v}&= b + A^{J}(\Omega ) v, v(0) =0, \quad b = b(\Omega ) = {{\,\textrm{vec}\,}}_{r}(B_{W_0}) \in {\mathbb {R}}^{n}, \end{aligned}$$
(2.26a)
$$\begin{aligned}&\quad A^J(\Omega ) = {\big (A^J_{ik}(\Omega )\big )}_{i,k \in I} \in {\mathbb {R}}^{n\times n},\nonumber \\&\quad A^J_{ik}(\Omega ) = {\left\{ \begin{array}{ll} \omega _{ik} R_{S_{i}(W_{0})}, &{} k \in \mathcal {N}_{i}, \\ 0, &{} k \not \in \mathcal {N}_{i}. \end{array}\right. } \end{aligned}$$
(2.26b)
$$\begin{aligned}&\quad v(t;\Omega ) = t\varphi \big (t A^{J}(\Omega )\big ) b,\nonumber \\&\quad n := \dim v(t;\Omega ) = |I| |J|, \end{aligned}$$
(2.26c)

where \(A^{J}(\Omega )\) results from

$$\begin{aligned} {{\,\textrm{vec}\,}}_{r}\big (A(\Omega ) V\big )&\overset{({2.23})}{=}&{{\,\textrm{Diag}\,}}(R_{S(W_{0})}){{\,\textrm{vec}\,}}_{r}(\Omega V)\nonumber \\= & {} {{\,\textrm{Diag}\,}}(R_{S(W_{0})})(\Omega \otimes I_{|J|}) v \end{aligned}$$
(2.27a)
$$\begin{aligned}= & {} A^{J}(\Omega ) v. \end{aligned}$$
(2.27b)

Using the Arnoldi iteration [21] with initial vector \(q_{1}=b/\Vert b\Vert \), we determine an orthonormal basis \(Q_{m}=(q_{1},\dotsc ,q_{m}) \in {{\mathbb {R}}}^{n\times m}\) of the Krylov space \(\mathcal {K}_m(A^{J}, b)\) of dimension m. As will be validated in Sect. 4, choosing \(m\le 10\) yields sufficiently accurate approximations of the actions of the matrix exponential \({{\,\textrm{expm}\,}}\) and the \(\varphi \) operator on a vector, respectively, that are given by

$$\begin{aligned} {{\,\textrm{expm}\,}}\big (tA^{J}(\Omega )\big )b&\approx \Vert b\Vert Q_{m} {{\,\textrm{expm}\,}}(t H_{m})e_1, \nonumber \\ \qquad H_{m}&=- Q_{m}^{\top }A^{J}(\Omega )Q_{m}, \end{aligned}$$
(2.28a)
$$\begin{aligned} \quad t \varphi \big (tA^{J}(\Omega )\big )b&\approx t \Vert b\Vert Q_{m} \varphi (t H_{m})e_1. \end{aligned}$$
(2.28b)

The expression \(\varphi (t H_{m})e_1\) results from computing the left-hand side of the relation [8, Section 10.7.4]

$$\begin{aligned} {{\,\textrm{expm}\,}}\begin{pmatrix} t H_{m} &{}\quad e_{1} \\ 0 &{}\quad 0 \end{pmatrix} = \begin{pmatrix} {{\,\textrm{expm}\,}}(t H_{m}) &{}\quad \varphi (t H_{m}) e_{1} \\ 0 &{}\quad 1 \end{pmatrix} \end{aligned}$$
(2.29)

and extracting the upper-right vector. Since \(H_{m}\) is a small matrix, any standard method [17] can be used for computing the matrix exponential on the left-hand side.

3 Parameter Estimation

Section 3.1 details our approach for learning optimal weight parameters for a given image and ground truth labeling: Riemannian gradient descent is performed with respect to a loss function that depends on the solution of the linearized assignment flow. A closed-form expression of this gradient is derived in Sect. 3.2 along with a low-rank approximation in Sect. 3.3 that can be computed efficiently. As an alternative and baseline, we outline in Sect. 3.4 two gradient approximations based on numerical schemes for integrating the linearized assignment flow and automatic differentiation.

3.1 Learning Procedure

Let

$$\begin{aligned} P_{\Omega } = \{\Omega \in {\mathbb {R}}_{+}^{|I|\times |I|}:\Omega \;\text {satisfies}~(2.18)\} \end{aligned}$$
(3.1)

denote the space of weight parameter matrices that parametrize the similarity mapping (2.17). Due to (2.18) and (2.19), the restrictions \(\Omega _{i}|_{\mathcal {N}}\) are strictly positive probability vectors, as are the assignment vectors \(W_{i}\) defined by (2.7). Therefore, similar to \(W_{i}\in \mathcal {S}\), we consider each \(\Omega _{i}|_{\mathcal {N}}\) as point on a corresponding manifold \((\Delta _{|\mathcal {N}|},g_{{\scriptscriptstyle FR}})\) equipped with the Fisher–Rao metric and—in this sense—regard \(P_{\Omega }\) in (3.1) as corresponding product manifold.

Let \(W^{*}\in \mathcal {W}\) denote the ground truth labeling for a given image, and let \(V^{*} = \Pi _0 W^{*} \in \mathcal {T}_{0}\) be a tangent vector such that \(\lim _{s\rightarrow \infty } {{\,\textrm{Exp}\,}}_{{\mathbb {1}}_{\mathcal {W}}}(s V^{*}) = W^{*}\). Our objective is to determine \(\Omega \) such that, for some specified time \(T>0\), the vector

$$\begin{aligned} V_{T}(\Omega ):= V(T;\Omega ), \end{aligned}$$
(3.2)

given by (2.24) and corresponding to the linearized assignment flow, approximates the direction of \(V^{*}\) and hence

$$\begin{aligned} \lim _{s\rightarrow \infty }{{\,\textrm{Exp}\,}}_{\mathbb {1}_{\mathcal {W}}}\big (s V_{T}(\Omega )\big ) = W^{*}. \end{aligned}$$
(3.3)

In this formula the direction of the vector \(V_{T}(\Omega )\) only is relevant, but not its magnitude. A distance function that also satisfies these properties is given by

$$\begin{aligned} f_{\mathcal {L}}:\mathcal {T}_{0}\rightarrow {\mathbb {R}},\qquad V\mapsto 1-\frac{\langle V^{*},V\rangle }{\Vert V^{*}\Vert \Vert V\Vert }. \end{aligned}$$
(3.4)

In addition, we consider a regularizer

$$\begin{aligned}{} & {} \mathcal {R}:P_{\Omega } \rightarrow {\mathbb {R}},\quad \Omega \mapsto \frac{\tau }{2} \sum _{i\in I}\Vert t_{i}(\Omega )\Vert ^{2}, \nonumber \\{} & {} \qquad t_{i}(\Omega ) = \exp _{{\mathbb {1}}_{\Omega }}^{-1}(\Omega _{i}|_{\mathcal {N}}),\qquad \tau > 0 \end{aligned}$$
(3.5)

and define the loss function

$$\begin{aligned} \mathcal {L}:P_{\Omega }\rightarrow {\mathbb {R}},\qquad \mathcal {L}(\Omega ) = f_{\mathcal {L}}\big (V_{T}(\Omega )\big ) + \mathcal {R}(\Omega ), \end{aligned}$$
(3.6)

with \(V_{T}(\Omega )\) from (3.2). \(\Omega \) is determined by the Riemannian gradient descent sequence

$$\begin{aligned} \Omega ^{(k+1)}{} & {} = \exp _{\Omega ^{(k)}}\big (-h\nabla \mathcal {L}(\Omega ^{(k)})\big ),\quad k\ge 0, \nonumber \\ \Omega ^{(0)}_{i}|_{\mathcal {N}}{} & {} = {\mathbb {1}}_{|\mathcal {N}|},\quad i\in I \end{aligned}$$
(3.7)

with step size \(h>0\). Here

$$\begin{aligned} \nabla \mathcal {L}(\Omega )=R_{\Omega }\partial \mathcal {L}(\Omega ) \end{aligned}$$
(3.8)

is the Riemannian gradient with respect to the Fisher–Rao metric. \(R_{\Omega }\) is given by (2.12) and (2.11b) and effectively applies to the restrictions \(\Omega _{i}|_{\mathcal {N}}\) of the row vectors with all remaining components equal to 0. It remains to compute the Euclidean gradient \(\partial \mathcal {L}(\Omega )\) of the loss function (3.6) which is presented in the subsequent Sect. 3.2.

3.2 Loss Function Gradient

In Sect. 3.2.2, we derive a closed-form expression for the loss function gradient (Theorem 3.8), after introducing some basic calculus rules for representing and computing differentials of matrix-valued mappings in Sect. 3.2.1.

3.2.1 Matrix Differentials

Let \(F :{\mathbb {R}}^{m_{1}\times m_{2}} \rightarrow {\mathbb {R}}^{n_{1}\times n_{2}}\) be a smooth mapping. Using the canonical identification \(T\mathcal {E} \cong \mathcal {E}\) of the tangent spaces of any Euclidean space \(\mathcal {E}\) with \(\mathcal {E}\) itself, we both represent and compute the differential

$$\begin{aligned} dF:{\mathbb {R}}^{m_{1}\times m_{2}} \rightarrow L({\mathbb {R}}^{m_{1}\times m_{2}},{\mathbb {R}}^{n_{1}\times n_{2}}) \end{aligned}$$
(3.9)

in terms of a vector-valued mapping f, which is defined by F according to the commutative diagram

(3.10)

In formulas, this means that based on the equation

$$\begin{aligned} {{\,\textrm{vec}\,}}_{r}\big (F(X)\big )=f\big ({{\,\textrm{vec}\,}}_{r}(X)\big ),\quad \forall X\in {\mathbb {R}}^{m_{1}\times m_{2}}, \end{aligned}$$
(3.11)

we set

$$\begin{aligned}{} & {} {{\,\textrm{vec}\,}}_{r}\big (dF(X)Y) = df\big ({{\,\textrm{vec}\,}}_{r}(X)\big ){{\,\textrm{vec}\,}}_{r}(Y), \nonumber \\{} & {} \qquad \forall X, Y \in {\mathbb {R}}^{m_{1}\times m_{2}} \end{aligned}$$
(3.12)

and hence define and compute the differential (3.9) as matrix-valued mapping

$$\begin{aligned} dF:= df \circ {{\,\textrm{vec}\,}}_{r}. \end{aligned}$$
(3.13)

The corresponding linear actions on \(Y\in {\mathbb {R}}^{m_{1}\times m_{2}}\) and \({{\,\textrm{vec}\,}}_{r}(Y)\in {\mathbb {R}}^{m_{1}m_{2}}\), respectively, are given by (3.12). We state an auxiliary result required in the next subsection, which also provides a first concrete instance of the general relation (3.12).

Lemma 3.1

(differential of the matrix exponential) If \(F = {{\,\textrm{expm}\,}}:{\mathbb {R}}^{n\times n}\rightarrow {\mathbb {R}}^{n\times n}\), then (3.12) reads

$$\begin{aligned} {{\,\textrm{vec}\,}}_{r}\big (d{{\,\textrm{expm}\,}}(X) Y\big )= & {} \big ({{\,\textrm{expm}\,}}(X)\otimes I_{n}\big )\varphi (-X\oplus X^{\top })\nonumber \\= & {} {{\,\textrm{vec}\,}}_{r}(Y), Y\in {\mathbb {R}}^{n\times n}, \end{aligned}$$
(3.14)

with \(\varphi \) given by (2.24).

Proof

The result follows from [8, Thm. 10.13] where columnwise vectorization is used, after rearranging so as to conform to the row-stacking mapping \({{\,\textrm{vec}\,}}_{r}\) used in this paper. \(\square \)

3.2.2 Closed-Form Gradient Expression

We separate the computation of \(\mathcal {L}(\Omega )\) and the gradient \(\partial \mathcal {L}(\Omega )\) into several operations that were introduced in Sects. 2 and 3.1. We illustrate their composition and accordingly the process from parameters \(\Omega \) to a loss \(\mathcal {L}(\Omega )\) in the following flow diagram that refers to quantities in (2.26) and (2.27) related to the linearized assignment flow, after vectorization.

(3.15)

In what follows, we traverse this diagram from top-left to bottom-right and collect each partial result by a corresponding lemma or proposition. Theorem 3.8 assembles all results and provides a closed-form expression of the loss function gradient \(\partial \mathcal {L}(\Omega )\). To enhance readability, the proofs of most lemmata are listed in Appendix A.1.

We focus on mapping (M1) in diagram (3.15).

Lemma 3.2

The differential of the function

$$\begin{aligned}{} & {} f_{1} :{\mathbb {R}}^{|I| \times |I|} \rightarrow {\mathbb {R}}^{|I| \times |J|},\nonumber \\{} & {} \qquad \Omega \mapsto f_{1}(\Omega ):= S(W_{0}) = \exp _{{\mathbb {1}}_{\mathcal {W}}}\Big (-\frac{1}{\rho } \Omega D\Big ),\nonumber \\{} & {} \qquad D\in {\mathbb {R}}^{|I|\times |J|} \end{aligned}$$
(3.16)

and its transpose are given by

$$\begin{aligned} df_{1}(\Omega ) Y&= -\frac{1}{\rho } R_{f_{1}(\Omega )} (Y D),\quad \quad \forall Y\in {\mathbb {R}}^{|I|\times |I|}, \end{aligned}$$
(3.17a)
$$\begin{aligned} df_{1}(\Omega )^{\top } Z&= -\frac{1}{\rho }R_{f_{1}(\Omega )}(Z) D^{\top }, \quad \forall Z\in {\mathbb {R}}^{|I|\times |J|}, \end{aligned}$$
(3.17b)

with \(R_{f_{1}(\Omega )}\) defined by (2.14).

Proof

see Appendix A.1.

We consider mapping (M2) of diagram (3.15), taking into account mapping (M4) and notation (3.16).

Lemma 3.3

The differential of the function

$$\begin{aligned}{} & {} f_{2}:{\mathbb {R}}^{|I|\times |I|} \rightarrow {\mathbb {R}}^{|I|^{2}}, \nonumber \\{} & {} \Omega \mapsto f_{2}(\Omega ):= b(\Omega ) = {{\,\textrm{vec}\,}}_{r}\big (R_{W_{0}}f_{1}(\Omega )\big ) \end{aligned}$$
(3.18)

and its transpose are given by

$$\begin{aligned} df_{2}(\Omega ) Y&= {{\,\textrm{vec}\,}}_{r}\big (R_{W_{0}}df_{1}(\Omega )Y\big ),\qquad \forall Y\in {\mathbb {R}}^{|I|\times |I|} \end{aligned}$$
(3.19a)
$$\begin{aligned} df_{2}(\Omega )^{\top } Z&= df_{1}(\Omega )^{\top }(R_{W_{0}} Z),\qquad \qquad \forall Z\in {\mathbb {R}}^{|I|\times |I|}. \end{aligned}$$
(3.19b)

Proof

see Appendix A.1.

We note that \(d f_{2}(\Omega )^{\top }\) should act on a vector \({{\,\textrm{vec}\,}}_{r}(Z)\in {\mathbb {R}}^{|I|^{2}}\). We prefer the more compact and equivalent non-vectorized expression (3.19b).

We turn to mapping (M3) of diagram (3.15) and use (3.15).

Lemma 3.4

The differential of the mapping

$$\begin{aligned}{} & {} f_{3}:{\mathbb {R}}^{|I|\times |I|}\rightarrow {\mathbb {R}}^{n\times n}, \nonumber \\{} & {} \quad \Omega \mapsto f_{3}(\Omega ):= A^{J}(\Omega )= {{\,\textrm{Diag}\,}}(R_{f_{1}(\Omega )})(\Omega \otimes I_{|J|}), n=|I| |J|\nonumber \\ \end{aligned}$$
(3.20)

is given by

$$\begin{aligned} df_{3}(\Omega ) Y&= {{\,\textrm{Diag}\,}}(d R_{f_{1}(\Omega )} Y)(\Omega \otimes I_{|J|}) \nonumber \\&\quad + {{\,\textrm{Diag}\,}}(R_{f_{1}(\Omega )})(Y\otimes I_{|J|}), \nonumber \\&\qquad \forall Y\in {\mathbb {R}}^{|I|\times |I|}. \end{aligned}$$
(3.21a)

Here, \({{\,\textrm{Diag}\,}}(d R_{f_{1}(\Omega )} Y)\in {\mathbb {R}}^{n\times n}\) is defined by (2.14b) and |I| block matrices of size \(|J|\times |J|\) on the diagonal of the form

$$\begin{aligned} d R_{f_{1i}(\Omega )} Y&= {{\,\textrm{Diag}\,}}\big (d f_{1i}(\Omega ) Y\big ) -\big (df_{1i}(\Omega ) Y\big )f_{1i}(\Omega )^{\top } \nonumber \\&\quad - f_{1i}(\Omega )\big (df_{1i}(\Omega ) Y\big )^{\top },\quad i \in I, \end{aligned}$$
(3.21b)

where \(d f_{1i}(\Omega ) Y\) is given by

$$\begin{aligned} (d R_{f_{1i}(\Omega )} Y) S_{i}&= \big ((d R_{f_{1}(\Omega )} Y) S\big )_{i},\quad i\in I \end{aligned}$$
(3.21c)

for any \(S = (\dotsc ,S_{i},\dotsc )^{\top }\in {\mathbb {R}}^{|I|\times |J|}\) and by (3.17a).

Proof

see Appendix A.1.

We focus on the differential of the vector-valued mapping \(v_{T}(\Omega )\in {\mathbb {R}}^{n}\) of (3.15) with n given by (2.26c). We utilize the fact that analogous to (2.29), the vector

$$\begin{aligned} v_{T}(\Omega )&=T\varphi (T A^{J}(\Omega ))b(\Omega ) = (I_{n},0_{n}) {{\,\textrm{expm}\,}}\big (\mathcal {A}(\Omega )\big )e_{n+1} \end{aligned}$$
(3.22a)

can be extracted from the last column of the matrix

$$\begin{aligned} {{\,\textrm{expm}\,}}\big (\mathcal {A}(\Omega )\big )&= \begin{pmatrix}{{\,\textrm{expm}\,}}\big (T A^{J}(\Omega )\big ) &{} v_{T}(\Omega ) \\ 0_{n}^{\top } &{} 1 \end{pmatrix},\nonumber \\ \mathcal {A}(\Omega )&= \begin{pmatrix} T A^{J}(\Omega ) &{}\quad T b(\Omega ) \\ 0_{n}^{\top } &{}\quad 0 \end{pmatrix}. \end{aligned}$$
(3.22b)

By means of relation (3.11), we associate a vector-valued function \(f_{\mathcal {A}}\) with the matrix-valued mapping \(\mathcal {A}\) through

$$\begin{aligned} {{\,\textrm{vec}\,}}_{r}\big (\mathcal {A}(\Omega )\big ) = f_{\mathcal {A}}\big ({{\,\textrm{vec}\,}}_{r}(\Omega )\big ) \end{aligned}$$
(3.23)

and record for later that, for any matrix \(Y\in {\mathbb {R}}^{|I|\times |I|}\), Eq. (3.12) implies

$$\begin{aligned} {{\,\textrm{vec}\,}}_{r}\big (d\mathcal {A}(\Omega )Y\big ) = df_{\mathcal {A}}\big ({{\,\textrm{vec}\,}}_{r}(\Omega )\big ){{\,\textrm{vec}\,}}_{r}(Y). \end{aligned}$$
(3.24)

Lemma 3.5

The differential of the mapping \(\mathcal {A}\) in (3.22b) is given by

$$\begin{aligned}{} & {} d\mathcal {A}(\Omega ) Y \nonumber \\ {}{} & {} \quad = T \begin{pmatrix} d f_{3}(\Omega ) &{}\quad d f_{2}(\Omega ) \\ 0_{n}^{\top } &{}\quad 0 \end{pmatrix} \left( \begin{pmatrix} 1 \\ 1 \end{pmatrix} \otimes Y\right) ,\quad \forall Y\in {\mathbb {R}}^{|I|\times |I|}.\nonumber \\ \end{aligned}$$
(3.25)

Proof

Equation (3.25) is immediate due to

$$\begin{aligned} d\mathcal {A}(\Omega ) = \begin{pmatrix} T d A^{J}(\Omega ) Y &{}\quad T d b(\Omega ) Y \\ 0_{n}^{\top } &{}\quad 0 \end{pmatrix} \end{aligned}$$
(3.26)

and Lemmata 3.3 and 3.4. \(\square \)

Now we are in the position to specify the differential of the solution to the linearized assignment flow with respect to the regularizing weight parameters.

Proposition 3.6

Let

$$\begin{aligned} f_{4}(\Omega ):= v_{T}(\Omega ):=v(T;\Omega ) \end{aligned}$$
(3.27)

denote the solution (2.26c) in vectorized form to the ODE(2.22b). Then, the differential is given according to the convention (3.13) by

$$\begin{aligned}&d f_{4}(\Omega )Y\nonumber \\ {}&\quad = T \Big ( d\big (\varphi \big (T A^{J}(\Omega )\big )b(\Omega )\big ) + \varphi \big (T A^{J}(\Omega )\big ) d f_{2}(\Omega )\Big ) Y \end{aligned}$$
(3.28a)

where

$$\begin{aligned}&d\big (\varphi \big (T A^{J}(\Omega )\big )b(\Omega )\big ) Y \end{aligned}$$
(3.28b)
$$\begin{aligned}&\quad = \Big (\big ({{\,\textrm{expm}\,}}(T A^{J}(\Omega )),v_{T}(\Omega )\big )\otimes e_{n+1}^{\top }\Big ) \nonumber \\&\quad \varphi \big (-\mathcal {A}(\Omega )\oplus \mathcal {A}(\Omega )^{\top }\big ) \cdot df_{\mathcal {A}}\big ({{\,\textrm{vec}\,}}_{r}(\Omega )\big ){{\,\textrm{vec}\,}}_{r}(Y), \end{aligned}$$
(3.28c)
$$\begin{aligned}&\qquad \forall Y\in {\mathbb {R}}^{|I|\times |I|}, \end{aligned}$$
(3.28d)

where \(A^{J}(\Omega )\) is given by (2.26b), \(\mathcal {A}(\Omega )\) by (3.22b), \(d f_{\mathcal {A}}\) by (3.24) and Lemma 3.5, and \(df_{2}\) by Lemma 3.3.

Proof

Equation (3.28a) follows directly from Eq. (2.26c) and Lemma 3.3 makes explicit the second summand on the right-hand side. It remains to compute the first summand. Using (3.22) and the chain rule, we have for any \(Y\in {\mathbb {R}}^{|I|\times |I|}\),

$$\begin{aligned}&d\big (T\varphi (T A^{J}(\Omega ))b(\Omega )\big ) \nonumber \\&Y = (I_{n},0_{n}) d{{\,\textrm{expm}\,}}\big (\mathcal {A}(\Omega )\big )\big (d\mathcal {A}(\Omega )Y \big ) e_{n+1}. \end{aligned}$$
(3.29a)

Applying \({{\,\textrm{vec}\,}}_{r}\) to both sides which does not change the vector on the left-hand side, yields by (2.3)

$$\begin{aligned}&d\big (T\varphi (T A^{J}(\Omega ))b(\Omega )\big ) Y = \big ((I_{n},0_{n}) \nonumber \\&\quad \otimes e_{n+1}^{\top }\big ){{\,\textrm{vec}\,}}_{r}\big (d{{\,\textrm{expm}\,}}\big (\mathcal {A}(\Omega )\big )(d\mathcal {A}(\Omega ) Y)\big ). \end{aligned}$$
(3.29b)

Applying Lemma 3.1 and (3.24), we obtain

$$\begin{aligned}&d\big (T\varphi (T A^{J}(\Omega ))b(\Omega )\big ) Y = \big ((I_{n},0_{n})\otimes e_{n+1}^{\top }\big ) \nonumber \\&\quad \big ({{\,\textrm{expm}\,}}\big (\mathcal {A}(\Omega )\big )\otimes I_{n+1}\big ) \varphi \big (-\mathcal {A}(\Omega )\oplus \mathcal {A}(\Omega )^{\top }\big ) \end{aligned}$$
(3.29c)
$$\begin{aligned}&\qquad \cdot df_{\mathcal {A}}\big ({{\,\textrm{vec}\,}}_{r}(\Omega )\big ){{\,\textrm{vec}\,}}_{r}(Y) \end{aligned}$$
(3.29d)

and using (2.2) and (3.22b)

$$\begin{aligned}&= \Big (\big ({{\,\textrm{expm}\,}}(T A^{J}(\Omega )),v_{T}(\Omega )\big )\otimes e_{n+1}^{\top }\Big )\nonumber \\&\qquad \varphi \big (-\mathcal {A}(\Omega )\oplus \mathcal {A}(\Omega )^{\top }\big ) \cdot df_{\mathcal {A}}\big ({{\,\textrm{vec}\,}}_{r}(\Omega )\big ){{\,\textrm{vec}\,}}_{r}(Y). \end{aligned}$$
(3.29e)

\(\square \)

We finally consider the regularizing mapping \(\mathcal {R}(\Omega )\), defined by (3.5) and corresponding to mapping (M5) in diagram (3.15). Here, we have to take into account the constraints (2.18) imposed on \(\Omega \). Accordingly, we define the corresponding set of tangent matrices

$$\begin{aligned} \mathcal {Y}_{\Omega } = \big \{Y\in {\mathbb {R}}^{|I|\times |I|}:\langle {\mathbb {1}}_{\mathcal {N}}, Y_{i}|_{\mathcal {N}}\rangle = 0,\;\forall i\in I\big \}. \end{aligned}$$
(3.30)

Lemma 3.7

The differential of the mapping \(\mathcal {R}\) in (3.5) is given by

$$\begin{aligned} d\mathcal {R}(\Omega ) Y = \tau \sum _{i\in I}\Big \langle t_{i}(\Omega ), \Pi _{0}\Big (\frac{Y_{i}}{\Omega _{i}}\Big )\Big |_{\mathcal {N}}\Big \rangle ,\quad \forall Y\in \mathcal {Y}_{\Omega }.\nonumber \\ \end{aligned}$$
(3.31)

Proof

see Appendix A.1.

Putting all results together, we state the main result of this section.

Theorem 3.8

(loss function gradient) Let

$$\begin{aligned} \mathcal {L}(\Omega ) = f_{\mathcal {L}}\big (v_{T}(\Omega )\big )+\mathcal {R}(\Omega ) \end{aligned}$$
(3.32)

be a continuously differentiable loss function, where \(v_{T}(\Omega )\) given by (2.26c) is the vectorized solution to the linearized assignment flow (2.22b) at time \(t=T\). Then, its gradient \(\partial \mathcal {L}(\Omega )\) is given by

$$\begin{aligned} \langle \partial \mathcal {L}(\Omega ),Y\rangle&= d\mathcal {L}(\Omega ) Y,\qquad \forall Y\in \mathcal {Y}_{\Omega } \end{aligned}$$
(3.33a)

with

$$\begin{aligned} d\mathcal {L}(\Omega )Y&= \big \langle \partial f_{\mathcal {L}}\big (v_{T}(\Omega )\big ), d f_{4}(\Omega ) Y\big \rangle + d\mathcal {R}(\Omega ) Y \end{aligned}$$
(3.33b)

and \(d f_{4}(\Omega )\) given by (3.28), and with \(d\mathcal {R}(\Omega ) Y\) given by Lemma 3.7.

Proof

The claim (3.33) follows from applying the definition of the gradient in (3.33a) and evaluating the right-hand side using the chain rule and Proposition 3.6, to obtain (3.33b). \(\square \)

3.3 Gradient Approximation

In this section, we discuss the complexity of the evaluation of the loss function gradient \(\partial \mathcal {L}(\Omega )\) as given by (3.33), and we develop a low-rank approximation (3.47) that is computationally feasible and efficient.

3.3.1 Motivation

We reconsider the gradient \(\partial \mathcal {L}\) given by (3.33). The gradient involves the term \(df_{4}(\Omega )Y\), given by (3.28), which comprises two summands. We focus on the computationally expensive first summand on the right-hand side of (3.28a) given by (3.28b)-(3.28c), i.e., the term

$$\begin{aligned} \underbrace{ \Big (\big ({{\,\textrm{expm}\,}}(T A^{J}(\Omega )),v_{T}(\Omega )\big )\otimes e_{n+1}^{\top }\Big )\varphi \big (-\mathcal {A}(\Omega )\oplus \mathcal {A}(\Omega )^{\top }\big ) \cdot df_{\mathcal {A}}\big ({{\,\textrm{vec}\,}}_{r}(\Omega )\big ) }_{=: C(\Omega )} {{\,\textrm{vec}\,}}_{r}(Y). \end{aligned}$$
(3.34)

In order to evaluate the corresponding component of \(\partial \mathcal {L}(\Omega )\) based on (3.33b), the matrix \(C(\Omega )\) is transposed and multiplied with \(\partial f_{\mathcal {L}}(v_{T}(\Omega ))\),

$$\begin{aligned}&C(\Omega )^{\top } \partial f_{\mathcal {L}}(v_{T}(\Omega )) \end{aligned}$$
(3.35a)
$$\begin{aligned}&\quad = df_{\mathcal {A}}\big ({{\,\textrm{vec}\,}}_{r}(\Omega )\big )^{\top } \varphi \big (-\mathcal {A}(\Omega )^{\top }\oplus \mathcal {A}(\Omega )\big ) \nonumber \\ {}&\qquad \cdot \Big (\big ({{\,\textrm{expm}\,}}(T A^{J}(\Omega )),v_{T}(\Omega )\big )^{\top }\otimes e_{n+1}\Big )\partial f_{\mathcal {L}}(v_{T}(\Omega )) \end{aligned}$$
(3.35b)
$$\begin{aligned}&\quad = df_{\mathcal {A}}\big ({{\,\textrm{vec}\,}}_{r}(\Omega )\big )^{\top } \varphi \big (-\mathcal {A}(\Omega )^{\top }\oplus \mathcal {A}(\Omega )\big ) \nonumber \\&\qquad \cdot \Big (\big ({{\,\textrm{expm}\,}}(T A^{J}(\Omega )),v_{T}(\Omega )\big )^{\top }\otimes e_{n+1}\Big )\nonumber \\&\qquad \cdot \big (\partial f_{\mathcal {L}}(v_{T}(\Omega )) \otimes (1)\big ) \end{aligned}$$
(3.35c)
$$\begin{aligned}&\quad \overset{(2.2)}{=} df_{\mathcal {A}}\big ({{\,\textrm{vec}\,}}_{r}(\Omega )\big )^{\top } \varphi \big (-\mathcal {A}(\Omega )^{\top }\oplus \mathcal {A}(\Omega )\big ) \nonumber \\ {}&\qquad \cdot \Big (\big ({{\,\textrm{expm}\,}}(T A^{J}(\Omega )),v_{T}(\Omega )\big )^{\top } \partial f_{\mathcal {L}}(v_{T}(\Omega )) \otimes e_{n+1}\Big ). \end{aligned}$$
(3.35d)

Thus, the matrix-valued function \(\varphi \) defined by (2.24) has to be evaluates at a Kronecker sum of matrices and then multiplied by a vector. The structure of this expression has the general form

$$\begin{aligned} f(M_{1}\oplus&M_{2}) (b_1 \otimes b_2),\qquad M_1, M_2 \in {{\mathbb {R}}}^{k \times k},\quad b_1, b_2 \in {{\mathbb {R}}}^k, \end{aligned}$$
(3.36)

where in our case we have

$$\begin{aligned} M_1&= -{\mathcal {A}(\Omega )}^{\top },\ M_2 = \mathcal {A}(\Omega ),\ k = n+1= |I||J| + 1, \end{aligned}$$
(3.37a)
$$\begin{aligned} b_1&= \big ({{\,\textrm{expm}\,}}(T A^{J}(\Omega )),\ v_{T}(\Omega )\big )^\top \partial f_{\mathcal {L}}\big (v_{T}(\Omega )\big ),\end{aligned}$$
(3.37b)
$$\begin{aligned} b_2&= e_{n+1}, f = \varphi . \end{aligned}$$
(3.37c)

As the following discussions also hold in the general setting (3.36), we derive our gradient approximation in this full generality. Afterward, we apply our setting to the gradient approximation (3.47). First, we discuss two ways to compute (3.36):

Direct Computation Compute the Kronecker sum \(M_{1} \oplus M_{2}\), evaluate the matrix function \(\varphi \) and multiply the vector \(b_1 \otimes b_2\). This approach has space and time complexity of at least \(\mathcal {O}(k^4)\), with k given by (3.37a). The complexity might be even higher depending on how the function f is evaluated.

Krylov Subspace Approximation Use the Krylov space \(\mathcal {K}_m(M_{1}\oplus M_{2}, b_1 \otimes b_2)\) for approximating (3.36), as explained in Sect. 2.4. This approach has space complexity \(\mathcal {O}(k^2 m^2)\) and time complexity \(\mathcal {O}(k^2(m+1))\) [22, p. 132].

Remark 3.9

(space complexity) Consider an image with \(512 \times 512\) pixels (\(|I| = 262\,144\)), \(|J|=10\) labels (i.e., \(k = |I||J| + 1 = 2\,621\,441\)) and using 8 bytes per number. Then the direct computation requires to store more than \(10^{14}\) terabytes of data. The Krylov subspace approximation (with \(m=10\)) is significantly cheaper, but still requires to store more than 5000 terabytes. Hence both methods are computationally infeasible especially in view of the fact that (3.36) has to be recomputed in every step of the gradient descent procedure (3.7).

3.3.2 An Approximation by Benzi and Simoncini

To reduce the memory footprint, we employ an approximation for computing (3.36), first discussed by Benzi and Simoncini [6], and refine it using a new additional approximation in Sect. 3.3.3. In the following, the notation from Benzi and Simoncini is slightly adapted to our definition (2.1) of the Kronecker sum that differs from Benzi and Simoncini’s definition of the Kronecker sum (\(A \oplus B = B \otimes I + I \otimes A\)).

The approach uses the Arnoldi iteration [21] to determine orthonormal bases \(P_m\), \(Q_m\) and the corresponding Hessenberg matrices \(T_1\) and \(T_2\) of the two Krylov subspaces \(\mathcal {K}(M_1, b_1)\), \(\mathcal {K}(M_2, b_2)\). The matrices are connected by a standard relation of Krylov subspaces [8, Section 13.2.1],

$$\begin{aligned} M_1 P_{m}&= P_{m} T_{1} + t_{1} p_{m+1} e_{m}^{\top }, \end{aligned}$$
(3.38a)
$$\begin{aligned} M_2 Q_{m}&= Q_{m} T_{2} + t_{2} q_{m+1} e_{m}^{\top }, \end{aligned}$$
(3.38b)

where \(t_1 \in {\mathbb {R}}\), \(p_{m+1} \in {\mathbb {R}}^n\) (resp. \(t_2 \in {\mathbb {R}}\), \(q_{m+1} \in {\mathbb {R}}^n\)) refer to the entries of the Hessenberg matrices and the orthonormal bases in the next step of the Arnoldi iteration. With these formulas, we deduce

$$\begin{aligned}{} & {} (M_1 \oplus M_2) (P_m \otimes Q_m)\nonumber \\{} & {} \overset{({2.1})}{=} (M_1P_m \otimes Q_m) + (P_m \otimes M_2 Q_m) \end{aligned}$$
(3.39a)
$$\begin{aligned}{} & {} \overset{({3.38})}{=} (P_{m} T_{1} + t_{1} p_{m+1} e_{m}^{\top } \otimes Q_m)\nonumber \\{} & {} \quad \qquad + (P_m \otimes Q_{m} T_{2} + P_m \otimes t_{2} q_{m+1} e_{m}^{\top }) \end{aligned}$$
(3.39b)
$$\begin{aligned}{} & {} \quad = (P_m \otimes Q_m)(T_1 \oplus T_2) +( t_{1} p_{m+1} e_{m}^{\top } \otimes Q_m) \nonumber \\{} & {} \qquad + (P_m \otimes t_{2} q_{m+1} e_{m}^{\top }). \end{aligned}$$
(3.39c)

Ignoring the last two summands and multiplying by \((P_m \otimes Q_m)^\top \) yields the approximation

$$\begin{aligned} (M_1 \oplus M_2)&\approx (P_m \otimes Q_m)(T_1 \oplus T_2)(P_m \otimes Q_m)^\top , \end{aligned}$$
(3.40)

which after applying f and multiplying \(b_1 \otimes b_2\) leads to the approximation

$$\begin{aligned}&f(M_{1}\oplus M_{2}) (b_1 \otimes b_2)\nonumber \\ {}&\quad \approx (P_m \otimes Q_m)f(T_1 \oplus T_2)(P_m \otimes Q_m)^\top (b_1 \otimes b_2) \end{aligned}$$
(3.41)

of the expression (3.36) as proposed by Benzi and Simoncini. We note that, due to the orthonormality of the bases \(P_m\) and \(Q_m\) and their relation to the vectors \(b_{1}, b_{2}\) that generate the subspaces \(\mathcal {K}(M_1, b_1)\), \(\mathcal {K}(M_2, b_2)\), the approximation simplifies to

$$\begin{aligned}&f(M_{1}\oplus M_{2}) (b_1 \otimes b_2) \nonumber \\ {}&\quad \approx \Vert b_1\Vert \Vert b_2\Vert (P_m \otimes Q_m)f(T_1 \oplus T_2) e_1 \end{aligned}$$
(3.42a)
$$\begin{aligned}&\quad = \Vert b_1\Vert \Vert b_2\Vert {{\,\textrm{vec}\,}}_r\left( P_m \ {{\,\textrm{vec}\,}}_r^{-1}\big ( f(T_1 \oplus T_2) e_1 \big ) Q_m^\top \right) , \end{aligned}$$
(3.42b)

where \(e_1 \in {\mathbb {R}}^{m^2}\) denotes the first unit vector.

Remark 3.10

(complexity of the approximation (3.42b)) Computing and storing the matrices \(P_m\), \(Q_m\), \(T_1\) and \(T_2\) has space complexity \(\mathcal {O}(2 k m^2)\) and a time complexity of \(\mathcal {O}(2k(m+1))\) [22, p. 132]. Storing the matrices \(T_1 \oplus T_2\) and \(f(T_1 \oplus T_2)\) has complexity \(\mathcal {O}(m^4)\). Finally, multiplying the three matrices \(P_m \in {{\mathbb {R}}}^{k \times m}\), \({{\,\textrm{vec}\,}}_r^{-1}\left( f(T_1 \oplus T_2) e_1 \right) \in {{\mathbb {R}}}^{m \times m}\) and \(Q_m^\top \in {{\mathbb {R}}}^{m \times k}\) has time complexity \(\mathcal {O}(k^2m + km^2)\) and space complexity \(\mathcal {O}(k^2 + km)\).

Ignoring negligible terms (recall \(m \ll k\)), the entire approximation has computational complexity \(\mathcal {O}(k^2m)\) and storage complexity \(\mathcal {O}(k^2)\). Compared to the Krylov subspace approximation of (3.36) discussed in the preceding section, this is a reduction of space complexity by a factor \(m^2\).

Consider an image with \(512 \times 512\) pixels (\(|I| = 262\,144\)) and \(|J|=10\) labels as in Remark 3.9. Then the approximation (3.42b) requires to store a bit more than 50 terabytes. While this is a huge improvement compared to the 5000 terabytes from the Krylov approximation (see Remark 3.9), using this approximation is still computationally infeasible. This motivates why we introduce below an additional low-rank approximation that yields a computationally feasible and efficient gradient approximation.

3.3.3 Low-Rank Approximation

We consider again the approximation (3.42b)

$$\begin{aligned}&f(M_{1}\oplus M_{2}) (b_1 \otimes b_2)\nonumber \\ {}&\quad \approx \Vert b_1\Vert \Vert b_2\Vert {{\,\textrm{vec}\,}}_r\left( P_m \ {{\,\textrm{vec}\,}}_r^{-1}\big ( f(T_1 \oplus T_2) e_1 \big ) Q_m^\top \right) \end{aligned}$$
(3.43)

and decompose the matrix \({{\,\textrm{vec}\,}}_r^{-1}\big ( f(T_1 \oplus T_2) e_1 \big ) \in {{\mathbb {R}}}^{m \times m}\) using the singular value decomposition (SVD)

$$\begin{aligned} {{\,\textrm{vec}\,}}_r^{-1}\big ( f(T_1 \oplus T_2) e_1 \big ) = \sum _{i \in [m] }\sigma _{i} y_{i}\otimes z_{i}^{\top }, \end{aligned}$$
(3.44)

with \(y_{i}, z_{i} \in {{\mathbb {R}}}^{m}\) and the singular values \(\sigma _{i} \in {{\mathbb {R}}},\; i\in [m]\). As m is generally quite small, computing the SVD is neither computationally nor storage-wise expensive. We accordingly rewrite the approximation in the form

$$\begin{aligned} \Vert b_1\Vert \Vert b_2\Vert&{{\,\textrm{vec}\,}}_r\left( P_m \ {{\,\textrm{vec}\,}}_r^{-1}\big ( f(T_1 \oplus T_2) e_1 \big ) Q_m^\top \right) \end{aligned}$$
(3.45a)
$$\begin{aligned} =~&\Vert b_1\Vert \Vert b_2\Vert {{\,\textrm{vec}\,}}_r\left( P_m \Big ( \sum _{i \in [m] }\sigma _{i} y_{i}\otimes z_{i}^{\top } \Big ) Q_m^\top \right) \end{aligned}$$
(3.45b)
$$\begin{aligned} =~&\Vert b_1\Vert \Vert b_2\Vert \sum _{i \in [m] } \sigma _{i} (P_m y_{i}) \otimes (Q_m z_{i}). \end{aligned}$$
(3.45c)

Remark 3.11

(space complexity) While the factorized form(3.45c) is equal to the approximation (3.42b), it requires only a fraction of the storage space: The intermediate results require storing m singular values and k numbers for each \(P_m y_{i}\) and \(Q_m z_{i}\), and the final approximation has an additional storage requirement of \(\mathcal {O}(2 k m)\). In total \(\mathcal {O}(4 k m)\) numbers need to be stored.

For a \(512 \times 512\) pixels image with 10 labels (see Remark 3.9), storing this approximation requires at most a gigabyte of memory.

In practice, this can be further improved: Numerical experiments show that the singular values decay very rapidly, such that just the first singular value can be used to obtain the gradient approximation

$$\begin{aligned} f(M_{1}\oplus M_{2}) (b_1 \otimes b_2) \approx \Vert b_1\Vert \Vert b_2\Vert \sigma _{1} (P_m y_{1}) \otimes (Q_m z_{1}). \end{aligned}$$
(3.46)

Numerical results in Sect. 4 demonstrate that this approximation is sufficiently accurate.

Remark 3.12

(space complexity) The term \(\Vert b_1\Vert \Vert b_2\Vert \sigma _{1} (P_my_{1}) \otimes (Q_m z_{1})\) requires to store \(\mathcal {O}(2k)\) numbers, i.e., about twice as much storage space as the original image. In total, we need to store \(\mathcal {O}(2k + 2 k m)\) numbers. The required storage for the running example (see Remark 3.9) now adds up to less than 500 megabytes.

We conclude this section by returning to our problem using the notation (3.37) and state the proposed low-rank approximation of the loss function gradient. By (3.33), (3.35), (3.37) and (3.46), we have

$$\begin{aligned} \partial \mathcal {L}(\Omega ) \approx c(\Omega ) \cdot {{\,\textrm{vec}\,}}_{r}^{-1}\Big ( df_{\mathcal {A}}\big ({{\,\textrm{vec}\,}}_{r}(\Omega )\big )^{\top } \big (\sigma _{1} (P_m y_{1}) \otimes (Q_m z_{1})\big )\nonumber \\ \end{aligned}$$
(3.47a)

where

$$\begin{aligned} c(\Omega )&= \big \Vert \big ({{\,\textrm{expm}\,}}(T A^{J}(\Omega )),v_{T}(\Omega )\big )^\top \partial f_{\mathcal {L}}\big (v_{T}(\Omega )\big )\big \Vert , \end{aligned}$$
(3.47b)
$$\begin{aligned} v_{T}(\Omega )&= v(T;\Omega )\qquad \text {(cf.~(2.26c))} \end{aligned}$$
(3.47c)
$$\begin{aligned} \sigma _{1}y_{1}\otimes z_{1}^{\top }&\approx {{\,\textrm{vec}\,}}_{r}^{-1}\big (\varphi (T_{1}\otimes T_{2})e_{1}\big ).\nonumber \\ {}&\quad \text {(largest singular value and vectors)} \end{aligned}$$
(3.47d)

Here, the matrices \(P_{m}, Q_{m}, T_{1}, T_{2}\) result from the Arnoldi iteration, cf. (3.38), that returns the two Krylov subspaces used to approximate the matrix vector product \(\varphi (-\mathcal {A}(\Omega )^{\top }\oplus \mathcal {A}(\Omega )) b_{1}\), with \(b_{1}\) given by (3.37b).

3.4 Computing the Gradient Using Automatic Differentiation

An entirely different approach to computing the gradient \(\partial \mathcal {L}(\Omega )\) of the loss function (3.6) is to not use an approximation of the exact gradient given in closed form by (3.8), but to replace the solution \(v_{T}(\Omega )\) to the linearized assignment flow in (3.33b) by an approximation determined by a numerical integration scheme and to compute the exact gradient therefrom. Thus, one replaces a differentiate-then-approximate approach by an approximate-then-differentiate alternative. We numerically compare these two approaches in Sect. 4.

We sketch the latter alternative. Consider again the loss function (3.6) evaluated at the linearized assignment flow integrated up to time T

$$\begin{aligned} \mathcal {L}(\Omega ) = f_{\mathcal {L}}\big (v_{T}(\Omega )\big ). \end{aligned}$$
(3.48)

Gradient approximations determined by automatic differentiation depend on what numerical scheme is used. We pick out two basic choices out of a broad range of proper schemes studied in [27]. In both cases, we implemented the loss function \(f_{\mathcal {L}}\) in PyTorch together with the functions \(\Omega \mapsto A^{J}(\Omega )\) and \(\Omega \mapsto b(\Omega )\) given by (2.26). Now two approximations can be distinguished depending on how the mappings \((A^{J}(\Omega ), b(\Omega )) \mapsto v_{T}(\Omega )=v(T;\Omega )\) are implemented.

Automatic Differentiation Based on the Explicit Euler Scheme We partition the interval [0, T] into T/h subintervals with some step size \(h>0\) and use the iterative scheme

$$\begin{aligned} v^{(k+1)} = v^{(k+1)} + h \big ( A^J(\Omega ) v^{(k)} + b(\Omega ) \big ), \qquad v^{(0)} = 0,\nonumber \\ \end{aligned}$$
(3.49)

in order to approximate \(v_{T}(\Omega ) \approx v^{(T/h)}\) the solution to the linearized assignment flow ODE (2.26a). As the computations only involve basic linear algebra, PyTorch is able to compute the gradient using automatic differentiation.

Automatic Differentiation Based on Exponential Integration The second approximation utilizes the numerical integration scheme developed in Sect. 2.4. Again, only basic operations of linear algebra are involved so that PyTorch can compute the gradient using automatic differentiation. The more special matrix exponential (2.29) is computed by PyTorch using a Taylor polynomial approximation [4].

Both approaches determine an approximation of the Euclidean gradient \(\partial \mathcal {L}(\Omega )\) which we subsequently convert into an approximation of the Riemannian gradient using Eq. (3.8).

4 Experiments

In this section, we report and discuss a series of experiments illustrating our novel gradient approximation (3.47) and the applicability of the linearized assignment flow to the image labeling problem.

We start with a discussion of the data generation (Sect. 4.1) and the general experimental setup (Sect. 4.2), before discussing properties of the gradient approximation (Sect. 4.3). In order to illustrate a complete pipeline that can also label previously unseen images, we trained a simple parameter predictor and report its application in Sect. 4.4.

4.1 Data Generation

As for the experiments, we focused on the image labeling scenarios depicted in Fig. 1a, b. Each scenario consists of a set containing five \(128 \times 128\) pixel images with random Voronoi structure, in order to mimic low-dimensional structure that has to be separated in noisy data from the background. This task occurs frequently in applications and cannot be solved without adaptive regularization.

For the design of the parameter predictor (Sect. 4.4), we used all patches of five additional unseen images for validation. In all cases we report the mean over all labeled pixels of 5 training and validation images, respectively. In order to test the resilience to noise, we added Gaussian noise to the images. The ground truth labeling is, in both labeling scenarios, given by the noiseless version of the images.

In the first scenario illustrated in Fig. 1a, we want to separate the boundary of the cells (black label) from their interior (white label). The main difficulty here is to preserve the thin line structures even in the presence of image noise. Weight patches with uniform (uninformed) weights average out most of the lines as Fig. 1c shows.

In the second scenario illustrated in Fig. 1b, we label the Voronoi cells according to their color represented by 8 labels. Due to superimposed noise, a pixelwise local rounding to the nearest label yields about 50% wrongly labeled pixels, see Fig. 1d.

Fig. 1
figure 1

Randomized scenarios for training and testing. Two randomly generated images for the two respective scenarios that were used to evaluate weight parameter estimation and prediction. a Random line structure whose accurate labeling requires to adapt weight parameters. b Random Voronoi cells to be labeled by pixelwise assignment of one of the colors (

figure d
,
figure e
,
figure f
,
figure g
,
figure h
,
figure i
,
figure j
,
figure k
). In both cases, Gaussian noise was added. The resulting noisy images are shown in the lower part of either panel (rescaled in the color channels to avoid color clipping). c The amount of noise is chosen quite large such that a labeling with uniform (“uninformed non-adaptive”) weights completely destroys the thin line structure in a. d A pixelwise local nearest label assignment yields around 50% wrongly labeled pixels for the labeling scenario depicted in b. Both of these naive parameter settings indicate the need for a more structured choice of the weight patches, by taking into account local image features in a local spatial neighborhood

4.2 Experimental Setup

Features and Parametrization For simplicity, we used the raw image data in a \(3 \times 3\) window around each pixel as feature (2.6) for this pixel. Weight patches \((\omega _{ik})_{k\in \mathcal {N}_{i}}\) in the \(\Omega \)-matrix (2.18) also had the size of \(3 \times 3\) pixels in all experiments. While the linearized assignment flow works with arbitrary features and also with larger neighborhood sizes for the weight parameters, the above setup suffices to illustrate and substantiate the contribution of this paper.

Performance Measure All labelings were evaluated on the tangent space of the assignment manifold using the loss function \(f_\mathcal {L}\) given by (3.4). Since the values of this function are rather abstract, however, we report the percentage of wrongly labeled pixels in all performance plots.

Gradient Computation We evaluated the loss function and approximated its Riemannian gradient in three different ways, as further detailed in Sect. 4.3, throughout using uniform (uninformed) weight patches as initialization. In particular, other common ways to update the parameters, like Adam or AdaMax [15], are possible as well, in conjunction with our approach. Therefore, we also compared gradient approximations based on our approach with the results of automatic differentiation, as implemented by PyTorch [19].

Parameter Prediction Parameter prediction for labeling novel data relies on the relation of features extracted from training data to corresponding parameters estimated by the Riemannian gradient descent (3.7). For any feature extracted from novel data, the predictor specifies the parameters, to be used for labeling the data by integrating the linearized assignment flow after substituting the predicted parameters. Details are provided in Sect. 4.4.

4.3 Properties of the Gradient Approximation

In this section, we report results that empirically validate our novel gradient approximation (3.47) by means of parameter estimation for the linearized assignment flow.

First, we compared our gradient approximation with two methods based on automatic differentiation (backpropagation), see also Sect. 3.4. To this end, we implemented in PyTorch [19] the simple explicit Euler scheme (3.49) for integrating the linearized assignment flow and computed the gradient of the loss function \(\mathcal {L}(\Omega )\) (3.6) with respect to \(\Omega \) using automatic differentiation. Similarly, the Krylov subspace approximation (2.28b) of the solution of the linearized assignment flow was implemented in PyTorch. As all involved computations in this approximation are basic linear algebra operations, PyTorch is able to apply automatic differentiation for evaluating the gradient.

These gradients are used for carrying out the gradient descent iteration (3.7) in order to optimize the weight parameters. Figure 2 illustrates the comparison of the three approaches. Although they rely on quite different principles, we observe a remarkable comparability of the three approaches with respect to the reduction of the percentage of wrongly labeled pixels per training iteration, for both noisy and noiseless images. In particular, our low-rank approximation based on the closed-form loss function gradient expression is competitive. In view of the minor differences between the curves, we point out that changing hyperparameters, like the step size in the gradient descent or the scale parameter \(\tau \) of the regularizer \(\mathcal {R}\) in (3.5), have a greater effect on the training performance than the choice of either of the three approaches. Overall, these results validate the closed-form formulas in Sect. 3.2 and, in particular, Theorem 3.8, and the subsequent low-rank approximation in Sect. 3.3. We point out, however, that our approach only reveals data-dependent low-dimensional subspaces where the essential parameters of the linearized assignment flow reside.

Fig. 2
figure 2

Comparing gradient approximation and automatic differentiation. Both figures show, for the second scenario depicted in Fig. 1b, the effect of parameter learning in terms of the labeling error during the training procedure (3.7). a Shows the result for noisy input data, b or noiseless input data. Note the different scales of the two ordinates. As is exemplarily shown here by both figures, we generally observed very similar results for all three algorithms which validates the closed-form formulas in Sect. 3.2 and the subsequent subspace approximation in Sect. 3.3

Next, we compared our gradient approximation to the exact gradient on a per-pixel basis. However, as the exact gradient is computationally infeasible, we used the gradient produced by automatic differentiation of the explicit Euler scheme with a very small step size as surrogate. Figure 3a demonstrates the high accuracy of our gradient approximation. A pixelwise illustration of the gradient approximation, at the initial step of the training procedure for adapting the parameters, is provided in Fig. 3b. The set of pixels with nonzero loss function gradient concentrate around the line structure since here weight adaption is required to achieve a proper labeling.

Fig. 3
figure 3

Checking the gradient approximation at each pixel. We evaluated our gradient approximation (3.47), at the first step of the training iteration and at each pixel, for the scenario depicted in Fig. 1a. As a proxy for the exact but computationally infeasible gradient, we used the gradient produced by automatic differentiation of the explicit Euler scheme with a very small step size. Then, we compared both gradients at each pixel using the cosine similarity, i.e., the value 1 means that the gradients point exactly in the same direction, whereas 0 signals orthogonality and \(-1\) means that they point in opposite directions. a More than 99% of the pixels have a value of 0.9 or more, corresponding to an angle of \(26^\circ \) or less between the gradient directions. This illustrates excellent agreement between our gradient approximation and the exact gradient. Disagreements with the exact gradient occur rarely and randomly at isolated pixels throughout the image. b Norm of the gradients are displayed at each pixel. Non-vanishing norms indicate where parameter learning (adaption) occurs. Since the initial weight parameter patches are uniform, no adaption—corresponding to zero norms of gradients—occurs in the interior of each Voronoi cell, because parameters are already optimal in such homogeneous regions

Our last three experiments regarding the gradient approximation, illustrated in Fig. 4, concern

  • the influence of the Krylov dimension m,

  • the rank of our approximation, and

  • the time T up to which the linearized assignment flow is integrated.

We observe according to Fig. 4a that already Krylov subspace of small dimension \(m \approx 10\) suffice for computing linearized assignment flows and learning their parameters. Similarly, the final rank-one gradient approximation of the gradient according to Eq. (3.46) suffices for parameter estimation, as illustrated in Fig. 4b. These experiments show that quite low-dimensional representations suffice for representing the information required for optimal regularization of dynamic image labeling. We point out that such insights cannot be gained from automatic differentiation.

The influence of the time T used for integrating the linearized assignment flow on parameter learning is illustrated in Fig. 4c. For the considered parameter estimation setup, we observe that already small integration times T yield good training results, whereas large times T yield slower convergence. A possible explanation is that, in the latter case, the linearized assignment flow is close to an integral solution which, when erroneous, is more difficult to correct.

Fig. 4
figure 4

Influence of Krylov subspace dimension, rank of the gradient approximation and integration time. The setup of Fig. 2 was used to demonstrate the influence of the Krylov subspace dimension, the low-rank approximation and the integration time T on our gradient approximation for parameter learning. a In general, we observed that Krylov dimensions of 5 to 10 are sufficient for most experiments. Larger Krylov dimensions only increase the computation time without any noticeable improvement of accuracy. b Training curves for different low-rank approximations coincide. This illustrates that just selecting the largest singular value and vectors in (3.47), according to the final rank-one approximation (3.46), suffices for parameter learning. c For small integration times T, the convergence rates of training do not much differ. Only for larger time points T, we observe slower convergence of training, presumably because almost hard decisions are more difficult to correct by changing the parameters of the underlying dynamical system

4.4 Parameter Prediction

Besides parameter learning, parameter prediction for unseen test data defines another important task. This task amounts to model and represent the relation of local features and optimal weight parameters, as basis to predict proper weights in unseen test data as a function of corresponding local features.

We illustrate this for the scenario depicted in Fig. 1a using the following simple end-to-end learned approach to parameter prediction. We trained a predictor that produces a weight patch \(\widehat{\Omega }_{i}\) given the features \(f_i\) at vertex i of novel unseen data. The predictor is parameterized with \(N=50\) by

$$\begin{aligned}&p_j \in {\mathbb {R}}^{3 |\mathcal {N}|},\; j\in [N] \quad \text {feature prototypes,} \end{aligned}$$
(4.1a)
$$\begin{aligned}&\nu _j \in T_0,\; j\in [N]\nonumber \\&\quad \text {tangent vectors representing prototypical weight patches,} \end{aligned}$$
(4.1b)

and a scale parameter \(\sigma \in {\mathbb {R}}\). Similar to the assignment vectors (2.7), the to-be-predicted weight patches \(\widehat{\Omega }_{i}\) are elements of the probability simplex \(\mathring{\Delta }_{|\mathcal {N}_{i}|}\), see (2.18). Accordingly, use tangent vector \(\nu _{j}\in T_0\) to represent weight patches. In particular, tangent vector of predicted weight patches result from weight averaging of vectors \(\{\nu _{j}\}_{j\in [N]}\), and the predicted weight patch by lifting, see (4.4).

We initialize \(\sigma = 1\) and initialize the \(p_j,\,j\in [N]\) by clustering noise-free patches extracted from of training images. Given \(p_{j}\), we initialize \(\nu _j\) such that it is directed toward the label of the corresponding prototypical patch,

$$\begin{aligned} \nu _{j}&= \Pi _0 \begin{pmatrix} e^{ - \Vert p_{j,1} - p_{j,\text {center pixel}} \Vert }, \dots , e^{ - \Vert p_{j,|\mathcal {N}|} - p_{j,\text {center pixel}} \Vert } \end{pmatrix}^\top ,\nonumber \\ {}&\quad j\in [N]. \end{aligned}$$
(4.2)
Fig. 5
figure 5

Parameter predictor. We learned a weight patch predictor as described in Sect. 4.4 for the scenario depicted in Fig. 1a. In order to assess the predicted parameters by comparison, we also estimated weights patches for the noise-free test data in the same way as for the training data. a Section of a noise-free test image. b The corresponding section of the noisy test image that is used as input data for prediction. c The training and validation accuracy during the training of the predictor. d Weight patches estimated for the noise-free data (a). e Predicted weight patches based on the noisy data (b). (f) The labeled (section of the) test image using the predicted weight patches (d). Comparing this result to the result depicted in Fig. 1c shows the effect of predicted parameter adaption. Last row: Further labelings on unseen noisy random test images

The predictor is trained by the following gradient descent iteration. As the change in the number of wrongly labeled pixels was small, we stopped the iteration after 100 steps, see Fig. 5c.

  1. (1)

    We compute the similarities

    $$\begin{aligned} s_{ij} = e^{-\sigma \Vert f_i-p_{j}\Vert },\qquad j\in [N] \end{aligned}$$
    (4.3)

    for each \(f_i\) and pixels i in all training images.

  2. (2)

    We predict the corresponding weight patches as lifted weighted average of the tangent vectors \(\nu _{j}\)

    $$\begin{aligned} \widehat{\Omega }_{i}(\nu , p, \sigma ) = \exp _{{\mathbb {1}}_{\Omega }}\left( \sum _{j\in [N]}\frac{s_{ij}}{\sum _{k\in [N]}s_{ik}} \nu _{j}\right) . \end{aligned}$$
    (4.4)
  3. (3)

    Substituting \(\widehat{\Omega }\) for \(\Omega \), we run the linearized assignment flow and evaluate the distance function (3.4).

  4. (4)

    The gradient of this function with respect to the predictor parameters \((\nu ,p,\sigma )\) results from composing the differential due to Theorem 3.8 and the differential of (4.4).

  5. (5)

    The gradient is used to update the predictor parameters, and all steps are repeated.

During training, the accuracy of the predictor is monitored, as illustrated in Fig. 5c. The iteration terminates when the slope of the validation curve, which measures label changes, are sufficiently flat.

After the training of the predictor, the linearized assignment flow is parametrized in a data-driven way so as to separate reliably line structure in noisy data for arbitrary random instances, as depicted in Fig. 5: panel (f) and last row. This result should be compared to the non-adaptive labeling result in Fig. 1c.

5 Conclusion and Further Work

5.1 Conclusion

We presented a novel approach for learning the parameters of the linearized assignment flow for image labeling. Based on the exact formula of the parameter gradient of a loss function subject to the ODE-constraint, an approximation of the gradient was derived using exponential integration and a Krylov subspace based low-rank approximation, that is memory efficient and sufficiently accurate. Experiments demonstrate that our research implementation is on par with highly tuned-machine learning toolboxes. Unlike the latter, however, our approach additionally returns the essential information for image labeling in terms of a low-dimensional parameter subspace.

5.2 Future Work

Our future work will study generalizations of the linearized assignment flow. Since this can be done within the overall mathematical framework of the assignment flow approach, the result presented in this paper is applicable. We briefly indicate this for the continuous-time ODE (1.1) that we write down here again with an index 0,

$$\begin{aligned} \dot{V}_{0} = A_{0}(\Omega _{0}) V_{0} + B_{0}. \end{aligned}$$
(5.1)

Recall that \(B_{0}\), given by \(B_{W_{0}}\) of (2.22b), represents the input data (2.15) via the mappings (2.16) and (2.17). Now suppose the data are represented in another way and denoted by \(B_{1}\). Then, consider the additional system

$$\begin{aligned} \dot{V}_{1} = A_{1}(\Omega _{1}) V_{1} + B_{1} + V_{0}(T) L, \end{aligned}$$
(5.2)

where the solution \(V_{0}(T_{0})\) to (5.1) at time \(t=T_{0}\), possibly transformed to a tangent subspace by a linear mapping L, modifies the data term \(B_{1}\) of (5.2). Applying (2.24) to (5.1) at time \(t=T_{0}\) and to (5.2) at time \(t=T_{1}\) yields the solution

$$\begin{aligned} V_{1}(T_{1}) = T_{1}\varphi \big (T_{1} A_{1}(\Omega )\big )\Big (B_{1}+T_{0}\varphi \big (T_{0} A_{0}(\Omega _{0})\big )B_{0} L\Big ), \nonumber \\ \end{aligned}$$
(5.3)

which is a composition of linearized assignment flows and hence linear too, due to the sequential coupling of (5.1) and (5.2). Parallel coupling of the dynamical systems is feasible as well and leads to larger matrix \(\varphi \) that is structured and linearly depends on the components \(A_{0}(\Omega _{0}), A_{1}(\Omega _{1}), L\). Designing larger networks of this sort by repeating these steps is straightforward.

In either case, the overall basic structure of (1.1), (1.3) is preserved. This enables us to broaden the scope of assignment flows for applications and to study, in a controlled manner, various mathematical aspects of deep networks in terms of sequences of generalized linearized assignment flow, analogous to (1.6).