Learning Adaptive Regularization for Image Labeling Using Geometric Assignment

We study the inverse problem of model parameter learning for pixelwise image labeling, using the linear assignment flow and training data with ground truth. This is accomplished by a Riemannian gradient flow on the manifold of parameters that determines the regularization properties of the assignment flow. Using the symplectic partitioned Runge–Kutta method for numerical integration, it is shown that deriving the sensitivity conditions of the parameter learning problem and its discretization commute. A convenient property of our approach is that learning is based on exact inference. Carefully designed experiments demonstrate the performance of our approach, the expressiveness of the mathematical model as well as its limitations, from the viewpoint of statistical learning and optimal control.


INTRODUCTION
1.1.Overview and Scope.The image labeling problem, i.e. the problem to classify images pixelwise depending on the spatial context, has been thoroughly investigated during the last two decades.While the evaluation (inference) of such models is well understood [KAH + 15], learning the parameters of such models has remained elusive, in particular for models with higher connectivity of the underlying graph.Various sampling-based and other approximation methods exist (cf., e.g.[ZL02] and references therein), but the relation between approximations of the learning problem on the one hand, and approximations of the subordinate inference problem on the other hand, is less understood [Wai06].
In this paper, we focus on parameter learning for contextual pixelwise image labeling based on the assignment flow introduced by [ ÅPSS17].In comparison with discrete graphical models, an antipodal viewpoint was adopted by [ ÅPSS17] for the design of the assignment flow approach: Rather than performing nonsmooth convex outer relaxation and programming, followed by subsequent rounding to integral solutions that is common when working with large-scale discrete graphical models, the assignment flow provides a smooth nonconvex interior relaxation that performs rounding to integral solutions simultaneously.In [HS ÅS18] it was shown that the assignment flow can emulate a given discrete graphical model in terms of smoothed local Wasserstein distances, that evaluate the edge-based parameters of the graphical model.In comparison to established belief propagation iterations [YFW05,WJW05], the assignment flow driven by 'Wasserstein messages' [HS ÅS18] continuously takes into account basic constraints, which enables to compute good suboptimal solutions just by numerically integrating the flow in a proper way [ZSPS18].We refer to [Sch19] for summarizing recent work based on the assignment flow and a discussion of further aspects.
In this paper, we ignore the connection to discrete graphical models and focus on the parameter learning problem for the assignment flow directly.This problem was raised in [ ÅPSS17, Section 5 and Fig. 14].The present paper provides a detailed solution.Adopting the linear assignment flow as introduced by [ZSPS18], enables to cast the parameter estimation problem into the general form min p∈P C x(T, p) (1.1a) s.t.ẋ(t) = f (x(t), p, t), t ∈ [0, T ], x(0) = x 0 , (1.1b) where the parameters p determine the vector field of the linear assignment flow (1.1b) whose unique solution is evaluated at some point of time T by a suitable loss function (1.1a).This problem formulation has a range of advantages.
• Inference (labeling) that always defines a subroutine of a learning procedure, can be carried out exactly by means of numerically solving (1.1b).In other words, errors of approximate inference (e.g. as they occur with graphical models) are absent and cannot compromise the effectiveness of parameter learning.
• In addition, discretization effects can be handled in the most convenient way: we show that the operations of (i) deriving the optimality conditions of (1.1) and (ii) problem discretization commute if a proper numerical scheme is used.
As a result, we obtain a well-defined and relatively simple algorithm for parameter learning that is easy to implement and enables reproducible research.We report the results of a careful numerical evaluation in order to highlight the scope of our approach and its limitations.
We next discuss in a broader context our specific contributions that considerably elaborate a related conference paper [HSPS19].
1.2.Related Work, Contribution and Organization.The task to optimize parameters of a dynamical system (1.1) is a familiar one in the communities of scientific computing and optimal control [TV72, CLPS03], but may be less known to the imaging community.Therefore, we provide the necessary background in Section 2.1.
Geometric numerical integration of ODEs on manifolds is a mature field as well [HLW06].Here we have to distinguish between the integration of the assignment flow [ZSPS18] and integration schemes for numerically solving (1.1).The task to design the latter schemes faces the 'optimize-then-discretize' vs. 'discretizethen-optimize' dilemma.Conditions and ways to resolve this dilemma have been studied in the optimal control literature [Hag00,Ros06].See also the recent survey [SS16] and references therein.We provide the corresponding background in Sections 2.2 and 2.3 including detailed proof of Theorem 2.7 that is merely outlined in [SS16].The application to the linear assignment flow (Section 3) requires considerable work, taking into account that the state equation (1.1b) derives from the full nonlinear geometric assignment flow (Section 4).Section 4 concludes with specifying Algorithms 1 and 2 whose implementation realizes our approach.
From a more distant viewpoint, our work ties in with research on networks from a dynamical systems point of view, that emanated from [HZRS16] in computer science and has also been promoted recently in mathematics [E17].The recent work [HR17], for example, studied stability issues of discrete-time network dynamics using techniques of numerical ODE integration.The authors adopted the discretize-then-differentiate viewpoint on the parameter estimation problem and suggested symplectic numerical integration in order to achieve better stability.As mentioned above, our work contrasts in that inference is always exact during learning, unlike the more involved architecture of [HR17] where learning is based on approximate inference.Furthermore, in our case, symplectic numerical integration is a consequence of making the diagram of Figure 2.2 (page 8) commute.This property qualifies our approach as a proper (though rudimentary) method of optimal control (cf.[Ros06]).
We numerically evaluate our approach in Section 5, using a class of computer-generated random images such that learning the regularization parameters is necessary for accurately labeling each image pixelwise.It is demonstrated in Section 5.1 that, for each given ground truth labeling, the parameter estimation problem can be solved exactly.As a consequence, the performance of the assignment flow solely depends on the prediction map, i.e. the ability to map features extracted from novel data to proper weights as regularization parameters, using as examples both features and optimal parameters computed during the training phase.For statistical reasons, this task becomes feasible if the domain of the prediction map is restricted to local contexts, in terms of features observed within local windows.We discuss consequences for future work in Section 6.Finally, in Section 5.2, we conduct an experiment that highlights the remarkable model expressiveness of the assignment flow as well as limitations that result from learning constant parameters.
We conclude in Section 6.
1.3.Basic Notation.For the clarity of exposition we use general mathematical notation in Section 2 that should be standard, whereas specific notation related to the assignment flow is introduced in Section 3.
We set [n] = {1, 2, . . ., n} for n ∈ N and 1 n = (1, 1, . . ., 1) ∈ R n .For a matrix A ∈ R m×n , the i-th row vector is denoted by A i , i ∈ [m] and its transpose by A ∈ R n×m .a, b denotes the Euclidean inner product of a, b ∈ R n and A, B = i∈[n] A i , B i the (Frobenius) inner product between two matrices A, B ∈ R m×n .The probability simplex is denoted by Various orthogonal projections onto a convex set are generally denoted by Π and distinguished by a corresponding subscript, like Π n , Π P , • • • , etc.
We assume the reader to be familiar with elementary notions of Riemannian geometry as found, e.g., in [Lee13,Jos17].Specifically, given a Riemannian manifold (M, g) with metric g, and a smooth function f : M → R, the Riemannian gradient of f is denoted by grad f and given by where X denotes any smooth vector field on M, that returns the tangent vector X p ∈ T p M when evaluated at p ∈ M. The right-hand side of (1.2) denotes the differential df of f , acting on X.More generally, for a map F : M → N between manifolds, we write dF (p In the Euclidean case f : R n → R, the gradient is a column vector and denoted by ∂f .For F : R n → R m , we identify the differential dF ∈ R m×n with the Jacobian matrix.
with respect to the parameter vector x i is denoted by

SENSITIVITY ANALYSIS FOR DYNAMICAL SYSTEMS
In this section, we consider the constrained optimization problem (1.1) with a smooth objective function C : R nx → R. The constraints are given by a general initial value problem (IVP), which consist of a system of ordinary differential equations (ODEs) (1.1b) that is parametrized by a vector p ∈ P ⊂ R np , and an initial value x 0 ∈ R nx .To ensure existence, uniqueness and continuous differentiability of the solution trajectory x(t) on the whole time horizon [0, T ], we assume that f (•, p, •) of (1.1b) is Lipschitz continuous on R nx × [0, T ], for any p.
Since we assume the initial value x 0 and the time horizon [0, T ] to be fixed, the objective function (1.1a) effectively is a function of the parameter p, i.e.Φ : R np → R. In order to solve (1.1) with a gradient based method, we have to compute the gradient The term d p x(T, p) -called sensitivity -measures the sensitivity of the the solution trajectory x(t) at time T with respect to changes in the parameter p. Two basic approaches for determining (2.2) are stated in Section 2.1, and we briefly highlight why using one of them, the adjoint approach, is advantageous for computing sensitivities.In Section 2.2, we recall symplectic Runge-Kutta methods and conditions for preserving quadratic invariants.The latter property relates to the derivation of a class of numerical methods such that evaluating (2.2), which derives from the time-continuous problem (1.1), is identical to first discretizing (1.1) followed by computing the corresponding derived expression (2.2).Two specific instances of the general numerical scheme are detailed in Section 2.4.
2.1.Sensitivity Analysis.In this section we show that the sensitivity d p x(T, p) can be determined by solving two initial value problems defined below: the variational system and the adjoint system. ) with δ(t) ∈ R nx×np .If the initial value x(0) of (1.1b) depends on the parameters p, the initial value (2.4b) has to be adjusted as δ(0) = d p x(0).
Proof.A detailed proof can be found in [HNW08, Chapter I.14, Theorem 14.1].In order to make this paper self-contained, a sketch follows.
The integral representation of the solution to (1.1b) is given by x(t, p) = x 0 + t 0 f (x(s), p, s)ds.Differentiating with respect to p and exchanging integration and differentiation by the theorem of Lebesgue yields which is the integral representation of the trajectory δ(t) solving (2.4).
For the computation of the variational system (2.4) the solution x(t) is required.Since the variational system (2.4) is a matrix-valued system of dimension n x ×n p , the size of the system grows with the number of parameters n p .For small n p , solving the variational system is efficient.In practice, it can be simultaneously integrated numerically together with the system (1.1b).
Theorem 2.2 (Adjoint System).Suppose that the derivatives d x f and d p f exist and are continuous in the neighborhood of the solution x(t) for t ∈ [0, T ].Then, the sensitivity with respect to the parameters is given by where λ(t) ∈ R nx×nx solves the adjoint system ) Proof.This proof is elaborated on in a broader context in Section 2.3.
Similar to the variational system of Theorem 2.1, solving the adjoint system (2.8) requires the solution x(t).The adjoint system is matrix-valued of dimension n x × n x , in contrast to the variational system which is of dimension n x ×n p .Thus, if n p n x as will be the case in our scenario, it is more efficient to solve (2.8) instead of (2.4).Another major difference is that the adjoint system is defined backwards in time, starting from the endpoint T .This has important computational advantages for our setting.In view of the required gradient (2.2), we are not interested in the full sensitivity but rather in the derivative along the direction η := ∂ x C(x(T, p)), i.e. d p x(T, p) η.This can be achieved by exploiting the structure of the adjoint system, by multiplying (2.8) from the right by η and setting λ(t) := λ(t)η.The resulting IVP is again an adjoint system, no longer being matrix-valued but vector-valued λ(t) ∈ R nx , with λ(T ) = η ∈ R nx .Thus, from now on, we consider the latter case and denote λ(t) again by λ(t), which is vector-valued.
As a consequence, we will focus on the adjoint system (2.8) in the remainder of this paper.In particular, (2.7) will be used to estimate parameters p by solving (1.1) using a gradient descent flow.This requires to solve the adjoint system numerically.However, a viable alternative to this 'optimize-then-discretize' approach is to reverse this order, that is to discretize problem (1.1) first, and then to derive a corresponding time-discrete adjoint system.It turns out that both ways are equivalent if a proper class of numerical integration scheme is chosen for discretizing the system in time.This will be shown in Section 2.3 after collecting required background material in Section 2.2.
2.2.Symplectic Partitioned Runge-Kutta Methods.In this section, we recall basic concepts of numerical integration from [HLW06,SS16] in order to prepare Section 2.3.Symplectic schemes are typically applied to Hamiltonian systems in order to conserve certain quantites, often with a physical background.The pseudo-Hamiltonian defined below by (2.19) will play a similar role, albeit there is no physical background for our concrete scenario to be studied in subsequent Sections.
A general s-stage Runge-Kutta (RK) method with s ∈ N is given by [HNW93] The coefficients a ij , b i , c i ∈ R can be arranged in a so-called Butcher tableau (Fig. there exists a unique solution of (2.9), which can be obtained by iteration.If f (x, p, t) is q times differentiable, the functions k i (as functions of h) are also in C q .
Proof.A detailed proof can be found in [HNW93, Chapter II, Theorem 7.2].
Suppose that the given system (1.1b) is partitioned into two parts with x = (q , p ) , f = (f 1 , f 2 ) and q = f 1 (q, p, t), (2.12a) ṗ = f 2 (q, p, t). (2.12b) Partitioned Runge-Kutta methods integrate (2.12) using two different sets of coefficients (2.13) The following theorems state conditions under which RK methods preserve certain quantities that should be invariant under the flow of the system that is integrated numerically.In this sense, such RK schemes are called symplectic.
Theorem 2.4 (Symplectic Runge-Kutta Method; [HLW06, Chapter VI, Theorems 7.6 and 7.10]).Assume that the system, (1.1b) has a quadratic invariant I, i.e.I(•, •) is a real-valued bilinear mapping such that (d/dt)I(x(t), x(t)) = 0, for each t and x 0 .If the coefficients of a Runge-Kutta method (2.9) satisfy then the value I(x n , x n ) does not depend on n.
Theorem 2.5 (Symplectic Partitioned RK Method; [SS16, Theorems 2.4 and 2.6]).Assume that S(•, •) is a real-valued bilinear mapping such that (d/dt)S(q(t), p(t)) = 0 for each t and x 0 of the solution x(t) = [q(t) , p(t) ] of (2.12).If the coefficients of the partitioned Runge-Kutta method (2.13) satisfy then the value S(q n , p n ) does not depend on n.
as coefficients for the second n-variables (2.12b).This construction results in an overall symplectic PRK method of the partitioned system (2.12).

Computing Adjoint
Sensitivities.There are two basic approaches for computing (2.2), the differentiatethen-discretize approach and the discretize-then-differentiate approach.Figure 2.2 illustrates both approaches by paths colored with blue and violet, respectively.Details are worked out in this section.Our main objective is to make this diagram commutative by adopting a class of numerical schemes as outlined in the preceding section.
In the following, we drop the dependency of x(t) on the parameter p, to simplify notation by just writing x(t).The following theorem details the blue path of Figure 2 where x(t), λ(t) solve the two-point boundary value problem In terms of the pseudo-Hamiltonian the system has the following form ) Next, we differentiate this problem and state the result in the following theorem.
Theorem 2.7 (Adjoint Sensitivity: Discretize-then-Differentiate Approach).Suppose the step-size h n satisfies condition (2.11).Then, the gradient of the objective function Φ(p) = C(x N (p)) from (2.21) with respect to parameter p is given by where the discrete adjoint variables are given by ) where the internal stages X n,i are given by (2.21d).This scheme is a general Runge-Kutta method (2.9a)-(2.9c)applied to the adjoint system (2.18b) with b i = 0, i = 1, . . ., s, and coefficients Proof.An outline of the proof can be found in [ 2.4.Two Specific Numerical Schemes.We complement and illustrate the general result of the preceding section by specifying two numerical schemes.
2.4.1.Adjoint Sensitivity: Explicit Euler method.We integrate the forward dynamic (2.18a) with the explicit Euler method [HNW08].The straightforward use of (2.16) leads to another Runge-Kutta method for integrating the adjoint system (2.18b).The forward and backward coefficients of this overall symplectic partitioned Runge-Kutta method are then given by Table 1. 1. Symplectic PRK coefficients induced by the explicit Euler method.
By substituting the backward coefficients a 11 , b 1 and c 1 into (2.23),we derive the concrete formulas of the discrete adjoint method (2.25c) Note, that (2.25c) coincides with (2.25a), that is by traversing from n + 1 to n, we can rewrite (2.25) in the form (2.27) 2.4.2.Adjoint Sensitivity: Heun's method.We integrate the forward dynamic (2.18a) with Heun's method [HNW08].The straightforward use of (2.16) leads to another Runge-Kutta method for integrating the adjoint system (2.18b).The forward and backward coefficients of this overall symplectic partitioned Runge-Kutta method are then given by Table 2 c forward coefficients backward coefficients TABLE 2. Symplectic PRK coefficients induced by Heun's method.
Although the butcher tableau of the backward coefficients (see Table 2, right matrix) is completely dense, the final update formulas are explicit, as we will show below.Again, we derive the concrete formulas of the discrete adjoint method by substituting the backward coefficients into (2.23) Note, that (2.28e) coincides with (2.28a), which implies the equations Using (2.29), we reformulate (2.28d) (2.30c) Formula (2.30c) is an explicit Euler step traversing backwards from n + 1 to n.Thus, we can rewrite the overall scheme (2.28) as Again, this is an explicit method traversing backwards from n + 1 to n. Formula (2.22) for the gradient of Φ(p) = C(x N (p)) from (2.21) has the form (2.32)

IMAGE LABELING USING GEOMETRIC ASSIGNMENT
In this section, we summarize material from [ ÅPSS17] and [ZSPS18] required in the remainder of this paper.
Let G = (V, E) be a given undirected graph with m := |V| vertices and let be data on the graph given in a metric space (F, d).We call F V image data given by features f i extracted from a raw image at pixel i ∈ V in a preprocessing step.Along with f we assume prototypical data to be given, henceforth called labels.Each label j represents the data of class j.Image labeling denotes the problem of finding an assignment V → X assigning class labels to nodes depending on the image data F V and the local context encoded by the graph structure G.We refer to [HS ÅS18] for more details and background on the image labeling problem.
G may be a grid graph (with self-loops) as in low-level image processing or a less structured graph, with arbitrary connectivity in terms of the neighborhoods where ik is a shorthand for the undirected edge {i, k} ∈ E. We require these neighborhoods to satisfy the relations (3.4) We associate with each neighborhood N i from (3.3) weights ω ik ∈ R for all k ∈ N i , satisfying (3.5) These weights parametrize the regularization property of the assignment flow below.Learning these weights from given data is the subject of the remainder of this paper.
3.1.Assignment Manifold.The probabilistic assignment of labels X at one node i ∈ V are represented by the manifold of discrete probability distributions with full support with constant tangent space Throughout this paper, we only work with T n .The probability space S is turned into a Riemannian manifold (S n , g) by equipping it with the Fisher-Rao (information) metric Furthermore, we have the uniform distribution of labels (3.9) the orthogonal projection onto the tangent space with respect to the standard Euclidean structure of R n and the replicator operator given by the linear map (3.12) The Riemannian gradient of a smooth function f : S → R is denoted by grad f : S → T n and relates to the Euclidean gradient ∂f by [ ÅPSS17, Prop.1] as Adopting the α-connection with α = 1, also called e-connection, from information geometry [AN00, Section 2.3], [AJLS17], the exponential map based on the corresponding affine geodesics reads (3.14a) Specifically, we define Applying the map exp p to a vector in R n = T n ⊕ R1 n does not depend on the constant component of the argument, due to (3.12).
Remark 3.1.The map Exp corresponds to the e-connection of information geometry [AN00], rather than to the exponential map of the Riemannian connection.Accordingly, the affine geodesics (3.14a) are not length-minimizing with respect to the Riemannian structure.But locally, they provide a close approximation [ ÅPSS17, Prop.3] and are more convenient for numerical computations.
Global label assignments on the whole set of nodes V are represented as points on the assignment manifold, given by the product with constant tangent space and Riemannian structure (W, g) given by the Riemannian product metric.We identify W with the embedding into R m×n Thus, points W ∈ W are row-stochastic matrices W ∈ R m×n with row vectors W i ∈ S n , i ∈ V representing the label assignments for every i ∈ V. Due to this embedding of W, the tangent space T W can be identified with and therefore for V ∈ T W every row vector V i is contained in T n for every i ∈ V.The global uniform distribution, given by the uniform distribution in every row, again called barycenter, is denoted by where the second equality is due to the embedding (3.18).The mappings (3.10)-(3.14a)naturally extend to the assignment manifold W and exp W , Exp −1 W , exp −1 W are similarly defined based on (3.15a), (3.14b) and (3.15b).Due to (3.13) , the Riemannian gradient and the Euclidean gradient of a smooth function f : W → R are also related by (3.22) 3.2.Assignment Flow.Based on the given data (3.1) and labels (3.2), the i-th row of the distance matrix D ∈ R m×n is defined by This distance information is lifted onto the manifold by the following likelihood matrix where ρ > 0 is a scaling parameter to normalize the a-priori unknown scale of the distances induced by the features f i depending on the application at hand.This representation of the data is regularized by weighted geometric averaging in the local neighborhoods (3.3) using the weights (3.5), to obtain the similarity matrix S(W ) ∈ W, with i-th row defined by If Exp W i were the exponential map of the Riemannian (Levi-Civita) connection, then the sum inside the outer brackets of the right-hand side in (3.25) would just be the negative Riemannian gradient with respect to W i of the objective function used to define the Riemannian center of mass, i.e. the weighted sum of the squared Riemannian distances between W i and L k [Jos17, Lemma 6.9.4].In view of Remark 3.1, this interpretation is only approximately true mathematically, but still correct informally: S i (W ) moves W i towards the normalized geometric mean of the likelihood vectors L k , k ∈ N i .The similarity matrix induces the assignment flow through a system of spatially coupled nonlinear ODEs which evolves the assignment vectors Integrating this flow numerically [ZSPS18] yields curves W i (t) ∈ S n for every pixel i ∈ V emanating from W i (0) = 1 Sn , which approach some vertex (unit vector) of S n = ∆ n and hence a unique label assignment after a trivial rounding W i (t) for sufficiently large t > 0.
3.3.Linear Assignment Flow.The linear assignment flow, introduced by [ZSPS18], uses the exponential map with respect to the e-connection (3.14a) in order to approximate the mapping (3.25) as part of the assignment flow (3.26a) by This linear assignment flow (3.27) is still nonlinear but admits the following parametrization [ZSPS18, Prop.4.2] (3.28) where the latter ODE is linear and defined on the vector space T W . Fixing S(W 0 ) in the following, (3.28) is linear with respect to both the tangent vector V and the parameters ω ik in the differential dS(W 0 ) (see (3.30) and Remark 4.1 below), that makes this approach attractive for parameter estimation.
It can be shown that S i (W ) from (3.25) can equivalently be expressed with exp 1 Sn as A standard calculation shows, that the i-th component of the differential dS(W ) : T W → T W is given by (3.30) 3.4.Numerical Integration of the Flow.Setting Λ(V, W ) := exp W (V ) gives an action Λ : T W ×W → W of the vector space T W viewed as an additive group on the assignment manifold W. In [ZSPS18] this action is used to numerically integrate the assignment flow by applying geometric Runge-Kutta methods.The resulting method for an arbitrary vector field F : W → T W is as follows.Suppose, the ODE on the assignment manifold is given.Then the parametrization W (t) = exp 1 W (V (t)) yields an equivalent reparametrized ODE = 0 (3.32) purely evolving on the vector space T W , where standard Runge-Kutta methods (cf.Section 2) can now be used for numerical integration.Translating these update schemes back onto W, yields geometric Runge-Kutta methods on W induced by the Lie-group action Λ = exp.Remark 3.2.Notice, that the assumption F (W ) ∈ T W is crucial because the transformation of the ODE (3.31) onto T W in (3.32) uses the inverse of R W , which only exists for elements of T W but not for R m×n .However, this is no limitation.Suppose any vector field F : W → R m×n is given.Due to R W = R W • Π by (3.12), we may consider F (W ) := Π[ F (W )] ∈ T W instead, without changing the underlying ODE (3.31) for W (t).
In the following, we mainly use the Euler method to numerically integrate the flow (3.32) on the vector space T W , i.e.
with step-size h k > 0. Due to the Lie-group action, this update scheme translates to the geometric Euler integration on W given by with step-size h k > 0.

LEARNING ADAPTIVE REGULARIZATION PARAMETERS
In this section, we study the parameter learning approach (4.1), which is a specific instance of the general formulation (1.1).The goal is to adapt the regularization of the linear assignment flow (3.27) on the fixed time horizon [0, T ] controlled by the weights (3.5), in the following collectively denoted by Ω, so as to preserve important image structure in a supervised manner.During learning, the image structure is prescribed by given ground truth labeling information W * , where every row W * i is some unit basis vector e k i of R n representing the ground truth label l k i at node i ∈ V.The adaptivity of the weights with respect to the desired image structure is measured by C in terms of the discrepancy between ground truth W * and the labeling induced by V (T ) = V (T, Ω) at fixed time T .The corresponding optimization problem reads It is important to note that the dependency of C(V (T, Ω)) on the weights Ω is only implicitly given through the solution V (T ) = V (T, Ω) of (4.1b).In Section 4.4 we therefore present a numerical first-order scheme for optimizing (4.1)where the gradient of C(V (T, Ω)) with respect to the parameter Ω is calculated using the sensitivity analysis from Section 2.
4.1.Parameter Manifold.In the following, we define the parameter manifold representing the weights ω ik from (3.5) associated to the neighborhood N i , i ∈ V. Based on this parametrization, we can compute the differential dS(W 0 ) and thus describe the linear assignment flow (3.28) on the tangent space by a corresponding expression in Lemma 4.2 below.
To simplify the exposition, we assume that all neighborhoods N i have the same size Due to the constraints (3.5), the weight vector Ω i := (ω i1 , . . ., ω iN ) can be viewed as a point in S N .Accordingly, we define the parameter manifold as feasible set for learning the weights, which has the form of an assignment manifold and thus also has a Riemannian structure (P, g), given by the Fisher-Rao metric.We use the identification Points Ω ∈ P now represent the global choice of weights with Ω i representing the weights ω ik associated to the neighborhood N i in (3.5).The constant tangent space of P is denoted by T P and the corresponding orthogonal projection by Next, we give a global expression for the differential dS(W ) which will simplify following formulas and calculations.For this, we define the averaging matrix A Ω ∈ R m×m with weights Ω ∈ P by where δ k∈N i is the Kronecker-Delta with value 1 if k ∈ N i and 0 otherwise.We observe that the averaging matrix A Ω linearly depends on the weight parameters.Thus, A Ω parametrizes averages depending on the corresponding weights Ω with respect to the underlying graph structure, given by the neighborhoods (3.3).For a matrix M ∈ R m×n , the averages of the row vectors with weights Ω are then just given by the matrix multiplication A Ω M , with the i-th row vector given by (4.7) For later use, we record the following formula for the adjoint of A Ω as a linear map with respect to Ω.
Lemma 4.1.If the averaging matrix is viewed as a linear map A : R m×N → R m×m , Ω → A Ω , then the adjoint map A : R m×m → R m×N , B → A B is given by Proof.For arbitrary B ∈ R m×m and Ω ∈ R m×N , we obtain A Ω , B = i,j∈V δ j∈N i Ω ik B ik = Ω, A B due to (4.6).
Using A Ω with Ω ∈ P, it follows from (3.30) that dS(W ) can be expressed as As a result, the linear assignment flow (3.28) on the vector space T W can be parametrized as follows.
Lemma 4.2.Using the parametrization V := nV , the linear assignment flow (3.28) takes the form Proof.At p = 1 Sn , the linear map (3.12) takes the form where I ∈ R n×n denotes the identity matrix.Because of Using (4.9) together with V W 0 = nV = V , the linear assignment flow (3.28) takes the form As a consequence of ΠR S(W 0 ) = R S(W 0 ) by (3.12), the right-hand side of (4.10) follows after multiplying the equation by n and using n V = V .
Remark 4.1.To simplify notation, we will write V for V below.Equation (4.10) highlights the importance to fix S(W 0 ) in order to obtain a model that is linear in both the state vector V and the parameters Ω.

4.2.
Modified Linear Assignment Flow.We now return to our objective to estimate the weight parameters Ω ∈ P controlling the linear assignment flow on the fixed time interval [0, T ], in the supervised scenario (4.1).In this formulation, the data represented by the likelihood matrix (3.24),only influence the linear assignment flow (3.27), or equivalently (4.10), through the constant similarity matrix S(W 0 ) that comprises averaged data information depending on the initial choice of the weights Ω 0 .However, since the initial weights are in general not adapted to any specific image structure, this can lead to a loss of desired structural information through S(W 0 ) at the outset, that cannot be recovered afterwards.
To avoid this problem, we slightly modify the linear assignment flow in (4.10) to obtain an explicit data term, independent of the choice of initial weights.This is done through replacing the constant term S(W 0 ) by the lifted distances L(W 0 ), which results in the modified linear assignment flow (4.14) Remark 4.2.We point out that, strictly speaking, that the similarity matrix S(W 0 ) is involved in two ways, in the constant term of (4.10) and in the expression R S(W 0 ) of the differential dS(W 0 ) (cf. (4.9)).However, the effect of the latter with respect to the initial weights is negligible, and the former appearance only causes the above mentioned loss of initial data information.We note again that (4.14) is linear with respect to both the tangent vector V and the parameters Ω only if S(W 0 ) is kept constant.
Proposition 4.3.The differential of the map F : T W ×P → T W on the right-hand side of (4.14) with respect to the first and second argument are given by d Ω F (V, Ω) : The corresponding adjoint mappings with respect to the standard Euclidean structure of R m×n are d Ω F (V, Ω) : with the adjoint A (•) from Lemma 4.1.
Proof.Let V, X ∈ T W and set γ(t) := V + tX ∈ T 0 for all t ∈ R. Then (4.17) Similarly, for Ω ∈ P and Ψ ∈ T P , let η(t) := Ω + tΨ ∈ P be a curve with t ∈ (−ε, ε) for sufficiently small ε > 0. The linearity of the averaging operator A Ω with respect to Ω gives We now determine the adjoint differentials.Consider arbitrary X, Y ∈ T W and note that the linear map R S (W 0 ) is symmetric, since every component map R S i (W 0 ) is symmetric by (3.11).Thus, . Now let arbitrary Ψ ∈ T P and X ∈ T P be given.Then which proves the expression for the corresponding adjoint.
4.3.Objective Function.Let W = exp 1 W (V ) ∈ W be an assignment induced by V ∈ T W . Accumulating the KL-divergence between the ground truth W * i and W i for every node i ∈ V, results in a measure of the global deviation between W induced by V and the ground truth Remark 4.3.It is important to note that C does not explicitly depend on the weights Ω ∈ P. In the problem formulation (4.1a), this dependency is only given implicitly through the evaluation of C at V (T, Ω), where V (t, Ω) is the object depending on the parameter Ω as solution of the modified linear assignment flow (4.14).
Proposition 4.4.The Euclidean gradient of objective (4.22) for fixed W * ∈ W is given by Hence the KL-divergence between W * i and the induced assignment and results in the following expression for C from (4.22), log( 1, e V i ).
(4.26) Take X ∈ R m×n and set γ(t) := V + tX for t ∈ R. The above formula for C then implies Since X ∈ R m×n was arbitrary, the expression (4.23) follows.
4.4.Numerical Optimization.With the above definitions of C and F , the optimization problem (4.1) for adapting the weights of the modified linear assignment flow (4.14) takes the form Our strategy for parameter learning is to follow the Riemannian gradient descent flow on the parameter manifold induced by the potential Due to (3.22), this Riemannian gradient flow on P takes the form where R Ω given by (3.21b) on P and Ω(0) = 1 W represents an unbiased initialization, i.e. uniform weights at every patch N i at i ∈ V.
We discretize (4.30) using the geometric explicit Euler scheme (3.34) from Section 3.4 with constant step-size h > 0, which results in Algorithm 1.
Data: Initial weights Ω (0) = 1 W , objective function Φ(Ω) = C V (T, Ω) Result: Weight parameter estimates Ω * // geometric Euler integration 1 for k = 0, . . ., K do 2 compute ∂Φ(Ω (k) ) ; // Algorithm 2 3 Algorithm 1 calls Algorithm 2 that we explain next.As pointed out in Remark 4.3, the dependency of Φ(Ω) = C(V (T, Ω)) on Ω is only implicitly given through the solution V (t, Ω) of the modified linear assignment flow (4.14), evaluated at time T .According to (2.2), the gradient of Φ decomposes as where d Ω V (T, Ω) is the sensitivity of the solution V (T, Ω) with respect to Ω. Thus, the major task is to determine the sensitivity of V (T, Ω) in order to obtain the gradient ∂Φ(Ω), which in turn drives the Riemannian gradient descent flow and adapts the weights Ω.To this end, we choose the discretize-thendifferentiate approach (2.22) -recall the commutative diagram of Fig. 2.2 and relations summarized as Remark 2.3 -with the explicit Euler method and constant step-size h > 0, which results in Algorithm 2.

EXPERIMENTS
In this section, we demonstrate and evaluate our approach.In Section 5.1, we consider a scenario with 3 labels and curvilinear line structure, that has to be detected and labeled explicitly in noisy data.Just using uniform weights for regularization must fail.In addition to the noise, the actual image structure is randomly generated as well and defines a class of images.We demonstrate empirically that learning the weights to adapt within local neighborhoods from example data solves this problem.
In Section 5.2, we adopt a different viewpoint and focus on pattern formation, rather than on pattern detection and recovery.We demonstrate the modeling expressiveness of the assignment flow with respect to pattern formation.In fact, even when using the linear assignment flow as in the present paper, label information can be flexibly transported across the image domain under certain conditions.The experiments just indicate what can be done, in principle, in order to stimulate future work.We return to this point in Section 6. 5.1.Adaptive Regularization of Curvilinear Line Structures.We consider a collection of images containing line structures induced by random Voronoi diagrams (Fig. 5.1, panel (a)).The goal is pixel-accurate labeling of any given image with three labels representing: thin line structure, homogeneous region and texture.In the figures below these labels are encoded by the three colors , , = {line, homogeneous, texture}.As usual in supervised machine learning, our approach is first applied during a training phase in order to learn weight adaptivity from ground truth labelings, and subsequently evaluated in a test phase using novel unseen data.Feature Vectors.The basis of our feature vectors are the outputs of simple 7 × 7 first-and second-order derivative filters, which are tuned to orientations at 0, 15, . . ., 180 degrees (we took absolute values of filter outputs to eliminate the 180 ∼ 360 degree symmetry).We reduced the dimension of the resulting feature vectors from 24 to 12 by taking the maximum of the first-order and second-order filter outputs, for each orientation.To incorporate more spatial information, we extracted 3 × 3 patches from this 12-dimensional feature vector field.Thus, our feature vectors f i , i ∈ V had dimension 3 × 3 × 12 = 108 and were given as a point set in the Euclidean feature space F = R 108 .

Label
(5.1) The quality of this distance information is illustrated by  Optimization.For each input image of the training set, we solved problem (4.1) using Algorithms 1 and 2 and the following parameter values: |N i | = 9 × 9 (size of local neighborhoods, for every i), ρ = 1 (scaling parameter for distance matrix, cf.(3.24)), h = 0.5 (constant step-size for computing the gradient with Alg. 2), and T = 6 (end of time horizon).As for optimization on the parameter manifold P through the Riemannian gradient flow (Alg.1), we used an initial value of h = 0.0125 together with backtracking for adapting the step-size, for a maximal number of 100 iterations, and we terminated the iteration once the relative change of the objective function Φ Ω (k) = C V (N ) (Ω (k) ) dropped below 0.001.Results. Figure 5.3 shows two results obtained during the training phase.They illustrate that non-adaptive regularization using uniform weights, which results in blurred partitions and fails completely to detect and label the line structures (panel (c)).On the other hand, the adapted regularizer preserves and restores the structure nearly perfect (panel (d)), i.e. the optimal weights steered the linear assignment flow towards the given ground-truth labeling.
Figure 5.4 shows a close-up view of a 10×10 pixel region together with the corresponding 10×10 optimal weight patches, extracted from Ω * .The top row depicts (a) the training data, (b) the corresponding ground truth, (c) the local label assignments, and (d) the labeling obtained when using the learned weights Ω * .Plot (e) shows the corresponding optimal weight patches Ω * i = (ω i1 , . . ., ω iN ) associated to every pixel i in the 10 × 10 pixel region, where small and large weights are indicated by dark and bright gray values, respectively.These weight patches illustrate the result of the learning process for adapting the weights.Close to the line structure, the regularizer increases the influence (with larger weights) of neighbors whose distance information matches the prescribed ground truth label.Away from the line structure, the regularizer has learned to suppress (with small weights) neighbors that belong to a line structure.5.1.2.Test Phase.During the training phase, optimal weights were associated with all training features through optimization, based on ground truth and a corresponding objective function.In the test phase with novel data and features, appropriate weights have to be precicted because ground truth no longer is available.This was done by extracting a coreset [Phi16] from the output generated by Algorithm 1 during the training phase, and constructing a map from novel features to weights, as desribed next.
Coreset.Let Ω * ∈ P denote the set of optimal weight patches generated by Algorithm 1, and let P * denote the set of all 15 × 15 patches of local label assignments based on the corresponding training features and distance (5.1).We partitioned P * into three classes: thin line structures, homogeneous regions and texture, and extracted for each class separately 225 prototypical patches by k-means clustering.To each of these patches and the corresponding cluster, a prototypical weight patch was assigned, namely the weighted geometric mean of all optimal weight patches in Ω * belonging to that cluster.As weights for the averaging we used the Euclidean distance between the respective patches of local label assignments and the corresponding cluster centroid.
Figure 5.5 depicts 10 pairs of patches of prototypical label assignments and weights, for each of the three classes: line, homogenous, texture.Comparing these weight patches with the optimal patches depicted by Figure 5.4, we observe that the former are regularized (smoothed) by geometric averaging and, in this sense, summarize and represent all optimal weights computed during the training phase.
Mapping features to weights.For each novel test image, we extracted features using the same procedure as done in the training phase and computed at each pixel i the patch of local label assignments.For the latter patch, the closest patch of local label assignments of the coreset was determined, and the corresponding weight patch was assigned to pixel i.
Note that the patch size 15 × 15 of local label assignments was chosen larger as the patch size 9 × 9 of the weights that was used both during training and for testing.The former larger neighborhood defines the local 'feature context' that is used to predict weights for novel data.
Inference (labeling novel data).In the test phase, we used the modified linear assignment flow and all parameter values in the same way, as was done during training.The only difference is that predicted weight patches were used for regularization, as described above.
Results. Figure 5.6 shows a result of the test phase.Since all data are randomly generated, this result is representative for the entire image class.The top row shows the input data (panels (a) and (c)), whereas ground truth (b) is only shown for visual comparison.The bottom row shows the results obtained using uniform weights (d) and predicted weights (e).The latter result clearly demonstrated the impact of weight adaptivity.This aspect is further illustrated in panel (f).
Figure 5.7 shows predicted weight patches for novel test data in the same format as Figure 5.4 depicts optimal weight patches computed during training.The similarity of the behaviour of predicted and optimal weights for pixels close and away from local line structure, demonstrates that the approach generalizes well to novel data.Since these data are randomly generated, this performance is achieved for any image data in this class.The corresponding optimal weight patches (10 × 10 grid), one patch for each pixel.Close to the line structure, the regularizer increases the influence of neighbors on the geometric averaging of assignments whose distances match the prescribed ground truth labels.Away from the line structure, the regularizer has learned to suppress with small weights neighbors belonging to a line structure.   .This shows that our approach generalizes to novel data.

Pattern Formation by Label Transport.
In this section, we illustrate the model expressiveness of the assignment flow.Specifically, we choose quite different labelings as input and target data, respectively, and show that our learning approach can determine weights that 'connect' these patterns by the assignment flow.This shows that the weights which determine the regularization properties of the assignment flow actually encode information for pattern formation.Finally, we briefly point out and illustrate limitations of our approach.The rightmost panel in the top row of Figure 5.8 shows, for each pixel, the deviation of the optimal weight patch form uniform weights.While it is obvious that the 'source labeling' of the input data receive large weights, the spatial arrangement of weights at all other locations is hard to predict beforehand, by humans.This is why learning them is necessary.The Riemannian gradient flow on the parameter manifold effectively steers the flow to the target labeling.BOTTOM ROW: Label assignments of the nonlinear assignment flow using the optimal weights that were estimated using the linear assignment flow.Closeness of both labeling patterns at the final point of time T = 5 demonstrates that the linear assignment flow provides a good approximation of the full nonlinear flow.

5.2.2.
Transporting and Enlarging Label Assignments.We repeated the experiment of the previous section using the academic scenario depicted by Figure 5.9.A major difference is that locations of the input labeling do not form a subset of the locations of the target labeling.As a consequence, the corresponding 'mass' of assignments has to be both transported and enlarged.
The results shown by Figure 5.9 closely resemble those of Figure 5.8, such that the corresponding comments apply likewise.We just point out again the following: Looking at the optimal weight patches in terms of their deviation from uniform weights, as depicted by the rightmost panel in the top row of Figure 5.9, it is both interesting and not too difficult to understand -after convergence and informally by visual inspectionhow these weights encode this particular 'label transport'.However, predicting these weights and certifying their optimality beforehand, seems to be an infeasible task.For example, it is hard to predict that the creation of intermediate locations where assignment mass temporarily accumulates (clearly visible in Fig. 5.9), effectively optimizes the constrained functional (4.1).Learning these weights, on the other hand, just requires to apply our approach.Applying our approach to (4.1) effectively solves the problem.BOTTOM ROW: Inserting the optimal weights that are computed using the linear assignment flow, into the nonlinear assignment flow, gives a similar result and underlines the good approximation property of the linear assignment flow.It is interesting to obverse that computing the Riemannian gradient flow on the parameter manifold entails 'intermediate locations' where assignment mass accumulates temporarily.This underlines the necessity of learning, since it seems hard to predict such an optimal regularization strategy beforehand.
5.2.3.Parameter Learning vs. Optimal Control. Figure 5.10 illustrates limitations of our parameter learning approach.In this experiment, we simply exceeded the time horizon in order to inspect labelings induced by the linear assignment flow after the point of time T , that was used for determining optimal weights in the training phase.Starting with T , Figure 5.10 shows these labelings for both experiments corresponding to Figures 5.8 and 5.9.Unlike the fern pattern (top row) where the initial label locations formed a subset of the target locations and were imposed as constraints, the 'moving mass pattern' (bottom row) is unsteady in the following quite natural sense: the linear assignment flow simply continues transporting mass beyond time T .As a result, assignments to the white label are transported to locations of the black target pattern.Hence, the target pattern is first created up to time T and destroyed afterwards.
This behaviour is not really a limitation, but a consequence of merely learning constant weight parameters.Due to the formulation of the optimization problem (4.1), optimal weights not only encode the 'knowledge' how to steer the assignment flow in order to solve the problem, but also the time period after which the task has to be completed.Fixing this issue requires a higher-level of adaptivity: weight functions depending on time and the current state of assignments would have to be estimated, that may be adjusted online through feedback in order to control the assignment flow in a more flexibel way.differentiation for the training problem commute.An additional convenient property of our approach is that the parameter manifold has mathematically the structure of an assignment manifold, such that Riemannian gradient descent can be used for effectively solving the training problem.
The output of the training phase is a database containing features extracted from training data, together with the respective optimal weights.In order to complete the parameter learning task, a mapping has to be specified that predicts optimal weights for novel unseen data.We solved this task simply by nearest-neighbor prediction after partitioning the database using k-means clustering and geometric averaging of the weights, separately for each cluster.We evaluated this approach for a 3-label scenario involving line structures where just using uniform weights inevitably fails.We additionally conducted experiments that highlight the model expressiveness of the assignment flow and also limitations caused by merely learning constant parameters.
Our main insights include the following.Regarding numerical optimization for parameter learning in connection with image labeling, our approach is more satisfying than working with discrete graphical models, where parameter learning requires to evaluate the partition function, which is a much more involved task when working with cyclic grid graphs.This latter problem of computational statistics shows up in our scenario in similar form as the problem to design the prediction map from features to weight parameters.A key difference of these two scenarios is that by restricting the scope to statistical predictions at a local scale, i.e.only within small windows, the prediction task becomes manageable, since regarding numerical optimization, no further approximations are involved at all.
Regarding future work, we mention two directions.The natural way for broading the scope of the prediction map and the class of images that the assignment flow can represent, is the composition of two or several assignment flows in a hierarchical fashion.This puts our work closer to current mainstream research on deep networks, whose parametrizations and internal representations are not fully understood, however.We hope that using the assignment flow can help to understand hierarchical architectures better.
The second line of research concerns the learning of weight functions, rather than constant parameters, as motivated in Section 5.2.3, since this would also enhance model expressiveness and adaptivity considerably.
A key problem then is to clarify the role of these functions and the choice of an appropriate time scale, as part of an hierarchical composition of assignment flows.
(2) Lagrange multiplier: The vector λ contains all Lagrange multipliers in (A.11) belonging to the constraints (2.21b)-(2.21d).The multipliers are stacked and arranged as follows In our situation, φ is the concatenation of the forward Runge-Kutta evaluation, which we express using the Kronecker-product as where Ψ n,1 ∈ R snx and Ψ n,2 ∈ R nx , as well as (2) Sparse block structure: The overall Jacobian d γ φ consists of N − 1 blocks (one for each iteration) of the form (A.16) and is given by Since, the kernel of A n is trivial, A n is invertible and consequently the overall Jacobian d γ φ as well.
Now we are in a position to apply Lemma A.1.More precisely, (A.8b) tells us that the vector λ is uniquely determined by the linear system d γ φ λ = −∂ γ C. In our situation, this system is given by  .24) ) with components P parameter manifold, representing the weights ω ik from (3.5), see Section 4.1.F (V, Ω) modified version of the linear assignment flow (4.14), see Section 4.2.C a objective function measuring the discrepancy to the ground truth, see Section 4.3.
5.1.1.Training Phase.We used 20 randomly generated images together with ground truth as training data: Figure 5.1(a) shows one of these images and Figure 5.1(b) the corresponding groune truth.Using these data we learned how to adapt the regularization parameter of the modified linear assignment flow (4.14) by solving problem (4.1), with the specific form given by (4.28).
FIGURE 5.1.Training data.The training data consist of 20 pairs of randomly generated images: (a) an input scene, and (b) the corresponding ground truth.The ground truth images encode the labels with colors , , = {line, homogeneous, texture}.Even though the global image structure can be easily assessed by the human eye, assigning correct labels pixelwise by an algorithm requires context-sensitive decisions, as the close-up view illustrates.
Extraction.Using ground truth information, we divided all feature vectors extracted from the training data into three classes: thin line structure, homogeneous region and texture.We computed 200 prototypical feature vectors l jc ∈ F, j ∈ [200], in each class c ∈ {line, homogeneous, texture} by k-means clustering.Thus, each label (line, homogeneous, texture) was represented by 200 feature vectors in F. Distance Matrix.Even though in the original formulation (3.2) labels are represented by a single feature vector, multiple representatives can be taken into account as well by modifying the distance matrix (3.23) accordingly.With the identification c ∈ {line, homogeneous, texture} = {1, 2, 3}, we defined the entries of the distance matrix D ic , for every i ∈ V, as the distance between f i and the best fitting representative l jc for class c, i.e.D ic := min j∈[200] Figure 5.2(b) that shows the labeling obtained by local rounding, i.e. by assigning to each pixel i the label c = min c D ic .Although the result looks similar to the ground truth shown as Figure 5.1(b), it is actually quite noisy when looking to single pixels in the close-up view of Figure 5.2(b).

FIGURE 5 . 2 .
FIGURE 5.2.Input data and local label assignments.The plots illustrate the input data and the quality of the distances (5.1) between extracted feature vectors.Panel (a) shows a randomly generated input image from which features are extracted, as described in the text.Panel (b) shows the labeling obtained by local rounding, i.e. by assigning to each pixel the label minimizing the corresponding distance.Comparing the close-up views of panel (b) and Fig. 5.1(b) (ground truth) shows that label assignments to individual pixels are noisy and incomplete.({ , , } = {line, homogeneous, texture}) FIGURE 5.3.Training phase: Labeling results.This figure shows results of the training phase.Panel (a) shows the given input scene and panel (b) the corresponding locally rounded distance information.The labeling with uniform regularization (panel (c)) returns smoothed over regions and completely fails to preserve the line structures.The adaptive regularizer preserves the line structure nearly perfect (panel (d)), i.e. the optimal weights are able to steer the linear assignment flow successfully towards the given ground-truth labeling.({ , , } = {line, homogeneous, texture})

FIGURE 5 . 4 .
FIGURE 5.4.Training phase: Optimal weight patches.TOP ROW: (a) Close-up view of training data (10 × 10 pixel region).(b) The corresponding ground truth section.(c) Local label assignments.(d) Correct labeling using adapted optimal weights.BOTTOM ROW: (e)The corresponding optimal weight patches (10 × 10 grid), one patch for each pixel.Close to the line structure, the regularizer increases the influence of neighbors on the geometric averaging of assignments whose distances match the prescribed ground truth labels.Away from the line structure, the regularizer has learned to suppress with small weights neighbors belonging to a line structure.

FIGURE 5 . 5 .
FIGURE 5.5.Coreset visualization.This plot shows 3 × 10 prototypical patches of local label assignments and the corresponding weight patches of the coreset, for each of the 3 classes.(a) 10 prototypical pairs of the class line.Weight patches 'know' to which neighbors large weights have to be assigned, such that the local line structure is labeled correctly.(b) Weight patches of the homogeneous label class are almost uniform, which is plausible, because the noisy assignments can be filtered most effectively.(c) The weight patches of the texture label are comparable to the homogeneous ones and almost uniform, for the same reason.(Color code , , = {line, homogeneous, texture}).
FIGURE 5.6.Test phase: Labeling results.TOP ROW: (a) Randomly generated novel input data, (b) the corresponding ground truth (c) the local label assignments.BOTTOM ROW: (d) Labeling using uniform weights fails to detect and label line structures.(e) Adaptive regularizer based on predicted weights yields a result that largely agrees with ground truth.Panel (f) illustrates weights adaptivity at each pixel in terms of the distance of the predicted weight patch to the uniform weight.
FIGURE 5.7.Test phase: Predicted weight patches.TOP ROW: (a) Close-up view of novel data (10 × 10 pixel window).(b) Corresponding ground truth section (just for visual comparison, not used in the experiment).(c) Local label assignment.(d) Labeling result using adaptive regularization with predicted weights.BOTTOM: (e) Corresponding predicted weight patches (10 × 10 grid), one patch for each pixel of the test data (a).The predicted weight patches behave similar to the optimal weight patches depicted by Fig. 5.4, that were computed during the training phase (for different data).This shows that our approach generalizes to novel data.

5. 2 . 1 .
Pattern Completion.The top row of Figure5.8 shows input and target labelings.The second row illustrates our approach to weight parameter learning using the linear assignment flow: Starting with uniform weights and imposing the very sparse information of the input labeling as constraint, adapting the weights by the Riemannian gradient flow on the parameter manifold effectively steers the assignment flow to the target labeling.Having obtained the optimal weights Ω * after convergence, we inserted them into the original nonlinear assignment flow.The evolution corresponding label assignments is shown by the third row of Figure5.8.The fact that the label assignment at the final time T is close to the target labeling which the linear assignment flow reaches exactly, confirms the remarkably close approximation of the nonlinear flow by the linear assignment flow, as already demonstrated in[ZSPS18] in a completely different way.

FIGURE 5 . 8 .
FIGURE 5.8.Pattern completion.This figure illustrates the model expressiveness of the assignment flow.TOP ROW: Input and target labelings.The task was to estimate weights in order to steer the assignment flow to the target labeling.The rightmore panel illustrates, for each pixel, the distance of the optimal weight patch from uniform weights.MIDDLE ROW: Label assignments of the linear assignment flow during weight parameter estimation.The Riemannian gradient flow on the parameter manifold effectively steers the flow to the target labeling.BOTTOM ROW: Label assignments of the nonlinear assignment flow using the optimal weights that were estimated using the linear assignment flow.Closeness of both labeling patterns at the final point of time T = 5 demonstrates that the linear assignment flow provides a good approximation of the full nonlinear flow.

FIGURE 5 . 9 .
FIGURE 5.9.Transporting and Enlarging Label Assignments.See Fig.5.8 for the setup.TOP ROW: Label locations of the input data do not form a subset of the target locations.Thus, 'mass' of label assignments has to be both transported and enlarged.Rightmost panel: Distance of the optimal weight patch from uniform weights, for every pixel.MIDDLE ROW: Applying our approach to (4.1) effectively solves the problem.BOTTOM ROW: Inserting the optimal weights that are computed using the linear assignment flow, into the nonlinear assignment flow, gives a similar result and underlines the good approximation property of the linear assignment flow.It is interesting to obverse that computing the Riemannian gradient flow on the parameter manifold entails 'intermediate locations' where assignment mass accumulates temporarily.This underlines the necessity of learning, since it seems hard to predict such an optimal regularization strategy beforehand.

F
n (X n , p) =    f (X n,1 , p, t n + c 1 h n ) . . .f (X n,s , p, t n + c s h n )    , X n = 1 s ⊗ x n + h n (A ⊗ I nx )k n =

D
N −1 A N −1 −I nx B N −1 I nx Invertibility of d γ φ: A matrix M is invertible if det M = 0. Since the matrix d γ φ is lower block diagonal, its determinant is given by det d γ φ = det A 1 • . . .• det A N −1 (A.19)Thus, we only need to show that det A n = 0 for all n = 1, . . ., N − 1. Equation (A.17a) reads in a more compact formA n = (I snx − h n M ), with M := d x F n (X n , p)(A ⊗ I nx ).(A.20)We show det A n = 0 by using the equivalent statement ker(A n ) = {0}.Now, let x ∈ R snx \{0}, then x ≤ x − h n M x + h n M x ≤ A n x + h n M x .(A.21)By using the row-sum norm • ∞ , we haveh n M ∞ (A.20) = h n d x F n (X n , p)(A ⊗ I nx ) ∞ ≤ h n d x F n (X n , p) ∞ (A ⊗ I nx ) ∞ < h n L maxi=1,...L denotes the Lipschitz constant of f and the step-size h n satisfies the assumption (2.11).Substituting (A.22) into (A.21)gives x < A n x + x ⇐⇒ 0 < A n x ⇐⇒ x ∈ ker(A n ).(A.23) D N −1 −I nx A N −1 B N −1 Theorem 2.1 (Variational System; [HNW08, Chapter I.14, Theorem 14.1]).Suppose the derivatives d x f and d p f exist and are continuous in the neighborhood of the solution x(t) for t ∈ [0, T ].
Remark 2.1.Assume the first set of Runge-Kutta coefficients are given and denoted by a ij , b i , c i with indices i, j ∈ [s].This method is used for the first n-variables (2.12a).Furthermore, let b i = 0 for all stages i ∈ [s].In view of condition (2.15), we can construct a symplectic PRK method by choosing .2. Illustration of the methodological part of this section.Our approach satisfies the commuting diagram, i.e. identical results are obtained either if the continuous problem is differentiated first and than discretized (blue path), or the other way around (violet path).