Journal of Mathematical Imaging and Vision

, Volume 58, Issue 2, pp 211–238

Image Labeling by Assignment

  • Freddie Åström
  • Stefania Petra
  • Bernhard Schmitzer
  • Christoph Schnörr
Article

DOI: 10.1007/s10851-016-0702-4

Cite this article as:
Åström, F., Petra, S., Schmitzer, B. et al. J Math Imaging Vis (2017) 58: 211. doi:10.1007/s10851-016-0702-4

Abstract

We introduce a novel geometric approach to the image labeling problem. Abstracting from specific labeling applications, a general objective function is defined on a manifold of stochastic matrices, whose elements assign prior data that are given in any metric space, to observed image measurements. The corresponding Riemannian gradient flow entails a set of replicator equations, one for each data point, that are spatially coupled by geometric averaging on the manifold. Starting from uniform assignments at the barycenter as natural initialization, the flow terminates at some global maximum, each of which corresponds to an image labeling that uniquely assigns the prior data. Our geometric variational approach constitutes a smooth non-convex inner approximation of the general image labeling problem, implemented with sparse interior-point numerics in terms of parallel multiplicative updates that converge efficiently.

Keywords

Image labeling Assignment manifold Fisher–Rao metric Riemannian gradient flow Replicator equations Information geometry Neighborhood filters Nonlinear diffusion 

Mathematics Subject Classification

62H35 65K05 68U10 62M40 

1 Introduction

1.1 Motivation

Image Labeling is a basic problem of variational low-level image analysis. It amounts to determining a partition of the image domain by uniquely assigning to each pixel a single element from a finite set of labels. Most applications require such decisions to be made depending on other decisions. This gives rise to a global objective function whose minima correspond to favorable label assignments and partitions. Because the problem of computing globally optimal partitions generally is NP hard, relaxations of the variational problem only define computationally feasible optimization approaches.

Continuous Models and relaxations of the image labeling problem were studied, e.g., in [13, 32], including the specific binary case, where two labels are only assigned [14] and the convex relaxation is tight, such that the global optimum can be determined by convex programming. Discrete models prevail in the field of computer vision. They lead to polyhedral relaxations of the image partitioning problem that are tighter than those obtained from continuous models after discretization. We refer to [22] for a comprehensive survey and evaluation. Similar to the continuous case, the binary partition problem can be efficiently and globally optimal solved using a subclass of binary discrete models [29].

Relaxations of the variational image labeling problem fall into two categories: convex and non-convex relaxations. The dominant convex approach is based on the local polytope relaxation, a particular linear programming (LP-) relaxation [49]. This has spurred a lot of research on developing specific algorithms for efficiently solving large problem instances, as they often occur in applications. We mention [28] as a prominent example and otherwise refer again to [22]. Yet, models with higher connectivity in terms of objective functions with local potentials that are defined on larger cliques are still difficult to solve efficiently. A major reason that has been largely motivating our present work is the non-smoothness of optimization problems resulting from convex relaxation—the price to pay for convexity.

Major classes of non-convex relaxations are based on the mean-field approach [39], [47, Section 5] or on approximations of the intractable entropy of the probability distribution whose negative logarithm equals the functional to be minimized [50]. Examples for early applications of relaxations of the former approach include [15, 18]. The basic instance of the latter class of approaches is known as the Bethe approximation. In connection with image labeling, all these approaches amount to non-convex inner relaxations of the combinatorially complex set of feasible solutions (the so-called marginal polytope), in contrast to the convex outer relaxations in terms of the local polytope discussed above. As a consequence, the non-convex approaches provide a mathematically valid basis for probabilistic inference like computing marginal distributions, which in principle enables a more sophisticated data analysis than mere energy minimization or maximum a posteriori inference, to which energy minimization corresponds from a probabilistic viewpoint.

On the other hand, like non-convex optimization problems in general, these relaxations are plagued by the problem of avoiding poor local minima. Although attempts were made to tame this problem by local convexification [16], the class of convex relaxation approaches has become dominant in the field, because the ability to solve the relaxed problem for a global optimum is a much better basis for research on algorithms and also results in more reliable software for users and applications.

Both classes of convex and non-convex approaches to the image labeling problem motivate the present work as an attempt to address the following two issues.
  • Smoothness versus Non-Smoothness Regarding convex approaches and the development of efficient algorithms, a major obstacle stems from the inherent non-smoothness of the corresponding optimization problems. This issue becomes particularly visible in connection with decompositions of the optimization task into simpler problems by dropping complicating constraints, at the cost of a non-smooth dual master problem where these constraints have to be enforced. Advanced bundle methods [23] then seem to be among the most efficient methods. Yet, how to make rapid progress in systematic way does not seem obvious. On the other hand, since the early days of linear programming, e.g., [4, 5], it has been known that endowing the feasible set with a proper smooth geometry enables efficient numerics. Yet, such interior-point methods [38] are considered as not applicable for large-scale problems of variational image analysis, due to dense numerical linear algebra steps that are both too expensive and too memory intensive.

    In view of these aspects, our approach may be seen as a smooth geometric approach to image labeling based on first-order, sparse numerical operations.

  • Local versus Global Optimality Global optimality distinguishes convex approaches from other ones and is the major argument for the former ones. Yet, having computed a global optimum of the relaxed problem, it has to be projected to the feasible set of combinatorial solutions (labelings) in a post-processing step. While the inherent suboptimality of this step can be bounded [31], and despite progress has been made to recover the true combinatorial optimum as least partially [46], it is clear that the benefit of global optimality of convex optimization has to be relativized when it constitutes a relaxation of an intractable optimization problem. Turning to non-convex problems, on the other hand, raises the two well-known issues: local optimality of solutions instead of global optimality, and susceptibility to initialization. In view of these aspects, our approach enjoys the following properties. While being non-convex, there is a single natural initialization only which makes obsolete the need to search for a good initialization. Furthermore, the approach returns a global optimum (out of many), which corresponds to an image labeling (combinatorial solution) without the need of further post-processing. Clearly, the latter property is typical for concave minimization formulations of combinatorial optimization problems [19] where solutions of the latter problem are enforced by weighting the concave penalty sufficiently large. Yet, in such cases, and in particular so when working in high dimensions as in image analysis, the problem persists to determine good initializations and to carefully design the numerics (search direction, step-size selection, etc.), in order to ensure convergence and a reasonable convergence rate.

Fig. 1

Overview of the variational approach. Given data and prior features in a metric space \({\mathscr {F}}\), inference corresponds to a Riemannian gradient flow with respect to an objective function J(W) on the assignment manifold \({\mathscr {W}}\). The curve of matrices W(t) assigns at each t prior data \({\mathscr {P}}_{{\mathscr {F}}}\) to observed data f and terminates at a global maximum \(W^{*}\) that constitutes a labeling, i.e., a unique assignment of a single prior datum to each data point. Spatial coherence of the labeling field is enforced by geometric averaging over spatial neighborhoods. The entire dynamic process on the assignment manifold achieves a MAP labeling in a smooth, geometrical setting, realized with sparse interior-point numerics in terms of parallel multiplicative updates

1.2 Approach: Overview

Figure 1 illustrates our setup and the approach. We distinguish the feature space \({\mathscr {F}}\) that models all application-specific aspects, and the assignment manifold \({\mathscr {W}}\) used for modeling the image labeling problem and for computing a solution. This distinction avoids to mix up physical dimensions, specific data formats, etc., with the representation of the inference problem. It ensures broad applicability to any application domain that can be equipped with a metric which properly reflects data similarity. It also enables to normalize the representation used for inference, so as to remove any bias toward a solution not induced by the data at hand.

We consider image labeling as the task to assign to the image data an arbitrary prior data set \({\mathscr {P}}_{{\mathscr {F}}}\), provided the distance of its elements to any given data element can be measured by a distance function \(d_{{\mathscr {F}}}\), which the user has to supply. Basic examples for the elements of \({\mathscr {P}}_{{\mathscr {F}}}\) include prototypical feature vectors, patches, etc. Collecting all pairwise distance data into a distance matrix D, which could be computed on the fly for extremely large problem sizes, provides the input data to the inference problem.

The mapping \(\exp _{W}\) lifts the distance matrix to the assignment manifold \({\mathscr {W}}\). The resulting likelihood matrix L constitutes a normalized version of the distance matrix D that reflects the initial feature space geometry as given by the distance function \(d_{{\mathscr {F}}}\). Each point on \({\mathscr {W}}\), like the matrices LS, and W, is stochastic matrix with strictly positive entries, that is, with row vectors that are discrete probability distributions having full support. Each such row vector indexed by i represents the assignment of prior elements of \({\mathscr {P}}_{{\mathscr {F}}}\) to the given datum a location i, in other words the labeling of datum i. We equip the set of all such matrices with the geometry induced by the Fisher–Rao metric and call it assignment manifold.

The inference task (image labeling) is accomplished by geometric averaging in terms of Riemannian means of assignment vectors over spatial neighborhoods. This step transforms the likelihood matrix L into the similarity matrix S. It also induces a dependency of labeling decisions on each other, akin to the prior (regularization) terms of the established variational approaches to image labeling, as discussed in the preceding section. These dependencies are resolved by maximizing the correlation (inner product) between the assignment in terms of the matrix W and the similarity matrix S, where the latter matrix is induced by W as well. The Riemannian gradient flow of the corresponding objective function J(W), that is highly nonlinear but smooth, evolves W(t) on the manifold \({\mathscr {W}}\) until a fixed point is reached which terminates the loop on the right-hand side of Fig. 1. The resulting fixed point corresponds to an image labeling which uniquely assigns to each datum a prior element of \({\mathscr {P}}_{{\mathscr {F}}}\).

Adopting a probabilistic Bayesian viewpoint, this fixed-point iteration may be viewed as maximum a posteriori inference carried out in a geometric setting with multiplicative, sparse, and highly parallel numerical operations.

1.3 Further Related Work

Besides current research on image labeling, there are further classes of approaches that resemble our approach. We briefly sketch each of them in turn and highlight similarities and differences.
  • Neighborhood Filters. A large class of approaches to denoising of given image data f are defined in terms of neighborhood filters that iteratively perform operations of the form
    $$\begin{aligned} u_{i}^{(k+1)} \!= \!\sum _{j} \frac{K\left( x_{i}, x_{j}, u_{i}^{(k)}, u_{j}^{(k)}\right) }{\sum _{l} K\left( x_{i}, x_{l}, u_{i}^{(k)}, u_{l}^{(k)}\right) } u_{j}^{(k)},\!\quad u(0)\!=\!f,\!\quad \forall i, \end{aligned}$$
    (1.1)
    where K is a nonnegative kernel function that is symmetric with respect to the two indexed locations (e.g., ij in the numerator) and may depend on both the spatial distance \(\Vert x^{i}-x^{j}\Vert \) and the values \(|u_{i}-u_{j}|\) of pairs of pixels. Maybe the most prominent example is the non-local means filter [9] where K depends on the distance of patches centered at i and j, respectively. We refer to [35] for a recent survey. Noting that (1.1) is a linear operation with a row-normalized nonnegative (i.e., stochastic) matrix, a similar situation would be
    $$\begin{aligned} u_{i} = \sum _{j} L_{ij}(W) u_{j}, \end{aligned}$$
    (1.2)
    with the likelihood matrix from Fig. 1, if we would replace the prior data \({\mathscr {P}}_{{\mathscr {F}}}\) with the given image data f itself and adopt a distance function \(d_{{\mathscr {F}}}\), in order to mimic the kernel function K of (1.1). In our approach, however, the likelihood matrix along with its nonlinear geometric transformation, the similarity matrix S(W), evolves along with the evolution of assignment matrix W, so as to determine a labeling with unique assignments to each pixel i, rather than convex combinations as required for denoising. Furthermore, the prior data set \({\mathscr {P}}_{{\mathscr {F}}}\) that is assigned in our case may be very different from the given image data and, accordingly, the assignment matrix may have any rectangular shape rather than being a quadratic \(m \times m\) matrix. Conceptually, we are concerned with decision making (labeling, partitioning, unique assignments) rather than with mapping one image to another one. Whenever the prior data \({\mathscr {P}}_{{\mathscr {F}}}\) comprise a finite set of prototypical image values or patches, such that a mapping of the form
    $$\begin{aligned} u_{i} = \sum _{j} W_{ij} f_{j}^{*},\qquad f_{j}^{*} \in {\mathscr {P}}_{{\mathscr {F}}},\qquad \forall i, \end{aligned}$$
    (1.3)
    is well defined, then this does result in a transformed image u after having reached a fixed point of the evolution of W. This result then should not be considered as a denoised image, however. Rather, it merely illustrates the interpretation of the given data f in terms of the prior data \({\mathscr {P}}_{{\mathscr {F}}}\) and a corresponding optimal assignment.
  • Nonlinear Diffusion. Neighborhood filters are closely related to iterative algorithms for numerically solving discretized diffusion equations. Just think of the basic 5-point stencil of the discrete Laplacian, the iterative averaging of nearest neighbor differences, and the large class of adaptive generalizations in terms of nonlinear diffusion filters [48]. More recent work directly addressing this connection includes [10, 36, 44]. The author of [36], for instance, advocates the approximation of the matrix of (1.1) by a symmetric (hence, doubly stochastic) positive-definite matrix, in order to enable interpretations of the denoising operation in terms of the spectral decomposition of the assignment matrix, and to make the connection to diffusion mappings on graphs. The connection to our work is implicitly given by the discussion of the previous point, the relation of our approach to neighborhood filters. Roughly speaking, the application of our approach in the specific case of assigning image data to image data may be seen as some kind of nonlinear diffusion that results in an image whose degrees of freedom are given by the cardinality of the prior set \({\mathscr {P}}_{{\mathscr {F}}}\). We plan to explore the exact nature of this connection in more detail in our future work.

  • Replicator Dynamics. Replicator dynamics and the corresponding equations are well known [17]. They play a major role in models of various disciplines, including theoretical biology and applications of game theory to economy. In the field of image analysis, such models have been promoted by Pelillo and co-workers, mainly to efficiently determine by continuous optimization techniques good local optima of intractable problems, like matchings through maximum-clique search in an association graph [42]. Although the corresponding objective functions are merely quadratic, the analysis of the corresponding equations is rather involved [8]. Accordingly, clever heuristics have been suggested to tame related problems of non-convex optimization [7].

    Regarding our approach, we aim to get rid of these issues—see the discussion of “Global optimality” in Sect. 1.1—through three ingredients: (1) a unique natural initialization, (2) spatial averaging that removes spurious local affects of noisy data, and (3) adopting the Riemannian geometry which determines the structure of the replicator equations, for both geometric spatial averaging and numerical optimization.

  • Relaxation Labeling. The task of labeling primitives in images has been formulated as a problem of contextual decision making already 40 years ago [20, 43]. Originally, update rules were merely formulated in order to find mutually consistent individual label assignments. Subsequent research related these labeling rules to optimization tasks. We refer to [41] for a concise account of the literature and for putting the approach on mathematically solid ground. Specifically, the so-called Baum–Eager theorem was applied in order to show that updates increase the mutual consistency of label assignments. Applications include pairwise clustering [40] that boils down to determining a local optimum by continuous optimization of a non-convex quadratic form, similar to the optimization tasks considered in [8, 42]. We attribute the fact that these approaches have not been widely applied to the problems of non-convex optimization discussed above.

    The measure of mutual consistency of our approach is non-quadratic, and the Baum–Eager theorem about polynomial growth transforms does not apply. Increasing consistency follows from the Riemannian gradient flow that governs the evolution of label assignments. Regarding the non-convexity from the viewpoint of optimization, we believe that the setup of our approach displayed by Fig. 1 significantly alleviates these problems, in particular through the geometric averaging of assignments that emanates from a natural initialization.

We address again some of these points that are relevant for our future work, in Sect. 5.

1.4 Organization

Section 2 summarizes the geometry of the probability simplex in order to define the assignment manifold, which is the basis of our variational approach. The approach is presented in Sect. 3 by repeating the discussion of Fig. 1, together with the mathematical details. Finally, several numerical experiments are reported in Sect. 4. They are academical, yet non-trivial, and supposed to illustrate properties of the approach as claimed in the preceding sections. Specific applications of image labeling are not within the scope of this paper. We conclude and indicate further directions of research in Sect. 5.

Major symbols and the basic notation used in this paper are listed in “Appendix 1.” In order not to disrupt the flow of reading and reasoning, proofs, and technical details, all of which are elementary but essentially complement the presentation and make this paper self-contained, are listed as “Appendix 2.”

2 The Assignment Manifold

In this section, we define the feasible set for representing and computating image labelings in terms of assignment matrices \(W \in {\mathscr {W}}\), the assignment manifold \({\mathscr {W}}\). The basic building block is the open probability simplex \({\mathscr {S}}\) equipped with the Fisher–Rao metric. We collect below and in “Proofs of Section 2 of Appendix 2” corresponding definitions and properties.

For background reading and much more details on information and Riemannian geometry, we refer to [1, 21].

2.1 Geometry of the Probability Simplex

The relative interior \({\mathscr {S}}=\mathring{\varDelta }_{n-1}\) of the probability simplex given by (6.8a) becomes a differentiable Riemannian manifold when endowed with the Fisher–Rao metric. In the present particular case, it reads (cf. the notation (6.16))
$$\begin{aligned} \langle u, v \rangle _{p} := \big \langle \frac{u}{\sqrt{p}}, \frac{v}{\sqrt{p}} \big \rangle ,\qquad \forall u,v \in T_{p}{\mathscr {S}}, \end{aligned}$$
(2.1)
with tangent spaces given by
$$\begin{aligned} T_{p}{\mathscr {S}} = \{ v \in \mathbb {R}^{n} :\langle {\mathbbm {1}}, v \rangle = 0 \},\qquad p \in {\mathscr {S}}. \end{aligned}$$
(2.2)
We regard the scaled sphere \({\mathscr {N}}=2{\mathbb {S}}^{n-1}\) as manifold with Riemannian metric induced by the Euclidean inner product of \(\mathbb {R}^{n}\). The following diffeomorphism \(\psi \) between \({\mathscr {S}}_{n}\) and the open subset \(\psi ({\mathscr {S}}_{n}) \subset {\mathscr {N}}\) was suggested, e.g., by [27, Section 2.1] and [1, Section 2.5].

Definition 1

(Sphere Map) We call the diffeomorphism
$$\begin{aligned} \psi :{\mathscr {S}} \rightarrow {\mathscr {N}},\qquad p \mapsto s = \psi (p) := 2 \sqrt{p}, \end{aligned}$$
(2.3)
sphere map (see Fig. 2).

The sphere map enables to compute the geometry of \({\mathscr {S}}\) from the geometry of the 2-sphere.

Lemma 1

The sphere map \(\psi \) (2.3) is an isometry, i.e., the Riemannian metric is preserved. Consequently, lenghts of tangent vectors and curves are preserved as well.

Proof

See “Proofs of Section 2” in Appendix 2. \(\square \)

In particular, geodesics as critical points of length functionals are mapped by \(\psi \) to geodesics. As a consequence, we have

Lemma 2

[Riemannian Distance on \({\mathscr {S}}\)] The Riemannian distance on \({\mathscr {S}}\) is given by
$$\begin{aligned} d_{{\mathscr {S}}}(p,q) = 2 \arccos \bigg (\sum _{i \in [n]} \sqrt{p_{i} q_{i}} \bigg ) \in [0,\pi ). \end{aligned}$$
(2.4)
Fig. 2

The Triangle encloses the image \(\psi ({\mathscr {S}}_{2}) \subset 2 {\mathbb {S}}^{2}\) of the simplex \({\mathscr {S}}_{2}\) under the sphere map (2.3)

Fig. 3

Geometry of the probability simplex induced by the Fisher–Rao Metric. The left panel shows Euclidean (black) and non-Euclidean geodesics (brown) connecting the barycenter (red) and the blue points, along with the corresponding Euclidean and Riemannian means: In comparison with Euclidean means, geometric averaging pushes toward the boundary. The right panel shows contour lines of points that have the same Riemannian distance from the respective center point (black dots). The different sizes of these regions indicate that geometric averaging causes a larger effect around the barycenters of both the simplex and its faces, where such points represent fuzzy labelings, and a smaller effect within regions close to the vertices (unit vectors)

Fig. 4

Each curve, from bottom to top, represents the Riemannian distances \(d_{\overline{{\mathscr {S}}}_{n}}\big (p(0),p(t)\big )\) (normalized to [0,1]; Eq. (2.4)) of points on the curve \(\{p(t)\}_{t \in [0,1]}\) that linearly (i.e., Euclidean) interpolates between the fixed vertex \(p(0)=e^{1}\) of the simplex \(\overline{{\mathscr {S}}}_{n}=\varDelta _{n-1}\) and the barycenter \(p(1)=\overline{p}\), for dimensions \(n = 2^{k},\,k \in \{1,2,3,4,8\}\). As the dimension n grows, the barycenter is located as far away from \(e^{1}\) as all other boundary points \(e^{i},\, t e^{i}+(1-t) e^{j},\, t \in [0,1],\, i,j \ne 1\), etc., which have disjoint supports. This entails a normalizing effect on the Riemannian mean of points that are far away, unlike with Euclidean averaging where this influence increases with the Euclidean distance

The objective function for computing Riemannian means (geometric averaging; see Definition 2 and Eq. (2.8) below) is based on the distance (2.4). Figure 3 visualizes corresponding geodesics and level sets on \({\mathscr {S}}_{3}\) that differ for discrete distributions \(p \in {\mathscr {S}}_{3}\) close to the barycenter and for low-entropy distributions close to the vertices. See also the caption of Fig. 3.

It is well known from the literature (e.g., [3, 30]) that geometries may considerably change in higher dimensions. Figure 4 displays the Riemannian distances of points on curves that connect the barycenter and vertices on \(\overline{{\mathscr {S}}}_{n}\) (to which the distance (2.4) extends), depending on the dimension n. The normalizing effect on geometric averaging, further discussed in the caption, increases with n and is relevant to image labeling, where large values of n may occur in applications.

Let \({\mathscr {M}}\) be a smooth Riemannian manifold (see the paragraph around Eq. (6.14) introducing our notation). The Riemannian gradient \(\nabla _{{\mathscr {M}}} f(p) \in T_{p}{\mathscr {M}}\) of a smooth function \(f :{\mathscr {M}} \rightarrow \mathbb {R}\) at \(p \in {\mathscr {M}}\) is the tangent vector defined by [21, p. 89]
$$\begin{aligned} \langle \nabla _{{\mathscr {M}}} f(p), v \rangle _{p} = Df(p)[v] = \langle \nabla f(p), v \rangle ,\;\;\; \forall v \in T_{p}{\mathscr {M}}. \end{aligned}$$
(2.5)
We consider next the specific case \({\mathscr {M}}={\mathscr {S}}={\mathscr {S}}_{n}\).

Proposition 1

(Riemannian Gradient on \({\mathscr {S}}_{n}\)) For any smooth function \(f :{\mathscr {S}} \rightarrow \mathbb {R}\), the Riemannian gradient of f at \(p \in {\mathscr {S}}\) is given by
$$\begin{aligned} \nabla _{{\mathscr {S}}} f(p) = p\big (\nabla f(p)-\langle p, \nabla f(p) \rangle {\mathbbm {1}}\big ). \end{aligned}$$
(2.6)

Proof

See “Proofs of Section 2” in Appendix 2. \(\square \)

The exponential map associated with the open probability simplex \({\mathscr {S}}\) is detailed next.

Proposition 2

(Exponential Map (Manifold \({\mathscr {S}}\))) The exponential mapping
$$\begin{aligned}&{{\mathrm{Exp}}}_{p} :V_{p} \rightarrow {\mathscr {S}},\quad v \mapsto {{\mathrm{Exp}}}_{p}(v) = \gamma _{v}(1), \quad p \in {\mathscr {S}},\nonumber \\ \end{aligned}$$
(2.7a)
is given by
$$\begin{aligned} \gamma _{v}(t)= & {} \frac{1}{2} \Big (p + \frac{v_{p}^{2}}{\Vert v_{p}\Vert ^{2}}\Big ) +\,\frac{1}{2}\Big (p - \frac{v_{p}^{2}}{\Vert v_{p}\Vert ^{2}}\Big ) \cos \big (\Vert v_{p}\Vert t\big ) \end{aligned}$$
(2.7b)
$$\begin{aligned}&+ \frac{v_{p}}{\Vert v_{p}\Vert } \sqrt{p} \sin \big (\Vert v_{p}\Vert t\big ), \end{aligned}$$
(2.7c)
with\(t=1\), \(v_{p} = v/\sqrt{p},\, p = \gamma (0)\), \(\dot{\gamma }_{v}(0)=v\)and
$$\begin{aligned} V_{p} = \big \{v \in T_{p}{\mathscr {S}} :\gamma _{v}(t) \in {\mathscr {S}},\; t \in [0,1]\big \}. \end{aligned}$$
(2.7d)

Proof

See “Proofs of Section 2” of Appendix 2. \(\square \)

Remark 1

Checking the inclusion \(v \in V_{p}\) due to (2.7d), for a given tangent vector \(v \in T_{p}{\mathscr {S}}\), is inconvenient for applications. Therefore, the mapping \(\exp \) is defined below by Eq. (3.8a) which approximates the exponential mapping \({{\mathrm{Exp}}}\), with the feasible set \(V_{p}\) replaced by the entire space \(T_{p}{\mathscr {S}}\) (Lemma 3).

Accordingly, geometric averaging as defined next (Sect. 2.2) based on \({{\mathrm{Exp}}}\) can be approximated as well using the mapping \(\exp \). This is discussed in Sect. 3.3.2.

2.2 Riemannian Means

The Riemannian center of mass is commonly called Karcher mean or Fréchet mean in the more recent literature, in particular outside the field of mathematics. We prefer—cf. [26]—the former notion and use the shorter term Riemannian mean.

Definition 2

(Riemannian Mean, Geometric Averaging) The Riemannian mean\(\overline{p}\) of a set of points \(\{p^{i}\}_{i \in [N]} \subset {\mathscr {S}}\) with corresponding weights \(w \in \varDelta _{N-1}\) minimizes the objective function
$$\begin{aligned} p \mapsto \frac{1}{2} \sum _{i \in [N]} w_{i} d_{{\mathscr {S}}}^{2}(p,p^{i}) \end{aligned}$$
(2.8)
and satisfies the optimality condition [21, Lemma 4.8.4]
$$\begin{aligned} \sum _{i \in [N]} w_{i} {{\mathrm{Exp}}}_{\overline{p}}^{-1}(p^{i}) = 0, \end{aligned}$$
(2.9)
with the inverse of the exponential mapping \({{\mathrm{Exp}}}^{-1}_{p} :{\mathscr {S}} \rightarrow T_{p}{\mathscr {S}}\). We denote the Riemannian mean by
$$\begin{aligned} {\mathrm {mean}}_{{\mathscr {S}},w}({\mathscr {P}}),\qquad w \in \varDelta _{N-1},\quad {\mathscr {P}}=\{p^{1},\ldots ,p^{N}\}, \end{aligned}$$
(2.10)
and drop the subscript w in the case of uniform weights \(w = \frac{1}{N} {\mathbbm {1}}_{N}\).

Lemma 3

The Riemannian mean (2.10) defined as minimizer of (2.8) is unique for any data \({\mathscr {P}} = \{p^{i}\}_{i \in [N]} \subset {\mathscr {S}}\) and weights \(w \in \varDelta _{N-1}\).

Proof

Using the isometry \(\psi \) given by (2.3), we may consider the scenario transferred to the domain on the 2-sphere depicted in Fig. 2. Due to [25, Thm. 1.2], the objective (2.8) is convex along geodesics and has a unique minimizer within any geodesic Ball \({\mathbb {B}}_{r}\) with diameter upper bounded by \(2 r \le \frac{\pi }{2 \sqrt{\kappa }}\), where \(\kappa \) upper bounds the sectional curvatures in \({\mathbb {B}}_{r}\). For the 2-sphere \({\mathscr {N}}\), we have \(\kappa = 1/4\) constant, and hence the inequality is satisfied for the domain \(\psi ({\mathscr {S}}) \subset {\mathscr {N}}\) which has geodesic diameter \(\pi \). \(\square \)

We call the computation of Riemannian means geometric averaging. The implementation of this iterative operation and its efficient approximation by a closed-form expression are addressed in Sect. 3.3.

2.3 Assignment Matrices and Manifold

A natural question is how to extend the geometry of \({\mathscr {S}}\) to stochastic matrices \(W \in \mathbb {R}^{m \times n}\) with \(W_{i} \in {\mathscr {S}},\, i \in [m]\), so as to preserve the information-theoretic properties induced by this metric (that we do not discuss here—cf. [1, 12]).

This problem was recently studied by [37]. The authors suggested three natural definitions of manifolds. It turned out that all of them are slight variations of taking the product of \({\mathscr {S}}\), differing only by the scaling of the resulting product metric. As a consequence, we make the following

Definition 3

(Assignment Manifold) The manifold of assignment matrices, called assignment manifold, is the set
$$\begin{aligned} {\mathscr {W}} = \{W \in \mathbb {R}^{m \times n} :W_{i} \in {\mathscr {S}},\, i \in [m]\}. \end{aligned}$$
(2.11)
According to this product structure and based on (2.1), the Riemannian metric is given by
$$\begin{aligned} \langle U, V \rangle _{W} := \sum _{i \in [m]} \langle U_{i}, V_{i} \rangle _{W_{i}},\qquad U, V \in T_{W}{\mathscr {W}}. \end{aligned}$$
(2.12)

Note that \(V \in T_{W}{\mathscr {W}}\) means \(V_{i} \in T_{W_{i}} {\mathscr {S}},\, i \in [m]\).

Remark 2

We call stochastic matrices contained in \({\mathscr {W}}\)assignment matrices, due to their role in the variational approach (Sect. 3).

3 Variational Approach

We introduce in this section the basic components of the variational approach and the corresponding optimization task, as illustrated in Fig. 1.

3.1 Basic Components

3.1.1 Features, Distance Function, Assignment Task

Let
$$\begin{aligned} f :{\mathscr {V}} \rightarrow {\mathscr {F}},\qquad i \mapsto f_{i},\qquad i \in {\mathscr {V}}=[m], \end{aligned}$$
(3.1)
denote any given data, either raw image data or features extracted from the data in a preprocessing step. In any case, we call ffeature. At this point, we do not make any assumption about the feature space\({\mathscr {F}}\) except that a distance function
$$\begin{aligned} d_{{\mathscr {F}}} :{\mathscr {F}} \times {\mathscr {F}} \rightarrow \mathbb {R}, \end{aligned}$$
(3.2)
is specified. We assume that a finite subset of \({\mathscr {F}}\)
$$\begin{aligned} {\mathscr {P}}_{{\mathscr {F}}} := \{f^{*}_{j}\}_{j \in [n]}, \end{aligned}$$
(3.3)
additionally is given, called prior set. We are interested in the assignment of the prior set to the data in terms of an assignment matrix
$$\begin{aligned} W \in {\mathscr {W}} \subset \mathbb {R}^{m \times n}, \end{aligned}$$
(3.4)
with the manifold \({\mathscr {W}}\) defined by (2.11). Thus, by definition, every row vector \(0 < W_{i} \in {\mathscr {S}}\) is a discrete distribution with full support \({{\mathrm{supp}}}(W_{i})=[n]\). The element
$$\begin{aligned} W_{ij} = \Pr (f^{*}_{j}|f_{i}),\qquad i \in [m],\quad j \in [n], \end{aligned}$$
(3.5)
quantifies the assignment of prior item \(f^{*}_{j}\) to the observed data point \(f_{i}\). We may think of this number as the posterior probability that \(f^{*}_{j}\) generated the observation \(f_{i}\).

The assignment task asks for determining an optimal assignment \(W^{*}\), considered as “explanation” of the data based on the prior data \({\mathscr {P}}_{{\mathscr {F}}}\). We discuss next the ingredients of the objective function that will be used to solve assignment tasks.

3.1.2 Distance Matrix

Given \({\mathscr {F}}, d_{{\mathscr {F}}}\) and \({\mathscr {P}}_{{\mathscr {F}}}\), we compute the distance matrix
$$\begin{aligned} D \in \mathbb {R}^{m \times n},\quad D_{i} \in \mathbb {R}^{n},\quad&D_{ij} = \frac{1}{\rho } d_{{\mathscr {F}}} (f_{i},f^{*}_{j}), \end{aligned}$$
(3.6a)
$$\begin{aligned}&\rho >0, \quad i \in [m],\quad j \in [n], \end{aligned}$$
(3.6b)
where \(\rho \) is the first (from two) user parameters to be set. This parameter serves two purposes. It accounts for the unknown scale of the data f that depends on the application and hence cannot be known beforehand. Furthermore, its value determines what subset of the prior features \(f^{*}_{j},\, j \in [n]\) effectively affects the process of determining the assignment matrix W. This becomes explicit through the definition of the next processing stage, given by Eq. (3.12) below, that uses D as input. We call \(\rho \)selectivity parameter.
Furthermore, we set the initial value
$$\begin{aligned} W = W(0),\qquad W_{i}(0) := \frac{1}{n} {\mathbbm {1}}_{n},\quad i \in [m]. \end{aligned}$$
(3.7)
of the flow (3.21) determining W(t) that is introduced and discussed below in Sect. 3.2.3.

Note that W is initialized with the uninformative uniform assignment that is not biased toward a solution in any way.

3.1.3 Likelihood Matrix

The next processing step is based on the following

Definition 4

(Lifting Map (Manifolds\({\mathscr {S}}, {\mathscr {W}}\))) The lifting mapping is defined by
$$\begin{aligned} \exp&:T{\mathscr {S}} \rightarrow {\mathscr {S}},&(p,u)&\!\mapsto \!\exp _{p}(u) \!=\! \frac{p e^{u}}{\langle p, e^{u} \rangle }, \end{aligned}$$
(3.8a)
$$\begin{aligned} \exp&:T{\mathscr {W}} \rightarrow {\mathscr {W}},&(W,U)&\!\mapsto \!\exp _{W}(U) \!=\! \begin{pmatrix} \exp _{W_{1}}(U_{1}) \\ \dots \\ \exp _{W_{m}}(U_{m}) \end{pmatrix}, \end{aligned}$$
(3.8b)
where \(U_{i}, W_{i}, i \in [m]\) index the row vectors of the matrices UW, and where the argument decides which of the two mappings \(\exp \) applies.

Remark 3

After replacing the arbitrary point \(p \in {\mathscr {S}}\) by the barycenter \(\frac{1}{n} {\mathbbm {1}}_{n}\), readers will recognize the softmax function in (3.8a), i.e., \(\langle \frac{1}{n} {\mathbbm {1}}_{n}, e^{u} \rangle ^{-1} \big (\frac{1}{n} {\mathbbm {1}}_{n} e^{u}\big ) = \frac{e^{u}}{\langle {\mathbbm {1}}, e^{u} \rangle }\). This function is widely used in various application fields of applied statistics (e.g., [45]), ranging from parametrizations of distributions, e.g., for logistic classification [6], to other problems of modeling [34] not related to our approach.

The lifting mapping generalizes the softmax function through the dependency on the base point p. In addition, it approximates geodesics and accordingly the exponential mapping \({{\mathrm{Exp}}}\), as stated next. We therefore use the symbol \(\exp \) as mnemonic. Unlike \({{\mathrm{Exp}}}_{p}\), the mapping \(\exp _{p}\) is defined on the entire tangent space, cf. Remark 1.

Proposition 3

Let
$$\begin{aligned} v = \big ({{\mathrm{Diag}}}(p)-p p^{\top }\big ) u,\qquad v \in T_{p}{\mathscr {S}}. \end{aligned}$$
(3.9)
Then \(\exp _{p}(u t)\) given by (3.8a) solves
$$\begin{aligned} \dot{p}(t) = p(t) u - \langle p(t), u \rangle p(t),\qquad p(0)=p, \end{aligned}$$
(3.10)
and provides a first-order approximation of the geodesic \(\gamma _{v}(t)\) from (2.7a)
$$\begin{aligned} \exp _{p}(u t) \approx p + v t,\qquad \Vert \gamma _{v}(t)-\exp _{p}(u t)\Vert = {\mathscr {O}}(t^{2}). \end{aligned}$$
(3.11)

Proof

See “Proofs of Section 3 and Further Details” of Appendix 2. \(\square \)

Figure 5 illustrates the approximation of geodesics \(\gamma _{v}\) and the exponential mapping \({{\mathrm{Exp}}}_{p}\), respectively, by the lifting mapping \(\exp _{p}\).

Remark 4

Note that adding any constant vector \(c {\mathbbm {1}},\, c \in \mathbb {R}\) to a vector u does not change \(\exp _{p}(u)\): \(\frac{p e^{u+c {\mathbbm {1}}}}{\langle p, e^{u+c {\mathbbm {1}}} \rangle } = \frac{p (e^{c}{\mathbbm {1}}) e^{u}}{\langle p, (e^{c}{\mathbbm {1}}) e^{u} \rangle } = \frac{p e^{u}}{\langle p, e^{u} \rangle } = \exp _{p}(u)\). Accordingly, the same vector v is generated by (3.9). While the definition (3.8a) removes this ambiguity, there is no need to remove the mean of the vector u in numerical computations.

Fig. 5

Illustration of Proposition 3. Various geodesics \(\gamma _{v^{i}}(t),\,i \in [k],\, t \in [t,t_{\max }]\) (solid lines) emanating from p (red point) with the same speed \(\Vert v^{i}\Vert _{p}=\Vert v^{j}\Vert _{p},\, \forall i,j\), are displayed together with the curves \(\exp _{p}(u^{i} t),\,i \in [k],\, t \in [t,t_{\max }]\), where the vectors \(u^{i}, v^{i},\, i \in [k]\) satisfy (3.9)

Given D and W as described in Sect. 3.1.2, we lift the matrix D to the manifold \({\mathscr {W}}\) by
$$\begin{aligned} L = L(W)&:= \exp _{W}(-U) \in {\mathscr {W}}, \end{aligned}$$
(3.12a)
$$\begin{aligned} U_{i}&= D_{i}-\frac{1}{n} \langle {\mathbbm {1}}, D_{i} \rangle {\mathbbm {1}},\quad i \in [m], \end{aligned}$$
(3.12b)
with \(\exp \) defined by (3.8b). We call Llikelihood matrix because the row vectors are discrete probability distributions which separately represent the similarity of each observation \(f_{i}\) to the prior data \({\mathscr {P}}_{{\mathscr {F}}}\), as measured by the distance \(d_{{\mathscr {F}}}\) in (3.6).

Note that the operation (3.12) depends on the assignment matrix \(W \in {\mathscr {W}}\).

3.1.4 Similarity Matrix

Based on the likelihood matrix L, we define the similarity matrix
$$\begin{aligned} S&= S(W) \in {\mathscr {W}}, \end{aligned}$$
(3.13a)
$$\begin{aligned} S_{i}&= {\mathrm {mean}}_{{\mathscr {S}}}\{L_{j}\}_{j \in \tilde{{\mathscr {N}}}_{{\mathscr {E}}}(i)},\quad i \in [m], \end{aligned}$$
(3.13b)
where each row is the Riemannian mean (2.10) (using uniform weights) of the likelihood vectors, indexed by the neighborhoods as specified by the underlying graph \({\mathscr {G}}=({\mathscr {V}},{\mathscr {E}})\),
$$\begin{aligned} \tilde{{\mathscr {N}}}_{{\mathscr {E}}}(i) = \{i\} \cup {\mathscr {N}}_{{\mathscr {E}}}(i),\quad {\mathscr {N}}_{{\mathscr {E}}}(i) = \{j \in {\mathscr {V}} :ij \in {\mathscr {E}}\}. \end{aligned}$$
(3.14)
Thus, S represents the similarity of the data within a local spatial neighborhood to the prior data \({\mathscr {P}}_{{\mathscr {F}}}\).

Note that S depends on W because L does so by (3.12). The size of the neighborhoods \(|\tilde{{\mathscr {N}}}_{{\mathscr {E}}}(i)|\) is the second-user parameter, besides the selectivity parameter \(\rho \) for scaling the distance matrix (3.6). Typically, each \(\tilde{{\mathscr {N}}}_{{\mathscr {E}}}(i)\) indexes the same local “window” around pixel location i. We then call the window size \(|\tilde{{\mathscr {N}}}_{{\mathscr {E}}}(i)|\)scale parameter.

Remark 5

In basic applications, the distance matrix D will not change once the features and the feature distance \(d_{{\mathscr {F}}}\) are determined. On the other hand, the likelihood matrix L(W) and the similarity matrix S(W) have to be recomputed as the assignment W evolves, as part of any numerical algorithm used to compute an optimal assignment \(W^{*}\).

We point out, however, that more general scenarios are conceivable —without essentially changing the overall approach—where \(D = D(W)\) depends on the assignment as well and hence has to be updated too, as part of the optimization process. Section 4.5 provides an example.

3.2 Objective Function, Optimal Assignment

We specify next the objective function as criterion for assignments and the gradient flow on the assignment manifold, to compute an optimal assignment \(W^{*}\). Finally, based on \(W^{*}\), the so-called assignment mapping is defined.

3.2.1 Objective Function

Getting back to the interpretation from Sect. 3.1.1 of the assignment matrix \(W \in {\mathscr {W}}\) as posterior probabilities,
$$\begin{aligned} W_{ij} = \Pr (f^{*}_{j}|f_{i}), \end{aligned}$$
(3.15)
of assigning prior feature \(f^{*}_{j}\) to the observed feature \(f_{i}\), a natural objective function to be maximized is
$$\begin{aligned} \max _{W \in {\mathscr {W}}} J(W),\qquad J(W) := \langle S(W),W \rangle . \end{aligned}$$
(3.16)
The functional J together with the feasible set \({\mathscr {W}}\) formalizes the following objectives:
  1. 1.

    Assignments W should maximally correlate with the feature-induced similarities \(S = S(W)\), as measured by the inner product which defines the objective function J(W).

     
  2. 2.

    Assignments of prior data to observations should be done in a spatially coherent way. This is accomplished by geometric averaging of likelihood vectors over local spatial neighborhoods, which turns the likelihood matrix L(W) into the similarity matrix S(W), depending on W.

     
  3. 3.

    Maximizers \(W^{*}\) should define image labelings in terms of rows \(\overline{W}_{i}^{*} = e^{k_{i}} \in \{0,1\}^{n},\; i, k_{i} \in [m]\), that are indicator vectors. While the latter matrices are not contained in the assignment manifold \({\mathscr {W}}\) as feasible set, we compute in practice assignments \(W^{*} \approx \overline{W}^{*}\) arbitrarily close to such points. It will turn out below that the geometry enforces this approximation. As a consequence, in view of (3.15), such points \(W^{*}\)maximize posterior probabilities, akin to the interpretation of MAP inference with discrete graphical models by minimizing corresponding energy functionals. As discussed in Sect. 1, however, the mathematical structure of the optimization task of our approach and the way of fusing data and prior information are quite different.

     
The following statement formalizes the discussion of the form of desired maximizers \(W^{*}\).

Lemma 4

We have
$$\begin{aligned} \sup _{W \in {\mathscr {W}}} J(W) = m, \end{aligned}$$
(3.17)
and the supremum is attained at the extreme points
$$\begin{aligned} \overline{{\mathscr {W}}}^{*} := \big \{\overline{W}^{*}&\in \{0,1\}^{m \times n} :\overline{W}^{*}_{i} = e^{k_{i}}, \end{aligned}$$
(3.18a)
$$\begin{aligned} i&\in [m],\; k_{1},\ldots ,k_{m} \in [n]\big \} \subset \overline{{\mathscr {W}}}, \end{aligned}$$
(3.18b)
corresponding to matrices with unit vectors as row vectors.

Proof

See “Proofs of Section 3 and Further Details” of Appendix 2. \(\square \)

3.2.2 Assignment Mapping

Regarding the feature space \({\mathscr {F}}\), no assumptions were made so far, except for specifying a distance function \(d_{{\mathscr {F}}}\). We have to be more specific about \({\mathscr {F}}\) only if we wish to synthesize the approximation to the given data f, in terms of an assignment \(W^{*}\) that optimizes (3.16) and the prior data \({\mathscr {P}}_{{\mathscr {F}}}\). We denote the corresponding approximation by
$$\begin{aligned} u :{\mathscr {W}} \rightarrow {\mathscr {F}}^{|{\mathscr {V}}|},\qquad W \mapsto u(W),\qquad u^{*} := u(W^{*}), \end{aligned}$$
(3.19)
and call it assignment mapping.
A trivial example of such a mapping concerns cases where prototypical feature vectors \(f^{*j},\, j\in [n]\) are assigned to data vectors \(f^{i},\, i \in [m]\): the mapping \(u(W^{*})\) then simply replaces each data vector by the convex combination of prior vectors assigned to it,
$$\begin{aligned} u^{*i} = \sum _{j \in [n]} W_{ij}^{*} f^{*j},\qquad i \in [m]. \end{aligned}$$
(3.20)
And if \(W^{*}\) approximates a global maximum \(\overline{W}^{*}\) as characterized by Lemma 4, then each \(f_{i}\) is (almost) uniquely replaced by some \(u^{*k_{i}} = f^{*k_{i}}\).

A less trivial example is the case of prior information in terms of patches. We specify the mapping u for this case and further concrete scenarios in Sect. 4.

3.2.3 Optimization Approach

The optimization task (3.16) does not admit a closed-form solution. We therefore compute the assignment by the Riemannian gradient ascent flow on the manifold \({\mathscr {W}}\),
$$\begin{aligned} \dot{W}_{ij}&= \big (\nabla _{{\mathscr {W}}} J(W)\big )_{ij} \end{aligned}$$
(3.21a)
$$\begin{aligned}&= W_{ij} \Big (\big (\nabla _{i}J(W)\big )_{j} - \big \langle W_{i},\nabla _{i}J(W) \big \rangle \Big ), \end{aligned}$$
(3.21b)
$$\begin{aligned} W_{i}(0)&= \frac{1}{n} {\mathbbm {1}}, \quad j \in [n], \end{aligned}$$
(3.21c)
with
$$\begin{aligned} \nabla _{i}J(W)&:= \frac{\partial }{\partial W_{i}} J(W) \end{aligned}$$
(3.21d)
$$\begin{aligned}&= \Big (\frac{\partial }{\partial W_{i1}} J(W),\ldots ,\frac{\partial }{\partial W_{in}} J(W)\Big )^{\top }, \; i \in [m], \end{aligned}$$
(3.21e)
which results from applying (2.6) to the objective (3.16). The flows (3.21), for \(i \in [m]\), are not independent as the product structure of \({\mathscr {W}}\) (cf. Sect. 2.3) might suggest. Rather, they are coupled through the gradient \(\nabla J(W)\) which reflects the interaction of the distributions \(W_{i},\,i \in [m]\), due to the geometric averaging which results in the similarity matrix (3.13).
Observe that, by (3.21a) and \(\langle {\mathbbm {1}}, W_{i} \rangle =1\),
$$\begin{aligned} \langle {\mathbbm {1}}, \dot{W}_{i} \rangle&= \langle {\mathbbm {1}}, W_{i} \nabla _{i} J(W) \rangle \end{aligned}$$
(3.22a)
$$\begin{aligned}&\qquad - \langle W_{i}, \nabla _{i} J(W) \rangle \langle {\mathbbm {1}}, W_{i} \rangle = 0,\quad i \in [m], \end{aligned}$$
(3.22b)
that is \(\nabla _{{\mathscr {W}}} J(W) \in T_{W}{\mathscr {W}}\), and thus the flow (3.21a) evolves on \({\mathscr {W}}\). Let \(W(t) \in {\mathscr {W}},\, t \ge 0\) solve (3.21a). Then, with the Riemannian metric (2.12),
$$\begin{aligned} \frac{\hbox {d}}{\hbox {d}t} J\big (W(t)\big )&= \big \langle \nabla _{{\mathscr {W}}}J\big (W(t)\big ), \dot{W}(t) \big \rangle _{W(t)} \end{aligned}$$
(3.23a)
$$\begin{aligned}&\overset{(3.21a)}{=} \big \Vert \nabla _{{\mathscr {W}}}J\big (W(t)\big )\big \Vert _{W(t)}^{2} \ge 0, \end{aligned}$$
(3.23b)
that is, the objectivefunction value increases until a stationary point is reached where the Riemannian gradient vanishes. Clearly, we expect W(t) to approximate a global maximum due to Lemma 4, which all satisfy the condition for stationary points \(\overline{W}\),
$$\begin{aligned} 0 = \dot{\overline{W}}_{i} = \overline{W}_{i}\big (\nabla _{i} J(\overline{W}) - \langle \overline{W}_{i}, \nabla _{i} J(\overline{W}) \rangle {\mathbbm {1}}\big ),\quad i \in [m], \end{aligned}$$
(3.24)
because replacing \(\overline{W}_{i}\) in (3.24) by \(\overline{W}_{i}^{*} = e^{k_{i}}\) for some \(k_{i} \in [n]\) makes the bracket vanish for the \(k_{i}\)-th equation, whereas all other equations indexed by \(j \ne k_{i},\, j \in [n]\) are satisfied due to \(\overline{W}_{ij}^{*}=0\).
Regarding interior stationary points \(\overline{W} \in {\mathscr {W}}\) with \(\overline{W} \ge 0\) due to the definition of \({\mathscr {W}}\), all brackets \((\cdots )\) on the r.h.s. of (3.24) must vanish, which can only happen if the Euclidean gradient satisfies
$$\begin{aligned} \nabla _{i} J(\overline{W}) = \langle \overline{W}_{i}, \nabla _{i} J(\overline{W}) \rangle {\mathbbm {1}},\qquad i \in [m] \end{aligned}$$
(3.25)
including the case \(\nabla J(\overline{W})=0\). Inspecting the gradient of the objective function (3.16), we get
$$\begin{aligned} \frac{\partial }{\partial W_{ij}} J(W)&= \frac{\partial }{\partial W_{ij}} \langle S(W), W \rangle = \sum _{k,l} \frac{\partial }{\partial W_{ij}} \big (S_{kl}(W) W_{kl}\big ) \end{aligned}$$
(3.26a)
$$\begin{aligned}&= \sum _{k,l} \Big (\frac{\partial }{\partial W_{ij}} S_{kl}(W)\Big ) W_{kl} + S_{ij}(W) \end{aligned}$$
(3.26b)
$$\begin{aligned}&= \langle T^{ij}(W), W \rangle + S_{ij}(W), \end{aligned}$$
(3.26c)
where both matrices S(W) and \(T^{ij}(W) = \frac{\partial }{\partial W_{ij}} S(W)\) depend in a smooth way on the data (3.1) and the prior set (3.3) through the distance matrix (3.6), the likelihood matrix (3.12) and the geometric averaging (3.13) which forms the similarity matrix S(W). Regarding the second term on the r.h.s. of (3.26b), a computation relegated to “Proofs of Section 3 and Further Details of Appendix 2” yields
$$\begin{aligned} \langle T^{ij}(W), W \rangle = \sum _{k,l} -\Big (\big (H^{k}(W)\big )^{-1} h^{k,ij}(W)\Big )_{l} W_{kl}. \end{aligned}$$
(3.27)
The way to compute the somewhat unwieldy explicit form of the r.h.s. is explained by (7.14f) and the corresponding appendix. In terms of these quantities, condition (3.25) for stationary interior points translates to
$$\begin{aligned} \langle T^{ij}(\overline{W}), \overline{W} \rangle&+ S_{ij}(\overline{W}) \end{aligned}$$
(3.28a)
$$\begin{aligned}&= \sum _{j} \big (\langle T^{ij}(\overline{W}), \overline{W} \rangle + S_{ij}(\overline{W})\big ) \overline{W}_{ij}, \end{aligned}$$
(3.28b)
$$\begin{aligned}&\qquad \forall i \in [m], \quad \forall j \in [n] \end{aligned}$$
(3.28c)
including the special case \(S_{ij}(W) = -\langle T^{ij}(W), W \rangle \), \(\forall i \in [m]\), \(j \in [n]\), corresponding to \(\nabla J(\overline{W})=0\). Note that condition (3.28) requires that for every \(i \in [m]\), the l.h.s. takes the same value for every \(j \in [n]\), such that averaging with respect to \(W_{i}\) on the r.h.s. causes no change.

We do not have evidence for the nonexistence of specific data configurations, for which the flow (3.21) may reach such very specific stationary interior points. Any such point, however, will not be a maximum and be isolated, by virtue of the local strict convexity of the objective function (2.8) for Riemannian means (cf. Lemma 3 below), which determines the similarity matrix (3.13). Consequently, any perturbation (e.g., by numerical computation) will let the flow escape from such a point, in order to maximize the objective due to (3.23).

We summarize this reasoning by the

Conjecture 1

For any data (3.1) and prior sets (3.3), up to a subset of \({\mathscr {W}}\) of measure zero, the flow W(t) generated by (3.21) approximates a global maximum as defined by (3.18) in the sense that, for any \(0 < \varepsilon \ll 1\), there is a \(t=t(\varepsilon )\) such that
$$\begin{aligned} \big \Vert W\big (t(\varepsilon )\big ) - \overline{W}^{*}\big \Vert \le \varepsilon ,\qquad \text {for some}\quad \overline{W}^{*} \in \overline{{\mathscr {W}}}^{*}. \end{aligned}$$
(3.29)

Remark 6

  1. 1.

    Since \(\overline{{\mathscr {W}}}^{*} \not \in {\mathscr {W}}\), the flow W(t) cannot converge to a global maximum, and numerical problems arise when (3.29) holds for \(\varepsilon \) very close to zero. Our strategy to avoid such problems is described in Sect. 3.3.1.

     
  2. 2.

    Although global maxima are not attained, we agree to call a point \(W^{*}=W(t)\)maximum and optimal assignment that satisfies (3.29) for some fixed small \(\varepsilon \). The criterion which terminates our algorithm is specified in Sect. 3.3.4.

     
  3. 3.

    Our numerical approximation of the flow (3.21) is detailed in Sect. 3.3.3.

     

3.3 Implementation

We discuss in this section specific aspects of the implementation of the variational approach.

3.3.1 Assignment Normalization

Because each vector \(W_{i}\) approaches some vertex \(\overline{W}^{*} \in \overline{{\mathscr {W}}}^{*}\) by construction, and because the numerical computations are designed to evolve on \({\mathscr {W}}\), we avoid numerical issues by checking for each \(i \in [m]\) every entry \(W_{ij},\, j \in [n]\), after each iteration of the algorithm (3.36) below. Whenever an entry drops below \(\varepsilon =10^{-10}\), we rectify \(W_{i}\) by
$$\begin{aligned} W_{i}&\leftarrow \frac{1}{\langle {\mathbbm {1}}, \tilde{W}_{i} \rangle } \tilde{W}_{i}, \end{aligned}$$
(3.30a)
$$\begin{aligned} \tilde{W}_{i}&= W_{i} - \min _{j \in [n]} W_{ij} + \varepsilon ,\qquad \varepsilon = 10^{-10}. \end{aligned}$$
(3.30b)
In other words, the number \(\varepsilon \) plays the role of 0 in our implementation. Our numerical experiments (Sect. 4) showed that this operation removed any numerical issues without affecting convergence in terms of the criterion specified in Sect. 3.3.4.

3.3.2 Computing Riemannian Means

Computation of the similarity matrix S(W) due to Eq. (3.13) involves the computation of Riemannian means. In view of Definition 2, we compute the Riemannian mean \({\mathrm {mean}}_{{\mathscr {S}}}({\mathscr {P}})\) of given points \({\mathscr {P}}=\{p^{i}\}_{i \in [N]} \subset {\mathscr {S}}\), using uniform weights, as fixed point \(p^{(\infty )}\) by iterating the following steps.
$$\begin{aligned} (1)\quad&\text {Set}\; p^{(0)}=\frac{1}{n} {\mathbbm {1}}. \end{aligned}$$
(3.31a)
Given \(p^{(k)},\; k \ge 0\), compute (cf. the explicit expressions (7.16b) and (2.7))
$$\begin{aligned} (2)\quad&v^{i}= {{\mathrm{Exp}}}_{p^{(k)}}^{-1}(p^{i}),\quad i \in [N], \end{aligned}$$
(3.31b)
$$\begin{aligned} (3)\quad&v = \frac{1}{N} \sum _{i \in [N]} v^{i}, \end{aligned}$$
(3.31c)
$$\begin{aligned} (4)\quad&p^{(k+1)} = {{\mathrm{Exp}}}_{p^{(k)}}(v), \end{aligned}$$
(3.31d)
and continue with step (2) until convergence. In view of the optimality condition (2.9), our implementation returns \(p^{(k+1)}\) as a result if after carrying out step (3) the condition \(\Vert v\Vert _{\infty } \le 10^{-3}\) holds.
We point out that numerical problems arise at step (2) if identical vectors are averaged, as the expression (7.16b) shows. Such situations may occur, e.g., when computer-generated images are processed. Setting \(\varepsilon =1-\langle \sqrt{p},\sqrt{q}\rangle \) for two vectors \(p, q \in {\mathscr {S}}\), we replace the expression (7.16b) by
$$\begin{aligned} \begin{aligned} {{\mathrm{Exp}}}_{p}^{-1}(q)&\approx \frac{9 \varepsilon ^{2}+40 \varepsilon + 480}{240 \sqrt{1-\varepsilon /2}} (\sqrt{p q}-(1-\varepsilon ) p) \\&\qquad \text {if}\; \varepsilon < 10^{-3}. \end{aligned} \end{aligned}$$
(3.32)
Although the iteration (3.31) converges quickly, carrying out such iterations as a subroutine, at each pixel and iterative step of the outer iteration (3.36), increases runtime (of non-parallel implementations) noticeably. In view of the approximation of the exponential map \({{\mathrm{Exp}}}_{p}(v) = \gamma _{v}(1)\) by (3.11), it seems natural to approximate the Riemannian mean as well by modifying steps (2) and (4) above accordingly.

Lemma 5

Replacing in the iteration (3.31) above the exponential mapping \({{\mathrm{Exp}}}_{p}\) by the lifting map \(\exp _{p}\) (3.8a) yields the closed-form expression
$$\begin{aligned} \begin{aligned} {\mathrm {mean}}_{{\mathscr {S}}}({\mathscr {P}}) \approx&\frac{{\mathrm {mean}}_{g}({\mathscr {P}})}{\langle {\mathbbm {1}}, {\mathrm {mean}}_{g}({\mathscr {P}}) \rangle }, \\&{\mathrm {mean}}_{g}({\mathscr {P}}) = \Big (\prod _{i \in [N]} p^{i}\Big )^{\frac{1}{N}} \end{aligned} \end{aligned}$$
(3.33)
as approximation of the Riemannian mean \({\mathrm {mean}}_{{\mathscr {S}}}({\mathscr {P}})\), with the geometric mean \({\mathrm {mean}}_{g}({\mathscr {P}})\) applied componentwise to the vectors in \({\mathscr {P}}\).

Proof

See “Proofs of Section 3 and Further Details” of Appendix 2. \(\square \)

Remark 7

Taking into account non-uniform weights \(w \in \varDelta _{N-1}\), according to Definition 2, is straightforward. We briefly take up this point in Sect. 5: see Eq. (5.2) and the corresponding paragraph together with figure 14.

3.3.3 Optimization Algorithm

A thorough analysis of various discrete schemes for numerically integrating the gradient flow (3.21), including stability estimates, is beyond the scope of this paper and will be separately addressed in follow-up work (see Sect. 5 for a short discussion).

Here, we merely adopted the following basic strategy from [33] that has been widely applied in the literature and performed remarkably well in our experiments. Approximating the flow (3.21) for each vector \(W_{i},\, i \in [m]\), by the time-discrete scheme
$$\begin{aligned} \frac{W_{i}^{(k+1)}-W_{i}^{(k)}}{t_{i}^{(k+1)}-t_{i}^{(k)}}&= W_{i}^{(k)} \big (\nabla _{i}J(W^{(k)})-\langle W_{i}^{(k)}, \nabla _{i} J(W^{(k)}) \rangle {\mathbbm {1}}\big ), \end{aligned}$$
(3.34a)
$$\begin{aligned} W_{i}^{(k)}&:= W_{i}\left( t_{i}^{(k)}\right) , \end{aligned}$$
(3.34b)
and choosing the adaptive step sizes \(t_{i}^{(k+1)}-t_{i}^{(k)} = \frac{1}{\langle W_{i}^{(k)}, \nabla _{i} J\left( W^{(k)}\right) \rangle }\), yields the multiplicative updates
$$\begin{aligned} W_{i}^{(k+1)} = \frac{W_{i}^{(k)} \big (\nabla _{i}J(W^{(k)})\big )}{\langle W_{i}^{(k)}, \nabla _{i} J(W^{(k)}) \rangle },\qquad i \in [m]. \end{aligned}$$
(3.35)
We further simplify this update in view of the explicit expression (3.26) of the gradient \(\nabla _{i} J(W)\) of the objective function that comprises two terms. The first one contributes the derivative of S(W) with respect to \(W_{i}\), which is significantly smaller than the second term \(S_{i}(W)\) of (3.26), because \(S_{i}(W)\) results from averaging (3.13) the likelihood vectors \(L_{j}(W_{j})\) over spatial neighborhoods and hence changes slowly. As a consequence, we simply drop this first term which, as a by-product, avoids the numerical evaluation of the expensive expressions (3.27) specifying the first term.
Thus, for computing the numerical results reported in this paper, we used the fixed-point iteration
$$\begin{aligned} W_{i}^{(k+1)} = \frac{W_{i}^{(k)} \big (S_{i}(W^{(k)})\big )}{\langle W_{i}^{(k)}, S_{i}(W^{(k)}) \rangle },\quad W_{i}^{(0)} = \frac{1}{n} {\mathbbm {1}}, \quad i \in [m] \end{aligned}$$
(3.36)
together with the approximation due to Lemma 5 for computing Riemannian means, which define by (3.13) the similarity matrices \(S(W^{(k)})\). Note that this requires to recompute the likelihood matrices (3.12) as well, at each iteration k (see Fig. 1).

3.3.4 Termination Criterion

Algorithm (3.36) was terminated if the average entropy
$$\begin{aligned} -\frac{1}{m} \sum _{i \in [m]} \sum _{j \in [n]} W_{ij}^{(k)} \log W_{ij}^{(k)} \end{aligned}$$
(3.37)
dropped below a threshold. For example, a threshold value \(10^{-3}\) means in practice that, up to a tiny fraction of indices \(i \subset [m]\) that should not matter for a subsequent further analysis, all vectors \(W_{i}\) are very close to unit vectors, thus indicating an almost unique assignment of prior items \(f_{j}^{*},\, j \in [n]\) to the data \(f_{i},\, i \in [m]\). Note that this termination criterion conforms to Conjecture 1 and was met in all experiments.

4 Illustrative Applications and Discussion

We focus in this section on few academical, yet non-trivial numerical examples, to illustrate and discuss basic properties of the approach. Elaborating any specific application is outside the scope of this paper.
Fig. 6

Parameter influence on labeling. The top row shows a ground-truth image and noisy input data. Both images and the prior data set \({\mathscr {P}}_{{\mathscr {F}}}\) are composed of 31 color vectors. Each color vectoris encoded as a vertex of the simplex \(\Delta _{30}\). This results in unit distances between all colors and thus enables an unbiased assessment of the impact of geometric averaging and the two parameter values \(\rho , |{\mathscr {N}}_{\varepsilon }|\). The remaining panels show the assignments \(u(W^{*})\) for various parameter values where \(W^{*}\) maximizes the objective function (3.16). The spatial scale \(|{\mathscr {N}}_{{\mathscr {E}}}|\) increases from left to right. The parameter \(\rho \) increases downwards. The results illustrate the compromise between sensitivity to noise and to the geometry of signal transitions. The selectivity parameter \(\rho \) increases from top to bottom. If \(\rho \) is chosen too small, then there is a tendency to noise-induced oversegmentation, in particular at small spatial scales \(|{\mathscr {N}}_{{\mathscr {E}}}|\). Depending on the application, however, the ability to separate the physical and the spatial scale in order to recognize outliers with small spatial support, while performing diffusion at a larger spatial scale as in the panels of the left column, may be beneficial

Fig. 7

Parameter values and convergence rate. Average entropy (3.37) of the assignment vectors \(W_{i}^{(k)}\) as a function of the iteration counter k and the two parameters \(\rho \) and \(|{\mathscr {N}}_{\varepsilon }|\), for the labeling task illustrated in Fig. 6. The left panel shows that despite high selectivity in terms of a small value of \(\rho \), small spatial scales necessitate to resolve more conflicting assignments through propagating information by geometric spatial averaging. As a consequence, more iterations are needed to achieve convergence and a labeling. The right panel, on the other hand, shows that at a fixed spatial scale \(|{\mathscr {N}}_{\varepsilon }|\), higher selectivity leads to faster convergence, because outliers are simply removed from the averaging process, whereas low selectivity leads to an assignment (labeling) taking all data into account

4.1 Parameters, Empirical Convergence Rate

Figure 6 shows a color image and a noisy version of it. The latter image was used as input data of a labeling problem. Both images comprise 31 color vectors forming the prior data set \({\mathscr {P}}_{{\mathscr {F}}} = \{f^{1*},\ldots ,f^{31*}\}\). The labeling task is to assign these vectors in a spatially coherent way to the input data so as to recover the ground-truth image.

This tasks should not be confused with image denoising in the traditional sense [9] where noise has to be removed from real-valued image data. Rather, the experiment depicted by Fig. 6 represents difficult classification tasks where the assignment process is essential in order to cope with the high noise level.

Every color vector was encoded by the vertices of the simplex \(\varDelta _{30}\), that is, by the unit vectors \(\{e^{1},\ldots ,e^{31}\} \subset \{0,1\}^{31}\). Choosing the distance \(d_{{\mathscr {F}}}(f^{i},f^{j}) := \Vert f^{i}-f^{j}\Vert _{1}\), this results in unit distances between all pairs of data points and hence enables to assess most clearly the impact of geometric spatial averaging and the influence of the two parameters \(\rho \) and \(|{\mathscr {N}}_{\varepsilon }|\), introduced in Sects. 3.1.2 and 3.1.4, respectively. We refer to the caption for a brief discussion of the selectivity parameter \(\rho \) and the spatial scale in terms of \(|{\mathscr {N}}_{\varepsilon }|\).

The reader familiar with total variation-based denoising, where a single parameter is only used to control the influence of regularization, may ask why two parameters are used in the present approach and if they are necessary. We refer again to Fig. 6 and the caption where the separation of the physical and spatial scale based on different parameter choices is demonstrated. The total variation measure couples these scales as the co-area formula explicitly shows. As a consequence, a single parameter is only needed. On the other hand, larger values of this parameter lead to the well-known loss-of-contrast effect, which using the present approach can be avoided by properly choosing the parameters \(\rho , |{\mathscr {N}}_{\varepsilon }|\) corresponding to these two scales.

Figure 7 shows how convergence of the iterative algorithm (3.36) is affected by these two parameters. It also demonstrates that few tens of massively parallel outer iterations suffice to reach the termination criterion of Sect. 3.3.4. A parallel implementation only has to take into account the spatial neighborhood (3.14) where pixel locations directly interact in order to compute by geometric averaging the likelihood matrix (3.13).

All results were computed using the assignment mapping (3.20) without rounding. This shows that the termination criterion of Sect. 3.3.4, illustrated in Fig. 7, leads to (almost) unique assignments .

4.2 Vector-Valued Data

Let \(f^{i} \in \mathbb {R}^{d}\) denote vector-valued image data or extracted feature vectors at locations \(i \in [m]\), and let
$$\begin{aligned} {\mathscr {P}}_{{\mathscr {F}}} = \{f^{*1},\ldots ,f^{*n}\} \end{aligned}$$
(4.1)
denote the prior information given by prototypical feature vectors. In the example that follows below, \(f^{i}\) will be a RGB color vector. It should be clear, however, that any feature vector of arbitrary dimension d could be used instead, depending on the application at hand. We used the distance function
$$\begin{aligned} d_{{\mathscr {F}}}(f^{i},f^{*j}) = \frac{1}{d} \Vert f^{i}-f^{*j}\Vert _{1}, \end{aligned}$$
(4.2)
with the normalizing factor 1 / d to make the choice of the parameter \(\rho \) insensitive with respect to the dimension d of the feature space. Given an optimal assignment matrix \(W^{*}\) as solution to (3.16), the prior information assigned to the data is given by the assignment mapping
$$\begin{aligned} u^{i} = u^{i}(W^{*}) = {\mathbb {E}}_{W_{i}^{*}}[{\mathscr {P}}_{{\mathscr {F}}}],\qquad i \in [m], \end{aligned}$$
(4.3)
which merely replaces each data vector \(f^{i}\) by the prior vector \(f^{*j}\) assigned to it through \(W_{i}^{*}\).
Figure 8 shows the assignment of 20 prototypical color vectors to a color image for various values of the spatial scale parameter \(|{\mathscr {N}}_{\varepsilon }|\), while keeping the selectivity parameter \(\rho \) fixed. As a consequence, the induced assignments and image partitions exhibit a natural coarsening effect in the spatial domain.
Fig. 8

Image labeling at different spatial scales. The two rightmost columns show the same information using a random color code for the assignment of the 20 prior vectors to pixel locations, to highlight the induced image partitions. Increasing the spatial scale \(|{\mathscr {N}}_{\varepsilon }|\) for a fixed value of the selectivity parameter \(\rho \) induces a natural coarsening of the assignments and the corresponding image partitions along the spatial scale. a Input image (left panel) and a section of it. Twenty color vectors (right panel) forming the set prior data set \({\mathscr {P}}_{{\mathscr {F}}}\) according to Eq. (4.1). b Assignment \(u(W^{*})\), \(|{\mathscr {N}}_{\varepsilon }|=3 \times 3,\, \rho =0.01\). c Assignment \(u(W^{*})\), \(|{\mathscr {N}}_{\varepsilon }|=7 \times 7,\, \rho =0.01\). d Assignment \(u(W^{*})\), \(|{\mathscr {N}}_{\varepsilon }|=11 \times 11,\, \rho =0.01\)

4.3 Patches

Let \(f^{i}\) denote a patch of raw image data (or, more generally, a patch of features vectors)
$$\begin{aligned} f^{ij} \in \mathbb {R}^{d},\qquad j \in {\mathscr {N}}_{p}(i),\qquad i \in [m], \end{aligned}$$
(4.4)
centered at location \(i \in [m]\) and indexed by \({\mathscr {N}}_{p}(i) \subset {\mathscr {V}}\) (subscript p indicates neighborhoods for patches). With each entry \(j \in {\mathscr {N}}_{p}(i)\), we associate the Gaussian weight
$$\begin{aligned} w^{p}_{ij} := G_{\sigma }(\Vert x^{i}-x^{j}\Vert ),\qquad i,j \in {\mathscr {N}}_{p}(i), \end{aligned}$$
(4.5)
where the vectors \(x^{i}, x^{j} \in \mathbb {R}^{d}\) correspond to the locations in the image domain indexed by \(i, j \in {\mathscr {V}}\). Specifically, \(w^{p}\) is chosen to be the discrete impulse response of a Gaussian low-pass filter supported on \({\mathscr {N}}_{p}(i)\), so that the scale \(\sigma \) directly depends on the patch size and does not need to be chosen by hand. Such downweighting of values that are less close to the center location of a patch is an established elementary technique for reducing boundary and ringing effects of patch (“window”)-based image processing.
The prior information is given in terms of n prototypical patches
$$\begin{aligned} {\mathscr {P}}_{{\mathscr {F}}} = \{f^{*1},\ldots ,f^{*n}\}, \end{aligned}$$
(4.6)
and a corresponding distance
$$\begin{aligned} d_{{\mathscr {F}}}(f^{i},f^{*j}),\qquad i \in [m],\quad j \in [n]. \end{aligned}$$
(4.7)
There are many ways to choose this distance depending on the application at hand. We refer to the Examples 1 and 2 below. Expression (4.7) is based on the tacit assumption that patch \(f^{*j}\) is centered at i and indexed by \({\mathscr {N}}_{p}(i)\) as well.
Fig. 9

a A patch supposed to represent prior knowledge about the structure of an image f (b). The dictionary \({\mathscr {P}}_{{\mathscr {F}}}\) of Eq. (4.6) was generated by all translations of (a) and assigned to the image (b), using a distance \(d_{{\mathscr {F}}}\) that adapts the two grayvalues of each template to the data—see Eqs. (4.14) and (4.15). The resulting assignment \(u(W^{*})\) is depicted by (c). d The residual image \(v(W^{*}) := f-u(W^{*})\) by subtracting (c) from (b) (rescaled for better visibility)

Given an optimal assignment matrix \(W^{*}\), it remains to specify how prior information is assigned to every location \(i \in {\mathscr {V}}\), resulting in a vector \(u^{i} = u^{i}(W^{*})\) that is the overall result of processing the input image f. Location i is affected by patches that overlap with i. Let us denote the indices of these patches by
$$\begin{aligned} {\mathscr {N}}_{p}^{i \leftarrow j} := \{ j \in {\mathscr {V}} :i \in {\mathscr {N}}_{p}(j) \}. \end{aligned}$$
(4.8)
Every such patch is centered at location j to which prior patches are assigned by
$$\begin{aligned} {\mathbb {E}}_{W_{j}^{*}}[{\mathscr {P}}_{{\mathscr {F}}}] = \sum _{k \in [n]} W_{jk}^{*} f^{*k}. \end{aligned}$$
(4.9)
Let location i be indexed by \(i_{j}\) in patch j (local coordinate inside patch j). Then, by summing over all patches indexed by \({\mathscr {N}}_{p}^{i \leftarrow j}\) whose supports include location i, and by weighting the contributions to location i by the corresponding weights (4.5), we obtain the vector
$$\begin{aligned}&u^{i} = u^{i}(W^{*}) = \frac{1}{\sum _{j' \in {\mathscr {N}}_{p}^{i \leftarrow j}} w^{p}_{j' i_{j}}} \sum _{j \in {\mathscr {N}}_{p}^{i \leftarrow j}} w^{p}_{j i_{j}} \nonumber \\&\quad \sum _{k \in [n]} W_{jk}^{*} f^{*ki_{j}} \;\in \mathbb {R}^{d}, \end{aligned}$$
(4.10)
that is assigned by \(W^{*}\) to location i. This expression looks more clumsy than it actually is. In words, the vector \(u^{i}\) assigned to location i is the convex combination of vectors contributed from patches overlapping with i that itself are formed as convex combinations of prior patches. In particular, if we consider the common case of equal patch supports \({\mathscr {N}}_{p}(i)\) for every i that additionally are symmetric with respect to the center location i, then \({\mathscr {N}}_{p}^{i \leftarrow j} = {\mathscr {N}}_{p}(i)\). As a consequence, due to the symmetry of the weights (4.5), the first sum of (4.10) sums up all weights \(w^{p}_{ij}\). Hence, the normalization factor on the right-hand side of (4.10) equals 1, because the low-pass filter \(w^{p}\) preserves the zero-order moment (mean) of signals. Furthermore, it then makes sense to denote by \((-i)\) the location \(i_{p}\) corresponding to i in patch j. Thus (4.10) becomes
$$\begin{aligned} u^{i} = u^{i}(W^{*}) = \sum _{j \in {\mathscr {N}}_{p}(i)} w^{p}_{j (-i)} \sum _{k \in [n]} W_{jk}^{*} f^{*k(-i)}. \end{aligned}$$
(4.11)
Introducing in view of (4.9) the shorthand
$$\begin{aligned} {\mathbb {E}}^{i}_{W_{j}^{*}}[{\mathscr {P}}_{{\mathscr {F}}}] := \sum _{k \in [n]} W_{jk}^{*} f^{*k(-i)} \end{aligned}$$
(4.12)
for the vector assigned to i by the convex combination of prior patches assigned to j, we finally rewrite (4.10) due the symmetry \(w^{p}_{j(-i)} = w^{p}_{ji} = w^{p}_{ij}\) in the more handy form1
$$\begin{aligned} u^{i} = u^{i}(W^{*}) = {\mathbb {E}}_{w^{p}}\big [{\mathbb {E}}^{i}_{W_{j}^{*}}[{\mathscr {P}}_{{\mathscr {F}}}]\big ]. \end{aligned}$$
(4.13)
The inner expression represents the assignment of prior vectors to location i by fitting prior patches to all locations \(j\in {\mathscr {N}}(i)\). The outer expression fuses the assigned vectors. If they were all the same, the outer operation would have no effect, of course.

We discuss further properties of this approach by concrete examples.

Example 1

(Patch Assignment) Figure 9 shows an image f and the corresponding assignment \(u(W^{*})\) based on a patch dictionary \({\mathscr {P}}_{{\mathscr {F}}}\) that was formed as explained in the caption.

We chose the distance \(d_{{\mathscr {F}}}\) of Eq. (4.2),
$$\begin{aligned} d_{{\mathscr {F}}}(f^{i},f^{*j}) = \frac{1}{|{\mathscr {N}}_{p}(i)|} \Vert f^{i}-f^{*j(i)}\Vert _{1}, \end{aligned}$$
(4.14)
where here the arguments \(f^{i}, f^{*j}\) stand for the vectorized scalar-valued patches centered at location i, after adapting each prior template \(f^{*j}\) at each pixel location i to the data f, denoted by \(f^{*j}=f^{*j(i)}\) in (4.14). Each such template takes two values that were adapted to the template \(f^{i}\) to which it is compared, i.e.,
$$\begin{aligned} f^{*j(i)}_{k}&\in \left\{ f^{i}_{\text {low}}, f^{i}_{\text {high}}\right\} , \quad \forall k, \end{aligned}$$
(4.15a)
where
$$\begin{aligned} f^{i}_{\text {low}}&= \!{\mathrm {median}}\left\{ f^{i}_{j} :j \!\in \! {\mathscr {N}}_{p}(i),\; f^{i}_{j} \!<\! {\mathrm {median}}\{f^{i}_{j}\}_{j \!\in \! {\mathscr {N}}_{p}(i)}\right\} , \end{aligned}$$
(4.15b)
$$\begin{aligned} f^{i}_{\text {high}}&= \!{\mathrm {median}}\big \{f^{i}_{j} :j \!\in \! {\mathscr {N}}_{p}(i),\; f^{i}_{j} \!\ge \!{\mathrm {median}}\{f^{i}_{j}\}_{j \in {\mathscr {N}}_{p}(i)}\big \}. \end{aligned}$$
(4.15c)

The result in Fig. 9c illustrates how the approximation \(u(W^{*})\) of f is restricted by the prior knowledge, leading to normalized signal transitions regarding both the spatial geometry and the signal values. By maximizing the objective (3.16), a patch-consistent and dense cover of the image is computed. It induces a strong nonlinear image filtering effect by fusing through assignment for each single pixel value more than 200 predictions of possible values based on the patch dictionary \({\mathscr {P}}_{{\mathscr {F}}}\).

The approach enables to model additive image decompositions
$$\begin{aligned} f = u(W^{*}) + v(W^{*}), \end{aligned}$$
(4.16)
that is, image = geometry + texture & noise, for specific image classes, which are implicitly represented by the dictionary \({\mathscr {P}}_{{\mathscr {F}}}\). Such decomposition appears to be more discriminative than additive image decompositions achieved by convex variational approaches (see, e.g., [2]) that employ various regularizing norms, for this purpose.
Fig. 10

Analysis of the local signal structure of image a by patch assignment. This process is twofold non-local: i through the assignment of \(3 \times 3\) patches (center row) and \(7 \times 7\) patches, respectively, and ii due to the gradient flow (3.21) that promotes the spatially coherent assignment of patches corresponding to different orientations of signal transitions, in order to maximize the similarity objective (3.16). a Input image f. b Contourplot of a smooth image computed and subtracted from f as a preprocessing step. c Prior patches representing binary signal transitions at orientations \(0^{\circ }, 30^{\circ },\ldots \) (top row), and the corresponding translation invariant dictionary (bottom row). Each row of patches constitutes an equivalence class of patches. d Color code indicating oriented bright-to-dark signal transitions. e Assignment \(u(W^{*})\) of \(3 \times 3\) patches to image f from (a) (\(\rho =0.02\)). f Class label of assigned patches encoded due to (d). Black means assignment of the constant template that was added to the dictionary (c). g Residual image \(v(W^{*})=f-u(W^{*})\) (rescaled for visualization). h Assignment \(u(W^{*})\) of \(7 \times 7\) patches to image f from (a) (\(\rho =0.02\)). i Class label of assigned patches encoded due to (d). j Residual image \(v(W^{*})=f-u(W^{*})\) (rescaled for visualization)

Example 2

(Patch Assignment) Figure 10 shows a fingerprint image characterized by two gray values \(f^{*}_{\text {dark}}, f^{*}_{\text {bright}}\) that were extracted from the histogram of f after removing a smooth function of the spatially varying mean value (panel (b)). The latter was computed by interpolating the median values for each patch of a coarse \(16 \times 16\) partition of the entire image.

Figure 10c shows the dictionary of patches modeling the remaining binary signal transitions. An essential difference to Example 1 is the subdivision of the dictionary into classes of equivalent patches corresponding to each orientation. The averaging process was set up to distinguish only the assignment of patches of different patch classes and to treat patches of the same class equally. This makes geometric averaging particularly effective if signal structures conform to a single class on larger spatial connected supports. Moreover, it reduces the problem size to merely 13 class labels: 12 orientations at \(k \cdot 30^{\circ },\, k \in [12]\) degrees, together with the single constant patch complementing the dictionary.

The distance \(d_{{\mathscr {F}}}(f^{i},f^{*j})\) between the image patch centered at i and the j-th prior patch was chosen depending on both the prior patch and the data patch it was compared to: For the constant prior patch, the distance was
$$\begin{aligned} d_{{\mathscr {F}}}&(f^{i},f^{*j}) = \frac{1}{|{\mathscr {N}}_{p}(i)|} \Vert f^{i}-f^{*}_{i} f^{*j}\Vert _{1} \end{aligned}$$
(4.17a)
with
$$\begin{aligned} f^{*}_{i}&= {\left\{ \begin{array}{ll} f^{*}_{\mathrm{dark}}, &{} \text {if}\; {\mathrm {med}}\{f^{i}_{j}\}_{j \in {\mathscr {N}}_{p}(i)} \le \frac{1}{2}\left( f^{*}_{\mathrm{dark}} + f^{*}_{\mathrm{bright}}\right) , \\ f^{*}_{\mathrm{bright}}, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(4.17b)
For all other prior patches, the distance was
$$\begin{aligned} d_{{\mathscr {F}}}\left( f^{i},f^{*j}\right) = \frac{1}{|{\mathscr {N}}_{p}(i)|} \Vert f^{i}-f^{*j}\Vert _{1}. \end{aligned}$$
(4.18)
Fig. 11

Unsupervised assignment of uniform noise a to itself in terms of a uniform discretization of the rgb color cube \([0,1]^{3}\) that does not include the color gray\(0.5 (1,1,1)^{\top }\). The assignment selects the 8 colors (d) closest to gray with random frequencies (c) and a spatially random partition (b) (rescaled to highlight the partition). a Uniform noise, b sparse assignment \(u(W^{*})\) (displayed after rescaling) of \(6^{3}\) color vectors corresponding to a uniform discretization of the rgb-cube\([0,1]^{3}\) to the image (a) yields a noise-induced random piecewise constant partition through geometric averaging (parameters: \(|{\mathscr {N}}_{\varepsilon }|=7 \times 7, \rho =0.01\)), c Relative frequencies of assignment of the prior color vectors \(f^{*j},\, j \in [6^3]\). The 8 nonzero frequencies correspond to vectors indicated in the color cube (d), d 8 color vectors (out of \(6^3\)) closest to gray (with equal distance) only were assigned to (a), resulting in (b). These colors look differently in (b) due to rescaling the image \(u(W^{*})\) to \([0,1]^{3}\) for better visibility

The center and bottom rows in Fig. 10, respectively, show the assignment \(u(W^{*})\) of the dictionary of \(3 \times 3\) patches (center row) and of \(7 \times 7\) patches (bottom row). The center panels (f) and (i) depict the class labels of these assignments according to the color code of panel (d). These images display the interpretation of the image structure of f from panel (a). While the assignment of patches of size \(3 \times 3\) is slightly noisy, which becomes visible through the assignment of the constant template marked by black in panel (f), the assignment of \(5 \times 5\) or \(7 \times 7\) patches results in a robust and spatially coherent, accurate representation of the local image structure. The corresponding pronounced nonlinear filtering effect is due to the consistent assignment of a large number of patches at each pixel location and fusing the corresponding predicted values.

Panels (g) and (j) show the resulting additive image decompositions (4.16) that seem difficult to achieve when using established convex variational approaches (see, e.g., [2]) that employ various regularizing norms and duality, for this purpose.

Finally, we point out that it would be straightforward to add to the dictionary further patches modeling minutiae and other features relevant to fingerprint analysis. We do not consider in this paper any application-specific aspects, however.

Fig. 12

Scenario for evaluating the approach of Sect. 4.5. f Illustrates the set of all rectangles and corresponding subsets (c, d). Unlike (d), the rectangles (c) do not intersect. Sampling the rectangles from both (c, d), shown together by (a, b), produced the input data (e). The task is to recognize among (f) all foreground objects (c) based on unary features (coverage of points) and disjunctive constraints (rectangles should not intersect). g Discusses the result. a Collection of rectangular areas that result in (e) after uniform point sampling. b Decomposition of the rectangles (a) into foreground (dark, cf. c), and background (light, cf. d). c Randomly oriented foreground rectangles that do not intersect. d Arbitrary sample of background rectangles from (f). e Input data: point pattern resulting from uniformly sampling the rectangles (a). f All possible rectangles densely cover the domain as indicated in the center region (not completely shown for better visibility). g Assignment (labeling) of the rectangles (f) based on the data (e): recognized foreground objects from (c) (black) and recognized background objects from (d) (dashed). Two foreground objects were erroneously labeled as background (gray). All remaining rectangles from (f) also belong to the background, four of which were erroneously labeled as foreground (white)

4.4 Unsupervised Assignment

We consider the case that no prior information is available.

The simplest way to handle the absence of prior information is to use the given data themselves as prior information along with a suitable constraint, to enforce selection of the most important parts by self-assignment.

In order to illustrate this mechanism clearly, Fig. 11 shows as example the assignment of uniform noise to itself. As prior data \({\mathscr {P}}_{{\mathscr {F}}}\), we uniformly discretized the rgb color cube \([0,1]^{3}\) at \(0, 0.2, 0.4, \ldots , 1\) along each axis, resulting in \(|{\mathscr {P}}_{{\mathscr {F}}}| = 6^{3} = 216\) color vectors. Because there is no preference for any of these vectors, spatial diffusion of uniform noise at any spatial scale will inherently end up with the average color gray, which however is excluded from the prior set, by construction. Accordingly, the process terminated with a spatially random assignment of the 8 color vectors closest to gray (Figs. 11b rescaled and 11d) solely induced by the input noise and geometric averaging at a certain scale. Figure 11c depicts the relative frequencies each prior vector is assigned to some location. Except for the 8 aforementioned vectors, all others are ignored.

A detailed elaboration of unsupervised scenarios based on our approach, for both vector- and patch-valued data, will be studied in our follow-up work (Sect. 5).

4.5 Labeling with Adaptive Distances

In this section, we consider a simple instance of the more general class of scenarios where the distance matrix (3.6) \(D = D(W)\) depends on the assignment matrix W, in addition to the likelihood matrix L(W) and the similarity matrix S(W).

Figure 12e displays a point pattern that was generated by sampling a foreground and background process of randomly oriented rectangles, as explained by the remaining panels in Fig. 12. The task is to recover the foreground process among all possible rectangles (Fig. 12f) based on (1) unary features given by the fraction of points covered by each rectangle, and on (2) the prior knowledge that unlike background rectangles, elements of the foreground process do not intersect. Rectangles of the background process were slightly less densely sampled than foreground rectangles so as to make the unary features indicative. Due to the overlap of many rectangles (Fig. 12a), however, these unary features are noisy (“weak”).

As a consequence, exploiting the prior knowledge that foreground rectangles do not intersect becomes decisive. This is done by determining the intersection pattern of all rectangles (Fig. 12f) in terms of Boolean values that are arranged into matrices \(R_{ij}\), for each edge ij of the grid graph whose vertices correspond to the centroids of the rectangles in Fig. 12f: \((R_{ij})_{k,l}=1\) if rectangle k at position i intersects with rectangle l at position j, and \((R_{ij})_{k,l}=0\) otherwise. Due to the geometry of the rectangles, a rectangle at position i may only intersect with \(8 \times 18 = 144\) rectangles located within a 8-neighborhood \(j \in {\mathscr {N}}_{\varepsilon }(i)\). Generalizations to other geometries are straightforward.

The inference task to recover the foreground rectangles (Fig. 12c) from the point pattern (Fig. 12e) may be seen as a multi-labeling problem based on an asymmetric Potts-like model: Labels correspond to equally oriented rectangles and have to be determined so as to maximize the coverage of points, subject to the pairwise constraints that selected rectangles do not intersect. Alternatively, we may think of binary “off–on” variables that are assigned to each rectangle in Fig. 12f, which have to be determined subject to disjunctive constraints: At each location, at most a single variable may become active, and pairwise active variables have to satisfy the intersection constraints. Note that in order to suppress intersecting rectangles, penalizing costs are only encountered if (a subset of) pairs of variables receive the same value 1 (=active and intersecting). This violates the submodularity constraint [29, Eq. (7)] and hence rules out global optimization using graph cuts.

Taking all ingredients into account, we define the distance vector field
$$\begin{aligned} D_{i}&= D_{i}(W) = \frac{1}{\rho } \begin{pmatrix} \tilde{D}_{i}(W) \\ \sigma \end{pmatrix}, \end{aligned}$$
(4.19a)
$$\begin{aligned} \tilde{D}_{i}(W)&= -p^{i} + \frac{\lambda }{|{\mathscr {N}}_{\varepsilon }(i)|} \sum _{j \in {\mathscr {N}}_{\varepsilon }(i)} R_{ij} W_{j}, \quad \lambda ,\sigma > 0, \end{aligned}$$
(4.19b)
where \(\rho > 0\) is the selectivity parameter from (3.6), \(\sigma > 0\) represents the cost of the additional label: “none rectangle,” vector \(p^{i}\) collects the fractions of points covered by the rectangles at position i, and \(\lambda > 0\) weights the influence of the intersection prior. This latter term is defined by the matrices \(R_{ij}\) discussed above and given by the gradient with respect to W of the penalty \((\lambda /|{\mathscr {N}}_{\varepsilon }(i)|) \sum _{ij \in {\mathscr {E}}} \langle W_{i}, R_{ij} W_{j} \rangle \).
In [24], a continuous optimization approach using DC (difference of convex functions) programming was proposed to compute local minimizers of non-convex functionals similar to \(\langle D(W), W \rangle \), with D given by (4.19). This “Euclidean approach”—in contrast to the geometric approach proposed here—entails to provide a DC decomposition of the intersection penalty just discussed and to explicitly take into account the affine constraints \(W_{i} \in \varDelta _{n-1}\). As a result, the DC approach computes a local minimizer by solving a sequence of convex quadratic programs.
Fig. 13

Two instances shown on the left in (a, b), adopted from [13, 32] to study the tightness of convex outer relaxations of the image labeling problem. The task is both to inpaint and to label the gray regions. Our smooth non-convex approach constitutes an inner approximation that yields the labeling results shown on the right in (a, b, without the need of a separate rounding post-processing step that projects the solution of convex relaxations onto the feasible set of label assignments (parameters: \(\rho =1\), \(|{\mathscr {N}}_{{\mathscr {E}}}(i)|=3\times 3\)). a Inpainting of the regions marked by gray through assignment leads to the result on the right. b Inpainting of the regions marked by gray through assignment leads to the result on the right

In order to apply our present approach instead, we bypass the averaging step (3.13) because labels will most likely be different at adjacent vertices i in our random scenario, and we thus set \(S(W) = L(W)\) with L(W) given by (3.12) based on (4.19). Applying then algorithm (3.36) implicitly handles all constraints through the geometric flow and computes a local minimizer by multiplicative updates, within a small fraction of the runtime that the DC approach would need, and without compromising the quality of the solution (Fig. 12g).

4.6 Image Inpainting

Inpainting denotes the task to fill in a known region where no image data were observed or are known to be corrupted, based on the surrounding region and prior information.

Once the feature metric \(d_{{\mathscr {F}}}\) is fixed, we assign to each pixel in the region to be inpainted as datum the uninformativ feature vectorf which has the same distance \(d_{{\mathscr {F}}}(f,f^{*}_{j})\) to every prior feature vector \(f^{*}_{j} \in {\mathscr {P}}_{{\mathscr {F}}}\). Note that there is not need to explicitly compute this data vector f. It merely represents the rule for evaluating the distance \(d_{{\mathscr {F}}}\) if one of its arguments belongs to a region to be inpainted.

Figure 13 shows two basic examples that were used by the authors of [13, 32], respectively, to examine numerically the tightness of convex relaxations of the image labeling problem. Unlike convex relaxations that constitute outer approximations of the combinatorically complex feasible set of assignments, our smooth non-convex approach may be considered as an inner approximation that yields results without the need of further rounding, i.e., the need of a post-processing step for projecting the solution of a convex relaxed problem onto the feasible set.
Fig. 14

Illustration of the influence of using nonuniform weights for geometric averaging (2.8) based on the approximation (5.2). a Image structure where only patch similarity enables to recognize pixel similarity. b Noisy input image to which the three prior vectors red, green, and blue are assigned. The \(\ell _{1}\) distance between data and prior vectors was used as distance function \(d_{{\mathscr {F}}}\). c Uniform labeling with weights \(w_{j} = \frac{1}{|{\mathscr {N}}_{{\mathscr {E}}}|}\) completely fails to recover the fine image structure (a). d Using non-uniform weights based on the comparison of \(7 \times 7\) patches of the noise input data (b) considerably enhances the labeling. Errors naturally occur in the center image region and along the diagonals where patch similarity is not sufficiently supported by other pixels locations. Parameters for (c, d): \(\rho =0.1\), \(|{\mathscr {N}}_{{\mathscr {E}}}|=7 \times 7\)

5 Conclusion and Further Work

We presented a novel approach to image labeling, formulated in a smooth geometric setting. The approach contrasts with established convex and non-convex relaxations of the image labeling problem through smoothness and geometric averaging. The numerics boil down to parallel sparse updates that maximize the objective along an interior path in the feasible set of assignments and finally return a labeling. Although an elementary first-order approximation of the gradient flow was only used, the convergence rate seems competitive. In particular, a large number of labels, like in Sect. 4.4, does not slow down convergence as is the case of convex relaxations. All aspects specific to an application domain are represented by a single distance matrix D and a single user parameter \(\rho \). This flexibility and the absence of ad hoc tuning parameters whose values do not have an intrinsic meaning should promote applications of the approach to various image labeling problems.

Aspects and open points to be addressed in future work include the following.
  • Numerics Many alternatives exist to the simple algorithm detailed in Sect. 3.3.3. An alternative first-order example is exponential multiplicative update [11] that results from an explicit Euler discretization of the flow (3.21) rewritten in the form
    $$\begin{aligned} \frac{\hbox {d}}{\hbox {d}t}\log \big (W_{i}(t)\big ) = \nabla _{i} J(W) - \langle W_{i}, \nabla _{i} J(W) \rangle {\mathbbm {1}},\qquad i \in [m]. \end{aligned}$$
    (5.1)
    Of course, higher-order schemes respecting the geometry are conceivable as well. We point out that the inherent smoothness of our problem formulation paves the way for systematic progress.
  • Non-uniform geometric averaging So far, we did not exploit the degrees of freedom offered by the weights \(w_{i},\, i \in [N]\) that define the Riemannian means by the objective (2.8). By doing so, the approximation of these means due to formula (3.33) generalizes in that the geometric mean has to be replaced by the weighted geometric mean
    $$\begin{aligned} {\mathrm {mean}}_{g,w}({\mathscr {P}}) = \prod _{j \in [N]} (p^{j})^{w_{j}},\quad w = \varDelta _{N-1} \end{aligned}$$
    (5.2)
    that is applied componentwise to the vectors \(p^{j} \in {\mathscr {P}}\). Figure 14 illustrates the influence of these weights \(w_{j}\) that were computed in a preprocessing step for each pixel i within the neighborhood \({\mathscr {N}}_{{\mathscr {E}}}\) by computing the distance \(d_{p}(p_{i}, p_{j})\) (defined as mean of the \(\ell _{2}\)-distance of the respective color vectors) between \(7 \times 7\) noisy data patches \(p_{i}, p_{j}\) centered at i and j, respectively, to obtain the normalized weights \(w_{j} = \frac{\tilde{w}_{j}}{\langle {\mathbbm {1}}, \tilde{w} \rangle }\), \(\tilde{w}_{j} = \exp {\big (-d_{p}(p_{i}, p_{j})/\rho \big )}\). Turning this data-driven adaptivity of the assignment process through non-uniform weights into a solution-driven adaptivity, by replacing the data f by u(W) due to (3.19) that evolves with W, enables an even more general way for further enhancing the assignment process.
  • Connection to nonlinear diffusion Referring to the discussion of neighborhood filters and nonlinear diffusion in Sect. 1.3, research making these connections explicit is attractive because, apparently, our approach is not covered by existing work.

  • Unsupervised scenarios. The nonexistence of a prior data set \({\mathscr {P}}_{{\mathscr {F}}}\) in applications was only briefly addressed in Sect. 4.4. In particular, the emergence of labels along with assignments and a corresponding generalization of our approach deserves attention.

  • Learning and updating prior information. This fundamental problem ties in with the preceding point: How can we learn and evolve prior information from many assignments over time?

We hope for a better mathematical understanding of corresponding models and that our work will stimulate corresponding research.
Footnotes
1

For locations i close to the boundary of the image domain where patch supports \({\mathscr {N}}_{p}(i)\) shrink, the definition of the vector \(w^{p}\) has to be adapted accordingly.

 

Acknowledgements

Support by the German Research Foundation (DFG) was gratefully acknowledged, Grant GRK 1653.

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Heidelberg Collaboratory for Image ProcessingHeidelberg UniversityHeidelbergGermany
  2. 2.Mathematical Imaging GroupHeidelberg UniversityHeidelbergGermany
  3. 3.CEREMADEUniversity Paris-DauphineParisFrance
  4. 4.Image and Pattern Analysis GroupHeidelberg UniversityHeidelbergGermany

Personalised recommendations