# Image Labeling by Assignment

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s10851-016-0702-4

- Cite this article as:
- Åström, F., Petra, S., Schmitzer, B. et al. J Math Imaging Vis (2017) 58: 211. doi:10.1007/s10851-016-0702-4

- 1 Citations
- 133 Downloads

## Abstract

We introduce a novel geometric approach to the image labeling problem. Abstracting from specific labeling applications, a general objective function is defined on a manifold of stochastic matrices, whose elements assign prior data that are given in any metric space, to observed image measurements. The corresponding Riemannian gradient flow entails a set of replicator equations, one for each data point, that are spatially coupled by geometric averaging on the manifold. Starting from uniform assignments at the barycenter as natural initialization, the flow terminates at some global maximum, each of which corresponds to an image labeling that uniquely assigns the prior data. Our geometric variational approach constitutes a smooth non-convex inner approximation of the general image labeling problem, implemented with sparse interior-point numerics in terms of parallel multiplicative updates that converge efficiently.

### Keywords

Image labeling Assignment manifold Fisher–Rao metric Riemannian gradient flow Replicator equations Information geometry Neighborhood filters Nonlinear diffusion### Mathematics Subject Classification

62H35 65K05 68U10 62M40## 1 Introduction

### 1.1 Motivation

*Image Labeling* is a basic problem of variational low-level image analysis. It amounts to determining a *partition* of the image domain by uniquely assigning to each pixel a single element from a finite set of labels. Most applications require such decisions to be made depending on other decisions. This gives rise to a global objective function whose minima correspond to favorable label assignments and partitions. Because the problem of computing globally optimal partitions generally is NP hard, *relaxations* of the variational problem only define computationally feasible optimization approaches.

*Continuous Models* and relaxations of the image labeling problem were studied, e.g., in [13, 32], including the specific binary case, where two labels are only assigned [14] and the convex relaxation is tight, such that the global optimum can be determined by convex programming. *Discrete models* prevail in the field of computer vision. They lead to polyhedral relaxations of the image partitioning problem that are tighter than those obtained from continuous models after discretization. We refer to [22] for a comprehensive survey and evaluation. Similar to the continuous case, the binary partition problem can be efficiently and globally optimal solved using a subclass of binary discrete models [29].

Relaxations of the variational image labeling problem fall into two categories: *convex and non-convex relaxations*. The dominant *convex approach* is based on the local polytope relaxation, a particular linear programming (LP-) relaxation [49]. This has spurred a lot of research on developing specific algorithms for efficiently solving large problem instances, as they often occur in applications. We mention [28] as a prominent example and otherwise refer again to [22]. Yet, models with higher connectivity in terms of objective functions with local potentials that are defined on larger cliques are still difficult to solve efficiently. A major reason that has been largely motivating our present work is the *non-smoothness* of optimization problems resulting from convex relaxation—the price to pay for convexity.

Major classes of *non-convex relaxations* are based on the mean-field approach [39], [47, Section 5] or on approximations of the intractable entropy of the probability distribution whose negative logarithm equals the functional to be minimized [50]. Examples for early applications of relaxations of the former approach include [15, 18]. The basic instance of the latter class of approaches is known as the Bethe approximation. In connection with image labeling, all these approaches amount to *non-convex inner* relaxations of the combinatorially complex set of feasible solutions (the so-called marginal polytope), in contrast to the *convex outer* relaxations in terms of the local polytope discussed above. As a consequence, the non-convex approaches provide a mathematically valid basis for *probabilistic inference* like computing marginal distributions, which in principle enables a more sophisticated data analysis than mere energy minimization or maximum a posteriori inference, to which energy minimization corresponds from a probabilistic viewpoint.

On the other hand, like non-convex optimization problems in general, these relaxations are plagued by the problem of avoiding poor local minima. Although attempts were made to tame this problem by local convexification [16], the class of *convex* relaxation approaches has become dominant in the field, because the ability to solve the relaxed problem for a global optimum is a much better basis for research on algorithms and also results in more reliable software for users and applications.

**Smoothness versus Non-Smoothness**Regarding convex approaches and the development of efficient algorithms, a major obstacle stems from the inherent non-smoothness of the corresponding optimization problems. This issue becomes particularly visible in connection with decompositions of the optimization task into simpler problems by dropping complicating constraints, at the cost of a non-smooth dual master problem where these constraints have to be enforced. Advanced bundle methods [23] then seem to be among the most efficient methods. Yet, how to make rapid progress in systematic way does not seem obvious. On the other hand, since the early days of linear programming, e.g., [4, 5], it has been known that endowing the feasible set with a proper*smooth*geometry enables efficient numerics. Yet, such*interior-point*methods [38] are considered as not applicable for large-scale problems of variational image analysis, due to dense numerical linear algebra steps that are both too expensive and too memory intensive.In view of these aspects,

**our approach**may be seen as a*smooth geometric approach*to image labeling based on*first-order, sparse*numerical operations.**Local versus Global Optimality**Global optimality distinguishes convex approaches from other ones and is the major argument for the former ones. Yet, having computed a global optimum of the relaxed problem, it has to be projected to the feasible set of combinatorial solutions (labelings) in a post-processing step. While the inherent suboptimality of this step can be bounded [31], and despite progress has been made to recover the true combinatorial optimum as least partially [46], it is clear that the benefit of global optimality of convex optimization has to be relativized when it constitutes a relaxation of an intractable optimization problem. Turning to non-convex problems, on the other hand, raises the two well-known issues: local optimality of solutions instead of global optimality, and susceptibility to initialization. In view of these aspects,**our approach**enjoys the following properties. While being non-convex, there is a*single natural*initialization only which makes obsolete the need to search for a good initialization. Furthermore, the approach returns a*global*optimum (out of many), which corresponds to an image labeling (combinatorial solution) without the need of further post-processing. Clearly, the latter property is typical for concave minimization formulations of combinatorial optimization problems [19] where solutions of the latter problem are enforced by weighting the concave penalty sufficiently large. Yet, in such cases, and in particular so when working in high dimensions as in image analysis, the problem persists to determine good initializations and to carefully design the numerics (search direction, step-size selection, etc.), in order to ensure convergence and a reasonable convergence rate.

### 1.2 Approach: Overview

Figure 1 illustrates our setup and the approach. We distinguish the feature space \({\mathscr {F}}\) that models all application-specific aspects, and the assignment manifold \({\mathscr {W}}\) used for modeling the image labeling problem and for computing a solution. This distinction avoids to mix up physical dimensions, specific data formats, etc., with the representation of the inference problem. It ensures broad applicability to any application domain that can be equipped with a metric which properly reflects data similarity. It also enables to normalize the representation used for inference, so as to remove any bias toward a solution *not* induced by the data at hand.

We consider *image labeling* as the task to assign to the image data an arbitrary prior data set \({\mathscr {P}}_{{\mathscr {F}}}\), provided the distance of its elements to any given data element can be measured by a distance function \(d_{{\mathscr {F}}}\), which the user has to supply. Basic examples for the elements of \({\mathscr {P}}_{{\mathscr {F}}}\) include prototypical feature vectors, patches, etc. Collecting all pairwise distance data into a distance matrix *D*, which could be computed on the fly for extremely large problem sizes, provides the input data to the inference problem.

The mapping \(\exp _{W}\) lifts the distance matrix to the assignment manifold \({\mathscr {W}}\). The resulting likelihood matrix *L* constitutes a normalized version of the distance matrix *D* that reflects the initial feature space geometry as given by the distance function \(d_{{\mathscr {F}}}\). Each point on \({\mathscr {W}}\), like the matrices *L*, *S*, and *W*, is *stochastic matrix* with strictly positive entries, that is, with row vectors that are discrete probability distributions having full support. Each such row vector indexed by *i* represents the *assignment* of prior elements of \({\mathscr {P}}_{{\mathscr {F}}}\) to the given datum a location *i*, in other words the *labeling* of datum *i*. We equip the set of all such matrices with the geometry induced by the Fisher–Rao metric and call it *assignment manifold*.

The inference task (image labeling) is accomplished by *geometric averaging* in terms of Riemannian means of assignment vectors over spatial neighborhoods. This step transforms the likelihood matrix *L* into the similarity matrix *S*. It also induces a dependency of labeling decisions on each other, akin to the prior (regularization) terms of the established variational approaches to image labeling, as discussed in the preceding section. These dependencies are resolved by maximizing the correlation (inner product) between the assignment in terms of the matrix *W* and the similarity matrix *S*, where the latter matrix is induced by *W* as well. The Riemannian gradient flow of the corresponding objective function *J*(*W*), that is highly nonlinear but smooth, evolves *W*(*t*) on the manifold \({\mathscr {W}}\) until a fixed point is reached which terminates the loop on the right-hand side of Fig. 1. The resulting fixed point corresponds to an *image labeling* which *uniquely* assigns to each datum a prior element of \({\mathscr {P}}_{{\mathscr {F}}}\).

Adopting a probabilistic Bayesian viewpoint, this fixed-point iteration may be viewed as maximum a posteriori inference carried out in a geometric setting with multiplicative, sparse, and highly parallel numerical operations.

### 1.3 Further Related Work

**Neighborhood Filters.**A large class of approaches to*denoising*of given image data*f*are defined in terms of neighborhood filters that iteratively perform operations of the formwhere$$\begin{aligned} u_{i}^{(k+1)} \!= \!\sum _{j} \frac{K\left( x_{i}, x_{j}, u_{i}^{(k)}, u_{j}^{(k)}\right) }{\sum _{l} K\left( x_{i}, x_{l}, u_{i}^{(k)}, u_{l}^{(k)}\right) } u_{j}^{(k)},\!\quad u(0)\!=\!f,\!\quad \forall i, \end{aligned}$$(1.1)*K*is a nonnegative kernel function that is symmetric with respect to the two indexed locations (e.g.,*i*,*j*in the numerator) and may depend on both the spatial distance \(\Vert x^{i}-x^{j}\Vert \) and the values \(|u_{i}-u_{j}|\) of pairs of pixels. Maybe the most prominent example is the non-local means filter [9] where*K*depends on the distance of*patches*centered at*i*and*j*, respectively. We refer to [35] for a recent survey. Noting that (1.1) is a linear operation with a row-normalized nonnegative (i.e., stochastic) matrix, a similar situation would bewith the likelihood matrix from Fig. 1, if we would replace the prior data \({\mathscr {P}}_{{\mathscr {F}}}\) with the given image data$$\begin{aligned} u_{i} = \sum _{j} L_{ij}(W) u_{j}, \end{aligned}$$(1.2)*f*itself and adopt a distance function \(d_{{\mathscr {F}}}\), in order to mimic the kernel function*K*of (1.1). In our approach, however, the likelihood matrix along with its nonlinear geometric transformation, the similarity matrix*S*(*W*), evolves along with the evolution of assignment matrix*W*, so as to determine a labeling with*unique*assignments to each pixel*i*, rather than convex combinations as required for denoising. Furthermore, the prior data set \({\mathscr {P}}_{{\mathscr {F}}}\) that is assigned in our case may be very different from the given image data and, accordingly, the assignment matrix may have any rectangular shape rather than being a quadratic \(m \times m\) matrix. Conceptually, we are concerned with*decision making*(labeling, partitioning, unique assignments) rather than with mapping one image to another one. Whenever the prior data \({\mathscr {P}}_{{\mathscr {F}}}\) comprise a finite set of*prototypical*image values or patches, such that a mapping of the formis well defined, then this does result in a transformed image$$\begin{aligned} u_{i} = \sum _{j} W_{ij} f_{j}^{*},\qquad f_{j}^{*} \in {\mathscr {P}}_{{\mathscr {F}}},\qquad \forall i, \end{aligned}$$(1.3)*u*after having reached a fixed point of the evolution of*W*. This result then should not be considered as a denoised image, however. Rather, it merely illustrates the interpretation of the given data*f*in terms of the prior data \({\mathscr {P}}_{{\mathscr {F}}}\) and a corresponding optimal assignment.**Nonlinear Diffusion.**Neighborhood filters are closely related to iterative algorithms for numerically solving discretized diffusion equations. Just think of the basic 5-point stencil of the discrete Laplacian, the iterative averaging of nearest neighbor differences, and the large class of adaptive generalizations in terms of nonlinear diffusion filters [48]. More recent work directly addressing this connection includes [10, 36, 44]. The author of [36], for instance, advocates the approximation of the matrix of (1.1) by a*symmetric*(hence, doubly stochastic)*positive-definite*matrix, in order to enable interpretations of the denoising operation in terms of the spectral decomposition of the assignment matrix, and to make the connection to diffusion mappings on graphs. The connection to our work is implicitly given by the discussion of the previous point, the relation of our approach to neighborhood filters. Roughly speaking, the application of our approach in the*specific*case of assigning image data to image data may be seen as some kind of nonlinear diffusion that results in an image whose degrees of freedom are given by the cardinality of the prior set \({\mathscr {P}}_{{\mathscr {F}}}\). We plan to explore the exact nature of this connection in more detail in our future work.**Replicator Dynamics.**Replicator dynamics and the corresponding equations are well known [17]. They play a major role in models of various disciplines, including theoretical biology and applications of game theory to economy. In the field of image analysis, such models have been promoted by Pelillo and co-workers, mainly to efficiently determine by continuous optimization techniques good local optima of intractable problems, like matchings through maximum-clique search in an association graph [42]. Although the corresponding objective functions are merely quadratic, the analysis of the corresponding equations is rather involved [8]. Accordingly, clever heuristics have been suggested to tame related problems of non-convex optimization [7].Regarding our approach, we aim to get rid of these issues—see the discussion of “Global optimality” in Sect. 1.1—through three ingredients: (1) a unique natural initialization, (2) spatial averaging that removes spurious local affects of noisy data, and (3) adopting the Riemannian geometry which determines the structure of the replicator equations, for both geometric spatial averaging and numerical optimization.

**Relaxation Labeling.**The task of labeling primitives in images has been formulated as a problem of contextual decision making already 40 years ago [20, 43]. Originally, update rules were merely formulated in order to find mutually consistent individual label assignments. Subsequent research related these labeling rules to optimization tasks. We refer to [41] for a concise account of the literature and for putting the approach on mathematically solid ground. Specifically, the so-called Baum–Eager theorem was applied in order to show that updates increase the mutual consistency of label assignments. Applications include pairwise clustering [40] that boils down to determining a local optimum by continuous optimization of a non-convex quadratic form, similar to the optimization tasks considered in [8, 42]. We attribute the fact that these approaches have not been widely applied to the problems of non-convex optimization discussed above.The measure of mutual consistency of our approach is non-quadratic, and the Baum–Eager theorem about polynomial growth transforms does not apply. Increasing consistency follows from the Riemannian gradient flow that governs the evolution of label assignments. Regarding the non-convexity from the viewpoint of optimization, we believe that the setup of our approach displayed by Fig. 1 significantly alleviates these problems, in particular through the geometric averaging of assignments that emanates from a natural initialization.

### 1.4 Organization

Section 2 summarizes the geometry of the probability simplex in order to define the assignment manifold, which is the basis of our variational approach. The approach is presented in Sect. 3 by repeating the discussion of Fig. 1, together with the mathematical details. Finally, several numerical experiments are reported in Sect. 4. They are academical, yet non-trivial, and supposed to illustrate properties of the approach as claimed in the preceding sections. Specific applications of image labeling are not within the scope of this paper. We conclude and indicate further directions of research in Sect. 5.

Major symbols and the basic notation used in this paper are listed in “Appendix 1.” In order not to disrupt the flow of reading and reasoning, proofs, and technical details, all of which are elementary but essentially complement the presentation and make this paper self-contained, are listed as “Appendix 2.”

## 2 The Assignment Manifold

In this section, we define the feasible set for representing and computating image labelings in terms of assignment matrices \(W \in {\mathscr {W}}\), the assignment manifold \({\mathscr {W}}\). The basic building block is the open probability simplex \({\mathscr {S}}\) equipped with the Fisher–Rao metric. We collect below and in “Proofs of Section 2 of Appendix 2” corresponding definitions and properties.

For background reading and much more details on information and Riemannian geometry, we refer to [1, 21].

### 2.1 Geometry of the Probability Simplex

### Definition 1

*Sphere Map*) We call the diffeomorphism

*sphere map*(see Fig. 2).

The sphere map enables to compute the geometry of \({\mathscr {S}}\) from the geometry of the 2-sphere.

### Lemma 1

The sphere map \(\psi \) (2.3) is an isometry, i.e., the Riemannian metric is preserved. Consequently, lenghts of tangent vectors and curves are preserved as well.

### Proof

See “Proofs of Section 2” in Appendix 2. \(\square \)

In particular, geodesics as critical points of length functionals are mapped by \(\psi \) to geodesics. As a consequence, we have

### Lemma 2

The objective function for computing Riemannian means (geometric averaging; see Definition 2 and Eq. (2.8) below) is based on the distance (2.4). Figure 3 visualizes corresponding geodesics and level sets on \({\mathscr {S}}_{3}\) that differ for discrete distributions \(p \in {\mathscr {S}}_{3}\) close to the barycenter and for low-entropy distributions close to the vertices. See also the caption of Fig. 3.

It is well known from the literature (e.g., [3, 30]) that geometries may considerably change in higher dimensions. Figure 4 displays the Riemannian distances of points on curves that connect the barycenter and vertices on \(\overline{{\mathscr {S}}}_{n}\) (to which the distance (2.4) extends), depending on the dimension *n*. The normalizing effect on geometric averaging, further discussed in the caption, increases with *n* and is relevant to image labeling, where large values of *n* may occur in applications.

### Proposition 1

*f*at \(p \in {\mathscr {S}}\) is given by

### Proof

See “Proofs of Section 2” in Appendix 2. \(\square \)

The exponential map associated with the open probability simplex \({\mathscr {S}}\) is detailed next.

### Proposition 2

*with*\(t=1\), \(v_{p} = v/\sqrt{p},\, p = \gamma (0)\), \(\dot{\gamma }_{v}(0)=v\)

*and*

### Proof

See “Proofs of Section 2” of Appendix 2. \(\square \)

### Remark 1

Checking the inclusion \(v \in V_{p}\) due to (2.7d), for a given tangent vector \(v \in T_{p}{\mathscr {S}}\), is inconvenient for applications. Therefore, the mapping \(\exp \) is defined below by Eq. (3.8a) which approximates the exponential mapping \({{\mathrm{Exp}}}\), with the feasible set \(V_{p}\) replaced by the entire space \(T_{p}{\mathscr {S}}\) (Lemma 3).

Accordingly, geometric averaging as defined next (Sect. 2.2) based on \({{\mathrm{Exp}}}\) can be approximated as well using the mapping \(\exp \). This is discussed in Sect. 3.3.2.

### 2.2 Riemannian Means

The *Riemannian center of mass* is commonly called *Karcher mean* or *Fréchet mean* in the more recent literature, in particular outside the field of mathematics. We prefer—cf. [26]—the former notion and use the shorter term *Riemannian mean*.

### Definition 2

*Riemannian Mean, Geometric Averaging*) The

*Riemannian mean*\(\overline{p}\) of a set of points \(\{p^{i}\}_{i \in [N]} \subset {\mathscr {S}}\) with corresponding weights \(w \in \varDelta _{N-1}\) minimizes the objective function

*w*in the case of uniform weights \(w = \frac{1}{N} {\mathbbm {1}}_{N}\).

### Lemma 3

The Riemannian mean (2.10) defined as minimizer of (2.8) is unique for any data \({\mathscr {P}} = \{p^{i}\}_{i \in [N]} \subset {\mathscr {S}}\) and weights \(w \in \varDelta _{N-1}\).

### Proof

Using the isometry \(\psi \) given by (2.3), we may consider the scenario transferred to the domain on the 2-sphere depicted in Fig. 2. Due to [25, Thm. 1.2], the objective (2.8) is convex along geodesics and has a unique minimizer within any geodesic Ball \({\mathbb {B}}_{r}\) with diameter upper bounded by \(2 r \le \frac{\pi }{2 \sqrt{\kappa }}\), where \(\kappa \) upper bounds the sectional curvatures in \({\mathbb {B}}_{r}\). For the 2-sphere \({\mathscr {N}}\), we have \(\kappa = 1/4\) constant, and hence the inequality is satisfied for the domain \(\psi ({\mathscr {S}}) \subset {\mathscr {N}}\) which has geodesic diameter \(\pi \). \(\square \)

We call the computation of Riemannian means *geometric averaging*. The implementation of this iterative operation and its efficient approximation by a closed-form expression are addressed in Sect. 3.3.

### 2.3 Assignment Matrices and Manifold

A natural question is how to extend the geometry of \({\mathscr {S}}\) to stochastic matrices \(W \in \mathbb {R}^{m \times n}\) with \(W_{i} \in {\mathscr {S}},\, i \in [m]\), so as to preserve the information-theoretic properties induced by this metric (that we do not discuss here—cf. [1, 12]).

This problem was recently studied by [37]. The authors suggested three natural definitions of manifolds. It turned out that all of them are slight variations of taking the product of \({\mathscr {S}}\), differing only by the scaling of the resulting product metric. As a consequence, we make the following

### Definition 3

*Assignment Manifold*) The manifold of assignment matrices, called

*assignment manifold*, is the set

Note that \(V \in T_{W}{\mathscr {W}}\) means \(V_{i} \in T_{W_{i}} {\mathscr {S}},\, i \in [m]\).

### Remark 2

We call stochastic matrices contained in \({\mathscr {W}}\)*assignment matrices*, due to their role in the variational approach (Sect. 3).

## 3 Variational Approach

We introduce in this section the basic components of the variational approach and the corresponding optimization task, as illustrated in Fig. 1.

### 3.1 Basic Components

#### 3.1.1 Features, Distance Function, Assignment Task

*f*

*feature*. At this point, we do not make any assumption about the

*feature space*\({\mathscr {F}}\) except that a

*distance function*

*prior set*. We are interested in the assignment of the prior set to the data in terms of an

*assignment matrix*

*posterior probability*that \(f^{*}_{j}\) generated the observation \(f_{i}\).

The *assignment task* asks for determining an optimal assignment \(W^{*}\), considered as “explanation” of the data based on the prior data \({\mathscr {P}}_{{\mathscr {F}}}\). We discuss next the ingredients of the objective function that will be used to solve assignment tasks.

#### 3.1.2 Distance Matrix

*distance matrix*

*user parameters*to be set. This parameter serves two purposes. It accounts for the unknown scale of the data

*f*that depends on the application and hence cannot be known beforehand. Furthermore, its value determines what subset of the prior features \(f^{*}_{j},\, j \in [n]\) effectively affects the process of determining the assignment matrix

*W*. This becomes explicit through the definition of the next processing stage, given by Eq. (3.12) below, that uses

*D*as input. We call \(\rho \)

*selectivity parameter*.

*W*(

*t*) that is introduced and discussed below in Sect. 3.2.3.

Note that *W* is initialized with the uninformative *uniform assignment* that is not biased toward a solution in any way.

#### 3.1.3 Likelihood Matrix

The next processing step is based on the following

### Definition 4

*Lifting Map*(

**Manifolds**\({\mathscr {S}}, {\mathscr {W}}\))) The lifting mapping is defined by

*U*,

*W*, and where the argument decides which of the two mappings \(\exp \) applies.

### Remark 3

After replacing the arbitrary point \(p \in {\mathscr {S}}\) by the barycenter \(\frac{1}{n} {\mathbbm {1}}_{n}\), readers will recognize the *softmax function* in (3.8a), i.e., \(\langle \frac{1}{n} {\mathbbm {1}}_{n}, e^{u} \rangle ^{-1} \big (\frac{1}{n} {\mathbbm {1}}_{n} e^{u}\big ) = \frac{e^{u}}{\langle {\mathbbm {1}}, e^{u} \rangle }\). This function is widely used in various application fields of applied statistics (e.g., [45]), ranging from parametrizations of distributions, e.g., for logistic classification [6], to other problems of modeling [34] not related to our approach.

The lifting mapping generalizes the softmax function through the dependency on the base point *p*. In addition, it approximates geodesics and accordingly the exponential mapping \({{\mathrm{Exp}}}\), as stated next. We therefore use the symbol \(\exp \) as mnemonic. Unlike \({{\mathrm{Exp}}}_{p}\), the mapping \(\exp _{p}\) is defined on the entire tangent space, cf. Remark 1.

### Proposition 3

### Proof

See “Proofs of Section 3 and Further Details” of Appendix 2. \(\square \)

Figure 5 illustrates the approximation of geodesics \(\gamma _{v}\) and the exponential mapping \({{\mathrm{Exp}}}_{p}\), respectively, by the lifting mapping \(\exp _{p}\).

### Remark 4

Note that adding any constant vector \(c {\mathbbm {1}},\, c \in \mathbb {R}\) to a vector *u* does not change \(\exp _{p}(u)\): \(\frac{p e^{u+c {\mathbbm {1}}}}{\langle p, e^{u+c {\mathbbm {1}}} \rangle } = \frac{p (e^{c}{\mathbbm {1}}) e^{u}}{\langle p, (e^{c}{\mathbbm {1}}) e^{u} \rangle } = \frac{p e^{u}}{\langle p, e^{u} \rangle } = \exp _{p}(u)\). Accordingly, the same vector *v* is generated by (3.9). While the definition (3.8a) removes this ambiguity, there is no need to remove the mean of the vector *u* in numerical computations.

*D*and

*W*as described in Sect. 3.1.2, we lift the matrix

*D*to the manifold \({\mathscr {W}}\) by

*L*

*likelihood matrix*because the row vectors are discrete probability distributions which separately represent the similarity of each observation \(f_{i}\) to the prior data \({\mathscr {P}}_{{\mathscr {F}}}\), as measured by the distance \(d_{{\mathscr {F}}}\) in (3.6).

Note that the operation (3.12) depends on the assignment matrix \(W \in {\mathscr {W}}\).

#### 3.1.4 Similarity Matrix

*L*, we define the

*similarity matrix*

*S*represents the similarity of the data within a local spatial neighborhood to the prior data \({\mathscr {P}}_{{\mathscr {F}}}\).

Note that *S* depends on *W* because *L* does so by (3.12). The *size* of the neighborhoods \(|\tilde{{\mathscr {N}}}_{{\mathscr {E}}}(i)|\) is the *second-user parameter*, besides the selectivity parameter \(\rho \) for scaling the distance matrix (3.6). Typically, each \(\tilde{{\mathscr {N}}}_{{\mathscr {E}}}(i)\) indexes the same local “window” around pixel location *i*. We then call the window size \(|\tilde{{\mathscr {N}}}_{{\mathscr {E}}}(i)|\)*scale parameter*.

### Remark 5

In basic applications, the distance matrix *D* will not change once the features and the feature distance \(d_{{\mathscr {F}}}\) are determined. On the other hand, the likelihood matrix *L*(*W*) and the similarity matrix *S*(*W*) have to be recomputed as the assignment *W* evolves, as part of any numerical algorithm used to compute an optimal assignment \(W^{*}\).

We point out, however, that more general scenarios are conceivable —without essentially changing the overall approach—where \(D = D(W)\) depends on the assignment as well and hence has to be updated too, as part of the optimization process. Section 4.5 provides an example.

### 3.2 Objective Function, Optimal Assignment

We specify next the objective function as criterion for assignments and the gradient flow on the assignment manifold, to compute an optimal assignment \(W^{*}\). Finally, based on \(W^{*}\), the so-called assignment mapping is defined.

#### 3.2.1 Objective Function

*posterior probabilities*,

*objective function*to be maximized is

*J*together with the feasible set \({\mathscr {W}}\) formalizes the following objectives:

- 1.
Assignments

*W*should*maximally correlate*with the feature-induced similarities \(S = S(W)\), as measured by the inner product which defines the objective function*J*(*W*). - 2.
Assignments of prior data to observations should be done in a

*spatially coherent*way. This is accomplished by*geometric averaging*of likelihood vectors over local spatial neighborhoods, which turns the likelihood matrix*L*(*W*) into the similarity matrix*S*(*W*),*depending*on*W*. - 3.
Maximizers \(W^{*}\) should define

*image labelings*in terms of rows \(\overline{W}_{i}^{*} = e^{k_{i}} \in \{0,1\}^{n},\; i, k_{i} \in [m]\), that are indicator vectors. While the latter matrices are not contained in the assignment manifold \({\mathscr {W}}\) as feasible set, we compute in practice assignments \(W^{*} \approx \overline{W}^{*}\) arbitrarily close to such points. It will turn out below that the*geometry enforces*this approximation. As a consequence, in view of (3.15), such points \(W^{*}\)*maximize posterior probabilities*, akin to the interpretation of MAP inference with discrete graphical models by minimizing corresponding energy functionals. As discussed in Sect. 1, however, the mathematical structure of the optimization task of our approach and the way of fusing data and prior information are quite different.

### Lemma 4

### Proof

See “Proofs of Section 3 and Further Details” of Appendix 2. \(\square \)

#### 3.2.2 Assignment Mapping

*synthesize*the approximation to the given data

*f*, in terms of an assignment \(W^{*}\) that optimizes (3.16) and the prior data \({\mathscr {P}}_{{\mathscr {F}}}\). We denote the corresponding approximation by

*assignment mapping*.

A less trivial example is the case of prior information in terms of patches. We specify the mapping *u* for this case and further concrete scenarios in Sect. 4.

#### 3.2.3 Optimization Approach

*Riemannian gradient ascent flow*on the manifold \({\mathscr {W}}\),

*not*independent as the product structure of \({\mathscr {W}}\) (cf. Sect. 2.3) might suggest. Rather, they are coupled through the gradient \(\nabla J(W)\) which reflects the interaction of the distributions \(W_{i},\,i \in [m]\), due to the geometric averaging which results in the similarity matrix (3.13).

*increases*until a stationary point is reached where the Riemannian gradient vanishes. Clearly, we expect

*W*(

*t*) to approximate a global maximum due to Lemma 4, which all satisfy the condition for stationary points \(\overline{W}\),

*interior*stationary points \(\overline{W} \in {\mathscr {W}}\) with \(\overline{W} \ge 0\) due to the definition of \({\mathscr {W}}\), all brackets \((\cdots )\) on the r.h.s. of (3.24) must vanish, which can only happen if the Euclidean gradient satisfies

*S*(

*W*) and \(T^{ij}(W) = \frac{\partial }{\partial W_{ij}} S(W)\) depend in a smooth way on the data (3.1) and the prior set (3.3) through the distance matrix (3.6), the likelihood matrix (3.12) and the geometric averaging (3.13) which forms the similarity matrix

*S*(

*W*). Regarding the second term on the r.h.s. of (3.26b), a computation relegated to “Proofs of Section 3 and Further Details of Appendix 2” yields

*same*value for every \(j \in [n]\), such that averaging with respect to \(W_{i}\) on the r.h.s. causes no change.

We do not have evidence for the nonexistence of specific data configurations, for which the flow (3.21) may reach such very specific stationary interior points. Any such point, however, will not be a maximum and be isolated, by virtue of the local strict convexity of the objective function (2.8) for Riemannian means (cf. Lemma 3 below), which determines the similarity matrix (3.13). Consequently, any perturbation (e.g., by numerical computation) will let the flow escape from such a point, in order to maximize the objective due to (3.23).

We summarize this reasoning by the

### Conjecture 1

*W*(

*t*) generated by (3.21) approximates a global maximum as defined by (3.18) in the sense that, for any \(0 < \varepsilon \ll 1\), there is a \(t=t(\varepsilon )\) such that

### Remark 6

- 1.
Since \(\overline{{\mathscr {W}}}^{*} \not \in {\mathscr {W}}\), the flow

*W*(*t*) cannot converge to a global maximum, and numerical problems arise when (3.29) holds for \(\varepsilon \) very close to zero. Our strategy to avoid such problems is described in Sect. 3.3.1. - 2.
Although global maxima are not attained, we agree to call a point \(W^{*}=W(t)\)

*maximum*and*optimal assignment*that satisfies (3.29) for some fixed small \(\varepsilon \). The criterion which terminates our algorithm is specified in Sect. 3.3.4. - 3.

### 3.3 Implementation

We discuss in this section specific aspects of the implementation of the variational approach.

#### 3.3.1 Assignment Normalization

#### 3.3.2 Computing Riemannian Means

*S*(

*W*) due to Eq. (3.13) involves the computation of Riemannian means. In view of Definition 2, we compute the Riemannian mean \({\mathrm {mean}}_{{\mathscr {S}}}({\mathscr {P}})\) of given points \({\mathscr {P}}=\{p^{i}\}_{i \in [N]} \subset {\mathscr {S}}\), using uniform weights, as fixed point \(p^{(\infty )}\) by iterating the following steps.

*identical*vectors are averaged, as the expression (7.16b) shows. Such situations may occur, e.g., when computer-generated images are processed. Setting \(\varepsilon =1-\langle \sqrt{p},\sqrt{q}\rangle \) for two vectors \(p, q \in {\mathscr {S}}\), we replace the expression (7.16b) by

### Lemma 5

### Proof

See “Proofs of Section 3 and Further Details” of Appendix 2. \(\square \)

#### 3.3.3 Optimization Algorithm

A thorough analysis of various discrete schemes for numerically integrating the gradient flow (3.21), including stability estimates, is beyond the scope of this paper and will be separately addressed in follow-up work (see Sect. 5 for a short discussion).

*S*(

*W*) with respect to \(W_{i}\), which is significantly smaller than the second term \(S_{i}(W)\) of (3.26), because \(S_{i}(W)\) results from

*averaging*(3.13) the likelihood vectors \(L_{j}(W_{j})\) over spatial neighborhoods and hence changes slowly. As a consequence, we simply drop this first term which, as a by-product, avoids the numerical evaluation of the expensive expressions (3.27) specifying the first term.

*k*(see Fig. 1).

#### 3.3.4 Termination Criterion

## 4 Illustrative Applications and Discussion

### 4.1 Parameters, Empirical Convergence Rate

Figure 6 shows a color image and a noisy version of it. The latter image was used as input data of a labeling problem. Both images comprise 31 color vectors forming the prior data set \({\mathscr {P}}_{{\mathscr {F}}} = \{f^{1*},\ldots ,f^{31*}\}\). The labeling task is to assign these vectors in a spatially coherent way to the input data so as to recover the ground-truth image.

This tasks should not be confused with image denoising in the traditional sense [9] where noise has to be removed from *real-valued* image data. Rather, the experiment depicted by Fig. 6 represents difficult *classification* tasks where the assignment process is essential in order to cope with the high noise level.

Every color vector was encoded by the vertices of the simplex \(\varDelta _{30}\), that is, by the unit vectors \(\{e^{1},\ldots ,e^{31}\} \subset \{0,1\}^{31}\). Choosing the distance \(d_{{\mathscr {F}}}(f^{i},f^{j}) := \Vert f^{i}-f^{j}\Vert _{1}\), this results in unit distances between all pairs of data points and hence enables to assess most clearly the impact of geometric spatial averaging and the influence of the two parameters \(\rho \) and \(|{\mathscr {N}}_{\varepsilon }|\), introduced in Sects. 3.1.2 and 3.1.4, respectively. We refer to the caption for a brief discussion of the selectivity parameter \(\rho \) and the spatial scale in terms of \(|{\mathscr {N}}_{\varepsilon }|\).

The reader familiar with total variation-based denoising, where a *single* parameter is only used to control the influence of regularization, may ask why *two* parameters are used in the present approach and if they are necessary. We refer again to Fig. 6 and the caption where the separation of the physical and spatial scale based on different parameter choices is demonstrated. The total variation measure couples these scales as the co-area formula explicitly shows. As a consequence, a single parameter is only needed. On the other hand, larger values of this parameter lead to the well-known loss-of-contrast effect, which using the present approach can be avoided by properly choosing the parameters \(\rho , |{\mathscr {N}}_{\varepsilon }|\) corresponding to these two scales.

Figure 7 shows how convergence of the iterative algorithm (3.36) is affected by these two parameters. It also demonstrates that few tens of massively parallel outer iterations suffice to reach the termination criterion of Sect. 3.3.4. A parallel implementation only has to take into account the spatial neighborhood (3.14) where pixel locations directly interact in order to compute by geometric averaging the likelihood matrix (3.13).

All results were computed using the assignment mapping (3.20) *without* rounding. This shows that the termination criterion of Sect. 3.3.4, illustrated in Fig. 7, leads to (almost) unique assignments .

### 4.2 Vector-Valued Data

*any*feature vector of arbitrary dimension

*d*could be used instead, depending on the application at hand. We used the distance function

*d*to make the choice of the parameter \(\rho \) insensitive with respect to the dimension

*d*of the feature space. Given an optimal assignment matrix \(W^{*}\) as solution to (3.16), the prior information assigned to the data is given by the assignment mapping

### 4.3 Patches

*p*indicates neighborhoods for patches). With each entry \(j \in {\mathscr {N}}_{p}(i)\), we associate the Gaussian weight

*n*prototypical patches

*i*and indexed by \({\mathscr {N}}_{p}(i)\) as well.

*f*. Location

*i*is affected by patches that overlap with

*i*. Let us denote the indices of these patches by

*j*to which prior patches are assigned by

*i*be indexed by \(i_{j}\) in patch

*j*(local coordinate inside patch

*j*). Then, by summing over all patches indexed by \({\mathscr {N}}_{p}^{i \leftarrow j}\) whose supports include location

*i*, and by weighting the contributions to location

*i*by the corresponding weights (4.5), we obtain the vector

*i*. This expression looks more clumsy than it actually is. In words, the vector \(u^{i}\) assigned to location

*i*is the convex combination of vectors contributed from patches overlapping with

*i*that itself are formed as convex combinations of prior patches. In particular, if we consider the common case of

*equal*patch supports \({\mathscr {N}}_{p}(i)\) for every

*i*that additionally are

*symmetric*with respect to the center location

*i*, then \({\mathscr {N}}_{p}^{i \leftarrow j} = {\mathscr {N}}_{p}(i)\). As a consequence, due to the symmetry of the weights (4.5), the first sum of (4.10) sums up all weights \(w^{p}_{ij}\). Hence, the normalization factor on the right-hand side of (4.10) equals 1, because the low-pass filter \(w^{p}\) preserves the zero-order moment (mean) of signals. Furthermore, it then makes sense to denote by \((-i)\) the location \(i_{p}\) corresponding to

*i*in patch

*j*. Thus (4.10) becomes

*i*by the convex combination of prior patches assigned to

*j*, we finally rewrite (4.10) due the symmetry \(w^{p}_{j(-i)} = w^{p}_{ji} = w^{p}_{ij}\) in the more handy form

^{1}

*i*by fitting prior patches to all locations \(j\in {\mathscr {N}}(i)\). The outer expression fuses the assigned vectors. If they were all the same, the outer operation would have no effect, of course.

We discuss further properties of this approach by concrete examples.

### Example 1

(Patch Assignment) Figure 9 shows an image *f* and the corresponding assignment \(u(W^{*})\) based on a patch dictionary \({\mathscr {P}}_{{\mathscr {F}}}\) that was formed as explained in the caption.

*i*, after adapting each prior template \(f^{*j}\) at each pixel location

*i*to the data

*f*, denoted by \(f^{*j}=f^{*j(i)}\) in (4.14). Each such template takes two values that were adapted to the template \(f^{i}\) to which it is compared, i.e.,

The result in Fig. 9c illustrates how the approximation \(u(W^{*})\) of *f* is restricted by the prior knowledge, leading to normalized signal transitions regarding both the spatial geometry and the signal values. By maximizing the objective (3.16), a patch-consistent and dense cover of the image is computed. It induces a strong nonlinear image filtering effect by fusing through assignment for each single pixel value more than 200 predictions of possible values based on the patch dictionary \({\mathscr {P}}_{{\mathscr {F}}}\).

### Example 2

(Patch Assignment) Figure 10 shows a fingerprint image characterized by two gray values \(f^{*}_{\text {dark}}, f^{*}_{\text {bright}}\) that were extracted from the histogram of *f* after removing a smooth function of the spatially varying mean value (panel (b)). The latter was computed by interpolating the median values for each patch of a coarse \(16 \times 16\) partition of the entire image.

Figure 10c shows the dictionary of patches modeling the remaining binary signal transitions. An essential difference to Example 1 is the *subdivision of the dictionary into classes of equivalent patches* corresponding to each orientation. The averaging process was set up to distinguish only the assignment of patches of *different* patch classes and to treat patches of the same class equally. This makes geometric averaging particularly effective if signal structures conform to a single class on larger spatial connected supports. Moreover, it reduces the problem size to merely 13 class labels: 12 orientations at \(k \cdot 30^{\circ },\, k \in [12]\) degrees, together with the single constant patch complementing the dictionary.

*i*and the

*j*-th prior patch was chosen depending on both the prior patch and the data patch it was compared to: For the constant prior patch, the distance was

The center and bottom rows in Fig. 10, respectively, show the assignment \(u(W^{*})\) of the dictionary of \(3 \times 3\) patches (center row) and of \(7 \times 7\) patches (bottom row). The center panels (f) and (i) depict the class labels of these assignments according to the color code of panel (d). These images display the interpretation of the image structure of *f* from panel (a). While the assignment of patches of size \(3 \times 3\) is slightly noisy, which becomes visible through the assignment of the constant template marked by black in panel (f), the assignment of \(5 \times 5\) or \(7 \times 7\) patches results in a robust and spatially coherent, accurate representation of the local image structure. The corresponding pronounced nonlinear filtering effect is due to the consistent assignment of a large number of patches at each pixel location and fusing the corresponding predicted values.

Panels (g) and (j) show the resulting additive image decompositions (4.16) that seem difficult to achieve when using established convex variational approaches (see, e.g., [2]) that employ various regularizing norms and duality, for this purpose.

Finally, we point out that it would be straightforward to add to the dictionary further patches modeling minutiae and other features relevant to fingerprint analysis. We do not consider in this paper any application-specific aspects, however.

### 4.4 Unsupervised Assignment

We consider the case that no prior information is available.

The simplest way to handle the absence of prior information is to use the given data themselves as prior information along with a suitable constraint, to enforce selection of the most important parts by *self-assignment*.

In order to illustrate this mechanism clearly, Fig. 11 shows as example the assignment of uniform noise to itself. As prior data \({\mathscr {P}}_{{\mathscr {F}}}\), we uniformly discretized the rgb color cube \([0,1]^{3}\) at \(0, 0.2, 0.4, \ldots , 1\) along each axis, resulting in \(|{\mathscr {P}}_{{\mathscr {F}}}| = 6^{3} = 216\) color vectors. Because there is no preference for any of these vectors, spatial diffusion of uniform noise at any spatial scale will inherently end up with the average color gray, which however is excluded from the prior set, by construction. Accordingly, the process terminated with a spatially random assignment of the 8 color vectors closest to gray (Figs. 11b rescaled and 11d) solely induced by the input noise and geometric averaging at a certain scale. Figure 11c depicts the relative frequencies each prior vector is assigned to some location. Except for the 8 aforementioned vectors, all others are ignored.

A detailed elaboration of unsupervised scenarios based on our approach, for both vector- and patch-valued data, will be studied in our follow-up work (Sect. 5).

### 4.5 Labeling with Adaptive Distances

In this section, we consider a simple instance of the more general class of scenarios where the distance matrix (3.6) \(D = D(W)\) depends on the assignment matrix *W*, in addition to the likelihood matrix *L*(*W*) and the similarity matrix *S*(*W*).

Figure 12e displays a point pattern that was generated by sampling a foreground and background process of randomly oriented rectangles, as explained by the remaining panels in Fig. 12. The task is to recover the foreground process among all possible rectangles (Fig. 12f) based on (1) unary features given by the fraction of points covered by each rectangle, and on (2) the prior knowledge that unlike background rectangles, elements of the foreground process do *not* intersect. Rectangles of the background process were slightly less densely sampled than foreground rectangles so as to make the unary features indicative. Due to the overlap of many rectangles (Fig. 12a), however, these unary features are noisy (“weak”).

As a consequence, exploiting the prior knowledge that foreground rectangles do not intersect becomes decisive. This is done by determining the intersection pattern of all rectangles (Fig. 12f) in terms of Boolean values that are arranged into matrices \(R_{ij}\), for each edge *ij* of the grid graph whose vertices correspond to the centroids of the rectangles in Fig. 12f: \((R_{ij})_{k,l}=1\) if rectangle *k* at position *i* intersects with rectangle *l* at position *j*, and \((R_{ij})_{k,l}=0\) otherwise. Due to the geometry of the rectangles, a rectangle at position *i* may only intersect with \(8 \times 18 = 144\) rectangles located within a 8-neighborhood \(j \in {\mathscr {N}}_{\varepsilon }(i)\). Generalizations to other geometries are straightforward.

The inference task to recover the foreground rectangles (Fig. 12c) from the point pattern (Fig. 12e) may be seen as a multi-labeling problem based on an asymmetric Potts-like model: Labels correspond to equally oriented rectangles and have to be determined so as to maximize the coverage of points, subject to the pairwise constraints that selected rectangles do not intersect. Alternatively, we may think of binary “off–on” variables that are assigned to each rectangle in Fig. 12f, which have to be determined subject to *disjunctive* constraints: At each location, at most a single variable may become active, and pairwise active variables have to satisfy the intersection constraints. Note that in order to suppress intersecting rectangles, penalizing costs are only encountered if (a subset of) pairs of variables receive the *same* value 1 (=active and intersecting). This violates the submodularity constraint [29, Eq. (7)] and hence rules out global optimization using graph cuts.

*i*, and \(\lambda > 0\) weights the influence of the intersection prior. This latter term is defined by the matrices \(R_{ij}\) discussed above and given by the gradient with respect to

*W*of the penalty \((\lambda /|{\mathscr {N}}_{\varepsilon }(i)|) \sum _{ij \in {\mathscr {E}}} \langle W_{i}, R_{ij} W_{j} \rangle \).

*D*given by (4.19). This “Euclidean approach”—in contrast to the geometric approach proposed here—entails to provide a DC decomposition of the intersection penalty just discussed and to

*explicitly*take into account the affine constraints \(W_{i} \in \varDelta _{n-1}\). As a result, the DC approach computes a local minimizer by solving a

*sequence*of convex quadratic programs.

In order to apply our present approach instead, we bypass the averaging step (3.13) because labels will most likely be different at adjacent vertices *i* in our random scenario, and we thus set \(S(W) = L(W)\) with *L*(*W*) given by (3.12) based on (4.19). Applying then algorithm (3.36) *implicitly* handles all constraints through the geometric flow and computes a local minimizer by multiplicative updates, within a small fraction of the runtime that the DC approach would need, and without compromising the quality of the solution (Fig. 12g).

### 4.6 Image Inpainting

*Inpainting* denotes the task to fill in a known region where no image data were observed or are known to be corrupted, based on the surrounding region and prior information.

Once the feature metric \(d_{{\mathscr {F}}}\) is fixed, we assign to each pixel in the region to be inpainted *as datum* the *uninformativ feature vector**f* which has the *same* distance \(d_{{\mathscr {F}}}(f,f^{*}_{j})\) to *every* prior feature vector \(f^{*}_{j} \in {\mathscr {P}}_{{\mathscr {F}}}\). Note that there is not need to explicitly compute this data vector *f*. It merely represents the rule for evaluating the distance \(d_{{\mathscr {F}}}\) if one of its arguments belongs to a region to be inpainted.

*outer*approximations of the combinatorically complex feasible set of assignments, our smooth non-convex approach may be considered as an

*inner*approximation that yields results without the need of further rounding, i.e., the need of a post-processing step for projecting the solution of a convex relaxed problem onto the feasible set.

## 5 Conclusion and Further Work

We presented a novel approach to image labeling, formulated in a smooth geometric setting. The approach contrasts with established convex and non-convex relaxations of the image labeling problem through smoothness and geometric averaging. The numerics boil down to parallel sparse updates that maximize the objective along an interior path in the feasible set of assignments and finally return a labeling. Although an elementary first-order approximation of the gradient flow was only used, the convergence rate seems competitive. In particular, a large number of labels, like in Sect. 4.4, does not slow down convergence as is the case of convex relaxations. All aspects specific to an application domain are represented by a single distance matrix *D* and a single user parameter \(\rho \). This flexibility and the absence of ad hoc tuning parameters whose values do not have an intrinsic meaning should promote applications of the approach to various image labeling problems.

**Numerics**Many alternatives exist to the simple algorithm detailed in Sect. 3.3.3. An alternative first-order example is exponential multiplicative update [11] that results from an explicit Euler discretization of the flow (3.21) rewritten in the formOf course, higher-order schemes respecting the geometry are conceivable as well. We point out that the inherent$$\begin{aligned} \frac{\hbox {d}}{\hbox {d}t}\log \big (W_{i}(t)\big ) = \nabla _{i} J(W) - \langle W_{i}, \nabla _{i} J(W) \rangle {\mathbbm {1}},\qquad i \in [m]. \end{aligned}$$(5.1)*smoothness*of our problem formulation paves the way for*systematic*progress.**Non-uniform geometric averaging**So far, we did not exploit the degrees of freedom offered by the weights \(w_{i},\, i \in [N]\) that define the Riemannian means by the objective (2.8). By doing so, the approximation of these means due to formula (3.33) generalizes in that the geometric mean has to be replaced by the weighted geometric meanthat is applied componentwise to the vectors \(p^{j} \in {\mathscr {P}}\). Figure 14 illustrates the influence of these weights \(w_{j}\) that were computed in a preprocessing step for each pixel$$\begin{aligned} {\mathrm {mean}}_{g,w}({\mathscr {P}}) = \prod _{j \in [N]} (p^{j})^{w_{j}},\quad w = \varDelta _{N-1} \end{aligned}$$(5.2)*i*within the neighborhood \({\mathscr {N}}_{{\mathscr {E}}}\) by computing the distance \(d_{p}(p_{i}, p_{j})\) (defined as mean of the \(\ell _{2}\)-distance of the respective color vectors) between \(7 \times 7\) noisy data patches \(p_{i}, p_{j}\) centered at*i*and*j*, respectively, to obtain the normalized weights \(w_{j} = \frac{\tilde{w}_{j}}{\langle {\mathbbm {1}}, \tilde{w} \rangle }\), \(\tilde{w}_{j} = \exp {\big (-d_{p}(p_{i}, p_{j})/\rho \big )}\). Turning this*data*-driven adaptivity of the assignment process through non-uniform weights into a*solution*-driven adaptivity, by replacing the data*f*by*u*(*W*) due to (3.19) that evolves with*W*, enables an even more general way for further enhancing the assignment process.**Connection to nonlinear diffusion**Referring to the discussion of neighborhood filters and nonlinear diffusion in Sect. 1.3, research making these connections explicit is attractive because, apparently, our approach is not covered by existing work.**Unsupervised scenarios.**The nonexistence of a prior data set \({\mathscr {P}}_{{\mathscr {F}}}\) in applications was only briefly addressed in Sect. 4.4. In particular, the emergence of labels along with assignments and a corresponding generalization of our approach deserves attention.**Learning and updating prior information.**This fundamental problem ties in with the preceding point: How can we learn and evolve prior information from many assignments over time?

For locations *i* close to the boundary of the image domain where patch supports \({\mathscr {N}}_{p}(i)\) shrink, the definition of the vector \(w^{p}\) has to be adapted accordingly.

## Acknowledgements

Support by the German Research Foundation (DFG) was gratefully acknowledged, Grant GRK 1653.