A Cutting-Plane Method for Sublabel-Accurate Relaxation of Problems with Product Label Spaces

Many problems in imaging and low-level vision can be formulated as nonconvex variational problems. A promising class of approaches to tackle such problems are convex relaxation methods, which consider a lifting of the energy functional to a higher-dimensional space. However, they come with increased memory requirements due to the lifting. The present paper is an extended version of the earlier conference paper by Ye et al. (in: DAGM German conference on pattern recognition (GCPR), 2021) which combined two recent approaches to make lifting more scalable: product-space relaxation and sublabel-accurate discretization. Furthermore, it is shown that a simple cutting-plane method can be used to solve the resulting semi-infinite optimization problem. This journal version extends the previous conference work with additional experiments, a more detailed outline of the complete algorithm and a user-friendly introduction to functional lifting methods.

(1) The set = {(γ 1 , . . . , γ k ) ∈ R N : γ i ∈ i , i = 1 . . . k} is defined by k individual submanifolds i ⊂ R N i with N = N 1 + · · · + N k . The individual i are required to be bounded subsets of R N i .
Since the focus of this paper are imaging applications we assume ⊂ R 2 to be a rectangular domain but the approach is easily generalized to higher dimensional or nonrectangular domains.
We make no special assumptions on the cost c : × → R ≥0 in (1) and allow it to be a general nonnegative nonconvex function. This turns (1) into an overall nonconvex optimization problem, which can be challenging to solve using standard gradient-based methods. Moreover, we do not assume that we are able to compute gradients, projections or proximal operators of the cost function c (x, u(x)). Our approach only requires function evaluations. This allows us to consider degenerate costs that are out of reach for gradientbased approaches.
The regularizer in (1) is a separable total variation regularization TV(u i ) on the individual components u i : → R N i weighted by a parameter λ i > 0. The total variation (TV) (Rudin et al. 1992;Chambolle et al. 2010) encourages a spatially smooth but edge-preserving solution. It can be defined as the following equation: where by ∇u i (x) ∈ R N i ×2 we denote the Jacobian matrix and by · * the dual norm of · . The last equality in (2) holds for sufficiently smooth u i . The convex relaxation approach we use in this paper works for general convex and nonconvex regularizers which depend on the Jacobian ∇u i , see Pock et al. (2009Pock et al. ( , 2010, Möllenhoff and Cremers (2019), Vogt et al. (2020). However, the main focus of this paper is an efficient implementation of the data cost c, and therefore we consider only the separable total variation (2).

Motivation and applications
To motivate Problem (1), let us consider some practical applications in low-level vision and imaging. One example is the variational estimation of optical flow between two RGB images I 1 , I 2 : → R 3 , see Horn and Schunck (1981). In that case, 1 = 2 = [a, b] ⊂ R models the displacement between the two images and the cost function is given by a photometric error.
Often, i is a curved manifold, see, e.g., the applications presented by , Weinmann et al. (2014). Examples include i = S 2 for normal field processing , SO(3) for motion estimation (Görlitz et al. 2019) or the circle S 1 for processing of cyclic data Steinke et al. 2010).
Many real-wolrd applications requires to estimate multiple quantities in a joint fashion. This naturally leads to the formulation of product space which is considered in (1), where = 1 × · · · × k . In this case, each i models one quantity of interest that one aims to estimate.
A prominent approach to address joint optimization problems of this form are alternating procedures such as expectation maximization (Dempster et al. 1977), block-coordinate descent and alternating direction-type methods (Boyd et al. 2011). There, the idea is to estimates a single quantity while holding the other ones fixed. However, these approaches usually depend on a good initialization and are easy to get stuck in a very poor local optima. Therefore, the goal of this paper is to instead consider a convex relaxation of Problem (1). The relaxed problem can then be solved to global optimality with standard proximal methods such as the primal dual algorithm . These methods are usually well parallelizable. Thus, they can be efficiently implemented on GPUs, allowing to solve large-scale problems in a reasonable time.
Contributions The main difficulty with convex approaches to (1) is the large memory requirements which are inherent to a lifted problem formulation which renders the problem convex. In order to improve the memory-efficiency of relaxations, two disparate ideas have been considered in previous work: sublabel-accurate liftings ) and product-space relaxations (Goldluecke et al. 2013). In this paper, we combine both approaches and present a sublabelaccurate implementation of Goldluecke et al. (2013). Unlike previous liftings Möllenhoff and Cremers 2017;Vogt et al. 2020), our approach does not require epigraphical projections and can therefore be applied in a black-box fashion, requiring only evaluations of the cost c.
Our main contribution is a simple way to implement the resulting semi-infinite optimization problem with a cuttingplane method. Moreover, we show that using this method, we can achieve a lower energy than the product-space lifting (Goldluecke et al. 2013) on optical flow estimation and manifold-valued denoising problems.
This journal paper is an extended version of the conference paper (Ye et al. 2021). In particular, we offer the following contributions over the conference version: • We have added additional background and explanations on the basics of functional lifting to Sects. 2 and 3. • In Sect. 4 we added the detailed update equations for the primal-dual algorithm. • We added additional figures and explanations to Sect. 5 to illustrate and provide an intuition of our algorithm on a simple example. • We added additional experiments on optical flow and manifold-valued image denoising to Sect. 5 and evaluated our method on a larger set of images.
Overview of the paper After this introduction, we provide in Sect. 2 an introduction to functional lifting methods for Problem (1), review existing works and explain our contributions relatively to them. We summarize the convex relaxation for (1) and its discretization in Sect. 3. The proposed cutting-plane method and sampling strategy which we use to implement the discretized relaxation are presented in Sect. 4. In Sect. 5 we evaluate our method on a toy problem and several real-world imaging applications. Our conclusions are eventually drawn in Sect. 6. Fig. 1 Traditional optimization versus optimization over a space of probability distributions. In the top row, going from left to right, we illustrate a traditional gradient-based local optimization method on a nonconvex problem of the form (3). Given a bad initialization, the algorithm might get stuck in a poor local optimum. The bottom row illustrates an optimization procedure on the relaxed problem (4). Due to convexity of the objective and search space of probability measures, the solution concentrates at a Dirac distribution centered around a global optimum

An Introduction to Functional Lifting
Let us first consider a simplified version of Problem (1) where consists only of a single point, i.e., the nonconvex minimization of one data term: A well-known approach to the global optimization of (3) is a lifting or stochastic relaxation procedure, which has been considered in diverse fields such as polynomial optimization (Lasserre, J-B. 2000), continuous Markov random fields (Fix and Agarwal 2014;Peng et al. 2011;Bauermeister et al. 2021), variational methods (Pock et al. 2008), and black-box optimization (de Boer et al. 2005;Ollivier et al. 2017;Schaul 2011). The idea is to relax the search space in (3) from γ ∈ to probability distributionsu ∈ P( ) and solve 1 min u∈P( ) Due to linearity of the integral wrt. u and convexity of the relaxed search space, this is a convex problem for arbitrary cost c. Moreover, the minimizers of (4) concentrate at the optima of c and can hence be identified with solutions to (3). However, if is a continuum, this problem is infinitedimensional and therefore challenging. We illustrate the conceptual difference between the formulation (3) and (4) on a one-dimensional example in Fig. 1. Discrete/traditional multilabeling In the context of Markov random fields (Ishikawa 2003;Kappes et al. 2013) and multilabel optimization (Chambolle et al. 2012;Lellmann et al. 2009;Lellmann and Schnörr 2011;Zach et al. 2008) one typically discretizes into a finite set of points (called the labels) = {v 1 , . . . , v }. This turns 1 P( ) is the set of nonnegative Radon measures on with total mass u( ) = 1.
(4) into a finite-dimensional linear program min u∈ c , u where c ∈ R ≥0 denotes the label cost and ⊂ R is the ( − 1)-dimensional unit simplex. If we evaluate the cost at the labels, this program upper bounds the continuous problem (3), since instead of all possible solutions, one considers a restricted subset determined by the labels. Since the solution will be attained at one of the labels, typically a fine meshing is needed. Similar to black-box and zero-order optimization methods, this strategy suffers from the curse of dimensionality. When each i is discretized into labels, the overall number is k which quickly becomes intractable since many labels are required for a smooth solution. This limits the method to be applied on more practical problems. Additionally, for pairwise or regularizing terms, often a large number of dual constraints has to be implemented. In that context, the work from  considers a constraint pruning strategy as an offline-preprocessing.

Sublabel-accurate multilabeling
The discrete-continuous MRF (Fix and Agarwal 2014;Zach 2013;Zach and Kohli 2012) and lifting methods Möllenhoff et al. 2016;Möllenhoff and Cremers 2017) attempt to find a more label-efficient convex formulation. These approaches can be understood through duality (Fix and Agarwal 2014;Möllenhoff and Cremers 2017). Applied to (3), the idea is to replace the cost c : → R with a dual variable q : → R: The inner supremum in the formulation (5) maximizes the lower-bound q. Additionally, if the dual variable is sufficiently expressive, this problem is actually equivalent to (4). Approximating q, for example with piecewise linear functions on , one arrives at a lower-bound to the nonconvex problem (3). It has been observed in a recent series of works Möllenhoff et al. 2016;Möllenhoff and Cremers 2017;Vogt et al. 2020;Zach and Kohli 2012) that piecewise linear dual variables can lead to smooth solutions even when q (and therefore also u) is defined on a rather coarse mesh. As remarked by Fix and Agarwal (2014), Laude et al. (2016), Möllenhoff et al. (2016), for an affine dual variable this strategy corresponds to minimizing the convex envelope of the cost, min γ ∈ c * * (γ ), where c * * denotes the Fenchel biconjugate of c.
The implementation of the constraints in (5) can be challenging even in the case of piecewise-linear q. This is partly due to the fact that Problem (5) is a semi-infinite optimization problem (Blankenship and Falk 1976), i.e., an optimization problem with infinitely many constraints. The works Zach and Kohli 2012) implement the constraints via projections onto the epigraph of the (restricted) conjugate function of the cost within a proximal optimization framework. Such projections are only available in closed form for some choices of c and expensive to compute if the dimension is larger than one ). This limits the applicability in a "plug-and-play" fashion.

Product-space liftings
The product-space lifting approach (Goldluecke et al. 2013) attempts to overcome the aforementioned exponential memory requirements of labeling methods in an orthogonal way to the sublabel-based methods. The main idea is to exploit the product-space structure in (1) and optimize over k marginal distributions of the probability measure u ∈ P( ), which we denote by u i ∈ P( i ). Applying (Goldluecke et al. 2013) to the single data term (3) one arrives at the following relaxation: Since one only has to discretize the individual i this substantially reduces the memory requirements from While at first glance it seems that the curse of dimensionality is lifted, the difficulty is moved to the dual, where we still have a large (or even infinite) number of constraints. A global implementation of the constraints with Lagrange multipliers as proposed in Goldluecke et al. (2013) again leads to the same exponential dependency on the dimension.
As a side note, readers familiar with optimal transport may notice that the supremum in (6) is a multi-marginal transportation problem (Carlier 2003;Villani 2008) with transportation cost c. This view is mentioned by Bach (2019) where relaxations of form (6) are analyzed under submodularity assumptions.
In summary, the sublabel-accurate lifting methods, discrete-continuous MRFs (Zach and Kohli 2012;Möllenhoff et al. 2016) and product-space liftings (Goldluecke et al. 2013) all share a common difficulty: implementation of an exponential or even infinite number of constraints on the dual variables.

Summary of our contribution
Our main contribution is a simple way to implement the dual constraints in an online fashion with a random sampling strategy which we present in Sect. 4. This allows a black-box implementation, which only requires an evaluation of the cost c and no epigraphical projection operations as in Möllenhoff et al. (2016), Zach and Kohli (2012). Moreover, the sampling approach allows us to propose and implement a sublabel-accurate variant of the product-space relaxation (Goldluecke et al. 2013) which we describe in the following section.

Product-Space Relaxation
Our starting point is the convex relaxation of (1) presented in Goldluecke et al. (2013), Strekalovskiy et al. (2014). In these works, i ⊂ R is chosen to be an interval. We first denote the Lagrangian as: Following Vogt et al. (2020) we consider a generalization to manifolds i ⊂ R N i which leads us to the following relaxation: This cost function appears similar to (6) explained in the previous section, but there are two differences. First, we now have marginal distributions u i (x) for every x ∈ since we do not consider only a single data term anymore. The notation du x i in (8) denotes the integration against the probability measure u i (x) ∈ P( i ). The variables q i play the same role as in (6) and lower-bound the cost under constraint (10). The second difference is the introduction of additional dual variables p i and the term − Div x p i in (8). Together with the constraint (9), this can be shown to implement the total variation regularization as in , Vogt et al. (2020). Following Vogt et al. (2020), the derivative ∇ γ i p i (x, γ i ) in (9) denotes the (N i ×2)-dimensional Jacobian considered in the Euclidean sense and P T γ i the projection onto the tangent space of i at the point γ i .
To get an intuition on the total variation regularization, we use a concrete example to illustrate how (8) and (9) implement it. Consider the case when the labeling variable u x i = δ u(x) is given as a Dirac measure at every point x. As a concrete example, we consider k = 1 for simplicity. The term in (8) then simplifies to which follows by applying the chain-rule and the fact that p has compact support. Finally, taking a point-wise supremum over p inside the integral in (12) under the dual-norm constraint (9) gives us the total variation of u: ∇u(x) dx, which is the same as defined in (2).

Finite-Element Discretization
We approximate the infinite-dimensional problem (8) by restricting u i , p i and q i to be piecewise functions on a discrete meshing of × i . The considered discretization is a standard finite-element approach and largely follows the work from Vogt et al. (2020). Unlike the forward-differences considered in Vogt et al. (2020) we use lowest-order Raviart-Thomas elements (see, e.g., Caillaud and Chambolle 2020, Section 5) in , which are specifically tailored towards the considered total variation regularization.

Discrete mesh
We approximate each d i -dimensional manifold i ⊂ R N i with a simplicial manifold h i , given by the union of a collection of d i -dimensional simplices T i . We denote the number of vertices ("labels") in the triangulation of i as i . The set of labels is denoted by ⊂ R 2 is a rectangle which we split into a set of faces F of edge-length h x with edge set E. The number of faces and edges are denoted by F = |F|, E = |E|.

Data term and the
on each face and denote its value as c(x( f ), γ ) for f ∈ F, where x( f ) ∈ denotes the midpoint of the face f . Similarly, we also assume the variables u i and q i to be constant in x ∈ on each face but continuous piecewise linear functions in γ i . They are represented by coefficient functions u h i , q h i ∈ R F· i , i.e., we specify the values on the labels and linearly interpolate inbetween. This is done by the interpolation operator W i, f ,γ i : R F· i → R which given an index 1 ≤ i ≤ k, face f , and (continuous) label position γ i ∈ i computes the function value based on barycentric coordinates: Note that after discretization, u i is only defined on h i but we can uniquely associate to each γ i ∈ h i a point on i . Divergence and p i variables Our variable p i is represented by coefficients p h i ∈ R E· i which live on the edges in and the labels in i . The vector p i (x, γ i ) ∈ R 2 is obtained by linearly interpolating the coefficients on the vertical and horizontal edges of the face and using the interpolated coefficients to evaluate the piecewise-linear function on h i . Under this approximation, the discrete divergence Div h x :

Total variation constraint
Computing the operator P T γ i ∇ γ i is largely inspired by Vogt et al. (2020), Section 2.2. It is implemented by a linear map Here, f ∈ F and α ∈ [0, 1] 2 correspond to a point x ∈ while t ∈ T i is the simplex containing the point corresponding to γ i ∈ i . First, the operator computes coefficients in R i of two piecewise-linear functions on the manifold by linearly interpolating the values on the edges based on the face index f ∈ F and α ∈ [0, 1] 2 . For each function, the derivative in simplex t ∈ T i on the triangulated manifold is given by the gradient of an affine extension. Projecting the resulting vector onto the d i -dimensional tangent space for both functions leads to a d i × 2-matrix which

Final discretized problem
Plugging our discretized u i , q i , p i into (8), we arrive at the following finite-dimensional optimization problem: where i{·} is the indicator function. In our applications, we found that it is sufficient to enforce the constraint (14) at the corners of each face which corresponds to choosing α ∈ {0, 1} 2 . Apart from the infinitely many constraints in (15), this is a finite-dimensional convex-concave saddle-point problem and can be tackled by many numerical optimization algorithms.

Solution Recovery
Before presenting in the next section how we propose to implement the constraints (15), we briefly discuss how a primal solution {u h i } of the above problem is turned into an approximate solution to (1). To that end, we follow , Vogt et al. (2020) and compute the Riemannian center of mass via an iteration τ = 1, . . . , T : Here, u 0 i ∈ i is initialized by the label with the highest probability according to u h i ( f , ·). log u τ i and exp u τ i denote the logarithmic and exponential mapping between h i and its tangent space at u τ i ∈ i , which are both available in closedform for the manifolds we consider here. In our case T = 20 was enough to reach convergence. For flat manifolds, T = 1 is enough, as both mappings boil down to the identity and (18) computes a weighted Euclidean mean.
In general, there is no theory which shows that u T (x) = (u T 1 (x), . . . , u T k (x)) from (18) is a global minimizer of (1). Tightness of the relaxation in the special case k = 1 and ⊂ R is shown in Pock et al. (2010). For higher dimensional , the tightness of related relaxations is ongoing research; see Ghoussoub and Kim, Y-H., Lavenant, H. Palmer, A.Z. (2021) for results on the Dirichlet energy. By computing aposteriori optimality gaps, solutions of (8) were shown to be typically near the global optimum of Problem (1); see, e.g., Goldluecke et al. (2013).

Implementation of the Constraints
Though the optimization variables in (13) are finitedimensional, the energy is still difficult to optimize because of the infinitely many constraints in (15).
Before we present our approach, let us first describe what we refer to as the baseline method for the remainder of this paper. As the baseline approach, we consider the direct solution of (13) where we implemented the constraints only at the label/discretization points L 1 × · · · × L k via Lagrange multipliers [this strategy is also employed by the global variant of the product-space approach (Goldluecke et al. 2013)]. This baseline actually corresponds to a single outer iteration N it of the proposed Algorithm 1, with a large number M it of inner iterations.
We aim for a framework that allows for solving a better approximation of (15) than the baseline while being of similar memory complexity. To this end, Algorithm 1 alternates the following two steps: (1) Sampling Based on the current solution we prune previously considered but feasible constraints and sample a new subset of the infinitely many constraints in (15). From all the current sampled constraints, we consider the most violated constraints for each face, add one sample at the current solution and discard the rest.
(2) Solving the subsampled problem Considering the current finite subset of constraints, we solve Problem (13) using a primal-dual method  with diagonal preconditioning . Both constraints (14) and (15) are implemented using Lagrange multipliers.
These two phases are performed alternatingly, with the aim to eventually approach the solution of the continuous problem (13). In practice, a fixed number of outer iterations N it is set. While we do not prove convergence of the overall algorithm, convergence results for related procedures exist; see, e.g., Blankenship and Falk (1976), Theorem 2.4.
The detailed algorithm is explained in Algorithm 1. The cost matrix C is constructed by evaluating c(x( f ), γ ) at proposed samples S f . We denote ξ and ν as the Lagrange multipliers. The Lagrange multiplier ξ is initialized by a warm-start strategy, i.e. ξ it keeps same if we have the same proposed sample from previous outer iteraion. The prox of a function g with step size τ is defined as: Our constraint sampling strategy is detailed in Algorithm 2. For each face in F, it generates a finite set of "sublabels" S f ⊂ at which we implement the constraints (15). Next, we provide the motivation behind each line in the algorithm. Random uniform sampling (Line 1) To have a global view of the cost function, we consider a uniform sampling on the label space . The parameter n > 0 determines the number of the samples for each face. Local perturbation around the mean (Line 2) Besides the global information, we apply local perturbation around the current solution u. In case the current solution is close to the optimal one, this strategy allows us to refine it with these samples. The parameter δ > 0 determines the size of the local neighbourhood. In our experiments, we always used a Gaussian perturbation with δ = 0.1. Pruning strategy (Lines 3-4) Most samples from previous iterations are discarded because the corresponding con-Algorithm 1 Proposed algorithm for problem (1).
Construct interpolation matrix W it i and cost matrix C it based on S it f from Algorithm 2. 5: Initialize ξ it i with the warm-start strategy. 6: Construct the diagonal preconditioners T it u , T it ν , T it ξ , 6 it p and 6 it q by Pock and Chambolle (2011). 7: end for 16: Get sampled S it+1 f for each face by Algorithm 2. 17: end for Algorithm 2 Sampling strategy at face f ∈ F.
/* have one sample at current solution */ 5: S f ← S f ∪ {u}. 6: Return S f straints are already satisfied. We prune all current feasible constraints as in Blankenship and Falk (1976). Similarly, the two random sampling strategies (Lines 1 and 2) might return some samples for which the constraints are already fulfilled. Therefore, we only consider the samples with violated constraints and pick the r most violated from them. This pruning strategy is essential for a memory efficient implementation as shown later. Sampling at u (Line 5) Finally, we add one sample which is exactly at the current solution u ∈ to have at least one guaranteed sample per face. In the next section, we illustrate the behavior of Algorithm 1 on a toy problem, and evaluate its performance on real-world imaging problems.

Numerical Validation
Our approach and the baseline are implemented in PyTorch. Code for reproducing the following experiments can be found here: https://github.com/zhenzhangye/sublabel_meets_prod uct_space. Note that that one of the runtime bottlenecks of our sampling strategy is creating the samples and picking the most violated r as shown in Algorithm 2. Additionally, the sparse matrix operations and the PDHG updates can be more efficiently implemented in CUDA, as all PDHG updates can be executed in a single CUDA kernel, compared to PyTorch's multiple kernel calls. Therefore, a specialized implementation as in Goldluecke et al. (2013) will allow the method to scale by factor 10 − 100× in favor of runtime.

Illustration of the Algorithm
First of all, we consider a simplistic minimization problem on a single nonconvex data term: with 5 labels to illustrate the behavior of both, the baseline algorithm and our sampling strategies. Figure 2 depicts the baseline's behavior. While it only evaluates the energy on the labels, five samples are considered as illustrated by the red dots in Fig. 2a. Figure 2b shows the dual variable q h i after it iterations. Since the algorithm is maximizing q h i , the green and red dots should overlay (e.g. the second label) when it converges. Despite the convergence, the resulting q h violates the constraints significantly close to the optimal solution ("green > blue"). Therefore, to attain the global optimal solution, the baseline approach needs more labels which requires more memory. The dual variable q satisfies the constraints on the samples. However, q is not globally optimal as it violates the constraint on the optimal solution ("green > blue") (Color figure online) The motivation of random uniform sampling (Line 1, Algorithm 2) and local perturbation around the mean (Line 2, Algorithm 2) in our sampling strategy is intuitively clear. However, as demonstrated later in the experiment of a truncated quadratic energy, sampling at u (Line 5, Algorithm 2) is critical for the stability of our method. A comparison in Fig. 3 helps to show the necessity of this strategy. We ran two experiments with the identical settings except for the sampling at u. After a given number of iterations, the dual variable q h is approximately optimal, as indicated in Fig. 3a. Our pruning strategy (Line 3, Algorithm 2) removes all of the proposed samples since all of them satisfy the constraints. As a result, the subproblem becomes unconstrainted on that dual variable q h and its update has no significance, Fig. 3b.
To solve this problem, we propose to always at least have one sample at u even when q h is nearly optimal, cf. Fig. 3c. As illustrated in Fig. 3d, this can avoid the degeneration of q h as it is still constrained.
Finally, the complete sampling strategy is illustrated in Fig. 4. As shown in Fig. 4a, the primal-dual method can obtain the optimal q h for the sampled subproblem. Our sampling strategy can provide necessary samples and prune the feasible ones, cf. Fig. 4 (b). These few but meaningful samples lead the q h to achieve global optimality, cf. Fig. 4c.

Truncated Quadratic Energy
In this section, we study the numerical effect of each line in Algorithm 2. We evaluate our method on the truncated quadratic energy c (x, u(x) → R is the input data as show in Fig. 5. For this specific experiment, we generate a 64 × 64 gray image degraded with Gaussian noise of standard deviation σ = 0.05 and 5% salt-and-pepper noise. The parameters are chosen as ν = 0.025, λ = 0.25, N it = 10, M it = 200, n = 10 and r = 1. To reduce the effect of randomness, we run each algorithm 20 times and report mean and standard deviation of the final energy for different number of labels in Table 1. We want to emphasize that more labels have benefits for both baseline and our algorithm. Nevertheless, the proposed approach can reach lower energies with the same number of labels and similar memory requirements.
As can be seen in this table, adding uniform sampling and picking the most violated constraint per face (Lines 1 and 4 of Algorithm 2) already decreases the final energy significantly. We also consider local exploration around the , which helps to find better energies at the expense of higher memory requirements. The pruning strategy (Line 3) circumvents this memory issue, however the energy deteriorates dramatically because some faces could end up having no samples after pruning. Therefore, keeping the current solution as a sample (Line 5) per face prevents the energy from degrading. Including all these sampling strategies, the proposed method can achieve the best energy and runtime, at comparable memory usage to the baseline method. We further illustrate the comparison on the number of iterations and time between the baseline and our proposed method in Fig. 6. Due to the replacement on the samples, we have a peak right after each sampling phase. The energy however converges immediately, leading to an overall decreasing trend.
Additionally, we compare our method to the baseline on a more practical dataset CBSD from Martin et al. (2001). This dataset contains 68 images and noisy ones with additive white Gaussian noise (5% in this experiment). The number of Numbers in parentheses indicate the standard deviation across 20 runs Fig. 6 Comparison between the baseline and our approach on a 64 × 64 grayscale image shown in Fig. 5, degraded with Gaussian and salt-and-pepper noise. Our approach finds lower energies in fewer iterations and less time than the baseline, which implements the constraints only at the label points labels is 7 for both methods. 5K iterations are performed on the baseline method, while we set N it = 10 and M it = 500 to get a fair comparison. The other parameters are chosen as λ = 0.25, n = 50 and r = 1. The results are shown in Fig. 7. Our method outpeforms the baseline among all the images regarding both energy and peak signal-to-noise ratio (PSNR).

Manifold-Value Denoising
To show the flexibility of our algorithm, we next evaluate it on a manifold-valued denoising problem in HSV color space. The hue component of this space is a circle, i.e., 1 = S 1 , The data term of this experiment is still a truncated quadratic energy, where for the hue component the distance is taken on the circle S 1 . The input images (Baker et al. 2011;Martin et al. 2001) are degraded with the same setting as above.
Both the baseline and our method are implemented with 7 labels. First of all, we evaluate the impact of the most violated r samples. As shown in Fig. 8, The maximum difference of energy and memory is only 0.8% and 0.06%, respectively, which can be considered almost constant wrt. r . Therefore, we pick r = 5. To get an equal number of total iterations, 30K iterations are performed on the baseline, while we set N it = 100 outer iterations with M it = 300 inner primaldual steps for our method. Other parameters are chosen as λ = 0.015 and n = 30. As shown in Fig. 9, our method can achieve a lower energy than the baseline. Qualitatively, since our method implements the constraints not only at the labels but also inbetween, there is less bias. Fig. 7 Quantitative results on the CBSD dataset (Martin et al. 2001). We ordered both results from the best to the worst. The difference is calculated using the formula ours -baseline baseline . It is clear that our method consistently outperforms the baseline across all images (a) (b) Fig. 8 Comparison between different number of most violated constraints on the energy and memory: a the change of the energy with r = 1 as the basis; b the change of memory with r = 1 as the basis. It can be observed that both the energy and memory vary very little (0.8% and 0.06%, respectively) regarding different r

Optical Flow
Given two input images I 1 , I 2 , we compute the optical flow u : → R 2 for the label space = [a, b] 2 . We use a simple 50 and M it = 1000 for a fair comparison. Additionally, the parameters are chosen as n = 20 and r = 1 in Algorithm 2. We consider three methods for this experiment: the baseline, the method from Laude et al. (2016) and ours. The method from Laude et al. (2016) approximates the dataterm in a piecewise convex manner and requires a specific epigraph projection for the dataterm. Though it can attain lower energy, our approach requires less memory and tackle any Fig. 9 Denoising of images (Baker et al. 2011;Martin et al. 2001) in HSV color space ( 1 = S 1 , 2 = 3 = [0, 1]) using our method and the baseline with 7 labels. The hue component plot demonstrates that our approach is able to handle the manifold setting where the jump from 2π to 0 is permitted. Since our approach implements the constraints adaptively inbetween the labels (gray lines) it reaches a lower energy with less label bias Given an equal number of labels/memory, our sampling strategy performs favorably to an implementation of the constraints at the labels. The relative numbers of energy, memory, average endpoint error (aep) and average angular error (aae) are calculated as "mean( Ours Baseline )" across all 8 datasets. The number in the parentheses resemble the standard deviation. The detailed table with all results can be found in the "Appendix" Fig. 10 Visualization of the optical flow on Grove3 and RubberWhale from the Middleburry dataset (Baker et al. 2011), using the baseline, the method from Laude et al. (2016) and our method for a varying amount of labels. OOM stands for out of memory. More qualitative results can be found in the appendix cost function in a simple black-box fashion. Table 2 summarizes the quantitative results obtained on the Middleburry dataset (Baker et al. 2011), while the detailed absolute numbers can be found in the appendix. This table shows how our approach performs relatively to the baseline, i.e. "mean( Ours Baseline )", e.g. for three labels our energy is 50.91% of the baseline energy for all 8 datasets, while using 99.84% of the baseline's memory and having an average end point error (aep) and average angular error (aae) of 91.94% and 82.34% of the baseline error metrics, respectively. To enable qualitative comparison, we visualize in Fig. 10 the results on two of the datasets. The remaining qualitative results on the Middlebury data set (Baker et al. 2011) are shown in the appendix. Our method outperforms the baseline approach regarding energy under the same number of labels and requires the same amount of memory. Because Laude et al. (2016) uses a tighter relxation on the label space, they can achieve lower energy with a smaller number of labels. However, it runs out of memory easily while the proposed method scales better wrt. memory consumption.

Conclusion and Limitations
In this paper we made functional lifting methods more scalable by combining two advances, namely product-space relaxations (Goldluecke et al. 2013) and sublabel-accurate discretizations (Möllenhoff and Cremers 2017;Vogt et al. 2020). This combination is enabled by adapting a cuttingplane method from semi-infinite programming (Blankenship and Falk 1976). This allows an implementation of sublabelaccurate methods without difficult epigraphical projections.
Moreover, our approach makes sublabel-accurate functional-lifting methods applicable to any cost function in a simple black-box fashion. In experiments, we demonstrate the effectiveness of the approach over a baseline based on the product-space relaxation (Goldluecke et al. 2013) and provided a proof-of-concept experiment showcasing the method in the manifold-valued setting.
Future work will concentrate on applying and adapting the presented framework to solve large inverse problems in computer vision with multiple data terms, different regularizers and several manifold-valued optimization variables in a joint fashion. However, it is not obvious if our presented cutting plane approach is easily applicable for such large problems or if novel ideas have to be pursued.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. Table 3 shows the energy and memory requirement as well as the average endpoint error (aep) and average angular error (aae) for the baseline, the method from Laude et al. (2016) and our approach across different number of labels for all 8 used Middlebury datasets (Baker et al. 2011). Figures 11  and 12 visualize the remaining optical flow results on the Middlebury dataset (Baker et al. 2011) using the baseline, the method from Laude et al. (2016) and ours for a varying amount of labels. Note that the approach from Laude et al. (2016) leverages a tighter discretization on the label space and is implemented on CUDA. Though their approach achieves better energy under fewer labels, ours has a better scability.  We denote the out of memory error as OOM. Given an equal number of labels/memory, our sampling strategy performs favorably to an implementation of the constraints at the labels comparing to the baseline. Additionally, our method obtains a better scalibilty than the one from Laude et al. (2016)