1 Introduction

The graphical framework has become a popular setting for classification [8, 25, 92, 100,101,102] and filtering [31, 34, 63, 85, 88, 89] of high-dimensional data. Some of the best performing classification algorithms are based on solving variational problems on graphs [3, 10, 14, 16, 17, 25, 39, 44, 45, 66, 68, 82, 86, 101]. In simple terms, these algorithms attempt to group the data points into classes in such a way that pairs of data points with different class memberships are as dissimilar as possible with respect to a certain feature. In order to avoid the computational complexity of working with fully connected graphs, approximations, such as those based on spectral graph theory [10, 38, 66] or nearest neighbors [17, 33, 68], are typically employed. For example, [10] and [66] employ spectral approaches along with the Nyström extension method [36] to efficiently calculate the eigendecomposition of a dense graph Laplacian. Works, such as [17, 24, 33, 39, 68, 99], use the ‘nearest neighbor’ approach to sparsify the graph for computational efficiency. Variational problems on graphs have also become popular for processing of 3D point clouds [30, 33, 45, 55, 60, 62].

When the classification task is cast as the minimization of similarity of point pairs with different class membership, extra information is necessary to avoid the trivial global minimizer of value zero where all points are assigned to the same class. In semi-supervised classification methods, a small set of the data points are given as training data in advance, and their class memberships are imposed as hard constraints in the optimization problem. In unsupervised classification methods, one typically enforces the sizes of each class not to deviate too far from each other, examples including the normalized cut [78] and Cheeger ratio cut problems [26].

Most of the computational methods for semi-supervised and unsupervised classification obtain the solution by computing the local minimizer of a non-convex energy functional. Examples of such algorithms are those based on phase fields [10] and the MBO scheme [38, 65,66,67]. PDEs on graphs for semi-supervised classification also include the Eikonal Equation [30] and tug-of-war games related to the infinity Laplacian Equation [34]. Unsupervised problems with class size constraints are inherently the most difficult to handle from an optimization viewpoint, as the convex envelope of the problem has a trivial constant function as a minimizer [16, 78, 82]. Various ways of simplifying the energy landscape have been proposed [17, 43, 89]. Our recent work [68] showed that semi-supervised classification problems with two classes could be formulated in a completely convex framework and also presented efficient algorithms that could obtain global minimizers.

Image segmentation is a special classification problem where the objective is to assign each pixel to a region. Algorithms based on energy minimization are among the most successful image segmentation methods, and they have historically been divided into ‘region-based’ and ‘contour-based.’

Region-based methods attempt to find a partition of the image so that the pixels within each region as a whole are as similar as possible. Additionally, some regularity is imposed on the region boundaries to favor spatial grouping of the pixels. The similarity is usually measured in the statistical sense. In the simplest case, the pixels within each region should be similar to the mean intensity of each region, as proposed in the Chan–Vese [23] and Mumford–Shah [71] models. Contour-based methods [50, 93] instead seek the best suited locations of the region boundaries, typically at locations of large jumps in the image intensities, indicating the interface between two objects.

More recently, it has been shown that the combination of region and contour-based terms in the energy function can give qualitatively very good results [15, 39, 48], especially when non-local operators are used in the contour terms [30, 39, 48]. There now exists efficient algorithms for solving the resulting optimization problems that can avoid getting stuck in a local minimum, including both combinatorial optimization algorithms [12, 13, 53] and more recent convex continuous optimization algorithms [6, 7, 15, 20, 58, 75, 96,97,98]. The latter have shown to be advantageous in several aspects, such as the fact that they require less memory and have a greater potential for parallel implementation of graphics processing units (GPUs), but special care is needed in case of non-local variants of the differential operators (e.g., [76]).

Most of the current data segmentation methods [10, 14, 16, 17, 44, 66, 68, 82] can be viewed as ‘contour-based,’ since they seek an optimal location of the boundaries of each region. Region-based variational image segmentation models with two classes were generalized to graphs for data segmentation in [59] and for 3D point cloud segmentation in [59, 60, 84] in a convex framework. The region terms could be constructed directly from the point geometry and/or be constructed from a color vector defined at the points. Concrete examples of the latter were used for experiments on point cloud segmentation. Region terms have also been proposed in the context of Markov Random Fields for 3D point cloud segmentation [2, 72, 87], where the weights were learned from training data using associate Markov networks. The independent preprint [94] proposed to use region terms for multiclass semi-supervised classification in a convex manner, where the region terms were inferred from the supervised points by diffusion.

Contributions This paper proposes a convex relaxation and an efficient algorithmic optimization framework for a general set of graph-based data classification problems that exhibits non-trivial global minimizers. It extends the convex approach for semi-supervised classification with two classes given in our previous work [68] to a much broader range of problems, including multiple classes, novel and more practically useful incorporation of class size information, and a novel unsupervised segmentation model for 3D point clouds acquired by a LaDAR.

The same basic relaxation for semi-supervised classification also appeared in the independent preprint [94]. The most major distinctions of this work compared to the preprint [94] are: we also incorporate class size information in the convex framework; we give a mathematical and experimental analysis of the close relation between the convex relaxed and original problems; we propose a different duality-based ‘max-flow’-inspired algorithm; we incorporate information of the supervised points in a different way; and we consider unsupervised segmentation of 3D point clouds.

The contributions can be summarized more specifically as follows:

  • We specify a general set of classification problems that are suitable for being approximated in a convex manner. The general set of problems involves minimization of a multiclass graph cut term together with supervised constraints, region homogeneity terms and novel constraints or penalty terms acting on the class sizes. Special cases include semi-supervised classification of high-dimensional data and unsupervised segmentation of 3D point clouds.

  • A convex relaxation is proposed for the general set of problems, and its approximation properties are analyzed thoroughly in theory and experiments. This extends the work on multiregion image segmentation [6, 58, 98] to data clustering on graphs and to cases where there are constraints or penalty terms acting on the class sizes. Since the introduction of either multiple classes or size constraints causes the general problem to become NP-hard, the relaxation can (probably) not be proved to be exact. Instead, conditions are derived for when an exact global minimizer can be obtained from a dual solution of the relaxed problem. The strongest conditions are derived in case there are no constraints on the class sizes, but the theoretical results in both cases show that very close approximations are expected. These theoretical results also agree well with experimental observations.

  • The convex relaxed problems are formulated as equivalent dual problems that are structurally similar to the ‘max-flow’ problem over the graph. This extends our work [68] to multiple classes and the work on image segmentation proposed in [95] to data clustering on graphs. We use a conceptually different proof than [68, 95], which relates ‘max-flow’ to another more direct dual formulation of the problem. Furthermore, it is shown that also the size constraints and penalty term can be incorporated naturally in the max-flow problem by modifying the flow conservation condition, such that there should be a constant flow excess at each node.

  • As in our previous work [68, 95], an augmented Lagrangian algorithm is developed based on the new ‘max-flow’ dual formulations of the problems. A key advantage compared to related primal–dual algorithms [21] in imaging, such as the one considered in the preprint [94], is that all constraints on the labeling function are handled implicitly. This includes constraints on the class sizes, which are dealt with by separate dual variables indicating the flow excess at the nodes. Consequently, projections onto the constraint set of the labeling function, which tend to decrease the accuracy and put strict restrictions on the step sizes, are avoided.

  • We propose an unsupervised segmentation model for unstructured 3D point clouds acquired by a LaDAR within the general framework. It extends the models of [59, 60, 84] to multiple classes and gives concrete examples of region terms constructed purely based on geometrical information of the unlabeled points, in order to distinguish classes such as vegetation, the ground plane and human-made structures in an outdoor scene. We also propose a graph total variation term that favors alignment of the region boundaries along ‘edges’ indicated by discontinuities in the normal vectors or the depth coordinate. In contrast to [2, 41, 72, 87], our model does not rely on any training data.

  • Extensive experimental evaluations on semi-supervised classification indicate consistently higher accuracies than related local minimization approaches, and considerably so when the training data are not uniformly distributed among the data set. The accuracies are also highly competitive against a wide range of other established methods on three benchmark data sets. The accuracies can be improved further if an estimate of the approximate class sizes are given in advance. Experiments on 3D point clouds acquired by a LaDAR in outdoor scenes demonstrate that the scenes can accurately be segmented into object classes such as vegetation, the ground plane and regular structures. The experiments also demonstrate fast and highly accurate convergence of the algorithms, and show that the approximation difference between the convex and original problems vanishes or becomes extremely low in practice.

Organization This paper starts by formulating the general set of problems mathematically in Sect. 2. Section 3 formulates a convex relaxation of the general problem and analyzes the quality of the relaxation from a dual perspective. Section 4 reformulates the dual problem as a ‘max-flow’ type of problem and derives an efficient algorithm. Applications to semi-supervised classification of high-dimensional data are presented in Sect. 5.1, and applications to segmentation of unstructured 3D point clouds are described in Sect. 5.2, including specific constructions of each term in the general model. Section 5 also presents a detailed experimental evaluation for both applications (Fig. 1).

2 Data Segmentation as Energy Minimization Over a Graph

Assume we are given N data points in \(\mathbb {R}^M\). In order to formulate the segmentation of the data points as a minimization problem, the points are first organized in an undirected graph. Each data point is represented by a node in the graph. The edges in the graph, denoted by E, consist of pairs of data points. Weights w(xy) on the edges \((x,y) \in E\) measure the similarity between data points x and y. A high value of w(xy) indicates that x and y are similar and a low value indicates that they are dissimilar. A popular choice for the weight function is the Gaussian

$$\begin{aligned} w(x,y)= e^{-\frac{\mathrm{d}(x,y)^{2}}{\sigma ^{2}}}, \end{aligned}$$
(1)

where d(xy) is the distance, in some sense, between x and y. For example, the distance between two 3D points x and y is naturally their Euclidean distance. In order to reduce the computational burden of working with fully connected graphs, one often only considers the set of edges where w(xy) is largest. For instance, edges may be constructed between each vertex in V and its k-nearest neighbors. More formally, for each \(x \in V\), one constructs an edge \((x,y) \in E\) for the k points with the shortest distance d(xy) to x. Such a graph can be constructed efficiently by using kd trees in \(O(N k\, \text {log}(N k))\) time [9, 46]. Note that the number of edges incident to some nodes in the resulting graph may be larger than k, as illustrated in Fig. 2 where \(k=2\), due to symmetry of the undirected graph. The construction of the graph itself provides a basic segmentation of the nodes; for instance in Fig. 2, it can be observed that the graph contains three different connected components. This fact has been exploited in basic graph-based classification methods [1].

In several recent works, the classification problem has been formulated as finding an optimal partition \(\{V_i\}_{i=1}^n\) of the nodes V in the graph G. The most basic objective function can be formulated as

$$\begin{aligned} \min _{\{V_i\}_{i=1}^n}&\sum _{i=1}^n \sum _{\begin{array}{c} (x,y) \in E \; : \\ x \in V_i, \, y \notin V_i \end{array}} w(x,y), \end{aligned}$$
(2)
$$\begin{aligned} \text {s.t.} \,&\cup _{i=1}^n V_i \, = \, V \, , \quad V_k \cap V_l \, = \, \emptyset \, , \, \forall k \ne l \,, \end{aligned}$$
(3)

where the constraint (3) imposes that there should be no vacuum or overlap between the subsets \(\{V_i\}_{i=1}^n\). If \(n =2\), then (2) is the so-called graph cut [69]. The motivation behind the model (2) is to group the vertices into classes in such a way that pairs of vertices with different class memberships are as dissimilar as possible, indicated by a low value of w.

2.1 Size Constraints and Supervised Constraints

Extra assumptions are necessary to avoid the trivial global minimizer of (2), where \(V_i = V\) for some i and \(V_j = \emptyset \) for all \(j \ne i\). There are two common ways to incorporate extra assumptions. In semi-supervised classification problems, the class membership of a small set of the data points is given as training data in advance by the constraints

$$\begin{aligned} V_i \supseteq T_i, \quad i \in I = \{1,...,n\}, \end{aligned}$$
(4)

where \(T_i\) is the set of ‘training’ points known to belong to class i. For notational convenience, the set of all class indices \(\{1,...,n\}\) is denoted by I in the rest of this paper.

In unsupervised classification problems, one usually assumes that the regions should have approximately equal sizes. The simplest way to achieve this is to impose that each class \(V_i\) should have a given size \(a_i \in \mathbb {N}\):

$$\begin{aligned} ||V_i|| = a_i, \; \quad i \in I, \end{aligned}$$
(5)

where \(\sum _{i=1}^n a_i = ||V||\). We focus on the case that the norm \(||V_i||\) is the number of nodes in \(V_i\) for simplicity. As an alternative, \(||V_i||\) could be the sum of degrees of each node in \(V_i\), where the degree of a node is the number of edges incident to that node. If size constraints are introduced, the problem cannot generally be solved exactly due to NP-hardness. This will be discussed in more detail in Sect. 3.

Usually, a more flexible option is preferred of modifying the energy function such that partitions of equal sizes have lower energy. In case of two classes, the energy (2) becomes \(\text {cut}(V_1,V_2)\) \(= \sum _{x,y} w(x,y)\), where \(x \in V_1\) and \(y \in V_2\). Several different ways of normalizing the energy by the class sizes have been proposed, which can be summarized as follows

$$\begin{aligned} \text {cut}(V_1,V_2) \left( \frac{1}{|V_1|} + \frac{1}{|V_2|}\right) , \quad \frac{\text {cut}(V_1,V_2)}{\text {min}(|V_1|,|V_2|)}. \end{aligned}$$
(6)

The expression on the left is called the ratio cut in case of the norm \(|V| = \sum _{x \in V}\) and the normalized cut in case of \(|V|= \sum _{x \in V} \text {degree}(x)\). The expression on the right is called the Cheeger ratio cut with the norm \(|V| = \sum _{x \in V}\).

The energy functions (6) are highly non-convex, but ways to simplify the energy landscape have been proposed [16, 17, 44, 82] in order to reduce the number of local minima.

2.2 New Flexible Constraint and Penalty Term on Class Sizes

In this paper, we aim to provide a broader set of constraints and penalty terms acting on the class sizes that can be handled in a completely convex manner. They are designed to achieve the same net result as the ratio energies (6) of promoting classes of equal sizes, but in a completely convex way. They can also promote any other size relations between the class sizes. We will consider flexible size constraints of the form

$$\begin{aligned} S_i^\ell \le ||V_i|| \le S_i^u, \; \quad i \in I, \end{aligned}$$
(7)

where \(S_i^u \in \mathbb {N}\) is an upper bound on the size of class i and \(S_i^\ell \in \mathbb {N}\) is a lower bound. Such types of constraints have previously been proposed for image segmentation in [52]. In case one only knows an estimate of the expected class sizes, such constraints can be used to enforce the sizes to lie within some interval of those estimates. To be well defined, it is obviously required that \(\sum _{i=1}^n S_i^\ell \le ||V||\) and \(\sum _{i=1}^n S_i^u \ge ||V||\). Note that if \(S_i^\ell = S_i^u = a_i\), then (7) becomes equivalent to (5).

To avoid imposing absolute upper and lower bounds on the class sizes, we also propose appending a piecewise linear penalty term \(\sum _{i=1}^n P_\gamma (||V_i||)\) to the energy function (2), defined as

(8)

An illustration of \(P_\gamma (||V_i||)\) is given in Fig. 1. In the limit as \(\gamma \rightarrow \infty \), the penalty term becomes an equivalent representation of the hard constraints (7). Note that quadratic or higher-order penalty terms, although they are convex, are not well suited for the convex relaxation, because they tend to encourage non-binary values of the labeling functions. In fact, we believe the set of constraints and penalty terms given here is complete when it comes to being suited for completely convex relaxations.

One major contribution of this paper is an efficient algorithmic framework that handles size constraints of the form (7) and the penalty term (8) naturally, with almost no additional computational efforts.

Fig. 1
figure 1

Illustration of penalty term \(P_\gamma (V_i)\)

2.3 Region Homogeneity Terms

The classification problem (2) involves the minimization of an energy on the boundary of the classes. The energy is minimized if the data points on each side of the boundary are as dissimilar as possible. These classification models are therefore similar to edge-based image segmentation models, which align the boundary of the regions along edges in the image where the intensity changes sharply. By contrast, region-based image segmentation models, such as the ‘Chan–Vese’ model, use region homogeneity terms that measure how well each pixel fits with each region, in the energy function.

Fig. 2
figure 2

Example of segmentation of a graph of 2D points (with number of neighbors \(k=2\)) into regions of low density (yellow), high degree of correlation of coordinates between neighboring points (red), medium correlation (blue) and low correlation (green). Dashed edges indicate those that contribute to the energy (Color figure online)

A graph extension of variational segmentation problems with two classes was formulated in [59, 60, 62, 84], using a non-local total variation term together with a region term promoting homogeneity of a vertex function. The vertex function could be constructed directly from point geometry and/or from external information such as a color vector defined at each point. We extend the general problem to multiple classes and optional constraints as follows:

$$\begin{aligned} \min _{\{V_i\}_{i=1}^n} \,&\, \sum _{i=1}^n \sum _{x \in V_i} f_i(x) + \sum _{i=1}^n \sum _{\begin{array}{c} (x,y) \in E \; : \\ x \in V_i, \, y \notin V_i \end{array}} w(x,y) \,, \nonumber \\ \text {s.t.} \,&\cup _{i=1}^n V_i \, = \, V \, , \quad V_k \cap V_l \, = \, \emptyset \, , \, \forall k \ne l \, \end{aligned}$$
(9)

under optional supervised constraints (4) and/or size constraints (7)/penalty term (8). In [59, 60, 62, 84], the region terms \(f_i(x)\) were defined in terms of a general vertex function \(f^0\), which could depend on a color vector or the point geometry. Experimental results on point clouds were shown in case \(f^0\) was a color vector defined at each point. In this work, we will give concrete constructions of \(f_i\) for point cloud segmentation purely based on the geometry of the 3D points themselves. For example, the eigenvalues and eigenvectors obtained from a local PCA around each point carry useful information for describing the homogeneity within each class. Concrete examples are given in Sect. 5.2. An illustrative example is given in Fig. 2, where each node is a 2D point and the region terms have been constructed to distinguish points with different statistical relations to their neighboring points.

The independent preprint [94], proposed to use region terms in the energy function for semi-supervised classification and the authors, proposed a region term that was inferred from the supervised points by diffusion. In contrast, the region terms in this work do not rely on any supervised points, but are as mentioned only specified and demonstrated for the application of 3D point cloud segmentation.

3 Convex Relaxation of Minimization Problem and Analysis Based on Duality

In this section, the classification problems are formulated as optimization problems in terms of binary functions instead of sets. The binary representations are used to derive convex relaxations. First, some essential mathematical concepts are introduced, such as various differential operators on graphs. These concepts are used extensively to formulate the binary and convex problems and the algorithms.

3.1 Differential Operators on Graphs

Our definitions of operators on graphs are based on the theory in [33, 42, 91]. More information is found in these papers.

Consider two Hilbert spaces, \({\mathcal {V}}\) and \({\mathcal {E}}\), which are associated with the sets of vertices and edges, respectively, and the following inner products and norms:

$$\begin{aligned} \langle u,\gamma \rangle _{V}= & {} \sum _{x \in V}u(x) \gamma (x) {\mathrm d}(x)^r, \nonumber \\ \langle \psi ,\phi \rangle _{{\mathcal {E}}}= & {} \frac{1}{2} \sum _{x,y \in V} \psi (x,y) \phi (x,y) w(x,y)^{2q-1},\nonumber \\ \left||u\right||_{\mathcal {V}}= & {} \sqrt{\langle u, u \rangle _{\mathcal {V}}}= \sqrt{ \sum \nolimits {_{x \in V}}\,\,u(x)^2\mathrm{d}(x)^r},\nonumber \\ \left||\phi \right||_{\mathcal {E}}= & {} \!\sqrt{\langle \phi , \phi \rangle _{\mathcal {E}}}\!=\! \sqrt{ \frac{1}{2} \sum \nolimits {_{x,y \in V}}\, \phi (x,y)^2 w(x,y)^{2q-1}},\nonumber \\ \left||\phi \right||_{\mathcal {E},\infty }= & {} \max _{x,y \in V}{|\phi (x,y)|}, \end{aligned}$$
(10)

for some \(r \in [0,1]\) and \(q \in [\frac{1}{2},1]\). Above d(x) is the degree of node x (it’s number of incident nodes) and w(., .) is the weighting function.

From these definitions, we can define the gradient operator \(\nabla \) and the Dirichlet energy as

$$\begin{aligned} (\nabla u)_w(x,y)= & {} w(x,y)^{1-q}(u(y)-u(x)), \end{aligned}$$
(11)
$$\begin{aligned} \frac{1}{2} \left||\nabla u\right||^2_{\mathcal {E}}= & {} \frac{1}{4} \sum _{x,y \in V} w(x,y) \left( u(y)-u(x)\right) ^2. \end{aligned}$$
(12)

We use the equation \(\langle \nabla u, \phi \rangle _{\mathcal {E}} = - \langle u, {{\mathrm{div}}}_w \phi \rangle _{\mathcal {V}}\) to define the divergence:

$$\begin{aligned} ({{\mathrm{div}}}_w \phi )(x)= \frac{1}{2{\mathrm d}(x)^r} \sum _{y \in V} w(x,y)^q(\phi (x,y)- \phi (y,x)),\nonumber \\ \end{aligned}$$
(13)

where we have exploited symmetry \(w(x,y)=w(y,x)\) of the undirected graph in the derivation of the operator.

Using divergence, a family of total variations \(TV_{w}:\mathcal {V} \rightarrow \mathbb {R}\) can now be defined:

$$\begin{aligned} TV_{w}(u)= & {} \sup \left\{ \langle {{\mathrm{div}}}_w \phi , u \rangle _{\mathcal {V}} : \phi \in \mathcal {E}, \left||\phi \right||_{\mathcal {E},\infty } \le 1 \right\} \nonumber \\= & {} \frac{1}{2} \sum _{x,y \in V} w(x,y)^q |u(y)-u(x)|. \end{aligned}$$
(14)

The definition of a family of graph Laplacians \(\triangle _w = {{\mathrm{div}}}_w {\dot{\nabla }}: \mathcal {V} \rightarrow \mathcal {V}\) is:

$$\begin{aligned} (\triangle _w u)(x)= \sum _{y \in V} \frac{w(x,y)}{d(x)^r}(u(y)-u(x)). \end{aligned}$$
(15)

3.2 Binary Formulation of Energy Minimization Problem

A partition \(\{V_i\}_{i=1}^n\) of V satisfying the no vacuum and overlap constraint

$$\begin{aligned} \cup _{i=1}^n V_i \, = \, V \, , \quad V_k \cap V_l \, = \, \emptyset , \, \quad \forall k \ne l \, \end{aligned}$$
(16)

can be described by a binary vector function \(u = (u_1,...,u_n) : \; V \mapsto \{0,1\}^n\) defined as

$$\begin{aligned} u_i(x) \, := \, \left\{ \begin{array}{ll} 1 , \, &{} x \in V_i \\ 0 , \, &{} x \notin V_i \end{array} \right. \, , \quad i=1,\ldots ,n \, . \end{aligned}$$
(17)

In other words, \(u(x) = e_i\) if and only if \(x \in V_i\), where \(e_i\) is the unit normal vector which is 1 at the i th component and 0 for all other components. The no vacuum and overlap constraint (16) can be expressed in terms of u as

$$\begin{aligned} \sum _{i=1}^n u_i(x) = 1 \, , \quad \forall x \in V. \end{aligned}$$
(18)

Moreover, note that the minimization term of (2) can be naturally related to total variation (14) for \(q=1\). In fact,

$$\begin{aligned} \sum _{i=1}^n \sum _{\begin{array}{c} (x,y) \in E \; : \\ x \in V_i, \, y \notin V_i \end{array}} w(x,y) = \sum _{i=1}^n TV_w(u_i). \end{aligned}$$
(19)

This connection between the two terms was used in several recent works to derive, utilizing the graphical framework, efficient unsupervised algorithms for clustering. For example, [16, 89] formulate rigorous convergence results for two methods that solve the relaxed Cheeger cut problem, using non-local total variation. Moreover, [82] provides a continuous relaxation of the Cheeger cut problem, and derives an efficient algorithm for finding good cuts. The authors of [82] relate the Cheeger cut to total variation and then present a split Bregman approach of solving the problem. In [88] the continuum limit of total variation on point clouds was derived.

The general set of problems (9) can now be formulated in terms of u as

$$\begin{aligned} \min _{u \in \mathcal {B}} E^P(u) = \sum _{i=1}^n \sum _{x \in V} C_i(x) u_i(x) + \sum _{i=1}^n TV_w(u_i) \end{aligned}$$
(20)

where

$$\begin{aligned} \mathcal {B} = \left\{ u : V \mapsto \{0,1\}^n, \quad \sum _{i=1}^n u_i(x) = 1, \;\; \forall x \in V \right\} \end{aligned}$$
(21)

is the set of binary functions indicating the partition. The superscript P stands for ‘primal.’ The optional size constraints (7) can be imposed in terms of u as

$$\begin{aligned} S_i^\ell \le ||u_i|| \le S_i^u, \; \quad i \in I, \end{aligned}$$

where \(||u_i|| = \sum _{x \in V} u_i(x)\). The size penalty term (8) can be imposed by appending the energy function (20) with the term \(\sum _{i=1}^n P_{\gamma }(||u_i||)\).

In case of semi-supervised classification, \(C_i(x)\) takes the form of

$$\begin{aligned} C_i(x) = \eta (x) \sum _{i=1}^n |e_i(x) - u^0_i(x)|^2, \end{aligned}$$
(22)

where \(u^0_i\) is a binary function taking the value of 1 in \(T_i\) and 0 elsewhere, and \(\eta (x)\) is a function that takes on a large constant value \(\eta \) on supervised points \(\cup _{i=1}^n T_i\) and zero elsewhere. If \(\eta \) is chosen sufficiently large, it can be guaranteed that the solution u satisfies the supervised constraints. The algorithm to be presented in this work does not require the selection of an appropriate value for the parameter \(\eta \), as the ideal case where \(\eta = \infty \) can be handled naturally without introducing numerical instabilities.

Region homogeneity terms can be imposed by setting \(C_i(x) = f_i(x)\). More generally, region homogeneity terms and supervised data points can be combined by setting

$$\begin{aligned} C_i(x) = \eta (x) \sum _{i=1}^n |e_i(x) - u^0_i(x)|^2 + f_i(x), \end{aligned}$$
(23)

The total variation term is defined as in (14) with \(q=1\).

If the number of supervised points is very low and there is no additional region term, the global minimizer of (20) may become the trivial solution where for one of the classes, say k, \(u_k(x) = 1\) everywhere, and for the other classes \(u_i(x) = 1\) for supervised points of class i and 0 elsewhere. The threshold tends to occur around less than \(2.5 \%\) of the points. As in our previous work [68], this problem can be countered by increasing the number of edges incident to supervised points in comparison with other points. Doing so will increase the cost of the trivial solution without significantly influencing the desired global minimizer. An alternative, proposed in the preprint [94], is to create region terms in a preprocessing step by diffusing information of the supervised points into their neighbors.

3.3 Convex Relaxation of Energy Minimization Problem

Due to the binary constraints (21), the problem (20) is non-convex. As in several recent works on variational image segmentation [6, 18, 58, 61, 77, 98] and MRF optimization [2, 11, 51, 54], we replace the indicator constraint set (21) by the convex unit simplex

$$\begin{aligned} \mathcal {B}' = \left\{ u \, : \, V \mapsto [0,1]^n, \quad \sum _{i=1}^n u_i(x) = 1, \;\; \forall x \in V \right\} .\nonumber \\ \end{aligned}$$
(24)

Hence, we are interested in solving the following convex relaxed problem

$$\begin{aligned} \min _{u \in \mathcal {B}'} E^P(u) = \sum _{i=1}^n \sum _{x \in V} C_i(x) u_i(x) + \sum _{i=1}^n TV_w(u_i). \end{aligned}$$
(25)

under optional size constraints (7) or penalty term (8). In case \(n=2\) and no size constraints, the relaxation is exact, as proved for image segmentation in [22, 80] and classification problems on graphs in [59, 68]. In case \(n>2\), the problem becomes equivalent to a multiway cut problem, which is known to be NP-hard [28]. In case size constraints are imposed, the problem becomes NP-hard even when \(n=2\), as it becomes equivalent to a knapsack problem [64] in the special case of no TV term.

In this paper, we are interested in using the convex relaxation (25) to solve the original problem approximately. Under certain conditions, the convex relaxation gives an exact global minimizer of the original problem. For instance, it can be straight forwardly shown that

Proposition 1

Let \(u^*\) be a solution of the relaxed problem (25), with optional size constraints (7) or penalty term (8). If \(u^* \in \mathcal {B}\), then \(u^*\) is a global minimizer of the original non-convex problem (20).

Proof

Let \(E^P(u)\) be the energy function defined in (25) with or without the size penalty term (8). Since \(\mathcal {B} \subset \mathcal {B}'\) it follows that \(\min _{u \in \mathcal {B}'} E^P(u) \le \min _{u \in \mathcal {B}} E^P(u)\). Therefore, if \(u^* = \hbox {arg min}_{u \in \mathcal {B}'} E^P(u)\) and \(u^* \in \mathcal {B}\) it follows that \(E(u^*) = \min _{u \in \mathcal {B}} E^P(u)\). The size constraints (7) can be regarded as a special case by choosing \(\gamma = \infty \). \(\square \)

If the computed solution of (25) is not completely binary, one way to obtain an approximate binary solution that exactly indicates the class membership of each point is to select the binary function as the nearest vertex in the unit simplex by the threshold

$$\begin{aligned} u^T(x) = e_\ell (x), \; \text {where} \; \ell = \hbox {arg max}_{i \in I} u_i(x). \end{aligned}$$
(26)

As an alternative to the threshold scheme (26), binary solutions of the convex relaxation (25) can also be obtained from a dual solution of (25), which has a more solid theoretical justification if some conditions are fulfilled. The dual problem also gives insight into why the convex relaxation is expected to closely approximate the original problem. This is the topic of the next section.

3.4 Analysis of Convex Relaxation Through a Dual Formulation

We will now derive theoretical results which indicate that the multiclass problem (20) is closely approximated by the convex relaxation (25). The following results extend those given in [6] from image domains to graphs. In contrast to [6], we also incorporate size constraints or penalty terms in the analysis. In fact, the strongest results given near the end of the section are only valid for problems without such size constraints/terms. This observation agrees well with our experiments, although in both cases very close approximations are obtained.

We start by deriving an equivalent dual formulation of (25). Note that this dual problem is different from the ‘max-flow’ type dual problem on graphs proposed in our previous work [68] in case of two classes. Its main purpose is theoretical analysis, not algorithmic development. In fact, its relation to flow maximization will be the subject of the next section. Dual formulations on graphs have also been proposed in [45] for variational multiscale decomposition of graph signals.

Theorem 1

The convex relaxed problem (25) S can equivalently be formulated as the dual problem

$$\begin{aligned} \sup _{q,\rho ^1,\rho ^2} \sum _{x \in V} \min _{i\in I}&\left( C_i(x) + ({{\mathrm{div}}}_w q_i)(x) + \rho _i^2 - \rho _i^1 \right) \nonumber \\ +&\left( \rho _i^1 S^\ell _i - \rho _i^2 S^u_i \right) , \end{aligned}$$
(27)

subject to

$$\begin{aligned}&(q_1,...,q_n) \in S^n_\infty , \end{aligned}$$
(28)
$$\begin{aligned}&\rho _i^1,\rho _i^2 \in [0,\gamma ], \quad i=1,...,n, \end{aligned}$$
(29)

where the above set of infinity norm spheres is defined as

$$\begin{aligned}&S^n_\infty = \left\{ (q_1,....,q_n) \, : \mathcal {E} \mapsto \mathbb {R}^n \; \text {s.t.} \; \left||q_i\right||_{\mathcal {E},\infty } \le 1 \quad \forall i\right\} . \end{aligned}$$
(30)

No size information is incorporated by choosing \(\gamma = 0\). The size penalty term (8) is incorporated by choosing \(0< \gamma < \infty \). Size constraints (7) are incorporated by choosing \(\gamma = \infty \).

Proof

By using the definition of total variation (14), the problem (25) with size penalty term (8) can be expressed in primal–dual form as

$$\begin{aligned}&\min _{u \in \mathcal {B}'} \, \sup _{q \in S^n_\infty } \; \sum _{i=1}^n P(||u_i||) \nonumber \\&+ \sum _{i=1}^n \sum _{x \in V} u_i(x) \left( C_i(x) \, +\, ({{\mathrm{div}}}_w q_i)(x) \right) , \end{aligned}$$
(31)

where \(S^n_\infty \) is defined in (30). It will be shown that the size constraints (7) or penalty term (8) can be implicitly incorporated by introducing the dual variables \(\rho _i^1,\rho _i^2 \in \mathbb {R}_+\), \(i=1,..,n\) as

$$\begin{aligned}&\min _{u \in \mathcal {B}'} \, \sup _{q \in S^n_\infty , \rho ^1,\rho ^2 \in [0,\gamma ]^n } E(u;q,\rho ^1,\rho ^2) \nonumber \\&= \sum _{i=1}^n \sum _{x \in V} u_i(x) \left\{ C_i(x) \, +\, ({{\mathrm{div}}}_w q_i)(x) + \rho _i^2 - \rho _i^1 \right\} \nonumber \\&\quad + \left( \rho _i^1 S^\ell _i - \rho _i^2 S^u_i \right) , \end{aligned}$$
(32)

The primal–dual problem (32) satisfies all the conditions of the mini-max theorem (see, e.g., Chapter 6, Proposition 2.4 of [32]). The constraint sets for \(q,\rho ^1,\rho ^2\) and u are compact and convex, and the energy function E(uq) is convex l.s.c. for fixed q and concave u.s.c. for fixed u. This implies the existence of at least one primal–dual solution (saddle point) of finite energy value.

For a given u, the terms involving \(\rho ^1\) and \(\rho ^2\) can be rearranged as

$$\begin{aligned}&\sup _{ 0 \le \rho _i^1 \le \gamma } \rho ^1_i \big ( S^\ell _i - \sum _{x \in V} u_i(x) )&\nonumber \\&\quad = \left\{ \begin{array}{ll} 0 \quad &{} \text{ if } \sum _{x \in V} u_i(x) \ge S^\ell _i \\ \gamma \left( S^\ell _i-\sum _{x \in V}u_i(x)\right) \quad &{} \text { if } \sum _{x \in V} u_i(x) < S^\ell _i \end{array}\right. \end{aligned}$$
(33)
$$\begin{aligned}&\sup _{0 \le \rho _i^2 \le \gamma } \rho _i^2 \big ( \sum _{x \in V} u_i(x) - S^u_i )&\nonumber \\&\quad = \left\{ \begin{array}{ll} 0 \quad &{} \text{ if } \sum _{x \in V} u_i(x) \le S^u_i \\ \gamma \left( \sum _{x \in V} u_i(x)-S^u_i\right) \quad &{} \text { if } \sum _{x \in V} u_i(x) > S^u_i \end{array}\right. \end{aligned}$$
(34)

Consider the above three choices for \(\gamma \). In case \(\gamma = 0\), the class sizes do not contribute to the energy. In case \(0< \gamma < \infty \), the two above terms summed together is exactly equal to the size penalty term \(P(||u_i||)\). In case \(\gamma = \infty \), the constraint set on \(\rho ^1,\rho ^2\) is no longer compact, but we can apply Sion’s generalization of the mini-max theorem [79], which allows either the primal or dual constraint set to be non-compact. It follows that if the size constraints (7) are not satisfied, the energy would be infinite, contradicting existence of a primal–dual solution.

From the mini-max theorems, it also follows that the inf and sup operators can be interchanged as follows

$$\begin{aligned}&\min _{u \in \mathcal {B}'} \, \sup _{q \in S^n_\infty , \rho ^1,\rho ^2 \in [0,\gamma ]^n} \; E(u;q,\rho ^1,\rho ^2) \, = \, \nonumber \\&\sup _{q \in S^n_\infty , \rho ^1,\rho ^2 \in [0,\gamma ]^n} \,\min _{u \in \mathcal {B}'} \; E(u;q,\rho ^1,\rho ^2). \end{aligned}$$
(35)

For notational convenience, we denote the unit simplex pointwise as

$$\begin{aligned} \Delta ^n_+ = \{(u_1,...,u_n) \in [0,1]^n \, : \; \sum _{i=1}^n u_i = 1 \} \end{aligned}$$
(36)

For an arbitrary vector \(F = (F_1, \ldots , F_n) \in \mathbb {R}^n\), observe that

$$\begin{aligned} \min _{(u_1, \ldots , u_n) \in \triangle _{+}} \, \sum _{i=1}^n u_i F_i \, = \, \min (F_1, \ldots ,F_n) \,. \end{aligned}$$
(37)

Therefore, the inner minimization of (35) can be solved analytically at each position \(x \in V\), and we obtain the dual problem

$$\begin{aligned}&\sup _{q \in S^n_\infty , \rho ^1,\rho ^2 \in [0,\gamma ]^n} \big ( \rho _i^1 S^\ell _i - \rho _i^2 S^u_i \big ) \nonumber \\&\quad + \sum _{x \in V} \min _{u(x) \in \triangle _{+}} \sum _{i=1}^n u_i \big \{ C_i \, +\, {{\mathrm{div}}}_w q_i + \rho _i^2 - \rho _i^1 \big \}(x) \nonumber \\&= \sup _{q \in S^n_\infty , \rho ^1,\rho ^2 \in [0,\gamma ]^n} \big ( \rho _i^1 S^\ell _i - \rho _i^2 S^u_i \big ) \nonumber \\&\quad + \sum _{x \in V} \min _{i \in I}\{C_i(x) + ({{\mathrm{div}}}_w q_i)(x) + \rho _i^2 - \rho _i^1 \}. \end{aligned}$$

\(\square \)

Assuming a solution of the dual problem \(q^*,{\rho ^1}^*,\) \({\rho ^2}^*\) has been obtained, the following theorem characterizes the corresponding primal variable \(u^*\)

Theorem 2

There exists a maximizer \(q^{*},{\rho ^1}^*,{\rho ^2}^*\) to the dual problem (27). At the point \(x \in V\), let \(I_m(x) = \{i_1,...,i_k\}\) be the set of indices such that

$$\begin{aligned} I_m(x) = \hbox {arg min}_{i \in I}\Big ( C_i(x) + ({{\mathrm{div}}}_w q^*_i)(x) + {\rho _i^2}^* - {\rho _i^1}^* \Big ).\nonumber \\ \end{aligned}$$
(38)

There exists a solution \(u^*\) to the primal problem (25) such that \((u^*;q^*,{\rho ^1}^*,{\rho ^2}^*)\) is a primal–dual pair. At the point x, \(u^*(x)\) must satisfy

$$\begin{aligned} \sum _{i \in I_m(x)} u_{i}^{*}(x) \, = \, 1 \; \text { and} \quad u_j^{*}(x) \, = \, 0 \, , \, j \notin I_m(x). \end{aligned}$$
(39)

If the minimizer (38) is unique at the point \(x \in V\), then the corresponding primal solution \(u^*\) at the point x must be valued

$$\begin{aligned} u^*_i(x) = \, \left\{ \begin{array}{ll} 1 , \, &{} \text {if}\;i = I_m(x) \\ 0 , \, &{} \text {if}\;i \ne I_m(x) \end{array} \right. \, , \quad i=1,\ldots ,n \, . \end{aligned}$$
(40)

If the minimizer (38) is unique at every point \(x \in V\), then the corresponding primal solution \(u^*\), given by the formula (40), is an exact global binary minimizer of the original non-convex problem (20).

Proof

Since all conditions of the mini-max theorem [32, 79] are satisfied (c.f. proof of Theorem 1), there must exist a maximizer \(q^*,{\rho ^1}^*,{\rho ^2}^*\) of the dual problem (27) and a minimizer \(u^*\) of the primal problem (25) such that \((u^*,q^*)\) is a solution of the primal–dual problem (31) (see, e.g., [32]). For arbitrary vectors \(u \in \Delta _+^n\) and \(F \in \mathbb {R}^n\), it must hold that \(\sum _{i\in I} u_i F_i \ge \min _{i \in I} F_i\). Therefore, at the point x, \(u^*\) must satisfy

$$\begin{aligned}&\sum _{i\in I} u^*_i(x)\big ( (C_i + {{\mathrm{div}}}_w q^*_i)(x) + {\rho _i^2}^* - {\rho _i^1}^* \big ) \\&\quad = \min _{i \in I} \big ( (C_i + {{\mathrm{div}}}_w q_i)(x) + {\rho _i^2}^* - {\rho _i^1}^* \big ); \end{aligned}$$

otherwise, the primal–dual energy would exceed the dual energy, contradicting strong duality. The above expression can be further decomposed as follows

$$\begin{aligned}&= \sum _{i\in I_m(x)} u^*_i(x)\left( (C_i + {{\mathrm{div}}}_w q^*_i)(x) + {\rho _i^2}^* - {\rho _i^1}^* \right) \\&\quad + \sum _{i\notin I_m(x)} u^*_i(x)\left( (C_i + {{\mathrm{div}}}_w q^*_i)(x) + {\rho _i^2}^* - {\rho _i^1}^* \right) \\&= \left( \sum _{i\in I_m(x)} u^*_i(x)\right) \min _{i \in I}\left( (C_i + {{\mathrm{div}}}_w q^*_i)(x) + {\rho _i^2}^* - {\rho _i^1}^* \right) \\&\quad + \sum _{i\notin I_m(x)} u^*_i(x)\left( (C_i + {{\mathrm{div}}}_w q^*_i)(x) + {\rho _i^2}^* - {\rho _i^1}^* \right) \end{aligned}$$

Since \(\left( (C_j + {{\mathrm{div}}}_w q^*_j)(x) + {\rho _i^2}^* - {\rho _i^1}^*\right) (x)\) \(> \min _{i \in I}\big ( (C_i + {{\mathrm{div}}}_w q^*_i)(x) + {\rho _i^2}^* - {\rho _i^1}^*)(x)\) for all \(j \notin I_{m}(x)\), the above can only be true provided \({\sum _{i \in I_m(x)}} u^*_i = 1\) and \(u_i^*(x) = 0\) for \(i \notin I_{m}(x)\).

If the minimizer \(I_m(x)\) is unique, it follows directly from (39) that \(u^*_i(x)\) must be the indicator vector (40).

If the minimizer \(I_m(x)\) is unique at every point \(x \in V\), then the corresponding primal solution \(u^*\) given by (40) is contained in the binary set \(\mathcal {B}\). By Proposition 1, \(u^*\) is a global minimizer of (20). \(\square \)

It can also be shown that an exact binary primal solution exists if there are two non-unique minimal components to the vector

$$\begin{aligned} (C(x) + {{\mathrm{div}}}_w q^{*}(x) + {\rho ^2}^* - {\rho ^1}^*) \end{aligned}$$

but this result only holds in case there are no constraints acting on the class sizes.

Theorem 3

Assume that \(q^*\) is a maximizer of the dual problem (27) with \(\gamma = 0\), i.e., no class size constraints. If (38) has at most two minimal components for all \(x \in V\), then there exists a corresponding binary primal solution to the convex relaxed primal problem (25), which is a global minimizer of the original non-convex problem (20).

A constructive proof of Theorem 3 is given in “Appendix.”

If the vector \((C(x) + {{\mathrm{div}}}_w q^{*}(x) + {\rho ^2}^* - {\rho ^1}^*)\) has three or more minimal components, it cannot in general be expected that a corresponding binary primal solution exists, reflecting that one can probably not obtain an exact solution to the NP-hard problem (20) in general by a convex relaxation. Experiments indicate that this very rarely, if ever, happens in practice for the classification problem (20).

As an alternative thresholding scheme, \(u^T\) can be selected based on the formula (40) after a dual solution to the convex relaxation has been obtained. If there are multiple minimal components to the vector \((C + {{\mathrm{div}}}q^*)(x)\), one can select \(u^T(x)\) to be one for an arbitrary one of those indices, just as for the ordinary thresholding scheme (26). Experiments will demonstrate and compare both schemes in Sect. 5.

4 ‘Max-Flow’ Formulation of Dual Problem and Algorithm

A drawback of the dual model (27) is the non-smoothness of the objective function, which is also a drawback of the original primal formulation of the convex relaxation. This section reformulates the dual model in a structurally similar way to a max-flow problem, which is smooth and facilitates the development of a very efficient algorithm based on the augmented Lagrangian theory.

The resulting dual problem can be seen as a multiclass variant of the max-flow model proposed in our work [68] for two classes, and a graph analogue of the max-flow model given for image domains in [95]. Note that our derivations differ conceptually from [68, 95], because we directly utilize the dual problem derived in the last section. Furthermore, the new flexible size constraint (7) and penalty term (8) are incorporated naturally in the max-flow problem by a modified flow conservation condition, which indicates that there should be a constant flow excess at each node. The amount of flow excess is expressed with a few additional optimization variables in the algorithm, and they can optimized over with very little additional computational cost.

4.1 ‘Max-Flow’ Reformulation Dual Problem

We now derive alternative dual and primal–dual formulations of the convex relaxed problem that are more beneficial for computations. The algorithm will be presented in the next section.

Proposition 2

The dual problem (27) can equivalently be formulated as the dual ‘max-flow’ problem:

$$\begin{aligned} \sup _{p_s, p, q,\rho ^1,\rho ^2} \; \,\sum _{x \in V} \, p_s(x) + \sum _{i=1}^n \big ( \rho _i^1 S^\ell _i - \rho _i^2 S^u_i \big )\, \end{aligned}$$
(41)

subject to, for all \(i \in I\),

$$\begin{aligned}&\left|q_i(x,y)\right| \, \le \, 1, \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \, \, \, \forall (x,y) \in E, \end{aligned}$$
(42)
$$\begin{aligned}&\quad p_i(x) \, \le \, C_i(x), \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \, \, \forall x \in V, \end{aligned}$$
(43)
$$\begin{aligned}&\quad \big ( {{\mathrm{div}}}_w q_i - p_s + p_i \big )(x) \, = \, \rho _i^1 - \rho _i^2, \quad \forall x \in V, \end{aligned}$$
(44)
$$\begin{aligned}&\quad 0 \le \rho _i^1,\rho _i^2 \le \gamma . \end{aligned}$$
(45)

Proof

By introducing the auxiliary variable \(p_s \, : \; V \mapsto \mathbb {R}\), the dual problem (27) can be reformulated as follows

$$\begin{aligned}&\sup _{q, \rho ^1,\rho ^2 } \, \sum _{x \in V} \min _{i\in I} \left( C_i(x) + {{\mathrm{div}}}_w q_i(x) + \rho _i^2 - \rho _i^1 \right) \,\nonumber \\&\quad + \sum _{i=1}^n \left( \rho _i^1 S^\ell _i - \rho _i^2 S^u_i \right) \nonumber \\&\text {subject to, for all }i \in I,\nonumber \\&\quad \left||q_i\right||_{\mathcal {E},\infty } \le 1, \nonumber \\&\quad 0 \le \rho _i^1,\rho _i^2 \le \gamma . \nonumber \\&\quad = \sup _{p_s, q, \rho ^1,\rho ^2} \sum _{x \in V} p_s(x) + \sum _{i=1}^n \big ( \rho _i^1 S^\ell _i - \rho _i^2 S^u_i \big )\, \nonumber \\&\text {subject to, for all }i \in I,\nonumber \\&\quad p_s(x) \le (C_i + {{\mathrm{div}}}_w q_i)(x) + \rho _i^2 - \rho _i^1 \;\; \forall x \in V, \nonumber \\&\quad \left||q_i\right||_{\mathcal {E},\infty } \le 1, \nonumber \\&\quad 0 \le \rho _i^1,\rho _i^2 \le \gamma . \end{aligned}$$
(46)

By adding another set of auxiliary variables \(p_i \, : \; V \mapsto \mathbb {R}\), \(i =1,...,n\), the constraints (46) can be formulated as

$$\begin{aligned}&p_s(x) = p_i(x) + {{\mathrm{div}}}_w q_i(x) + \rho _i^2 - \rho _i^1,\nonumber \\&\quad p_i(x) \le C_i(x), \end{aligned}$$
(47)

for all \(x \in V\) and all \(i \in I\). Rearranging the terms in constraint (47), and using the definition of the infinity norm (10), leads to the ‘max-flow’ model (41) subject to (42)–(45). \(\square \)

Problem (41) with constraints (42)–(45) is structurally similar to a max-flow problem over n copies of the graph G, \((V_1,E_1) \times ... \times (V_n,E_n)\), where \((V_i,E_i) = G\) for \(i \in I\). The aim of the max-flow problem is to maximize the flow from a source vertex to a sink vertex under flow capacity at each edge and flow conservation at each node. The variable \(p_s(x)\) can be regarded as the flow on the edges from the source to the vertex x in each of the subgraphs \((V_1,E_1), ..., (V_n,E_n)\), which have unbounded capacities. The variables \(p_i(x)\) and \(C_i(x)\) can be regarded as the flow and capacity on the edge from vertex x in the subgraph \((V_i,E_i)\) to the sink. Constraint (47) is the flow conservation condition. Observe that in case of size constraints/terms, instead of being conserved, there should be a constant excess flow \(\rho _i^1 - \rho _i^2\) for each node in the subgraph \((V_i,E_i)\). The objective function (41) is a measure of the total amount of flow in the graph.

Utilizing results from Sect. 3.4, we now show that the convex relaxation (25) is the equivalent dual problem to the max-flow problem (41).

Theorem 4

The following problems are equivalent to each other:

  1. (1)

    The max-flow problem (41), subject to (42)–(45);

  2. (2)

    The primal–dual problem:

    $$\begin{aligned}&\min _{u} \sup _{p_s, p, q,\rho ^1,\rho ^2} \left\{ E(p_s, p, q, \rho ^1,\rho ^2; u) \right. \nonumber \\&= \sum _{x \in V} p_s(x) + \sum _{i=1}^n \left( \rho _i^1 S^\ell _i - \rho _i^2 S^u_i \right) \nonumber \\&\quad \left. + \sum _{i=1}^n \sum _{x \in V} u_i \left( {{\mathrm{div}}}_w q_i - p_s + p_i + \rho _i^2 - \rho _i^1 \right) (x) \right\} \end{aligned}$$
    (48)

    subject to (42), (43) and (45), where u is the relaxed region indicator function.

  3. (3)

    The convex relaxed problem (25) with size constraint (7) if \(\gamma = \infty \), size penalty term (8) if \(0< \gamma < \infty \) and no size constraints if \(\gamma = 0\).

Proof

The equivalence between the primal–dual problem (48) and the max-flow problem (41) follows directly as \(u_i\) is an unconstrained Lagrange multiplier for the flow conservation constraint (47). Existence of the Lagrange multipliers follows as: (1) (41) is upper bounded, since it is equivalent to (27), which by Theorem 2 admits a solution, and (2) the constraints (44) are linear and hence differentiable.

The equivalence between the primal–dual problem (48), the max-flow problem (41) and the convex relaxed problem (25) now follows: By Proposition 2 the ‘max-flow’ problem (41) is equivalent to the dual problem (27). By Theorem 1, the dual problem (27) is equivalent to the convex relaxed problem (25) with size constraints (7) if \(\gamma = \infty \), size penalty term (8) if \(0< \gamma < \infty \) and no size constraints if \(\gamma = 0\). \(\square \)

Note an important distinction between the primal–dual problem (48) and the primal–dual problem (35) derived in the last section: The primal variable u is unconstrained in (48). The simplex constraint \(\mathcal {B}'\) is handled implicitly. It may not seem obvious from the proof how the constraints on u are encoded in the primal–dual problem; therefore, we give some further insights: For a given primal variable u, the maximization with respect to \(p_s\) of the primal–dual problem (48) at the point x can be rearranged as

$$\begin{aligned} \sup _{p_s(x)}&\left( \left( 1- \sum _{i=1}^n u_i \right) p_s\right) (x) \nonumber \\ =&\left\{ \begin{array}{ll} 0 \quad &{} \text{ if } \sum _{i=1}^n u_i(x) = 1 \\ \infty \quad &{} \text { if } \sum _{i=1}^n u_i(x) \ne 1 \end{array}\right. \end{aligned}$$
(49)

If u does not satisfy the sum to one constraint at x, then the primal–dual energy would be infinite, contradicting boundedness from above. In a similar manner, the optimization with respect to \(p_i\) can be expressed as

$$\begin{aligned} \sup _{p_i(x) \le C_i(x)}&u_i(x) p_i(x) = \left\{ \begin{array}{ll} (u_i C_i)(x) \, &{} \; \text{ if } u_i(x) \ge 0 \\ \infty \, &{} \; \text{ if } u_i(x) < 0 . \end{array}\right. \end{aligned}$$
(50)

which would be infinite if u does not satisfy the non-negativity constraints. If \(u(x) = e_i\), the indicator function of class i, the value would be \(C_i(x)\), which is indeed the pointwise cost of assigning x to class i.

4.2 Augmented Lagrangian Max-flow Algorithm

This section derives an efficient algorithm, which exploits the fact all constraints on u are handled implicitly in the primal–dual problem (48). The algorithm is based on the augmented Lagrangian theory, where u is updated as a Lagrange multiplier by a gradient descent step each iteration. Since no subsequent projection of u is necessary, the algorithm tolerates a wide range of step sizes and converges with high accuracy. The advantages of related ‘max-flow’ algorithms for ordinary 2D imaging problems over, e.g., Arrow–Hurwicz type primal–dual algorithms have been demonstrated in [5, 97].

From the primal–dual problem (48), we first construct the augmented Lagrangian functional:

$$\begin{aligned}&L_{c} = \sum _{x \in V} p_{s} + \sum _{i=1}^n \left( \rho _i^1 S^\ell _i - \rho _i^2 S^u_i \right) \nonumber \\&\quad \quad + \sum _{x \in V} u_i(x) \left( {{\mathrm{div}}}_w q_i - p_{s}+ p_{i} + \rho _i^2 - \rho _i^1 \right) (x) \nonumber \\&\quad \quad - \frac{c}{2} \sum _{i=1}^n \left||{{\mathrm{div}}}_w q_i - p_{s}+ p_{i} + \rho _i^2 - \rho _i^1\right||_2^2. \end{aligned}$$
(51)

An augmented Lagrangian algorithm for minimizing the above functional is given below, which involves alternatively maximizing \(L_c\) for the dual variables and then updating the Lagrange multiplier u.

Note that if there are no constraints on the class sizes, \(\gamma = 0\), then \({\rho ^1}^k = {\rho ^2}^k \equiv 0\) for every iteration k. The algorithm can in this case be simplified by setting \({\rho ^1}^k = {\rho ^2}^k \equiv 0\) for all k and ignoring all steps involving \(\rho ^1\) and \(\rho ^2\).

figure c

The optimization problem (52) can be solved by a few steps of the projected gradient method as follows:

$$\begin{aligned} q^{k+1}_i= \Pi _{W} (q_i + c \nabla _w(\text{ div }_w q^{k}_i - F^k)), \end{aligned}$$
(57)

Above, \(\Pi _{w}\) is a projection operator which is defined as

$$\begin{aligned} \Pi _{W} (s(x,y))= & {} \nonumber \\&{\left\{ \begin{array}{ll} s(x,y) &{}\text{ if } |s(x,y)| \le 1, \\ \text {sgn}(s(x,y)) &{} \text{ if } |s(x,y)| > 1, \end{array}\right. } \end{aligned}$$
(58)

where sgn is the sign function. There are extended convergence theories for the augmented Lagrangian method in the case when one of the subproblems is solved inexactly; see, e.g., [35, 40]. In our experience, one gradient ascent iteration leads to the fastest overall speed of convergence.

The subproblems (53) and (54) can be solved by

$$\begin{aligned} p_s(x)= & {} G^k(x)+ \frac{1}{c}, \end{aligned}$$
(59)
$$\begin{aligned} p_i(x)= & {} \min \left( H^k(x), C_i(x)\right) . \end{aligned}$$
(60)

Consider now the subproblems (55) and (56). In case no constraints are given on \(\rho ^1\) and \(\rho ^2\), the maximizers over the sum of the concave quadratic terms can be computed as the average of the maximizers to each individual term as

$$\begin{aligned} \text{ mean } \left( - J^k + \frac{S^\ell _i }{c \,||V||}\right) , \quad \text{ mean } \left( - M^k - \frac{S_i^u}{c\,||V||} \right) , \end{aligned}$$

respectively, for \(\rho ^1\) and \(\rho ^2\). Since the objective function is concave, and the maximization variable is just a constant, an exact solution to the constrained maximization problem can now be obtained by a projection onto that constraint as follows

$$\begin{aligned}&{\rho _i^1}^{k+1}= \min \left( \max \left( \text{ mean }\left( - J^k + \frac{S^\ell _i }{c\,||V||}\right) , 0 \right) ,\gamma \right) , \end{aligned}$$
(61)
$$\begin{aligned}&\quad {\rho _i^2}^{k+1}= \min \left( \max \left( \text{ mean }\left( - M^k - \frac{S_i^u}{c\,||V||}\right) , 0 \right) ,\gamma \right) . \end{aligned}$$
(62)

Algorithm 1 is suitable for parallel implementation on GPU, since the subproblems at each substep can be solved pointwise independently of each other using simple floating point arithmetics. The update formula (57) for q only requires access to the values of neighboring nodes at the previous iterate. As discussed in Sect. 2, the number of neighbors may vary for nodes across the graph; therefore, special considerations should be taken when declaring memory. We have implemented the algorithm on CPU for experimental evaluation for simplicity.

5 Applications and Experiments

We now focus on specific applications of the convex framework. Experimental results on semi-supervised classification of high-dimensional data are presented in Sect. 5.1. Section 5.2 proposes specific terms in the general model (9) for segmentation of unstructured 3D point clouds and presents experimental results on LaDAR data acquired in outdoor scenes. In both cases, we give a thorough examination of accuracy of the results, tightness of the convex relaxations and convergence properties of the algorithms.

A useful quality measure of the convex relaxation is to what extent the computed solution is binary. Proposition 1 indicates that if the computed solution is completely binary, it is also an exact global minimizer to the original non-convex problem. Let \({u^k}^T\) be a thresholding of \(u^k(x)\) in the sense that each row of \(u^k\) is modified to be the closest vertex in the unit simplex according to the scheme (26). As a quality measure of the solution \(u^k\) at each iteration k of Algorithm 1, we calculate the average difference between \(u^k\) and its thresholded version \({u^k}^T\) as follows:

$$\begin{aligned} b(u^k) = \frac{1}{2n||V||} \left( \sum _{i=1}^n \sum _{x \in V} |{u^k}^T_i(x) - u^k_i(x)|\right) , \end{aligned}$$
(63)

where ||V|| is the number of nodes and n is the number of classes. We call \(b(u^k)\) the ‘binary difference’ of \(u^k\) at iteration k. Naturally, we want \(b(u^k)\) to become as low as possible as the algorithm converges.

5.1 Semi-supervised Classification Results

In this section, we describe the supervised classification results, using the algorithm with and without the size constraints (5), (7) or penalty term (8).

We compare the accuracy of the results with respect to the ground truth. The results are also compared against other local minimization approaches in terms of the final total variation energies:

$$\begin{aligned} E(u) = \frac{1}{2} \sum _{i=1}^n \sum _{x,y \in V} w(x,y)|u_i(x) - u_i(y)|, \end{aligned}$$

where n is the number of classes. A lower value of E is better. The energy contribution from the fidelity term is ignored because the solution satisfies the supervised constraints by construction, thus giving zero contribution from those terms.

To compute the weights for the data sets, we use the Zelnik-Manor and Perona weight function [74]. The function is defined as:

$$\begin{aligned} w(x,y)=\exp \left( -\frac{\mathrm{d}(x,y)^{2}}{\sqrt{\tau (x)\tau (y)}}\right) , \end{aligned}$$
(64)

where d(xy) is a distance measure between vertices x and y, and \(\sqrt{\tau (x)}\) is the distance between vertex x and its M th closest neighbor. If y is not among the M-nearest neighbors of x, then w(xy) is set to 0. After the graph is computed, we symmetrize it by setting

$$\begin{aligned} w(x,y)= \max (w(x,y),w(y,x)). \end{aligned}$$

Here, M is a parameter to be chosen. The weight function will be defined more specifically for each application.

We run the minimization procedure until the following stopping criterion is satisfied:

$$\begin{aligned} \frac{1}{||V||} \left( \sum _i \sum _{x \in V} |u_i(x) - u^{old}_i(x)|\right) < \delta , \end{aligned}$$

where \(u^{old}\) is the u from the previous iteration, and the value of \(\delta \) varies depending on the data set (anywhere from \(10^{-12}\) to \(10^{-10}\)).

All experiments were performed on a 2.4 GHz Intel Core i2 Quad CPU. We initialize \(C_i(x)=\) constant (in our case, the constant is set to 500) if x is a supervised point of any class but class i, and 0 otherwise, for all \(i \in I\). The variables u, \(q_i\), \(\rho _i^1\), \(\rho _i^2\) are initialized to zero for all \(i \in I\). The variable \(p_s\) is initialized to \(C_n\), where n is the number of classes. We set \(p_i=p_s \quad \forall i \in I\).

In the following, we give details about the setup and results for each data set, before we draw some general conclusions in the end.

5.1.1 MNIST

The MNIST data set [56], affiliated with the Courant Institute of New York University, consists of 70, 000 \(28 \times 28\) images of handwritten digits 0 through 9. Some of the images in the database are shown in Fig. 3. The objective is, of course, to assign the correct digit to each image; thus, this is a 10-class segmentation problem.

Fig. 3
figure 3

Examples of digits from the MNIST data base

Fig. 4
figure 4

MNIST results. These graphs visualize the values of the first versus the sixth eigenvector (of the graph Laplacian) relating to the nodes of class 4 and 9 only. The blue and red region represents nodes of class 4 and 9, respectively. The green region represents misclassified points. a Ground truth; b proposed result (randomly selected supervised points); c proposed result (non-randomly selected supervised points); d MBO result (randomly selected supervised points); e MBO result (non-randomly selected supervised points) (Color figure online)

We construct the graph as follows; each image is a node on a graph, described by the feature vector of 784 pixel intensity values in the image. These feature vectors are used to compute the weights for pairs of nodes. The weight matrix is computed using the Zelnik–Manor and Perona weight function (64) with local scaling using the 8th closest neighbor. We note that preprocessing of the data is not needed to obtain an accurate classification; we do not perform any preprocessing. The parameter c used was 0.05.

The average accuracy results over 100 different runs with randomly chosen supervised points are shown in Table 1 in case of no size constraints. We note that the new approaches reach consistently higher accuracies and lower energies than related local minimization approaches, and that incorporation of size information can improve the accuracies further. The computation times are highly efficient, but not quite as fast as MBO, which only uses 10 iteration to solve the problem in an approximate manner. The \(\text {Log}_{10}\) plots of the binary difference versus iteration, depicted in Fig. 7, show that the binary difference converges to an extremely small number.

The results of the data set are visualized in Fig. 4. For the visualization procedure, we use the first and the sixth eigenvector of the graph Laplacian. The dimension of each of the eigenvectors is \(N \times 1\), and each node of the data set is associated with a value of each of the vectors. One way to visualize a classification of a data set such as MNIST, which consists of a collection of images is to plot the values of one eigenvector of the graph Laplacian versus another and use colors to differentiate classes in a given segmentation. In this case, the plots in Fig. 4 graph the values of the first versus the sixth eigenvector (of the graph Laplacian) relating to the nodes of class 4 and 9 only. The blue and red region represents nodes of class 4 and 9, respectively. The green region represents misclassified points.

Moreover, we compare our results to those of other methods in Table 1, where our method’s name is written in bold. Note that algorithms such as linear and nonlinear classifiers, boosted stumps, support vector machines and both neural and convolution nets are all supervised learning approaches, which use around 60, 000 of the images as a training set (\(86\%\) of the data) and 10, 000 images for testing. However, we use only \(3.57\%\) (or less) of our data as supervised training points, and obtain classification results that are either competitive or better than those of some of the best methods. Moreover, note that no preprocessing was performed on the data, as was needed for some of the methods we compare with; we worked with the raw data directly.

5.1.2 Three Moons Data Set

We created a synthetic data set, called the three moons data set, to test our method. The set is constructed as follows. First, consider three half circles in \(\mathbb {R}^2\). The first two half top circles are unit circles with centers at (0, 0) and (3, 0). The third half circle is a bottom half circle with radius of 1.5 and center at (1.5, 0.4). A thousand points from each of the three half circles are sampled and embedded in \(\mathbb {R}^{100}\) by adding Gaussian noise with standard deviation of 0.14 to each of the 100 components of each embedded point. The goal is to segment the circles, using a small number of supervised points from each class. Thus, this is a 3-class segmentation problem. The noise and the fact that the points are embedded in high-dimensional space make this difficult.

We construct the graph as follows; each point is a node on a graph, described by the feature vector consisting of the 100 dimensions of the point. To compute the distance component of the weight function for pairs of nodes, we use these feature vectors. The weight matrix is computed using the Zelnik–Manor and Perona weight function (64) with local scaling using the 10th nearest neighbor. The parameter c was 0.1.

Fig. 5
figure 5

Three moons results. a Ground truth; b proposed result c MBO result

The results of the data set are visualized in Fig. 5, and the accuracies are shown in Table 1. This is the only data set where the proposed approach got lower accuracy than MBO. For this particular example, the global minimizer does not seem the best in terms of accuracy, which is a fault of the model rather than an optimization procedure.

Fig. 6
figure 6

COIL Results. These graphs visualize the values of the first versus the third eigenvector of the graph Laplacian. The results of the classification are labeled by different colors: a Ground truth; b MBO result (non-randomly selected supervised points); c proposed result (non-randomly selected supervised points) (Color figure online)

5.1.3 COIL

We evaluated our performance on the benchmark COIL data set [25, 73] from the Columbia University Image Library. This is a set of color \(128 \times 128\) images of 100 objects, taken at different angles. The red channel of each image was downsampled to \(16 \times 16\) pixels by averaging over blocks of \(8 \times 8\) pixels. Then, 24 of the objects were randomly selected and then partitioned into six classes. Discarding 38 images from each class leaves 250 per class, giving a data set of 1500 data points and 6 classes.

We construct the graph as follows; each image is a node on a graph. We apply PCA to project each image onto 241 principal components; these components form the feature vectors. The vectors are used to calculate the distance component of the weight function. The weight matrix is computed using the Zelnik–Manor and Perona weight function (64) with local scaling using the 4th nearest neighbor. The parameter c used was 0.03.

Resulting accuracies are shown in Table 1, indicating that our method outperforms local minimization approaches and is comparable to or better than some of the other best existing methods. The results of the data set are visualized in Fig. 6; the procedure used is similar to that of the MNIST data set visualization procedure. The plots in the figure graph the values of the first versus the third eigenvector of the graph Laplacian. The results of the classification are labeled by different colors.

5.1.4 Landsat Satellite Data Set

We also evaluated our performance on the Landsat Satellite data set, obtained from the UCI Machine Learning Repository [4]. This is a hyperspectral data set which is composed of sets of multispectral values of pixels in 3 \(\times \) 3 neighborhoods in a satellite image; the portions of the electromagnetic spectrum covered include near-infrared. The goal is to predict the classification of the central pixel in each element of the data set. The six classes are red soil, cotton crop, gray soil, damp gray soil, soil with vegetation stubble and very damp gray soil. There are 6435 nodes in the data set.

We construct the graph as follows. The UCI Web site provides a 36-dimensional feature vector for each node. The feature vectors are used to calculate the distance component of the weight function. The weight matrix is computed using the Zelnik–Manor and Perona weight function (64) with local scaling using the 4th nearest neighbor. The parameter c used was 0.3.

Table 1 includes comparison of our method to some of the best methods (most cited in [70]). One can see that our results are of higher accuracy. We now note that, except the GL and MBO algorithms, all other algorithms we compare the Landsat satellite data to are supervised learning methods, which use 80% of data for training and 20% for testing. Our method was able to outperform these algorithms while using a very small percentage of the data set (10%) as supervised points. Even with 5.6% supervised points, it outperforms all but one of the aforementioned methods.

5.1.5 Non-uniform Distribution of Supervised Points

In all previous experiments, the supervised points have been sampled randomly from all the data points. To test the algorithms in more challenging scenarios, we introduce some bias in the sampling of the supervised points, which is also a more realistic situation in practice. We used two different data sets for this test: the MNIST data set and the COIL data set.

In the case of the MNIST data set, we chose the supervised points non-randomly for digits 4 and 9 only. To obtain the non-randomness, we allowed a point to be chosen as supervised only if it had a particular range of values for the second eigenvector. This resulted in a biased distribution of the supervised points. The results for this experiment were the following: For the max-flow algorithm, the overall accuracy was \(97.734\%\), while for digits 4 and 9, it was \(96.85\%\). For comparison, the non-convex MBO algorithm [38] gave an accuracy of \(95.60\%\) overall, but \(89.71\%\) for digits 4 and 9. The MBO method was also a bit more unstable in its accuracy with respect to different distributions of the supervised points. The max-flow algorithm was very stable, with a very small standard deviation for a set of accuracies for different supervised point distributions.

In the case of the COIL data set, we chose the supervised points non-randomly for classes 2 and 6. The non-randomness was achieved in the same way as for the MNIST data set. The results were the following: The overall accuracy of the max-flow was \(92.69\%\), while for classes 2 and 6, it was \(90.89\%\). The MBO algorithm [38] gave an accuracy of \(83.90\%\) overall, but \(77.24\%\) for classes 2 and 6.

These results are summarized in Table 2 and are visualized in Figs. 4 and 6 for MNIST and COIL data sets, respectively.

5.1.6 Experiments with Size Constraints and Penalty Term

The exact size constraints (5) could improve the accuracies if knowledge of the exact class sizes are available. However, it is not realistic to obtain the exact knowledge of the class sizes in practice, and this was the motivation behind developing the flexible constraints (7) or the penalty term (8). In order to simulate the case that only an estimate of the class sizes are known, we perturb the exact class sizes by a random number ranging between 1 and 20 \(\%\) of ||V|| / n. The lower and upper bounds in (7) and (8) are centered around the perturbed class size, and the difference between them is chosen based on the uncertainty of the estimation, which we assume to be known. More specifically, denoting the exact class size \(c_i\), the perturbed class size \({\tilde{c}}_i\) is chosen as a random number in the interval \([c_i-p,c_i+p]\). In experiments, we select p as 1, 10 and 20 \(\%\) of ||V|| / n. The lower and upper bounds in the flexible size constraint (7) and the penalty term (8) are chosen as \(S^\ell _i = {\tilde{c}}_i - p\) and \(S^u_i = {\tilde{c}}_i +p\). The parameter \(\gamma \) in the penalty term is set to 10 for all data sets.

We run the algorithm for each choice of p several times with different random selections of the perturbed class size \({\tilde{c}}_i\) each time. The average accuracies over all the runs for each choice of p are shown in Table 3. The flexible size constraints or penalty term improve the accuracy compared to the case when no size information was given, as shown in Table 1. Note that the accuracy improves also in cases of great uncertainty in the class size estimates (\(p = 20 \%\)). The exact size constraints can be seen to not be suitable in case knowledge of the exact class sizes are not known, as imposing them significantly reduces the accuracies in those cases (Table 4).

Table 1 Accuracy compared to ground truth of the proposed algorithm versus other algorithms
Table 2 Accuracies in case of non-uniformly distributed supervised points
Table 3 Accuracies for experiments with class size incorporation.
Table 4 Timing results (in seconds)

5.1.7 Summary of Experimental Results

Experimental results on the benchmark data sets, shown in Table 1, indicate a consistently higher accuracy of the proposed convex algorithm than related local minimization approaches based on the MBO or Ginzburg–Landau scheme. The improvements are especially significant when the supervised points are not uniformly distributed among the data set as shown in Table 2. On one synthetic data set, ‘three moons,’ the accuracy of the new algorithm was slightly worse, indicating that the global minimizer was not the best in terms of accuracy for this particular toy example. Table 5 shows that the new algorithm reaches the lowest energy on all of the experiments, further indicating that MBO and Ginzburg–Landau are not able to converge to the global minimum. Table 1 shows that the accuracies of the proposed algorithm are also highly competitive against a wide range of other established algorithms, even when substantially less training data than those algorithms are being used. Table 3 shows that that the flexible size constraints (7) and penalty term (8) can improve the accuracy, if a rough estimate of the approximate class sizes are given.

The binary difference (63), plotted on log-scale against the iteration count, is depicted for each experiment in Fig. 7. For experiments without any size information, the average binary difference tends to less than \(10^{-16}\), which is vanishingly low and more or less indicates that an exact global minimizer has been obtained. For experiments with size constraints or penalty terms, the binary difference also gets very low, although not as low. This indicates convergence to at least a very close approximation of the global minimizer. These observations agree well with the theoretical results in Sect. 3.4, where the strongest results were also obtained in case of no size information.

Table 5 Initial and final energy
Fig. 7
figure 7

\(\text {Log}_{10}\) plot of binary difference \(b(u^k)\) versus iteration count. a Algorithm without size constraints; b algorithm with flexible constraints (7) and penalty term (8) acting on class sizes

Note that a lot more iterations than necessary have been used in the binary difference plots. In practice, the algorithm reaches sufficient stability in 100–300 iterations. The CPU times, summerized in Table 4, indicate a fast convergence of the new algorithm, much faster than GL, although not quite as fast as the MBO scheme. It must be noted that MBO is an extremely fast front propagation algorithm that only uses a few (e.g., 10) iterations, but its accuracy is limited due to the large step sizes. A deeper discussion on the number of iterations needed to reach the exact solution after thresholding will be given at the end of the next section on point cloud segmentation.

5.2 Segmentation of 3D Point Clouds

The energy function (9) that combines region homogeneity terms and dissimilarity across region boundaries will be demonstrated for segmentation of unstructured 3D point clouds, where each point is a vertex in V. Point clouds arise for instance through laser-based range imaging or multiple view scene reconstruction. The results of point cloud segmentation are easy to visualize and the choice of each term in the energy function will have a clear intuitive meaning that may be translated to other graph-based classification problems in the future. We focus especially on point clouds acquired through the concept of laser detection and ranging (LaDAR) in outdoor scenarios. A fundamental computer vision task is to segment such scenes into classes of similar objects. Roughly, some of the most common object classes in an outdoor scene are the ground plane, vegetation and human-made ‘objects’ with a certain regular structure.

5.2.1 Construction of the Energy Function

We construct the graph by connecting each node to its k-nearest neighbors (kNN) based on the Euclidian distance as described at the beginning of Sect. 2. In experiments, we set \(k=20\). We construct region terms that favor homogeneity of geometrical features based on a combination of point coordinates, normal vectors and variation of normal vectors. The construction is a concrete realization of the general region terms introduced in [59, 60, 84]. We also propose to use a contour term that favors alignment of the boundaries of the regions at ‘edges,’ indicated by sharp discontinuities of the normal vectors. Our model can be seen as a point cloud analogue of variational models for traditional image segmentation that combine region- and edge-based features in a single energy functional [15, 39, 48]. In contrast to the work [2, 72, 87] our model does not rely on training data.

Normal vectors in a point cloud can be estimated from principal component analysis locally around each point, as in, e.g., [31, 55, 60]. For each point \(x \in V\), let \(y^1,...,y^m\) denote the set of neighboring points and define for notational convenience \(y^0 = x\). Define the normalized vectors \({\bar{y}}^i = y^i - \text {mean}(y^0,y^1,...,y^m)\) for \(i=0,1,...,m\) and construct the matrix

$$\begin{aligned} \mathbf{Y } = [{\bar{y}}^0 {\bar{y}}^1 {\bar{y}}^2 ... {\bar{y}}^m]. \end{aligned}$$
(65)

Let \(\mathbf{v }^1(x),\mathbf{v }^2(x),\mathbf{v }^3(x)\) be the eigenvectors and \(\lambda ^1(x),\lambda ^2(x),\lambda ^3(x)\) be the eigenvalues of the correlation matrix \(\mathbf{Y } \mathbf{Y }^T\). The first eigenvector \(\mathbf{v }^1(x)\) points in the direction of least variation between the points \({\bar{y}}^1 , ... , {\bar{y}}^m\) and the first eigenvalue \(\lambda ^1(x)\) indicates the variation along the direction of \(\mathbf{v }^1(x)\).

The variable \(\mathbf{v }^1(x)\) is consequently a discrete estimation of the normal vector at x and the first eigenvalue \(\lambda ^1(x)\) indicates to which extend the normal vectors vary locally around the point x. If all the points were laying on a plane, then \(\lambda ^1(x)\) would be zero and \(\mathbf{v }^1(x)\) would be the normal vector of the plane.

The region term for region \(V_i\) can be constructed to be small at the point x if the value of \(\lambda ^1(x)\) is close to the expected value \(\lambda _i\) of region i, and be large otherwise. This can be achieved by requiring the following term to be small

$$\begin{aligned} \big | \lambda ^1(x) - \lambda _i \big |^2 \, , \quad \forall x \in V, \quad i = \{v, h, g\}. \end{aligned}$$
(66)

For instance, \(\lambda ^1_v\), \(\lambda ^1_h\) and \(\lambda ^1_g\) for vegetation, human-made objects and the ground plane can be estimated from measurements. Note that their particular values depend on characteristics of the LaDAR, such as the angular scanning resolution and depth resolution. If an estimate of \(\lambda _i\) is not known, \(\lambda _i\) could be part of the minimization problem in a similar way to the mean intensity values in the Chan–Vese model [23].

Furthermore, the region terms can be constructed for discriminating regions where the normal vector are oriented either parallel with or perpendicular to a specific direction \(\mathbf{n }^i\) by requiring the following terms to be small, respectively,

$$\begin{aligned} - |\mathbf{v }^1(x) \cdot \mathbf{n }^i|, \quad \quad |\mathbf{v }^1(x) \cdot \mathbf{n }^i|. \end{aligned}$$

For instance, the normal vectors of the ground plane are expected to point predominantly in the upward direction. The ground plane can also be characterized by its height, defined by the \(x_3\)-coordinate of the points, which is generally lower than the heights of other objects in the nearby surroundings. Assuming a rough estimate of the local height of the ground plane \(h^*(x)\) at the point x is known, the fidelity term (9) can be modified to take into account both normals vectors and height by requiring the following term to be small

$$\begin{aligned} - |\mathbf{v }^1(x) \cdot \mathbf{n }^i(x)| + H(x_3, h^*(x)), \end{aligned}$$
(67)

where H is an increasing function in \(x_3\) and, in addition, \(H(h^*(x),h^*(x)) = 0\). We have used the term \(H(x_3, h^*(x)) = \theta (x_3 - h^*(x))\) and simply estimated \(h^*(x)\) as the average \(x_3\) coordinate of the points in the neighborhood of x.

The weight function w is constructed to encourage spatial grouping of the points, and so that it is favorable to align the border between regions at locations where the normal vectors change from pointing upwards to pointing outwards, i.e., where the scene is convex-shaped. On the contrary, locations where the scene is concave, such as the transition from the side of buildings to the roof, should be unfavorable for the region boundaries. Such assumptions can be incorporated by modifying the Gaussian weight function (1) as follows:

$$\begin{aligned} w(x,y)= e^{-\frac{\mathrm{d}(x,y)^{2}}{\sigma ^{2}} + \gamma \frac{v^1_3(y)-v^1_3(x)}{d(x,y)}\text {SIGN}(y_1 - x_1)} \end{aligned}$$
(68)

Here \(v_1^1(x)\) and \(v_3^1(x)\) are the first and third components of the vector \(\mathbf{v }^1(x)\), and a coordinate system has been assumed where the positive \(x_1\) axis points outwards from the view direction and the positive \(x_3\) axis points upwards. An illustration is given in Fig. 8, where edges at convex parts of the scene are given a low energy value, marked by the color code of light blue.

Fig. 8
figure 8

Illustration of construction of a graph to separate the ground plane from human-made structures, view point from the side. The edges are assigned a low energy at convex parts of the scene, marked in light blue, making it favorable to place the boundary between the regions at such locations (Color figure online)

Taking the above information into account, an example of how the different region terms can be constructed for the ground plane, human-made structures and vegetation, respectively, is

$$\begin{aligned} f_g(x)&= (1-C)\big | \lambda ^1(x) - \lambda _g \big |^2 \nonumber \\&\quad + C \big \{ - |\mathbf{v }^1(x) \cdot \mathbf{n }^g(x)| + H(x_2,h^*(x)) \big \}. \end{aligned}$$
(69)
$$\begin{aligned} f_h(x)&= (1-C) \big | \lambda ^1(x) - \lambda _h \big |^2 \nonumber \\&\quad + C |\mathbf{v }^1(x) \cdot \mathbf{n }^g(x)| , \end{aligned}$$
(70)
$$\begin{aligned} f_v(x)&= C \big | \lambda ^1(x) - \lambda _v \big |^2. \end{aligned}$$
(71)

Here, \(C \in (0,1)\) is a parameter that balances considerations between variation and direction/height. In experiments, we set \(\lambda _g = \lambda _h\) and set C to a low value so that only vegetation is distinguished from other regions by the value of \(\lambda _1\). In some experiments, we also use two regions for vegetation with two different values of \(\lambda _i\). This makes it possible to distinguish different kinds of vegetation, those with leaves or needles tend to have a lower mean value \(\lambda _i\) than those without them. Smoke can also characterized by its irregular surface, and its region term constructed as (66) with an intermediate value of \(\lambda _i\).

5.2.2 Experiments

Some illustrative experiments are shown in Figs. 9 and 10. Ordinary photographs of the scenes are shown on the top and the red rectangles indicate the areas that have been scanned by the LaDAR. The point clouds have been segmented into three regions as described above and the results are visualized by brown color for points assigned to the ground plane region, green color for points assigned to the vegetation region and blue color for points assigned to the region of human-made objects. In Fig. 11, vegetation with and without leaves is indicated by dark and light green, respectively.

Fig. 9
figure 9

a Scanning area of LaDAR. b, c Segmentation of acquired point cloud, consisting of 93, 641 points, into three regions: ground plane (brown), vegetation (green) and human-made objects (blue). a Scanning area; b segmentation, view from front; c segmentation, view from top (Color figure online)

Fig. 10
figure 10

a Scanning area of LaDAR. b, c Segmentation of acquired point cloud, consisting of 80, 937 points, into three regions: ground plane (brown), vegetation (green) and human-made objects (blue). a Scanning area; b segmentation, view from front; c segmentation, view from top (Color figure online)

Fig. 11
figure 11

Top: Scanning area of LaDAR. Bottom: Segmentation of acquired point cloud, consisting of 14, 002 points, into four regions: ground plane (brown), and human-made objects (blue), vegetation with (light green) and without (dark green) leaves/needles (Color figure online)

Fig. 12
figure 12

Top: Scanning area of LaDAR. Bottom: Segmentation of a point cloud (81,551 points) into smoke (gray), vegetation (green) and human-made structures (blue). a Scanning area; b segmentation result (Color figure online)

It can be observed that the algorithm leads to consistent results even though these scenes are particularly challenging because the tilt and height of the ground plane vary highly over the scene due to the hilly landscape, and some of the trees and bushes are completely aligned with and touches the buildings. Note that buildings hidden behind vegetation get detected since the laser pulses are able to partially penetrate through the leaves. A misassignment can be observed in the middle of Fig. 10, where only the roof of one of the buildings is visible due to occlusions. Since no points are observed from the wall of the building, the roof gets assigned to the ground plane region. Some large rocks on Fig. 10 also get assigned to the blue region due to their steep and smooth surfaces (Fig. 12).

As was the case for experiments involving semi-supervised classification, the approximation errors of the convex relaxation practically vanish. Figure 13c depicts the binary difference (63) as a function of the iteration count in the experiment shown in Fig. 10. As can be seen, the solution of the convex relaxation converges to a binary function; after 10,000 iterations, the average binary difference (63) was \(5.74\times 10^{-10}\). Note, however, that a lot less iterations are necessary before the thresholded function stabilizes at the global minimum. Figure 13 (left) depicts the energy evolution as a function of the iteration count for the relaxed solution (blue), thresholded solution with scheme (26) (red) and with scheme (40) (green). Figure 13 (right) depicts a log plot of the absolute energy precision \(\frac{||E^i - E^*_{relaxed}||}{E^*_{relaxed}}||\), where \(E^*_{relaxed}\) is the global minimum of the relaxed problem, estimated by 10,000 iterations of the algorithm. \(E^i\) is the energy at iteration i of the relaxed solution (blue), thresholded solution with scheme (26) (red) and thresholded solution with scheme (40) (green). This plot demonstrates that the binary solution obtained by the thresholding scheme (26) stabilizes after about 300 iterations, after which the energy is within \(10^{-16}\) of the energy of the ground truth solution of the relaxed problem estimated at iteration 10,000. The threshold scheme (40) takes more iterations before stabilizing, but also eventually converges to the correct solution after about 3500 iterations. The CPU times of the experiments were in the range 5–15 s on an Intel i5-4570 3.2 Ghz CPU for point clouds with around 80,000 points. For comparison, the inference step of the related MRF approaches [2, 72, 87] took around 9 minutes for a scan with around 30,000 points, as reported in [72], but of course on older hardware. The proposed algorithm is also suitable for parallel implementation on GPU as discussed at the end of Sect. 4.2.

Fig. 13
figure 13

a, b Energy evolution of u (blue), \(u^T\) with threshold scheme (26) (red), and \(u^T\) with threshold scheme (40) (green) for the experiment in Fig. 10. a Primal energies; b Log of \(\frac{||E^i - E^*_{relaxed}||}{E^*_{relaxed}}\); c Binary difference (63) (Color figure online)

6 Conclusions

Variational models on graphs have shown to be highly competitive for various data classification problems, but are inherently difficult to handle from an optimization perspective, due to NP-hardness except in some restricted special cases. This work has developed an efficient convex algorithmic framework for a set of classification problems with multiple classes involving graph total variation, region homogeneity terms, supervised information and certain constraints or penalty terms acting on the class sizes. Particular problems that could be handled as special cases included semi-supervised classification of high-dimensional data and unsupervised segmentation of unstructured 3D point clouds. The latter involved minimization of a novel energy function enforcing homogeneity of point coordinate-based features within each region, together with a term aligning the region boundaries along edges. Theoretical and experimental analysis revealed that the convex algorithms were able to produce vanishingly close approximations to the global minimizers of the original problems in practice.

Experiments on benchmark data sets for semi-supervised classification resulted in higher accuracies of the new algorithm compared to related local minimization approaches. The accuracies were also highly competitive against a wide range of other established algorithms. The advantages of the proposed algorithm were particularly prominent in case of sparse or non-uniformly distributed training data. The accuracies could be improved further if an estimate of the approximate class sizes were given in advance. Experiments also demonstrated that 3D point clouds acquired by a LaDAR in outdoor scenes could be segmented into object classes with a high degree of accuracy, purely based on the geometry of the points and without relying on training data. The computational efficiency was at least an order of magnitude faster than related work reported in the literature.

In the future, it would be interesting to investigate region homogeneity terms for general unsupervised classification problems. In addition to avoiding the problem of trivial global minimizers, the region terms may improve the accuracy compared to models based primarily on boundary terms. Region homogeneity may, for instance, be defined in terms of the eigendecomposition of the covariance matrix or graph Laplacian.