Assessing hierarchies by their consistent segmentations

Current approaches to generic segmentation start by creating a hierarchy of nested image partitions and then specifying a segmentation from it. Our first contribution is to describe several ways, most of them new, for specifying segmentations using the hierarchy elements. Then, we consider the best hierarchy-induced segmentation specified by a limited number of hierarchy elements. We focus on a common quality measure for binary segmentations, the Jaccard index (also known as IoU). Optimizing the Jaccard index is highly non-trivial, and yet we propose an efficient approach for doing exactly that. This way we get algorithm-independent upper bounds on the quality of any segmentation created from the hierarchy. We found that the obtainable segmentation quality varies significantly depending on the way that the segments are specified by the hierarchy elements, and that representing a segmentation with only a few hierarchy elements is often possible. (Code is available).


Introduction
Generic (i.e., non-semantic) image segmentation is widely used in various tasks of image analysis and computer vision.A variety of image segmentation methods are proposed in the literature, including the watershed method [1], level-set method [2], normalized cuts [3], and many others.Modern generic segmentation algorithms use (deep) edge detectors and watershed-like merging [4].Augmenting the detected edges with region descriptors improves segmentations [5].Note that generic image segmentation, the topic we are focusing on in this paper, is different from semantic image segmentation, which provides segmentation of objects from specific classes with the help of (deep) image classifiers [6,7].
Segmentation (generic or semantic) is useful for numerous applications, such as image enhancement [8], image analysis [9], and medical image analysis [10].
The dominant generic segmentation algorithms (e.g., [4]) are hierarchical and built as follows: first, an oversegmentation is carried out, specifying superpixels as the elements to be grouped.Then a hierarchical structure (usually represented by a tree) is constructed with the superpixels as its smallest elements (i.e., leaves).The regions specified by the hierarchy are the building blocks from which the final segmentation is decided.Restricting the building blocks to the elements of the hierarchy yields simple, effective algorithms at a low computational cost.Most segmentation methods build the segmentation from the hierarchy by choosing a cut from a limited cut set.Our first contribution is to generalize this choice.We systematically consider all possible ways for specifying a segmentation, using set operations on elements of the hierarchy.Most of these methods are new.
We are also interested in the limitations imposed on the segmentation quality by using the hierarchy-based approach.These limitations depend on (1) the quality of the hierarchy, (2) the number of hierarchy elements (nodes) that may be used, and (3) the way that these elements are combined.We investigate all these causes in this paper.The quality is also influenced by the oversegmentation quality, which was studied elsewhere [11].
The number of hierarchy elements determines the complexity of specifying a segmentation.Lower complexity is advantageous by the minimum description length (MDL) principle, which minimizes a cost composed of the description cost and the approximation cost, and relies on statistical justifications [12][13][14][15][16].Moreover, representation by a small number of elements opens possibilities for a new type of segmentation algorithms that are based on search, for example, in contrast to the greedy current algorithms.The number of elements needed also indicates, in a sense, how much information about the segmentation is included in the hierarchy, and thus, it provides a measure of quality for the hierarchy as an image descriptor, as well as a global measure of the associated boundary operator.
To investigate the hierarchy-induced limitations, we optimize the segmentation from elements of a given hierarchy.We consider binary segmentation, and use the Jaccard index (IoU) measure of quality [17].More precisely, we use imagedependent oversegmentation and hierarchies produced by algorithms that have access only to the image.However, we allow the final stage, which constructs the segmentation from the hierarchy elements, to have access to the ground-truth segmentation.As a result, the imperfections of the optimized segmentation correspond only to its input, i.e., to the hierarchy.Thus, the results we obtain are upper bounds on the quality that may be achieved by any realistic algorithm, that does not have access to the ground truth, but relies on the same hierarchy.
Optimizing the Jaccard index is highly nontrivial, but we provide a framework that optimizes it exactly and effectively.Earlier studies either use simplistic quality measures or rely on facilitating constraints [18,19].
The contributions of this work are: 1.Four different methods for specifying a hierarchy-induced segmentation.These methods are denoted (segmentation to hierarchy) consistencies.

Efficient and exact algorithms for finding
the best segmentation (in the sense of maximizing the Jaccard index) that is consistent with a given hierarchy.We provide four algorithms 1 , one for each consistency.The algorithms are fast, even for large hierarchies.3. A characterization of the limits of hierarchyinduced segmentation.Notably, this characterization is also a measure of the hierarchy quality.This paper considers segmentation of images, but all the results apply as well to the partition of general data sets.The paper continues as follows.First, we describe terms and notations required for specifying the task (Section 2).In Section 3, we present our goal and discuss the notion of consistencies, which is central to this paper.In Section 4, we review several related works.In Section 5, we develop an indirect optimization approach that relies on the notion of co-optimality and enables us to optimize certain quality measures.Section 6 provides particular optimization algorithms and the corresponding upper bounds for the Jaccard index and the different consistencies.The bounds are evaluated empirically in Section 7, which also provides some typical hierarchy-based segmentations.Finally, we conclude and suggest some extensions in Section 8. 2 Preliminaries

Hierarchies
The following definitions and notations are standard, but are presented here for the sake of completeness.Recall that a partition of a set I is a set of non-empty subsets of I, such that every element in I is in exactly one of these subsets (i.e., I is a disjoint union of the subsets).In this paper, these subsets are referred to as regions.Moreover, all examples are done with connected regions, but the connectivity constraint is not needed for the theory and algorithms.Let π 1 and π 2 be two partitions of a pixel set I. Partition π 1 is finer than partition π 2 , denoted π 1 ≤ π 2 , if each region of π 1 is included in a region of π 2 .In this case, we also say that π 2 is coarser than where π 0 is the finest partition and π n is the trivial partition of I into a single region: π n = {I}.A hierarchy T is a pool of regions of I, called nodes, that are provided by elements of Π : For any two partitions from Π, one is finer than the other, hence, any two nodes Let N 1 and N 2 be two different nodes of T .We say that N 1 is the parent of N 2 if N 2 ⊂ N 1 and there is no other node N 3 ∈ T such that N 2 ⊂ N 3 ⊂ N 1 .In this case, we also say that N 2 is a child of N 1 .Note that every node has exactly one parent, except I ∈ π n , which has no parent.Hence, for every node N ∈ T , there is a unique chain: Thus, the parenthood relation induces a representation of T by a tree, in which the nodes of π 0 are the leaves, and the single node of π n is the root; see Figure 1.Hence, we also refer to T as a tree.When each non-leaf node2 of T has exactly two children, T is a binary partition tree (BPT) [18][19][20][21].In this paper, we focus on BPTs, but our results hold for non-binary trees as well.
A hierarchy T can be represented by a dendrogram, and every possible partition of I corresponds to a set of T 's nodes and may be obtained by "cutting" the dendrogram; see Figure 1.In the literature, any partition of I into nodes of T is called a cut of the hierarchy [22,23].Every π i ∈ Π is a horizontal cut of the hierarchy, but there are many other ways to cut the hierarchy, and each cut specifies a partition of I.As we shall see later, a hierarchy may induce other partitions of I.
Pruning of a tree in some node N is a removal from the tree of the entire subtree rooted in N, except N itself, which becomes a leaf.Each cut of a hierarchy represents a tree T ′ obtained by pruning T , by specifying the leaves of T ′ ; see Figure 1.The converse is also true: the leaves of a tree obtained by pruning T are a cut of the hierarchy.That is, a subset of nodes N ⊂ T is a cut of the hierarchy, if and only if N is the set of leaves of a tree obtained by pruning T .More precisely, N ⊂ T is a cut of the hierarchy, if and only if for every leaf in T , the only path between it and the root contains exactly one node from N .Often, a segmentation is obtained by searching for the best pruning of T .However, the cardinality of the set of all prunings of T grows exponentially with the number of leaves in T [18].Thus, it is unfeasible to scan this set exhaustively by brute force.

Coarsest partitions
We use the following notations.The cardinality of a set is denoted by | • |.The initial partition π 0 of I, which is the set of leaves of the tree T , is denoted by L. Let N ∈ T ; we denote by T N ⊂ T (resp.L N ⊂ L ) the subset of nodes of T (resp.L) included in N .Note that T N is represented by the subtree of T rooted in N ; hence, we refer to T N also as a subtree, and to L N as the leaves of this subtree.
Let Y ⊂ I be a pixel subset.We refer to any partition of Y into nodes of T (namely, a subset of disjoint nodes of T whose union is Y ) as a T-partition of Y .Note that a T -partition of Y does not necessarily exist.We refer to the smallest subset of disjoint nodes of T whose union is Y as the coarsest T-partition of Y .Obviously, N ⊂ T is a cut of the hierarchy if and only if N is a Tpartition of I. N ⊂ T is a T -partition of a node N ∈ T , if and only if N is the set of leaves of a tree obtained by pruning T N .Obviously, the coarsest T -partition of a node N ∈ T is {N }.
Figure 2 illustrates several ways of representing a region using a hierarchy and the corresponding coarsest partition.
In Figure 1 3 Problem formulation

The general task
As discussed in the introduction, we consider only segmentations that are consistent in some way with a given hierarchy, and aim to find limitations on their quality.
Obviously, the quality improves with the number of regions.More precisely, General task: Given a hierarchy and a measure for estimating segmentation quality, we want to find a segmentation that has the best quality, is consistent with the hierarchy, and uses no more than a given number of regions from it.
To make this task well-defined, we now specify and formalize the notion of segmentation consistency with a hierarchy.

Various types of consistency between a segmentation and a hierarchy
We consider segmentations whose (not necessarily connected) segments are specified by set operations on the nodes of T .The intersection between two nodes in a hierarchy is either empty or one of the nodes.Therefore, we are left with union and set-difference.Complementing (with respect to I) is allowed as well.By further restricting the particular operations and the particular node subsets on which they act, we get different, non-traditional, ways for specifying segmentation from a hierarchy.We denote these different ways as (hierarchy to segmentation) consistencies.
Remark 1. Consistency of some type, with the subset N s , implies consistency of a later, more general type, with the same subset N s .
Remark 2. We argue that these four consistency types systematically cover all possibilities.The first choice is whether the nodes subset N s should be limited to a hierarchy cut or not.For the cut case (which is the popular choice in the literature), union is the only set operation that makes sense, because set difference between disjoint nodes is empty and the cut covers the full image, making the complement empty as well.For the more general case, where the subset N s is not necessarily a cut, both the union and the set difference are relevant.Unions without set difference is an important special case that is simpler both conceptually and computationally.Set difference between two nodes without additional unions does not seem to justify another consistency type (and is included, of course, in d-consistency).
Figure 3 illustrates the different consistencies.The a-consistency is used in most hierarchy-based segmentation algorithms, where some cut is chosen and all its leafs are specified as segments; see [19,24].To the best of our knowledge, the b-, c-, and d-consistencies were not used in the context of hierarchical segmentation; see however [25] for (c-consistency-like) node selection in a hierarchy of components.
As specified above, the subset N s , specified for a segmentation s, is not necessarily unique; see Figures 2 and 3(c,d Proof sketch: Following Remark 1, consistency of a segmentation according to one type implies its consistency according to the more general types.The converse is also true.Consider a d-consistent segmentation s.Recall that every node in T is a union of disjoint nodes of the initial partition L. A set-difference of nested nodes is still a union of nodes of L (which are a T -partition of this set-difference); see Figure 2(a-c).Hence, there is a subset Ns consisting of disjoint nodes.By Property 2, s is also c-consistent.
A subset Ns consisting of disjoint nodes can be completed to a partition of I by adding some Tpartition of I\ • Ns ; see Figure 2(d-e).Hence, there is another subset Ns that is a cut of the hierarchy.By Property 2, s is also b-consistent.□ Lemma 2 states the somewhat surprising result that b/c/d consistencies are equivalent.Thus, the set of segmentations consistent with T , using either b-, c-, or d-consistencies, is common.Denote this set by S. Note that the set of aconsistent segmentations, S 1 ⊂ S, is smaller.The consistencies may differ significantly, however, in the The proof of the following lemma is straightforward.
Note that a segmentation is consistent with T (in some consistency a/b/c/d), if and only if each of its segments is a union of nodes from T ; that is, there is a T -partition for each segment.For a segmentation s ∈ S, we refer to the union of the coarsest T -partitions of the segments of s (i.e., N b s ) as the coarsest cut of the hierarchy for s.Lemma 3 implies that for every s ∈ S there is a unique coarsest cut of the hierarchy.The converse is not true.A cut of a hierarchy can be the coarsest for several segmentations.For example, the same cut is the coarsest for different segmentations (a) and (b) in Figure 3.

Previous work
The task considered here (and in [18,26]) is to estimate the limitation associated with hierarchybased segmentation.That is, to find the best s ∈ S, maximizing the quality M(s), that is consistent with the hierarchy for a limited size of N s , |N s | ⩽ k.This upper-bound of the segmentation quality is a function of the consistency type and k.We refer to the segmentation maximizing the quality as a/b/c/d-optimal.
First, we emphasize again that this task is different from the common evaluation of hierarchydependent segmentations, which provides precision recall curves and chooses the best segmentation from them; see, e.g., [24,27,28].This approach considers only the easily enumerable set of segmentations associated with horizontal cuts, which are parameterized by a single scalar parameter.Here, on the other hand, we find the best possible segmentation from much more general segmentation sets, and provide an upper bound on its quality measure.The best segmentations from these larger sets have often significantly better quality; see [28].
Only a few papers address such upper bounds.Most of the upper bounds were derived for local measures.A local measure M(s) of a segmentation s may be written as a sum of functions defined over the components of the cut defining s.
Local measures are considered in [18].An elegant dynamic programming algorithm provides upper bounds on these measures for segmentations that are a-consistent with a given BPT hierarchy.
Fig. 3: Examples of segmentations of various consistency types, all consistent with the hierarchy described in Figure 1, shown also in (e).All segmentations are specified by three nodes (although sometimes fewer nodes suffice).Note that a segment is not necessarily connected.Except for the a-consistency, the nodes in a cut of the hierarchy specifying each segmentation are shaded with the colors of the segments in which they are included.Unlike that work, we consider binary segmentation, for which the a-consistent segmentation is trivial.We extend this work by working with b,c, and d-consistent segmentation and by optimizing each one for a non-local measure: the Jaccard index.
The boundary-based F b measure [29] was considered in [19].A method to evaluate the aconsistency performance of a BPT hierarchy is proposed.The optimization was modeled as a Linear Fractional Combinatorial Optimization problem [30] and was solved for every possible size of a cut of a hierarchy (from 1 till |L|).This process is computationally expensive, and therefore is limited to moderate size hierarchies.
Extending those previous works, a hierarchy evaluation framework was proposed [28].It includes various types of upper bounds corresponding to boundaries and regions, and further extends the analysis to supervised, markers based, segmentation.More recently, the study described in [31] introduced some new measures that quantify the match between hierarchy and ground truth.Both papers [28,31] address neither the exact optimization of the Jaccard index nor the advanced (b, c, d) consistencies.

A co-optimality tool for optimization
Given a quality measure M over a set S, we want to find s ∈ S with the best score, M(s).
Optimizing the quality measures over all possible node subsets may be computationally hard.One Algorithm 1: Generic optimization scheme Data: A quality measure M, a set S, and a family of measures {Q ω } Data: 6 while ω > ω 0 7 return s ω approach could be to optimize an equivalent measure Q(s) instead.Measures are equivalent if they rank objects identically.For example, the Jaccard index and the object-based F -measure are equivalent [24] because they are functionally related by a monotonically increasing function.
An equivalent measure Q may, however, be as difficult to optimize.Recalling that we are interested only in the maximum of M and not in the ranking of all subsets, we may turn to a weaker, easier-to-optimize form of equivalence.Definition 2. Let S M ⊂ S be the subset of the elements optimizing M. We refer to measures M and Q as co-optimal over S, if S M = S Q .
We now propose an optimization approach that is valid for general finite sets S, including but not limited to hierarchical segmentations.Algorithm 1 uses a family of measures {Q ω }, ω ∈ [0, 1] over S. It works by iteratively alternating between assigning values to ω and optimizing Q ω (s).As Theorem 1 below shows, under some conditions on the family {Q ω }, the algorithm returns the segmentation that maximizes the quality measure M, and the corresponding maximal value M. Theorem 1.Let M be a quality measure over a finite set S, receiving its values in [0, 1] .Let M be the (unknown) maximal value of M over S. Let {Q ω }, ω ∈ [0, 1] be a family of measures over S, satisfying the following conditions: 1. Q ω= M and M are co-optimal measures over S.
Then Algorithm 1 returns s ∈ S M after a finite number of iterations.
Proof Suppose that ω 0 ∈ [0, M].Then the iterative scheme in each iteration finds s ω ∈ S Q ω and specifies a new value for ω to be M(s ω ).Condition 2 is fulfilled trivially for s ′ = s ω since s ω maximizes Q ω .Hence, ω < M(s ω ) ; i.e., ω strictly increases from iteration to iteration while ω < M .S is finite, hence, ω reaches M after a finite number of iterations.When that happens, s ω= M ∈ S M since, by condition 1, Q ω= M and M are co-optimal.The iterations stop when ω no longer increases.Hence, to prove the theorem, we show that ω does not change after it reaches M.
Suppose now that the condition required above, ω 0 ∈ [0, M], is not satisfied (i.e., ω 0 > M).Then, line 1 returns some ω which must be lower than the maximal M. Then the algorithm proceeds and reaches the optimum according to the proof above.□ Given a quality measure 0 ⩽ M ⩽ 1 over S, we refer to a family {Q ω } ω ∈ [0, 1] of measures over S, as a family of auxiliary measures for M if {Q ω } contains at least one measure Q ω ′ that is cooptimal with M over S, and there is some iterative process that finds Q ω ′ from {Q ω }.We refer to Q ω ′ as a co-optimal auxiliary measure, and we refer to an algorithm that can optimize every member of {Q ω } as an auxiliary algorithm.
In scheme 1, the auxiliary algorithm is written in the most general form: argmax Q ω .In the next section, we provide a family of auxiliary measures and corresponding auxiliary algorithms, suitable for optimizing the Jaccard index, for different consistencies and constraints of the node set size.

Optimizing the Jaccard index
After setting the framework and developing the necessary new optimization tool, we shall now turn to the main goal of this paper: Finding a tight upper bound on the obtainable Jaccard index.

The Jaccard index
The Jaccard index (or the intersection over union measure) is a popular segmentation quality measure, applicable to a simple segmentation into two parts: foreground (or object) and background.Let (B GT , F GT ) and (B s , F s ) be two foreground-background partitions corresponding to the ground-truth and a segmentation s ∈ S .The Jaccard index J is given by: Given a hierarchy, we shall find, for each consistency and node subset size |N s |, the segmentation that maximizes the Jaccard index.This segmentation also maximizes the object-based Fmeasure, as the two measures are equivalent [24].
For two-part segmentation, only one segmentation is a-consistent with a BPT hierarchy: the two children of the root.We ignore this trivial case in the following discussion.

Segmentation dimensions
Let S 2 ⊂ S be the subset of all possible 2segment segmentations, consistent with the hierarchy.Denote the areas of the ground-truth parts by Considering this segment as foreground, denote its areas inside the groundtruth's parts by (b The Jaccard index is then Alternatively, the foreground can be specified by the complementary segment I\X s .The corresponding areas inside the ground-truth's parts are (B − b s , F − f s ).The Jaccard index associated with this foreground is Optimizing J(s) for b-consistency provides a cut in tree N s .Both F s and B s are unions of nodes of this cut.The c/d consistencies allow one segment to be specified as the complement of the other.The hierarchy may match better either F GT or B GT .Thus, we optimize both J(s) and J c (s) (for the same size of N s ) and choose the better result.
The values (b s , f s ) are the main optimization variables.We refer to them as segmentation dimensions.
6.3 Applying co-optimality for optimizing J (s)

Geometrical interpretation
Our goal is to find A key idea is to observe that the J(s) value may be interpreted geometrically using the graph of segmentation dimensions (b, f ).Selecting the segment X s , every segmentation s ∈ S 2 corresponds to a point (b s , f s ) inside the rectangle (0, B)×(0, F ). J(s) is tan(α s ), where α s is the angle between the b axis and the line connecting the point (b s , f s ) with the point (-F, 0) ; see Figure 4.The geometry implies that tan(α s ) ∈ [0, 1], consistently with tan(α s ) = J(s) ∈ [0, 1].

A family of auxiliary measures
For every ω ∈ [0, 1], let be a measure over S 2 .Note that, geometrically, P ω (s) is the oblique projection at the arctan(ω) angle of point (b s , f s ) onto the f axis.The following two observations imply that J(s) and the projection (at arctan( J ) angle) P J (s) are co-optimal measures.
1. J(s) and P J(s) (s) are equivalent measures.2. P J(s) (s) and P J (s) are co-optimal measures.The first observation is clear: ranking the elements of S 2 by J(s) = tan(α s ) is equivalent to ranking them by their projection at angle α s , i.e., by P J(s) (s) ; see Figure 4.
The second observation states that there is a constant angle arctan(ω) with ω = J(not depending on s), such that the projection P ω (s) at this angle and P J(s) (s) are co-optimal.By the first observation, P J(s) (s) is maximized by s.Every non-optimal segmentation corresponds to a point below the line [(−F, 0) − (b s , f s )] and its constant angle projection satisfies P ω (s) < P ω ( s).P ω (s) is maximized only by points lying on this line, as is also the case with P J(s) (s).
Thanks to these two observations, the family {P ω } is a family of auxiliary measures for the Jaccard index.The optimization process (Algorithm 1) maximizes this auxiliary measure in every iteration: Note that P ω (s) is linear in (b s , f s ) , which simplifies its maximization.To use scheme 1 to find s (and J ), the second condition of Theorem 1, which guarantees that ω strictly increases at every iteration while ω < J, must be met as well.
Remark 3. Optimizing J c (s) is similarly done.

Optimizing J (s) for hierarchical segmentation
Using scheme 1 reduces the optimization of J(s) to iterations where auxiliary measures are optimized.
The auxiliary algorithm provides a foregroundbackground segmentation s ∈ S 2 whose dimensions (b s , f s ) maximize the auxiliary measure corresponding to the current iteration.In this work, we use the hierarchy for this optimization, and the auxiliary algorithm returns the best segmentation s ∈ S 2 together with the corresponding subset N s ⊂ T , which both depend on the required consistency of s with T .

Specifying N s and the segmentation s for various consistencies
Here, we specify the relation between the hierarchy and the segmentation for each of the different consistencies considered in this paper.Let N ⊂ T be a subset of nodes.Because the nodes belong to a hierarchy, each pair of nodes is either disjoint or nested.Denote by K 0 N ⊂ N the subset of disjoint nodes that are not nested in any other node from N .Recursively, denote by K i N ⊂ N the subset of disjoint nodes that are not nested in any other node from N \{∪ i−1 j=0 K j N }.We refer to each K i N as a layer of N .Note that (each subsequent layer is nested in any previous layer); hence, N , be the index of the layer in which the node N lies.Note that the set of layers is a partition of N , i.e., every node N ∈ N is associated with exactly one index i N N .Note that i N N is the number of nodes in N in which N is nested.Let i max N be the largest index corresponding to a nonempty layer.The segmentation is specified from Each D i N is the set-difference of the layers 2•i and 2•i + 1.Since each subsequent layer is nested in any previous layer, all D i N are disjoint.The segments of s are

Calculation of segmentation dimensions for various consistencies
To calculate the segmentation dimensions (b s , f s ) efficiently, we use a tree data structure to represent the tree hierarchy.For each node N of the tree, we store the area of the node inside the ground truth's parts (b Similarly to the segmentation dimensions, specified above, we refer to these values as node dimensions. Note that the dimensions of a union of disjoint nodes are equal to the sum of the dimensions of all nodes from the union, and the dimensions of a set-difference of two nested nodes are equal to the difference of their dimensions.Given a segmentation s = (X s , I\X s ) ∈ S 2 , the calculation of its dimensions (b s , f s ) (which are the dimensions of the segment X s ) from the dimensions of the nodes of a subset N s depends on the required consistency of s with N s :  7), the dimensions (b s , f s ) are calculated by the sum of the dimensions of all nodes from N s , each multiplied by an appropriate sign: -1 to the power of i N

Ns
. More formally, we can write (b s , f s ) as the expression below.Note that for the b/c consistencies, N s consists of a single layer: i N Ns = 0 ∀N ∈ N s .Therefore, this expression is valid for all (b/c/d) consistencies.
A unified expression of segmentation dimensions: Remark 4. Since for a segmentation s ∈ S 2 the subset N s is not necessarily unique, we could ask whether the expression (8) is well-defined, i.e., whether we get the same area (b The answer to this question is positive, due to the properties of node dimensions for the union of disjoint nodes and for the set-difference of nested nodes.

Auxiliary measures additivity
A particularly useful property of the auxiliary measures is their additivity.Consider some attribute defined on every node in the tree.If the attribute of each non-leaf node is the sum of the attributes of the node's children, then we say that this attribute is additive.
For a specific projection P ω , the two dimensions of a node may be merged into one attribute, , we get a closed form, simplified linear expression for the auxiliary measure of the segmentation s.We may refer to this measure, alternatively, as the benefit of the corresponding node set N s : Note that each non-leaf node N is the union of its disjoint children; hence, the dimensions (b N , f N ) are the sum of the dimensions of the children of N, which implies the additivity for A P ω (N ).
The additivity property holds for all projections.
For simplification, we refer to the attribute of N as A(N ).The auxiliary algorithms search for the subset of nodes maximizing the benefit (9).These optimization tasks are performed under the constraint: While (9) provides a general expression for all consistencies, in practice we use the following consistency-dependent expressions, which are equivalent and more explicit.
Here and below, we prefer to use the more general N (over N s ), when the discussion applies to general sets of nodes from the tree.
The proposed auxiliary algorithms (described below) are not restricted to the auxiliary measures discussed above; they would work for any additive measure Q.The additivity is crucial, because otherwise the score Q(s) is ill-defined, i.e., it may result in different score values for different subsets N s specifying the same s ∈ S 2 .

Using the tree structure for maximizing the auxiliary measures
The maximization of the benefit (property 3) results in a subset of nodes subject to the consistency constraints, with the maximal benefit in T .
The key observation in this maximization is that a subset with the maximal benefit in a subtree T N can be obtained from subsets with the maximal benefit in the subtrees of children of N.That is, we can use the recursive structure of the tree T to maximize the benefit.Let N ′ ⊂ N ⊂ T .We say that N ′ is best if it has the highest benefit relative to every other subset of N with the same number of nodes.
Depending on the context, N ′ should also have the properties associated with the consistency; i.e., being a partition (for b-consistency) or belong to a single layer (c-consistency).Interestingly, we also need the notion of worst subsets.N ′ is worst if it has the minimal benefit relative to other subsets of N of the same size.
Remark 5. Note that within the same consistency type, there can be several best/worst subsets in N , having the same benefit but not necessarily of the same size.
Thus, a subset N maximizes the benefit (property 3), if and only if N is a best subset in T .Below, by referring to N as best without specifying in which subset of T the subset N is best, we mean that N is best in the entire T .
The following claim readily follows from the additivity properties of the dimensions (Section 6.4.3).
Lemma 4. (a) Let N 1 and N 2 be subsets of nodes, such that (b) Let N be a node and N be a subset of nodes, such that Lemma 4(b) applies only to the d-consistency, in the case where • N and N are not disjoint.The set of nodes {N } ∪ N corresponds to a segment that is the set difference between N and the segment specified by N , which leads to the claim on the benefit.
The children of a non-leaf node are disjoint and nested in the node, which implies the following claim.
Lemma 5. Let N ∈ T be a non-leaf node: N = N r ∪N l , where N r (right), N l (left) are its children.Let N N be a subset of T N and let N Nr , N N l be (possibly empty) subsets of N N from T Nr and T N l respectively.Then: Proof Assume the opposite about any N Nr , N N l .Lemma 4 implies that N N can be improved/worsened, which contradicts N N being best/worst.□ Lemma 5 specifies the necessary condition for a best/worst subset in T N .With its help, the search for the best subsets in T N can be significantly reduced, making this search feasible.Namely, for finding a best subset in T N , it is enough to examine only those subsets that are best/worst in the subtrees of the children of N. The following (trivial) sufficient condition for best/worst subset in T N is to examine all possible candidates.We can now describe the auxiliary algorithms.From a high-level point of view, they work as follows: At the outset of the run, each auxiliary algorithm specifies each leaf of T as both the best and the worst subset (of size 1) in the trivial subtree of the leaf.Then, each auxiliary algorithm visits all non-leaf nodes of T once, in a postorder of tree traversal which guarantees visiting every node after visiting its children.When visiting a node N, each algorithm finds the best/worst subsets in T N using Lemma 6.

Preliminaries
Generally, the algorithm works as follows: Starting from the hierarchy leaves, the algorithms calculates the maximal auxiliary quality measure for every node and for every budget (up to k) in its subtree.When reaching the root, the decision about the particular nodes used for the optimal auxiliary measure is already encoded in the hierarchy nodes and is then explicitly extracted.Like [18], it is a dynamic programming algorithm.
The following variables and notations are used within the algorithms: 1. N 1 , N 2 , N 3 , . . ., N |T | is the set of all nodes, ordered in a post-order of tree traversal.2. N l (left) and N r (right) are the children of a non-leaf node N. 3. A(N ) is an additive attribute of a node N.
Recall that allowed size of a best/worst subset in T N , which is limited by k or by the number of leaves in T N .6. r is the number of nodes in the node subset associated with the right child of N .It depends on N and is optimized by the algorithms.The range of r values is denoted ( r min , r max ). 7.
, denoted H, is the output of the auxiliary algorithm, maximizing the benefit; see however remark 7.These subsets are used to describe the algorithm, but are not variables of the algorithm.8.
N ) are vector variables stored in node N, holding the number of those nodes in H N + (i) / H N − (i) , which belong to T Nr (the subtree of N r ).The number of nodes in T N l follows.10.Q is queue data structure, used to obtain H from the vectors R N + / R N − .To find the best subset consisting of a single layer (as is the case for b/c consistency), we need to examine only the corresponding best subsets and disregard the worst subsets.In this case, we simplify the notation and use Remark 6. Different optimal subsets for different k are associated with different ω parameter values.Therefore, the set of subsets {H Root (i); i < k} obtained with the best subset H Root (k) are not optimal as well.

b-consistency
The auxiliary algorithm for b-consistency is formally given in Algorithm 2.
A best subset (for b-consistency) (def.3), H N (i), must be a T -partition of is the benefit of the node N itself.To calculate B N [i] for i > 1 , we need only the condition (a) of Lemma 6 , which implies that , over all possible values of r.This part is carried out in lines 1-5 of Algorithm 2.
The best subset H = H Root [k] and its subset of nodes with a positive attribute, denoted G, are specified from the vectors R N .The number of nodes in H that belong to the subtree of the root's right child is R Root [k] (recall that t(Root) = k), and their number in the subtree of the left child is k − R Root [k].The same consideration is applied recursively to every node N, stopping when R N [i] is equal to zero.This part is carried out in lines 6-16 of Algorithm 2.

Notes:
1.The range ( r min , r max ) is calculated as follows: i is the number of nodes in the subset associated with N .The number of nodes, r, associated with the right child should satisfy 1 ⩽ r ⩽ t(N r ).(Note that the lower limit is 1 and not 0, because for b-consistency N s is a cut.)The number of nodes i − r associated with the left child should satisfy 1 ⩽ i − r ⩽ t(N l ), which imples that r should satisfy i − t(N l ) ⩽ r ⩽ i − 1. Therefore r should be in the range ( H is a cut of T (Section 2.1), which implies that the deepest node is no deeper than |H|−1.Hence, Algorithm 2 can be accelerated by processing only those nodes whose depth is less than k.

c-consistency
A similar auxiliary algorithm for c-consistency is given in Algorithm 3. Note that a c-best subset H N (i) consists of disjoint nodes, but their union is not necessarily N.
For example, for H N (1) , there are three possibilities: the best node from T Nr , the best node from T N l , and N itself, which are marked by ad-hoc values of 1, 0, and -1, respectively, in R N [1].Notes: 1.The calculation of ( r min , r max ) is as above, but the ranges of both children start from 0. 2. For coding convenience, we added a cell B N [0], which always takes the value 0. 3. The algorithm should preferably select only nodes with a positive attribute.If the number of nodes with a positive attribute (in one layer) is less than k, then nodes with a nonpositive attribute are selected as well.In this case, however, there is a subset with fewer nodes and with a bigger benefit, which can be specified from (B Root , R Root ) ; see Remark 7.

d-consistency
The auxiliary algorithm for d-consistency is formally given in Algorithm 4.
Unlike the other consistencies, a d-best subset may contain nested nodes, which requires additional variables.Unlike the algorithms for the other consistencies, here we use all the vectors B N + , B N − , R N + , and R N − , including the additional cells which always take the value 0. By Lemma 6, and using the notations introduced in section 6.5.1, H N + (i) is the subset having the maximal benefit from , over all possible values of r.
For getting H N − (i), the subset having the minimal benefit, we use similar expressions; see Algorithm 4. It may happen that the calculation of in two passes through all values of i, the second pass being in decreasing order of i (see lines 6-18 in Algorithm.4).
The d-best subset H is specified from R N as before.However, since both a node N and nodes that are nested in it, may be included in H , we added an indicator Belong N + [i] / Belong N − [i] i = 1, . . ., t(N ) (boolean vector variables stored in a node N ), indicating whether N belongs to H N + (i) / H N − (i).In addition, for every node N ∈ H, the index i N H (Section 6.4.1) is calculated, so Algorithm 4 returns a subset H , which is a subset of the pairs (N , i N H ).
H is not necessarily of minimal size, and in extreme case, when the number of nodes with positive attribute is too small, it does not provide the best benefit.(See Remark 5 and Note 3 in section 6.5.3).The best subset with the best benefit and minimal size, is always associated with the maximal value in B Root .It can be specified by running the queue starting from Q.Enqueue( Root , k ′ ) ( k ′ replaces t(Root) ), where k ′ is the minimal index such that the value B Root [k ′ ] is the maximal in B Root . 2 ), where n is the number of iterations made by scheme 1.The straightforward (and least tight) upper-bound on n is the number of segmentations s ∈ S with different scores M(s) (the measure maximized in scheme 1), since M(s) strictly increases from iteration to iteration (Section 5).However, in practice, we found that only a few iterations are required (no more than five).

Time complexity
6.5.6The best segmentation specified by a subset of unlimited size Sometimes, we are interested in a segmentation s ∈ S achieving the best score M(s), regardless of the size of a subset N s .Then, the auxiliary algorithm becomes linear, and is significantly simpler.Lemma 2 implies that in this case, optimizing M yields s ∈ S with the same score M(s), for each of the consistency types b/c/d.By simply discarding the node subset size parts, the b-consistency algorithm can be simplified to be particularly efficient.
Algorithm 5 provides the full description.
In every node N, we store only the maximal benefit over all b-best subsets in T N , regardless of their sizes.That is, we need only a scalar variable p N , storing the maximal value in the vector B N in Algorithm 2. After the values p N are calculated for all nodes, the b-best subset H is found as the optimal cut of T .In this case, H has the minimal size (see Remark 7), i.e., there is no b-best subset in T , that has the same benefit, while being smaller.
Processing of each node is in O(1) ; hence, the time complexity of this algorithm is O(|T |) = O(|L|).Note that Algorithm 5 returns two subsets: H and G ⊂ H (the nodes with a positive attribute).

Auxiliary algorithms' correctness
Theorem 3. The auxiliary algorithms optimize the auxiliary measure (9) subject to the corresponding consistency, and the constraint on the maximal number of nodes in N s .
As each of the auxiliary algorithms recursively applies Lemma 6, the proof readily follows by induction on Height(T ).
Algorithm 5: Auxiliary algorithm for finding the best segmentation specified by an unlimited subset 15 while Q is not empty 16 return ( H, G )

A note on the implementation
Each of the auxiliary algorithms calculates the benefit of node subsets by performing arithmetic operations with natural numbers b N , f N and the real number ω.To avoid numerical error in the accumulation, we use integer arithmetic.We represent the benefit with two natural numbers, each of which is a linear combination of b N and f N values, with ±1 coefficients.To compare the benefits of different subsets, we need only a single operation involving ω.

Experiments
The contribution of this paper is mostly theoretical, in providing, for the first time, effective algorithms for bounding the obtainable Jaccard index quality of a segmentation.These bounds, depending on the hierarchy, the consistency, and the number of nodes, are experimentally illustrated below.
To the best of our knowledge, the optimization of the Jaccard index was not considered before, which prevents us from comparing our empirical results with prior work.
For the experiments, we consider four BPT hierarchies.The first, denoted geometric tree, is image independent and serves as a baseline.The Note that for a low number of nodes (e.g., 5) the b-consistent segmentation (g) is of lower quality than the other segmentations (h)(i).Note also that the c-consistent segmentation (h) is slightly worse than the d-consistent one (i).The differences decrease when the number of nodes increases.
other three hierarchies are created as the minimum spanning tree of a super-pixels graph, where the nodes are SLIC superpixels [11].Fig. 8: An illustration of the maximal Jaccard index, obtainable for a given number of nodes.Every one of the plots correspond to one of the segmentation-hierarchy consistencies.The curves correspond to averages over all images in the Weizmann DB, to the four hierarchies.Filtered hierarchies are used.The use of d-consistency clearly requires a significantly lower number of nodes for the same quality, relatively to the c-consistency, which, in turn, requires a lower number of nodes than the usage of the b-consistency.Also, as expected, the hierarchy built using the HED edge detector gives better results than the other hierarchies demonstrated here.
image), each node split into two equal parts (the node children), horizontally or vertically, depending on whether the height or the width of the node is larger.Note that the geometric tree is independent of the image content.2. L2 Tree -based on traditional, low quality non-learned gradient: the L2 difference between the RGB color vectors.

SED Tree -based on learned, Structured
Forests Edge detection, which can be considered medium quality [32].4. HED Tree -Modern, high quality, deep learning based, Holistically-Nested Edge Detector [33].
A common issue with hierarchical image segmentation is the presence of small regions (containing few pixels) at lower depths in the hierarchy.These small regions are found more frequently when generating the HED and SED trees, as their gradient generally contains thick boundaries.It is therefore common to filter the hierarchy and to remove such small unwanted regions; see, e.g., the implementation of [34] and [35,36].We followed this practice and use the Higra [37] area-based filtering algorithm proposed in [36].
The leaves of the image independent, geometric-tree are the image pixels, which makes this tree large (and regular).The other trees are smaller, as they use super pixels, and also benefit from the filtering process, when applied.
We calculated the best segmentations that match the different hierarchies, and show how they depend on the particular hierarchy that is used, and on the consistency type.First, we show several examples of such best segmentations, corresponding to the same image, using the HED hierarchy; see Figure 7.As expected, the segmentation quality improves with the number of nodes that are used, and with the consistency type (b < c < d).
Figure 8 confirms this observation, and shows that the average Jaccard index, over an image dataset, grows with the number of hierarchy nodes.It also shows that requiring d-consistency allows us to use a relatively small number of nodes for getting good segmentation, with a high Jaccard index.C-consistency follows, and b-consistency is last.The differences between the consistencies are clearly seen in Figure 9.It is also clear that better hierarchies, obtained with more accurate edge detectors, provide much higher quality of segmentation with lower number of nodes.These plots show the average Jaccard index over 100 images of the Weizmann database [38].Every image in this database contains a single object over a background, which match the applicability of the Jaccard index.
The average Jaccard index curves are smooth.We observed however that for particular images, the curves have stair-like behavior, implying that .Fig. 9: Comparing the segmentation quality obtainable using the three segmentation-hierarchy consistencies.The comparison is carried out for each of the four different types of hierarchies.Note that the performance with c and d consistencies is similar for the best, HED, tree.For better visibility, we used a different scale in the x-axis, for each of the hierarchy.the same Jaccard index is achieved for different k values, which happens, e.g., when adding a few nodes to N s does not change the foreground specification.
Note that for b-consistency the geometric, image-independent tree is better than say the L2 tree.This happens because in the L2 tree, we have many spurious small nodes that are close to the root, even after the filtering process.Bconsistency chooses a set of nodes which is a cut in the tree; when taking a cut that contains the important nodes needed to approximate the GT segment, some spurious nodes must be included, which significantly increases the node count (k).
The best results (in terms of lowest node count) are achieved with d-consistency.This holds for all hierarchies.For the higher quality hierarchies, the node count needed for excellent quality is remarkably low (only four on average).We found that even if the hierarchy contains errors such as incorrect merges and small nodes near the root, segmentations specified by d-consistency still require a small node count.To illustrate this robustness property, consider the case where one incorrect merge was done; see Figure 10.This merge leads to a sequence of modes that are not purely foreground or background.In this example, the b-consistent foreground segment is specified by the cut containing 6 nodes (A,B,. . .,F), cconsistency requires one node less (A,B, . . ., E), while d-consistency requires only 2 nodes (K and F).This robustness is significant because the hierarchy is constructed usually by an error-prone, greedy process.By using d-consistency, the harm made by the greedy process can be compensated to some extent.Note that when using the geometric tree, the segmentation qualities obtained by c and d consistencies are not very different; see Figure 9.The merging errors made by the geometric tree are numerous, and happen in all hierarchy levels, therefore they cannot be corrected by a few set-difference operations.
The experiments are meant only to be illustrative, and are not the main contributions of this paper.Several surprising findings are observed, however.First, it turned out that for approximating a segment, in the Jaccard index sense, the geometric tree provide reasonable results, which are often as good as some of the others trees (but not of the modern HED tree).Note that while all the nodes in this case are image-independent rectangles, the nodes that were selected for the approximation are based on the (image-dependent) ground truth segmentation.We also found that the hierarchies based on the SED edge detector are not as good as we could have expected.This was somewhat surprising because previous evaluations of the SED show good results (F-number=0.75,BSDS [32]).Overall, these results imply that hierarchies built greedily are sensitive to the gradient that is used.

Conclusions
This paper considered the relation between the hierarchical representation of an image and the segmentation of this image.It proposed that a segmentation may depend on the hierarchy in 4 different ways, denoted consistencies.The higher level consistencies are more robust to hierarchy errors, which allows us to describe the segmentation in a more economical way, use fewer nodes, relative to the lower-level consistencies that are commonly used.
While the common a-consistency requires that every segment is a separate node in a hierarchy cut, using b-consistency allows to describe segments that were split between different branches of the hierarchy.The c-and d-consistency no longer require that the segmentation is specified by a cut, and this way can ignore, non-important small nodes.The d-consistency can even compensate for incorrect merges that occurred in the (usually greedy) construction of the hierarchy.We found, for example, that fairly complicated segments can be represented by only 3-5 nodes of the tree, using the hierarchy built with a modern edge detector (HED [33]) and d-consistency.This efficient segment representation opens the way to new algorithm for analyzing segmentation and searching for the best one.Developing such algorithms seems nontrivial and is left for future work.
The number of nodes required to describe a segmentation is a measure of the quality of the hierarchy.A segmentation may be accurately described by a large number of leaves of almost any hierarchy.For describing the segmentation with a few nodes, however, the hierarchy should contain nodes that correspond to the true segments, or at least to a large fraction of them.Thus, this approach is an addition to the variety of existing tools that were proposed for hierarchy evaluation.
Technically, most of this paper was dedicated to deriving rigorous and efficient algorithms for optimizing the Jaccard index.For this complex optimization, the co-optimality tool was introduced.We argue that with this tool, other measures of segmentation quality, such as the boundary-based F b measure [29] considered in [19], may be optimized more efficiently, and propose that for future work as well.

Fig. 1 :
Fig. 1: (a) The true segmentation (GT).(b-f ) A chain of the image partitions: Π = {π 0 , π 1 , π 2 , π 3 , π 4 }, which yields a hierarchy T = {N 1 , . . ., N 15 }.Each π i is represented in the Binary Partition Tree (representing T ) by a set of colored nodes.(g) Another partition of the image, denoted π ′ .The nodes representing π ′ (shaded red) are a cut of the hierarchy T , and are the leaves of a tree T ′ obtained by pruning T .(h) The dendrogram representing T .Each partition of the above is represented by a cut of the dendrogram (red dashed lines).

Fig. 2 :
Fig. 2: A few examples illustrating how regions may be represented by a hierarchy.We use the hierarchy described in Figure 1.(a) Set-difference of nodes: N 15 \N 11 .The nodes covering the part of I that are included in this set-difference are shaded green.(b) A possible T -partition of N 15 \N 11 is shaded red.(c) The unique coarsest T -partition of N 15 \N 11 , which is {N 4 , N 14 }, is shaded red.(d) A possible Tpartition of the complement I\(N 4 ∪ N 14 ) is shaded blue.(e) The unique coarsest T -partition of the complement I\(N 4 ∪N 14 ), which is {N 11 }, is shaded blue.
Fig.3: Examples of segmentations of various consistency types, all consistent with the hierarchy described in Figure1, shown also in (e).All segmentations are specified by three nodes (although sometimes fewer nodes suffice).Note that a segment is not necessarily connected.Except for the a-consistency, the nodes in a cut of the hierarchy specifying each segmentation are shaded with the colors of the segments in which they are included.(a) An a-consistent segmentation, into three segments, specified by a cut of the hierarchy: {N 4 , N 11 , N 14 }.(b) A segmentation that is b-consistent with the same cut of the hierarchy as in (a).The nodes N 4 and N 14 are merged into one segment.(c) A segmentation, denoted s, that is cconsistent with the subset N s = {N 1 , N 6 , N 9 }.The burgundy segment is • N s and the turquoise segment is the complement I\ • N s .Note that the T -partition of the burgundy segment is non-coarsest.Hence, N s ̸ = N c s and the specified cut of the hierarchy is non-coarsest for s.The minimal number of disjoint nodes required to cover the turquoise segment is four, while it is only two for the burgundy segment; hence, N c s = {N 6 , N 11 }.(d) The same segmentation s is d-consistent with the subset N s = {N 4 , N 6 , N 14 }.The turquoise segment is specified by N 4 ∪ {N 14 \N 6 }, the burgundy segment is the rest of the image: I\ • N s .The specified cut of the hierarchy is the coarsest for s.Note that representing this segment with N c s = {N 6 , N 11 }, as specified above, is valid and more node-economical.

Fig. 4 :
Fig. 4: A geometrical interpretation: the Jaccard index J(s) is the tangent of the angle α s .

•D
Ns and the complement I\ • D Ns ; see Figure 6.

Fig. 6 : 1 N 2 N 2 N∪D 1 N
Fig. 6: Specification of a segmentation that is d-consistent with the subset N = {N 4 , N 7 , N 10 , N 14 } consisting of nodes from the hierarchy described in Figure 1.The segments are • D N and the complement I\ • D N .Here maximal non-empty layer correspond to the index i max N = 2.The layers are: (a) K 0 N = {N 4 , N 14 } (b) K 1 N = {N 10 } (c) K 2 N = {N 7 }.The set differences between subsequent layers are (d) D 0 N = K 0 N \K 1 N (e) D 1 N = K 2 N \∅.The final segmentation is specified by (f ) • D N = D 0 N ∪D 1 N ; see Section 6.4.1.

Property 3 .
(Equivalent benefit expressions) b-consistency: N is a partition of I and B[N ] = N ∈ N : A(N ) > 0 A(N ) c-consistency: N consists of disjoint nodes and

Lemma 6 .
Let N ∈ T be a non-leaf node.The subset N ⊂ T N having the largest/smallest benefit from the following is a best/worst subset in T N : (a) The union of best/worst subsets in T Nr and in T N l , having the maximal/minimal benefit among all such unions of size |N |.(b) N itself and the union of worst/best subsets in T Nr and in T N l , having the minimal/maximal benefit among all such unions of size |N | − 1.

For
our auxiliary algorithms, the vector variable size is bounded by k.The vector variables of a node N may be calculated in O(min(k, |L Nr |) • min(k, |L N l |)) time.For the common case, where k << |L|, this amounts to O(k 2 ), and is independent of the tree size.The algorithm linearly scans all the nodes and requires O(|L|•(min(k, log|L|)) 2 ) time.This includes the time required to get the best subset from the node vectors.The full algorithm starts by calculating the node dimensions (b N , f N ).First, these dimensions are calculated for the leaves of T in O(|I|) time, and then propagated to the rest of the nodes in linear time.Overall, this calculation takes O(|I|+|T |) = O(|I|+|L|) time.Thus, the total time complexity is O(|I| + n • |L|•(min(k, log|L|))

Fig. 7 :
Fig. 7: Hierarchy consistent optimal segmentations for the HED hierarchy.(a) original image (a cat image from the Weizmann database), (b) ground truth, (c) the saliency map for the HED hierarchy.The segmentations are calculated for the three b-, c-, and d-consistencies and for several numbers of nodes.Note that for a low number of nodes (e.g., 5) the b-consistent segmentation (g) is of lower quality than the other segmentations (h)(i).Note also that the c-consistent segmentation (h) is slightly worse than the d-consistent one (i).The differences decrease when the number of nodes increases.

Fig. 10 :
Fig.10: Consistency-robustness against incorrect mergings -An example of a hierarchy, with several nodes in the foreground (A,. . .,E) and one node in the background (F), which merged incorrectly with E. Expressing the foreground using this hierarchy requires 6,5, and 2 nodes in the b-,c-, and d-consistency, respectively; see text.
).From this point forward in this paper, N s is considered as the minimal set so that all nodes in it are required to specify s.As a result,