1 Introduction

Generic (i.e., non-semantic) image segmentation is widely used in various tasks of image analysis and computer vision. A variety of image segmentation methods are proposed in the literature, including the watershed method [1], level-set method [2], normalized cuts [3], and many others. Modern generic segmentation algorithms use (deep) edge detectors and watershed-like merging [4]. Augmenting the detected edges with region descriptors improves segmentations [5]. Note that generic image segmentation, the topic we are focusing on in this paper, is different from semantic image segmentation, which provides segmentation of objects from specific classes with the help of (deep) image classifiers [6, 7].

Segmentation (generic or semantic) is useful for numerous applications, such as image enhancement [8], image analysis [9], and medical image analysis [10].

The dominant generic segmentation algorithms (e.g., [4]) are hierarchical and built as follows: first, an oversegmentation is carried out, specifying superpixels as the elements to be grouped. Then, a hierarchical structure (usually represented by a tree) is constructed with the superpixels as its smallest elements (i.e., leaves). The regions specified by the hierarchy are the building blocks from which the final segmentation is decided. Restricting the building blocks to the elements of the hierarchy yields simple, effective algorithms at a low computational cost. Most segmentation methods build the segmentation from the hierarchy by choosing a cut from a limited cut set. Our first contribution is to generalize this choice. We systematically consider all possible ways for specifying a segmentation, using set operations on elements of the hierarchy. Most of these methods are new.

We are also interested in the limitations imposed on the segmentation quality by using the hierarchy-based approach. These limitations depend on (1) the quality of the hierarchy, (2) the number of hierarchy elements (nodes) that may be used, and (3) the way that these elements are combined. We investigate all these causes in this paper. The quality is also influenced by the oversegmentation quality, which was studied elsewhere [11].

The number of hierarchy elements determines the complexity of specifying a segmentation. Lower complexity is advantageous by the minimum description length (MDL) principle, which minimizes a cost composed of the description cost and the approximation cost, and relies on statistical justifications [12,13,14,15,16]. Moreover, representation by a small number of elements opens possibilities for a new type of segmentation algorithms that are based on search, for example, in contrast to the greedy current algorithms. The number of elements needed also indicates, in a sense, how much information about the segmentation is included in the hierarchy, and thus, it provides a measure of quality for the hierarchy as an image descriptor, as well as a global measure of the associated boundary operator.

To investigate the hierarchy-induced limitations, we optimize the segmentation from elements of a given hierarchy. We consider binary segmentation, and use the Jaccard index (IoU) measure of quality [17]. More precisely, we use image-dependent oversegmentation and hierarchies produced by algorithms that have access only to the image. However, we allow the final stage, which constructs the segmentation from the hierarchy elements, to have access to the ground truth segmentation. As a result, the imperfections of the optimized segmentation correspond only to its input, i.e., to the hierarchy. Thus, the results we obtain are upper bounds on the quality that may be achieved by any realistic algorithm, that does not have access to the ground truth, but relies on the same hierarchy.

Optimizing the Jaccard index is highly nontrivial, but we provide a framework that optimizes it exactly and effectively. Earlier studies either use simplistic quality measures or rely on facilitating constraints [18, 19].

The contributions of this work are as follows:

  1. 1.

    Four different methods for specifying a hierarchy-induced segmentation. These methods are denoted (segmentation to hierarchy) consistencies.

  2. 2.

    Efficient and exact algorithms for finding the best segmentation (in the sense of maximizing the Jaccard index) that is consistent with a given hierarchy. We provide four algorithmsFootnote 1, one for each consistency. The algorithms are fast, even for large hierarchies.

  3. 3.

    A characterization of the limits of hierarchy-induced segmentation. Notably, this characterization is also a measure of the hierarchy quality.

This paper considers segmentation of images, but all the results apply as well to the partition of general data sets. The paper continues as follows. First, we describe terms and notations required for specifying the task (Sect. 2). In Sect. 3, we present our goal and discuss the notion of consistencies, which is central to this paper. In Sect. 4, we review several related works. In Sect. 5, we develop an indirect optimization approach that relies on the notion of co-optimality and enables us to optimize certain quality measures. Section 6 provides particular optimization algorithms and the corresponding upper bounds for the Jaccard index and the different consistencies. The bounds are evaluated empirically in Sect. 7, which also provides some typical hierarchy-based segmentations. Finally, we conclude and suggest some extensions in Sect. 8.

Fig. 1
figure 1

a The true segmentation (GT). (b–f) A chain of the image partitions: \(\Pi = \{\pi _0, \, \pi _1, \, \pi _2, \, \pi _3, \, \pi _4\},\) which yields a hierarchy \(\mathcal{T} = \{N_1, \dots , N_{15}\}\). Each \(\,\pi _i\,\) is represented in the Binary Partition Tree (representing \(\,\mathcal T\)) by a set of colored nodes. (g) Another partition of the image, denoted \(\,\pi '\). The nodes representing \(\,\pi '\,\) (shaded red) are a cut of the hierarchy \(\,\mathcal T,\) and are the leaves of a tree \(\,\mathcal T'\) obtained by pruning \(\,\mathcal T.\) (h) The dendrogram representing \(\,\mathcal T.\) Each partition of the above is represented by a cut of the dendrogram (red dashed lines)

2 Preliminaries

2.1 Hierarchies

The following definitions and notations are standard, but are presented here for the sake of completeness. Recall that a partition of a set I is a set of non-empty subsets of I, such that every element in I is in exactly one of these subsets (i.e., I is a disjoint union of the subsets). In this paper, these subsets are referred to as regions. Moreover, all examples are done with connected regions, but the connectivity constraint is not needed for the theory and algorithms.

Let \(\pi _1\) and \(\pi _2\) be two partitions of a pixel set I. Partition \(\pi _1\) is finer than partition \(\pi _2,\) denoted \(\pi _1\le \pi _2,\) if each region of \(\pi _1\) is included in a region of \(\pi _2.\) In this case, we also say that \(\pi _2\) is coarser than \(\pi _1.\) Let \(\Pi \) be a finite chain of partitions \(\Pi \!=\!\{ \pi _i \,\, | \,\, 0 \le i \le j \le n \implies \pi _i \le \pi _j \}\) where \(\pi _0\) is the finest partition and \(\pi _n\) is the trivial partition of \(\,I\,\) into a single region: \(\pi _n = \{I\}\). A hierarchy \(\,\mathcal {T}\) is a pool of regions of \(\,I,\) called nodes, that are provided by elements of \(\Pi \!:\,\,\mathcal{T} \!=\! \{\, N \! \subset I \,\, | \,\, \exists \, \pi _i \in \Pi \,: \, N \! \in \! \pi _i \,\}.\) For any two partitions from \(\Pi ,\) one is finer than the other, hence, any two nodes \(N_1, N_2 \!\in \! \mathcal T\) are either nested \((N_1 \!\subset \! N_2\) or \(N_2 \!\subset \! N_1),\) or disjoint \((N_1 \!\cap \! N_2 \!= \emptyset );\) see Fig. 1.

Let \(N_1\) and \(N_2\) be two different nodes of \(\,\mathcal T\). We say that \(N_1\) is the parent of \(N_2\,\) if \(\,N_2 \subset N_1\) and there is no other node \(N_3 \! \in \! \mathcal{T}\) such that \(N_2 \! \subset \! N_3 \! \subset \! N_1\). In this case, we also say that \(N_2\) is a child of \(N_1\). Note that every node has exactly one parent, except \(I \in \pi _n,\) which has no parent. Hence, for every node \(N \! \in \! \mathcal{T}\), there is a unique chain: \(N \! = \! N_1 \! \subset \! \dots \! \subset \! N_k \! = \! I\,,\) where \(\,N_i\,\) is the parent of \(\,N_{i-1}\). Thus, the parenthood relation induces a representation of \(\,\mathcal T\) by a tree, in which the nodes of \(\,\pi _0\,\) are the leaves, and the single node of \(\,\pi _n\,\) is the root; see Fig. 1. Hence, we also refer to \(\,\mathcal T\) as a tree. When each non-leaf nodeFootnote 2 of \(\,\mathcal T\) has exactly two children, \(\mathcal T\) is a binary partition tree (BPT) [18,19,20,21]. In this paper, we focus on BPTs, but our results hold for non-binary trees as well.

A hierarchy \(\mathcal T\) can be represented by a dendrogram, and every possible partition of I corresponds to a set of \(\mathcal T\)’s nodes and may be obtained by “cutting” the dendrogram; see Fig. 1. In the literature, any partition of I into nodes of \(\mathcal T\) is called a cut of the hierarchy [22, 23]. Every \(\pi _i \in \Pi \) is a horizontal cut of the hierarchy, but there are many other ways to cut the hierarchy, and each cut specifies a partition of \(\,I.\) As we shall see later, a hierarchy may induce other partitions of I.

Pruning of a tree in some node N is a removal from the tree of the entire subtree rooted in N,  except N itself, which becomes a leaf. Each cut of a hierarchy represents a tree \(\,\mathcal T'\) obtained by pruning \(\,\mathcal T,\) by specifying the leaves of \(\,\mathcal T'\); see Fig. 1. The converse is also true: the leaves of a tree obtained by pruning \(\,\mathcal T\) are a cut of the hierarchy. That is, a subset of nodes \(\,\mathcal N \! \subset \! T\) is a cut of the hierarchy, if and only if \(\,\mathcal N\) is the set of leaves of a tree obtained by pruning \(\,\mathcal T\). More precisely, \(\,\mathcal N \! \subset \! T\) is a cut of the hierarchy, if and only if for every leaf in \(\,\mathcal T\), the only path between it and the root contains exactly one node from \(\,\mathcal N\). Often, a segmentation is obtained by searching for the best pruning of \(\,\mathcal T\). However, the cardinality of the set of all prunings of \(\,\mathcal T\) grows exponentially with the number of leaves in \(\,\mathcal T\) [18]. Thus, it is unfeasible to scan this set exhaustively by brute force.

2.2 Coarsest Partitions

We use the following notations. The cardinality of a set is denoted by \(|\, \cdot \,|\). The initial partition \(\pi _0\) of \(\,I,\) which is the set of leaves of the tree \(\,\mathcal T,\) is denoted by \(\,\mathcal L.\) Let \(N \! \in \! \mathcal T; \,\) we denote by \(\mathcal{T}^N \!\! \subset \! \mathcal T\) (resp. \(\mathcal{L}^N \!\! \subset \! \mathcal L\,\)) the subset of nodes of \(\,\mathcal T\) (resp. \(\mathcal L\)) included in N. Note that \(\,\mathcal{T}^N\) is represented by the subtree of \(\,\mathcal T\) rooted in N; hence, we refer to \(\,\mathcal{T}^N\) also as a subtree, and to \(\,\mathcal{L}^N\) as the leaves of this subtree.

Let \(\,Y \! \subset I\) be a pixel subset. We refer to any partition of Y into nodes of \(\mathcal T\) (namely, a subset of disjoint nodes of \(\,\mathcal T\) whose union is \(\,Y\)) as a \(\,\mathcal T\!\)-partition of \(\,Y\). Note that a \(\mathcal T\)-partition of \(\,Y\) does not necessarily exist. We refer to the smallest subset of disjoint nodes of \(\,\mathcal T\) whose union is \(\,Y\,\) as the coarsest \(\,\mathcal T\!\)-partition of \(\,Y\). Obviously, \(\,\mathcal N \! \subset \! T\,\) is a cut of the hierarchy if and only if \(\,\mathcal N\,\) is a \(\,\mathcal T\)-partition of \(\,I\). \(\,\mathcal N \! \subset \! T\,\) is a \(\,\mathcal T\)-partition of a node \(N \! \in \! \mathcal T,\,\) if and only if \(\,\mathcal N\) is the set of leaves of a tree obtained by pruning \(\,\mathcal{T}^N\!\). Obviously, the coarsest \(\,\mathcal T\)-partition of a node \(N \! \in \! \mathcal T\) is \(\{N\}.\)

Figure 2 illustrates several ways of representing a region using a hierarchy and the corresponding coarsest partition.

Property 1

A non-coarsest \(\,\mathcal T\)-partition of a node \(N \! \in \! \mathcal T\) is a union of \(\,\mathcal T\)-partitions of its children.

In Fig. 1g, for example, the subset \(\{N_4, N_5, N_8, N_{10}, N_{11}\}\) is a non-coarsest \(\mathcal T\)-partition of \(N_{15}\,\) whereas, \(\{N_4, N_{11}\}\) and \(\{N_5, N_8, N_{10}\}\) are \(\mathcal T\)-partitions of the children of \(N_{15}: \, N_{13}\) and \(N_{14},\) respectively.

Lemma 1

(See Appendix 1 for the proof.)

i.:

A \(\,\mathcal T\)-partition of a pixel subset \(\,Y \! \subset I\,\) is non-coarsest, if and only if it contains a non-coarsest \(\,\mathcal T\)-partition of some node \(\,N \! \in \! \mathcal T\) that is included in \(\,Y\) (\(\,N \! \subset Y\)).

ii.:

When the coarsest \(\,\mathcal T\)-partition of a pixel subset \(\,Y \! \subset I\,\) exists, it is unique.

3 Problem Formulation

3.1 The General Task

As discussed in the introduction, we consider only segmentations that are consistent in some way with a given hierarchy, and aim to find limitations on their quality.

Obviously, the quality improves with the number of regions. More precisely,

General task: Given a hierarchy and a measure for estimating segmentation quality, we want to find a segmentation that has the best quality, is consistent with the hierarchy, and uses no more than a given number of regions from it.

To make this task well-defined, we now specify and formalize the notion of segmentation consistency with a hierarchy.

3.2 Various Types of Consistency Between a Segmentation and a Hierarchy

We consider segmentations whose (not necessarily connected) segments are specified by set operations on the nodes of \(\,\mathcal T\). The intersection between two nodes in a hierarchy is either empty or one of the nodes. Therefore, we are left with union and set difference. Complementing (with respect to I) is allowed as well. By further restricting the particular operations and the particular node subsets on which they act, we get different, non-traditional, ways for specifying segmentation from a hierarchy. We denote these different ways as (hierarchy to segmentation) consistencies.

Definition 1

Consistency. Let \(\,\mathcal Y\,\) be a set of pixel subsets; we denote by   the union of all elements of \(\,\mathcal Y.\) We say that:

  1. (a)

    A segmentation s is a-consistent with \(\mathcal T\) if there is a subset \(\,\mathcal{N}_s \! \subset \! \mathcal T\,\) such that each segment in s is a single node of \(\,\mathcal{N}_s.\)

  2. (b)

    A segmentation s is b-consistent with \(\mathcal T\,\) if there is a subset \(\,\mathcal{N}_s \! \subset \! \mathcal T\,\) such that each segment in s is a union of some nodes of \(\,\mathcal{N}_s.\)

  3. (c)

    A segmentation s is c-consistent with \(\mathcal T\,\) if there is a subset \(\,\mathcal{N}_s \! \subset \! \mathcal T\,\) such that each segment in s,  except at most one, is a union of some nodes of \(\,\mathcal{N}_s.\) One complement segment, if it exists, is .

  4. (d)

    A segmentation s is d-consistent with \(\mathcal T\,\) if there is a subset \(\,\mathcal{N}_s \! \subset \! \mathcal T\,\) such that each segment, except at most one, is obtained by unions and/or differences of nodes of \(\,\mathcal{N}_s.\) One complement segment, if it exists, is

Remark 1

Consistency of some type, with the subset \(\mathcal{N}_s\), implies consistency of a later, more general type, with the same subset \(\mathcal{N}_s\).

Remark 2

We argue that these four consistency types systematically cover all possibilities. The first choice is whether the nodes subset \(\mathcal{N}_s\) should be limited to a hierarchy cut or not. For the cut case (which is the popular choice in the literature), union is the only set operation that makes sense, because set difference between disjoint nodes is empty and the cut covers the full image, making the complement empty as well. For the more general case, where the subset \(\mathcal{N}_s\) is not necessarily a cut, both the union and the set difference are relevant. Unions without set difference is an important special case that is simpler both conceptually and computationally. Set difference between two nodes without additional unions does not seem to justify another consistency type (and is included, of course, in d-consistency).

Figure 3 illustrates the different consistencies. The a-consistency is used in most hierarchy-based segmentation algorithms, where some cut is chosen and all its leaves are specified as segments; see [19, 24]. To the best of our knowledge, the b-, c-, and d- consistencies were not used in the context of hierarchical segmentation; see however [25] for (c-consistency-like) node selection in a hierarchy of components.

As specified above, the subset \(\,\mathcal{N}_s\), specified for a segmentation s, is not necessarily unique; see Figs. 2 and 3c, d. From this point forward in this paper, \(\,\mathcal{N}_s\) is considered as the minimal set so that all nodes in it are required to specify s. As a result,

Property 2

If \(\mathcal{N}_s \!\subset \! \mathcal T\) is a subset associated with some consistency type a/b/c/d of a segmentation s,  then

  1. (a)

    \(s\,\) is a-consistent with \(\,\mathcal T,\,\) if and only if \(\,\mathcal{N}_s\,\) is a cut of \(\,\mathcal T\) such that each segment of s is a single node of \(\,\mathcal{N}_s.\) The subset \(\,\mathcal{N}_s\,\) associated with a-consistency of s is unique.

  2. (b)

    \(s\,\) is b-consistent with \(\,\mathcal T,\,\) if and only if \(\,\mathcal{N}_s\,\) is a cut of \(\,\mathcal T.\)

  3. (c)

    \(s\,\) is c-consistent with \(\,\mathcal T,\,\) if and only if \(\,\mathcal{N}_s\,\) consists of disjoint nodes of \(\,\mathcal T.\)

  4. (d)

    \(s\,\) is d-consistent with \(\,\mathcal T,\,\) if and only if \(\,\mathcal{N}_s\,\) consists of (possibly overlapping) nodes of \(\,\mathcal T.\)

Lemma 2

Every segmentation that is consistent with a hierarchy in one of the types b/c/d is also consistent with the hierarchy in the other two types.

Proof sketch:

Following Remark 1, consistency of a segmentation according to one type implies its consistency according to the more general types. The converse is also true.

Consider a d-consistent segmentation \(\,s.\) Recall that every node in \(\mathcal T\) is a union of disjoint nodes of the initial partition \(\mathcal L.\) A set difference of nested nodes is still a union of nodes of \(\,\mathcal L\) (which are a \(\mathcal T\)-partition of this set difference); see Fig. 2a–c. Hence, there is a subset \(\mathcal{N}_s\) consisting of disjoint nodes. By Property 2, \(s\,\) is also c-consistent.

A subset \(\mathcal{N}_s\) consisting of disjoint nodes can be completed to a partition of \(\,I\) by adding some \(\mathcal T\)-partition of see Fig. 2d–e. Hence, there is another subset \(\mathcal{N}_s\) that is a cut of the hierarchy. By Property 2, s is also b-consistent. \(\square \)

Lemma 2 states the somewhat surprising result that b/c/d consistencies are equivalent. Thus, the set of segmentations consistent with \(\,\mathcal T\), using either b-, c-, or d- consistencies, is common. Denote this set by \(\mathcal S\). Note that the set of a-consistent segmentations, \(\mathcal{S}_1 \!\subset \! \mathcal S\), is smaller. The consistencies may differ significantly, however, in the \(\mathcal{N}_s\,\) subsets. Let \(\,\mathcal{N}_s^a\,\) (resp. \(\mathcal{N}_s^b\,, \,\mathcal{N}_s^c\,, \,\mathcal{N}_s^d\,\)) be the smallest subset such that \(\,s \!\in \! \mathcal S\) is a- (resp. b-, c-, d-) consistent with this subset

Fig. 2
figure 2

A few examples illustrating how regions may be represented by a hierarchy. We use the hierarchy described in Fig. 1. a Set difference of nodes: \(N_{15} \!\backslash N_{11}\). The nodes covering the part of \(\,I\) that are included in this set difference are shaded green. b A possible \(\,\mathcal T\)-partition of \(\,N_{15} \!\backslash N_{11}\,\) is shaded red. c The unique coarsest \(\,\mathcal T\)-partition of \(\,N_{15} \!\backslash N_{11}\,\), which is \(\{N_4, N_{14}\}\), is shaded red. d A possible \(\mathcal T\)-partition of the complement \(I \backslash (N_4 \! \cup \! N_{14})\) is shaded blue. e The unique coarsest \(\mathcal T\)-partition of the complement \(I \backslash (N_4 \!\cup \! N_{14}),\) which is \(\{N_{11}\},\) is shaded blue

The proof of the following lemma is straightforward.

Lemma 3

Let \(\,s \! \in \! \mathcal S\,\), then \(|\mathcal{N}_s^b| \, \geqslant |\mathcal{N}_s^c| \, \geqslant |\mathcal{N}_s^d|\). Furthermore, \(\,\mathcal{N}_s^b\,\) is unique, but not necessarily \(\,\mathcal{N}_s^c, \,\mathcal{N}_s^d.\) Moreover, if s is a-consistent, then \(\mathcal{N}_s^a= \mathcal{N}_s^b\).

Note that a segmentation is consistent with \(\,\mathcal T\) (in some consistency a/b/c/d), if and only if each of its segments is a union of nodes from \(\,\mathcal T\); that is, there is a \(\,\mathcal T\)-partition for each segment. For a segmentation \(s \! \in \! \mathcal S,\) we refer to the union of the coarsest \(\,\mathcal T\)-partitions of the segments of \(\,s\) (i.e., \(\mathcal{N}_s^b\)) as the coarsest cut of the hierarchy for \(\,s.\) Lemma 3 implies that for every \(s \! \in \! \mathcal S\) there is a unique coarsest cut of the hierarchy. The converse is not true. A cut of a hierarchy can be the coarsest for several segmentations. For example, the same cut is the coarsest for different segmentations (a) and (b) in Fig. 3.

Fig. 3
figure 3

Examples of segmentations of various consistency types, all consistent with the hierarchy described in Fig. 1, shown also in e. All segmentations are specified by three nodes (although sometimes fewer nodes suffice). Note that a segment is not necessarily connected. Except for the a-consistency, the nodes in a cut of the hierarchy specifying each segmentation are shaded with the colors of the segments in which they are included. a An a-consistent segmentation, into three segments, specified by a cut of the hierarchy: \(\{N_4, N_{11}, N_{14}\}\). b A segmentation that is b-consistent with the same cut of the hierarchy as in (a). The nodes \(N_4\) and \(N_{14}\) are merged into one segment. c A segmentation, denoted s, that is c-consistent with the subset \(\mathcal{N}_s = \{N_1, N_6, N_9\}.\) The burgundy segment is and the turquoise segment is the complement Note that the \(\,\mathcal T\)-partition of the burgundy segment is non-coarsest. Hence, \(\,\mathcal{N}_s \!\ne \! \mathcal{N}_s^c\,\) and the specified cut of the hierarchy is non-coarsest for s. The minimal number of disjoint nodes required to cover the turquoise segment is four, while it is only two for the burgundy segment; hence, \(\mathcal{N}_s^c \!=\! \{N_6, N_{11}\}.\) d The same segmentation \(s\,\) is d-consistent with the subset \(\,\mathcal{N}_s = \{N_4, N_6, N_{14}\}.\) The turquoise segment is specified by \(N_4 \cup \{N_{14} \!\backslash N_6\},\) the burgundy segment is the rest of the image: The specified cut of the hierarchy is the coarsest for s. Note that representing this segment with \(\mathcal{N}_s^c \!=\! \{N_6, N_{11}\},\) as specified above, is valid and more node-economical

4 Previous Work

The task considered here (and in [18, 26]) is to estimate the limitation associated with hierarchy-based segmentation. That is, to find the best \(s \!\in \! \mathcal S,\) maximizing the quality \(\mathcal{M}(s)\), that is consistent with the hierarchy for a limited size of \(\,\mathcal{N}_s,\) \(\,|\mathcal{N}_s| \,\leqslant k.\) This upper bound of the segmentation quality is a function of the consistency type and k. We refer to the segmentation maximizing the quality as a/b/c/d-optimal.

First, we emphasize again that this task is different from the common evaluation of hierarchy-dependent segmentations, which provides precision recall curves and chooses the best segmentation from them; see, e.g., [24, 27, 28]. This approach considers only the easily enumerable set of segmentations associated with horizontal cuts, which are parameterized by a single scalar parameter. Here, on the other hand, we find the best possible segmentation from much more general segmentation sets, and provide an upper bound on its quality measure. The best segmentations from these larger sets have often significantly better quality; see [28].

Only a few papers address such upper bounds. Most of the upper bounds were derived for local measures. A local measure \(\mathcal{M}(s)\) of a segmentation s may be written as a sum of functions defined over the components of the cut defining s.

Local measures are considered in [18]. An elegant dynamic programming algorithm provides upper bounds on these measures for segmentations that are a-consistent with a given BPT hierarchy. Unlike that work, we consider binary segmentation, for which the a-consistent segmentation is trivial. We extend this work by working with b,c, and d-consistent segmentation and by optimizing each one for a non-local measure: the Jaccard index.

The boundary-based \(F_b\) measure [29] was considered in [19]. A method to evaluate the a-consistency performance of a BPT hierarchy is proposed. The optimization was modeled as a Linear Fractional Combinatorial Optimization problem [30] and was solved for every possible size of a cut of a hierarchy (from 1 till \(|\mathcal L|\)). This process is computationally expensive, and therefore is limited to moderate size hierarchies.

Extending those previous works, a hierarchy evaluation framework was proposed [28]. It includes various types of upper bounds corresponding to boundaries and regions, and further extends the analysis to supervised, markers based, segmentation. More recently, the study described in [31] introduced some new measures that quantify the match between hierarchy and ground truth. Both papers [28, 31] address neither the exact optimization of the Jaccard index nor the advanced (b, c, d) consistencies.

5 A Co-optimality Tool for Optimization

Given a quality measure \(\mathcal M\) over a set \(\mathcal S,\) we want to find \(s \! \in \! \mathcal S\) with the best score, \(\mathcal{M}(s).\) Optimizing the quality measures over all possible node subsets may be computationally hard. One approach could be to optimize an equivalent measure \(\mathcal{Q}(s)\) instead. Measures are equivalent if they rank objects identically. For example, the Jaccard index and the object-based F-measure are equivalent [24] because they are functionally related by a monotonically increasing function.

An equivalent measure \(\mathcal Q\) may, however, be as difficult to optimize. Recalling that we are interested only in the maximum of \(\mathcal M\) and not in the ranking of all subsets, we may turn to a weaker, easier-to-optimize form of equivalence.

Definition 2

Let \(\,\mathcal{S}_{_{\!\!\mathcal{M}}}\! \subset \mathcal S\) be the subset of the elements optimizing \(\mathcal M.\) We refer to measures \(\mathcal M\) and \(\mathcal Q\,\) as co-optimal over \(\mathcal S,\) if \(\,\mathcal{S}_{_{\!\!\mathcal{M}}}\! = \mathcal{S}_{_{\!\!\mathcal{Q}}}.\)

We now propose an optimization approach that is valid for general finite sets \(\mathcal S\), including but not limited to hierarchical segmentations. Algorithm 1 uses a family of measures \(\{\mathcal{Q}_{_{\omega }}\},\omega \!\in \! [0,1]\) over \(\,\mathcal S\). It works by iteratively alternating between assigning values to \(\omega \) and optimizing \(\mathcal{Q}_{_{\omega }}(s)\). As Theorem 1 below shows, under some conditions on the family \(\{\mathcal{Q}_{_{\omega }}\}\), the algorithm returns the segmentation that maximizes the quality measure \(\mathcal M\), and the corresponding maximal value \(\widehat{\mathcal{M}}\).

figure e

Theorem 1

Let \(\mathcal M\) be a quality measure over a finite set \(\,\mathcal S,\) receiving its values in [0, 1]. Let \(\widehat{\mathcal{M}}\) be the (unknown) maximal value of \(\mathcal M\) over \(\mathcal S\). Let \(\{\mathcal{Q}_{_{\omega }}\},\omega \!\in \! [0,1]\) be a family of measures over \(\,\mathcal S,\) satisfying the following conditions:

  1. 1.

    \(\mathcal{Q}_{\omega = \widehat{\mathcal{M}}}\,\,\text { and }\,\mathcal M\,\) are co-optimal measures over \(\mathcal S\).

  2. 2.

    For \(0 \leqslant \omega < \widehat{\mathcal{M}}\) and \( s' \!\in \mathcal{S},\) if there is \(\,s \in \mathcal{S}_{_{\!\!\mathcal{M}}}\, \text {s.t.}\,\mathcal{Q}_{_{\omega }}(s) \leqslant \mathcal{Q}_{_{\omega }}(s')\), then \(\mathcal{M}(s') > \omega \).

Then Algorithm 1 returns \(\,s \in \mathcal{S}_{_{\!\!\mathcal{M}}}\) after a finite number of iterations.

Proof

Suppose that \(\omega _0\in [0,\widehat{\mathcal{M}}]\). Then the iterative scheme in each iteration finds \(\,{s}_{_{\!\omega }} \!\in \! \mathcal{S}_{_{\!\!\mathcal{Q}_{_\omega }}}\!\) and specifies a new value for \(\omega \) to be \(\mathcal{M}({s}_{_{\!\omega }}).\) Condition  2 is fulfilled trivially for \(s' \!=\! {s}_{_{\!\omega }}\) since \({s}_{_{\!\omega }}\) maximizes \(\mathcal{Q}_{_{\omega }}.\) Hence, \(\,\omega < \mathcal{M}({s}_{_{\!\omega }});\) i.e., \(\omega \) strictly increases from iteration to iteration while \(\,\omega < \widehat{\mathcal{M}}\,.\) \(\mathcal S\) is finite, hence, \(\omega \) reaches \(\widehat{\mathcal{M}}\) after a finite number of iterations. When that happens, \(\,\widehat{s}_{\omega = \widehat{\mathcal{M}}} \in \mathcal{S}_{_{\!\!\mathcal{M}}}\,\) since, by condition 1, \(\mathcal{Q}_{\omega = \widehat{\mathcal{M}}}\,\) and \(\mathcal{M}\) are co-optimal. The iterations stop when \(\,\omega \) no longer increases. Hence, to prove the theorem, we show that \(\,\omega \,\) does not change after it reaches \(\widehat{\mathcal{M}}\).

\(\pmb \implies \):

Suppose that \(\,\omega \!=\! \widehat{\mathcal{M}}.\) Since \(\widehat{s}_{\omega = \widehat{\mathcal{M}}}\) maximizes both \(\,\mathcal{M},\,\mathcal{Q}_{\omega = \widehat{\mathcal{M}}}\,,\) we have \(\,\widehat{\mathcal{M}} = \mathcal{M}({s}_{_{\!\omega }}) \quad \Rightarrow \quad \omega = \widehat{\mathcal{M}} = \mathcal{M}({s}_{_{\!\omega }}).\)

:

Conversely, suppose that \(\,\omega \!=\! \mathcal{M}({s}_{_{\!\omega }})\), i.e., \(\omega \) does not change at line 5 of the algorithm. All values of \(\,\omega \,\) specified in scheme 1 are values of \(\mathcal M;\) hence, \(\,\omega \!\leqslant \! \widehat{\mathcal{M}}.\) If \(\,\omega \!<\! \widehat{\mathcal{M}},\) then by condition 2 \(\,\omega \!<\! \mathcal{M}({s}_{_{\!\omega }}),\) which contradicts the current assumption. Hence, \(\,\omega = \widehat{\mathcal{M}}.\)

Suppose now that the condition required above, \(\omega _0\in [0,\widehat{\mathcal{M}}]\), is not satisfied (i.e., \(\omega _0 > \widehat{\mathcal{M}}\)). Then, line 1 returns some \(\omega \) which must be lower than the maximal \(\widehat{\mathcal{M}}\). Then the algorithm proceeds and reaches the optimum according to the proof above. \(\square \)

Given a quality measure \(\,0 \!\leqslant \! \mathcal{M} \!\leqslant \! 1\) over \(\mathcal S,\) we refer to a family \(\{\mathcal{Q}_{_{\omega }}\} \,\omega \!\in \! [0,1]\) of measures over \(\mathcal S,\) as a family of auxiliary measures for \(\mathcal M\) if \(\{\mathcal{Q}_{_{\omega }}\}\) contains at least one measure \(\mathcal{Q}_{_{\!\omega '}}\) that is co-optimal with \(\mathcal M\) over \(\mathcal S,\) and there is some iterative process that finds \(\mathcal{Q}_{_{\!\omega '}}\) from \(\{\mathcal{Q}_{_{\omega }}\}.\) We refer to \(\mathcal{Q}_{_{\!\omega '}}\) as a co-optimal auxiliary measure, and we refer to an algorithm that can optimize every member of \(\{\mathcal{Q}_{_{\omega }}\}\) as an auxiliary algorithm.

In scheme 1,  the auxiliary algorithm is written in the most general form: \(\mathop {\textrm{argmax}}\limits \,\mathcal{Q}_{_{\omega }}.\) In the next section, we provide a family of auxiliary measures and corresponding auxiliary algorithms, suitable for optimizing the Jaccard index, for different consistencies and constraints of the node set size.

6 Optimizing the Jaccard Index

After setting the framework and developing the necessary new optimization tool, we shall now turn to the main goal of this paper: Finding a tight upper bound on the obtainable Jaccard index.

6.1 The Jaccard Index

The Jaccard index (or the intersection over union measure) is a popular segmentation quality measure, applicable to a simple segmentation into two parts: foreground (or object) and background.

Let \(\,(\mathcal{B}_{_{\textrm{GT}}}, \mathcal{F}_{_{\textrm{GT}}})\,\) and \(\,(\mathcal{B}_{s},\mathcal{F}_{s})\,\) be two foreground-background partitions corresponding to the ground truth and a segmentation \(\,s \! \in \! \mathcal S\,\). The Jaccard index J is given by:

$$\begin{aligned} J(s)= \frac{|\mathcal{F}_{_{\textrm{GT}}}\cap \mathcal{F}_s|}{|\mathcal{F}_{_{\textrm{GT}}}\cup \mathcal{{F}}_s|} \end{aligned}$$
(1)

Given a hierarchy, we shall find, for each consistency and node subset size \(|\mathcal{N}_s|\), the segmentation that maximizes the Jaccard index. This segmentation also maximizes the object-based F-measure, as the two measures are equivalent [24].

For two-part segmentation, only one segmentation is a-consistent with a BPT hierarchy: the two children of the root. We ignore this trivial case in the following discussion.

6.2 Segmentation Dimensions

Let \(\mathcal{S}^{^{_2}} \!\!\subset \! \mathcal S\) be the subset of all possible 2-segment segmentations, consistent with the hierarchy. Denote the areas of the ground truth parts by \(B \! = \! |\mathcal{B}_{_\textrm{GT}}|\,, \,F \!=\! |\mathcal{F}_{_\textrm{GT}}|.\) Let \(\,\mathcal{X}_s\) be one segment of a segmentation \(\,s \! \in \! \mathcal{S}^{^{_2}}\). Considering this segment as foreground, denote its areas inside the ground truth’s parts by \((b_s \!=\! |\mathcal{X}_s \cap \mathcal{B}_{_\textrm{GT}}| \,, f_s \!=\! |\mathcal{X}_s \cap \mathcal{F}_{_\textrm{GT}}|).\) The Jaccard index is then

$$\begin{aligned} J(s) = \frac{|\mathcal{F}_{_\textrm{GT}}\cap \mathcal{X}_s|}{|\mathcal{F}_{_\textrm{GT}}\cup \mathcal{X}_s|} = \frac{f_s}{F + b_s} = \Psi (b_s,f_s). \end{aligned}$$
(2)

Alternatively, the foreground can be specified by the complementary segment \(\,I \backslash \mathcal{X}_{s}\,\). The corresponding areas inside the ground truth’s parts are \(\,(B-b_s,F-f_s).\) The Jaccard index associated with this foreground is

$$\begin{aligned} J^c(s) = \Psi (B - b_s,F - f_s) = \frac{F - f_s}{F + B - b_s}. \end{aligned}$$
(3)

Optimizing J(s) for b-consistency provides a cut in tree \(\,\mathcal{N}_s\). Both \(\,\mathcal{F}_s\) and \(\,\mathcal{B}_s\) are unions of nodes of this cut. The c/d consistencies allow one segment to be specified as the complement of the other. The hierarchy may match better either \(\mathcal{F}_{_\textrm{GT}}\) or \(\mathcal{B}_{_\textrm{GT}}\). Thus, we optimize both \(\,J(s)\,\) and \(J^c(s)\,\) (for the same size of \(\,\mathcal{N}_s\)) and choose the better result.

The values \((b_s,f_s)\) are the main optimization variables. We refer to them as segmentation dimensions.

6.3 Applying Co-optimality for Optimizing J(s)

6.3.1 Geometrical Interpretation

Fig. 4
figure 4

A geometrical interpretation: the Jaccard index J(s) is the tangent of the angle \(\alpha _s\)

Our goal is to find

$$\begin{aligned} (\, \widehat{J} \,, \, \widehat{s} \, ) = \mathop {\mathrm {(\, max \, , \, argmax \,)}}\limits \limits _{\!\!\! s \, \in \, \mathcal{S}^{^{_2}}} J(s). \end{aligned}$$
(4)

A key idea is to observe that the J(s) value may be interpreted geometrically using the graph of segmentation dimensions (bf). Selecting the segment \(\,\mathcal{X}_s,\) every segmentation \(\,s \!\in \! \mathcal{S}^{^{_2}}\,\) corresponds to a point \((b_s, f_s)\) inside the rectangle \((0,B)\times (0,F).\,\,J(s)\) is \(\tan (\alpha _s),\) where \(\alpha _s\) is the angle between the \(\,b\,\) axis and the line connecting the point \((b_s, f_s)\) with the point (-\(F, \,0\)) ; see Fig. 4. The geometry implies that \(\tan (\alpha _s) \in [0,1]\), consistently with \(\tan (\alpha _s)=J(s) \in [0,1]\).

6.3.2 A Family of Auxiliary Measures

For every \(\omega \in [0, 1]\), let

$$\begin{aligned} P_{_{\!\omega }}(s) = f_s - \omega \cdot b_s \end{aligned}$$
(5)

be a measure over \(\mathcal{S}^2\). Note that, geometrically, \(P_{_{\!\omega }}(s)\) is the oblique projection at the \(\arctan (\omega )\) angle of point \((b_s, f_s)\) onto the f axis. The following two observations imply that J(s) and the projection (at \(\arctan (\widehat{J}\,)\) angle) \(P_{_{\!\widehat{J}}}(s)\) are co-optimal measures.

  1. 1.

    \(J(s)\,\) and \(\,P_{_{\!\!J(s)\!}}(s)\,\) are equivalent measures.

  2. 2.

    \(\,P_{_{\!\!J(s)\!}}(s)\,\) and \(\,P_{_{\!\widehat{J}}}(s)\) are co-optimal measures.

The first observation is clear: ranking the elements of \(\mathcal{S}^{^{_2}}\) by \(J(s) = tan(\alpha _s)\) is equivalent to ranking them by their projection at angle \(\,\alpha _s,\) i.e., by \(P_{_{\!\!J(s)\!}}(s)\); see Fig. 4.

The second observation states that there is a constant angle \(\arctan (\omega )\) with \(\omega = \widehat{J} \) (not depending on s), such that the projection \(P_{_{\!\omega }}(s) \) at this angle and \(P_{_{\!\!J(s)\!}}(s)\) are co-optimal. By the first observation, \(P_{_{\!\!J(s)\!}}(s)\) is maximized by \(\widehat{s}\). Every non-optimal segmentation corresponds to a point below the line \([(-F,0)-(b_{\widehat{s}},f_{\widehat{s}})]\) and its constant angle projection satisfies \(P_{_{\!\omega }}(s) < P_{_{\!\omega }}(\widehat{s})\). \(P_{_{\!\omega }}(s)\) is maximized only by points lying on this line, as is also the case with \(P_{_{\!\!J(s)\!}}(s)\).

Thanks to these two observations, the family \(\{P_\omega \}\) is a family of auxiliary measures for the Jaccard index. The optimization process (Algorithm 1) maximizes this auxiliary measure in every iteration:

$$\begin{aligned} (\, \widehat{P}_{_{\!\omega }} \,, \, \widehat{s}_{_{\!\omega }} \, ) = \mathop {\mathrm {(\, max \, , \, argmax \,)}}\limits \limits _{\!\!\! s \, \in \, \mathcal{S}^{^{_2}}} P_{_{\!\omega }}(s) \end{aligned}$$
(6)

Note that \(P_{\omega }(s)\) is linear in \((b_s,f_s),\) which simplifies its maximization. To use scheme 1 to find \(\widehat{s}\) (and \(\,\widehat{J}\,),\) the second condition of Theorem 1, which guarantees that \(\omega \) strictly increases at every iteration while \(\omega \!<\! \widehat{J}\), must be met as well.

Figure 5 geometrically proves this property. Indeed, let \(\omega \!\in \! [0,\widehat{J})\) and \(\,\widehat{s},s' \!\in \! \mathcal{S}^{^{_2}}\) such that \(\,P_{_{\!\omega }}(\widehat{s}) \!\leqslant \! P_{_{\!\omega }}(s').\) Observe that the angle \(\,\alpha _{s'}\) must be larger than the projection angle of \(\,P_{_{\!\omega }},\) i.e., \(\arctan (\omega ).\) The detailed proof is left to the reader.

Fig. 5
figure 5

An illustration showing that \(\,\omega \,\) strictly increases from iteration to iteration, while \(\omega < \widehat{J}.\)

Therefore, by Theorem 1

Theorem 2

For \(\mathcal{M} \!=\! J\), \(\{\mathcal{Q}_{_{\omega }}\} \!=\! \{P_{_{\!\omega }}\}\), and \(\omega \!\in \! [0,1]\), scheme 1 (starting from \(\omega _0\in [0,\widehat{J}]\)) returns the best segmentation \(\,\widehat{s}\,\) after a finite number of iterations.

Remark 3

Optimizing \(J^c(s)\) is similarly done.

6.4 Optimizing J(s) for Hierarchical Segmentation

Using scheme 1 reduces the optimization of J(s) to iterations where auxiliary measures are optimized. The auxiliary algorithm provides a foreground-background segmentation \(s \!\in \! \mathcal{S}^{^{_2}}\) whose dimensions (\(b_s,f_s\)) maximize the auxiliary measure corresponding to the current iteration. In this work, we use the hierarchy for this optimization, and the auxiliary algorithm returns the best segmentation \(\,s \!\in \! \mathcal{S}^{^{_2}}\) together with the corresponding subset \(\,\mathcal{N}_s \!\subset \! \mathcal T,\,\) which both depend on the required consistency of \(\,s\) with \(\mathcal T.\)

6.4.1 Specifying \(\mathcal{N}_s\) and the segmentation s for Various Consistencies

Here, we specify the relation between the hierarchy and the segmentation for each of the different consistencies considered in this paper.

a-consistency: Trivial and not considered here, as discussed above.

b-consistency: \(\mathcal{N}_s\,\) is a partition of \(\,I.\,\) A segmentation \(\,s\,\) is specified by assigning some nodes of \(\,\mathcal{N}_s\) to the foreground \(\,\mathcal{F}_s,\) and the rest to the background \(\,\mathcal{B}_s.\)

c-consistency: The nodes of \(\,\mathcal{N}_s\,\) are disjoint, but their union is not the full image. The segments of \(\,s\,\) are (the union of the regions corresponding to the nodes in \(\mathcal{N}_s\)) and the complement

d-consistency: Not all nodes of \(\,\mathcal{N}_s\,\) are necessarily disjoint, and their union is not necessarily the full image. The segments of \(\,s\) are specified as follows:

Let \(\,\mathcal N \!\subset \! T\,\) be a subset of nodes. Because the nodes belong to a hierarchy, each pair of nodes is either disjoint or nested. Denote by \(\,\mathcal{K}^{^{_0}}_{_{\!\mathcal N}} \!\subset \! \mathcal N\) the subset of disjoint nodes that are not nested in any other node from \(\mathcal N.\) Recursively, denote by \(\mathcal{K}^i_{_{\!\mathcal N}} \!\subset \! \mathcal N\,\) the subset of disjoint nodes that are not nested in any other node from \(\mathcal{N} \backslash \{ \cup _{j = 0}^{i-1} \, \mathcal{K}^j_{_{\!\mathcal N}} \}.\) We refer to each \(\,\mathcal{K}^i_{_{\!\mathcal N}}\,\) as a layer of \(\,\mathcal N.\,\) Note that (each subsequent layer is nested in any previous layer); hence, Let \(\,i^{^{_N}}_{_\mathcal{N}},\) be the index of the layer in which the node N lies. Note that the set of layers is a partition of \(\,\mathcal N,\) i.e., every node \(\,N \! \in \! \mathcal N\) is associated with exactly one index \(\,i^{^{_N}}_{_\mathcal{N}}.\) Note that \(\,i^{^{_N}}_{_\mathcal{N}}\) is the number of nodes in \(\mathcal N\) in which N is nested. Let \(i_{_\mathcal{N}}^{max}\) be the largest index corresponding to a nonempty layer. The segmentation is specified from

(7)

Each \(D^i_{_{\!\mathcal N}}\) is the set difference of the layers \(\,2 \!\cdot \! i\,\) and \(\,2 \!\cdot \! i + 1.\) Since each subsequent layer is nested in any previous layer, all \(D^i_{_{\!\mathcal N}}\) are disjoint. The segments of s are and the complement see Fig. 6.

Fig. 6
figure 6

Specification of a segmentation that is d-consistent with the subset \(\,\mathcal{N} \! = \! \{N_4, N_7, N_{10}, N_{14}\}\,\) consisting of nodes from the hierarchy described in Fig. 1. The segments are and the complement corresponds to the index \(i_{_\mathcal{N}}^{max} \! = 2.\,\) The layers are as follows: a \(\mathcal{K}^{^{_0}}_{_{\!\mathcal N}} \! = \! \{N_4, N_{14}\}\,\) b \(\mathcal{K}^{^{_1}}_{_{\!\mathcal N}} \! = \! \{N_{10}\}\,\) c \(\mathcal{K}^{^{_2}}_{_{\!\mathcal N}} \! = \! \{N_7\}.\,\) The set differences between subsequent layers are d \(D^{^{_0}}_{_{\!\mathcal N}} \! = \! \mathcal{K}^{^{_0}}_{_{\!\mathcal N}} \! \backslash \mathcal{K}^{^{_1}}_{_{\!\mathcal N}}\,\) e \(D^{^{_1}}_{_{\!\mathcal N}} \! = \! \mathcal{K}^{^{_2}}_{_{\!\mathcal N}} \! \backslash \varnothing .\,\) The final segmentation is specified by f ; see Sect. 6.4.1

6.4.2 Calculation of Segmentation Dimensions for Various Consistencies

To calculate the segmentation dimensions \((b_s,f_s)\) efficiently, we use a tree data structure to represent the tree hierarchy. For each node N of the tree, we store the area of the node inside the ground truth’s parts \((b^{^{_N}} \!\! = \!\! |N \cap \mathcal{B}_{_\textrm{GT}} |\,, \,f^{^{_N}} \!\! = \!\! |N \cap \mathcal{F}_{_\textrm{GT}}|).\) Similarly to the segmentation dimensions, specified above, we refer to these values as node dimensions. Note that the dimensions of a union of disjoint nodes are equal to the sum of the dimensions of all nodes from the union, and the dimensions of a set difference of two nested nodes are equal to the difference of their dimensions. Given a segmentation \(s=(\mathcal{X}_s\,,\,I \backslash \mathcal{X}_{s}) \!\in \! \mathcal{S}^{^{_2}},\) the calculation of its dimensions \((b_s,f_s)\) (which are the dimensions of the segment \(\mathcal{X}_s\)) from the dimensions of the nodes of a subset \(\mathcal{N}_s\) depends on the required consistency of s with \(\mathcal{N}_s\,\):

b-consistency: \((\mathcal{X}_s \!=\! \mathcal{F}_s).\) \((b_s,f_s)\) are calculated by the sum of the dimensions of the nodes assigned to \(\,\mathcal{F}_s.\)

c-consistency: \((b_s,f_s)\) are calculated by the sum of the dimensions of \(\mathcal{N}_s.\)

d-consistency: By the observations above about the sums and difference of dimensions and Equation (7), the dimensions \((b_s,f_s)\) are calculated by the sum of the dimensions of all nodes from \(\mathcal{N}_s\,,\) each multiplied by an appropriate sign: \(-\)1 to the power of \(\,i^{^{_N}}_{_{\mathcal{N}_s}}\!.\) More formally, we can write \((b_s,f_s)\) as the expression below. Note that for the b/c consistencies, \(\mathcal{N}_s\) consists of a single layer: \(i^{^{_N}}_{_{\mathcal{N}_s}} \! \!=\! 0\forall N \!\in \! \mathcal{N}_s\). Therefore, this expression is valid for all (b/c/d) consistencies.

A unified expression of segmentation dimensions:

$$\begin{aligned} b_s =&\bigg ( \!\!\!\!\!\!\!\!\!\! \sum \limits _{\begin{array}{c} \,\,\,\,\,\,\,\,\, N \in \, \mathcal{N}_s:\\ \,\,\,\,\mathcal{N}_s \,\text {specifies}\,\,\mathcal{X}_s \end{array}} \!\!\!\!\!\!\!\!\! b^{^{_N}} \! \cdot (-1)^{^{_{i^{^N}_{\mathcal{N}_s}}}}) \end{aligned}$$
(8a)
$$\begin{aligned} f_s =&\bigg ( \!\!\!\!\!\!\!\!\!\! \sum \limits _{\begin{array}{c} \,\,\,\,\,\,\,\,\, N \in \, \mathcal{N}_s:\\ \,\,\,\,\mathcal{N}_s \,\text {specifies}\,\,\mathcal{X}_s \end{array}} \!\!\!\!\!\!\!\!\! f^{^{_N}} \! \cdot (-1)^{^{_{i^{^N}_{\mathcal{N}_s}}}}\bigg ) \end{aligned}$$
(8b)

Remark 4

Since for a segmentation \(s \!\in \! \mathcal{S}^{^{_2}}\) the subset \(\,\mathcal{N}_s\) is not necessarily unique, we could ask whether the expression (8) is well-defined, i.e., whether we get the same area (\(b_s \! = \! |\mathcal{X}_s \cap \mathcal{B}_{_\textrm{GT}}| \,, f_s \! = \! |\mathcal{X}_s \cap \mathcal{F}_{_\textrm{GT}}|\)) for different subsets \(\mathcal{N}_s.\) The answer to this question is positive, due to the properties of node dimensions for the union of disjoint nodes and for the set difference of nested nodes.

6.4.3 Auxiliary Measures Additivity

A particularly useful property of the auxiliary measures is their additivity. Consider some attribute defined on every node in the tree. If the attribute of each non-leaf node is the sum of the attributes of the node’s children, then we say that this attribute is additive.

For a specific projection \(P_{_{\!\omega }}\), the two dimensions of a node may be merged into one attribute, \( A^{\!^{P_{_{\!\omega }}}}\!(N) = f^{^{_N}} \! - \omega \cdot b^{^{_N}}\). By inserting (8a) and (8b) into \(P_{_{\!\omega }}(s) = f_s \, - \, \omega \cdot b_s \) (5), we get a closed form, simplified linear expression for the auxiliary measure of the segmentation s. We may refer to this measure, alternatively, as the benefit of the corresponding node set \(\mathcal{N}_s\):

$$\begin{aligned} P_{_{\!\omega }}(s) = B[\mathcal{N}_s]=\!\!\!\!\!\!\!\!\! \sum \limits _{\begin{array}{c} \,\,\,\,\,\,\,\,\, N \in \, \mathcal{N}_s:\\ \,\,\,\,\mathcal{N}_s \,\text {specifies}\,\,\mathcal{X}_s \end{array}} \!\!\!\!\!\!\!\!\! A^{\!^{P_{_{\!\omega }}}}\!(N) \cdot (-1)^{^{_{i^{^N}_{\mathcal{N}_s}}}} \end{aligned}$$
(9)

Note that each non-leaf node N is the union of its disjoint children; hence, the dimensions \((b^{^{_N}}, f^{^{_N}})\) are the sum of the dimensions of the children of N,  which implies the additivity for \(A^{\!^{P_{_{\!\omega }}}}\!(N).\) The additivity property holds for all projections. For simplification, we refer to the attribute of N as A(N).

The auxiliary algorithms search for the subset of nodes maximizing the benefit (9). These optimization tasks are performed under the constraint: \(|\mathcal{N}_s| \,\leqslant k\).

While (9) provides a general expression for all consistencies, in practice we use the following consistency-dependent expressions, which are equivalent and more explicit.

Property 3

(Equivalent benefit expressions)

  1. b-consistency:

    \(\mathcal{N}\) is a partition of I and \(B[\mathcal{N}] = \sum \limits _{\begin{array}{c} \,\,\,\,\,\,\,\,\, N \in \,\mathcal{N}:\\ \,\,\,\,\,\,\,\,A(N)\,>\,0 \end{array}} \!\!\!\!\!\!\! A(N)\)

  2. c-consistency:

    \(\mathcal{N}\) consists of disjoint nodes and \(B[\mathcal{N}] = \sum \limits _{\,\,\,\,\,\,\, N \in \,\mathcal N} \!\!\!\! A(N) \)

  3. d-consistency:

    \( B[\mathcal{N}] = \sum \limits _{\,\,\,\,\,\,\, N \in \,\mathcal N} \!\!\!\! A(N) \cdot (-1)^{^{_{i^{^N}_{_\mathcal{N}}}}}\)

Here and below, we prefer to use the more general \(\mathcal{N}\) (over \(\mathcal{N}_s\)), when the discussion applies to general sets of nodes from the tree.

The proposed auxiliary algorithms (described below) are not restricted to the auxiliary measures discussed above; they would work for any additive measure \(\mathcal Q.\) The additivity is crucial, because otherwise the score \(\mathcal{Q}(s)\) is ill-defined, i.e., it may result in different score values for different subsets \(\mathcal{N}_s\) specifying the same \(s \!\in \! \mathcal{S}^{^{_2}}\!.\)

6.4.4 Using the Tree Structure for Maximizing the Auxiliary Measures

The maximization of the benefit (property 3) results in a subset of nodes subject to the consistency constraints, with the maximal benefit in \(\,\mathcal T.\) The key observation in this maximization is that a subset with the maximal benefit in a subtree \(\mathcal{T}^N\!\) can be obtained from subsets with the maximal benefit in the subtrees of children of N. That is, we can use the recursive structure of the tree \(\mathcal T\) to maximize the benefit.

Let \(\,\mathcal N' \!\subset \! N \!\subset \! T.\) We say that \(\,\mathcal N'\) is best if it has the highest benefit relative to every other subset of \(\mathcal N\) with the same number of nodes. Depending on the context, \(\,\mathcal N'\) should also have the properties associated with the consistency; i.e., being a partition (for b-consistency) or belong to a single layer (c-consistency). Interestingly, we also need the notion of worst subsets. \(\,\mathcal N'\) is worst if it has the minimal benefit relative to other subsets of \(\,\mathcal N\) of the same size.

Remark 5

Note that within the same consistency type, there can be several best/worst subsets in \(\,\mathcal N,\) having the same benefit but not necessarily of the same size.

Thus, a subset \(\mathcal N\) maximizes the benefit (property 3), if and only if \(\mathcal N\) is a best subset in \(\mathcal T.\) Below, by referring to \(\mathcal N\) as best without specifying in which subset of \(\mathcal T\) the subset \(\mathcal N\) is best, we mean that \(\mathcal N\) is best in the entire \(\mathcal T.\)

The following claim readily follows from the additivity properties of the dimensions (Sect. 6.4.3).

Lemma 4

  1. (a)

    Let \(\mathcal{N}_1\!\) and \(\mathcal{N}_2\) be subsets of nodes, such that and are disjoint, then: \(B[\mathcal{N}_1 \cup \mathcal{N}_2] \!=\! B[\mathcal{N}_1] + B[\mathcal{N}_2]\)

  2. (b)

    Let N be a node and \(\,\mathcal N\) be a subset of nodes, such that then: \(B[{\{N\} \cup \mathcal N}] \!=\! A(N) - B[\mathcal N]\)

Lemma 4(b) applies only to the d-consistency, in the case where and N are not disjoint. The set of nodes \({\{N\} \cup \mathcal N}\) corresponds to a segment that is the set difference between N and the segment specified by \(\mathcal{N}\), which leads to the claim on the benefit.

The children of a non-leaf node are disjoint and nested in the node, which implies the following claim.

Lemma 5

Let \(N \!\in \! \mathcal T\) be a non-leaf node: \(N \!=\! N_r \cup N_l\,,\) where \(N_r\!\) (right), \(N_l\!\) (left) are its children. Let \(\,\mathcal{N}^N\!\) be a subset of \(\,\mathcal{T}^N\!\) and let \(\mathcal{N}^{N_r}\!,\,\mathcal{N}^{N_l}\!\) be (possibly empty) subsets of \(\,\mathcal{N}^N\!\) from \(\,\mathcal{T}^{N_r}\!\) and \(\mathcal{T}^{N_l}\!\) respectively. Then:

  1. (a)

    If \(\,N \notin \mathcal{N}^N\) then: \(\quad \mathcal{N}^N\) is best/worst in \(\,\mathcal{T}^N \quad \Rightarrow \quad \mathcal{N}^{N_r}\!\), \(\mathcal{N}^{N_l}\) are best/worst in \(\,\mathcal{T}^{N_r}\!\), \(\mathcal{T}^{N_l}\)

  2. (b)

    If \(\,N \in \mathcal{N}^N\) then: \(\quad \mathcal{N}^N\) is best/worst in \(\,\mathcal{T}^N \quad \Rightarrow \quad \mathcal{N}^{N_r}\!\), \(\mathcal{N}^{N_l}\) are worst/best in \(\,\mathcal{T}^{N_r}\!\), \(\mathcal{T}^{N_l}\)

Proof

Assume the opposite about any \(\,\mathcal{N}^{N_r},\,\mathcal{N}^{N_l}.\) Lemma 4 implies that \(\,\mathcal{N}^N\!\) can be improved/worsened, which contradicts \(\,\mathcal{N}^N\!\) being best/worst. \(\square \)

Lemma 5 specifies the necessary condition for a best/worst subset in \(\,\mathcal{T}^N\!.\) With its help, the search for the best subsets in \(\,\mathcal{T}^N\!\) can be significantly reduced, making this search feasible. Namely, for finding a best subset in \(\,\mathcal{T}^N\!,\) it is enough to examine only those subsets that are best/worst in the subtrees of the children of \(\,N.\) The following (trivial) sufficient condition for best/worst subset in \(\,\mathcal{T}^N\!\) is to examine all possible candidates.

Lemma 6

Let \(N \!\in \! \mathcal T\) be a non-leaf node. The subset \(\mathcal{N} \!\subset \! \mathcal{T}^N\!\) having the largest/smallest benefit from the following is a best/worst subset in \(\mathcal{T}^N\!:\)

  1. (a)

    The union of best/worst subsets in \(\,\mathcal{T}^{N_r}\!\) and in \(\mathcal{T}^{N_l}\!,\) having the maximal/minimal benefit among all such unions of size \(|\mathcal{N}|.\)

  2. (b)

    N itself and the union of worst/best subsets in \(\,\mathcal{T}^{N_r}\!\) and in \(\mathcal{T}^{N_l}\!,\) having the minimal/maximal benefit among all such unions of size \(|\mathcal{N}| \,-\, 1.\)

We can now describe the auxiliary algorithms. From a high-level point of view, they work as follows: At the outset of the run, each auxiliary algorithm specifies each leaf of \(\,\mathcal T\) as both the best and the worst subset (of size 1) in the trivial subtree of the leaf. Then, each auxiliary algorithm visits all non-leaf nodes of \(\,\mathcal T\) once, in a post-order of tree traversal which guarantees visiting every node after visiting its children. When visiting a node N,  each algorithm finds the best/worst subsets in \(\mathcal{T}^N\!\) using Lemma 6.

6.5 The Auxiliary Algorithms

6.5.1 Preliminaries

Generally, the algorithm works as follows: Starting from the hierarchy leaves, the algorithms calculates the maximal auxiliary quality measure for every node and for every budget (up to k) in its subtree. When reaching the root, the decision about the particular nodes used for the optimal auxiliary measure is already encoded in the hierarchy nodes and is then explicitly extracted. Like [18], it is a dynamic programming algorithm.

The following variables and notations are used within the algorithms:

  1. 1.

    \(N_1\), \(N_2\), \(N_3, \dots , N_{|\mathcal{{T}}|}\) is the set of all nodes, ordered in a post-order of tree traversal.

  2. 2.

    \(N_l\) (left) and \(N_r\) (right) are the children of a non-leaf node N.

  3. 3.

    A(N) is an additive attribute of a node N. Recall that \(A(N) = A(N_r) + A(N_l)\).

  4. 4.

    \(k \! \in \! \mathbb {N}\) is a constraint specifying the maximal size of the best subset \( (\,1 \!\leqslant \! k \!\leqslant \! |\mathcal L|\,)\).

  5. 5.

    \(t(N) = \min (\,k,\,|\mathcal{{L}}^N|\,)\,\) is the maximal allowed size of a best/worst subset in \(\,\mathcal{T}^N\!,\) which is limited by k or by the number of leaves in \(\,\mathcal{T}^N\!.\)

  6. 6.

    r is the number of nodes in the node subset associated with the right child of N. It depends on N and is optimized by the algorithms. The range of r values is denoted \((\, r_{min} \,, \, r_{max} \,)\).

  7. 7.

    \(\mathcal{H}_+^N\!(i)\,/\,\mathcal{H}_-^N\!(i) \,\, i \!=\! 1, \dots , t(N)\,\) are best/worst subsets of size \(\,i\,\) in \(\mathcal{T}^N\!.\) The best subset \(\mathcal{H}_+^{Root}[k]\), denoted \(\mathcal H,\) is the output of the auxiliary algorithm, maximizing the benefit; see however remark 7. These subsets are used to describe the algorithm, but are not variables of the algorithm.

  8. 8.

    \(B_+^N[i]\,/\,B_-^N[i] \quad i \!=\! 1 \dots t(N)\,\) are vector variables stored in node N,  holding the benefits of \(\mathcal{H}_+^N\!(i)\,/\,\mathcal{H}_-^N\!(i).\)

  9. 9.

    \(R_+^N[i]\,/\,R_-^N[i] \quad i \!=\! 1 \dots t(N)\,\) are vector variables stored in node N,  holding the number of those nodes in \(\mathcal{H}_+^N\!(i)\,/\,\mathcal{H}_-^N\!(i),\) which belong to \(\,\mathcal{T}^{N_r}\) (the subtree of \(N_r\)). The number of nodes in \(\,\mathcal{T}^{N_l}\) follows.

  10. 10.

    \(Q\,\) is queue data structure, used to obtain \(\mathcal H\) from the vectors \(R_+^N\,/\,R_-^N\).

To find the best subset consisting of a single layer (as is the case for b/c consistency), we need to examine only the corresponding best subsets and disregard the worst subsets. In this case, we simplify the notation and use \(B^N,R^N,\mathcal{H}^N\!\) instead of \(B_+^N,R_+^N,\mathcal{H}_+^N.\)

Remark 6

Different optimal subsets for different k are associated with different \(\omega \) parameter values. Therefore, the set of subsets \(\{\mathcal{H}^{Root}(i); \ i < k\}\) obtained with the best subset \(\mathcal{H}^{Root}(k)\) are not optimal as well.

6.5.2 b-consistency

The auxiliary algorithm for b-consistency is formally given in Algorithm 2.

A best subset (for b-consistency) (def. 3), \(\mathcal{H}^N\!(i)\), must be a \(\mathcal T\)-partition of Hence, \(N \!\in \! \mathcal{H}^N\!(i),\) if and only if \(i \!=\! 1.\) Thus, \(B^N[1]\) is the benefit of the node N itself. To calculate \(B^N[i]\,\) for \(\,i \!>\! 1,\) we need only the condition (a) of Lemma 6 , which implies that \(B^N[i]\) is the maximum of \(\,B^{N_r}[r] + B^{N_l}[i-r],\) over all possible values of \(\,r.\) This part is carried out in lines 1–5 of Algorithm 2.

The best subset \(\mathcal{H} = \mathcal{H}^{Root}[k]\) and its subset of nodes with a positive attribute, denoted \(\mathcal G\), are specified from the vectors \(R^N\!.\) The number of nodes in \(\mathcal H\) that belong to the subtree of the root’s right child is \(R^{Root}[k]\) (recall that \(\,t(Root) \!=\! k),\) and their number in the subtree of the left child is \(\,k- R^{Root}[k].\) The same consideration is applied recursively to every node N,  stopping when \(R^{N}[i]\) is equal to zero. This part is carried out in lines 6–16 of Algorithm 2.

figure f

Notes:

  1. 1.

    The range \((\, r_{min} \,, \, r_{max} \,)\) is calculated as follows: i is the number of nodes in the subset associated with N. The number of nodes, r, associated with the right child should satisfy \(1 \leqslant r \leqslant t(N_r)\). (Note that the lower limit is 1 and not 0, because for b-consistency \(N_s\) is a cut.) The number of nodes \(i-r\) associated with the left child should satisfy \(1 \leqslant i-r \leqslant t(N_l)\), which implies that r should satisfy \(i-t(N_l) \leqslant r \leqslant i-1\). Therefore r should be in the range \((\, r_{min} \,, \, r_{max} \,) = (\, \max ( \, 1 \,, \, i-t(N_l) \,) \,\,, \, \min (\, t(N_r) \,, \, i-1 \,) \,)\).

  2. 2.

    \(\mathcal H\) is a cut of \(\,\mathcal T\) (Sect. 2.1), which implies that the deepest node is no deeper than \(|\mathcal{H}| - 1\). Hence, Algorithm 2 can be accelerated by processing only those nodes whose depth is less than k.

6.5.3 c-consistency

A similar auxiliary algorithm for c-consistency is given in Algorithm 3.

Note that a c-best subset \(\mathcal{H}^N\!(i)\) consists of disjoint nodes, but their union is not necessarily N. For example, for \(\mathcal{H}^N\!(1),\) there are three possibilities: the best node from \(\mathcal{T}^{N_r}\!,\) the best node from \(\mathcal{T}^{N_l}\!,\) and N itself, which are marked by ad-hoc values of 1, 0, and \(-\)1, respectively, in \(R^N[1].\)

figure g

Notes:

  1. 1.

    The calculation of \((\, r_{min} \,, \, r_{max} \,)\) is as above, but the ranges of both children start from 0.

  2. 2.

    For coding convenience, we added a cell \(B^N[0],\) which always takes the value 0.

  3. 3.

    The algorithm should preferably select only nodes with a positive attribute. If the number of nodes with a positive attribute (in one layer) is less than k, then nodes with a non-positive attribute are selected as well. In this case, however, there is a subset with fewer nodes and with a bigger benefit, which can be specified from \((B^{Root}, \, R^{Root});\) see Remark 7.

6.5.4 d-consistency

The auxiliary algorithm for d-consistency is formally given in Algorithm 4.

Unlike the other consistencies, a d-best subset may contain nested nodes, which requires additional variables. Unlike the algorithms for the other consistencies, here we use all the vectors \(B_+^N\), \(B_-^N\), \(R_+^N\), and \(R_-^N\), including the additional cells \(B_+^N[0],\) \(B_-^N[0],\) \(R_+^N[0]\) and \(R_-^N[0]\) which always take the value 0. By Lemma 6, and using the notations introduced in Sect. 6.5.1, \(\mathcal{H}_+^N\!(i)\) is the subset having the maximal benefit from \(C_i^{1+} \!\!=\! \mathcal{H}_+^{N_r}\!(r) \cup \mathcal{H}_+^{N_l}\!(i\!-\!r)\) and \(C_i^{2+} \!\!=\! \{N\} \cup \mathcal{H}_-^{N_r}\!(r) \cup \mathcal{H}_-^{N_l}\!(i\!-\!1\!-\!r),\) over all possible values of \(\,r.\)

For getting \(\mathcal{H}_-^N\!(i),\) the subset having the minimal benefit, we use similar expressions; see Algorithm 4. It may happen that the calculation of \(B_+^N[i]\) using \(B_-^N[i-1]\) includes A(N) twice. To avoid this problem, we calculate \(B_+^N[i],B_-^N[i]\) in two passes through all values of \(\,i\), the second pass being in decreasing order of \(\,i\) (see lines 6–18 in Algorithm. 4).

figure h

The d-best subset \(\mathcal H\) is specified from \(R^N\) as before. However, since both a node N and nodes that are nested in it, may be included in \(\mathcal H,\) we added an indicator \(Belong_+^N[i]\,/\,Belong_-^N[i] \quad i \!=\! 1, \dots , t(N)\,\) (Boolean vector variables stored in a node N), indicating whether N belongs to \(\mathcal{H}_+^N\!(i)\,/\,\mathcal{H}_-^N\!(i).\) In addition, for every node \(N \!\in \! \mathcal H,\) the index \(i^{^{_N}}_{_\mathcal{H}}\) (Sect. 6.4.1) is calculated, so Algorithm 4 returns a subset \(\,\widetilde{\mathcal{H}},\) which is a subset of the pairs \((N,\, i^{^{_N}}_{_\mathcal{H}}).\)

Remark 7

\(\mathcal H\) is not necessarily of minimal size, and in extreme case, when the number of nodes with positive attribute is too small, it does not provide the best benefit. (See Remark 5 and Note 3 in Sect. 6.5.3). The best subset with the best benefit and minimal size is always associated with the maximal value in \(B^{Root}\!.\) It can be specified by running the queue starting from Q\(Enqueue(\, Root \,, \, k' \,)\) (\(\,k'\) replaces \(t(Root)\,\)), where \(k'\) is the minimal index such that the value \(B^{Root}[k']\) is the maximal in \(B^{Root}\!.\)

6.5.5 Time Complexity

For our auxiliary algorithms, the vector variable size is bounded by k. The vector variables of a node N may be calculated in \(O(\min (k,|\mathcal{L}^{N_r}|) \!\cdot \! \min (k,|\mathcal{L}^{N_l}|))\) time. For the common case, where \(k<< |\mathcal{L}|\), this amounts to \(O(k^2)\), and is independent of the tree size. The algorithm linearly scans all the nodes and requires \(O(|\mathcal{L}| \cdot (\! \min (k, \log |\mathcal{L}|) )^2\,)\) time. This includes the time required to get the best subset from the node vectors.

The full algorithm starts by calculating the node dimensions \((b^{^{_N}},f^{^{_N}}).\) First, these dimensions are calculated for the leaves of \(\,\mathcal T\) in O(|I|) time, and then propagated to the rest of the nodes in linear time. Overall, this calculation takes \(O(|I|+|\mathcal{T}|) = O(|I|+|\mathcal{L}|)\) time.

Thus, the total time complexity is \(O(|I| \, + \, n \cdot |\mathcal{L}| \cdot \big {(} \! \min (k, \log |\mathcal{L}|) \big {)}^2\,)\), where n is the number of iterations made by scheme 1. The straightforward (and least tight) upper bound on \(\,n\,\) is the number of segmentations \(s \!\in \! \mathcal S\) with different scores \(\mathcal{M}(s)\) (the measure maximized in scheme 1), since \(\mathcal{M}(s)\) strictly increases from iteration to iteration (Sect. 5). However, in practice, we found that only a few iterations are required (no more than five).

6.5.6 The Best Segmentation Specified by a Subset of Unlimited Size

Sometimes, we are interested in a segmentation \(s \!\in \! \mathcal S\) achieving the best score \(\mathcal{M}(s),\) regardless of the size of a subset \(\mathcal{N}_s.\) Then, the auxiliary algorithm becomes linear, and is significantly simpler. Lemma 2 implies that in this case, optimizing \(\,\mathcal M\) yields \(s \!\in \! \mathcal S\) with the same score \(\mathcal{M}(s)\), for each of the consistency types b/c/d. By simply discarding the node subset size parts, the b-consistency algorithm can be simplified to be particularly efficient. Algorithm 5 provides the full description.

In every node N,  we store only the maximal benefit over all b-best subsets in \(\mathcal{T}^N\!,\) regardless of their sizes. That is, we need only a scalar variable \(p^N\!,\) storing the maximal value in the vector \(B^N\) in Algorithm 2. After the values \(p^N\!\) are calculated for all nodes, the b-best subset \(\mathcal H\) is found as the optimal cut of \(\,\mathcal T.\) In this case, \(\mathcal H\) has the minimal size (see Remark 7), i.e., there is no b-best subset in \(\mathcal T,\) that has the same benefit, while being smaller.

Processing of each node is in \(\,O(1);\) hence, the time complexity of this algorithm is \(O(|\mathcal{T}|) = O(|\mathcal{L}|)\). Note that Algorithm 5 returns two subsets: \(\mathcal H\) and \(\mathcal G \!\subset \! H\) (the nodes with a positive attribute).

figure i

6.5.7 Auxiliary Algorithms’ Correctness

Theorem 3

The auxiliary algorithms optimize the auxiliary measure (9) subject to the corresponding consistency, and the constraint on the maximal number of nodes in \(\mathcal{N}_s\).

As each of the auxiliary algorithms recursively applies Lemma 6, the proof readily follows by induction on \(Height(\mathcal T).\)

6.5.8 A Note on the Implementation

Each of the auxiliary algorithms calculates the benefit of node subsets by performing arithmetic operations with natural numbers \(b^{^{_N}}\!,f^{^{_N}}\!\) and the real number \(\omega \). To avoid numerical error in the accumulation, we use integer arithmetic. We represent the benefit with two natural numbers, each of which is a linear combination of \(\,b^{^{_N}}\!\) and \(f^{^{_N}}\!\) values, with \(\,\pm 1\) coefficients. To compare the benefits of different subsets, we need only a single operation involving \(\omega \).

7 Experiments

The contribution of this paper is mostly theoretical, in providing, for the first time, effective algorithms for bounding the obtainable Jaccard index quality of a segmentation. These bounds, depending on the hierarchy, the consistency, and the number of nodes, are experimentally illustrated below.

To the best of our knowledge, the optimization of the Jaccard index was not considered before, which prevents us from comparing our empirical results with prior work.

Fig. 7
figure 7

Hierarchy consistent optimal segmentations for the HED hierarchy. a original image (a cat image from the Weizmann database), b ground truth, c the saliency map for the HED hierarchy. The segmentations are calculated for the three b-, c-, and d-consistencies and for several numbers of nodes. Note that for a low number of nodes (e.g., 5) the b-consistent segmentation g is of lower quality than the other segmentations (h, i). Note also that the c-consistent segmentation h is slightly worse than the d-consistent one (i). The differences decrease when the number of nodes increases

Fig. 8
figure 8

An illustration of the maximal Jaccard index, obtainable for a given number of nodes. Every one of the plots correspond to one of the segmentation-hierarchy consistencies. The curves correspond to averages over all images in the Weizmann DB, to the four hierarchies. Filtered hierarchies are used. The use of d-consistency clearly requires a significantly lower number of nodes for the same quality, relatively to the c-consistency, which, in turn, requires a lower number of nodes than the usage of the b-consistency. Also, as expected, the hierarchy built using the HED edge detector gives better results than the other hierarchies demonstrated here

For the experiments, we consider four BPT hierarchies. The first, denoted geometric tree, is image independent and serves as a baseline. The other three hierarchies are created as the minimum spanning tree of a super-pixels graph, where the nodes are SLIC superpixels [11]. The weights of the graph are specified by different types of gradients. More specifically, we consider the following hierarchies:

  1. 1.

    Geometric Tree (image-independent baseline)—Starting from the root node (the entire image), each node split into two equal parts (the node children), horizontally or vertically, depending on whether the height or the width of the node is larger. Note that the geometric tree is independent of the image content.

  2. 2.

    L2 Tree—based on traditional, low quality non-learned gradient: the L2 difference between the RGB color vectors.

  3. 3.

    SED Tree—based on learned, Structured Forests Edge detection, which can be considered medium quality [32].

  4. 4.

    HED Tree—Modern, high quality, deep learning based, Holistically-Nested Edge Detector [33].

A common issue with hierarchical image segmentation is the presence of small regions (containing few pixels) at lower depths in the hierarchy. These small regions are found more frequently when generating the HED and SED trees, as their gradient generally contains thick boundaries. It is therefore common to filter the hierarchy and to remove such small unwanted regions; see, e.g., the implementation of [34] and [35, 36]. We followed this practice and use the Higra [37] area-based filtering algorithm proposed in [36].

The leaves of the image independent, geometric-tree are the image pixels, which makes this tree large (and regular). The other trees are smaller, as they use super pixels, and also benefit from the filtering process, when applied.

We calculated the best segmentations that match the different hierarchies, and show how they depend on the particular hierarchy that is used, and on the consistency type. First, we show several examples of such best segmentations, corresponding to the same image, using the HED hierarchy; see Fig. 7. As expected, the segmentation quality improves with the number of nodes that are used, and with the consistency type (\(b<c<d\)).

Figure 8 confirms this observation, and shows that the average Jaccard index, over an image dataset, grows with the number of hierarchy nodes. It also shows that requiring d-consistency allows us to use a relatively small number of nodes for getting good segmentation, with a high Jaccard index. C-consistency follows, and b-consistency is last. The differences between the consistencies are clearly seen in Fig. 9. It is also clear that better hierarchies, obtained with more accurate edge detectors, provide much higher quality of segmentation with lower number of nodes. These plots show the average Jaccard index over 100 images of the Weizmann database [38]. Every image in this database contains a single object over a background, which match the applicability of the Jaccard index.

Fig. 9
figure 9

Comparing the segmentation quality obtainable using the three segmentation-hierarchy consistencies. The comparison is carried out for each of the four different types of hierarchies. Note that the performance with c and d consistencies is similar for the best, HED, tree. For better visibility, we used a different scale in the x-axis, for each of the hierarchy

The average Jaccard index curves are smooth. We observed however that for particular images, the curves have stair-like behavior, implying that the same Jaccard index is achieved for different k values, which happens, e.g., when adding a few nodes to \(N_s\) does not change the foreground specification.

Note that for b-consistency the geometric, image-independent tree is better than say the L2 tree. This happens because in the L2 tree, we have many spurious small nodes that are close to the root, even after the filtering process. B-consistency chooses a set of nodes which is a cut in the tree; when taking a cut that contains the important nodes needed to approximate the GT segment, some spurious nodes must be included, which significantly increases the node count (k).

The best results (in terms of lowest node count) are achieved with d-consistency. This holds for all hierarchies. For the higher quality hierarchies, the node count needed for excellent quality is remarkably low (only four on average). We found that even if the hierarchy contains errors such as incorrect merges and small nodes near the root, segmentations specified by d-consistency still require a small node count. To illustrate this robustness property, consider the case where one incorrect merge was done; see Fig. 10. This merge leads to a sequence of modes that are not purely foreground or background. In this example, the b-consistent foreground segment is specified by the cut containing 6 nodes (A,B,...,F), c-consistency requires one node less (A,B, ..., E), while d-consistency requires only 2 nodes (K and F). This robustness is significant because the hierarchy is constructed usually by an error-prone, greedy process. By using d-consistency, the harm made by the greedy process can be compensated to some extent. Note that when using the geometric tree, the segmentation qualities obtained by c and d consistencies are not very different; see Fig. 9. The merging errors made by the geometric tree are numerous and happen in all hierarchy levels; therefore, they cannot be corrected by a few set difference operations.

The experiments are meant only to be illustrative and are not the main contributions of this paper. Several surprising findings are observed, however. First, it turned out that for approximating a segment, in the Jaccard index sense, the geometric tree provide reasonable results, which are often as good as some of the others trees (but not of the modern HED tree). Note that while all the nodes in this case are image-independent rectangles, the nodes that were selected for the approximation are based on the (image-dependent) ground truth segmentation. We also found that the hierarchies based on the SED edge detector are not as good as we could have expected. This was somewhat surprising because previous evaluations of the SED show good results (F-number=0.75, on BSDS [32]). Overall, these results imply that hierarchies built greedily are sensitive to the gradient that is used.

Fig. 10
figure 10

Consistency-robustness against incorrect mergings—An example of a hierarchy, with several nodes in the foreground (A,...,E) and one node in the background (F), which is merged incorrectly with E. Expressing the foreground using this hierarchy requires 6,5, and 2 nodes in the b-,c-, and d-consistency, respectively; see text

8 Conclusions

This paper considered the relation between the hierarchical representation of an image and the segmentation of this image. It proposed that a segmentation may depend on the hierarchy in 4 different ways, denoted consistencies. The higher level consistencies are more robust to hierarchy errors, which allows us to describe the segmentation in a more economical way, use fewer nodes, relative to the lower-level consistencies that are commonly used.

While the common a-consistency requires that every segment is a separate node in a hierarchy cut, using b-consistency allows to describe segments that were split between different branches of the hierarchy. The c- and d-consistency no longer require that the segmentation is specified by a cut, and this way can ignore, non-important small nodes. The d-consistency can even compensate for incorrect merges that occurred in the (usually greedy) construction of the hierarchy. We found, for example, that fairly complicated segments can be represented by only 3–5 nodes of the tree, using the hierarchy built with a modern edge detector (HED [33]) and d-consistency. This efficient segment representation opens the way to new algorithm for analyzing segmentation and searching for the best one. Developing such algorithms seems nontrivial and is left for future work.

The number of nodes required to describe a segmentation is a measure of the quality of the hierarchy. A segmentation may be accurately described by a large number of leaves of almost any hierarchy. For describing the segmentation with a few nodes, however, the hierarchy should contain nodes that correspond to the true segments, or at least to a large fraction of them. Thus, this approach is an addition to the variety of existing tools that were proposed for hierarchy evaluation.

Technically, most of this paper was dedicated to deriving rigorous and efficient algorithms for optimizing the Jaccard index. For this complex optimization, the co-optimality tool was introduced. We argue that with this tool, other measures of segmentation quality, such as the boundary-based \(F_b\) measure [29] considered in [19], may be optimized more efficiently and propose that for future work as well.