Abstract
This paper presents an approach for bottom-up hierarchical instance segmentation. We propose an end-to-end model to estimate energies of regions in an hierarchical region tree. To this end, we introduce a Convolutional Tree-LSTM module to leverage the tree-structured network topology. For constructing the hierarchical region tree, we utilize the accurate boundaries predicted from a pre-trained convolutional oriented boundary network. We evaluate our model on PASCAL VOC 2012 dataset showing that we obtain good trade-off between segmentation accuracy and time taken to process a single image.
K. V. Manohar—Work done when the author was an intern at Preferred Networks inc., Japan.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
In this work we address the task of instance segmentation which involves segmenting each individual instance of a semantic class in an image. Many top-down approaches to this problem are based on object detection pipelines [1, 2] and each box is refined to generate a segmentation. Further, these methods do not consider entire image but rather independent proposals and as a result cannot handle occlusions between different objects. Since these methods are based on initial detections, they cannot recover from false detections motivating an approach that reasons globally.
A key aspect of our approach is to leverage the hierarchical segmentation trees [3] to sample potential object instances. To this end, we propose a new bottom-up approach to parse the regions in an hierarchical region tree. At the core of our approach lies Convolutional Tree-LSTM module which estimates the energies of the regions taking into account the entire image and tracking temporal relations across regions through different levels of the tree. Unlike MCG [4], that uses hand engineered features to generate object candidates, we exploit rich features learnt by Convolutional Neural Networks to sample object instances. Further, MCG involves complex pipeline involving proposal generation and ranking. The resulting system is very slow and takes more than 9.9 s for candidate generation alone. Ours on the other hand is trained end-to-end and on average takes 0.06 s at test time.
Our paper is outlined as follows. We begin by reviewing related work in Sect. 2. In Sect. 3 we describe the details of our approach. In Sect. 4, we dwell into implementation details. We investigate the performance of our method both qualitatively and quantitatively in Sect. 5. Finally, we conclude in Sect. 6.
2 Related Work
Our work is closely related to bottom-up methods exploiting superpixels [5]. Pham et al. [6] proposed a dynamic programming based approach to image segmentation by constructing a hierarchical segmentation tree. An unified energy function jointly quantifies geometric goodness-of-fit and objectness measure. A top-down traversal through the tree comparing the energies of the current node and its subtree results in optimal tree cut. Kirillov et al. [7] impose graph structure on the superpixels and formulate instance estimation as a MultiCut problem. One of the limitations of this method however is that, it cannot find instances that are formed by disconnected regions in the image. Unlike these methods, by training our model end-to-end we can find such instances as discussed in Sect. 6.
3 Method
Given an input image \(\mathcal {I}\), our goal is to segment the image into semantically meaningful non-overlapping regions. Figure 1 depicts the overview of our method. Henceforth, we adopt the following notation. For a given \(\mathcal {I}\), let \(\mathcal {T}\), \(L = \{1, 2, \dots , l_{max}\}\), \(\mathcal {R} = \{r_1, r_2, \dots , r_N\}\), \(\mathcal {F} = \{F_{r_{1}}, F_{r_{2}}, \dots , F_{r_{N}}\}\) and \(\mathcal {C} = \{C_{r_{1}}, C_{r_{2}}, \dots , C_{r_{N}}\}\) represent the hierarchical tree, set of distinct levels, set of regions in the tree, corresponding features for the regions and children of the regions in the tree respectively. For each level \(0 < l \le l_{max}\), we denote the set of regions, corresponding features and the threshold at this level as \(\mathcal {R}_l = \{r^l_1, r^l_2, \dots , r^l_{N_l}\} \subseteq \mathcal {R}\), \(\mathcal {F}_l = \{F_{r^l_{1}}, F_{r^l_{2}}, \dots , F_{r^l_{N_l}}\}\) and \(\alpha _l\) respectively. Tree cut at a level \(l^{'}\) for a horizontal cut-threshold \(\lambda _{cut} = \alpha _{l^{'}}\) results in a new set of levels \(L^{'} = \{l | l \ge l^{'}\}\).
3.1 Feature Extraction
We first extract features \(\mathbf {F}\) by passing input image \(\mathcal {I}\) through a series of convolutions. For a given region \(r \in \mathcal {R}\) in the tree, we generate a tightest bounding box \(b_r\) covering the non-linear boundary of r. We then extract a fixed spatial dimensional feature map \(F_{r}^{*}\)(e.g., \(7 \times 7\)) from \(\mathbf {F}\) corresponding to \(b_r\). Our approach in extracting \(F_{r}^{*}\) is similar to ROIAlign layer [1]. Additionally, we mask out the features corresponding to the region \(b_r \setminus r\) giving rise to the final feature map \(F_r\).
3.2 Convolutional Tree-LSTM Module
The motivation behind the method is to estimate how the probabiliy distribution over the categories change when a new region is added to the region under consideration in the subsequent levels. The model implicitly learns the temporal relations which lead to the formation of a given region.
We process the hierarchical tree \(\mathcal {T}\) starting from level \(l^{'}\) which corresponds to the initial cut-threshold \(\lambda _{cut} = \alpha _{l^{'}}\) using Convolutional Tree-LSTM predicting softmax probabilities for each region \(r \in \mathcal {R}_l\) at all the levels \(l \in L^{'}\) in order. Input to the LSTM at each level l are the features \(\mathcal {F}_l\). Equations 1–7 summarizes the forward propagation through the LSTM module. For jth region at level l,
where \(*, \odot \) denote convolution operation and Hadamard product respectively. We do the above for each region j and \(\forall \) \(l \in L^{'}\). For a region j at level l, \(c^l_k, h^l_k \forall k \in C_{r^l_{j}}\) are initialized to zeros provided they are the leaves of the tree and for the rest of the regions, \(c^l_k, h^l_k\) are governed by the Eqs. 6 and 7 respectively. Figure 2 depicts analysis on variation of sequence length and number of regions considered for different horizontal cuts.
On top of the LSTM module, we apply series of convolutions and fully connected layers which take input as \(h^l_j\) and predict probabilities.
3.3 Objective Formulation
For a given image \(\mathcal {I}\), let \(\mathcal {M} = \{m_1, m_2, \dots m_M\}, L^G = \{l_1, l_2, \dots l_M\}\) be the set of ground truth masks and one-hot labels respectively. For each mask \(m_i\), we construct the positive set \(\mathcal {P}^+_i = \{p^i_1, p^i_2, \dots p^i_{N_{i}}\}\) which consists of probabilities of regions from \(\mathcal {R}\) whose IoU with \(m_i\) is greater than \(\lambda _+\). Similarly, we construct \(\mathcal {P^-} = \{p^-_1, p^-_2, \dots p^-_{N_{-}}\}\) consisting of probabilities of regions from \(\mathcal {R}\) whose IoU with all \(m_i\) is less than \(\lambda _-\). We then formulate the loss as follows,
where \(I^b_c\) is 1 if class c corresponds to the background label b and T represents the transpose of vector. The hyperparameter \(\lambda \) in Eq. 8 controls the balance between positive and negative regions.
4 Implementation Details
4.1 Network Architecture
We use the pre-trained COB network for estimating contours which is a ResNet50 model. Features \(\mathbf {F}\) are extracted from res3 layer of ResNet50 model having spatial resolution of \(28\times 28\). ROIAlign extracts features having a fixed spatial resolution of \(7\times 7\). All the convolutions within the LSTM have kernel size of \(3\times 3\), stride 1 and use zero-padding. On top of convolutional LSTM, we have 2 \(3\times 3\) convolutions and 2 fully connected layers predicting softmax probabilities.
4.2 Training Details
We set the parameters \(\lambda _+\), \(\lambda _-\), \(\lambda \) to 0.7, 0.3 and 0.2 respectively in all our experiments. We train Convolutional LSTM and subsequent layers from scratch with a batch size of 1, initial learning rate of 0.001 and decay it by a factor of 0.1 after every 20 epochs. We experiment over various initial cut-thresholds from \(\lambda _{cut} = 0.3\) to \(\lambda _{cut} = 0.9\) in steps of 0.1.
5 Experiments
We use the pretrained COB network to predict the contours which was trained on PASCAL Context dataset. We train our Convolutional Tree-LSTM and subsequent layers on PASCAL VOC 2012 dataset. We evaluate our model on PASCAL VOC 2012 val dataset using average precision, Jaccard Index and time taken to process an image as evaluation metrics. Table 1 compares the time taken to process a single image by different methods. Figure 3 denotes the precision-recall curves for all the classes.
On the VOC 2012 val set, our best performing model scores 48% mAP. Our model struggles on categories like bicycle, chair. However on categories like train and plane, our model achieves higher performance. Table 2 summarizes the average precision for all the categories. We further compare Jaccard Index with MCG and is presented in Table 3 (Fig. 4).
6 Conclusions
We proposed an unique approach for bottom-up instance segmentation which overcomes the limitations of the current bottom-up and top-down approaches. Our method produces comparative results with good trade-off between segmentation accuracy and processing time. We would like to further investigate an end-to-end network predicting contours in tandem with the estimation of energies of regions. This leads to prediction of semantically accurate contours resulting in high-quality hierarchical region tree further aiding the estimation of energies.
References
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, October 2017
Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4438–4446 (2017)
Arbelaez, P., Maire, M., Fowlkes, C.C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 898–916 (2011)
Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: Computer Vision and Pattern Recognition (2014)
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004)
Pham, T., Do, T.T., Sünderhauf, N., Reid, I.: SceneCut: joint geometric and object segmentation for indoor scenes. In: 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018)
Kirillov, A., Levinkov, E., Andres, B., Savchynskyy, B., Rother, C.: InstanceCut: from edges to instances with multicut. In: CVPR (2017)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Manohar, K.V., Niitani, Y. (2019). An End-to-End Tree Based Approach for Instance Segmentation. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11133. Springer, Cham. https://doi.org/10.1007/978-3-030-11021-5_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-11021-5_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11020-8
Online ISBN: 978-3-030-11021-5
eBook Packages: Computer ScienceComputer Science (R0)