TS $$^{2}$$ C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection

Wei, Yunchao; Shen, Zhiqiang; Cheng, Bowen; Shi, Honghui; Xiong, Jinjun; Feng, Jiashi; Huang, Thomas

doi:10.1007/978-3-030-01252-6_27

Yunchao Wei¹⁷,
Zhiqiang Shen^17,18,
Bowen Cheng¹⁷,
Honghui Shi¹⁹,
Jinjun Xiong¹⁹,
Jiashi Feng²⁰ &
…
Thomas Huang¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11215))

Included in the following conference series:

European Conference on Computer Vision

2851 Accesses
38 Citations

Abstract

This work provides a simple approach to discover tight object bounding boxes with only image-level supervision, called Tight box mining with Surrounding Segmentation Context (TS²C). We observe that object candidates mined through current multiple instance learning methods are usually trapped to discriminative object parts, rather than the entire object. TS²C leverages surrounding segmentation context derived from weakly-supervised segmentation to suppress such low-quality distracting candidates and boost the high-quality ones. Specifically, TS²C is developed based on two key properties of desirable bounding boxes: (1) high purity, meaning most pixels in the box are with high object response, and (2) high completeness, meaning the box covers high object response pixels comprehensively. With such novel and computable criteria, more tight candidates can be discovered for learning a better object detector. With TS²C, we obtain 48.0% and 44.4% mAP scores on VOC 2007 and 2012 benchmarks, which are the new state-of-the-arts.

Z. Shen and B. Cheng—Equal contribution.

You have full access to this open access chapter, Download conference paper PDF

Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

R-CCF: region-aware continual contrastive fusion for weakly supervised object detection

Article 01 March 2024

UFO $$^2$$ : A Unified Framework Towards Omni-supervised Object Detection

Keywords

1 Introduction

Weakly Supervised Object Detection (WSOD) [3, 7, 10, 17, 18, 20, 21, 23, 32, 33, 35, 42,43,44] aims to detect objects only using image-level annotations for supervision. Despite remarkable progress, existing approaches still have difficulties in accurately identifying tight boxes of target objects with only image-level annotations, thus their performance is inferior to the fully supervised counterparts [6, 13, 22, 25, 28,29,30].

To localize objects with weak supervision information, one popular solution is to apply Multiple Instance Learning (MIL) for mining high-confidence region proposals [34, 47] with positive image-level annotations. However, MIL usually discovers the most discriminative part of the target object (e.g. the head of a cat) rather than the entire object region, as shown in Fig. 1. This inability of providing the complete object severely limits its effectiveness for WSOD. To address this issue, Li et al. [21] exploited the contrastive relationship between a selected region and its mask-out image for proposal selection. Nevertheless, the mask-out strategy fails for multi-instance cases. The selector is easily confused by remained instances with high responses, even though the correct object has been masked out.

Recently, some weakly supervised semantic segmentation approaches [19, 36, 38, 40] have demonstrated promising performance. Utilizing the inferred segmentation confidence maps, Diba et al. [10] presented a cascaded approach that leverages segmentation knowledge to filter noisy proposals and achieves competitive detection results. However, we argue that their solution is sub-optimal and insufficient as it only considers the segmentation confidence inside the proposal boxes, thus is unable to filter high-response fragments of object parts, as the magenta boxes shown in Fig. 2 (b).

In this work, we propose a principled and more effective approach, compared with [10], to mine tight object boxes by exploiting segmentation confidence maps in a creative way, aiming for addressing the challenging WSOD problems. Our approach is motivated by the following observations, as illustrated by two examples in Fig. 2 (a). We use blue and yellow to encode two kinds of boxes, which partially and tightly cover objects respectively. Based on the semantic segmentation confidence maps obtained in a weakly supervised manner, many pixels surrounding the blue boxes have high predicted segmentation confidence, while very few high-confidence pixels are included in the surrounding context for the yellow ones of higher tightness. We find that a desirable tight object box generally needs to satisfy two properties based on segmentation context:

Purity: most pixels inside the box should have high confidence scores, which guarantees that the box is located around the target object;
Completeness: very few pixels are with high confidence scores in the surrounding context of the target box.

Based on these properties, we devise a simple yet effective approach, named Tight box mining with Surrounding Segmentation Context (TS²C), to efficiently select object candidates of high quality from thousands of candidates. Specifically, the proposed TS²C examines two kinds of regions for evaluating the tightness of bounding boxes: (1) the region included in the box and (2) the region surrounding the box. It computes objectness scores of the two regions by averaging the corresponding pixel confidence values on the segmentation maps. Tight boxes are expected to be with high and low objectness values of the two kinds of regions simultaneously. Thus, the difference of two objectness scores is then taken as the quality metric on the final tightness for ranking object candidates. Figure 2 (b) shows the top 1 object candidate inferred by the proposed TS²C. We can see that our approach is more effective for mining tight object boxes than [10]. Moreover, our proposed TS²C is generic and can be easily integrated into any WSOD framework by introducing a parallel semantic segmentation branch for class-specific confidence map prediction. Benefiting from our TS²C, we achieve 48.0% and 44.4% mAP scores on the challenging Pascal VOC 2007 and VOC 2012 benchmarks, which are the new state-of-the-arts in the WSOD community.

2 Related Work

Multiple Instance Learning (MIL) provides a suitable way for formulating and solving WSOD. In specific, if an image is annotated with a specific class, at least one proposal instance from the image is positive for this class; and no proposal instance is positive for unlabeled classes. Previous works on applying MIL to WSOD can be roughly categorized into two-step [7, 17, 21, 35] and end-to-end [3, 10, 18, 20, 32, 33] based approaches.

Two-Step Approaches. First extract proposal representation leveraging hand-crafted features or pre-trained CNN models and employ MIL to select the best object candidate for learning the object detector. For instance, Wang et al. [35] presented a latent semantic clustering approach to select the most discriminative cluster for each category. Cibis et al. [7] learned a multi-fold MIL detector by re-labeling proposals and re-training the object classifier iteratively. Li et al. [21] first trained a multi-label classification network on entire images and then selected class-specific proposal candidates using a mask-out strategy, followed by MIL for learning a Fast R-CNN detector. Recently, Jie et al. [17] took a similar strategy as Li et al. [21] and proposed a more robust self-taught approach to learn a detector by harvesting more accurate supportive proposals in an online manner. However, splitting the WSOD into two steps results in a non-convex optimization problem, making such approaches trapped in local optima.

End-to-End Approaches. Combine CNNs and MIL into a unified framework for addressing WSOD. Oquab et al. [27] and Wei et al. [39] adopted a similar strategy to learn a multi-label classification network with max-pooling MIL. The learned classification model was then applied to coarse object localization [27]. Bilen et al. [3] proposed a novel Weakly Supervised Deep Detection Network (WSDDN) including two key streams, one for classification and the other for object localization. The outputs of these two streams are then combined for better rating the objectness of proposals. Based on WSDDN, Kantorov et al. [18] proposed to learn a context-aware CNN with contrast-based contextual modeling. Both [18] and our approach employ proposal context to identify high-quality proposals. However, [18] exploits inside/outside context features of each bounding box for learning to classification, in contrast, we leverage objectness scores obtained by segmentation confidence maps to pick out tight candidates. Recently, Tang et al. [32] also employed WSDDN as the basic network and augmented it with several Online Instance Classifier Refinement (OICR) branches, which is the state-of-the-art on the challenging WSOD task. In this work, we employ both WSDDN and OICR to develop our framework where the proposed TS²C is leveraged to further improve performance. Both [10] and our approach utilizes object segmentation knowledge to benefit WSOD. However, Diba et al. [10] only considered the confidence of pixels included in the bounding box for rating the proposal objectness, which is not as effective as ours.

Beyond the above mentioned related works, some fully-supervised object detection approaches [5, 12, 22, 46] also exploit contextual information of bounding boxes for benefiting object detection. Both Chen et al. [5] and Li et al. [22] leveraged information of enlarged contextual proposals to enhance the accuracy of the classifier. Zhu et al. [46] proposed to use a pool of segments obtained in the bottom-up manner to obtain better detection boxes. Our TS²C is totally different from these works in terms of both motivation and methodology. In particular, our motivation is to employ surrounding segmentation context to suppress these false positive objects parts. In addition, our approach can be easily embedded into any WSOD framework to make a further performance improvement.

3 The Proposed Approach

We show the overall architecture of the proposed approach in Fig. 3. It consists of three key branches, i.e. image classification, semantic segmentation and object detection. In particular, the Classification branch is employed to generate class-specific localization maps. Following the previous weakly supervised semantic segmentation approaches [37], we leverage the inferred localization maps to produce pseudo segmentation masks of training images, which are then used as supervision to train the Segmentation branch. The segmentation confidence maps from the Segmentation branch are then employed to evaluate objectness scores of the proposals according to the proposed TS²C, which finally collaborates with the Detection branch for learning an improved object detector. The overall framework is trained by minimizing the following composite loss functions from the three branches using stochastic gradient descent:

$$\begin{aligned} L = L_{cls} + L_{seg} + L_{det}. \end{aligned}$$

(1)

We will introduce each branch below and then elaborate on details of TS²C.

3.1 Classification for Object Localization

Inspired by [10, 24, 45], the fully convolutional network along with the Global Average Pooling (GAP) operation is able to generate class-specific activation maps, which can provide coarse object localization prior. We conduct experiments on Pascal VOC benchmarks, in which each training image is annotated with one or several labels. We thus treat the classification task as a separate binary classification problem for each class. Following [27], the loss function $L_{cls}$ is thus defined as a sum of C binary logistic regression losses.

3.2 Weakly Supervised Semantic Segmentation

The Classification branch can produce localization cues for foreground objects. We assign the pixels with values on the class-specific confidence map larger than a pre-defined normalized threshold (i.e. $\ge $0.78) with the corresponding class label. Beyond the object regions, background localization cues are also needed for training the segmentation branch. Motivated by [19, 36, 38, 40], we leverage the saliency detection technology [41] to produce the saliency map for each training image. Based on the generated saliency map, we choose the pixels with low normalized saliency values (i.e. $\le $0.06) as background. However, both the class-specific confidence map and the saliency map are not accurate enough to guarantee a high-quality segmentation mask. To alleviate the negative effect caused by falsely assigned pixels, we ignore the ambiguous pixels during training the Segmentation branch, including (1) pixels that are not assigned semantic labels, (2) foreground pixels of different categories that are in conflict, and (3) low-saliency pixels that fall in the foreground pixels. With the produced pseudo segmentation mask, we train the Segmentation branch with pixel-wise cross-entropy loss $L_{seg}$, which is widely adopted by fully-supervised schemes [4, 26].

3.3 Learning Object Detection with TS²C

For each training or test image, Selective Search [34] is employed to generate object proposals and Spatial Pyramid Pooling (SPP) [15] is leveraged to generate constant size feature maps for different proposals. Our TS²C aims to select high-quality object candidates from thousands of candidates to improve the effectiveness of training, which can be easily implanted into any WSOD framework. We choose the state-of-the-art Online Instance Classifier Refinement (OICR) [32] as the backbone of the Detection branch, which collaborates with the proposed TS²C for learning a better object detector. In the following, we will first make a brief introduction of OICR, and then explain how to leverage our TS²C to benefit the learning process of WSOD.

OICR. As shown in Fig. 3, the OICR mainly includes two modules, i.e. multiple instance classification and instance refinement. In particular, the multiple instance classification module is inspired from [3], which includes two branches to extract parallel data streams from the input features pooled by SPP, as shown in Fig. 4 (a). The upper stream conducts softmax operation on each individual proposal for classification. The bottom stream estimates a probability distribution over all candidate proposals using softmax, which indicates the contribution of each proposal to classifier decision for each class. Therefore, these two streams provide classification-based and localization-based features for each proposal. Both inferred scores are then fused with element-wise product operation and finally aggregated into image-level prediction by sum-pooling over all proposals. With the supervision of image-level annotations, the multiple instance classification module can be learned with binary logistic regression losses as detailed in Sect. 3.1.

By leveraging multiple instance classification module as a basic classifier for obtaining initial classification scores for each proposal, progressive refinement is then conducted via the instance refinement module, as detailed in Fig. 4 (b). In particular, the instance refinement module first selects the top-scoring proposal of each image-level label. Those proposals with high spatial overlap scores over the top-scoring one are then labeled correspondingly. The idea behind such a module is that the top-scoring proposal may only contain part of a target object and its adjacent proposals may cover more object regions. Benefiting from both two modules embedded in the OICR, each proposal is assigned with a pseudo class label, which is then employed as supervision for learning detection with the softmax cross-entropy loss [13, 14, 29]. To address the initialization issue (i.e. the classifier cannot well recognize proposals with randomly initialized parameters at the beginning of training), OICR adopts a weighted loss by assigning different weights to different proposals during different training iterations. Thus, the $L_{det}$ is composed of binary logistic regression losses for image-level classification and softmax cross-entropy loss for proposal-level classification. Please refer to [32] for more details.

Problems. However, such progressive refinement operation of OICR highly relies on the quality of initial object candidates from the multiple instance classification module. This means without reasonable object candidates received from the multiple instance classification module for initialization, the following progressive refinement strategy of OICR cannot find the correct proposals with high IoU scores over ground-truth bounding boxes. This brings a critical risk: if the multiple instance classification module fails to produce reasonable object candidates then the OICR cannot recall the missed object with any hope. We propose to reduce such a risk by designing an objectness rating approach from a totally new perspective. In particular, we detail our proposed TS²C that rates the proposals’ objectness from the segmentation view in the following.

TS$^{2}$C for Learning Detection. As shown in Fig. 3, TS²C uses the segmentation confidence maps from the Segmentation branch to rate the proposal objectness. We consider $x_i (i = 1 \cdots n)$ as one proposal from a given training image annotated by class c. Let $H_c$ denote the confidence map of category c predicted by the semantic Segmentation branch. For $x_i$, we calculate objectness scores of both the region inside the box $P_I$ and the surrounding context $P_S$ between $x_i$ and the corresponding enlarged one. Let $avg(H_c, x_i)$ denote the operation of computing $P_I$, which takes all pixel values included in $x_i$ into account. $P_I$ of a large value can guarantee that $x_i$ is around the target object. To obtain a robust surrounding objectness score $P_S$, we adopt a conditional average strategy $\hat{avg}(H_c, x_i)$. As shown in Fig. 5, many surrounding regions of negative candidates include a large number of un-related (i.e. background) pixels, which are with low confidence scores. Therefore, the resulted objectness score will be small if we average all the pixel values for computing $P_S$, in a similar way as for $P_I$. However, we expect the value of $P_S$ to be large, so that negative candidates of such cases can be suppressed by $P_I-P_S$. To this end, we first rank the pixels in the surrounding region according to their confidence scores and the conditional average strategy only employs the first 50% pixels for calculating the objectness score. Then, the objectness score $O(x_i)$ of the proposed TS²C is finally calculated as

$$\begin{aligned} \begin{aligned} O(x_i) = P_I - P_S = avg(H_c, x_i) - \hat{avg}(H_c, x_i). \end{aligned} \end{aligned}$$

We rank all the object candidates according to $O(x_i)$ and build a candidate pool by selecting the top two hundred proposals, collaborating with the OICR for learning a better detector. As shown in Fig. 3, $\oplus $ means the OICR will only select object candidates from the pool produced by TS²C for the following training process.

During the testing stage, we ignore the Classification and Segmentation branches, and leverage the classification outputs from the instance refinement module to obtain the final detection results.

4 Experiments

4.1 Datasets and Evaluation Metrics

Datasets. We conduct experiments on Pascal VOC 2007 and 2012 datasets [11], which are the two most widely used benchmarks for weakly supervised object detection. For VOC 2007, we train the model on the trainval set (5,011 images) and evaluate on the test set (4,096 images). We also make extensive ablation analysis on VOC 2007 to verify the effectiveness of some settings. For VOC 2012, we train the model on the trainval set (11,540 images) and evaluate on test set (10,991 images) by submitting the testing result to the evaluation server.

Metrics. Following [10, 17, 32], we adopt two metrics for evaluation, i.e. mean average precision (mAP) and correct localization (CorLoc) [9], for evaluation on test and trainval sets, respectively. Both two metrics employ the same threshold of bounding box overlaps with ground-truth boxes, i.e. IoU $>=$0.5.

4.2 Implementation Details

We use the object proposals generated by Selective Search [34], and adopt the VGG16 network [31] pre-trained on ImageNet [8] as the backbone of the proposed framework. We employ the Deeplab-CRF-LargeFOV [4] model to initialize the corresponding layers in the segmentation branch. For the newly added layers, the parameters are randomly initialized with a Gaussian distribution $\mathcal {N}(\mu , \delta )(\mu =0, \delta =0.01)$. We take a mini-batch size of 2 images and set the learning rates of the first 40 K and the following 30 K iterations as 0.001 and 0.0001 respectively. During training, we take five image scales $\{480, 576, 688, 864, 1200\}$ for data augmentation. For TS²C, we adopt an enlarged ratio of 1.2 to obtain the surrounding context, which is further employed for evaluating completeness of object candidates. Our experiments use the OICR [32] code, which is implemented based on the publicly available Caffe [16] deep learning framework. All of our experiments are run on NVIDIA TITAN X PASCAL GPUs.

Table 1. Comparison of detection average precision (AP) (%) on PASCAL VOC.

Full size table

Table 2. Comparison of detection AP (%) by training FRCNN detectors.

Full size table

Table 3. Comparison of correct localization (CorLoc) (%) on PASCAL VOC.

Full size table

4.3 Comparison with Other State-of-the-arts

We compare our approach with both two-step [7, 17, 21, 35] and end-to-end [3, 10, 18, 20, 32, 33] approaches. Top-3 results are indicated by , and colors. Table 1 shows the comparison in terms of AP on the VOC 2007. It can be observed that the proposed TS²C is effective and outperforms all the other approaches. In particular, we adopt OICR proposed by Tang et al. [32] as the detection backbone in the proposed framework. Our approach outperforms OICR by 3.1%. The gains are mainly from using both purity and completeness metrics for filtering noisy object candidates. We also show the comparison between our approach and other state-of-the-arts on PASCAL VOC 2012 in terms of AP. Our result^{Footnote 1} outperforms the baseline (i.e. Tang et al.[32]) and the state-of-the-art approach (i.e. Jie et al.[17]) by 2.1% and 1.7%, respectively.

Following [32], we also train a FRCNN [13] detector using top-scoring proposals produced by TS²C as pseudo ground-truth bounding boxes. As shown in Table 2, the performance can be further enhanced to 48.0% and 44.4%^{Footnote 2} on VOC 2007 and 2012, respectively. Our results from a single model are much better than those of [32] obtained by models (e.g. VGG16 and VGG-M) fusion. In addition, we conduct additional experiments using CorLoc as the evaluation metric. Table 3 shows the comparison on the VOC 2007 and 2012. Our approach achieves 61.0% and 64.4% in terms of CorLoc score, which are competitive compared with the state-of-the-arts. We visualize some successful detection results (blue boxes) on VOC 2007, as shown in Fig. 6. Results from OICR (green boxes) and ground truth (red boxes) are employed for comparison. It can be seen that our approach effectively reduces false positives including partial objects.

4.4 Ablation Experiments

We conduct extensive ablation analyses of the proposed TS²C, including the influence of the enlarged scale for obtaining surrounding context and the proposed tightness criteria (i.e. purity and completeness). All experiments are based on VOC 2007 benchmark.

Table 4. Ablation study on PASCAL VOC 2007.

Full size table

Purity and Completeness. One of our main contributions is the proposed criteria of purity and completeness for measuring the tightness of object candidates based on the semantic segmentation confidence maps. To validate the effectiveness of our approach (i.e. $P_I-P_S$), we test the other popular setting where only the purity (e.g. $P_I$) is taken into account. Specifically, we firstly leverage the two metrics to rank object candidates for annotated class(es). For example, if the image is annotated with two labels, we will produce two rankings according to segmentation confidence maps of the two classes, which are then employed for evaluating recall scores. As shown in Fig. 7, we vary the top number of object candidates based on the rankings from two metrics. Since our evaluation method only takes one object candidate for each annotated category in the top-1 case, the upper bound of the recall is 57.9% due to the existence of multi-instance images. Despite the apparent simplicity, the recall scores of our proposed $P_I-P_S$ significantly outperform those of $P_I$ under different settings according to the top number, which demonstrates that the completeness metric is effective for reducing noisy object candidates. More visualizations of rank 1 boxes produced by $P_I-P_S$ and $P_I$ are shown in Fig. 8. We can observe that our approach can successfully discover the tight ones from thousands of candidates. To further validate the effectiveness of the proposed TS²C, we also conduct experiments using purity i.e. $P_I$ for ranking object candidates as adopted in [10] for proposal selection, which results in 42.2% in mAP. By simultaneously taking purity and completeness into account, i.e. $P_I-P_S$, the result surpasses the baseline by 2.1% as shown in Table 4.

Influence of Enlarged Scale. To evaluate the completeness of object candidates, we need to enlarge the original box with a specific ratio. As shown in Table 4, we examine four ratios (i.e. from 1.1 to 1.4) for obtaining the surrounding context of object candidates, which are then employed to calculate objectness scores with the proposed TS²C. We can observe that all the models trained with the proposed TS²C can outperform the baseline by more than 1.4%. In particular, the best result is achieved by adopting the ratio of 1.2. By continually enlarging the ratio, the performance will be decreased. The reason may be that some training images include multiple instances with the same semantics, and the completeness score of each object candidate will be influenced by adjacent instances in the case of using larger ratios.

Influence of Conditional Averaging Strategy. As shown in Table 4, we also examine the threshold of conditional average strategy. The best result is achieved by employ the first 50% largest pixels to calculate the objectness score of surrounding region.

Discussion. Some failure cases are shown in the last row of Fig. 8. These samples share some similar characteristics: low-quality segmentation predictions or many semantically identical instances are linked together. For instance (the middle image of the last row), the semantic segmentation branch makes a false prediction for the object under the bird, leading to incorrect inference of our approach. It is believed that such a case can be well addressed with the development of weakly supervised semantic segmentation techniques. For other failure samples, although the segmentation branch can provide high quality confidence maps, the overlap between objects results in false prediction of our TS²C. In this case, we may need to develop effective instance-level semantic segmentation approaches in a weakly supervised manner.

However, the limitation of our TS²C to deal with overlapping objects with the same semantics does not affect its good performance on WSOD. We do not employ the top-1 proposal according to the objectness score as the object candidate, but build a candidate pool by selecting the top two hundred proposals. In this case, these tight boxes may still be recalled even without the largest tightness score. The effectiveness of our TS²C can be well proved by the performance gains on VOC 2007 and 2012 compared with [32].

5 Conclusion and Future Work

In this work, we proposed a simple approach, i.e. TS²C, for mining tight boxes by exploiting surrounding segmentation context. The TS²C is effective for suppressing low quality object candidates and promoting high quality ones tightly covering the target object. Based on the segmentation confidence map, TS²C introduces two simple criteria, i.e. purity and completeness, to evaluate objectness scores of object candidates. Despite apparent simplicity, the proposed TS²C can effectively filter thousands of noisy candidates and be easily embedded into any end-to-end weakly supervised framework for performance improvement. In the future, we plan to design more effective metrics for mining tight boxes by further boosting our current approach.

Notes

References

Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with posterior regularization. In: BMVC, pp. 1–12 (2014)
Google Scholar
Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with convex clustering. In: IEEE CVPR, pp. 1081–1089 (2015)
Google Scholar
Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: IEEE CVPR, pp. 2846–2854 (2016)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. preprint arXiv:1412.7062 (2014)
Chen, X., et al.: 3D object proposals for accurate object class detection. In: NIPS, pp. 424–432 (2015)
Google Scholar
Cheng, B., Wei, Y., Shi, H., Feris, R., Xiong, J., Huang, T.: Revisiting RCNN: on awakening the classification power of faster RCNN. In: ECCV (2018)
Google Scholar
Cinbis, R.G., Verbeek, J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. IEEE TPAMI 39(1), 189–203 (2017)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE CVPR, pp. 248–255 (2009)
Google Scholar
Deselaers, T., Alexe, B., Ferrari, V.: Weakly supervised localization and learning with generic knowledge. IJCV 100(3), 275–293 (2012)
Article MathSciNet Google Scholar
Diba, A., Sharma, V., Pazandeh, A., Pirsiavash, H., Van Gool, L.: Weakly supervised cascaded convolutional networks. In: IEEE CVPR (2017)
Google Scholar
Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. IJCV 111(1), 98–136 (2014)
Article Google Scholar
Gidaris, S., Komodakis, N.: Object detection via a multi-region and semantic segmentation-aware CNN model. In: IEEE ICCV, pp. 1134–1142 (2015)
Google Scholar
Girshick, R.: Fast R-CNN. In: IEEE ICCV, pp. 1440–1448 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE CVPR, pp. 580–587 (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_23
Chapter Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM Multimedia, pp. 675–678 (2014)
Google Scholar
Jie, Z., Wei, Y., Jin, X., Feng, J., Liu, W.: Deep self-taught learning for weakly supervised object localization. In: IEEE CVPR (2017)
Google Scholar
Kantorov, V., Oquab, M., Cho, M., Laptev, I.: ContextLocNet: context-aware deep network models for weakly supervised localization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 350–365. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_22
Chapter Google Scholar
Kolesnikov, A., Lampert, C.H.: Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In: ECCV, pp. 695–711 (2016)
Google Scholar
Lai, B., Gong, X.: Saliency guided end-to-end learning for weakly supervised object detection. In: IJCAI (2017)
Google Scholar
Li, D., Huang, J.B., Li, Y., Wang, S., Yang, M.H.: Weakly supervised object localization with progressive domain adaptation. In: IEEE CVPR, pp. 3512–3520 (2016)
Google Scholar
Li, J., et al.: Attentive contexts for object detection. IEEE Trans. Multimedia 19(5), 944–954 (2017)
Article Google Scholar
Liang, X., Liu, S., Wei, Y., Liu, L., Lin, L., Yan, S.: Towards computational baby learning: a weakly-supervised approach for object detection. In: IEEE ICCV, pp. 999–1007 (2015)
Google Scholar
Lin, M., Chen, Q., Yan, S.: Network in network. In: ICLR (2013)
Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE CVPR (2015)
Google Scholar
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free?-weakly-supervised learning with convolutional neural networks. In: IEEE CVPR, pp. 685–694 (2015)
Google Scholar
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: IEEE CVPR (2017)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Google Scholar
Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., Xue, X.: DSOD: learning deeply supervised object detectors from scratch. In: IEEE ICCV (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Google Scholar
Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: IEEE CVPR (2017)
Google Scholar
Teh, E.W., Rochan, M., Wang, Y.: Attention networks for weakly supervised object localization. In: BMVC (2016)
Google Scholar
Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. IJCV 104(2), 154–171 (2013)
Article Google Scholar
Wang, C., Ren, W., Huang, K., Tan, T.: Weakly supervised object localization with latent category learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 431–445. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_28
Chapter Google Scholar
Wei, Y., Feng, J., Liang, X., Cheng, M.M., Zhao, Y., Yan, S.: Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: IEEE CVPR (2017)
Google Scholar
Wei, Y., et al.: Learning to segment with image-level annotations. Pattern Recogn. (2016)
Google Scholar
Wei, Y., et al.: STC: a simple to complex framework for weakly-supervised semantic segmentation. IEEE TPAMI (2016)
Google Scholar
Wei, Y., et al.: HCP: a flexible cnn framework for multi-label image classification. IEEE TPAMI 38(9), 1901–1907 (2016)
Article Google Scholar
Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., Huang, T.S.: Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In: IEEE CVPR, pp. 7268–7277
Google Scholar
Xiao, H., Feng, J., Wei, Y., Zhang, M., Yan, S.: Deep salient object detection with dense connections and distraction diagnosis. IEEE Trans. Multimedia (2018)
Google Scholar
Zhang, X., Wei, Y., Feng, J., Yang, Y., Huang, T.: Adversarial complementary learning for weakly supervised object localization. In: IEEE CVPR (2018)
Google Scholar
Zhang, X., Wei, Y., Kang, G., Yang, Y., Huang, T.: Self-produced guidance for weakly-supervised object localization. In: ECCV (2018)
Google Scholar
Zhao, F., Li, J., Zhao, J., Feng, J.: Weakly supervised phrase localization with multi-scale anchored transformer network. In: IEEE CVPR, pp. 5696–5705 (2018)
Google Scholar
Zhou, B., Khosla, A.A.L., Oliva, A., Torralba, A.: Learning Deep Features for Discriminative Localization. IEEE CVPR (2016)
Google Scholar
Zhu, Y., Urtasun, R., Salakhutdinov, R., Fidler, S.: segDeepM: exploiting segmentation and context in deep neural networks for object detection. In: IEEE CVPR, pp. 4703–4711 (2015)
Google Scholar
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_26
Chapter Google Scholar

Download references

Acknowledgements

This work is in part supported by IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM AI Horizons Network, NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133, MOE Tier-II R-263-000-D17-112 and the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number D17PC00341. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.

Author information

Authors and Affiliations

University of Illinois at UrbanaChampaign, Urbana, IL, USA
Yunchao Wei, Zhiqiang Shen, Bowen Cheng & Thomas Huang
Fudan University, Shanghai, China
Zhiqiang Shen
IBM T.J. Watson Research Center, Yorktown Heights, USA
Honghui Shi & Jinjun Xiong
National University of Singapore, Singapore, Singapore
Jiashi Feng

Authors

Yunchao Wei
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Shen
View author publications
You can also search for this author in PubMed Google Scholar
Bowen Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Honghui Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jinjun Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Jiashi Feng
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunchao Wei .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wei, Y. et al. (2018). TS$^{2}$C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11215. Springer, Cham. https://doi.org/10.1007/978-3-030-01252-6_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-01252-6_27
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01251-9
Online ISBN: 978-3-030-01252-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TS\(^{2}\)C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection

Abstract

Similar content being viewed by others

Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

R-CCF: region-aware continual contrastive fusion for weakly supervised object detection

UFO $$^2$$ : A Unified Framework Towards Omni-supervised Object Detection

Keywords

1 Introduction

2 Related Work

3 The Proposed Approach

3.1 Classification for Object Localization