- 1.8k Downloads
This paper presents a simple and effective nonparametric approach to the problem of image parsing, or labeling image regions (in our case, superpixels produced by bottom-up segmentation) with their categories. This approach is based on lazy learning, and it can easily scale to datasets with tens of thousands of images and hundreds of labels. Given a test image, it first performs global scene-level matching against the training set, followed by superpixel-level matching and efficient Markov random field (MRF) optimization for incorporating neighborhood context. Our MRF setup can also compute a simultaneous labeling of image regions into semantic classes (e.g., tree, building, car) and geometric classes (sky, vertical, ground). Our system outperforms the state-of-the-art nonparametric method based on SIFT Flow on a dataset of 2,688 images and 33 labels. In addition, we report per-pixel rates on a larger dataset of 45,676 images and 232 labels. To our knowledge, this is the first complete evaluation of image parsing on a dataset of this size, and it establishes a new benchmark for the problem. Finally, we present an extension of our method to video sequences and report results on a video dataset with frames densely labeled at 1 Hz.
KeywordsScene understanding Image parsing Image segmentation
This paper addresses the problem of image parsing, or segmenting all the objects in an image and identifying their categories. Many approaches to this problem have been proposed recently, including ones that estimate labels pixel by pixel (He et al. 2004; Ladicky et al. 2010; Shotton et al. 2006, 2008), ones that aggregate features over segmentation regions (Galleguillos et al. 2010; Gould et al. 2009; Hoiem et al. 2007; Malisiewicz and Efros 2008; Rabinovich et al. 2007; Socher et al. 2011), and ones that predict object bounding boxes (Divvala et al. 2009; Felzenszwalb et al. 2008; Heitz and Koller 2008; Russell et al. 2007). Most of these methods operate with a few pre-defined classes and require a generative or discriminative model to be trained in advance for each class (and sometimes even for each training exemplar (Malisiewicz and Efros 2008, 2011)). Training can take days and must be repeated from scratch if new training examples or new classes are added to the dataset. In most cases (with the notable exception of Shotton et al. 2008), processing a test image is also quite slow, as it involves steps like running multiple object detectors over the image, performing graphical model inference, or searching over multiple segmentations.
While most existing methods are tailored for “closed universe” datasets, a new generation of “open universe” datasets is beginning to take over. An example open-universe dataset is LabelMe (Russell et al. 2008), which consists of complex, real-world scene images that have been segmented and labeled by multiple users (sometimes incompletely or noisily). There is no pre-defined set of class labels; the dataset is constantly expanding as people upload new photos or add annotations to current ones. In order to cope with such datasets, vision algorithms must have much faster training and testing times, and they must make it easy to continuously update the visual models with new classes or new images.
Recently, a few researchers have begun advocating nonparametric, data-driven approaches suitable for open-universe datasets (Hays and Efros 2008; Torralba et al. 2008; Liu et al. 2011a, 2011b). Such approaches do not do any training at all. Instead, for each new test image, they try to retrieve the most similar training images and transfer the desired information from the training images to the query. Liu et al. (2011a) have proposed a nonparametric label transfer method based on estimating “SIFT flow,” or a dense deformation field between images. The biggest drawback of this method is that the optimization problem for finding the SIFT flow is fairly complex and expensive to solve. Moreover, the formulation of scene matching in terms of estimating a dense per-pixel flow field is not necessarily in accord with our intuitive understanding of scenes as collections of discrete objects defined by their spatial support and class identity.
The prevailing consensus in the recognition community is that image parsing requires context (Divvala et al. 2009; Galleguillos and Belongie 2010; Heitz and Koller 2008; Hoiem et al. 2007; Rabinovich et al. 2007). However, learning and inference with most existing contextual models are slow and non-exact. Therefore, in keeping with our goal of developing a scalable system, we restrict ourselves to efficient forms of context that do not need training and that can be cast in an MRF framework amenable to optimization by fast graph cut algorithms (Boykov and Kolmogorov 2004; Boykov et al. 2001; Kolmogorov and Zabih 2004). Our in-depth analysis presented in Sect. 3 demonstrates that such simple context is sufficient for good performance provided the local feature representation is powerful enough. We also investigate geometric/semantic context in the manner of Gould et al. (2009). Namely, for each superpixel in the image, we simultaneously estimate a semantic label (e.g., building, car, person, etc.) and a geometric label (sky, ground, or vertical surface) while making sure the two types of labels assigned to the same region are consistent (e.g., a building has to be vertical, road horizontal, and so on). Our experiments show that enforcing this coherence improves the performance of both labeling tasks.
Our system exceeds the results reported in Liu et al. (2011a) on a dataset of 2,688 images and 33 labels. Moreover, to demonstrate the scalability of our method, we present per-pixel and per-class rates on a subset from the LabelMe and SUN (Xiao et al. 2010) datasets totaling 45,676 images and 232 labels. To our knowledge, we are the first to report complete recognition results on a dataset of this size. Thus, one of the contributions of our work is to establish a new benchmark for large-scale image parsing. Note that unlike other popular benchmarks for image parsing (e.g., Gould et al. 2009; Hoiem et al. 2007; Liu et al. 2011a; Shotton et al. 2006), our LabelMe+SUN dataset contains both outdoor and indoor images. As will be discussed in Sect. 3.3, indoor imagery currently appears to be much more challenging for general-purpose image parsing systems than outdoor imagery, due in part to the greater diversity of indoor scenes, as well as to the smaller amount of training data available for them.
As another contribution, we extend our parsing approach to video and show that we can take advantage of motion cues and temporal consistency to improve performance. Existing video parsing approaches (Brostow et al. 2008; Zhang et al. 2010) use structure from motion to obtain either sparse point clouds or dense depth maps, and extract geometry-based features that can be combined with appearance-based features or used on their own to achieve greater accuracy. We take a simpler approach and only use motion cues to segment the video into temporally consistent regions (Grundmann et al. 2010), or supervoxels. This helps to better separate moving objects from one another especially when there is no high-contrast edge between them. Our results in Sect. 4 show that the incorporation of motion cues from video can significantly help parsing performance even without the explicit reconstruction of scene geometry.
A previous version of this work has been published in Tighe and Lazebnik (2010). The main advances over Tighe and Lazebnik (2010) are: an in-depth analysis of the various parameters of our system, an evaluation on the new LabelMe+SUN dataset containing both outdoor and indoor images, and an extension of our system to video parsing. Our code and data can be found at http://www.cs.unc.edu/SuperParsing.
2 System Description
Find a retrieval set of images similar to the query image (Sect. 2.1).
Segment the query image into superpixels and compute feature vectors for each superpixel (Sect. 2.2).
For each superpixel and each feature type, find the nearest-neighbor superpixels in the retrieval set according to that feature. Compute a likelihood score for each class based on the superpixel matches (Sect. 2.3).
Use the computed likelihoods together with pairwise co-occurrence energies in an Markov Random Field (MRF) framework to compute a global labeling of the image (Sect. 2.4). Alternatively, with modifications, the MRF framework can simultaneously solve for both semantic and geometric class labels (Sect. 2.5).
2.1 Retrieval Set
A complete list of features used in our system
(a) Global features for retrieval set computation (Sect. 2.1)
Spatial pyramid (3 levels, SIFT dictionary of size 200)
Gist (3-channel RGB, 3 scales with 8, 8, & 4 orientations)
Color histogram (3-channel RGB, 8 bins per channel)
(b) Superpixel features (Sect. 2.2)
Mask of superpixel shape over its bounding box (8×8)
Bounding box width/height relative to image width/height
Superpixel area relative to the area of the image
Mask of superpixel shape over the image
Top height of bounding box relative to image height
Texton histogram, dilated by 10 pix texton histogram
Quantized SIFT histogram, dilated by 10 pix quantized SIFT histogram
Left/right/top/bottom boundary quantized SIFT histogram
RGB color mean and std. dev.
Color histogram (RGB, 11 bins per channel), dilated by 10 pix color histogram
Color thumbnail (8 × 8)
Masked color thumbnail
Grayscale gist over superpixel bounding box
We examine the contributions of different global features and the effect of changing the retrieval set size K in the experiments of Sect. 3.3.
2.2 Superpixel Features
We wish to label the query image based on the content of the retrieval set, but assigning labels on a per-pixel basis as in He et al. (2004), Liu et al. (2011a, 2011b) tends to be too inefficient. Instead, like Hoiem et al. (2007), Malisiewicz and Efros (2008), Rabinovich et al. (2007), we choose to assign labels to superpixels, or regions produced by bottom-up segmentation. This not only reduces the complexity of the problem, but also gives better spatial support for aggregating features that could belong to a single object than, say, fixed-size square windows centered on every pixel in the image. We obtain superpixels using the fast graph-based segmentation algorithm of Felzenszwalb and Huttenlocher (2004)1 and describe their appearance using 20 different features similar to those of Malisiewicz and Efros (2008), with some modifications and additions. A complete list of the features is given in Table 1(b). In particular, we compute histograms of textons2 and dense SIFT descriptors over the superpixel region, as well as a version of that region dilated by 10 pixels. For SIFT features, which are more powerful than textons, we have also found it useful to compute left, right, top, and bottom boundary histograms. To do this, we find the boundary region as the difference between the superpixel dilated and eroded by 5 pixels, and then obtain the left/right/top/bottom parts of the boundary by cutting it with an “X” drawn over the superpixel bounding box. All of the features are computed for each superpixel in the training set and stored together with their class labels. We associate a class label with a training superpixel if 50 % or more of the superpixel overlaps with a ground truth segment mask with that label.
2.3 Local Superpixel Labeling
At this point, we can obtain a labeling of the image by simply assigning to each superpixel the class that maximizes Eq. (1). As shown in Table 2, the resulting classification rates already come within 2.5 % of those of Liu et al. (2011a).
2.4 Contextual Inference
Performance on the LMSun dataset broken down by outdoor and indoor test images. Per-pixel classification rate is listed first, followed by the average per-class rate in parentheses
We have also experimented with a contrast-sensitive per-pixel MRF similar to that of Liu et al. (2011a), but have found that our per-superpixel formulation is faster, and achieves the same per-pixel and per-class performance. One reason for this may be that the per-superpixel MRF makes it easier to converge to a better minimum by flipping labels over larger areas of the image. A per-pixel MRF does however produce more visually pleasing labelings, but we chose to use the superpixel-based MRF due to its superior speed.
2.5 Simultaneous Classification of Semantic and Geometric Classes
3 Image Parsing Results
3.1 SIFT Flow Dataset
The first dataset in our experiments, referred to as “SIFT Flow dataset” in the following, comes from Liu et al. (2011a). It is composed of the 2,688 images that have been thoroughly labeled by LabelMe users. Liu et al. (2011a) have split this dataset into 2,488 training images and 200 test images and used synonym correction to obtain 33 semantic labels. We use the same training/test split as Liu et al. (2011a).
As explained in Sect. 2.5, our system labels each superpixel by a semantic class (the original 33 labels) and a geometric class of sky, horizontal, or vertical. Because the number of geometric classes is small and fixed, we have trained a boosted decision tree (BDT) classifier as in Hoiem et al. (2007) to distinguish between them. We use a tree depth of 8 and train 100 trees for each class. This classifier outputs a likelihood ratio score that is comparable to the one produced by our nonparametric scheme (Eq. (1)), but that gets about 2 % higher accuracy for geometric classification (Sect. 3.3 will present a detailed comparison of nearest-neighbor classifiers and BDT). Apart from this, local and MRF classification for geometric classes proceeds as described in Sects. 2.3 and 2.4, and we also put the semantic and geometric likelihood ratios into a joint contextual classification framework as described in Sect. 2.5.
Performance on the SIFTFlow dataset for our system and three state-of-the-art approaches. Per-pixel classification rate is listed first, followed by the average per-class rate in parentheses
Our final system on the SIFT Flow dataset achieves a classification rate of 77.0 %. Thus, we outperform Liu et al. (2011a), who report a rate of 76.7 % on the same test set with a more complex pixel-wise MRF (without the pixel-wise MRF, their rate is 66.24 %). Liu et al. (2011a) also cite a rate of 82.72 % for the top seven object categories; our corresponding rate is 84.7 %. Table 2 also reports results of two new approaches (Eigen and Fergus 2012; Farabet et al. 2012) that build on and compare to the earlier version of our system (Tighe and Lazebnik 2010). Eigen and Fergus (2012) are able to improve on our average per-class rate, while Farabet et al. (2012) are able to improve on the overall rate through the use of more sophisticated learning techniques.
Sample output of our system on several SIFT Flow test images can be seen in Fig. 13.
3.2 LM+SUN Dataset
Our second dataset (“LM+SUN” in the following) is derived by combining the SUN dataset (Xiao et al. 2010) with a complete download of LabelMe (Russell et al. 2008) as of July 2011. We cull from this dataset any duplicate images and any images from video surveillance (about 10,000), and use manual synonym correction to obtain 232 labels. This results in 45,676 images of which 21,182 are indoor and 24,494 are outdoor. We split the dataset into 45,176 training images and 500 test images by selecting test images at random that have at least 90 % of their pixels labeled and at least 3 unique labels (a total of 13,839 images in the dataset meet this criteria). Apart from its bigger size, the inclusion of indoor images makes this dataset very challenging.
Our final system achieves a classification rate of 54.9 % across all scene types (as compared to 77 % for SIFT Flow); the respective rates for outdoor and indoor images are 60.8 % and 32.1 %. Figure 5(b) gives a breakdown of rates for the 50 most common classes, and Fig. 14 shows the output of our system on a few example images. It is clear that indoor scenes are currently much more challenging than outdoor ones, due at least in part to their greater diversity and sparser training data. In fact, we are not aware of any system that can produce accurate dense labelings of indoor scenes; most existing work dealing with indoor scenes tries to leverage specialized geometric information and focuses on only a few target classes of interest. For example, Hedau et al. (2009) infer the “box” of a room and then leverage the geometry of that box to align object detectors to the scene (Hedau and Hoiem 2010). Gupta et al. infer the possible use of a space (Gupta et al. 2011) rather than directly labeling the objects. There has also been a recent interest in indoor parsing with the help of structured light sensors (Silberman and Fergus 2011; Janoch et al. 2011; Lai et al. 2011) as a way to combat the ambiguity present in cluttered indoor scenes. It is clear that our system, which relies on generic appearance-based matching, cannot currently achieve very high levels of performance on indoor imagery. However, our reported numbers can serve as a useful baseline for more advanced future approaches. The challenges of indoor classification will be further studied in Sect. 3.3.
3.3 Detailed System Evaluation
This section presents a detailed evaluation of various components of our system. Unless otherwise noted, the evaluation is conducted on both the SIFT Flow dataset and LM+SUN, no MRF smoothing is done, and only semantic classification rates are reported.
3.3.1 Retrieval set selection
Evaluation of global image features for retrieval set generation (retrieval set size 200). “Maximum Label Overlap” is the upper bound that we get by selecting retrieval set images that are the most semantically consistent with the query (see text)
Spatial Pyramid (SP)
Color Hist. (CH)
G + SP
G + SP + CH
Maximum Label Overlap
Effect of retrieval set size on local superpixel labeling. Note that the entire LM+SUN training set is too large for our hardware to store in memory
Retrieval Set Size
Entire training set
Accuracy of local superpixel labeling obtained by restricting the set of possible classes in the test image to different “shortlists” (see text)
Classes in retrieval set
10 most common classes
Effect of indoor/outdoor separation on the accuracy of local superpixel labeling on LM+SUN. “Local labeling” corresponds to our default system with no separation between outdoor and indoor training images (the numbers are the same as in line 1 of Table 3). “Ground truth” uses the ground truth label for the query image to determine if the retrieval set should consist of indoor or outdoor images, while “Classifier” uses a trained indoor/outdoor SVM classifier (see text)
Note that performing automatic indoor/outdoor image classification and then using the inferred scene type to constrain the interpretation of the image is conceptually similar to performing a geometric labeling of the image and using the inferred geometric classes of regions to constrain the semantic classes. In both cases we are taking advantage of the high accuracy that can be achieved on relatively easier two- and three-class problems to improve the accuracy on a harder many-class problem.
3.3.2 Superpixel Classification
It is also interesting to compare the curves for three versions of our system: local superpixel labeling, separate semantic and geometric MRF, and joint semantic/geometric MRF. Consistent with the results reported in Tables 2 and 3, the separate MRF tends to lower the average per-class rate due to over-smoothing and then the joint MRF brings it back up. More surprisingly, Fig. 8 reveals that both types of our contextual models have a much greater impact when they are applied on top of a relatively weak local model, i.e., one with fewer features. As we add more local features, the improvements afforded by non-local inference gradually diminish. The important message here is that “local features” and “context” are, to a great degree, interchangeable. For one, many of our features are not truly “local” since they include information from outside the spatial support of the superpixel. But also, it seems that contextual inference can fix relatively few labeling mistakes that cannot just as easily be fixed by a more powerful local model. This is important to keep in mind when critically evaluating contextual models proposed in the literature: a big improvement over a local baseline does not necessarily prove that the proposed form of context is especially powerful—the local features may simply not be strong enough.
Comparison of our nearest neighbor classifier to boosted decision trees. While the boosted decision trees constantly perform better on the relatively balanced geometric labels, they have worse per-class rates on semantic labels with heavily skewed label counts
Boosted Decision Trees
3.3.3 Running Time
The average timing in seconds of the different stages in our system (excluding file I/O). While the runtime is significantly longer for the LM+SUN dataset, this is primarily due to the change in image size and not the number of images
Training set size
Avg. # superpixels
1.5 ± 0.5
5.2 ± 1.8
Retrieval set search
0.04 ± 0.0
3.5 ± 0.51
3.75 ± 1.8
13.1 ± 11.2
0.005 ± 0.003
.009 ± .006
Total (excluding features)
4.4 ± 2.3
16.6 ± 11.7
4 Video Parsing
This section presents the extension of our system to video. Video sequences provide richer information which should be useful for better understanding scenes. Intuitively, motion cues can improve object segmentation, and being able to observe the same objects in multiple frames, possibly at different angles or scales, can help us build a better model of the objects’ shape and appearance. On the other hand, the large volume of video data makes parsing very challenging computationally.
Previous approaches have tried a variety of strategies for exploiting the cues contained in video data. Brostow et al. (2008), Sturgess et al. (2009), and Zhang et al. (2010) extract 3D structure (sparse point clouds or dense depth maps) from the video sequences and then use the 3D information as a source of additional features for parsing individual frames. Xiao and Quan (2009) run a region-based parsing system on each frame and enforce temporal coherence between regions in adjacent frames as a post-processing step.
We pre-process the video using a spatiotemporal segmentation method (Grundmann et al. 2010) that gives 3D regions or supervoxels that are spatially coherent within each frame (i.e., have roughly uniform color and optical flow) as well as temporally coherent between frames. The hope is that these regions will contain the same object from frame to frame. We then compute local likelihood scores for possible object labels over each supervoxel, and finally, construct a single graph for each video sequence where each node is a supervoxel and edges connect adjacent supervoxels. We perform inference on this graph using the same MRF formulation as in Sect. 2.5. Section 4.1 will give details of our video parsing approach, and Sect. 4.2 will show that this approach significantly improves the performance compared to parsing each frame independently.
4.1 System Description
Finally, we construct an MRF for the entire video sequence where nodes represent supervoxels and edges connect pairs of supervoxels that are spatially adjacent in at least one frame. We define the edge energy term in the same way as in the 2D case, using Eq. (5). We do this for both semantic and geometric classes and solve for them simultaneously using the same joint formulation as in Eq. (6). For the video sequences in our experiments, which range from 1,500 to 4,000 frames, we typically obtain graphs of 10,000 to 30,000 nodes, which are very tractable.
We test our video segmentation on the standard CamVid dataset (Brostow et al. 2008), which consists of daytime and dusk videos taken from a car driving through Cambridge, England. There are a total of five video sequences. We follow the training/test split of Brostow et al. (2008), with two daytime and one dusk sequence used for training, and one daytime and one dusk sequence used for testing. The sequences are densely labeled at one frame per second with 11 class labels: Building, Tree, Sky, Car, Sign-Symbol, Road, Pedestrian, Fence, Column-Pole, Sidewalk, and Bicyclist. There are a total of 701 labeled frames in the dataset with 468 used for training and 233 for testing. Note that while we evaluate the accuracy of our output on only the labeled testing frames, we do obtain dense labels for all frames in the test video.
CamVid dataset results. (a) Still image segmentation baseline. (b) Results with spatiotemporal segmentation (see text). (c) Competing state-of-the-art approaches. As before, per-pixel classification rate is followed by the average per-class rate in parentheses
Still Image Parsing
Brostow et al. (2008)
Sturgess et al. (2009)
Zhang et al. (2010)
Ladicky et al. (2010)
The sixth and seventh rows of Table 10(b) show the performance of the temporally coherent setup following contextual MRF smoothing and joint semantic/geometric inference. Somewhat disappointingly, both versions of the MRF give a very minimal improvement. This is likely due to a number of factors. First, MRF energy minimization on the spatiotemporal graph appears to be a harder problem, and the solutions tend to show a much greater tendency to oversmooth. Second, we gain a big improvement in object boundaries by incorporating motion cues into the segmentation, and this is likely diminishing the subsequent power of the MRF. Recall that in Sect. 3.3 we have seen a similar effect: as we made the local appearance model more powerful by adding features, the improvement afforded by the MRF diminished (Fig. 8). Finally, joint semantic/geometric inference introduces very few new constraints, since the CamVid dataset has only three non-vertical classes (sky, road, and sidewalk).
Per-class performance on the CamVid (Brostow et al. 2008) dataset
Still image parsing (joint sem./geom.)
Spatiotemporal parsing (joint sem./geom.)
Brostow et al. (2008)
Sturgess et al. (2009)
Zhang et al. (2010)
Ladicky et al. (2010)
While we do not outperform current state-of-the-art methods on the CamVid dataset, our results are encouraging as our system is the most simple and scalable. Note that we use the motion information in video only to improve our segmentation, not to change our features. By contrast, Brostow et al. (2008), Sturgess et al. (2009), Zhang et al. (2010) use features derived from 3D point clouds or depth maps, while Ladicky et al. (2010) incorporate sliding window object detectors. Overall, our experiments on video confirm the flexibility and broad applicability of our image parsing framework, and give us additional insights into its strengths and weaknesses that complement our findings on still image datasets.
This paper has presented a superpixel-based approach to image parsing that can take advantage of datasets consisting of tens of thousands of images annotated with hundreds of labels. Our underlying feature representation, based on multiple appearance descriptors computed over segmentation regions, is simple and allows new features to be easily incorporated. We also use efficient MRF optimization to capture label co-occurrence context, and to jointly label regions with semantic and geometric classes.
We have demonstrated state-of-the-art results on the SIFT Flow and LM+SUN datasets with a nonparametric version of our system based on a two-stage approach (global retrieval set matching followed by superpixel matching). This framework does not need any training, except for computation of basic statistics such as label co-occurrence probabilities, and it relies on just a few constants that are kept fixed for all datasets. In principle, it is applicable to “open universe” datasets where the set of training examples and target classes may evolve over time. In particular, our results on the LM+SUN dataset, which has 45,676 images and 232 labels, constitute an important baseline for future approaches. To our knowledge, it is currently the largest dense per-pixel image parsing dataset and, unlike most other general-purpose image parsing benchmarks, it includes both outdoor and indoor images. As we have shown, the latter pose severe recognition challenges, and deserve more study in the future.
Besides the nonparametric “open universe” regime, our system has the flexibility to operate with offline pre-trained classifiers, such as boosted decision trees. The use of these may be preferable for static datasets with smaller numbers of classes and a more balanced class distribution (see the conference version of this paper (Tighe and Lazebnik 2010) for additional results on the small-scale Geometric Context (Hoiem et al. 2007) and Stanford Background (Gould et al. 2009) datasets).
Finally, we have demonstrated an extension of our system to video. This extension segments the video into spatiotemporal “supervoxels” and uses a simple heuristic to combine local appearance cues across frames. The resulting approach does not exploit all the motion information that is potentially available in video (in particular, it does not attempt to extract 3D geometry), but it still affords a big improvement over incoherent frame-by-frame parsing.
Through the extensive analysis of Sect. 3.3, we have identified two major limitations of our system. First, the scene matching step for obtaining the retrieval set suffers from an inability of low-level global features such as GIST to retrieve semantically similar scenes, resulting in incoherent interpretations (e.g., indoor and outdoor class labels mixed together). We plan to investigate supervised feature learning methods for improving the semantic consistency of retrieval sets. Second, our reliance on bottom-up segmentation really hurts our performance on “thing” classes. Traditionally, such classes are handled using sliding window detectors, and there exists work (e.g., Ladicky et al. 2010) attempting to incorporate such detectors into region-based parsing. We are interested in exploring the idea of per-exemplar detectors (Malisiewicz and Efros 2011) to complement our superpixel-based approach in a manner that still allows for lazy learning in the “open universe” mode.
We set K=200 and σ=.8.
Note that our original system (Tighe and Lazebnik 2010) did not use the sigmoid nonlinearity, but in our subsequent work (Tighe and Lazebnik 2011) we found it necessary to successfully perform more complex multi-level inference. We have also found that the sigmoid is a good way of making the output of the nonparametric classifier comparable to that of other classifiers, for example, boosted decision trees (see Sect. 3.1).
Since the videos were taken from a forward-moving camera, we have found the segmentation results to be better if we run the videos through the system backwards.
This research was supported in part by NSF grants IIS-0845629 and IIS-0916829, DARPA Computer Science Study Group, Microsoft Research Faculty Fellowship, and Xerox.
- Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In Proceedings European conference computer vision (pp. 1–15). Google Scholar
- Divvala, S., Hoiem, D., Hays, J., Efros, A., & Hebert, M. (2009). An empirical study of context in object detection. In Proceedings IEEE conference computer vision and pattern recognition (pp. 1271–1278). Google Scholar
- Eigen, D., & Fergus, R. (2012). Nonparametric image parsing using adaptive neighbor sets. In: Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
- Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2012) Scene parsing with multiscale feature learning, purity trees, and optimal covers. arXiv preprint. Google Scholar
- Felzenszwalb, P., Mcallester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In: Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
- Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 2(2), 1–26. Google Scholar
- Galleguillos, C., Mcfee, B., Belongie, S., & Lanckriet, G. (2010). Multi-class object localization by combining local contextual interactions. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
- Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In Proceedings IEEE international conference computer vision. Google Scholar
- Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
- Gu, C., Lim, J. J., Arbel, P., & Malik, J. (2009). Recognition using regions. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
- Gupta, A., Satkin, S., Efros, A. A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
- Hays, J., & Efros, A. A. (2008). IM 2 GPS: estimating geographic information from a single image. In Proceedings IEEE conference computer vision and pattern recognition (Vol. 05). Google Scholar
- He, X., Zemel, R. S., & Carreira-Perpinan, M. A. (2004). Multiscale conditional random fields for image labeling. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
- Hedau, V., & Hoiem, D. (2010). Thinking inside the box: using appearance models and context based on room geometry. In Proceedings European conference computer vision (pp. 1–14). Google Scholar
- Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In Proceedings IEEE international conference computer vision. Google Scholar
- Heitz, G., & Koller, D. (2008). Learning spatial context: using stuff to find things. In Proceedings European conference computer vision (pp. 1–14). Google Scholar
- Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, 75(1). Google Scholar
- Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., & Darrell, T. (2011). A category-level 3-D object dataset: putting the Kinect to work. In ICCV workshop. Google Scholar
- Ladicky, L., Sturgess, P., Alahari, K., Russell, C., & Torr, P. H. S. (2010). What, where and how many? Combining object detectors and CRFs. In Proceedings European conference computer vision (pp. 424–437). Google Scholar
- Lai, K., Bo, L., Ren, X., & Fox, D. (2011). A scalable tree-based approach for joint object and pose recognition. In Artificial intelligence. Google Scholar
- Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In Proceedings IEEE conference computer vision and pattern recognition (Vol. 2). Google Scholar
- Malisiewicz, T., & Efros, A. A. (2008). Recognition by association via learning per-exemplar distances. In Proceedings IEEE conference computer vision and pattern recognition (pp. 1–8). Google Scholar
- Malisiewicz, T., & Efros, A. A. (2011). Ensemble of exemplar-SVMs for object detection and beyond. In Proceedings IEEE international conference computer vision (pp. 89–96). Google Scholar
- Nowozin, S., Carsten, R., Bagon, S., Sharp, T., Yao, B., & Kohli, P. (2011). Decision tree fields. In Proceedings IEEE international conference computer vision (pp. 1668–1675). Google Scholar
- Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In Proceedings IEEE international conference computer vision (pp. 1–8). Google Scholar
- Ren, X., & Malik, J. (2003). Learning a classification model for segmentation. In Proceedings IEEE international conference computer vision. Google Scholar
- Russell, B. C., Torralba, A., Liu, C., Fergus, R., & Freeman, W. T. (2007). Object recognition by scene alignment. In Neural information processing systems foundation. Google Scholar
- Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
- Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). TextonBoost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings European conference computer vision (pp. 1–14). Google Scholar
- Silberman, N., & Fergus, R. (2011). Indoor scene segmentation using a structured light sensor. In Proceedings IEEE international conference computer vision workshop. Google Scholar
- Socher, R., Lin, C. C. Y., Ng, A. Y., & Manning, C. D. (2011). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the international conference on machine learning. Google Scholar
- Sturgess, P., Alahari, K., Ladicky, L., & Torr, P. H. S. (2009). Combining appearance and structure from motion features for road scene understanding. In British machine vision conference (pp. 1–11). Google Scholar
- Tighe, J., & Lazebnik, S. (2010). SuperParsing: scalable nonparametric image parsing with superpixels. In Proceedings European conference computer vision. Google Scholar
- Tighe, J., & Lazebnik, S. (2011). Understanding scenes on many levels. In Proceedings IEEE international conference computer vision (pp. 335–342). Google Scholar
- Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In Proceedings IEEE conference computer vision and pattern recognition (pp. 3485–3492). Google Scholar
- Xiao, J., & Quan, L. (2009). Multiple view semantic segmentation for street view images. In Proceedings IEEE international conference computer vision. Google Scholar
- Xu, C., & Corso, J. J. (2012). Evaluation of super-voxel methods for early video processing. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
- Zhang, C., Wang, L., & Yang, R. (2010). Semantic segmentation of urban scenes using dense depth maps. In Proceedings European conference computer vision (pp. 708–721). Google Scholar