Abstract
This paper presents a simple and effective nonparametric approach to the problem of image parsing, or labeling image regions (in our case, superpixels produced by bottom-up segmentation) with their categories. This approach is based on lazy learning, and it can easily scale to datasets with tens of thousands of images and hundreds of labels. Given a test image, it first performs global scene-level matching against the training set, followed by superpixel-level matching and efficient Markov random field (MRF) optimization for incorporating neighborhood context. Our MRF setup can also compute a simultaneous labeling of image regions into semantic classes (e.g., tree, building, car) and geometric classes (sky, vertical, ground). Our system outperforms the state-of-the-art nonparametric method based on SIFT Flow on a dataset of 2,688 images and 33 labels. In addition, we report per-pixel rates on a larger dataset of 45,676 images and 232 labels. To our knowledge, this is the first complete evaluation of image parsing on a dataset of this size, and it establishes a new benchmark for the problem. Finally, we present an extension of our method to video sequences and report results on a video dataset with frames densely labeled at 1 Hz.
Similar content being viewed by others
Notes
We set K=200 and σ=.8.
Note that our original system (Tighe and Lazebnik 2010) did not use the sigmoid nonlinearity, but in our subsequent work (Tighe and Lazebnik 2011) we found it necessary to successfully perform more complex multi-level inference. We have also found that the sigmoid is a good way of making the output of the nonparametric classifier comparable to that of other classifiers, for example, boosted decision trees (see Sect. 3.1).
Since the videos were taken from a forward-moving camera, we have found the segmentation results to be better if we run the videos through the system backwards.
References
Boykov, Y., & Kolmogorov, V. (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1124–1137.
Boykov, Y., Veksler, O., & Zabih, R. (2001). Efficient approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1222–1239.
Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In Proceedings European conference computer vision (pp. 1–15).
Divvala, S., Hoiem, D., Hays, J., Efros, A., & Hebert, M. (2009). An empirical study of context in object detection. In Proceedings IEEE conference computer vision and pattern recognition (pp. 1271–1278).
Eigen, D., & Fergus, R. (2012). Nonparametric image parsing using adaptive neighbor sets. In: Proceedings IEEE conference computer vision and pattern recognition.
Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2012) Scene parsing with multiscale feature learning, purity trees, and optimal covers. arXiv preprint.
Felzenszwalb, P., Mcallester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In: Proceedings IEEE conference computer vision and pattern recognition.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 2(2), 1–26.
Galleguillos, C., & Belongie, S. (2010). Context based object categorization: a critical survey. Computer Vision and Image Understanding, 114(6), 712–722.
Galleguillos, C., Mcfee, B., Belongie, S., & Lanckriet, G. (2010). Multi-class object localization by combining local contextual interactions. In Proceedings IEEE conference computer vision and pattern recognition.
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In Proceedings IEEE international conference computer vision.
Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In Proceedings IEEE conference computer vision and pattern recognition.
Gu, C., Lim, J. J., Arbel, P., & Malik, J. (2009). Recognition using regions. In Proceedings IEEE conference computer vision and pattern recognition.
Gupta, A., Satkin, S., Efros, A. A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In Proceedings IEEE conference computer vision and pattern recognition.
Hays, J., & Efros, A. A. (2008). IM 2 GPS: estimating geographic information from a single image. In Proceedings IEEE conference computer vision and pattern recognition (Vol. 05).
He, X., Zemel, R. S., & Carreira-Perpinan, M. A. (2004). Multiscale conditional random fields for image labeling. In Proceedings IEEE conference computer vision and pattern recognition.
Hedau, V., & Hoiem, D. (2010). Thinking inside the box: using appearance models and context based on room geometry. In Proceedings European conference computer vision (pp. 1–14).
Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In Proceedings IEEE international conference computer vision.
Heitz, G., & Koller, D. (2008). Learning spatial context: using stuff to find things. In Proceedings European conference computer vision (pp. 1–14).
Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, 75(1).
Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., & Darrell, T. (2011). A category-level 3-D object dataset: putting the Kinect to work. In ICCV workshop.
Kolmogorov, V., & Zabih, R. (2004). What energy functions can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 147–159.
Ladicky, L., Sturgess, P., Alahari, K., Russell, C., & Torr, P. H. S. (2010). What, where and how many? Combining object detectors and CRFs. In Proceedings European conference computer vision (pp. 424–437).
Lai, K., Bo, L., Ren, X., & Fox, D. (2011). A scalable tree-based approach for joint object and pose recognition. In Artificial intelligence.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In Proceedings IEEE conference computer vision and pattern recognition (Vol. 2).
Liu, C., Yuen, J., & Torralba, A. (2011a). Nonparametric scene parsing via label transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), 2368–2382.
Liu, C., Yuen, J., & Torralba, A. (2011b). SIFT flow: dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 978–994.
Malisiewicz, T., & Efros, A. A. (2008). Recognition by association via learning per-exemplar distances. In Proceedings IEEE conference computer vision and pattern recognition (pp. 1–8).
Malisiewicz, T., & Efros, A. A. (2011). Ensemble of exemplar-SVMs for object detection and beyond. In Proceedings IEEE international conference computer vision (pp. 89–96).
Nowozin, S., Carsten, R., Bagon, S., Sharp, T., Yao, B., & Kohli, P. (2011). Decision tree fields. In Proceedings IEEE international conference computer vision (pp. 1668–1675).
Oliva, A., & Torralba, A. (2006). Building the gist of a scene: the role of global image features in recognition. visual perception. Progress in Brain Research, 155, 23–36.
Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In Proceedings IEEE international conference computer vision (pp. 1–8).
Ren, X., & Malik, J. (2003). Learning a classification model for segmentation. In Proceedings IEEE international conference computer vision.
Russell, B. C., Torralba, A., Liu, C., Fergus, R., & Freeman, W. T. (2007). Object recognition by scene alignment. In Neural information processing systems foundation.
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1–3), 157–173.
Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Proceedings IEEE conference computer vision and pattern recognition.
Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). TextonBoost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings European conference computer vision (pp. 1–14).
Silberman, N., & Fergus, R. (2011). Indoor scene segmentation using a structured light sensor. In Proceedings IEEE international conference computer vision workshop.
Socher, R., Lin, C. C. Y., Ng, A. Y., & Manning, C. D. (2011). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the international conference on machine learning.
Sturgess, P., Alahari, K., Ladicky, L., & Torr, P. H. S. (2009). Combining appearance and structure from motion features for road scene understanding. In British machine vision conference (pp. 1–11).
Tighe, J., & Lazebnik, S. (2010). SuperParsing: scalable nonparametric image parsing with superpixels. In Proceedings European conference computer vision.
Tighe, J., & Lazebnik, S. (2011). Understanding scenes on many levels. In Proceedings IEEE international conference computer vision (pp. 335–342).
Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.
Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In Proceedings IEEE conference computer vision and pattern recognition (pp. 3485–3492).
Xiao, J., & Quan, L. (2009). Multiple view semantic segmentation for street view images. In Proceedings IEEE international conference computer vision.
Xu, C., & Corso, J. J. (2012). Evaluation of super-voxel methods for early video processing. In Proceedings IEEE conference computer vision and pattern recognition.
Zhang, C., Wang, L., & Yang, R. (2010). Semantic segmentation of urban scenes using dense depth maps. In Proceedings European conference computer vision (pp. 708–721).
Acknowledgements
This research was supported in part by NSF grants IIS-0845629 and IIS-0916829, DARPA Computer Science Study Group, Microsoft Research Faculty Fellowship, and Xerox.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tighe, J., Lazebnik, S. Superparsing. Int J Comput Vis 101, 329–349 (2013). https://doi.org/10.1007/s11263-012-0574-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0574-z