International Journal of Computer Vision

, Volume 101, Issue 2, pp 329–349 | Cite as

Superparsing

Scalable Nonparametric Image Parsing with Superpixels
Article

Abstract

This paper presents a simple and effective nonparametric approach to the problem of image parsing, or labeling image regions (in our case, superpixels produced by bottom-up segmentation) with their categories. This approach is based on lazy learning, and it can easily scale to datasets with tens of thousands of images and hundreds of labels. Given a test image, it first performs global scene-level matching against the training set, followed by superpixel-level matching and efficient Markov random field (MRF) optimization for incorporating neighborhood context. Our MRF setup can also compute a simultaneous labeling of image regions into semantic classes (e.g., tree, building, car) and geometric classes (sky, vertical, ground). Our system outperforms the state-of-the-art nonparametric method based on SIFT Flow on a dataset of 2,688 images and 33 labels. In addition, we report per-pixel rates on a larger dataset of 45,676 images and 232 labels. To our knowledge, this is the first complete evaluation of image parsing on a dataset of this size, and it establishes a new benchmark for the problem. Finally, we present an extension of our method to video sequences and report results on a video dataset with frames densely labeled at 1 Hz.

Keywords

Scene understanding Image parsing Image segmentation 

Notes

Acknowledgements

This research was supported in part by NSF grants IIS-0845629 and IIS-0916829, DARPA Computer Science Study Group, Microsoft Research Faculty Fellowship, and Xerox.

References

  1. Boykov, Y., & Kolmogorov, V. (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1124–1137. CrossRefGoogle Scholar
  2. Boykov, Y., Veksler, O., & Zabih, R. (2001). Efficient approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1222–1239. CrossRefGoogle Scholar
  3. Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In Proceedings European conference computer vision (pp. 1–15). Google Scholar
  4. Divvala, S., Hoiem, D., Hays, J., Efros, A., & Hebert, M. (2009). An empirical study of context in object detection. In Proceedings IEEE conference computer vision and pattern recognition (pp. 1271–1278). Google Scholar
  5. Eigen, D., & Fergus, R. (2012). Nonparametric image parsing using adaptive neighbor sets. In: Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
  6. Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2012) Scene parsing with multiscale feature learning, purity trees, and optimal covers. arXiv preprint. Google Scholar
  7. Felzenszwalb, P., Mcallester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In: Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
  8. Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 2(2), 1–26. Google Scholar
  9. Galleguillos, C., & Belongie, S. (2010). Context based object categorization: a critical survey. Computer Vision and Image Understanding, 114(6), 712–722. CrossRefGoogle Scholar
  10. Galleguillos, C., Mcfee, B., Belongie, S., & Lanckriet, G. (2010). Multi-class object localization by combining local contextual interactions. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
  11. Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In Proceedings IEEE international conference computer vision. Google Scholar
  12. Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
  13. Gu, C., Lim, J. J., Arbel, P., & Malik, J. (2009). Recognition using regions. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
  14. Gupta, A., Satkin, S., Efros, A. A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
  15. Hays, J., & Efros, A. A. (2008). IM 2 GPS: estimating geographic information from a single image. In Proceedings IEEE conference computer vision and pattern recognition (Vol. 05). Google Scholar
  16. He, X., Zemel, R. S., & Carreira-Perpinan, M. A. (2004). Multiscale conditional random fields for image labeling. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
  17. Hedau, V., & Hoiem, D. (2010). Thinking inside the box: using appearance models and context based on room geometry. In Proceedings European conference computer vision (pp. 1–14). Google Scholar
  18. Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In Proceedings IEEE international conference computer vision. Google Scholar
  19. Heitz, G., & Koller, D. (2008). Learning spatial context: using stuff to find things. In Proceedings European conference computer vision (pp. 1–14). Google Scholar
  20. Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, 75(1). Google Scholar
  21. Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., & Darrell, T. (2011). A category-level 3-D object dataset: putting the Kinect to work. In ICCV workshop. Google Scholar
  22. Kolmogorov, V., & Zabih, R. (2004). What energy functions can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 147–159. CrossRefGoogle Scholar
  23. Ladicky, L., Sturgess, P., Alahari, K., Russell, C., & Torr, P. H. S. (2010). What, where and how many? Combining object detectors and CRFs. In Proceedings European conference computer vision (pp. 424–437). Google Scholar
  24. Lai, K., Bo, L., Ren, X., & Fox, D. (2011). A scalable tree-based approach for joint object and pose recognition. In Artificial intelligence. Google Scholar
  25. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In Proceedings IEEE conference computer vision and pattern recognition (Vol. 2). Google Scholar
  26. Liu, C., Yuen, J., & Torralba, A. (2011a). Nonparametric scene parsing via label transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), 2368–2382. CrossRefGoogle Scholar
  27. Liu, C., Yuen, J., & Torralba, A. (2011b). SIFT flow: dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 978–994. CrossRefGoogle Scholar
  28. Malisiewicz, T., & Efros, A. A. (2008). Recognition by association via learning per-exemplar distances. In Proceedings IEEE conference computer vision and pattern recognition (pp. 1–8). Google Scholar
  29. Malisiewicz, T., & Efros, A. A. (2011). Ensemble of exemplar-SVMs for object detection and beyond. In Proceedings IEEE international conference computer vision (pp. 89–96). Google Scholar
  30. Nowozin, S., Carsten, R., Bagon, S., Sharp, T., Yao, B., & Kohli, P. (2011). Decision tree fields. In Proceedings IEEE international conference computer vision (pp. 1668–1675). Google Scholar
  31. Oliva, A., & Torralba, A. (2006). Building the gist of a scene: the role of global image features in recognition. visual perception. Progress in Brain Research, 155, 23–36. CrossRefGoogle Scholar
  32. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In Proceedings IEEE international conference computer vision (pp. 1–8). Google Scholar
  33. Ren, X., & Malik, J. (2003). Learning a classification model for segmentation. In Proceedings IEEE international conference computer vision. Google Scholar
  34. Russell, B. C., Torralba, A., Liu, C., Fergus, R., & Freeman, W. T. (2007). Object recognition by scene alignment. In Neural information processing systems foundation. Google Scholar
  35. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1–3), 157–173. CrossRefGoogle Scholar
  36. Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
  37. Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). TextonBoost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings European conference computer vision (pp. 1–14). Google Scholar
  38. Silberman, N., & Fergus, R. (2011). Indoor scene segmentation using a structured light sensor. In Proceedings IEEE international conference computer vision workshop. Google Scholar
  39. Socher, R., Lin, C. C. Y., Ng, A. Y., & Manning, C. D. (2011). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the international conference on machine learning. Google Scholar
  40. Sturgess, P., Alahari, K., Ladicky, L., & Torr, P. H. S. (2009). Combining appearance and structure from motion features for road scene understanding. In British machine vision conference (pp. 1–11). Google Scholar
  41. Tighe, J., & Lazebnik, S. (2010). SuperParsing: scalable nonparametric image parsing with superpixels. In Proceedings European conference computer vision. Google Scholar
  42. Tighe, J., & Lazebnik, S. (2011). Understanding scenes on many levels. In Proceedings IEEE international conference computer vision (pp. 335–342). Google Scholar
  43. Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970. CrossRefGoogle Scholar
  44. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In Proceedings IEEE conference computer vision and pattern recognition (pp. 3485–3492). Google Scholar
  45. Xiao, J., & Quan, L. (2009). Multiple view semantic segmentation for street view images. In Proceedings IEEE international conference computer vision. Google Scholar
  46. Xu, C., & Corso, J. J. (2012). Evaluation of super-voxel methods for early video processing. In Proceedings IEEE conference computer vision and pattern recognition. Google Scholar
  47. Zhang, C., Wang, L., & Yang, R. (2010). Semantic segmentation of urban scenes using dense depth maps. In Proceedings European conference computer vision (pp. 708–721). Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Computer Science DepartmentUniversity of North CarolinaChapel HillUSA

Personalised recommendations