Skip to main content
Log in

Scaling Up Multi-domain Semantic Segmentation with Sentence Embeddings

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The state-of-the-art semantic segmentation methods have achieved impressive performance on predefined close-set individual datasets, but their generalization to zero-shot domains and unseen categories is limited. Labeling a large-scale dataset is challenging and expensive, Training a robust semantic segmentation model on multi-domains has drawn much attention. However, inconsistent taxonomies hinder the naive merging of current publicly available annotations. To address this, we propose a simple solution to scale up the multi-domain semantic segmentation dataset with less human effort. We replace each class label with a sentence embedding, which is a vector-valued embedding of a sentence describing the class. This approach enables the merging of multiple datasets from different domains, each with varying class labels and semantics. We merged publicly available noisy and weak annotations with the most finely annotated data, over 2 million images, which enables training a model that achieves performance equal to that of state-of-the-art supervised methods on 7 benchmark datasets, despite not using any images therefrom. Instead of manually tuning a consistent label space, we utilized a vector-valued embedding of short paragraphs to describe the classes. By fine-tuning the model on standard semantic segmentation datasets, we also achieve a significant improvement over the state-of-the-art supervised segmentation on NYUD-V2 (Silberman et al., in: European conference on computer vision, Springer, pp 746–760, 2012) and PASCAL-context (Everingham et al. in Int J Comput Visi 111(1):98–136, 2015) at \(60\%\) and \(65\%\) mIoU, respectively. Our method can segment unseen labels based on the closeness of language embeddings, showing strong generalization to unseen image domains and labels. Additionally, it enables impressive performance improvements in some adaptation applications, such as depth estimation and instance segmentation. Code is available at https://github.com/YvanYin/SSIW.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Baek, D., Oh, Y., & Ham, B. (2021). Exploiting a joint embedding space for generalized zero-shot semantic segmentation. In International conference on computer vision (pp. 9536–9545).

  • Benenson, R., Popov, S., & Ferrari, V. (2019). Large-scale interactive object segmentation with human annotators. In IEEE conference on computer vision and pattern recognition.

  • Bevandić, P., Krešo, I., Oršić, M., & Šegvić, S. (2019). Simultaneous semantic segmentation and outlier detection in presence of domain shift. In German Conference on Pattern Recognition (pp. 33–47). Springer.

  • Brostow, G., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In European conference on computer vision (pp. 44–57). Springer.

  • Bucher, M., Vu, T.-H., Cord, M., & Pérez, P. (2019). Zero-shot semantic segmentation. Advances in Neural Information Processing Systems, 32, 468–479.

    Google Scholar 

  • Bujwid, S., & Sullivan, J. (2021). Large-scale zero-shot image classification from rich and diverse textual descriptions. arXiv Computing Research Repository.

  • Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European conference on computer vision (pp. 611–625).

  • Cao, J., Leng, H., Lischinski,D., Cohen-Or, D., Tu, C., & Li, Y. (2021). Shapeconv: Shape-aware convolutional layer for indoor RGB-D semantic segmentation. arXiv Computing Research Repository. arXiv:2108.10528

  • Chen, W., Qian, S., Fan, D., Kojima, N., Hamilton, M., & Deng, J. (2020). Oasis: A large-scale dataset for single image 3D in the wild. In IEEE conference on computer vision and pattern recognition (pp. 679–688).

  • Chen, X., Girshick, R., He, K., & Dollár, P. (2019). Tensormask: A foundation for dense object segmentation. In International conference on computer vision (pp. 2061–2069).

  • Chen, H., Sun, K., Tian, Z., Shen, C., Huang, Y., & Yan, Y. (2020). BlendMask: Top-down meets bottom-up for instance segmentation. In IEEE conference on computer vision and pattern recognition.

  • Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder–decoder with atrous separable convolution for semantic image segmentation. In European conference on computer vision (pp. 801–818). Springer.

  • Chen, W., Fu, Z., Yang, D., & Deng, J. (2016). Single-image depth perception in the wild. Advances in Neural Information Processing Systems, 29, 730–738.

    Google Scholar 

  • Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1290–1299).

  • Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34(2021), 17864–17875.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In IEEE conference on computer vision and pattern recognition (pp. 3213–3223).

  • Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). Scannet: Richly-annotated 3D reconstructions of indoor scenes. In IEEE conference on computer vision and pattern recognition (pp. 5828–5839).

  • Ding, J., Xue, N., Xia, G.-S., & Dai, D. (2022). Decoupling zero-shot semantic segmentation. In Proceedingsof the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11583–11592).

  • Erkent, Ö., & Laugier, C. (2020). Semantic segmentation with unsupervised domain adaptation under varying weather conditions for autonomous vehicles. IEEE Robotics and Automation Letters, 5(2), 3580–3587.

    Article  Google Scholar 

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32, 1231–1237.

    Article  Google Scholar 

  • Gu, Z., Zhou, S., Niu, L., Zhao, Z., & Zhang, L. (2020). Context-aware feature generation for zero-shot semantic segmentation,” in ACM international conference on multimedia (pp. 1921–1929).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In International conference on computer vision (pp. 2961–2969).

  • Hua, Y., Kohli, P., Uplavikar, P., Ravi, A., Gunaseelan, S., Orozco, J., & Li, E. (2020). Holopix50k: A large-scale in-the-wild stereo image dataset. In IEEE conference on computer vision and pattern recognition workshop.

  • Huang, Y., Jia, W., He, X., Liu, L., Li,Y., & Tao, D. (2021). Channelized axial attention for semantic segmentation. arXiv Computing Research Repository, 2101.07434.

  • Hu, P., Sclaroff, S., & Saenko, K. (2020). Uncertainty-aware learning for zero-shot semantic segmentation. Advances in Neural Information Processing Systems, 3, 5.

    Google Scholar 

  • Kim, Y., Jung, H., Min, D., & Sohn, K. (2018). Deep monocular depth estimation via integration of global and local predictions. IEEE Transactions on Image Processing, 27(8), 4131–4144.

    Article  MathSciNet  Google Scholar 

  • Lambert, J., Liu, Z., Sener, O., Hays, J., & Koltun, V. (2020). MSeg: A composite dataset for multi-domain semantic segmentation. In IEEE conference on computer vision and pattern recognition.

  • Li, B., Weinberger, K. Q., Belongie, S., Koltun, V., & Ranftl, R. (2022). Language-driven semantic segmentation. In International conference on learning representations.

  • Li, Z., & Snavely, N. (2018). Megadepth: Learning single-view depth prediction from internet photos. In IEEE conference on computer vision and pattern recognition (pp. 2041–2050).

  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft Coco: Common Objects in Context. In European conference on computer vision (pp. 740–755). Springer.

  • Liu, C., Chen, L.-C., Schroff, F., Adam, H., Hua, W., Yuille, A. L., & Fei-Fei, L. (2019). Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In IEEE conference on computer vision and pattern recognition (pp. 82–92).

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In International conference on computer vision.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In IEEE conference on computer vision and pattern recognition (pp. 3431–3440).

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111–3119.

    Google Scholar 

  • Miller, G. A. (1995). Wordnet: A lexical database for English. Communications of the ACM, 38(11), 39–41.

    Article  Google Scholar 

  • Mottaghi,R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., Urtasun, R., & Yuille, A. (2014). The role of context for object detection and semantic segmentation in the wild. In IEEE conference on computer vision and pattern recognition (pp. 891–898).

  • Neuhold, G., Ollmann, T., Rota Bulo, S., & Kontschieder, P. (2017). The mapillary vistas dataset for semantic understanding of street scenes. In International conference on computer vision (pp. 4990–4999).

  • Porzi, L., Rota Bulò, S., Colovic, A., & Kontschieder, P. (2019). Seamless scene segmentation. In IEEE conference on computer vision and pattern recognition.

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh,G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision. arXiv Computing Research Repository, 2103.00020.

  • Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). Vision transformers for dense prediction. In International conference on computer vision (pp. 12179–12188).

  • Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 1623–1637.

    Article  Google Scholar 

  • Ros, G., Stent, S., Alcantarilla, P. F., & Watanabe, T. (2016). Training constrained deconvolutional networks for road scene semantic segmentation. arXiv Computing Research Repository, 1604.01545.

  • Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., & Sun, J. (2019). Objects365: A large-scale, high-quality dataset for object detection. In International conference on computer vision (pp. 8430–8439).

  • Shi, H., Li, H., Wu, Q., & Song, Z. (2019). Scene parsing via integrated classification model and variance-based regularization. In IEEE conference on computer vision and pattern recognition (pp. 5307–5316).

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In European conference on computer vision (pp. 746–760). Springer.

  • Song, S., Lichtenberg, S. P., & Xiao, J. (2015). Sun RGB-D: A RGB-D scene understanding benchmark suite. In IEEE conference on computer vision and pattern recognition, pp. 567–576.

  • Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In IEEE conference on computer vision and pattern recognition.

  • Tian, Z., Shen, C., & Chen, H. (2020). Conditional convolutions for instance segmentation. In European conference on computer vision. Springer.

  • Valada, A., Mohan, R., & Burgard, W. (2020). Self-supervised model adaptation for multimodal semantic segmentation. International Journal of Computer Vision, 128(5), 1239–1285.

    Article  Google Scholar 

  • Vandenhende, S., Georgoulis, S., & Van Gool, L. (2020). MTI-net: Multi-scale task interaction networks for multi-task learning. In European conference on computer vision (pp. 527–543). Springer.

  • Varma, G., Subramanian, A., Namboodiri, A., Chandraker, M., & Jawahar, C. (2019). IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments. In IEEE Winter Conference on Applications of Computer Vision (pp. 1743–1751). IEEE.

  • Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F. Z., Daniele, A. F., Mostajabi, M., Basart, S., Walter, M. R., & Shakhnarovich, G. (2019). DIODE: A Dense Indoor and Outdoor DEpth Dataset. arXiv Computing Research Repository [Online]. Available: arXiv: 1908.00463

  • Wang, X., Zhang, R., Kong, T., Li, L., & Shen, C. (2020). SOLOv2: Dynamic and fast instance segmentation. Advances in Neural Information Processing Systems, 33, 17721–17732.

    Google Scholar 

  • Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., & Luo, Z. (2018). Monocular relative depth perception with web stereo data supervision. In IEEE conference on computer vision and pattern recognition (pp. 311–320).

  • Xian, K., Zhang, J., Wang, O., Mai, L., Lin, Z., & Cao, Z. (2020). Structure-guided ranking loss for single image depth prediction. In IEEE conference on computer vision and pattern recognition (pp. 611–620).

  • Xian, Y., Choudhury, S., He, Y., Schiele, B., & Akata, Z. (2019). Semantic projection network for zero-and few-label semantic segmentation. In IEEE conference on computer vision and pattern recognition (pp. 8256–8265).

  • Xie, E., Sun, P., Song, X., Wang, W., Liu, X., Liang, D., Shen, C., & Luo, P. (2020). Polarmask: Single shot instance segmentation with polar representation. In IEEE conference on computer vision and pattern recognition (pp. 12193–12202).

  • Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems.

  • Yang, L., Fan, Y., & Xu, N. (2019a). Video instance segmentation. arXiv Computing Research Repository [Online]. Available: arXiv:1905.04804

  • Yang, L., Fan, Y., & Xu, N. (2019b). Video instance segmentation. In International conference on computer vision (pp. 5188–5197).

  • Yin, W., Liu, Y., & Shen, C. (2021). Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44, 7282–7295.

    Article  Google Scholar 

  • Yin, W., Wang, X., Shen, C., Liu, Y., Tian, Z., Xu, S., Sun, C., & Renyin, D. (2020). Diversedepth: Affine-invariant depth prediction using diverse data. arXiv Computing Research Repository, 2002.00569.

  • Yin, W., Zhang, C., Chen, H., Cai, Z., Yu, G., Wang, K., Chen, X., & Shen, C. (2023). Metric3d: Towards zero-shot metric 3d prediction from a single image. In International conference on computer vision.

  • Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., Chen, S., & Shen, C. (2021). Learning to recover 3d scene shape from a single image. In IEEE conference on computer vision and pattern recognition.

  • Yin, W., Zhang, J., Wang, O., Niklaus, S., Chen, S., Liu, Y., & Shen, C. (2022). Towards accurate reconstruction of 3D scene shape from a single monocular image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 6480–6494.

    Google Scholar 

  • Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., & Darrell, T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE conference on computer vision and pattern recognition.

  • Yuan, Y., Chen, X., & Wang, J. (2020). Object-contextual representations for semantic segmentation. In European conference on computer vision (pp. 173–190). Springer.

  • Zamir, A., Sax, A., Shen, W., Guibas, L., Malik, J., & Savarese, S. (2018). Taskonomy: Disentangling task transfer learning. In IEEE conference on computer vision and pattern recognition.

  • Zendel, O., Honauer, K., Murschitz, M., Steininger, D., & Dominguez, G. F. (2018). Wilddash-creating hazard-aware benchmarks. In European conference on computer vision (pp. 402–416). Springer.

  • Zhao, H., Puig, X., Zhou, B., Fidler, S., & Torralba, A. (2017). Open vocabulary scene parsing. In International conference on computer vision (pp. 2002–2010).

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In IEEE conference on computer vision and pattern recognition.

  • Zhu, Y., Sapra, K., Reda, F. A., Shih, K. J., Newsam, S., Tao, A., & Catanzaro, B. (2019). Improving semantic segmentation via video propagation and label relaxation. In IEEE conference on computer vision and pattern recognition (pp. 8856–8865).

Download references

Acknowledgements

Part of this work was done when Wei Yin was an intern at Amazon.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunhua Shen.

Additional information

Communicated by Takayuki Okatani.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yin, W., Liu, Y., Shen, C. et al. Scaling Up Multi-domain Semantic Segmentation with Sentence Embeddings. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02060-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02060-4

Keywords

Navigation