SegStereo: Exploiting Semantic Information for Disparity Estimation

  • Guorun Yang
  • Hengshuang Zhao
  • Jianping Shi
  • Zhidong DengEmail author
  • Jiaya Jia
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11211)


Disparity estimation for binocular stereo images finds a wide range of applications. Traditional algorithms may fail on featureless regions, which could be handled by high-level clues such as semantic segments. In this paper, we suggest that appropriate incorporation of semantic cues can greatly rectify prediction in commonly-used disparity estimation frameworks. Our method conducts semantic feature embedding and regularizes semantic cues as the loss term to improve learning disparity. Our unified model SegStereo employs semantic features from segmentation and introduces semantic softmax loss, which helps improve the prediction accuracy of disparity maps. The semantic cues work well in both unsupervised and supervised manners. SegStereo achieves state-of-the-art results on KITTI Stereo benchmark and produces decent prediction on both CityScapes and FlyingThings3D datasets.


Disparity estimation Semantic cues Semantic feature embedding Softmax loss regularization 



This work was supported in part by the National Key R&D Program of China under Grant No. 2017YFB1302200 and by Joint Fund of NORINCO Group of China for Advanced Research under Grant No. 6141B010318.

Supplementary material

474212_1_En_39_MOESM1_ESM.pdf (11.7 mb)
Supplementary material 1 (pdf 12028 KB)


  1. 1.
    Alhaija, H.A., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets deep learning for car instance segmentation in urban scenes. In: BMVC (2017)Google Scholar
  2. 2.
    Bai, M., Luo, W., Kundu, K., Urtasun, R.: Exploiting semantic information and deep matching for optical flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 154–170. Springer, Cham (2016). Scholar
  3. 3.
    Barron, J.T.: A more general robust loss function. arXiv preprint arXiv:1701.03077 (2017)
  4. 4.
    Behl, A., Jafari, O.H., Mustikovela, S.K., Alhaija, H.A., Rother, C., Geiger, A.: Bounding boxes, segmentations and object coordinates: how important is recognition for 3D scene flow estimation in autonomous driving scenarios? In: ICCV (2017)Google Scholar
  5. 5.
    Brown, M., Hua, G., Winder, S.: Discriminative learning of local image descriptors. TPAMI 33(1), 43–57 (2011)CrossRefGoogle Scholar
  6. 6.
    Chang, J., Chen, Y.: Pyramid stereo matching network. In: CVPR (2018)Google Scholar
  7. 7.
    Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs (2015)Google Scholar
  8. 8.
    Chen, Z., Sun, X., Wang, L., Yu, Y., Huang, C.: A deep visual correspondence embedding model for stereo matching costs. In: ICCV (2015)Google Scholar
  9. 9.
    Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: SegFlow: joint learning for video object segmentation and optical flow. In: ICCV (2017)Google Scholar
  10. 10.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  11. 11.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)Google Scholar
  12. 12.
    Flynn, J., Neulander, I., Philbin, J., Snavely, N.: DeepStereo: learning to predict new views from the world’s imagery. In: CVPR (2016)Google Scholar
  13. 13.
    Garg, R., Vijay Kumar, B.G., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). Scholar
  14. 14.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: CVPR (2012)Google Scholar
  15. 15.
    Geiger, A., Roser, M., Urtasun, R.: Efficient large-scale stereo matching. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6492, pp. 25–38. Springer, Heidelberg (2011). Scholar
  16. 16.
    Gidaris, S., Komodakis, N.: Detect, replace, refine: deep structured prediction for pixel wise labeling. In: CVPR (2017)Google Scholar
  17. 17.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)Google Scholar
  18. 18.
    Guney, F., Geiger, A.: Displets: resolving stereo ambiguities using object knowledge. In: CVPR (2015)Google Scholar
  19. 19.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  20. 20.
    Heise, P., Jensen, B., Klose, S., Knoll, A.: Fast dense stereo correspondences by binary locality sensitive hashing. In: ICRA (2015)Google Scholar
  21. 21.
    Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. TPAMI 30(2), 328–341 (2008)CrossRefGoogle Scholar
  22. 22.
    Hirschmuller, H., Scharstein, D.: Evaluation of stereo matching costs on images with radiometric differences. TPAMI 31(9), 1582–1599 (2009)CrossRefGoogle Scholar
  23. 23.
    Yu, J.J., Harley, A.W., Derpanis, K.G.: Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 3–10. Springer, Cham (2016). Scholar
  24. 24.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM MM (2014)Google Scholar
  25. 25.
    Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: ICCV (2017)Google Scholar
  26. 26.
    Kong, D., Tao, H.: A method for learning matching errors for stereo computation. In: BMVC (2004)Google Scholar
  27. 27.
    Kuznietsov, Y., Stückler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: CVPR (2017)Google Scholar
  28. 28.
    Liang, Z., et al.: Learning for disparity estimation through feature constancy. In: CVPR (2018)Google Scholar
  29. 29.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  30. 30.
    Luo, W., Schwing, A.G., Urtasun, R.: Efficient deep learning for stereo matching. In: CVPR (2016)Google Scholar
  31. 31.
    Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR (2016)Google Scholar
  32. 32.
    Meister, S., Hur, J., Roth, S.: UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In: AAAI (2018)Google Scholar
  33. 33.
    Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: CVPR (2015)Google Scholar
  34. 34.
    Pang, J., Sun, W., Ren, J., Yang, C., Yan, Q.: Cascade residual learning: a two-stage convolutional neural network for stereo matching. In: ICCV Workshop (2017)Google Scholar
  35. 35.
    Ren, Z., Sun, D., Kautz, J., Sudderth, E.B.: Cascaded scene flow prediction using semantic segmentation. In: ICCV Workshop (2017)Google Scholar
  36. 36.
    Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: DeepMatching: hierarchical deformable dense matching. IJCV 120(3), 300–323 (2016)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Seki, A., Pollefeys, M.: Patch based confidence prediction for dense disparity map. In: BMVC (2016)Google Scholar
  38. 38.
    Shaked, A., Wolf, L.: Improved stereo matching with constant highway networks and reflective confidence learning. In: CVPR (2017)Google Scholar
  39. 39.
    Vijay, B., Alex, K., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. TPAMI 39(12), 2481–2495 (2017)CrossRefGoogle Scholar
  40. 40.
    Xie, J., Girshick, R., Farhadi, A.: Deep3D: fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 842–857. Springer, Cham (2016). Scholar
  41. 41.
    Yamaguchi, K., McAllester, D., Urtasun, R.: Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 756–771. Springer, Cham (2014). Scholar
  42. 42.
    Yu, L., Wang, Y., Wu, Y., Jia, Y.: Deep stereo matching with explicit cost aggregation sub-architecture. In: AAAI (2018)Google Scholar
  43. 43.
    Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. JMLR 17(1), 2287–2318 (2016)zbMATHGoogle Scholar
  44. 44.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)Google Scholar
  45. 45.
    Zhou, C., Zhang, H., Shen, X., Jia, J.: Unsupervised learning of stereo matching. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Guorun Yang
    • 1
  • Hengshuang Zhao
    • 2
  • Jianping Shi
    • 3
  • Zhidong Deng
    • 1
    Email author
  • Jiaya Jia
    • 2
    • 4
  1. 1.Department of Computer Science, State Key Laboratory of Intelligent Technology and Systems, Beijing National Research Center for Information Science and TechnologyTsinghua UniversityBeijingChina
  2. 2.The Chinese University of Hong KongShatinHong Kong
  3. 3.SenseTime ResearchBeijingChina
  4. 4.Tencent YouTu LabShenzhenChina

Personalised recommendations