Season-Invariant Semantic Segmentation with a Deep Multimodal Network

Kim, Dong-Ki; Maturana, Daniel; Uenoyama, Masashi; Scherer, Sebastian

doi:10.1007/978-3-319-67361-5_17

Dong-Ki Kim¹¹,
Daniel Maturana¹¹,
Masashi Uenoyama¹² &
…
Sebastian Scherer¹¹

Part of the book series: Springer Proceedings in Advanced Robotics ((SPAR,volume 5))

3763 Accesses
15 Citations

Abstract

Semantic scene understanding is a useful capability for autonomous vehicles operating in off-roads. While cameras are the most common sensor used for semantic classification, the performance of methods using camera imagery may suffer when there is significant variation between the train and testing sets caused by illumination, weather, and seasonal variations. On the other hand, 3D information from active sensors such as LiDAR is comparatively invariant to these factors, which motivates us to investigate whether it can be used to improve performance in this scenario. In this paper, we propose a novel multimodal Convolutional Neural Network (CNN) architecture consisting of two streams, 2D and 3D, which are fused by projecting 3D features to image space to achieve a robust pixelwise semantic segmentation. We evaluate our proposed method in a novel off-road terrain classification benchmark, and show a 25% improvement in mean Intersection over Union (IoU) of navigation-related semantic classes, relative to an image-only baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For performance reasons, we simplify the point cloud network by replacing the dilation layer and asymmetric layer with the regular convolution layer. Also, we replace the deconvolution layer with the upsample layer followed by the \(3 \times 3 \times 3\) convolutional layer with stride 1. For simplicity, we use the same term “deconvolution”.
2.
Point cloud is represented by the 3D voxel grid as a convolutional architecture requires a regular input data format.

References

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional models for semantic segmentation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561 [cs.CV] (2015)
Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: ENet: a deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147 [cs.CV] (2016)
Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. arXiv:1301.3572 [cs.CV] (2013)
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Proceedings European Conference on Computer Vision (ECCV) (2014)
Google Scholar
Valada, A., Oliveira, G.L., Brox, T., Burgard, W.: Deep Multispectral Semantic Scene Understanding of Forested Environments Using Multimodal Fusion. In: Proceedings International Symposium on Experimental Robotics (ISER) (2016)
Google Scholar
Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 35(8), 1915–1929 (2013)
Article Google Scholar
Ladický, L., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, where and how many? combining object detectors and CRFs. In: Proceedings European Conference on Computer Vision (ECCV) (2010)
Google Scholar
Micusik, B., Košecká, J., Singh, G.: Semantic parsing of street scenes from video. Intl J. Rob. Res. (IJRR) 31(4), 484–497 (2012)
Article Google Scholar
Xiao, J., Quan, L.: Multiple view semantic segmentation for street view images. In: Proceedings IEEE Intl Conference on Computer Vision (ICCV) (2009)
Google Scholar
Simonyan, K., Zisserman A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [cs.CV] (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 [cs.CV] (2015)
Munoz, D., Bagnell, J.A., Hebert, M.: Co-inference for multi-modal scene analysis. In: Proceedings European Conference on Computer Vision (ECCV) (2012)
Google Scholar
Newman, P., et al.: Navigating, recognizing and describing urban spaces with vision and lasers. Intl J. Rob. Res. (IJRR) 28(11–12), 1406–1433 (2009)
Article Google Scholar
Cadena, C., Košecká, J.: Semantic segmentation with heterogeneous sensor coverages. In: Proceedings IEEE Intl Conference on Robotics and Automation (ICRA) (2014)
Google Scholar
Alvis, C.D., Ott, L., Ramos, F.: Urban scene segmentation with laser-constrained CRFs. In: Proceedings IEEE/RSJ Intl Conference on Intelligent Robots and Systems (IROS) (2016)
Google Scholar
Gupta, S., Arbeláez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)
Google Scholar
Maturana, D., Scherer, S.: 3D convolutional neural networks for landing zone detection from LiDAR. In: Proceedings IEEE Intl Conference on Robotics and Automation (ICRA) (2015)
Google Scholar
Scherer, S., Chamberlain, L.J., Singh, S.: Online assessment of landing sites. In: Proceedings AIAA Infotech@Aerospace (2010)
Google Scholar
Amanatides, J., Woo, A.: A fast voxel traversal algorithm for ray tracing. In: Proceedings Eurographics (1987)
Google Scholar

Download references

Acknowledgements

We thank the Yamaha Motor corporation for supporting this research.

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, USA
Dong-Ki Kim, Daniel Maturana & Sebastian Scherer
Yamaha Motor Corporation, Cypress, USA
Masashi Uenoyama

Authors

Dong-Ki Kim
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Maturana
View author publications
You can also search for this author in PubMed Google Scholar
Masashi Uenoyama
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Scherer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong-Ki Kim .

Editor information

Editors and Affiliations

ETH Zurich, Zürich, Switzerland
Marco Hutter
ETH Zurich, Zürich, Switzerland
Roland Siegwart

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, DK., Maturana, D., Uenoyama, M., Scherer, S. (2018). Season-Invariant Semantic Segmentation with a Deep Multimodal Network. In: Hutter, M., Siegwart, R. (eds) Field and Service Robotics. Springer Proceedings in Advanced Robotics, vol 5. Springer, Cham. https://doi.org/10.1007/978-3-319-67361-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-67361-5_17
Published: 03 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67360-8
Online ISBN: 978-3-319-67361-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics