Deep Long Term Prediction for Semantic Segmentation in Autonomous Driving

Dash, Bidya; Bilagi, Shreyas; Breitenstein, Jasmin; Schomerus, Volker; Bagdonat, Thorsten; Fingscheidt, Tim

doi:10.1007/978-3-031-49896-1_7

Bidya Dash^14,15,
Shreyas Bilagi¹⁴,
Jasmin Breitenstein¹⁵,
Volker Schomerus¹⁴,
Thorsten Bagdonat¹⁴ &
…
Tim Fingscheidt¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14343))

Included in the following conference series:

International Workshop on Advanced Analytics and Learning on Temporal Data

197 Accesses

Abstract

Temporal prediction is an important function in autonomous driving (AD) systems as it forecasts how the environment will change and transform in the next few seconds. Humans have an inherited prediction capability that extrapolates a present scenario to the future. In this paper, we present a novel approach to look further into the future using a standard semantic segmentation representation and time series networks of varying architectures. An important property of our approach is its flexibility to predict an arbitrary time horizon into the future. We perform prediction in the semantic segmentation domain where inputs are semantic segmentation masks. We present extensive results and discussion on different data dimensionalities that can prove beneficial for prediction on longer time horizons (up to \(2\,\textrm{s}\)). We also show results of our approach on two widely employed datasets in AD research, i.e., Cityscapes and BDD100K. We report two types of mIoUs as we have investigated with self generated ground truth labels (mIoU\(^{seg}\)) for both of our dataset and actual ground truth labels (mIoU\(^\textrm{gt}\)) for a specific split of the Cityscapes dataset. Our method achieves \(57.12\%\) and \(83.95\%\) mIoU\(^{seg}\), respectively, on the validation split of BDD100K and Cityscapes, for short-term time horizon predictions (up to \(0.2\,\textrm{s}\) and \(0.06\,\textrm{s}\)), outperforming the current state of the art on Cityscapes by \(13.71\%\) absolute. For long-term predictions (up to \(2\,\textrm{s}\) and \(0.6\,\textrm{s}\)), we achieve \(37.96\%\) and \(63.65\%\) mIoU\(^{seg}\), respectively, for BDD100K and Cityscapes. Specifically on the validation split of Cityscapes with perfect ground truth annotations, we achieve \(67.55\%\) and \(63.60\%\) mIoU\(^\textrm{gt}\), outperforming current state of the art by \(1.45\%\) absolute and \(4.2\%\) absolute with time horizon predictions up to \(0.06\,\textrm{s}\) and \(0.18\,\textrm{s}\), respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: Proceedings of ICIP, Melbourne, VIC, Australia, pp. 3464–3468, September 2016
Google Scholar
Breitenstein, J., Termöhlen, J.A., Lipinski, D., Fingscheidt, T.: Systematization of corner cases for visual perception in automated driving. In: Proceedings of IV, Las Vegas, NV, USA, pp. 986–993, October 2020
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of CVPR, Las Vegas, NV, USA, pp. 3213–3223, June 2016
Google Scholar
Duwek, H.C., Shalumov, A., Tsur, E.E.: Image reconstruction from neuromorphic event cameras using Laplacian-prediction and poisson integration with spiking and artificial neural networks. In: Proceedings of CVPR - Workshops, pp. 1333–1341. Virtual, June 2021
Google Scholar
Fingscheidt, T., Gottschalk, H., Houben, S. (eds.): Deep Neural Networks and Data for Automated Driving: Robustness, Uncertainty Quantification, and Insights Towards Safety. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-01233-4
Book Google Scholar
Guen, V.L., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: Proceedings of ICCV, Los Alamitos, CA, USA, pp. 11471–11481, June 2020
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jin, X., et al.: Predicting scene parsing and motion dynamics in the future. In: Proceedings of the NeurIPS, Long Beach, CA, USA, December 2017
Google Scholar
Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME-J. Basic Eng. 82(Series D), 35–45 (1960)
Google Scholar
Kwon, Y.H., Park, M.G.: Predicting future frames using retrospective cycle GAN. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 1811–1820, June 2019
Google Scholar
Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: Proceedings of the ICCV, Venice, Italy, pp. 4463–4471, October 2017
Google Scholar
Lotter, W., Kreiman, G., Cox, D.D.: Deep predictive coding networks for video prediction and unsupervised learning. arXiv, August 2016
Google Scholar
Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: Proceedings of ICCV, Venice, Italy, pp. 648–657, October 2017
Google Scholar
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of IJCAI, Vancouver, BC, Canada, pp. 674–679, August 1981
Google Scholar
Maas, A., Hannun, A., Ng, A.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, Atlanta, Georgia (2013)
Google Scholar
Mahjourian, R., Wicke, M., Angelova, A.: Geometry-based next frame prediction from monocular video. In: Proceedings of IV, pp. 1700–1707, June 2017
Google Scholar
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi scale video prediction beyond mean square error. In: Proceedings of ICLR, San Juan, Puerto Rico, pp. 1–14, May 2016
Google Scholar
Nabavi, S.S., Rochan, M., Wang, Y.: Future semantic segmentation with convolutional LSTM. In: Proceedings of BMVC, Newcastle, UK, pp. 1–12, September 2018
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of CVPR, Las Vegas, NV, USA, pp. 779–788, June 2016
Google Scholar
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of CVPR, Honolulu, HI, USA, July 2017
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of NIPS, Montreal, QC, Canada, pp. 802–810, December 2015
Google Scholar
Walker, J., Razavi, A., van den Oord, A.: Predicting video with VQVAE. CoRR abs/2103.01950 (2021). https://arxiv.org/abs/2103.01950
Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 107–122. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_7
Chapter Google Scholar
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Proceedings of NeurIPS, pp. 12077–12090. Virtual Conference, December 2021
Google Scholar
Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: Proceedings of CVPR, Seattle, WA, USA, pp. 1–14, June 2020
Google Scholar
Zhao, H., Zhang, S., Wu, G., Moura, J.M.F., Costeira, J.P., Gordon, G.J.: Adversarial multiple source domain adaptation. In: Proceedings of NeurIPS, Montréal, QC, Canada, pp. 8568–8579, December 2018
Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of CVPR, Honulu, HI, USA, pp. 2881–2890, July 2017
Google Scholar
Zhao, H., et al.: PSANet: point-wise spatial attention network for scene parsing. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 270–286. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_17
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Group Innovation, Volkswagen AG, 38440, Wolfsburg, Germany
Bidya Dash, Shreyas Bilagi, Volker Schomerus & Thorsten Bagdonat
Institute for Communications Technology, Technische Universität Braunschweig, Schleinitzstraße 22, 38106, Braunschweig, Germany
Bidya Dash, Jasmin Breitenstein & Tim Fingscheidt

Authors

Bidya Dash
View author publications
You can also search for this author in PubMed Google Scholar
Shreyas Bilagi
View author publications
You can also search for this author in PubMed Google Scholar
Jasmin Breitenstein
View author publications
You can also search for this author in PubMed Google Scholar
Volker Schomerus
View author publications
You can also search for this author in PubMed Google Scholar
Thorsten Bagdonat
View author publications
You can also search for this author in PubMed Google Scholar
Tim Fingscheidt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bidya Dash .

Editor information

Editors and Affiliations

University College Dublin, Dublin, Ireland
Georgiana Ifrim
University of Rennes 2, Rennes, France
Romain Tavenard
University of Southampton, Southampton, UK
Anthony Bagnall
Humboldt University of Berlin, Berlin, Germany
Patrick Schaefer
University of Rennes, Rennes, France
Simon Malinowski
Claude Bernard University Lyon 1, Villeurbanne, France
Thomas Guyet
Orange Innovation, Lannion, France
Vincent Lemaire

Ethics declarations

Disclaimer

The results, opinions and conclusions expressed in this publication are not necessarily those of Volkswagen Aktiengesellschaft.

A Supplementary Material

Table 4. Re-ordering of semantic classes in BDD100K

Full size table

1.1 A.1 Qualitative Results

In this section, we show the qualitative results of our method on sequences of BDD100K, \(\mathcal {D}^\mathrm {BDD-MOTS}_\textrm{val}\) and Cityscapes, \(\mathcal {D}^\mathrm {CS-vid}_\textrm{val}\). In Fig. 5, we show the qualitative results of prediction method on a sequence of \(\mathcal {D}^\mathrm {BDD-MOTS}_\textrm{val}\). We can observe that, for increasing time steps, i.e., \(\varDelta {t}=\{1, 2, 5, 10\}\), the prediction worsens for dynamic objects. This can be inferred from the increase in white regions in the absolute difference estimation visualizations (bottom row) defined as \(\hat{\textbf{d}}_{t}=|\hat{\textbf{m}}_{t}-\overline{\textbf{m}}_{t}|\). Also, majority of the predictions are incorrect in the boundary segregation of different classes, i.e., where car pixel occupancy meets road occupancy, where sidewalk occupancy meets road occupancy.

Specifically for the Cityscapes dataset, in Fig. 6, the predictions (middle row) are at par with their corresponding 20\(^\textrm{th}\) frame ground truth annotations (top row) for \(\varDelta {t}=\{1, 2, 5\}\), obtained from the dataset. The semantic class boundaries are so well captured with noise supression in predictions (middle row). However, for \(\varDelta {t}=10\), we can see that the prediction focuses more on predicting the static classes than the finer dynamic classes boundary details, i.e., visible from the missing sidewalk in the left region of \(\hat{\textbf{m}}_{t+10}\) that is visible in the ground truth annotation \(\overline{\textbf{m}}_{t+10}\) (in pink) of Fig. 6.

1.2 A.2 Prediction Invariance on the Ordering of Semantic Classes

Table 5. Confusion matrix on BDD100K: scores of all the \({S}=19\) classes in BDD100K with original ordering. The classes with highest true score are highlighted in red and the cells with second highest true score are marked in light red.

Full size table

Table 6. Confusion matrix on BDD100K: scores of all the \({S}=19\) classes in BDD100K with first re-ordering. The classes with highest true score are highlighted in red and the cells with second highest true score are marked in light red.

Full size table

Table 7. Confusion matrix on BDD100K: scores of all the \({S}=19\) classes in BDD100K with second re-ordering. The classes with highest true score are highlighted in red and the cells with second highest true score are marked in light red.

Full size table

We investigate the behavior of our model when 1-channel inputs are fed to our predictor network, i.e., the generated pseudo ground truth \(\overline{\textbf{m}}_{t} \in \mathcal {S}^{H \times W \times 1}\) for the BDD100K dataset \(\mathcal {D}^\mathrm {BDD-MOTS}\). The semantic segmentation mask \(\overline{\textbf{m}}_{t}\) contains class indices \(s \in \mathcal {S}=\{1, 2, ..., S\}\), where \(S=19\). Here, each semantic class corresponds to a specific class index s, e.g., \(s=0\) for class road, \(s=12\) for class person, etc according to the scene. For instance, if we consider a small region in a semantic segmentation mask, we usually find a certain semantic class pixels in close proximity with another semantic class pixels, i.e., class road pixels (\(s=1\)) almost always occurs adjacent to class car pixels (\(s=14\)) and class sidewalk pixels (\(s=2\)) almost are always adjacent to class road pixels (\(s=1\)) as can be seen in Fig. 7. To investigate the proposed method’s performance and robustness when the original class orientation is re-ordered, we conducted some experiments by shuffling the class indexes in the generated pseudo ground truth frames \(\overline{\textbf{m}}_{t}\). For instance, now the same scene would contain class road (\(s=5\)) adjacent to class car (\(s=16\)) and class sidewalk (\(s=9\)) adjacent to class road (\(s=5\)). Note that, the semantic classes remain the same, just the class indices are shuffled randomly. In Table 4, we can see the original class order along with the re-ordered class indices where semantic classes are marked by their actual defined colors in \(\mathcal {D}^{BDD-MOTS}\). can This is an important investigation to prove that our predictor model still learns the proximal relationship between the semantic classes instead of the numerical class indices occupancy, i.e., our model perfectly learns that class road pixels are most likely to occur near class car pixels and vice-versa, irrespective of their class indices value. Hence, we performed experiments by re-ordering the class indices of \(\mathcal {D}^\mathrm {BDD-MOTS}\) in such a way that the classes that occurred near to each other in terms of class index distance, e.g., road (\(s=1\)) and sidewalk (\(s=2\)) are now placed further apart, e.g., road (\(s=5\)) and sidewalk (\(s=9\)) as can be seen in Table 4.

Table 5 shows the confusion matrix for the original class order of \(\mathcal {D}^\mathrm {BDD-MOTS}\). The confusion matrix represents how each class in the prediction is confused and interpreted with respect to all the classes present in the ground truth and vice-versa. It can be observed that, every class is predicted well with the highest score for itself (see diagonal) except class rider (\(s=13\)) which is predicted as class person (\(s=12\)) with a score of 0.49 which is obvious as rider fits into the broader category of person after all. Similarly, class motorcycle (\(s=18\)) gets confused for class road (\(s=1\)) with a score of 0.33. This could be simply attributed the fact that the class road heavily overpowers the pixel distribution in all scenes whereas the class motorcycle has very minimal occupancy in most of the scenes. Now, in Table 6, we can see the confusion matrix for the first re-ordering of classes. In Table 6, it can be observed that, the predictor still confuses class rider (\(s=17\)) for class person(\(s=11\)) with a score of 0.34 and class motorcycle (\(s=19\)) for class road (\(s=5\)) with a score of 0.38. Similarly, in Table 7, we can see the confusion matrix for the second re-ordering of classes. We can see that the predictor once again interprets class rider (\(s=5\)) as class person (\(s=13\)) with a score of 0.37 and class motorcycle (\(s=11\)) as class road (\(s=8\)) with a score of 0.38. It can be inferred that the predictor regardless of class index ordering, behaves exactly the same for all the class predictions. Thus, the predictor can be safely labeled as invariant to the class ordering.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dash, B., Bilagi, S., Breitenstein, J., Schomerus, V., Bagdonat, T., Fingscheidt, T. (2023). Deep Long Term Prediction for Semantic Segmentation in Autonomous Driving. In: Ifrim, G., et al. Advanced Analytics and Learning on Temporal Data. AALTD 2023. Lecture Notes in Computer Science(), vol 14343. Springer, Cham. https://doi.org/10.1007/978-3-031-49896-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-49896-1_7
Published: 20 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49895-4
Online ISBN: 978-3-031-49896-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Deep Long Term Prediction for Semantic Segmentation in Autonomous Driving

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Disclaimer

A Supplementary Material

A Supplementary Material

1.1 A.1 Qualitative Results

1.2 A.2 Prediction Invariance on the Ordering of Semantic Classes

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation