Skip to main content

Deep Long Term Prediction for Semantic Segmentation in Autonomous Driving

  • Conference paper
  • First Online:
Advanced Analytics and Learning on Temporal Data (AALTD 2023)

Abstract

Temporal prediction is an important function in autonomous driving (AD) systems as it forecasts how the environment will change and transform in the next few seconds. Humans have an inherited prediction capability that extrapolates a present scenario to the future. In this paper, we present a novel approach to look further into the future using a standard semantic segmentation representation and time series networks of varying architectures. An important property of our approach is its flexibility to predict an arbitrary time horizon into the future. We perform prediction in the semantic segmentation domain where inputs are semantic segmentation masks. We present extensive results and discussion on different data dimensionalities that can prove beneficial for prediction on longer time horizons (up to \(2\,\textrm{s}\)). We also show results of our approach on two widely employed datasets in AD research, i.e., Cityscapes and BDD100K. We report two types of mIoUs as we have investigated with self generated ground truth labels (mIoU\(^{seg}\)) for both of our dataset and actual ground truth labels (mIoU\(^\textrm{gt}\)) for a specific split of the Cityscapes dataset. Our method achieves \(57.12\%\) and \(83.95\%\) mIoU\(^{seg}\), respectively, on the validation split of BDD100K and Cityscapes, for short-term time horizon predictions (up to \(0.2\,\textrm{s}\) and \(0.06\,\textrm{s}\)), outperforming the current state of the art on Cityscapes by \(13.71\%\) absolute. For long-term predictions (up to \(2\,\textrm{s}\) and \(0.6\,\textrm{s}\)), we achieve \(37.96\%\) and \(63.65\%\) mIoU\(^{seg}\), respectively, for BDD100K and Cityscapes. Specifically on the validation split of Cityscapes with perfect ground truth annotations, we achieve \(67.55\%\) and \(63.60\%\) mIoU\(^\textrm{gt}\), outperforming current state of the art by \(1.45\%\) absolute and \(4.2\%\) absolute with time horizon predictions up to \(0.06\,\textrm{s}\) and \(0.18\,\textrm{s}\), respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: Proceedings of ICIP, Melbourne, VIC, Australia, pp. 3464–3468, September 2016

    Google Scholar 

  2. Breitenstein, J., Termöhlen, J.A., Lipinski, D., Fingscheidt, T.: Systematization of corner cases for visual perception in automated driving. In: Proceedings of IV, Las Vegas, NV, USA, pp. 986–993, October 2020

    Google Scholar 

  3. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of CVPR, Las Vegas, NV, USA, pp. 3213–3223, June 2016

    Google Scholar 

  4. Duwek, H.C., Shalumov, A., Tsur, E.E.: Image reconstruction from neuromorphic event cameras using Laplacian-prediction and poisson integration with spiking and artificial neural networks. In: Proceedings of CVPR - Workshops, pp. 1333–1341. Virtual, June 2021

    Google Scholar 

  5. Fingscheidt, T., Gottschalk, H., Houben, S. (eds.): Deep Neural Networks and Data for Automated Driving: Robustness, Uncertainty Quantification, and Insights Towards Safety. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-01233-4

    Book  Google Scholar 

  6. Guen, V.L., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: Proceedings of ICCV, Los Alamitos, CA, USA, pp. 11471–11481, June 2020

    Google Scholar 

  7. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  8. Jin, X., et al.: Predicting scene parsing and motion dynamics in the future. In: Proceedings of the NeurIPS, Long Beach, CA, USA, December 2017

    Google Scholar 

  9. Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME-J. Basic Eng. 82(Series D), 35–45 (1960)

    Google Scholar 

  10. Kwon, Y.H., Park, M.G.: Predicting future frames using retrospective cycle GAN. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 1811–1820, June 2019

    Google Scholar 

  11. Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: Proceedings of the ICCV, Venice, Italy, pp. 4463–4471, October 2017

    Google Scholar 

  12. Lotter, W., Kreiman, G., Cox, D.D.: Deep predictive coding networks for video prediction and unsupervised learning. arXiv, August 2016

    Google Scholar 

  13. Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: Proceedings of ICCV, Venice, Italy, pp. 648–657, October 2017

    Google Scholar 

  14. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of IJCAI, Vancouver, BC, Canada, pp. 674–679, August 1981

    Google Scholar 

  15. Maas, A., Hannun, A., Ng, A.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, Atlanta, Georgia (2013)

    Google Scholar 

  16. Mahjourian, R., Wicke, M., Angelova, A.: Geometry-based next frame prediction from monocular video. In: Proceedings of IV, pp. 1700–1707, June 2017

    Google Scholar 

  17. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi scale video prediction beyond mean square error. In: Proceedings of ICLR, San Juan, Puerto Rico, pp. 1–14, May 2016

    Google Scholar 

  18. Nabavi, S.S., Rochan, M., Wang, Y.: Future semantic segmentation with convolutional LSTM. In: Proceedings of BMVC, Newcastle, UK, pp. 1–12, September 2018

    Google Scholar 

  19. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of CVPR, Las Vegas, NV, USA, pp. 779–788, June 2016

    Google Scholar 

  20. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of CVPR, Honolulu, HI, USA, July 2017

    Google Scholar 

  21. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  22. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of NIPS, Montreal, QC, Canada, pp. 802–810, December 2015

    Google Scholar 

  23. Walker, J., Razavi, A., van den Oord, A.: Predicting video with VQVAE. CoRR abs/2103.01950 (2021). https://arxiv.org/abs/2103.01950

  24. Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 107–122. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_7

    Chapter  Google Scholar 

  25. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Proceedings of NeurIPS, pp. 12077–12090. Virtual Conference, December 2021

    Google Scholar 

  26. Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: Proceedings of CVPR, Seattle, WA, USA, pp. 1–14, June 2020

    Google Scholar 

  27. Zhao, H., Zhang, S., Wu, G., Moura, J.M.F., Costeira, J.P., Gordon, G.J.: Adversarial multiple source domain adaptation. In: Proceedings of NeurIPS, Montréal, QC, Canada, pp. 8568–8579, December 2018

    Google Scholar 

  28. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of CVPR, Honulu, HI, USA, pp. 2881–2890, July 2017

    Google Scholar 

  29. Zhao, H., et al.: PSANet: point-wise spatial attention network for scene parsing. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 270–286. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_17

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bidya Dash .

Editor information

Editors and Affiliations

Ethics declarations

Disclaimer

The results, opinions and conclusions expressed in this publication are not necessarily those of Volkswagen Aktiengesellschaft.

A Supplementary Material

A Supplementary Material

Table 4. Re-ordering of semantic classes in BDD100K

1.1 A.1 Qualitative Results

In this section, we show the qualitative results of our method on sequences of BDD100K, \(\mathcal {D}^\mathrm {BDD-MOTS}_\textrm{val}\) and Cityscapes, \(\mathcal {D}^\mathrm {CS-vid}_\textrm{val}\). In Fig. 5, we show the qualitative results of prediction method on a sequence of \(\mathcal {D}^\mathrm {BDD-MOTS}_\textrm{val}\). We can observe that, for increasing time steps, i.e., \(\varDelta {t}=\{1, 2, 5, 10\}\), the prediction worsens for dynamic objects. This can be inferred from the increase in white regions in the absolute difference estimation visualizations (bottom row) defined as \(\hat{\textbf{d}}_{t}=|\hat{\textbf{m}}_{t}-\overline{\textbf{m}}_{t}|\). Also, majority of the predictions are incorrect in the boundary segregation of different classes, i.e., where car pixel occupancy meets road occupancy, where sidewalk occupancy meets road occupancy.

Specifically for the Cityscapes dataset, in Fig. 6, the predictions (middle row) are at par with their corresponding 20\(^\textrm{th}\) frame ground truth annotations (top row) for \(\varDelta {t}=\{1, 2, 5\}\), obtained from the dataset. The semantic class boundaries are so well captured with noise supression in predictions (middle row). However, for \(\varDelta {t}=10\), we can see that the prediction focuses more on predicting the static classes than the finer dynamic classes boundary details, i.e., visible from the missing sidewalk in the left region of \(\hat{\textbf{m}}_{t+10}\) that is visible in the ground truth annotation \(\overline{\textbf{m}}_{t+10}\) (in pink) of Fig. 6.

Fig. 5.
figure 5

Output predictions for a sequence of the BDD100K validation split, \(\mathcal {D}^\mathrm {BDD-MOTS}_\textrm{val}\). The top row depicts the pseudo ground truth \(\overline{\textbf{m}}_{t}\), \(\overline{\textbf{m}}_{t+1}\), \(\overline{\textbf{m}}_{t+2}\), \(\overline{\textbf{m}}_{t+5}\), \(\overline{\textbf{m}}_{t+10}\) generated by PSANet. In the middle row, we show the input semantic segmentation \(\overline{\textbf{m}}_{t}\) along with the predictions \(\hat{\textbf{m}}_{t+1}\), \(\hat{\textbf{m}}_{t+2}\), \(\hat{\textbf{m}}_{t+5}\), \(\hat{\textbf{m}}_{t+10}\) from the prediction network. The bottom row portrays the absolute difference \(\hat{\textbf{d}}_{t+1}\), \(\hat{\textbf{d}}_{t+2}\), \(\hat{\textbf{d}}_{t+5}\), \(\hat{\textbf{d}}_{t+10}\), between the ground truth and prediction frames.

Fig. 6.
figure 6

Output predictions for a sequence of the Cityscapes validation split, \(\mathcal {D}^\mathrm {CS-vid}_\textrm{val}\). The top row depicts actual 20\(^\textrm{th}\) frame ground truth annotations available in the dataset for \(\overline{\textbf{m}}_{t}\), \(\overline{\textbf{m}}_{t+1}\), \(\overline{\textbf{m}}_{t+2}\), \(\overline{\textbf{m}}_{t+5}\), \(\overline{\textbf{m}}_{t+10}\). In the middle row, we show the input semantic segmentation \(\overline{\textbf{m}}_{t}\) along with the predictions \(\hat{\textbf{m}}_{t+1}\), \(\hat{\textbf{m}}_{t+2}\), \(\hat{\textbf{m}}_{t+5}\), \(\hat{\textbf{m}}_{t+10}\) from the prediction network. The bottom row portrays the absolute difference \(\hat{\textbf{d}}_{t+1}\), \(\hat{\textbf{d}}_{t+2}\), \(\hat{\textbf{d}}_{t+5}\), \(\hat{\textbf{d}}_{t+10}\), between the ground truth and prediction frames.

Fig. 7.
figure 7

A semantic segmentation input mask \(\overline{\textbf{m}}_{t}\) from \(\mathcal {D}^\mathrm {BDD-MOTS}_\textrm{val}\) showing different semantic classes. The left white encircled region portrays the proximal occurence of class sidewalk (\(s=2\)) near to class road (\(s=1\)). Similarly, the right white encircled region portrays the proximal occurrence of class road (\(s=1\)) near to class car (\(s=14\)).

1.2 A.2 Prediction Invariance on the Ordering of Semantic Classes

Table 5. Confusion matrix on BDD100K: scores of all the \({S}=19\) classes in BDD100K with original ordering. The classes with highest true score are highlighted in red and the cells with second highest true score are marked in light red.
Table 6. Confusion matrix on BDD100K: scores of all the \({S}=19\) classes in BDD100K with first re-ordering. The classes with highest true score are highlighted in red and the cells with second highest true score are marked in light red.
Table 7. Confusion matrix on BDD100K: scores of all the \({S}=19\) classes in BDD100K with second re-ordering. The classes with highest true score are highlighted in red and the cells with second highest true score are marked in light red.

We investigate the behavior of our model when 1-channel inputs are fed to our predictor network, i.e., the generated pseudo ground truth \(\overline{\textbf{m}}_{t} \in \mathcal {S}^{H \times W \times 1}\) for the BDD100K dataset \(\mathcal {D}^\mathrm {BDD-MOTS}\). The semantic segmentation mask \(\overline{\textbf{m}}_{t}\) contains class indices \(s \in \mathcal {S}=\{1, 2, ..., S\}\), where \(S=19\). Here, each semantic class corresponds to a specific class index s, e.g., \(s=0\) for class road, \(s=12\) for class person, etc according to the scene. For instance, if we consider a small region in a semantic segmentation mask, we usually find a certain semantic class pixels in close proximity with another semantic class pixels, i.e., class road pixels (\(s=1\)) almost always occurs adjacent to class car pixels (\(s=14\)) and class sidewalk pixels (\(s=2\)) almost are always adjacent to class road pixels (\(s=1\)) as can be seen in Fig. 7. To investigate the proposed method’s performance and robustness when the original class orientation is re-ordered, we conducted some experiments by shuffling the class indexes in the generated pseudo ground truth frames \(\overline{\textbf{m}}_{t}\). For instance, now the same scene would contain class road (\(s=5\)) adjacent to class car (\(s=16\)) and class sidewalk (\(s=9\)) adjacent to class road (\(s=5\)). Note that, the semantic classes remain the same, just the class indices are shuffled randomly. In Table 4, we can see the original class order along with the re-ordered class indices where semantic classes are marked by their actual defined colors in \(\mathcal {D}^{BDD-MOTS}\). can This is an important investigation to prove that our predictor model still learns the proximal relationship between the semantic classes instead of the numerical class indices occupancy, i.e., our model perfectly learns that class road pixels are most likely to occur near class car pixels and vice-versa, irrespective of their class indices value. Hence, we performed experiments by re-ordering the class indices of \(\mathcal {D}^\mathrm {BDD-MOTS}\) in such a way that the classes that occurred near to each other in terms of class index distance, e.g., road (\(s=1\)) and sidewalk (\(s=2\)) are now placed further apart, e.g., road (\(s=5\)) and sidewalk (\(s=9\)) as can be seen in Table 4.

Table 5 shows the confusion matrix for the original class order of \(\mathcal {D}^\mathrm {BDD-MOTS}\). The confusion matrix represents how each class in the prediction is confused and interpreted with respect to all the classes present in the ground truth and vice-versa. It can be observed that, every class is predicted well with the highest score for itself (see diagonal) except class rider (\(s=13\)) which is predicted as class person (\(s=12\)) with a score of 0.49 which is obvious as rider fits into the broader category of person after all. Similarly, class motorcycle (\(s=18\)) gets confused for class road (\(s=1\)) with a score of 0.33. This could be simply attributed the fact that the class road heavily overpowers the pixel distribution in all scenes whereas the class motorcycle has very minimal occupancy in most of the scenes. Now, in Table 6, we can see the confusion matrix for the first re-ordering of classes. In Table 6, it can be observed that, the predictor still confuses class rider (\(s=17\)) for class person(\(s=11\)) with a score of 0.34 and class motorcycle (\(s=19\)) for class road (\(s=5\)) with a score of 0.38. Similarly, in Table 7, we can see the confusion matrix for the second re-ordering of classes. We can see that the predictor once again interprets class rider (\(s=5\)) as class person (\(s=13\)) with a score of 0.37 and class motorcycle (\(s=11\)) as class road (\(s=8\)) with a score of 0.38. It can be inferred that the predictor regardless of class index ordering, behaves exactly the same for all the class predictions. Thus, the predictor can be safely labeled as invariant to the class ordering.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dash, B., Bilagi, S., Breitenstein, J., Schomerus, V., Bagdonat, T., Fingscheidt, T. (2023). Deep Long Term Prediction for Semantic Segmentation in Autonomous Driving. In: Ifrim, G., et al. Advanced Analytics and Learning on Temporal Data. AALTD 2023. Lecture Notes in Computer Science(), vol 14343. Springer, Cham. https://doi.org/10.1007/978-3-031-49896-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-49896-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-49895-4

  • Online ISBN: 978-3-031-49896-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics