Skip to main content

Multi-task Perception for Autonomous Driving

  • Chapter
  • First Online:
Autonomous Driving Perception

Abstract

Multi-task learning has become a popular paradigm to tackle multiple tasks simultaneously with less inference time and computation resources. Recently, many self-supervised pre-training methods have been proposed and they have achieved impressive performance on a range of computer vision tasks. However, their generalization ability to multi-task scenarios is yet to be explored. Besides, most multi-task algorithms are designed for specific tasks usually not within the scope of autonomous driving, which makes it difficult to compare state-of-the-art multi-task learning methods in autonomous driving. In this chapter, we divide the multi-task perception into 2D perception and 3D perception in autonomous driving. For 2D perception, we extensively investigate the transfer ability of various self-supervised methods and reproduce multiple popular multi-task methods. Then we introduce a simple and effective pretrain-adapt-finetune paradigm for multi-task learning and a novel adapter named LV-Adapter which reuses powerful knowledge from the Contrastive Language-Image Pre-training (CLIP) model pre-trained on image-text pairs. We further present an effective multi-task framework for autonomous driving, GT-Prompt, which learns general prompts and generates task-specific prompts to guide the model to capture task-invariant and task-specific information. For 3D perception, we investigate both multi-modality fusion and multi-task learning, and introduce an effective multi-level gradient calibration learning framework across tasks and modalities during optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/open-mmlab/mmselfsup.

References

  1. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C.L, Parikh D (2015) VQA: visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2425–2433

    Google Scholar 

  2. Bai X, Hu Z, Zhu X, Huang Q, Chen Y, Fu H, Tai C.L (2022) Transfusion: robust lidar-camera fusion for 3d object detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1090–1099

    Google Scholar 

  3. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. Advances in neural information processing systems (NeurIPS) 33:1877–1901

    Google Scholar 

  4. Brüggemann D, Kanakis M, Obukhov A, Georgoulis S, Van Gool L (2021) Exploring relational context for multi-task dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 15869–15878

    Google Scholar 

  5. Caesar H, Bankiti V, Lang A.H, Vora S, Liong V.E, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom O (2020) Nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11621–11631

    Google Scholar 

  6. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of European conference on computer vision, pp 213–229

    Google Scholar 

  7. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst (NeurIPS) 33:9912–9924

    Google Scholar 

  8. Chen R, Ai H, Shang C, Chen L, Zhuang Z (2019) Learning lightweight pedestrian detector with hierarchical knowledge distillation. In: 2019 IEEE international conference on image processing (ICIP). IEEE, pp 1645–1649

    Google Scholar 

  9. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning (ICML), pp 1597–1607

    Google Scholar 

  10. Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv:2003.04297

  11. Chen X, Zhang T, Wang Y, Wang Y, Zhao H (2022) Futr3d: a unified sensor fusion framework for 3d detection. arXiv preprint arXiv:2203.10642

  12. Chen Y, Zhao D, Lv L, Zhang Q (2018) Multi-task learning for dangerous object detection in autonomous driving. Inf Sci 432:559–571

    Article  Google Scholar 

  13. Chen Z, Badrinarayanan V, Lee CY, Rabinovich A (2018) Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In: International conference on machine learning, pp 794–803. PMLR

    Google Scholar 

  14. Cheng B, Schwing A, Kirillov A (2021) Per-pixel classification is not all you need for semantic segmentation. Advances Neural Inf Process Syst (NeurIPS) 34

    Google Scholar 

  15. Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258

    Google Scholar 

  16. Clark K, Luong MT, Khandelwal U, Manning CD, Le QV (2019) Bam! born-again multi-task networks for natural language understanding. arXiv preprint arXiv:1907.04829

  17. Crawshaw M (2020) Multi-task learning with deep neural networks: a survey. arXiv preprint arXiv:2009.09796

  18. Désidéri JA (2012) Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Math 350(5–6):313–318

    Article  MathSciNet  MATH  Google Scholar 

  19. Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv arXiv:1810.04805

  20. Donahue J, Krähenbühl P, Darrell T (2017) Adversarial feature learning. arXiv arXiv:1605.09782

  21. Donahue J, Simonyan K (2019) Large scale adversarial representation learning. In: NeurIPS

    Google Scholar 

  22. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  23. Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2650–2658

    Google Scholar 

  24. Feng D, Zhou Y, Xu C, Tomizuka M, Zhan W (2021) A simple and efficient multi-task network for 3d object detection and road understanding. In: 2021 IEEE/RSJ international conference on intelligent robots and systems, pp 7067–7074

    Google Scholar 

  25. Fifty C, Amid E, Zhao Z, Yu T, Anil R, Finn C (2021) Efficiently identifying task groupings for multi-task learning. Adv Neural Inf Process Syst (NeurIPS) 34

    Google Scholar 

  26. Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, Li H, Qiao Y (2021) Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544

  27. Gao R, Oh TH, Grauman K, Torresani L (2020) Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10457–10467

    Google Scholar 

  28. Gao Y, Ma J, Zhao M, Liu W, Yuille AL (2019) Nddr-cnn: layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3205–3214

    Google Scholar 

  29. Ghiasi G, Zoph B, Cubuk ED, Le QV, Lin TY (2021) Multi-task self-training for learning general representations. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 8856–8865

    Google Scholar 

  30. Girshick RB (2015) Fast r-cnn. In: 2015 IEEE international conference on computer vision (ICCV), pp 1440–1448

    Google Scholar 

  31. Goel S, Bansal H, Bhatia S, Rossi RA, Vinay V, Grover A (2022) Cyclip: cyclic contrastive language-image pretraining. arXiv preprint arXiv:2205.14459

  32. Grill JB, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi Azar M, Piot B, kavukcuoglu k, Munos R, Valko M, (2020) Bootstrap your own latent: a new approach to self-supervised learning. Adv Neural Inf Process Syst (NeurIPS) 33:21271–21284

    Google Scholar 

  33. Guo P, Lee CY, Ulbricht D (2020) Learning to branch for multi-task learning. In: International conference on machine learning, pp 3854–3863

    Google Scholar 

  34. He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9729–9738

    Google Scholar 

  35. He K, Girshick RB, Dollár P (2019) Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 4917–4926

    Google Scholar 

  36. He K, Gkioxari G, Dollar P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2961–2969

    Google Scholar 

  37. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

    Google Scholar 

  38. He Y, Zheng S, Tay Y, Gupta J, Du Y, Aribandi V, Zhao Z, Li Y, Chen Z, Metzler D et al (2022) Hyperprompt: prompt-based task-conditioning of transformers. In: International conference on machine learning, pp 8678–8690. PMLR

    Google Scholar 

  39. Hendrycks D, Gimpel K (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units

    Google Scholar 

  40. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

    Google Scholar 

  41. Ilievski I, Feng J (2017) Multimodal learning and reasoning for visual question answering. Adv Neural Inf Process Syst 30:551–562

    Google Scholar 

  42. Ishihara K, Kanervisto A, Miura J, Hautamaki V (2021) Multi-task learning with attention for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2902–2911

    Google Scholar 

  43. Ismail AA, Hasan M, Ishtiaq F (2020) Improving multimodal accuracy through modality pre-training and attention. arXiv preprint arXiv:2011.06102

  44. Jia M, Tang L, Chen BC, Cardie C, Belongie S, Hariharan B, Lim SN (2022) Visual prompt tuning. arXiv preprint arXiv:2203.12119

  45. Kazakos E, Nagrani A, Zisserman A, Damen D (2019) Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5492–5501

    Google Scholar 

  46. Kendall A, Gal Y, Cipolla R (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7482–7491

    Google Scholar 

  47. Kiela D, Grave E, Joulin A, Mikolov T (2018) Efficient large-scale multi-modal classification. In: Proceedings of the AAAI conference on artificial intelligence, pp 5198–5204

    Google Scholar 

  48. Kirillov A, Girshick R, He K, Dollár P (2019) Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6399–6408

    Google Scholar 

  49. Kirillov A, He K, Girshick R, Rother C, Dollár P (2019) Panoptic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9404–9413

    Google Scholar 

  50. Kokkinos I (2017) Ubernet: training a ‘universal’ convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6129–6138

    Google Scholar 

  51. Kokkinos I (2017) Ubernet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6129–6138

    Google Scholar 

  52. Lang AH, Vora S, Caesar H, Zhou L, Yang J, Beijbom O (2019) Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12697–12705

    Google Scholar 

  53. Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: EMNLP

    Google Scholar 

  54. Li Q, Wang Y, Wang Y, Zhao H (2022) Hdmapnet: an online HD map construction and evaluation framework. In: 2022 international conference on robotics and automation, pp 4628–4634. https://doi.org/10.1109/ICRA46639.2022.9812383

  55. Li XL, Liang P (2021) Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190

  56. Li Y, Ge Z, Yu G, Yang J, Wang Z, Shi Y, Sun J, Li Z (2022) Bevdepth: acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092

  57. Li Z, Wang W, Li H, Xie E, Sima C, Lu T, Qiao Y, Dai J (2022) Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Proceedings of the European conference on computer vision, pp 1–18. https://doi.org/10.1007/978-3-031-20077-9_1

  58. Liang M, Yang B, Chen Y, Hu R, Urtasun R (2019) Multi-task multi-sensor fusion for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7345–7353

    Google Scholar 

  59. Liang T, Xie H, Yu K, Xia Z, Lin Z, Wang Y, Tang T, Wang B, Tang Z (2022) BEVFusion: a simple and robust LiDAR-camera fusion framework. arXiv preprint arXiv:2205.13790

  60. Liang X, Niu M, Han J, Xu H, Xu C, Liang X (2023) Visual exemplar driven task-prompting for unified perception in autonomous driving. arXiv preprint arXiv:2303.01788

  61. Liang X, Wu Y, Han J, Xu H, Xu C, Liang X (2022) Effective adaptation in multi-task co-training for unified autonomous driving. arXiv preprint arXiv:2209.08953

  62. Likhosherstov V, Arnab A, Choromanski K, Lucic M, Tay Y, Weller A, Dehghani M (2021) Polyvit: co-training vision transformers on images, videos and audio. arXiv preprint arXiv:2111.12993

  63. Likhosherstov V, Arnab A, Choromanski K, Lucic M, Tay Y, Weller A, Dehghani M (2021) Polyvit: co-training vision transformers on images, videos and audio. arXiv preprint arXiv:2111.12993

  64. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

    Google Scholar 

  65. Lin X, Zhen HL, Li Z, Zhang QF, Kwong S (2019) Pareto multi-task learning. Adv Neural Inf Process Syst 32

    Google Scholar 

  66. Liu L, Li Y, Kuang Z, Xue J, Chen Y, Yang W, Liao Q, Zhang W (2021) Towards impartial multi-task learning. In: International conference on learning representations

    Google Scholar 

  67. Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586

  68. Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586

  69. Liu S, Johns E, Davison AJ (2019) End-to-end multi-task learning with attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1871–1880

    Google Scholar 

  70. Liu X, Ji K, Fu Y, Du Z, Yang Z, Tang J (2021) P-tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv arXiv:2110.07602

  71. Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, Tang J (2021) Gpt understands, too. arXiv preprint arXiv:2103.10385

  72. Liu Y, Wang T, Zhang X, Sun J (2022) Petr: position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625

  73. Liu Y, Yan J, Jia F, Li S, Gao Q, Wang T, Zhang X, Sun J (2022) Petrv2: a unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256

  74. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

    Google Scholar 

  75. Liu Z, Tang H, Amini A, Yang X, Mao H, Rus D, Han S (2022) Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542

  76. Misra I, Shrivastava A, Gupta A, Hebert M (2016) Cross-stitch networks for multi-task learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3994–4003

    Google Scholar 

  77. Nie X, Ni B, Chang J, Meng G, Huo C, Zhang Z, Xiang S, Tian Q, Pan C (2022) Pro-tuning: unified prompt tuning for vision tasks. arXiv preprint arXiv:2207.14381

  78. van den Oord A, Li Y, Vinyals O (2019) Representation learning with contrastive predictive coding. arXiv:1807.03748

  79. Owens A, Efros AA (2018) Audio-visual scene analysis with self-supervised multisensory features. In: Proceedings of the European conference on computer vision, pp 631–648

    Google Scholar 

  80. Peng X, Wei Y, Deng A, Wang D, Hu D (2022) Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8238–8247

    Google Scholar 

  81. Philion J, Fidler S (2020) Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: Proceedings of European conference on computer vision, pp 194–210

    Google Scholar 

  82. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning (ICML), pp 8748–8763

    Google Scholar 

  83. Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks? arXiv arXiv:2108.08810

  84. Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, Zhou J, Lu J (2022) Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091

    Google Scholar 

  85. Ren P, Li C, Wang G, Xiao Y, Chang QDXLX (2022) Beyond fixation: dynamic window visual transformer. arXiv preprint arXiv:2203.12856

  86. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst (NeurIPS) 28

    Google Scholar 

  87. Ruder S, Bingel J, Augenstein I, Søgaard A (2019) Latent multi-task architecture learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 4822–4829

    Google Scholar 

  88. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis (IJCV) 115(3):211–252

    Article  MathSciNet  Google Scholar 

  89. Sener O, Koltun V (2018) Multi-task learning as multi-objective optimization. Adv Neural Inf Process Syst 31

    Google Scholar 

  90. Standley TS, Zamir AR, Chen D, Guibas LJ, Malik J, Savarese S (2020) Which tasks should be learned together in multi-task learning? In: International conference on machine learning (ICML), pp 9120–9132

    Google Scholar 

  91. Sun P, Zhang R, Jiang Y, Kong T, Xu C, Zhan W, Tomizuka M, Li L, Yuan Z, Wang C, Luo P (2021) Sparse r-cnn: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 14454–14463

    Google Scholar 

  92. Sun X, Panda R, Feris RS (2020) Adashare: learning what to share for efficient deep multi-task learning. Adv Neural Inf Process Syst (NeurIPS) 33:8728–8740

    Google Scholar 

  93. Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow IJ, Fergus R (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199

  94. Teichmann M, Weber M, Zoellner M, Cipolla R, Urtasun R (2018) Multinet: real-time joint semantic reasoning for autonomous driving. In: IEEE intelligent vehicles symposium (IV), pp 1013–1020

    Google Scholar 

  95. Vandenhende S, Georgoulis S, Gool LV (2020) Mti-net: multi-scale task interaction networks for multi-task learning. In: European conference on computer vision. Springer, pp 527–543

    Google Scholar 

  96. Vandenhende S, Georgoulis S, Van Gansbeke W, Proesmans M, Dai D, Van Gool L (2021) Multi-task learning for dense prediction tasks: a survey. IEEE Trans Pattern Anal Mach Intell

    Google Scholar 

  97. Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst (NeurIPS) 30

    Google Scholar 

  98. Vora S, Lang AH, Helou B, Beijbom O (2020) Pointpainting: sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4604–4612

    Google Scholar 

  99. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461

  100. Wang W, Tran D, Feiszli M (2020) What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12695–12705

    Google Scholar 

  101. Wang X, Ke B, Li X, Liu F, Zhang M, Liang X, Xiao Q (2022) Modality-balanced embedding for video retrieval. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pp 2578–2582

    Google Scholar 

  102. Wang X, Zhang R, Shen C, Kong T, Li L (2021) Dense contrastive learning for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3023–3032

    Google Scholar 

  103. Wang Z, Ren W, Qiu Q (2018) Lanenet: real-time lane detection networks for autonomous driving. arXiv preprint arXiv:1807.01726

  104. Wang Z, Zhang Z, Ebrahimi S, Sun R, Zhang H, Lee CY, Ren X, Su G, Perot V, Dy J et al (2022) Dualprompt: complementary prompting for rehearsal-free continual learning. arXiv preprint arXiv:2204.04799

  105. Wu D, Liao M, Zhang W, Wang X (2021) Yolop: you only look once for panoptic driving perception. arXiv preprint arXiv:2108.11250

  106. Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In: European conference on computer vision (ECCV), pp 418–434

    Google Scholar 

  107. Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 418–434

    Google Scholar 

  108. Xie E, Ding J, Wang W, Zhan X, Xu H, Li Z, Luo P (2021) Detco: unsupervised contrastive learning for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 8392–8401

    Google Scholar 

  109. Xie E, Yu Z, Zhou D, Philion J, Anandkumar A, Fidler S, Luo P, Alvarez JM (2022) M\(^{2}\)bev: multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088

  110. Xie Z, Lin Y, Zhang Z, Cao Y, Lin S, Hu H (2021) Propagate yourself: exploring pixel-level consistency for unsupervised visual representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16679–16688

    Google Scholar 

  111. Xu D, Ouyang W, Wang X, Sebe N (2018) Pad-net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 675–684

    Google Scholar 

  112. Xu Y, Li X, Yuan H, Yang Y, Zhang J, Tong Y, Zhang L, Tao D (2022) Multi-task learning with multi-query transformer for dense prediction. arXiv preprint arXiv:2205.14354

  113. Yang Z, Zhang Y, Yu J, Cai J, Luo J (2018) End-to-end multi-modal multi-task vehicle control for self-driving cars with visual perceptions. In: 24th international conference on pattern recognition (ICPR), pp 2289–2294

    Google Scholar 

  114. Yao Y, Zhang A, Zhang Z, Liu Z, Chua TS, Sun M (2021) Cpt: colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797

  115. Ye D, Zhou Z, Chen W, Xie Y, Wang Y, Wang P, Foroosh H (2022) Lidarmultinet: towards a unified multi-task network for lidar perception. arXiv preprint arXiv:2209.09385

  116. Yin T, Zhou X, Krahenbuhl P (2021) Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11784–11793

    Google Scholar 

  117. Yin T, Zhou X, Krähenbühl P (2021) Multimodal virtual point 3d detection. Adv Neural Inf Process Syst 34:16494–16507

    Google Scholar 

  118. Yu F, Chen H, Wang X, Xian W, Chen Y, Liu F, Madhavan V, Darrell T (2020) Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2636–2645

    Google Scholar 

  119. Zhang Y, Zhu Z, Zheng W, Huang J, Huang G, Zhou J, Lu J (2022) Beverse: unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743. 10.48550/arXiv.2205.09743. https://doi.org/10.48550/arXiv.2205.09743

  120. Zhang Z, Cui Z, Xu C, Jie Z, Li X, Yang J (2018) Joint task-recursive learning for semantic segmentation and depth estimation. In: Proceedings of the European conference on computer vision (ECCV), pp 235–251

    Google Scholar 

  121. Zhong Z, Friedman D, Chen D (2021) Factual probing is [mask]: learning versus learning to recall. arXiv preprint arXiv:2104.05240

  122. Zhou C, Loy CC, Dai B (2021) Denseclip: extract free dense labels from clip. arXiv arXiv:2112.01071

  123. Zhou K, Yang J, Loy CC, Liu Z (2021) Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134

  124. Zhou Y, Tuzel O (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4490–4499

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaodan Liang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Liang, X., Liang, X., Xu, H. (2023). Multi-task Perception for Autonomous Driving. In: Fan, R., Guo, S., Bocus, M.J. (eds) Autonomous Driving Perception. Advances in Computer Vision and Pattern Recognition. Springer, Singapore. https://doi.org/10.1007/978-981-99-4287-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-4287-9_9

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-4286-2

  • Online ISBN: 978-981-99-4287-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics