Abstract
Multi-task learning has become a popular paradigm to tackle multiple tasks simultaneously with less inference time and computation resources. Recently, many self-supervised pre-training methods have been proposed and they have achieved impressive performance on a range of computer vision tasks. However, their generalization ability to multi-task scenarios is yet to be explored. Besides, most multi-task algorithms are designed for specific tasks usually not within the scope of autonomous driving, which makes it difficult to compare state-of-the-art multi-task learning methods in autonomous driving. In this chapter, we divide the multi-task perception into 2D perception and 3D perception in autonomous driving. For 2D perception, we extensively investigate the transfer ability of various self-supervised methods and reproduce multiple popular multi-task methods. Then we introduce a simple and effective pretrain-adapt-finetune paradigm for multi-task learning and a novel adapter named LV-Adapter which reuses powerful knowledge from the Contrastive Language-Image Pre-training (CLIP) model pre-trained on image-text pairs. We further present an effective multi-task framework for autonomous driving, GT-Prompt, which learns general prompts and generates task-specific prompts to guide the model to capture task-invariant and task-specific information. For 3D perception, we investigate both multi-modality fusion and multi-task learning, and introduce an effective multi-level gradient calibration learning framework across tasks and modalities during optimization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C.L, Parikh D (2015) VQA: visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2425–2433
Bai X, Hu Z, Zhu X, Huang Q, Chen Y, Fu H, Tai C.L (2022) Transfusion: robust lidar-camera fusion for 3d object detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1090–1099
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. Advances in neural information processing systems (NeurIPS) 33:1877–1901
Brüggemann D, Kanakis M, Obukhov A, Georgoulis S, Van Gool L (2021) Exploring relational context for multi-task dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 15869–15878
Caesar H, Bankiti V, Lang A.H, Vora S, Liong V.E, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom O (2020) Nuscenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11621–11631
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of European conference on computer vision, pp 213–229
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst (NeurIPS) 33:9912–9924
Chen R, Ai H, Shang C, Chen L, Zhuang Z (2019) Learning lightweight pedestrian detector with hierarchical knowledge distillation. In: 2019 IEEE international conference on image processing (ICIP). IEEE, pp 1645–1649
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning (ICML), pp 1597–1607
Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv:2003.04297
Chen X, Zhang T, Wang Y, Wang Y, Zhao H (2022) Futr3d: a unified sensor fusion framework for 3d detection. arXiv preprint arXiv:2203.10642
Chen Y, Zhao D, Lv L, Zhang Q (2018) Multi-task learning for dangerous object detection in autonomous driving. Inf Sci 432:559–571
Chen Z, Badrinarayanan V, Lee CY, Rabinovich A (2018) Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In: International conference on machine learning, pp 794–803. PMLR
Cheng B, Schwing A, Kirillov A (2021) Per-pixel classification is not all you need for semantic segmentation. Advances Neural Inf Process Syst (NeurIPS) 34
Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258
Clark K, Luong MT, Khandelwal U, Manning CD, Le QV (2019) Bam! born-again multi-task networks for natural language understanding. arXiv preprint arXiv:1907.04829
Crawshaw M (2020) Multi-task learning with deep neural networks: a survey. arXiv preprint arXiv:2009.09796
Désidéri JA (2012) Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Math 350(5–6):313–318
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv arXiv:1810.04805
Donahue J, Krähenbühl P, Darrell T (2017) Adversarial feature learning. arXiv arXiv:1605.09782
Donahue J, Simonyan K (2019) Large scale adversarial representation learning. In: NeurIPS
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2650–2658
Feng D, Zhou Y, Xu C, Tomizuka M, Zhan W (2021) A simple and efficient multi-task network for 3d object detection and road understanding. In: 2021 IEEE/RSJ international conference on intelligent robots and systems, pp 7067–7074
Fifty C, Amid E, Zhao Z, Yu T, Anil R, Finn C (2021) Efficiently identifying task groupings for multi-task learning. Adv Neural Inf Process Syst (NeurIPS) 34
Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, Li H, Qiao Y (2021) Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544
Gao R, Oh TH, Grauman K, Torresani L (2020) Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10457–10467
Gao Y, Ma J, Zhao M, Liu W, Yuille AL (2019) Nddr-cnn: layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3205–3214
Ghiasi G, Zoph B, Cubuk ED, Le QV, Lin TY (2021) Multi-task self-training for learning general representations. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 8856–8865
Girshick RB (2015) Fast r-cnn. In: 2015 IEEE international conference on computer vision (ICCV), pp 1440–1448
Goel S, Bansal H, Bhatia S, Rossi RA, Vinay V, Grover A (2022) Cyclip: cyclic contrastive language-image pretraining. arXiv preprint arXiv:2205.14459
Grill JB, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi Azar M, Piot B, kavukcuoglu k, Munos R, Valko M, (2020) Bootstrap your own latent: a new approach to self-supervised learning. Adv Neural Inf Process Syst (NeurIPS) 33:21271–21284
Guo P, Lee CY, Ulbricht D (2020) Learning to branch for multi-task learning. In: International conference on machine learning, pp 3854–3863
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9729–9738
He K, Girshick RB, Dollár P (2019) Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 4917–4926
He K, Gkioxari G, Dollar P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2961–2969
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He Y, Zheng S, Tay Y, Gupta J, Du Y, Aribandi V, Zhao Z, Li Y, Chen Z, Metzler D et al (2022) Hyperprompt: prompt-based task-conditioning of transformers. In: International conference on machine learning, pp 8678–8690. PMLR
Hendrycks D, Gimpel K (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Ilievski I, Feng J (2017) Multimodal learning and reasoning for visual question answering. Adv Neural Inf Process Syst 30:551–562
Ishihara K, Kanervisto A, Miura J, Hautamaki V (2021) Multi-task learning with attention for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2902–2911
Ismail AA, Hasan M, Ishtiaq F (2020) Improving multimodal accuracy through modality pre-training and attention. arXiv preprint arXiv:2011.06102
Jia M, Tang L, Chen BC, Cardie C, Belongie S, Hariharan B, Lim SN (2022) Visual prompt tuning. arXiv preprint arXiv:2203.12119
Kazakos E, Nagrani A, Zisserman A, Damen D (2019) Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5492–5501
Kendall A, Gal Y, Cipolla R (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7482–7491
Kiela D, Grave E, Joulin A, Mikolov T (2018) Efficient large-scale multi-modal classification. In: Proceedings of the AAAI conference on artificial intelligence, pp 5198–5204
Kirillov A, Girshick R, He K, Dollár P (2019) Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6399–6408
Kirillov A, He K, Girshick R, Rother C, Dollár P (2019) Panoptic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9404–9413
Kokkinos I (2017) Ubernet: training a ‘universal’ convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6129–6138
Kokkinos I (2017) Ubernet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6129–6138
Lang AH, Vora S, Caesar H, Zhou L, Yang J, Beijbom O (2019) Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12697–12705
Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: EMNLP
Li Q, Wang Y, Wang Y, Zhao H (2022) Hdmapnet: an online HD map construction and evaluation framework. In: 2022 international conference on robotics and automation, pp 4628–4634. https://doi.org/10.1109/ICRA46639.2022.9812383
Li XL, Liang P (2021) Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190
Li Y, Ge Z, Yu G, Yang J, Wang Z, Shi Y, Sun J, Li Z (2022) Bevdepth: acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092
Li Z, Wang W, Li H, Xie E, Sima C, Lu T, Qiao Y, Dai J (2022) Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Proceedings of the European conference on computer vision, pp 1–18. https://doi.org/10.1007/978-3-031-20077-9_1
Liang M, Yang B, Chen Y, Hu R, Urtasun R (2019) Multi-task multi-sensor fusion for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7345–7353
Liang T, Xie H, Yu K, Xia Z, Lin Z, Wang Y, Tang T, Wang B, Tang Z (2022) BEVFusion: a simple and robust LiDAR-camera fusion framework. arXiv preprint arXiv:2205.13790
Liang X, Niu M, Han J, Xu H, Xu C, Liang X (2023) Visual exemplar driven task-prompting for unified perception in autonomous driving. arXiv preprint arXiv:2303.01788
Liang X, Wu Y, Han J, Xu H, Xu C, Liang X (2022) Effective adaptation in multi-task co-training for unified autonomous driving. arXiv preprint arXiv:2209.08953
Likhosherstov V, Arnab A, Choromanski K, Lucic M, Tay Y, Weller A, Dehghani M (2021) Polyvit: co-training vision transformers on images, videos and audio. arXiv preprint arXiv:2111.12993
Likhosherstov V, Arnab A, Choromanski K, Lucic M, Tay Y, Weller A, Dehghani M (2021) Polyvit: co-training vision transformers on images, videos and audio. arXiv preprint arXiv:2111.12993
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin X, Zhen HL, Li Z, Zhang QF, Kwong S (2019) Pareto multi-task learning. Adv Neural Inf Process Syst 32
Liu L, Li Y, Kuang Z, Xue J, Chen Y, Yang W, Liao Q, Zhang W (2021) Towards impartial multi-task learning. In: International conference on learning representations
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586
Liu S, Johns E, Davison AJ (2019) End-to-end multi-task learning with attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1871–1880
Liu X, Ji K, Fu Y, Du Z, Yang Z, Tang J (2021) P-tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv arXiv:2110.07602
Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, Tang J (2021) Gpt understands, too. arXiv preprint arXiv:2103.10385
Liu Y, Wang T, Zhang X, Sun J (2022) Petr: position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625
Liu Y, Yan J, Jia F, Li S, Gao Q, Wang T, Zhang X, Sun J (2022) Petrv2: a unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Liu Z, Tang H, Amini A, Yang X, Mao H, Rus D, Han S (2022) Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542
Misra I, Shrivastava A, Gupta A, Hebert M (2016) Cross-stitch networks for multi-task learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3994–4003
Nie X, Ni B, Chang J, Meng G, Huo C, Zhang Z, Xiang S, Tian Q, Pan C (2022) Pro-tuning: unified prompt tuning for vision tasks. arXiv preprint arXiv:2207.14381
van den Oord A, Li Y, Vinyals O (2019) Representation learning with contrastive predictive coding. arXiv:1807.03748
Owens A, Efros AA (2018) Audio-visual scene analysis with self-supervised multisensory features. In: Proceedings of the European conference on computer vision, pp 631–648
Peng X, Wei Y, Deng A, Wang D, Hu D (2022) Balanced multimodal learning via on-the-fly gradient modulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8238–8247
Philion J, Fidler S (2020) Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: Proceedings of European conference on computer vision, pp 194–210
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning (ICML), pp 8748–8763
Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks? arXiv arXiv:2108.08810
Rao Y, Zhao W, Chen G, Tang Y, Zhu Z, Huang G, Zhou J, Lu J (2022) Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18082–18091
Ren P, Li C, Wang G, Xiao Y, Chang QDXLX (2022) Beyond fixation: dynamic window visual transformer. arXiv preprint arXiv:2203.12856
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst (NeurIPS) 28
Ruder S, Bingel J, Augenstein I, Søgaard A (2019) Latent multi-task architecture learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 4822–4829
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis (IJCV) 115(3):211–252
Sener O, Koltun V (2018) Multi-task learning as multi-objective optimization. Adv Neural Inf Process Syst 31
Standley TS, Zamir AR, Chen D, Guibas LJ, Malik J, Savarese S (2020) Which tasks should be learned together in multi-task learning? In: International conference on machine learning (ICML), pp 9120–9132
Sun P, Zhang R, Jiang Y, Kong T, Xu C, Zhan W, Tomizuka M, Li L, Yuan Z, Wang C, Luo P (2021) Sparse r-cnn: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 14454–14463
Sun X, Panda R, Feris RS (2020) Adashare: learning what to share for efficient deep multi-task learning. Adv Neural Inf Process Syst (NeurIPS) 33:8728–8740
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow IJ, Fergus R (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199
Teichmann M, Weber M, Zoellner M, Cipolla R, Urtasun R (2018) Multinet: real-time joint semantic reasoning for autonomous driving. In: IEEE intelligent vehicles symposium (IV), pp 1013–1020
Vandenhende S, Georgoulis S, Gool LV (2020) Mti-net: multi-scale task interaction networks for multi-task learning. In: European conference on computer vision. Springer, pp 527–543
Vandenhende S, Georgoulis S, Van Gansbeke W, Proesmans M, Dai D, Van Gool L (2021) Multi-task learning for dense prediction tasks: a survey. IEEE Trans Pattern Anal Mach Intell
Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst (NeurIPS) 30
Vora S, Lang AH, Helou B, Beijbom O (2020) Pointpainting: sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4604–4612
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461
Wang W, Tran D, Feiszli M (2020) What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12695–12705
Wang X, Ke B, Li X, Liu F, Zhang M, Liang X, Xiao Q (2022) Modality-balanced embedding for video retrieval. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pp 2578–2582
Wang X, Zhang R, Shen C, Kong T, Li L (2021) Dense contrastive learning for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3023–3032
Wang Z, Ren W, Qiu Q (2018) Lanenet: real-time lane detection networks for autonomous driving. arXiv preprint arXiv:1807.01726
Wang Z, Zhang Z, Ebrahimi S, Sun R, Zhang H, Lee CY, Ren X, Su G, Perot V, Dy J et al (2022) Dualprompt: complementary prompting for rehearsal-free continual learning. arXiv preprint arXiv:2204.04799
Wu D, Liao M, Zhang W, Wang X (2021) Yolop: you only look once for panoptic driving perception. arXiv preprint arXiv:2108.11250
Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In: European conference on computer vision (ECCV), pp 418–434
Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 418–434
Xie E, Ding J, Wang W, Zhan X, Xu H, Li Z, Luo P (2021) Detco: unsupervised contrastive learning for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 8392–8401
Xie E, Yu Z, Zhou D, Philion J, Anandkumar A, Fidler S, Luo P, Alvarez JM (2022) M\(^{2}\)bev: multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088
Xie Z, Lin Y, Zhang Z, Cao Y, Lin S, Hu H (2021) Propagate yourself: exploring pixel-level consistency for unsupervised visual representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16679–16688
Xu D, Ouyang W, Wang X, Sebe N (2018) Pad-net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 675–684
Xu Y, Li X, Yuan H, Yang Y, Zhang J, Tong Y, Zhang L, Tao D (2022) Multi-task learning with multi-query transformer for dense prediction. arXiv preprint arXiv:2205.14354
Yang Z, Zhang Y, Yu J, Cai J, Luo J (2018) End-to-end multi-modal multi-task vehicle control for self-driving cars with visual perceptions. In: 24th international conference on pattern recognition (ICPR), pp 2289–2294
Yao Y, Zhang A, Zhang Z, Liu Z, Chua TS, Sun M (2021) Cpt: colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797
Ye D, Zhou Z, Chen W, Xie Y, Wang Y, Wang P, Foroosh H (2022) Lidarmultinet: towards a unified multi-task network for lidar perception. arXiv preprint arXiv:2209.09385
Yin T, Zhou X, Krahenbuhl P (2021) Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11784–11793
Yin T, Zhou X, Krähenbühl P (2021) Multimodal virtual point 3d detection. Adv Neural Inf Process Syst 34:16494–16507
Yu F, Chen H, Wang X, Xian W, Chen Y, Liu F, Madhavan V, Darrell T (2020) Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2636–2645
Zhang Y, Zhu Z, Zheng W, Huang J, Huang G, Zhou J, Lu J (2022) Beverse: unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743. 10.48550/arXiv.2205.09743. https://doi.org/10.48550/arXiv.2205.09743
Zhang Z, Cui Z, Xu C, Jie Z, Li X, Yang J (2018) Joint task-recursive learning for semantic segmentation and depth estimation. In: Proceedings of the European conference on computer vision (ECCV), pp 235–251
Zhong Z, Friedman D, Chen D (2021) Factual probing is [mask]: learning versus learning to recall. arXiv preprint arXiv:2104.05240
Zhou C, Loy CC, Dai B (2021) Denseclip: extract free dense labels from clip. arXiv arXiv:2112.01071
Zhou K, Yang J, Loy CC, Liu Z (2021) Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134
Zhou Y, Tuzel O (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4490–4499
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Liang, X., Liang, X., Xu, H. (2023). Multi-task Perception for Autonomous Driving. In: Fan, R., Guo, S., Bocus, M.J. (eds) Autonomous Driving Perception. Advances in Computer Vision and Pattern Recognition. Springer, Singapore. https://doi.org/10.1007/978-981-99-4287-9_9
Download citation
DOI: https://doi.org/10.1007/978-981-99-4287-9_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4286-2
Online ISBN: 978-981-99-4287-9
eBook Packages: Computer ScienceComputer Science (R0)