Abstract
Deep learning has revolutionized the field of artificial intelligence. Based on the statistical correlations uncovered by deep learning-based methods, computer vision tasks, such as autonomous driving and robotics, are growing rapidly. Despite being the basis of deep learning, such correlation strongly depends on the distribution of the original data and is susceptible to uncontrolled factors. Without the guidance of prior knowledge, statistical correlations alone cannot correctly reflect the essential causal relations and may even introduce spurious correlations. As a result, researchers are now trying to enhance deep learning-based methods with causal theory. Causal theory can model the intrinsic causal structure unaffected by data bias and effectively avoids spurious correlations. This paper aims to comprehensively review the existing causal methods in typical vision and vision-language tasks such as semantic segmentation, object detection, and image captioning. The advantages of causality and the approaches for building causal paradigms will be summarized. Future roadmaps are also proposed, including facilitating the development of causal theory and its application in other complex scenarios and systems.
Similar content being viewed by others
References
Zhao C Q, Sun Q Y, Zhang C Z, et al. Monocular depth estimation based on deep learning: An overview. Sci China Tech Sci, 2020, 63: 1612–1627
Tang Y, Zhao C, Wang J, et al. An overview of perception and decision-making in autonomous systems in the era of learning. arXiv: 2001.02319
Zhang C, Wang J, Yen G G, et al. When autonomous systems meet accuracy and transferability through AI: A survey. Patterns, 2020, 1: 100050
Zhang Z W, Zheng L, Li Y N, et al. Structured road-oriented motion planning and tracking framework for active collision avoidance of autonomous vehicles. Sci China Tech Sci, 2021, 64: 2427–2440
Xu C, Zhao W Z, Chen Q Y, et al. An actor-critic based learning method for decision-making and planning of autonomous vehicles. Sci China Inf Sci, 2021, 64: 984–994
Wei J, Qiu J, Li T, et al. Cloud and precipitation interference by strong low-frequency sound wave. Sci China Tech Sci, 2021, 64: 261–272
Zhang N B, Zhao Y, Gu G Y, et al. Synergistic control of soft robotic hands for human-like grasp postures. Sci China Tech Sci, 2022, 65: 553–568
Chu Z, Deng J, Su L, et al. A gecko-inspired adhesive robotic end effector for critical-contact manipulation. Sci China Inf Sci, 2022, 65: 182203
Xia R, Zhao C, Zheng M, et al. CMDA: Cross-modality domain adaptation for nighttime semantic segmentation. arXiv: 2307.15942
Zhao C, Yen G G, Sun Q, et al. Masked GAN for unsupervised depth and pose prediction with scale consistency. IEEE Trans Neural Netw Learn Syst, 2020, 32: 5392–5403
Ren W, Tang Y, Sun Q, et al. Visual semantic segmentation based on few/zero-shot learning: An overview. IEEE CAA J Autom Sin, 2023, doi: https://doi.org/10.1109/JAS.2023.123207
Yang T, Tong C. Real-time detection network for tiny traffic sign using multi-scale attention module. Sci China Tech Sci, 2022, 65: 396–406
Liu T, Bao J, Zheng H, et al. Learning semantic-specific visual representation for laser welding penetration status recognition. Sci China Tech Sci, 2022, 65: 347–360
Yan P, Tan Y, Tai Y. Repeatable adaptive keypoint detection via self-supervised learning. Sci China Inf Sci, 2022, 65: 212103
Shao Y, Geng Z, Liu Y, et al. CPT: A pre-trained unbalanced transformer for both chinese language understanding and generation. arXiv: 2109.05729
Li J, Li D, Savarese S, et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv: 2301.12597
Pearl J. Causality. Cambridge: Cambridge University Press, 2009
Gao C, Zheng Y, Wang W, et al. Causal inference in recommender systems: A survey and future directions. arXiv: 2208.12397
Blyth C R. On Simpson’s paradox and the sure-thing principle. J Am Statist Assoc, 1972, 67: 364–366
Borsboom D, Kievit R A, Cervone D, et al. The Two Disciplines of Scientific Psychology, or: The Disunity of Psychology as a Working Hypothesis. New York: Springer, 2009. 67–97
Malik N, Singh P V. Deep learning in computer vision: Methods, interpretation, causation, and fairness. In: Operations Research & Management Science in the Age of Analytics. Seattle: INFORMS, 2019. 73–100
Sun Q Y, Zhao C Q, Tang Y, et al. A survey on unsupervised domain adaptation in computer vision tasks (in Chinese). Sci Sin-Tech, 2022, 52: 26–54
Zhou K, Liu Z, Qiao Y, et al. Domain generalization in vision: A survey. arXiv: 2103.02503
Heidel R E. Causality in statistical power: Isomorphic properties of measurement, research design, effect size, and sample size. Scientifica, 2016, 2016: 1–5
Dawid A P. Statistical causality from a decision-theoretic perspective. Annu Rev Stat Appl, 2015, 2: 273–303
Heckman J J, Pinto R. Causality and econometrics. Technical Report 29787, National Bureau of Economic Research. 2022
Geweke J. Inference and causality in economic time series models. Handbook Econometrics, 1984, 2: 1101–1144
Kundi M. Causality and the interpretation of epidemiologic evidence. Environ Health Perspect, 2006, 114: 969–974
Ohlsson H, Kendler K S. Applying causal inference methods in psychiatric epidemiology. JAMA Psychiatry, 2020, 77: 637–644
HairJr. J F, Sarstedt M. Data, measurement, and causal inferences in machine learning: Opportunities and challenges for marketing. J Mark Theor Pract, 2021, 29: 65–77
Prosperi M, Guo Y, Sperrin M, et al. Causal inference and counter-factual prediction in machine learning for actionable healthcare. Nat Mach Intell, 2020, 2: 369–375
Chen H, Du K, Yang X, et al. A review and roadmap of deep learning causal discovery in different variable paradigms. arXiv: 2209.06367
Pearl J. Bayesian networks. Technical Report, UCLA, Los Angeles. 2011
Kaddour J, Lynch A, Liu Q, et al. Causal machine learning: A survey and open problems. arXiv: 2206.15475
Li Z, Zhu Z, Guo X, et al. A survey of deep causal models and their industrial applications. 2023, doi: https://doi.org/10.21203/rs.3.rs-2689686/v1
Rebane G, Pearl J. The recovery of causal poly-trees from statistical data. arXiv: 1304.2736, 2013
Castro D C, Walker I, Glocker B. Causality matters in medical imaging. Nat Commun, 2020, 11: 3673
Splawa-Neyman J, Dabrowska D M, Speed T P. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist Sci, 1990, 5: 465–472
Shen Z, Cui P, Kuang K, et al. On image classification: Correlation vs causality. arXiv: 1708.06656
Goyal Y, Feder A, Shalit U, et al. Explaining classifiers with causal concept effect (cace). arXiv: 1907.07165
Tang K, Huang J, Zhang H. Long-tailed classification by keeping the good and removing the bad momentum causal effect. Adv Neural Inf Process Syst, 2020, 33: 1513–1524
Yue Z, Zhang H, Sun Q, et al. Interventional few-shot learning. Adv Neural Inf Process Syst, 2020, 33: 2734–2746
Hu X, Tang K, Miao C, et al. Distilling causal effect of data in class-incremental learning. In: Proceedings of the 2021 IEEE/CVF conference on Computer Vision and Pattern Recognition. Nashville, 2021. 3957–3966
Mahajan D, Tople S, Sharma A. Domain generalization using causal matching. In: Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021. 7313–7324
Liu C, Sun X, Wang J, et al. Learning causal semantic representation for out-of-distribution prediction. Adv Neural Inf Process Syst, 2021, 34: 6155–6170
Sun X, Wu B, Zheng X, et al. Recovering latent causal factor for generalization to distributional shifts. Adv Neural Inf Process Syst, 2021, 34: 16846–16859
Yue Z, Sun Q, Hua X S, et al. Transporting causal mechanisms for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, 2021. 8599–8608
Miao Q, Yuan J, Kuang K. Domain generalization via contrastive causal learning. arXiv: 2210.02655
Lv F, Liang J, Li S, et al. Causality inspired representation learning for domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, 2022. 8046–8056
Wang X, Saxon M, Li J, et al. Causal balancing for domain generalization. arXiv: 2206.05263
Wang Y, Liu F, Chen Z, et al. Contrastive-ACE: Domain generalization through alignment of causal mechanisms. IEEE Trans Image Process, 2022, 32: 235–250
Yang C H H, Hung I T, Liu Y C, et al. Treatment learning causal transformer for noisy image classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, 2023. 6139–6150
Qiu B, Li H, Wen H, et al. Cafeboost: Causal feature boost to eliminate task-induced bias for class incremental learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, 2023. 16016–16025
Chen J, Gao Z, Wu X, et al. Meta-causal learning for single domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, 2023. 7683–7692
Huang W, Jiang M, Li M, et al. Causal intervention for object detection. In: Proceedings of the 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI). Washington, 2021. 770–774
Resnick C, Litany O, Kar A, et al. Causal bert: Improving object detection by searching for challenging groups. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, 2021. 2972–2981
Lin X, Wu Z, Chen G, et al. A causal debiasing framework for unsupervised salient object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, 2022. 1610–1619
Li J, Zhang Y, Qiang W, et al. Disentangle and remerge: Interventional knowledge distillation for few-shot object detection from a conditional causal perspective. arXiv: 2208.12681
Xu M, Qin L, Chen W, et al. Multi-view adversarial discriminator: Mine the non-causal factors for object detection in unseen domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, 2023. 8103–8112
Shen F, Liu J, Hu P. Conterfactual generative zero-shot semantic segmentation. arXiv: 2106.06360
Li W, Li Z. Causal-setr: A segmentation transformer variant based on causal intervention. In: Proceedings of the Asian Conference on Computer Vision. Berlin: Springer, 2022. 756–772
Zhang D, Zhang H, Tang J, et al. Causal intervention for weakly-supervised semantic segmentation. Adv Neural Inf Process Syst, 2020, 33: 655–666
Wang Y. Causal class activation maps for weakly-supervised semantic segmentation. In: Proceedings of UAI 2022 Workshop on Causal Representation Learning. Netherlands, 2022
Chen Z, Tian Z, Zhu J, et al. C-CAM: Causal cam for weakly supervised semantic segmentation on medical image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, 2022. 11676–11685
Ding H, Zhang J, Kazanzides P, et al. CARTS: Causality-driven robot tool segmentation from vision and kinematics data. In: Proceedings of the Medical Image Computing and Computer Assisted Intervention-MICCAI 2022: 25th International Conference. Singapore: Springer, 2022. 387–398
Ouyang C, Chen C, Li S, et al. Causality-inspired single-source domain generalization for medical image segmentation. IEEE Trans Med Imag, 2023, 42: 1095–1106
Qin W, Zhang H, Hong R, et al. Causal interventional training for image recognition. IEEE Trans Multimedia, 2023, 25: 1033–1044
Liu R, Liu H, Li G, et al. Contextual debiasing for visual recognition with causal mechanisms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, 2022. 12755–12765
Wang T, Zhou C, Sun Q, et al. Causal attention for unbiased visual recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, 2021. 3091–3100
Mao C, Cha A, Gupta A, et al. Generative interventions for causal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, 2021. 3947–3956
Mao C, Xia K, Wang J, et al. Causal transportability for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, 2022. 7521–7531
Yang X, Zhang H, Cai J. Deconfounded image captioning: A causal retrospect. IEEE Trans Pattern Anal Mach Intell, 2022, 45: 12996–13010
Chen W, Tian J, Fan C, et al. Dependent multi-task learning with causal intervention for image captioning. arXiv: 2105.08573
Liu B, Wang D, Yang X, et al. Show, deconfound and tell: Image captioning with causal inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, 2022. 18041–18050
Niu Y, Tang K, Zhang H, et al. Counterfactual VQA: A cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, 2021. 12700–12710
Agarwal V, Shetty R, Fritz M. Towards causal VQA: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, 2020. 9690–9698
Zhang S, Jiang T, Wang T, et al. Devlbert: Learning deconfounded visio-linguistic representations. In: Proceedings of the 28th ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2020. 4373–4382
Chen L, Yan X, Xiao J, et al. Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, 2020. 10800–10809
Li Y, Wang X, Xiao J, et al. Invariant grounding for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, 2022. 2928–2937
Zang C, Wang H, Pei M, et al. Discovering the real association: Multimodal causal reasoning in video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, 2023. 19027–19036
Liu Y, Li G, Lin L. Cross-modal causal relational reasoning for eventlevel visual question answering. IEEE Trans Pattern Anal Mach Intell, 2023, 45: 11624–11641
Liu W, Liu Z, Paull L, et al. Structural causal 3D reconstruction. In: Proceedings of the Computer Vision-ECCV 2022: 17th European Conference. Berlin: Springer, 2022. 140–159
Zhang X, Wong Y, Wu X, et al. Learning causal representation for training cross-domain pose estimator via generative interventions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, 2021. 11270–11280
Zhang S, Song X, Li W, et al. Layout-based causal inference for object navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, 2023. 10792–10802
Chen C F R, Fan Q, Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal, 2021. 357–366
Cai R, Liu C, Li J. Efficient phase-induced gabor cube selection and weighted fusion for hyperspectral image classification. Sci China Tech Sci, 2022, 65: 778–792
Zeng N, Wu P, Wang Z, et al. A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans Instrum Meas, 2022, 71: 1–14
Fang L Y, Tang Q, Ouyang L H, et al. Long-tailed object detection of kitchen waste with class-instance balanced detector. Sci China Tech Sci, 2023, 66: 2361–2372
Xie X, Cheng G, Li Q, et al. Fewer is more: Efficient object detection in large aerial images. arXiv: 2212.13136
Geng Q, Zhou Z, Cao X. Survey of recent progress in semantic image segmentation with CNNs. Sci China Inf Sci, 2018, 61: 1–8
Srinivas A, Lin T Y, Parmar N, et al. Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, 2021. 16519–16529
Wei X S, Xu S L, Chen H, et al. Prototype-based classifier learning for long-tailed visual recognition. Sci China Inf Sci, 2022, 65: 160105
Bareinboim E, Pearl J. A general algorithm for deciding transportability of experimental results. J Causal Inference, 2013, 1: 107–134
Du Y, Liu Z, Li J, et al. A survey of vision-language pre-trained models. arXiv: 2202.10936
Li K, Guo D, Wang M, et al. ViGT: Proposal-free video grounding with a learnable token in the transformer. Sci China Inf Sci, 2023, 66: 202102
Marino K, Rastegari M, Farhadi A, et al. OK-VQA: A visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, 2019. 3195–3204
Zhou L, Palangi H, Zhang L, et al. Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 13041–13049
Wang J, Li Y, Pan Y, et al. Contextual and selective attention networks for image captioning. Sci China Inf Sci, 2022, 65: 222103
Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, 2020. 10578–10587
Pan Y, Yao T, Li Y, et al. X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, 2020. 10971–10980
Ray A, Sikka K, Divakaran A, et al. Sunny and dark outside?! Improving answer consistency in VQA through entailed question generation. arXiv: 1909.04696
Liu Y, Wei Y S, Yan H, et al. Causal reasoning meets visual representation learning: A prospective study. Mach Intell Res, 2022, 19: 485–511
Li J, Wang Q. Multi-modal bioelectrical signal fusion analysis based on different acquisition devices and scene settings: Overview, challenges, and novel orientation. Inf Fusion, 2022, 79: 229–247
Lahat D, Adali T, Jutten C. Multimodal data fusion: An overview of methods, challenges, and prospects. Proc IEEE, 2015, 103: 1449–1477
Fu J, Lv Y, Yu W. Robust adaptive time-varying region tracking control of multi-robot systems. Sci China Inf Sci, 2023, 66: 159202
Zhang Y, Yang C, Xu S, et al. Obstacle avoidance in human-robot cooperative transportation with force constraint. Sci China Inf Sci, 2023, 66: 119205
Jin L, He Y, Zhang C K, et al. Equivalent input disturbance-based load frequency control for smart grid with air conditioning loads. Sci China Inf Sci, 2022, 65: 122205
Chen X, Gong Z, Zhao X, et al. A machine learning surrogate modeling benchmark for temperature field reconstruction of heat source systems. Sci China Inf Sci, 2023, 66: 152203
Lindner F, Olz C. Step-by-step task plan explanations beyond causal links. In: Proceedings of the 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). Napoli, 2022. 45–51
Daruna A, Das D, Chernova S. Explainable knowledge graph embedding: Inference reconciliation for knowledge inferences supporting robot actions. In: Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Kyoto, 2022. 1008–1015
Yohanandhan R V, Elavarasan R M, Pugazhendhi R, et al. A specialized review on outlook of future Cyber-Physical Power System (CPPS) testbeds for securing electric power grid. Int J Electr Power Energy Syst, 2022, 136: 107720
Runge J, Bathiany S, Bollt E, et al. Inferring causation from time series in Earth system sciences. Nat Commun, 2019, 10: 2553
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the National Natural Science Foundation of China (Grant Nos. 62233005 and 62293502), the Programme of Introducing Talents of Discipline to Universities (the 111 Project, Grant No. B17017), the Fundamental Research Funds for the Central Universities (Grant No. 222202317006), and Shanghai AI Lab.
Rights and permissions
About this article
Cite this article
Zhang, K., Sun, Q., Zhao, C. et al. Causal reasoning in typical computer vision tasks. Sci. China Technol. Sci. 67, 105–120 (2024). https://doi.org/10.1007/s11431-023-2502-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11431-023-2502-9