Causal Reasoning Meets Visual Representation Learning: A Prospective Study

Visual representation learning is ubiquitous in various real-world applications, including visual comprehension, video understanding, multi-modal analysis, human-computer interaction, and urban computing. Due to the emergence of huge amounts of multimodal heterogeneous spatial/temporal/spatial-temporal data in the big data era, the lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. The majority of the existing methods tend to fit the original data/variable distributions and ignore the essential causal relations behind the multi-modal knowledge, which lacks unified guidance and analysis about why modern visual representation learning methods easily collapse into data bias and have limited generalization and cognitive abilities. Inspired by the strong inference ability of human-level agents, recent years have therefore witnessed great effort in developing causal reasoning paradigms to realize robust representation and model learning with good cognitive ability. In this paper, we conduct a comprehensive review of existing causal reasoning methods for visual representation learning, covering fundamental theories, models, and datasets. The limitations of current methods and datasets are also discussed. Moreover, we propose some prospective challenges, opportunities, and future research directions for benchmarking causal reasoning algorithms in visual representation learning. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods, publicly available benchmarks, and consensus-building standards for reliable visual representation learning and related real-world applications more efficiently.


Introduction
With the emergence of huge amounts of heterogeneous multi-modal data, including images [1−3] , videos [4−7] , texts/languages [8−10] , audios [11−14] , and multi-sensor [15−19] data, deep learning based methods have shown promising performance for various computer vision and machine learning tasks, e.g., the visual comprehension [20−23] , video understanding [24−27] , visual-linguistic analysis [28−30] , and multi-modal fusion [31−33] , etc. However, the existing methods rely heavily upon fitting the data distributions and tend to capture the spurious correlations from different modalities, and thus fail to learn the essential causal relations behind the multi-modal knowledge that have a good generalization and cognitive abilities. Inspired by the fact that most of the data in computer vision society are independent and identically distributed (i.i.d.), a substantial body of literature [34−37] adopted data augmentation, pre-training, self-supervision, and novel architectures to improve the robustness of the state-of-the-art deep neural network architectures. However, it has been argued that such strategies only learn correlation-based patterns (statistical dependencies) from data and may not generalize well without the guarantee of the i.i.d setting [38] .
Due to the powerful ability of to uncover the underlying structural knowledge about data generating processes that allow interventions and generalize well across different tasks and environments, causal reasoning [39−41] offers a promising alternative to correlation learning. Recently, causal reasoning has attracted increasing attention in a myriad of high-impact domains of computer vision and machine learning, such as interpretable deep learning [42−47] , causal feature selection [48−60] , visual comprehension [61−70] , visual robustness [71−78] , visual question answering [79−84] , and video understanding [85−92] . A common challenge of these causal methods is how to build a strong cognitive model that can fully discover causality and spatial-temporal relations.
In this paper, we aim to provide a comprehensive overview of causal reasoning for visual representation learning, attract attention, encourage discussions, and bring to the forefront the urgency of developing novel causality-guided visual representation learning methods.
Although there are some surveys [40, 41, 93−95] about causal reasoning, these works are intended for general representation learning tasks such as deconfounding, out-of-distribution (OOD) generalization, and debiasing. Differently, our paper focuses on the systematic and comprehensive survey of related works, datasets, insights, future challenges and opportunities for causal reasoning, visual representation learning, and their integration. To present the review more concisely and clearly, this paper selects and cites related work by considering their sources, publication years, impact, and the cover of different aspects of the topic surveyed in this paper. The overview of the structure of this paper is shown in Fig. 1. Overall, the main contributions of this paper are given as follows.
Firstly, this paper presents the basic concepts of causality, the structural causal model (SCM), the independent causal mechanism (ICM) principle, causal inference, and causal intervention. Then, based on the analysis, this paper further gives some directions for conducting causal reasoning on visual representation learning tasks. Note that to the best of our knowledge, this paper is the first that proposes the potential research directions for causal visual representation learning.
Secondly, a prospective review is introduced to systematically and structurally review the existing works according to their efforts in the above-pointed directions for conducting causal visual representation learning more efficiently. We focus on the relation between visual representation learning and causal reasoning and provide a better understanding of why and how existing causal reasoning methods can be helpful in visual representation learning, as well as providing inspiration for future research and studies.
Thirdly, this paper explores and discusses future research areas and open problems related to using causal reasoning methods to tackle visual representation learning. This can encourage and support the broadening and deepening of research in the related fields.
The remainder of this paper is organized as follows. Section 2 provides the preliminaries, including the basic concepts of causality, the SCM, the ICM principle, caus-al inference, and causal intervention. Section 3 discusses the ways to use causal reasoning to learn robust features, which are the key techniques for visual representation learning. Section 4 reviews some recent visual learning tasks, including visual understanding, action detection and recognition, and visual question answering, including the discussions about the existing challenges of these visual learning methods. Section 5 reviews the related causality-based visual representation learning works systematically. Section 6 provides a review of existing causal datasets for visual learning. Section 7 proposes and discusses some future research directions and finally Section 8 gives the conclusions.

Causal learning and reasoning
As the sentence "correlation is not causation" says, two variables are correlated does not mean that one of them causes the other. Actually, statistical learning models the correlations of data. By observing a sufficient amount of i.i.d. data, the statistical learning method can perform considerably well under i.i.d. settings. However, when facing problems that do not satisfy i.i.d. assumptions, the performance of these methods often seems poor (e.g., image recognition models tend to predict "bird" when seeing "sky" in the image, since bird and sky usually appear simultaneously in the dataset). Causal learning [39] is different from statistical learning, which aims to discover causal relationships beyond statistical relations. Learning causality requires machine learning methods not only to predict the outcome of i.i.d. experiments but also to reason from a causal perspective. Causal reasoning can be divided into three levels. The first level is association. The statistical machine learning methods mentioned above belong to this level. A typical question of association is "How would the weather change when the sky is turning grey", which asks about the association between "weather" and "the appearance of the sky". The second  Fig. 1 Overview of the structure of this paper, including the discussion of related methods, datasets, challenges, and the relations among causal reasoning, visual representation learning, and their integration level is intervention. An intervention-based question asks about the effect of the intervention (e.g., "Would I become stronger if I go to the gym every day?"). Intervention-based questions require us to answer the outcome when taking specific treatment, which can not be answered by only learning data associations (e.g., If we only learn the associations, then if we observe that a man who goes to the gym every day may not be stronger than a professional athlete, we may conclude that going to the gym not always makes you stronger). The third level is counterfactual. A typical form of a counterfactual question is "What if I had ···", which focuses on the outcome when the condition is not realized. Counterfactual inference aims to compare different outcomes under the same condition, but the antecedent of the counterfactual question is not real.

X1, X2, · · · , Xn
The SCM considers the formulation of a causality style. Assume that we have a set of variables , each variable is a vertex of a causal graph (i.e., a directed acyclic graph (DAG) describes causal relations of variables). Then, those variables could be written as the outcome of a function: where indicates the parents of in the causal graph, and refers to unmeasured factors such as noise. The deterministic function gives a mathematical form of the effect of direct causes of on the variable . Using the graphical causal model and SCM language, we can express joint distributions as follows: Equation (2) is called a product decomposition of the joint distributions. After the decomposition and graphical modeling, the causal relations and effects of a dataset can be represented as the causal graph and the joint distribution.

Independent causal mechanism
The independent causal mechanism principle [40] can be expressed as follows: ICM principle. The causal generative process of a system′s variables is composed of autonomous modules that do not inform or influence each other. In the probabilistic case, this means that the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other conditional distributions.
The ICM principle describes the independence of caus-al mechanisms. If we conceive that the real world is composed of modules in variable styles, then the modules could represent the physically independent mechanisms of the world. When applying the ICM principle to the disentangled factorization (2), it can be written as [40] : 1) Changing (or performing an intervention upon) one mechanism does not change any of the other mechanisms .
2) Knowing some other mechanisms does not give us information about a mechanism . The ICM principle guarantees that our intervention on one mechanism does not affect others, which further reveals the possibility of transferring knowledge across domains that have the same modules.

Causal inference
The purpose of causal inference is to estimate the outcome shift (or effect) of different treatments. Let symbol denote a treatment that refers to an action that applies to a unit. For example, if we have a medicine , let denote applying medicine and denote not applying medicine , then is a treatment, and the recovery of the patient is the outcome of the treatment . Under this condition, the aim of causal inference is to uncover the effect of applying treatment . A counterfactual outcome is the potential outcome of an action that has not been taken. For example, if we take treatment , then the outcome of is counterfactual. Then the average treatment effect (ATE) of treatment could be written as where denotes the potential outcome of treatment . If we have taken treatment , then is the counterfactual outcome. The goal of causal inference is to estimate the treatment effects given the observational data, which is usually incomplete in real-world scenarios due to the cost and moral problems. From a counterfactual perspective, we cannot always obtain a no-treatment outcome if we apply the treatment. Thus, we need to adopt causal inference to analyze the effect of a certain treatment.

Causal intervention
Causal intervention for machine learning aims to capture the causal effects of interventions (i.e., variables), and take advantage of causal relations in datasets to improve model performance and generalization ability. The basic idea of causal intervention is to use an adjustment strategy that modifies the graphical model and manipulates conditional probabilities to discover causal relationships among variables. In this section, we review two ad-justment strategies: back-door adjustment and front-door adjustment.

X Y
Assume that we want to gauge the causal effect between and by Bayes′ rules; we can have This conditional distribution could not represent the true causal effect of on , due to the existence of backdoor path . Variable here is a confounder that not only affects pre-intervention but also the outcome , which would make the conditional distribution a collective effect of and , and thus leads to spurious correlation. To eliminate the spurious correlation introduced by the back-door path, the back-door adjustment uses do-operator to calculate the intervened probability instead of the conditional probability : is replaced with the conditional distribution with the marginal distribution . Fig. 2

Front-door adjustment
The back-door criterion may not be satisfied in some causal graphical patterns (e.g., no back-door paths exist in causal graphs, or variables that block the back-door paths are unobserved). In such a case, the front-door adjustment pattern can be applied to estimate causal effects. As Fig. 3 shows, assuming that the variable is an unobserved variable, the back-door adjustment becomes invalid because the marginal distribution is not observed. However, if we have an observed mediator variable on the front-door path , then we can identify the effect of on directly since the back-door path from to is blocked by the collider at : Note that there is a back-door path from to : , which can be blocked by applying back-door adjustment on : And the total effect of on could be written by summing on :

(8)
Then, the front-door adjustment formulation is obtained by applying (6)-(8): The front-door adjustment identifies the effect of on by applying the do-operator twice, one at the mediator variable and the other at variable that blocks the back-door path. In this way, the unobserved variable can be bypassed in intervention.

Back-door or front-door?
The back-door adjustment requires us to determine what the confounder is in advance. Thus, the back-door adjustment is effective when the confounder is observable. However, in visual domains, data biases are complex, and it is hard to know and disentangle different types of confounders. Especially for some challenging tasks like the visual-linguistic question reasoning where the confounders in visual and linguistic modalities are not always observable. Therefore, the front-door causal intervention gives a feasible way to calculate when we cannot explicitly represent the confounder.

Causality-aware feature learning
Traditional feature learning methods usually learn the spurious correlation introduced by confounders. This will reduce the robustness of models and make models hard to generalize across domains. Causal reasoning, a learning paradigm that reveals the real causality from the outcome, overcomes the essential defect of correlation learning and learns robust, reusable, and reliable features. In this chapter, we review the recent representative causal reasoning methods for general feature learning, which mainly consist of three main paradigms: 1) structural causal model (SCM) embedded, 2) applying causal intervention/counterfactual, and 3) Markov boundary (MB) based feature selection.
For embedding the SCM, Mitrovic et al. [96] proposed representation learning via invariant causal mechanisms (RELIC) to address self-supervised learning problems and achieved competitive performance in terms of robustness and out-of-distribution generalization on ImageNet. Shen et al. [97] proposed a disentangled generative causal representation (DEAR) learning method for causal controllable generation on both synthesized and real datasets.
To apply causal intervention or counterfactual inference for feature learning, Huang et al. [63] proposed a causal intervention-based deconfounded visual grounding method to eliminate the confounding bias. Zhang et al. [65] present a causal inference based weakly-supervised semantic segmentation framework. Tang et al. [67] present a causal inference framework that disentangles the paradoxical effects of the momentum to remove the confounder of long-tail classification. Chen et al. [84] proposed a counterfactual critic MultiAgent training (CMAT) approach to learn the visual context properly.
Causal feature selection aims to find a subset of features from a large number of predictive features to reduce computational cost and build predictive models for variables of interest. Recent causality-based feature selection methods use Bayesian network (BN) and Markov boundary (MB) to identify potential causal features. BN is used as a DAG representing the causal relations between variables, and MB implies the local causal relationships between the class variable and the features in its MB. Since the BN of variables may be very large and hard to compute, current causal-based methods focus on identifying the MB as a variable or a subset of the MB. For example, Wu et al. [59] introduced the PCMasking concept to explain a type of incorrect CI tests in MB discovery and proposed a cross-check and complement MB discovery (CCMB) algorithm to solve the incorrect test problem. Yu et al. [60] presented theoretical analyses on the conditions for MB discovery in multiple interventional datasets and designed an algorithm for learning MBs from multiple interventional datasets. Yu et al. [58] formulated the causal feature selection problem with multiple datasets as a search problem and gave the upper and lower bounds of the invariant set, then proposed a multisource feature selection algorithm. Yang et al. [55] proposed the concept of N-structures and then designed an MB discovering subroutine to integrate MB learning with N-structures to discover MB while distinguishing direct causes from direct effects. Yu et al. [53] proposed a multilabel feature selection algorithm, multi-label feature selection to causal structure learning (M2LC), which learns the causal mechanism behind the data and is able to select causally informative features and visualize common features. Guo et al. [51] proposed an error-aware Markov blanket learning algorithm to solve the conditional independence test error in causal feature selection. Ling et al. [57] proposed an efficient local causal structure learning algorithm, local causal structure learning by feature selection (LCS-FS), which speeds up parent and children discovery by employing feature selection without searching for conditioning sets. Yu et al. [50] proposed a multiple imputation MB framework MimMB for causal feature selection with missing data. MimMB integrates data imputation with MB learning in a unified framework to enable the two key components to engage with each other.
Finding causal features improves the explanatory capability and robustness of models. Causal feature selection methods can provide a more convincing explanation for prediction than correlation-based methods. As the ICM principle implies, the underlying mechanism of the class variable can be learned from causal relations and thus can be transferred across different settings or environments. Although the existing causal feature learning methods achieve promising performance, most of them focus on general feature learning without considering a more specific problem, visual representation learning.

Visual representation learning: Stateof-the-art
Visual representation learning has made great progress in recent years, which can utilize spatial or/and temporal information to complete specific tasks, including visual understanding (object detection, scene graph generation, visual grounding, visual commonsense reasoning), action detection and recognition, and visual question answering, etc. In this section, we introduce these representative visual learning tasks and discuss the existing challenges and necessity of applying causal reasoning to visual representation learning.
Object detection aims to determine where objects are located in a given image (object localization) and to which category each object belongs to (object classification) and label them with rectangular bounding boxes (BBs) to show their confidence in existence. In image object detection, deep learning frameworks for object detection are divided into two types. The first type is to follow the traditional object detection process, generating region proposals firstly and then classifying each proposal into a different object class. The other type is to treat object detection as a regression or classification problem and adopt a unified framework to directly obtain the final predictions (category and location). Region proposalbased methods mainly include R-CNN [98] , spatial pyramid pooling (SPP-Net) [99] , Fast R-CNN [100] , Faster R-CNN [101] , feature pyramid network (FPN) [102] , regionbased fully convolutional Network (R-FCN) [103] , and Mask R-CNN [104] , some of which are interrelated (e.g., SPP-net modifies R-CNN with an SPP layer). Based on regression/classification, the methods mainly include MultiBox [105] , AttentionNet [106] , G-CNN [107] , YOLO [108] , single shot MultiBox detector (SSD) [109] , YOLOv2 [110] , deeply supervised object detector (DSOD) [111] and deconvolution single shot detector (DSSD) [112] . The correlations between these two pipelines are connected by anchors introduced in Faster R-CNN. In video saliency object detection, extending state-of-the-art saliency detectors from images to videos is challenging. Li et al. [113] presented a flow-guided recurrent neural encoder (FGRNE), which works by enhancing the temporal coherence of the per-frame feature by exploiting both motion information in terms of optical flow and sequential feature evolution encoding in terms of long short-term memory (LSTM) networks. Li et al. [114] developed a multi-task motion-guided video salient object detection network, which learns to accomplish two sub-tasks using two sub-networks, one sub-network is for salient object detection in still images and the other one is for motion saliency detection in optical flow images. Yan et al. [115] presented an effective video saliency detector that consists of a spatial refinement network and a spatiotemporal module. By utilizing the generated pseudo-labels together with a part of manual annotations, the detector can learn spatial and temporal cues for both contrast inference and coherence enhancement. For video salient object detection, how to effectively take object motion into consideration and obtain robust spatial-temporal information is crucial in video salient object detection. However, some non-object, occlusion, motion blur, and lens movement make the model hard to concentrate on the true interesting object area.
Scene graph generation (SGG) aims to describe object instances and relations between objects in a scene. With its powerful representation ability, SGG can encode images [116,117] and videos [118,119] as its abstract semantic elements without any restrictions on the attributes, types, and relations between objects. Therefore, the task of SGG is to build a graph structure that associates its nodes and edges well with objects in the scene and their relations, where the key challenge task is to detect/recognize relations between objects. Currently, SGG can be divided into two classes: 1) with facts alone and 2) introducing prior information. Besides, these SGG methods pay more attention to the methods with facts alone, including CRF-based (conditional random field) SGG [117,120] , VTransE-based (visual translation embedding) SGG [121,122] , RNN/LSTM-based SGG [123,124] , Faster RCNN-based SGG [125,126] , graph neural network (GNN) [127,128] , etc. Furthermore, SGG adds different types of prior information, such as language priors [129] , knowledge priors [130,131] , visual contextual information [132] , visual cue [133] , etc. Fig. 4 shows the related work on SGG, and it can be clearly seen that most of the methods use the GNN model or introduce relevant prior information when conducting SGG. Existing SGG methods are still far from building a practical knowledge base. There exists a serious conditional distribution bias of the relation-ship in SGG methods. For example, knowing that the subject and object are person and head it is easy to guess that the relationship is that a person has a head.  Visual grounding usually involves two modalities, visual and linguistic data. This task aims to locate the target object in the image according to the corresponding object description (title or description) and the given image. When locating the target object, it is necessary to understand the input description information, and integrate the information of the visual modality for localization prediction. Currently, visual grounding methods can be classified into three types: fully supervised [134−141] , weakly supervised [142] , and unsupervised [143] . First, the fully supervised methods contain box annotations with object-phrase information. This method can be further divided into two-stage methods [135,136,138,141] and one-stage method [139] . The two-stage approach is to extract candidate proposals and their features in the first stage through a region proposal network (RPN) [100] or traditional algorithms (Edgebox [144] , Selective Search [145] ). Second, weakly supervised methods [146,147] only have images and corresponding sentences, and no box annotations for object-phrases in the sentences. Due to the lack of mapping between phrases and boxes, weak supervision will additionally design many loss functions, such as designing reconstruction loss, introducing external knowledge, and designing loss functions based on image-caption matching. Third, there is no image-sentence information in the unsupervised method. Wang and Specia [148] used off-theshelf approaches to detect objects, scenes and colors in images and explore different approaches to measuring semantic similarity between the categories of detected visual elements and words in phrases. To locate the object instance described by a natural language referring expression in an image, some referring expression comprehension methods are proposed. Yang et al. [149] proposed a dynamic graph attention network to perform multi-step reasoning by modeling the relationships among the objects in the image and the linguistic structure of the expression. Yang et al. [134] proposed a cross-modal relationship extractor (CMRE) to adaptively highlight objects and relationships with a cross-modal attention mechanism, and represented the extracted information as a language-guided visual relation graph. Furthermore, Yang et al. [22] proposed a cross-modal relationship extractor to adaptively highlight objects and relationships (spatial and semantic relations) related to the given expression with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. Yang et al. [150] proposed a scene graph-guided modular network (SGMN), which performed reasoning over a semantic graph and a scene graph with neural modules under the guidance of the linguistic structure of the expression. However, due to the existence of linguistic and visual biases, most visual grounding models are heavily dependent on specific datasets, without good transfer ability and generalization performance.
Due to the success of BERT-related models in the field of NLP, researchers have begun to focus on more a challenging multi-modal reasoning task, visual commonsense reasoning (VCR). The VCR task needs to combine image information with the understanding of questions, and obtain the correct answer as well as the reasoning process based on the commonsense. Given an image, the image contains a series of bounding boxes with labels. In general, VCR can be divided into two sub-tasks: task is choosing an answer based on the question; and task is reasoning based on the question and the answer, explaining why the answer was chosen. Due to the challenging nature of VCR, there are actually relatively few existing studies. Some of them resort to designing specific model architectures [151−155] . Recognition to cognition networks (R2C) [151] implemented this task with a three-step approach, associating text with objects involved, linking answers with corresponding questions and objects, and finally, reasoning about shared representations. Inspired by brain neuron connectivity, CCN [152] dynamically modeled the visual neuron connectivity, which is contextualized by the queries and responses. Heterogeneous graph learning (HGL) [153] leveraged visual answering and dual question answering heterogeneous graphs to seamlessly connect vision and language. Zhang et al. [155] proposed a multi-level counterfactual contrastive learning network for VCR by jointly modeling the hierarchical visual contents and the inter-modality relationships between the visual and linguistic domains. Recently, BERT-based pre-training methods have been extensively explored in vision and language domains. In general, most of them adopt a pre-training-then-transfer scheme and achieve significant performance improvements on the VCR [156−158] benchmarks. These models are usually pretrained on large-scale multi-modal datasets (e.g., concept captioning [159] ) and then fine-tuned on VCR. At present, the promising performance of VCR is generally attributed to the pre-trained big model and the prior external knowledge. Compared with simple vision-linguistic domain tasks, the introduction of external knowledge brings new challenges: 1) How to retrieve limited supporting knowledge from external knowledge bases that contain massive data. 2) How to effectively integrate external knowledge with visual and linguistic features. 3) The reasoning process gives interpretability needs supporting facts, which depends heavily on the language structure design.
The task of action detection and recognition includes two aspects, one is to identify all action instances in the video, and the other one is to localize actions spatially and temporally. Nowadays, spatial-temporal action detection or recognition models can be divided into two categories, the first one [6, 7, 160−167] is to model spatial-temporal relationships based on convolutional neural networks (CNNs), and the other one [168−171] is based on video transformer structures. Besides, the skeleton-based models [172−175] have recently attracted great attention. Sun et al. [161] proposed an actor-centric relational network (AC-RN), which used two-stream to extract the central character feature and global background information from the input clip, and then performed feature fusion for action classification. Feichtenhofer et al. [165] proposed a twostream model named SlowFast networks that contains a Slow pathway and a Fast pathway. Bertasius et al [169] . simply extended the ViT [176] design to video by proposing several scalable schemes for space-time self-attention. Arnab et al. [177] proposed pure-transformer architectures for video classification, including several variants of the model by factorizing the spatial and temporal dimensions of the input video. Although great progress has been made in spatial-temporal action detection and recognition based on the CNNs or transformer models, there exist some critical problems in terms of the robustness and the transferability of the models. The existing action detection and recognition models rely heavily on scenes and objects. When a model is well-trained in one dataset, it is hard to be generalized to another dataset with different scenes. Additionally, the methods are easily focused on some static appearance or background information rather than the true motion area due to the essential correlation learning in most of the models. This may be harmful to the reliability of the model, as well as the robustness of the learned spatial-temporal representations. Causal reasoning has the powerful ability to uncover the underlying structural knowledge about human actions that build a strong cognitive model that can fully discover causality and spatial-temporal relations.
Visual question answering (VQA) is a vision-language task that has received much attention recently. The objective of VQA is: Given the image/video and a related question, a machine needs to reason over visual elements and general knowledge to infer the correct answer. The attention mechanism is widely used in VQA models, which aim to focus on the critical part of the image and question, and find cross-modality correlations. The UpDn [178] framework is a typical conventional VQA meth-od based on attention, which uses a top-down attention LSTM [179] for the fusion of visual and linguistic features. Besides using LSTM, the transformer [180] can also be adapted to the VQA task, thanks to its powerful scaled dotproduct attention block. Visual-language pre-training (VLP) models based on BERT [181] show remarkable performance in the VQA task. ViLBERT [156] is a BERTbased visual and language pre-training framework, which uses a self-attention transformer block [180] to model inmodality relation and develop a co-attention transformer block to compute cross-modality attention score, and it finally achieves a state-of-the-art on four visual-language tasks including VQA at that time. Compared with the image QA [29,30,178,182,183] , the video question answering (VideoQA) task is much more challenging due to the existence of extra temporal information. To accomplish the VideoQA problem, the model needs to capture spatial, temporal, visual, and linguistic relations to reason about the answer. To explore relational reasoning in VideoQA, Xu et al. [184] proposed an attention mechanism to exploit the appearance and motion knowledge with the question as guidance. Later on, some hierarchical attention and coattention-based methods are proposed to learn appearance-motion and question-related multi-modal interactions. Le et al. [185] proposed a hierarchical conditional relation network (HCRN) to construct sophisticated structures for representation and reasoning over videos. Jiang and Han [186] introduced a heterogeneous graph alignment (HGA) network. Huang et al. [187] proposed a locationaware graph convolutional network to reason over detected objects. Lei et al. [188] employed sparse sampling to build a transformer-based model named CLIPBERT and achieve end-to-end video-and-language understanding. Liu et al. [189] proposed a hierarchical visual-semantic relational reasoning (HAIR) framework to perform hierarchical relational reasoning. Although hierarchical attention mechanisms successfully improve the visual-language task performance, these models remain with a strong reliance on modality bias [28,190] and tend to capture the spurious linguistic or visual correlations within the images/videos, and thus fail to learn the multi-modal knowledge with good generalization ability and interpretability.

Causality-aware visual representation learning
According to the above-discussed visual representation learning methods, the current machine learning, especially representation learning, faces several challenges: 1) lack of interpretability, 2) poor generalization ability, and 3) over-reliance on correlations of data distribution. Causal reasoning offers a promising alternative to address these challenges. The discovery of causality helps to uncover the causal mechanism behind the data, allowing the machine to understand better why and to make decisions through intervention or counterfactual reasoning.
Since Section 3 has reviewed the recent causal reasoning methods for general feature learning, it provides a good theoretical basis for further research on causal reasoning with specific visual representation learning tasks. In this section, we summarize some recent approaches for causal visual representation learning, as shown in Table 1. The causal visual representation learning is an emerging research topic and has appeared since the 2020s. The related tasks can be roughly categorized into several main aspects: 1) causal visual understanding, 2) causal visual robustness, and 3) causal visual question answering. In this section, we discuss these three representative causal visual representation learning tasks.

Causal visual understanding
Visual understanding contains several tasks, such as object detection, scene graph generation, visual grounding, visual commonsense reasoning, etc. However, some challenges exist in these tasks: 1) For image/video object saliency detection, some non-object, occlusion, motion blur, and lens movement make the model hard to concentrate on the true interesting object area. To this end, causal reasoning can make the model focus on the essential interesting object area by learning robust and reliable visual representations. 2) For the SGG problem that contains superficial bias and insufficient generalization ability, causal reasoning can be introduced to mitigate these problems well. For example, an item such as towel is used to bathe in the bathroom, but is used to wash the face in the office. Introducing causal reasoning into SGG can generalize the functionality of an item to different scenarios. 3) Due to the existence of linguistic and visual biases, most visual grounding models are heavily dependent on specific datasets without good transfer ability and generalization performance. This problem can be mitigated by causal reasoning methods, which learn robust and transferable features to mitigate the visual and linguistic biases. 4) For visual commonsense reasoning, linguistic biases may directly affect the reasoning performance. Generally, the superficial correlations captured by the existing VCR models can be mitigated by introducing causality that integrates external knowledge and visual and linguistic features into a robust and discriminative representation space. Non-causal visual understanding methods are easily affected by confounders in visual content. Illumination, position, backgrounds, co-occurrence of objects, and other visual factors are confounders that are inevitable in common settings. With traditional correlation learning, spurious correlations introduced by the confounders degrade the robustness of representation learning. For example, since the co-occurrence of "bird" and "sky" are high, the model would learn a strong correlation between them. Thus, when seeing a picture of a floating balloon that also contains "sky", it would also make a confident prediction that it is a picture of a bird.
Causal reasoning provides a good solution to address the above problem. By replacing the conditional distribution with the intervened distribution, the spurious correlation can be eliminated, and the machine can learn the real causality. Applying intervention in the training procedure is a widely used implementation of the causal intervention. In the visual recognition task, Wang et al. [61] combined adversarial training with causal intervention, modeled the different causal effects of mediators and confounders, and designed an adversarial training pipeline to improve the effect of mediators while suppressing the effect of confounders. Yue et al. [62] applied counterfactual inference to zero-shot and open-set visual recognition by proposing a generative causal model to generate counterfactual samples. A confounding effect also exists in the visual grounding task, Huang et al. [63] proposed a deconfounded visual grounding framework by conducting interventions on linguistic features. For the weakly-supervised semantic segmentation task, Zhang et al. [65] used the structural causal model to formulate the causalities between components, then constructed a confounder set and removed confounders by back-door adjustment. The prior bias is also a non-trivial problem in the scene graph generation (SGG) task. To reduce the negative impact of training bias in scene graph generation, Tang et al. [66] built a causal graph and extracted counterfactual causality from the trained graph to infer the causal effects of training bias and then remove the negative bias. Besides, causal reasoning can replace the traditional re-weighting and re-sampling methods in resolving long-tailed distribution problems. Tang et al. [67] analyzed that the momentum in stochastic gradient descent (SGD) introduces the unbalanced sample distribution and then proposed to use counterfactual inference in the test stage to detect and remove the causal effect of the momentum item. Wang et al. [68] proposed an unsupervised commonsense learning framework to learn intervened visual features by back-door adjustment, which can be used in the downstream task as image captioning, visual question answering, and visual commonsense reasoning. Liu et al. [195] es- Agarwal et al. [191] 2020 Visual question answering GAN [192] Counterfactual sample synthesising Chen et al. [84] 2020 Visual question answering CCS [84] Counterfactual sample synthesising Zhang et al. [65] 2020 Weakly-supervised semantic segmentation Pseudo-mask generation Backdoor adjustment Tang et al. [66] 2020 Scene graph generation Unbiased training Causal inference Tang et al. [67] 2020 Long-tailed classification De-confounded training Causal inference Wang et al. [68] 2020 Yue et al. [74] 2020 Few-shot learning -Backdoor adjustment Hu et al. [72] 2021 Class incremental learning -Causal inference Yue et al. [73]  Huang et al. [63] 2022 Visual grounding CNN, BERT [181] , Attention [180] Causal inference Zhang et al. [193] 2020 Visual question answering BERT [181] Backdoor adjustment Yang et al. [82] 2021 Visual question answering Attention [180] Front-door adjustment Niu et al. [81] 2021 Visual question answering CF-VQA [81] Counterfactual inference Li et al. [194] 2022 Video question answering IGV [194] Invariant grounding Liu et al. [195] 2022 Visual recognition CNN/Transformer, CCD [195] Causal context debiasing Liu et al. [196] 2022 Motion forecasting Encoder-decoder [196] Causal invariant learning Lv et al. [197] 2022 Domain generalization CIRL [197] Causal intervention Lin et al. [198] 2022 Video anomaly detection UVAD [198] Causal intervention, counterfactual inference Lin et al. [199] 2022 Salient object detection USOD [199] Causal intervention Liu et al. [200] 2022 Event-level video question answering Transformer, CMCIR [200] Causal intervention tablished a SCM to uncover the causal relevance among contextual priori, object feature, contextual bias, and final prediction in multi-target visual tasks. Liu et al. [196] introduced a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables, namely invariant variables, style confounders, and spurious features. Lin et al. [198] proposed a causal graph to analyze the confounding effect of the pseudo label generation process for unsupervised video anomaly detection. Lin et al. [199] proposed a causal-based debiasing framework to disentangle the unsupervised salient object detection from the impact of contrast distribution bias and spatial distribution bias.

Causal visual robustness
The ubiquitous spurious correlation learned by deep learning models reduces the model robustness, which is a potential vulnerability of the conventional deep learning paradigm. In this perspective, the causal learning paradigm can be introduced to avoid the presence of confounding effects and make the model more robust [201] .
Confounders are widespread in visual robustness problems, including few-shot learning, class-incremental learning, domain adaptation, generative model, etc. Yue et al. [74] uncovered that pre-trained knowledge is a confounder in fewshot learning and developed a few-shot learning paradigm by introducing back-door adjustment to control the pre-trained knowledge. The confounding effect can be leveraged by attackers, Tang et al. [71] proposed an instrumental variable [202] estimation-based causal regularization method for adversarial defense. Hu et al. [72] explained the catastrophic forgetting effect in classincremental learning in terms of causality: the causal effect of old data is zero, and then proposed distilling the causal effect of old data by controlling the collider effect of the causal graph. As ICM inferred, causal mechanisms could be invariant across domains; hence, learning invariant causal knowledge is likely to be superior in robustness. To learn cross-domain knowledge, Yue et al. [73] disentangled semantic attributes in images into causal factors and used CycleGAN [203] to generate counterfactual samples in the counterpart domain, then exploited the counterfactual sample and a latent variable encoded by variational auto encoder (VAE) [204] as proxy variables of an unobserved attribute for intervention. Apart from generating counterfactual samples, intervention can also be implemented by the generative method. Mao et al. [77] argued that conventional randomized control trials and intervention approaches could hardly be used in naturally collected images, then introduced a framework performing interventions on realistic images by steering generative models to generate intervened distribution. Lv et al. [197] introduced a causality inspired representation learning (CIRL) algorithm that enforced the representations to satisfy three properties, and then used them to simulate causal factors.

Causal visual question answering
For visual question answering, the real causality behind the visual-linguistic modalities and the interaction between the appearance-motion and language knowledge are neglected in most of the existing methods. In recent works, the purpose of introducing causality into visual question-answering tasks is to reduce language bias in VQA tasks. Strong correlations between the question and the answer will make VQA models rely on spurious correlations without concerning visual knowledge. For example, since the answer to the question "What is the color of the apple?" is "red" in most cases, the VQA model will easily learn the correlation between the word "apple" and the word "red". Thus, when given an image of a green apple, the model still predicts the answer "red" with strong confidence. Although simply balancing the dataset [28,205] can partly mitigate the linguistic bias, the spurious correlation still exists in the model. From this perspective, the causality-based solution is better than simply balancing the data, since the causal reasoning cuts off the superficial correlations and makes the VQA models focus on the real causality.
Constructing a confounder set has been commonly used in causal intervention practice. VC-RCNN [68] constructed an object level visual confounder set for performing back-door adjustment in a visual task. Following VC-RCNN, DeVLBert [193] treated nouns in linguistic modality as confounders and constructed language confounder sets using their average Bert representation vectors. Besides, DeVLBert incorporated the intervention into Bert′s [181] pre-training process and combined mask modeling objective with causal intervention. As another implementation of the intervention, Yang et al. [82] designed the In-Sample attention and Cross-Sample attention module to conduct front-door adjustment, where the In-Sample attention module approximates probability , and the Cross-Sample attention module approximates probability . Using these attention modules, a cross modality causal attention network was proposed for the VQA task by combining causal attention with the previous LXMERT [206] framework. Counterfactual-based solutions are also worth noting. Agarwal et al. [191] proposed a counterfactual sample synthesizing method based on generative adversarial network (GAN) [192] . Overcoming the complexity of the GAN based synthesising method, Chen et al. [84] tried to replace critical objects and critical words with mask tokens and reassigned an answer to the synthesis of counterfactual QA pairs. Apart from sample synthesizing methods, Niu et al. [81] developed a counterfactual VQA framework that reduces multi-modality bias by using the total indirect effect (TIE) [39] for final inference. By blocking the direct effect of one modality, the TIE measures the total causal effect of the question and visual information, thus reducing language bias in the VQA. Li et al. [194] proposed an invariant grounding for VideoQA (IGV) to force the VideoQA models to shield the answering process from the negative influence of spurious correlations, which significantly improves the reasoning ability. Liu et al. [200] proposed a causality-aware event-level visual question answering framework named cross-modal causal relational reasoning (CMCIR) to discover true causal structures via causal intervention on the integration of visual and linguistic modalities.

Comparisons and discussions
To summarize the development line and the current state of causal visual representation learning, we show the development situation of causal visual representation learning in Fig. 5, including the past, current, and future directions. Although the above-mentioned causal visual representation learning methods successfully apply causal reasoning methods to uncover causal mechanisms and achieve promising results, causal reasoning for visual representation learning is still in its infancy stage with many challenges. Firstly, the existing causal visual representation tasks are limited to several computer vision tasks without being applied to more diverse and challenging tasks such as video understanding, human-computer interaction, urban computing, etc. It should be noticed that recent large pre-trained vision-language models like contrastive language-image pre-training (CLIP) [207] have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from traditional representation learning, which is based mostly on discretized labels, popular prompt learning adopts vision-language pre-training and aligns images/videos/texts in a common feature space, which allows zero-shot transfer to any downstream task via prompting. Therefore, how to apply causality-ware knowledge to prompt learning may be a potential direction. Secondly, causal reasoning has been burgeoned for many visual learning tasks. So far, the existing evaluation datasets are still as traditional datasets for correlation learn-ing without proper large-scale benchmarking datasets and pipelines specified for causal reasoning. Thirdly, most of the existing methods focus on causality discovery on either visual or linguistic modality without considering both of them. Therefore, a more in-depth analysis of the relations between causal reasoning and visual representation learning is required.

Related causal datasets
Correlation-based models may perform well in existing datasets, not because these models have a strong reasoning capability, but because these datasets cannot fully support the evaluation of the models′ reasoning capability. Spurious correlations in these datasets can be exploited by the model to cheat, which means that the model just concentrates on superficial correlation learning, not real causal reasoning, only approximating the distribution of the dataset. For example, in the VQA v1.0 [182] dataset for the VQA task, the model simply answers "yes" when seeing the question "Do you see a ···", which will achieve nearly 90% accuracy. Due to this shortcoming in current datasets, researchers need to build benchmarks that can evaluate the true causal reasoning capability of models. In this section, we take image question answering benchmarks and video question answering benchmarks as examples to analyze the current research situation of related causal reasoning datasets and give some future directions.

Image question answering
Image question answering benchmarks evaluate the models′ capability to answering natural language questions based on a corresponding image. Recent image question answering benchmarks try to collect or generate balanced QA pairs to make the dataset distribution more balanced in question distribution. VQA v2.0 [28] collects complementary QA pairs by replacing the image and the answer in QA pairs. VQA-CP [190]   dataset and VQA v2 dataset to construct two new datasets VQA-CP v1 and VQA-CP v2. As Fig. 6 shows, Agarwal et al. [191] constructed IV-VQA and CV-VQA datasets using semantic editing to generate images and reexamine the image by a human. Li et al. [208] proposed a humanmachine adversarial to collect robust QA pairs. Fig. 7 illustrates the adversarial data collection procedure. In Table 2, we summarize these datasets in terms of image source, split numbers, collected or not, and rebalanced or not.   [191] Q1: Are the kids about the same age? A1: No, Conf: 58.5% Q2: How many kids are there? A2: 3, Conf: 95.0% Q3: Is the kid held by the man youngest? A3: Yes Fig. 7 The example QA pairs in AVQA [208] Current image question answering uses various approaches to overcome the bias introduced by unbalanced data. However, there is still a lack of large-scale benchmark datasets that support fair and transparent evaluations of the causality behind the data and the reasoning ability of the method. Introducing the causal concept and causal methods like confounders and causal interventions when building benchmark datasets may help resolve the problem of the lack of specific causal reasoning benchmark datasets.

Video question answering
The video question answering task is more complex than the image question answering task due to the ubiquitous correlation between spatial and temporal information, i.e., the introduction of complex temporal relations. Thus, improving the spatial-temporal causal reasoning ability of models can improve the performance on this task, but simply approximating data distributions usually does not work. Thus, some recently released benchmark datasets are proposed to evaluate whether the model has the reasoning ability to understand the causal relation knowledge within the visual and linguistic content, as shown in Table 3.
CLEVRER [210] contains synthesized videos and automatically generated questions describing the collision of geometric objects. A typical video and question types from CLEVRER are shown in Fig. 8. It is a balanced and synthetic dataset that contains diagnostic annotations and counterfactuals. VQuAD [211] is also a diagnostic synthesized dataset. It is constructed from a balanced dataset by separating objects into attributes like texture and color, and balancing the data distribution based on these attributes. A brief overview of the VQuAD objects is shown in Fig. 9. The VQuAD is a diagnostic dataset that can be used to evaluate the extent of reasoning abilities of various video QA methods. ComPhy [212] is a video QA dataset that focuses on understanding object-centric and relational physics properties hidden from visual appearances. As shown in Fig. 10, the ComPhy dataset studies object′s intrinsic physical properties from object′s interactions and how these properties affect their motions in future and counterfactual scenes to answer the corresponding questions. Action genome question answering (AGQA) [213] includes numerous QA pairs, which are automatically generated by the process. An overview of AGQA is shown in Fig. 11. The QA pairs in AGQA are generated by parsing videos to scene graphs and using the language composition inference by scene graph to generate QA pairs. SUTD-TrafficQA [214] is a traffic video question answering dataset with six challenging reasoning tasks, including basic understanding, event forecasting, reverse reasoning, counterfactual inference, introspection, and attribution analysis, to analyze the models′ reasoning ability. Fig. 12 shows an example of a counterfactual traffic video question answering process from SUTD-Traf-ficQA. To be noticed, the counterfactual traffic video question answering task in Fig. 12 requires the outcome of a certain hypothesis that does not occur in the video. To accurately reason about the imagined events under the designated condition, the model is required to not only conduct relational reasoning in a hierarchical way but also fully explore the causal, logic, and spatial-temporal structures of the visual and linguistic content. NExT-QA [215] is a video question answering benchmark targeting the explanation of the content of the video, which requires a deeper understanding of videos and reasoning about causal and temporal actions from rich object inter-actions in daily activities. As shown in Fig. 13, the NExT-QA dataset contains rich object interactions and requires causal and temporal action reasoning in realistic videos. The NExT-QA dataset challenges QA models to reason about causal and temporal actions and understands rich object interactions in daily activities.

Extensive applications
Causal reasoning with visual representation learning has a variety of applications. Modeling causal reasoning for a variety of tasks can achieve a better perception of the real world. In this section, we introduce the applications from five aspects: image/video analysis, explainable artificial intelligence, recommendation system, humancomputer dialog and interaction, and crowd intelligence analysis. We also discuss how causal reasoning benefits various real-world applications, as shown in Fig. 14.

SUTD-
TrafficQA [214] Traffic events 62K MC&OC Human ✓ ✓ ✓ NExT-QA [215] Causal and temporal interactions 52K MC&OG Human ✓ ✓ ✓ lies on learning data correlations rather than causal structures, and the superficial correlation within the image and video data makes the model vulnerable to visual changes in the dataset. Therefore, a causality-ware feature learning strategy is required to make the model learn essential causal structures behind the data and robust to different data distributions. One of the main methods of dealing with superficial data correlations is using the causal intervention. Assume that commonsense knowledge exists in visual features, but commonsense might be confused by false observation bias. For example, the words "cup", "table", and "stool" have high co-occurrence frequencies be- Fig. 10 Sample reference videos, target video, and question-answer pairs from ComPhy dataset [212] Spatio-temporal scene graph: Example compositional spatio-temporal questions: Q: What did the person hold after putting a phone somewhere? Q: Were they taking a picture or holding a bottle for longer? Q: Did they take a picture before or after they did the longest action?
Generalization to novel compositions: Q: Did the person twist the bottle after taking a picture?
Generalization to indirect references: Q: Did the person twist the bottle? Q: Did the person twist the object they were holding last?
Generalization to more compositional steps: Q: What did they touch last before holding the bottle and after taking a picture, a phone or a bottle ?  Fig. 11 An overview of AGQA dataset [213] Q: Would the accidentstill happenif the persondid not ridethe motorbike across the crossing? A: Yes, the road is not congested and the side-collision happened at the crossing.  Fig. 12 An example of counterfactual question-answer pair in SUTD-TrafficQA dataset [214] How many things can hit the big jumping squared shiny sphere? 1

Counterfactual Intervention
Is there any thing that can hit the big still red_and_gray thing? Ye s There is a large shiny thing that is expected to hit the large squared metal ball; what is its shape? Sphere What is the rate of movement of large bubbled metallic cylinder?

Fast
What number of big shiny spheres are there with slow movement rate? 2 There is a thing that is in front of the striped shiny block; what is its rate of movement?  Fig. 9 Illustration of an instance of VQuAD dataset [211] , which shows various questions that are generated concerning the video created and the difference in complexity in terms of hops for the questions cause they commonly appear in daily life, but the commonsense knowledge usually wrongly predicts the class as table due to the observation bias. To reduce the observation bias, the causality-ware visual commonsense model is required, which regards the object category as a confounding factor and directly maximizes the likelihood after the intervention to learn the visual feature representation. By eliminating observation bias, the learned visual features are robust to image and video analysis tasks. A representative task is weakly supervised object localization, detection, and grounding [216−218] , which aims to localize objects described in the sentence to visual regions in the image/video. Despite recent progress, existing methods may suffer from the severe problem of spurious association such as: 1) the association is not objectrelevant but extremely ambiguous due to weak supervision, and 2) the association is unavoidably confounded by the observational bias when using statistics-based methods. Therefore, a unified causal framework is required to learn the deconfounded object-relevant association for accurate and robust video object localization, detection and grounding.
With the development of deep learning across industries and disciplines, the applications of deep learning models in real-world scenes require a high degree of robustness, interpretability, and transparency. Unfortunately, the black-box properties of deep neural networks are still not fully explainable, and many machine decisions are still poorly understood [219] . In recent years, causal interpretability has received increasing attention. These works [220−226] have made progress in explainable artificial intelligence based on causal interpretability. For example, in the current COVID-19 pandemic, causal mediation analysis helps disentangle different effects contributing to case fatality rates when an example of Simpson′s paradox was observed [227] . Learning the best treatment rules for each patient is one of the promising goals of applying explainable treatment effect estimation methods in the medical field. Since the effects of different available drugs can be estimated and explainable, doctors can prescribe better drugs accordingly.
At present, some causal reasoning works [228−235] have been applied to the recommendation system. The recommendation system is actually a problem of causal reasoning [228] . User embedding represents what type of person the user is and infers the user′s preferences based on the user′s attributes. The causal effect of a recommendation system is whether the user is satisfied with the recommendation. Superficial bias exists because the recommendation system is trained on biased samples (both users and items). An example is a personalized recommendation, where we wish to build a model of a custom-er′s shopping interest through various data sources, such as Web Browser records and shopping history. However, if we train a recommendation system on customers′ records in controlled settings, the system may provide little additional insight compared to the customers′ mental states and emotions, thus may fail when deployed. While Q: Why did the boy jump onto the green disc at the start? 0. To break it.  Fig. 13 Examples of multi-choice QA in NExT-QA dataset [215] Image/video analysis Explainable artificial intelligence it may be useful to automate certain decisions, understanding causality may be necessary to recommend commodities that are personalized and reliable. A general approach to removing survival bias is to construct counterfactual mirror users, construct similarity measures using unbiased information, and construct matches from lowactive to high-active users. In this way, we can alleviate the user′s dissatisfaction with the previously recommended content and the low user activity. For human-computer dialog and interaction, some emerging tasks contain the interaction between vision and language. Additionally, there exist multi-modal spatialtemporal information and complex relations captured by various devices. Most of the existing work relies on data correlation rather than causal relevant evidence, and the false correlation in the data makes the model vulnerable to language biases in the problems. Take a VQA task as an example, where we aim to remove visual objects that are unrelated to answering the question, and the prediction of the model is not expected to change. This can prevent the model from relying on superficial data correlations. When changing objects that are related to a question, the model is expected to change the answer accordingly. Adjusting question-related objects encourages the model to predict based on causality-aware objects. For a better user experience, the human-computer dialog and interaction system is required to understand people′s purposes and make reliable decisions. Causal reasoning is beneficial to the pursuit of reliable human-computer interaction by uncovering and modeling heterogeneous spatialtemporal information in a reliable and explainable way. Especially for robot interaction [91, 233−235] , where the relevant environmental features are not known in advance, prior knowledge can be utilized as a good candidate for causal structures. The strong relation between causal reasoning and its ability to intervene in the world suggests that causal reasoning can greatly address this challenge for robotics, which benefits the application of robotics significantly.
The applications mentioned above usually focus on a single subject, whereas crowd intelligence analysis [236] aims to address related sensing and cognitive tasks for multiple subjects and their interaction. In recent years, we have been witnessing the explosive growth of multimodal heterogeneous spatial/temporal/spatial-temporal data from different kinds of data sensors. Urban computing [237] is an example of crowd intelligence analysis, which aims to tackle traffic congestion, energy consumption, and pollution by using the data that has been generated by a large number of traffic vehicles in cities (e.g., traffic flow, human mobility, and geographical data). For example, huge amounts of heterogeneous traffic data come from various sources, including both static and dynamic data, such as traffic road networks, geographic information system (GIS) data, traffic flow, traffic mobility, traffic energy consumption, etc. Moreover, the heterogen-eous spatial-temporal traffic data contains a large number of useful traffic rules with strong causal relations. Therefore, how to utilize different heterogeneous spatialtemporal data and discover their complex and entangled causal relations is beneficial to urban computing and crowd intelligence analysis.

More detailed discussions
Some researchers have successfully implemented causal reasoning for visual representation learning to discover causality and visual relations. However, causal reasoning for visual representation learning is still in its infancy stage, and many issues remain unsolved. Therefore, this section highlights several possible research directions and open problems to inspire further extensive and in-depth research on this topic. Potential research directions for causal visual representation learning can be summarized as: 1) more reasonable causal relation modeling; 2) more precise approximation of intervention distributions; 3) more proper counterfactual synthesizing process; 4) large-scale benchmarks and evaluation pipeline.

More reasonable causal relation modeling
Reasonable causality modeling is the basis for causal inference. Real-world data like visual information is usually unstructured, and the effect of causal relation may be unobserved. For example, momentum is likely to be detrimental under long-tailed distribution data [67] , and there is no consensus on how to properly model causality on many tasks because the real causality may be more complicated than expected. For the VQA task, Yang et al. [82] treated visual and language features as one vertex in the causal graph, and Niu et al. [81] consider the visual and linguistic features separately. However, these methods focus on causality discovery on either visual or linguistic modality without considering both of them. Therefore, future work should consider: 1) in-depth analysis of the relations between causal reasoning and visual representation learning; 2) model comprehensive and reasonable causal relation.

More precise approximation of intervention distributions
A precise estimation of the intervention distribution helps the implementation of a certain causal model. Most of the current intervention distribution approximation methods focus on identifying all confounders for a certain task, while these confounders are usually defined as the average of object features in visual tasks [61,68,193] . Actually, the average features may not properly describe a certain confounder, especially for complex heterogeneous visual data. Thus, how to approximate the confounders more accurately is a key future work that needs to be fur-ther considered for causal intervention methods.

More proper counterfactual synthesising process
Counterfactual inference-based methods usually focus on refining the training procedure, i.e., embedding the counterfactual inference process into the training procedure. Counterfactual synthesizing methods [62,77,84,191] have proved their effectiveness in many tasks. Embedding counterfactual inference into models can effectively eliminate data bias within the data. A novel counterfactual framework [81] gives us insight into this potential. However, visual data is often entangled and heterogeneous, which makes the data bias hard to understand and model. Therefore, how to model a proper counterfactual synthesizing process is a potential direction of data debiasing in visual representation.

Large-scale benchmarks and evaluation pipeline
Although causal reasoning has been burgeoned for many visual learning tasks, most of the existing evaluation datasets are still traditional datasets for correlation learning without proper large-scale benchmarking datasets and pipelines to support fair and transparent evaluations of emerging research contributions. The only existing causal datasets discussed in the above sections have limited scale and lack comprehensive evaluation standards for causal reasoning. Therefore, more large-scale benchmark datasets and pipelines for specific visual representation learning tasks should be considered in future research.
Generally, causal visual representation learning is still an emerging and challenging research topic. Causal modeling, intervened distribution approximation, counterfactual inference, large-scale benchmarks, and evaluation pipelines have great potential for further exploration.

Conclusions
This paper has provided a comprehensive survey on causal reasoning for visual representation learning. In this paper, we focus on the prospective survey of related works, datasets, insights, future challenges and opportunities for causal reasoning, visual representation learning, and their integration. We mathematically present the basic concepts of causality, the SCM, the ICM principle, causal inference, and causal intervention. Then, based on the analysis, we further give some directions for conducting causal reasoning on visual representation learning tasks. We also review some recent popular visual learning tasks, including visual understanding, action detection and recognition, and visual question answering, including the discussions about the existing challenges of these visual learning methods. In addition, the related causality-based visual representation learning works and datasets are also discussed systematically. Finally, extensive applications and some potential future research directions are provided for further exploration. We hope that this survey can help attract attention, encourage discussions, and bring to the forefront the urgency of developing novel causal reasoning methods, publicly available benchmarks, and consensus-building standards for reliable visual representation learning and related real-world applications more efficiently.

Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article′s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article′s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.