Meta-supervision for Attention Using Counterfactual Estimation

Neural attention mechanism has been used as a form of explanation for model behavior. Users can either passively consume explanation or actively disagree with explanation and then supervise attention into more proper values (attention supervision). Though attention supervision was shown to be effective in some tasks, we find the existing attention supervision is biased, for which we propose to augment counterfactual observations to debias and contribute to accuracy gains. To this end, we propose a counterfactual method to estimate such missing observations and debias the existing supervisions. We validate the effectiveness of our counterfactual supervision on widely adopted image benchmark datasets: CUFED and PEC.


Introduction
Neural attention mechanism has gained interests, due to its contribution toward enhancing both accuracy and explainability. By generating a heatmap over attended regions [1] or highlighting a word of importance [2], the decision of the underlying model can be explained in a human interpretable manner. However, such work treats attention, only as a by-product of prediction or latent variables for explanation [3,4], while attention coefficients can also be considered as output variables, which can be human supervised.
We study the latter problem of attention supervision (AS). The existing work suggests that, when such explanation coincides with human perception, accuracy also improves [5][6][7][8][9][10]. We illustrate our problem with an image attention supervision scenario for event-type annotation [11].
Specifically, given a folder of unannotated personal images, our task is to predict its event type out of E types. Given the first row of images in Fig. 1, the model is tasked to predict its event type ThemePark of the given album.
For this prediction, neural attention [4] may identify that the images of a Ferris wheel and an animal highly contribute to the machine prediction. In AS problem, human can supervise attention, by giving a scalar importance score for each image in contexts of ThemePark type. CUFED [12] is a dataset annotating such human supervisions, where human annotators are asked to give a scalar importance score for each image for the given event type: For the first row, the image of Ferris wheel and zebra were annotated to be important for detecting ThemePark event, with a high scalar score 1.5, shown as a bar and a number in Fig. 1. Such score is low for the image of sky.
Our key claim is that: CUFED attention supervision S of image I is a biased observation toward the given event type y. This observation can be debiased if we can observe (or estimate) its counterfactuals: The unobserved supervision for image in event ỹ ≠ y . A closely related problem is obtaining an unbiased relevance estimation [13], from biased click observations to the ranking provided to the user.
One way to debias is to collect the counterfactual observations, or online A/B testing. In our problem setting, we may ask annotations of the same image for all E event types, which multiplies annotation overhead E-fold.
In contrast, we propose to estimate counterfactual observations to keep human annotation cost as low as AS, or offline A/B testing. That is, in Fig. 1, we estimated the dotted distribution of considering the attention distribution for all types, where only a value shown in the bar is observed. This distributional attention view (which we name DistAS) 1 3 allows us to realize a bias: Zebra image seems critical for detecting Zoo event, given a high observed value shown in the bar. However, the estimated annotation suggests that this image is relevant to many other types as well and cannot contribute much to conclude the event type. In other word, its importance needs to be debiased into a lower value. This is similar in spirit with propensity weighting [14] that has been a standard approach to correct for item selection bias: interactions are biased to the documents presented at the annotation time.
Our key contribution is to leverage image semantics for propensity weighting: For example, in Fig. 1, to estimate the importance of zebra in Zoo event, we can consider observed annotations for a similar image (such as a horse in the second row) with high weights. If two given images are similar, we force the two images to have similar distribution of supervisions across event types.
We validate the effectiveness of our estimation, by directly comparing with human annotation on image importance or indirectly by the accuracy of event-type prediction. In both tasks, our proposed model, purposely built upon simple RNN and CNN models, outperforms more complex state of the arts [4,11,12], leveraging counterfactual supervisions. Specifically, our proposed models outperform the existing methods by up to 10.6% Fig. 1 Ground-truth importance score in the CUFED dataset [12] (shown as a bar) and the estimated counterfactual supervisions (shown as a distribution). The first row is from ThemePark album, and the second is from Zoo album point on two personal image benchmark datasets: CUFED and PEC. 1 This work builds on and extends [15] in the following three ways: • Extensive Evaluation To further isolate the effect of our proposed similarity-based counterfactual estimation, we report the ablation study on CUFED dataset. Our study confirms that all components consistently contribute to the performance improvement (Sect. 5

Problem Formulation
We aim to solve the task of event-type recognition for a set of unannotated images (album) For event recognition, we are tasked to train a recognition function f ∶ ℝ T×d → ℝ E , which predicts a correct event type y ∈ {y 1 , y 2 , … , y E } among E event types. Our goal is to improve the neural attention mechanism by learning attention ∈ ℝ T to follow the gold importance S = {S 1 , S 2 , … , S T } as closely as possible (we call attention supervision), which can be evaluated in the following two ways, by comparing event-type prediction and event-specific image ranking with the human annotations. For the second evaluation, we regard the attentions as an alternative of importance scoring function g ∶ ℝ T×d → ℝ T , employed in the recognition function for weighting purpose For the sake of this discussion and without loss of generality, we will consider a decomposition of the recognition network into two functional components-an album feature extractor X � � � � � � → z and a decision network z →ŷ . The former combines the image features X into an album representation z by weighting the image features with attention . In the latter, the album representation z is used to make event-type prediction ŷ . Our intention is to keep this decision network as simple as possible to make the point that, with advanced attention supervision, simple models can beat more complex state of the arts. We thus consider simple CNN-and RNNbased models below.

CNN-Att
Dependent on the event type, the importance of images does vary and more important images should contribute more to the album representation, which can be modeled as neural attention [2,4]. Specifically, attentions, = { 1 , 2 , … , T } , are computed with a feed-forward network. We model the attentions as a probability distribution over all the images via a softmax layer. Then, z is defined by a weighted sum of image features according to their attentions. We denote this model variant as CNN-Att. Specifically, where x avg denotes the average of all the image features in the given album and [; ] means concatenation of the features. W, b and e are learnable parameters. Intuitively, attention will measure the relative importance of an image with regard to the whole album. The context vector e represents a latent query asking for image importance for the given event.

RNN-Att
Alternatively, some event type has a strong temporal dependence, such that input features are better represented as recurrent models, such as LSTM [16] and GRU [17]. We thus employ bidirectional GRU network into our attention architecture, named RNN-Att. Specifically, input images are first sorted in chronological order and fed into the BiGRU network, and hidden states of the recurrent network are used as input for the attention computation. (1)

3
The decision network takes the album representation z and predicts log-probabilities over output classes (E event types).

Approach
Our next task is to supervise such attentions for accurate prediction of both event type and importance, using public annotations, known as CUFED [12], of gold event label y and event-specific importance S for each album.

Baseline: ScalarAS
Formally, we design a model to predict event type with minimal error (represented by objective function L cls ), but also event-specific importance (represented as L ScalarAS ). First, for L cls , all models are trained with the classification objective, to minimize the categorical crossentropy loss L cls between the ground-truth y and predicted event-type label ŷ.
where A denotes the entire albums in the training set.
Second, for L ScalarAS , the objective is to ensure the distribution of attention is closer to the target distribution : Following [12], we focus on the relative importance of each image in the given album, rather than directly predicting the exact importance scores, due to the hardness of learning a reliable absolute importance. We turn the importance scores (i.e., supervisions) into a probability distribution of where is a positive hyper-parameter that controls a score contrast: When the increases, the distribution of target attention becomes more skewed, guiding to attend a few of more important images. We then set the total loss is the weighted sum of the two loss terms: L = cls ⋅ L cls + AS ⋅ L ScalarAS , where cls and AS denote the balancing coefficients between the two terms. We apply this loss function on CNN-Att and RNN-Att, respectively, and denote these variants as CNN-ScalarAS and RNN-ScalarAS.

Distributional Attention Supervision (DistAS)
This section questions whether CUFED annotation S is an optimal supervision for the attention . Rather, we propose the supervision vector S ∈ ℝ T should be expanded into a matrix S * ∈ ℝ T×E , to annotate unobserved image importance for other event types as well. That is to say, CUFED annotation can only sparsely supervise for such matrix, by annotating S * iy for the importance for each image I i and gold event y, namely biased toward the prediction. The same image is not considered for other types, such that S * ik = 0 where k ≠ y. Now, the question is, can we augment zero entries S * ik = 0 for k ≠ y , with better estimates? Existing frameworks leverage the labeled data from other event types, by inventing Siamese structure looking into multiple types [12], or iterative convergence [11], as implicit data augmentation. Instead, we keep structures simple and augment annotations into S * ik (replacing 0 with a counterfactual estimation), which we discuss later.
Given the expanded target supervision matrix S * , our attention supervision goal is formally stated as follows: . Note that we apply a softmax function across the images for each event type, which aims to preserve the observed ranking information within the event type. Because S * ik is zero-initialized, the softmax yields uniform distribution of * ik = 1 T for k ≠ y. To accept the expanded supervisions S * , our attention architecture needs to be expanded to have multiple context vectors e k as many as E, intuitively querying "important images for k-th event". This modification yields event-wise attention weights * as follows: The overall architecture of our proposed model is presented in Fig. 2.

Counterfactual Supervision Estimation
From the observed importance S * iy , our goal is to estimate the unobserved importance S * ik for k ≠ y at training time. The zero entries, S * ik = 0 , may mean either the image is absolutely unimportant in the given event or important yet unobserved. In contrast to ScalarAS built on only the former assumption, we take the latter assumption by taking the .
missing supervisions S * ik as optimization variables, which can be estimated by the observed importance in other events.
Inspired by propensity weighting [14], we propose a (propensity-)weighted aggregation of observed importance for debiasing, based on the following intuition: if the two images in different events have similar image features (or, propensity), they have similar importance distributions across multiple events S * i ∈ ℝ E (a row vector of matrix S * ). In other words, human annotations on the given image for an unobserved event type S * iỹ is close to their annotation on other similar image presented for S * jỹ . Formally, we set our goal as to minimize the difference between two different image similarity metrics obtained from image features and importance distributions as follows: . In order to efficiently introduce such objective into the existing training process, we additionally sample an album Ã whose gold event is ỹ(≠ y) and build two matrices of image similarities M feat ∈ ℝ T×T and M imp ∈ ℝ T×T by comparing the two albums A and Ã . The where j denotes the index of an image in album Ã . In this work, we use cosine similarity as similarity measure, i.e., sim(a, b) = cos(a, b).
Meanwhile, the above estimation provides small, yet nonzero scores for dissimilar pairs such as (x elephant , x mountain ) , generating noisy supervision. We thus redefine M feat with an introduction of threshold , where an entry with value smaller than becomes 0: where we empirically set the threshold to 0.8. Such thresholding allows our model to deal with a poor estimation of feature-based image similarity. From the two similarity matrices, we define new estimation loss L sim as the Frobenius norm of the error matrix M feat − M imp : where In summary, our attention module will be jointly trained with the following two objectives: • L sim , estimating S * , which is initially a sparse matrix with many unobserved importances S * iỹ . • L DistAS , supervising * ik , attached to CNN or RNN models, to follow the estimated supervision S * ik .
The entire model is trained with a new loss function: L = cls ⋅ L cls + AS ⋅ L DistAS + sim ⋅ L sim . We introduce an additional coefficient sim for L sim for balancing * estimation in the loss function. Training the expanded attention view * with the counterfactual supervisions S * , we name these models as CNN-DistAS and RNN-DistAS in the later experiments.

Debiased Ranking from Attention Distribution
In this section, we discuss about how we generate the debiased ranking from our attention distribution. A naive [1,T] ∑ j∈ [1,T] |M ij | 2 . Fig. 2 The overall architecture of RNN-DistAS prediction of treating the maximum of the attention distribution as importance score could be inherently biased toward the observed event. We argue that better debiased ranking could be achieved by learning to discount the images, which are important in many event types, but not showing the discriminative parts, like the zebra image in Fig. 1.
For evaluation of debiased ranking, we employ the concept of Inverse Propensity Scoring (IPS) [18] by giving penalty to the high frequent images across multiple event types (multiple documents). Specifically, event-specific importance, namely relevance R i of the given image I i in the album A, should be the probability of the image being relevant in gold event y, normalized by it being relevant in other events ỹ , which we define as the propensity of the image. It can be estimated with * as follows: In this work, we treat the similarity between the two images ( x i , x j ) as a propensity score of x i over different events. This architecture design is targeted to inference time, when we are not aware of what the gold event type is. It is not yet guaranteed the maximum attention is of gold event y. However, by maximizing the attention score of gold event in training time, where the target supervision S * iy is initialized only at gold event, such metric could achieve correct guidance.
Finally, we obtain the album representation z according to the normalized coefficients r i via a softmax layer: For event-specific ranking, we use the debiased relevance score r i as the sorting criteria for the image I i .

Dataset
To evaluate the effectiveness of DistAS, we conduct experiments on two public benchmark datasets: CUration of Flickr Events Dataset (CUFED) [12] and Personal Events Collection (PEC) [19], for event recognition and event-specific ranking. Due to no available ranking annotations in PEC, we only report the result for event recognition to show the effectiveness of our counterfactual approach and debiased (16) ranking. PEC dataset could be regarded as an extreme scenario of no human annotation.

Baselines
We compare the proposed approach DistAS with the current state-of-the-art baselines.
• Siamese-CNN [12] is trained to predict the difference of importance scores between a pair of images with the piece-wise ranking loss. When evaluation, the output of CNN is used as sorting criteria. • Iterative-CNN-LSTM [11] consists of three different modules: (1) CNN for image-level event recognition, (2) LSTM for album-level event recognition, and (3) Siamese networks for importance prediction from Wang et al. [12]. The same ResNet architecture is used as the base network in module 1 and 3. The prediction is iteratively improved by updating the output of module 1 and 2 with the importance predicted by module 3.

Model Configuration
Due to the page limitation, we report the hyper-parameter settings in CUFED dataset only. The details in PEC dataset is available with our experiment codes. We use ResNet50 [20] features as the image feature x i of dimension size 2048. The size of context vector (e and e k ) is set to 128. For recurrent models, the size of hidden states is fixed to 512, yielding 1024 in bidirectional model. The decision network, i.e., the last fully connected layers, contains two feed-forward layers of 300-dimension with 0.2 dropout rate.
Regarding the other hyper-parameters: is empirically set to 3.0, making more clear contrast between important and unimportant images. We observe AS works differently in two different AS approaches: 0.2 for ScalarAS and 0.8 for DistAS. We posit that such difference stems from that the target attention * used in DistAS already contain rich information about gold event label y. For the counterfactual estimation, we observe that 0.01 for sim works well. There was unstable training problem when we use larger value for sim , such as 0.1 and 1. One possible reason is that the randomly sampled Ã introduces unnecessary training signals at the beginning of training, before learning useful ranking information.

Training Details
Following [21], the training is done following the same protocol of extracting multiple subsets from an album, where we extract 16 images (i.e., T = 16 ) over 20 times. To diminish the side effect of such sampling, we report the average performance over 5 runs. We use Adam [22] optimizer with learning rate of 0.001. Models are trained over 50 epochs to ensure convergence of training loss with batch size of 64. All models are evaluated when showing their best ranking performance at validation set.

Direct Evaluation: Event-Specific Ranking
We begin the assessment of our model with a direct evaluation to show the superiority of our model. For this evaluation, we follow the protocol of Wang et al. [11], reporting precision@K% metric, which tells how many images of the highest predicted importance score i are ranked in top K% images ordered by the ground-truth importance.
The experimental results are shown in Table 1. Our finding could be summarized as twofold: First, as expected, our attention supervision approaches are better able to rank images than the state-of-the-art baselines. In particular, RNN-DistAS achieves 40.6% at P@10% metric, outperforming the previous state-of-the-art Iterative-CNN-LSTM model by 10.6% point. Notably, we could observe substantial improvement even in the weakest model CNN-ScalarAS among our proposed models, achieving 7.6% at P@20% and 10.3% at P@30%, compared to Iterative-CNN-LSTM. It demonstrates the effectiveness of our problem formulation, employing the supervised attention as internal ranking function.
Second, we manifest the effectiveness of our counterfactual supervisions, particularly at P@10%. CNN-DistAS achieves the 5.7% improvement compared to CNN-Sca-larAS, and RNN-DistAS achieves 6.2% point gain over RNN-ScalarAS. It demonstrates that debiasing the importance of images, which are important at multiple events, is essential for selecting the most representative image in the given album.
For further analysis, we show qualitative examples in Fig. 3. We present the top-8 ranked images by each model. As discussed, we can observe that RNN-ScalarAS incorrectly gives high scores for irrelevant images, such as the flower image in archiTecTure (more important in NaTureTriP album). Meanwhile, our approach better highlights more discriminative image. Even when the ranked images are not optimally correlated with human-ordered images, RNN-DistAS consistently shows a reasonable ordering, such as the tiger image at top-1 in Zoo event, compared to building image of RNN-ScalarAS.

Indirect Evaluation: Album Event Recognition
The main objective of our work is to investigate the impact of counterfactual supervisions. Following the direct evaluation, here we evaluate our model in terms of their contribution to event recognition task. The results of album event recognition on the two datasets are provided in Table 2. From the table, we can observe similar trends with direct evaluation, showing the strength of the debiased ranking in attention mechanism.
Our best performing model RNN-DistAS, reaching an accuracy of 75.7%, shows a 3.4% improvement over the state-of-the-art baseline Iterative-CNN-LSTM in CUFED dataset. At the same time, RNN-DistAS achieves better performance 91.1% than Hierarchical-CNN-Att 90.1%, showing the strength of debiased ranking even in the extreme scenario of no human annotation.

Ablation Study (CUFED)
To show the effectiveness of each component in DistAS, we conduct an ablation study on CUFED dataset (Table 3). First, the effectiveness of counterfactual estimation L sim is tested in model 1 . When we removed the L sim by setting sim = 0 , the performance significantly drops to 36.3 at P@10%. Second, in 2 , we simply replace R i with the maximum value of * i , rather than IDF relevance (Eq. 16). The lower performance of model 2 demonstrates that the relevance modeling works in a meaningful way. Lastly, we discard the filter of unnecessary training signals in Eq. 13 by setting the threshold to 0. The result of 3 shows that the feature-based image similarity sim(x i , x j ) is not the optimal measure, such that simple thresholding could significantly contribute to the performance.
From these results, we have the following observations: (1) all components including counterfactual estimation L sim , IDF relevance, and threshold consistently contribute to the performance improvement; (2) the counterfactual augmentation is the most important component which leads to more substantial improvement compared to other components.

Multi-label Analysis on ˇ *
In this section, we conduct further analysis on the counterfactual estimation, specifically focusing on multi-label cases.
Recently, Wang et al. [11] studied the problem of ambiguous album in CUFED dataset, having multiple event types as label (multi-label), for which they manually disambiguate the albums by re-annotating the dataset. Table 4 shows examples of the most frequently appeared event-type pairs of two-label albums, e.g., (BirThday, casualFamilyGaTher). For comparison, we find such frequent event-type pairs from two-label images in CUFED training set, by selecting the images of higher gold importance (included in top 10%), but discounted by our IDF relevance ranking (excluded from top 10%), obtained from the counterfactual supervisions.
In Table 4, we can observe that our counterfactually identified event-type pairs are comparable with manually identified pairs (overlapping pairs are marked as bold), even though we did not use any human annotations in training. These results show that the counterfactual supervisions highly correlate with the human perception about the personal events. And, we stress that the multi-label characteristics of CUFED dataset could be found automatically by our approach, while that required human efforts in [11]. Figure 4 shows the real examples of counterfactual supervisions, extracted from RNN-DistAS. As discussed above, there are closely related event pairs (e.g., BeachTriP and cruise), and some images are visually similar, and important on both events (e.g., swimming with 1.6 importance). Their visual similarity makes it possible to augment the CUFED annotations in a counterfactual way, represented as the distribution. And, at the same time, it makes IDF relevance effectively decrease the importance of multi-label images, which do not show the discriminative parts of specific event. We found a total of 1088 multi-label images from CUFED dataset, and the significant amount of such images emphasizes the necessity of our counterfactual approach, not requiring human efforts.

Vision Tasks
This paper raises a bias problem in the existing attention supervision, while previous literature assumes no such bias: Das et al. [23] is the pioneering work that introduces the inconsistency between the attentions of human and machine in Visual Question Answering (VQA) task. Toward plausible (to human insights) attentions, Gan et al. [8] and Yu et al. [10] use human attention annotations, i.e., human gaze, to supervise the attention of neural architecture in vision tasks. However, they incur expensive overheads of human annotations, such that methods for replacing human annotations are explored [7,9,24,25].
Although several work propose to supervise the neural attention for each specific task, to the best of our knowledge, our work is the first to study the augmentation of counterfactual supervisions for providing improved attentions, without increasing annotation overheads on human side.

Language Tasks
This paper studies how to machine-enhance the quality and quantity of human attention supervision. Related concept in language tasks is faithfulness [26,27], stating that attention weights of unsupervised attention are too poorly correlated with the contribution of each word for machine decision (or, unfaithful). Our work can be considered as a means of enhancing faithfulness with machine self-supervision.
Another related concept is plausibility [28], requiring more expensive human annotations, namely rationale, for  sample-specific annotation. Our work can be viewed as an alternative direction of leveraging machine self-supervision and keeping human annotation to vocabularies: Such human annotation overhead can even be replaced by the existing preannotated resources: Zou et al. [29] consider sentiment lexicon dictionary such as SentiWordNet, for a related task. Our contribution is to show that simple human annotations (often replaced by public resources), with machine augmentation, can contribute toward improving the accuracy and robustness of model. There have been several works of using attention supervision for different language tasks. Mi et al. [6] and Liu et al. [5] employ an explicit aligner as an attention prior in machine translation and [30] leverages user authenticated domains to narrow down the scope of attentions. Strubell et al. [31] injects word dependency relations to recognize the semantic roles in text. These supervision mechanisms mainly focused on injecting task-specific knowledge. In contrast, our distinction lies in improving the given attention supervision with sample-specific adaptations.

Event-Specific Ranking and Recognition
The goal of event recognition is to assign labels (e.g., casu-alFamilyGaTher and BirThday) to the given image or album. With the recent advances for image understanding [20,32], many event recognition approaches use deep learning models, such as CNN, to capture the semantic of single image (or, multiple images in the album). For example, for representing an album, to effectively combine single image features, a neural attention is introduced by Guo et al. [4], which we adopt as a baseline. A key distinction of our work is, we study the task of supervising such attentions, which would contribute to boosting representation quality.

Conclusion
In this paper, we study the problem of counterfactual attention supervision in the personal album recognition and ranking tasks. We propose to augment attention supervision by estimating the missing image importance in the counterfactual events, without additional annotation overheads. This augmented supervision can combine with simple models, improving the event-specific relevance modeling, and outperforms more sophisticated state of the arts.