1 Introduction

Deep models are well-known for their excellent performance achieved in many challenging domains, as well as their black-box nature. To interpret the prediction of a deep model, a number of explanation algorithms (Bach et al., 2015; Lundberg & Lee, 2017; Ribeiro et al., 2016; Smilkov et al., 2017; Sundararajan et al., 2017; Zhou et al., 2016) have been recently proposed to attribute the importance of every input feature in a given sample with respect to the model’s output. For example, given an image classification model, LIME (Ribeiro et al., 2016) and SmoothGrad (Smilkov et al., 2017) could attribute the importance score to every superpixel/pixel in an image with respect to the model’s prediction. In this way, one can gain insights into models’ behaviors by visualizing the important features used by the model for prediction.

We take image classification models as the research target. The use of interpretation tools finds that, even though these deep models make the same and correct predictions on the same image, they may rely on different sets of input features to solve the task. Our work uses LIME (or SmoothGrad similarly) to explain a number of image classification models trained on the same set of images, all of which make the correct predictions. The explanation algorithm obtains (slightly to moderately) different explanations for these models, with examples latterly shown in Figs. 2 and 3. While the features used by these models are not exactly the same, we can still find a set of features that the majority of models might use. We name them as common features. In this way, we are particularly interested in two research questions as follows: (1) What are the common features used by various models in an image? (2) Whether the models with better performance favor those common features?, to better understand the behaviors behind the black-box models.

To answer these two questions, we propose to study the common features across deep models and measure the similarity between the set of common features and the one used by individual models. Specifically, as illustrated in Fig. 1, we generalize an electoral system to first form a committee with a number of deep models, obtain the explanations for a given image based on one trustworthy explanation algorithm, then call for voting to get the cross-model consensus of explanations, or shortly consensus, and finally compute a similarity score between the consensus and the explanation result for each deep model, denoted as consensus score. Through extensive experiments using 80+ models on five datasets/tasks, we find that (1) the consensus is aligned with the ground truth of image semantic segmentation; (2) a model in the committee with a higher consensus score usually performs better in terms of testing accuracy; and (3) models’ consensus scores potentially correlate to their interpretability.

The contributions of this paper can be summarized as follows. To the best of our knowledge, this work is the first to investigate the common features used and shared by a large number of deep models for image classification through incorporating trustworthy explanation algorithms. We propose the cross-model consensus of explanations to characterize the common features and connect the consensus score to the model performance and interpretability. Finally, we obtain the three observations from the experiments with thorough analyses and discussions.

Fig. 1
figure 1

Illustration of the proposed framework that consists of the three steps: (1) preparing a set of trained models as committee, (2) aggregating explanation results across the committee to get the consensus, and (3) computing the similarity score of each explanation to the consensus

2 Related work

In this section, we first review the explanation algorithms and the approaches to evaluating their trustworthiness. Then we introduce some works related to our observations on the positive correlation between model performance and the proposed consensus score.

2.1 Explanation algorithms and evaluations

Many algorithms have been proposed to visualize the activated regions of feature maps in the intermediate layers (Chattopadhay et al., 2018; Selvaraju et al., 2020; Wang et al., 2020; Zhou et al., 2016), to gain insights for understanding the internals of convolutional networks. Apart from investigating the inside of complex deep networks, simple linear or tree-based surrogate models have been used as “out-of-box explainers” to explain the predictions made by the deep model over the dataset through local or global approximations (Ahern et al., 2019; Ribeiro et al., 2016; van der Linden et al., 2019; Zhang et al., 2019). Instead of using surrogates for deep models, investigations on the gradients for differentiable models have also been proposed to estimate the input feature importance with respect to the model predictions, such as SmoothGrad (Smilkov et al., 2017), Integrated Gradients (Sundararajan et al., 2017), DeepLIFT (Shrikumar et al., 2017) etc. Note that there are many other explanation algorithms (Afrabandpey et al., 2020; Atanasova et al., 2020; Bach et al., 2015; Kim et al., 2018; Looveren & Janis, 2020) and we mainly focus on those that are related to feature attributions and suitable for image classification models in this work.

There are few works of analyzing the explanations across models. For example, Fisher et al. (2019) theoretically studied the variable importance for machine learning models of the same family. Agarwal et al. (2021) aggregates rankings from several classifiers for time series classification tasks, instead of averaging explanation results. In this work, we investigate the cross-model explanations for image classification models and relate the consensus to the ability of localizing visual objects, model performance and interpretability.

Evaluations on the trustworthiness of explanation algorithms are of objective to qualify their fidelity to the models and avoid the misunderstanding of models’ behaviors. For example, Adebayo et al. (2018) have found that some algorithms are independent both of the model and the data generating process, which should be avoided for not explaining the model, and thus proposed a sanity-check framework through perturbing parts parameters of models. Quantitative metrics for the trustworthiness evaluations include measuring the performance drop by perturbation of important features (Hooker et al., 2019; Petsiuk et al., 2018; Samek et al., 2016; Vu et al., 2019), model trojaning attacks (Chen et al., 2017; Gu et al., 2017; Lin et al., 2020), infidelity and sensitivity (Ancona et al., 2018; Yeh et al., 2019) to similarity samples in the neighborhood, through crafted datasets (Yang & Kim, 2019), and user-study experiments (Jeyakumar et al., 2020; Lage et al., 2019).

Towards building more interpretable and explainable AI systems, trustworthy explanation algorithms are the first step. Evaluations on the model interpretability, indicating which model is more interpretable, are also urged. However, such evaluations across deep models are scarce. Bau et al. (2017) proposed Network Dissection to build an additional dataset with dense annotations of a number of visual concepts for evaluating the interpretability of convolutional neural networks. Given a convolutional model, Network Dissection recovers the intermediate-layer feature maps used by the model for the classification. It then measures the overlap between the activated regions in the feature maps with the densely human-labeled visual concepts to estimate the interpretability of the model. Note that elaborately designed user-study experiments are also a common solution to evaluating deep model interpretability.

In this work, we do not directly evaluate the interpretability across deep models. Instead, we experimentally show that the consensus score is positively correlated to the generalization performance of deep models and related to the interpretability. We will discuss more details with analyses later. Based on the explanations, our proposed framework and the consensus score could help to understand the deep models better.

2.2 Explanation and semantic segmentation

Explanations are also useful to improve model performance (Kim et al., 2020), robustness (Ross et al., 2018) or interpretability (Chen et al., 2019). One related direction is weakly supervised semantic segmentation, which trains a deep model with image-wise annotations and makes efforts to predict pixel-wise segmentation results. The explanation results are beneficial to connect the bridge between two levels of labels (Jo et al., 2021; Wang et al., 2020). The proposed consensus, obtained across models from another aspect, confirms this connection as described in our first observation.

3 Framework of cross-model consensus of explanations

In this section, we first recall the two explanation algorithms that are used to validate our proposed framework, i.e., LIME (Ribeiro et al., 2016) and SmoothGrad (Smilkov et al., 2017). Then we introduce the proposed approach that generalizes the electoral system to provide the consensus of explanations across various deep models.

3.1 Recall LIME and SmoothGrad

Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro et al., 2016) searches an interpretable model, usually a linear one, to approximate the output of a deep model for an individual data point, such that LIME obtains a weighted linear combination of features including the importance of every feature for classifying the data point. To explain a deep model on vision tasks, LIME (Ribeiro et al., 2016) first performs a superpixel segmentation (Vedaldi et al., 2008) for a given image to reduce the number of input features (pixels aggregated into superpixels), then generates interpolated samples by randomly masking some superpixels and computing the prediction outputs of the generated samples through the original model, and finally uses a linear regression model to fit the outputs with the presence/absence of superpixels as inputs. The linear weights are then used to indicate the contributions of superpixels as the explanation results.

LIME is model-agnostic, without any requirement on the inside of models. Instead, another family of explanation algorithms is based on gradients, thus requires the models be differentiable. The gradients of model output w.r.t. input can partly identify influential pixels, but due to the saturation of activation functions in the deep networks, the vanilla gradient is usually noisy. SmoothGrad (Smilkov et al., 2017) reduces the visual noises by repeatedly adding small random noises on the image to get the gradients corresponding to the noised inputs, and then averages these gradients to smooth out noises for obtaining the final explanation result. In this work, based on LIME and SmoothGrad, we conduct experiments to validate our proposed approach. Note that other post-hoc explanation algorithms for interpreting individual samples are also available for our proposed framework.

3.2 Steps to computing cross-model consensus of explanations

Based on one of the explanation algorithms, our proposed framework computes the cross-model consensus of explanations and the consensus score, with the three specific steps as follows.

figure a
figure b

Step 1: Committee formation with deep models

Given m deep models that are trained for solving a target task (image classification task in our experiments) on a visual dataset where each image contains one primary object, the approach first forms the given deep models as a committee, noted as \({\mathcal {M}}\), and then considers the variety of models in the committee that would establish the consensus for comparisons and evaluations.


Step 2: Committee voting for consensus achievement

With a committee of deep models for the image classification task, our proposed framework leverages a trustworthy interpretation tool \({\mathcal {A}}\), e.g. LIME (Ribeiro et al., 2016) or SmoothGrad (Smilkov et al., 2017), to obtain the explanation on each image in the dataset. Given some sample, denoted as \(d_i\), from the dataset, we note the obtained explanation results of all models in the committee as \({\varvec{L}}\). Specifically, \({\varvec{L}}_{j}\) indicates the explanation given by the j-th model. Then, we propose a voting procedure that aggregates \(\{{\varvec{L}}_j\}_{j=1,\dots ,m}\) to reach the cross-model consensus of explanations, i.e., the consensus, \({\varvec{c}}\) for \(d_i\). Specifically, the k-th element of the consensus \({\varvec{c}}\) is \({\varvec{c}}_k = \frac{1}{m} \sum _{j=1}^m \frac{ {\varvec{L}}_{jk}^2}{\Vert {\varvec{L}}_{j} \Vert },\ \forall 1\le k\le K\) for LIME, where K refers to the dimension of an explanation result and \({\varvec{c}}_k = \frac{1}{m} \sum _{j=1}^m \frac{{\varvec{L}}_{jk} - min({\varvec{L}}_{j})}{max({\varvec{L}}_{j}) - min({\varvec{L}}_{j})},\ \forall 1\le k\le K\) for SmoothGrad, following the conventional normalization-averaging procedure (Ahern et al., 2019; Ribeiro et al., 2016; Smilkov et al., 2017). To the end, the consensus has been reached for every sample in the target dataset based on committee voting.

Step 3: Consensus-based similarity score

Given the reached consensus, our approach calculates the similarity score between the explanation result of every model in the committee and the consensus, as the consensus score. Specifically, for the explanations and the consensus based on LIME (visual feature importance in superpixel levels), cosine similarity between the flattened vector of explanation of each model and the consensus is used. For the results based on SmoothGrad (visual feature importance in pixel levels), a similar procedure is followed, where the proposed algorithm uses Radial Basis Function (RBF, \(exp({-\frac{1}{2}(||{\varvec{a}}- {\varvec{b}}||/\sigma )^2})\)) for the similarity measurement. The difference in similarity computations is due to that (1) the dimensions vary across data samples for LIME explanations while do not change for SmoothGrad, and (2) the scales of LIME explanation results vary much larger than SmoothGrad. Thus cosine similarity is more suitable for LIME while RBF is for SmoothGrad. Eventually, the framework computes a quantitative but relative score for each model in the committee using their similarity to the consensus.

In summary, the proposed method first selects a number of deep models as a committee. Then given a data sample from the dataset, the proposed method computes the explanation using one interpretation algorithm (e.g. LIME or SmoothGrad) for each deep model in the committee, and obtains the cross-model consensus of explanations through a voting process. Finally, given the reached consensus, our approach calculates the similarity score for each model, as the consensus score, between the explanation result of that model and the consensus. For each data sample, we can compute such a consensus score and the final one is averaged across all data samples from the dataset. For further clarity, these three steps of the proposed framework are illustrated in Fig. 1 and formalized in Algorithm 1, while the definitions of elementary functions (i.e., the interpretation algorithm, the voting procedure, and the similarity score) are omitted in Algorithm 1 but given in Algorithm 2.

4 Overall experiments and results

In this section, we start by introducing the experiment setups. We use the image classification as the target task and follow the proposed framework to obtain the consensus and compute the consensus scores. Through the experiments, we have found (1) the good alignment between the consensus and image semantic segmentation, (2) positive correlations between the consensus score and model performance, and (3) potential correlations between the consensus score and model interpretability. We end this section with robustness analyses of the framework.

4.1 Evaluation setups

4.1.1 Datasets

For overall evaluations and comparisons, we use ImageNet (Deng et al., 2009) for general visual object recognition and CUB-200-2011 (Welinder et al., 2010) for bird recognition respectively. Note that ImageNet provides the class label for every image, and the CUB-200-2011 dataset includes the class label and pixel-level segmentation for the bird in every image, where the pixel annotations of visual objects are found to align with the consensus.

4.1.2 Committee formation with deep models

Our main experiments and results include models of the two committees based on ImageNet and CUB-200-2011, respectively. Both of them target the image classification task, with each image being labeled to one category. For complete comparisons, we use more than 80 deep models trained on ImageNet that are publicly available.

There are over 100 deep modelsFootnote 1 at the moment we initiate the experiments. We first exclude some very large models that take much more computation resources. Then for the consistency of computing superpixels, we include only the models that take images of size 224\(\times \)224 as input, resulting 81 models for the committee based on ImageNet. Note that other models can be also included by an additional step of aligning the superpixels in images of different sizes. However, in our experiments, we choose to ignore this small set of models since a large number of models are available.

As for CUB-200-2011 (Welinder et al., 2010), similarly we first exclude the very large models. Then we follow the standard procedures (Sermanet et al., 2014) for fine-tuning ImageNet-pretrained models on CUB-200-2011. We choose the default hyper-parameter setting to conduct the fine-tuning experiments on CUB-200-2011 for all models, i.e., learning rate 0.01, batch size 64, SGD optimizer with momentum 0.9, resize to 256 being the short edge, randomly cropping images to the size of 224\(\times \)224, and obtain 85 models that are well trained. Different hyper-parameters may help to improve the performance of some specific networks, but for a fair comparison across model structures and economical reasons, we choose not to do the hyper-parameter tuning.

4.1.3 Explanation algorithms

As we previously introduced, we consider two explanation algorithms, LIME and SmoothGrad. Specifically, LIME surrogates the explanation as the assignment of visual feature importance to superpixels, and SmoothGrad outputs the explanations as the visual feature importance over pixels. In this way, we can validate the flexibility of the proposed framework over explanation results from diverse sources (i.e., linear surrogates vs. input gradients) and in multiple granularities (i.e., feature importance in superpixel/pixel-levels).

4.1.4 Computation costs

We here report the computation costs of preparing the committee and the explanation for reference, tested on one V100 GPU. Each fine-tuned model on CUB-200-2011 takes one hour more or less depending on the model size. LIME takes around 15 s in average (variant across models) per sample and SmoothGrad takes around 3 s per sample. For practical usages of cross-model consensus explanation, 15 models are suggested while a smaller scale of 5 may work as well.

4.2 Alignment between the consensus and image segmentation

The image segmentation task searches the pixel-wise classifications of images. The cross-model consensus of explanations for image classification are well aligned to image segmentation, especially when only one object is contained in the image. This observation partially demonstrates the effectiveness of most deep models in extracting visual objects from input images. We show two examples using both LIME and SmoothGrad in Figs. 2 and 3 from ImageNet and CUB-200-2011 respectively. For both examples, we can find that the explanation algorithms reveal the models’ predictions by highlighting some parts of the target objects, while the cross-model consensus shows a much better alignment with the objects than individual models. This observation can be found in more examples, as shown in the Appendix.

Fig. 2
figure 2

Visual comparisons between consensus and the interpretation results of CNNs using LIME (in the upper line) and SmoothGrad (in the lower line) based on an image from ImageNet, where the ground truth of segmentation is not available

Fig. 3
figure 3

Visual comparisons between consensus and the explanation results of deep models using LIME (in the upper line) and SmoothGrad (in the lower line) based on an image from CUB-200-2011, where the ground truth of segmentation is available as pixel-wise annotations and the mean Average Precision (mAP) are measured

We confirm this alignment using the Average Precision (AP) score between the cross-model consensus of explanations and the image segmentation ground truth, where the latter is available on CUB-200-2011. We take the mean of AP scores (mAP) over the dataset to compare with the overall consensus scores. Higher mAP scores indicate better alignment between the explanations and the image segmentation ground truth. The quantitative results are shown in Fig. 4, where the consensus achieves higher mAP scores than any individual network. Both visual comparisons and quantitative results validate the closeness of consensus to the ground truth of image segmentation.

Fig. 4
figure 4

Correlation between model performance and mAP scores to the segmentation ground truth using LIME (left) and SmoothGrad (right) on CUB-200-2011 over 85 models. Pearson correlation coefficients are 0.927 (with p value 4e−37) for LIME and 0.916 (p value 9e−35) for SmoothGrad. Take the “AlexNet” as example, this model gets 0.507 accuracy score on CUB-200-2011, and the alignment between its LIME explanations and the ground truth of semantic segmentation is measured by the mAP score of 0.343 (and 0.571 for SmoothGrad). These numeric results are reported in the Appendix. Moreover, the points “Consensus” here refer to the testing accuracy of the ensemble of networks in the committee by probabilities averaging and voting (in y-axis), as well as the mAP between the consensus and the ground truth (in x-axis). For the concise purpose, models in the same family are represented by the same symbol. Best viewed in color and with zoom-in (Color figure online)

4.3 Positive correlations between consensus scores and model performance

Raw input features are not always useful. Some are discriminative while others are not. We use discriminative features to indicate those that can be used by models to well separate samples from different categories. Usually they are the important features for solving the learning task. Based on this, we could reasonably assume that (1) for classification tasks of single-object images, the discriminative features are the pixels/superpixels of the target object in the image, and (2) if the key features used by the deep model (that can be revealed by trustworthy explanation algorithms) are aligned with the discriminative ones, the model is more likely to produce correct predictions and thus better performance. We presented previously that the cross-model consensus of explanations is aligned with object segmentation, implicitly indicating that the common features may be aligned with the discriminative ones. Here we show the positive correlations between the consensus score and model performance.

Specifically, in Fig. 5, we present the consensus scores (in x-axis) using LIME (left) and SmoothGrad (right) on ImageNet (a) and CUB-200-2011 (b), against model performance (in y-axis). High correlation coefficients are observed across the dataset-explanation combinations, though in some local areas of Fig. 5 (a, right), the correlation between the consensus score and model performance is weaker. In this way, we could conclude that, in an overall manner, the evaluation results based on the consensus score using both LIME and SmoothGrad over the two datasets are correlated to model performance with significance. This can further be supported by experiments on other datasets with random subsets of deep models, as shown in Fig. 11 (Appendix 2).

Fig. 5
figure 5

Model performance vs. consensus scores using LIME (a, left) and SmoothGrad (a, right) over 81 models on ImageNet (b, left) and 85 models on CUB-200-2011 (b, right). Pearson correlation coefficients are a 0.809 and 0.783, b 0.908 and 0.880. For concise purposes, networks in the same family are represented by the same symbol. Best viewed in color and with zoom-in (Color figure online)

4.4 Potential correlations between consensus scores and model interpretability

Deep model interpretability measures the ability to present in understandable terms to a human (Doshi-Velez & Kim, 2017). Network Dissection (Bau et al., 2017) and user-study experiments are two possible methods to measure the interpretability of deep models quantitatively. Network Dissection (Bau et al., 2017) provided a dataset named Broden, which pre-defines a set of semantics, including colors, patterns, materials, textures, object parts etc, and provides the manually annotated pixel-wise labels in each image. Network Dissection benefits this dataset to count the number of semantic neurons in the intermediate layers of deep models as the interpretability. User-study evaluations measure the interpretability through designed experiments with humans’ interactions and the collected statistics.

This subsection shows that the consensus scores are correlated with interpretability scores, measured by Network Dissection and user-study evaluations. Note that the consensus scores are computed based on explanation results, but they are not a direct estimator or metric of the model interpretability.

Table 1 Rankings (and scores) of five deep models, evaluated by Network Dissection Bau et al. (2017) (Net.Dis.), user-study evaluations (User-Study), and Consensus with LIME and SmoothGrad (C.LIME and C.SG respectively).

4.4.1 Consensus versus network dissection

We compare the results of the proposed framework with the ones from Network Dissection (Bau et al., 2017). Based on the Broden dataset, Network Dissection reported a ranking list of five models (w.r.t. the model interpretability), shown in Table 1, through counting the semantic neurons, where a neuron is defined as semantic if its activated feature maps overlap with human-annotated visual concepts. With our proposed framework, we report the consensus scores using LIME and SmoothGrad in Table 1, which are consistent to Fig. 5 (a, LIME) and (a, SmoothGrad). The three ranking lists are almost identical, except the comparisons between DenseNet161 and ResNet152, and in both lists based on the consensus scores, DenseNet161 is similar to ResNet152 with marginally elevated consensus scores, while Network Dissection considers ResNet152 is more interpretable than DenseNet161.

We believe the results from our proposed framework and Network Dissection are close enough from the perspectives of ranking lists. The difference may be caused by the different ways that our framework and Network Dissection perform the evaluations. The consensus score measures the similarity to the consensus on images, while Network Dissection counts the number of neurons in the intermediate layers activated by all the visual concepts, including objects, object parts, colors, materials, textures, and scenes. Furthermore, Network Dissection evaluates the interpretability of deep models using the Broden dataset with densely labeled visual objects and patterns (Bau et al., 2017), while the consensus score does not need additional datasets or the ground truth of semantics. In this way, the results of our proposed framework and Network Dissection might be slightly different.

4.4.2 Consensus versus user-study evaluations

In order to further validate the effectiveness of the proposed framework, we have also conducted user-study experiments on these five models and report the results on the second row of Table 1. The experimental settings of the user-study evaluations are as followed. For each image, we randomly choose two models from the five models and present the LIME (or SmoothGrad respectively) explanations of the two models, without giving the model information to users. Users are then requested to choose which one helps better to reveal the model’s reasoning of making predictions according to their understanding, or equal if the two interpretations are equally bad or good. Each pair of models is repeated three times and represented to different users. The better one in each pair will get three points and the other one will get zero; in the equal case, both get one point. Finally, a normalization of dividing the number of images and the number of repeats (i.e. 3) is performed for each model. The user-study evaluations yield the scores indicating the model interpretability, as shown in Table 1. The results confirm that our proposed framework is capable of approximating the model interpretability.

We note that it is a small-scale user study with around thirty users. Since there is a ranking list of only five models available in Network Dissection which we can compare with, our experiments here aim to validate the effectiveness of the proposed framework by approximately evaluating model interpretability. The scores obtained in the user study may not be such accurate but the ranking list is roughly valid.

4.5 Robustness analyses of consensus

In this subsection, we investigate several factors that might affect the evaluation results with consensus, including basic explanation algorithms (e.g., LIME and SmoothGrad), the size of the committee, and the candidate pool for models in the committee.

4.5.1 Consistency between LIME and SmoothGrad

Even though the granularity of explanation results from LIME and SmoothGrad are different, which causes mismatching in mAP scores to segmentation ground truth, the consensus scores based on the two algorithms are generally consistent. The consistency has been confirmed by Fig. 6, which shows the consensus scores based on LIME and those based on SmoothGrad. The correlation coefficients are 0.825 and 0.854 respectively, indicating the strong correlations over all models on both datasets. This shows that the proposed framework can work well with a broad spectrum of basic explanation algorithms.

Fig. 6
figure 6

Consistency between LIME and SmoothGrad. This figure shows the similarity to the consensus of SmoothGrad interpretations vs. the similarity to the consensus of LIME interpretations on the ImageNet committee (a) and CUB-200-2011 committee (b). Pearson correlation coefficients are a 0.825 and b 0.854. For concise purposes, networks in the same family are represented by the same symbol. Best viewed in color and with zoom-in (Color figure online)

4.5.2 Consistency across committees

In real-world applications, the committee-based estimations and evaluations may make inconsistent results in a committee-by-committee manner. In this work, we are interested in whether the consensus score estimations are consistent against the change of committee. Given 16 ResNet models as the targets, we form 20 independent committees by combining the 16 ResNet models with 10–20 models randomly drawn from the rest networks. In each of these 20 independent committees, we compute the consensus scores of the 16 ResNet models. We then estimate the Pearson correlation coefficients between any of these 20 results and the one in Fig. 5 (a, LIME), where the mean correlation coefficient is 0.96 with the standard deviation of 0.04. To visually show the low variance, we present the consensus scores and performanceFootnote 2 of these 16 ResNet models based on the complete committee (81 models) and themselves (16 ResNet models) in Fig. 7. No large difference is observed for these two extreme cases. Thus, we can say the consensus score evaluation would be consistent against randomly picked committees.

Fig. 7
figure 7

Model performance vs. similarity to the consensus of LIME on ResNet family. The consensus in the left plot is voted by the ResNet family (16 models) while the right is by complete committee on ImageNet (81 models). Best viewed in color and with zoom-in (Color figure online)

4.5.3 Convergence over committee sizes

To understand the effect of the committee size on the consensus score estimation, we run the proposed framework using committees of various sizes formed by deep models that are randomly picked up from the pools. In Fig. 8, we plot and compare the performance of the consensus with increasing committee sizes, where we estimate the mAP between the ground truth and the consensus reached by the random committees of different sizes and 20 random trials have been done for every single size independently. It shows that the curve of mAP would quickly converge to the complete committee, while the consensus based on a small proportion of committee (e.g., 15 networks) works well enough even compared to the complete committee of 85 networks.

Fig. 8
figure 8

Convergence of mAP between the ground truth and the consensus results based on committees of increasing sizes, using LIME on CUB-200-2011. The green lines and orange triangles are, respectively, the mean values and the median values of 20 random trials. The red dashed line is the mAP of the consensus reached by the complete committee of the original 85 models

5 Discussions: limitations and strengths with future works

In this section, we discuss several limitations and strengths in our studies, with interesting directions for future works.

5.1 Limitations

First of all, our studies are based on the explanation algorithms. We propose to study the features used by deep models using the explanation results (i.e., the importance of superpixels/pixels in the image for prediction). The correctness of these explanation algorithms might affect our results. However, we use two independent algorithms, including LIME (Ribeiro et al., 2016) and SmoothGrad (Smilkov et al., 2017), which attribute feature importance in two different scales i.e., superpixels and pixels. Both algorithms lead to the same observations and conclusive results (see Sect. 4.5 for the consistency between results obtained by LIME and SmoothGrad). Thus, we believe the explanation algorithms here are (almost) trustworthy and it is appropriate to use explanation results as a proxy to analyze features. For future research, we would include more advanced explanation algorithms to confirm our observations.

We obtain some interesting observations from our experiments and make conclusions using multiple datasets. However, the image classification datasets used in our experiments have some limitations—every image in the dataset only consists of one visual object for classification. It is reasonable to doubt that when multiple visual objects (rather than the target for classification) and complicated visual patterns for background (Chen et al., 2017; Koh & Liang, 2017) co-exist in an image, the cross-model consensus of explanations may no longer overlap to the ground truth semantic segmentation. To showcase our approach, we include an example from the COCO dataset (Lin et al., 2014) in Fig. 9, where multiple objects co-exist in the image and the consensus partly matches the segmentation. To address this issue, our future work would focus on the datasets with multiple visual objects and complicated background for object detection, segmentation, and multi-label classification tasks.

Fig. 9
figure 9

Visualization of an image from the MS-COCO dataset (Lin et al., 2014) for showing the strengths of cross-model consensus of explanations, where the predicted label with probability is noted. Models are not adapted to the COCO dataset

Finally, only well-known models with good performance have been included in the committee. It would probably bring some bias in our analysis, but it would not cause many problems in practice because these models would be one of the first choices or frequently used in many applications for relevance. Moreover, if the committee consists of a large number of random-guess models, the consensus would become a constant matrix. To avoid this case and simplify the analyses, we consider well-known models with good performance in this work. In our future work, we would include more models with diverse performances to seek more observations and will try to explain more complicated models such as Transformers (Yuan et al., 2021).

5.2 Strengths

In addition to the limitations, we demonstrate several strengths of cross-model consensus of explanations for further studies.

As was shown in Fig. 8, with a larger committee, the consensus would slowly converge to a stable set of common features that aligns with the segmentation ground truth of the dataset. This experiment further demonstrates the capacity of consensus to precisely position the visual objects for classification. Thus, in our future work, we would like to use consensus based on a committee of image classification models to detect the visual objects in the image.

Model performance is one of the most critical metrics in most practical scenarios. Estimations of model performance are needed in these situations, especially when there are no (or few) validation samples. Our second observation that a model in the committee with a higher consensus score usually performs better in terms of testing accuracy, would be helpful to relatively estimate the performance of models.

We believe that the cross-model consensus of explanations, or the common features, is an explanation of data, instead of explanations for individual models, that aim to explain the model. Informally, we consider the explanations as a conditional probability (of importance) of features f given a trained deep model \({\varvec{M}}\), denoted as \(p(f \vert {\varvec{M}})\). Higher values of \(p(f \vert {\varvec{M}})\) indicate that the features f are (supposed to be) more important to solve the task, from the view of the given model \({\varvec{M}}\). Then intuitively, the cross-model consensus of explanations is to marginalize out the variable of models, i.e., \(p(f) = \int _{{\varvec{M}}} p(f \vert {\varvec{M}}) p({\varvec{M}}) \text {d} {\varvec{M}}\), to indicate the feature importance from the view of data. In practice, the intractable integration is approximated by the discrete summation (15 models approximate well, cf Sect. 4.5). The common features found by the consensus are approximately equivalent to the features that are discriminative for solving the task, which resides in data. Therefore, the cross-model consensus of explanations would be capable of identifying the discriminative features.

Following the previous notations, individual models are supposed to use them to achieve good performance and thus the consensus score measures the quantity of discriminative features that the model uses to make predictions. A higher consensus score indicates that the model is more likely to achieve good performance. More investigations on this may lead to theoretical proofs of our second observation.

Furthermore, our experiments with both explanation algorithms on all datasets have found that consensus scores are correlated to the interpretability scores of the models, even though interpretability scores were evaluated through totally different ways—network dissections (Bau et al., 2017) and user studies. Actually, network dissections evaluate the interpretability of a model through matching its activation maps in intermediate layers with the ground truth segmentation of visual concepts in the image. A model with higher interpretability should have more convolutional filters activated at the visual patterns/objects for classification. In this way, we particularly measure the similarity between the explanation results obtained for every model and the segmentation ground truth of images. We found that the models’ segmentation-explanation similarity significantly correlates to their consensus scores (see Fig. 10). This observation might encourage us to further study the connections between interpretability and consensus scores in the future work.

Fig. 10
figure 10

Correlation between mAP scores to the segmentation ground truth and the consensus scores using a LIME and b SmoothGrad with the CUB-200-2011 dataset over 85 models (of the committee). Pearson correlation coefficients are 0.885 (p value 3e−29) for LIME and 0.906 (p value 8e−33) for SmoothGrad. For concise purposes, networks in the same family are represented by the same symbol. Best viewed in color and with zoom-in (Color figure online)

6 Conclusion

In this paper, we study the common features shared by various deep models for image classification. Specifically, given the explanation results obtained by explanation algorithms, we propose to aggregate the explanation results from different models and obtain the cross-model consensus of explanations through voting. To understand features used by every model and the common ones, we measure the consensus scores as the similarity between the consensus and the explanation of individual models.

Our empirical studies based on comprehensive experiments using 80+ deep models on five datasets/tasks find that (i) the consensus aligns with the ground truth semantic segmentation of the visual objects for classification; (ii) models with higher consensus scores would enjoy better testing accuracy; and (iii) the consensus scores correlate to the interpretability scores. In addition to the main claims, we also include additional experiments to demonstrate robustness and consistency of the proposed cross-model consensus of explanations in explanation algorithms, formed committees, the committee size, and random selections on various datasets. All these studies confirm the applicability of consensus as a proxy to study and analyze the common features shared by different models in our research. Furthermore, several open issues and strengths have been discussed, with future directions introduced. Hereby, we are encouraged to adopt the consensus and consensus scores for better understanding the behaviors of deep models.