1 Introduction

Computer vision researchers have long tried to emulate the biology of primate vision. Visual recognition methods based on features such as the widely adopted Scale Invariant Feature Transform (SIFT) [28], are inspired by computations that take place in the early visual cortex, while early convolutional neural networks such as HMAX of [30] mimic the simple and complex cell hierarchy first described in the seminal work of Hubel and Wiesel [17]. Convolution neural networks (CNNs), adding further steps that likely occur in human vision such as nonlinearity, and being trained on millions of images, are currently employed by neuroscientists to produce plausible computational models not only of lower but also of higher visual cortical areas [39].

In order to understand the functioning of CNNs, much effort is being made to dissect them, mainly by visualizing or labelling the features learned at the hidden layers. This has been achieved by looking for input patterns that maximize the activation of hidden units [12, 42], or by trying to identify salient image features through back-propagation [25, 32]. As a result it has emerged that salient regions extracted from the top layers of CNNs tend to have semantic meaning, i.e. they correspond to objects or subjects relevant to humans [13, 42, 44] and similarly, from a neuroscience perspective, that top layers of hierarchical neural networks are highly predictive of neural responses in the higher visual cortex [40]. Recently, some authors have been looking at ways to improve attention maps generated by CNNs: by direct guidance on the attention maps generated by a weakly supervised learning deep neural network [23] or by attribute based textual explanation [38] and, inspired by the human visual system, by task-specific top-down signals together with visual stimuli [10].

Increasing knowledge on the primate visual cortex system has, on the other side, led to several saliency models that try to predict where humans look in a scene [4, 15, 18] and, more recently, the application of CNNs

to the definition of saliency models improved prediction performance [22]. It seems that the gap between computational visual recognition methods and the primate visual system is narrowing. In order to determine the extent of their similarity we think some of the fundamental questions we need to answer are whether deep convolutional neural networks actually “look” where humans look in an image and if they are getting any closer to where humans look with respect to earlier biologically inspired models. In this paper, we address these questions by proposing a methodology to establish the similarity between human fixations and the features used (or learned) by biologically inspired computational visual recognition methods. These questions have been partially addressed in [8], where human fixations have been compared to some relevant points of two CNNs, and earlier in [11] where SIFT, SURF [1] and the Harris Corner Detector (HCD) [33] have been compared to human fixations. Our proposed method draws inspiration from both works and it improves them in several ways: with a reliable definition of the regions of interest starting from human fixations or interest points, with the definition of two comparison protocols, one global and one local, implemented to establish regions similarity. Furthermore, the approach is applied to a large variety of handcrafted features and CNNs. To evaluate feature similarity, we run several experiments on the MIT eye tracking dataset (ETD) [20] comparing both human fixations to a variety of CNNs and handcrafted interest points, and also comparing intraclass and interclass points from handcrafted interest points and CNNs. A thorough statistical analysis shows that there is a positive correlation between human fixations and handcrafted features, overturning results in [11]. The results show that highest correlations occur between intraclass features and that human attention regions tend to be contained in the regions of interest defined by most features.

2 Related work

Prior to the introduction of CNNs, in [11] the authors investigated the correlation between human fixations and interest points extracted with SIFT, SURF and HCD, concluding that the similarity is not much different to that obtained with randomly generated interest points, with the exclusion of SURF points, although no tests to determine whether the difference in similarity was statistically significant were conducted.

After the development of high-performance CNNs, much effort has been dedicated to visualize their behaviour at individual unit levels. Erhan et al. [12] were the first to look for input image patterns that maximize the activation of hidden units. Zeiler et al. in [42] introduced a visualization technique that reveals which image input stimuli excite individual feature maps. They do it by mapping back feature maps activity from hidden layers back to the original input image simply by reversing the process from the considered layer to the original image (deconvnet). In the proposed method, we also reverse the process to extract our features, but since we are interested in determining the points coordinates on the input image rather than passing the feature maps through a deconvnet layer to visualize the features, what we do is simply to back-propagate the maximum response points coordinates to the original image. Yosinski et al. [41] released an interactive software to visualize the activations produced in each layer of a DNN when an image is processed. Both studies by Zeiler et al. [42] and Yosinski et al. [41] reveal the hierarchical nature of the features: shallow layers respond to corners, edges or colours or a combination of them, intermediate layers tend to be class specific so, for instance, they respond to animal faces or legs, while deepest layers respond to entire objects, e.g. a dog. A further step to interpret DNNs representation was made in [43], where the alignment between individual units and visual semantic concepts is evaluated. Again, it is confirmed that the deeper the layer, the higher the capacity to represent concepts of high semantic complexity, such as entire objects or scene parts.

Attention modules in vision were also applied to enhance the interpretability of neural networks [37]. For example, Chen et al. [10] utilised a human saliency dataset to boost their network’s performance, while the recent Transformer [36] architecture relies on attention to achieve the state-of-the-art results. While providing good performance, attention allows the network to dynamically assign relative importance to the features, allowing greater transparency. However, none of these works involving attention conduct qualitative nor quantitative evaluations against human fixations, which is the main focus of this work.

Mopuri et al. [26] proposed a method for conducting evidence tracing from the prediction layer to the image in order to identify discriminative pixel locations. While their method also produces a set of fixation points similar to our work, they do not evaluate the similarity between the obtained CNN points and human fixation points. Instead, their work focuses mainly on weakly supervised object localization and caption grounding.

Several other network dissection works focus on the quality and accuracy of visualizations rather than similarity with human saliency. In [29], although there is evaluation against human attention for the task of Visual Question Answering (VQA), the emphasis is again on visualization quality and, importantly, there is not a comparison of different CNN architectures on the basis of similarity with human saliency.

3 Interest points extraction

In this section, we describe the interest points we use to define visual attention regions. While human fixations can be acquired by eye tracking devices and handcrafted feature points are determined by the feature extraction algorithms, the process of extracting interest points from CNNs needs to be defined.

3.1 Human fixations

Human fixations are defined to be the image points where eye gaze is stable, or its speed is below a set threshold (see [27]). The human fixations we use in this paper were collected at the Massachusetts Institute of Technology as part of a project that focused on visual attention [20]. In the dataset, called the MIT eye tracking dataset (ETD), the fixations from 15 users asked to free view 1003 images randomly selected from Flickr were collected. Images were shown in succession to each viewer, each image was shown on screen for 3 s, with a grey screen interval of one second between consecutive images. Among the available human fixation datasets, the MIT-ETD is the best suited for our purposes for the high number of images and the fact that they are randomly picked, have different resolution, orientation and content (779 are landscape and 228 are portrait). For each image I in the dataset we consider the set of cumulative fixations of all 15 users, and we name it HFix(I).

3.2 Handcrafted features

SIFT, SURF and the Harris corner detector (HCD) have been extensively used to perform a variety of tasks, from face recognition [5] to object recognition [7] and object tracking [45]. SIFT and SURF both derive from the Hessian of the Scale Space representation of the image, but, as shown in [6], the SIFT descriptor tends to locate points around edges, while the SURF one around corners, so, since they do not look exactly at the same image points it is interesting analysing both of them to evaluate the extent of their correlation. Interest points for the three descriptors are extracted from all images in the dataset according to the original implementations in [1, 28, 33], yielding, for a given image I the interest points sets SIFT(I), SURF(I) and HCD(I). Although the cardinality of these sets is often higher than that of human fixations, there is no a priori criterion for selecting highly significant interest points among SIFT, SURF and HCD, so all the extracted points have been considered.

3.3 CNN’s features

We consider 7 pretrained deep convolutional neural networks from 6 different families: AlexNet [21], VGG-19 [31] and VGG-F [9], InceptionV3 [34], ResNetV2-50 [14], DenseNet-201 [16], and EfficientNet-b7 [35]. All networks were pretrained on more than a million images from the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2012 classification dataset with 1,000 object categories. To locate image points that are significant to a deep neural network, we look for the points that have maximum filter response and map them back onto the original image. We do not resize or crop the images we feed into the networks. This allows to maintain sufficient points positions accuracy, which is critical since accuracy is also affected by the reduction of the images as they are processed by the networks, something that intrinsically causes a degree of localization uncertainty when the points coordinates are mapped back to the original image. Since we are interested in selecting the points at convolution stages, before the fully connected layers occur, feeding images of different sizes is not an issue, nor it is in terms of points significance, since the networks are trained on images of objects captured at different scales and from different view points.

Fig. 1
figure 1

Scheme of features extraction from VGG-19

As an example, Fig. 1 depicts the general VGG-19 architecture with the feature maps from which the interest points are selected. As the image shows, we extract interest points from the feature maps obtained after the first five convolutions that precede the max pooling steps; in particular, in Fig. 1 the first set of feature maps is obtained at the level relu\(1\_{1}\) after the first bunch of convolutions. There are 64 such maps resulting from the convolution conv\(1\_{2}\) with 64 filters. From each of these maps, we extract the point that has the maximum filter response, obtaining 64 maximum response points. The points coordinates are mapped back through the inverse scaling function, leading to 64 interest points on the original image.

The points of highest response to each filter will certainly be included in the subsequent max pooling step and will affect the result of the following convolution steps, in other words, they are highly significant for the network. While it would be perfectly reasonable to extract more than one point from each feature map, we choose to select the global maximum since the eye fixations we compare them with were collected from viewers that were shown each image for 3s, a short time in which the human eye can scan only a subset of the areas it might find attractive.

We repeat this process for the stages relu\(1\_{2}\), relu2\(\_{2}\), relu\(3\_{4}\), relu\(4\_{4}\) and relu\(5\_{4}\) in Fig. 1, obtaining 5 sets of 64, 128, 256, 512 and 512 points, respectively, which will be denoted by VGG-19\(_{Ci}\), for \(i=1, \ldots , 5\). The set that contains all sets of interest points VGG-19\(_{Ci}\), for \(i=1, \ldots , 5\) will be denoted by VGG-19. The same interest points extraction is carried forward for the other networks; in particular, the points are extracted at the end of each block of convolutional layers.

The feature maps of the first layers are quite close to the original input image, and interest points extracted from them can be mapped back to the original image through the inverse of the image scaling function that results from the convolution, relu or max pooling steps. In fact, in [25], by inverting the network representation obtained in the first few layers, an image that is a slightly fuzzier but otherwise a visually faithful representation of the original image is obtained.

Fig. 2
figure 2

Interest points. CNN’s interest points are from some of the deepest layers: \(\hbox {Alex}_{C5}\), VGG-19\(_{C5}\), VGG-F\(_{C3}\), Densnet\(_{C3}\), EfficientNet\(_{b6}\), Inception\(_{C6}\) Resnet\(_{C1}\)

Figure 2 contains an image from the ETD with the interest points just defined. Notice that human fixations concentrate on the face, the trophy, the microphone and the hand of the woman, the same areas are mostly targeted by the seven CNNs. SIFT and SURF detect the same areas, but they also capture the pattern of the background, while, as expected, the HCD detects corners and edges. CNNs features in the image correspond to the last but one layer, and, as expected, they are mostly located on areas of semantic meaning, as human fixations are.

4 Interest regions modelling

In this section, we define the region of interest corresponding to a set of interest points. We want to determine which are the areas in the image that highly likely contain human fixations, CNN’s and handcrafted features. We use the same methodology adopted in [8], based on a nonparametric estimate of the density of a given set of interest points, which offers more flexibility in modelling our points distributions over parametric methods. In [11], a nonparametric estimate is also adopted, specifically a kernel density estimation (KDE) method with a radial basis function kernel. However, these kind of kernels lack local adaptivity which in our case can lead to spurious bumps; moreover, as discussed in [2], classical bandwidth estimators rely on the “plug-in” method [19] which requires that the data are approximately normal, an assumption that again is not satisfied by our data. To overcome these problems and prevent inaccurate comparisons, to model the density of the sets of points we use the KDE based on a linear diffusion model developed in [2], which has also the advantage of using a bandwidth selector that does not assume data normality. In Fig. 3, we can see some human fixation points (red dots) over the central part of an image (figure on top) with the level sets of the density surface generated via diffusion. On the bottom row, the KDE via diffusion (on the left) and that via a Gaussian kernel (right) of the whole image are shown. As it can be seen, KDE estimation via Gaussian kernel leads to a bumpy surface in the central area, where there is a high density of points, but these bumps are clearly spurious, since they do not correspond to any significant clusters of points (see the red points in the central yellow area in the image on top). This high-density area would be much better modelled by a unique peak, which characterises the surface generated by the KDE via diffusion shown in the top image, represented by level curves, and in the corresponding highest peak in the bottom right image.

Fig. 3
figure 3

KDE estimation. On top, the points with the diffusion method surface represented by level curves (warmer colours indicate high density). On the bottom, on the left the KDE via diffusion and on the right the KDE via radial basis functions (Gaussian kernel) (color figure online)

Given an image I and a set of features or fixations \(F_I\), the density surface associated to \(F_I\), and evaluated at an image point \(\mathbf{x} = (x,y)\) will be called \(f_{F_I}(\mathbf{x})\), so, for instance, the density of a set of human fixations for image I will be denoted by \(f_{\mathrm{HFix}(I)}(\mathbf{x})\). In Fig. 4, we can see an image from the MIT fixations dataset, with human fixations on the top left and AlexNet interest points on the top right. On the bottom, we can see the estimated densities for the human fixations (left) and for the AlexNet points (right). As it can be seen, the automated bandwidth selection is able to accurately model both densities: one arising from spread points and one from clustered points.

Fig. 4
figure 4

KDE estimation. On top, the human fixation points (left) and the AlexNet interest points (right). On the bottom, the respective densities estimated via diffusion

5 Distributions comparison

To compare the obtained points densities, we need to take into account several aspects. The first one is the difference between the cardinality of the feature sets we want to compare, which can vary from an average of 66 in the case of human fixations, to an average of 1696 in case of HCD. The second aspect is the variability of image content, which, for the sake of generality, is not supposed to be restricted to any particular class. The third aspect is that the available human fixations dataset results from exposing each image to the users for 3 s, so users attention tend to concentrate, when present, on the areas that are most significant to humans, such as faces, bodies and texts. Other potentially interesting image areas are not reached by the users because of the time limit, whereas they can be explored by feature extractors or CNNs. On the basis of these premises, to compare points sets we propose three (global) indexes of the difference between any two density distributions and a the local index defined in [8] that can reveal if the image areas targeted by one set of points are a subset of the areas target by the other. The first global index we use is the Bray–Curtis similarity [3], which is widely used in ecology to quantify the similarity between two sample populations and is well suited to assess the global similarity of two points distributions. The same index is also used in [11] which allows us to compare the results we obtain on the similarity of human fixation and handcrafted features with the ones in [11]. The second global index we use is the Jensen–Shannon divergence [24], and the third is the universally known Spearman rank correlation coefficient \(\rho\).

Given an image I and two densities \(f_1\) and \(f_{2}\), the Bray–Curtis similarity index is defined as \(BC_{1,2}=1-\frac{\sum _{i=1}^{n}|f_1(\mathbf{x}_{i})-f_{2}(\mathbf{x}_{i})|}{\sum _{i=1}^{n}f_1(\mathbf{x}_{i})+f_{2}(\mathbf{x}_{i})},\) for each image pixel \(\mathbf{x}_{i}.\) The Jensen–Shannon divergence is defined as

$$\begin{aligned} JSD_{1,2} =\sum _{\mathbf{x}_{i}}(f_1(\mathbf{x}_{i})-f_{2}(\mathbf{x}_{i}))\log \frac{f_1(\mathbf{x}_{i})}{f_{2}(\mathbf{x}_{i})} \end{aligned}$$

To be able to directly compare the two indexes, we turn the Jensen–Shannon divergence into a similarity measure by defining \(JS_{1,2}=1-JSD_{1,2}\).

While the global indexes will quantify how similar the two densities are globally, they cannot determine whether one of the two sets of points is contained in the density distribution of the other set of points. Establishing this local similarity is particularly meaningful in the cases in which the points concentrate on small regions of the image, which is what happens for human fixations of the ETD dataset, where each image was shown to each person for only 3 s.

To compare the densities at a local level, we adopt the measure defined in [8] in terms of a “two-way” binary classification.

Let \(F_1\), \(F_{2}\) be two sets of interest points for an image I and \(f_{F_1}\), \(f_{F_{2}}\) their respective densities. Let us assume that all points in \(F_1\) are \(F_{2}\) points (positive) and all remaining image pixels are not (negative). Given a \(\mathbf{x}\in F_1\) we can ask what the probability of \(\mathbf{x}\) to be a point of type \(F_{2}\) is by evaluating \(f_{F_{2}}(\mathbf{x})\). By setting a threshold \(\tau\), for all \(i=1,\ldots , |F_1|\) we can classify an interest point \(\mathbf{x}_{i}\in F_1\) as an \(F_{2}\) interest point (a true positive) if \(f_{F_{2}}(\mathbf{x}_{i})\ge \tau\). All points in \(F_1\) that do not satisfy the condition will be false negative. All pixels \(\mathbf{w}\in I\setminus F_1\) in the image that are not \(F_1\) points for which \(f_{F_{2}}(\mathbf{w})\ge \tau\) will be false positive, while those for which \(f_{F_{2}}(\mathbf{w})<\tau\) will be true negative. We can determine the true and false positive rates for a given threshold \(\tau\) and, by varying \(\tau\), build a ROC curve. The area under the ROC curve \(AUC(F_1,f_{F_{2}})\) is the probability that the classifier will rank a (randomly selected) \(F_1\) point higher than a (randomly selected) \(I\setminus F_1\) pixel, so the ability of the classifier to correctly classify the \(F_1\) points tells to which extent the set \(F_1\) is contained in the density \(f_{F_{2}}\), or equivalently, to which extent the interest points \(F_1\) are a “subset” of \(F_{2}\). By switching the two sets of points, we can evaluate the probability \(f_{F_1}(\mathbf{x})\) a point \(\mathbf{x}\in F_{2}\) has to be a point of type \(F_1\), and so we can evaluate if the points in \(F_{2}\) are contained in \(f_{F_1}\) with the index \(AUC(f_{F_1},F_{2})\). The two indexes describe how the two densities intersect. Notice that \(AUC(F_1,f_{F_{2}}) = AUC(f_{F_1},F_{2}) = 1\) can never be achieved since even in high density \(f_1\) areas there will always be some pixels that are not \(F_{2}\) points. Nevertheless, high values of \(AUC(F_1,f_{F_{2}})\) mean that a high number of \(F_1\) points are contained in \(f_{F_{2}}\), and to see if the reverse is true, we need to look at \(AUC(f_{F_1},F_{2})\): if it is smaller (bigger), it means that \(F_1\setminus F_{2}\) has a smaller (bigger) area than \(F_{2}\setminus F_1\), if they have similar values the areas covered by \(F_1\setminus F_{2}\) and \(F_{2}\setminus F_1\) are similar. The meaning of the two indexes is schematised in terms of density areas intersection in Fig. 5.

Fig. 5
figure 5

Similarity indexes relationship in terms of areas of interest intersection

6 Experiments and results

We conducted a series of experiments to compare human fixations with CNN’s and handcrafted features and, more generally, to evaluate the intraclass and interclass similarity of the handcrafted and CNN’s families. We use the 1003 images from the ETD, the Eye tracking database at MIT [20]. As outlined in Sect. 3.1, for each image \(I\in\) ETD, the database provides a set of point coordinates that result from the union of the fixations of 15 users, which for image I, according to the notation in 3.1 we call HFix(I). As illustrated in Sect. 3, for each image \(I\in\) ETD, we extract the features from each CNN: for instance, VGG-19\(_{Ci}\) is referred to the interest points of VGG-19 extracted at the layer Ci. We then extract the interest points SIFT(I), SURF(I) and HCD(I). For each of the previous sets of interest points, we estimate a probability density function as described in Sect. 4, which we will use to establish points similarity.

6.1 Correlation between human fixations and interest points

We first compare human fixations to interest points using the global similarity indexes. Let us denote by \(F_j(I)\) any of the feature set F over image I, and by \(f_{F_j}(I)\) its density. For each image \(I\in\) ETD we compare \(f_{F_j}\) with \(f_{HFix}(I)\) using the global indexes defined in Sect. 5. This results in three similarity scores: the Bray–Curtis \(BC_{f_{F_j} f_{HFix}}(I)\) and Jensen–Shannon \(JS_{1,2}\)similarities and the Spearman rank correlation coefficient \(\rho\) between the set \(F_j(I)\) and HFix(I) over image I. If we average the similarities scores as I ranges in the dataset ETD, we obtain the similarity between the set of features F and the human fixations HFix. The average scores are reported in Table 1.

Table 1 Similarity scores between interest points and human fixations

As a baseline experiment, human fixations densities of each of the 1003 images in the ETD database are compared to 100 random densities and the obtained similarity indexes are averaged. Average random scores for the similarity indexes BC and JS are reported in the first line of Table 1. To see if the differences between the average similarity indexes scores and the random scores are statistically significant, a two-tailed Wilcoxon rank-sum test was run for each comparison in the table rows. The resulting p values were well under the set threshold of 0.05 for all experiments but the ones reported in bold in the table.

The three similarity indexes are coherent most of the times, with the Spearman rank correlation coefficients in between the Bray–Curtis and the Jensen–Shannon indexes. We can see, in particular, that among handcrafted features, SURF has the highest similarity with human fixations. Among CNNs, we can see how the most shallow layers poorly correlate with human fixations, while the middle to deep layers show the highest correlation. Deep layers of EfficientNet have the highest correlation of all interest points, followed by Resnet, VGG-19, DenseNet and AlexNet. Inception has a somehow peculiar behaviour, with low correlation at the shallow layers which further decreases as the layers deepen(and even becomes negative). This might be due to the different architecture of the network which is based on convoluting the image with filters of multiple sizes at the same layer. The extracted features thus follow a different pattern than that of the other networks which start from low level features to follow with high level ones.

It is interesting to compare the results we obtained relatively to the handcrafted points with the ones in [11], where no significant differences between features/fixations similarities and random points/fixations similarities were found. In particular, limiting the comparison to the BC index (which is the only similarity measure used the work), we can see higher similarity scores between the distributions of random points and human fixations, coupled with lower scores between the distributions of SIFT, SURF and HCD and distributions of human fixations. This might be due to the different kernel density estimation techniques: an accurate bandwidth selection of the kernel is essential for robust density estimation, and possibly to a different methodology, which is not detailed, to estimate random densities. Contrarily to the conclusions in [11], we can affirm that global similarities between human fixations and SIFT, SURF and HCD, although not high, are statistically significant.

6.2 Local similarity assessment

To further explore the correlation between interest points and human fixations, we compare them on a local level using the local similarity AUC indexes defined in Sect. 5. Given an image I from the ETD dataset, the sets \(F_{i}(I)\) and HFix(I) of interest points and human fixations and the two densities \(f_{F_{i}(I)}\) and \(f_{HFix(I)}\), we determine the ROC curve deriving from the classification of the \(F_{i}(I)\) points with the human fixations density \(f_{Hfix}(I)\) and calculate the indicator \(AUC_{F_{i},f_{Hfix}}(I)\) and conversely, the ROC curve and the index \(AUC_{F_{Hfix},f_{F_{i}}}(I)\) deriving from the classification of the human fixations HFix(I) with the density estimated from the interest points \(F_{i}(I)\). We repeat the procedure for each image \(I\in\) ETD, and we average over the indexes \(AUC_{F_{i},f_{Hfix}}(I)\) and \(AUC_{F_{Hfix},f_{F_{i}}}(I)\) to get \(AUC_{F_{i},f_{Hfix}}\) and \(AUC_{F_{Hfix},f_{F_{i}}}\), which reveal to which extent the areas covered by the interest points of type \(F_{i}\) are contained in the areas covered by human fixations and vice versa.

The results can be seen in Table 2. For all feature sets, both AUC indicators are well above the 50% value corresponding to a random classifier, showing that there is a significant intersection with human fixations. Furthermore, in most cases, the first indicator (second column of the table) is greater than the second (third column of the table). This means that, on average, the image areas target by humans tend to be contained in those occupied by the other features. Only some shallow layers of some the CNNs have featured that are contained in the attention areas of humans (AlexNet, VGG-19, DenseNet). The highest value of 79.25% is given by EfficientNet\(_{b6}\), while among handcrafted features, SURF has the highest score of 73.28%. The higher values of the first indicators could be explained by the way the human fixations were collected. Indeed, the time limit of 3 s for each image exposure means that fixations concentrate on the image areas that most capture human attention, other possible areas of interest targeted by local descriptors or CNNs might not be observed by humans simply because of lack of time.

Table 2 AUC indexes of the comparison between interest points and human fixations

In Figs. 6 and 7 are shown the images of the ETD that produce, respectively, the highest and lowest AUC scores for SIFT points (the red crosses in the first images of both figures) and human fixations (the blue crosses). For the image in Fig. 6, we have the values \(AUC_{SIFT,f_{Hfix}}=85.21\%\) and \(AUC_{HFix,f_{SIFT}}=97.19\%\), indicating that the human fixations are almost completely contained in the density of the SIFT points, as confirmed by the image on the bottom right. On the other hand, the smaller value of the \(AUC_{SIFT,f_{Hfix}}\) indicates that there are areas that contain SIFT points that are not looked at by humans, as the image on the bottom left shows. Notice how human fixations tend to concentrate on the writings. In this case, the value of the global index was \(60.65\%\), which hints that the two sets are correlated but does not shed light on how they intersect.

Fig. 6
figure 6

SIFT-human fixations local correlation maximum. Top image: red crosses are SIFT points, blue crosses are human fixations. Bottom images: (left) planar projection of human density and sift points (red crosses), (right) SIFT density and human fixations (red crosses) (color figure online)

For the image in Fig. 7, the values of the indexes are \(AUC_{SIFT,f_{Hfix}}=37.01\%\) and \(AUC_{HFix,f_{SIFT}}=43.37\%\), which are both low values that indicate the two sets of points hardly intersect, as shown in the two images at the bottom of the figure. In this case the global BC similarity has a value of \(23.45\%\), below the random value of about \(26\%\).

Fig. 7
figure 7

SIFT-human fixations local correlation minimum. Top image: red crosses are SIFT points, blue crosses are human fixations. Bottom images: (left) planar projection of human density and sift points (red crosses), (right) SIFT density and human fixations (red crosses) (color figure online)

In Figs. 8 and 9, we show the images for which the local correlation between AlexNet points and human fixations is, respectively, maximum and minimum. Local correlation indexes values for the image in 8 are \(AUC_{AlexNet_{C3},f_{Hfix}}=88.05\%\) and \(AUC_{HFix,f_{AlexNet_{C3}}}=56.81\%\) while the global BC similarity index is \(38.98\%\). As it can be seen from the density images, the proportion of \(\hbox {AlexNet}_{C3}\) points that is contained in the human fixation density (left bottom image) is higher than the proportion of human fixations contained in the \(\hbox {AlexNe}_{C3}\) density, which is correctly reflected by the AUC scores.

Fig. 8
figure 8

\(\hbox {AlexNet}_{C3}\)-human fixations local correlation maximum. Left image: original. Middle: \(\hbox {AlexNet}_{C3}\) points (red), human fixations (blue). Right images: (top) planar projection of \(\hbox {AlexNet}_{C3}\) density and human fixations (red crosses), (bottom) human density and \(\hbox {AlexNet}_{C3}\) points (red crosses) (color figure online)

Local correlation indexes for the image in 9 are \(AUC_{AlexNet_{C3},f_{Hfix}}=36.38\%\) and \(AUC_{HFix,f_{AlexNet_{C3}}}=34.29\%\), the minimum local correlation between AlexNe\(t_{C3}\) point and human fixations in the ETD dataset. The BC similarity index has a very low value of \(9.08\%\). The two sets of points hardly intersect, in accordance with the low AUC values.

Fig. 9
figure 9

\(\hbox {AlexNet}_{C3}\)-human fixations local correlation minimum. Left image: \(\hbox {AlexNet}_{C3}\) points (red) and human fixations (blue). Right images: (top) planar projection of \(\hbox {AlexNet}_{C3}\) density and human fixations (red crosses), (bottom) human density and \(\hbox {AlexNet}_{C3}\) points (red crosses) (color figure online)

6.3 Similarity assessment with CNN fixations extracted with other methods

The correlation between humans and CNNs might depend on the way CNN’s interest regions are selected. While our main intent is to analyse the correlation between human fixations and CNNs at all layers, by unveiling what the filters learn stage by stage, it is also interesting to see how human fixations compare with CNN’s interest regions extracted with other methods that take into account also the fully connected layers (see last block in Fig. 1). To do this, we extracted the VGG-16 fixations from all images in the ETD dataset using the technique described in [26], where discriminative pixel locations that guide the network prediction are obtained by considering feature dependencies between pairs of consecutive layers. The resulting pixel locations, or interest image points, are fed into our KDE module to produce the regions of interest that are compared to the regions of interest generated by human fixations. In Tables 3 and 4, we can see the similarity scores (global and local, respectively) between human fixations and VGG-16 interest points extracted with our method from each layer before the max pooling step (lines 2–6), and the similarity scores between human fixations and VGG-16 fixations extracted according to [26] (line 7).

Table 3 Similarity scores between human fixations and VGG-16 interest points extracted with our method from layers \(C1, \ldots , C5\) and with CNN fixations as in [26]
Table 4 AUC indexes of the comparison between human fixations and VGG-16 with intersect points extracted with our method from layers \(C1, \ldots , C5\) and with CNN fixations as in [26]

The correlation between VGG-16 fixations of [26] and human fixations is similar to the correlation between VGG-16 interest points of the last two layers and human fixations. This was somehow predictable, since the features extracted from the last layers tend to have semantic meaning. The slightly superior correlation humans fixations have with VGG-16 fixations of [26] is probably due to outlier removal.

6.4 Similarity across features types

Having established the similarity between features and human fixations, we investigate the intraclass and interclass features correlation. This question has hardly been explored in the past, even for handcrafted features. In [5, 6], there is evidence that SIFT and SURF points only partially overlap, but the experiments are limited to images of human faces. To have an idea of how the various interest points correlate, we select a representative of each CNN family, namely the set of features relative extracted from the deepest layer. For each pair of feature sets, we calculate the similarity indexes across all images of the ETD database. In Table 5, the three global similarity scores are reported for all pairs of features sets compared, arranged in decreasing order from top to bottom. Similarity scores of human fixations with the last layer of CNNs and handcrafted features are reported again for ease of comparison.

Table 5 Similarity scores between all pairs of interest points, in decreasing order

On a global level, interest points appear to be more correlated among themselves than with human fixations. Greatest correlations occur between DenseNet and Resnet (\(BC = 93.91\%\), \(JS = 99.47\%\), \(\rho = 75.06\%\)), Inception and Resnet \(BC = 85.45\%, JS = 97.33\%, \rho = 41.97\%\), DenseNet Inception (\(BC = 85.51\%, JS = 97.38\%, \rho = 40.86\%\)), VGG-f and AlexNet (\(BC = 77.28\%, JS = 92.47\%, \rho = 78.70\%\)), SIFT and SURF (\(BC = 77.16\%, JS = 92.72\%, \rho = 77.66\%\)). We can generally see that interest points from the CNN’s family highly correlate, as they do sets from the handcrafted points family, while interfamilies correlations are weaker (humans included).

Global scores can be better interpreted together with local scores. To this end, the AUC values for all pairs of feature types of Table 5 are calculated, and the results are shown in the plot, where, for two feature sets \((F_{i}, F_j)\), the two local indexes are displayed as a point of coordinates (\(AUC_{F_{i},f_{F_j}}\), \(AUC_{F_j,f_{F_{i}}}\)). As the plot shows, human fixations scores (see Table 2) are all above the bisector, indicating that a notably higher fraction of humans fixations are contained in all other features, as discussed in Sect. 6.2. This also holds for Harris Corner points, which tend to be contained in other sets but SIFT and SURF. Looking at CNNs, it is surprising to discover that the pairs that have a high global similarity, such as DenseNet and Resnet (58.19%,58.04%), Inception and Resnet (56.14%,56.84%), DenseNet and Inception (56.56%, 55.85%) have not so high local similarity indexes, while all being near the bisector. This is due to the fact that the interest points of the deepest layers of the mentioned CNNs are generated by points that are quite spread (and not numerous) over the images, so they will generate densities that share a similar support and shape. The global indexes, based on the shape of the densities, will score high values, while the local similarity is more subtle, since the low probability values of the densities cause more false negatives and the many pixels sitting in the densities support but not belonging to the interest points sets will be false positive, thus leading to a low local similarity scores. How the features are spread on the last layers of the networks Densnet-201 and Resnet-V2-50 can be seen in Fig. 10. The small size of the feature maps at these two deepest layers contributes to the “grid” effect.

Fig. 10
figure 10

(Left) DenseNet-201 layer c5 interest points and (right) Resnet-V2-50 layer b4 interest points. Notice how the points are evenly spread at these two deepest layers

On the contrary, as Fig. 11 and Table 2 show, the local measure tells us that human fixations highly correlate with EfficientNet, VGG, AlexNet, SIFT and SURF, albeit being subsets of them.

Human fixations tend to be located in areas that are targeted by local descriptors and CNNs and, at the same time, there are areas targeted by local descriptors and CNNs that humans do not seem to look at.

Fig. 11
figure 11

Local similarity indexes between all interest points. For an ordered pair (\(F_{i}=\) Feature1, \(F_j=\) Feature2) of interest point sets in the legend the corresponding point in the graph will have coordinates (\(AUC_{F_{i},f_{F_j}}\), \(AUC_{F_j,f_{F_{i}}}\)). The origin of the axes corresponds to the scores of a random classifier (0.5, 0.5). Points above the bisector imply Feature1 tends to be contained in Feature2, the converse holds for points below. Points near the bisector are relative to features that have a similar fraction of common surface areas, which will be small for points near the origin and increasingly larger as points move further from it

7 Conclusion

The more we learn about the mechanisms of the primate visual system, the more these mechanisms can be embedded in computational models to improve their performance. While earlier local image descriptors were inspired by the mechanisms of the early visual cortex, today’s CNNs embed processes that take place on the higher cortical areas of the primate visual system. It is then natural to ask if these computational models rely on the same image areas humans focus their attention on when performing recognition or classification tasks. In this paper, we measured the similarity between attention areas of humans, handcrafted features (we used the three local image descriptors SIFT, SURF and HCD) and seven deep CNNs from the families AlexNet, VGG, Resnet, Inception, DenseNet, EfficientNet. To do so, we used three global similarities and a local one to compare the area of interest of the three classes humans, local image descriptors and CNNs. Extensive experiments were carried out on the ETD dataset, to establish intraclass and interclass similarities. The obtained results indicate that human fixations positively correlate with SIFT, SURF and HCD. Slightly higher correlations can be seen with the deepest layers of some of the networks, notably EfficientNet, Resnet, DenseNet, VGG and AlexNet, while there is weak or no correlation with shallow layers. Only Inception learns features through the layers that do not follow this pattern: correlation with humans is always weak or negative, especially at the intermediate layers. Local comparisons highlight that humans attention areas tend to be contained in areas relevant to local descriptors and most CNN’s layers. This might be due to the fact that human fixations were collected by viewing images for only 3 s. Further investigations on how correlation changes when humans can view images for a longer time would shed light on the full extent of the correlations. Moving on to intraclass similarities, we can see how SIFT and SURF highly correlate, as most of the networks do, while interclass similarities are predictably lower, with the most relevant being the ones between humans and CNNs.