1 Introduction

Psychological researches have proven that human emotions will change with different visual stimuli, such as images and videos. According to these research developments, more and more researchers have begun to predict people’s emotional responses based on a series of visual content. This creative research topic, called affective image computing, has received more and more attention in recent years. However, due to the complexity of emotions and emotional subjectivity, it is more challenging to analyze images at the affective level than to study the semantic level of images [1]. Although many studies in social media sentiment analysis, including facial expression recognition [2, 3] and speech emotion recognition [4], little progress in image emotion computing.

Designing a comprehensive representation to bridge the “affective gap” between measurable image features and high-level affective semantics is a challenging study. Figure 1 gives some samples of different datasets that evoke or express different emotions. Image emotions are related to complex visual features ranging from low-level to high-level local or global views. For example, the same scenes or objects with different colors may cause different emotional experiences of the viewer, while various scenes or objects may cause the same. First of all, this task is to develop the affective-level reasoning of images, closely related to many visual features. It is difficult to design a discriminative representation to cover all affective factors. In addition, some images are not easily constrained by precise rules, such as abstract pictures and artistic painting [5, 6]. Therefore, extracting discriminative affective representations is vital for affective image recognition.

Fig. 1
figure 1

Affective image samples that evoke the positive emotion (up) and negative emotion (down)

Many studies have been proposed to address the highly challenging problem. Low-level visual features, such as color, shape, and texture, are first adopted for affective image classification [7]. Zhao et al. proposed that image emotion is highly related to artistic principles. According to their research, mid-level characteristics representing artistic principles, such as visual balance, emphasis, and harmony, are applied to image sentiment classification [8]. Borth et al. believe that the semantic content will significantly affect the affective of the viewer, such as beautiful flowers or red cars [9,10,11]. They constructed 1200 ANPs (Adjective–Noun Pairs) visual emotion ontology methods. The composed ANP opened up a new visual concept for understanding image emotion. However, most previous studies rely on hand-crafted features, which are manually designed based on people’s common sense and observations. These methods are difficult to consider all the essential factors related to affective: image semantics, artistic principles, and low-level visual features.

Currently, CNNs have been widely used in various fields and have achieved outstanding breakthroughs, such as classification tasks [12, 13], object detection [14, 15], and semantic image segmentation [16, 17]. Researchers have designed various neural network models. The learning representation of images makes the target-object information clearer along the processing hierarchy [18]. Many studies have applied deep networks to affective image computing tasks. Experimental results show that CNNs are superior to hand-crafted features in image emotion recognition [19, 20], indicating the outstanding potential of deep representation in this challenging task. However, the relationship between deep features and emotion and why deep CNN performs so remarkably on this task have not been well explored. Besides, the VGG-like layered stacked structure may weaken the low-level features, such as texture, color, and shape, conducive to emotion recognition. It indicated that analyzing the different layers features has practical significance.

To explore how CNNs for image classification work on affective image computing tasks, we explored the deep features in the emotion recognition processing hierarchy and proved our guess through visualization. Our study shows that image emotion can be represented by semantic information derived from deep inference. On the other hand, it is also found that the hierarchical structure loses low-level visual features. In some cases, non-graphic components such as color and texture may cause more robust affective responses than content [6], as shown in Fig. 2. Therefore, study the correlation between low-level visual features and evoked emotions, and prove that shallow features are also significant for affective recognition.

Fig. 2
figure 2

Web images preferred high-level content semantics, while abstract paintings preferred low-level visual features and mid-level aesthetics. The red box and the green box represent negative and positive emotions, respectively, from left to right

This paper proposes a novel feature extraction method that learns multi-layer deep representations from low-level and deep-level features for affective image recognition based on the above challenges. Unlike traditional hierarchical CNNs designed to classify objects in the central position, our model can obtain more discriminative hybrid representations. Figure 3 gives an overview of the proposed method. Traditional hierarchical CNNs pay more attention to the reconstruction from shallow features to deep representations, causing low efficiency on extraction of low- and mid-level features. To perform an end-to-end learning method for different level features from the entire image, we propose a multi-level model with side branches to extract discriminative representations. Through a fusion layer, multi-level representations are integrated for image sentiment classification. Finally, it is observed that there is a problem of class imbalances in several image emotion datasets, which cannot be ignored for sentiment recognition. Therefore, a new loss function is adopted to solve this problem by including sample imbalance to optimize the CNN model.

Fig. 3
figure 3

Network Architecture. The proposed model consists of two branches: the backbone for deep semantic representation and the Gram matrix guidance shallow visual features branches. We adopt ResNet with 50 layers as the backbone CNN architecture. The detailed convolutional layer is omitted for ease to observe

The main contributions in this paper are summarized as follows: (1) We explore the working mechanism of deep representation over hierarchical CNNs by visualization in the emotion recognition task. Studies have indicated that the deep model mainly relies on deep semantic information, while ignoring the shallow visual details. (2) A new multi-level model is proposed, which combines shallow visual features and deep semantic representations for image sentiment classification, including low-level visual details, mid-level aesthetics, and high-level semantics. By combining different in-depth features, our model can effectively extract emotional representations. (3) For the problem of sample imbalance in the image emotion dataset, a class imbalance loss is adopted to optimize the deep model. Extensive experiments on five benchmark image emotion datasets show that our model outperforms state of the arts that use deep features or manual features, especially on social images and abstract paintings.

2 Related work

People are currently interested in developing computational models for affective computing of multimedia data, including text, audio, video, and images [22]. For affective image classification, we divide the existing works into hand-crafted feature engineering and deep representations according to different features and review these methods in this section.

  1. (1)

    Hand-crafted feature engineering for affective image recognition

    The previous work mainly focused on studying the role of visual description in predicting the emotions conveyed by images to the observer. For the hand-crafted feature method, designing and extracting features closely related to emotions is crucial for sentiment recognition. The visual representations used for sentiment classification are created and gained from different levels. Early researchers have applied standard visual features in computer vision, such as lines, shapes, and textures, to image sentiment analysis, but the effect is insignificant. And others have researched the emotional level of images starting from image composition and aesthetic concepts or have extracted scenes, objects, and relationships according to image content for image affective computing.

    Yanulevskaya et al. [7] extracted Gabor and Wiccest visual features for support vector machine (SVM) for image emotion recognition. Machajdik et al. [5] extracted color-based art theory for image retrieval of abstract paintings. In addition, brightness, saturation, etc., also have a significant impact on image emotion. The psychological research results of Valdez et al. [21] show that there is a linear relationship between the brightness and saturation of visual features and emotion. Zhao et al. [8] proposed nine types of features based on artistic principles for image emotion recognition, including balance, harmony, gradual change, etc. Rao et al. [11] presented a visual bag of words based on image blocks as a mid-level feature for image sentiment analysis, which is more intuitively related to emotions. Adjective–noun pair (ANP), composed of high-level semantic features, such as beautiful flowers, opens up a new visual concept for understanding image emotion. Borth et al. constructed a visual emotion ontology with 1,200 ANPs to describe image emotion [9, 23]. Ali et al. [24] proposed high-level concepts HLC, such as object and scene information, to construct the relationship between these high-level concepts and each type of emotion for image sentiment analysis. Compared with the features extracted from the CNN model, these hand-made features are mainly concentrated on low-level visual cues, while the high-level semantics have not been fully utilized.

  2. (2)

    Deep representations for affective image recognition

    CNNs have been applied to affective image recognition and achieved significant performance due to excellent high-level representations [25, 26]. Peng et al. [27] first attempt to apply the CNN model in [28] for emotion recognition. They fine-tuned the pre-trained CNN on ImageNet [29] and indicated that the CNN model is superior to previous hand-crafted methods on the Emotion6 dataset. Campos et al. [19] conducted a layer-by-layer analysis on AlexNet [28] through a series of ablation experiments. They proved that CNNs could capture more advanced affective features than manual methods. Some researchers combine high-level semantic information with low-level visual features in different ways to guide sentiment classification. Zhu et al. [30] designed a CNN + RNN model that extracts visual and semantic features through base-layer convolution and then uses bidirectional recurrent convolutional neural network (BiRNN) to aggregate. Rao et al. [25] proposed a multi-layer deep network MldrNet, which uses the maximum and average aggregation function to fuse the features of different layers and believes that image emotion can be recognized by image semantics, image aesthetics, and low-level features.

In addition, the emotionally evoke regions of visual features are also interesting research projects. Peng et al. [27] proposed a supervised affective regions detection approach. They constructed the Emotion6 dataset and expressed the contribution of each pixel to the induced emotion through the emotional activation map (ESM). This study comprehensively considers the influence of background and foreground regions on stimulating people’s affective. The experimental results show that their method predicts the affective-inducing regions better than saliency detection and region detection. Many research works employed similar benchmarks to evaluate the performance of affective region detection [6, 31, 32]. Fan et al. [33] adopted physical equipment, like eye trackers, to assist in discovering visual attention areas, combined with CNN to explain the relationship between human emotional response and visual attention.

However, the relationship between deep representation and image emotion and why deep CNN performs so well on this task have not been well explored. Moreover, these widely used CNN models cannot effectively classify images whose emotions are mainly caused by low-level visual features and mid-level aesthetic features. To this end, this paper illustrates the mechanism through visual representation and provides an inference model to extract the fusion representation of the shallow visual features and deep content of the image for sentiment recognition. In addition, we discussed the class imbalance issue in several image emotion datasets and introduced a new optimized loss function to address this problem.

3 The proposed inference method

As mentioned earlier, different affective features should be considered and combined in affective recognition. This section aims to develop a model that combines global view representation with local view representation for affective image classification.

3.1 Fine-tuning CNN for affective image recognition

Before introducing our method, we first fine-tune several CNN models that have been widely used for computer vision tasks. As a transfer learning strategy, fine-tuning makes it possible to use deep learning for small datasets and often achieves good results [34, 35]. Therefore, we first study the affective recognition performance of several classic CNNs with fine-tuning and explore how typical CNNs work for affective image recognition.

Given a training sample \(\left\{ {\left( {x,y} \right)} \right\}\), where \(x\) is the image and \(y\) is the associated label, CNN extracts layer-wise representations of input images using convolutional layers and fully connected layers. Following a softmax layer, the output of the last fully connected layer can be transformed into a probability distribution \({\mathbf{p}} \in {\mathbb{R}}^{n}\) for image emotions of \(n\) categories. In this work, \(n = 2\) indicates two emotion categories. The probability that the image belongs to a particular emotion category is defined as follows:

$${\varvec{p}}_{i} = { }\frac{{{\text{exp}}\left( {{\varvec{h}}_{i} } \right)}}{{\mathop \sum \nolimits_{i = 0}^{n} {\text{exp}}\left( {{\varvec{h}}_{i} } \right)}} ,{ }i = 1, \ldots ,n$$
(1)

where \({{\varvec{h}}}_{i}\) is the output from the last fully connected layer. The loss of the predicting probability can also be measured by using cross-entropy:

$$L_{cls} = - \mathop \sum \limits_{i = 1}^{n} {\varvec{y}}_{i} {\text{log}}\left( {{\varvec{p}}_{i} } \right)$$
(2)

where \(y = \{ {\varvec{y}}_{i} |{\varvec{y}}_{i} \in \left\{ {0,1} \right\},i = 1,...,n,\mathop \sum \nolimits_{i = 1}^{n} {\varvec{p}}_{i} = 1\}\) indicates the true emotion label of the image.

In addition, to explore the performance of the classic CNN model on affective classification, in this section, we analyze and discuss the feature representation with visualization, as shown in Fig. 4. Taking ResNet [36] with 50 layers as the backbone CNN, we visualized the model’s intermediate features after the fine-tuning training. We found that the shallow layer maintains the faithful performance of the image in photography, despite the increasing blurriness. Besides, along with the processing level of the network, the visual details are ignored at a high level. Based on these observations, we inferred that image emotion recognition requires another representation that may contain missing visual information from these observations.

Fig. 4
figure 4

Visualization of convolution filters, which produce an activation map with the highest activation, in several convolutional layer. The images from left to right and top to bottom are the input image, the representations in the first, second, third, and fourth convolutional blocks, respectively

3.2 Deep network learning multi-level representations

Affective-related features can be roughly divided into low-level visual features, such as lines, colors, and textures, mid-level aesthetic features, such as composition, visual balance, and high-level semantic features. However, the current CNNs mainly adopt a stacked structure. If “deeper” in the hierarchical structure, the representation level learned will be higher [37], leading to the lack of features from shallow details. We expect to extract multi-level affective representations, including shallow visual features, image aesthetics, and deep semantics. The framework of the proposed method, as shown in Fig. 3, contains two branches: the backbone network extracts high-level semantic representations and the shallow branches pull visual features and image aesthetics. Finally, different levels of deep representations are integrated through a fusion layer for classification.

  1. (1)

    feature extractor module: We employ ResNet [36] with 50 layers as the backbone CNN in this work. Each residual module passes \(3\times 3\) and \(1\times 1\) filters to capture visual features. Downsampling is performed directly by the pooling layer with strides of 2. The backbone ends with a global average pooling and connects to a fully connected layer to generate the final semantic representations.

    We extract the visual representation and image aesthetics from the shallow layer since visual details are still available there. Different layers featuremaps in the backbone CNN form shallow representations through a Gram matrix guided visual representation extractor, as shown in Fig. 5. The \(1\times 1\) convolutional layer is employed for cross-channel interaction and information integration. However, using featuremaps from shallow layers to extract visual details directly will lead to redundant information, as it includes content other than low-level details [38]. Inspired by the work of texture synthesis and style transfer studies [39, 40], we adopt a Gram matrix to extract the correlation between featuremaps to form a Gram matrix representation. For Gram matrix representation to direct the \(1\times 1\) convolutional layer to generate a shallow representation, both are transformed into a fully connected layer with the same dimension and finally integrated through element-wise multiplication.

    Fig. 5
    figure 5

    Illustration of Gram matrix guided visual representation extractor module. Detailed parameters are marked under the component to observation

    For each feature map \({{\varvec{x}}}_{i}\in {\mathbb{R}}^{H\times W}\) in the \(l\mathrm{th}\) feature map \({{\varvec{F}}}_{l}\in {\mathbb{R}}^{H\times W\times C}\) in the backbone network, \(H\) and \(W\) mean the height and width of the feature map, respectively. First, stretch \({{\varvec{x}}}_{i}\) into a vector of size \(H*W\), then the vectorized feature maps can be stored in a matrix \({{\varvec{F}}}_{l}\in {\mathbb{R}}^{H*W\times C}\) and the Gram matrix \({\varvec{G}}\in {\mathbb{R}}^{C\times C}\) of the convolutional layer can be written as:

    $${\varvec{G}} = \user2{F*F}^{{\varvec{T}}} .$$
    (3)

    Specifically, each element \({{\varvec{G}}}_{i,j}\) in the Gram matrix is the inner product between the vectorized feature maps \({{\varvec{x}}}_{i}\) and \({{\varvec{x}}}_{j}\) in the layer:

    $${\varvec{G}}_{i,j} = \frac{1}{C*H*W}\mathop \sum \limits_{k} {\varvec{x}}_{ik} {\varvec{x}}_{jk}$$
    (4)

    where \(\frac{1}{C*H*W}\) is normalization factor to prevent the network from being unable to train due to excessively large values in the calculation process. After normalization, the Gram matrix is merged into a \(1\times 1\) convolutional layer and finally converted into a visual representation \({\varvec{V}}\) through a fully connected layer with \({\varvec{n}}\) neurons.

    Moreover, we visually and intuitively show some examples of emotional images and their Gram feature representations, as shown in Fig. 6. It can be observed that the Gram feature of the image can better reflect the low-level visual feature representation at the pixel level, which may play an important role in evoking emotion.

    Fig. 6
    figure 6

    Emotional image (top) and its Gram feature representation (bottom) generated following [42]

  2. (2)

    Fusion Layer: Unlike deep semantic representation, visual representation provides a description of low-level visual features. To include more shallow visual information to aid image emotion recognition, a set of visual representations \(\left\{ {{\varvec{V}}_{1} ,{\varvec{V}}_{2} ,{\varvec{V}}_{3} , \ldots } \right\}\) are connected and aggregated from different layers by a fully connected layer. In order to effectively combine different levels of depth representation, the fusion function needs to be carefully selected. We tried the most commonly used fusion functions, including \(concatenation,{ }max\left( \cdot \right),{ }avg\left( \cdot \right)\), \(add\left( \cdot \right)\), and \(mul\left( \cdot \right)\). At the end of the proposed model, deep image semantics and shallow visual details are connected to create the multi-level feature representations.

3.3 Loss function

Currently, the publicly available image affective datasets in image sentiment analysis are relatively limited. Moreover, due to the expensive manual annotation of sentiment labels, existing affective datasets, including ArtPhoto [5], abstract [5], Twitter I [41], and Emotion6 [27], typically contain less than two thousand images. You et al. [42] used the metadata provided by the uploader to weakly mark them into two categories to form the FI dataset, which is currently a large-scale dataset used to train deep learning models. Table 1 summarizes the specific scale of these image emotion datasets. We discover that the sample size of the sentiment dataset is in a situation of unbalanced proportions. It can lead to two problems: (1) training is inefficient, since the model tends to cause the classifier to prefer the main category of the dataset; (2) easy-to-classify samples will overwhelm training and lead to the model degenerating [43].

Table 1 Statistics of the available image affective datasets

Considering the class imbalance, we propose a novel loss function to address this issue. For ease of description, \({{\varvec{p}}}_{i}\) represents the prediction probability of the \({\varvec{i}}\mathrm{th}\) image, and \({{\varvec{C}}}_{i}\) represents the total number of images in the category of \({\varvec{i}}\). Based on this definition, the class imbalance loss function can be introduced as:

$$L_{cis} = - \alpha {\varvec{w}}_{i} \left( {1 - {\varvec{p}}_{i} } \right)^{\gamma } \log \left( {{\varvec{p}}_{i} } \right)$$
(5)

where \({\varvec{w}}_{i} = - \frac{{1/{\varvec{C}}_{i} }}{{\mathop \sum \nolimits_{i = 0}^{n} 1/{\varvec{C}}_{i} }}\) is the category weight coefficient and \(\alpha\) is the balance factor for adjusting the loss value to accelerate the convergence.

On the one hand, the limitation \(\gamma > 0\) reduces the loss value of easy-to-classify samples. Thus, the model could pay more attention to indistinguishable samples, thereby increasing the generalization ability of the deep model. On the other hand, \({\varvec{w}}_{i}\) can make the images from categories with few examples get a more significant loss value to balance the ratio of the uneven sample. We compared similar optimize loss in Sect. 4, including the class weight algorithm, Class-Balanced \(softmax\) Cross-Entropy Loss (\({\text{CB}}_{softmax}\)) [44], to reasonably verify the proposed loss’s effectiveness. Specifically, the class weighting algorithm is to multiply the cross-entropy loss by the weight of each category, which is the reciprocal of the ratio of the corresponding category to the total number of training samples. The hyperparameter \(\beta\) of \(softmax\) CB is set to 0.9999.

4 Experiments

In this section, we present our experiments and evaluate our method against the state-of-the-art approaches to validate the effectiveness of our framework for sentiment classification on different datasets.

4.1 Dataset

We evaluate our proposed method on five widely used datasets, including ArtPhoto [5], Abstract [5], Twitter I [41], Emotion6 [27], and FI [42]. Among them, the scale of FI collected by You et al. is significantly larger than others. Therefore, we divide the dataset into large-scale and small-scale datasets according to the number of samples, and Table 1 shows the details.

ArtPhoto contains 806 art photographs from photograph-sharing sites, in which emotions are determined by the artist who uploaded the photograph. In contrast, Abstract contains 228 peer-reviewed abstract paintings, including colors and textures. Twitter I is collected by Amazon Mechanical Turk (AMT) staff from social networking sites and labeled as two categories (i.e., positive and negative) and contains 1,269 images. We tested our method on all three subsets of Twitter I, including “five consents,” “at least four consents,” and “at least three consents.” “Five Agrees” means that all five AMT workers have given the same emotional label to a given image. Emotion6 was created as an emotion prediction benchmark and collected from Flickr to generate 1,980 images with six emotion categories. They assumed that the influence of each pixel on the emotion-inducing area is proportional to the number of rectangles that cover that pixel and adopt AMT to collect 15 responses to the emotion-inducing site and represent the actual situation.

FI is currently the largest well-labeled dataset. The original FI dataset contains 90,000 noisy images collected from Flicker and Instagram by searching eight types of affective keywords. Weakly labeled images were further labeled by 225 Amazon Mechanical Turk (AMT) workers selected through qualification tests. A total of 23,308 images with at least three votes obtained from five designated AMT staff will be retained. We divide FI into binary datasets the same as Twitter I.

Figure 7 shows some example images from these datasets. Since we focus on binary sentiment prediction, we convert the multi-sentiment labels into positive and negative ones according to their valance for datasets except for Twitter I, initially labeled with a binary sentiment. Specifically, for ArtPhoto, Abstract Paintings, and FI, we divide them into binary labels according to [45], which suggests that amusement, awe, contentment, and excitement are positive sentiments and anger, disgust, fear, and sadness are negative sentiments. Emotion6 is labeled with six emotions (i.e., anger, disgust, fear, joy, sadness, surprise) along with Valance-Arousal scores, where anger, disgust, fear, and sadness can similarly be considered as the negative sentiments. Since the mean valance of the set of joy and surprise images is higher than the mean valance of the set of negative images, we treat them as positive sentiment.

Fig. 7
figure 7

Images examples from popular affective datasets

4.2 Implementation details

Our experiments are implemented by the Amazon deep learning open-source development platform MXNet [46]. The model parameters are optimized by the Mini-Batch gradient descent method (MBGD) [47]. The momentum strategies [48] are used to accelerate the model convergence, and the momentum coefficient is set to 0.9. Due to the different numbers of samples of these datasets, the batch sizes are 16 and 32. Weights are initialized to a uniform distribution, i.e.,\(\upomega \sim {\text{U}}\left( { - 0.07,0.07} \right)\), bias(b) is initialized to 0, and the learning rate is 0.001.

As in Ref. [36, 49], data enhancements are adopted to reduce the dependence of deep model on some attributes in training process, thereby improving the generalization ability. Specifically, our input images are resized as \(256 \times 256\) and then randomly cropped back to \(224 \times 224\) from the center and corners while flipping left or right and adjusting the color and tone with a certain probability. Finally, the features are effectively normalized with channel mean and variance for the training mode. All our experiments are carried out on two NVIDIA RTX 2080 Ti with 64 GB of CPU memory.

4.3 Ablation studies

  1. (1)

    Network Architecture. Deep CNNs have shown excellent performance in many computer vision tasks and tend to become deeper and deeper. Inspired by the significance of depth, we first try to analyze the influence of several classic CNNs on image sentiment analysis, including AlexNet [28], VGG [50], DenseNet [51], and ResNet [36], due to their excellent performance on visual classification tasks and relatively consistent structures. We obtain pre-training weights from the model zoo of MXNet [46] and fine-tuning the model.

    Table 2 shows the performance of several deep CNN models on large-scale dataset FI. From this table, we can argue that the ResNet-152 achieved the best performance, followed by ResNet-50. In contrast, AlexNet achieved the worst results because it has only five convolutional layers. For the same family networks, the emotion recognition accuracy increases as the layer of the model increases. However, when the layer rises to a certain extent, the large-scale parameters brought about by the rise in network depth do not significantly improve the performance. For example, the ResNet performance of 50 layers and 150 layers is almost the same. Another interesting finding is that DenseNet with 121 layers does not perform well, mainly due to the limited number of convolution filters in each layer, although it has far more convolutional layers than ResNet-50. As expected, simply increasing the network layer cannot obtain a better affective representation. Therefore, more discriminative emotive cues are needed. Not only that, it is observed that VGG networks (VGG-16 and VGG-19) achieve better performance than ResNet-18 with more convolutional layers. This may be because VGG has multiple fully connected layers that map the learned “distributed feature representation” to the sample label space.

    Table 2 Emotion classification accuracy for deep CNNs with different convolutional layers on FI dataset

    For this reason, we attempt to introduce a similar fully connected layer to other networks. The experimental results in this table prove our conjecture. The fully connected layers at the end of the network are helpful for affective image recognition, as they provide redundant parameters to construct high-level concepts for affective reasoning. Finally, we employ ResNet-50 as the backbone network since ResNet-152 does not have obvious performance hints but brings many parameters.

  2. (2)

    Multi-level feature aggregation. We attempt to explore multi-level representations for emotion recognition from CNN. Generally speaking, some convolutional layers in CNN will produce output feature maps of the same size, and we treat these layers as being in the same stage. For specific deep CNNs, we employ it as the backbone network to extract high-level semantic feature representations, and visual feature representations are extracted from different network stages, including low-level detailed features and mid-level aesthetic features. In this work, we first calculate the Gram matrix for the shallow features, and then transform it into a one-dimensional vector by a fully connection layer. At the same time, we added \(1 \times 1\) convolutional layer to reduce the amount of calculation and facilitate the model to achieve cross-channel interaction and information integration and then multiplied by broadcasting for shallow visual representation.

As shown in Table 3, the network performance can be further increased by adding shallow visual features. It is indicated that high-level semantics are not enough to express image emotions perfectly. Additionally, we have also observed that a combination of visual representation \(\left\{ {{\varvec{V}}_{1} ,{\varvec{V}}_{2} ,{\varvec{V}}_{3} , \ldots } \right\}\) combined with the backbone network can produce the best accuracy results. It means that the visual features extracted from the shallow layer can help image semantics for emotion classification. However, obtaining visual representations from higher-level featuremaps will cause bad performance. This is because the deeper feature maps only include a small part of low-level visual features and more of the image content. Therefore, using feature maps from deep layers to extract visual details will result in redundant information.

Table 3 Evaluation of multi-level representation on FI dataset

The fusion technique is also a critical component of our multi-level model. We compare several fusion functions, including element-wise add, element-wise multi, channel pooling, and concatenation, to explore the impact of different fusion methods on the affective classification performance, as shown in Table 4. All combinations help capture information in the overall and regional views, and concatenation is the most effective way as it retains all deep representations.

Table 4 Emotion classification accuracy for our multi-level representations model of different fusion functions

Furthermore, we verify the impact of the ratio of shallow visual feature representation and high-level semantic feature representation on emotion recognition, as shown in Fig. 8. Let \(r\) represent the ratio of visual representation dimension to the dimension of semantic features. Note that the dimension of the semantic annotations of ResNet is 2048, DenseNet is 1024, while the dimension of VGG is 512. When \(r=1\), both ResNet and VGG earn the best performance, which means the dimension of shallow representations and deep semantics is equal. DenseNet has the best performance when \(r=0.75\). It is indicated that low-level visual representation can assist image semantic features for affective recognition. Still, too much attention to shallow features will bring redundant information and affect the performance.

Fig. 8
figure 8

Impact of different \(r\) on the validation set of the FI dataset. We set \(r=1\) in the remaining experiments

4.4 Class imbalance and loss function

In this section, we discuss the impact of class imbalance in the image sentiment dataset on recognition accuracy through experiments. As mentioned in Sect. 3, the class imbalance issue cannot ignore for the deep networks. Considering the above challenges, a new loss function is proposed, as shown in Eq. (5). The proposed model and loss are evaluated using ResNet and VGG as the backbone network.

Figure 9 shows the impact of balance factor \(\alpha\) and modulation factor \(\gamma\) on the multi-level model performance. Note that when the modulating factor \(\gamma =0\), the loss function degenerates to cross-entropy loss for training. As shown in this figure, training ResNet-50, VGG, and our proposed model with suggested loss (formula 5) improves the performance. Moreover, when gamma = 2, the proposed model will produce the best results.

Fig. 9
figure 9

Impact of different \(\gamma\) on the model performance of the FI dataset. The dashed and solid lines represent that the balance factor \(\alpha\) takes 1, 10, and 20. We set \(\gamma =2\) and \(\alpha =10\) in the remaining experiments

4.5 Comparison to state of the arts

To prove the effectiveness of our model, we compared it with the state-of-the-art affective image recognition method, including several hand-crafted feature algorithms and deep models.

  1. (1)

    Hand-crafted features: We employ several low-level visual features extracted by Yang et al. [52], including SIFT, HOG, GIST, etc. ColorName employs the algorithm in [5] to calculate the pixels of each of the 11 primary colors presented on the query image. SentiBank [23] is a concept detector library based on ANP’s visual ontology, using 1,200-dimensional features as mid-level representation. Zhao et al. [53] adopt principles-of-art-based affective features (PAEF) for image affective computing, a harmonious combination of features from different guides, including balance, emphasis, harmony, change, level, and movement.

  2. (2)

    Deep models: You et al. [41] proposed a novel progressive CNN architecture PCNN, which employs large-scale weakly labeled data to improve the model’s generalization ability. DeepSentiBank [54] is a CNN-based classification of visual emotion concepts to discover ANP. The pre-trained DeepSentiBank is adopted to extract 2,089 ANPs as mid-level representations. We also show the performance of the deep visual features of fine-tuned CNNs on the sentiment dataset, including different architectures, namely AlexNet [28], VGG [50], DenseNet [51], and ResNet-50 [36]. Yang et al. [52] proposed the Affective Regions detection method AR, which uses a dual-branch strategy and combines local information and holistic representations for sentiment classification. Xiong et al. [55] proposed R-CNNGSR, which automatically detects affective regions by combining underlying features and emotional features. Then, the entire image and regions are fused to predict emotion. Moreover, we employ the class weight algorithm and balanced \(softmax\) cross-entropy loss (\({\mathrm{CB}}_{softmax}\)) [44] to verify the effectiveness of the proposed loss function.

We first conducted experiments on the large-scale dataset FI and compared our model’s performance with several related techniques. Meanwhile, four different loss functions are used to optimize the model. As shown in Table 5, the proposed method is better than the deep CNN designed for object classification tasks, such as AlexNet, VGG-16, DenseNet, and ResNet-50. It indicated that our proposed method benefits from multi-level feature representation and a more suitable model optimization loss. Compared with other deep models designed for affective classification, including PCNN [42], DeepSentiBank [54], and AR [52], our multi-level model produces better performance based on multi-level representation. Moreover, the proposed class imbalance loss can optimize the deep affective model better than the other loss functions.

Table 5 Comparison with state-of-the-art methods on FI dataset

To further evaluate the effectiveness of our model, we validate the proposed method on small-scale datasets, including ArtPhoto, Abstract Paintings, Twitter I, and Emotion6 dataset, as shown in Table 6. It can be seen from the experimental results in this table that our multi-level feature model produces better performance compared with several hand-crafted features methods such as PAEF [53] and SentiBank [23]. Not only that, our proposed multi-level representations method has also achieved performance improvements beyond other deep CNNs models. However, it is interesting that compared to Yang [52] and Xiong [55], the proposed method has achieved better performance on Abstract. This is consistent with our initial conjecture: Abstract paintings pay more attention to the expression of visual details and aesthetics, while Web images are more inclined to understand image semantics, as shown in Fig. 2. What is more, it could be observed that our method also has obvious performance hints on the Twitter I and Emotion6 datasets, which benefit from a more suitable model optimization loss.

Table 6 Comparison with state-of-the-art methods on small-scale datasets

Finally, time consumption is also an exciting topic. Since different fusion strategies will have different computational costs, we have made a detailed comparison of the time consumption of the proposed model under different fusion strategies. For example, the model within concatenate can expand the dimension of the final fully connected layer, leading to more parameters. In contrast, the other four fusion strategies cannot have this problem. Figure 10 shows the running time comparison between the proposed multi-level model and the backbone network. The proposed model does not have too much time overhead compared to the base network as the proposed model only works on the block level instead of the hierarchy.

Fig. 10
figure 10

Comparison of running time consumption on FI. The base model represents fine-tuning model from the model zoo. Ours-multi and ours-concatenation represent the proposed model using different fusion strategies, respectively

5 Conclusion

This paper proposes a multi-level representations model to learn and integrate the deep semantics and shallow visual details of images for affective recognition. The visualization of deep representations and experimental results show that combining shallow features can benefit image emotion classification. Moreover, our studies indicate that class imbalance can affect performance as the dominant class will overwhelm training and lead to degeneration. Therefore, a new loss function was introduced to optimize the deep CNN model.