Learning multi-level representations for affective image recognition

Zhang, Hao; Xu, Dan; Luo, Gaifang; He, Kangjian

doi:10.1007/s00521-022-07139-y

Learning multi-level representations for affective image recognition

Original Article
Open access
Published: 22 April 2022

Volume 34, pages 14107–14120, (2022)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Learning multi-level representations for affective image recognition

Download PDF

Hao Zhang¹,
Dan Xu¹,
Gaifang Luo² &
…
Kangjian He¹

2172 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

Images can convey intense affective experiences and affect people on an affective level. With the prevalence of online pictures and videos, evaluating emotions from visual content has attracted considerable attention. Affective image recognition aims to classify the emotions conveyed by digital images automatically. The existing studies using manual features or deep networks mainly focus on low-level visual features or high-level semantic representation without considering all factors. To better understand how deep networks are working for affective recognition tasks, we investigate the convolutional features by visualization them in this work. Our research shows that the hierarchical CNN model mainly relies on deep semantic information while ignoring the shallow visual details, which are essential to evoke emotions. To form a more general and discriminative representation, we propose a multi-level hybrid model that learns and integrates the deep semantics and shallow visual representations for sentiment classification. In addition, this study shows that class imbalance would affect performance as the main category of the affective dataset will overwhelm training and degenerate the deep networks. Therefore, a new loss function is introduced to optimize the deep affective model. Experimental results on several affective image recognition datasets show that our model outperforms various existing studies. The source code is publicly available.

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Deep learning models for digital image processing: a review

Article 07 January 2024

Visualizing and Understanding Convolutional Networks

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Psychological researches have proven that human emotions will change with different visual stimuli, such as images and videos. According to these research developments, more and more researchers have begun to predict people’s emotional responses based on a series of visual content. This creative research topic, called affective image computing, has received more and more attention in recent years. However, due to the complexity of emotions and emotional subjectivity, it is more challenging to analyze images at the affective level than to study the semantic level of images [1]. Although many studies in social media sentiment analysis, including facial expression recognition [2, 3] and speech emotion recognition [4], little progress in image emotion computing.

Designing a comprehensive representation to bridge the “affective gap” between measurable image features and high-level affective semantics is a challenging study. Figure 1 gives some samples of different datasets that evoke or express different emotions. Image emotions are related to complex visual features ranging from low-level to high-level local or global views. For example, the same scenes or objects with different colors may cause different emotional experiences of the viewer, while various scenes or objects may cause the same. First of all, this task is to develop the affective-level reasoning of images, closely related to many visual features. It is difficult to design a discriminative representation to cover all affective factors. In addition, some images are not easily constrained by precise rules, such as abstract pictures and artistic painting [5, 6]. Therefore, extracting discriminative affective representations is vital for affective image recognition.

Many studies have been proposed to address the highly challenging problem. Low-level visual features, such as color, shape, and texture, are first adopted for affective image classification [7]. Zhao et al. proposed that image emotion is highly related to artistic principles. According to their research, mid-level characteristics representing artistic principles, such as visual balance, emphasis, and harmony, are applied to image sentiment classification [8]. Borth et al. believe that the semantic content will significantly affect the affective of the viewer, such as beautiful flowers or red cars [9,10,11]. They constructed 1200 ANPs (Adjective–Noun Pairs) visual emotion ontology methods. The composed ANP opened up a new visual concept for understanding image emotion. However, most previous studies rely on hand-crafted features, which are manually designed based on people’s common sense and observations. These methods are difficult to consider all the essential factors related to affective: image semantics, artistic principles, and low-level visual features.

Currently, CNNs have been widely used in various fields and have achieved outstanding breakthroughs, such as classification tasks [12, 13], object detection [14, 15], and semantic image segmentation [16, 17]. Researchers have designed various neural network models. The learning representation of images makes the target-object information clearer along the processing hierarchy [18]. Many studies have applied deep networks to affective image computing tasks. Experimental results show that CNNs are superior to hand-crafted features in image emotion recognition [19, 20], indicating the outstanding potential of deep representation in this challenging task. However, the relationship between deep features and emotion and why deep CNN performs so remarkably on this task have not been well explored. Besides, the VGG-like layered stacked structure may weaken the low-level features, such as texture, color, and shape, conducive to emotion recognition. It indicated that analyzing the different layers features has practical significance.

To explore how CNNs for image classification work on affective image computing tasks, we explored the deep features in the emotion recognition processing hierarchy and proved our guess through visualization. Our study shows that image emotion can be represented by semantic information derived from deep inference. On the other hand, it is also found that the hierarchical structure loses low-level visual features. In some cases, non-graphic components such as color and texture may cause more robust affective responses than content [6], as shown in Fig. 2. Therefore, study the correlation between low-level visual features and evoked emotions, and prove that shallow features are also significant for affective recognition.

This paper proposes a novel feature extraction method that learns multi-layer deep representations from low-level and deep-level features for affective image recognition based on the above challenges. Unlike traditional hierarchical CNNs designed to classify objects in the central position, our model can obtain more discriminative hybrid representations. Figure 3 gives an overview of the proposed method. Traditional hierarchical CNNs pay more attention to the reconstruction from shallow features to deep representations, causing low efficiency on extraction of low- and mid-level features. To perform an end-to-end learning method for different level features from the entire image, we propose a multi-level model with side branches to extract discriminative representations. Through a fusion layer, multi-level representations are integrated for image sentiment classification. Finally, it is observed that there is a problem of class imbalances in several image emotion datasets, which cannot be ignored for sentiment recognition. Therefore, a new loss function is adopted to solve this problem by including sample imbalance to optimize the CNN model.

The main contributions in this paper are summarized as follows: (1) We explore the working mechanism of deep representation over hierarchical CNNs by visualization in the emotion recognition task. Studies have indicated that the deep model mainly relies on deep semantic information, while ignoring the shallow visual details. (2) A new multi-level model is proposed, which combines shallow visual features and deep semantic representations for image sentiment classification, including low-level visual details, mid-level aesthetics, and high-level semantics. By combining different in-depth features, our model can effectively extract emotional representations. (3) For the problem of sample imbalance in the image emotion dataset, a class imbalance loss is adopted to optimize the deep model. Extensive experiments on five benchmark image emotion datasets show that our model outperforms state of the arts that use deep features or manual features, especially on social images and abstract paintings.

2 Related work

People are currently interested in developing computational models for affective computing of multimedia data, including text, audio, video, and images [22]. For affective image classification, we divide the existing works into hand-crafted feature engineering and deep representations according to different features and review these methods in this section.

(1)
Hand-crafted feature engineering for affective image recognition

The previous work mainly focused on studying the role of visual description in predicting the emotions conveyed by images to the observer. For the hand-crafted feature method, designing and extracting features closely related to emotions is crucial for sentiment recognition. The visual representations used for sentiment classification are created and gained from different levels. Early researchers have applied standard visual features in computer vision, such as lines, shapes, and textures, to image sentiment analysis, but the effect is insignificant. And others have researched the emotional level of images starting from image composition and aesthetic concepts or have extracted scenes, objects, and relationships according to image content for image affective computing.

Yanulevskaya et al. [7] extracted Gabor and Wiccest visual features for support vector machine (SVM) for image emotion recognition. Machajdik et al. [5] extracted color-based art theory for image retrieval of abstract paintings. In addition, brightness, saturation, etc., also have a significant impact on image emotion. The psychological research results of Valdez et al. [21] show that there is a linear relationship between the brightness and saturation of visual features and emotion. Zhao et al. [8] proposed nine types of features based on artistic principles for image emotion recognition, including balance, harmony, gradual change, etc. Rao et al. [11] presented a visual bag of words based on image blocks as a mid-level feature for image sentiment analysis, which is more intuitively related to emotions. Adjective–noun pair (ANP), composed of high-level semantic features, such as beautiful flowers, opens up a new visual concept for understanding image emotion. Borth et al. constructed a visual emotion ontology with 1,200 ANPs to describe image emotion [9, 23]. Ali et al. [24] proposed high-level concepts HLC, such as object and scene information, to construct the relationship between these high-level concepts and each type of emotion for image sentiment analysis. Compared with the features extracted from the CNN model, these hand-made features are mainly concentrated on low-level visual cues, while the high-level semantics have not been fully utilized.
(2)
Deep representations for affective image recognition

CNNs have been applied to affective image recognition and achieved significant performance due to excellent high-level representations [25, 26]. Peng et al. [27] first attempt to apply the CNN model in [28] for emotion recognition. They fine-tuned the pre-trained CNN on ImageNet [29] and indicated that the CNN model is superior to previous hand-crafted methods on the Emotion6 dataset. Campos et al. [19] conducted a layer-by-layer analysis on AlexNet [28] through a series of ablation experiments. They proved that CNNs could capture more advanced affective features than manual methods. Some researchers combine high-level semantic information with low-level visual features in different ways to guide sentiment classification. Zhu et al. [30] designed a CNN + RNN model that extracts visual and semantic features through base-layer convolution and then uses bidirectional recurrent convolutional neural network (BiRNN) to aggregate. Rao et al. [25] proposed a multi-layer deep network MldrNet, which uses the maximum and average aggregation function to fuse the features of different layers and believes that image emotion can be recognized by image semantics, image aesthetics, and low-level features.

In addition, the emotionally evoke regions of visual features are also interesting research projects. Peng et al. [27] proposed a supervised affective regions detection approach. They constructed the Emotion6 dataset and expressed the contribution of each pixel to the induced emotion through the emotional activation map (ESM). This study comprehensively considers the influence of background and foreground regions on stimulating people’s affective. The experimental results show that their method predicts the affective-inducing regions better than saliency detection and region detection. Many research works employed similar benchmarks to evaluate the performance of affective region detection [6, 31, 32]. Fan et al. [33] adopted physical equipment, like eye trackers, to assist in discovering visual attention areas, combined with CNN to explain the relationship between human emotional response and visual attention.

However, the relationship between deep representation and image emotion and why deep CNN performs so well on this task have not been well explored. Moreover, these widely used CNN models cannot effectively classify images whose emotions are mainly caused by low-level visual features and mid-level aesthetic features. To this end, this paper illustrates the mechanism through visual representation and provides an inference model to extract the fusion representation of the shallow visual features and deep content of the image for sentiment recognition. In addition, we discussed the class imbalance issue in several image emotion datasets and introduced a new optimized loss function to address this problem.

3 The proposed inference method

As mentioned earlier, different affective features should be considered and combined in affective recognition. This section aims to develop a model that combines global view representation with local view representation for affective image classification.

3.1 Fine-tuning CNN for affective image recognition

Before introducing our method, we first fine-tune several CNN models that have been widely used for computer vision tasks. As a transfer learning strategy, fine-tuning makes it possible to use deep learning for small datasets and often achieves good results [34, 35]. Therefore, we first study the affective recognition performance of several classic CNNs with fine-tuning and explore how typical CNNs work for affective image recognition.

Given a training sample $\left\{ {\left( {x,y} \right)} \right\}$, where $x$ is the image and $y$ is the associated label, CNN extracts layer-wise representations of input images using convolutional layers and fully connected layers. Following a softmax layer, the output of the last fully connected layer can be transformed into a probability distribution ${\mathbf{p}} \in {\mathbb{R}}^{n}$ for image emotions of $n$ categories. In this work, $n = 2$ indicates two emotion categories. The probability that the image belongs to a particular emotion category is defined as follows:

$${\varvec{p}}_{i} = { }\frac{{{\text{exp}}\left( {{\varvec{h}}_{i} } \right)}}{{\mathop \sum \nolimits_{i = 0}^{n} {\text{exp}}\left( {{\varvec{h}}_{i} } \right)}} ,{ }i = 1, \ldots ,n$$

(1)

where ${{\varvec{h}}}_{i}$ is the output from the last fully connected layer. The loss of the predicting probability can also be measured by using cross-entropy:

$$L_{cls} = - \mathop \sum \limits_{i = 1}^{n} {\varvec{y}}_{i} {\text{log}}\left( {{\varvec{p}}_{i} } \right)$$

(2)

where $y = \{ {\varvec{y}}_{i} |{\varvec{y}}_{i} \in \left\{ {0,1} \right\},i = 1,...,n,\mathop \sum \nolimits_{i = 1}^{n} {\varvec{p}}_{i} = 1\}$ indicates the true emotion label of the image.

In addition, to explore the performance of the classic CNN model on affective classification, in this section, we analyze and discuss the feature representation with visualization, as shown in Fig. 4. Taking ResNet [36] with 50 layers as the backbone CNN, we visualized the model’s intermediate features after the fine-tuning training. We found that the shallow layer maintains the faithful performance of the image in photography, despite the increasing blurriness. Besides, along with the processing level of the network, the visual details are ignored at a high level. Based on these observations, we inferred that image emotion recognition requires another representation that may contain missing visual information from these observations.

3.2 Deep network learning multi-level representations

Affective-related features can be roughly divided into low-level visual features, such as lines, colors, and textures, mid-level aesthetic features, such as composition, visual balance, and high-level semantic features. However, the current CNNs mainly adopt a stacked structure. If “deeper” in the hierarchical structure, the representation level learned will be higher [37], leading to the lack of features from shallow details. We expect to extract multi-level affective representations, including shallow visual features, image aesthetics, and deep semantics. The framework of the proposed method, as shown in Fig. 3, contains two branches: the backbone network extracts high-level semantic representations and the shallow branches pull visual features and image aesthetics. Finally, different levels of deep representations are integrated through a fusion layer for classification.

(1)
feature extractor module: We employ ResNet [36] with 50 layers as the backbone CNN in this work. Each residual module passes $3\times 3$ and $1\times 1$ filters to capture visual features. Downsampling is performed directly by the pooling layer with strides of 2. The backbone ends with a global average pooling and connects to a fully connected layer to generate the final semantic representations.

We extract the visual representation and image aesthetics from the shallow layer since visual details are still available there. Different layers featuremaps in the backbone CNN form shallow representations through a Gram matrix guided visual representation extractor, as shown in Fig. 5. The $1\times 1$ convolutional layer is employed for cross-channel interaction and information integration. However, using featuremaps from shallow layers to extract visual details directly will lead to redundant information, as it includes content other than low-level details [38]. Inspired by the work of texture synthesis and style transfer studies [39, 40], we adopt a Gram matrix to extract the correlation between featuremaps to form a Gram matrix representation. For Gram matrix representation to direct the $1\times 1$ convolutional layer to generate a shallow representation, both are transformed into a fully connected layer with the same dimension and finally integrated through element-wise multiplication.
Fig. 5
Illustration of Gram matrix guided visual representation extractor module. Detailed parameters are marked under the component to observation
Full size image

For each feature map ${{\varvec{x}}}_{i}\in {\mathbb{R}}^{H\times W}$ in the $l\mathrm{th}$ feature map ${{\varvec{F}}}_{l}\in {\mathbb{R}}^{H\times W\times C}$ in the backbone network, $H$ and $W$ mean the height and width of the feature map, respectively. First, stretch ${{\varvec{x}}}_{i}$ into a vector of size $H*W$, then the vectorized feature maps can be stored in a matrix ${{\varvec{F}}}_{l}\in {\mathbb{R}}^{H*W\times C}$ and the Gram matrix ${\varvec{G}}\in {\mathbb{R}}^{C\times C}$ of the convolutional layer can be written as:
$${\varvec{G}} = \user2{F*F}^{{\varvec{T}}} .$$
(3)

Specifically, each element ${{\varvec{G}}}_{i,j}$ in the Gram matrix is the inner product between the vectorized feature maps ${{\varvec{x}}}_{i}$ and ${{\varvec{x}}}_{j}$ in the layer:
$${\varvec{G}}_{i,j} = \frac{1}{C*H*W}\mathop \sum \limits_{k} {\varvec{x}}_{ik} {\varvec{x}}_{jk}$$
(4)
where $\frac{1}{C*H*W}$ is normalization factor to prevent the network from being unable to train due to excessively large values in the calculation process. After normalization, the Gram matrix is merged into a $1\times 1$ convolutional layer and finally converted into a visual representation ${\varvec{V}}$ through a fully connected layer with ${\varvec{n}}$ neurons.

Moreover, we visually and intuitively show some examples of emotional images and their Gram feature representations, as shown in Fig. 6. It can be observed that the Gram feature of the image can better reflect the low-level visual feature representation at the pixel level, which may play an important role in evoking emotion.
Fig. 6
Emotional image (top) and its Gram feature representation (bottom) generated following [42]
Full size image
(2)
Fusion Layer: Unlike deep semantic representation, visual representation provides a description of low-level visual features. To include more shallow visual information to aid image emotion recognition, a set of visual representations $\left\{ {{\varvec{V}}_{1} ,{\varvec{V}}_{2} ,{\varvec{V}}_{3} , \ldots } \right\}$ are connected and aggregated from different layers by a fully connected layer. In order to effectively combine different levels of depth representation, the fusion function needs to be carefully selected. We tried the most commonly used fusion functions, including $concatenation,{ }max\left( \cdot \right),{ }avg\left( \cdot \right)$, $add\left( \cdot \right)$, and $mul\left( \cdot \right)$. At the end of the proposed model, deep image semantics and shallow visual details are connected to create the multi-level feature representations.

3.3 Loss function

Currently, the publicly available image affective datasets in image sentiment analysis are relatively limited. Moreover, due to the expensive manual annotation of sentiment labels, existing affective datasets, including ArtPhoto [5], abstract [5], Twitter I [41], and Emotion6 [27], typically contain less than two thousand images. You et al. [42] used the metadata provided by the uploader to weakly mark them into two categories to form the FI dataset, which is currently a large-scale dataset used to train deep learning models. Table 1 summarizes the specific scale of these image emotion datasets. We discover that the sample size of the sentiment dataset is in a situation of unbalanced proportions. It can lead to two problems: (1) training is inefficient, since the model tends to cause the classifier to prefer the main category of the dataset; (2) easy-to-classify samples will overwhelm training and lead to the model degenerating [43].

Table 1 Statistics of the available image affective datasets

Full size table

Considering the class imbalance, we propose a novel loss function to address this issue. For ease of description, ${{\varvec{p}}}_{i}$ represents the prediction probability of the ${\varvec{i}}\mathrm{th}$ image, and ${{\varvec{C}}}_{i}$ represents the total number of images in the category of ${\varvec{i}}$. Based on this definition, the class imbalance loss function can be introduced as:

$$L_{cis} = - \alpha {\varvec{w}}_{i} \left( {1 - {\varvec{p}}_{i} } \right)^{\gamma } \log \left( {{\varvec{p}}_{i} } \right)$$

(5)

where ${\varvec{w}}_{i} = - \frac{{1/{\varvec{C}}_{i} }}{{\mathop \sum \nolimits_{i = 0}^{n} 1/{\varvec{C}}_{i} }}$ is the category weight coefficient and $\alpha$ is the balance factor for adjusting the loss value to accelerate the convergence.

On the one hand, the limitation $\gamma > 0$ reduces the loss value of easy-to-classify samples. Thus, the model could pay more attention to indistinguishable samples, thereby increasing the generalization ability of the deep model. On the other hand, ${\varvec{w}}_{i}$ can make the images from categories with few examples get a more significant loss value to balance the ratio of the uneven sample. We compared similar optimize loss in Sect. 4, including the class weight algorithm, Class-Balanced $softmax$ Cross-Entropy Loss (${\text{CB}}_{softmax}$) [44], to reasonably verify the proposed loss’s effectiveness. Specifically, the class weighting algorithm is to multiply the cross-entropy loss by the weight of each category, which is the reciprocal of the ratio of the corresponding category to the total number of training samples. The hyperparameter $\beta$ of $softmax$ CB is set to 0.9999.

4 Experiments

In this section, we present our experiments and evaluate our method against the state-of-the-art approaches to validate the effectiveness of our framework for sentiment classification on different datasets.

4.1 Dataset

We evaluate our proposed method on five widely used datasets, including ArtPhoto [5], Abstract [5], Twitter I [41], Emotion6 [27], and FI [42]. Among them, the scale of FI collected by You et al. is significantly larger than others. Therefore, we divide the dataset into large-scale and small-scale datasets according to the number of samples, and Table 1 shows the details.

ArtPhoto contains 806 art photographs from photograph-sharing sites, in which emotions are determined by the artist who uploaded the photograph. In contrast, Abstract contains 228 peer-reviewed abstract paintings, including colors and textures. Twitter I is collected by Amazon Mechanical Turk (AMT) staff from social networking sites and labeled as two categories (i.e., positive and negative) and contains 1,269 images. We tested our method on all three subsets of Twitter I, including “five consents,” “at least four consents,” and “at least three consents.” “Five Agrees” means that all five AMT workers have given the same emotional label to a given image. Emotion6 was created as an emotion prediction benchmark and collected from Flickr to generate 1,980 images with six emotion categories. They assumed that the influence of each pixel on the emotion-inducing area is proportional to the number of rectangles that cover that pixel and adopt AMT to collect 15 responses to the emotion-inducing site and represent the actual situation.

FI is currently the largest well-labeled dataset. The original FI dataset contains 90,000 noisy images collected from Flicker and Instagram by searching eight types of affective keywords. Weakly labeled images were further labeled by 225 Amazon Mechanical Turk (AMT) workers selected through qualification tests. A total of 23,308 images with at least three votes obtained from five designated AMT staff will be retained. We divide FI into binary datasets the same as Twitter I.

Figure 7 shows some example images from these datasets. Since we focus on binary sentiment prediction, we convert the multi-sentiment labels into positive and negative ones according to their valance for datasets except for Twitter I, initially labeled with a binary sentiment. Specifically, for ArtPhoto, Abstract Paintings, and FI, we divide them into binary labels according to [45], which suggests that amusement, awe, contentment, and excitement are positive sentiments and anger, disgust, fear, and sadness are negative sentiments. Emotion6 is labeled with six emotions (i.e., anger, disgust, fear, joy, sadness, surprise) along with Valance-Arousal scores, where anger, disgust, fear, and sadness can similarly be considered as the negative sentiments. Since the mean valance of the set of joy and surprise images is higher than the mean valance of the set of negative images, we treat them as positive sentiment.

4.2 Implementation details

Our experiments are implemented by the Amazon deep learning open-source development platform MXNet [46]. The model parameters are optimized by the Mini-Batch gradient descent method (MBGD) [47]. The momentum strategies [48] are used to accelerate the model convergence, and the momentum coefficient is set to 0.9. Due to the different numbers of samples of these datasets, the batch sizes are 16 and 32. Weights are initialized to a uniform distribution, i.e.,$\upomega \sim {\text{U}}\left( { - 0.07,0.07} \right)$, bias(b) is initialized to 0, and the learning rate is 0.001.

As in Ref. [36, 49], data enhancements are adopted to reduce the dependence of deep model on some attributes in training process, thereby improving the generalization ability. Specifically, our input images are resized as $256 \times 256$ and then randomly cropped back to $224 \times 224$ from the center and corners while flipping left or right and adjusting the color and tone with a certain probability. Finally, the features are effectively normalized with channel mean and variance for the training mode. All our experiments are carried out on two NVIDIA RTX 2080 Ti with 64 GB of CPU memory.

4.3 Ablation studies

(1)
Network Architecture. Deep CNNs have shown excellent performance in many computer vision tasks and tend to become deeper and deeper. Inspired by the significance of depth, we first try to analyze the influence of several classic CNNs on image sentiment analysis, including AlexNet [28], VGG [50], DenseNet [51], and ResNet [36], due to their excellent performance on visual classification tasks and relatively consistent structures. We obtain pre-training weights from the model zoo of MXNet [46] and fine-tuning the model.

Table 2 shows the performance of several deep CNN models on large-scale dataset FI. From this table, we can argue that the ResNet-152 achieved the best performance, followed by ResNet-50. In contrast, AlexNet achieved the worst results because it has only five convolutional layers. For the same family networks, the emotion recognition accuracy increases as the layer of the model increases. However, when the layer rises to a certain extent, the large-scale parameters brought about by the rise in network depth do not significantly improve the performance. For example, the ResNet performance of 50 layers and 150 layers is almost the same. Another interesting finding is that DenseNet with 121 layers does not perform well, mainly due to the limited number of convolution filters in each layer, although it has far more convolutional layers than ResNet-50. As expected, simply increasing the network layer cannot obtain a better affective representation. Therefore, more discriminative emotive cues are needed. Not only that, it is observed that VGG networks (VGG-16 and VGG-19) achieve better performance than ResNet-18 with more convolutional layers. This may be because VGG has multiple fully connected layers that map the learned “distributed feature representation” to the sample label space.
Table 2 Emotion classification accuracy for deep CNNs with different convolutional layers on FI dataset
Full size table

For this reason, we attempt to introduce a similar fully connected layer to other networks. The experimental results in this table prove our conjecture. The fully connected layers at the end of the network are helpful for affective image recognition, as they provide redundant parameters to construct high-level concepts for affective reasoning. Finally, we employ ResNet-50 as the backbone network since ResNet-152 does not have obvious performance hints but brings many parameters.
(2)
Multi-level feature aggregation. We attempt to explore multi-level representations for emotion recognition from CNN. Generally speaking, some convolutional layers in CNN will produce output feature maps of the same size, and we treat these layers as being in the same stage. For specific deep CNNs, we employ it as the backbone network to extract high-level semantic feature representations, and visual feature representations are extracted from different network stages, including low-level detailed features and mid-level aesthetic features. In this work, we first calculate the Gram matrix for the shallow features, and then transform it into a one-dimensional vector by a fully connection layer. At the same time, we added $1 \times 1$ convolutional layer to reduce the amount of calculation and facilitate the model to achieve cross-channel interaction and information integration and then multiplied by broadcasting for shallow visual representation.

As shown in Table 3, the network performance can be further increased by adding shallow visual features. It is indicated that high-level semantics are not enough to express image emotions perfectly. Additionally, we have also observed that a combination of visual representation $\left\{ {{\varvec{V}}_{1} ,{\varvec{V}}_{2} ,{\varvec{V}}_{3} , \ldots } \right\}$ combined with the backbone network can produce the best accuracy results. It means that the visual features extracted from the shallow layer can help image semantics for emotion classification. However, obtaining visual representations from higher-level featuremaps will cause bad performance. This is because the deeper feature maps only include a small part of low-level visual features and more of the image content. Therefore, using feature maps from deep layers to extract visual details will result in redundant information.

Table 3 Evaluation of multi-level representation on FI dataset

Full size table

The fusion technique is also a critical component of our multi-level model. We compare several fusion functions, including element-wise add, element-wise multi, channel pooling, and concatenation, to explore the impact of different fusion methods on the affective classification performance, as shown in Table 4. All combinations help capture information in the overall and regional views, and concatenation is the most effective way as it retains all deep representations.

Table 4 Emotion classification accuracy for our multi-level representations model of different fusion functions

Full size table

Furthermore, we verify the impact of the ratio of shallow visual feature representation and high-level semantic feature representation on emotion recognition, as shown in Fig. 8. Let $r$ represent the ratio of visual representation dimension to the dimension of semantic features. Note that the dimension of the semantic annotations of ResNet is 2048, DenseNet is 1024, while the dimension of VGG is 512. When $r=1$, both ResNet and VGG earn the best performance, which means the dimension of shallow representations and deep semantics is equal. DenseNet has the best performance when $r=0.75$. It is indicated that low-level visual representation can assist image semantic features for affective recognition. Still, too much attention to shallow features will bring redundant information and affect the performance.

4.4 Class imbalance and loss function

In this section, we discuss the impact of class imbalance in the image sentiment dataset on recognition accuracy through experiments. As mentioned in Sect. 3, the class imbalance issue cannot ignore for the deep networks. Considering the above challenges, a new loss function is proposed, as shown in Eq. (5). The proposed model and loss are evaluated using ResNet and VGG as the backbone network.

Figure 9 shows the impact of balance factor $\alpha$ and modulation factor $\gamma$ on the multi-level model performance. Note that when the modulating factor $\gamma =0$, the loss function degenerates to cross-entropy loss for training. As shown in this figure, training ResNet-50, VGG, and our proposed model with suggested loss (formula 5) improves the performance. Moreover, when gamma = 2, the proposed model will produce the best results.

4.5 Comparison to state of the arts

To prove the effectiveness of our model, we compared it with the state-of-the-art affective image recognition method, including several hand-crafted feature algorithms and deep models.

(1)
Hand-crafted features: We employ several low-level visual features extracted by Yang et al. [52], including SIFT, HOG, GIST, etc. ColorName employs the algorithm in [5] to calculate the pixels of each of the 11 primary colors presented on the query image. SentiBank [23] is a concept detector library based on ANP’s visual ontology, using 1,200-dimensional features as mid-level representation. Zhao et al. [53] adopt principles-of-art-based affective features (PAEF) for image affective computing, a harmonious combination of features from different guides, including balance, emphasis, harmony, change, level, and movement.
(2)
Deep models: You et al. [41] proposed a novel progressive CNN architecture PCNN, which employs large-scale weakly labeled data to improve the model’s generalization ability. DeepSentiBank [54] is a CNN-based classification of visual emotion concepts to discover ANP. The pre-trained DeepSentiBank is adopted to extract 2,089 ANPs as mid-level representations. We also show the performance of the deep visual features of fine-tuned CNNs on the sentiment dataset, including different architectures, namely AlexNet [28], VGG [50], DenseNet [51], and ResNet-50 [36]. Yang et al. [52] proposed the Affective Regions detection method AR, which uses a dual-branch strategy and combines local information and holistic representations for sentiment classification. Xiong et al. [55] proposed R-CNNGSR, which automatically detects affective regions by combining underlying features and emotional features. Then, the entire image and regions are fused to predict emotion. Moreover, we employ the class weight algorithm and balanced $softmax$ cross-entropy loss (${\mathrm{CB}}_{softmax}$) [44] to verify the effectiveness of the proposed loss function.

We first conducted experiments on the large-scale dataset FI and compared our model’s performance with several related techniques. Meanwhile, four different loss functions are used to optimize the model. As shown in Table 5, the proposed method is better than the deep CNN designed for object classification tasks, such as AlexNet, VGG-16, DenseNet, and ResNet-50. It indicated that our proposed method benefits from multi-level feature representation and a more suitable model optimization loss. Compared with other deep models designed for affective classification, including PCNN [42], DeepSentiBank [54], and AR [52], our multi-level model produces better performance based on multi-level representation. Moreover, the proposed class imbalance loss can optimize the deep affective model better than the other loss functions.

Table 5 Comparison with state-of-the-art methods on FI dataset

Full size table

To further evaluate the effectiveness of our model, we validate the proposed method on small-scale datasets, including ArtPhoto, Abstract Paintings, Twitter I, and Emotion6 dataset, as shown in Table 6. It can be seen from the experimental results in this table that our multi-level feature model produces better performance compared with several hand-crafted features methods such as PAEF [53] and SentiBank [23]. Not only that, our proposed multi-level representations method has also achieved performance improvements beyond other deep CNNs models. However, it is interesting that compared to Yang [52] and Xiong [55], the proposed method has achieved better performance on Abstract. This is consistent with our initial conjecture: Abstract paintings pay more attention to the expression of visual details and aesthetics, while Web images are more inclined to understand image semantics, as shown in Fig. 2. What is more, it could be observed that our method also has obvious performance hints on the Twitter I and Emotion6 datasets, which benefit from a more suitable model optimization loss.

Table 6 Comparison with state-of-the-art methods on small-scale datasets

Full size table

Finally, time consumption is also an exciting topic. Since different fusion strategies will have different computational costs, we have made a detailed comparison of the time consumption of the proposed model under different fusion strategies. For example, the model within concatenate can expand the dimension of the final fully connected layer, leading to more parameters. In contrast, the other four fusion strategies cannot have this problem. Figure 10 shows the running time comparison between the proposed multi-level model and the backbone network. The proposed model does not have too much time overhead compared to the base network as the proposed model only works on the block level instead of the hierarchy.

5 Conclusion

This paper proposes a multi-level representations model to learn and integrate the deep semantics and shallow visual details of images for affective recognition. The visualization of deep representations and experimental results show that combining shallow features can benefit image emotion classification. Moreover, our studies indicate that class imbalance can affect performance as the dominant class will overwhelm training and lead to degeneration. Therefore, a new loss function was introduced to optimize the deep CNN model.

References

Zhao S, Ding G, Huang Q, et al (2018) Affective image content analysis: a comprehensive survey[C]//IJCAI. pp 5534–5541
Hariri W, Farah N (2021) Recognition of 3D emotional facial expression based on handcrafted and deep feature combination. Pattern Recogn Lett 148:84–91
Article Google Scholar
Hariri W, Farah N, Vishwakarma DK (2021) Deep and shallow covariance feature quantization for 3D facial expression recognition. arXiv preprint https://arxiv.org/abs/2105.05708
Wang J, Han Z (2019) Research on speech emotion recognition technology based on deep and shallow neural network. In: 2019 Chinese Control Conference (CCC). IEEE. pp 3555–3558
Machajdik J, Hanbury A (2010) Affective image classification using features inspired by psychology and art theory. In: Proceedings of the 18th ACM international conference on Multimedia. pp 83–92
Alameda-Pineda X, Ricci E, Yan Y, et al (2016) Recognizing emotions from abstract paintings using non-linear matrix completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 5240–5248
Yanulevskaya V, van Gemert J C, Roth K, et al. Emotional valence categorization using holistic image features[C]//2008 15th IEEE international conference on Image Processing. IEEE, 2008: 101–104
Zhao S (2016) Image emotion computing. In: Proceedings of the 24th ACM international conference on Multimedia. pp 1435–1439
Borth D, Ji R, Chen T, et al (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on Multimedia. pp 223–232
Yuan J, Mcdonough S, You Q, et al (2013) Sentribute: image sentiment analysis from a mid-level perspective. In: Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining. pp 1–8
Rao T, Xu M, Liu H, et al (2016) Multi-scale blocks based image emotion classification using multiple instance learning. In: 2016 IEEE International Conference on Image Processing (ICIP). IEEE. pp 634–638
Kim I, Baek W, Kim S (2020) Spatially attentive output layer for image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 9533–9542
Zoran D, Chrzanowski M, Huang PS, et al (2020) Towards robust image classification using sequential attention models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9483–9492
He K, Gkioxari G, Dollár P, et al (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp 2961–2969
Joseph KJ, Khan S, Khan FS, et al (2021) Towards open world object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 5830–5840
Liu C, Chen L C, Schroff F, et al (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 82–92.
Fan M, Lai S, Huang J, et al (2021) Rethinking BiSeNet For Real-time Semantic Segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 9716–9725
Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 5188–5196
Campos V, Salvador A, Giró-i-Nieto X, et al (2015) Diving deep into sentiment: understanding fine-tuned CNNs for visual sentiment prediction. In: Proceedings of the 1st International Workshop on Affect & Sentiment in Multimedia. pp 57–62
Zhang H, Xu D (2019) Ethnic painting analysis based on deep learning. Sci Sin Inf 49(2):204–215
Article Google Scholar
Valdez P, Mehrabian A (1994) Effects of color on emotions. J Exp Psychol Gen 123(4):394
Article Google Scholar
Yadav A, Vishwakarma DK (2020) Sentiment analysis using deep learning architectures: a review. Artif Intell Rev 53(6):4335–4385
Article Google Scholar
Borth D, Chen T, Ji R, et al (2013) Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content[C]//Proceedings of the 21st ACM international conference on Multimedia. pp 459–460
Ali AR, Shahid U, Ali M, et al (2017) High-level concepts for affective understanding of images. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017. pp 679–687
Rao T, Li X, Xu M (2020) Learning multi-level deep representations for image emotion classification. Neural Process Lett 51(3):2043–2061
Article Google Scholar
Zhang W, He X, Lu W (2020) Exploring discriminative representations for image emotion recognition with CNNs. IEEE Trans Multimedia 22(2):515–523
Article Google Scholar
Peng KC, Chen T, Sadovnik A, et al (2015) A mixed bag of emotions: model, predict, and transfer emotion distributions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 860–868
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Google Scholar
Deng J, Dong W, Socher R, et al (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE. pp 248–255
Zhu X, Li L, Zhang W, et al (2017) Dependency exploitation: A unified CNN-RNN approach for visual emotion recognition. In: proceedings of the 26th international joint conference on artificial intelligence. pp 3595–3601
Joshi D, Datta R, Fedorovskaya E (2011) Aesthetics and emotions in images. IEEE Signal Process Mag 28:94–115
Article Google Scholar
Xiong H, Liu H, Zhong B et al (2019) Structured and sparse annotations for image emotion distribution learning. Proc AAAI Conf Artif Intell 33(01):363–370
Google Scholar
Fan S, Shen Z, Jiang M, et al (2018) Emotional attention: a study of image sentiment and visual attention. In: Proceedings of the IEEE Conference on computer vision and pattern recognition. pp 7521–7531
Tajbakhsh N, Shin JY, Gurudu SR et al (2016) Convolutional neural networks for medical image analysis: full training or fine tuning? IEEE Trans Med Imaging 35(5):1299–1312
Article Google Scholar
Jung H, Lee S, Yim J, et al (2015) Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE international conference on computer vision. pp 2983–2991
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, Cham. pp 818–833
Elad M, Milanfar P (2017) Style transfer via texture synthesis. IEEE Trans Image Process 26(5):2338–2351
Article MathSciNet Google Scholar
Gatys L, Ecker AS, Bethge M (2015) Texture synthesis using convolutional neural networks. Adv Neural Inf Process Syst 28:262–270
Google Scholar
Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2414–2423
You Q, Luo J, Jin H, et al (2015) Robust image sentiment analysis using progressively trained and domain transferred deep networks. In: Proceedings of the AAAI conference on Artificial Intelligence. 29(1)
You Q, Luo J, Jin H, et al (2016) Building a large scale dataset for image emotion recognition: the fine print and the benchmark. In: Proceedings of the AAAI conference on artificial intelligence. 30(1)
Lin TY, Goyal P, Girshick R, et al (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp 2980–2988
Cui Y, Jia M, Lin T Y, et al (2019) Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9268–9277
Mikels JA, Fredrickson BL, Larkin GR et al (2005) Emotional category data on images from the International Affective Picture System. Behav Res Methods 37(4):626–630
Article Google Scholar
Chen T, Li M, Li Y, et al (2015) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint https://arxiv.org/abs/1512.01274
Khirirat S, Feyzmahdavian HR, Johansson M (2017) Mini-batch gradient descent: faster convergence under data sparsity. In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC). IEEE. pp 2880–2887
Chan LKC, Jegadeesh N, Lakonishok J (1996) Momentum strategies. J Financ 51(5):1681–1713
Article Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, et al (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2818–2826
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint https://arxiv.org/abs/1409.1556
Huang G, Liu Z, Van Der Maaten L, et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4700–4708
Yang J, She D, Sun M et al (2018) Visual sentiment prediction based on automatic discovery of affective regions. IEEE Trans Multimedia 20(9):2513–2525
Article Google Scholar
Zhao S, Gao Y, Jiang X, et al (2014) Exploring principles-of-art features for image emotion recognition. In: Proceedings of the 22nd ACM international conference on Multimedia. pp 47–56
Chen T, Borth D, Darrell T, et al (2014) Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint https://arxiv.org/abs/1410.8586
Xiong H, Liu Q, Song S et al (2019) Region-based convolutional neural network using group sparse regularization for image sentiment classification. EURASIP J Image Video Process 2019(1):1–9
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by Yunnan Province Ten Thousand Talents Program and Yunling Scholars Special Project under Grant YNWR-YLXZ-2018-022, in part by the Yunnan Provincial Science and Technology Department-Yunnan University “Double First Class” Construction Joint Fund Project under grant 2019FY003012, in part by National Natural Science Foundation of China under Grant 62162068 and Grant 62061049, in part by the Science Research Fund Project of Yunnan Provincial Department of Education under Grant 2021Y027.

Author information

Authors and Affiliations

School of Information Science and Engineering, Yunnan University, Kunming, China
Hao Zhang, Dan Xu & Kangjian He
School of Software, Shanxi Agricultural University, Jinzhong, China
Gaifang Luo

Authors

Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Gaifang Luo
View author publications
You can also search for this author in PubMed Google Scholar
Kangjian He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dan Xu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, H., Xu, D., Luo, G. et al. Learning multi-level representations for affective image recognition. Neural Comput & Applic 34, 14107–14120 (2022). https://doi.org/10.1007/s00521-022-07139-y

Download citation

Received: 30 July 2021
Accepted: 22 February 2022
Published: 22 April 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s00521-022-07139-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning multi-level representations for affective image recognition

Abstract

Similar content being viewed by others

A review of convolutional neural networks in computer vision

Deep learning models for digital image processing: a review

Visualizing and Understanding Convolutional Networks

1 Introduction

2 Related work

3 The proposed inference method

3.1 Fine-tuning CNN for affective image recognition