Deep Transfer Learning for Image Emotion Analysis: Reducing Marginal and Joint Distribution Discrepancies Together

A lot of research attentions have been paid to image emotion analysis in recent years. Meanwhile, as convolutional neural networks (CNNs) have made great successful in computer vision, many researchers start to employ CNN to discriminate image emotions. However, the training procedure of CNNs depends on sufficient labeled data. Therefore, a CNN is hard to perform well in an image domain with scant labeled information. In this paper, we propose a deep transfer learning method for image emotion analysis. The method can leverage rich emotion knowledge from a source domain to the target domain. Our method reduces both marginal and joint domain distribution discrepancies at fully-connected layers. Through this way, we can effectively extract more transferable features and advance the performance of CNNs on poor-label emotion-image domains.


Introduction
Different visual content can evoke different human emotions, which directly influence our cognition and decision. Therefore, more researchers start to investigate and interpret human emotion contained in image content [30]. Most conventional methods design manually crafted features based on art and psychology theory and then recognize human emotions by discriminating these features [8,17,19,37].
Deep learning has made significant development in recent years, and the performance of convolutional neural networks (CNNs) on many computer vision tasks is comparable to that of humans. Meanwhile, large-scale image datasets boost feature learning based on CNNs. For example, a CNN pre-trained with ImageNet can extract more representative features for B Yuwei He hyw16@mails.tsinghua.edu.cn Guiguang Ding dinggg@tsinghua.edu.cn 1 Tsinghua University School of Life Sciences, Beijing, People's Republic of China general visual tasks. In the image emotion analysis field, studies have proved that CNN-based features are more discriminative compared with traditional manually crafted features [29].
However, there are still limitations for CNNs in image emotion analysis. Firstly, training CNN depends on massive labeled data. But in many emotion-image domains, the amount of labeled images is limited and manually labeling them is prohibitive [18]. Moreover, the scalability of CNNs is still limited as different emotion-image domains exhibit different image styles, which leads to different domain distributions. Therefore, even if a CNN performs well in an emotion-image domain, it may not achieve comparable performance in another one.
Transfer learning aims at transferring information from a rich-label source domain to another poor-label target domain [26]. The key technical problem is how to reduce the distribution discrepancy of the two domains. Recently deep transfer learning methods have been widely applied in computer visions [14,15,24,25]. One important reason is that a deep model can learn more domain-invariant features [29]. As deep models prefer to learn domainspecific features on top layers, the main bottleneck of deep transfer learning methods is to reduce the shift between two domain distributions of these layers.
In order to generalize CNNs to different emotion-image domains, in this paper, we design a novel deep transfer learning method to promote CNN-based emotion classifiers on smallscale image domains. Its advantages on image emotion analysis are as follows: 1) Our method requires two domains share the same CNN. As different emotion-image domains contain similarity elements on pixel-level, sharing the same CNN can learn higher quality image features at first-layers. 2) We have both considered marginal distribution discrepancy at the same layers [14] and joint distribution discrepancy of different layers [16]. The layers in deep models are trained jointly, so we should not only consider marginal distribution P(Z l ) of one layer, but also joint distribution P(Z 1 , ldots, Z l ) of several layers. A proper trade-off of the two discrepancies can advance transferability between two domains.

Related Work
Psychological researches show that human generates different emotions according to different visual content [9,11]. And because of the development of social networks, more and more people upload their images, which increase the image amount for researches. Therefore, emotion researchers pay more attention from the psychology analysis to the image emotion analysis. Some research works even extent the analysis from dominant emotion to personalized emotion [32,34,35]. Traditional method classifying emotion contained in images based on low-level crafted features [8,17,19,37]. For example, Machajdik et al. [18] designed 8 kinds of emotion-related features. Zhao et al. [31] proposed principles-or-art based features for discriminating emotions.
Recently, deep learning has made great development [6,13] and convolutional neural networks (CNNs) are widely applied in computer vision [5,10,22]. One import reason is that the appearance of large-scale datasets, such as ImageNet [1], boosts the features learning of CNNs. In visual emotion analysis, You et al. [28] utilized weakly labeled images to train a CNN and learned a binary image emotion classifier. Then they built a large-scale dataset for image emotion analysis [29]. And the CNN based emotion classifier outperformed ones based manually crafted features [29]. However, training a CNN requires massive labeled data and many emotion-image domains lack them. Although some methods were designed to ease the problem, such as generating images similar the target domains [33,36], the generating procedure is fussy and the qualities of generated images can not be guaranteed.
In this paper, we aim at alleviating the data scarcity problem with transfer learning. Transfer learning focus on knowledge transfer from the source domain with rich label information to the target domain [26]. Traditional transfer learning methods learn domain-invariant model based on shallow features [2,7,20]. Recent studies have demonstrated that deep models can learn more transferable features between two domains [27]. For example, when a CNN extract features from different image domains, the first-layer features all tend to resemble Gabor filters or color blobs.
However, as CNNs always learn domain-specific features at top layers, distributions of different domains exist relatively large discrepancies at these layers. Therefore, many researchers add specific transfer modules to reduce the discrepancies in a layer-wise way [14,15,24]. These methods promote the effect of deep transfer learning. However, it is necessary to consider the dependencies between layers. Long et al. [16] proposed joint adaptation network, which first considered the joint distribution of all the top fully-connected layers.

Maximum Mean Discrepancy
Maximum Mean Discrepancy (MMD) is used to judge whether two distributions P(X s ) and Q(X t ) are the same [4]. Its hypothesis is when P = Q. Now it is usually used to measure the distribution similarity and its form is presented as: where F is a functional set.

Reproducing kernel Hilbert space
MMD can be represented as the distance in Reproducing kernel Hilbert space [4]. As Euclidean space V is a finite vector space, Hilbert Space is typically viewed as an infinite function space H and its orthogonal basis can be denoted as [3,21] We find that: This is the reproducing property of H. We denote as μ x (P) and μ x (Q) respectively [4]. Now MMD can be presented as: If we only select f which satisfies | f | = 1, D(P, Q) can be calculated as: Now we can define a kernel function k(x, y) to replace < φ(x), φ(y) >. The kernel function can not only be a scalar product, but also other choices like Gaussian kernel. This method is widely employed in many tasks like density estimation and two-sample test [4,23]. Given with finite instances sampled from P and Q. The kernel embeddings are calculated by: As k(x, ·) = φ(x), μ x (P) and μ x (Q) are called kernel embedding here. Now MMD can be estimated as the distance of two kernel embeddings and its formula is:

Transfer Learning for Image Emotion Analysis
Given a source emotion-image domain D s = {(x s i , y s i )} n s i=1 and a target emotion-image where n s n t , our task is employing a transfer learning method to optimize a CNN with D s and D t and improve its classification performance in D t . The specific method is to reduce the domain distribution discrepancy at the fully-connected layers while training the CNN with D s and D t simultaneously.
Choosing a CNN as our base transfer learning model is based on two reasons: (1) Compared with conventional manually crafted features, features extracted by CNNs are more suitable for image emotion analysis; (2) Recent studies show that CNNs can learn more transferable image features at first layers.
where J is a cross-entropy loss function. Intuitively, if we hope to utilize D s to improve the performance of a CNN on D t , we can employ both D s and D t to train the same CNN together. However, in the image emotion analysis field, there always exists a discrepancy between domain distributions P(X s ) and Q(X t ). Meanwhile, the image features transits from general to domain-specific along a CNN, which means the transferability decreases at the fullyconnected (FC) layers. Our transfer learning method minimizes the domain shift at FC layers from two perspectives: (1) Reducing marginal distribution discrepancy {P(Z si , Q(Z ti )} i∈G in a layer-wise way; (2) Reducing joint distribution discrepancy P(Z s1 , . . . , Z s|G| ) and P(Z t1 , . . . , Z s|G| ). {Z si } i∈G and {Z ti } i∈G are features at FC layers. G is a set of selected fully-connected layers to be aligned for joint distribution. Usually, G contains all the fullyconnected layers of the CNN.

Joint Maximum Mean Discrepancy
To decrease joint distribution discrepancy of two domains, Long et al. [16] designed a module to measure joint distribution discrepancy like MMD, which is called Joint Maximum Mean Discrepancy (JMMD). JMMD is estimated as: C Z s,1:|G| (P) and C Z t,1:|G| (Q) is the feature embedding in Hilbert space.
Where * ∈ {s, t}. If we make use of kernel trick, D G (P, Q) can be estimated as:

Deep Transfer Learning Model
We integrate both MMD and JMMD into the FC layers of the CNN, where MMD is used for measuring marginal discrepancy and JMMD is used for measuring joint discrepancy for two domain. The optimizing process is minimizing MMD and JMMD of fully-connected layers while fine-tuning CNN with D s and D t . The loss function is as follows: where L s and L t are classification loss functions for D s and D t and they are presented as: D i (P, Q) is the MMD loss at i-th FC layer. λ and η are two trade-off parameters. The overall architecture of JAN is shown in Fig. 1.

Experiment
Experiments focus on the image emotion classification problem. And the purpose is to evaluate whether our transfer learning method can generalize a CNN trained in a large-scale emotion-image domain to another small-scale one better. Datasets FI [29] contains 22700 emotion-images in 8 categories. Images are collected through search engines (Flickr and Instagram) with 8 emotion keywords. Then images are labeled using Amazon Mechanical Turk (AMT). ArtPhoto [18] consists of 806 photos from professional artists. The labels of photos are provided by image owners.
IAPS-Subset is a subset of the International Affective Picture System (IAPS) [12]. This dataset and Artphoto share the same 8 categories with FI Table 1 shows the statistics of the two datasets. For ArtPhoto and IAPS-Subset, the table shows that image numbers of each category are imbalanced and the total numbers are both much smaller than that of FI. Therefore, we take FI as the source domain when it is included in the task.
Based on the 3 datasets, we construct 4 cross emotion-image domain classification tasks: F → A, F → I, I → A, A → I. F → A means, for example, taking FA as the source domain and ArtPhoto as the target domain.
We randomly split the target-domain data into training, validation, and test set with fractions 80%, 5%, 15%. We perform a 5-fold Cross Validation to obtain results. Architecture We choose ResNet50 [1] as the base CNN. A fully-connected layer and a softmax layer are added behind convolutional layers. We fine-tune the whole network in an end-to-end way. We measure marginal distribution discrepancy at the FC layer with MMD and joint discrepancy of the FC layer and softmax layer with JMMD. The λ and η in Eq. 12 are 0.2 and 0.3 respectively. Baseline -CTD [29]: The CNN model is fine-tuned only with labeled data in target domain. This is the basic method used for image emotion classification. The accuracies in bold are highest ones in their corresponding tasks  Table 2. Table 2 reveals the following observations: (1) CBD outperforms CTD on most tasks, which proves the transferability of CNNs; (2) Deep transfer learning method performs better than CBD. This validates that integrating transfer modules into CNNs can boost it to learn more transferable features; (3) Our method outperforms DAN and JAN in most cases, which demonstrates that reducing marginal and joint distribution discrepancies together can improve the transferability of the CNN further; (4) On task F → A, CTD performs best, which shows that the degree of domain shift influences the feasibility of transfer learning. When the domain shift is large, the transferred information from the source domain may become noise information. Figures 2 and 3 show the accuracy of each emotion category on task F → I and I → A. We do not report the result of emotion anger as its data amount is scant. The results reveal that the CNN classifier with transferring modules consistently outperforms conventional CNN classifier. Furthermore, DAN, JAN outperforms our method on partial categories, which demonstrates that the most proper ratios between marginal and joint distribution discrepancies are different for different categories. But on most categories, our method performs the best. Therefore, considering the two different discrepancies together is necessary.

Parameter Analysis
Now we check the sensitivity of proportion between JMMD parameter λ and MMD parameter η in Eq. 12. the value of η varies in {0, 0.1, 0.2, 0.3, 0.4, 0.5} and λ = 1 − η. The results are shown in Fig. 4. The results present as bell-shaped curves, which confirms our motivation that a proper trade-off between marginal and joint distribution discrepancies can advance the transferability of CNNs.

Conclusion
In this paper, we propose a deep transfer learning method into image emotion analysis. Our purpose is improving the classification performance of CNNs on a small-scale emotion-image domain by transferring label information from another large-scale one. During the transferring process, we decrease the marginal and the joint distribution discrepancies together. The experimental results demonstrate the promise of our method for discriminating image emotions. In future work, we will explore how to transfer information from art and psychological theory based features to CNN-based features.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.