Learning to Compose Diversified Prompts for Image Emotion Classification

Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained vision-language models. Although CLIP has recently shown its superior power on a wide range of downstream vision-language tasks like Visual Question Answering, it is still underexplored for Image Emotion Classification (IEC). Adapting CLIP to the IEC task has three significant challenges, tremendous training objective gap between pretraining and IEC, shared suboptimal and invariant prompts for all instances. In this paper, we propose a general framework that shows how CLIP can be effectively applied to IEC. We first introduce a prompt tuning method that mimics the pretraining objective of CLIP and thus can leverage the rich image and text semantics entailed in CLIP. Then we automatically compose instance-specific prompts by conditioning them on the categories and image contents of instances, diversifying prompts and avoiding suboptimal problems. Evaluations on six widely-used affective datasets demonstrate that our proposed method outperforms the state-of-the-art methods to a large margin (i.e., up to 9.29% accuracy gain on EmotionROI dataset) on IEC tasks, with only a few parameters trained. Our codes will be publicly available for research purposes.


Introduction
Image Emotion Classification aims to extract emotions evoked in images.Previous methods approach this challenging but essential task by first loading a backbone that initialized on datasets of fixed label sets (i.e., ImageNet), then designing various architectures or gating, attention mechanisms to compose discriminate emotion features.Benefiting from the powerful feature composition ability of deep learning, these methods have achieved great success [Zhao et al., 2022].
Recently, vision-language pre-training such as CLIP has emerged as a promising alternative [Radford et al., 2021].The main idea is to align images and raw text using two separate encoders.Compared to traditional vision-only pretraining methods, the large-scale, easily accessible training data and diverse natural language descriptions enable CLIP to learn more fine-grained open-set visual concepts.Benefiting from these broader ranges of visual concepts, CLIP has strong generalization ability, and some more recent work has shown that it can be readily transferred to downstream language-vision tasks with greatly improved performance.Adapting these recent techniques to IEC can be highly valuable, given that hu-man emotion is highly abstractive and figuring out emotions carried in images requires a deep understanding of various details and concepts, but only a fixed set of concepts can be accessed by traditional vision pretraining methods [Zhao et al., 2021].However, adapting CLIP to IEC has three significant challenges.(1) Tremendous gap between pretraining and IEC: Unlike language-vision tasks, IEC has only image data during model training (as depicted in (a) (b) shown of Figure 1), which is dramatically different from CLIP training scenarios, making it difficult to effectively utilize rich knowledge entailed in CLIP.(2) Suboptimal prompts: CLIP proposed to manually design text prompts to transfer knowledge to downstream tasks.However, we observe that a slight change in wording can make a massive difference in performance, as illustrated in (c) (d) of Figure 1.(3) Shared invariant prompts: Some work in NLP treat prompts as sequences of virtual tokens [Li and Liang, 2021] and learns prompts automatically by parameterizing these tokens [Lester et al., 2021], avoiding the suboptimal problem to some extent [Liu et al., 2021].However, these methods employed shared prompts across all instances, regardless of the fact that instances of different categories share similar features while also have their own distinct characteristics.As shown in (e) of Figure 1, although the left picture and the middle one are from different categories, there are some associations between them.i.e., the color.While the middle and the right ones come from one category, they have their own peculiarities.
To tackle the above three challenges, we propose a novel Prompt Tuning method with Diversified Prompt Composition (PT-DPC) based on CLIP, which can learn to compose unique prompts for each image.Specifically, we treat prompts as a sequence of tunable virtual tokens and obtain text representations by inputting them to the text encoder of CLIP, these virtual tokens are trained end-to-end and can condense the signal from a full labeled dataset, with CLIP weights fixed.We further condition these virtual tokens on the classes and image contents of instances.More specifically, we employ different virtual tokens for each class to obtain class-specific prompts.Then we integrate image contents with all classspecific prompts to compose diversified prompt, forming an instance-specific prompt and capturing associations between possible classes.
We evaluate our model on six widely used image emotion classification benchmarks, namely, FI 8, EmotionROI 6, EmotionROI 2, FI 6, FI 2, Twitter I, Twitter II.Experimental results show that our model outperforming state-of-the-art methods to a large margin.For example, our method achieves 9.29% absolute accuracy gain on EmotionROI 6 dataset.
The contributions of this work are summarized as follows: • We propose a novel prompt tuning method, PT-DPC, addressing three challenges of adapting the CLIP model to the IEC task.To our best knowledge, this is the first work to introduce a prompt tuning method for image emotion classification tasks.
• To avoid suboptimal problems for the fixed prompt tuning of CLIP, we propose a diversified prompt composition by utilizing both image contents and all class-specific virtual tokens.
• The experimental results on six popular affective image datasets demonstrate that our proposed framework can outperform the state-of-the-art methods for emotion classification.

Related Work
In this section, we will introduce the related works, including emotion representation models, image emotion classification, large-scale pre-trained models, and prompt tuning methods in NLP.

Emotion Representation Models
In psychology theory, emotion is mainly measured by two representative models: dimensional emotion space (DES) and categorical emotion states (CES).DES models are designed to represent emotions by employing continuous 2D, 3D, or higher dimensional Cartesian space, such as valence-arousal (VA) [Hanjalic, 2006] and valence-arousal-dominance (VAD) [Gunes and Schuller, 2013].VAD is the most popular DES model, where valence represents the pleasantness ranging from negative to positive, arousal represents the intensity of emotion ranging from calm to excited, and the dominance represents the degree of control ranging from in control to controlled.In practice, dominance is difficult to measure and is often omitted, leading to the commonly used two-dimensional VA space.Theoretically, every emotion can be represented as a coordinate point in the Cartesian space.[Zhao et al., 2016;Balouchian et al., 2019]propose their researches based on this representation model, which made impressive contributions.However, the absolute continuous values are difficult for users to distinguish, which constrains the employment of DES models.
On the contrary, although having limited emotion categories cannot well reflect the complexity and subtlety of emotions, CES models are easy for users to understand, explain, and annotate [Zhao et al., 2021].CES models classify emotions into a few basic categories.The simplest CES model is the sentimental binary model, which just includes negative and positive.In many cases, "emotion" is often called "sentiment", which sometimes also includes neutral.Since the sentiment is too coarse-grained, some relatively fine-grained emotion models are propose, such as Mikel's emotion wheel model(amusement, anger, awe, contentment, disgust, excitement, fear, and sadness) [Mikels et al., 2005] and Ekman's emotion model(anger, disgust, fear, happiness, sadness, surprise) [Ekman, 1992].In this paper, we propose an image emotion classification method that represents emotion using the CES model.

Image Emotion Classification
Image emotion classification is usually formulated as an emotion feature extraction problem, which represents the emotion of an image in a CES model.Learning discriminative emotion features will facilitate classification performance.To approach this task, [Machajdik and Hanbury, 2010;Zhao et al., 2014] designed many types of hand-crafted representations to bridge the affective gap between low-level features and abstract emotions in the earlier years.
With the blooming of the convolutional neural networks (CNNs) on different tasks, researchers mainly designed CNN-based architectures [Krizhevsky et al., 2012;He et al., 2016] that pretrained on a fixed set of labels were proposed to boost classification performance [You et al., 2015;Yang et al., 2018a;Deng et al., 2021].
In light of the image emotion's abstract, it is not easy to obtain sufficient discriminative features from the image itself.To boost classification performance, few efforts turn to enriching feature representations by incorporating external knowledge, such as proposing a well-designed sentiment dictionary [Borth et al., 2013;Chen et al., 2014;Wu et al., 2021] or introducing different kinds of datasetspecific information [Yang et al., 2017;Rao et al., 2019;Zhang and Xu, 2020].

Large-scale Pre-trained Models
In recent years, deep neural networks, such as CNNs and attention neural networks, have been widely applied for various artificial intelligence tasks [Jaderberg et al., 2015;Wang et al., 2017].The neural models can automatically learn representations from data, thereby getting rid of complex feature engineering.However, they are easy to overfit and have poor generalization ability when lack of sufficient training data [Xu et al., 2020].Moreover, it is expensive and time-consuming to manually annotate large-scale data for complex tasks.Thus, it has been a critical research issue on how to train effective deep neural models for specific tasks with limited human-annotated data [Han et al., 2021].
To tackle these problems, massive efforts have been devoted to manually constructing high-quality datasets [Deng et al., 2009;Bojar et al., 2014], which triggers a wave of transfer learning [Pan and Yang, 2009;Thrun and Pratt, 2012].The transfer learning formalizes a two-phase learning framework: a pre-training phase to capture knowledge from one or more source tasks, and a fine-tuning stage to transfer the captured knowledge to target tasks.Owing to the wealth of knowledge obtained in the pre-training phase, the fine-tuning phase can enable models to well handle target tasks with limited samples [Han et al., 2021].
Plenty works of exploring pre-trained models (PTMs) are applied to artificial intelligence research, especially in CV and NLP.A series of CNNs are successful used for almost all CV task, such as image classification [He et al., 2016] and object detection [Redmon et al., 2016].The researchers from the NLP community develop more deep PTMs, proposing a series of powerful transformers architecture to capture the semantic meaning of the text, such as BERT [Devlin et al., 2018] and GPT [Brown et al., 2020].

Prompt Tuning Method in NLP
Nowadays, many downstream applications have achieved significant improvements by finetuning on top of the pre-trained models.However, due to the large scale of model parameters, finetuning brings a large computational and storage burden.Instead of finetuning on the full model, Prompt tuning methods transfer knowledge entailed in a large pre-trained model by designing a textual prompt to reformulate downstream tasks to look more like pretraining tasks.This reduces the gap between pretraining and downstream tasks, making the knowledge entailed in pretraining models can be transferred to downstream tasks easily with only a few annotation examples.Therefore, how to get text prompts becomes a vital issue.Currently, prompt tuning methods can be roughly divided into two categories: manually crafted and automatically learned.While manually crafting prompts [Brown et al., 2020;Radford et al., 2021]is intuitive, creating and experimenting with these prompts takes time and experience, even experienced prompt designers may fail to manually discover optimal prompts [Jiang et al., 2020].To automate prompt engineering, [Li and Liang, 2021;Lester et al., 2021;Zhou et al., 2021] parameterized the prompts by treating prompts as virtual tokens and performing prompting directly in the embedding space.
Our proposed model lies in the second line of work.Instead of using a shared prompt for all instances, we condition prompts on classes and contents of instances, diversifying the prompts, which can model associations between possible classes and fits pretraining scenario better.

Problem Definition
A CLIP model carries two separate encoders, image encoder M img and text encoder M txt .And two preprocessores of image and text inputs are E I and E T respectively.Normally, when a CLIP model is deployed on the image classification task, it is given an image x with its corresponding label y ∈ Y as input, where Y = {y (1) , y (2) , ..., y (C) } includes all C categories of the dataset to which x belongs.f img (E I (x)) is the feature representation of x which obtained by M img .And f ) are obtained by M txt , where p t is a prompt that formulated by concatenating a series tokens and a class embedding E T (y (i) ).S means the computation of cosine similarity as Eq. 1.
Then the classification result y pred is formed as Eq. 2.
) . (2) We aim to find a text prompt p t to maximize the likelihood of P (y pred = y|p t ).
Generally, p t is manually crafted, which may cause the suboptimal problem.Inspired by [Li and Liang, 2021], we parameterize p t by θ that can be updated, which avoiding the suboptimal problem.

Text Encoder
Preprocessor Image Encoder    c to maximize the likelihood of P (y pred = y|p c ).However, just like the part (e) in the Figure 1, there are significant differences in the content of different affective images even though they belong to the same class.So the prompts should not be the same for different images.Thus we propose a diversified prompt composition method to utilize both the instance-specific and class-specific feature, which is shown in Figure 3.

Diversified Prompt Composition
The initial parameters are the same for different categories.The effect of different initialization methods will be discussed in Section 4.4.After the initialization of class-specific prompt p We calculate the cosine similarity between the output feature of the image and each virtual token.And to multiply each virtual token which uses the similarity score as weight.Then we get the diversified prompt p d , as the Eq. 3.
where p d (j) is the j-th virual token of the diversified prompt p d .
After composing the full diversified prompt by concatenating the p d with each class embedding y (i) , the inputs of text encoder are finalized.

Training
. (4) The final prediction P (y pred ) is obtained as Eq. 4. The entire model can be trained by maximizing the likelihood of P (y pred = y|p d ) via backpropagation, while the parameters of the whole original CLIP model are fixed, gradients are only applied to update p d .We adopt cross-entropy loss as classification loss to update and optimize the model, as Eq. 5.

Datasets
We perform experiments on four datasets with six settings, including Flickr and Instagram(FI), EmotionROI, FI 2, Emo-tionROI 2, Twitter I, and Twitter II.Following previous studies [Yang et al., 2018a], we adopt accuracy as the metric to evaluate our proposed method and use the same dataset split for fair comparisons.
• • Twitter II [Borth et al., 2013] is a small-scale dataset that contains 603 images of two different categories.

Implementation Details
We build our framework based on the CLIP-ViT-B/32 [Radford et al., 2021], which is trained on 400 million image-text pairs and reports impressive performance on several zero-shot downstream tasks.The CLIP model has been pre-trained on a large-scale dataset, and the CLIP text and image encoders are fixed throughout the experimental period.
We employ the SGD optimizer to tune the trainable part with 0.1, 0.01, and 0.001 as the initial learning rate of different datasets with 0.9 momentum.And the learning rate was updated by a StepLR scheduler which stepsize is 3 and gamma is 0.9.
All our experiments are carried out on an NVIDIA RTX3090 GPU with 32GB of CPU memory using PyTorch framework [Paszke et al., 2019].Images are resized and center cropped to 224 × 224, channel converted, and data normalized by the original CLIP project's preprocessor, with a batch size of 64 for 10 epochs.

Classification Performance
In this section, we review recent works on image emotion classification and compare them with our method.There are two hand-craft-feature-based methods, three CNN-based finetuning methods, and a few unique pipeline design methods based on the CNN backbone in recent years, including SOTA methods of each dataset.
• In the early years, researchers explored emotion classification tasks in terms of hand-craft features at the image art level [Zhao et al., 2014] or using sentiment dictionary and simple classifiers [Borth et al., 2013].
With the rise of the deep learning method, the image emotion classification methods turned to use CNN as a backbone, getting better performance.ResNet is also a widely-used CNN baseline structure.
It is pre-trained on the ImageNet dataset [Deng et al., 2009] and fine-tuned after modifying its FC layer.• Based on the backbone of ResNet101, the WSC-Net [Yang et al., 2018a] realized the end-to-end image emotion classification by coupling the global and local features according to the detected salient regions in the image, which is the best method among the contentbased image emotion classification task.• To compare with the WSCNet, ECWA [Deng et al., 2021] proposed an emotion class-wise aware loss on the same backbone.Only fine-tuning the backbone, it got better performance on all datasets than WSCNet without any other structure.• Zhu et al. [Zhu et al., 2017] explored a unified CNN-RNN architecture for visual emotion recognition.• MldrNet [Rao et al., 2020] provided a CNN architecture based on AlexNet with a side branch to utilize hieratical features.
• MSRCA [Zhang et al., 2022] proposed a novel multilevel sentiment region correlation analysis model.• Due to some datasets having probabilities with their corresponding labels, [Rao et al., 2019] utilized the label probability of the affective images into a loss function for training to leverage the ambiguity and subjectivity of the class labels.This work has achieved the best classification performance on several benchmark datasets [Zhao et al., 2021].• [Zhang and Xu, 2020] proposed an end-to-end network for IEC leveraging weakly supervised emotion intensity learning, achieved SOTA performance on FI 2 and two categories types of EmotionROI dataset.
• Based on the object information in the detected image, [Wu et al., 2021] built a graph convolutional network based on the sentiment dictionary to explore the relationship among the object in the image, which made a better performance on the sentiment polarity classification datasets.
The experimental results are shown in the Table 1.Despite having better interpretability, early hand-crafted feature methods, such as Zhao et al. [Zhao et al., 2014] and Sentibank [Borth et al., 2013] are generally less effective than deep learning methods.Then with the rise of deep learning, the performance of deep feature methods on IEC improves with the depth of network architecture and the number of model parameters, like the AlexNet [Krizhevsky et al., 2012], VGG-16 [Simonyan andZisserman, 2014], andResNet101 [He et al., 2016].
With the blooming development of deep learning architecture, more and more researchers managed to capture the internal sentimental factors [You et al., 2015;Yang et al., 2018a;Deng et al., 2021] or utilize external knowledge [Chen et al., 2014;Yang et al., 2017;Rao et al., 2019;Zhang and Xu, 2020;Wu et al., 2021] to improve the classification performance.They get better results than the original deep models.Our method, PT-DPC, which utilizes a large scale pre-trained model with richer knowledge, achieves competitive results on all six commonly used datasets, e.g., our method achieves about 2.9% improvement on the FI 2 dataset and 9.29% on EmotionROI 6 than the-state-of-the-art methods.Only the TwitterII dataset gets the second place by 0.18% difference.It is mainly because of the tiny data scale, with only 603 images in total.

Ablation Study
The diversified prompt part utilizes the instance-specific information and the class-specific information to leverage the knowledge from the large-scale pre-trained model.To evaluate the effectiveness of the two proposed components, we conduct the ablation study on all six dataset.The experimental results are shown in Table 2.
Effectiveness of the instance-specific information.The first row in Table 2 denotes the baseline single prefix tuning method like [Lester et al., 2021], which doesn't use any instance-specific information or class-specific information.The second row means we adopt the instance-specific component, which means the prompt templates are initialized separately.And the diversified prompt is obtained by utilizing its corresponding image as query.The results show that utilizing the instance-specific information can moderately improve the classification performance on most datasets.Effectiveness of the class-specific information.We show the model performance of utilizing class-specific information in the third row in Table 2, while the fourth row means we further utilize class-specific information on the basis of having instance-specific information.We can see that the classspecific information can improve the model performance on all but one of the six datasets.
Totally, utilizing instance-specific information improves the accuracy while leveraging class-specific information further improves the performance, which indicates the necessity of considering both the instance-specific information and the class-specific information.Our proposed PT-DPC achieved the best performance by considering both the instance-specific and class-specific information, which indicates that both the image content and sentiment concept are important for prompt tuning on IEC task.

Sensitivity analysis about the Prompt Initialization
The performance of previous prompt tuning methods are highly sensitive to prompts, so it is necessary to take the sensitivity analysis of the prompt initialization for PT-DPC.
There are different massive types of the template to initialize the prompt.Inspired by the prompt tuning work, we chose a few kinds of standing prompt templates for this ablation study.
• PT-DPC-1 is a template of "a photo seems to express a f eeling of [label word]", which is similar with the original CLIP type.We use this template for performance comparison in Section 4.3.
• PT-DPC-2 is a template of "an image to express a f eeling like [label word]", which some words are replaced by synonyms.• PT-DPC-3 is a template of "a picture seems to express some f eelings like [label word]", which are replaced in the same meaning.Except for the classification results of each initialized template, we also calculate the Standard Deviation(Std) of these results on different datasets to intuitive display the degree of influence of different templates.The smaller the Std is, the better the PT-DPC's sensitivity on the corresponding dataset.
As shown in Table 3, different initialization only brings small changes and is not relevant to a particular dataset.The Std results show only a few influences on different initialization methods.It almost does not influence the FI and TwitterII dataset with polarity classification.Though further tuning of the initialization words might help, it can still consider that PT-DPC is robust by different initialized templates.

Impact of Vision Encoder
From the earlier CNNs to Vision Transformers, many vision encoder architectures have been developed to capture discriminate feature representations.Thus we conduct the classification experiment among the different vision encoder architectures employed in the comparison methods.In detail, to achieve the classification on each dataset, we change the dimension of the output layer and only train this fullyconnected layer but fix the other part of the encoder.
The experimental results are shown in the Table 4.The CNNs model performs better with the deeper depth the mod-

Visualization
To demonstrate the performance of PT-DPC on the image sentiment classification task more intuitively, we employ the confusion matrix in the Figure 4.The numbers in the matrix indicate the number of ground truth category images predicted into different categories in the corresponding test set.The higher the number is, the higher the corresponding grids' colour brightness is.Bright diagonal grids in the figure mean that we have excellent classification results in the two primary datasets.

Remaining Challenges
The successfully applying of large models has a strong effect on recognising semantically complex affective images.However, a unique domain adaptation design is needed for datasets with small-scale and large domain biases, such as the TwitterII dataset, to bridge this gap more adequately.In addition, the emotion is complicated and there is not an either/or relationship between affective categories.As can be seen from Fig 5, one image from each dataset is exhibited in a row.They are informative that different reasons cause them to be attributed to different emotional categories.However, ignoring some factual elements can lead to erroneous judgments.For example, when there is no concept of disaster, the third picture will only be misclassified as posi-tive for being identified as a spectacular scene.At the same time, since some of the images could have belonged to more than one emotion category, they would have produced false results like the second image, which seems acceptable.So it could be considered to further research about the interclass relationships or LDL-related studies by PT-DPC or to exploit a more large-scale pre-trained model beyond CLIP.

Conclusion
In this paper, we first propose a general framework that adapting CLIP to the image emotion classification task, diversified prompt composing (PT-DPC), to effectively leverage the rich image and text semantics entailed in CLIP.Except for addressing the challenge of the training objective gap, the PT-DPC automatically compose instance-specific prompts by conditioning them on the categories and image contents of instances, diversifying prompts and avoiding suboptimal problems.Compared with the state-of-the-art method, PT-DPC performs better in several widely-used datasets, including binary categories and multi-categories.Furthermore, research about the multi-modal method is well popular, but there are still many cases alive with only one modal information.We hope that the ideas in this article can inspire other resourceconstrained tasks like image emotion classification and develop more novel multi-modal methods for traditional singlemodal tasks.

Figure 1 :
Figure 1: The challenges of adapting CLIP to the IEC task.(a) is the pretraining data form of CLIP while (b) is IEC task ones (i.e.FI dataset).(c), (d) show the classification results of different manually designed prompts on two affective image datasets.The left one in (e) is a picture of the contentment category in the FI dataset, and the middle one and the right one belong to the awe category.

Figure 2 :Figure 3 :
Figure 2: Illustration of the PT-DPC framework.Given an initialized template and the class words of the dataset, we generate the class-specific prompts and the class embeddings by using the same text preprocessor that includes a text tokenizer and an embedding lookup table.The input image is processed by the image preprocessor and image encoder to generate image features.Then the diversified prompt is obtained by attentional filtering of the class-specific prompt with image features as queries.After concatenating the diversified prompt with each class embedding, full diversified prompts are sent to the text encoder to get text features.Then the similarity scores are obtained by calculating the cosine distance between text features and image features.We use cross-entropy loss to estimate the parameters.During model training, both the text and image encoders are frozen and only the diversified prompts are optimized.Finally, the category to which the maximum score belongs becomes the predicted result.

Following[
Lester et al., 2021], we initialize the class-specific prompts p

c
by a template s, which are processed by the tokenizer and embedding of the original CLIP model.Then the task converts into training the parametes of p get C series of L-long class embeddings which are marked as p (i)(j) c (j ∈ [1, L]) .Each token has same dimension with the input and the output feature of the CLIP model.

Figure 4 :Figure 5 :
Figure 4: Visualization confusion metric of the classification results of PT-DPC method on the FI 8 and EmotionROI 6 dataset.The different colors distinguish different accuracy degree as shown in color bar.(a) shows the results of FI 8 dataset, while (b) presents the EmotionROI 6.
• DeepSentibank[Chen et al., 2014] employed CNNs to discover ANPs and realized visual sentiment concept classification.• PCNN [You et al., 2015] proposed a novel progressive CNN architecture based on VGGNet.• Yang et al.[Yang et al., 2018b] employed object detection technique to produce the "Affective Regions" and proposed three fusion strategies to generate the final predictions on VGGNet.

Table 1 :
The Classification Accuracy of PT-DPC on different datasets comparing with baseline methods

Table 2 :
Ablation Study Results of PT-DPC on Different Datasets

Table 3 :
Sensitivity Analysis Results of PT-DPC on Different Datasets

Table 4 :
The Classification Accuracy of different vision encoder on the affecitve image datasets