Meme Sentiment Analysis Enhanced with Multimodal Spatial Encoding and Facial Embedding

Internet memes are characterised by the interspersing of text amongst visual elements. State-of-the-art multimodal meme classifiers do not account for the relative positions of these elements across the two modalities, despite the latent meaning associated with where text and visual elements are placed. Against two meme sentiment classification datasets, we systematically show performance gains from incorporating the spatial position of visual objects, faces, and text clusters extracted from memes. In addition, we also present facial embedding as an impactful enhancement to image representation in a multimodal meme classifier. Finally, we show that incorporating this spatial information allows our fully automated approaches to outperform their corresponding baselines that rely on additional human validation of OCR-extracted text.


Introduction
The sentiment polarity classification task traditionally entailed analysing a piece of natural language text to classify its sentiment as negative, positive, or neutral. Sentiment analysis was initially performed on text. The growth of user-generated multimodal content (e.g., videos, image-caption pairs) has motivated the extension of affective computing techniques to input types beyond text [8]. Multimodal sentiment analysis poses the same questions as its text-only predecessor, but is extended to inputs comprising multiple modalities simultaneously. When faced with multimodal inputs, Poria et al. [8] describe unimodal encoders as crucial building blocks of multimodal systems, each encoder directly contributing to the Fig. 1: Sample memes with a) Positive sentiment [7] and b) Negative sentiment [7], c) hateful spatial analogies [12], and d) spatial segments [13]. resultant performance. Furthermore, the fusion of unimodal representations also plays a key role by providing "surplus information" to the classifier [8].
Along with the advent of other multimodal formats of user-generated content, Internet memes (or simply "memes") have proliferated. Memes are commonly found in various online communities to communicate ideas, incite humour, and express emotions. Automated analysis of memes allows for: including memes in automated opinion mining processes [8], taking action against meme-based hate speech [12,6], identifying disinformation campaigns [1], and investigating social and political cultures [5]. This work contributes to the underlying problem of sentiment polarity classification of a meme: "Given a meme in a visual format, comprising an image I with embedded text T , classify the meme as having the overall sentiment of either Negative (e.g., Fig. 1b), Positive (e.g., Fig. 1a), or Neutral ".
Recent work has shown that incorporating additional relevant information improves the performance of meme affective classifiers [10], amongst which is positional information of words within text and visual objects within an image [12,16]. Unlike many other forms of multimodal content, the text within a meme is interspersed into its image, often either superimposed on the image or comprising a segment of the meme image, creating a shared visual medium. Meme authors intentionally position a grouping of words ("text clusters") to convey meaning, such as implying hateful analogies [12] (e.g., Fig. 1c); text clusters can be paired with image segments, with each pair signifying a different sentiment (e.g., Fig. 1d). Current approaches that use positional information in meme sentiment classification opt to omit intermodal positional relations, i.e. they consider the position of a word amongst text but not its position in relation to the meme image or vice versa.
This work proposes injecting the spatial information of features from both modalities of a meme into a deep learning multimodal classifier to improve sentiment classification performance. Crucially, we account for the interspersing of showing the image and text modalities, I and T , as given in the dataset; b) bounding boxes generated for local features: text clusters (t 1 and t 2 ), objects (o 1 detected as "Dog" and o 2 as "Person"), and faces (f 1 ); and c) the coordinate system used to generate the spatial encoding for each bounding box (e.g. the vertices of o 1 , p o1 ).
visual objects and text clusters by representing the spatial position of each on a shared coordinate system ("spatial encoding"). We append the spatial encoding of visual objects (e.g. o 1 , o 2 in Fig. 2b), faces (e.g. f 1 in Fig. 2b), and text clusters (e.g. t 1 , t 2 in Fig. 2b) to their local representations prior to multimodal fusion and classification. The performance implication of spatial encodings and local representations are systematically evaluated on two benchmark datasets using the seven models described in Section 3.2. To the best of our knowledge, this work is the first to use shared coordinate spatial encoding and deep representation of faces to tackle the sentiment classification of memes.

Meme Affective Classifiers
Memes are distinct from other multimodal user-generated content types in several key ways. First, the text and image of a meme share a common visual medium, unlike the more common image-caption pairs. Text in memes is often intentionally located amongst other visual content to create meaning [12]. Second, memes use short text pieces and few foreground visual objects, relying on intermodal relations to convey meaning. Kiela et al. [6] show how harmless images and texts could be combined to create hateful memes. Furthermore, slight changes in either modality can change a hateful meme into a harmless one and vice versa. Therefore, meme classifiers must be able to learn subtle intermodal relationships with very limited input.
Architecturally, the current literature suggests that various affective classification tasks can be applied to memes without requiring entirely distinct approaches. Most apparently, Bucur et al.'s [3] winning submission of the Memotion 2022 Challenge [7], was trained to simultaneously classify sentiment polarity, offensiveness, sarcasm, humour, and motivational intent. Their findings suggest that meme classification architectures exhibit adaptability across different affective computing tasks. Furthermore, Pramanick et al. [9], who reported the best-performing sentiment classification solution to the Memotion 1.0 dataset [13], showed that the same architecture outperforms all, or all but one, competing solution when individually trained on eight affect dimensions.
A typical approach to building a multimodal meme classifier is to generate unimodal representations of each modality before fusing these representations into a multimodal representation of the meme, such as in [3,10,12,9]. Furthermore, the literature presents a wide range of deep learning representations used for each visual and textual modality [6,7,13], with no clear evidence that any of the options would consistently outperform all others.

Positional Encoding
Positional encoding plays a central role in the Transformers architecture [14] and has seen wide adoption in tackling various natural language tasks. It describes the position of tokens, such as a word in a sentence or a region in an image, within the input. However, since most multimodal meme classifiers employ unimodal encoders, the positions of text and visual elements are encoded separately.
To the best of our knowledge, a positional encoding that is shared between the text and image modalities on a common spatial coordinate system (a "spatial encoding") has not been applied to classifying meme sentiment. None of the architectures reportedly used to learn meme sentiment classification in [13] and [7] did so using a positional information from a coordinate system shared between modalities. Further, we were not able to find a pre-trained multimodal Transformer that readily supports such a shared encoding.
In this task, Pramanick et al. [9] showed performance gains by segmenting the text modality into text clusters but did not explicitly represent the spatial position of each cluster. To classify hateful memes, Zhu [16] employed a patch detector to divide each meme into "image regions". They then appended each text token with a representation of its surrounding image patch. However, they did not present the performance gains solely attributable to this approach. Further, we posit that such a patch-based definition of position would not be suitable where multiple text clusters are placed within the same image patch (e.g., Fig.  1c) or where a patch consists only of text (e.g. Fig 1a).
Shang et al. [12] proposed a more general representation of spatial position by appending the spatial encoding of extracted visual objects and text clusters prior to input into an intermodal co-attentive pooling module based on a design from [?]. They attributed their model's outperforming of other leading hateful meme classifiers to its "awareness" of offensive intermodal analogies: the purposeful superimposing of a text cluster near to a visual object is used to represent an offensive conceptual comparison. While their approach is predicated solely on offensive spatial analogies, we posit that this approach could capture a broader category of intermodal spatial relationships, including those captured by Pramanick et al.'s [9] and Zhu's [16] approaches.

Visual Feature Representations
While the image modality is commonly represented by passing the entire meme image through an image encoder [7], enhancing this representation with that of extracted visual objects has proven beneficial in classifying hateful memes [10,12,16]. One such approach is to input the meme image into Google Cloud Vision API's Web Entity Detection to create a corresponding description or set of attributes in text format [10,16]. Zhu [16] also demonstrated further performance improvement with the inclusion of Race and Gender tags for each face using a pre-trained FairFace classifier. Pramanick et al. [10] also showed improved performance by representing cropped images of visual objects and faces with VGG-19. Shang et al. [12] also found that their multimodal classifiers perform best when global and local visual feature representations are available.
The use of faces to convey sentiment is neither new nor unique to memes. Firstly, visual sentiment analysis [15] points to facial expressions as a valuable mid-level feature in classifying the sentiment conveyed by images from social networks. Second, facial expression emojis have been shown to be informative in supporting the sentiment classification of textual social media [2]. In memes, Zhu [16] argues that expecting a global image encoding to sufficiently recognise facial features that are predictive of hatefulness is unreasonable given the size of current meme datasets. Although we agree with Zhu's argument, we posit that their approach omits other information conveyed by faces that may indicate a meme's sentiment, such as emotion, expression, and identity.

Methodology
In this work, we evaluate the performance of seven novel multimodal classifier models. These models are separately trained on two competition datasets, Memotion 1.0 [13] and Memotion 2.0, [7], to classify the sentiment polarity of memes. We first designed and evaluated a multimodal deep learning model to establish baseline performance. This model is then repeatedly augmented to answer our research questions. Augmentations include incorporating spatial information of faces, visual objects, and text clusters and are described for each model in Table 3. Evaluation is conducted based on the differences in macro-averaged and weighted-averaged F1 scores -metrics prescribed by the authors of the datasets [7,13] -between pairs of models that respectively include and exclude each augmentation. This section presents details of the datasets and models used.

Dataset & Feature Extraction
This work utilises datasets presented in the SemEval 2019 Memotion 1.0 [13] ("Memo1") and AAAI 2022 Memotion 2.0 [7] challenges ("Memo2"). Both are collections of user-generated memes labelled with one of three exclusive sentiment classes. The authors of the datasets extracted text from each meme with an automated OCR tool and then manually corrected any erroneous text extraction. For our experiments, the samples from Memo1 and Memo2 are kept separate. Without filtering or pre-processing, these samples comprise our Original datasets that we use to compare our Baseline model to leading solutions. For each meme in these datasets, we localised, extracted, and represented its text clusters, faces, and visual objects using the tools listed in Fig. 3  maximum counts of text clusters, visual objects, and faces are set to 18, 10, and 5, respectively, with padding used for memes with fewer. Padding for text clusters is defined by passing an empty string into the CLIP text encoder, while that for visual objects is the CLIP encoding of a blank image, and zero-padding is used for faces.
Since this work applies to memes that contain identifiable visual objects and text clusters, we removed meme samples that do not meet these criteria to make up the Filtered datasets. This filtering is performed on all subsets of Memo1 and Memo2. As Memo1 did not contain a designated validation set, we defined one by splitting the training set -as reported by the authors of the Memo1 dataset and used in submissions to their competition [13] -with a random 85:15 sampling, weighted by the sentiment class, to maintain the target distribution. We maintained the train-validation-test splits defined for Memo2 [7]. Meme samples with identifiable visual objects but no detected faces are given face feature representation made up entirely of padding.
Finally, the Filtered-OCR datasets replace the text of each meme in Filtered with that returned in our feature extraction OCR step. Unlike in [7,9,13], we excluded any additional human validation during the OCR extraction process. All models are trained, validated, and tested on the resultant Filtered-OCR datasets. The counts of memes in each dataset and sentiment labels are shown in Table 1.

Models
This section describes the architectural characteristics of our models as listed in Table 2 and illustrated in Fig. 4. Each was built using PyTorch and trained with a triangular cyclical learning rate schedule ranging between 1e−4 and 1e−3 with a step size of 52 mini-batches of 512 samples. During training, validation performance was monitored for overfitting or until each model was trained for 100 epochs. Training is carried out using AdamW optimiser with weight decay of 5e − 1, betas of 0.1 and 0.25 to minimise negative log-likelihood loss with class weights inversely proportional to its sample count in the training dataset. All non-pretrained weights are initialised with a zero-mean Gaussian distribution with standard deviation 0.02 , while pretrained weights are not fine-tuned. The same hyperparameter settings are maintained across all models as they are separately trained on the datasets.
Leading meme sentiment classifiers use a variety of architectures with little indication of which is most optimal. For our Baseline model, we drew inspiration from the typical overall approach used in leading solutions to the Memotion 2.0 Challenge [7]: each modality is represented using a pretrained encoder. Then, these representations are fused, often with a multimodal attention mechanism, and finally passed to a fully connected layer.
To encode the meme image and text (see I and T in Fig. 2a) in our Baseline model, we opted to use the pretrained image and text encoders of CLIP [11], respectively, which has shown comparable performance to other multimodal approaches [10]. In addition, CLIP image encodings have been shown to outperform  [4,9] various other image encoders in the zero-shot classification of hateful memes [11] and are used by the winning solution of the Memo2 challenge [3]. We chose the ViT-B/16 variant of CLIP while Pramanick et al. [10] and Bucur et al. [3] did not report their chosen variant.
Since attentive fusion has been shown to perform well on several meme problems [9], we included one in our models. Our Baseline model fuses the CLIP representations of the meme image and text using Gu's [4] attentive modality fusion mechanism, as used in [10]. We defined the sizes of the four dense layers as 256, 64, 8, and 1, which produces an attention score for each modality. The attention-weighted representation of each modality is concatenated and passed into a GeLU-activated dense layer followed by a log-softmax activation to output predicted logits of each sentiment class.
This model is trained on the Original dataset to allow performance comparisons with previously published works. We then evaluated this model on the Filtered and Filtered-OCR datasets. In the latter, the content of all text clusters t n is concatenated and entered into the text encoder. The difference in the performance of this model on these two datasets allows us to measure the performance impact resulting from our OCR-based text extraction output relative to the human-curated approach used by the authors of the datasets [7,13].
The Obj-NoSpatial and Face-NoSpatial models remove the meme image and text, I and T per Baseline. As inputs, the former takes CLIP-encoded visual objects, o 1 , o 2 , ..., o j , and text clusters extracted from a meme, t 1 , t 2 , ..., t i . Instead of objects, the Face-NoSpatial model takes the FaceNet representa-tion of faces, f 1 , f 2 , ..., f k . Then, the j visual objects or k face representations are passed through co-attentive weighted pooling against i text clusters as used in [12] but without spatial encodings. This step allows the models to learn attention maps between each object/face and each text cluster; producing a onedimensional vector representing each modality. This representation replaces that of the image modality as input into the attentive fusion mechanism described for the Baseline model.
The Obj-Spatial and Face-Spatial models introduce the spatial encodings of each text cluster, p ti , as well as for visual objects, p oj , and faces, p f k , respectively. We augment the co-attentive pooling module in Obj-NoSpatial and Face-NoSpatial into the co-attentive analogy alignment module proposed in [12]. This is performed by appending each object's and cluster/face's representation vector with its spatial encoding. The padding for spatial encodings is defined as zeros for all coordinates.
The Img-Obj-Spatial and Img-Face-Spatial models each combine the CLIP representation of the meme image, I, into Obj-Spatial and Face-Spatial, respectively. Since these models make use of three representations per memeimage, text clusters and objects/faces -we extend Gu's [4] fusion mechanism to accommodate three inputs by introducing a third set of dense layers. Table 2: Goals of each experimental model.

Model
Dataset Goal

Baseline
Original Benchmarks our chosen modality encodings and fusion mechanism against leading solutions.

Filtered
Establishes baseline performance on samples with detectable text clusters and visual objects. Filtered-OCR Measures the impact of replacing humancurated text replaced with text clusters returned by automated OCR. Also, establishes a baseline for our fully automated approaches.

Obj-NoSpatial
Filtered-OCR Measures the performance of representing the image modality using only CLIP-encoded localised visual objects without spatial encodings.

Obj-Spatial
Filtered-OCR Measure the performance impact of including spatial encodings of objects and text clusters.

Img-Obj-Spatial
Filtered-OCR Maximises available visual information by augmenting image input with objects and text clusters and respective spatial encodings.

Face-NoSpatial
Filtered-OCR Measures the performance of representing the image modality using only embeddings of localised faces without spatial encodings.

Face-Spatial
Filtered-OCR Measure the performance impact of including spatial encodings of faces and text clusters.

Img-Face-Spatial
Filtered-OCR Augments image input with faces and text clusters and respective spatial encodings.

Results
Evaluating the Baseline model on the Original datasets places it within the top six highest performing solutions on each respective dataset; see Tables 3 and  4.
The performance of the Baseline model on the Original, Filtered and Filtered-OCR datasets are shown in Table 5. The lower performance of the model on the Filtered dataset than on the Original dataset likely stems from the removal of samples that contain only text on an object-less background. Classifying such samples is similar to discerning the sentiment of unimodal text inputs and is beyond the scope of this work. We attribute the performance decrease of the Baseline model on the Filtered-OCR vs. Filtered datasets to the lower quality of the text extracted with our automated OCR process relative to human-curated text. Despite this, our spatially aware models are able to overcome this performance penalty. The model that performs best on each dataset -as seen in Table 6 -constitutes fully automated approaches that outperform their respective Baseline models trained on the human-curated text from the Filtered datasets. By removing the need for manual intervention, fully  The results show that spatial encoding improves performance. Obj-Spatial and Face-Spatial each outperforms Obj-NoSpatial and Face-NoSpatial respectively. These results point to intermodal spatial information being informative for the problem task and not sufficiently represented by the CLIP encodings of the whole meme image. This finding holds significance to applying deep learning solutions on memes in particular, as the text modality is incorporated and interspersed within the image. Although the importance of token positions in leading solution architectures has been well established, the lack of a shared visual medium for image and text modalities in many other vision-language tasks has resulted in leading multimodal architectures with separate positional representations for each modality. Based on our results, we argue that spatial encodings should also be considered for other vision-language tasks where visual objects and text share a common visual medium.
The performance benefit of representing the image modality with localised visual feature representations depends on whether the features are defined as objects or faces. CLIP-encoded object representation performs worse than Baseline. This results from a reduction in the visual information available to the image encoder. However, Face-NoSpatial, which uses FaceNet embeddings to represent faces, outperforms both Obj-NoSpatial and Baseline while also suffering from the same, if not greater, reduction in available visual information. Furthermore, Obj-Spatial showed mixed results against Baseline, while Face-Spatial outperforms Baseline in both datasets. Notably, faces are not entirely excluded from models based on visual objects, as many meme samples had "Person" as a detected object. Thus, we believe that the performance difference between the two approaches arises from the more fine-grained facial embedding provided by FaceNet and the inherent exclusion of non-face visual objects that emphasises the contribution of faces to the sentiment of a meme.
We found that augmenting the meme image with local representations of either objects or faces and their spatial encodings consistently outperforms models that rely on the image alone. However, choosing between CLIP-encoded objects versus FaceNet-encoded faces as augmentations to the meme image proved inconsistent and dependent on the dataset. Although Img-Obj-Spatial and Img-Face-Spatial perform the best in the Memo1 and Memo2 datasets, respectively, their performance relative to Obj-Spatial and Face-Spatial appears to depend on the dataset. Drops in performance here may stem from redundant intermodal information (e.g. between global image and objects-based representations). Unlike in [9], we did not employ any form of learned cross-modal filtering.

Conclusions
In this work, we addressed spatial encoding and facial embedding in classifying sentiment polarity of internet memes. We developed seven novel architectures, and evaluated each on two challenge datasets. For both datasets, our proposed baseline multimodal classifier ranked within the top six of leading state-of-the-art solutions on both datasets. While we found that representing the image modality with visual objects alone does not consistently offer performance benefits, a facebased representation does. Furthermore, the incorporation of spatial information of these visual features grants performance improvements over both image-only and faces-/objects-only approaches. For each of the Memotion datasets, our top performing solution comprises augmenting the image modality with spatially encoded visual features and text clusters. We propose these solutions as fully automated competitive alternatives to current state-of-the-art solutions that rely on manual validation of OCR-based text extraction.