MemeTector: enforcing deep focus for meme detection

Koutlis, Christos; Schinas, Manos; Papadopoulos, Symeon

doi:10.1007/s13735-023-00277-6

MemeTector: enforcing deep focus for meme detection

Regular Paper
Open access
Published: 13 May 2023

Volume 12, article number 11, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

MemeTector: enforcing deep focus for meme detection

Download PDF

Christos Koutlis¹,
Manos Schinas¹ &
Symeon Papadopoulos¹

1992 Accesses
1 Citation
5 Altmetric
Explore all metrics

Abstract

Image memes and specifically their widely known variation image macros are a special new media type that combines text with images and are used in social media to playfully or subtly express humor, irony, sarcasm and even hate. It is important to accurately retrieve image memes from social media to better capture the cultural and social aspects of online phenomena and detect potential issues (hate-speech, disinformation). Essentially, the background image of an image macro is a regular image easily recognized as such by humans but cumbersome for the machine to do so due to feature map similarity with the complete image macro. Hence, accumulating suitable feature maps in such cases can lead to deep understanding of the notion of image memes. To this end, we propose a methodology, called visual part utilization, that utilizes the visual part of image memes as instances of the regular image class and the initial image memes as instances of the image meme class to force the model to concentrate on the critical parts that characterize an image meme. Additionally, we employ a trainable attention mechanism on top of a standard ViT architecture to enhance the model’s ability to focus on these critical parts and make the predictions interpretable. Several training and test scenarios involving web-scraped regular images of controlled text presence are considered for evaluating the model in terms of robustness and accuracy. The findings indicate that light visual part utilization combined with sufficient text presence during training provides the best and most robust model, surpassing state of the art. Source code and dataset are available at https://github.com/mever-team/memetector.

Artificial intelligence in the creative industries: a review

Article Open access 02 July 2021

Learning to Prompt for Vision-Language Models

Article 31 July 2022

GenAI against humanity: nefarious applications of generative artificial intelligence and large language models

Article Open access 22 February 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Image memes have been established during the last years as a popular means of communication in social media. Their typical form, known as image macros,^{Footnote 1} comprises images with overlay text at the top and/or bottom and is principally used to express a spectrum of concepts and emotions such as humor, irony, sarcasm and even hate. Memes and regular images have critical visual differences that render their discrimination an easy task for a human, such as the overlay text with a specific type of font size, color, family and position as well as the background image usually having a cultural reference or being memorable. In contrast, regular images potentially depict anything without certain constraints. In Fig. 1, we exemplify one image meme^{Footnote 2} and one regular image^{Footnote 3} to showcase the differences between the two types of digital media.

Other forms of Internet memes also exist; for instance, they might be plain text [10], tweet screenshots, social statement cards, logos [18] or images reusing memorable visual elements in different creative ways such as Bernie Sanders’ mittens.^{Footnote 4} In addition, the adoption of different meme forms seems to be highly platform-dependent as community-specific vernaculars determine different meme cultures [18]. Here, we only address the detection of the typical Internet meme form, namely image macros being a combination of background images with superimposed text as Fig. 1a.

In the framework of analyzing digital social behavior and trends, image memes have attracted research interest [5, 19, 30, 32], mostly with a focus on deep learning models for image meme classification [1, 3, 8, 25] and more frequently for the detection of hateful image memes [2, 13, 15, 33]. The latter works utilize datasets with image memes and appropriate labeling, thus they do not put effort on detecting whether an input image is a meme or not. The detection of image memes and their discrimination from regular images is currently still a relatively understudied topic, and there are only few attempts in this direction [17, 24].

In this work, we present MemeTector, a model for efficiently classifying images as memes or regular ones. Utilizing this model in online social environments to retrieve memes can facilitate the monitoring and analysis of web trends and behaviors as well as the detection of harmful practices that are sometimes carried out with the use of memes, such as hate-speech and disinformation. To enforce the model’s focus on critical visual cues that characterize both classes, we propose visual part utilization (VPU), a methodology for artificial dataset creation from an existing image meme dataset [14] and a deep learning architecture called ViTa, employing a trainable attention mechanism on top of a Vision Transformer (ViT) [6]. Although we propose ViTa for meme detection it is general and can potentially be utilized to other tasks. Regarding VPU, from each image meme instance $M_i$ of the initial dataset we extract the largest part that contains no text and utilize it as a regular image instance. We denote the latter with $V_i$ and call it visual part of image meme $M_i$. In Fig. 2, we present the construction process for set ${\textbf {M}}$ of image memes and ${\textbf {V}}$ of visual parts. In essence, a meme’s background image is a regular image which is easily discriminated from the meme by humans while neural networks initially produce almost identical feature maps for both. VPU is thus considered here for effectively enhancing the learning of class distribution subtle peculiarities.

The paper is structured as follows. Section 2 reviews the related literature. Section 3 elaborates the proposed methodology. Section 4 describes the experimental setup. Section 5 presents the results. Section 6 concludes the paper.

2 Related work

Previous related studies focus mainly on classification of memes into categories such as hateful or offensive [26]. Due to their multi-modal nature, the classification of image memes is most frequently treated as a multi-modal analysis problem by processing both visual and text content [1]. Other studies attempt to classify image memes in terms of their sentiment (positive, negative, neutral) and type of humor (e.g., sarcastic or motivational) [22]. However, the topic of this study precedes the work of classifying image memes into certain categories. First, one needs to know if an image is a meme or just a regular image before further analyzing it.

The topic of image meme detection, i.e., automatically discriminating image memes from regular images, has not yet received considerable attention from the research community. To our knowledge, there is only one dataset for meme detection, namely the DankMemes dataset [17], which was released in 2020 but is publicly unavailable at the timing of writing the paper. DankMemes contains 2000 images related to the 2019 Italian government crisis, half of which are memes and the rest are regular images. In terms of competing approaches, only few meme detection methods have been recently presented [7, 20, 29]. Also, the similar problem of identifying satire images on social media is addressed in [24]. Finally, a few other attempts exist on the Internet, for instance, in blog posts or GitHub repositories, but are not peer reviewed.^{Footnote 5}^,^{Footnote 6}

This is the first paper to utilize the visual part of image memes as instances of the regular images class, thus enabling: (i) a 100% accurate albeit automatic annotation process through the usage of already existing meme classification datasets, as well as (ii) the creation of a dataset of 40,000 images (x20 larger than the DankMemes dataset). Additionally, we are the first to employ a supplemental attention mechanism on top of a ViT architecture that empowers deep focus through the combination of different levels of information granularity and interpretability of the results.

3 Methodology

Here, we present MemeTector’s building blocks, namely VPU and ViTa.

3.1 Visual part utilization

3.1.1 Extraction

To extract the visual part $V_i$ of a given image meme $M_i$ ($i=1,\dots ,k$), one first needs to locate the text elements in it. To this end, we consider a state-of-the-art deep learning-based text detection model, called TextFuseNet [31]. This processes $M_i$ and produces the set ${\textbf {B}}_i$ of detected text bounding boxes. We only keep boxes corresponding to whole words since the bounding boxes of letters are not useful for our task. Then, we apply Algorithm 1 to find the largest rectangle that contains no text and consider that part of $M_i$ as its visual part $V_i$.

A rectangle R covers a fraction p of the initial image area, namely $A_{R}=p\cdot W\cdot H$, where W and H are width and height of the initial image. Thus, R can have width $\frac{\sqrt{p}\cdot W}{r}$ and height $\sqrt{p}\cdot H \cdot r$ to consider different aspect ratios r and preserve the rectangle’s area $A_R$. Given p, we can determine the upper and lower bounds of r based on the image size, i.e., $\frac{\sqrt{p}\cdot W}{r}\le W$ and $\sqrt{p}\cdot H\cdot r\le H$ that entail $\sqrt{p}\le r\le \frac{1}{\sqrt{p}}$. Similarly, given p and r we can determine the upper and lower bounds of the rectangle’s center $c_R=(f_W\cdot W, f_H\cdot H)$:

$$\begin{aligned} f_W\cdot W - \frac{W\sqrt{p}}{2r}\ge 0 \end{aligned}$$

(1)

$$\begin{aligned} f_W\cdot W + \frac{W\sqrt{p}}{2r}\le W \end{aligned}$$

(2)

$$\begin{aligned} f_H\cdot H - \frac{\sqrt{p}\cdot H\cdot r}{2}\ge 0 \end{aligned}$$

(3)

$$\begin{aligned} f_H\cdot H + \frac{\sqrt{p}\cdot H\cdot r}{2}\le H \end{aligned}$$

(4)

that entail $\frac{\sqrt{p}}{2r}\le f_W\le 1-\frac{\sqrt{p}}{2r}$ and $\frac{r\sqrt{p}}{2}\le f_H\le 1-\frac{r\sqrt{p}}{2}$.

Consequently, we consider 17 equidistant values for p and 10 equidistant values for r, $f_W$ and $f_H$ covering the corresponding ranges. So, 17, 000 rectangles R are created per $M_i$ from which we initially select the non-overlapping with any $B\in {\textbf {B}}_i$, and out of those select the ones with the maximum area. Finally, we randomly chose one rectangle $R_{V_i}$ and crop the corresponding part from $M_i$ in order to get its visual part $V_i$.

3.1.2 Utilization

We utilize the extracted visual parts $V_i$ of image memes $M_i$ as regular image instances in order to force the model’s focus on the critical parts that discriminate them. More precisely, we consider the set ${\textbf {M}}=\{M_i\}_{i=1}^k$ that contains image memes and the set ${\textbf {V}}=\{V_i\}_{i=1}^k$ that contains the corresponding visual parts. In Fig. 2, we illustrate the construction process of ${\textbf {M}}$ and ${\textbf {V}}$ through an example. To assess the extent to which VPU is useful we also conduct experiments mixing instances of ${\textbf {V}}$ and web-scraped regular images for the construction of regular images class ${\textbf {R}}$. Additionally, given the inherent text presence in image memes another crucial aspect to consider is the extent to which text presence in regular images affects the model’s robustness. Hence, we consider two more sets as pools for ${\textbf {R}}$ construction, namely ${\textbf {R}}_p=\{R_i^p\}_{i=1}^k$ for web-scraped regular images with text presence and ${\textbf {R}}_a=\{R_i^a\}_{i=1}^k$ for web-scraped regular images with text absence. The model’s objective is to correctly classify the instances of the two sets, ${\textbf {M}}$ and ${\textbf {R}}$, with:

$$\begin{aligned} {\textbf {R}}=\{V_i\}_{i=1}^{k\cdot (1-P_{W})}\cup \{R_i^p\}_{i=1}^{k\cdot P_{W}\cdot P_{T}}\cup \{R_i^a\}_{i=1}^{k\cdot P_{W}\cdot (1-P_{T})} \end{aligned}$$

(5)

where $P_W$ and $P_T$ denote the fraction of web-scraped regular images out of the total number of regular images and the fraction of web-scraped regular images with text presence out of the total number of web-scraped regular images, respectively. In that way, ${\textbf {M}}$ and ${\textbf {R}}$ preserve the same cardinality k.

3.2 Model architecture

We propose Vision Transformer with Trainable Attention (ViTa), that augments ViT [6] by an attention module leveraging information from multiple processing stages. This approach was first successfully tested on CNNs [12].

3.2.1 ViT

The input image ${\textbf {x}}\in {\mathbb {R}}^{H\times W\times C}$ is reshaped into a sequence of flattened 2D patches ${\textbf {x}}_p\in {\mathbb {R}}^{N \times (P^2\cdot C)}$, where H is height, W is width, C is number of channels, P is patches’ side length and $N=HW/P^2$ is number of patches. Then, ${\textbf {x}}_p$ is linearly projected to D dimensions through a dense layer, a learnable class token is added to the sequence and learnable 1D position embeddings are added to the $N+1$ tokens, resulting in the Transformer encoder inputs ${\textbf {z}}_0\in {\mathbb {R}}^{(N+1)\times D}$. Consequently, L Transformer encoder layers process the inputs to produce the final vector representations:

$$\begin{aligned}{} & {} {\textbf {z}}_l^{'}=\text {MSA}(\text {LN}({\textbf {z}}_{l-1}))+{\textbf {z}}_{l-1} \end{aligned}$$

(6)

$$\begin{aligned}{} & {} {\textbf {z}}_l=\text {MLP}(\text {LN}({\textbf {z}}_l^{'}))+{\textbf {z}}_l^{'} \end{aligned}$$

(7)

where MSA is multiheaded self-attention with h heads [28], LN is Layer Normalization [4], MLP is a multilayer perceptron with two GELU [11] activated layers of $2\cdot D$ and D number of units, respectively, and $l=1,\dots ,L$. Finally, a general representation ${\textbf {y}}$, describing the whole image is extracted by passing ${\textbf {z}}_L^0$, namely the class token’s embedding after L Transformer encoder layers, through layer normalization:

$$\begin{aligned} {\textbf {y}}=\text {LN}({\textbf {z}}_L^0) \end{aligned}$$

(8)

3.2.2 Attention module

ViT contains multiple self-attention layers in which the class token’s embedding receives information from the patch embeddings of the same layer. However, it lacks an attention module that combines information from past layers that capture semantics of different granularity levels. To this end, we first compute a compatibility score between ${\textbf {y}}$ and the patch embeddings of odd layers $\{{\textbf {z}}_1^{1:N}\}$, $\{{\textbf {z}}_3^{1:N}\}$, $\dots $, $\{{\textbf {z}}_n^{1:N}\}$ ($n=L-1$ if L is even and $n=L$ if L is odd), by:

$$\begin{aligned} s_l^i = <{\textbf {v}},[{\textbf {y}};{\textbf {z}}_l^i]> \end{aligned}$$

(9)

where $i\in \{1,\dots ,N\}$, $l\in \{1, 3, \dots , n\}$, $[\cdot ;\cdot ]$ denotes concatenation and ${\textbf {v}}\in {\mathbb {R}}^{2D}$ is a learnable vector. The attention weights are calculated through softmax as:

$$\begin{aligned} a_l^i=\frac{\text {exp}(s_l^i)}{\sum _{i=1}^N \text {exp}(s_l^i)} \end{aligned}$$

(10)

and the context vectors are simply the weighted average of the corresponding layer’s patch embeddings ${\textbf {c}}_l=\sum _{i=1}^N a_l^i\cdot {\textbf {z}}_l^i$.

3.2.3 Classification module

The concatenation of all context vectors ${\textbf {c}}= [{\textbf {c}}_1;{\textbf {c}}_3;\dots ;{\textbf {c}}_n]\in {\mathbb {R}}^{D^{'}}$ ($D^{'}=([n/2]+1)\cdot D$, where $[\cdot ]$ denotes integer part) is processed by three dense layers for the final prediction, the first two are GELU activated and the last has one sigmoid unit:

$$\begin{aligned} y=\text {sigmoid}({\textbf {w}}_3\cdot \text {GELU}({\textbf {w}}_2\cdot \text {GELU}({\textbf {w}}_1\cdot {\textbf {c}}))) \end{aligned}$$

(11)

where ${\textbf {w}}_1\in {\mathbb {R}}^{2048\times D^{'}}$, ${\textbf {w}}_2\in {\mathbb {R}}^{1024\times 2048}$, and ${\textbf {w}}_3\in {\mathbb {R}}^{1024}$.

4 Experimental setup

4.1 Datasets

To form a suitable dataset that can be used for the task of meme detection, we merge instances from existing datasets that contain image memes and regular images, respectively.

For the image meme class ${\textbf {M}}$, we consider the Hateful Memes Dataset [14]. This is a multimodal dataset for hateful meme detection containing 10,000 image memes. We do not take into account the nature of these memes, being hateful or not, in our analysis rather we use them all under the class of image memes. Figure 3a–c illustrates three indicative examples.

For the regular images class ${\textbf {R}}$, apart from the VPU methodology, explained in Sect. 3.1, that we apply on the Hateful Memes dataset to obtain ${\textbf {V}}$, we also consider part of the widely used web-scraped Google’s Conceptual Captions dataset [21]. Specifically, we randomly sample images in order to construct ${\textbf {R}}_p$ and ${\textbf {R}}_a$ (see Sect. 3.1.2), with the text presence property automatically assessed through TextFuseNet [31] if it detects at least one text instance. Figure 3 presents three indicative examples with ((d)-(f)) and without ((g)-(i)) text presence.

4.2 Sample mixing and splitting

As a starting point for the image meme class ${\textbf {M}}$, we consider the 10,000 instances of the Hateful Memes dataset. Then, we extract the visual parts of these image memes resulting in 9,984 images considered as regular to form the set ${\textbf {V}}$. The mean area fraction across all $V_i\in {\textbf {V}}$ is 64.3%. For the remaining 16 images the VPU algorithm was unable to find a rectangle with no overlap with text. Thus, for all four sets ${\textbf {M}}$, ${\textbf {V}}$, ${\textbf {R}}_p$ and ${\textbf {R}}_a$ we consider the same size of k=9,984 instances. To do so, we discard the same 16 image memes from ${\textbf {M}}$ and sample from Google’s Conceptual Captions dataset k=9,984 instances to form ${\textbf {R}}_p$ and another k=9,984 instances to form ${\textbf {R}}_a$, respectively.

Table 1 Composition scenarios for regular images class (${\textbf {R}}$) construction based on $P_W$ and $P_T$

Full size table

For sample mixing, in order to form the class of regular images ${\textbf {R}}$ we only need to determine $P_W$ and $P_T$ (see Eq. 5). Also, to assess the impact of both VPU and text presence in the model’s performance we consider several dataset composition scenarios $S_i$=$(P_W,P_T)$, with i=$1,\dots ,13$, for ${\textbf {R}}$, presented in Table 1.

Furthermore, we consider the same scenarios both on the training and test sets and experiment with crossed scenarios $(S_i, S_j)$, e.g., the model is trained on $S_1$=($P_W$=0%, $P_T$=0%) but evaluated on $S_{13}$=($P_W$=100%, $P_T$=100%), resulting in 13$\cdot $13=169 crossed scenarios to analyze. For sample splitting, we initially select 85% training, 5% validation and 10% test samples for each set ${\textbf {M}}$, ${\textbf {V}}$, ${\textbf {R}}_p$, ${\textbf {R}}_a$ and construct ${\textbf {R}}$ per split. For ${\textbf {M}}$ and ${\textbf {V}}$ we consider the same index split in order not to include the visual part $V_i$ in one split, e.g., training, and the initial image meme $M_i$ in another, e.g., test. The training and validation sets always derive from the same scenario $S_i$, while the evaluation is performed on all test scenarios $S_j$.

4.3 Competitive models

Meme detection is an understudied research topic. However, published studies exist such as leveraging ImageNet pre-trained ResNet features [7], combining ResNet, AlexNet and DenseNet fine-tuned on the task [20], and considering VGG16 and ResNet fine-tuned on the task [29]. These works consider text representations as additional inputs yet do not assess their impact. In this study, we only focus on the analysis of the visual signal for three reasons:

1.
Meme detection is a visually driven task as the critical information regarding existing text is the fonts’ size, color, family and position rather than the actual text content
2.
Reliably recognizing text in image memes (let alone regular images) is a challenging task on its own, so an error-prone text recognition component would add another layer of unnecessary complexity
3.
Manual text recognition by humans, such as in the dataset used by the previously mentioned studies, is unrealistic for automatic meme detection.

Hence, we consider state of the art and baseline models from the image classification domain as competitive methods. More precisely, we consider VGG16 [23], ResNet50 [9], EfficientNetB5 [27], and ViT^{Footnote 7} [6].

4.4 Training details

For ViT and ViTa, we consider input size (H, W, C)=$(250, 250, 3)$ and patch size P=25 that entail N=100, projection dimension D=64, L=8 Transformer encoder layers and h=4 MSA heads. In total ViTa has 3.4 M parameters trained from scratch as we also do for ViT. We consider the pre-trained on ImageNet weights for EfficientNetB5, ResNet50 and VGG16, discard the last layer, add a dense sigmoid activated layer to all and train only this very last layer.

All models are trained for 20 epochs with batch size 64, using the AdamW optimizer [16] with weight decay 1e-3 and the binary cross-entropy loss function. The first 10% of the iterations are linear warm-up steps with learning rate $\lambda (t)$ from 0 to 1e-3 which then decays as below:

$$\begin{aligned} \lambda (t) = \frac{\text {1e-3}}{(1+d\cdot t\cdot 1.001^t)} \end{aligned}$$

(12)

where t is the iteration and d=1e-3/20. We checkpoint our models based on validation accuracy. For ViT and ViTa, we preprocess input images first through normalizing to [0,1] and then through standardizing the values by subtracting the mean and dividing by the standard deviation per channel (computed on the training set). For the other models, we use the standard preprocessing pipeline provided by keras.

All models are evaluated on various test settings using the binary accuracy metric.

5 Results

Table 2 MemeTector’s performance in terms of binary accuracy over all crossed scenarios

Full size table

5.1 Ablation study

Table 3 Impact of VPU usage in terms of aggregated accuracy

Full size table

In Table 2, we illustrate the performance of the proposed MemeTector model on all crossed scenarios between training and test settings. For easier interpretation of the results, we denote best performance per test scenario with bold letters. First, we observe that in almost all crossed scenarios MemeTector obtains high accuracy scores ranging between 89.0 and 97.8 (mean 94.98 and standard deviation 1.47). The $S_9$=($P_W$=67%, $P_T$=100%) training scenario not only provides better performance on average and the highest accuracy values in most test scenarios (10 out of 13), but it preserves model robustness against all test scenarios as well. On the contrary, the highest variability in performance is observed in the $S_2$=($P_W$=33%, $P_T$=0%) training scenario that provides high accuracy in test scenarios with $P_T$=0% yet low accuracy in test scenarios with text presence. The best performance is achieved in the ($S_7$, $S_1$) scenario, where MemeTector is trained on ($P_W$=67%, $P_T$=33%) and evaluated on ($P_W$=0%, $P_T$=0%). The worst performance is obtained with ($S_1$, $S_{13}$), where MemeTector is trained on ($P_W$=0%, $P_T$=0%) and evaluated on ($P_W$=100%, $P_T$=100%). This is expected, as feeding the model with samples of different nature makes it generalize better to easy scenarios such as $S_1$, while evaluating it on a dissimilar and harder setting compared to what it was trained on leads to lower performance. It is also remarkable that when evaluating on $S_{10}$-$S_{13}$ scenarios, training on $S_9$ outperforms the models trained on $S_{10}$-$S_{13}$. This fact showcases the usefulness of the VPU methodology in training set construction.

In Table 3, we present comparative results with regards to VPU. The first four columns show the maximum model performance when trained without the use of VPU, and the last row shows the corresponding results of MemeTector when trained using the VPU. Additionally, the last column of Table 3 provides the average model performance. It is observed that VPU improves our model performance and also when incorporated, MemeTector performs better than competition in 3 out of 4 test scenarios as well as on average.

5.2 Comparative study

Based on the analysis presented in Sect. 5.1, we employ our best configuration to compare with the competitive models presented in Sect. 4.3. The most robust training scenario that also provides best performance in most evaluation scenarios is $S_9$=($P_W$=67%, $P_T$=100%). Hence, we consider this configuration for comparison with state of the art, namely the proposed MemeTector model architecture trained using 33% VPU created regular images and 67% web-scraped images that all contain text. Additionally, we train and evaluate VGG16, ResNet50, EfficientNetB5 and ViT on all crossed scenarios.

In Table 4, we present the average model performance on all crossed scenarios where the proposed methodology achieves the highest score. Moreover, Table 5 presents the fraction of crossed scenarios where MemeTector surpasses the performance of competitive models. Although the average difference between our model and the second to best, namely ViT is only 0.81%, MemeTector actually outperforms the latter in 87.57% of the crossed scenarios. Similarly, the proposed methodology outperforms the baselines in the majority of cases.

Table 4 Model performance in terms of average binary accuracy on all crossed scenarios

Full size table

Table 5 Fraction of crossed scenarios where the MemeTector model surpasses competitive models

Full size table

5.3 Attention plots

Figure 4 illustrates attention plots from MemeTector’s trainable attention mechanism. Given that this mechanism attends back to four layers, here we show the average attention weights across these four layers. As we observe in image memes the MemeTector attends mostly on the areas where text is present almost ignoring the background content. However, it does not attend at the whole sentences but only at a few seemingly random parts of them which means that analyzing the font morphology (being the same throughout each sentence) provides sufficient information for accurate discrimination. Similarly, humans do not need to focus on every part of image memes that contains text to make an informed and accurate decision. In regular images, MemeTector focuses on the main depicted concepts as well as the text if it is present and prominent. Presumably, the reason for not classifying regular images with prominent text presence as image memes is the fonts’ morphology and their position.

5.4 Use case on Twitter images

We also evaluate MemeTector on images from Twitter in order to assess its applicability on a practical use case. Specifically, we consider three relevant queries, namely “meme,” “dankmemes,” “memesdaily” and download 19,502 recent tweets, on 15 April 2022. Out of the collected tweets, only 6,256 contain an image: 2,071 from “meme,” 1,660 from “dankmemes” and 2,525 from “memesdaily” query, respectively. We download these images, drop the duplicates leaving 3,199 images for analysis and provide them as input to the MemeTector. Another seven of the downloaded images being gray-scale are also discarded because our model has been trained on RGB images only.

The model detected 1,342 memes (42%) and 1,850 regular images (58%). To provide quantitative results, we manually labeled the Twitter images, compared with MemeTector’s predictions and assessed the model’s accuracy. We found TP=877, FP=396, TN=1,454 and FN=465, which amount to a 73% accuracy. Note that although the used queries are related to image memes, there are regular images retrieved. The performance is reduced in the uncontrolled and noisy real-world data as expected, but it can still be considered successful, especially when considering the quite different characteristics of the images used for training the model.

In Fig. 5, we illustrate indicative correct and erroneous predictions from the experiment on Twitter images. The correctly classified image meme in Fig. 5a has a similar format to MemeTector’s training samples, while the format of Fig. 5b with text over objects or persons is not present in the training set but the model still recognizes it from the fonts morphology and background image semantics. There are many other meme formats not present in the training set that the model recognizes as well. The correctly classified regular images do not confuse the model, even though they contain text. This is due to the incorporation of text containing regular images at training. The misclassified regular images, contain meme-like overlay text and text at the top, which appears to mislead MemeTector. The first misclassified image meme (Fig. 5g) contains only two numbers as overlay text which is not a common meme form in the training set. However, the second (Fig. 5h) has similar structure to the training samples and MemeTector focuses on the fonts, but their morphology is different and that might be the reason for the detection miss.

6 Conclusions

In this work, we address the problem of image meme detection. We introduce a novel artificial dataset creation process termed visual part utilization (VPU) that first extracts the visual part of an image meme and then utilizes this new image as an instance of the regular images class. Additionally, we propose a trainable attention mechanism on top of a ViT architecture combining different levels of information granularity that led not only to improved performance but also to interpretability of the model’s choices. The findings show that our model surpasses state of the art performance and also demonstrate the usefulness of incorporating VPU in the training of MemeTector. Finally, we validated the proposed methodology on a practical use case involving the retrieval and classification of image memes from Twitter.

Data availability

All datasets used in this work are publicly available and have been properly referenced in the text.

Notes

https://en.wikipedia.org/wiki/Image_macro.
Source: Facebook’s HateFul Memes dataset [14].
Source: Google’s Conceptual Captions dataset [21].
https://en.wikipedia.org/wiki/Bernie_Sanders_mittens_meme.
https://github.com/matyasbohacek/meme-detection.
https://medium.datadriveninvestor.com/memes-detection-android-app-using-deep-learning-d2c65347e6f3.
For ViT we consider the same architecture as ViTa discarding only the attention module, which means that ${\textbf {y}}$ is directly passed to Eq. 11 replacing ${\textbf {c}}$. This of course changes the size of ${\textbf {w}}_1$ which for ViT belongs to ${\mathbb {R}}^{2048\times D}$.

References

Afridi TH, Alam A, Khan MN, et al (2021) A multimodal memes classification: A survey and open research issues. In: Ben Ahmed M, Rakip Karaş İ, Santos D, et al (eds) Innovations in smart cities applications, Vol. 4. Springer International Publishing, Cham, pp 1451–1466, https://doi.org/10.1007/978-3-030-66840-2_109
Aggarwal A, Sharma V, Trivedi A et al (2021) Two-way feature extraction using sequential and multimodal approach for hateful meme classification. Complexity 2021:1–7. https://doi.org/10.1155/2021/5510253
Article Google Scholar
Amalia A, Sharif A, Haisar F, et al (2018) Meme opinion categorization by using optical character recognition (OCR) and naïve Bayes algorithm. In: 2018 third international conference on informatics and computing (ICIC), pp. 1–5, https://doi.org/10.1109/IAC.2018.8780410
Ba LJ, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Dancygier B, Vandelanotte L (2017) Internet memes as multimodal constructions. Cogn Linguist 28(3):565–598. https://doi.org/10.1515/cog-2017-0074
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: 9th international conference on learning representations, ICLR 2021, Virtual Event, Austria. OpenReview.net, https://openreview.net/forum?id=YicbFdNTTy
Fiorucci S (2020) SNK @ DANKMEMES: leveraging pretrained embeddings for multimodal meme detection. In: Basile V, Croce D, Maro M, et al (eds) Proceedings of the 7th evaluation campaign of natural language processing and speech tools for Italian, EVALITA 2020, Torino, Italy. Accademia University Press, https://doi.org/10.4000/books.aaccademia.7352
Gaurav D, Shandilya S, Tiwari S, et al (2020) A machine learning method for recognizing invasive content in memes. In: Villazón-Terrazas B, Ortiz-Rodríguez F, Tiwari SM, et al (eds) Knowledge graphs and semantic web, KGSWC 2020. Springer International Publishing, Cham, pp 195–213, https://doi.org/10.1007/978-3-030-65384-2_15
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision and pattern recognition, CVPR 2016. IEEE, pp 770–778, https://doi.org/10.1109/CVPR.2016.90
He S, Yang H, Zheng X, et al (2019) Massive meme identification and popularity analysis in geopolitics. In: IEEE international conference on intelligence and security informatics, ISI 2019. IEEE, pp 116–121, https://doi.org/10.1109/ISI.2019.8823294
Hendrycks D, Gimpel K (2016) Gaussian error linear units (GELUs). arXiv preprint arXiv: 1606.08415
Jetley S, Lord NA, Lee N, et al (2018) Learn to pay attention. In: 6th international conference on learning representations, ICLR 2018, Vancouver. OpenReview.net, https://openreview.net/forum?id=HyzbhfWRW
Khedkar S, Karsi P, Ahuja D, et al (2022) Hateful memes, offensive or non-offensive! In: Khanna A, Gupta D, Bhattacharyya S, et al (eds) international conference on innovative computing and communications, ICICC 2022, Delhi, India. Springer Singapore, pp 609–621, https://doi.org/10.1007/978-981-16-2597-8_52
Kiela D, Firooz H, Mohan A, et al (2020) The hateful memes challenge: detecting hate speech in multimodal memes. In: Larochelle H, Ranzato M, Hadsell R, et al (eds) Annual conference on neural information processing systems, NeurIPS 2020, Virtual Event. Curran Associates, Inc., pp 2611–2624, https://proceedings.neurips.cc/paper/2020/file/1b84c4cee2b8b3d823b30e2d604b1878-Paper.pdf
Kiela D, Firooz H, Mohan A, et al (2021) The hateful memes challenge: competition report. In: Proceedings of the NeurIPS 2020 competition and demonstration track, Proceedings of machine learning research, vol 133. PMLR, pp 344–360, https://proceedings.mlr.press/v133/kiela21a.html
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th international conference on learning representations, ICLR 2019, New Orleans. OpenReview.net, https://openreview.net/forum?id=Bkg6RiCqY7
Miliani M, Giorgi G, Rama I, et al (2020) DANKMEMES @ evalita 2020: the memeing of life: Memes, multimodality and politics. In: Basile V, Croce D, Maro M, et al (eds) Proceedings of the 7th evaluation campaign of natural language processing and speech tools for Italian, EVALITA 2020, Torino. Accademia University Press, https://doi.org/10.4000/books.aaccademia.7330
Olivieri A, Noris A, Theng A, et al (2022) What is a meme, technically speaking? https://wiki.digitalmethods.net/Dmi/WinterSchool2022WhatIsAMeme
Segev E, Nissenbaum A, Stolero N et al (2015) Families and networks of internet memes: the relationship between cohesiveness, uniqueness, and quiddity concreteness. J Comput-Mediat Commun 20(4):417–433. https://doi.org/10.1111/jcc4.12120
Article Google Scholar
Setpal J, Sarti G (2020) ArchiMeDe @ DANKMEMES: a new model architecture for meme detection. In: Basile V, Croce D, Maro M, et al (eds) Proceedings of the 7th evaluation campaign of natural language processing and speech tools for Italian, EVALITA 2020, Torino. Accademia University Press https://doi.org/10.4000/books.aaccademia.7405
Sharma P, Ding N, Goodman S, et al (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics, ACL 2018, Melbourne. Association for Computational Linguistics, pp 2556–2565, https://doi.org/10.18653/v1/P18-1238
Shrestha I, Rusert J (2020) NLP_UIOWA at SemEval-2020 Task 8: You’re not the only one cursed with knowledge - multi branch model memotion analysis. In: Proceedings of the 14th workshop on semantic evaluation, SemEval 2020, Barcelona. International Committee for Computational Linguistics, pp 891–900, https://doi.org/10.18653/v1/2020.semeval-1.113
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego. http://arxiv.org/abs/1409.1556
Sinha A, Patekar P, Mamidi R (2019) Unsupervised approach for monitoring satire on social media. In: Proceedings of the 11th forum for information retrieval evaluation, FIRE 2019. Association for Computing Machinery, pp 36–41, https://doi.org/10.1145/3368567.3368582
Smitha ES, Sendhilkumar S, Mahalaksmi GS (2018) Meme classification using textual and visual features. In: Hemanth D, Smys S (eds) Computational vision and bio inspired computing, Lecture notes in computational vision and biomechanics, vol 28. Springer International Publishing, pp 1015–1031, https://doi.org/10.1007/978-3-319-71767-8_87
Suryawanshi S, Chakravarthi BR, Arcan M, et al (2020) Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text. In: Proceedings of the 2nd workshop on trolling, aggression and cyberbullying, TRAC 2020, Marseille. European Language Resources Association, pp 32–41, https://aclanthology.org/2020.trac-1.6
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri K, Salakhutdinov R (eds) 36th international conference on machine learning, ICML 2019, Long Beach, vol 97. PMLR, pp 6105–6114, https://proceedings.mlr.press/v97/tan19a.html
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, et al (eds) Annual conference on neural information processing systems, NeurIPS 2017, Long Beach, vol 30. Curran Associates, Inc., https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Vlad GA, Zaharia GE, Cercel DC, et al (2020) UPB @ DANKMEMES: Italian memes analysis - employing visual models and graph convolutional networks for meme identification and hate speech detection. In: Basile V, Croce D, Maro M, et al (eds) Proceedings of the 7th evaluation campaign of natural language processing and speech tools for Italian, EVALITA 2020, Torino. Accademia University Press, https://doi.org/10.4000/books.aaccademia.7360
Xie L, Natsev A, Kender JR, et al (2011) Visual memes in social media: tracking real-world news in youtube videos. In: Proceedings of the 19th ACM international conference on multimedia, MM 2011, Scottsdale. Association for Computing Machinery, pp 53–62, https://doi.org/10.1145/2072298.2072307
Ye J, Chen Z, Liu J, et al (2020) TextFuseNet: scene text detection with richer fused features. In: Bessiere C (ed) Proceedings of the 29th international joint conference on artificial intelligence, IJCAI 2020, Yokohama, Japan. International joint conferences on artificial intelligence organization, pp 516–522, https://doi.org/10.24963/ijcai.2020/72
Zannettou S, Caulfield T, Blackburn J, et al (2018) On the origins of memes by means of fringe web communities. In: Proceedings of the Internet Measurement Conference, IMC 2018. Association for Computing Machinery, pp 188–202, https://doi.org/10.1145/3278532.3278550
Zhou Y, Chen Z, Yang H (2021) Multimodal learning for hateful memes detection. In: IEEE international conference on multimedia expo workshops, ICMEW 2021, Virtual Event. IEEE, pp 1–6, https://doi.org/10.1109/ICMEW53276.2021.9455994

Download references

Funding

Open access funding provided by HEAL-Link Greece. This work is partially funded by the Horizon 2020 European project MediaVerse under grant agreement no. 957252.

Author information

Authors and Affiliations

CERTH, ITI, 6th km Charilaou-Thermi Rd, 57001, Thermi, Thessaloniki, Greece
Christos Koutlis, Manos Schinas & Symeon Papadopoulos

Authors

Christos Koutlis
View author publications
You can also search for this author in PubMed Google Scholar
Manos Schinas
View author publications
You can also search for this author in PubMed Google Scholar
Symeon Papadopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christos Koutlis.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Koutlis, C., Schinas, M. & Papadopoulos, S. MemeTector: enforcing deep focus for meme detection. Int J Multimed Info Retr 12, 11 (2023). https://doi.org/10.1007/s13735-023-00277-6

Download citation

Received: 04 May 2022
Revised: 30 September 2022
Accepted: 20 January 2023
Published: 13 May 2023
DOI: https://doi.org/10.1007/s13735-023-00277-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

MemeTector: enforcing deep focus for meme detection

Abstract

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

Learning to Prompt for Vision-Language Models

GenAI against humanity: nefarious applications of generative artificial intelligence and large language models

1 Introduction

2 Related work

3 Methodology

3.1 Visual part utilization

3.1.1 Extraction

3.1.2 Utilization

3.2 Model architecture

3.2.1 ViT

3.2.2 Attention module

3.2.3 Classification module

4 Experimental setup

4.1 Datasets

4.2 Sample mixing and splitting

4.3 Competitive models

4.4 Training details

5 Results

5.1 Ablation study

5.2 Comparative study

5.3 Attention plots

5.4 Use case on Twitter images

6 Conclusions

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation