1 Introduction

Image memes have been established during the last years as a popular means of communication in social media. Their typical form, known as image macros,Footnote 1 comprises images with overlay text at the top and/or bottom and is principally used to express a spectrum of concepts and emotions such as humor, irony, sarcasm and even hate. Memes and regular images have critical visual differences that render their discrimination an easy task for a human, such as the overlay text with a specific type of font size, color, family and position as well as the background image usually having a cultural reference or being memorable. In contrast, regular images potentially depict anything without certain constraints. In Fig. 1, we exemplify one image memeFootnote 2 and one regular imageFootnote 3 to showcase the differences between the two types of digital media.

Other forms of Internet memes also exist; for instance, they might be plain text [10], tweet screenshots, social statement cards, logos [18] or images reusing memorable visual elements in different creative ways such as Bernie Sanders’ mittens.Footnote 4 In addition, the adoption of different meme forms seems to be highly platform-dependent as community-specific vernaculars determine different meme cultures [18]. Here, we only address the detection of the typical Internet meme form, namely image macros being a combination of background images with superimposed text as Fig. 1a.

Fig. 1
figure 1

Example image meme versus a regular image

In the framework of analyzing digital social behavior and trends, image memes have attracted research interest [5, 19, 30, 32], mostly with a focus on deep learning models for image meme classification [1, 3, 8, 25] and more frequently for the detection of hateful image memes [2, 13, 15, 33]. The latter works utilize datasets with image memes and appropriate labeling, thus they do not put effort on detecting whether an input image is a meme or not. The detection of image memes and their discrimination from regular images is currently still a relatively understudied topic, and there are only few attempts in this direction [17, 24].

Fig. 2
figure 2

The proposed visual part utilization process. The original image \(M_i\), which belongs to the set of image memes \({\textbf {M}}\), is passed through the visual part extraction algorithm that identifies the corresponding visual part \(V_i\), crops it and adds it to the set \({\textbf {V}}\). Best viewed in color

In this work, we present MemeTector, a model for efficiently classifying images as memes or regular ones. Utilizing this model in online social environments to retrieve memes can facilitate the monitoring and analysis of web trends and behaviors as well as the detection of harmful practices that are sometimes carried out with the use of memes, such as hate-speech and disinformation. To enforce the model’s focus on critical visual cues that characterize both classes, we propose visual part utilization (VPU), a methodology for artificial dataset creation from an existing image meme dataset [14] and a deep learning architecture called ViTa, employing a trainable attention mechanism on top of a Vision Transformer (ViT) [6]. Although we propose ViTa for meme detection it is general and can potentially be utilized to other tasks. Regarding VPU, from each image meme instance \(M_i\) of the initial dataset we extract the largest part that contains no text and utilize it as a regular image instance. We denote the latter with \(V_i\) and call it visual part of image meme \(M_i\). In Fig. 2, we present the construction process for set \({\textbf {M}}\) of image memes and \({\textbf {V}}\) of visual parts. In essence, a meme’s background image is a regular image which is easily discriminated from the meme by humans while neural networks initially produce almost identical feature maps for both. VPU is thus considered here for effectively enhancing the learning of class distribution subtle peculiarities.

The paper is structured as follows. Section 2 reviews the related literature. Section 3 elaborates the proposed methodology. Section 4 describes the experimental setup. Section 5 presents the results. Section 6 concludes the paper.

2 Related work

Previous related studies focus mainly on classification of memes into categories such as hateful or offensive [26]. Due to their multi-modal nature, the classification of image memes is most frequently treated as a multi-modal analysis problem by processing both visual and text content [1]. Other studies attempt to classify image memes in terms of their sentiment (positive, negative, neutral) and type of humor (e.g., sarcastic or motivational) [22]. However, the topic of this study precedes the work of classifying image memes into certain categories. First, one needs to know if an image is a meme or just a regular image before further analyzing it.

The topic of image meme detection, i.e., automatically discriminating image memes from regular images, has not yet received considerable attention from the research community. To our knowledge, there is only one dataset for meme detection, namely the DankMemes dataset [17], which was released in 2020 but is publicly unavailable at the timing of writing the paper. DankMemes contains 2000 images related to the 2019 Italian government crisis, half of which are memes and the rest are regular images. In terms of competing approaches, only few meme detection methods have been recently presented [7, 20, 29]. Also, the similar problem of identifying satire images on social media is addressed in [24]. Finally, a few other attempts exist on the Internet, for instance, in blog posts or GitHub repositories, but are not peer reviewed.Footnote 5,Footnote 6

This is the first paper to utilize the visual part of image memes as instances of the regular images class, thus enabling: (i) a 100% accurate albeit automatic annotation process through the usage of already existing meme classification datasets, as well as (ii) the creation of a dataset of 40,000 images (x20 larger than the DankMemes dataset). Additionally, we are the first to employ a supplemental attention mechanism on top of a ViT architecture that empowers deep focus through the combination of different levels of information granularity and interpretability of the results.

3 Methodology

Here, we present MemeTector’s building blocks, namely VPU and ViTa.

3.1 Visual part utilization

3.1.1 Extraction

To extract the visual part \(V_i\) of a given image meme \(M_i\) (\(i=1,\dots ,k\)), one first needs to locate the text elements in it. To this end, we consider a state-of-the-art deep learning-based text detection model, called TextFuseNet [31]. This processes \(M_i\) and produces the set \({\textbf {B}}_i\) of detected text bounding boxes. We only keep boxes corresponding to whole words since the bounding boxes of letters are not useful for our task. Then, we apply Algorithm 1 to find the largest rectangle that contains no text and consider that part of \(M_i\) as its visual part \(V_i\).

figure a

A rectangle R covers a fraction p of the initial image area, namely \(A_{R}=p\cdot W\cdot H\), where W and H are width and height of the initial image. Thus, R can have width \(\frac{\sqrt{p}\cdot W}{r}\) and height \(\sqrt{p}\cdot H \cdot r\) to consider different aspect ratios r and preserve the rectangle’s area \(A_R\). Given p, we can determine the upper and lower bounds of r based on the image size, i.e., \(\frac{\sqrt{p}\cdot W}{r}\le W\) and \(\sqrt{p}\cdot H\cdot r\le H\) that entail \(\sqrt{p}\le r\le \frac{1}{\sqrt{p}}\). Similarly, given p and r we can determine the upper and lower bounds of the rectangle’s center \(c_R=(f_W\cdot W, f_H\cdot H)\):

$$\begin{aligned} f_W\cdot W - \frac{W\sqrt{p}}{2r}\ge 0 \end{aligned}$$
(1)
$$\begin{aligned} f_W\cdot W + \frac{W\sqrt{p}}{2r}\le W \end{aligned}$$
(2)
$$\begin{aligned} f_H\cdot H - \frac{\sqrt{p}\cdot H\cdot r}{2}\ge 0 \end{aligned}$$
(3)
$$\begin{aligned} f_H\cdot H + \frac{\sqrt{p}\cdot H\cdot r}{2}\le H \end{aligned}$$
(4)

that entail \(\frac{\sqrt{p}}{2r}\le f_W\le 1-\frac{\sqrt{p}}{2r}\) and \(\frac{r\sqrt{p}}{2}\le f_H\le 1-\frac{r\sqrt{p}}{2}\).

Consequently, we consider 17 equidistant values for p and 10 equidistant values for r, \(f_W\) and \(f_H\) covering the corresponding ranges. So, 17, 000 rectangles R are created per \(M_i\) from which we initially select the non-overlapping with any \(B\in {\textbf {B}}_i\), and out of those select the ones with the maximum area. Finally, we randomly chose one rectangle \(R_{V_i}\) and crop the corresponding part from \(M_i\) in order to get its visual part \(V_i\).

3.1.2 Utilization

We utilize the extracted visual parts \(V_i\) of image memes \(M_i\) as regular image instances in order to force the model’s focus on the critical parts that discriminate them. More precisely, we consider the set \({\textbf {M}}=\{M_i\}_{i=1}^k\) that contains image memes and the set \({\textbf {V}}=\{V_i\}_{i=1}^k\) that contains the corresponding visual parts. In Fig. 2, we illustrate the construction process of \({\textbf {M}}\) and \({\textbf {V}}\) through an example. To assess the extent to which VPU is useful we also conduct experiments mixing instances of \({\textbf {V}}\) and web-scraped regular images for the construction of regular images class \({\textbf {R}}\). Additionally, given the inherent text presence in image memes another crucial aspect to consider is the extent to which text presence in regular images affects the model’s robustness. Hence, we consider two more sets as pools for \({\textbf {R}}\) construction, namely \({\textbf {R}}_p=\{R_i^p\}_{i=1}^k\) for web-scraped regular images with text presence and \({\textbf {R}}_a=\{R_i^a\}_{i=1}^k\) for web-scraped regular images with text absence. The model’s objective is to correctly classify the instances of the two sets, \({\textbf {M}}\) and \({\textbf {R}}\), with:

$$\begin{aligned} {\textbf {R}}=\{V_i\}_{i=1}^{k\cdot (1-P_{W})}\cup \{R_i^p\}_{i=1}^{k\cdot P_{W}\cdot P_{T}}\cup \{R_i^a\}_{i=1}^{k\cdot P_{W}\cdot (1-P_{T})} \end{aligned}$$
(5)

where \(P_W\) and \(P_T\) denote the fraction of web-scraped regular images out of the total number of regular images and the fraction of web-scraped regular images with text presence out of the total number of web-scraped regular images, respectively. In that way, \({\textbf {M}}\) and \({\textbf {R}}\) preserve the same cardinality k.

3.2 Model architecture

We propose Vision Transformer with Trainable Attention (ViTa), that augments ViT [6] by an attention module leveraging information from multiple processing stages. This approach was first successfully tested on CNNs [12].

3.2.1 ViT

The input image \({\textbf {x}}\in {\mathbb {R}}^{H\times W\times C}\) is reshaped into a sequence of flattened 2D patches \({\textbf {x}}_p\in {\mathbb {R}}^{N \times (P^2\cdot C)}\), where H is height, W is width, C is number of channels, P is patches’ side length and \(N=HW/P^2\) is number of patches. Then, \({\textbf {x}}_p\) is linearly projected to D dimensions through a dense layer, a learnable class token is added to the sequence and learnable 1D position embeddings are added to the \(N+1\) tokens, resulting in the Transformer encoder inputs \({\textbf {z}}_0\in {\mathbb {R}}^{(N+1)\times D}\). Consequently, L Transformer encoder layers process the inputs to produce the final vector representations:

$$\begin{aligned}{} & {} {\textbf {z}}_l^{'}=\text {MSA}(\text {LN}({\textbf {z}}_{l-1}))+{\textbf {z}}_{l-1} \end{aligned}$$
(6)
$$\begin{aligned}{} & {} {\textbf {z}}_l=\text {MLP}(\text {LN}({\textbf {z}}_l^{'}))+{\textbf {z}}_l^{'} \end{aligned}$$
(7)

where MSA is multiheaded self-attention with h heads [28], LN is Layer Normalization [4], MLP is a multilayer perceptron with two GELU [11] activated layers of \(2\cdot D\) and D number of units, respectively, and \(l=1,\dots ,L\). Finally, a general representation \({\textbf {y}}\), describing the whole image is extracted by passing \({\textbf {z}}_L^0\), namely the class token’s embedding after L Transformer encoder layers, through layer normalization:

$$\begin{aligned} {\textbf {y}}=\text {LN}({\textbf {z}}_L^0) \end{aligned}$$
(8)

3.2.2 Attention module

ViT contains multiple self-attention layers in which the class token’s embedding receives information from the patch embeddings of the same layer. However, it lacks an attention module that combines information from past layers that capture semantics of different granularity levels. To this end, we first compute a compatibility score between \({\textbf {y}}\) and the patch embeddings of odd layers \(\{{\textbf {z}}_1^{1:N}\}\), \(\{{\textbf {z}}_3^{1:N}\}\), \(\dots \), \(\{{\textbf {z}}_n^{1:N}\}\) (\(n=L-1\) if L is even and \(n=L\) if L is odd), by:

$$\begin{aligned} s_l^i = <{\textbf {v}},[{\textbf {y}};{\textbf {z}}_l^i]> \end{aligned}$$
(9)

where \(i\in \{1,\dots ,N\}\), \(l\in \{1, 3, \dots , n\}\), \([\cdot ;\cdot ]\) denotes concatenation and \({\textbf {v}}\in {\mathbb {R}}^{2D}\) is a learnable vector. The attention weights are calculated through softmax as:

$$\begin{aligned} a_l^i=\frac{\text {exp}(s_l^i)}{\sum _{i=1}^N \text {exp}(s_l^i)} \end{aligned}$$
(10)

and the context vectors are simply the weighted average of the corresponding layer’s patch embeddings \({\textbf {c}}_l=\sum _{i=1}^N a_l^i\cdot {\textbf {z}}_l^i\).

3.2.3 Classification module

The concatenation of all context vectors \({\textbf {c}}= [{\textbf {c}}_1;{\textbf {c}}_3;\dots ;{\textbf {c}}_n]\in {\mathbb {R}}^{D^{'}}\) (\(D^{'}=([n/2]+1)\cdot D\), where \([\cdot ]\) denotes integer part) is processed by three dense layers for the final prediction, the first two are GELU activated and the last has one sigmoid unit:

$$\begin{aligned} y=\text {sigmoid}({\textbf {w}}_3\cdot \text {GELU}({\textbf {w}}_2\cdot \text {GELU}({\textbf {w}}_1\cdot {\textbf {c}}))) \end{aligned}$$
(11)

where \({\textbf {w}}_1\in {\mathbb {R}}^{2048\times D^{'}}\), \({\textbf {w}}_2\in {\mathbb {R}}^{1024\times 2048}\), and \({\textbf {w}}_3\in {\mathbb {R}}^{1024}\).

4 Experimental setup

4.1 Datasets

Fig. 3
figure 3

Example images from the Hateful Memes (a, b, c) and the Google’s Conceptual Captions with (d, e, f) and without (g, h, i) text presence

To form a suitable dataset that can be used for the task of meme detection, we merge instances from existing datasets that contain image memes and regular images, respectively.

For the image meme class \({\textbf {M}}\), we consider the Hateful Memes Dataset [14]. This is a multimodal dataset for hateful meme detection containing 10,000 image memes. We do not take into account the nature of these memes, being hateful or not, in our analysis rather we use them all under the class of image memes. Figure 3a–c illustrates three indicative examples.

For the regular images class \({\textbf {R}}\), apart from the VPU methodology, explained in Sect. 3.1, that we apply on the Hateful Memes dataset to obtain \({\textbf {V}}\), we also consider part of the widely used web-scraped Google’s Conceptual Captions dataset [21]. Specifically, we randomly sample images in order to construct \({\textbf {R}}_p\) and \({\textbf {R}}_a\) (see Sect. 3.1.2), with the text presence property automatically assessed through TextFuseNet [31] if it detects at least one text instance. Figure 3 presents three indicative examples with ((d)-(f)) and without ((g)-(i)) text presence.

4.2 Sample mixing and splitting

As a starting point for the image meme class \({\textbf {M}}\), we consider the 10,000 instances of the Hateful Memes dataset. Then, we extract the visual parts of these image memes resulting in 9,984 images considered as regular to form the set \({\textbf {V}}\). The mean area fraction across all \(V_i\in {\textbf {V}}\) is 64.3%. For the remaining 16 images the VPU algorithm was unable to find a rectangle with no overlap with text. Thus, for all four sets \({\textbf {M}}\), \({\textbf {V}}\), \({\textbf {R}}_p\) and \({\textbf {R}}_a\) we consider the same size of k=9,984 instances. To do so, we discard the same 16 image memes from \({\textbf {M}}\) and sample from Google’s Conceptual Captions dataset k=9,984 instances to form \({\textbf {R}}_p\) and another k=9,984 instances to form \({\textbf {R}}_a\), respectively.

Table 1 Composition scenarios for regular images class (\({\textbf {R}}\)) construction based on \(P_W\) and \(P_T\)

For sample mixing, in order to form the class of regular images \({\textbf {R}}\) we only need to determine \(P_W\) and \(P_T\) (see Eq. 5). Also, to assess the impact of both VPU and text presence in the model’s performance we consider several dataset composition scenarios \(S_i\)=\((P_W,P_T)\), with i=\(1,\dots ,13\), for \({\textbf {R}}\), presented in Table 1.

Furthermore, we consider the same scenarios both on the training and test sets and experiment with crossed scenarios \((S_i, S_j)\), e.g., the model is trained on \(S_1\)=(\(P_W\)=0%, \(P_T\)=0%) but evaluated on \(S_{13}\)=(\(P_W\)=100%, \(P_T\)=100%), resulting in 13\(\cdot \)13=169 crossed scenarios to analyze. For sample splitting, we initially select 85% training, 5% validation and 10% test samples for each set \({\textbf {M}}\), \({\textbf {V}}\), \({\textbf {R}}_p\), \({\textbf {R}}_a\) and construct \({\textbf {R}}\) per split. For \({\textbf {M}}\) and \({\textbf {V}}\) we consider the same index split in order not to include the visual part \(V_i\) in one split, e.g., training, and the initial image meme \(M_i\) in another, e.g., test. The training and validation sets always derive from the same scenario \(S_i\), while the evaluation is performed on all test scenarios \(S_j\).

4.3 Competitive models

Meme detection is an understudied research topic. However, published studies exist such as leveraging ImageNet pre-trained ResNet features [7], combining ResNet, AlexNet and DenseNet fine-tuned on the task [20], and considering VGG16 and ResNet fine-tuned on the task [29]. These works consider text representations as additional inputs yet do not assess their impact. In this study, we only focus on the analysis of the visual signal for three reasons:

  1. 1.

    Meme detection is a visually driven task as the critical information regarding existing text is the fonts’ size, color, family and position rather than the actual text content

  2. 2.

    Reliably recognizing text in image memes (let alone regular images) is a challenging task on its own, so an error-prone text recognition component would add another layer of unnecessary complexity

  3. 3.

    Manual text recognition by humans, such as in the dataset used by the previously mentioned studies, is unrealistic for automatic meme detection.

Hence, we consider state of the art and baseline models from the image classification domain as competitive methods. More precisely, we consider VGG16 [23], ResNet50 [9], EfficientNetB5 [27], and ViTFootnote 7 [6].

4.4 Training details

For ViT and ViTa, we consider input size (HWC)=\((250, 250, 3)\) and patch size P=25 that entail N=100, projection dimension D=64, L=8 Transformer encoder layers and h=4 MSA heads. In total ViTa has 3.4 M parameters trained from scratch as we also do for ViT. We consider the pre-trained on ImageNet weights for EfficientNetB5, ResNet50 and VGG16, discard the last layer, add a dense sigmoid activated layer to all and train only this very last layer.

All models are trained for 20 epochs with batch size 64, using the AdamW optimizer [16] with weight decay 1e-3 and the binary cross-entropy loss function. The first 10% of the iterations are linear warm-up steps with learning rate \(\lambda (t)\) from 0 to 1e-3 which then decays as below:

$$\begin{aligned} \lambda (t) = \frac{\text {1e-3}}{(1+d\cdot t\cdot 1.001^t)} \end{aligned}$$
(12)

where t is the iteration and d=1e-3/20. We checkpoint our models based on validation accuracy. For ViT and ViTa, we preprocess input images first through normalizing to [0,1] and then through standardizing the values by subtracting the mean and dividing by the standard deviation per channel (computed on the training set). For the other models, we use the standard preprocessing pipeline provided by keras.

All models are evaluated on various test settings using the binary accuracy metric.

5 Results

Table 2 MemeTector’s performance in terms of binary accuracy over all crossed scenarios

5.1 Ablation study

Table 3 Impact of VPU usage in terms of aggregated accuracy

In Table 2, we illustrate the performance of the proposed MemeTector model on all crossed scenarios between training and test settings. For easier interpretation of the results, we denote best performance per test scenario with bold letters. First, we observe that in almost all crossed scenarios MemeTector obtains high accuracy scores ranging between 89.0 and 97.8 (mean 94.98 and standard deviation 1.47). The \(S_9\)=(\(P_W\)=67%, \(P_T\)=100%) training scenario not only provides better performance on average and the highest accuracy values in most test scenarios (10 out of 13), but it preserves model robustness against all test scenarios as well. On the contrary, the highest variability in performance is observed in the \(S_2\)=(\(P_W\)=33%, \(P_T\)=0%) training scenario that provides high accuracy in test scenarios with \(P_T\)=0% yet low accuracy in test scenarios with text presence. The best performance is achieved in the (\(S_7\), \(S_1\)) scenario, where MemeTector is trained on (\(P_W\)=67%, \(P_T\)=33%) and evaluated on (\(P_W\)=0%, \(P_T\)=0%). The worst performance is obtained with (\(S_1\), \(S_{13}\)), where MemeTector is trained on (\(P_W\)=0%, \(P_T\)=0%) and evaluated on (\(P_W\)=100%, \(P_T\)=100%). This is expected, as feeding the model with samples of different nature makes it generalize better to easy scenarios such as \(S_1\), while evaluating it on a dissimilar and harder setting compared to what it was trained on leads to lower performance. It is also remarkable that when evaluating on \(S_{10}\)-\(S_{13}\) scenarios, training on \(S_9\) outperforms the models trained on \(S_{10}\)-\(S_{13}\). This fact showcases the usefulness of the VPU methodology in training set construction.

In Table 3, we present comparative results with regards to VPU. The first four columns show the maximum model performance when trained without the use of VPU, and the last row shows the corresponding results of MemeTector when trained using the VPU. Additionally, the last column of Table 3 provides the average model performance. It is observed that VPU improves our model performance and also when incorporated, MemeTector performs better than competition in 3 out of 4 test scenarios as well as on average.

5.2 Comparative study

Based on the analysis presented in Sect. 5.1, we employ our best configuration to compare with the competitive models presented in Sect. 4.3. The most robust training scenario that also provides best performance in most evaluation scenarios is \(S_9\)=(\(P_W\)=67%, \(P_T\)=100%). Hence, we consider this configuration for comparison with state of the art, namely the proposed MemeTector model architecture trained using 33% VPU created regular images and 67% web-scraped images that all contain text. Additionally, we train and evaluate VGG16, ResNet50, EfficientNetB5 and ViT on all crossed scenarios.

In Table 4, we present the average model performance on all crossed scenarios where the proposed methodology achieves the highest score. Moreover, Table 5 presents the fraction of crossed scenarios where MemeTector surpasses the performance of competitive models. Although the average difference between our model and the second to best, namely ViT is only 0.81%, MemeTector actually outperforms the latter in 87.57% of the crossed scenarios. Similarly, the proposed methodology outperforms the baselines in the majority of cases.

Table 4 Model performance in terms of average binary accuracy on all crossed scenarios
Table 5 Fraction of crossed scenarios where the MemeTector model surpasses competitive models
Fig. 4
figure 4

Attention plots from MemeTector’s trainable attention mechanism. All predictions are correct. The upper 10 are image memes, while the lower 10 are regular images. From regular images, the first 5 have text absence while the last 5 have text presence

5.3 Attention plots

Fig. 5
figure 5

Twitter images classified by MemeTector: a through d are correct predictions, while e through h are wrong predictions. a, b, g and h are image memes, while c, d, e and f are regular images. The ground truth label is presented in the left of the arrow, and the MemeTector prediction on the right

Figure 4 illustrates attention plots from MemeTector’s trainable attention mechanism. Given that this mechanism attends back to four layers, here we show the average attention weights across these four layers. As we observe in image memes the MemeTector attends mostly on the areas where text is present almost ignoring the background content. However, it does not attend at the whole sentences but only at a few seemingly random parts of them which means that analyzing the font morphology (being the same throughout each sentence) provides sufficient information for accurate discrimination. Similarly, humans do not need to focus on every part of image memes that contains text to make an informed and accurate decision. In regular images, MemeTector focuses on the main depicted concepts as well as the text if it is present and prominent. Presumably, the reason for not classifying regular images with prominent text presence as image memes is the fonts’ morphology and their position.

5.4 Use case on Twitter images

We also evaluate MemeTector on images from Twitter in order to assess its applicability on a practical use case. Specifically, we consider three relevant queries, namely “meme,” “dankmemes,” “memesdaily” and download 19,502 recent tweets, on 15 April 2022. Out of the collected tweets, only 6,256 contain an image: 2,071 from “meme,” 1,660 from “dankmemes” and 2,525 from “memesdaily” query, respectively. We download these images, drop the duplicates leaving 3,199 images for analysis and provide them as input to the MemeTector. Another seven of the downloaded images being gray-scale are also discarded because our model has been trained on RGB images only.

The model detected 1,342 memes (42%) and 1,850 regular images (58%). To provide quantitative results, we manually labeled the Twitter images, compared with MemeTector’s predictions and assessed the model’s accuracy. We found TP=877, FP=396, TN=1,454 and FN=465, which amount to a 73% accuracy. Note that although the used queries are related to image memes, there are regular images retrieved. The performance is reduced in the uncontrolled and noisy real-world data as expected, but it can still be considered successful, especially when considering the quite different characteristics of the images used for training the model.

In Fig. 5, we illustrate indicative correct and erroneous predictions from the experiment on Twitter images. The correctly classified image meme in Fig. 5a has a similar format to MemeTector’s training samples, while the format of Fig. 5b with text over objects or persons is not present in the training set but the model still recognizes it from the fonts morphology and background image semantics. There are many other meme formats not present in the training set that the model recognizes as well. The correctly classified regular images do not confuse the model, even though they contain text. This is due to the incorporation of text containing regular images at training. The misclassified regular images, contain meme-like overlay text and text at the top, which appears to mislead MemeTector. The first misclassified image meme (Fig. 5g) contains only two numbers as overlay text which is not a common meme form in the training set. However, the second (Fig. 5h) has similar structure to the training samples and MemeTector focuses on the fonts, but their morphology is different and that might be the reason for the detection miss.

6 Conclusions

In this work, we address the problem of image meme detection. We introduce a novel artificial dataset creation process termed visual part utilization (VPU) that first extracts the visual part of an image meme and then utilizes this new image as an instance of the regular images class. Additionally, we propose a trainable attention mechanism on top of a ViT architecture combining different levels of information granularity that led not only to improved performance but also to interpretability of the model’s choices. The findings show that our model surpasses state of the art performance and also demonstrate the usefulness of incorporating VPU in the training of MemeTector. Finally, we validated the proposed methodology on a practical use case involving the retrieval and classification of image memes from Twitter.