1 Introduction

Sentiment analysis, also known as opinion mining, plays a crucial role in natural language processing. Its objective is to analyze sentiments, opinions, evaluations, attitudes, and emotions expressed in user-generated content online, such as tweets. Aspect-Based Sentiment Analysis (ABSA) is a fine-grained sentiment analysis task that attracts significant attention due to its ability to offer detailed sentiment information, making it applicable to various scenarios. Many previous works on ABSA have focused on analyzing entities expressing user sentiment and their interrelationships from text, such as aspect terms, opinion terms, and sentiment polarities (Chen et al. 2020; Li et al. 2022; Liang et al. 2023). Users often express opinions through multimodal posts with both text and images, rather than just text. Analyzing these multimodal inputs enables more effective aspect-based sentiment analysis. The recent introduction of the multimodal aspect-based sentiment analysis (MABSA) task determines sentiment polarities towards different aspects mentioned in text-image pairs (Xu et al. 2019).

Previous research of MABSA is typically divided into three subtasks: Multimodal Aspect Term Extraction (MATE) (Wang et al. 2022; Chen et al. 2022), Multimodal Aspect-oriented Sentiment Classification (MASC) (Khan and Fu 2021), and Joint Multimodal Aspect-Sentiment Analysis (JMASA) (Ju et al. 2021; Ling et al. 2022). Given a text-image pair as input, MATE aims to extract all the aspect terms mentioned in the text, MASC focuses on detecting the sentiment corresponding to specific aspect terms, and JMASA is designed to jointly extract aspect terms and their corresponding sentiments. For example, considering the two image-text pairs shown in Fig. 1, the goal of JMASA is to identify all aspect-sentiment pairs, such as (Brent Seabrook, Positive) and (Blackhawks, Negative) in (a), as well as (Jesse Eisenberg, Positive) in (b).

Fig. 1
figure 1

Two examples of the MABSA task

Since the output of the JMASA task includes both the results of MATE and MASC, it typically poses a greater challenge compared to either of them individually. Ju et al. (2021) regarded the global correlation between sentences and images as the degree to integrate visual cues into textual representations, almost completely neglecting the role of object-level visual information. In contrast to coarse-grained text-image relevance judgments, the text in JMASA tasks comprises multiple complete aspect terms, while images often provide information related to only one or a few of these aspects. The information across different modalities is not always entirely consistent. Subsequently, Yang et al. (2022) leverage auxiliary tasks to capture highly sensitive multimodal representations. However, focusing too much on emotionally sensitive areas in visuals may introduce noise from unrelated images during cross-modal interactions. As shown in Fig. 1(a), the celebration action, serving as a region of interest (ROI) feature, exhibits low relevance to the Blackhawks, yet it still impacts the sentiment prediction with negative ground truth. Recently, Ling et al. (2022) devised a task-specific pre-training framework for MABSA closely associated with downstream tasks. Nevertheless, their work solely considers aligning fine-grained object visual information with text (Chen et al. 2022), overlooking hierarchical alignment of vision and language modalities at multiple granularities. For instance, in Fig. 1(b), visual information is distributed across the global images, local faces, and text in the image, each corresponding to image descriptions at different levels. In this complex multimodal context, achieving cross-modal alignment and fusion between text and visual information at various levels is a significant challenge.

In this paper, we propose a multi-level textual-visual alignment and fusion network called MTVAF for the JMASA task to address the aforementioned limitations. We first construct an image-text alignment module that translates hierarchical visual information into the textual space by leveraging multi-granularity visual information. Next, the textual (T) input is connected with multi-level visual context as textual+visual (T+V) input and fed into a text-based pre-trained language model. To inject visual sentiment knowledge from low to high levels into the network, we design a multi-scale visual aspect-opinion fusion module that dynamically integrates additional visual features into the text modal model using a dynamic attention mechanism. Additionally, we process and obtain top-N visual aspect-opinion, providing explicit fine-grained visual cues. Finally, we adopt a text-centered multimodal training approach using multi-scale visual data, enhancing the robustness of the proposed model by minimizing the KL divergence between the probability distributions of the T input space and the T+V input space. Our main contributions are summarized as follows:

  1. 1.

    We propose a multi-level textual-visual alignment and fusion network that bridges the semantic gap between text and images, which aligns them at multiple granularities and fully integrates hierarchical visual features into the textual space as context.

  2. 2.

    Building upon modal alignment, we further devise a multi-scale visual aspect-opinion fusion module that effectively incorporates significant visual information into the Transformer model, adaptively learning fine-grained knowledge from images.

  3. 3.

    Extensive experiments on two MABSA benchmark datasets demonstrate that the proposed model consistently outperforms existing unimodal and multimodal approaches and achieves state-of-the-art results.

The remainder of this paper is organized as follows. Section 2 gives a brief review of related work. Section 3 formally defines the task and describes our proposed MTVAF model. In Sect. 4, we compare and discuss the experimental results to verify the effectiveness of the proposed models. Finally, Sect. 5 concludes the paper and outlines future work.

2 Related work

2.1 Unimodal aspect-based sentiment analysis

Unimodal aspect-based sentiment analysis involves classifying the sentiment polarity of aspect terms extracted within a single modality, typically text (Chen and Qian 2019; Chen et al. 2020). Early studies primarily concentrated on two independent subtasks: Aspect Term Extraction (ATE) and Aspect Sentiment Classification (ASC). ATE was initially formulated as a sequence labeling task, with approaches emphasizing representation learning methods to enhance word embeddings and employ Conditional Random Field (CRF) models (Luo et al. 2019). Advancements in deep learning led to the increased popularity of neural networks like Convolutional Neural Networks (CNN) (Xue and Li 2018), Recurrent Neural Networks (RNN) (Ding et al. 2017), and Recursive Neural Networks (RecNN) (Wang and Pan 2020) in ATE research.

ASC tasks are typically categorized based on the aspect type: aspect term sentiment classification and aspect category sentiment classification. Many existing approaches address these problems by designing models that leverage attention mechanisms to capture positional information and obtain aspect-specific representations for sentiment detection (Tang et al. 2016). These models effectively integrate the relationships between aspects (terms/categories) and sentence contexts. Additionally, researchers have explored methods based on Graph Neural Network (GNN) to explicitly leverage the syntactic structural relations between aspects and corresponding opinions for sentiment polarity detection (Sun et al. 2019).

Recognizing that aspects provide valuable cues for sentiment classification and polarity detection, the joint ABSA task, which extracts both aspect terms and their corresponding sentiment polarities, has gained substantial attention (Chen et al. 2020). Recent ABSA research has increasingly focused on extracting more fine-grained information, encompassing aspect sentiment triplet extraction, aspect category sentiment detection, and aspect sentiment quad prediction. While unimodal ABSA has made significant progress, it overlooks a key fact: real-world sentiments are often expressed through integrating information from multiple modalities. This includes not just words, but also visual cues.

2.2 Multimodal aspect-based sentiment analysis

In addition to the coarse-grained multimodal sentiment analysis conducted at the sentence level (Zadeh et al. 2017; Yu et al. 2020; Gandhi et al. 2023), the MABSA task aims to extract more detailed information from text-image pairs. This includes tasks such as multimodal aspect term extraction (MATE), multimodal aspect-sentiment classification (MASC), and joint multimodal aspect-sentiment analysis (JMASA). Regarding the MATE task, similar to information extraction, sequential annotation methods such as CRF (Wang et al. 2022) and GNN (Zhang et al. 2021) are employed for entity extraction. In contrast, the MASC task is typically approached as a sequence classification problem, and various neural network-based models, including Bidirectional Long Short-Term Memory (BiLSTM) (Zhou et al. 2021), BERT (Khan and Fu 2021), have been proposed.

Recent MABSA research has been focused on acquiring crucial visual features to enhance the semantic representation of entities in complex scenes. Yu et al. (2022) introduce a hierarchical interaction module with assisted image reconstruction. However, this module primarily emphasizes local interactions and overlooks global semantic relationships and modality heterogeneity. Another approach, pioneered by Khan and Fu (2021), involves converting images to textual descriptions to avoid bias arising from cross-modal interactions. Despite aiding cross-modal alignment, this approach often results in neutral descriptions of images, introducing noise when learning the diverse emotional cues in the visual modality. Consequently, Yang et al. (2022) employ facial emotions as a supervised signal for learning visual emotions. Nonetheless, this method overlooks scenarios where facial expressions are absent in images, making it challenging to capture emotional visual cues.

The JMASA task proposed by Ju et al. (2021) aims to simultaneously address both subtasks: extracting aspect terms and classifying their corresponding sentiments while also considering text-image relation detection through annotated datasets. Nonetheless, this approach primarily emphasizes cross-modal global interactions and does not extensively examine fine-grained aspects and sentiments. VLP-MABSA (Ling et al. 2022) simulates text-based ABSA and designs specific pre-trained tasks for images to achieve cross-modal alignment. However, this approach demands significant computational resources and relies on extensive pre-trained labeled data. To tackle this challenge, CMMT (Yang et al. 2022) introduces multi-aspect and sentiment detection tasks for cross-modal interaction learning, using a unified label. However, few studies have delved into bridging the semantic gap between modalities and efficiently harnessing visual information for coarse-grained to fine-grained alignment. Different from them, the goal of our model is to effectively translate different granularities of image information into the textual space, thereby achieving accurate alignment between images and text.

2.3 Pretrained vision-language models

Inspired by the success of foundational pre-trained language models like BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), BART (Yan et al. 2021), and GPT (Radford et al. 2019) across various domains in natural language processing, Vision-Language models have gained prominence. Pre-trained vision-language models have been trained on large-scale image-text pair datasets. These models have shown impressive generalization abilities in various vision-language tasks. For example, in image-text retrieval, visual question answering, and image captioning (Tu et al. 2021, 2022), pre-trained models have achieved state-of-the-art performance.

Early Vision-Language Pretraining (VLP) approaches, such as Oscar (Li et al. 2020) and Uniter (Chen et al. 2020), heavily relied on pre-trained Object Detectors for visual feature extraction during pretraining. This reliance resulted in limited generalization capabilities and a strong dependence on prior knowledge. To address these issues, CLIP (Radford et al. 2021), ALBEF (Li et al. 2021), and others proposed incorporating the Vision Transformer architecture in VLP (Li et al. 2022; Zhan et al. 2023), enabling the learning of more abstract and versatile visual representations. More recently, models like BLIP (Li et al. 2022) and BEiT-3 (Wang et al. 2022) introduced Transformer-based encoder-decoder architectures that almost unified all aspects of visual-language understanding and generation tasks. In this paper, we utilize the CLIP for image captioning tasks, translating global image information into the textual feature space.

Vision-Language models typically adopt coarse-grained pre-training tasks like Masked Language Modeling (MLM), Masked Region Classification (MRC), and Image-Text Matching (ITM). These tasks aim to equally understand and learn from both images and text. However, in the case of MABSA tasks, the focus shifts to prioritizing using aligned visual information to enhance text representations, especially for aspects and emotions. The VAL (Chen et al. 2020) and VLP-MABSA (Ling et al. 2022) respectively align visual and language information at the levels of multi-grained and fine-grained to obtain expressive representations. Considering the limited data resources for MABSA tasks, in order to effectively utilize alignment information, our work proposes a method of integrating multi-scale image information as prompts into text representations.

3 Methodlogy

Task definition: Following previous works (Ju et al. 2021; Yang et al. 2022), we formulate the JMASA task as a sequence labeling problem with a unified tagging scheme. Formally, considering a sentence S = (\(s_1\), \(s_2\), \(\cdots\), \(s_n\)) comprising n words, along with its corresponding global image G. The objective of JMASA is to identify the aspect terms within the sentence S and their corresponding sentiment polarities. In particular, we aim to derive a joint label sequence \(y = (y_1, y_2, \cdots , y_n)\), where \(y_i \in \{\)B-POS, B-NEU, B-NEG, I-POS, I-NEU, I-NEG, O\(\}\). The labels B, I, and O signify the beginning, inside and out of an aspect, respectively. POS, NEU, and NEG represent positive, neutral, and negative sentiments towards the aspect. For example, when the word \(s_i\) is tagged with the label B-POS, it indicates that the word represents the beginning of an aspect with a positive sentiment.

3.1 Model overview

As mentioned before, two challenges of MABSA are to extract effective emotional visual features at various levels of granularity and bridge the gap between different modalities. To tackle these challenges, we propose a multi-level textual-visual alignment and fusion network. Figure 2 provides an overview of the proposed framework. Our approach leverages hierarchical visual information from ResNet (He et al. 2016) and transforms it into the visual prompt, facilitating the seamless integration of richer visual information into both the T input space and T+V input space.

Fig. 2
figure 2

The overall architecture of our proposed MTVAF

As shown in Fig. 2, the proposed MTVAF consists of the following three main components:

  1. 1.

    Multi-granularity visual translation alignment: This component aims to align textual and visual spaces by translating the image into visual context.

  2. 2.

    Multi-scale visual aspect-opinion fusion: It is designed to project multi-scale visual features into the same low-dimensional space and dynamically fuse them into each layer of the Transformer. By incorporating visual aspect-opinion supervision, it enables the acquisition of fine-grained visual information throughout the model.

  3. 3.

    Text-centered multimodal training: To enhance robustness, this component minimizes the KL divergence over the output distributions of two inputs, effectively denoising the visual modality. Furthermore, it employs CRF for generating JMASA output.

3.2 Multi-granularity visual translation alignment

Despite recent multimodal research has explored different ways to align textual and visual spaces (Yang et al. 2022; Ling et al. 2022), these studies have consistently neglected the need for precise alignment, particularly in the ranging from coarse to fine-grained levels. This oversight might hinder the recognition of visual clues that could enhance textual representations.

To address this issue, we propose a multi-granularity visual translation alignment module, as shown in Fig. 3, which transforms images into visual context inputs at different granularity levels. These inputs are combined with the T input and subsequently fed into a stacked bidirectional Transformer with multi-layer attention modules. This method indirectly accomplishes multi-granularity alignment between images and text, covering three levels of granularity alignment: global, local, and character-level.

Fig. 3
figure 3

The workflow of multi-granularity visual translation alignment

3.2.1 Global coarse-grained alignment

Our primary objective is to establish a coarse-grained alignment between global images and text. We use an image captioning model to create a comprehensive connection between visual and textual modalities, reducing the impact of irrelevant images. This process strives to generate meaningful and accurate image descriptions that convey the semantic information of visual content at a coarse-grained level. In particular, we apply an image captioning tool ClipCap (Mokady et al. 2021), which generates the high-quality caption for scenes, denoted as C:

$$\begin{aligned} C = Caption(G) \end{aligned}$$
(1)

where C denotes the comprehensive description of the image generated from Caption, serving as a coarse-grained text-aligned mapping of the entire image.

3.2.2 Local fine-grained alignment

Building upon the global coarse-grained alignment, local fine-grained alignment focuses on translating local facial features into the textual space to obtain aligned descriptions of faces. Facial expressions, being a direct means for humans to convey emotions, are invaluable for the exact identification of emotions at the object level in images (Fan et al. 2018). Our observations of the Twitter-2017 dataset reveal that facial expressions appear in over half of the images within tweets.

Hence, for the extraction of fine-grained emotional visual information, we initiate the process by utilizing the LightFaceFootnote 1 face detector to identify all faces and transform them into textual facial attributes. Following this, we adopt the facial expression description template proposed by Yang et al. (2022) to generate face descriptions:

$$\begin{aligned} D = Face\_Description(G) \end{aligned}$$
(2)

where \(D = \left\{ D_1, D_2,..., D_d\right\}\), and d represents the number of facial descriptions.

After obtaining the facial attributes in textual form, we sort them in descending order based on the prediction confidence of the face detector, allowing us to filter out attributes with low prediction confidence. This step is essential for focusing on emotionally relevant information from local regions in images and achieving fine-grained alignment between different modalities.

3.2.3 Optical Character-grained Alignment

In addition, images that include text offer valuable semantic information that enhances the visual content (Wang et al. 2022; Yao et al. 2023). Conventional image encoders often face challenges in comprehending this information such as slogans in advertisements, text within emojis, famous quotes on posters, and so on. Thus, we apply Google’s Tesseract OCR engineFootnote 2, an advanced lightweight open-source OCR system to extract text from images. Through accurate text identification and extraction from images, our method achieves character-level alignment, greatly improving its sensitivity to the emotional information conveyed by these images.

$$\begin{aligned} O_c = OCR(G) \end{aligned}$$
(3)

where \(O_c\) represents the concatenated sequence of English words extracted by the OCR model.

To mitigate the noise caused by irrelevant images, we concatenate \(O_c\) with C and D to generate a textual-visual alignment of visual context called \(V_c = (C, [SEP], D, [SEP], O_c, [SEP])\). At this stage, visual information is mapped into the text space, and after concatenating it with the T input, a T+V input is formed. Just like before, we insert the [SEP] token between the T input (S) and the visual context (\(V_c\)). Both T+V and T pass through a Transformer-based model to acquire the final hidden representations \({H}^{L}_{T+V}\) and \({H}^{L}_{T}\), which are fed into the CRF layer. For a label sequence \(y = (y_1, y_2,..., y_n)\), we define the probability of the tag sequence y based on the hidden representations \({H}^{L}\) as follows:

$$\begin{aligned} \begin{aligned} s({H}^{L},y)&= \sum _{j=0}^{n} M_{y_{j},y_{j+1}} + \sum _{j=1}^{n} P_{j,y_j}\\ p(y|{H}^{L})&= Softmax(s({H}^{L},y)) \end{aligned} \end{aligned}$$
(4)

where \(M_{y_{j},y_{j+1}}\) is the randomly initialized transition matrix from label \(y_{j}\) to \(y_{j+1}\), \(P_{j,y_j}\) denotes the emission matrix of label \(y_{j}\) linearly transformed from the \(H_L\).

3.3 Multi-scale visual aspect-opinion fusion

In this module, three primary subtasks are addressed. Firstly, the multi-scale visual feature subtask is dedicated to converting the input image into visual features at various scales. Secondly, the top-N visual aspect-opinion subtask aims to acquire detailed aspect-opinion information from visual data. To achieve this goal, it employs Adjective-Noun Pairs (ANPs) (Borth et al. 2013) and predicts their top-N probabilities, which serve as supervision signals. Finally, the prompt-based dynamic visual fusion subtask primarily focuses on dynamically integrating multi-scale visual cues, acting as key-value prompt information in the multi-layer bidirectional Transformer (BERT) (Devlin et al. 2019) for two inputs: T input and T+V input.

3.3.1 Multi-scale visual feature

Recent research (Tian et al. 2023) has shown the ability of convolutional neural networks (CNN) to hierarchically extract target features. It’s been shown that different layers of these networks, both shallow and deep, possess distinct receptive fields suitable for processing objects of various sizes. This is particularly important when dealing with various-grained images. Our module’s objective is to capture multi-scale visual features and obtain corresponding hierarchical visual representations.

To accomplish this, we incorporate global and regional images as supplementary visual information. Global images help capture large-scale abstract concepts such as entity context and overall emotional clues. On the other hand, regional images act as vital visual cues for smaller-scale details, guiding visual feature learning. By combining semantic and spatial information from deep and shallow features, we obtain multi-scale visual features. Specifically, we utilize a four-block structured ResNet (He et al. 2016) as the visual encoder and YOLOv5x6Footnote 3 as the object detector. We retain at most z regions \(O_b=(O_1, O_2,..., O_z)\) with the highest confidence scores.

The multi-scale image inputs are fed into the visual encoder, where deep information is upsampled and element-wise added with shallow information. This process extracts multi-scale feature maps \(F=(F_1, F_2,..., F_r)\), which are then fused. Following this, an average pooling operation is executed to enhance the recognition capability of visual aspects within the image:

$$\begin{aligned} \begin{aligned} \left[ F_1,F_2,...,F_r\right] _G;\left[ F_1,F_2,...,F_r\right] _{O_{b}}&= \text {Visual\_Encoder}([G];[O_b]) \\ \hat{F_i}&= \text {Ave}(F_i) \end{aligned} \end{aligned}$$
(5)

where \(\left[ F_1, F_2,..., F_r\right] _G\) and \(\left[ F_1, F_2,..., F_r\right] _{O_{b}}\) represent the visual features obtained from the fusion of multi-scale feature maps, including global image features and object features. Ave denotes the average pooling layer, which transforms \(F_i\) into same dimension.

3.3.2 Top-N visual aspect-opinion

When integrating a significant amount of fine-grained visual information into two inputs, it’s clear that exploring the fine-grained relationships of the visual features obtained from the multi-scale network is essential. Inspired by VLP-MABSA (Ling et al. 2022), we utilize Adjective-Noun Pairs (ANPs) as supervision for visual aspects and opinions. These ANPs are derived from the pre-trained ANP detector, DeepSentiBankFootnote 4 (Chen et al. 2014), which predicts the class distribution of 2089 ANPs within the entire image, reflecting visual aspect-opinion information.

Since depending just on the predicted ANP causes an error propagation issue and utilizing the entire distribution for supervision creates redundant noise, we propose using adjective-noun pairs with top-N prediction probabilities for the guided model. For instance, in the middle right part of Fig. 2, the top-N adjective-noun pairs from the input image contain an ordered list relevant to the visual aspect-opinion, potentially guiding our model’s focus on fine-grained visual information. The distribution of top-N prediction P is computed as:

$$\begin{aligned} P = Softmax(W^T (\frac{1}{r} \sum _{i=1}^{r} (\hat{F_i})) + b) \end{aligned}$$
(6)

with r=4, where \(W \in \mathbb {R}^{d\times N}\) and \(b \in \mathbb {R}^{N}\) represent trainable parameters, d represents the dimension of the text representation in BERT.

To bring the predicted distribution P closer to the ground-truth top-N adjective-noun pairs distribution A, we employ the standard Cross-Entropy loss to find fine-grained information from image input:

$$\begin{aligned} L_V = - A log(P) \end{aligned}$$
(7)

This loss function is designed to minimize the discrepancy between the predicted and ground-truth distributions, thereby enhancing the model’s capability to capture the needed visual aspect-opinion relationship.

3.3.3 Prompt-based dynamic visual fusion

In multimodal aspect-based sentiment analysis, the textual modality remains crucial for identifying entities and sentiments, even as the visual modality plays a significant role (Zhan et al. 2023). Therefore, we propose a submodule that employs dynamic attention mechanism (Chen et al. 2018) to project multi-level visual information, including not only full image details but also object-level information, as prompts to the l-th layer of BERT within the textual modality. These visual prompts concatenate keys and values in each layer during multi-head attention calculations, mitigating noise interference from irrelevant visual information. This dynamic projector calculates multiple normalized vectors that control the extent of visual feature transformation for each BERT block. Firstly, we calculate the logits \(\alpha ^l_i\) for the projecting signal:

$$\begin{aligned} \begin{aligned} e_{i}&= MLP(\hat{F_i}) \\ \alpha _{i}^l&= \frac{exp(e_{i})}{\sum _{k=1}^{r} (exp(e_{k}))} \end{aligned} \end{aligned}$$
(8)

where MLP represents a layer that appropriately reduces the feature dimensionality.

As for integrating textual and visual features, we employ multi-head self-attention. This process combines the visual prompts with the key/value vectors of contextual representations in each BERT layer. Here, \(V^l\) represents the transformed visual features, which are fed into the l-th layer of BERT.

$$\begin{aligned} \begin{aligned} V^l = [V_G^l; V_{O_b}^l]&= \sum _{i=1}^{r}(\alpha _i^l \cdot \hat{F_i}) \\ [\delta ^l_k; \delta ^l_v]&= W^l_{\delta } V^l \end{aligned} \end{aligned}$$
(9)

where visual prompts \(\delta ^l_k, \delta ^l_v \in \mathbb {R}^{(z+1)hw \times d}\), \((z+1) hw\) denotes the length of visual features. The multi-scale visual features undergo a linear transformation \(W^l_{\delta } \in \mathbb {R}^{d \times 2 \times d}\) projects them into the same embedding space as the textual representations.

The equally-length visual prompts are then concatenated with the origin key and value vector from the previous BERT layer, which acts as the new key and value during the attention process. Formally, the fusion of visual prompts with text-based attention is calculated as follows:

$$\begin{aligned} Fusion\_Attention = \frac{Softmax(W^l_Q H^{l-1} \cdot [\delta ^l_k; W_K^l H^{l-1}])}{ \sqrt{d} } [\delta ^l_v; W_V^l H^{l-1}] \end{aligned}$$
(10)

where \(W_Q^l H^{l-1}\), \([\delta ^l_k; W_K^l H^{l-1}]\) and \([\delta ^l_v; W_V^l H^{l-1}]\) represent the query, key, and value in the new attention matrices.

3.4 Text-centered multimodal training

Given the diverse and multi-level visual information, when input to multi-layer bidirectional Transformers, there’s a risk of excessive attention toward lengthy visual context. This can overshadow the primary role of textual information during gradient loss backpropagation. Furthermore, the absence of annotated labels for supervising the alignment between textual and multimodal information poses a challenge. Thus, we introduce minimizing the KL-divergence over two probability distributions, which is obtained from feeding two inputs into the Transformer-based model in Eq. 10. It is equivalent to calculating the cross-entropy loss between the two distributions:

$$\begin{aligned} L_{T+V} = KL(p(y|H^L_{T+V}) || p(y|H^L_T)) = \sum _{y\in Y} p(y|H^L_{T+V}) log(p(y|H^L_T)) \end{aligned}$$
(11)

where \(p(y|H^L_{T+V})\) and \(p(y|H^L_T)\) are T+V and T probability distributions derived from two inputs through Eq. 4.

As T+V information introduces noise in the aspect-sentiment pairs process of the MABSA model, we adopt a text-centered approach to transfer crucial information from the T+V context. Thereby, only \(p(y|H^L_T)\) is backpropagated. This loss function is the negative log probability of the ground truth label sequence as follows:

$$\begin{aligned} L_T = - \sum _{i=1}^{n} log(p(y|H^L_T)) \end{aligned}$$
(12)

The final combined objective function is defined as follows:

$$\begin{aligned} L_{MTVAF} = \lambda \cdot L_T + \mu \cdot L_V + \gamma \cdot L_{T+V} \end{aligned}$$
(13)

where \(\lambda\), \(\mu\), and \(\gamma\) \(\in [0, 1]\) are trade-off hyper-parameters to control the contribution of each module.

4 Experiments

4.1 Experimental settings

Datasets: We demonstrate the effectiveness of our approaches on the Twitter-2015 and Twitter-2017 datasets (Yu et al. 2019) for MABSA. These datasets comprise text-image pairs extracted from tweets spanning the years 2014 to 2017. The statistics of these two datasets are presented in Table 1. And the detailed statistics of the multi-granularity visual translation alignment are shown in Table 2.

Table 1 Statistics of two benchmark datasets (All: number of all aspects, \(\#\)S: number of sentences, Mean: mean length of sentences, Max: maximum length of sentences)

Evaluation metrics: Following previous works (Ju et al. 2021; Ling et al. 2022; Yang et al. 2022; Yang et al. 2023), we adopted Precision (P), Recall (R), and F1 score as the evaluation metrics to assess the performance of different methods in the MABSA task.

$$\begin{aligned} Precision= & {} \frac{\#true}{\#prediction} \end{aligned}$$
(14)
$$\begin{aligned} Recall= & {} \frac{\#true}{\#ground\ truth} \end{aligned}$$
(15)
$$\begin{aligned} F1\ score= & {} \frac{2 *Precision *Recall}{Precision + Recall} \end{aligned}$$
(16)

where \(\#prediction\) and \(\#ground\ truth\) denote the number of predicted and ground truth aspect-sentiment pairs, respectively. The amount of correct predictions in aspect-sentiment pairs is represented by \(\#true\), implying that both the aspect boundary and sentiment classification are correct.

Table 2 A statistic about the number of sentences with different-granularity visual contexts and their mean and maximum length

Implementation details: To ensure fair comparisons, we employed BERT-base-uncasedFootnote 5 as our textual backbone and ResNet152 as the visual encoder, consistent with recent studies. The Transformer model had 12 attention heads and a dropout rate of 0.1. Training lasted for 30 epochs, with evaluation after the 16th epoch. In optimization, we used the AdamW optimizer with a weight decay of 0.01. Additionally, the learning rate was linearly warmed up to its maximum value during the first 1\(\%\) of training steps. For the Twitter-2015 dataset, we used a learning rate of 2e–5 and a batch size of 16. For the Twitter-2017 dataset, a learning rate of 1.5e–5 and a batch size of 4 were employed. The length of the prompt was set to 4, and its dimensionality was reduced to 800. The number of image objects was limited to no more than 3, and the OCR text length was limited to 100 characters. Besides, we consider the top-N visual aspect-opinion as its fine-grained concepts, i.e., N set to 10. The impact of tradeoff hyperparameters \(\lambda\) and \(\mu\) impact on performance is not particularly sensitive, as both are set to 1 in our model. Conversely, another hyperparameter \(\gamma\) demonstrates high sensitivity to performance. In subsection 4.5.2, we via a grid-search strategy determine the optimal trade-off values for \(\gamma\), resulting in values of 0.3 and 0.2 for the Twitter-2015 and Twitter-2017 datasets, respectively. We implemented all our methods using PyTorch and executed them on a single NVIDIA Tesla V100 GPU.

4.2 Baselines

In this subsection, we conduct a comprehensive comparison by comparing our MTVAF model with two groups of competitive models. Specifically, we consider the unimodal and multimodal baselines.

Text-based methods:

  1. 1.

    SpanABSA (Hu et al. 2019) is a span-based hierarchical method for textual ABSA.

  2. 2.

    D-GCN (Chen et al. 2020) proposes the concept of second-order proximity information, which is used to extend the convolution operation receptive field to extract more features on the directed graph.

  3. 3.

    GPT-2 (Radford et al. 2019) adopts a Transformer-based architecture using only a decoder structure, enabling end-to-end applications in textual ABSA through text generation.

  4. 4.

    RoBERTa (Liu et al. 2019) is an advanced pre-trained transformer-based model, which feeds the contextualized text representation into a CRF layer for sequence labeling.

  5. 5.

    BART (Yan et al. 2021) is an Encoder-Decoder Transformer architecture that combines contextual information and autoregressive features, formulating textual ABSA as an index generation task.

Multimodal methods:

  1. 1.

    UMT-collapsed (Yu et al. 2020), OSCGA-collapsed (Wu et al. 2020) and RpBERT-collapse (Sun et al. 2021) are model the multimodal named entity recognition (MNER) task originally, replaced with collapsed tagging for MABSA task. Note that UMT-collapsed uses the cross-modal Transformer to model the interaction between text and images, and OSCGA-collapsed combines object-level visual information with textual information. The RpBERT-collapsed uses the confidence of the image-text relationship to fuse the two modalities.

  2. 2.

    CLIP (Radford et al. 2021) employs contrastive pretraining to encode rich semantic representations both images and text, which can be applied in MABSA.

  3. 3.

    JML (Ju et al. 2021) performs a span-based hierarchical joint learning approach while introducing an auxiliary cross-model relationship detection task to integrate appropriate visual information.

  4. 4.

    CMMT (Yang et al. 2022) uses a gating mechanism to control the contribution of text and image representations and to capture the interaction between them, with two unimodal auxiliary tasks.

  5. 5.

    VLP-MABSA (Ling et al. 2022) designs multiple distinct vision-language pre-training tasks on an extra pre-labeled dataset containing over 17,500 image-text pairs. This approach aims to bridge the gap between a fine-grained MABSA task with limited resources and a general pre-training task.

  6. 6.

    GMP (Yang et al. 2023) utilizes multimodal encoders and decoders to automatically generate aspect-oriented and sentiment-oriented prompts for MABSA in text-image few-shot scenarios.

  7. 7.

    AoM (Zhou et al. 2023) reduces inter-modal noise for fine-grained sentiment analysis by jointly modeling aspect level semantics and guided sentiment aggregation.

4.3 Main results

In Table 3, we compare our approach with the baselines on the Twitter-2015 and Twitter-2017 datasets to demonstrate superior performance, and the following observations can be made.

Table 3 A comparison of our MTVAF model and other competitive baselines for MABSA

Incorporating additional visual content enhances the model’s understanding of the correlation between aspects and sentiments, resulting in promoting performance in the MABSA task. Multimodal models that leverage visual information clearly outperform text-only methods in MABSA. A notable comparison can be drawn by contrasting the unimodal baseline BART with its corresponding multimodal method VLP-MABSA, both based on pre-training. The latter outperforms the former, with F1 scores increased by 3.4\(\%\) and 2.4\(\%\) on the two datasets, respectively. This finding further emphasizes the valuable role and significance of visual information in aspect-level sentiment analysis in a multimodal setting.

The representation capacity of the model can be improved by incorporating relevant auxiliary tasks to filter out irrelevant image and text information, like multimodal relation detection and late-stage weighted fusion. Among the multimodal baseline models, UMT-collapse and OSCGA-collapse perform poorly because they do not consider the relevance to downstream tasks and simply fuse visual features without alignment. Direct fusing of image and text features into RpBERT-collapse through naive fusion results in remarkably inferior performance. On the other hand, JML and CMMT design auxiliary tasks to explore coarse-grained or fine-grained alignment information in visual content, yielding better results. However, they do not fully exploit the diverse range of semantic information present in images, resulting in the omission of crucial visual cues. In contrast, MTVAF leverages multi-level visual-textual alignment and visual representations as prompts. The table results show that our method effectively mitigates interference from irrelevant global and local image information, providing a more comprehensive approach for visual-language alignment and fusion in MABSA.

Multimodal pre-trained models generally require task-specific pre-training or prompt-based learning to provide supervised signals for model fine-tuning. Although CLIP can capture semantically powerful multimodal representations through contrastive learning, its performance is significantly lower than VLP-MABSA guided by proper pre-training tasks. In comparison, GMP may help model the correlations between input images, text, and output sentiments better by automatically generating customized multimodal prompts for MABSA. This also highlights the effectiveness of visual prompt-based fusion for inter-modality interaction in MTVAF.

As shown in Table 3, it is evident that the proposed MTVAF outperforms the state-of-the-art AoM model by 1.8\(\%\), 3.8\(\%\), and 2.8\(\%\) in terms of Precision, Recall, and F1 score, respectively, on the Twitter-2015 dataset. This observation indicates that even selectively attending to aspect-relevant visual-textual contents and aggregating associated sentiment signals may still be insufficient to completely filter out misaligned visual noise. For the MABSA task, our approach leverages multi-level visual knowledge aligned with text, which proves to be more beneficial in training text-based models.

4.4 Ablation study

Table 4 Ablation study of the MTVAF

In Table 4, we conducted ablation experiments to further analyze the contributions of each module in our model. (1) only global/local/character-grained alignment: Retain only the alignment translation at the specified level. (2) w/o global/local/character alignment: Remove alignment translation at the specified level (various combinations of multi-granularity alignment). (3) w/o multi-granularity visual translation alignment: Completely remove T+V input and KL divergence in Fig. 2 top left. (4) w/o top-N visual aspect-opinion: Remove the visual aspect-opinion loss in Eq. 7. (5) w/o prompt-based dynamic visual fusion: Exclude the fusion of image information via visual prompts, only use translated visual contexts as visual modality input. (6) rep. text-centered multimodal training: Replace text probability distribution with T+V distribution in Eq. 12 for backpropagation, eliminating the two-distribution loss in Eq. 11.

We conducted a series of experiments to validate the vital role played by various granularity alignment methods and their combinations. The performance drop of only global coarse-grained alignment on Twitter-2015 is less significant than on Twitter-2017. This difference could be due to the dominance of neutral samples in Twitter-2015, as shown in Table 1 and Table 2, where face descriptions, accounting for sentiment clues, only make up 34.41%. During training, the model may tend to focus more on objective global coarse-grained alignment in such cases. Meanwhile, on Twitter-2017, F1 of w/o local fine-grained alignment is slightly higher than the complete MTVAF model. There could be two reasons for this result. First, OCR has weaker recognition capability on low-quality or complex images, which may introduce irrelevant information during multi-granularity visual translation alignment. Second, a great deal of OCR text in the Twitter datasets exhibit little semantic association with linked sentences. OCR text not only fail to enhance visual alignment, but they also introduce noise. Subsequently, we removed all visual context (w/o multi-granularity visual translation alignment). Performance degradation was observed on both datasets, with a larger decrease of 3.7% and 2.6% when the visual context was removed. This reveals that translating visual information from coarse-grained to fine-grained levels into the textual space can bridge the semantic gap between the image and text modalities.

To demonstrate the practicality of multi-scaled visual fusion, we conducted experiments involving the removal of two components: top-N adjective-noun pairs (w/o top-N visual aspect-opinion), and multi-scaled fusion (w/o prompt-based dynamic visual fusion) both as ablative models. When we eliminated auxiliary supervision, there was a decrease of 2.2% and 1.9% in F1 scores on the two Twitter datasets, indicating that the predictions of ANPs may capture the visual aspect-opinion information. Furthermore, when we excluded multi-scale visual cues, there was a significant decline in performance. This emphasizes the importance of our proposed MTVAF, which facilitates comprehensive and informative interactions within the textual model by integrating visual cues of different scales.

Besides, the decline observed in both datasets by removing the loss of the two distributions in Eq. 11 (w/o text-centered multimodal training), as it effectively reduces noise introduced by redundant visual context.

4.5 In-depth analysis

We perform the in-depth analysis on two Twitter datasets to investigate the effect of the number of adjective-noun pairs, hyper-parameter settings, and layers of the Image Encoder on the performance, which further demonstrates the validity of the proposed MTVAF model.

4.5.1 Analysis of the sensitivity of the top-N visual aspect-opinion

At the top-N adjective-noun pairs stage, we investigated the influence of different settings for N. We explored values of 0, 1, 5, 10, 25, 100, as well as the entire distribution (2089) to understand their impact. Our tests revealed the sensitivity of the number of adjective-noun pairs. Figure 4(a) manifests that using adjective-noun pairs generally enhances the performance of MABSA compared to the top-0 (w/o top-N visual aspect-opinion).

Fig. 4
figure 4

The result of different top-n adjective-noun pairs and contribution of multi-granularity visual translation alignment module for MTVAF

Moreover, using only the adjective-noun pair with the highest predicted probability (N = 1) may lead to error propagation, while employing the distribution data of a large number of adjective-noun pairs (whole distribution) may introduce noise. Through our observations, setting the number of adjective-noun pairs to 10 yielded the best result. This finding highlights that adopting an appropriate number of adjective-noun pairs can effectively alleviate error propagation and visual noise problems to a certain extent, consequently leading to improved performance.

4.5.2 Analysis of the sensitivity of the hyper-parameter

We conduct a hyperparameter experiment of MTVAF to effectively utilize aligned image information through translating input image into three granularity, as depicted in Fig. 4(b), based on the final performance on the development set. For the hyperparameter \(\gamma\) in Eq. 13, we experimented with different settings ranging from 0.1 to 1.0 in increments of 0.1. We found that the optimal values for \(\gamma\) were 0.3 and 0.2 for the two Twitter datasets, yielding the best performance.

As previously emphasized, reducing irrelevant image information is crucial in the context of the MABSA task. The tradeoff parameter \(\gamma\) inherently signifies the degree of influence exerted by the aligned image content. When \(\gamma\) sets a larger value, it implies the introduction of redundant T+V input, potentially biasing the model training process and resulting in decreased robustness. On the contrary, selecting a smaller \(\gamma\) reduces the number of T+V inputs, negatively impacting overall performance. Therefore, to strike a balance between incorporating sufficiently aligned visual information and mitigating the impact of noise, it is advisable to assign a relatively smaller value to the contribution level of T+V input, optimizing the model’s performance.

4.5.3 Analysis of the effectiveness of the image encoder

The importance of layers in the image encoder is emphasized in Fig. 5, we conducted experiments by replacing ResNet152 with alternative ResNet variants featuring varying numbers of layers. Notably, we observed a consistent decrease in F1 scores across the Twitter-2015 and Twitter-2017 datasets as the number of layers diminished. This observation underscores the importance of increasing the number of residual blocks within each module, as it enables the capture of more comprehensive bottom-up syntactic and semantic information within the images, despite the overall similarity in structure.

Fig. 5
figure 5

Performance comparison of image encoder on twitter datasets

4.6 Case study

In this subsection, we selected three representative test examples in Table 5 for comparison among four models: the textual baseline BART, the multimodal benchmark model JML, VLP-MABSA, and our MTVAF framework.

Table 5 Three examples of the predictions by different methods

As shown in Table 5, in example (a), we find that JML failed to identify the aspect term (Brandon Carr), while BART and VLP-MABSA additionally identified an aspect word the and BART also made an incorrect sentiment prediction. MTVAF might takes advantage of the image caption and relevant object-level images, which do not have specific emotional tendencies, aiding in entity recognition and sentiment prediction. In example (b), owing to the lack of image input, BART did not identify the aspect term (LFC). JML only extracted one aspect word Steven, and VLP-MABSA made an incorrect sentiment prediction. MTVAF, on the other hand, aligns the second entity, which combines the facial description of angry expression with adjective-noun pairs such as tough face and extreme violence to capture emotional visual contextual cues. As a result, MTVAF correctly extracts two aspect terms and classifies their sentiment as neutral and negative. In example (c), BART and JML only extracted partial entities, with JML identifying an additional aspect word Show compared to BART. Although VLP-MABSA accurately predicted the correct aspect term (The Seth Leibsohn Show), it failed to classify sentiment, probably due to its equal treatment of images and text.

In the case of using images as an auxiliary modality, the MTVAF model not only correctly recognizes the aspect term (The Seth Leibsohn Show) through character-grained visual context, but also predicts a positive sentiment in combination with image information. These case results demonstrate that our MTVAF model can obtain all correct aspect terms and their associated sentiments by comprehensively aligning and fusing textual information with relevant images at different scales.

4.7 Comparison with human assessment

Is MTVAF consistent with human assessment? To gain a deeper understanding of the correlation and discrepancies between our approach and human perception of multimodal data, we engaged three graduate students (majoring in computer science and technology) to independently annotate each text along with its associated image. Their task was to assess whether the image enhances aspect recognition and sentiment detection in the text. To reduce semantic discrepancies and ensure dataset quality, we randomly sample three groups of 100 samples from the Twitter-2015 test set using different seeds (12, 43, 100). Cohen’s Kappa (Cohen 1960) is adopted to measure inter-annotator agreement, and the highest average Kappa score among the three groups is presented in Table 6. This subset, demonstrating agreement at a fundamental level, was chosen as the evaluation dataset. The majority label among the three annotations was then adopted as the ground truth label.

Table 6 Agreement between every pair of three graduate annotators (G1, G2, G3)

We evaluated our model’s performance in both unimodal and multimodal settings on the data subset, as shown in Table 7. When annotators considered the visual modality supportive for ABSA tasks, our model exhibited a significant improvement upon incorporating the visual modality, indicating its effective utilization of additional image information, in alignment with human judgment. However, in the evaluation of 19 unsupported samples, MTVAF showed marginal improvements of 4.3%. This observation underscores that even for human annotators, subject to inherent subjectivity and limited prior knowledge, the assessment of image support in image-text pairs can vary, as discussed by Lake et al. (2017) and others.

Table 7 Performance on human-annotated datasets for MTVAF and its ablative variants

Our model may mitigate the semantic gap between images and text during modality alignment and reduce interference from irrelevant images during fusion. Although there is a modest decline in performance compared to cases where judgments align with image-text correlation, the multi-level image information may still provide implicit semantic knowledge for MABSA. Therefore, aligning and fusing information from different modalities into an effective and robust multimodal representation holds great potential. This is not only because it aligns with human cognitive processes but also because it enables a more comprehensive understanding and richer representation.

5 Conclusions

In this paper, we present a novel multimodal textual-visual alignment and fusion network for performing joint multimodal aspect-sentiment analysis (JMASA). Our approach enables comprehensive interaction between textual and visual modalities by integrating multi-granularity alignment and multi-scale fusion techniques. Moreover, we introduce a text-centered multimodal training strategy to effectively address the noise introduced by the extensive visual context. Experimental results on two benchmark MABSA datasets demonstrate that our proposed model outperforms the state-of-the-art baselines in the MABSA task. Furthermore, in-depth analysis validates the effectiveness of our proposed model and the appropriateness of our chosen hyperparameters, highlighting its ability to accurately handle the complexities of JMASA in a comprehensive manner.

In future work, we aim to delve into more refined modeling approaches, extending the proposed method to cater to a broader range of multimodal tasks in practical applications, such as multimodal aspect sentiment triplet extraction. Furthermore, we plan to design alignment mechanisms for incorporating relevant multi-granularity visual contexts into our model training to reduce reliance on external alignment tools. This will enhance the overall robustness of our approach and enable more sophisticated integration of visual information into the analysis process.