1 Introduction

The process of preparing for, responding to, and learning from the repercussions of big failures is known as disaster management. It encompasses how individuals and organizations cope with the human, material, economic, or environmental implications of a disaster. Natural catastrophes and military warfare have consistently created peaks in death and morbidity, marking human life throughout history.

Social media networks get billions of photos and words every second documenting a wide range of events in our immediate environment. It is now feasible to recognize and categorize occurrences worldwide in real-time thanks to large-scale image recognition and textual understanding as foundational tools.

During emergencies, crucial information for both first responders and the public often comes through image-text pairs. However, just before disasters, users may upload fragmented or contradictory information as events unfold, making it challenging to identify important occurrences accurately. This leads to difficulties in automatically assessing the severity of disasters. While computer vision algorithms and natural language processing techniques can classify data effectively, incomplete data presents a significant obstacle [1]. Therefore, augmenting data is essential to ensure accurate classification and enable effective learning from the information available.

Fig. 1
figure 1

Tweet text and image pairs from different disaster events with complementary information

Figure 1 displays tweets from three recent significant disasters and related photographs. It is observed that relying solely on one modality often results in the loss of critical information. For instance, even if the tweet text in Fig. 1(d) mentions an earthquake in Southern Mexico with a 7.1 magnitude, the extent of the damage caused by this earthquake cannot be deduced from the language. However, upon examining the picture accompanying this tweet, the immense damage caused by the earthquake becomes evident.

In situations like this learning from different modalities of data can significantly enhance analysis [2]. A comprehensive learning with a nuanced understanding of the situation can be achieved by incorporating diverse sources such as images and text. This approach enables the model to cross-reference information, filling gaps [3] and validating data across multiple modalities.

Multimodal deep learning is a novel machine learning paradigm that combines numerous data kinds such as pictures, text, audio, and numerical data with multiple intelligence processing methods to provide better results. Such deep learning models analyze the situation better by understanding the situation effectively through many modalities since some stimuli are only present in particular modalities. Multimodal social media data analysis has received little attention lately, despite the triumphs of multimodal learning in other fields. To support humanitarian organizations’ planning, mitigation, response, and recovery activities while leveraging social media data for social good specifically necessitates time-critical analysis of the multimedia information produced during a crisis event.

The motivation for employing multimodal learning in disaster assessment arises from the limited exploration of automatically detecting crisis events using both visual and textual information. Despite the acknowledged importance of utilizing AI for social good and the significant impact of social media, there has been minimal research on this specific intersection. The prevalence of social media in crisis contexts and the extensive history of interdisciplinary research in humanitarian crisis efforts underscore the need for a more comprehensive approach. Previous methods concentrating on either images or text alone have prompted the exploration of multimodal learning as a more holistic and effective strategy for crisis event detection.

To address these issues, the proposed work assesses the severity of the situation based on the following labels: 1) (a) Informativeness: whether the social media post may be used to provide humanitarian relief in an emergency, (b) Disaster Type: identifying the type of disaster, 2) Humanitarian Category: categorizing the specific type of humanitarian information provided in the tweet, and 3) Severity: determining the severity of the emergency based on the damage represented in the image and text. The contributions of the work are as follows:

  • A novel, multimodal middle fusion based framework is introduced for classifying multimodal data in the crisis domain. The approach involves extracting image features from the picture and generating word embeddings for the text for each image-text combination.

  • A cross-modal attention technique is proposed to fuse information from both modalities, followed by the utilization of a self-attention mechanism to generate a set of features with internal correlations, which are then used to derive the classification results. This cross-modal attention approach avoids transferring negative knowledge between modalities, while the self-attention mechanism helps the model learn the correlations within the fused data.

The paper is arranged in the following manner. The related studies that form a literature base for this work are discussed in Section 2, Section 3 puts forth the proposed multimodal deep learning architecture, Section 4 tabulates the results obtained and compares it to previous works, Section 5 includes a discussion of the results and an assessment of the impact of the research, Section 6 includes the conclusion, followed by Section 6.1, where the study challenges, limitations and future work are discussed.

Table 1 Related works

2 Related works

A survey of the related works proposing unimodal and multimodal approaches for disaster assessment is tabulated in Table 1.

3 Methodology

The proposed multimodal model architecture is illustrated in Fig. 2, and the model layers and parameters are tabulated in Table 5.

Fig. 2
figure 2

Proposed multimodal model architecture

3.1 Data collection

The proposed methodology is trained and tested on the publicly-available CrisisMMD [17] dataset. This multimodal dataset consists of tweets and associated images collected during seven different natural disasters that took place in 2017: Hurricane Irma, Hurricane Harvey, Hurricane Maria, the Mexico earthquake, the California wildfires, the Iraq-Iran earthquakes, and the Sri Lanka floods. The description of the dataset with the feature labels is as follows:

  1. 1.

    General Category: It is sub-tasked in the following manner.

    1. 1.

      Informativeness (Informative and Non-informative)

    2. 2.

      Disaster Type (Earthquake; Flood; Hurricane; Non-informative; Wildfire)

  2. 3.

    Humanitarian Categories (Affected individuals; Infrastructure and utility damage; Not humanitarian; Other relevant information; Rescue, volunteering, or donation effort)

  3. 4.

    Damage Severity Assessment (Little or no damage; Mild damage; Severe damage)

The distribution of text and images in the CrisisMMD dataset are tabulated in Table 2. The dataset includes tweet text and image pairs with independently annotated labels for each task. However, only a subset of the dataset with the same label for both text and image pairs is used in the study. For the damage severity assessment task, which only includes labels for the images, the labels from the images are used for the corresponding tweets as well [4, 18, 19].

Table 3 puts forth the total training, test, and validation split for various tasks experimented on in the study. Detailed representation of no. of samples used in the study is presented in Table 4.

Table 2 Data distribution of the CrisisMMD dataset
Table 3 No. of samples and percentage of dataset used for all tasks
Table 4 Data distribution of the data for each train, dev and test splits

3.2 Data preprocessing

3.2.1 Text

Text from tweets frequently contains noisy data, such emoticons, symbols, and invisible letters. Therefore, using NLTK libraries, the textual input was preprocessed to remove stop words, URLs, hashtag signs, and non-ASCII characters. Whitespaces are used in place of the punctuation marks. Emoticons were swapped out for their associated meanings. The majority of misspelt words were corrected, and abbreviations were appropriately transcribed (Fig. 3).

3.2.2 Image

For image feature extraction from images, the input images are resized and cropped to 224x224, scaled to get pixel values in the range of 0 and 1 and each channel is normalized with respect to ImageNet data set.

Fig. 3
figure 3

Flowchart for unimodal model selection and multimodal model processing

3.3 Feature extraction

3.3.1 Text

The input text \(X_t\) is first tokenized [20, 21] and then encoded using the different variants of the BERT model, namely BERT [22], RoBERTa [23], and BERTweet [24]. This produces a sequence of text features \(h_t\) that capture the relevant semantic and contextual feature vectors of the input text. The ultimate selection for the conclusive multimodal architecture leans towards the RoBERTa variant, which exhibits the utmost accuracy in text classification. This choice is grounded in the understanding that the selected BERT variant has demonstrated superior performance, thereby ensuring the robust and precise integration of textual components within the broader multimodal framework. The working of the base BERT model is mathematically represented in (1).

$$\begin{aligned} h_t = RoBERTa(X_t) \end{aligned}$$
(1)

where \(X_t\) is the input text and \(h_t \in \mathbb {R}^{128 \times 1024}\) is the text feature map obtained from the RoBERTa model.

3.3.2 Image

The input image \(X_i\) is passed through an image encoder, which produces a set of image features \(h_i\) that capture the essential visual information in the input image. The rationale for selecting Visual Transformer (ViT) [25] over the CNN architectures [3] tested, ResNet [26] and DenseNet [27], stems from its higher performance, which is critical for effectively finding and selecting features required for disaster classification. This approach allows the strong integration of visual components required for identifying disaster-related patterns within the larger multimodal framework. The mathematical depiction of the operation of the basic ViT model is found in (2).

$$\begin{aligned} h_i = ViT(X_i) \end{aligned}$$
(2)

where \(X_i\) is the input text and \(h_i \in \mathbb {R}^{197 \times 768}\) is the image feature map obtained from the ViT model.

3.4 Multimodal fusion

The proposed work incorporates two primary forms of multimodal sequence data: Text (t) and Image (i). To fuse the features from both modalities, a cross-modal fusion attention mechanism is employed. Cross-modal attention is a fusion approach that leverages attention processes to blend features from many modalities. When complementary information is provided in many modalities, it is extremely useful. This fusion method has been prominent in recent cross-modal learning approaches [28,29,30,31,32,33].

In the data set that was employed, textual data gives context by explaining the location and severity of the damage, while images depict the extent of the damage visually. Through the use of cross-modal attention processes, these various forms of data may be effectively merged, enhancing our comprehension of the horrific event as a whole.

Furthermore, it facilitates semantic interpretation and raises the accuracy of assessments of the disaster scene by matching semantically related features across modalities. This helps identify the most relevant parts of the text given the image \((t \epsilon \{t, i\})\), and the most relevant parts of the image given the text \((i \epsilon \{t, i\})\).

In simpler terms, cross-modal attention captures connections between different types of information, like images and text. It helps understand how they relate to each other. For example, it can help identify the relevant regions in the images that correspond to the mentioned information in the text by attending to specific visual features.

To ensure that both sequences are of the same dimension, the output features are passed through feed-forward neural encoders, \(E_t\) and \(E_i\), as represented in (3)-(5)

$$\begin{aligned} E_t(t)&= tanh (Linear(tanh(Conv1d(t)))) \end{aligned}$$
(3)
$$\begin{aligned}&\qquad \qquad \quad \bar{h}_{t} = E_t(h_t)&\end{aligned}$$
(4)

where encoder, \(E_t\), is comprised of a 1d convolution layer and a linear layer with tanh activation function, and \(\bar{h}_t \in \mathbb {R}^{128}\) is textual hidden vector.

$$\begin{aligned} E_i(i)&= tanh(Linear(tanh(Conv1d(i)))) \end{aligned}$$
(5)
$$\begin{aligned}&\qquad \qquad \quad \bar{h}_{i} = E_i(h_i) \end{aligned}$$
(6)

where encoder, \(E_i\), is comprised of a 1d convolution layer and a linear layer with tanh activation function, and \(\bar{h}_i \in \mathbb {R}^{128}\) is image hidden vector.

A fully connected layer is used as an encoder to help identify relevant parts of each modality given the other (7).

$$\begin{aligned} E_s(s) = tanh(Linear(s)) \end{aligned}$$
(7)
$$\begin{aligned} \bar{h}_{t \in \{t, i\}}, \bar{h}_{i \in \{t, i\}} = E_s(\{\bar{h}_t, \bar{h}_i\}) \end{aligned}$$
(8)

where encoder, \(E_s\), is comprised of a linear layer with tanh activation function, and \(\bar{h}_{t \in \{t, i\}} \in \mathbb {R}^{128}\) and \(\bar{h}_{i \in \{t, i\}} \in \mathbb {R}^{128}\) are the shared text and image hidden vectors.

Cross-modal attention uses the information interaction between the text and image modalities to produce a set of weighted features of both modalities that capture the important information in each modality. These representations are then stacked into a matrix, \(X_m \in \mathbb {R}^{4 \times 128}\), as shown in (9).

$$\begin{aligned} X_m = [\bar{h}_{t}, \bar{h}_{i}, \bar{h}_{t \in \{t, i\}}, \bar{h}_{i \in \{t, i\}}] \end{aligned}$$
(9)

3.5 Multimodal assocation

In the proposed work, the self-attentive mechanism of Transformer [34], also known as scaled dot product attention based on deflation, is adopted. By leveraging this attention mechanism, our model can effectively prioritize relevant information while downweighting less important elements, thus improving its ability to process and understand complex data patterns. This approach, based on the principles of deflation, enables the model to efficiently attend to key features within the input sequences, enhancing its overall performance in tasks such as disaster assessment where capturing intricate relationships between modalities is crucial for accurate analysis. It is calculated via (10).

$$\begin{aligned} Attention(Q, K, V) = SoftMax\{\frac{QK^t}{\sqrt{d_k}}\} \cdot V \end{aligned}$$
(10)

QKV are the query, key, and value matrices respectively.

The multimodal data with computed internal correlations, \(X_{Att}\), is given by the (11).

$$\begin{aligned} X_{Att} = Attention(X_m) = [\hat{h}_{t}, \hat{h}_{i}, \hat{h}_{t \in \{t, i\}}, \hat{h}_{i \in \{t, i\}}] \end{aligned}$$
(11)

where \(X_{Att} \in \mathbb {R}^{4 \times 128}\) is the Transformer output, and \(\hat{h}_{t}, \hat{h}_{i}, \hat{h}_{t \in \{t, i\}}, \hat{h}_{i \in \{t, i\}} \in \mathbb {R}^{128}\) are the computed internal correlation vectors.

By computing internal correlations, we aim to capture the interdependencies and relationships within the data, allowing our model to effectively understand the complex interactions between modalities. This multimodal representation serves as the foundation for our fusion strategy, enabling us to leverage the complementary information across modalities for tasks such as disaster assessment (Table 5).

Table 5 Model layers and parameters

A joint-vector, \(X_{classify} \in \mathbb {R}^{512}\), is constructed by concatenating the attention output, \(X_{Att}\), as given in (12).

$$\begin{aligned} X_{classify} = [\hat{h}_{t} \oplus \hat{h}_{i} \oplus \hat{h}_{t \in \{t, i\}} \oplus \hat{h}_{i \in \{t, i\}}] \end{aligned}$$
(12)

3.6 Output

The multimodal fused feature data, \(X_{classify}\), is computed by a decoder, D, to derive the classification results using (13).

$$\begin{aligned} D(x)&= Linear(tanh(Linear(x))) \end{aligned}$$
(13)
$$\begin{aligned}&\qquad y = D(X_{classify}) \end{aligned}$$
(14)

where decoder, D, is comprised of two linear layers with tanh activation function, \(X_{classify} \in \mathbb {R}^{512}\) is the aggregated joint-vector and \(y \in \mathbb {R}^{n\_classes}\) is the classification result.

Table 6 Model performance comparison for text modality

4 Results

4.1 Text modality

The text dataset was trained using three BERT models for a comparative study: BERT, RoBERTa, and BERTweet.

BERT [22], short for Bidirectional Encoder Representations from Transformers, has revolutionized natural language understanding by capturing contextual information from both preceding and subsequent words in a sentence.BERT can evaluate tweets containing information on disasters by taking into account the surrounding context, which helps it understand the nature, location, impact, and severity of the event. BERT has the ability to recognise the faint linguistic hints that are characteristic of content connected to disasters by pre-training on a wide corpus of text, including tweets. This allows BERT to recognise terms, expressions, and background data that denote crises or disasters.

BERT merely has to be fine-tuned for disaster assessment, usually by adding an output layer that is customized for the particular task at hand. Because it can rapidly and efficiently adjust its learned representations to the subtleties of disaster-related language without requiring significant task-specific architectural changes, BERT is an excellent choice for analysing tweets linked to catastrophes.

RoBERTa [23], building upon BERT’s foundation, enhances pretraining techniques to better understand language context. RoBERTa has better optimizations which ensures that it is more effective at understanding the subtle language used in tweets. Some of the refinements in RoBERTa are removing the Next Sentence Prediction (NSP) task, introducing dynamic masking, and utilizing larger batch sizes which offers better contextual understanding in events such as disaster assessment.

BERTweet [24] is an adaptation of BERT fine-tuned and trained on tweets to learn tweet-specific features and linguistic nuances. In addition, BERTweet, unlike regular BERT or RoBERTa, includes tweet-specific pre-processing stages such as an alternate mechanism for handling URLs, hashtags, and mentions.

The models were trained with a batch size of 16 for 10 epochs with early stopping and a learning rate of \(10^{-6}\). Table 6 presents the accuracy, precision, recall, and F1 score of the models on the test data. From Table 7, we can conclude that RoBERTa performs much better than BERT and BERTweet at the informativeness and damage severity tasks, and BERTweet performs better at the humanitarian task.

Table 7 Model performance comparison for image modality

4.2 Image modality

To perform a comparative study, the image dataset was trained using two CNN models, ResNet-50 and DenseNet-121, and Vision Transformer [35].

ResNet-50 [26], a variant of the Residual Neural Network (ResNet) architecture, is designed to tackle the challenge of training very deep networks. It acts as a feature extractor, identifying important visual features in disaster images. By traversing its layers, ResNet-50 extracts hierarchical features ranging from low-level details like edges to high-level concepts such as objects and scenes. This feature extraction is beneficial for tasks like classifying or detecting disaster-related elements. Pretraining ResNet-50 on large-scale image datasets like ImageNet allows it to learn generic visual representations. Fine-tuning on a smaller set of disaster images further customizes its learned features to better suit disaster scenes. ResNet-50 excels in object detection and localization tasks, enabling the identification and classification of damaged buildings, infrastructure, or vehicles within disaster imagery. Additionally, ResNet-50 aids in scene understanding by categorizing images into different types or severity levels of disasters. This categorization facilitates the development of efficient response strategies.

DenseNet-121 [27], a variant of the Densely Connected Convolutional Network (DenseNet) family, is also analysed in this study to test its efficacy in capturing vital disaster-related image features. DenseNet-121 employs dense connectivity, where each layer receives input not just from its immediate predecessor but from all preceding layers. This dense inter-layer connection fosters enhanced feature reuse and more efficient information flow throughout the network. Unlike ResNet-50, which utilizes skip connections for layer interactions, DenseNet-121’s connectivity pattern promotes a more direct propagation of information, potentially mitigating challenges related to vanishing gradients and facilitating effective learning, especially in deeper architecture

Table 8 Model performance comparison for multimodal fusion

In the domain of disaster assessment through image analysis, Vision Transformers (ViTs) [25] present a novel approach distinct from traditional convolutional neural network (CNN) architectures like ResNet-50 and DenseNet-121. ViTs leverage the transformer architecture, originally designed for natural language processing tasks, to directly process image data as sequences of tokens. This fundamentally differs from CNNs, which process images through a series of convolutional and pooling layers. One significant difference between ViTs and CNNs lies in their attention mechanisms. ViTs utilize self-attention mechanisms to capture global dependencies between different parts of the image, allowing them to consider long-range interactions and contextual relationships. In contrast, CNNs like ResNet-50 and DenseNet-121 primarily rely on local receptive fields and hierarchical feature extraction, which may limit their ability to capture global context efficiently.

Table 9 Summary of results - task 1(a)
Table 10 Summary of results - task 1(b)

The models were trained with a batch size of 10 for 10 epochs with early stopping. The learning rate was optimized to minimize loss and gradient. Table 7 presents the accuracy, precision, recall, and F1 score of the models on the test data. From Table 7, we can conclude that Vision Transformer performs much better than ResNet and DenseNet at extracting relevant information from the images for disaster assessment.

Table 11 Summary of results - task 2
Table 12 Summary of results - Task 3

4.3 Multimodal fusion

The model was built using the architecture proposed in Section 3. The model was trained with a batch size of 16 for 10 epochs with early stopping. Cross Entropy Loss was chosen as the loss function along with Adam optimizer [36] with a learning rate of \(10^{-5}\). To perform a comparative study of the performance of the proposed model, the accuracy, precision, recall, and F1 score of the proposed model, score fusion, and other best-performing models are tabulated on the test data in Table 8.

In score fusion, the probability outputs of the text-only and image-only classifiers are combined to get the final predictions. The score fusion technique is being used here as a baseline for comparing the performance of the proposed multimodal model.

The proposed approach is the best-performing model for task1(a), Informativeness, achieving a higher accuracy (91.53%) than the previously best-performing models.

Fig. 4
figure 4

Task 1(a): confusion matrix

Fig. 5
figure 5

Task 1(a): precision-recall curve

5 Discussion

5.1 Summary of results

This section compares the performance of the best-performing text and image models and the proposed multimodal fusion model for each task. The evaluation metrics include Accuracy, Precision, Recall, F1 score, Area Under the Receiver Operating Characteristic curve (AUC-ROC), Area Under the Precision-Recall curve (AUC-PR), Logarithmic loss, Matthews Correlation Coefficient (MCC), Specificity, Cohen’s Kappa, Balanced accuracy, Youden’s J statistic, and Positive- Negative likelihood ratios.

Tables 9, 10, 11, and 12 show the performance of the text-only classifier, image-only classifier, and the proposed multimodal model on the various tasks. Figures 4, 6, 8, and 10 show the confusion matrices, while the Figs. 5, 7, 9, and 11 plot the Precision-Recall curve of the models on the various tasks (Figs. 6, 7, 8, 9, 10 and 11).

Fig. 6
figure 6

Task 1(b): confusion matrix

Fig. 7
figure 7

Task 1(b): precision-recall curve

Fig. 8
figure 8

Task 2: confusion matrix

Fig. 9
figure 9

Task 2: precision-recall curve

In general, the analysis of the text-only classifier reveals promising results across multiple evaluation metrics, with the performance of the image-only classifier being comparable. The text-only classifier performs better on informativeness and humanitarian tasks, suggesting that such information can be better inferred from language. While the image-only classifier performs better on the damage severity tasks, highlighting its advantage in identifying the severity of damage from images rather than textual data. However, the proposed multimodal model shows superior performance compared to both text-only and image-only classifiers.

Notably, the multimodal approach achieves higher accuracy, precision, recall, and F1 score, indicating its effectiveness in integrating textual and visual information for improved disaster assessment. AUC-ROC and AUC-PR metrics demonstrate improved discrimination between the class labels, reflecting the model’s ability to leverage complementary features from both modalities. Logarithmic loss suggests reduced predictive uncertainty compared to individual modalities, while MCC signifies substantial agreement between predicted and actual outcomes. Specificity, Cohen’s Kappa, balanced accuracy, and Youden’s J statistic highlight the multimodal model’s superior performance in correctly identifying non-disaster instances and distinguishing true negatives. Positive and negative likelihood ratios further underscore the model’s capacity to differentiate between positive and negative class labels with heightened certainty and reliability, affirming the efficacy of multimodal deep learning in disaster assessment.

Fig. 10
figure 10

Task 3: confusion matrix

Fig. 11
figure 11

Task 3: precision-recall curve

5.2 Discussion

The proposed multimodal fusion technique is a novel approach that leverages both cross-modal and self-attention mechanisms. This innovation goes beyond previous approaches, which solely relied on cross-modal attention, to achieve superior information extraction at the intermediate fusion level. This section highlights the justification for employing the proposed key components in constructing the architecture. The justification is as follows:

  1. 1.

    Capturing Cross-Modal Dependencies: By using cross-modal attention, the model can efficiently incorporate input from both text and visual modalities. In disaster assessment circumstances, access to both visual (image) and contextual (textual) information is essential for a complete understanding of the event. By employing cross-modal attention, the model can capture the interdependencies between language and visual signals. This is achieved by allowing the model to focus on relevant portions of the image while also taking into consideration the textual descriptions or captions that correspond with those areas, and vice versa.

  2. 2.

    Self-Attentive Mechanism: Long-range dependencies can be captured within a single modality with the help of the self-attentive mechanism, also referred to as intra-modal attention. Through the application of self-attention in the combined text and image domains, the model can capture the inherent relationships and connections within each modality.

  3. 3.

    Intermediary Feature Fusion: Cross-modal attention with self-attentive mechanism, combined with deflation and scaled dot product attention, serves as an intermediary feature fusion method. It allows the model to combine information from the image and text domains at a finer level, identifying connections and correlations between them. For instance, while an image may show debris and destruction, accompanying text might provide insights into the specific location or type of disaster. Such a kind of fusion enhances contextual understanding of the situation by mitigating the limitations owing to the ambiguous or incomplete information in any individual modality.

The results demonstrate that the proposed approach is particularly effective in capturing detailed dependencies within and between different modalities, enabling it to make more informed decisions, especially in disaster assessment tasks.

The proposed intermediate fusion provides a more contextually rich and nuanced representation than early and late fusion. Early fusion integrates modalities at the input level, potentially leading to information loss or simplifying of multimodal data. Late fusion, on the other hand, merges modalities at a higher level of representation, perhaps missing subtle interactions between them. Thus, in essence, intermediate fusion improves the interpretability and transparency of the model’s decision-making process by providing a greater understanding of how different modalities affect the model’s predictions. The proposed framework reflects the complicated interactions and dependencies across modalities required for tasks like damage detection and situational awareness. This thorough integration of multimodal data allows the model to make more accurate and informed judgments during calamities [37, 38].

5.3 Significance of the work

In this era of social media, enforcing the use of AI for disaster assessment within social media contributes paramountly to providing a more holistic understanding of the situation. For example, by analyzing both textual information and visual content, analysts can pinpoint areas with high levels of damage or where help is urgently needed. These approaches offer a thorough data analysis by leveraging information from multiple sources, thereby enhancing the model’s performance even when dealing with incomplete data, which is often encountered in online social networks (OSNs).Most significantly, by cross-referencing information offered by multiple modalities, these multimodal fusion methodologies provide substantial information to disaster rescue teams, enabling timely assistance.

6 Conclusion

In this highly interconnected world, harnessing social networks to advance real-time crisis management operations is a breakthrough. The use of multimodal systems has shown to be quite useful in assessing disaster intensity. In the course of our research, we demonstrated that the proposed intermediate fusion approach outperforms early or late fusion methods when using multimodal learning models for tasks such as damage detection, resource allocation, and situational awareness during the crisis. By combining textual and visual data and facilitating the integration of information at a granular level, the model synthesizes a holistic knowledge of catastrophic occurrences through the integration of cross-modal attention, self-attentive mechanisms, and intermediary feature fusion.

The proposed research holds significant implications by offering a deeper comprehension of disaster scenarios, thereby substantially improving disaster response and recovery endeavors. This enhanced understanding enables responders to make better-informed decisions, resulting in more efficient allocation of resources and coordinated efforts. Beyond disaster assessment, the proposed method holds promise for possible extension in diverse domains such as medical diagnosis, autonomous systems, and multimedia content analysis. In the field of medical diagnostics, utilizing data from many sources like scans, textual data, and patient records has tremendous potential for improving knowledge and aiding the early identification of diseases. Similarly, the proposed approach can be applied to improve decision-making in autonomous systems such as robots and surveillance. Furthermore, in multimedia content analysis, the proposed method can be extended to improve the relevance and accuracy of results for tasks such as content-based retrieval and cross-modal information retrieval.

6.1 Study challenges, limitations and future work

During the research, various challenges related to model training and dataset limitations were encountered. Transformer-based architectures used in the study, such as RoBERTa and Vision Transformer, showed a tendency to overfit. Additionally, the deep learning classifiers implemented were prone to catastrophic forgetting during knowledge transfer. To address these issues, a range of learning rates, regularization and other hyperparameters were explored to find the optimal configuration. Self-attention mechanisms were employed to mitigate the problems associated with incremental learning. Metrics such as validation loss were used to measure the performance of the classifiers [39]. The damage severity task heavily relies on visual information. Additionally, the data exhibits class imbalance, with fewer samples in the low and mid-severity classes compared to severe damage. Even in Task 2, the number of suitable message pairs containing both tweets and images was not found in equal proportions across all categories. These factors particularly led to poor performance of the methodology in Task 3 relative to other tasks. This limitation can be tackled in future research by implementing additional preprocessing techniques. These may include strategies like data augmentation, applying SMOTE for resampling, assigning appropriate class weights, generating balanced batches, and creating synthetic data using GANs.

There is a great degree of bias in research including data from online social networks because of differences in data representation, noise, and a dearth of ground truth labels. Furthermore, the abundance of misinformation in news-related data on social media presents considerable difficulty. An additional challenge to the analysis is the profusion of inconsistent data across different social network components. One significant shortcoming of the proposed study is its failure to handle cases in which both textual and visual information convey conflicting facts about a particular event. This study does not address the verification of misinformation conveyed through various modalities, if any.

Analyzing discriminative features or the correlation of characteristics in different modalities could address the aforementioned limitations. Techniques such as information relevance [40] can be explored to enhance model learning. To represent the complementary nature of visual and textual information in conveying information, studies could evaluate image-text relations and confirm these relations using appropriate metrics [41]. Additionally, AI-assisted labeling can be utilized to improve annotation quality, particularly when labels are inconsistent. Techniques such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) can be employed to validate inconsistencies in the data and preserve both local and global structures of the data.

Future research in this direction can include methods to improve the generalizability of the model. This might include adding additional interpretability to deep learning models or adopting explainable AI approaches [42] to give explanations or reasons for AI models’ judgments. Additionally, future research could explore methods for verifying the accuracy of information and incorporating privacy metrics, such as ensuring the anonymity of users involved in accidents during disasters.

In upcoming research endeavors, more intricate multimodal fusion methods, such as - hierarchical structures, hybrid architectures and improved attention processes [43,44,45] that mainly improve image topic representations, can be investigated, to improve the accuracy of disaster assessments while incorporating textual and visual elements more effectively. Additionally, weakly supervised and semi-supervised learning strategies can be commissioned in the future to take advantage of the wealth of unlabeled social media data and label them, thereby reducing the dependence on manually labeled datasets. Furthermore, including additional dimensions like temporal and spatial values could offer valuable insights into improving disaster severity assessment.