Abstract
Real-time global event detection particularly catastrophic events has benefited significantly due to the ubiquitous adoption of social media platforms and advancements in image classification and natural language processing. Social media is a rich repository of multimedia content during disasters, encompassing reports on casualties, infrastructure damage, and information about missing individuals. While previous research has predominantly concentrated on textual or image analysis, the proposed study presents a multimodal middle fusion paradigm that includes Cross-modal attention and Self-attention to improve learning from both image and text modalities. Through rigorous experimentation, we validate the effectiveness of our proposed middle fusion paradigm in leveraging complementary information from both textual and visual sources.The proposed intermediate design outperforms current late and early fusion structures, achieving an accuracy of 91.53% and 91.07% in the informativeness and disaster type recognition categories, respectively. This study is among the few that examine all three tasks in the CrisisMMD dataset by combining textual and image analysis, demonstrating an approximate improvement of about 2% in prediction accuracy compared to similar studies on the same dataset.Additionally, ablation studies indicate that it outperforms the best-selected unimodal classifiers, with a 3-5% increase in prediction accuracies across various tasks. Thus, the method aims to bolster emergency response capabilities by offering more precise insights into evolving events.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The process of preparing for, responding to, and learning from the repercussions of big failures is known as disaster management. It encompasses how individuals and organizations cope with the human, material, economic, or environmental implications of a disaster. Natural catastrophes and military warfare have consistently created peaks in death and morbidity, marking human life throughout history.
Social media networks get billions of photos and words every second documenting a wide range of events in our immediate environment. It is now feasible to recognize and categorize occurrences worldwide in real-time thanks to large-scale image recognition and textual understanding as foundational tools.
During emergencies, crucial information for both first responders and the public often comes through image-text pairs. However, just before disasters, users may upload fragmented or contradictory information as events unfold, making it challenging to identify important occurrences accurately. This leads to difficulties in automatically assessing the severity of disasters. While computer vision algorithms and natural language processing techniques can classify data effectively, incomplete data presents a significant obstacle [1]. Therefore, augmenting data is essential to ensure accurate classification and enable effective learning from the information available.
Figure 1 displays tweets from three recent significant disasters and related photographs. It is observed that relying solely on one modality often results in the loss of critical information. For instance, even if the tweet text in Fig. 1(d) mentions an earthquake in Southern Mexico with a 7.1 magnitude, the extent of the damage caused by this earthquake cannot be deduced from the language. However, upon examining the picture accompanying this tweet, the immense damage caused by the earthquake becomes evident.
In situations like this learning from different modalities of data can significantly enhance analysis [2]. A comprehensive learning with a nuanced understanding of the situation can be achieved by incorporating diverse sources such as images and text. This approach enables the model to cross-reference information, filling gaps [3] and validating data across multiple modalities.
Multimodal deep learning is a novel machine learning paradigm that combines numerous data kinds such as pictures, text, audio, and numerical data with multiple intelligence processing methods to provide better results. Such deep learning models analyze the situation better by understanding the situation effectively through many modalities since some stimuli are only present in particular modalities. Multimodal social media data analysis has received little attention lately, despite the triumphs of multimodal learning in other fields. To support humanitarian organizations’ planning, mitigation, response, and recovery activities while leveraging social media data for social good specifically necessitates time-critical analysis of the multimedia information produced during a crisis event.
The motivation for employing multimodal learning in disaster assessment arises from the limited exploration of automatically detecting crisis events using both visual and textual information. Despite the acknowledged importance of utilizing AI for social good and the significant impact of social media, there has been minimal research on this specific intersection. The prevalence of social media in crisis contexts and the extensive history of interdisciplinary research in humanitarian crisis efforts underscore the need for a more comprehensive approach. Previous methods concentrating on either images or text alone have prompted the exploration of multimodal learning as a more holistic and effective strategy for crisis event detection.
To address these issues, the proposed work assesses the severity of the situation based on the following labels: 1) (a) Informativeness: whether the social media post may be used to provide humanitarian relief in an emergency, (b) Disaster Type: identifying the type of disaster, 2) Humanitarian Category: categorizing the specific type of humanitarian information provided in the tweet, and 3) Severity: determining the severity of the emergency based on the damage represented in the image and text. The contributions of the work are as follows:
-
A novel, multimodal middle fusion based framework is introduced for classifying multimodal data in the crisis domain. The approach involves extracting image features from the picture and generating word embeddings for the text for each image-text combination.
-
A cross-modal attention technique is proposed to fuse information from both modalities, followed by the utilization of a self-attention mechanism to generate a set of features with internal correlations, which are then used to derive the classification results. This cross-modal attention approach avoids transferring negative knowledge between modalities, while the self-attention mechanism helps the model learn the correlations within the fused data.
The paper is arranged in the following manner. The related studies that form a literature base for this work are discussed in Section 2, Section 3 puts forth the proposed multimodal deep learning architecture, Section 4 tabulates the results obtained and compares it to previous works, Section 5 includes a discussion of the results and an assessment of the impact of the research, Section 6 includes the conclusion, followed by Section 6.1, where the study challenges, limitations and future work are discussed.
2 Related works
A survey of the related works proposing unimodal and multimodal approaches for disaster assessment is tabulated in Table 1.
3 Methodology
The proposed multimodal model architecture is illustrated in Fig. 2, and the model layers and parameters are tabulated in Table 5.
3.1 Data collection
The proposed methodology is trained and tested on the publicly-available CrisisMMD [17] dataset. This multimodal dataset consists of tweets and associated images collected during seven different natural disasters that took place in 2017: Hurricane Irma, Hurricane Harvey, Hurricane Maria, the Mexico earthquake, the California wildfires, the Iraq-Iran earthquakes, and the Sri Lanka floods. The description of the dataset with the feature labels is as follows:
-
1.
General Category: It is sub-tasked in the following manner.
-
1.
Informativeness (Informative and Non-informative)
-
2.
Disaster Type (Earthquake; Flood; Hurricane; Non-informative; Wildfire)
-
1.
-
3.
Humanitarian Categories (Affected individuals; Infrastructure and utility damage; Not humanitarian; Other relevant information; Rescue, volunteering, or donation effort)
-
4.
Damage Severity Assessment (Little or no damage; Mild damage; Severe damage)
The distribution of text and images in the CrisisMMD dataset are tabulated in Table 2. The dataset includes tweet text and image pairs with independently annotated labels for each task. However, only a subset of the dataset with the same label for both text and image pairs is used in the study. For the damage severity assessment task, which only includes labels for the images, the labels from the images are used for the corresponding tweets as well [4, 18, 19].
Table 3 puts forth the total training, test, and validation split for various tasks experimented on in the study. Detailed representation of no. of samples used in the study is presented in Table 4.
3.2 Data preprocessing
3.2.1 Text
Text from tweets frequently contains noisy data, such emoticons, symbols, and invisible letters. Therefore, using NLTK libraries, the textual input was preprocessed to remove stop words, URLs, hashtag signs, and non-ASCII characters. Whitespaces are used in place of the punctuation marks. Emoticons were swapped out for their associated meanings. The majority of misspelt words were corrected, and abbreviations were appropriately transcribed (Fig. 3).
3.2.2 Image
For image feature extraction from images, the input images are resized and cropped to 224x224, scaled to get pixel values in the range of 0 and 1 and each channel is normalized with respect to ImageNet data set.
3.3 Feature extraction
3.3.1 Text
The input text \(X_t\) is first tokenized [20, 21] and then encoded using the different variants of the BERT model, namely BERT [22], RoBERTa [23], and BERTweet [24]. This produces a sequence of text features \(h_t\) that capture the relevant semantic and contextual feature vectors of the input text. The ultimate selection for the conclusive multimodal architecture leans towards the RoBERTa variant, which exhibits the utmost accuracy in text classification. This choice is grounded in the understanding that the selected BERT variant has demonstrated superior performance, thereby ensuring the robust and precise integration of textual components within the broader multimodal framework. The working of the base BERT model is mathematically represented in (1).
where \(X_t\) is the input text and \(h_t \in \mathbb {R}^{128 \times 1024}\) is the text feature map obtained from the RoBERTa model.
3.3.2 Image
The input image \(X_i\) is passed through an image encoder, which produces a set of image features \(h_i\) that capture the essential visual information in the input image. The rationale for selecting Visual Transformer (ViT) [25] over the CNN architectures [3] tested, ResNet [26] and DenseNet [27], stems from its higher performance, which is critical for effectively finding and selecting features required for disaster classification. This approach allows the strong integration of visual components required for identifying disaster-related patterns within the larger multimodal framework. The mathematical depiction of the operation of the basic ViT model is found in (2).
where \(X_i\) is the input text and \(h_i \in \mathbb {R}^{197 \times 768}\) is the image feature map obtained from the ViT model.
3.4 Multimodal fusion
The proposed work incorporates two primary forms of multimodal sequence data: Text (t) and Image (i). To fuse the features from both modalities, a cross-modal fusion attention mechanism is employed. Cross-modal attention is a fusion approach that leverages attention processes to blend features from many modalities. When complementary information is provided in many modalities, it is extremely useful. This fusion method has been prominent in recent cross-modal learning approaches [28,29,30,31,32,33].
In the data set that was employed, textual data gives context by explaining the location and severity of the damage, while images depict the extent of the damage visually. Through the use of cross-modal attention processes, these various forms of data may be effectively merged, enhancing our comprehension of the horrific event as a whole.
Furthermore, it facilitates semantic interpretation and raises the accuracy of assessments of the disaster scene by matching semantically related features across modalities. This helps identify the most relevant parts of the text given the image \((t \epsilon \{t, i\})\), and the most relevant parts of the image given the text \((i \epsilon \{t, i\})\).
In simpler terms, cross-modal attention captures connections between different types of information, like images and text. It helps understand how they relate to each other. For example, it can help identify the relevant regions in the images that correspond to the mentioned information in the text by attending to specific visual features.
To ensure that both sequences are of the same dimension, the output features are passed through feed-forward neural encoders, \(E_t\) and \(E_i\), as represented in (3)-(5)
where encoder, \(E_t\), is comprised of a 1d convolution layer and a linear layer with tanh activation function, and \(\bar{h}_t \in \mathbb {R}^{128}\) is textual hidden vector.
where encoder, \(E_i\), is comprised of a 1d convolution layer and a linear layer with tanh activation function, and \(\bar{h}_i \in \mathbb {R}^{128}\) is image hidden vector.
A fully connected layer is used as an encoder to help identify relevant parts of each modality given the other (7).
where encoder, \(E_s\), is comprised of a linear layer with tanh activation function, and \(\bar{h}_{t \in \{t, i\}} \in \mathbb {R}^{128}\) and \(\bar{h}_{i \in \{t, i\}} \in \mathbb {R}^{128}\) are the shared text and image hidden vectors.
Cross-modal attention uses the information interaction between the text and image modalities to produce a set of weighted features of both modalities that capture the important information in each modality. These representations are then stacked into a matrix, \(X_m \in \mathbb {R}^{4 \times 128}\), as shown in (9).
3.5 Multimodal assocation
In the proposed work, the self-attentive mechanism of Transformer [34], also known as scaled dot product attention based on deflation, is adopted. By leveraging this attention mechanism, our model can effectively prioritize relevant information while downweighting less important elements, thus improving its ability to process and understand complex data patterns. This approach, based on the principles of deflation, enables the model to efficiently attend to key features within the input sequences, enhancing its overall performance in tasks such as disaster assessment where capturing intricate relationships between modalities is crucial for accurate analysis. It is calculated via (10).
Q, K, V are the query, key, and value matrices respectively.
The multimodal data with computed internal correlations, \(X_{Att}\), is given by the (11).
where \(X_{Att} \in \mathbb {R}^{4 \times 128}\) is the Transformer output, and \(\hat{h}_{t}, \hat{h}_{i}, \hat{h}_{t \in \{t, i\}}, \hat{h}_{i \in \{t, i\}} \in \mathbb {R}^{128}\) are the computed internal correlation vectors.
By computing internal correlations, we aim to capture the interdependencies and relationships within the data, allowing our model to effectively understand the complex interactions between modalities. This multimodal representation serves as the foundation for our fusion strategy, enabling us to leverage the complementary information across modalities for tasks such as disaster assessment (Table 5).
A joint-vector, \(X_{classify} \in \mathbb {R}^{512}\), is constructed by concatenating the attention output, \(X_{Att}\), as given in (12).
3.6 Output
The multimodal fused feature data, \(X_{classify}\), is computed by a decoder, D, to derive the classification results using (13).
where decoder, D, is comprised of two linear layers with tanh activation function, \(X_{classify} \in \mathbb {R}^{512}\) is the aggregated joint-vector and \(y \in \mathbb {R}^{n\_classes}\) is the classification result.
4 Results
4.1 Text modality
The text dataset was trained using three BERT models for a comparative study: BERT, RoBERTa, and BERTweet.
BERT [22], short for Bidirectional Encoder Representations from Transformers, has revolutionized natural language understanding by capturing contextual information from both preceding and subsequent words in a sentence.BERT can evaluate tweets containing information on disasters by taking into account the surrounding context, which helps it understand the nature, location, impact, and severity of the event. BERT has the ability to recognise the faint linguistic hints that are characteristic of content connected to disasters by pre-training on a wide corpus of text, including tweets. This allows BERT to recognise terms, expressions, and background data that denote crises or disasters.
BERT merely has to be fine-tuned for disaster assessment, usually by adding an output layer that is customized for the particular task at hand. Because it can rapidly and efficiently adjust its learned representations to the subtleties of disaster-related language without requiring significant task-specific architectural changes, BERT is an excellent choice for analysing tweets linked to catastrophes.
RoBERTa [23], building upon BERT’s foundation, enhances pretraining techniques to better understand language context. RoBERTa has better optimizations which ensures that it is more effective at understanding the subtle language used in tweets. Some of the refinements in RoBERTa are removing the Next Sentence Prediction (NSP) task, introducing dynamic masking, and utilizing larger batch sizes which offers better contextual understanding in events such as disaster assessment.
BERTweet [24] is an adaptation of BERT fine-tuned and trained on tweets to learn tweet-specific features and linguistic nuances. In addition, BERTweet, unlike regular BERT or RoBERTa, includes tweet-specific pre-processing stages such as an alternate mechanism for handling URLs, hashtags, and mentions.
The models were trained with a batch size of 16 for 10 epochs with early stopping and a learning rate of \(10^{-6}\). Table 6 presents the accuracy, precision, recall, and F1 score of the models on the test data. From Table 7, we can conclude that RoBERTa performs much better than BERT and BERTweet at the informativeness and damage severity tasks, and BERTweet performs better at the humanitarian task.
4.2 Image modality
To perform a comparative study, the image dataset was trained using two CNN models, ResNet-50 and DenseNet-121, and Vision Transformer [35].
ResNet-50 [26], a variant of the Residual Neural Network (ResNet) architecture, is designed to tackle the challenge of training very deep networks. It acts as a feature extractor, identifying important visual features in disaster images. By traversing its layers, ResNet-50 extracts hierarchical features ranging from low-level details like edges to high-level concepts such as objects and scenes. This feature extraction is beneficial for tasks like classifying or detecting disaster-related elements. Pretraining ResNet-50 on large-scale image datasets like ImageNet allows it to learn generic visual representations. Fine-tuning on a smaller set of disaster images further customizes its learned features to better suit disaster scenes. ResNet-50 excels in object detection and localization tasks, enabling the identification and classification of damaged buildings, infrastructure, or vehicles within disaster imagery. Additionally, ResNet-50 aids in scene understanding by categorizing images into different types or severity levels of disasters. This categorization facilitates the development of efficient response strategies.
DenseNet-121 [27], a variant of the Densely Connected Convolutional Network (DenseNet) family, is also analysed in this study to test its efficacy in capturing vital disaster-related image features. DenseNet-121 employs dense connectivity, where each layer receives input not just from its immediate predecessor but from all preceding layers. This dense inter-layer connection fosters enhanced feature reuse and more efficient information flow throughout the network. Unlike ResNet-50, which utilizes skip connections for layer interactions, DenseNet-121’s connectivity pattern promotes a more direct propagation of information, potentially mitigating challenges related to vanishing gradients and facilitating effective learning, especially in deeper architecture
In the domain of disaster assessment through image analysis, Vision Transformers (ViTs) [25] present a novel approach distinct from traditional convolutional neural network (CNN) architectures like ResNet-50 and DenseNet-121. ViTs leverage the transformer architecture, originally designed for natural language processing tasks, to directly process image data as sequences of tokens. This fundamentally differs from CNNs, which process images through a series of convolutional and pooling layers. One significant difference between ViTs and CNNs lies in their attention mechanisms. ViTs utilize self-attention mechanisms to capture global dependencies between different parts of the image, allowing them to consider long-range interactions and contextual relationships. In contrast, CNNs like ResNet-50 and DenseNet-121 primarily rely on local receptive fields and hierarchical feature extraction, which may limit their ability to capture global context efficiently.
The models were trained with a batch size of 10 for 10 epochs with early stopping. The learning rate was optimized to minimize loss and gradient. Table 7 presents the accuracy, precision, recall, and F1 score of the models on the test data. From Table 7, we can conclude that Vision Transformer performs much better than ResNet and DenseNet at extracting relevant information from the images for disaster assessment.
4.3 Multimodal fusion
The model was built using the architecture proposed in Section 3. The model was trained with a batch size of 16 for 10 epochs with early stopping. Cross Entropy Loss was chosen as the loss function along with Adam optimizer [36] with a learning rate of \(10^{-5}\). To perform a comparative study of the performance of the proposed model, the accuracy, precision, recall, and F1 score of the proposed model, score fusion, and other best-performing models are tabulated on the test data in Table 8.
In score fusion, the probability outputs of the text-only and image-only classifiers are combined to get the final predictions. The score fusion technique is being used here as a baseline for comparing the performance of the proposed multimodal model.
The proposed approach is the best-performing model for task1(a), Informativeness, achieving a higher accuracy (91.53%) than the previously best-performing models.
5 Discussion
5.1 Summary of results
This section compares the performance of the best-performing text and image models and the proposed multimodal fusion model for each task. The evaluation metrics include Accuracy, Precision, Recall, F1 score, Area Under the Receiver Operating Characteristic curve (AUC-ROC), Area Under the Precision-Recall curve (AUC-PR), Logarithmic loss, Matthews Correlation Coefficient (MCC), Specificity, Cohen’s Kappa, Balanced accuracy, Youden’s J statistic, and Positive- Negative likelihood ratios.
Tables 9, 10, 11, and 12 show the performance of the text-only classifier, image-only classifier, and the proposed multimodal model on the various tasks. Figures 4, 6, 8, and 10 show the confusion matrices, while the Figs. 5, 7, 9, and 11 plot the Precision-Recall curve of the models on the various tasks (Figs. 6, 7, 8, 9, 10 and 11).
In general, the analysis of the text-only classifier reveals promising results across multiple evaluation metrics, with the performance of the image-only classifier being comparable. The text-only classifier performs better on informativeness and humanitarian tasks, suggesting that such information can be better inferred from language. While the image-only classifier performs better on the damage severity tasks, highlighting its advantage in identifying the severity of damage from images rather than textual data. However, the proposed multimodal model shows superior performance compared to both text-only and image-only classifiers.
Notably, the multimodal approach achieves higher accuracy, precision, recall, and F1 score, indicating its effectiveness in integrating textual and visual information for improved disaster assessment. AUC-ROC and AUC-PR metrics demonstrate improved discrimination between the class labels, reflecting the model’s ability to leverage complementary features from both modalities. Logarithmic loss suggests reduced predictive uncertainty compared to individual modalities, while MCC signifies substantial agreement between predicted and actual outcomes. Specificity, Cohen’s Kappa, balanced accuracy, and Youden’s J statistic highlight the multimodal model’s superior performance in correctly identifying non-disaster instances and distinguishing true negatives. Positive and negative likelihood ratios further underscore the model’s capacity to differentiate between positive and negative class labels with heightened certainty and reliability, affirming the efficacy of multimodal deep learning in disaster assessment.
5.2 Discussion
The proposed multimodal fusion technique is a novel approach that leverages both cross-modal and self-attention mechanisms. This innovation goes beyond previous approaches, which solely relied on cross-modal attention, to achieve superior information extraction at the intermediate fusion level. This section highlights the justification for employing the proposed key components in constructing the architecture. The justification is as follows:
-
1.
Capturing Cross-Modal Dependencies: By using cross-modal attention, the model can efficiently incorporate input from both text and visual modalities. In disaster assessment circumstances, access to both visual (image) and contextual (textual) information is essential for a complete understanding of the event. By employing cross-modal attention, the model can capture the interdependencies between language and visual signals. This is achieved by allowing the model to focus on relevant portions of the image while also taking into consideration the textual descriptions or captions that correspond with those areas, and vice versa.
-
2.
Self-Attentive Mechanism: Long-range dependencies can be captured within a single modality with the help of the self-attentive mechanism, also referred to as intra-modal attention. Through the application of self-attention in the combined text and image domains, the model can capture the inherent relationships and connections within each modality.
-
3.
Intermediary Feature Fusion: Cross-modal attention with self-attentive mechanism, combined with deflation and scaled dot product attention, serves as an intermediary feature fusion method. It allows the model to combine information from the image and text domains at a finer level, identifying connections and correlations between them. For instance, while an image may show debris and destruction, accompanying text might provide insights into the specific location or type of disaster. Such a kind of fusion enhances contextual understanding of the situation by mitigating the limitations owing to the ambiguous or incomplete information in any individual modality.
The results demonstrate that the proposed approach is particularly effective in capturing detailed dependencies within and between different modalities, enabling it to make more informed decisions, especially in disaster assessment tasks.
The proposed intermediate fusion provides a more contextually rich and nuanced representation than early and late fusion. Early fusion integrates modalities at the input level, potentially leading to information loss or simplifying of multimodal data. Late fusion, on the other hand, merges modalities at a higher level of representation, perhaps missing subtle interactions between them. Thus, in essence, intermediate fusion improves the interpretability and transparency of the model’s decision-making process by providing a greater understanding of how different modalities affect the model’s predictions. The proposed framework reflects the complicated interactions and dependencies across modalities required for tasks like damage detection and situational awareness. This thorough integration of multimodal data allows the model to make more accurate and informed judgments during calamities [37, 38].
5.3 Significance of the work
In this era of social media, enforcing the use of AI for disaster assessment within social media contributes paramountly to providing a more holistic understanding of the situation. For example, by analyzing both textual information and visual content, analysts can pinpoint areas with high levels of damage or where help is urgently needed. These approaches offer a thorough data analysis by leveraging information from multiple sources, thereby enhancing the model’s performance even when dealing with incomplete data, which is often encountered in online social networks (OSNs).Most significantly, by cross-referencing information offered by multiple modalities, these multimodal fusion methodologies provide substantial information to disaster rescue teams, enabling timely assistance.
6 Conclusion
In this highly interconnected world, harnessing social networks to advance real-time crisis management operations is a breakthrough. The use of multimodal systems has shown to be quite useful in assessing disaster intensity. In the course of our research, we demonstrated that the proposed intermediate fusion approach outperforms early or late fusion methods when using multimodal learning models for tasks such as damage detection, resource allocation, and situational awareness during the crisis. By combining textual and visual data and facilitating the integration of information at a granular level, the model synthesizes a holistic knowledge of catastrophic occurrences through the integration of cross-modal attention, self-attentive mechanisms, and intermediary feature fusion.
The proposed research holds significant implications by offering a deeper comprehension of disaster scenarios, thereby substantially improving disaster response and recovery endeavors. This enhanced understanding enables responders to make better-informed decisions, resulting in more efficient allocation of resources and coordinated efforts. Beyond disaster assessment, the proposed method holds promise for possible extension in diverse domains such as medical diagnosis, autonomous systems, and multimedia content analysis. In the field of medical diagnostics, utilizing data from many sources like scans, textual data, and patient records has tremendous potential for improving knowledge and aiding the early identification of diseases. Similarly, the proposed approach can be applied to improve decision-making in autonomous systems such as robots and surveillance. Furthermore, in multimedia content analysis, the proposed method can be extended to improve the relevance and accuracy of results for tasks such as content-based retrieval and cross-modal information retrieval.
6.1 Study challenges, limitations and future work
During the research, various challenges related to model training and dataset limitations were encountered. Transformer-based architectures used in the study, such as RoBERTa and Vision Transformer, showed a tendency to overfit. Additionally, the deep learning classifiers implemented were prone to catastrophic forgetting during knowledge transfer. To address these issues, a range of learning rates, regularization and other hyperparameters were explored to find the optimal configuration. Self-attention mechanisms were employed to mitigate the problems associated with incremental learning. Metrics such as validation loss were used to measure the performance of the classifiers [39]. The damage severity task heavily relies on visual information. Additionally, the data exhibits class imbalance, with fewer samples in the low and mid-severity classes compared to severe damage. Even in Task 2, the number of suitable message pairs containing both tweets and images was not found in equal proportions across all categories. These factors particularly led to poor performance of the methodology in Task 3 relative to other tasks. This limitation can be tackled in future research by implementing additional preprocessing techniques. These may include strategies like data augmentation, applying SMOTE for resampling, assigning appropriate class weights, generating balanced batches, and creating synthetic data using GANs.
There is a great degree of bias in research including data from online social networks because of differences in data representation, noise, and a dearth of ground truth labels. Furthermore, the abundance of misinformation in news-related data on social media presents considerable difficulty. An additional challenge to the analysis is the profusion of inconsistent data across different social network components. One significant shortcoming of the proposed study is its failure to handle cases in which both textual and visual information convey conflicting facts about a particular event. This study does not address the verification of misinformation conveyed through various modalities, if any.
Analyzing discriminative features or the correlation of characteristics in different modalities could address the aforementioned limitations. Techniques such as information relevance [40] can be explored to enhance model learning. To represent the complementary nature of visual and textual information in conveying information, studies could evaluate image-text relations and confirm these relations using appropriate metrics [41]. Additionally, AI-assisted labeling can be utilized to improve annotation quality, particularly when labels are inconsistent. Techniques such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) can be employed to validate inconsistencies in the data and preserve both local and global structures of the data.
Future research in this direction can include methods to improve the generalizability of the model. This might include adding additional interpretability to deep learning models or adopting explainable AI approaches [42] to give explanations or reasons for AI models’ judgments. Additionally, future research could explore methods for verifying the accuracy of information and incorporating privacy metrics, such as ensuring the anonymity of users involved in accidents during disasters.
In upcoming research endeavors, more intricate multimodal fusion methods, such as - hierarchical structures, hybrid architectures and improved attention processes [43,44,45] that mainly improve image topic representations, can be investigated, to improve the accuracy of disaster assessments while incorporating textual and visual elements more effectively. Additionally, weakly supervised and semi-supervised learning strategies can be commissioned in the future to take advantage of the wealth of unlabeled social media data and label them, thereby reducing the dependence on manually labeled datasets. Furthermore, including additional dimensions like temporal and spatial values could offer valuable insights into improving disaster severity assessment.
Data and Code Availability
The data supporting the conclusions of this study can be accessed at https://crisisnlp.qcri.org/crisismmd. The code can be obtained from the corresponding authors upon a reasonable request.
References
Kumar A, Sangwan SR, Nayyar A (2020) Multimedia social big data: mining. Concepts, Paradigms and Solutions, Multimedia Big Data Computing for IoT Applications, pp 289–321
Cai Q, Wang H, Li Z, Liu X (2019) A survey on multimodal data-driven smart healthcare systems: approaches and applications. IEEE Access 7:133583–133599
Layek AK, Chatterjee A, Chatterjee D, Biswas S (2020) Detection and classification of earthquake images from online social media. In: Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2019, pp 345–355 Springer
Abavisani M, Wu L, Hu S, Tetreault J, Jaimes A (2020) Multimodal categorization of crisis events in social media. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14679–14689
Sirbu I, Sosea T, Caragea C, Caragea D, Rebedea T (2022) Multimodal semi-supervised learning for disaster tweet classification. In: Proceedings of the 29th international conference on computational linguistics, pp 2711–2723
Zou Z, Gan H, Huang Q, Cai T, Cao K (2021) Disaster image classification by fusing multimodal social media data. ISPRS Int J Geo-Inf 10(10):636
Aamir M, Ali T, Irfan M, Shaf A, Azam MZ, Glowacz A, Brumercik F, Glowacz W, Alqhtani S, Rahman S (2021) Natural disasters intensity analysis and classification based on multispectral images using multi-layered deep convolutional neural network. Sensors 21(8):2648
Zhang M, Huang Q, Liu H (2022) A multimodal data analysis approach to social media during natural disasters. Sustainability 14(9):5536
Ofli F, Alam F, Imran M (2020) Analysis of social media data using multimodal deep learning for disaster response. arXiv preprint arXiv:2004.11838
Belcastro L, Marozzo F, Talia D, Trunfio P, Branda F, Palpanas T, Imran M (2021) Using social media for sub-event detection during disasters. J Big Data 8(1):1–22
Ponce-López V, Spataru C (2022) Social media data analysis framework for disaster response. Discov Artif Intell 2(1):10
Asif A, Khatoon S, Hasan MM, Alshamari MA, Abdou S, Elsayed KM, Rashwan M (2021) Automatic analysis of social media images to identify disaster type and infer appropriate emergency response. J Big Data 8(1):83
Ochoa KS, Comes T (2021) A machine learning approach for rapid disaster response based on multi-modal data. the case of housing & shelter needs. arXiv preprint arXiv:2108.00887
Yang L, Cervone G (2019) Analysis of remote sensing imagery for disaster assessment using deep learning: a case study of flooding event. Soft Comput 23(24):13393–13408
Khalaf M, Alaskar H, Hussain AJ, Baker T, Maamar Z, Buyya R, Liatsis P, Khan W, Tawfik H, Al-Jumeily D (2020) Iot-enabled flood severity prediction via ensemble machine learning models. IEEE Access 8:70375–70386
Jena R, Pradhan B, Beydoun G, Alamri AM, Sofyan H et al (2020) Earthquake hazard and risk assessment using machine learning approaches at Palu, Indonesia. Sci Total Environ 749:141582
Alam F, Ofli F, Imran M (2018) Crisismmd: multimodal twitter datasets from natural disasters. In: Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM)
Agarwal M, Leekha M, Sawhney R, Shah RR (2020) Crisis-dias: towards multimodal damage analysis - deployment, challenges and assessment. Proceedings of the AAAI Conference on Artificial Intelligence 34(01):346–353. https://doi.org/10.1609/aaai.v34i01.5369
Firoj Alam FO, Imran M (2018) Processing social media images by combining human and machine computing during crises. Int J Hum Comput 34(4):311–327. https://doi.org/10.1080/10447318.2018.1427831
Schuster M, Nakajima K (2012) Japanese and Korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5149–5152 IEEE
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units (2016)
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding (2019)
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: a robustly optimized BERT pretraining approach (2019)
Nguyen DQ, Vu T, Nguyen AT (2020) BERTweet: a pre-trained language model for English Tweets (2020)
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale (2021)
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition (2015)
Huang G, Liu Z, Maaten L, Weinberger KQ (2018) Densely connected convolutional networks (2018)
Kiela D, Bhooshan S, Firooz H, Perez E, Testuggine D (2020) Supervised multimodal bitransformers for classifying images and text (2020)
Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W (2019) VisualBERT: a simple and performant baseline for vision and language (2019)
Lu J, Batra D, Parikh D, Lee S (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks (2019)
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) VL-BERT: pre-training of generic visual-linguistic representations (2020)
Xi C, Lu G, Yan J (2020) Multimodal sentiment analysis based on multi-head attention mechanism. In: Proceedings of the 4th international conference on machine learning and soft computing, pp 34–39
Hazarika D, Zimmermann R, Poria S (2020) MISA: modality-invariant and -specific representations for multimodal sentiment analysis (2020)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention Is All You Need(2017)
Gaurav Bhardwaj S, Agarwal R (2023) Two-tier feature extraction with metaheuristics-based automated forensic speaker verification model. Electronics 12(10):2342
Kingma DP, Ba J (2017) Adam: A Method for Stochastic Optimization (2017)
Gaw N, Yousefi S (2022) Gahrooei MR (2022) multimodal data fusion for systems improvement: a review. Handbook of scholarly publications from the Air Force Institute of Technology (AFIT) 1(2000–2020):101–136
Radhika K, Oruganti VRM (2021) Deep multimodal fusion for subject-independent stress detection. In: 2021 11th International conference on cloud computing, data science & engineering (confluence), pp 105–109 IEEE
Malik MSI, Younas MZ, Jamjoom MM, Ignatov DI (2024) Categorization of tweets for damages: infrastructure and human damage assessment using fine-tuned BERT model. PeerJ Comput Sci 10(e1859):1859
Chen D, Su W, Wu P, Hua B (2023) Joint multimodal sentiment analysis based on information relevance. Inf Process Manage 60(2):103193. https://doi.org/10.1016/j.ipm.2022.103193
Otto C, Springstein M, Anand A, Ewerth R (2020) Characterization and classification of semantic image-text relations. Int J Multimed Inf Retr 9(1):31–45. https://doi.org/10.1007/s13735-019-00187-6
Saranya A, Subhashini R (2023) A systematic review of explainable artificial intelligence models and applications: recent developments and future trends. Decis Anal J 7:100230. https://doi.org/10.1016/j.dajour.2023.100230
Shi L, Luo J, Cheng G, Liu X, Xie G (2021) A multifeature complementary attention mechanism for image topic representation in social networks. Sci Program 2021:5304321. https://doi.org/10.1155/2021/5304321
Wang H, Guo P, Zhou P, Xie L (2024) Mlca-avsr: multi-layer cross attention fusion based audio-visual speech recognition (2024). arXiv:2401.03424
Luo Y, Guo X, Dong M, Yu J (2023) Learning modality complementary features with mixed attention mechanism for rgb-t tracking. Sensors 23:(14). https://doi.org/10.3390/s23146609
Funding
Open access funding provided by Manipal Academy of Higher Education, Manipal No funds, grants, or other support was received for conducting this study.
Author information
Authors and Affiliations
Contributions
Dr. Nisha P Shetty and Mrs. Jayashree Shetty conceived and designed the study. Mr. Yash Bijalwan and Mr. Pranav Chaudhari conducted the data analysis and coding. Dr. Balachandra Muniyal provided supervision for the research. The initial draft of the manuscript was penned by Dr. Nisha P Shetty, with input from all authors on earlier drafts. All authors reviewed and endorsed the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Shetty, N.P., Bijalwan, Y., Chaudhari, P. et al. Disaster assessment from social media using multimodal deep learning. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19818-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-19818-0