1 Introduction

Meteorologists have always faced the challenge of predicting extreme precipitation events (Lazo et al. 2009). According to recent surveys of the public’s use of weather forecasts, the most largely utilized information of standard forecasts is precipitation prediction (e.g., where, when, and how much rain will fall) (Lazo et al. 2009). End users in different communities and fields, like transportation, water resources, flood control, emergency management, and many others, require reliable precipitation forecasts. In the USA Weather Research Program’s (USWRP) community planning report, experts in hydrology, transportation, and emergency management discussed how quantitative precipitation forecasts (QPF) relate to their specific communities (Ralph 2005). On average, users can receive weather information four to five times daily through mobile phone apps, TV, newspapers, tweets, etc. (Purwandari et al. 2021). When assessing climate data, it is possible to encounter a discrete–continuous option problem because of the dynamic nature of precipitation and the variety of physical forms involved, making rainfall forecasts challenging (Purwandari et al. 2021). The determination of QPF in short-, mid-, and long-range is done with different levels of difficulties, and QPF is usually delivered with varying uncertainty (Purwandari et al. 2021). Meteorologists have always tried to assimilate as many possible observations as possible to enhance their weather forecast skills. Observations from satellite, airborne, or ground-based sources have been used. The more reliable observations, event qualitative, are examined, processed, and eventually assimilated into models, the better the forecast. The increasing use of the Internet, precisely social media steams, has created a wealth of information that holds the potential to improve weather models. However, a large amount of unstructured data is available on the Internet, which lacks the identifiable tabular organization necessary for traditional data analysis methods, which diminishes its potential (Gandomi and Haider 2015). Although unstructured data, such as Web pages, emails, and mobile phone records, may contain numerical and quantitative information (e.g., dates), they are usually text-heavy. Contrary to numbers, textual data are inherently inaccurate and vague. According to (Britton 1978), at least 32% of the words used in English text are lexically ambiguous. Textual data are often unstructured, making it difficult for researchers to use them to enhance meteorologic models. Nevertheless, the large amount of textual data provides new opportunities for urban researchers to investigate people’s perceptions, attitudes, and behaviors, which will help them better understand the impact of natural hazards. Jang and Kim (2019) have demonstrated that crowd-sourced text data can effectively represent the collective identity of urban spaces by analyzing crowd-sourced text data gathered from social media. Conventional methods of collecting data, such as surveys, focus groups, and interviews, are often time-consuming and expensive. Raw text data without predetermined purposes can be compelling if used wisely and can complement purposefully designed data collection strategies.

Machines can analyze and comprehend human language thanks to a process known as natural language processing (NLP). It is at the heart of all the technologies we use daily, including search engines, chatbots, spam filters, grammar checkers, voice assistants, and social media monitoring tools (Chowdhary 2020). By applying NLP, it is possible to grasp better human language’s syntax, semantics, pragmatics, and morphology. Then, computer science uses this language understanding to create rule-based, machine learning algorithms that can solve certain issues and carry out specific tasks (Chowdhary 2020). NLP has demonstrated tremendous capabilities in harvesting the abundance of textual data available. Hirschberg and Manning (2015) define it as a form of artificial intelligence similar to deep learning and machine learning that uses computational algorithms to learn, understand, and produce human language content. Basic NLP procedures require processing text data, converting text into features, and identifying semantic relationships (Ghosh and Gunning 2019). In addition to structuring large volumes of unstructured data, NLP can also improve the accuracy of text processing and analysis because it follows the rules and criteria consistently. A wide range of fields has proven to benefit from NLP. Guetterman et al. (2018) conducted an experiment in which they compared the results of traditional text analysis with those of a natural language processing analysis. The authors claim that NLP could identify major themes manually summarized by conventional text analysis. Syntactic and semantic analysis is frequently employed in NLP to break human discourse into machine-readable segments (Chowdhary 2020). Syntactic analysis, commonly called parsing or syntax analysis, detects a text’s syntactic structure and the dependencies between words, as shown on a parse tree diagram (Chowdhary 2020). The semantic analysis aims to find out what the language means or, to put it another way, extract the exact meaning or dictionary meaning from the text. However, semantics is regarded as one of the most challenging domains in NLP due to language’s polysemy and ambiguity (Chowdhary 2020). Semantic tasks examine sentence structure, word interactions, and related ideas to grasp the topic of a text and the meaning of words. One of the main reasons for NLP’s complexity is the ambiguity of human language. For instance, sarcasm is problematic for NLP (Suhaimin et al. 2017). It would be challenging to educate a machine to understand the irony in the statement, “I was excited about the weekend, but then my neighbor rained on my parade," or “it is raining cats and dogs,” yet humans would be able to do so quickly. Researchers worked on instructing NLP technologies to explore beyond word meanings and word order to thoroughly understand context, word ambiguities, and other intricate ideas related to communications. However, they must consider other factors like culture, background, and gender while adjusting natural language processing models. For instance, idioms, like those related to weather, can significantly vary from one nation to the next.

NLP can support extreme weather events data challenges by analyzing large amounts of weather data for patterns and trends (Kahle et al. 2022). These data can provide more accurate forecasts and early warning systems for extreme weather events (Kitazawa and Hale 2021; Rossi et al. 2018; Vayansky et al. 2019; Zhou et al. 2022). NLP can also monitor social media for information on extreme weather events, allowing for the detection of local events that may not be reported in official channels (Kitazawa and Hale 2021; Zhou et al. 2022). Additionally, NLP can be used to create automated chat bots to provide information to those affected by extreme weather events, such as directions to shelters, medical assistance, and other resources.

This article comprehensively reviews how researchers have used NLP in extreme weather event assessment. The present study is, to our knowledge, a first attempt to synthesize opportunities and challenges for extreme events assessment research by adopting Natural Language Processing. In the methodology section of the manuscript, the approach, the selection method, and the search terms used for the article selection are detailed. Then, the search results are summarized and categorized. Next, the role of NLP and its challenges in supporting extreme events are assessed. Finally, the limitations of this literature study are listed.

2 Methodology

First, the protocol registration and information sources were determined. In this study, the Preferred Reporting Items for Systematic Reviews and Meta-Analysis Reviews and Meta-Analysis guidelines (PRISMA) were followed to conduct the systematic review (Page et al. 2021). The protocol (Tounsi 2022) was registered with the Open Science Framework on June 6th, 2022. We searched for peer-reviewed publications in databases to identify articles within this systematic literature review’s scope and eligibility criteria.

Then, the search strategy was defined. Our search terms were developed systematically to ensure that all related and eligible papers in the databases were captured. A preliminary literature review determined the keywords used in the search and then modified them based on feedback from content experts and the librarian. Our review then incorporated a collaborative search strategy to ensure that all papers related to the use of NLP and their use for the assessment of extreme weather events were included in our review and determined the keywords. We searched four databases: IEEE Xplore, Web of Science, ScienceDirect, and Scopus.

We grouped the query keywords to identify relevant studies that meet our scope and inclusion criteria. We combined them using an AND/OR operator. We used keyword terms such as “NLP OR Natural Language Processing” with narrower terms such as “Precipitation OR Rainfall” in our keyword search. Figure 1 shows all the combinations of search terms used in the keyword search.

Fig. 1
figure 1

Conceptual framework of the search terms used in the literature review to query the studies for the literature review

This study focused on peer-reviewed publications satisfying the following two primary conditions: (a) applying one of the techniques of natural language processing to solve the problem stated in the study, the study has to show the results of the model used, not just suggest it and (b) reporting results related to precipitation-related extreme weather events assessment. Note that, papers that did not meet these conditions were excluded from consideration. For example, studies that only focused on developing numerically based deep learning models were excluded from the study. Additionally, secondary research, such as reviews, commentaries, and conceptual articles, was excluded from this review. The search was limited to English language papers published until June 2022.

Two authors screened the publications simultaneously to decide whether or not to include each of them using consensus methods. First, we screened the publications by looking at the titles and abstracts, and then, we removed duplicates. Having read the full text of the remaining papers, we finalized the selection by reading the full text. To minimize selection bias, all discrepancies were resolved by discussion requiring consensus from both reviewers. For each paper, standardized information was recorded using a data abstraction form.

3 Results

3.1 Search results

The flowchart of the procedure used to choose the articles to be included in this systematic literature review is shown in Fig. 2. A total of 1225 documents were found in the initial search using a set of queries. To control the filtering and duplicate removal procedure, we employed EndNote. We eliminated duplicates and all review, opinion, and perspective papers. Following that, two writers conducted second filtering by reading titles and abstracts (n = 846). Three hundred sixteen (316) documents were left after a screening procedure using inclusion criteria for a full-text examination. An additional 281 articles that were not within the scope of the study were removed after reading the full study. For example, some studies suggested NLP as a solution but did not implement it. We cannot include such studies because NLP was not shown as an efficient solution to the problem. Other examples include studies that explained the consequences of different extreme weather events (including rain-related issues) on other fields, such as construction or maritime transportation.

Fig. 2
figure 2

PRISMA flow diagram for Searching and Selection process

Consequently, the final number of studies considered for the systematic review is 35, with consensus from both authors. To extract the information listed in Table 1, we used a systematic approach from each eligible article. Overall, 29 were journal articles, and nine were conference proceedings (Table 2).

Table 1 Summary of included literature
Table 2 Opportunities and challenges of the included works of literature

3.2 NLP areas of application

The authors of the literature selection have explored numerous NLP topics, as shown in Table 1. To summarize, researchers have used NLP in four areas: (1) Social influence and trend analysis, (2) event impact assessment and mapping, (3) event detection, and (4) disaster resilience. Social influence and trend analysis is the most dominant topic (37% of all literature), which includes discussions on crowdsourcing, sentiment analysis, topic modeling, and citizen engagement. Researchers also have used NLP to study extreme events’ impact assessment and mappings (25% of all literature), such as disaster response, event impact on infrastructure, and flood mapping. Event detection is another popular area of research (21% of all literature), in which authors used NLP to detect and monitor extreme weather events and design early warning systems. Lastly, researchers adopted NLP models for disaster resilience (17% of all literature).

3.3 Data

Multiple sources of data have been used for the selected studies. Figure 3 shows the distribution of data sources used in the literature. Data sources vary between social media (Twitter, Weibo, and Facebook), publications (Newspapers, Disaster Risk Reduction Knowledge Service, Numerical climate & gauge data, Weather bulletins, and Generic ontology), public encyclopedia, and services (Chinese wiki and Google Map) and posted photographs. Social media plays an essential role in data sourcing for the studies. For example, Twitter is used by 57.1% of the total studies, which emphasizes both the role and the potential of this platform as a real-time and historical public data provider with additional features and functionality that support collecting more precise, complete, and unbiased datasets. Typically, researchers take the content from social media posts and evaluate it along with the geolocation data. Authors also used NLP to process other data sources such as weather bulletins, gauge data, online photographs, and newspapers. Data size could vary from small datasets (dozens of photographs, hundreds of weather bulletins) to large ones (millions of tweets). It is important to note that when predictive modeling was the goal of a study, it was usual practice to compare the accuracy of NLP findings to data from reliable sources.

Fig. 3
figure 3

Pie chart of the data sources used within the literature

3.4 NLP tasks and models in the review literature

All of the studies mentioned in this review applied at least one NLP task within their research that involves either syntactic or sematic analysis. As studies deal with extensive unstructured social media data, clustering was used to categorize tweets to understand what was being discussed on social platforms. Clustering proved a highly effective machine learning approach for identifying structures in labeled and unlabeled datasets. Studies used K-means and graph-based clustering.

In addition, pre-trained models (PTMs) for NLP are deep learning models trained on a large dataset to perform specific NLP tasks. PTMs may learn universal language representations when they are trained on a significant corpus. This can aid in solving actual NLP tasks (downstream tasks) and save the need to train new models from the start. Several studies within the literature have used PTMs such as Bidirectional Encoder Representations from Transformers (BERT), SpaCy NER, Stanford NLP, NeuroNER, FlauBERT, and CamemBERT. These models have either been used for named entity recognition or classification.

Moreover, the automated extraction of subjects being discussed from vast amounts of text is another application addressed in the literature. Topic modeling is a statistical technique for discovering and separating these subjects in an unsupervised manner from the massive volume of documents. The authors have used models such as Latent Dirichlet Allocation (LDA) and Correlation Explanation (CorEx) (Barker and Macleod 2019; Chen and Ji 2021; Karimiziarani et al. 2022; Xin et al. 2019; Zhou et al. 2021). Furthermore, other NLP sub-tasks such as tokenization, part-of-speech tagging, dependency parsing, and lemmatization & stemming have also been applied in several studies to deal with data-related problems.

3.5 Study area

Studies utilizing NLP have been conducted in different countries. Most research concentrated on metropolitan cities like Beijing, China, and New York City, USA. This may be because these cities are heavily populated, which increases the probability of data availability (more social media users, for example). For example, the USA is the leading country based on the number of Twitter users as of January 2022, with more than 76.9 million users, while 252 million Chinese users actively use Weibo daily. Adding to that, these cities are both in the path of frequent tropical cyclones. They might be affected by several types of flooding, such as coastal flooding due to storm surges, pluvial flooding due to heavy precipitation in a short time, or fluvial flooding due to their proximity to a river. For example, on July 21, 2012, a flash flood struck the city of Beijing for over twenty hours. As a result, 56,933 people were evacuated within a day of the flooding, which caused 79 fatalities, at least 10 billion Yuan in damages, and destroyed at least 8,200 dwellings. Figure 4 shows the distribution of the studies by country. Some researchers compared data from various cities. The analysis may be done at multiple scales, from one town to a whole continent.

Fig. 4
figure 4

Distribution of studies by country

3.6 Types of extreme weather events

Three types of extreme events were addressed in the surveyed literature: hurricanes and storms, typhoons, and flooding. Formed in the North Atlantic, the northeastern Pacific, the Caribbean Sea, or the Gulf of Mexico, hurricanes are considered to cause significant damage over a wide area with high population density, which explains the focus on this type of extreme event and its presence in social media. More than 48% of the studies concentrated in one way or another on hurricanes. Hurricanes have hazardous impacts. Storm surges and large waves produced by hurricanes pose the greatest threat to life and property along the coast (Rappaport et al. 2009). In emergencies, governments have to invest in resources (financial and human) to support the affected areas and populations and to help spread updates and warnings (Vanderford et al. 2007).

Flood is a challenging and complex phenomenon as it may occur at different spatial and temporal scales (Istomina et al. 2005). Floods can occur due to intense rainfall, surges from the ocean, rapid snowmelt, or the failure of dams or levees (Istomina et al. 2005). The most dangerous floods are flash floods, which combine extreme speed with a flood’s devastating strength (Sene 2016). As it is considered the deadliest type of severe weather, decision-makers must use every possible data source to confront flooding. Flooding was the second hazard addressed in the literature, varying magnitudes from a few inches of water to several feet. They may also come on quickly or build gradually. Nearly 42% of the study literature covered flooding.

Three studies, representing 10% of the literature, covered typhoons, which develop in the northwestern Pacific and usually affect Asia. Typhoons can cause enormous casualties and economic losses. Additionally, governments and decision-makers have difficulty gathering data on typhoon crisis scenarios. At the same time, modern social media platforms like Twitter and Weibo offer access to almost real-time disaster-related information (Jiang et al. 2019).

4 Discussion

Hurricanes, storms, and floods are the most extreme events addressed in the literature. The duration of weather hazards varies greatly, from a few hours for some powerful storms to years or decades for protracted droughts. The occurrences of weather hazards usually raise more awareness of weather hazards and trigger, therefore, higher sensitivity to extreme events and the tendency of the public and Internet users to report them. Even short-lived catastrophes may leave a long-lasting mark in the public’s mind and remain referred to online as a reference event. Using ground-based sensors to understand the dynamics of weather hazards is often confronted with limited resources, leading to sparse networks and data scares. Thus, we can only ensure a better understanding and monitoring of these extreme events by expanding the data set to other structured and unstructured data sources. In this regard, only through NLP this kind of data could be valorized and made available for weather modelers. This systematic review addressed a critical gap in the literature by exploring the applications of NLP in assessing extreme events.

4.1 Role of NLP in supporting extreme events assessment

4.1.1 Hurricanes

NLP can help learn from structured and unstructured textual data from reliable sources to take preventive and corrective measures in emergencies to help support decision-making. We found in this study that many different NLP models are used together with multiple data sources. For example, topic modeling was used by 12 studies to support the hurricanes and storms decision-making using LDA and CorEx models applied to social media data (Facebook and Twitter). Vayansky et al. (2019) used sentiment analysis to measure changes in Twitter users’ emotions during natural disasters. Their work can help the authorities limit the damages from natural disasters, specifically hurricanes and storms. In addition to the corrective measures of recovering from the disaster, it can also help them adjust future response efforts accordingly (Vayansky et al. 2019). Yuan et al. (2021) investigated the differences in sentiment polarities between various racial/ethnic and gender groups to look into the themes of concern in their expressions, including the popularity of these themes and their sentiment toward them, and to understand better the social aspects of disaster resilience using the results of disparities in disaster response. Findings can assist crisis response managers in identifying the most sensitive/vulnerable groups and targeting the appropriate demographic groups with catastrophe evolution reports and relief resources. On another note, Yuan et al. (2020) looked at how often people posted on social media and used the yearly average sentiment as a baseline for the sentiment. The LDA was used to creatively determine the sentiment and weights for various subjects in public discourse. Using their work, a better understanding of the public’s unique worries and panics as catastrophes progress may assist crisis response managers in creating and carrying out successful response methods. In addition to protecting people’s lives, NLP can be used to monitor post-disaster infrastructure conditions. Chen and Ji (2021) used the CorEx topic model to capture infrastructure condition-related topics by incorporating domain knowledge into the correlation explanation and to look at spatiotemporal patterns of topic engagement levels for systematically sensing infrastructure functioning, damage, and restoration conditions. To help practitioners maintain essential infrastructure after catastrophes, Chen and Ji (2021) offered a systematic situational assessment of the infrastructure. Additionally, the suggested method looked at how people and infrastructure systems interact, advancing human-centered infrastructure management (Chen and Ji 2021).

Other studies used topic modeling twinned with other methods, such as clustering and named entity recognition (Barker and Macleod 2019; Fan et al. 2018; Shannag and Hammo 2019; Sit et al. 2019). This enabled the authors to develop more advanced analytical frameworks than a single-model-based pipeline. Sit et al. (2019) used an analytical framework for Twitter analysis that could recognize and classify tweets about disasters, identify impact locations and time frames, and determine the relative importance of each category of disaster-related information over time and geography. Throughout the disaster’s temporal course, their analysis revealed possible places with significant densities of impacted people and infrastructure damage (Sit et al. 2019). The approach described in this article has enormous potential for usage during a disaster for real-time damage and emergency information detection and for making wise judgments by analyzing the circumstance in affected areas (Sit et al. 2019). Barker and Macleod (2019) created a prototype national-scale Twitter data mining pipeline for better stakeholder situational awareness during flooding occurrences across Great Britain. By automatically detecting tweets using Paragraph Vectors and a classifier based on regression models, the study can be implemented as a national-scale, real-time product that could respond to requests for better crisis situational awareness. In another study, Fan et al. (2018) detected infrastructure-related topics in the tweets posted during disasters and their evolutions during hurricanes by integrating several NLP and statistical models such as LDA and K-means clustering(Fan et al. 2018). The study made it possible to trace the progression of conditions during various crisis phases and to summarize key points (Fan et al. 2018). The proposed framework’s analytics components can help decision-makers recognize infrastructure performance through text-based representation and give evidence for measures that can be taken immediately (Fan et al. 2018).

Not only is topic modeling beneficial for decision-making support, but other NLP techniques, such as information extraction (IE). Many hurricane-related social media reactions are contained in natural language text. However, using it effectively in this format is extremely difficult. IE allows extracting information from these unstructured textual sources, finding those entities (words and tokens related to the topic of interest), and classifying and storing them in a database (Grishman 2015). In this review, four studies used Information Extractio-related models to extract valuable information from unstructured text (Devaraj et al. 2020; Chao Fan et al. 2020a, b; Zhou et al. 2022). Zhou et al. (2022) created VictimFinder models based on cutting-edge NLP algorithms, including BERT, to identify tweets that ask for help rescuing people. The study presents a handy application promoting social media use for rescue operations in future disaster events. Web apps can be created to offer near-real-time rescue request locations that emergency responders and volunteers may use as a guide for dispatching assistance. The ideal model can also be included in GIS tools (Zhou et al. 2022). On another subject and to analyze location-specific catastrophe circumstances, Chao Fan et al. (2020a, b) suggested an integrated framework to parse social media data and evaluate location-event-actor meta-networks. The study’s outcomes highlighted the potential of the proposed framework to enhance social sensing of crisis conditions, prioritize relief efforts, and prioritize relief and rescue operations based on the severity of the events and local requirements (Chao Fan et al. 2020a, b).

Devaraj et al. (2020) considered whether it could successfully extract valuable information for first responders from public tweets during a hurricane. As the use of social media constantly increases, people now turn to social media platforms like Twitter to make urgent requests for help. The study shows that using machine learning models, urgent requests posted on social media sites like Twitter may be identified (Devaraj et al. 2020). As hurricanes develop, emergency services or other relevant relief parties might utilize these broad models to automatically identify pleas for assistance on social media in real-time (Devaraj et al. 2020).

4.1.2 Typhoons

When a disaster strikes, as has been demonstrated in numerous cases, citizens can quickly organize themselves and start sharing disaster information (Jiang et al. 2019). Three studies from the review scope have dealt with typhoons and suggested both topic modeling and information extraction as approaches. Kitazawa and Hale (2021) studied how the general population reacts online to warnings of typhoons and heavy rains. The study suggests that insights can assist authorities in creating more focused social media strategies that can reach the public more rapidly and extensively than traditional communication channels, improve the circulation of information to the public at large, and gather more detailed disaster data (Kitazawa and Hale 2021). Lam et al. (2017) suggested a strategy for poorly annotating tweets on typhoons when just a tiny, annotated subset is available. Government and other concerned bodies can utilize tweet classifiers to find additional information about a tragedy on social media (Lam et al. 2017). To evaluate the degree of harm suggested by social media texts, Yuan et al. (2021) presented a model that focuses on the in-depth interpretation of social media texts while requiring less manual labor in a semi-supervised setting. The damage extent map developed by the authors mostly matches the actual damage recorded by authorities, demonstrating that the suggested approach can correctly estimate typhoon damage with minimum manual labor (Yuan et al. 2021).

4.1.3 Flooding

Through 14 studies from this review’s literature, authors developed several topic modeling-based frameworks and products that might help assess flood-related situations (Barker and Macleod 2019; GrÜNder-Fahrer et al. 2018; Rahmadan et al. 2020). Barker and Macleod (2019) presented a prototype social geodata machine learning pipeline that combined current developments in word embedding NLP with real-time environmental data at the national level to identify tweets that were related to flooding throughout Great Britain. By automatically detecting tweets using Paragraph Vectors and a logistic regression-based classifier, the study supports requests for better crisis situational awareness (Barker and Macleod 2019). The approach can be considered an important finding as it holds considerable potential to apply to other countries and other emergencies (Barker and Macleod 2019). In their investigation of the thematic and temporal structure of German social media communication, GrÜNder-Fahrer et al. (2018) looked at the types of content shared on social media during the event, the evolution of topics over time, and the use of temporal clustering techniques to identify various defining phases of communication automatically. According to the study, social media material has significant potential in disasters’ factual, organizational, and psychological aspects and throughout all phases of the disaster management life cycle (GrÜNder-Fahrer et al. 2018). In the framework of the methodological inquiry, the authors assert that topic model analysis showed great relevance for thematic and temporal social media analysis in crisis management when paired with proper optimization approaches (GrÜNder-Fahrer et al. 2018). Social media is heavily utilized for warnings and the dissemination of current information about many factual components of the event (such as weather, water levels, and traffic hindrance). Social media may help with situational awareness and the promptness of early warnings in the planning and reaction states.

NLP completely changes how we view social media in a way that it becomes the ideal way to interact with the volunteer movement immediately and shift it from its current stage of autonomous individual involvement to organized participation from the perspective of crisis management (GrÜNder-Fahrer et al. 2018). From another angle, Rahmadan et al. (2020) identified the subjects mentioned during the flood crisis by applying an LDA topic modeling methodology and a lexicon-based approach to examine the sentiment displayed by the public when floods strike. Related parties can utilize their work and the information they provide to design disaster management plans, map at-risk floodplains, assess the causes and monitor the effects after a flood catastrophe (Rahmadan et al. 2020).

Despite the proven adverse financial, economic, and humanitarian effects of floods, databases including measures to reduce flood risk are either sparse in detail, have a limited scope, or are owned privately by commercial entities. However, given that the amount of Internet data is constantly increasing, several studies from this review highlighted the emergence of information extraction methods on unstructured text (Kahle et al. 2022; Lai et al. 2022). For example, to extract information from newspapers, Lai et al. (2022) used NLP to build a hybrid Named Entity Recognition (NER) model that uses a domain-specific machine learning model, linguistic characteristics, and rule-based matching. The study’s methodology builds upon earlier comparable efforts by extending the geographical scope and using methods to extract information from massive documents with little accuracy loss (Lai et al. 2022). The work offers new information that climate researchers may use to recognize and map flood patterns and assess the efficacy of current flood models (Lai et al. 2022). Zhang et al. (2021) used the BERT-Bilstm-CRF model to offer a social sensing strategy for identifying flooding sites by separating waterlogged places from common locations. The authors created a “City Portrait” using semantic data to depict the city’s functional regions (Zhang et al. 2021). This would be very valuable for crucial decision-makers using the same approach to validate regions with high flooding rates. In another study, Wang et al. (2020) used Computer Vision (CV) and NeuroNER methods to extract information from social media’s visual and linguistic content to create a deep learning-based framework for flood monitoring. The work can provide thorough situational awareness and create a passive hotline to coordinate rescue and search efforts (Wang et al. 2020).

4.2 Challenges of NLP in supporting extreme events assessment

NLP provides a fascinating new path with the potential of great support to understand natural hazards better. Nevertheless, the technique brings uncertainties. Models, data, and hydrology applications are three possible sources of uncertainty related to using NLP in extreme events assessment research. Several data-related challenges were reported in the literature review. Several studies share the fact that any social media data are either noisy (made up of fake news, disinformation, automated messaging, adverts, and unrelated cross-event subjects) or hard to obtain due to rate limitations from the source (Twitter Rest API, Twitter Streaming API) (C. Fan et al. 2020a, b; Xin et al. 2019; Zhou et al. 2021). As they frequently contain higher degrees of slang, URLs, linguistic variety, and other elements compared with generic text or other social media posts, tweets are challenging to examine in a comparative context with other media (Alam et al. 2020; Xin et al. 2019).

Moreover, several studies that tried to work with geotagged tweets reported the scarcity of the number of them, and even if they are available, GPS position precision of crowd-sourced data can reach meters. Still, Twitter-based data are only accurate to the level of street names (Wang et al. 2018). Despite the exponential growth in social media-based research in different fields, the number of studies included is relatively low. Many factors, such as data issues, can explain this. In fact, as an example, Twitter recently added more restrictions on access to data which may have been behind the slow growth in the research using social media to assess the consequences of extreme weather events. Finally, data might lack critical elements, including disaster-related (such as severity, length, magnitude, and kind of catastrophe) and non-disaster-related (such as regional socioeconomic variations, day/night) aspects.

Despite the advances made in NLP, most common applications are still minimal. While the end goal of NLP is for algorithms to arrange meaning through computer logic and establish the links between words and grammar in human language, the existing methodologies do not have the same capacity for resolving natural language as humans do. Several named entity recognition applications require human interaction as a post-processing operation to correct model output (Alam et al. 2020). In addition, several models did not perform well due to data scarcity and quality, leading to failure to detect fake and spam social media messages (Vayansky et al. 2019). Moreover, several studies reported the low model performance of NLP models in languages other than English (GrÜNder-Fahrer et al. 2018). More research is needed to be done on this matter. On another note, several studies lacked time to optimize better and tune their models (de Bruijn et al. 2020). Finally, topic modeling algorithms such as Latent Dirichlet Allocation cannot produce reliable results when used with microblogging platforms like Twitter because of their unique features, particularly their brief tweets (Shannag and Hammo 2019). LDA was shown to perform well on long texts and less on shorter ones.

Although these models can solve plenty of hydrology challenges, some issues are still not solved yet and need much investigation. Most developed methods are not ready for operational use, despite their capability to communicate preventive and corrective states for extreme weather events assessment (Alam et al. 2020). In addition, public response, especially from those directly impacted by a natural hazard, might be relatively limited during disasters where data are most needed as Internet users will prioritize their response to the hazard and its aftermath over-reporting online (Vayansky et al. 2019). Adding to that, it is still challenging to obtain complete situational awareness to help disaster management due to the unpredictable nature of natural catastrophe behavior (Maulana and Maharani 2021). On another note, there is a lack of studies in several critical research areas, such as the geographical disparity in the spatial distribution of flood research at the global and intercontinental scales (Zhang and Wang 2022) and how the public’s attitudes change over time when disasters occur (Reynard and Shirgaokar 2019).

4.3 Study limitations

Although the search strategy was meant to be a systematic and comprehensive approach, it may hold some limitations. First, a language bias was evident in the search because only English language studies were included. Any study in non-English language with English abstracts was also excluded. Secondly, the retrieval method may have missed studies labeled with other terms based on NLP techniques. For instance, literature searches using keywords such as latent Dirichlet allocation (LDA), a statistical model in NLP, did not return results, but this was not the criteria used for the literature search. Adding to that, several studies may look interesting from the title and abstract and could be included in the scope of the paper. However, in the full paper reading, these studies would mention the MLP as a solution to their problem without implementing any model.

Moreover, several studies would tackle problems related to other than natural hazards, such as infrastructure or transportation logistics, and still apply NLP models, which makes them not related to the scope of this study. This review excluded dissertations, theses, books, reports, and working papers because only peer-reviewed journal articles and conference papers were included. The quality and quantity of literature were traded off here. Finally, we considered recent studies that were published after 2018.

5 Conclusion

NLP techniques hold a huge potential to process and analyze natural language content. By leveraging NLP, it is possible to convert unstructured textual data into structured data that can be used for further analysis. In this study, Natural Language Processing algorithms have proven their ability to leverage hurricanes and flood events. The benefits of NLP in evaluating extreme weather events include increasing social media platforms, newspapers, and other hydrological datasets as data sources, broadening study locations and scales and lowering research expenses. There are many different applications for NLP use. First, Natural language processing can be used in the data collection phase. News articles and social media data can be collected using NLP techniques to monitor, study extreme weather events, validate numerical models’ predictions, identify potential trends, and explore future risk management strategies inspired by the hazards assessed. With its proper use, meteorologists can provide more accurate and timely warnings to the public, helping to reduce the risk of injury or death due to extreme weather events. Adding to that, NLP enables faster decision-making by automating the process of analyzing data and generating reports which can help decision-makers quickly access the information they need to make informed decisions and take appropriate action in a timely manner.

According to this systematic evaluation of the literature, there is a need for further research to advance the use of NLP to analyze extreme weather occurrences. Information extraction, topic modeling, categorization, or clustering are all examples of NLP modeling techniques tested and assessed. Although using this new potential is promising, hydrologists and meteorologists should have reasonable expectations of what NLP can achieve and recognize its limitations. In future studies, researchers should focus on methods to overcome data inadequacy, accessibility, non-representativity, immature NLP approaches, and computing ability.