1 Introduction

Year after year, flood disasters unleash severe and widespread societal and humanitarian consequences (UNDRR 2020, 2022). Flood models are indispensable for effectively mitigating and managing this risk. These models help distinguish flood-prone areas in relation to specific return intervals, for instance to locate exposed communities and assets. In recent years, there has been rapid progress in the development of global flood models (GFMs), primarily driven by advancements in numerical algorithms, computational capabilities, and remote sensing (Wood et al. 2011; Dottori et al. 2016). Despite being relatively crude compared to more detailed local models, GFMs allow for risk assessments in a coherent and systematic manner across large scales. The flood maps generated by GFMs are, therefore, particularly valuable for international scenario analysis and risk assessments (Ward et al. 2015).

A multitude of individual GFMs are currently available, yet their performance is generally not well understood due to validation challenges over large spatial scales (Wing et al. 2017; Bernhofen et al. 2018; Hawker et al. 2020; Devitt et al. 2021; Mester et al. 2021). Due to this, the evaluation of GFMs has primarily taken an indirect approach by cross-comparing model outputs (Trigg et al. 2016, 2021; Refice et al. 2018; Bernhofen et al. 2018; Hoch and Trigg 2019; Schumann 2019; Aerts et al. 2020; Lindersson et al. 2021). Trigg et al. (2016) conducted a comparative analysis of flood hazard maps for Africa using six GFMs. The findings indicated a 30–40% level of agreement in flood extent, highlighting significant differences. This is in line with the findings of Aerts’ et al. (2020), which unveiled substantial variability in inundated areas when comparing eight GFMs, including proprietary GFMs for insurance and consultancy purposes. Lindersson et al. (2021) also conducted an inter-model comparison, revealing a wide range of agreement, with a particular disagreement in arid regions and in very steep or flat terrains. Efforts have also been made to validate GFMs against official flood hazard maps, however, this approach is limited to countries where such maps are available (Sampson et al. 2015; Dottori et al. 2016).

However, recent years have also seen a notable uptick in efforts to validate GFMs by leveraging remotely sensed imagery of past flood events. This approach involves employing change detection analysis to discern differences in surface water extent before and during flood events. Some of these validation endeavours draw upon historical flood maps stored in databases such as the Dartmouth Flood Observatory (DFO), the Global Flood Database (GFD) and UNOSAT (Horritt 2006; Winsemius et al. 2013; Rudari et al. 2015; Dottori et al. 2016; Bernhofen et al. 2018; Bhattacharya et al. 2019; Hawker et al. 2020; Tellman et al. 2021). A study made by Bernhofen et al. (2018) evaluated the performance of six GFMs by comparing their flood extent predictions with historical satellite observations, finding a level of agreement ranging from 45 to 70%.

Indeed, the validation of GFMs using readily available flood maps from open databases is an important step towards improving model performance. One notable drawback of this approach, however, lies in its inherent inflexibility, as validation is confined to the events already catalogued within the databases. Furthermore, these databases may incorporate flood maps derived from various satellite sensors, which may impede consistent validation across events. Given these limitations, there is a need to develop alternative, more flexible, and consistent methods for GFM validation.

One promising approach to address this gap involves the performance of change detection analysis of high-resolution satellite imagery on cloud-based geospatial analysis platforms such as the Google Earth Engine (GEE) (Gorelick et al. 2017). An illustrative example of this approach can be found in the recommended practice endorsed by the United Nations Office for Outer Space Affairs (UN-SPIDER) for mapping historical flood events (Canty and Nielsen 2017; Notti et al. 2018; UN-SPIDER 2021). This practice entails performing change detection analysis on synthetic aperture radar (SAR) imagery obtained from the Sentinel-1 satellite, using GEE for the analysis (Canty and Nielsen 2017; Ali et al. 2018; Notti et al. 2018; DeVries et al. 2020). What sets this particular approach, henceforth referred to as CD-SAR, apart is its flexibility and relatively straightforward implementation. CD-SAR could thus enable validation of models across numerous flood events at a consistent spatial resolution.

Related to this, there has recently been a growing number of studies adopting the analysis of Sentinel-1 imagery on cloud-based analysis platforms for flood mapping in specific regions (Vanama et al. 2020; Singha et al. 2020; Tiwari et al. 2020; Lal et al. 2020). Prior research has specifically focused on evaluating methods akin to CD-SAR by comparing their outputs with flood maps derived from optical satellite imagery (Clement et al. 2018). For instance, Tripathy and Malladi (2022) employed a similar approach to CD-SAR for deriving flood maps from Sentinel-1 images, subsequently comparing them with flood maps generated using Sentinel-2 multispectral instrumentation and ground-based observations. The study revealed the advantages of SAR as it excels in capturing flood inundation, especially in areas with changing water extents. The inconsistency of optical data in identifying water across diverse regions is linked to spectral changes from debris and mud in floodwater, while SAR data remains unaffected by false alarms caused by thin clouds and atmospheric effects.

Nevertheless, CD-SAR remains an unexplored potential tool for the validation of GFM performance.

The primary aim of our study is to assess the potential of utilising CD-SAR data for validating Global Flood Models (GFMs). This paper is structured around two main objectives to reach this aim. First, this study seeks to collectively validate four widely adopted GFMs with flood maps generated through the CD-SAR approach. An approach that differs from previous validation efforts, which have primarily relied on model inter-comparisons or readily available flood maps from open databases. This validation analysis has been conducted across eight distinct large river basins on four continents, encompassing a diverse range of hydro-climatic environments. Consequently, this analysis sheds light on the performance of GFMs under various geographical conditions. Second, this study aims to compare CD-SAR-derived flood maps with those obtained from alternative remote sensing sources. These comparative results offer valuable insights into the reliability of CD-SAR data as a validation tool, more specifically how it stacks up against flood maps generated by other remote sensing techniques.

2 Data and study areas

The following section describes the study areas and data used to fulfil the aim of this paper. Taken together, the validation analysis was conducted on four individual GFMs for eight past flood events across a range of geographic conditions. The comparison of satellite-derived flood maps from CD-SAR and alternative sources were subsequently conducted on two of these sites.

2.1 Global flood models

The outputs from four fluvial flood models were validated in this study: The Joint Research Centre (JRC) model (Dottori et al. 2016), the Global Assessment Report (GAR) model (Rudari et al. 2015), the Catchment-Based Macro‐scale Floodplain (CaMa-flood) model (Yamazaki et al. 2011; Zhou et al. 2021), and the Fathom Global Ltd. (Fathom) model (Sampson et al. 2015). These models have previously been included in multiple inter-comparison and validation studies (e.g., Trigg et al. 2016; Bernhofen et al. 2018; Mester et al. 2021), and can be argued to represent the state-of-the-art of publicly available GFMs. While there are many different return periods available for these GFMs’ outputs, two return periods of 20/25-year and 100-year were chosen and tested for each GFM for the validation analysis.

The GFMs can be arranged into two main model structures (Table 1), a cascade model type (JRC, CaMa), and a gauged flow data model type (GAR, Fathom). A cascade model type uses a precipitation time series from global climate reanalysis data driving a hydrological model, which produces flows across a river network. The gauged flow data model type uses global gauged flow data and regional flow frequency analysis to determine the flood flow magnitude (Trigg et al. 2016). All four models assume a constant return period across the modelled domain, but they differ in complexity ranging from 2D hydrodynamic modelling to 1D hydraulic modelling and differ in modelled output resolutions.

Table 1 Characteristics of the selected Global Flood Models (GFMs). This table has been adapted from (Bernhofen et al. 2022)

2.2 Observational data from Sentinel-1

The basis of CD-SAR is satellite imagery from the Sentinel-1 (hereafter, called S1) mission performing C-band synthetic aperture radar (SAR) imaging, which is the first of five missions that the European Space Agency is developing for the Copernicus initiative (Torres et al. 2012). The S1 SAR data was chosen for multiple reasons. Radar imagery, in general, is advantageous for being independent of weather conditions and sunlight (Fischell et al. 2018). Because of this, SAR imagery often complement already existing rapid-response disaster management maps from optical sources (Canty et al. 2019), and has been used in a wide range of environmental analyses, including flood mapping (Twele et al. 2016; Meyer et al. 2018; Cao et al. 2019; Ban et al. 2020; Tavus et al. 2020).

The data from the S1 satellite, more specifically, is readily available on GEE with a relatively high spatial resolution (10 m) in a pre-processed version – that includes thermal-noise removal, radiometric calibration, and terrain correction – which makes it particularly suitable for flood mapping. This study used both S1 data of single-polarization, (VV, vertical transmit/vertical receive) and dual-polarization (VH, vertical transmit/horizontal receive).

Two constraints associated with the S1 data relates to its temporal limitations. The S1 mission is relatively young, spanning only from 2014 onwards. Furthermore, the varying revisit intervals, ranging from 6 to 12 days, pose challenges in matching data retrieval with flood peaks, especially for rapidly evolving flash floods. The failure of Sentinel-1B in late 2021 also means that the data coverage is limited to Sentinel-1 A, resulting in doubled revisit times (Roth et al. 2023). It is worth noting that this latter limitation is not unique to S1, however, but is a common constraint of data from all publicly available satellite missions. In addition to these temporal considerations, the use of radar imagery for surface water detection introduces numerous error sources that can result in false positives. These “water look-alikes” require careful scrutiny, and may encompass features such as tarmacs and roads, dry and frozen soil, wet snow, and agricultural fields under certain conditions (Shen et al. 2019).

2.3 Study areas

As mentioned earlier, this validation study encompasses the analysis of eight distinct flood events occurring under a diverse set of geographical conditions. Some of these events were entirely contained within specific countries, while others spanned across national borders. The events under examination are denoted based on the region where the majority of their extent were situated, namely: Netherlands, Myanmar West, Myanmar East, Paraguay/Argentina, Nigeria, Zambia, Ethiopia, and Australia (Fig. 1).

These selected events collectively satisfy four essential sampling criteria. First, all eight events fall within the coverage of the four GFMs and were detectable using S1 data. Second, only inland fluvial flood events were considered, due to the variable capacity among the GFMs to take into account storm surges and tidal flooding (Bernhofen et al. 2022). Third, a wide range of events were chosen, varying in flood size and exhibiting diverse climatological, topological, and morphological characteristics, allowing for an exploration of how findings may differ across different geographical conditions and flood characteristics. Lastly, urban areas were omitted from the analysis due to the challenges encountered by synthetic aperture radar (SAR) in mapping floods in such areas. These challenges arise from issues like layover effects, shadows created by buildings, and backscattering effects (Twele et al. 2016).

Fig. 1
figure 1

World map showing the locations of the eight different events. The flood events are located in Netherlands and Belgium (1), Myanmar (2), Myanmar (3), Argentina and Paraguay (4), Nigeria (5), Zambia (6), Ethiopia (7), and Australia (8)

The Meuse River [1] spans multiple countries, primarily characterized by a concentration of agricultural activity. The flood event in focus here, which occurred in 2021, extends predominantly from Belgium to the Netherlands, encompassing a relatively flat terrain. In Myanmar, two distinct events during 2020 were examined: Myanmar West [2] has more of a meandering nature compared to Myanmar East [3]. The western part is situated in mountainous, forested terrain, whereas the eastern section comprises vast plains. Both Myanmar events features agricultural zones. The Paraná River is located on the border between Paraguay and Argentina [4] and was hit by a flood in the beginning of 2016. The river is characterised by its braided nature due to its high sediment load. The surrounding region primarily consists of narrow strips of forest, shrubs, and agricultural zones. The floodplain along the Niger River, which was flooded in 2018, south of Idah in Nigeria [5] presents a relatively flat topography, including an expansive floodplain and several minor tributaries. The Luapula River, located in the southern portion of Lake Mwaru in the northern region of Zambia, borders the Democratic Republic of Congo [6]. This region primarily comprises a vast delta originating from an upstream wetland and was flooded after a severe drought in 2020. The Shabelle River, situated in the eastern Ethiopian highlands [7], experiences tropical arid to dry and sub-humid conditions and experienced flooding during 2020. Lastly, the Clarence River, situated on the east coast of Australia [8] was flooded during 2022. The river flows through a low-lying, flat alluvial plain, characterized by lagoons, channels, creeks, and agricultural zones.

2.4 Alternative observational data

To assess the capability of CD-SAR for flood map generation, two alternative sources were considered for two of the selected flood events, hereafter referred to as ‘the reference maps’. For the Myanmar East event [3], flood maps derived from MODIS imagery at a 1 km resolution were retrieved from the UNOSAT flood database (available at: https://unosat.org/products). In the case of the Paraguay/Argentina event [4], data were obtained from the Global Flood Database (GFD) provided by Cloud To Street (Tellman et al. 2017, 2021), which also relies on MODIS imagery but at a finer 250 m resolution.

It is noteworthy that observational data for the remaining six major flood events were absent from the three databases we examined for this part of the analysis, namely UNOSAT, GFD, and the Dartmouth Flood Observatory (DFO) database (available at https://floodobservatory.colorado.edu/). This underscores the inherent challenges associated with relying on readily available observational databases for GFM validation.

3 Methodology

The analysis in this study can be divided into three main parts. First, S1 data were imported for all eight areas of interest and subjected to the change detection method CD-SAR. Second, data were harmonized to ensure consistency included resampling and masking of raster layers. Third, map agreements were quantified to facilitate the validation and comparative analysis. The S1 data were processed in GEE (Gorelick et al. 2017), and the subsequent data homogenisation and spatial analysis were conducted in ArcGIS Pro (v2.9).

3.1 CD-SAR

The change detection method of this study, CD-SAR, as outlined in recommended practice by UN-SPIDER (2021) and Canty and Nielsen (2017), was executed in several steps, as described in this section. The flood events were first delineated based on their associated river basins. This delineation process hinged on three different sub-basin sizes extracted from the HydroSHEDS dataset (Lehner et al. 2008), chosen according to the size of the flood extent captured in the S1 data. The basin sizes, classified from level 5 (largest) to level 7 (smallest), were identified using Pfastetter codes (Lehner et al. 2008; Lehner and Grill 2013). Worth noting is that the basin encompassing the Netherlands flood consists of two merged level 7 basins.

Subsequently, both a pre- and a post-flood image were imported from the S1 dataset (ESA 2022) at a resolution of 10 m. The pre-flood image represents the area in its non-flooded state, whereas the post-flood image shows the area when inundation has occurred. The delineated area often encompassed two or, in some instances, more S1 images that were merged to encompass the entire flood-affected area. Table 2 summarizes location and size of the eight river basins, date of the captured Sentinel-1 images and estimated flood extents.

Table 2 Summary of the locations and the Sentinel-1 data used for each region. The estimated flood extent is based on CD-SAR 10 m resolution

The pre- and post-flood images were divided with each other to produce a difference image. This difference image shows pixels with a higher value where a difference between the two images is detected, highlighting the area most likely to be flooded. The larger the difference (i.e., bright versus dark), the higher the value.

A thresholding approach was applied to the difference images to distinguish flooded from non-flooded areas, resulting in a binary layer distinguishing flooded and non-flooded areas. Various methods and algorithms exist for determining the appropriate threshold value, often rooted in probabilistic or classification approaches (Long et al. 2014; Schlaffer et al. 2017; Clement et al. 2018). In this study, a single threshold value was adopted for all events to ensure applicability across different scenarios, although this approach introduces some level of uncertainty as the threshold value can be influenced by various factors, including weather conditions and the presence of “water look-alikes.” Initially, a range of static threshold values from 1.0 to 1.5 were tested. Subsequently, following a visual examination of the imagery (Notti et al. 2018), a threshold level of 1.25 was selected. This means that all pixels in the difference image with values exceeding 1.25 were designated as flooded. Notably, this chosen threshold value aligns with the recommended practices advocated by UN-SPIDER.

As previously mentioned, the analysis considered two polarizations within the S1 dataset, and their performance exhibited negligible differences across the selected events. However, the VH (vertical transmit/horizontal receive) dual-polarization was ultimately chosen due to its superior performance in the majority of cases, with the added benefit of its capacity to distinguish between open water and inundated vegetation (Irwin et al. 2018). This choice aligns with suggestions from the recommended practices by UN-SPIDER, as VH is widely suggested for flood mapping since it is more sensitive to changes on the land surface (Notti et al. 2018). Depending on the availability of the satellite images and the revisit times, both descending and ascending directions for S1 were tested for all events and the direction with the best coverage was chosen.

Noise-like speckle is a common characteristic of SAR data and, usually, a filter is applied to smooth these out (Shen et al. 2019). In this case, a mean filter was applied to both the pre- and post-flood images before computing the difference between them. This entailed moving a 50 m circular matrix along each pixel of the raster, replacing each specific pixel with the mean value of the matrix. This reduces the speckle and creates a smoothing (Mansourpour and Blais 2006).

Apart from the mean filter used to reduce speckles, several other filters were also applied to the final image. To further suppress noise, pixel connectivity was assessed, and pixels connected to eight or fewer neighbours were marked as non-flooded (UN-SPIDER 2021; McCormack et al. 2022). This step not only contributed to noise reduction but also served to eliminate smaller flooded areas stemming from pluvial flooding, thus unrelated to fluvial flooding. To exclude perennial water bodies, a water mask derived from the JRC Global Surface Water Mapping dataset (Pekel et al. 2016) was applied. This mask identified areas with water detected for more than 10 months per year, effectively removing them from the analysis. Areas with a slope exceeding 5% were also masked out using the HydroSHEDS Digital Elevation Model (DEM) (Lehner et al. 2008) to exclude areas where water accumulation was improbable.

Lastly, further delineation of the flood extent was achieved through a distance mask, customised to specific ranges spanning from 3.5 km to 40 km from the main river reach. The selection of these ranges was contingent on factors such as basin size and the extent of the flood, aiming to exclude areas deemed unrelated to the fluvial floods. For river data, the HydroSHEDS free-flowing rivers dataset (Grill et al. 2019) was used, and only the main river reaches were considered, excluding the tributaries.

3.2 Data homogenisation

Outputs from the GFMs with depth-indicating pixels were initially converted to a binary format distinguishing wet and dry pixels. All maps from the GFMs, as well as the reference maps, were then delineated and masked in the same manner as the CD-SAR flood maps, as described above.

It has been suggested that the most pragmatic approach to compare model outputs is against high-resolution datasets (Sampson et al. 2015). As all the GFMs are using DEMs with a ∼ 90 m resolution, this was decided as the best high-resolution comparison. This was also chosen as it can be argued that this is a more conservative test for the coarser-resolution GFMs and the reference maps. The extent of all outputs that did not have a ∼ 90 m resolution were resampled using the nearest neighbour method, as it is suitable for categorical data (Esri; McRoberts 2012). When resampling datasets, there is a risk of introducing false accuracy errors. In the case of resampling to a higher resolution, this can be dismissed as all the pixels are binary and do not result in new values (Bernhofen et al. 2018). Geospatial overlap errors may occur regardless of which resolution is resampled to (Bernhofen et al. 2018). This is however unlikely to affect the resampling to a higher resolution to any higher degree.

3.3 Model evaluation

Two performance metrics were chosen for the model evaluation analysis. The first metric is the Critical Success Index (CSI), which has been used in previous validation efforts (Bernhofen et al. 2018). To calculate the CSI, the CD-SAR outputs were combined pairwise with each of the GFMs and the reference maps to create a raster for each comparison. The pixel values generated represent various categories of agreement (Eq. 1) where the resulting value varies between 0 (no agreement/worst) and 1 (maximum agreement/best) (Bernhofen et al. 2018):

$$CSI=\frac{a}{a+b+c}$$
(1)

Where \(a, b\) and \(c\) is the number of pixels for each flood event according to Table 3 (i.e., the denominator is the union of both datasets, and the numerator is the intersection). This index ignores large dry areas that could give a false impression of agreement and does not assume that any of the models is correct. It is purely an agreement measure for wet areas.

Table 3 Contingency table for the pairwise agreement evaluation where the variables a, b, and c relates to the number of pixels for each flood event

The second metric is based on the aggregated performance metric used by Sampson et al. (2015). A 5 × 5 km grid is set up over the entire basin and the ratio of the wet/dry areas is calculated for each individual cell, giving a value between 0 and 1. Cells in the grid containing no data from any of the datasets is neglected. This was done both for CD-SAR, the GFMs and the reference maps. The Mean Absolute Error (MAE) is then calculated between the CD-SAR and the GFM/reference maps using Eq. (3) where the resulting value varies between 0 (full agreement) and 1 (no agreement):

$$MAE=\frac{{\sum }_{i-1}^{N}\left|M-C\right|}{N}$$
(3)

Where N is the number of cells in the grid containing data, M is the fraction of the flooded area for the model/reference map, and C is the fraction of the flooded area for CD-SAR.

This metric was added as it does not take the specific resolutions and the size of the flood extent into consideration. This is contrary to the agreement index, which is dependent on the number of pixels in the mapped extent. It can also give an indication of how well the inundation boundary and the flood-edge locations of the two different datasets compare. This can further clarify the differences between the datasets as the agreement index might give a lower value when the flood extents are large and contain a higher number of pixels.

4 Results and discussion

The section below begins with a presentation of the GFM validation analysis using CD-SAR, aligning with the first objective of this study. Subsequently, the section delves into the comparative examination of CD-SAR flood maps against those derived from optical remote sensing technology (i.e. the reference maps), aligning with the second objective of the study.

4.1 Model evaluation using CD-SAR

The CSI scores between CD-SAR and the four GFMs reveal substantial variation both among the models and across flood events (Fig. 2, Supplementary Fig. 1). Across all flood events, the JRC maps demonstrated the highest agreement with CD-SAR, with an average CSI score of 0.34 for the 100-year return period. CaMa and Fathom closely follow with average scores of 0.30, while GAR yields the lowest average score at 0.18. Examining median values reinforces this pattern, with JRC achieving a median score of 0.40 for the same return period, followed by CaMA at 0.36, Fathom at 0.30, and GAR at 0.21. In most cases, the CSI scores are relatively insensitive to the choice of return period, with exception for the high-resolution flood maps of Fathom (Fig. 2).

Previous GFM validation studies have shown that a CSI score of 0.7 can be achieved by some GFMs in some regions but can also go as low as 0.02 (Trigg et al. 2016; Bernhofen et al. 2018, 2022; Mester et al. 2021). Generally, a CSI score exceeding 0.7 is considered indicative of a good model performance, whereas scores below 0.5 are regarded as poor (Bernhofen et al. 2018). It is worth noting that the CSI scores in this study are slightly lower than the ranges in previous studies, such as the one conducted by Bernhofen et al. (2018). In this study, Fathom had the highest specific score (0.55) for the Myanmar East region at a 20-year return period (Fig. 2), which could be considered to be a moderate performance. However, for the most part, the scores of this study tend to fall within the lower end of the spectrum, indicating poorer agreement. It is important to note that the inconsistencies in the scores cannot be solely attributed to the performance of the models, but also to the efficacy of the CD-SAR approach across distinct regions. The ability of CD-SAR to map flood outlines compared to other remote sensing alternatives will be further analysed in the next subsection.

Fig. 2
figure 2

MAE and CSI scores between the GFMs and the CD-SAR flood maps, considering two return periods and eight distinct flood events. The GFMs are arranged in ascending order based on their spatial resolution, with the finest resolution positioned at the top

In essence, the maps of JRC, CaMa and Fathom generally demonstrate similar levels of agreement with CD-SAR, while GAR consistently exhibits the lowest agreement. The case study of Nigeria, however, stands as an exception to this pattern, where GAR shows the highest agreement with CD-SAR among the models (Supplementary Figs. 1011). This may be related to the general complexity to model flood outlines in flat, unconfined, topographies such as the case for the flood in the Niger River. On the contrary, the confined floodplains of the Myanmar flood events exhibit the highest average CSI scores of all study areas and across all models (Fig. 2, Supplementary Figs. 57). The raster outputs from the four GFMs with 20/25-year return period and CD-SAR in Myanmar East can be seen in Fig. 3. This comparatively high agreement may exemplify how models generally perform better in confined floodplains with limited vegetation cover. Furthermore, the absence of substantial vegetation could result in minimal interference with the radar signal, thereby facilitating accurate detection of the flood extent by CD-SAR.

Fig. 3
figure 3

Comparison between the individual GFMs and CD-SAR for the flood in Myanmar East, with the models using a 20-25-year return period

The performance of CaMa generally aligns with that of Fathom and JRC across the case studies, with one notable exception found in Ethiopia, where CaMa displays substantially lower agreement with CD-SAR (Supplementary Figs. 1415). One reason for this could be that this arid region may be difficult to represent with the floodplain storage elevation relationships used by CaMa, as opposed to the hydrodynamic modelling used by Fathom and JRC. It should also be noted, however, that arid regions may contain “water look-alikes”, which could also lower performance of CD-SAR. The difficulty of modelling flood outlines in arid regions is widespread across models, as indicated by overall lower scores in dry climate regions compared to the more humid ones (such as Myanmar) and shown in previous studies, such as Lindersson et al. (2021).

The floods in Australia and Paraguay/Argentina stand out as exhibiting particularly low agreement scores for GAR (Fig. 4, Supplementary Figs. 89,16). The low values of 0.00-0.06 means that there is barely any agreement between GAR and CD-SAR for these events. One reason for this could be that GAR does not model the flood closest to the rivers, as illustrated in the case for Australia in Fig. 4. The inundation solver of GAR uses a 1D Manning approach, which may have difficulties mapping the meandering dynamics and wide floodplains with its associated side flows, which is the case for the Paraná River’s braided nature and the Clarence River’s flat character. This might explain why, for instance, Fathom’s higher CSI score (especially for the Paraná River) as Fathom is based on a 2D hydrodynamic model.

Fig. 4
figure 4

Comparison between the individual GFMs and CD-SAR for the flood in Australia, with the models using a 100-year return period

Across the eight flood events, the lowest agreement was found in the Netherlands (Fig. 2, Supplementary Figs. 34). This is likely due to a widespread use of flood mitigation measures in the Netherlands, in the form of dikes and levees. The ways in which the GFMs incorporate (or not) flood defenses vary considerably, which means that most of the GFMs will overestimate the flood extents in these contexts. This particular river basin is also a relatively flat and unconfined floodplain, which also may contribute to overestimated flood extents, as previously discussed.

Turning now to the pairwise MAE scores, which can offer insights into the effectiveness of the model in capturing inundation boundaries compared to the observational data. Lower MAE values indicate a better model fit. Overall, in contrast to the CSI scores, the MAE scores tend to align with the spatial resolution of the models (Fig. 2, Supplementary Fig. 2). Moreover, the MAE scores exhibit a higher sensitivity to the choice of the return period, compared to the CSI scores.

Among the GFMs, Fathom emerges as having the lowest MAE across all cases, particularly evident for Myanmar West in the context of a 20-year return period. For a 100-year return period, Fathom still performs better, with an average MAE score of 0.22 across all cases, followed by CaMa (0.23), JRC (0.28), and GAR (0.38). Notably, GAR demonstrates consistently poor performance, both in CSI and MAE scores. It should be noted that the low MAE for 25-year GAR in the Netherlands is due to the model only mapping an exceedingly small amount of the basin, which lowers the MAE score.

Across all GFMs, the average MAE scores exhibit their lowest values for the flood events in Myanmar and Zambia (Fig. 2). This aligns with the higher CSI scores also observed for at least the Myanmar flood events. The river basin in Zambia consists of large wetland areas, which generally can pose challenges when modelling the connectivity of the main channel and the floodplain (Supplementary Figs. 1213). Furthermore, GFMs are generally limited in their inability to accurately capture river channel bathymetry beneath the water surface, due to a lack of observed/surveyed data (Neal et al. 2021). This leads to difficulty in determining at which water level flooding occurs, thus leading to an increase in model errors in unconfined floodplains or where there is a dense channel network. This could explain why the overall CSI scores are poor in this case, despite capturing the inundation boundaries relatively well. Taken together, the regions with the highest MAE scores tend to be those characterized by flat terrain and numerous small channels. Nigeria, for instance, stands out as a flood event with particularly low inundation boundary agreement (Fig. 2). This finding yet again underscores the overarching challenge faced by flood models in accurately representing such environments.

4.2 Comparison with alternative reference maps

To evaluate the CD-SAR methodology, outputs from CD-SAR and the GFMs were subsequently compared with the reference flood maps generated from optical MODIS data. This assessment was carried out for two specific regions, namely Myanmar East and Paraguay/Argentina. For the Paraguay/Argentina region, the alternative flood map was retrieved from the GFD database, relying on MODIS data at 250 m resolution. The reference map for the Myanmar East region was obtained from the UNOSAT database, relying on MODIS data at 1 km resolution. Figure 5 shows the results from this pairwise evaluation between these reference flood maps and those of CD-SAR and the GFMs.

Fig. 5
figure 5

MAE and CSI scores between the alternative observational data and flood maps from CD-SAR and the GFMs, considering two flood events. The CD-SAR and GFMs are arranged in ascending order based on their spatial resolution, with the finest resolution positioned at the top

Turning first to how CD-SAR compares to these reference flood maps, it is evident that the overall agreement is only marginally improved with the more higher-resolution reference map (Fig. 5). Both the 250-m and the 1 km resolution reference maps exhibit moderate CSI scores when compared to CD-SAR, at values around 0.5. The low MAE scores, specifically 0.09 and 0.14, indicates a similar level of representation of inundation boundaries between CD-SAR and the reference maps.

When comparing the GFMs to the reference maps, the results are notably varied (Fig. 5). In the Paraguay/Argentina case, the CaMa model stand out as having the highest level of agreement with the 250 m reference map, both in terms of CSI and MAE scores. This could potentially be attributed to CaMa having the most similar spatial resolution as the reference map. A similar tendency is observed for the Myanmar East case, where the JRC model demonstrates the highest level of agreement with the reference map, and both models share a spatial resolution of 1 km. It is worth noting, however, that for this particular case study the high-resolution maps of Fathom and CaMa performed slightly better in terms of MAE scores.

The CSI scores between the three best performing GFMs (JRC, CaMa and Fathom) and the reference flood maps range from 0.46 to 0.62, which correspond to a moderate agreement (Bernhofen et al. 2018). A key point here is that the average CSI score between the reference maps and these three GFMs (across both cases and return periods) stands at 0.54, surpassing the average CSI score of CD-SAR, which is 0.50. This pinpoints to how flood maps from change detection analysis of different sources of remotely sensed data can exhibit disagreement levels comparable to those found for the GFM themselves, underscoring the importance of avoiding to label these remotely sensed flood maps as ‘ground truth’. This discrepancy among individual remotely sensed flood maps can be partly attributed to individual challenges associated with distinct satellite sensors, the spatial resolution of the data, the timing of the data acquisition, as well as the assumptions and limitation of the subsequent change detection analysis.

This analysis, however, indicates that the capability of CD-SAR to delineate flood extents is on par with that of commonly used reference maps, including those relying on MODIS data. To illustrate, the range of agreement between the GFMs and CD-SAR exhibit similar levels to those obtained by the alternative reference maps. This suggests that CD-SAR could serve as a valuable tool for validation purposes, akin to other reference maps. The results of this study furthermore highlight the central influence of spatial resolution on the overall level of agreement between GFMs and observational data. This supports the use of the relatively high-resolution SAR data used by the CD-SAR methodology, particularly as GFMs continue to advance in detail.

4.3 Limitations of the CD-SAR methodology

The following section describes key sources of uncertainty of the CD-SAR methodology, relating to both limitations of the underlying radar data and assumptions of the subsequent change detection analysis. The limitations of the radar data arise from the inherent characteristics of radar imagery and the limitations imposed by the revisit times of the satellites.

A major uncertainty that needs to be addressed when using CD-SAR is the general challenge of using SAR data to detect flood water in vegetation-rich areas, tropical wetlands, and forest areas with evergreen canopies. Because of the scattering effects of the radar reflection, CD-SAR is unlikely to capture all flooded areas in these types of regions. This could be the case for most events located in Africa, South America, and Southeast Asia. The same issue arises with water look-alikes such as agricultural fields, which is an obstacle in nearly all the events, or sand in, for example, Ethiopia. Due to shadowing, uncertainties also arise when using CD-SAR in mountainous areas.

Another shortcoming of CD-SAR relates to how the timing of the data acquisition does not necessarily correspond to the timing of the flood peak. This uncertainty is especially relevant for rapid flash floods, which might not be detected by CD-SAR at all, as the flood might only be visible for a few days or a few hours. However, underestimation could also occur due to floods with longer durations. This is most prominent in areas with large wetlands and river deltas where flood water could be retained during longer periods of time, such as Nigeria and Zambia. However, in most of the cases CD-SAR underestimated the flooding further downstream in the catchment, which could also be due to CD-SAR being unsuccessful in capturing the maximum flood extent of floods with longer durations.Various techniques can be employed to enhance the ability of CD-SAR to capture flood peaks. One such approach involves combining both ascending and descending scenes derived from the Sentinel-1 data (Tripathy and Malladi 2022). As the GFMs are modelled to show the maximum extent of the flood event for a given return period, CD-SAR will most likely underpredict the flood extent when compared to these GFMs in most cases.

In the change detection analysis of this study, a distance mask was kept constant for each region. However, this particular setup choice may benefit certain models, particularly when a model covers a larger area than was encompassed by the distance mask. This might also be a disadvantage for certain models containing a higher density of modelled river network compared to CD-SAR. This can be seen in Fig. 3 where the comparison with Fathom might produce spurious overestimation. As mentioned earlier, there are also numerous methods of choosing threshold levels in change detection analysis. A static threshold, as used in this study, will increase the uncertainty, as the reflectance will change depending on the region and the local weather conditions. Lastly follows a point regarding the evaluation analysis in this study. The evaluation metrics used in this study may also have biases toward some of the regions. The use of a bias score, concerning the errors or the over- or under-predictions, could help clarify the influence of these biases.

5 Conclusion

The validation analysis of the four GFMs of this study has unveiled a substantial variability in model performance, both among different models and across the eight flood events considered. Overall, the flood maps generated by JRC, CaMa, and Fathom demonstrated comparable levels of moderate agreement with the observational data, whereas GAR consistently displayed lower levels of agreement. This assessment also reaffirms previous findings regarding the general challenges that flood models encounter in flat unconfined floodplains, braided river systems, arid climates and regions characterised by dense vegetation. Some of these characteristics, such as heavy vegetation and aridity, also pose challenges when detecting surface water with remotely sensed data, which further complicates validation efforts. Furthermore, this study underscores the limitations of GFMs in regions where flood mitigation measures are in place. Taken together, this segment of the analysis underscores the critical need for sustained validation efforts of GFMs.

This study has also shed light on how various factors of the validation analysis itself can influence the results, including choice of performance metrics and spatial resolution of the reference maps. For instance, the results indicate that while the CSI and MAE scores generally correlate, some distinctions across the two metrics emerge. Specifically, MAE scores exhibit greater sensitivity to the choice of the modelled return period compared to the CSI scores – which tended to be relatively stable across return periods, except for the more high-resolution flood maps of the Fathom model. Additionally, MAE scores display greater sensitivity to the spatial resolution of the GFM compared to the CSI scores – models with higher spatial resolutions tend to yield lower (indicative of better) MAE scores. The CSI scores, on the other hand, tended to benefit when the spatial resolution of the model matched that of the reference map.

A key outcome of this study is also the observation that the overall agreement between CD-SAR and the alternative reference maps does not surpass the agreement between the Global Flood Models (GFMs) and these alternative reference maps. This notable discrepancy among individual remotely sensed flood maps can be attributed to individual challenges associated with distinct satellite sensors, the spatial resolution of the data, the timing of the data acquisition, as well as the inherent assumptions and limitation of the subsequent change detection analysis. Taken together, this pinpoints the importance of refraining from labelling these remotely sensed flood maps as ‘ground truth’.

So, does CD-SAR demonstrate potential for generating reference maps to validate GFMs? Indeed, this study established CD-SAR as a viable, accessible, and flexible tool for mapping historical flood events, with GFM validation being one of its potential applications. Notably, the findings suggest that CD-SAR is capable of generating flood maps that are on par with alternative reference maps. While it is essential to acknowledge that CD-SAR, like any flood detection methodology, comes with inherent uncertainties and limitations (encompassing both the SAR data and the change detection analysis), this particular methodology holds a number of advantages compared to the more low-resolution flood maps in readily available databases. The flexibility and high resolution of CD-SAR allow it to cover events across a diverse range of geographic conditions and sizes, even permitting near real-time mapping. This characteristic of CD-SAR facilitates early assessments of how GFM models perform under specific circumstances. Furthermore, the results of this study underscore the pivotal influence of spatial resolution on validation outcomes, favouring the use of the relatively high-resolution SAR data of the CD-SAR methodology – particularly as GFMs continue to advance in detail.

Looking ahead, future studies hold the potential to expand upon the approach employed in this study, systematically conducting validation efforts across an even broader range of scenarios and events. Continued comparative assessments will also be essential as new GFMs emerge, or existing models are updated with increased coverage and resolution. This could be done by implementing and improving on the CD-SAR approach or using similar potentially valuable tools, such as the GloFAS Global Flood monitoring tool (Salamon et al. 2021; Roth et al. 2023). The results of this study highlights the need for heightened validation efforts, not only for GFMs but also for the observational flood maps themselves.