“Grey swan” storm surges pose a greater coastal flood hazard than climate change

In this paper, we show that over the next few decades, the natural variability of mid-latitude storm systems is likely to be a more important driver of coastal extreme sea levels than either mean sea level rise or climatically induced changes to storminess. Due to their episodic nature, the variability of local sea level response, and our short observational record, understanding the natural variability of storm surges is at least as important as understanding projected long-term mean sea level changes due to global warming. Using the December 2013 North Atlantic Storm Xaver as a baseline, we used a meteorological forecast modification tool to create “grey swan” events, whilst maintaining key physical properties of the storm system. Here we define “grey swan” to mean an event which is expected on the grounds of natural variability but is not within the observational record. For each of these synthesised storm events, we simulated storm tides and waves in the North Sea using hydrodynamic models that are routinely used in operational forecasting systems. The grey swan storms produced storm surges that were consistently higher than those experienced during the December 2013 event at all analysed tide gauge locations along the UK east coast. The additional storm surge elevations obtained in our simulations are comparable to high-end projected mean sea level rises for the year 2100 for the European coastline. Our results indicate strongly that mid-latitude storms, capable of generating more extreme storm surges and waves than ever observed, are likely due to natural variability. We confirmed previous observations that more extreme storm surges in semi-enclosed basins can be caused by slowing down the speed of movement of the storm, and we provide a novel explanation in terms of slower storm propagation allowing the dynamical response to approach equilibrium. We did not find any significant changes to maximum wave heights at the coast, with changes largely confined to deeper water. Many other regions of the world experience storm surges driven by mid-latitude weather systems. Our approach could therefore be adopted more widely to identify physically plausible, low probability, potentially catastrophic coastal flood events and to assist with major incident planning.


Introduction
Coastal flooding poses a significant risk to life, property, and infrastructure, with wide-ranging social, economic, and environmental impacts. In coastal cities worldwide, flood exposure is increasing due to the changing climate, subsidence and population growth, and infrastructure development in lowlying coastal areas. Hallegatte et al. (2013) suggest that, even allowing for investment in adaptation measures (e.g. increasing flood defences), global flood insurance losses in 136 of the world's largest coastal cities will rise from US$ 6 billion per year in 2005 to US$ 60-63 billion per year in 2050. Whilst mean sea level is known to be rising (IPCC 2013) and at an accelerated rate during the present century (e.g. Kopp et al. 2016;Jevrejeva et al. 2014;Dangendorf et al. 2019), coastal flood events are caused by extreme sea levels (ESL) associated with some combination of high astronomical tides, storm surges and extreme waves. Global studies of trends in ESLs conclude that they result from changes to mean sea level rather than changes in storminess (e.g. Marcos et al. 2015;Mawdsley and Haigh 2016;Menendez and Woodworth 2010).
Storm surges are the increase in sea level caused by low atmospheric pressure, and strong winds combining with Earth's rotation to drive water against a coastline. They can episodically raise sea level by up to 4m when caused by extra-tropical weather systems (e.g. Wadey et al. 2015) and over 9m when caused by tropical systems (hurricanes, tropical cyclones). Storm surges are amongst the most costly and deadly natural hazards. In Bangladesh in 1970, a storm surge caused by tropical cyclone Bhola killed approximately 300,000 people (Frank and Husain 1971). Storm surges resulting from Hurricane Katrina (Jonkman et al. 2009) and Hurricane Sandy (Blake et al. 2013) are two of the most costly natural disasters in US history. In Europe, the North Sea storm surge of 1953 killed over 2000 people (McRobie et al. 2005).
Coastal defences are well adapted to extreme tides and-in the short term-to relatively slow changes in mean sea level so it is storm surges that are responsible for nearly all incidents of coastal flooding either directly, or indirectly by providing a water level that allows strong wave fields to overtop or breach defences. Annual average economic damages from coastal flooding in the UK are estimated to be £540 million today, and are expected to more than double to £1.2-1.9 billion by the 2080s due to sea-level rise (Sayers et al. 2015). For the wider European coastline, expected annual damages due to coastal flooding are projected to increase by two to three orders of magnitude (from €1.25 billion today) by 2100 (Vousdoukas et al. 2018). As a semi-enclosed marginal sea, exposed to the North Atlantic storm track, the North Sea is a focus for storm surges and large waves, and has had a long history of major coastal flood events (Haigh et al. 2015. Considering the key components of extreme sea levels in the future, there is scientific consensus around mean sea level changes (IPCC 2013) and tides are deterministic-although subject to possible small future changes (e.g. Haigh et al. 2020). Since there is no scientific consensus about climatically induced changes to future storminess, an improved understanding of the natural variability of storm surges and waves caused by mid-latitude depressions is crucial for coastal management.
All projections of future storminess are limited by the lack of consensus between different climate models and the also by the capability of even regional climate models (with higher resolution) to accurately simulate extreme winds (IPCC 2012;Wolf et al. 2020). Neither is there any consistent observational evidence for long-term trends in either storminess across the UK or resultant storm surges (Palmer et al. 2018). A systematic review of storminess over the North Atlantic and northwest Europe (Feser et al. 2015) concluded that trends in storm activity depends strongly on the period analysed; studies based on measurements or reanalyses generally do not show any changes in storminess. Allen et al. (2008) showed that changes in UK storm frequency over the second half of the twentieth century were dominated by the natural variability of weather systems. Multi-decadal records of sea level from European tide gauges provide few examples of the most extreme storm surges that pose a hazard. For the North Sea, two noteworthy storm surges had more extreme consequences than all others (Wadey et al. 2015); these occurred on 31st January to 1st February 1953 and 5th to 6th December 2013. The more recent event resulted in the highest sea levels ever recorded for certain UK locations, which begs the obvious question-could small but plausible changes to the atmospheric forcing have produced ESLs larger than those observed? Furthermore, it is important to know whether unobserved natural variability invalidates the statistics used to guide coastal defences (Batstone et al. 2013).
To explore this, we undertook a detailed assessment of past storm characteristics over the North Sea. Based on this we then created six synthetic but dynamically plausible midlatitude weather systems where we artificially deepened the central pressure, altered the storm speed, and adjusted the storm track. We then used these to force storm surge and wave models of the North Sea. We refer to these synthesised events as grey swans by analogy with a similar approach taken for tropical cyclones (Lin and Emanuel 2016). Whereas "black swan" is a metaphor for an event that is unexpected, we use grey swan to imply a high-impact event that we might expect on the grounds of natural variability but outside the observational record. For tropical cyclones, it is possible to generate many thousands of synthetic events in a computational efficient way by embedding relatively simple cyclone models within large-scale climate or reanalysis models (Lin and Emanuel 2016;Bloemendaal et al. 2020). An alternative approach is a multi-model downscaling such as that used by Ninomiya et al. (2017) who embedded the Weather Research and Forecasting (WRF) model in a general circulation model and then further downscaled using empirical models to create a large ensemble of an exemplar tropical cyclone impact at Nagoya Port, Japan. For extra-tropical low pressure systems, it is more difficult to generate a large sample for probabilistic impact studies due to the size and spatial complexity of extra-tropical storms making them less amenable to simple parametric models. In fact, even tropical cyclones have asymmetry not captured by parametric models (e.g. Olfateh et al. 2017) and for tropical cyclones moving into mid-latitudes extratropical transition means that statistical relations for parametric model parameters may no longer be appropriate (Vickery et al. 2000).
In operational forecasting, the spatial complexity of midlatitude weather systems means that most storm surge and wave models are forced by full numerical weather prediction (NWP) tools. Here we used modified fields from the UK Met Office operational forecasting system to provide the atmospheric forcing for the storm surge and wave models. Since we wish to synthesise storm variability beyond the range of a single operational ensemble, we perturbed the 5-6 December 2013 storm (so-called storm Xaver) using software (Carroll and Hewson 2005) that allows user-driven modifications to wind and pressure fields whilst preserving the dynamical balance of the weather system. This allowed us to change the speed of movement of the storm, its central pressure (and hence associated wind fields), and its direction of travel whilst maintaining the key physical properties of the storm system. All adjustments made were constrained by a detailed analysis of North Atlantic weather systems since 1950 causing large storm surges. A simpler approach to perturbing North Atlantic storm tracks for wave modelling was previously used by Wolf and Woolf (2006) but did not attempt to preserve the dynamical consistency of the perturbed wind field.

Methods
The analysis is undertaken in three main stages, described in turn below.

Meteorological analysis
In the first stage of analysis, we extracted the tracks and meteorological characteristics of past storms that have generated large storm surges along the UK east coast of the North Sea. To do this we used complete high-frequency (15 minute) sea level records from nine sites from the UK National Tide Gauge Network. We downloaded the data from the British Oceanographic Data Centre (BODC) and excluded all values flagged as suspect. We undertook a separate tidal analysis (with 67 tidal constituents) for each calendar year using the Matlab T-Tide harmonic analysis software (Pawlowicz et al. 2002) to derive the astronomical tidal component. We then extracted all twice-daily, measured and predicted tidal highwater levels at each site, and from this derived time-series of skew surge. Skew surge is the difference between the maximum observed sea level and the maximum predicted tidal level, regardless of their timing, during any tidal cycle (de Vries et al. 1995). The advantage of using skew surge is that it is a simple and unambiguous measure of the effect of a storm surge integrated over a tidal cycle.
At each of the nine sites, which varied in record length (the total span of the records was 1952 to 2016), we extracted the 25 highest skew surge values, giving a total of 225 values and corresponding dates. We then identified distinct storm events that produced the 225 highest skew surge values, across the nine sites, following the "event-based analysis" approach of Haigh et al. (2016). This resulted in a total of 113 distinct storm events. We then digitised the storm track and stored the central mean sea level pressure of each of the 113 storm events ( Fig. 1a and b). To do this we used the tracking algorithm developed by Haigh et al. (2015) which uses gridded mean sea level pressure and near-surface wind fields from the 20th Century global meteorological Reanalysis, Version 2 (Compo et al. 2011).
From the digitised track we calculated the speed of each storm, every 6 h (Fig. 1c). We calculated the mean storm track over the UK, and the variance around this, by computing the average and standard deviation of all storm latitudes in 5degree longitudinal bins. North Atlantic storms typically start south of Nova-Scotia, Canada, and move in a north-easterly direction towards Europe. It is typically northerly winds behind the cold front that generate large storm surge events in the North Sea. We created a 5-by-5 degree grid over the UK, counted the number of storms that cross the area and then calculated the minimum, mean, maximum and standard derivation of the central mean sea level pressure and storm speed, in each grid cell.

Scenario generation
In the second stage of analysis we created six synthetic but dynamically realistic mid-latitude weather systems. We found that the largest skew surges recorded in the North Sea, over the period of available tide gauge measurements, resulted from the 5-6 December 2013 storm event, storm Xaver. This Atlantic storm tracked eastwards around the north of Scotland, bringing very strong winds across northern parts of the UK. Gusts of 60-80mph were recorded across the region and exposed mountain sites registered gusts over 100mph. The low pressure system explosively deepened, dropping 32mb in 18 h from 999mb at 18Z 4th to 967mb at 12Z 5th (see Fig. 2). The associated strong north-westerly winds led to a storm surge on both the west and east coasts of the UK. The event was coincident with spring tides and therefore it caused significant flooding along the UK east coast (Spencer et al. 2015;Wadey et al. 2015) and the German coastline (Dangendorf et al. 2016).
We used this as our baseline event and applied perturbations to the weather system that were constrained by the analysis in Section 2.1 (i.e. within the range of variation of previously observed storms). We synthesised six new sets of atmospheric forcing, with realistic spatially and temporally varying mean sea level pressure and wind fields, using the NWP grid editing tool (Carroll and Hewson 2005;Carroll 1997). Prior to the routine use of ensemble prediction systems, this tool was widely used by forecasters at the UK Met Office to intensify, weaken, or reposition depressions as well as adjusting features (such as fronts) in response to new observational data. Userdefined adjustments are made to the quasi-geostrophic potential vorticity (QGPV) field. Potential vorticity is a conserved quantity in fluid dynamics and is also invertible allowing a recalculation of the geopotential height in the atmosphere. QGPV inversion is achieved using a successive overrelaxation method, full details which are given in Carroll (1997). The spatial fields of geopotential height, temperature and geostrophic wind are retrieved subject to hydrostatic and geostrophic balance assumptions. Then the winds are recalculated Fig. 1 (a) Tracks (blue dot shows the start of the storm, and red dot the location of the storm at maximum skew surge; (b) mean sea level pressure and (c) speeds of the 113 storm events by reintroducing the ageostrophic component and adding it to the recalculated geostrophic part of the wind, using a simple log-law boundary layer model to recalculate low-level winds. Since potential vorticity is conserved following geostrophic flow we believe that our approach has more credibility than altering the properties of weather systems arbitrarily (we acknowledge the adjustments we have used are still ad hoc to a degree, albeit guided by climatology). The six grey swan scenarios are summarised in Table 1; the time series of sea level pressures and 10m wind fields from the adjusted weather systems were used as boundary conditions for storm surge and wave models.

Storm surge and wave modelling
In the third stage of analysis, we used the UK operational forecasting model (the Continental Shelf 3 or CS3 model) as our tide and storm surge model. This model is an evolution of finite difference numerical models for tides and storm surges developed at the National Oceanography Centre in the UK and run operationally by the Met Office. It saw extensive use for forecasting from 1991 to 2006 and is one of the most validated operational storm surge forecasting models in the world (see Annex A for validation details). CS3 is based on a finitedifference discretisation of the fully non-linear, depth-averaged Navier-Stokes equations following original work of Proctor and Flather (1983) and Flather et al. (1991). The model covers the entire northwest European continental shelf (Fig. 3) on a 1/9°l atitude by 1/6°longitude grid, giving a resolution of approximately 12km in the North Sea. Tidal forcing was applied at the domain lateral boundaries using the 15 largest constituents derived from an harmonic analysis of a larger area ocean model. Wind stress was calculated using a quadratic stress law where the drag coefficient is derived from observations using the Smith and Banke (1975) parameterisation. At the time of this study, the model used operationally 1 by the UK Met Office for storm surge forecasting was actually a version with a slightly enlarged domain compared to that here, but with similar validation skill for the North Sea. The widened CS3X domain (Flowerdew et al. 2010) was introduced to the UK system specifically to capture storm surges originating in the Bay of Biscay (Williams et al. 2005). We used the smaller operational predecessor CS3 domain here for consistency with the atmospheric fields generated by the NWP grid editing software.
The UK Met Office has been producing ensemble operational storm surge forecasts since 2008 (Flowerdew et al. 2010) based on an atmospheric model ensemble (Bowler et al. 2008;Tennant et al. 2011). The dynamical core of the underlying atmospheric model (the Unified Model, or UM) solves the compressible, nonhydrostatic equations of motion globally, using semi-Lagrangian advection and semi-implicit time stepping, on a latitudelongitude grid system with a rotated pole. Recent updates to the UM are described in detail by Wood et al. (2014). Regional 1 At the time of writing, the UK operational system is under transition to the NEMO model (Madec 2008) as described by Furner et al. (2016). atmospheric forecasts are nested within a global atmospheric ensemble (MOGREPS-G) which is configured using an Ensemble Transform Kalman Filter (ETKF) to generate initial condition perturbations centred around high-resolution analysis (Bishop et al. 2001). During 2013, the operational system ran four times per day, using atmospheric forcing from the MOGREPS-G forecast system at 33km horizontal resolution . In this work, the hindcast MOGREPS-G atmospheric model run was perturbed to create the field adjustments used as forcing for the six new model runs described in Table 1. The same hindcast MOGREPS-G atmospheric run was used to spin-up both the storm surge and wave simulations.
For waves, we used the WAVEWATCH III (WW3) wave model (Tolman 2009), implemented for the northwest European Continental Shelf domain with approximately 12km resolution (using the same domain and resolution of the storm surge model). As a reference (and also to provide initial conditions), we ran the wave model using hourly archived winds data from the MOGREPS-G hindcast atmospheric model (at approximately 11km resolution) for the period November-December 2013. Wave boundary conditions for both the baseline and perturbed runs (Table 1)  Very good correlation was found between simulations and observations with a correlation coefficient consistently above 0.9 (average root mean square error is 0.3312 m and average bias is −0.1624 m).

Storm surges
Firstly, we compare the storm surges generated by the six grey swan weather systems to those of the control run. All model runs as described in Table 1 were performed with both tidal and atmospheric forcing and then again with tidal forcing only. The spatial patterns in Fig. 4 are therefore non-tidal residuals-derived by subtracting the tide only run from the fully forced run (we use non-tide residual in this one plot rather than skew surge since it is more spatially coherent, but we note that a component of this signal may be due to tidal phase shift). The maximum non-tidal residual at each cell for the full run is then extracted (note that the maximum values displayed in Fig. 4 do not necessarily occur at the same time step in the model runs). Figure 4a shows the maximum non-tidal residual attained at any time for the control run and then the same quantity extracted from the deterministic operational model run of 12Z UTC on 4 December 2013 (Fig. 4b). The spatial similarity between the two figures confirms that our control run reproduces the same pattern of storm surge produced by the event. The differences in magnitude (a maximum of 15cm along the coastline of the Netherlands) are due to the use of different choices for wind stress parameterisation in the two models (see Annex A for a detailed explanation). Whilst recognising this limitation, our primary interest is the difference between the control run and the grey swan runs; all model outputs are adjusted to the observed levels in our subsequent analysis to account for the contribution of the real tide to the event. This approach gives the most accurate estimation of the hypothetical impact of the grey swan simulations (i.e. we are comparing one model run with another rather than commenting on the operational accuracy; even the operational model run was not 100% accurate on the day). Figure 4 shows the difference between the maximum nontidal residual from each of the grey swan model runs and that obtained in the control run. Since Storm Xaver caused the highest sea levels ever recorded for some locations around the southern North Sea, it is clear from Fig. 4 that these synthetic but meteorologically plausible storms can produce significantly larger storm surges (over 1m higher across a large area-Runs 2, 3, and 6) than have ever been observed in the North Sea. It is clear that the most extreme changes are in Runs 2 (Fig. 4d) and Run 3 (Fig. 4e). For Run 2, the central pressure of the depression was lowered to generate 50-year return period winds (but still within dynamical constraints) and this run produced larger storm surges along most of the east coast of the UK, and in the German Bight. In Run 3, the forward speed of the storm was decreased by 1 standard deviation (constrained by the meteorological reanalysis) and this perturbation produced significantly larger storm surges in the most southern part of the North Sea, including the Thames Estuary in the UK. Table 2 compares the total water levels that would have been obtained using our grey swan model runs with the observed water levels at four tide gauges on the east coast of the UK (see Fig. 3 for locations). These locations were chosen North Sea storm surges, and those observations are necessary to contextualise extreme events. To ensure a valid comparison with the observed total water levels, we use the skew surge from model runs and add this to the harmonically predicted tide at each port. This is how total water levels are obtained for operational flood warning in the UK, and provides the only accurate prediction of total water level. To understand the significance and potential impact of the grey swan storm surges, we adjusted the total water level values for each tide gauge location in Table 2 by the difference between the local observations and those that the control run predicted (i.e. we adjusted so that the control run would have predicted the skew surge perfectly at every location and then add the harmonically predicted tide); this allows a clear visualisation of the higher water levels implied in the context of historical total water levels, and is also essential to understand the return periods of the synthetic storm surges which are based on the extreme value analysis of total water levels. Table 2 shows that, at all these locations, the grey swan simulations produced a higher total water level than either the 1953 or 2013 storm surges. In the most extreme case, at Sheerness in the Thames Estuary, the maximum level obtained (from Run 3) was 1.55m higher than the 2013 event and 0.95m higher than the disastrous 1953 storm surge. The same simulation, where storm speed was slowed, also produced levels that were significantly higher (0.44m higher than 1953) at Lowestoft. Whilst our simulations did produce higher levels at the two more northern tide gauges (Whitby and Immingham), the differences compared to the 2013 storm were not so great. This is because the unique properties of Storm Xaver resulted in it having greater impact in the northern part of the North Sea than the 1953 storm. The differences between these two events, and a spatial analysis of their impacts, are described thoroughly by Wadey et al. (2015).
The return periods of the water levels observed on 5 December 2013, derived from a joint probability method, are also shown in Table 2. Joint probability methods produce probability distributions of two or more conditions occurring simultaneously. For coastal flooding, the distribution of extreme sea levels for a given location is obtained by combining statistically the separate distributions of tides and storm surges. Extreme sea levels derived in this way are assigned a return period (RP) which describes the probability of a particular level being exceeded in a given number of years. The values shown in Table 2 are derived from the most up to date UK guidance (Environment Agency 2018) and based on the method of Batstone et al. (2013). We can use our grey swan simulations to test the robustness of these statistical estimates (whose very purpose is to try and quantify the severity of events that have never occurred). The reliability of statistical estimates of extreme sea levels and their confidence limits, which inform coastal defence investment decisions, is crucial for coastal managers and policy makers.
The penultimate column of Table 2 shows the RP for the observed water levels on 5 December 2013, which reemphasises the fact that Storm Xaver had its greatest impact at the more northerly locations. The final column is the water level corresponding to the 95% confidence limit (CL) of that same return period. When we compare this with the highest total water levels obtained from our grey swan simulation we find that at Whitby and Immingham the 95% CL return level is greater than any of our simulations; this implies that the additional variability of our grey swans is already anticipated by the statistical model. However, for Sheerness and Lowestoft (the more southerly locations), our simulations produced water levels that are larger than the 95% CL levels, suggesting that they are not anticipated by the statistical model. At Sheerness, our highest value was 1.35m greater than the Fig. 4 (a) Maximum non-tidal residual from control run; (b) the operational CS3X equivalent field; (c-h) the differences between the maximum nontidal residual from the respective run (as indicated in the thumbnail) and that obtained in the control run 95% CL. This level of 5.65m exceeds the estimated 10,000 year return period for Sheerness which implies that the statistical model for that location (and arguably other southern North Sea locations) has insufficient data to produce reliable estimates of extreme events, or a good prescription of the uncertainty in those estimates.
Our results show that a slower speed of storm movement has the strongest effect in the southern, enclosed part of the North Sea. This phenomenon has been reported previously, using storm tracks whose speeds were arbitrarily altered. Here we provide a physical explanation for the first time. In a semienclosed basin, a slower speed of storm movement allows a longer time to approach equilibrium between the sea surface slope and the wind stress. Byrne (2019) showed that during storm surges in the North Sea there is a flux of water across open model boundaries as the North Sea dynamically adjusts. Although a body of water could reach a sea slope equilibrium without any external flux, the fact that there is such a flux (in this case from the North Atlantic) gives useful insight in the transition towards equilibrium during a storm surge. We introduce the concept of residual volume, V r , which is defined as: where N is the number of model cells for the North Sea, A is the area of each grid cell, η total is the height of the free surface of each cell in a full-forced model run and η tide is the corresponding elevation during a tide only run. In a set of idealised wind forcing experiments using the CS3X storm surge model, a constant wind stress of 1Nm −2 (corresponding to a wind speed of 18.7 ms −1 ) was applied in 8 different compass directions to investigate the attainment of equilibrium in the absence of tide (see Fig. 5). For northerly winds (solid black line) and north-westerly winds (dotted blue line) there is an influx of V r over 20-30 h, then reaching an equilibrium (to first order) by about 60 h. For real North Atlantic weather systems (mid-latitude depressions), the wind direction is not constant; nor does the wind direction remain in any point of the compass for times similar to the equilibrium timescales shown in Fig. 5.

Waves
The maximum significant wave heights obtained for each of the model runs are shown in Fig. 6, along with the difference in maximum significant wave height between the designated simulation and the control run. The greatest changes in significant wave height (to the control run) were obtained with Runs 2 (Fig. 6c) and 4 (Fig. 6e). In the case of Run 2 this is simply due to the increased wind strengths that resulted from lowering the central pressure in the NWP editing tool. In Run 4, the Table 2 Maximum total water levels (high tide plus skew surge) of the 1953 and 2013 North Sea storm surges at four tide gauges (see Fig. 3 for locations), and the maximum bias-corrected levels from the grey swan model runs. The table also shows the median (50% percentile) return period of the observed maximum water level during Storm Xaver and the level corresponding to the 95% percentile confidence limit of that return period.
Port (see location insert in Fig. 1 (2018) where statistical estimates of extreme water levels were made using a joint probability method based on long tide gauge records speed of movement of the storm was increased by 1 standard deviation and this resulted in a large area of the eastern North Sea, near Germany and Norway, being affected by waves that were not present in the control simulation. It is interesting that increasing the speed of movement can increase wind waves, albeit in specific regions, whereas the storm surge is increased by a slower moving storm. This may be related to the dynamic fetch, in which the effective wind input is increased if the storm moves at about the same speed as the waves (Wolf and Woolf 2006) but this has not been analysed here.
Although some changes to significant wave heights (SWH) were obtained in the simulations, these changes are largely confined to deeper water in the centre of the North Sea. The maximum wave height generated in the control run was 12m, on the eastern side of the North Sea. Increases in offshore wave height of up to 6m (>50% larger than for the control run) are seen in Run 2 (deeper low pressure, and increased winds) and Run 4 (increased storm translation speed). The panels (b-g) in Fig. 6 show that coastal locations were affected by smaller absolute changes to wave heights (which is not surprising since waves are increasingly attenuated as they shoal). Most of the significant wave heights obtained were smaller than extreme values previously obtained from wave observations (since wind directions did not develop extreme waves at the coast). Santos et al. (2017) performed an extreme value analysis to derive estimates of extreme wave heights from wave buoy observations at four locations on the east coast of the UK (Blakeney Overfalls, North Well, Hornsea and Tyne Tees) where we are able to compare our results. These locations are also shown in Fig. 3. Figure 7 shows the growth of significant wave height from our grey swan simulations. Although higher values of SWH were obtained from Fig. 5 Residual volume, V r , for idealised storm surge model simulations where constant wind stress of 1 Nm −2 is applied for 100 h, from eight different directions Fig. 6 (a) Significant wave heights obtained from the control run; (b-g) difference between maximum significant wave height obtained for the designated run and that of the control run the different runs, compared to the control run (black line), these heights were not that extreme in most cases (for Run 2 at Blakeney Overfalls, late on 6/12/2013, the 5-year return period was obtained and Hornsea almost reaches the same RP).
However, Tyne Tees reaches the 100-year RP, with wave height over 8m, which is a very significant enhancement over the control run, for which the wave heights were not exceptional, although reaching 4m. In the North Sea, a typical  Fig. 3 for locations). Lines depict the change in SWH with time (dark black is the model control run and other coloured lines and symbols are the synthetic runs). The fainter black line shows wave observations from three of the locations (excluding Hornsea) during the actual event. The horizontal dashed lines show the height of the wave with the return period of that lines label (on the top right axis). Return levels were derived by Santos et al. (2017) based on their analysis of wave buoy data feature of weather systems giving rise to large storm surges is strong northerly and north-westerly winds (and see isobars in Fig. 2). This wind direction is less favourable for creating extreme onshore waves along the east coast of the UK, which explains the results in Fig. 7. Brown et al. (2010) have likewise shown that for the Irish Sea different wind directions are responsible for the largest surges and waves.

Discussion and conclusions
With these grey swan storms, we were able to produce storm surges in the North Sea that were higher than those experienced during the December 2013 event, at all locations where we analysed tide gauge data, and significantly larger than any member of the operational ensemble forecast at the time. One synthesised storm surge was nearly 1 m higher than levels recorded during the disastrous 1953 storm. At Sheerness, the highest simulated level of 5.65m is identical to the value obtained when one adds the biggest skew surge in the observational record to the highest astronomical tide (designated HATMOS in Table 2). This is coincidental but it is not surprising that the HATMOS level should be reached or exceeded since Williams et al. (2016) showed that there is no significant correlation between the magnitude of high water and the size of the most extreme observed skew surges (since weather systems are unique). For Lowestoft, one simulation (Run 3) produced a perturbed storm surge which exceeded the HATMOS value by 0.38m once added to the highest astronomical tide. This simple measure, combined with the results of our simulations, suggest that it is very likely that a larger storm surge than yet observed is feasible for the southern part of the North Sea. Furthermore, since these simulated storm surges are dynamically credible, it follows that even higher levels could be reached (albeit with some physical limit) if such a real-but as yet unexperienced-weather system were to generate storm surges coincident with the highest astronomical tide. Our analysis (Table 2) shows that for northern parts of the North Sea, our most extreme synthetic storm surges were within the 95% confidence limit of the extreme value estimates (Environment Agency 2018), whereas for the southern part of the domain our simulations exceeded those limits. All data-driven statistical estimates are constrained by the length of the data record and without a far larger sample, we are unable to characterise our grey swan simulations in terms of the recurrence interval of the storm. Nevertheless, our results show that even where statistical estimates are based on high quality data, planners and coastal policy makers must consider the higher range confidence limits for extreme events.
We found that the most extreme storm surges in the southern North Sea were caused by slowing down the speed of movement of the storm. This effect was first reported in the Irish Sea (Maskell et al. 2013) who found that in the relatively shallow eastern Irish Sea, wind-generated surge magnitude is influenced by the propagation speed of the depression, which controls the timing of momentum input with respect to tidal depth variations. This has been confirmed in the North Sea by Wei et al. (2019) who showed that the storm propagation speed is an important factor, but the peak surge did not monotonically increase with decreasing storm propagation speed. Wei et al. (2019) suggested that for the North Sea, the maximum surge is found when the wind set-up caused oscillations that match the resonance frequency of the basin. Our simple, idealised experiments demonstrate that a slower speed of storm movement allows a longer time to approach equilibrium between the sea surface slope and the wind stress. Wind blows across a larger fraction of the basin for longer, and in this sense, the slower storms are, by analogy, like having a larger fetch and longer duration for wind waves. Making some simplifying assumptions, such as a constant wind from a constant direction, Pugh and Woodworth (2014) explain that given sufficient time, the wind stress terms in the equations of motion would reach equilibrium with the sea surface pressure gradient. In reality, this equilibrium is never obtained because winds vary enormously with time and the weather system will pass through the region too quickly. Our idealised simulations, shown in Fig. 5, suggest that real mid-latitude depressions, where wind direction changes typically from southwesterly then veering to northerly winds over a 24-48 h period are unlikely to allow the sea surface slope to reach equilibrium. However, slower moving systems are more likely to keep winds acting on the sea surface to generate stress for longer. Note that in our reanalysis-constrained simulations we only perturbed storm speed by one standard deviation. Furthermore, we only perturbed one controlling variable (speed, pressure, track) at a time. In reality more significant changes are possible, which supports the idea that a far worse storm surge is feasible due to natural variability.
Waves are often a contributory factor in the destruction caused by storms e.g. Wolf and Flather (2005), where the waves in the North Sea were also extreme. For the Xaver storm in 2013, waves were not exceptional. Increases of up to 50% in the maximum off-shore wave heights were seen, in the perturbation analysis described above, mainly in the intensified wind case of Run 2 and the enhanced translation speed of Run 4. However, the full contribution of waves to coastal water levels, through wave setup, runup and overtopping (which is determined by offshore wave height as well as coastal bathymetry and instantaneous water levels) was not calculated here. The wave-driven contribution to the total water level generally varies with shoreline morphology and seastate properties and can be a fraction of the offshore Hs that ranges from 10 to 200% (Dodet et al. 2019). Although some changes to significant wave heights were obtained in the simulations, these changes are largely confined to deeper water in the centre of the North Sea. The changes in storm track and intensity, which enhance the surge, do not necessarily increase the offshore wave heights, the enhancement of which often requires a different wind direction or storm speed. Only Run 2, where the low pressure was deepened, thus increasing the wind speed, produced increased wave heights as well as surges.
The biggest limitation of this study is that we were only able to synthesise a relatively small sample of grey swan storms (limited by the computationally demanding, semimanual NWP grid editing tool). Part of this limitation is that, whilst our six perturbed model runs gave spatially differing results, they were all derived from the same baseline storm. It is probably the case that our sample does not account for a full range of variability in storm structure. To extend our approach demands techniques that provide a much larger statistical sample of weather systems; those techniques could then be applied more widely to all coastal regions affected by mid-latitude storm surges. For instance, if one could demonstrate that simplified axisymmetric representations of mid-latitude weather systems (e.g. Wolf and Woolf 2006) produced spatially and temporally accurate storm surges and waves (i.e. despite the idealisation of the atmospheric system it gives the same sea levels at tide gauges), then it would be possible to synthesise a far larger set of synthetic storms, as has been used to understand extreme sea levels due to tropical storms and hurricanes (e.g. Hallegatte 2007; Lin and Emanuel 2016; Cui and Caracoglia 2019). One approach to tuning the parameters of idealised axisymmetric storms would be to minimise the cost functions of adjoint models for storm surges at tide gauges (e.g. Wilson et al. 2013;Warder et al. 2021). For storm surges and waves, which demand an accurate simulation of extreme wind speeds and extreme low pressures, such a probabilistic approach to synthesising mid-latitude storms is likely to be more insightful than full NWP ensemble simulations from high-resolution climate models (which currently lack the resolution to generate extreme wind speeds and low pressures).
In climate modelling, ensemble simulations are the preferred approach for exploring the internal variability of weather systems. Thompson et al. (2017) used a large ensemble of high-resolution climate simulations to understand the exceptional rainfall throughout the 2013/2014 winter in the UK, arguing that climate models provide a larger sample of events that are meteorologically plausible, potentially providing better estimates of the occurrence of extreme events than statistical analyses of (often short period and/or inhomogeneous) observations. Their UNSEEN (UNprecedented Simulated Extremes using ENsembles) methodology used the Hadley Centre global climate model (HadGEM3-GC2) at 60km atmospheric resolution with a 0.25°ocean. In UNSEEN, 40 ensemble members were able to simulate 22,400 months of winter rainfall giving many more "unprecedented" monthly rainfall totals for south east England in the model simulations. Thompson et al. (2017) found a 7% chance of a rainfall total greater than the current observed record in at least one month of any given winter. At even higher resolution it may be possible to generate the relevant meteorological detail for storm surges and waves, but this is not currently computationally feasible. A large (26 teams, 11 models) multi-model ensemble of European regional climate models down to 12.5km horizontal resolution was carried out by the EURO-CORDEX initiative (Jacob et al. 2014) but their focus was on precipitation and temperature with no reported results on wind speeds. An approach that has been successful in projecting longer term trends in tropical cyclone impacts is the use of very large ensembles over 60 year time slice runs (Mori et al. 2019). Using a 90 member ensemble forced by climatological warming parameters, Mori et al. (2019) created an equivalent of 5000 years of data, suggesting strong changes to storm surges in east Asia, with projections of 0.3-0.45 m increases in 100-year return period storm surges for Tokyo and Osaka Bay in Japan.
Long-term mean sea-level rise remains an important driver of future coastal flood risk, with the UKCP18 Marine Report (Palmer et al. 2018) advising of (central estimate) mean sealevel rises for the east Coast of the UK of between 0.2 and 0.8m by the year 2100. For storm surges and waves, it is well recognised that projections of future storminess are limited by the lack of consistency between climate models and the capability of even regional climate models to accurately simulate extreme winds. The multi-decadal variability of winter storms-and therefore of storm surges-is dominated by natural variability. This variability, and its influence of European extreme winters, will be controlled in part by the response of the North Atlantic storm track to climate change (e.g. Woollings et al. 2012) and the influence of a warming Arctic on the Atlantic polar front (e.g. Hall et al. 2015). Other factors that influence storminess on all timescales are Arctic amplification and the expansion of subtropical cells.
Our new research shows that, for short term (i.e. 10-30 years) planning purposes, the greatest threat from coastal extreme sea levels is the potential underestimation of the hazard due to our incomplete knowledge of mid-latitude storm variability. For regions affected by mid-latitude storm surges, the natural variability of weather over the next century could be as important as mean sea level change in determining extreme sea levels. For comparison with projected mean sea level rises, the additional storm surge heights in our simulations are comparable to the mean sea-level rise by the year 2100 in a high emissions RCP 8.5 scenario (Palmer et al. 2018). We have used an extreme weather event as the baseline for our synthetic storms (one that produced an 800-year return period storm surge at a single location) and it is therefore possible that the storms we have created have a recurrence period comparable with the timescale of long-term sea level change. However, the point we make here is valid since projected sea level rise for the year 2100 will (subject to all its uncertainties) not be reached significantly before then, whereas the extreme storms we are considering could occur during any winter.
Regional wave projections for the NE Atlantic given by Bricheno and Wolf (2018) show that mean wave heights may reduce but extremes may increase over the same period. The UK has developed a series of Major Incident Plans (MIPs) designed to support the preparation for, response to, and recovery from major incidents. Within MIPs, incidents are described at four different levels of emergencies: (1) elevated risk of a major incident; (2) significant incident; (3) serious incident; and (4) catastrophic incident. The south and west coast of the UK have not been subjected to a "Catastrophic" event in recent history, unlikely the east coast which experienced the 1953 event. The analysis we have undertaken here would provide a framework to theorise what a catastrophic coastal flood incident for the UK might look like, and would therefore help improve the planning and preparation for such events. The same approach could be adopted more widely around those parts of the world affected by episodic events caused by mid-latitude weather systems.