An Observation-Based Dataset of Global Sub-Daily Precipitation Indices (GSDR-I)

Pritchard, David; Lewis, Elizabeth; Blenkinsop, Stephen; Patino Velasquez, Luis; Whitford, Anna; Fowler, Hayley J.

doi:10.1038/s41597-023-02238-4

An Observation-Based Dataset of Global Sub-Daily Precipitation Indices (GSDR-I)

Data Descriptor
Open access
Published: 22 June 2023

Volume 10, article number 393, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

An Observation-Based Dataset of Global Sub-Daily Precipitation Indices (GSDR-I)

Download PDF

2987 Accesses
3 Altmetric
Explore all metrics

Abstract

Precipitation indices based on daily gauge observations are well established, openly available and widely used to detect and understand climate change. However, in many areas of climate science and risk management, it has become increasingly important to understand precipitation characteristics, variability and extremes at shorter (sub-daily) durations. Yet, no unified dataset of sub-daily indices has previously been available, due in large part to the lesser availability of suitable observations. Following extensive efforts in data collection and quality control, this study presents a new global dataset of sub-daily precipitation indices calculated from a unique database of 18,591 gauge time series. Developed together with prospective users, the indices describe sub-daily precipitation variability and extremes in terms of intensity, duration and frequency properties. The indices are published for each gauge where possible, alongside a gridded data product based on all gauges. The dataset will be useful in many fields concerned with variability and extremes in the climate system, as well as in climate model evaluation and management of floods and other risks.

HYADES - A Global Archive of Annual Maxima Daily Precipitation

Article Open access 15 March 2024

Evaluating extreme precipitation in gridded datasets with a novel station database in Morocco

Article 10 April 2023

Variations in sub-daily precipitation at centennial scale

Article Open access 03 April 2020

Background & Summary

Datasets providing time series of climate indices are essential for understanding climate variability and change¹. Focusing on precipitation and temperature extremes, the WMO/WCRP/JCOMM Expert Team on Climate Change Detection and Indices (ETCCDI) developed a set of indices around two decades ago that is still in wide use today in global climate change detection studies^2,3 and climate model evaluations⁴. A key advance made by the ETCCDI was to standardise both the definitions and the software used in index calculation. These steps enabled direct comparison between analyses and wider sharing of vital observed climate information, as some data holders are able to derive and share indices more readily than raw climate data. A succession of open, station-based and gridded observational datasets has followed from the introduction of the ETCCDI indices, including HadEX⁵, HadEX2⁶, HadEX3⁷ and GHCNDEX⁸.

As the ETCCDI indices are calculated from daily data, they do not provide any information on variability, changes or extremes of precipitation at sub-daily timescales. Indeed, sub-daily precipitation data from most countries around the world are generally not openly available. Yet, intense precipitation at short durations can have very large impacts, including flash flooding⁹, landslides¹⁰ and soil erosion¹¹. Available observations and high-resolution (convection-permitting) climate modelling indicate that these high-impact extremes will intensify in a warmer and moister atmosphere^{12,13,14,15,16}, which may lead to increases in the associated risks to life and higher economic costs. At large scales some regions may experience sub-daily precipitation intensity increases greater than the ~7% K⁻¹ implied by the Clausius-Clapeyron relation^17,18, although the interplay between cloud processes, local dynamics and large-scale circulation exerts a complex modulating influence on the thermodynamic scaling of precipitation extremes with warming^19,20.

It is therefore critical to increase the availability of open sub-daily precipitation information to support both climate research and risk management. As such, this study presents a new global dataset of sub-daily precipitation indices (GSDR-I) calculated from a unique database of gauge observations. This work was conducted in the INTENSE project (‘INTElligent use of climate models for adaptatioN to non-Stationary hydrological Extremes’)²¹. INTENSE represented a major international effort to improve understanding of changing sub-daily precipitation extremes around the world. The project led the global research effort in this area within the Global Energy and Water Exchanges Hydroclimatology Panel Cross-Cutting project on sub-daily precipitation as part of the World Climate Research Programme.

Figure 1 outlines the approach taken to construct GSDR-I. The first step was to collate and quality control as much global sub-daily precipitation data as possible. This culminated in the Global Sub-Daily Rainfall (GSDR) dataset²². Building on previous work, an open-source automated quality control procedure (GSDR-QC) was developed, evaluated and applied to the GSDR dataset²³. In parallel, the INTENSE project facilitated discussions with international experts and data users to help define a set of sub-daily precipitation indices for GSDR-I. Python code was then developed to calculate these indices for one or more gauges. This software was shared with partners who wished to calculate the indices themselves and contribute them to GSDR-I. Finally, the indices calculated from gauge records were gridded, which enabled the incorporation of gauges whose indices cannot be released directly due to licensing restrictions. The gridded indices form a data product comparable to the ETCCDI-related indices datasets (HadEX3, GHCNDEX) but focused on sub-daily timescales.

The resulting indices dataset will be useful in many areas of research and applied risk management. It will provide climate researchers with the most comprehensive observation-based dataset on the climatology and variability of sub-daily precipitation available. Climate modellers will be able to use the dataset to help evaluate and improve models operating at a range of scales, from global to regional and local (including convection-permitting models). In addition, through the new indices dataset, engineers and risk managers will have access to unprecedented open information on observed short-duration precipitation extremes relevant for assessing hazards and designing resilience measures. For example, the indices describing monthly, seasonal and annual maxima will be directly applicable in extreme value analysis and the construction of intensity-duration-frequency curves, which are used to design drainage systems and flood resilience measures. Further updates to the indices dataset over time will also ultimately enhance its role in climate change detection.

Methods

Gauge observations

The INTENSE project undertook a major effort to collate sub-daily precipitation gauge records from around the world. This activity culminated in the construction of the GSDR hourly precipitation dataset²², which provides the majority of the data underpinning the GSDR-I sub-daily indices calculations. National meteorological and hydrological services provided most of the data for the GSDR, although other environmental agencies and researchers also contributed substantially. Gauges from the Integrated Surface Database (ISD)²⁴ were also included if they were not present in data obtained from national hydrometeorological services. Compared with the initial version described by Lewis et al.²², the GSDR now contains data for 18,591 gauges, due to the addition of hourly records for Brazil (765 gauges; https://portal.inmet.gov.br/), Canada (1705 gauges; https://climate.weather.gc.ca/), India (65 gauges; https://mausam.imd.gov.in/), Ireland (36 gauges; https://www.met.ie/climate/available-data/historical-data) and New Zealand (453 gauges; https://cliflo.niwa.co.nz/).

Figure 2a shows the effective record lengths (i.e. equivalent number of years of record if missing data are removed) of gauges used to build the GSDR-I dataset. Only hourly gauge records with effective lengths exceeding one year were used in the indices calculations. Figure 2 indicates that data coverage is highly variable, with long records available mainly in parts of Europe, Asia, Australasia and North America. The number of stations available in these regions varies over time (Fig. 2b), which partly reflects changing measurement networks but also difficulties in obtaining up-to-date records globally. Peak gauge availability globally in GSDR-I is found in 2008, when 7,150 relatively complete gauges (<20% missing) contribute to the dataset simultaneously. Data is very limited in much of South America and Africa, as well as parts of Asia and the Middle East. Nevertheless, with 12,104 relatively complete (<20% missing) gauge records of more than one year in length, the GSDR is the most comprehensive dataset of global hourly precipitation observations and hence the basis for GSDR-I.

It should also be noted here that indices calculations for additional gauges (not held in GSDR) were performed and shared by meteorological services from all member countries of the PannEX initiative²⁵. PannEX is a GEWEX Regional Hydroclimatological Project (RHP) focusing on the hydroclimatology of the Pannonian basin through collaboration by Austria, Croatia, Czech Republic, Hungary, Romania, Serbia, Slovakia and Slovenia. PannEX contributed 264 of the total 18,591 gauges in the GSDR-I indices dataset.

Quality control

Errors in precipitation observations arise from various sources, including equipment malfunctions (e.g. gauge blockages) and data/metadata transmission or processing errors (e.g. missing values infilled with zeros, implausible values, etc.). For sub-daily precipitation, some specific additional issues tend to recur in data series, such as entry of daily accumulations at single hours or by uniform disaggregation. However, only some of the data collected for GSDR had associated time series of quality flags. Therefore, given the range of potential quality issues and the volume of data in GSDR, the INTENSE project developed a largely automated quality control procedure to identify and remove suspicious data. This procedure (GSDR-QC²³) builds on previous work on automated checking of daily²⁴ and sub-daily^26,27 precipitation data to underpin an open-source Python package (intense) for community use.

The GSDR-QC procedure comprises several steps, with full details and evaluation given by Lewis et al.²³. First, station metadata checks identify and exclude implausible locations and duplicated records. Second, a set of 25 time series checks are used to flag suspicious data values or periods. These checks cover a wide range of possible errors, including implausible extreme values, entry of daily accumulations into hourly series, suspicious streaks of repeated values, dubiously long dry periods relative to local climate, record inhomogeneity, excessive intermittency, and inconsistency in relation to neighbouring reference gauges, amongst others. These checks draw on consistency with well verified auxiliary data in the form of HadEX2 and GHCNDEX gridded ETCCDI daily indices, as well as with the high quality daily and monthly precipitation observations curated by the Global Precipitation Climatology Centre (GPCC)²⁸.

Values and periods flagged as very suspicious in the second step are removed from the time series based on a set of 11 rules explained in Lewis et al.²³. The rules focus on excluding the following errors from the time series: suspected daily and monthly accumulations; long streaks of repeated values; implausible extremes relative to known records, daily extreme totals and neighbouring gauges (after applying a safety margin); and long dry periods not corroborated by ETCCDI datasets or neighbouring gauges. These rules combine multiple time series checks to enhance their robustness. The performance of the GSDR-QC procedure “rule base” is discussed in the Technical Validation and Usage Notes sections below. A final manual review then identifies and removes gauges whose overall records are of particularly poor quality, based on examination of the GSDR-QC summary information. After this review for both GSDR gauges and those gauges contributed by PannEX, 18,591 gauges remained available for use in construction of the GSDR-I indices dataset.

Definitions of indices

Two objectives guided the development of the indices and statistics included in GSDR-I. The first objective was compatibility with the ETCCDI indices as far as possible. This principle will enable users to compare GSDR-I indices with longer duration (daily and longer) precipitation indices, while also making the new sub-daily indices readily understandable. The second objective was to ensure user/stakeholder participation in the design of new indices. To this end, the INTENSE project held a workshop in September 2016 involving climate scientists, modellers and flood risk specialists to identify and define appropriate, useful indices and statistics. This process resulted in the GSDR-I indices summarised in Table 1.

Table 1 Overview of time series indices in GSDR-I.

Full size table

Table 1 shows that the GSDR-I time series indices build on common and typically simple concepts, such as maxima and counts over thresholds, which generally have counterparts in the ETCCDI indices. For example, GSDR-I contains an Rx1hr index defined as a time series of 1-hour precipitation maxima, whereas the ETCCDI has Rx1day to describe a 1-day maxima series. Similarly, the GSDR-I dataset (like the ETCCDI-based datasets) also incorporates explicit frequency-related indices to help users quantify exceedances of fixed thresholds, which are often relevant indicators for physical risk managers. Other indices provide information on the contribution of heavy precipitation to total amounts on scales ranging from daily to annual, as well as wet spell durations and mean/total precipitation.

All of the time series indices in Table 1 are calculated on a monthly, seasonal and annual basis to provide flexibility to the data users. Seasons are defined as December to February (DJF), March to May (MAM), June to August (JJA), and September to November (SON). A threshold of 0.1 mm is used to distinguish wet hours (or multi-hour intervals) in the relevant calculations. All of the indices are calculated using a 1-hour accumulation interval (i.e. using hourly gauge time series) and, with the exceptions of Rx1hrP, RTot and the duration-related indices (NWH, MeLWS and MxLWS), most indices are also calculated using 3-, 6-, 12- and 24-hour accumulation intervals (i.e. after aggregating the hourly (1-hour) gauge time series) to facilitate as wide a range of user applications as possible. For most indices, aggregations of the 1-hour series use fixed windows beginning at 00:00 local time, e.g. 00:00–03:00, 03:00–06:00, etc. However, maxima (RxHhr) are identified using a sliding window, which provides a more accurate approximation.

Following the ETCCDI indices, we also provide formal definitions of the GSDR-I time series indices in Table 1. Let X_ij be the precipitation total for the i-th successive hour in the j-th period (month, season or year) of an hourly precipitation time series, X. Let n_j denote the number of hours (wet and dry) in period, j. Using Iverson bracket notation, the number of wet hours in period j can be obtained from

$$NW{H}_{j}={\sum }_{i=1}^{{n}_{j}}\left[{X}_{ij}\ge 0.1\right]$$

(1)

where [X_ij ≥ 0.1] is an indicator function mapping X_ij to one if it equals or exceeds 0.1 mm and to zero otherwise, i.e.

$$\left[{X}_{ij}\ge 0.1\right]=\left\{\begin{array}{c}1,{X}_{ij}\ge 0.1\\ 0,{X}_{ij} < 0.1\end{array}\right.$$

(2)

The total precipitation in wet hours for period j can be identified from

$$RTo{t}_{j}={\sum }_{i=1}^{{n}_{j}}{X}_{ij}\left[{X}_{ij}\ge 0.1\right]$$

(3)

where X_ij [X_ij ≥ 0.1] indicates that X_ij is only included in the summation if it is greater than or equal to 0.1 mm.

If LWS_rj is the length (in hours) of continuous wet spell r (in which each hour equals or exceeds 0.1 mm) in period j, then the maximum wet spell length of all w_j wet spells in period j can be written as

$$MxLW{S}_{j}=\left\{\begin{array}{cc}\mathop{\max }\limits_{r=1,\ldots ,{w}_{j}}LW{S}_{rj}, & {w}_{j} > 0\\ 0, & {w}_{j}=0\end{array}\right.$$

(4)

and the mean wet spell length is calculated from

$$MeLW{S}_{j}=\left\{\begin{array}{cc}\frac{{\sum }_{r=1}^{{w}_{j}}LW{S}_{rj}}{{w}_{j}}, & {w}_{j} > 0\\ 0, & {w}_{j}=0\end{array}\right.$$

(5)

Now let H ∈ {1, 3, 6, 12, 24} be an accumulation interval in hours and Y_kj be the precipitation total for the k-th successive (non-overlapping) H-hour interval in the j-th period (month, season or year) – i.e. after aggregation (summation) of the original hourly series (X) using fixed H-hour intervals starting at midnight local time to obtain series Y (which thus has a H-hour interval or time step). There are m_j successive H-hour intervals in Y_j. For a threshold T ∈ {10, 20, 30, 50} (in mm) and using the indicator function [Y_kj ≥ T] to map Y_kj to one if it is greater than or equal to T (otherwise it is mapped to zero), we can identify counts of exceedances from

$$RHhrTm{m}_{j}={\sum }_{k=1}^{{m}_{j}}\left[{Y}_{kj}\ge T\right]$$

(6)

In a similar way, mean precipitation intensity in wet H-hour intervals is defined as

$$SPIIHh{r}_{j}=\begin{array}{cc}\frac{RTo{t}_{j}}{{\sum }_{k=1}^{{m}_{j}}\left[{Y}_{kj}\ge 0.1\right]}, & RTo{t}_{j}\ge 0.1\\ 0, & RTo{t}_{j} < 0.1\end{array}$$

(7)

Then let Y_Q be a threshold defined as the Q-th percentile (with Q∈{95, 99}) of the precipitation distribution calculated from wet H-hour intervals (≥0.1 mm) using all months, seasons or years in the time series that share the same the month, season or year of period j (e.g. all Januarys). The contribution of H-hour intervals exceeding Y_Q to the total precipitation in wet intervals for period j can be calculated as

$$RQpwHhr{P}_{j}=\left\{\begin{array}{cc}\frac{{\sum }_{k=1}^{{m}_{j}}{Y}_{kj}\left[{Y}_{kj}\ge {Y}_{Q}\right]}{RTo{t}_{j}}\times 100 \% , & RTo{t}_{j}\ge 0.1\\ 0, & RTo{t}_{j} < 0.1\end{array}\right.$$

(8)

where Y_kj [Y_kj ≥ Y_Q] indicates that Y_kj is only included in the summation if it is greater than or equal to Y_Q.

Finally, for the maxima-related indices, we take H∈{1, 3, 6, 12, 24} again as an accumulation interval in hours and let Z_hj be the precipitation total for the H-hour interval ending at hour h in the j-th period (month, season or year) of an hourly time series X_j. Z_j is thus an hourly time series based on a sliding window aggregation, in which the value at each hour h is a sum of itself and the H-1 preceding hours. With n_j as the total number of hours (wet and dry) in period j, the H-hour maximum can be determined from the sliding window series by

$$RxHh{r}_{j}=\mathop{\max }\limits_{h=H,H+1,\ldots ,{n}_{j}}{Z}_{hj}$$

(9)

If D_j is the 24-hour precipitation total on the calendar day (00:00 – 00:00 local time) in which the 1-hour maximum precipitation Rx1hr_j occurred, then

$$Rx1hr{P}_{j}=\left\{\begin{array}{cc}\frac{Rx1h{r}_{j}}{{D}_{j}}\times 100 \% , & Rx1h{r}_{j} > 0\\ 0, & Rx1h{r}_{j}=0\end{array}\right.$$

(10)

If more than one hour in period j has the value Rx1hr_j then one of those hours is chosen at random to identify D_j.

Definitions of supplementary statistics

In addition to the time series indices, the GSDR-I dataset also contains a small number of supplementary statistics calculated as single values based on full record periods (e.g. percentiles), rather than as time series. Detailed in Table 2, these statistics describe aspects of short-duration precipitation climatology – specifically extreme precipitation percentiles and the diurnal cycle/profile of precipitation. The supplementary statistics are provided because they contain useful information for interpreting the time series indices (i.e. percentiles) or for particular applications, such as climate model evaluation, while requiring longer, climatology-length record periods for their definition.

Table 2 Overview of supplementary statistics.

Full size table

Table 2 documents the GSDR-I supplementary statistics, each of which comprise climatological values rather than time series. The supplementary statistics include all-hour and wet-hour percentiles (i.e. as opposed to the time series indices of percentage contributions above percentile thresholds – see Table 1). Providing both percentile variants enables flexibility for different applications, while ensuring that influences of frequency variability can be considered in analyses of intensity²⁹. Wet hours (or multi-hour intervals) are again defined using a 0.1 mm threshold. The diurnal cycle is also calculated on a climatological basis, in acknowledgement of the fact that some climate regions experience few wet days, at least seasonally. The full diurnal cycle (i.e. mean hourly precipitation by hour of day) is thus calculated as a climatology using wet days throughout the available record period. Wet days have been defined using a 1 mm threshold (DiCyc1mm). Days are defined as calendar days, i.e. 00:00 to 00:00 local time. The number of wet days in a period is recorded as metadata (see Data Records section). The supplementary statistics are calculated on a monthly, seasonal and annual basis (e.g. pooling all Januarys together).

Calculation details

Several additional details are important in the practical calculation of the GSDR-I indices and statistics for individual gauges. Firstly, for all of the supplementary statistics in Table 2, calculations use full available record periods, rather than a common reference period (e.g. 1981–2010). This approach is taken because of the highly variable length of record periods within GSDR, as well as the substantial differences in start and end years of records. It is very difficult to define a common reference period across the GSDR dataset without eliminating a substantial amount of data or having potentially low record completeness. Sub-daily gauge records are typically much shorter than daily records, such that it is important to retain as much information as possible. Therefore, to help users understand the length and timing of the records used, the metadata includes the record period of each gauge and its completeness.

It was also necessary to handle missing data in the calculations. For the time series indices, percentage completeness is recorded alongside the index value for each month, season or year calculated. The user can then filter time series indices using a data completeness threshold of their choice. For the supplementary statistics, missing data was accounted for by performing two sets of calculations: (1) using all available data, regardless of completeness over the record period (i.e. a completeness threshold of 0%), and (2) performing the calculation only on those months, seasons or years with more than 80% completeness for the period required. The number of data points available for calculation is given as metadata in both cases. This approach gives some flexibility to the dataset user, who can examine the impact of restricting calculations to relatively complete periods on their results.

The GSDR-I index calculations were implemented in Python (see Code Availability section below). Similar to the software for the ETCCDI indices, the code enables consistent calculations by different users. This is a very useful property in cases where national meteorological services are unable to share full data records. Indeed, sharing the Python package enabled the contribution of indices for 264 gauges not held in the underlying GSDR dataset by the PannEX initiative²⁵, as outlined above.

Gridding: Angular distance weighting

Gridding the GSDR-I gauge-based indices represents a way to include the maximum amount of data and information in an open dataset, given that indices for some gauges cannot be shared directly due to data licence restrictions. The motivation for gridded indices within GSDR-I is thus similar to that for the GHCNDEX and the HadEX family of gridded indices. These datasets also guided the order of operations for gridding in GSDR-I. Specifically, indices were first calculated for each gauge record and then the resulting indices were gridded (i.e. index values were gridded, not anomalies).

This processing sequence remains the current standard for indices. As precipitation is highly variable in space and time but gauge networks are often sparse, it can be difficult to grid the gauge observations themselves in a satisfactory way³⁰, whereas indices tend to be better correlated over larger distances⁷. Nevertheless, this sequence can lead to potentially significant differences in gridded index values compared with the reverse approach of gridding gauge data first and then calculating indices^1,31. Therefore care is required in comparison with indices calculated from other data types (e.g. reanalyses, climate models or satellite)³² that tend to represent grid cell averages, as highlighted in the Usage Notes.

For comparability with GHCNDEX and the HadEX datasets, a modified angular distance weighting (ADW)^7,33 method was employed for gridding. ADW has been shown to exhibit both good performance for gridding indices and robustness to parametric and structural uncertainties³⁴, especially for precipitation³². The ADW method estimates index values at the centre of each grid cell using a weighted combination of values from surrounding gauges. Weights for the selected gauges are based first on their distances from the cell centre. The distance-based component of the weights are then scaled by a direction-dependent term, which is proportional to the bearing of each gauge from the cell centre relative to the bearings of all other selected gauges. The method is thus designed so that gauges gain higher weights if they are close to the grid cell centre and if they are relatively isolated rather than close to multiple other gauges. Full equations for the weights are provided in Appendix A1 in Dunn et al.⁷.

The indices were gridded on a regular 1° latitude-longitude grid. Gridding was undertaken using gauges with more than 80% completeness for the relevant period, i.e. the month/season/year for time series indices or the record period for supplementary statistics. A value was estimated at a grid cell if any stations with valid index values were available within the search radius explained below. Cell-wise metadata is provided on the number of gauges used in the gridding calculation (along with the number of gauges in each individual cell) for each field generated, which gives a user the option to set a higher threshold. Further details of the metadata are given below in the Data Records section. The gridding calculations were implemented using Python (see Code Availability section).

Gridding: Search radius

For a given index and grid cell, gauges are selected for inclusion in ADW gridding using a search radius, which is determined from the rate of decay in correlations between pairs of stations as the separation distance between stations increases. The search radius is taken as the decorrelation length scale (DLS) or e-folding distance of this relationship. The DLS is estimated by binning the correlations for station pairs and fitting a relationship of the form $R=A{e}^{-x/b}+c$ to the bin-wise median correlation, where R is correlation, x is separation distance, and A, b and c are parameters to be estimated. Variable bin widths were used to resolve the relationship more accurately at relatively short separation distances. The first bin used was 50 km wide and the final bins were 500 km wide. The DLS is defined as the distance at which correlation drops to A/e. Only gauges with record lengths of 15 years or more were used to help identify the DLS and a maximum of 300 stations were used from each country for computational tractability.

For the time series indices, the DLS is calculated separately for each index and on a monthly, seasonal and annual basis (e.g. one DLS for all Januarys). It is also calculated separately for different latitudinal bands, which helps to account for the strong influence of latitude on precipitation climatology. The specific latitudinal bands overlap a little and differ from those used in HadEX3⁷, in order to account for the lesser availability and specific distribution of sub-daily precipitation gauge records. The bands used are: (1) 65°S to 25°S, (2) 30°S to 30°N and (3) 25°N to 65°N. The resulting DLS values were linearly interpolated to the latitudes of the grid cell centres to reduce boundary effects between latitudinal zones. Below 65°S and above 65°N the DLS of the closest mid-latitude band was used, given the very low gauge density at high latitudes.

Reflecting the potential for low spatial correlation of short duration precipitation and extremes, minimum and maximum DLS values are set at 100 km and 500 km, respectively. If insufficient data are available to calculate the DLS, a default value of 100 km is used for the time series indices. These values were determined after analysing the DLS calculations and testing a small number of alternatives. During gridding, if only one gauge was available within the search radius, an additional criterion was added to check whether this value was within 100 km of the cell centre. The supplementary statistics all use a DLS of 200 km in gridding, as correlations cannot be calculated because the statistics are not time-variant. The larger value of 200 km was selected after initial testing and reflects the higher spatial correlation of climatological quantities (e.g. percentiles) compared with index values for particular months, seasons or years. This testing also supported previous findings that the gauge density is a more significant factor than the DLS in interpolation or gridding accuracy^34,35.

Data Records

The data records presented in the GSDR-I dataset fall into two groups – gauge and gridded records. The first (gauge) group contains metadata, time series indices and supplementary statistics for each individual gauge. These data records are provided for 80% (14,832) of the total 18,591 gauges available to the study (i.e. with the total gauges being those used from GSDR plus the indices and statistics shared by partner organisations). Unfortunately, the indices and statistics cannot be released for all of the individual gauges, due to licensing restrictions. Of the 14,832 gauges for which indices are provided, 27% (4,008 gauges) are in Europe (primarily western Europe and Scandinavia) and 58% (8,614 gauges) are in North or Central America (including 6,813 in the United States and 1,801 in Canada). The remaining gauges with open indices are in Japan (1,757 gauges) and New Zealand (453 gauges).

Table 3 summarises the information available for the first (gauge) group of data records, i.e. the metadata, indices and statistics available for individual gauges. These data records lend themselves to a tabular structure and are published in.csv file format. Metadata for each gauge include information such as a gauge identifier, coordinates, elevation, record period and record completeness. The indices and statistics files contain basic coordinate metadata for convenience, alongside the index/statistic values and data completeness information. As noted above, data completeness for time series indices indicates the completeness of the underlying data series for a given month, season or year. For supplementary statistics, data completeness indicates the threshold used for inclusion of a given month, season or year in calculations, with the number of valid data points also provided. To keep individual file sizes manageable, there are separate files for each aggregation period (monthly, seasonal or annual).

Table 3 Overview of table fields (columns) in each of the data records available for individual gauges in GSDR-I.

Full size table

The second group of records in GSDR-I is the gridded indices and statistics, which bring in information from all available gauges. The gridded data records are published as Network Common Data Form (NetCDF) files following the NetCDF Climate and Forecast (CF) Metadata Convention (CF1-8). These files are the modern standard for many spatiotemporal climate datasets, such as climate model outputs and meteorological reanalyses, as well as other gridded indices datasets like HadEX3 and GHCNDEX. NetCDF files are platform-independent and self-describing, with each file containing headers and metadata as key/value attributes. In addition to the gridded indices and statistics themselves, the files contain spatial fields with additional metadata (see Table 4), including the number of gauges used in gridding calculations at each grid cell. All of the indices listed in Tables 1, 2 are provided in the gridded data record on a monthly, seasonal and annual basis.

Table 4 Overview of key variables in NetCDF files of gridded indices (in addition to standard variables of time, latitude and longitude).

Full size table

Both the gauge and gridded components of the GSDR-I dataset are available in one data repository³⁶: https://doi.org/10.5281/zenodo.7492812.

Technical Validation

There are three aspects to the technical validation of the GSDR-I dataset: (1) verification of the quality control procedure applied to gauge records, (2) validation of the indices calculations, and (3) assessment of the gridding methodology. Lewis et al.²³ considered the first aspect in detail, finding that the GSDR-QC procedure improved correspondence between GSDR gauges and high-quality daily reference gauge series archived by the GPCC. Lewis et al. also evaluated the GSDR-QC procedure against a sample of 300 manually-checked GSDR gauges. The GSDR-QC procedure showed over 99% agreement with manual judgments of hourly data quality. More specifically, the false positive rate (i.e. incorrect removal of reasonable values) was low at 0.4%, while the true positive rate (i.e. correct removal of erroneous values) was 71%. There are inevitable trade-offs in automated QC procedures, but GSDR-QC achieves a good balance between removing erroneous data and retaining reasonable data, with prioritisation of a low false positive rate to avoid unduly removing important extremes³⁷.

Validation of the indices calculations was achieved by selecting a sample of gauges and checking each of the calculations/outputs manually in detail. After calculation of the indices for all gauges, preliminary maps and time series plots were also generated to check for outliers and ensure that the spatial/temporal patterns were coherent with conceptual understanding of spatiotemporal climatic variation. These plots are available in a repository³⁸ at https://doi.org/10.5281/zenodo.7492848 (see also examples in Fig. 3). The majority of the indices build on simple and common concepts (e.g. maxima, counts of threshold exceedances, etc.), such that the software and methodological complexity of indices calculations is generally lower than for both the quality control and gridding procedures.

The gridded indices were evaluated first by qualitative comparison of maps of gauge indices with gridded indices for a selection of different examples. This comparison provides an overall sense check on the gridding, as the fields should show good qualitative consistency with the gauge values. One example is given in Fig. 3, which shows clear consistency in maps of gauge (Fig. 3a) and gridded (Fig. 3b) 1-hour precipitation maxima (Rx1hr) for a selected month (January 2011) in the South-East Asia and Oceania region. Selecting two example grid cells, Fig. 3c shows a good degree of correspondence between the time series for the grid cells and the time series for (the median of) the gauges present within the grid cells. Note that a perfect match is not expected here, in part because the gridding procedure draws additionally on gauges outside of the grid cell to produce its estimates. Coherence was also found in the other examples checked and available in the validation figures repository³⁸.

Following previous evaluations of the angular distance weighting method³⁵, further validation of the gridded indices was undertaken through cross-validation. Each gauge was left out of the indices dataset in turn and the angular distance weighting method was used to estimate the index values at the gauge location. Differences (residuals) between the estimated and observed index values provide a measure of the accuracy and uncertainty of the interpolation method. The cross-validation procedure was repeated for each index and each month, season and year using all available gauge index values with less than 20% missing data in the underlying records. The period 2001 to 2010 was used for cross-validation of times series indices. As noted in prior studies³⁵, angular distance weighting is not an exact interpolator, which should be borne in mind when interpreting the residuals. However, cross-validation is still expected to provide a strong indication of the performance of the method, particularly in terms of bias and relative variations in skill according to season and location.

The distribution of residuals for one example index (monthly Rx1hr) for four major regions (see Fig. 2) is shown in Fig. 4. Distributions are provided for all months pooled together, as well as separately for four example months. In all cases the distributions of residuals peak close to zero, which indicates that the gridding method does not exhibit large overall biases. The interquartile range of the residuals is mostly small at around 5 mm or less. However, Fig. 4 also shows notable spatial and seasonal variation in the degree of spread in the residuals. In most regions, the distribution of residuals is peakiest in February and most widely dispersed in August. The contrast likely reflects seasonal variation in the balance between different types of precipitation (e.g. stratiform and convective – with the former typically exhibiting much larger correlation length scales than the latter).

Spatially, the European region shows the tightest concentration of residuals around zero in Fig. 4, especially in winter. This suggests the gridding method is likely to work particularly well here. Performance for the North America region is likely to be a little better than in Asia/Oceania, whereas the more data-sparse South/Central America region shows the largest dispersion. In all regions it is noticeable that, although most residuals are relatively small, there are some notably larger ones present (i.e. in the tails of the distributions). These large residuals are often associated with areas of sparse gauge density or complex topography, which provide longstanding challenges when attempting to derive representative index values³². That said, other example plots of residuals determined through cross-validation (available in the validation figures repository³⁸) confirm that overall the gridding method works well for different indices.

Usage Notes

Specific points to bear in mind when using the GSDR-I dataset are:

1.
The GSDR-QC quality control procedure focuses on errors arising from instrumentation, transmission or recording problems. It does not deal with issues affecting gauge representativeness (e.g. gauge type or siting) or systematic biases, such as those arising from wind-induced undercatch. This is not a shortcoming specific to GSDR-I; other precipitation indices and datasets are also typically unadjusted for these issues (including HadEX and GHCNDEX gridded ETCCDI indices). However, it is a point of uncertainty to bear in mind when analysing the dataset.
2.
Data records for the supplementary statistics are based on the available record periods for each gauge (including only months, seasons or years when the data completeness threshold is met). This means that the record periods vary between gauges. It is possible for the user to calculate summaries of time series indices to focus on specific reference periods, e.g. 1961–1990. Depending on the region under consideration, this may rule out a substantial amount of data.
3.
Gridding in the GSDR-I dataset proceeds by first calculating indices based on gauge time series and then gridding the resulting indices. The “order of operations” is thus the same as for the gridded GHCNDEX and HadEX family of ETCCDI indices. However, the reverse order (first gridding precipitation observations and then calculating indices from the gridded fields) is considered to be more directly comparable with climate model outputs, as well as other datasets in which grid cells represent spatial averages (e.g. meteorological reanalyses, satellite precipitation datasets, etc.)¹. The difference between the two approaches can vary depending on region, season and the quantity of interest (e.g. absolute values compared with an indication of variability or anomalies), but it may be significant³¹. It is therefore important to consider this point when using GSDR-I in comparisons with other data products or model outputs.
4.
The gridding method is likely to give a generally representative value, but it is not an explicit grid cell average. This consideration is additional to the order of operations point above, as the grid cell value may exhibit lesser representativeness in areas of complex terrain or highly variable precipitation patterns, e.g. places or seasons dominated by convective rainfall. Further research could also explore different indices gridding methods in more detail^7,32, but this is a large task beyond the scope of this work.
5.
Great care is required for any change or trend analysis undertaken with GSDR-I time series. Sub-daily precipitation records are typically much shorter than daily precipitation records, such that many of the gauges within GSDR are intrinsically unsuitable for climate change detection at present. It should also be noted that homogeneity checks are conducted on gauge time series with GSDR-QC, but these are not currently used to remove inhomogeneous records or to attempt to homogenise time series. Users should consult the available metadata before performing analyses sensitive to homogeneity issues. In general, the GSDR-I data records are better suited to characterising the spatial, seasonal and inter-annual variation in sub-daily precipitation variability and extremes at present, until such time as long records become available for more than a small number of countries.
6.
In addition, the gridded indices use all available data to provide the best estimate for each month, season or year at a grid cell, so it should be remembered that the number of gauges incorporated often varies through time. This may introduce inhomogeneity into time series in the gridded dataset. Again, the gauge and gridded indices may therefore provide a better guide to variability rather than long-term changes – at least at this stage in the development of sub-daily indices. Further updates to the dataset over time will address this point by extending record lengths and opening up the possibility of producing a more homogeneous version of GSDR-I to support climate change detection work.
7.
It is hoped that updates and extensions to the GSDR-I dataset will be carried out every few years, depending on resources, uptake and community contributions. As more observation data becomes openly accessible, these updates will ideally include the collation of additional gauge records that are not currently held in GSDR. This would help to fill in some of the gaps in spatial coverage across parts of Asia, Africa and South America. Over the longer term, it would be useful to automate the collection and processing of the data as far as possible, so that the GSDR-I indices dataset could be more frequently updated.

Code availability

The Python code for indices calculation based on gauge records and subsequent gridding is available in the code repository³⁹ here: https://doi.org/10.5281/zenodo.7492877.

References

Alexander, L. V. et al. On the use of indices to study extreme precipitation on sub-daily and daily timescales. Environ. Res. Lett. 14, 125008 (2019).
Article ADS Google Scholar
Westra, S., Alexander, L. V. & Zwiers, F. W. Global increasing trends in annual maximum daily precipitation. J. Clim. 26, 3904–3918 (2013).
Article ADS Google Scholar
Donat, M. G., Lowry, A. L., Alexander, L. V., O’Gorman, P. A. & Maher, N. More extreme precipitation in the world’s dry and wet regions. Nat. Clim. Chang. 6, 508–513 (2016).
Article ADS Google Scholar
Sillmann, J., Kharin, V. V., Zhang, X., Zwiers, F. W. & Bronaugh, D. Climate extremes indices in the CMIP5 multimodel ensemble: Part 1. Model evaluation in the present climate. J. Geophys. Res. Atmos. 118, 1716–1733 (2013).
Article ADS Google Scholar
Alexander, L. V. et al. Global observed changes in daily climate extremes of temperature and precipitation. J. Geophys. Res. Atmos. 111, D05109 (2006).
Article ADS Google Scholar
Donat, M. G. et al. Updated analyses of temperature and precipitation extreme indices since the beginning of the twentieth century: The HadEX2 dataset. J. Geophys. Res. Atmos. 118, 2098–2118 (2013).
Article ADS Google Scholar
Dunn, R. J. H. et al. Development of an updated global land in situ-based data set of temperature and precipitation extremes: HadEX3. J. Geophys. Res. Atmos. 125, e2019JD032263 (2020).
Article ADS Google Scholar
Donat, M. G. et al. Global land-based datasets for monitoring climatic extremes. Bull. Am. Meteorol. Soc. 94, 997–1006 (2013).
Article ADS Google Scholar
Archer, D. R. & Fowler, H. J. Characterising flash flood response to intense rainfall and impacts using historical information and gauged data in Britain. J. Flood Risk Manag. 11, S121–S133 (2015).
Article Google Scholar
Van Asch, T. W. J., Buma, J. & Van Beek, L. P. H. A view on some hydrological triggering systems in landslides. Geomorphology 30, 25–32 (1999).
Article ADS Google Scholar
Nearing, M. A. et al. Modeling response of soil erosion and runoff to changes in precipitation and cover. CATENA 61, 131–154 (2005).
Article Google Scholar
Guerreiro, S. B. et al. Detection of continental-scale intensification of hourly rainfall extremes. Nat. Clim. Chang. 8, 803–807 (2018).
Article ADS Google Scholar
Ban, N., Schmidli, J. & Schär, C. Heavy precipitation in a changing climate: does short-term summer precipitation increase faster? Geophys. Res. Lett. 42, 1165–1172 (2015).
Article ADS Google Scholar
Vergara-Temprado, J., Ban, N. & Schär, C. Extreme sub-hourly precipitation intensities scale close to the Clausius-Clapeyron rate over Europe. Geophys. Res. Lett. 48, e2020GL089506 (2021).
Article ADS Google Scholar
Prein, A. F. et al. The future intensification of hourly precipitation extremes. Nat. Clim. Chang. 7, 48–52 (2017).
Article ADS Google Scholar
Kendon, E. J. et al. Heavier summer downpours with climate change revealed by weather forecast resolution model. Nat. Clim. Chang. 4, 570–576 (2014).
Article ADS Google Scholar
Lenderink, G., Barbero, R., Loriaux, J. M. & Fowler, H. J. Super-Clausius–Clapeyron scaling of extreme hourly convective precipitation and its relation to large-scale atmospheric conditions. J. Clim. 30, 6037–6052 (2017).
Article ADS Google Scholar
Ali, H., Fowler, H. J., Lenderink, G., Lewis, E. & Pritchard, D. Consistent large-scale response of hourly extreme precipitation to temperature variation over land. Geophys. Res. Lett. 48, e2020GL090317 (2021).
Article ADS Google Scholar
Fowler, H. J. et al. Anthropogenic intensification of short-duration rainfall extremes. Nat. Rev. Earth Environ. 2, 107–122 (2021).
Article ADS Google Scholar
Fowler, H. J. et al. Towards advancing scientific knowledge of climate change impacts on short-duration rainfall extremes. Philos. Trans. R. Soc. A 379, 20190542 (2021).
Article ADS Google Scholar
Blenkinsop, S. et al. The INTENSE project: using observations and models to understand the past, present and future of sub-daily rainfall extremes. Adv. Sci. Res. 15, 117–126 (2018).
Article Google Scholar
Lewis, E. et al. GSDR: a global sub-daily rainfall dataset. J. Clim. 32, 4715–4729 (2019).
Article ADS Google Scholar
Lewis, E. et al. Quality control of a global hourly rainfall dataset. Environ. Model. Softw. 144, 105169 (2021).
Article Google Scholar
Smith, A., Lott, N. & Vose, R. The Integrated Surface Database: recent developments and partnerships. Bull. Am. Meteorol. Soc. 92, 704–708 (2011).
Article ADS Google Scholar
Lakatos, M. et al. Analysis of sub-daily precipitation for the PannEx region. Atmosphere (Basel). 12, 838 (2021).
Article ADS Google Scholar
Blenkinsop, S., Lewis, E., Chan, S. C. & Fowler, H. J. Quality-control of an hourly rainfall dataset and climatology of extremes for the UK. Int. J. Climatol. 37, 722–740 (2017).
Article PubMed Google Scholar
Lewis, E. et al. A rule based quality control method for hourly rainfall data and a 1 km resolution gridded hourly rainfall dataset for Great Britain: CEH-GEAR1hr. J. Hydrol. 564, 930–943 (2018).
Article ADS Google Scholar
Schneider, U. et al. GPCC’s new land surface precipitation climatology based on quality-controlled in situ data and its role in quantifying the global water cycle. Theor. Appl. Climatol. 115, 15–40 (2014).
Article ADS Google Scholar
Schär, C. et al. Percentile indices for assessing changes in heavy precipitation events. Clim. Change 137, 201–216 (2016).
Article ADS Google Scholar
Hofstra, N., New, M. & McSweeney, C. The influence of interpolation and station network density on the distributions and trends of climate variables in gridded daily data. Clim. Dyn. 35, 841–858 (2010).
Article Google Scholar
Avila, F. B. et al. Systematic investigation of gridding-related scaling effects on annual statistics of daily temperature and precipitation maxima: a case study for south-east Australia. Weather Clim. Extrem. 9, 6–16 (2015).
Article Google Scholar
Dunn, R. J. H., Donat, M. G. & Alexander, L. V. Comparing extremes indices in recent observational and reanalysis products. Front. Clim. 4, 989505 (2022).
Article Google Scholar
Shepard, D. A two-dimensional interpolation function for irregularly-spaced data. in Proceedings of the 1968 23rd ACM national conference 517–524 (1968).
Dunn, R. J. H., Donat, M. G. & Alexander, L. V. Investigating uncertainties in global gridded datasets of climate extremes. Clim. Past 10, 2171–2199 (2014).
Article Google Scholar
Hofstra, N. & New, M. Spatial variability in correlation decay distance and influence on angular-distance weighting interpolation of daily precipitation over Europe. Int. J. Climatol. 29, 1872–1880 (2009).
Article Google Scholar
Pritchard, D. et al. GSDR-I Global Sub-Daily Precipitation Indices - Dataset. Zenodo https://doi.org/10.5281/zenodo.7492812 (2022).
Durre, I., Menne, M. J. & Vose, R. S. Strategies for Evaluating Quality Assurance Procedures. J. Appl. Meteorol. Climatol. 47, 1785–1791 (2008).
Article ADS Google Scholar
Pritchard, D. et al. GSDR-I Global Sub-Daily Precipitation Indices - Validation Figures. Zenodo https://doi.org/10.5281/zenodo.7492848 (2022).
Pritchard, D. et al. GSDR-I Global Sub-Daily Precipitation Indices - Code. Zenodo https://doi.org/10.5281/zenodo.7492877 (2022).

Download references

Acknowledgements

The authors are extremely grateful to the many institutions and individuals who contributed data to the INTENSE project. This includes all of the organisations listed in Table A1 in Lewis et al.²². We would also very much like to thank the PannEX initiative for their efforts in collating data and conducting/sharing indices calculations. Very important data contributions were also made by Alex Cannon and Hylke Beck (for gauges in Canada and Brazil, respectively), for which the authors are very grateful. We would also like to acknowledge the very constructive inputs of all participants who attended the workshop on sub-daily precipitation indices held in Newcastle upon Tyne (UK) in September 2016. DP, EL, SB and HJF were funded by the European Research Council grant INTENSE (ERC-2013-CoG-617329). DP was also funded by UK Research and Innovation (UKRI) (ES/P011373/1) and HJF by the Wolfson Foundation and the Royal Society as a Royal Society Wolfson Research Merit Award holder (WM140025). LPV was funded by the UKRI Engineering and Physical Science Research Council (EP/S023577/1) and the Water Security & Sustainable Development Hub (ES/S008179/1). AW was funded by the UKRI Natural Environment Research Council (NE/S007512/1).

Author information

Authors and Affiliations

Newcastle University, Newcastle upon Tyne, UK
David Pritchard, Elizabeth Lewis, Stephen Blenkinsop, Luis Patino Velasquez, Anna Whitford & Hayley J. Fowler

Authors

David Pritchard
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth Lewis
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Blenkinsop
View author publications
You can also search for this author in PubMed Google Scholar
Luis Patino Velasquez
View author publications
You can also search for this author in PubMed Google Scholar
Anna Whitford
View author publications
You can also search for this author in PubMed Google Scholar
Hayley J. Fowler
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.J.F., S.B. and E.L. initiated and supervised the development of the dataset. D.P. and E.L. wrote the code for indices calculations, with L.P.V. and D.P. implementing the gridding method. D.P. carried out the calculations for gauges in the GSDR dataset and conducted the dataset validation. A.W. conducted analysis to support the refinement of the dataset and index definitions. D.P. drafted the manuscript and figures, with the other authors providing comments and edits.

Corresponding author

Correspondence to David Pritchard.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pritchard, D., Lewis, E., Blenkinsop, S. et al. An Observation-Based Dataset of Global Sub-Daily Precipitation Indices (GSDR-I). Sci Data 10, 393 (2023). https://doi.org/10.1038/s41597-023-02238-4

Download citation

Received: 30 December 2022
Accepted: 15 May 2023
Published: 22 June 2023
DOI: https://doi.org/10.1038/s41597-023-02238-4
Springer Nature Limited

An Observation-Based Dataset of Global Sub-Daily Precipitation Indices (GSDR-I)

Abstract

Similar content being viewed by others

HYADES - A Global Archive of Annual Maxima Daily Precipitation

Evaluating extreme precipitation in gridded datasets with a novel station database in Morocco

Variations in sub-daily precipitation at centennial scale

Background & Summary