Frequency based imputation of precipitation

Dikbas, Fatih

doi:10.1007/s00477-016-1356-x

Frequency based imputation of precipitation

Original Paper
Published: 19 November 2016

Volume 31, pages 2415–2434, (2017)
Cite this article

Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Fatih Dikbas ORCID: orcid.org/0000-0001-5779-2801¹

286 Accesses
5 Citations
Explore all metrics

Abstract

Changing climate and precipitation patterns make the estimation of precipitation, which exhibits two-dimensional and sometimes chaotic behavior, more challenging. In recent decades, numerous data-driven methods have been developed and applied to estimate precipitation; however, these methods suffer from the use of one-dimensional approaches, lack generality, require the use of neighboring stations and have low sensitivity. This paper aims to implement the first generally applicable, highly sensitive two-dimensional data-driven model of precipitation. This model, named frequency based imputation (FBI), relies on non-continuous monthly precipitation time series data. It requires no determination of input parameters and no data preprocessing, and it provides multiple estimations (from the most to the least probable) of each missing data unit utilizing the series itself. A total of 34,330 monthly total precipitation observations from 70 stations in 21 basins within Turkey were used to assess the success of the method by removing and estimating observation series in annual increments. Comparisons with the expectation maximization and multiple linear regression models illustrate that the FBI method is superior in its estimation of monthly precipitation. This paper also provides a link to the software code for the FBI method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The effect of simple imputations based on four variants of PCA methods on the quantiles of annual rainfall data

Article 04 September 2018

Comparative assessment of univariate and multivariate imputation models for varying lengths of missing rainfall data in a humid tropical region: a case study of Kozhikode, Kerala, India

Article 25 July 2023

An imputation method for the climatic data with strong seasonality and spatial correlation

Article 30 January 2021

References

Ajaaj AA, Mishra AK, Khan AA (2016) Comparison of BIAS correction techniques for GPCC rainfall data in semi-arid climate. Stoch Environ Res Risk A 30:1659–1675. doi:10.1007/s00477-015-1155-9
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38
Google Scholar
Dikbas F (2016a) Frequency based prediction of Buyuk Menderes flows. Tek Dergi 27:7325–7343
Google Scholar
Dikbas F (2016b) Three-dimensional imputation of missing monthly river flow data. Sci Iran 23:45–53
Google Scholar
Do CB, Batzoglou S (2008) What is the expectation maximization algorithm?. Nat Biotech 26:897–899. http://www.nature.com/nbt/journal/v26/n8/suppinfo/nbt1406_S1.html
Elshorbagy A, Corzo G, Srinivasulu S, Solomatine DP (2010a) Experimental investigation of the predictive capabilities of data driven modeling techniques in hydrology-Part 1: concepts and methodology. Hydrol Earth Syst Sci 14:1931–1941. doi:10.5194/hess-14-1931-2010
Article Google Scholar
Elshorbagy A, Corzo G, Srinivasulu S, Solomatine DP (2010b) Experimental investigation of the predictive capabilities of data driven modeling techniques in hydrology-Part 2: application. Hydrol Earth Syst Sci 14:1943–1961. doi:10.5194/hess-14-1943-2010
Article Google Scholar
Hou AY et al (2014) The global precipitation measurement mission. Bull Am Meteorol Soc 95:701–722. doi:10.1175/BAMS-D-13-00164.1
Article Google Scholar
Jayawardena AW, Lai F (1994) Analysis and prediction of chaos in rainfall and stream flow time series. J Hydrol 153:23–52. doi:10.1016/0022-1694(94)90185-6
Article Google Scholar
Leconte J, Forget F, Charnay B, Wordsworth R, Pottier A (2013) Increased insolation threshold for runaway greenhouse processes on earth-like planets. Nature 504:268–271. doi:10.1038/nature12827
Article CAS Google Scholar
Maier HR, Dandy GC (2000) Neural networks for the prediction and forecasting of water resources variables: a review of modelling issues and applications. Environ Model Softw 15:101–124. doi:10.1016/S1364-8152(99)00007-9
Article Google Scholar
Maier HR, Jain A, Dandy GC, Sudheer KP (2010) Methods used for the development of neural networks for the prediction of water resource variables in river systems: current status and future directions. Environ Model Softw 25:891–909. doi:10.1016/j.envsoft.2010.02.003
Article Google Scholar
Popp M, Schmidt H, Marotzke J (2016) Transition to a Moist Greenhouse with CO₂ and solar forcing. Nat Commun. doi:10.1038/ncomms10627
Google Scholar
Reager JT, Famiglietti JS (2009) Global terrestrial water storage capacity and flood potential using GRACE. Geophys Res Lett. doi:10.1029/2009GL040826
Google Scholar
Remesan R, Mathew J (2015) Hydrological data driven modelling: a case study approach. Springer, Switzerland. doi:10.1007/978-3-319-09235-5
Book Google Scholar
Sikorska AE, Montanari A, Koutsoyiannis D (2015) Estimating the uncertainty of hydrological predictions through data-driven resampling techniques. J Hydrol Eng. doi:10.1061/(ASCE)HE.1943-5584.0000926
Google Scholar
Sivakumar B (2000) Chaos theory in hydrology: important issues and interpretations. J Hydrol 227:1–20. doi:10.1016/S0022-1694(99)00186-9
Article Google Scholar
Sivakumar B, Liong SY, Liaw CY, Phoon KK (1999) Singapore rainfall behavior: chaotic? J Hydrol Eng 4:38–48. doi:10.1061/(ASCE)1084-0699(1999)4:1(38)
Article Google Scholar
Solomatine DP (2006) Data-driven modeling and computational intelligence methods in hydrology. Encyclopedia of hydrological sciences. Wiley, Hoboken. doi:10.1002/0470848944.hsa021
Google Scholar
Solomatine D, See LM, Abrahart RJ (2008) Data-driven modelling: concepts, approaches and experiences. In: Abrahart R, See L, Solomatine D (eds) Practical hydroinformatics, vol 68, Water science and technology library. Springer, Berlin, p 17. doi:10.1007/978-3-540-79881-1_2
Chapter Google Scholar
Wang XL, Lin A (2015) An algorithm for integrating satellite precipitation estimates with in situ precipitation data on a pentad time scale. J Geophys Res Atmos 120:3728–3744. doi:10.1002/2014JD022788
Article Google Scholar
Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res 30:79–82. doi:10.3354/cr030079
Article Google Scholar
Yozgatligil C, Aslan S, Iyigun C, Batmaz I (2013) Comparison of missing value imputation methods in time series: the case of Turkish meteorological data. Theor Appl Climatol 112:143–167. doi:10.1007/s00704-012-0723-x
Article Google Scholar
Zhang Q, Xu C-Y, Tao H, Jiang T, Chen YD (2010) Climate changes and their impacts on water resources in the arid regions: a case study of the Tarim River basin, China. Stoch Environ Res Risk A 24:349–358. doi:10.1007/s00477-009-0324-0
Article CAS Google Scholar

Download references

Acknowledgements

I would like to thank The General Directorate of the State Hydraulic Works of Turkey for providing the data used in this study and the editors and reviewers for their valuable contributions and comments, which greatly improved the manuscript.

Author information

Authors and Affiliations

Civil Engineering Department, Pamukkale University, Denizli, Turkey
Fatih Dikbas

Authors

Fatih Dikbas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fatih Dikbas.

Appendices

Appendix 1: Determination of range clusters

For various reasons, there are generally gaps in any time series dataset, and the reliable estimation of the missing data has great value. In the FBI method, the missing data value at the center of the matrix in Fig. 8 (cell i, j) has temporal and quantitative relationships with nearby cells.

To estimate the probable range of the missing value at node i, j, the value ranges of all existing observations in the dataset should be determined. First, the observed data is sorted in ascending order and a three-dimensional vector containing the sorted data and associated coordinates in the data matrix is generated. The coordinate of each data point used in this study is the observed month (column) and year (row) of the data and is unique for each observation. The coordinate information is crucial because the observation time of a given value affects the temporal and quantitative investigation of time series data. Sorting and investigating statistical relationships for a variable without considering the observation times of each individual variable mean ignoring information about the temporal relationship between observations.

After sorting the observations, the observed time series range is divided into 2 to n range clusters to evaluate and estimate the possible clusters into which the missing data point may fall. The value of n may increase with the amount of available data; this increase would provide more precise results, as the value range for each cluster would be narrower. The number of clusters should be chosen so that the distribution of the observed values is sufficiently represented. Currently, the maximum number of clusters is determined by running the software for various number of clusters. It must be noted that the selected cluster number may not be optimum for obtaining the best results, although the method may still produce successful results. A good approach to determine the maximum number of clusters might be to start with a high number of clusters (like 50). Then, the cluster number that produces sufficient frequency values and cluster ranges might be chosen by looking at the generated frequency tables. Future studies should propose a method for determining the optimum number of clusters based on the number and variability of observations to further improve the successful estimation of missing values.

Clusters may be generated using two different approaches. In the first approach, each cluster has as equal a number of elements as possible (the clusters have varying ranges). Observed values are assigned to clusters using Eq. (1).

$$Cl_{i} = int\left( {\frac{{i*n_{cl} }}{{n_{d} }}} \right) + 1$$

(1)

In the second approach, range values are equalized (the clusters have a varying number of elements). The bounds of the cluster ranges are the lowest and highest observations belonging to that range. Observed values are assigned to clusters using Eq. (2).

$$Cl_{i} = int\left( {\frac{{\left( {X_{i} - X_{min} } \right)*n_{cl} }}{{X_{max} - X_{min} }}} \right) + 1$$

(2)

In the above equations: n _d is the total number of observations in the sorted data vector, i is the rank (index number) of the observation in the sorted data vector (changes between 1 and n_d), n _cl is the number of clusters used to divide the sorted data vector, Cl _i is the cluster index to be assigned to the i-th observation (changes between 1 and n_cl), int() is the function converting a decimal number into an integer, X _i is the i-th observation in the sorted data series, $X_{min} ; X_{max}$ the minimum and maximum observations.

Both approaches have advantages and disadvantages over each other. Selection of the appropriate clustering method completely depends on the diversity of the observed time series. For example, if the number of elements in certain clusters become too high compared to other clusters, then it would be better to generate clusters with an equal number of elements. For the precipitation data used in this paper, the first approach was used; each cluster included a similar number of elements. For example, for station 07-016, the first 11 clusters cover the range 0.0–80.3 mm while the 12th cluster covers the range 80.8–204.8 mm (1.54 times greater than the cumulative range of the first 11 clusters).

Appendix 2: Generation of the cluster frequency table

The clustering process explained in Appendix 1 assigns a cluster index to each observation. The cluster index value of each cell is the key to finding the cluster value of the missing cell. When the observed range is divided into two clusters, the first cluster includes the lower values and has a cluster index of 1, and the second cluster includes the higher values and has a cluster index value of 2. All adjacent cluster pairs in the data matrix near the missing cell are searched. Frequency values for the probable clusters are set to zero prior to the initiation of the search process. At the first clustering step, there are two possible clusters (1 or 2) into which the missing data may fall. When a match for a cluster pair is found in the matrix, the frequency of the cluster value at the relative location of the missing data point is increased by one. The maximum number of unique cluster pairs near the missing data point is 158. This number decreases if there is more than one missing data point in the neighborhood. The following rules provide three examples of the 158 unique rules used to find matching cluster pairs.

1.
If [Cl(X_i,j−2) = a & Cl(X_i,j−1) = b] and if [Cl(X_p,q−2) = a & Cl(X_p,q−1) = b & Cl(X_p,q) = c] then freq(c) = freq(c) +1.
2.
If [Cl(X_i−2,j) = a & Cl(X_i−1,j) = b] and if [Cl(X_p−2,q) = a & Cl(X_p−1,q) = b & Cl(X_p,q) = c] then freq(c) = freq(c) +1.
3.
If [Cl(X_i−2,j−2) = a & Cl(X_i−1,j−1) = b] and if [Cl(X_p−2,q−2) = a & Cl(X_p−1,q−1) = b & Cl(X_p,q) = c] then freq(c) = freq(c) +1.

In the above rules, Cl(X) is the cluster index of the observed value X; i and j are the row and column numbers of the missing node at the center of the 7 × 7 cell field; p and q are the row and column numbers of the cell at the relative location of the missing data at i, j and a, b and c are the cluster numbers of the related cells. When the entire dataset is divided into two clusters, a, b and c might have values of 1 or 2; for n clusters, they may have values ranging between 1 and n. The values of a, b and c may differ for each rule because they may represent different locations within the data matrix. The above three rules represent the horizontal cluster pair to the left of the missing node, the vertical cluster pair above the missing node and the diagonal cluster pair to the top left of the missing node, as shown in Fig. 9a in orange, yellow and green, respectively. Figure 9b shows the location of the first pair match for the first rule. With the first match, the frequency of the cluster number of the cell at the relative location of the missing data point is increased by one (the cell at p, q shown in pink). This is done because the cluster value at cell p, q is a probable value for the missing node at i, j, given that both cells have the same cluster pairs to the left. The search for the same pair then continues until all matching pairs are found and the frequencies of the clusters at the corresponding cells p, q are increased by one (for each match, the values of p and q might be different because the matching pairs will be at different locations within the data matrix).

After the search for the first cluster pair is completed, the above process is repeated for the next pair until all pairs near the missing data point have been searched and the total frequencies for each probable cluster determined. The clusters with the highest frequencies will be the most likely clusters into which the missing node will fall. Some cluster frequencies might remain at zero, indicating that it is unlikely that the missing data point will fall within that cluster.

In the first step, the observed data range was divided into two clusters. After the determination of the frequencies of both clusters, the observed range is divided into three clusters. This time, the cells in the data matrix will have cluster values ranging from 1 to 3. The process used to assign values to the two clusters above is repeated for the three clusters. For the missing value, the frequency of the three probable clusters will be zero to start. Then, all cluster pairs near the missing data point will be searched, and the frequencies of the clusters found at the relative location of the missing data point will be increased by one for each cluster pair match. The clustering, searching and cluster frequency determination process continues until the process has been applied for the greatest number of clusters. During this process, a cluster frequency table is generated to show the frequencies of the clusters determined at each clustering step. The highest frequency values in this table indicate the most likely clusters into which the missing data point will fall.

A dataset might have more than one missing value. The above method can be applied to each missing data point in the set and a frequency table generated for each missing cell. As the locations of the missing data points in the matrix will be different from one another, the neighbors of each missing cell will be unique; consequently, the frequency table for each missing data point will also be unique. To avoid repetition, cluster frequency table samples and details about how the estimates are calculated using the cluster frequencies are presented in the Application of the FBI Method section.

Appendix 3: The frequency based imputation software

The software developed to implement the method used in this study was written in Visual Basic in the Microsoft Visual Studio environment. The software is a console application that makes use of the interoperability feature, which enables synchronous operation of Microsoft Visual Basic and Microsoft Excel. The flowchart in Fig. 10 shows the general application procedure of the developed method and the software.

The first step in the application of the method is to read all observed values in the selected time series from the input file. The file is an Excel spreadsheet containing a two-dimensional matrix of the observed data. In this study, the columns in the data file represent months and the rows represent years. For each run, all observed data for a single station is evaluated. The method requires no preprocessing of data and uses all observed values from a station to generate the frequency tables for each observation; estimations are then made for the entire series. No observations are ignored and no smoothing occurs.

The software generates four output files containing the frequency tables, the estimations and their correlations with removed observations and statistical measures comparing the observed and estimated series to one another. Conditional formatting is used in the output files to visualize the differences between the values. The code is separated into distinct sections and explanations about the implementation of the method by the software are provided in the code itself.

The frequency based imputation software is distributed under the terms of the GNU General Public License version 3, and a copyright notice is provided at the beginning of the code. The software code may be downloaded using the following link: https://www.dropbox.com/s/l9eavvjiywipl19/FrequencyBasedImputation.vb?dl=0.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dikbas, F. Frequency based imputation of precipitation. Stoch Environ Res Risk Assess 31, 2415–2434 (2017). https://doi.org/10.1007/s00477-016-1356-x

Download citation

Published: 19 November 2016
Issue Date: November 2017
DOI: https://doi.org/10.1007/s00477-016-1356-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Frequency based imputation of precipitation

Abstract

Access this article

Similar content being viewed by others

The effect of simple imputations based on four variants of PCA methods on the quantiles of annual rainfall data

Comparative assessment of univariate and multivariate imputation models for varying lengths of missing rainfall data in a humid tropical region: a case study of Kozhikode, Kerala, India

An imputation method for the climatic data with strong seasonality and spatial correlation

References

Acknowledgements