Skip to main content
Log in

Frequency based imputation of precipitation

  • Original Paper
  • Published:
Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Abstract

Changing climate and precipitation patterns make the estimation of precipitation, which exhibits two-dimensional and sometimes chaotic behavior, more challenging. In recent decades, numerous data-driven methods have been developed and applied to estimate precipitation; however, these methods suffer from the use of one-dimensional approaches, lack generality, require the use of neighboring stations and have low sensitivity. This paper aims to implement the first generally applicable, highly sensitive two-dimensional data-driven model of precipitation. This model, named frequency based imputation (FBI), relies on non-continuous monthly precipitation time series data. It requires no determination of input parameters and no data preprocessing, and it provides multiple estimations (from the most to the least probable) of each missing data unit utilizing the series itself. A total of 34,330 monthly total precipitation observations from 70 stations in 21 basins within Turkey were used to assess the success of the method by removing and estimating observation series in annual increments. Comparisons with the expectation maximization and multiple linear regression models illustrate that the FBI method is superior in its estimation of monthly precipitation. This paper also provides a link to the software code for the FBI method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

Download references

Acknowledgements

I would like to thank The General Directorate of the State Hydraulic Works of Turkey for providing the data used in this study and the editors and reviewers for their valuable contributions and comments, which greatly improved the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fatih Dikbas.

Appendices

Appendix 1: Determination of range clusters

For various reasons, there are generally gaps in any time series dataset, and the reliable estimation of the missing data has great value. In the FBI method, the missing data value at the center of the matrix in Fig. 8 (cell i, j) has temporal and quantitative relationships with nearby cells.

Fig. 8
figure 8

Pairs to be searched in the data matrix

To estimate the probable range of the missing value at node i, j, the value ranges of all existing observations in the dataset should be determined. First, the observed data is sorted in ascending order and a three-dimensional vector containing the sorted data and associated coordinates in the data matrix is generated. The coordinate of each data point used in this study is the observed month (column) and year (row) of the data and is unique for each observation. The coordinate information is crucial because the observation time of a given value affects the temporal and quantitative investigation of time series data. Sorting and investigating statistical relationships for a variable without considering the observation times of each individual variable mean ignoring information about the temporal relationship between observations.

After sorting the observations, the observed time series range is divided into 2 to n range clusters to evaluate and estimate the possible clusters into which the missing data point may fall. The value of n may increase with the amount of available data; this increase would provide more precise results, as the value range for each cluster would be narrower. The number of clusters should be chosen so that the distribution of the observed values is sufficiently represented. Currently, the maximum number of clusters is determined by running the software for various number of clusters. It must be noted that the selected cluster number may not be optimum for obtaining the best results, although the method may still produce successful results. A good approach to determine the maximum number of clusters might be to start with a high number of clusters (like 50). Then, the cluster number that produces sufficient frequency values and cluster ranges might be chosen by looking at the generated frequency tables. Future studies should propose a method for determining the optimum number of clusters based on the number and variability of observations to further improve the successful estimation of missing values.

Clusters may be generated using two different approaches. In the first approach, each cluster has as equal a number of elements as possible (the clusters have varying ranges). Observed values are assigned to clusters using Eq. (1).

$$Cl_{i} = int\left( {\frac{{i*n_{cl} }}{{n_{d} }}} \right) + 1$$
(1)

In the second approach, range values are equalized (the clusters have a varying number of elements). The bounds of the cluster ranges are the lowest and highest observations belonging to that range. Observed values are assigned to clusters using Eq. (2).

$$Cl_{i} = int\left( {\frac{{\left( {X_{i} - X_{min} } \right)*n_{cl} }}{{X_{max} - X_{min} }}} \right) + 1$$
(2)

In the above equations: n d is the total number of observations in the sorted data vector, i is the rank (index number) of the observation in the sorted data vector (changes between 1 and nd), n cl is the number of clusters used to divide the sorted data vector, Cl i is the cluster index to be assigned to the i-th observation (changes between 1 and ncl), int() is the function converting a decimal number into an integer, X i is the i-th observation in the sorted data series, \(X_{min} ; X_{max}\) the minimum and maximum observations.

Both approaches have advantages and disadvantages over each other. Selection of the appropriate clustering method completely depends on the diversity of the observed time series. For example, if the number of elements in certain clusters become too high compared to other clusters, then it would be better to generate clusters with an equal number of elements. For the precipitation data used in this paper, the first approach was used; each cluster included a similar number of elements. For example, for station 07-016, the first 11 clusters cover the range 0.0–80.3 mm while the 12th cluster covers the range 80.8–204.8 mm (1.54 times greater than the cumulative range of the first 11 clusters).

Appendix 2: Generation of the cluster frequency table

The clustering process explained in Appendix 1 assigns a cluster index to each observation. The cluster index value of each cell is the key to finding the cluster value of the missing cell. When the observed range is divided into two clusters, the first cluster includes the lower values and has a cluster index of 1, and the second cluster includes the higher values and has a cluster index value of 2. All adjacent cluster pairs in the data matrix near the missing cell are searched. Frequency values for the probable clusters are set to zero prior to the initiation of the search process. At the first clustering step, there are two possible clusters (1 or 2) into which the missing data may fall. When a match for a cluster pair is found in the matrix, the frequency of the cluster value at the relative location of the missing data point is increased by one. The maximum number of unique cluster pairs near the missing data point is 158. This number decreases if there is more than one missing data point in the neighborhood. The following rules provide three examples of the 158 unique rules used to find matching cluster pairs.

  1. 1.

    If [Cl(Xi,j−2) = a & Cl(Xi,j−1) = b] and if [Cl(Xp,q−2) = a & Cl(Xp,q−1) = b & Cl(Xp,q) = c] then freq(c) = freq(c) +1.

  2. 2.

    If [Cl(Xi−2,j) = a & Cl(Xi−1,j) = b] and if [Cl(Xp−2,q) = a & Cl(Xp−1,q) = b & Cl(Xp,q) = c] then freq(c) = freq(c) +1.

  3. 3.

    If [Cl(Xi−2,j−2) = a & Cl(Xi−1,j−1) = b] and if [Cl(Xp−2,q−2) = a & Cl(Xp−1,q−1) = b & Cl(Xp,q) = c] then freq(c) = freq(c) +1.

In the above rules, Cl(X) is the cluster index of the observed value X; i and j are the row and column numbers of the missing node at the center of the 7 × 7 cell field; p and q are the row and column numbers of the cell at the relative location of the missing data at i, j and a, b and c are the cluster numbers of the related cells. When the entire dataset is divided into two clusters, a, b and c might have values of 1 or 2; for n clusters, they may have values ranging between 1 and n. The values of a, b and c may differ for each rule because they may represent different locations within the data matrix. The above three rules represent the horizontal cluster pair to the left of the missing node, the vertical cluster pair above the missing node and the diagonal cluster pair to the top left of the missing node, as shown in Fig. 9a in orange, yellow and green, respectively. Figure 9b shows the location of the first pair match for the first rule. With the first match, the frequency of the cluster number of the cell at the relative location of the missing data point is increased by one (the cell at p, q shown in pink). This is done because the cluster value at cell p, q is a probable value for the missing node at i, j, given that both cells have the same cluster pairs to the left. The search for the same pair then continues until all matching pairs are found and the frequencies of the clusters at the corresponding cells p, q are increased by one (for each match, the values of p and q might be different because the matching pairs will be at different locations within the data matrix).

Fig. 9
figure 9

a The cluster pairs (orange, yellow and green) for which rules 1, 2 and 3 are written, b a matching cluster pair for the first rule

Fig. 10
figure 10

Primary steps in the FBI method

After the search for the first cluster pair is completed, the above process is repeated for the next pair until all pairs near the missing data point have been searched and the total frequencies for each probable cluster determined. The clusters with the highest frequencies will be the most likely clusters into which the missing node will fall. Some cluster frequencies might remain at zero, indicating that it is unlikely that the missing data point will fall within that cluster.

In the first step, the observed data range was divided into two clusters. After the determination of the frequencies of both clusters, the observed range is divided into three clusters. This time, the cells in the data matrix will have cluster values ranging from 1 to 3. The process used to assign values to the two clusters above is repeated for the three clusters. For the missing value, the frequency of the three probable clusters will be zero to start. Then, all cluster pairs near the missing data point will be searched, and the frequencies of the clusters found at the relative location of the missing data point will be increased by one for each cluster pair match. The clustering, searching and cluster frequency determination process continues until the process has been applied for the greatest number of clusters. During this process, a cluster frequency table is generated to show the frequencies of the clusters determined at each clustering step. The highest frequency values in this table indicate the most likely clusters into which the missing data point will fall.

A dataset might have more than one missing value. The above method can be applied to each missing data point in the set and a frequency table generated for each missing cell. As the locations of the missing data points in the matrix will be different from one another, the neighbors of each missing cell will be unique; consequently, the frequency table for each missing data point will also be unique. To avoid repetition, cluster frequency table samples and details about how the estimates are calculated using the cluster frequencies are presented in the Application of the FBI Method section.

Appendix 3: The frequency based imputation software

The software developed to implement the method used in this study was written in Visual Basic in the Microsoft Visual Studio environment. The software is a console application that makes use of the interoperability feature, which enables synchronous operation of Microsoft Visual Basic and Microsoft Excel. The flowchart in Fig. 10 shows the general application procedure of the developed method and the software.

The first step in the application of the method is to read all observed values in the selected time series from the input file. The file is an Excel spreadsheet containing a two-dimensional matrix of the observed data. In this study, the columns in the data file represent months and the rows represent years. For each run, all observed data for a single station is evaluated. The method requires no preprocessing of data and uses all observed values from a station to generate the frequency tables for each observation; estimations are then made for the entire series. No observations are ignored and no smoothing occurs.

The software generates four output files containing the frequency tables, the estimations and their correlations with removed observations and statistical measures comparing the observed and estimated series to one another. Conditional formatting is used in the output files to visualize the differences between the values. The code is separated into distinct sections and explanations about the implementation of the method by the software are provided in the code itself.

The frequency based imputation software is distributed under the terms of the GNU General Public License version 3, and a copyright notice is provided at the beginning of the code. The software code may be downloaded using the following link: https://www.dropbox.com/s/l9eavvjiywipl19/FrequencyBasedImputation.vb?dl=0.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dikbas, F. Frequency based imputation of precipitation. Stoch Environ Res Risk Assess 31, 2415–2434 (2017). https://doi.org/10.1007/s00477-016-1356-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00477-016-1356-x

Keywords

Navigation