Abstract
Like almost all fields of science, hydrology has benefited to a large extent from the tremendous improvements in scientific instruments that are able to collect long-time data series and an increase in available computational power and storage capabilities over the last decades. Many model applications and statistical analyses (e.g., extreme value analysis) are based on these time series. Consequently, the quality and the completeness of these time series are essential. Preprocessing of raw data sets by filling data gaps is thus a necessary procedure. Several interpolation techniques with different complexity are available ranging from rather simple to extremely challenging approaches. In this paper, various imputation methods available to the hydrological researchers are reviewed with regard to their suitability for filling gaps in the context of solving hydrological questions. The methodological approaches include arithmetic mean imputation, principal component analysis, regression-based methods and multiple imputation methods. In particular, autoregressive conditional heteroscedasticity (ARCH) models which originate from finance and econometrics will be discussed regarding their applicability to data series characterized by non-constant volatility and heteroscedasticity in hydrological contexts. The review shows that methodological advances driven by other fields of research bear relevance for a more intensive use of these methods in hydrology. Up to now, the hydrological community has paid little attention to the imputation ability of time series models in general and ARCH models in particular.
Similar content being viewed by others
Notes
As in “Multiple imputation” section below, multiple imputation generates multiple datasets containing imputed values which are enhanced by a random error term. The desired statistical analyses are then carried out multiple times on these different datasets and their results aggregated. This approach allows getting more appropriated standard errors on the estimates of the desired parameters.
References
Adhikari R, Agrawal R (2013) An introductory study on time series modeling and forecasting. arXiv:13026613
Allison PD (2000) Multiple imputation for missing data: a cautionary tale. Sociological methods research, vol 28. Sage Publications, pp 301–309
Allison PD (2001) Missing data, vol 136. Sage Publications, Philadelphia
Allison PD (2012) Handling missing data by maximum likelihood. SAS Global Forum Proceedings, pp 1–21
Astel A, Mazerski J, Polkowska Z, Namieśnik J (2004) Application of PCA and time series analysis in studies of precipitation in Tricity (Poland). Adv Environ Res 8:337–349
Aubin J, Bertrand-Krajewski J (2014) Analysis of continuous time series in urban hydrology: filling gaps and data reconstitution. Proceedings of the METMA VII and GRASPA14 conference. Torino (IT)
Baraldi AN, Enders CK (2010) An introduction to modern missing data analyses. J Sch Psychol 48:5–37
Baur DG, Lucey BM (2009) Flights and contagion—an empirical analysis of stock–bond correlations. J Financ Stab 5:339–352
Box GE, Jenkins GM (1976) Time series analysis, control, and forecasting, vol 3226. Holden Day, San Francisco, p 10
Chen CH, Liu CH, Su HC (2008) A nonlinear time series analysis using two-stage genetic algorithms for streamflow forecasting. Hydrol Process 22:3697–3711
Cool AL (2000) A review of methods for dealing with missing data. Texas A&M University, College Station
Croninger RG, Douglas KM (2005) Missing data and institutional research. New Dir Inst Res 2005:33–49
de Leeuw J (1986) In: Proceedings of a workshop on multidimensional data analysis, Pembroke College, Cambridge University, England, 30 June–2 July 1985, vol 7. DSWO Press
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Donders ART, van der Heijden GJ, Stijnen T, Moons KG (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59:1087–1091
Elshorbagy A, Simonovic S, Panu U (2002) Estimation of missing streamflow data using principles of chaos theory. J Hydrol 255:123–133
Enders CK (2010) Applied missing data analysis. Guilford Press, New York
Engle RF (1982) Autoregressive conditional heteroscedastisity with estimates of the variance of United Kingdom inflation. Econometrica 50:987–1008
Eom KS, Hahn SB, Joo S (2004) Partial price adjustment and autocorrelation in foreign exchange markets. University of California, Berkeley
Fama EF, French KR (1988) Permanent and temporary components of stock prices. J Polit Econ 96:246–273
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Patt Recogn 41:3692–3705
Feinberg EA, Genethliou D (2005) Load forecasting. In: Chow JH, Wu FF, Momoh JA (eds) Applied mathematics for restructured electric power systems. Springer, US, pp 269–285
Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, Cambridge
Frane JW (1976) Some simple procedures for handling missing data in multivariate analysis. Psychometrika 41:409–415
Gill MK, Asefa T, Kaheil Y, McKee M (2007) Effect of missing data on performance of learning algorithms for hydrologic predictions: implications to an imputation technique. Water Resour Res. https://doi.org/10.1029/2006WR005298
Graham JW (2009) Missing data analysis: making it work in the real world. Annu Rev Psychol 60:549–576
Graham JW, Hofer SM (2000) Multiple imputation in multivariate research. In: Little TD, Schnabel KU, Baumert J (eds) Modeling longitudinal and multiple group data: practical issues, applied approaches, and specific examples. Lawrence Erlbaum Associates, Mahwah, NJ, pp 201–218
Greenland S, Finkle WD (1995) A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol 142:1255–1264
Guzman JA, Moriasi D, Chu M, Starks P, Steiner J, Gowda P (2013) A tool for mapping and spatio-temporal analysis of hydrological data. Environ Model Softw 48:163–170
Harrington D (2008) Confirmatory factor analysis. Oxford University Press, USA
Hassani H (2007) Singular spectrum analysis: methodology and comparison. J Data Sci 5(2):239–257
Hawkins M, Merriam V (1991) An overmodeled world. Direct Mark, pp 21–24
Hedeker D, Gibbons RD (1997) Application of random-effects pattern-mixture models for missing data in longitudinal studies. Psychol Methods 2:64
Henn B, Raleigh MS, Fisher A, Lundquist JD (2013) A comparison of methods for filling gaps in hourly near-surface air temperature data. Gloss Meteorol AMS. https://doi.org/10.1175/JHM-D-12-027.1
Hughes CE, Cendón DI, Johansen MP, Meredith KT (2011) Climate change and groundwater. In: Anthony J, Jones A (eds) Sustaining groundwater resources. Springer, pp 97–117
Johnston CA (1999) Development and evaluation of infilling methods for missing hydrologic and chemical watershed monitoring data. Virginia Tech, Master thesis [17479]
Jolliffe IT (1993) Principal component analysis: a beginner’s guide—II. Pitfalls Myths Ext Weather 48:246–253
Jolliffe I (2002) Principal component analysis. Wiley Online Library, New York
Kiers HA (1997) Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62:251–266
Kim J-O, Curry J (1977) The treatment of missing data in multivariate analysis. Sociol Methods Res 6:215–240
Kim J, Ryu JH (2016) A heuristic gap filling method for daily precipitation series. Water Resour Manage 30:2275–2294
King G, Honaker J, Joseph A, Scheve K (1998) List-wise deletion is evil: what to do about missing data in political science. In: Annual meeting of the American political science association, Boston
Kondrashov D, Ghil M (2006) Spatio-temporal filling of missing points in geophysical data sets. Nonlinear Process Geophys 13:151–159
Lee KJ, Carlin JB (2010) Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am J Epidemiol 171:624–632
Little RJA (1988) Missing-data adjustments in large surveys. J Bus Econ Stat 6:287–296
Little R, Rubin D (1987) Analysis with missing data. Wiley, New York
Machiwal D, Jha M (2008) Comparative evaluation of statistical tests for time series analysis: application to hydrological time series. Hydrol Sci J 53:353–366
Malhotra NK (1987) Analyzing marketing research data with incomplete information on the dependent variable. J Mark Res 24:74–84
Marsh HW (1998) Pairwise deletion for missing data in structural equation models: nonpositive definite matrices, parameter estimates, goodness of fit, and adjusted sample sizes. Struct Equ Model Multidiscip J 5:22–36
Mcdonald RA, Thurston PW, Nelson MR (2000) A Monte Carlo study of missing item methods. Organ Res Methods 3:71–92
McKnight PE, McKnight KM, Sidani S, Figueredo AJ (2007) Missing data: a gentle introduction. Guilford Press, New York
Modarres R, Ouarda T (2013) Generalized autoregressive conditional heteroscedasticity modelling of hydrologic time series. Hydrol Process 27(22):3174–3191
Pandey PK, Singh Y, Tripathi S (2011) Image processing using principle component analysis. Int J Comput Appl 15(4):37–40
Peugh JL, Enders CK (2004) Missing data in educational research: a review of reporting practices and suggestions for improvement. Rev Educ Res 74:525–556
Pigott TD (2001) A review of methods for missing data. Educ Res Eval 7:353–383
Puma MJ, Olsen RB, Bell SH, Price C (2009) What to do when data are missing in group randomized controlled trials. NCEE 2009-0049. National Center for Education Evaluation and Regional Assistance
Raaijmakers QA (1999) Effectiveness of different missing data treatments in surveys with Likert-type data: introducing the relative mean substitution approach. Educ Psychol Measur 59:725–748
Raghunathan TE (2004) What do we do with missing data? Some options for analysis of incomplete data. Annu Rev Public Health 25:99–117
Roth PL (1994) Missing data: a conceptual review for applied psychologists. Pers Psychol 47:537–560
Roth PL, Switzer FS, Switzer DM (1999) Missing data in multiple item scales: a Monte Carlo analysis of missing data techniques. Organ Res Methods 2:211–232
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Rubin DB (2004) Multiple imputation for nonresponse in surveys, vol 81. Wiley, New York
Rubin DB, Little RJ (2002) Statistical analysis with missing data. Wiley, Hoboken
Rubin LH, Witkiewitz K, St Andre J, Reilly S (2007) Methods for handling missing data in the behavioral neurosciences: do not throw the baby rat out with the bath water. J Undergrad Neurosci Educ 5:71–77
Saunders JA, Morrow-Howell N, Spitznagel E, Doré P, Proctor EK, Pescarino R (2006) Imputing missing data: a comparison of methods for social work researchers. Soc Work Res 30:19–31
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7:147
Soley-Bori M (2013) Dealing with missing data: key assumptions and methods for applied analysis (No. 4). Technical report, Boston University
Sorjamaa A, Hao J, Reyhani N, Ji Y, Lendasse A (2007) Methodology for long-term prediction of time series. Neurocomputing 70:2861–2869
Stock JH, Watson MW, Addison-Wesley P (2007) Introduction to econometrics. Addison and Wesley, Boston
Tannenbaum CE (2009) The empirical nature and statistical treatment of missing data. University of Pennsylvania, ProQuest Dissertations Publishing, Philadelphia
Tsikriktsis N (2005) A review of techniques for treating missing data in OM survey research. J Oper Manag 24:53–62
van der Heijden GJ, Donders ART, Stijnen T, Moons KG (2006) Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol 59:1102–1109
Wall ME, Rechtsteiner A, Rocha LM (2003) Singular value decomposition and principal component analysis. In: Berrar DP, Dubitzky W, Granzow M (eds) A practical approach to microarray data analysis. Springer, pp 91–109
Wang W, Vrijling JK, Van Gelder PH, Ma J (2005) Testing and modeling autoregressive conditional heteroskedasticity of streamflow processes. Nonlinear Process Geophys 12:55–66
Wothke W (2000) Longitudinal and multigroup modeling with missing data. In: Little TD, Schnabel KU, Baumert J (eds) Modeling longitudinal and multilevel data. Erlbaum, Mahwah, pp 219–240
Zhang Q, Wang B-D, He B, Peng Y, Ren M-L (2011) Singular spectrum analysis and ARIMA hybrid model for annual runoff forecasting. Water Resour Manage 25:2683–2703
Zhu F, Wang D (2008) Local estimation in AR models with nonparametric ARCH errors. Commun Stat Theory Methods 37:1591–1609
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gao, Y., Merz, C., Lischeid, G. et al. A review on missing hydrological data processing. Environ Earth Sci 77, 47 (2018). https://doi.org/10.1007/s12665-018-7228-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12665-018-7228-6