A review on missing hydrological data processing

Gao, Yongbo; Merz, Christoph; Lischeid, Gunnar; Schneider, Michael

doi:10.1007/s12665-018-7228-6

A review on missing hydrological data processing

Original Article
Published: 19 January 2018

Volume 77, article number 47, (2018)
Cite this article

Environmental Earth Sciences Aims and scope Submit manuscript

Yongbo Gao ORCID: orcid.org/0000-0003-3283-0538^1,2,
Christoph Merz^1,2,
Gunnar Lischeid^1,3 &
…
Michael Schneider²

1781 Accesses
39 Citations
Explore all metrics

Abstract

Like almost all fields of science, hydrology has benefited to a large extent from the tremendous improvements in scientific instruments that are able to collect long-time data series and an increase in available computational power and storage capabilities over the last decades. Many model applications and statistical analyses (e.g., extreme value analysis) are based on these time series. Consequently, the quality and the completeness of these time series are essential. Preprocessing of raw data sets by filling data gaps is thus a necessary procedure. Several interpolation techniques with different complexity are available ranging from rather simple to extremely challenging approaches. In this paper, various imputation methods available to the hydrological researchers are reviewed with regard to their suitability for filling gaps in the context of solving hydrological questions. The methodological approaches include arithmetic mean imputation, principal component analysis, regression-based methods and multiple imputation methods. In particular, autoregressive conditional heteroscedasticity (ARCH) models which originate from finance and econometrics will be discussed regarding their applicability to data series characterized by non-constant volatility and heteroscedasticity in hydrological contexts. The review shows that methodological advances driven by other fields of research bear relevance for a more intensive use of these methods in hydrology. Up to now, the hydrological community has paid little attention to the imputation ability of time series models in general and ARCH models in particular.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

As in “Multiple imputation” section below, multiple imputation generates multiple datasets containing imputed values which are enhanced by a random error term. The desired statistical analyses are then carried out multiple times on these different datasets and their results aggregated. This approach allows getting more appropriated standard errors on the estimates of the desired parameters.

References

Adhikari R, Agrawal R (2013) An introductory study on time series modeling and forecasting. arXiv:13026613
Allison PD (2000) Multiple imputation for missing data: a cautionary tale. Sociological methods research, vol 28. Sage Publications, pp 301–309
Allison PD (2001) Missing data, vol 136. Sage Publications, Philadelphia
Google Scholar
Allison PD (2012) Handling missing data by maximum likelihood. SAS Global Forum Proceedings, pp 1–21
Astel A, Mazerski J, Polkowska Z, Namieśnik J (2004) Application of PCA and time series analysis in studies of precipitation in Tricity (Poland). Adv Environ Res 8:337–349
Article Google Scholar
Aubin J, Bertrand-Krajewski J (2014) Analysis of continuous time series in urban hydrology: filling gaps and data reconstitution. Proceedings of the METMA VII and GRASPA14 conference. Torino (IT)
Baraldi AN, Enders CK (2010) An introduction to modern missing data analyses. J Sch Psychol 48:5–37
Article Google Scholar
Baur DG, Lucey BM (2009) Flights and contagion—an empirical analysis of stock–bond correlations. J Financ Stab 5:339–352
Article Google Scholar
Box GE, Jenkins GM (1976) Time series analysis, control, and forecasting, vol 3226. Holden Day, San Francisco, p 10
Google Scholar
Chen CH, Liu CH, Su HC (2008) A nonlinear time series analysis using two-stage genetic algorithms for streamflow forecasting. Hydrol Process 22:3697–3711
Article Google Scholar
Cool AL (2000) A review of methods for dealing with missing data. Texas A&M University, College Station
Google Scholar
Croninger RG, Douglas KM (2005) Missing data and institutional research. New Dir Inst Res 2005:33–49
Google Scholar
de Leeuw J (1986) In: Proceedings of a workshop on multidimensional data analysis, Pembroke College, Cambridge University, England, 30 June–2 July 1985, vol 7. DSWO Press
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Google Scholar
Donders ART, van der Heijden GJ, Stijnen T, Moons KG (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59:1087–1091
Article Google Scholar
Elshorbagy A, Simonovic S, Panu U (2002) Estimation of missing streamflow data using principles of chaos theory. J Hydrol 255:123–133
Article Google Scholar
Enders CK (2010) Applied missing data analysis. Guilford Press, New York
Google Scholar
Engle RF (1982) Autoregressive conditional heteroscedastisity with estimates of the variance of United Kingdom inflation. Econometrica 50:987–1008
Article Google Scholar
Eom KS, Hahn SB, Joo S (2004) Partial price adjustment and autocorrelation in foreign exchange markets. University of California, Berkeley
Google Scholar
Fama EF, French KR (1988) Permanent and temporary components of stock prices. J Polit Econ 96:246–273
Article Google Scholar
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Patt Recogn 41:3692–3705
Article Google Scholar
Feinberg EA, Genethliou D (2005) Load forecasting. In: Chow JH, Wu FF, Momoh JA (eds) Applied mathematics for restructured electric power systems. Springer, US, pp 269–285
Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, Cambridge
Book Google Scholar
Frane JW (1976) Some simple procedures for handling missing data in multivariate analysis. Psychometrika 41:409–415
Article Google Scholar
Gill MK, Asefa T, Kaheil Y, McKee M (2007) Effect of missing data on performance of learning algorithms for hydrologic predictions: implications to an imputation technique. Water Resour Res. https://doi.org/10.1029/2006WR005298
Google Scholar
Graham JW (2009) Missing data analysis: making it work in the real world. Annu Rev Psychol 60:549–576
Article Google Scholar
Graham JW, Hofer SM (2000) Multiple imputation in multivariate research. In: Little TD, Schnabel KU, Baumert J (eds) Modeling longitudinal and multiple group data: practical issues, applied approaches, and specific examples. Lawrence Erlbaum Associates, Mahwah, NJ, pp 201–218
Greenland S, Finkle WD (1995) A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol 142:1255–1264
Article Google Scholar
Guzman JA, Moriasi D, Chu M, Starks P, Steiner J, Gowda P (2013) A tool for mapping and spatio-temporal analysis of hydrological data. Environ Model Softw 48:163–170
Article Google Scholar
Harrington D (2008) Confirmatory factor analysis. Oxford University Press, USA
Book Google Scholar
Hassani H (2007) Singular spectrum analysis: methodology and comparison. J Data Sci 5(2):239–257
Google Scholar
Hawkins M, Merriam V (1991) An overmodeled world. Direct Mark, pp 21–24
Hedeker D, Gibbons RD (1997) Application of random-effects pattern-mixture models for missing data in longitudinal studies. Psychol Methods 2:64
Article Google Scholar
Henn B, Raleigh MS, Fisher A, Lundquist JD (2013) A comparison of methods for filling gaps in hourly near-surface air temperature data. Gloss Meteorol AMS. https://doi.org/10.1175/JHM-D-12-027.1
Google Scholar
Hughes CE, Cendón DI, Johansen MP, Meredith KT (2011) Climate change and groundwater. In: Anthony J, Jones A (eds) Sustaining groundwater resources. Springer, pp 97–117
Johnston CA (1999) Development and evaluation of infilling methods for missing hydrologic and chemical watershed monitoring data. Virginia Tech, Master thesis [17479]
Jolliffe IT (1993) Principal component analysis: a beginner’s guide—II. Pitfalls Myths Ext Weather 48:246–253
Google Scholar
Jolliffe I (2002) Principal component analysis. Wiley Online Library, New York
Google Scholar
Kiers HA (1997) Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62:251–266
Article Google Scholar
Kim J-O, Curry J (1977) The treatment of missing data in multivariate analysis. Sociol Methods Res 6:215–240
Article Google Scholar
Kim J, Ryu JH (2016) A heuristic gap filling method for daily precipitation series. Water Resour Manage 30:2275–2294
Article Google Scholar
King G, Honaker J, Joseph A, Scheve K (1998) List-wise deletion is evil: what to do about missing data in political science. In: Annual meeting of the American political science association, Boston
Kondrashov D, Ghil M (2006) Spatio-temporal filling of missing points in geophysical data sets. Nonlinear Process Geophys 13:151–159
Article Google Scholar
Lee KJ, Carlin JB (2010) Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am J Epidemiol 171:624–632
Article Google Scholar
Little RJA (1988) Missing-data adjustments in large surveys. J Bus Econ Stat 6:287–296
Google Scholar
Little R, Rubin D (1987) Analysis with missing data. Wiley, New York
Google Scholar
Machiwal D, Jha M (2008) Comparative evaluation of statistical tests for time series analysis: application to hydrological time series. Hydrol Sci J 53:353–366
Article Google Scholar
Malhotra NK (1987) Analyzing marketing research data with incomplete information on the dependent variable. J Mark Res 24:74–84
Article Google Scholar
Marsh HW (1998) Pairwise deletion for missing data in structural equation models: nonpositive definite matrices, parameter estimates, goodness of fit, and adjusted sample sizes. Struct Equ Model Multidiscip J 5:22–36
Article Google Scholar
Mcdonald RA, Thurston PW, Nelson MR (2000) A Monte Carlo study of missing item methods. Organ Res Methods 3:71–92
Article Google Scholar
McKnight PE, McKnight KM, Sidani S, Figueredo AJ (2007) Missing data: a gentle introduction. Guilford Press, New York
Google Scholar
Modarres R, Ouarda T (2013) Generalized autoregressive conditional heteroscedasticity modelling of hydrologic time series. Hydrol Process 27(22):3174–3191
Google Scholar
Pandey PK, Singh Y, Tripathi S (2011) Image processing using principle component analysis. Int J Comput Appl 15(4):37–40
Google Scholar
Peugh JL, Enders CK (2004) Missing data in educational research: a review of reporting practices and suggestions for improvement. Rev Educ Res 74:525–556
Article Google Scholar
Pigott TD (2001) A review of methods for missing data. Educ Res Eval 7:353–383
Article Google Scholar
Puma MJ, Olsen RB, Bell SH, Price C (2009) What to do when data are missing in group randomized controlled trials. NCEE 2009-0049. National Center for Education Evaluation and Regional Assistance
Raaijmakers QA (1999) Effectiveness of different missing data treatments in surveys with Likert-type data: introducing the relative mean substitution approach. Educ Psychol Measur 59:725–748
Article Google Scholar
Raghunathan TE (2004) What do we do with missing data? Some options for analysis of incomplete data. Annu Rev Public Health 25:99–117
Article Google Scholar
Roth PL (1994) Missing data: a conceptual review for applied psychologists. Pers Psychol 47:537–560
Article Google Scholar
Roth PL, Switzer FS, Switzer DM (1999) Missing data in multiple item scales: a Monte Carlo analysis of missing data techniques. Organ Res Methods 2:211–232
Article Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Article Google Scholar
Rubin DB (2004) Multiple imputation for nonresponse in surveys, vol 81. Wiley, New York
Google Scholar
Rubin DB, Little RJ (2002) Statistical analysis with missing data. Wiley, Hoboken
Google Scholar
Rubin LH, Witkiewitz K, St Andre J, Reilly S (2007) Methods for handling missing data in the behavioral neurosciences: do not throw the baby rat out with the bath water. J Undergrad Neurosci Educ 5:71–77
Google Scholar
Saunders JA, Morrow-Howell N, Spitznagel E, Doré P, Proctor EK, Pescarino R (2006) Imputing missing data: a comparison of methods for social work researchers. Soc Work Res 30:19–31
Article Google Scholar
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7:147
Article Google Scholar
Soley-Bori M (2013) Dealing with missing data: key assumptions and methods for applied analysis (No. 4). Technical report, Boston University
Sorjamaa A, Hao J, Reyhani N, Ji Y, Lendasse A (2007) Methodology for long-term prediction of time series. Neurocomputing 70:2861–2869
Article Google Scholar
Stock JH, Watson MW, Addison-Wesley P (2007) Introduction to econometrics. Addison and Wesley, Boston
Google Scholar
Tannenbaum CE (2009) The empirical nature and statistical treatment of missing data. University of Pennsylvania, ProQuest Dissertations Publishing, Philadelphia
Google Scholar
Tsikriktsis N (2005) A review of techniques for treating missing data in OM survey research. J Oper Manag 24:53–62
Article Google Scholar
van der Heijden GJ, Donders ART, Stijnen T, Moons KG (2006) Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol 59:1102–1109
Article Google Scholar
Wall ME, Rechtsteiner A, Rocha LM (2003) Singular value decomposition and principal component analysis. In: Berrar DP, Dubitzky W, Granzow M (eds) A practical approach to microarray data analysis. Springer, pp 91–109
Wang W, Vrijling JK, Van Gelder PH, Ma J (2005) Testing and modeling autoregressive conditional heteroskedasticity of streamflow processes. Nonlinear Process Geophys 12:55–66
Article Google Scholar
Wothke W (2000) Longitudinal and multigroup modeling with missing data. In: Little TD, Schnabel KU, Baumert J (eds) Modeling longitudinal and multilevel data. Erlbaum, Mahwah, pp 219–240
Google Scholar
Zhang Q, Wang B-D, He B, Peng Y, Ren M-L (2011) Singular spectrum analysis and ARIMA hybrid model for annual runoff forecasting. Water Resour Manage 25:2683–2703
Article Google Scholar
Zhu F, Wang D (2008) Local estimation in AR models with nonparametric ARCH errors. Commun Stat Theory Methods 37:1591–1609
Article Google Scholar

Download references

Author information

Authors and Affiliations

Leibniz Centre for Agricultural Landscape Research (ZALF), Eberswalder Str. 84, 15374, Müncheberg, Germany
Yongbo Gao, Christoph Merz & Gunnar Lischeid
Institute of Geological Sciences, Workgroup Hydrogeology, Freie Universität Berlin, Malteser Str. 74-100, 12249, Berlin, Germany
Yongbo Gao, Christoph Merz & Michael Schneider
Department of Earth and Environmental Science, University of Potsdam, Karl-Liebknecht-Str. 24-25, 14476, Potsdam-Golm, Germany
Gunnar Lischeid

Authors

Yongbo Gao
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Merz
View author publications
You can also search for this author in PubMed Google Scholar
Gunnar Lischeid
View author publications
You can also search for this author in PubMed Google Scholar
Michael Schneider
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongbo Gao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, Y., Merz, C., Lischeid, G. et al. A review on missing hydrological data processing. Environ Earth Sci 77, 47 (2018). https://doi.org/10.1007/s12665-018-7228-6

Download citation

Received: 09 September 2016
Accepted: 03 January 2018
Published: 19 January 2018
DOI: https://doi.org/10.1007/s12665-018-7228-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on missing hydrological data processing

Abstract

Access this article

Similar content being viewed by others

Machine Learning Strategies for Time Series Forecasting

Time series analysis of climate variables using seasonal ARIMA approach

Causal inference for time series analysis: problems, methods and evaluation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A review on missing hydrological data processing

Abstract

Access this article

Similar content being viewed by others

Machine Learning Strategies for Time Series Forecasting

Time series analysis of climate variables using seasonal ARIMA approach

Causal inference for time series analysis: problems, methods and evaluation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation