Skip to main content
Log in

Reconstructing missing data sequences in multivariate time series: an application to environmental data

  • Original Paper
  • Published:
Statistical Methods & Applications Aims and scope Submit manuscript

Abstract

Missing data arise in many statistical analyses, due to faults in data acquisition, and can have a significant effect on the conclusions that can be drawn from the data. In environmental data, for example, a standard approach usually adopted by the Environmental Protection Agencies to handle missing values is by deleting those observations with incomplete information from the study, obtaining a massive underestimation of many indexes usually used for evaluating air quality. In multivariate time series, moreover, it may happen that not only isolated values but also long sequences of some of the time series’ components may miss. In such cases, it is quite impossible to reconstruct the missing sequences basing on the serial dependence structure alone. In this work, we propose a new procedure that aims to reconstruct the missing sequences by exploiting the spatial correlation and the serial correlation of the multivariate time series, simultaneously. The proposed procedure is based on a spatial-dynamic model and imputes the missing values in the time series basing on a linear combination of the neighbor contemporary observations and their lagged values. It is specifically oriented to spatio-temporal data, although it is general enough to be applied to generic stationary multivariate time-series. In this paper, the procedure has been applied to the pollution data, where the problem of missing sequences is of serious concern, with remarkably satisfactory performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Aga E, Samoli E, Touloumi G, Anderson HR, Cadum E, Forsberg B (2003) Short-term effects of ambient particles on mortality in the elderly: results from 28 cities in the APHEA2 project. Eur Resp J Suppl 40:28s33s

    Google Scholar 

  • Anselin L (1988) Spatial econometrics: methods and models. Kluwer Academic, Dordrecht

    Book  MATH  Google Scholar 

  • Biggeri A, Baccini M, Accetta G, Lagazio C (2002) Estimates of short-term effects of air pollutants in Italy. Epidemiologia e Prevenzione 26:203205

    Google Scholar 

  • Calculli C, Fassò A, Finazzi F, Pollice A, Turnone A (2015) Maximum likelihood estimation of the multivariate hidden dynamic geostatistical model with application to air quality in Apulia, Italy. Environmetrics 26:406–417

    Article  MathSciNet  Google Scholar 

  • Cameletti M, Ignaccolo R, Bande S (2011) Comparing spatio-temporal models for particulate matter in Piemonte. Environmetrics 22:985996

    Article  MathSciNet  Google Scholar 

  • Dou B, Parrella ML, Yao Q (2016) Generalized Yule–Walker estimation for spatio-temporal models with unknown diagonal coefficients. J Econom 194:369–382

    Article  MathSciNet  MATH  Google Scholar 

  • Fitri MDNF, Ramli NA, Yahaya AS, Sansuddin N, Ghazali NA, Al Madhoun W (2010) Monsoonal differences and probability distribution of \(PM_{10}\) concentration. Environ Monit Assess 163:655–667

    Article  Google Scholar 

  • Honaker J, King G, Blackwell M (2011) Amelia II: a program for missing data. J Stat Softw 45(7):1–47

    Article  Google Scholar 

  • Josse J, Husson F (2016) missMDA: a package for handling missing values in multivariate data analysis. J Stat Softw 70(1):1–31

    Article  Google Scholar 

  • Junninen H, Niska H, Tuppurrainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38:2895–2907

    Article  Google Scholar 

  • Kowarik A, Templ M (2016) Imputation with the R package VIM. J Stat Softw 74(7):1–16

    Article  Google Scholar 

  • Lee LF, Yu J (2010a) Estimation of spatial autoregressive panel data models with fixed effects. J Econom 154:165–185

    Article  MathSciNet  MATH  Google Scholar 

  • Lee LF, Yu J (2010b) Some recent developments in spatial panel data models. Reg Sci Urban Econ 40:255–271

    Article  Google Scholar 

  • Liu S, Molenaar PC (2014) iVAR: a program for imputing missing data in multivariate time series using vector autoregressive models. Behav Res Method 46(4):1138–1148

    Article  Google Scholar 

  • Moritz S, Bartz-Beielstein T (2017) imputeTS: time series missing value imputation in R. R J 9:207–218

    Article  Google Scholar 

  • Norazian MN, Shukri YA, Azam RN, Al Bakri AMM (2008) Estimation of missing values in air pollution data using single imputation techniques. ScienceAsia 34:341–345

    Article  Google Scholar 

  • Oehmcke S, Zielinski O, Kramer O (2016) kNN ensembles with penalized DTW for multivariate time series imputation. In: International joint conference on neural networks (IJCNN), IEEE

  • Pollice A, Lasinio GJ (2009) Two approaches to imputation and adjustment of air quality data from a composite monitoring network. J Data Sci 7:43–59

    Google Scholar 

  • Raaschou-Nielsen O, Andersen ZJ, Beelen R, Samoli E, Stafoggia M, Weinmayr G (2013) ir pollution and lung cancer incidence in 17 European cohorts: prospective analyses from the European Study of Cohorts for Air Pollution Effects (ESCAPE). Lancet Oncol 14(9):813–822

    Article  Google Scholar 

  • van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giuseppina Albano.

Appendix A: Proof of Theorem 1

Appendix A: Proof of Theorem 1

For the sake of simplicity, we assume here a process with zero mean value. First, we consider the case in which the parameters \(\varvec{\lambda }_0, \varvec{\lambda }_1, \varvec{\lambda }_2\) are known. For \(s\ge 1\) we have

$$\begin{aligned} {\widetilde{{\mathbf y}}}^{(s+1)}_t-{\widetilde{{\mathbf y}}}^{(s)}_t=(\mathbf{1}-\varvec{\delta }_t)\circ \left( {\widehat{{\mathbf y}}}^{(s+1)}_t- {\widehat{{\mathbf y}}}^{(s)}_t\right) \end{aligned}$$

and \(\Vert {\widetilde{{\mathbf y}}}^{(s+1)}_t- {\widetilde{{\mathbf y}}}^{(s)}_t\Vert _2 \le \Vert {\widehat{{\mathbf y}}}^{(s+1)}_t- {\widehat{{\mathbf y}}}^{(s)}_t\Vert _2\), so we focus on \({\widehat{{\mathbf y}}}^{(s+1)}_t- {\widehat{{\mathbf y}}}^{(s)}_t\). We can write

$$\begin{aligned} {\widehat{{\mathbf y}}}^{(s+1)}_t- {\widehat{{\mathbf y}}}^{(s)}_t = D(\varvec{\lambda }_0){\mathbf W}\left( {\widetilde{{\mathbf y}}}^{(s)}_t-{\widetilde{{\mathbf y}}}^{(s-1)}_t\right) + \left[ D({\varvec{\lambda }_1}) + D(\varvec{\lambda }_2){\mathbf W}\right] \left( {\widetilde{{\mathbf y}}}^{(s)}_{t-1}-{\widetilde{{\mathbf y}}}^{(s-1)}_{t-1}\right) , \end{aligned}$$

and

$$\begin{aligned} \Vert {\widehat{{\mathbf y}}}^{(s+1)}_t- {\widehat{{\mathbf y}}}^{(s)}_t\Vert _2\le & {} \Vert D(\varvec{\lambda }_0){\mathbf W}({\widetilde{{\mathbf y}}}^{(s)}_t-{\widetilde{{\mathbf y}}}^{(s-1)}_t)\Vert _2\nonumber \\&+\Vert \left[ D({\varvec{\lambda }_1}) + D(\varvec{\lambda }_2){\mathbf W}\right] ({\widetilde{{\mathbf y}}}^{(s)}_{t-1}-{\widetilde{{\mathbf y}}}^{(s-1)}_{t-1})\Vert _2\nonumber \\\le & {} \Vert D(\varvec{\lambda }_0){\mathbf W}\Vert _2 \Vert {\widetilde{{\mathbf y}}}^{(s)}_t-{\widetilde{{\mathbf y}}}^{(s-1)}_t\Vert _2\nonumber \\&+\Vert D({\varvec{\lambda }_1}) + D(\varvec{\lambda }_2){\mathbf W}\Vert _2 \Vert {\widetilde{{\mathbf y}}}^{(s)}_{t-1}-{\widetilde{{\mathbf y}}}^{(s-1)}_{t-1}\Vert _2. \end{aligned}$$
(12)

Defining the vector operator \(\Delta ^j({\mathbf x}_t)=(1-\varvec{\delta }_{t-j})\circ {\mathbf x}_{t-j}\) and iterating the inequality in (12), we obtain

$$\begin{aligned}\le & {} \sum _{j=0}^{s-1}\left( {\begin{array}{c}s-1\\ j\end{array}}\right) \Vert D(\varvec{\lambda }_0){\mathbf W}\Vert _2^{s-1-j}\Vert D({\varvec{\lambda }_1}) + D(\varvec{\lambda }_2){\mathbf W}\Vert _2^{j}\, \left\| \Delta ^j\!\left( {\widehat{{\mathbf y}}}^{(2)}_t- {\widehat{{\mathbf y}}}^{(1)}_t \right) \right\| _2 \\\le & {} \left( \Vert D(\varvec{\lambda }_0){\mathbf W}\Vert _2 + \Vert D({\varvec{\lambda }_1}) + D(\varvec{\lambda }_2){\mathbf W}\Vert _2 \right) ^{s-1}\max _j\left\| \Delta ^j\!\left( {\widehat{{\mathbf y}}}^{(2)}_t- {\widehat{{\mathbf y}}}^{(1)}_t \right) \right\| _2. \end{aligned}$$

In the extreme case when \(\alpha =0\), we have \(\Delta ^j(\cdot )\equiv \mathbf{0}\) for all tj, because all the vectors \((\mathbf{1}-\varvec{\delta }_t)\) are zero, and the convergence of the iterative procedure is trivially proved. In the opposite case when \(\alpha =1\) (only theoretically, of course), we have \(\Delta ^j\equiv \mathbf{0}\) for all j because all \({\widehat{{\mathbf y}}}_t^{(s)}\) are zero vectors, so again the convergence of the iterative algorithm is trivially proved. In the remaining more realistic cases when \(\alpha \in (0,1)\), the convergence is guaranteed by assumption A4. All this implies that the iterative procedure always converges to a limit \({\widetilde{{\mathbf y}}}_t^{(\infty )}\), for any T and \(\alpha \), under the assumption of known parameters \(\varvec{\lambda }_j\).

Note that A4 is a sufficient condition to get the convergence of (12) for any T, and could be relaxed if one considers that \(\max _j\left\| \Delta ^j\!\left( {\widehat{{\mathbf y}}}^{(2)}_t- {\widehat{{\mathbf y}}}^{(1)}_t \right) \right\| _2\) also converges to zero in probability as \(T\rightarrow \infty \), with a rate depending on the percentage of missing values, \(\alpha \). However, the analysis of the exact rate goes beyond the aim of this paper.

In order to deal with the estimated parameters, we consider the following Taylor expansion of \({\widehat{{\mathbf y}}}_t^{(s)}\)

$$\begin{aligned} {\widehat{{\mathbf y}}}_t^{(s)}= & {} D(\varvec{\lambda }_0){\mathbf W}{\widetilde{{\mathbf y}}}_t^{(s-1)}+D(\varvec{\lambda }_1){\widetilde{{\mathbf y}}}_{t-1}^{(s-1)}+D(\varvec{\lambda }_2){\mathbf W}{\widetilde{{\mathbf y}}}_{t-1}^{(s-1)} \\&+ D({\widehat{\varvec{\lambda }}}_0^{(s-1)}-\varvec{\lambda }_0){\mathbf W}{\widetilde{{\mathbf y}}}_t^{(s-1)}+D\left( {\widehat{\varvec{\lambda }}}_1^{(s-1)}-\varvec{\lambda }_1\right) {\widetilde{{\mathbf y}}}_{t-1}^{(s-1)} \\&+D({\widehat{\varvec{\lambda }}}_2^{(s-1)}-\varvec{\lambda }_2){\mathbf W}{\widetilde{{\mathbf y}}}_{t-1}^{(s-1)} \end{aligned}$$

where the first row of the equality is exactly the quantity analysed in the (12) whereas the other rows depend on the differences \({\widehat{\varvec{\lambda }}}_j^{(s)}-\varvec{\lambda }_j, j=0,1,2\). Now, remembering the (4), \({\widehat{\varvec{\lambda }}}_j^{(0)}-\varvec{\lambda }_j\) converges to zero in probability for \(T\rightarrow \infty \) by Theorem 1 in Dou et al. (2016), for all j. So, assuming \({\widehat{\varvec{\lambda }}}_j^{(s-1)}\) and \({\widetilde{{\mathbf y}}}_t^{(s-1)}\) consistent and following the iterative algorithm, by induction, we can conclude that also \({\widehat{\varvec{\lambda }}}_j^{(s)}-\varvec{\lambda }_j\) converges to zero in probability for \(T\rightarrow \infty \). Combining this with the previous result in (12), we finally have that \({\widehat{{\mathbf y}}}_t^{(s)}\) and \({\widetilde{{\mathbf y}}}_t^{(s)}\) converge in probability to the limit \({\mathbf y}_t^{*}\), for \(T\rightarrow \infty \) and \(s\rightarrow \infty \).

Of course, the convergence rates of all the previous stochastic limits are expected to depend on the percentage of missing values, \(\alpha \). Again, the evaluation of the exact rates goes beyond the aims of this paper. Here, instead, we analyse heuristically the “quality” of the imputation output, i.e. how near the final imputed values are to the true latent ones, as a function of the proportion of missing values in the time series.

By simple algebra, we can write

$$\begin{aligned} {\mathbf y}_t-{\widehat{{\mathbf y}}}_t^{(s+1)}= & {} D(\varvec{\lambda }_0){\mathbf W}\left( {\mathbf y}_t-{\widetilde{{\mathbf y}}}_t^{(s)}\right) +\left[ D(\varvec{\lambda }_1)+D(\varvec{\lambda }_2){\mathbf W}\right] \left( {\mathbf y}_{t-1}-{\widetilde{{\mathbf y}}}_{t-1}^{(s)}\right) \\&+ \quad D\left( {\widehat{\varvec{\lambda }}}^{(s)}_0-\varvec{\lambda }_0\right) {\mathbf W}{\widetilde{{\mathbf y}}}_t^{(s)}+\left[ D\left( {\widehat{\varvec{\lambda }}}_1^{(s)}-\varvec{\lambda }_1\right) +D\left( {\widehat{\varvec{\lambda }}}_2^{(s)}-\varvec{\lambda }_2\right) {\mathbf W}\right] {\widetilde{{\mathbf y}}}_{t-1}^{(s)} + {\varvec{\varepsilon }}_t \\= & {} \quad L_1(\alpha )+L_2(\alpha )+{\varvec{\varepsilon }}_t. \end{aligned}$$

In the extreme case when all data are missing (only theoretically, of course), \(\alpha =1\) and the algorithm imputes zero to all data since \({\widetilde{{\mathbf y}}}_t^{(s)}\equiv \mathbf{0}\) for all s and t, as obvious. In such a case, \(L_2(\alpha )\equiv \mathbf{0}\) and the imputation error is

$$\begin{aligned} {\mathbf y}_t-{\widehat{{\mathbf y}}}_t^{(s)}= D(\varvec{\lambda }_0){\mathbf W}{\mathbf y}_t+\left[ D(\varvec{\lambda }_1)+D(\varvec{\lambda }_2){\mathbf W}\right] {\mathbf y}_{t-1} + {\varvec{\varepsilon }}_t\qquad \mathrm{for\ }\alpha =1, \forall s,\forall T, \end{aligned}$$

which is something very different from desired (worst imputation quality), still centered around zero but with higher variability. In the other cases, assuming \(\alpha \in [0,1)\) approximately fixed for \(T\rightarrow \infty \), we have \(({\widehat{\varvec{\lambda }}}^{(s)}_j-\varvec{\lambda }_j){\mathop {\rightarrow }\limits ^{p}} 0\) by Theorem 1 of Dou et al. (2016) and, then, \(({\mathbf y}_t-{\widetilde{{\mathbf y}}}_t^{(s)}){\mathop {\rightarrow }\limits ^{p}} 0\) by (12). Therefore, \(L_1(\alpha ){\mathop {\rightarrow }\limits ^{p}} \mathbf{0}\) and \(L_2(\alpha ){\mathop {\rightarrow }\limits ^{p}} \mathbf{0}\) for \(T\rightarrow \infty \) and the imputation error converges to

$$\begin{aligned} {\mathbf y}_t-{\widehat{{\mathbf y}}}_t^{(s)}{\mathop {\longrightarrow }\limits ^{p}} {\varvec{\varepsilon }}_t \qquad \mathrm{for\ }\alpha \in [0,1), s\rightarrow \infty \mathrm{\ and\ } T\rightarrow \infty , \end{aligned}$$
(13)

as desired (best imputation quality), but the convergence rate of the (13) is expected to be faster as long as the proportion \(\alpha \) approaches to zero. The fastest convergence rate is derived in Dou et al. (2016) and is reached when \(\alpha =0\). \(\bullet \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Parrella, M.L., Albano, G., La Rocca, M. et al. Reconstructing missing data sequences in multivariate time series: an application to environmental data. Stat Methods Appl 28, 359–383 (2019). https://doi.org/10.1007/s10260-018-00435-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10260-018-00435-9

Keywords

Navigation