Reconstructing missing data sequences in multivariate time series: an application to environmental data

Parrella, Maria Lucia; Albano, Giuseppina; La Rocca, Michele; Perna, Cira

doi:10.1007/s10260-018-00435-9

Reconstructing missing data sequences in multivariate time series: an application to environmental data

Original Paper
Published: 18 August 2018

Volume 28, pages 359–383, (2019)
Cite this article

Statistical Methods & Applications Aims and scope Submit manuscript

Maria Lucia Parrella¹,
Giuseppina Albano ORCID: orcid.org/0000-0002-2317-0331¹,
Michele La Rocca¹ &
…
Cira Perna¹

544 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Missing data arise in many statistical analyses, due to faults in data acquisition, and can have a significant effect on the conclusions that can be drawn from the data. In environmental data, for example, a standard approach usually adopted by the Environmental Protection Agencies to handle missing values is by deleting those observations with incomplete information from the study, obtaining a massive underestimation of many indexes usually used for evaluating air quality. In multivariate time series, moreover, it may happen that not only isolated values but also long sequences of some of the time series’ components may miss. In such cases, it is quite impossible to reconstruct the missing sequences basing on the serial dependence structure alone. In this work, we propose a new procedure that aims to reconstruct the missing sequences by exploiting the spatial correlation and the serial correlation of the multivariate time series, simultaneously. The proposed procedure is based on a spatial-dynamic model and imputes the missing values in the time series basing on a linear combination of the neighbor contemporary observations and their lagged values. It is specifically oriented to spatio-temporal data, although it is general enough to be applied to generic stationary multivariate time-series. In this paper, the procedure has been applied to the pollution data, where the problem of missing sequences is of serious concern, with remarkably satisfactory performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Approach to Detect Missing Values Patterns in Time Series Data

Scalable recovery of missing blocks in time series with high and low cross-correlations

Article 15 November 2019

Bootstrap Confidence Intervals for Sequences of Missing Values in Multivariate Time Series

References

Aga E, Samoli E, Touloumi G, Anderson HR, Cadum E, Forsberg B (2003) Short-term effects of ambient particles on mortality in the elderly: results from 28 cities in the APHEA2 project. Eur Resp J Suppl 40:28s33s
Google Scholar
Anselin L (1988) Spatial econometrics: methods and models. Kluwer Academic, Dordrecht
Book MATH Google Scholar
Biggeri A, Baccini M, Accetta G, Lagazio C (2002) Estimates of short-term effects of air pollutants in Italy. Epidemiologia e Prevenzione 26:203205
Google Scholar
Calculli C, Fassò A, Finazzi F, Pollice A, Turnone A (2015) Maximum likelihood estimation of the multivariate hidden dynamic geostatistical model with application to air quality in Apulia, Italy. Environmetrics 26:406–417
Article MathSciNet Google Scholar
Cameletti M, Ignaccolo R, Bande S (2011) Comparing spatio-temporal models for particulate matter in Piemonte. Environmetrics 22:985996
Article MathSciNet Google Scholar
Dou B, Parrella ML, Yao Q (2016) Generalized Yule–Walker estimation for spatio-temporal models with unknown diagonal coefficients. J Econom 194:369–382
Article MathSciNet MATH Google Scholar
Fitri MDNF, Ramli NA, Yahaya AS, Sansuddin N, Ghazali NA, Al Madhoun W (2010) Monsoonal differences and probability distribution of $PM_{10}$ concentration. Environ Monit Assess 163:655–667
Article Google Scholar
Honaker J, King G, Blackwell M (2011) Amelia II: a program for missing data. J Stat Softw 45(7):1–47
Article Google Scholar
Josse J, Husson F (2016) missMDA: a package for handling missing values in multivariate data analysis. J Stat Softw 70(1):1–31
Article Google Scholar
Junninen H, Niska H, Tuppurrainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38:2895–2907
Article Google Scholar
Kowarik A, Templ M (2016) Imputation with the R package VIM. J Stat Softw 74(7):1–16
Article Google Scholar
Lee LF, Yu J (2010a) Estimation of spatial autoregressive panel data models with fixed effects. J Econom 154:165–185
Article MathSciNet MATH Google Scholar
Lee LF, Yu J (2010b) Some recent developments in spatial panel data models. Reg Sci Urban Econ 40:255–271
Article Google Scholar
Liu S, Molenaar PC (2014) iVAR: a program for imputing missing data in multivariate time series using vector autoregressive models. Behav Res Method 46(4):1138–1148
Article Google Scholar
Moritz S, Bartz-Beielstein T (2017) imputeTS: time series missing value imputation in R. R J 9:207–218
Article Google Scholar
Norazian MN, Shukri YA, Azam RN, Al Bakri AMM (2008) Estimation of missing values in air pollution data using single imputation techniques. ScienceAsia 34:341–345
Article Google Scholar
Oehmcke S, Zielinski O, Kramer O (2016) kNN ensembles with penalized DTW for multivariate time series imputation. In: International joint conference on neural networks (IJCNN), IEEE
Pollice A, Lasinio GJ (2009) Two approaches to imputation and adjustment of air quality data from a composite monitoring network. J Data Sci 7:43–59
Google Scholar
Raaschou-Nielsen O, Andersen ZJ, Beelen R, Samoli E, Stafoggia M, Weinmayr G (2013) ir pollution and lung cancer incidence in 17 European cohorts: prospective analyses from the European Study of Cohorts for Air Pollution Effects (ESCAPE). Lancet Oncol 14(9):813–822
Article Google Scholar
van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dip. di Scienze Economiche e Statistiche, Università of Salerno, Fisciano (Salerno), Italy
Maria Lucia Parrella, Giuseppina Albano, Michele La Rocca & Cira Perna

Authors

Maria Lucia Parrella
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppina Albano
View author publications
You can also search for this author in PubMed Google Scholar
Michele La Rocca
View author publications
You can also search for this author in PubMed Google Scholar
Cira Perna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giuseppina Albano.

Appendix A: Proof of Theorem 1

For the sake of simplicity, we assume here a process with zero mean value. First, we consider the case in which the parameters $\varvec{\lambda }_0, \varvec{\lambda }_1, \varvec{\lambda }_2$ are known. For $s\ge 1$ we have

$$\begin{aligned} {\widetilde{{\mathbf y}}}^{(s+1)}_t-{\widetilde{{\mathbf y}}}^{(s)}_t=(\mathbf{1}-\varvec{\delta }_t)\circ \left( {\widehat{{\mathbf y}}}^{(s+1)}_t- {\widehat{{\mathbf y}}}^{(s)}_t\right) \end{aligned}$$

and $\Vert {\widetilde{{\mathbf y}}}^{(s+1)}_t- {\widetilde{{\mathbf y}}}^{(s)}_t\Vert _2 \le \Vert {\widehat{{\mathbf y}}}^{(s+1)}_t- {\widehat{{\mathbf y}}}^{(s)}_t\Vert _2$, so we focus on ${\widehat{{\mathbf y}}}^{(s+1)}_t- {\widehat{{\mathbf y}}}^{(s)}_t$. We can write

$$\begin{aligned} {\widehat{{\mathbf y}}}^{(s+1)}_t- {\widehat{{\mathbf y}}}^{(s)}_t = D(\varvec{\lambda }_0){\mathbf W}\left( {\widetilde{{\mathbf y}}}^{(s)}_t-{\widetilde{{\mathbf y}}}^{(s-1)}_t\right) + \left[ D({\varvec{\lambda }_1}) + D(\varvec{\lambda }_2){\mathbf W}\right] \left( {\widetilde{{\mathbf y}}}^{(s)}_{t-1}-{\widetilde{{\mathbf y}}}^{(s-1)}_{t-1}\right) , \end{aligned}$$

and

$$\begin{aligned} \Vert {\widehat{{\mathbf y}}}^{(s+1)}_t- {\widehat{{\mathbf y}}}^{(s)}_t\Vert _2\le & {} \Vert D(\varvec{\lambda }_0){\mathbf W}({\widetilde{{\mathbf y}}}^{(s)}_t-{\widetilde{{\mathbf y}}}^{(s-1)}_t)\Vert _2\nonumber \\&+\Vert \left[ D({\varvec{\lambda }_1}) + D(\varvec{\lambda }_2){\mathbf W}\right] ({\widetilde{{\mathbf y}}}^{(s)}_{t-1}-{\widetilde{{\mathbf y}}}^{(s-1)}_{t-1})\Vert _2\nonumber \\\le & {} \Vert D(\varvec{\lambda }_0){\mathbf W}\Vert _2 \Vert {\widetilde{{\mathbf y}}}^{(s)}_t-{\widetilde{{\mathbf y}}}^{(s-1)}_t\Vert _2\nonumber \\&+\Vert D({\varvec{\lambda }_1}) + D(\varvec{\lambda }_2){\mathbf W}\Vert _2 \Vert {\widetilde{{\mathbf y}}}^{(s)}_{t-1}-{\widetilde{{\mathbf y}}}^{(s-1)}_{t-1}\Vert _2. \end{aligned}$$

(12)

Defining the vector operator $\Delta ^j({\mathbf x}_t)=(1-\varvec{\delta }_{t-j})\circ {\mathbf x}_{t-j}$ and iterating the inequality in (12), we obtain

$$\begin{aligned}\le & {} \sum _{j=0}^{s-1}\left( {\begin{array}{c}s-1\\ j\end{array}}\right) \Vert D(\varvec{\lambda }_0){\mathbf W}\Vert _2^{s-1-j}\Vert D({\varvec{\lambda }_1}) + D(\varvec{\lambda }_2){\mathbf W}\Vert _2^{j}\, \left\| \Delta ^j\!\left( {\widehat{{\mathbf y}}}^{(2)}_t- {\widehat{{\mathbf y}}}^{(1)}_t \right) \right\| _2 \\\le & {} \left( \Vert D(\varvec{\lambda }_0){\mathbf W}\Vert _2 + \Vert D({\varvec{\lambda }_1}) + D(\varvec{\lambda }_2){\mathbf W}\Vert _2 \right) ^{s-1}\max _j\left\| \Delta ^j\!\left( {\widehat{{\mathbf y}}}^{(2)}_t- {\widehat{{\mathbf y}}}^{(1)}_t \right) \right\| _2. \end{aligned}$$

In the extreme case when $\alpha =0$, we have $\Delta ^j(\cdot )\equiv \mathbf{0}$ for all t, j, because all the vectors $(\mathbf{1}-\varvec{\delta }_t)$ are zero, and the convergence of the iterative procedure is trivially proved. In the opposite case when $\alpha =1$ (only theoretically, of course), we have $\Delta ^j\equiv \mathbf{0}$ for all j because all ${\widehat{{\mathbf y}}}_t^{(s)}$ are zero vectors, so again the convergence of the iterative algorithm is trivially proved. In the remaining more realistic cases when $\alpha \in (0,1)$, the convergence is guaranteed by assumption A4. All this implies that the iterative procedure always converges to a limit ${\widetilde{{\mathbf y}}}_t^{(\infty )}$, for any T and $\alpha $, under the assumption of known parameters $\varvec{\lambda }_j$.

Note that A4 is a sufficient condition to get the convergence of (12) for any T, and could be relaxed if one considers that $\max _j\left\| \Delta ^j\!\left( {\widehat{{\mathbf y}}}^{(2)}_t- {\widehat{{\mathbf y}}}^{(1)}_t \right) \right\| _2$ also converges to zero in probability as $T\rightarrow \infty $, with a rate depending on the percentage of missing values, $\alpha $. However, the analysis of the exact rate goes beyond the aim of this paper.

In order to deal with the estimated parameters, we consider the following Taylor expansion of ${\widehat{{\mathbf y}}}_t^{(s)}$

$$\begin{aligned} {\widehat{{\mathbf y}}}_t^{(s)}= & {} D(\varvec{\lambda }_0){\mathbf W}{\widetilde{{\mathbf y}}}_t^{(s-1)}+D(\varvec{\lambda }_1){\widetilde{{\mathbf y}}}_{t-1}^{(s-1)}+D(\varvec{\lambda }_2){\mathbf W}{\widetilde{{\mathbf y}}}_{t-1}^{(s-1)} \\&+ D({\widehat{\varvec{\lambda }}}_0^{(s-1)}-\varvec{\lambda }_0){\mathbf W}{\widetilde{{\mathbf y}}}_t^{(s-1)}+D\left( {\widehat{\varvec{\lambda }}}_1^{(s-1)}-\varvec{\lambda }_1\right) {\widetilde{{\mathbf y}}}_{t-1}^{(s-1)} \\&+D({\widehat{\varvec{\lambda }}}_2^{(s-1)}-\varvec{\lambda }_2){\mathbf W}{\widetilde{{\mathbf y}}}_{t-1}^{(s-1)} \end{aligned}$$

where the first row of the equality is exactly the quantity analysed in the (12) whereas the other rows depend on the differences ${\widehat{\varvec{\lambda }}}_j^{(s)}-\varvec{\lambda }_j, j=0,1,2$. Now, remembering the (4), ${\widehat{\varvec{\lambda }}}_j^{(0)}-\varvec{\lambda }_j$ converges to zero in probability for $T\rightarrow \infty $ by Theorem 1 in Dou et al. (2016), for all j. So, assuming ${\widehat{\varvec{\lambda }}}_j^{(s-1)}$ and ${\widetilde{{\mathbf y}}}_t^{(s-1)}$ consistent and following the iterative algorithm, by induction, we can conclude that also ${\widehat{\varvec{\lambda }}}_j^{(s)}-\varvec{\lambda }_j$ converges to zero in probability for $T\rightarrow \infty $. Combining this with the previous result in (12), we finally have that ${\widehat{{\mathbf y}}}_t^{(s)}$ and ${\widetilde{{\mathbf y}}}_t^{(s)}$ converge in probability to the limit ${\mathbf y}_t^{*}$, for $T\rightarrow \infty $ and $s\rightarrow \infty $.

Of course, the convergence rates of all the previous stochastic limits are expected to depend on the percentage of missing values, $\alpha $. Again, the evaluation of the exact rates goes beyond the aims of this paper. Here, instead, we analyse heuristically the “quality” of the imputation output, i.e. how near the final imputed values are to the true latent ones, as a function of the proportion of missing values in the time series.

By simple algebra, we can write

$$\begin{aligned} {\mathbf y}_t-{\widehat{{\mathbf y}}}_t^{(s+1)}= & {} D(\varvec{\lambda }_0){\mathbf W}\left( {\mathbf y}_t-{\widetilde{{\mathbf y}}}_t^{(s)}\right) +\left[ D(\varvec{\lambda }_1)+D(\varvec{\lambda }_2){\mathbf W}\right] \left( {\mathbf y}_{t-1}-{\widetilde{{\mathbf y}}}_{t-1}^{(s)}\right) \\&+ \quad D\left( {\widehat{\varvec{\lambda }}}^{(s)}_0-\varvec{\lambda }_0\right) {\mathbf W}{\widetilde{{\mathbf y}}}_t^{(s)}+\left[ D\left( {\widehat{\varvec{\lambda }}}_1^{(s)}-\varvec{\lambda }_1\right) +D\left( {\widehat{\varvec{\lambda }}}_2^{(s)}-\varvec{\lambda }_2\right) {\mathbf W}\right] {\widetilde{{\mathbf y}}}_{t-1}^{(s)} + {\varvec{\varepsilon }}_t \\= & {} \quad L_1(\alpha )+L_2(\alpha )+{\varvec{\varepsilon }}_t. \end{aligned}$$

In the extreme case when all data are missing (only theoretically, of course), $\alpha =1$ and the algorithm imputes zero to all data since ${\widetilde{{\mathbf y}}}_t^{(s)}\equiv \mathbf{0}$ for all s and t, as obvious. In such a case, $L_2(\alpha )\equiv \mathbf{0}$ and the imputation error is

$$\begin{aligned} {\mathbf y}_t-{\widehat{{\mathbf y}}}_t^{(s)}= D(\varvec{\lambda }_0){\mathbf W}{\mathbf y}_t+\left[ D(\varvec{\lambda }_1)+D(\varvec{\lambda }_2){\mathbf W}\right] {\mathbf y}_{t-1} + {\varvec{\varepsilon }}_t\qquad \mathrm{for\ }\alpha =1, \forall s,\forall T, \end{aligned}$$

which is something very different from desired (worst imputation quality), still centered around zero but with higher variability. In the other cases, assuming $\alpha \in [0,1)$ approximately fixed for $T\rightarrow \infty $, we have $({\widehat{\varvec{\lambda }}}^{(s)}_j-\varvec{\lambda }_j){\mathop {\rightarrow }\limits ^{p}} 0$ by Theorem 1 of Dou et al. (2016) and, then, $({\mathbf y}_t-{\widetilde{{\mathbf y}}}_t^{(s)}){\mathop {\rightarrow }\limits ^{p}} 0$ by (12). Therefore, $L_1(\alpha ){\mathop {\rightarrow }\limits ^{p}} \mathbf{0}$ and $L_2(\alpha ){\mathop {\rightarrow }\limits ^{p}} \mathbf{0}$ for $T\rightarrow \infty $ and the imputation error converges to

$$\begin{aligned} {\mathbf y}_t-{\widehat{{\mathbf y}}}_t^{(s)}{\mathop {\longrightarrow }\limits ^{p}} {\varvec{\varepsilon }}_t \qquad \mathrm{for\ }\alpha \in [0,1), s\rightarrow \infty \mathrm{\ and\ } T\rightarrow \infty , \end{aligned}$$

(13)

as desired (best imputation quality), but the convergence rate of the (13) is expected to be faster as long as the proportion $\alpha $ approaches to zero. The fastest convergence rate is derived in Dou et al. (2016) and is reached when $\alpha =0$. $\bullet $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Parrella, M.L., Albano, G., La Rocca, M. et al. Reconstructing missing data sequences in multivariate time series: an application to environmental data. Stat Methods Appl 28, 359–383 (2019). https://doi.org/10.1007/s10260-018-00435-9

Download citation

Accepted: 05 August 2018
Published: 18 August 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s10260-018-00435-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reconstructing missing data sequences in multivariate time series: an application to environmental data

Abstract

Access this article

Similar content being viewed by others

A Novel Approach to Detect Missing Values Patterns in Time Series Data

Scalable recovery of missing blocks in time series with high and low cross-correlations

Bootstrap Confidence Intervals for Sequences of Missing Values in Multivariate Time Series

References

Author information

Authors and Affiliations

Corresponding author

Appendix A: Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Reconstructing missing data sequences in multivariate time series: an application to environmental data

Abstract

Access this article

Similar content being viewed by others

A Novel Approach to Detect Missing Values Patterns in Time Series Data

Scalable recovery of missing blocks in time series with high and low cross-correlations

Bootstrap Confidence Intervals for Sequences of Missing Values in Multivariate Time Series

References

Author information

Authors and Affiliations

Corresponding author

Appendix A: Proof of Theorem 1

Appendix A: Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation