Skip to main content
Log in

Functional outlier detection by a local depth with application to NO x levels

  • Original Paper
  • Published:
Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Abstract

This paper proposes methods to detect outliers in functional data sets and the task of identifying atypical curves is carried out using the recently proposed kernelized functional spatial depth (KFSD). KFSD is a local depth that can be used to order the curves of a sample from the most to the least central, and since outliers are usually among the least central curves, we present a probabilistic result which allows to select a threshold value for KFSD such that curves with depth values lower than the threshold are detected as outliers. Based on this result, we propose three new outlier detection procedures. The results of a simulation study show that our proposals generally outperform a battery of competitors. We apply our procedures to a real data set consisting in daily curves of emission levels of nitrogen oxides (NO\(_{x}\)) since it is of interest to identify abnormal NO\(_{x}\) levels to take necessary environmental political actions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. In presence of tie, the method with lower false outlier detection percentage (f) is preferred.

References

  • Barnett V, Lewis T (1994) Outliers in statistical data, vol 3. Wiley, New York

    Google Scholar 

  • Chakraborty A, Chaudhuri P (2014) On data depth in infinite dimensional spaces. Ann Inst Stat Math 66:303–324

    Article  Google Scholar 

  • Chen Y, Dang X, Peng H, Bart HL (2009) Outlier detection with the kernelized spatial depth function. IEEE Trans Pattern Anal Mach Intell 31:288–305

    Article  Google Scholar 

  • Cuesta-Albertos JA, Nieto-Reyes A (2008) The random Tukey depth. Comput Stat Data Anal 52:4979–4988

    Article  Google Scholar 

  • Cuevas A (2014) A partial overview of the theory of statistics with functional data. J Stat Plan Inference 147:1–23

    Article  Google Scholar 

  • Cuevas A, Fraiman R (2009) On depth measures and dual statistics. A methodology for dealing with general data. J Multivar Anal 100:753–766

    Article  Google Scholar 

  • Cuevas A, Febrero M, Fraiman R (2006) On the use of the bootstrap for estimating functions with functional data. Comput Stat Data Anal 51:1063–1074

    Article  Google Scholar 

  • Febrero M, Oviedo de la Fuente M (2012) Statistical computing in functional data analysis: the R package fda.usc. J Stat Softw 51:1–28

    Google Scholar 

  • Febrero M, Galeano P, González-Manteiga W (2007) A functional analysis of NOx levels: location and scale estimation and outlier detection. Comput Stat 22:411–427

    Article  Google Scholar 

  • Febrero M, Galeano P, González-Manteiga W (2008) Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels. Environmetrics 19:331–345

    Article  CAS  Google Scholar 

  • Ferraty F, Vieu P (2006) Nonparametric functional data analysis: theory and practice. Springer, New York

    Google Scholar 

  • Fraiman R, Muniz G (2001) Trimmed means for functional data. Test 10:419–440

    Article  Google Scholar 

  • Horváth L, Kokoszka P (2012) Inference for functional data with applications. Springer, New York

    Book  Google Scholar 

  • Hyndman RJ, Shang HL (2010) Rainbow plots, bagplots, and boxplots for functional data. J Comput Graph Stat 19:29–45

    Article  Google Scholar 

  • Ignaccolo R, Franco-Villoria M, Fassò A (2015) Modelling collocation uncertainty of 3D atmospheric profiles. Stoch Environ Res Risk Assess 29:419–429

    Article  Google Scholar 

  • López-Pintado S, Romo J (2009) On the concept of depth for functional data. J Am Stat Assoc 104:718–734

    Article  Google Scholar 

  • McDiarmid C (1989) On the method of bounded differences. Survey in combinatorics. Cambridge University Press, Cambridge, pp 148–188

    Google Scholar 

  • Menafoglio A, Guadagnini A, Secchi P (2014) A kriging approach based on Aitchison geometry for the characterization of particle-size curves in heterogeneous aquifers. Stoch Environ Res Risk Assess 28:1835–1851

    Article  Google Scholar 

  • Ramsay JO, Silverman BW (2005) Functional data analysis. Springer, New York

    Book  Google Scholar 

  • Ruiz-Medina MD, Espejo RM (2012) Spatial autoregressive functional plug-in prediction of ocean surface temperature. Stoch Environ Res Risk Assess 26:335–344

    Article  Google Scholar 

  • Sguera C, Galeano P, Lillo R (2014) Spatial depth-based classification for functional data. Test 23:725–750

    Article  Google Scholar 

  • Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London

    Book  Google Scholar 

  • Sun Y, Genton MG (2011) Functional boxplots. J Comput Graph Stat 20:316–334

    Article  Google Scholar 

  • Tukey JW (1975) Mathematics and the picturing of data. Proc Int Congr Math 2:523–531

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank the editor in chief, the associate editor and an anonymous referee for their helpful comments. This research was partially supported by Spanish Ministry of Science and Innovation grant ECO2011-25706 and by Spanish Ministry of Economy and Competition grant ECO2012-38442.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlo Sguera.

Appendix

Appendix

1.1 From \(FSD(x, Y_{n})\) to \(KFSD(x, Y_{n})\)

To show how to pass from \(FSD(x, Y_{n})\) in (1) to \(KFSD(x, Y_{n})\) in (4), we first show that \(FSD(x, Y_{n})\) can be expressed in terms of inner products. We present this result for \(n=2\). The norm in (1) can be written as

$$\begin{aligned} \left\| \sum _{i}^{2}\frac{x-y_{i}}{\Vert x-y_{i}\Vert }\right\| ^{2} &= \left\| \frac{x-y_{1}}{\Vert x-y_{1}\Vert }+\frac{x-y_{2}}{\Vert x-y_{2}\Vert }\right\| ^{2}\\ &= \left\| \frac{x-y_{1}}{\sqrt{\langle x,x\rangle +\langle y_{1},y_{1}\rangle -2\langle x,y_{1}\rangle }}+\frac{x-y_{2}}{\sqrt{\langle x,x\rangle +\langle y_{2},y_{2}\rangle -2\langle x,y_{2}\rangle }}\right\| ^{2} \end{aligned}$$

Let \(\delta _{1}=\sqrt{\langle x,x\rangle +\langle y_{1},y_{1}\rangle -2\langle x,y_{1}\rangle }\) and \(\delta _{2}=\sqrt{\langle x,x\rangle +\langle y_{2},y_{2}\rangle -2\langle x,y_{2}\rangle }\). Then,

$$\begin{aligned} \left\| \sum _{i}^{2}\frac{x-y_{i}}{\Vert x-y_{i}\Vert }\right\| ^{2} &= \left\| \frac{x-y_{1}}{\delta _{1}}+\frac{x-y_{2}}{\delta _{2}}\right\| ^{2} \\ &= \left\| \frac{x-y_{1}}{\delta _{1}}\right\| +\left\| \frac{x-y_{2}}{\delta _{2}}\right\| + \frac{2}{\delta _{1}\delta _{2}}\langle x-y_{1},x-y_{2}\rangle \\ &= 2 + \frac{2}{\delta _{1}\delta _{2}}(\langle x,x\rangle +\langle y_{1},y_{2}\rangle -\langle x,y_{1}\rangle -\langle x,y_{2}\rangle )\\ &= \sum _{i,j=1}^{2}\frac{\langle x,x\rangle +\langle y_{i},y_{j}\rangle -\langle x,y_{i}\rangle -\langle x,y_{j}\rangle }{\delta _{i}\delta _{j}},\end{aligned}$$

and apply the embedding map \(\phi \) to all the observations of the last expression. According to (2), this is equivalent to substitute the inner product function with a positive definite and stationary kernel function \(\kappa \), which explains the definition of \(KFSD(x, Y_{n})\) in (4) for \(n=2\). The generalization of this result to \(n>2\) is straightforward.

1.2 Proof of theorem 1

As explained in Sect. 3, Theorem 1 is a functional extension of a result derived by Chen et al. (2009) for KSD, and since they are closely related, next we report a sketch of the proof of Theorem 1. The proof for KSD is mostly based on an inequality known as McDiarmid ’s inequality (McDiarmid 1989), which also applies to general probability spaces, and therefore to functional Hilbert spaces. We report this inequality in the next lemma:

Lemma 1

(McDiarmid [1.2]) Let \(\Omega _{1}, \ldots , \Omega _{n}\) be probability spaces. Let \({\mathbf {\Omega }} = \prod _{j=1}^{n} \Omega _{j}\) and let \(X: {\mathbf {\Omega} } \rightarrow {\mathbb {R}}\) be a random variable. For any \(j \in \left\{ 1, \ldots , n\right\} \), let \((\omega _{1}, \ldots , \omega _{j}, \ldots ,\) \(\omega _{n})\) and \(\left( \omega _{1}, \ldots , \hat{\omega }_{j}, \ldots , \omega _{n}\right) \) be two elements of \({\mathbf {\Omega }}\) that differ only in their jth coordinates. Assume that X is uniformly difference-bounded by \(\{c_j\}\), that is, for any \(j \in \left\{ 1, \ldots , n\right\} \),

$$\left| X\left( \omega _{1}, \ldots , \omega _{j}, \ldots , \omega _{n}\right) -X\left( \omega _{1}, \ldots , \hat{\omega }_{j}, \ldots , \omega _{n}\right) \right| \le c_{j}. $$
(12)

Then, if \({\mathbb {E}}[X]\) exists, for any \(\tau > 0\)

$${\mathrm {Pr}}\left( X-{\mathbb {E}}[X] \ge \tau \right) \le \exp \left( \frac{-2\tau ^2}{\sum _{j=1}^{n} c_{j}^{2}}\right) .$$

In order to apply Lemma 1 to our problem, define

$$X(z_{1},\ldots ,z_{n_{Z}}) = - \frac{1}{n_{Z}}\sum _{i=1}^{n_{Z}}g(z_{i},Y_{n_{Y}}|Y_{n_{Y}}),$$
(13)

whose expected value is given by

$${\mathbb {E}}[X] = {\mathbb {E}}_{z_{i}|Y_{n_{Y}}}\left[ - \frac{1}{n_{Z}}\sum _{i=1}^{n_{Z}}g(z_{i},Y_{n_{Y}}|Y_{n_{Y}})\right] = - {\mathbb {E}}_{z_{1}|Y_{n_{Y}}}\left[ g(z_{1},Y_{n_{Y}}|Y_{n_{Y}})\right] . $$
(14)

Now, for any \(j \in \left\{ 1, \ldots , n_{Z}\right\} \) and \(\hat{z}_{j} \in {\mathbb {H}}\), the following inequality holds

$$\left| X(z_{1},\ldots ,z_{j},\ldots ,z_{n_{Z}}) - X(z_{1},\ldots ,\hat{z}_{j},\ldots ,z_{n_{Z}})\right| \le \frac{1}{n_{Z}}, $$

and it provides assumption (12) of Lemma 1. Therefore, for any \(\tau > 0\)

$${\mathrm {Pr}}\left( {\mathbb {E}}_{z_{1}|Y_{n_{Y}}}\left[ g(z_{1},Y_{n_{Y}}|Y_{n_{Y}})\right] - \frac{1}{n_{Z}}\sum _{i=1}^{n_{Z}}g(z_{i},Y_{n_{Y}}|Y_{n_{Y}}) \ge \tau \right) \le \exp \left( -2n_{Z}\tau ^{2}\right) , $$

and by the law of total probability

$$\begin{aligned} \begin{array}{l} {\mathbb {E}}\left[ {\mathrm {Pr}}\left( {\mathbb {E}}_{z_{1}|Y_{n_{Y}}}\left[ g(z_{1},Y_{n_{Y}}|Y_{n_{Y}})\right] - \frac{1}{n_{Z}}\sum _{i=1}^{n_{Z}}g(z_{i},Y_{n_{Y}}|Y_{n_{Y}}) \ge \tau \right) \right] \\ \quad = {\mathrm {Pr}}\left( {\mathbb {E}}_{z_{1}|Y_{n_{Y}}}\left[ g(z_{1},Y_{n_{Y}})\right] - \frac{1}{n_{Z}}\sum _{i=1}^{n_{Z}}g(z_{i},Y_{n_{Y}}) \ge \tau \right) \le \exp \left( -2n_{Z}\tau ^{2}\right) \\ \end{array} \end{aligned}$$

Next, setting \(\delta = \exp \left( -2n_{Z}\tau ^{2}\right) \), and solving for \(\tau \), the following result is obtained:

$$\tau = \sqrt{\frac{\ln 1/\delta }{2n_{Z}}}.$$

Therefore,

$${\mathrm {Pr}}\left( {\mathbb {E}}_{z_{1}|Y_{n_{Y}}}\left[ g(z_{1},Y_{n_{Y}})\right] \le \frac{1}{n_{Z}}\sum _{i=1}^{n_{Z}}g(z_{i},Y_{n_{Y}}) + \sqrt{\frac{\ln 1/\delta }{2n_{Z}}}\right) \ge 1-\delta . $$
(15)

However, Theorem 1 provides a probabilistic upper bound for \({\mathbb {E}}_{x|Y_{n_{Y}}}\left[ g(x,Y_{n_{Y}})\right] \). First, recall that \(z_{1} \sim Y_{mix}\) and note that

$${\mathbb {E}}_{(z_{1} \sim Y_{mix})|Y_{n_{Y}}}\left[ g\left( z_{1},Y_{n_{Y}}\right) \right] = (1-\alpha ){\mathbb {E}}_{(z_{1} \sim Y_{nor})|Y_{n_{Y}}}\left[ g\left( z_{1},Y_{n_{Y}}\right) \right] + \alpha {\mathbb {E}}_{(z_{1} \sim Y_{out})|Y_{n_{Y}}}\left[ g\left( z_{1},Y_{n_{Y}}\right) \right] . $$

Then, since \({\mathbb {E}}_{(z_{1} \sim Y_{nor})|Y_{n_{Y}}}\left[ g\left( z_{1},Y_{n_{Y}}\right) \right] = {\mathbb {E}}_{x|Y_{n_{Y}}}\left[ g\left( x,Y_{n_{Y}}\right) \right] \), for \(\alpha >0\),

$${\mathbb {E}}_{x|Y_{n_{Y}}}\left[ g\left( x,Y_{n_{Y}}\right) \right] \le \frac{1}{1-\alpha }{\mathbb {E}}_{(z_{1} \sim Y_{mix})|Y_{n_{Y}}}\left[ g\left( z_{1},Y_{n_{Y}}\right) \right] . $$
(16)

Consequently, combining (15) and (16), and for \(r \ge \alpha \), we obtain

$${\mathrm {Pr}}\left( {\mathbb {E}}_{x|Y_{n_{Y}}}\left[ g(x,Y_{n_{Y}})\right] \le \frac{1}{1-r}\left[ \frac{1}{n_{Z}}\sum _{i=1}^{n_{Z}}g(z_{i},Y_{n_{Y}}) + \sqrt{\frac{\ln 1/\delta }{2n_{Z}}}\right] \right) \ge 1-\delta ,$$

which completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sguera, C., Galeano, P. & Lillo, R.E. Functional outlier detection by a local depth with application to NO x levels. Stoch Environ Res Risk Assess 30, 1115–1130 (2016). https://doi.org/10.1007/s00477-015-1096-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00477-015-1096-3

Keywords

Navigation