Abstract
This paper proposes methods to detect outliers in functional data sets and the task of identifying atypical curves is carried out using the recently proposed kernelized functional spatial depth (KFSD). KFSD is a local depth that can be used to order the curves of a sample from the most to the least central, and since outliers are usually among the least central curves, we present a probabilistic result which allows to select a threshold value for KFSD such that curves with depth values lower than the threshold are detected as outliers. Based on this result, we propose three new outlier detection procedures. The results of a simulation study show that our proposals generally outperform a battery of competitors. We apply our procedures to a real data set consisting in daily curves of emission levels of nitrogen oxides (NO\(_{x}\)) since it is of interest to identify abnormal NO\(_{x}\) levels to take necessary environmental political actions.
Similar content being viewed by others
Notes
In presence of tie, the method with lower false outlier detection percentage (f) is preferred.
References
Barnett V, Lewis T (1994) Outliers in statistical data, vol 3. Wiley, New York
Chakraborty A, Chaudhuri P (2014) On data depth in infinite dimensional spaces. Ann Inst Stat Math 66:303–324
Chen Y, Dang X, Peng H, Bart HL (2009) Outlier detection with the kernelized spatial depth function. IEEE Trans Pattern Anal Mach Intell 31:288–305
Cuesta-Albertos JA, Nieto-Reyes A (2008) The random Tukey depth. Comput Stat Data Anal 52:4979–4988
Cuevas A (2014) A partial overview of the theory of statistics with functional data. J Stat Plan Inference 147:1–23
Cuevas A, Fraiman R (2009) On depth measures and dual statistics. A methodology for dealing with general data. J Multivar Anal 100:753–766
Cuevas A, Febrero M, Fraiman R (2006) On the use of the bootstrap for estimating functions with functional data. Comput Stat Data Anal 51:1063–1074
Febrero M, Oviedo de la Fuente M (2012) Statistical computing in functional data analysis: the R package fda.usc. J Stat Softw 51:1–28
Febrero M, Galeano P, González-Manteiga W (2007) A functional analysis of NOx levels: location and scale estimation and outlier detection. Comput Stat 22:411–427
Febrero M, Galeano P, González-Manteiga W (2008) Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels. Environmetrics 19:331–345
Ferraty F, Vieu P (2006) Nonparametric functional data analysis: theory and practice. Springer, New York
Fraiman R, Muniz G (2001) Trimmed means for functional data. Test 10:419–440
Horváth L, Kokoszka P (2012) Inference for functional data with applications. Springer, New York
Hyndman RJ, Shang HL (2010) Rainbow plots, bagplots, and boxplots for functional data. J Comput Graph Stat 19:29–45
Ignaccolo R, Franco-Villoria M, Fassò A (2015) Modelling collocation uncertainty of 3D atmospheric profiles. Stoch Environ Res Risk Assess 29:419–429
López-Pintado S, Romo J (2009) On the concept of depth for functional data. J Am Stat Assoc 104:718–734
McDiarmid C (1989) On the method of bounded differences. Survey in combinatorics. Cambridge University Press, Cambridge, pp 148–188
Menafoglio A, Guadagnini A, Secchi P (2014) A kriging approach based on Aitchison geometry for the characterization of particle-size curves in heterogeneous aquifers. Stoch Environ Res Risk Assess 28:1835–1851
Ramsay JO, Silverman BW (2005) Functional data analysis. Springer, New York
Ruiz-Medina MD, Espejo RM (2012) Spatial autoregressive functional plug-in prediction of ocean surface temperature. Stoch Environ Res Risk Assess 26:335–344
Sguera C, Galeano P, Lillo R (2014) Spatial depth-based classification for functional data. Test 23:725–750
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London
Sun Y, Genton MG (2011) Functional boxplots. J Comput Graph Stat 20:316–334
Tukey JW (1975) Mathematics and the picturing of data. Proc Int Congr Math 2:523–531
Acknowledgments
The authors would like to thank the editor in chief, the associate editor and an anonymous referee for their helpful comments. This research was partially supported by Spanish Ministry of Science and Innovation grant ECO2011-25706 and by Spanish Ministry of Economy and Competition grant ECO2012-38442.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 From \(FSD(x, Y_{n})\) to \(KFSD(x, Y_{n})\)
To show how to pass from \(FSD(x, Y_{n})\) in (1) to \(KFSD(x, Y_{n})\) in (4), we first show that \(FSD(x, Y_{n})\) can be expressed in terms of inner products. We present this result for \(n=2\). The norm in (1) can be written as
Let \(\delta _{1}=\sqrt{\langle x,x\rangle +\langle y_{1},y_{1}\rangle -2\langle x,y_{1}\rangle }\) and \(\delta _{2}=\sqrt{\langle x,x\rangle +\langle y_{2},y_{2}\rangle -2\langle x,y_{2}\rangle }\). Then,
and apply the embedding map \(\phi \) to all the observations of the last expression. According to (2), this is equivalent to substitute the inner product function with a positive definite and stationary kernel function \(\kappa \), which explains the definition of \(KFSD(x, Y_{n})\) in (4) for \(n=2\). The generalization of this result to \(n>2\) is straightforward.
1.2 Proof of theorem 1
As explained in Sect. 3, Theorem 1 is a functional extension of a result derived by Chen et al. (2009) for KSD, and since they are closely related, next we report a sketch of the proof of Theorem 1. The proof for KSD is mostly based on an inequality known as McDiarmid ’s inequality (McDiarmid 1989), which also applies to general probability spaces, and therefore to functional Hilbert spaces. We report this inequality in the next lemma:
Lemma 1
(McDiarmid [1.2]) Let \(\Omega _{1}, \ldots , \Omega _{n}\) be probability spaces. Let \({\mathbf {\Omega }} = \prod _{j=1}^{n} \Omega _{j}\) and let \(X: {\mathbf {\Omega} } \rightarrow {\mathbb {R}}\) be a random variable. For any \(j \in \left\{ 1, \ldots , n\right\} \), let \((\omega _{1}, \ldots , \omega _{j}, \ldots ,\) \(\omega _{n})\) and \(\left( \omega _{1}, \ldots , \hat{\omega }_{j}, \ldots , \omega _{n}\right) \) be two elements of \({\mathbf {\Omega }}\) that differ only in their jth coordinates. Assume that X is uniformly difference-bounded by \(\{c_j\}\), that is, for any \(j \in \left\{ 1, \ldots , n\right\} \),
Then, if \({\mathbb {E}}[X]\) exists, for any \(\tau > 0\)
In order to apply Lemma 1 to our problem, define
whose expected value is given by
Now, for any \(j \in \left\{ 1, \ldots , n_{Z}\right\} \) and \(\hat{z}_{j} \in {\mathbb {H}}\), the following inequality holds
and it provides assumption (12) of Lemma 1. Therefore, for any \(\tau > 0\)
and by the law of total probability
Next, setting \(\delta = \exp \left( -2n_{Z}\tau ^{2}\right) \), and solving for \(\tau \), the following result is obtained:
Therefore,
However, Theorem 1 provides a probabilistic upper bound for \({\mathbb {E}}_{x|Y_{n_{Y}}}\left[ g(x,Y_{n_{Y}})\right] \). First, recall that \(z_{1} \sim Y_{mix}\) and note that
Then, since \({\mathbb {E}}_{(z_{1} \sim Y_{nor})|Y_{n_{Y}}}\left[ g\left( z_{1},Y_{n_{Y}}\right) \right] = {\mathbb {E}}_{x|Y_{n_{Y}}}\left[ g\left( x,Y_{n_{Y}}\right) \right] \), for \(\alpha >0\),
Consequently, combining (15) and (16), and for \(r \ge \alpha \), we obtain
which completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Sguera, C., Galeano, P. & Lillo, R.E. Functional outlier detection by a local depth with application to NO x levels. Stoch Environ Res Risk Assess 30, 1115–1130 (2016). https://doi.org/10.1007/s00477-015-1096-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00477-015-1096-3