Abstract
We propose two new outlier detection methods, for identifying and classifying different types of outliers in (big) functional data sets. The proposed methods are based on an existing method called Massive Unsupervised Outlier Detection (MUOD). MUOD detects and classifies outliers by computing for each curve, three indices, all based on the concept of linear regression and correlation, which measure outlyingness in terms of shape, magnitude and amplitude, relative to the other curves in the data. ‘Semifast-MUOD’, the first method, uses a sample of the observations in computing the indices, while ‘Fast-MUOD’, the second method, uses the point-wise or \(L_1\) median in computing the indices. The classical boxplot is used to separate the indices of the outliers from those of the typical observations. Performance evaluation of the proposed methods using simulated data show significant improvements compared to MUOD, both in outlier detection and computational time. We show that Fast-MUOD is especially well suited to handling big and dense functional datasets with very small computational time compared to other methods. Further comparisons with some recent outlier detection methods for functional data also show superior or comparable outlier detection accuracy of the proposed methods. We apply the proposed methods on weather, population growth, and video data.
Similar content being viewed by others
Code availability
Code of proposed method available online on Github at https://github.com/otsegun/fastmuod.
References
Arribas-Gil A, Romo J (2014) Shape outlier detection and visualization for functional data: the outliergram. Biostatistics 15(4):603–619. https://doi.org/10.1093/biostatistics/kxu006
Azcorra A, Chiroque LF, Cuevas R, Fernández Anta A, Laniado H, Lillo RE, Romo J, Sguera C (2018) Unsupervised scalable statistical method for identifying influential users in online social networks. Sci Rep 8(1):6955. https://doi.org/10.1038/s41598-018-24874-2
Brys G, Hubert M, Rousseeuw PJ (2005) A robustification of independent component analysis. J Chemom 19(5–7):364–375. https://doi.org/10.1002/cem.940
Carling K (2000) Resistant outlier rules and the non-gaussian case. Comput Stat Data Anal 33(3):249–258. https://doi.org/10.1016/S0167-9473(99)00057-2
Claeskens G, Hubert M, Slaets L, Vakili K (2014) Multivariate functional halfspace depth. J Am Stat Assoc 109(505):411–423. https://doi.org/10.1080/01621459.2013.856795
Cuevas A (2014) A partial overview of the theory of statistics with functional data. J Stat Plan Inference 147:1–23. https://doi.org/10.1016/j.jspi.2013.04.002
Dai W, Genton MG (2018) Multivariate functional data visualization and outlier detection. J Comput Graph Stat 27(4):923–934. https://doi.org/10.1080/10618600.2018.1473781
Dai W, Genton MG (2019) Directional outlyingness for multivariate functional data. Comput Stat Data Anal 131:50–65. https://doi.org/10.1016/j.csda.2018.03.017
Dai W, Mrkvička T, Sun Y, Genton MG (2020) Functional outlier detection and taxonomy by sequential transformations. Comput Stat Data Anal. https://doi.org/10.1016/j.csda.2020.106960
Eddelbuettel D, Francois R (2011) Rcpp: seamless r and c++ integration. J Stat Softw 40(8):1–18. https://doi.org/10.18637/jss.v040.i08
Febrero M, Galeano P, González-Manteiga W (2008) Outlier detection in functional data by depth measures, with application to identify abnormal nox levels. Environmetrics 19(4):331–345. https://doi.org/10.1002/env.878
Febrero-Bande M, de la Fuente MO (2012) Statistical computing in functional data analysis: the r package fda.usc. J Stat Softw 51(4):1–28. https://doi.org/10.18637/jss.v051.i04
Ferraty F, Vieu P (2006) Nonparametric functional data analysis: theory and practice (Springer series in statistics). Springer, Berlin
Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31(5):579–587. https://doi.org/10.1016/j.cageo.2004.11.013
Fraiman R, Muniz G (2001) Trimmed means for functional data. Test 10(2):419–440. https://doi.org/10.1007/BF02595706
Fritz H, Filzmoser P, Croux C (2012) A comparison of algorithms for the multivariate l1-median. Comput Stat 27(3):393–410. https://doi.org/10.1007/s00180-011-0262-4
Hardin J, Rocke DM (2005) The distribution of robust distances. J Comput Graph Stat 14(4):928–946. https://doi.org/10.1198/106186005X77685
Huang H, Sun Y (2019) A decomposition of total variation depth for understanding functional outliers. Technometrics 61(4):445–458. https://doi.org/10.1080/00401706.2019.1574241
Hubert M, Van der Veeken S (2008) Outlier detection for skewed data. J Chemom 22(3–4):235–246. https://doi.org/10.1002/cem.1123
Hubert M, Vandervieren E (2008) An adjusted boxplot for skewed distributions. Comput Stat Data Anal 52(12):5186–5201. https://doi.org/10.1016/j.csda.2007.11.008
Hubert M, Rousseeuw PJ, Segaert P (2015) Multivariate functional outlier detection. Stat Methods Appl 24(2):177–202. https://doi.org/10.1007/s10260-015-0297-8
Hyndman RJ, Shang HL (2010) Rainbow plots, bagplots, and boxplots for functional data. J Comput Graph Stat 19(1):29–45. https://doi.org/10.1198/jcgs.2009.08158
Izrailev S (2014) Tictoc: functions for timing R scripts, as well as implementations of Stack and List structures. R package version 1.0
Long JP, Huang JZ (2015) A study of functional depths. arXiv preprint arXiv:1506.01332
López-Pintado S, Romo J (2009) On the concept of depth for functional data. J Am Stat Assoc 104(486):718–734. https://doi.org/10.1198/jasa.2009.0108
López-Pintado S, Romo J (2011) A half-region depth for functional data. Comput Stat Data Anal 55(4):1679–1695. https://doi.org/10.1016/j.csda.2010.10.024
Nagy S, Gijbels I, Hlubinka D (2017) Depth-based recognition of shape outlying functions. J Comput Graph Stat 26(4):883–893. https://doi.org/10.1080/10618600.2017.1336445
Nagy S, Gijbels I, Omelka M, Hlubinka D (2016) Integrated depth for functional data: statistical properties and consistency. ESAIM Probab Stat 20:95–130. https://doi.org/10.1051/ps/2016005
Narisetty NN, Nair VN (2016) Extremal depth for functional data and applications. J Am Stat Assoc 111(516):1705–1714. https://doi.org/10.1080/01621459.2015.1110033
Nieto-Reyes A, Battey H (2016) A topologically valid definition of depth for functional data. Stat Sci 31(1):61–79. https://doi.org/10.1214/15-STS532
Ojo OT, Lillo RE, Fernandez Anta A (2021) Fdaoutlier: outlier detection tools for functional data analysis. https://cran.r-project.org/package=fdaoutlier. R package version 0.2.9000
R Core Team (2020) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Ramsay JO, Silverman BW (2005) Functional data analysis. Springer, Berlin
Rousseeuw PJ, Driessen KV (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223. https://doi.org/10.1080/00401706.1999.10485670
Rousseeuw PJ, Raymaekers J, Hubert M (2018) A measure of directional outlyingness with applications to image data and video. J Comput Graph Stat 27(2):345–359. https://doi.org/10.1080/10618600.2017.1366912
Sguera C, Galeano P, Lillo RE (2016) Functional outlier detection by a local depth with application to no x levels. Stoch Environ Res Risk Assess 30(4):1115–1130. https://doi.org/10.1007/s00477-015-1096-3
Sun Y, Genton MG (2011) Functional boxplots. J Comput Graph Stat 20(2):316–334. https://doi.org/10.1198/jcgs.2011.09224
Vinue G, Epifanio I (2020) Robust archetypoids for anomaly detection in big functional data. Adv Data Anal Classif. https://doi.org/10.1007/s11634-020-00412-9
Acknowledgements
This research was funded in part by Agencia Estatal de Investigación (AEI) grant number AEI/10.13039/501100011033. This research was also partially supported by the Regional Government of Madrid (CM) grant EdgeData-CM (P2018/TCS4499, cofunded by FSE & FEDER) and Agencia Estatal de Investigación (AEI) grant PID2019-109805RB-I00/ AEI/10.13039/501100011033. The authorsare grateful to the editor and the referees for their constructive and insightfulcomments that led to considerable improvements in this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
11634_2021_460_MOESM1_ESM.pdf
[Simulation Results:] (ESM_1.pdf) Additional simulation results showing comparisons between Fast-MOUD computed with the \(L_1\) median and the point-wise median and outlier detection performance of all methods considered at higher contamination rates. Also includes comparisons of different correlation coefficients for computing the shape index \(I_S\). More simulations results using lower sample size and lower evaluation points are also presented together with a sensitivity analysis of outlier detection performance when more noise is added to the simulation models.
Rights and permissions
About this article
Cite this article
Ojo, O.T., Fernández Anta, A., Lillo, R.E. et al. Detecting and classifying outliers in big functional data. Adv Data Anal Classif 16, 725–760 (2022). https://doi.org/10.1007/s11634-021-00460-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-021-00460-9