Abstract
Archetypoid analysis (ADA) has proven to be a successful unsupervised statistical technique to identify extreme observations in the periphery of the data cloud, both in classical multivariate data and functional data. However, two questions remain open in this field: the use of ADA for outlier detection and its scalability. We propose to use robust functional archetypoids and adjusted boxplot to pinpoint functional outliers. Furthermore, we present a new archetypoid algorithm for obtaining results from large data sets in reasonable time. Functional time series are occurring in many practical problems, so this paper focuses on functional data settings. The new algorithm for detecting functional anomalies, called CRO-FADALARA, can be used with both univariate and multivariate curves. Our proposal for outlier detection is compared with all the state-of-the-art methods in a controlled study, showing a good performance. Furthermore, CRO-FADALARA is applied to two large time series data sets, where outliers curves are discussed and the reduction in computational time is clearly stated. A third case study with a small ECG data set is discussed, given its importance in functional data scenarios. All data, R code and a new R package are freely available.
Similar content being viewed by others
Notes
Run in R these two commands for inspecting all the results:
library(shiny) ; \(\text {runUrl(`path\_to/Drift\_data\_app.zip')}\).
Run in R these two commands for inspecting all the results:
library(shiny) ; \(\text {runUrl(`path\_to/Starlight\_data\_app.zip')}\).
References
Alcacer A, Epifanio I, Ibáñez M, Simó A, Ballester A (2020) A data-driven classification of 3D foot types by archetypal shapes based on landmarks. PLoS ONE 15(1):e0228016. https://doi.org/10.1371/journal.pone.0228016
Arribas-Gil A, Romo J (2014) Shape outlier detection and visualization for functional data: the outliergram. Biostatistics 15(4):603–619. https://doi.org/10.1093/biostatistics/kxu006
Azcorra A, Chiroque L, Cuevas R, Fernández Anta A, Laniado H, Lillo R, Romo J, Sguera C (2018) Unsupervised scalable statistical method for identifying influential users in online social networks. Sci Rep 8:1–7. https://doi.org/10.1038/s41598-018-24874-2
Bagnall A, Lines J, Vickers W, Keogh E (2018) The UEA & UCR time series classification repository. www.timeseriesclassification.com
Beaton A, Tukey J (1974) The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16(2):147–185. https://doi.org/10.1080/00401706.1974.10489171
Cabero I, Epifanio I (2019) Archetypal analysis: an alternative to clustering for unsupervised texture segmentation. Image Anal Stereol 38:151–160. https://doi.org/10.5566/ias.2052
Cabero I, Epifanio I (2020) Finding archetypal patterns for binary questionnaires. SORT 44(1) (in press). arXiv:2003.00043
Chang W, Cheng J, JJ A, Xie Y, McPherson J (2017) Shiny: web application framework for R. https://CRAN.R-project.org/package=shiny. R package version 1.0.5
Chen Y, Mairal J, Harchaoui Z (2014) Fast and robust archetypal analysis for representation learning. In: CVPR 2014—IEEE conference on computer vision and pattern recognition, pp 1478–1485. https://doi.org/10.1109/CVPR.2014.192
Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4):338–347. https://doi.org/10.2307/1269949
D’Orazio M (2018) univOutl: detection of univariate outliers. https://CRAN.R-project.org/package=univOutl. R package version 0.1-4
Dua D, Karra-Taniskidou E (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml
Epifanio I (2016) Functional archetype and archetypoid analysis. Comput Stat Data Anal 104:24–34. https://doi.org/10.1016/j.csda.2016.06.007
Epifanio I, Ibáñez M, Simó A (2018) Archetypal shapes based on landmarks and extension to handle missing data. Adv Data Anal Classif 12:705–735. https://doi.org/10.1007/s11634-017-0297-7
Epifanio I, Ibáñez M, Simó A (2020) Archetypal analysis with missing data: see all samples by looking at a few based on extreme profiles. Am Stat 72:169–183. https://doi.org/10.1080/00031305.2018.1545700
Eugster M, Leisch F (2011) Weighted and robust archetypal analysis. Comput Stat Data Anal 55:1215–1225. https://doi.org/10.1016/j.csda.2010.10.017
Febrero M, Galeano P, González-Manteiga W (2007) A functional analysis of \(NO_x\) levels: location and scale estimation and outlier detection. Comput Stat 22(3):411–427. https://doi.org/10.1007/s00180-007-0048-x
Febrero M, Galeano P, González-Manteiga W (2008) Outlier detection in functional data by depth measures, with application to identify abnormal \(NO_x\) levels. Environmetrics 19:331–345. https://doi.org/10.1002/env.878
Febrero-Bande M, Oviedo de la Fuente M (2012) Statistical computing in functional data analysis: the R package fda.usc. J Stat Softw 51(4):1–28
Fraiman R, Svarc M (2013) Resistant estimates for high dimensional and functional data based on random projections. Comput Stat Data Anal 58:326–338. https://doi.org/10.1016/j.csda.2012.09.006
Hubert M, Rousseeuw P, Segaert P (2015) Multivariate functional outlier detection. Stat Methods Appl 24(2):177–202. https://doi.org/10.1007/s10260-015-0297-8
Hubert M, Rousseeuw P, Segaert P (2017) Multivariate and functional classification using depth and distance. Adv Data Anal Classif 11:445–466. https://doi.org/10.1007/s11634-016-0269-3
Hyndman R, Shahid Ullah M (2007) Robust forecasting of mortality and fertility rates: a functional data approach. Comput Stat Data Anal 51(10):4942–4956. https://doi.org/10.1016/j.csda.2006.07.028
Hubert M, Vandervieren E (2008) An adjusted boxplot for skewed distributions. Comput Stat Data Anal 52:5186–5201. https://doi.org/10.1016/j.csda.2007.11.008
Hyndman R (2010) Rainbow plots, bagplots, and boxplots for functional data. J Comput Graph Stat 19(1):29–45. https://doi.org/10.1198/jcgs.2009.08158
Kaufman L, Rousseeuw P (1990) Finding groups in data, an introduction to cluster analysis. Wiley, New York
Mair S, Boubekki A, Brefeld U (2017) Frame-based data factorizations. In: Proceedings of the 34th international conference on machine learning, Sydney, Australia, pp 2305–2313. http://proceedings.mlr.press/v70/mair17a/mair17a.pdf
Millán-Roures L, Epifanio I, Martínez V (2018) Detection of anomalies in water networks by functional data analysis. Math Probl Eng 2018:1–14. https://doi.org/10.1155/2018/5129735
Moliner J, Epifanio I (2019) Robust multivariate and functional archetypal analysis with application to financial time series analysis. Physica A Stat Mech Appl 519:195–208. https://doi.org/10.1016/j.physa.2018.12.036
Ooi H (2017) Microsoft Corporation, Weston, S., Tenenbaum, D.: doParallel: Foreach Parallel Adaptor for the ‘parallel’ Package. https://CRAN.R-project.org/package=doParallel. R package version 1.0.11
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: SIGMOD ’00 proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp 427–438. https://doi.org/10.1145/342009.335437
Ramsay JO, Silverman B (2005) Functional data analysis, 2nd edn. Springer, Berlin
Ramsay JO, Hooker G, Graves S (2009) Functional data analysis with R and MATLAB. Springer, Berlin
Ramsay JO, Wickham H, Graves S, Hooker G (2017) FDA: functional data analysis. R package version 2.4.7, https://CRAN.R-project.org/package=fda
Rebbapragada U, Protopapas P, Brodley C, Alcock C (2009) Finding anomalous periodic time series. An application to catalogs of periodic variable stars. Mach Learn. https://doi.org/10.1007/s10994-008-5093-3
Rodríguez-Luján I, Fonollosa J, Vergara A, Homer M, Huerta R (2014) On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemom Intell Lab Syst 130:123–134. https://doi.org/10.1016/j.chemolab.2013.10.012
Rousseeuw P, Leroy A (1987) Robust regression and outlier detection. Wiley, New York
Segaert P, Hubert M, Rousseeuw P, Raymaekers J (2017) mrfDepth: depth measures in multivariate, regression and functional settings. R package version 1.0.6. https://CRAN.R-project.org/package=mrfDepth
Shang HL, Hyndman RJ (2016) rainbow: Rainbow Plots, Bagplots and Boxplots for functional data. R package version 3.4. https://CRAN.R-project.org/package=rainbow
Sinova B, González Rodríguez G, Van Aelst S (2018) M-estimators of location for functional data. Bernouilli 24(3):2328–2357. https://doi.org/10.3150/17-BEJ929
Sun Y, Genton M (2011) Functional boxplots. J Comput Graph Stat 20(2):316–334. https://doi.org/10.1198/jcgs.2011.09224
Sun W, Yang G, Wu K, Li W, Zhang D (2017) Pure endmember extraction using robust kernel archetypoid analysis for hyperspectral imagery. ISPRS J Photogr Remote Sens 131:147–159. https://doi.org/10.1016/j.isprsjprs.2017.08.001
Tarabelloni N, Arribas-Gil A, Ieva F, Paganoni AM, Romo J (2018) roahd: robust analysis of high dimensional data. R package version 1.4, https://CRAN.R-project.org/package=roahd
Vergara A, Vembu S, Ayhan T, Ryan M, Homer M, Huerta R (2012) Chemical gas sensor drift compensation using classifier ensembles. Sens Actuators B Chem 166:320–329. https://doi.org/10.1016/j.snb.2012.01.074
Vinué G, Epifanio I, Alemany S (2015) Archetypoids: a new approach to define representative archetypal data. Comput Stat Data Anal 87:102–115. https://doi.org/10.1016/j.csda.2015.01.018
Vinué G, Epifanio I (2017) Archetypoid analysis for sports analytics. Data Min Knowl Discov 31(6):1643–1677. https://doi.org/10.1007/s10618-017-0514-1
Vinué G (2017) Anthropometry: an R package for analysis of anthropometric data. J Stat Softw 77(6):1–39 10.18637/jss.v077.i06
Vinué G, Epifanio I (2019) Forecasting basketball players’ performance using sparse functional data. Stat Anal Data Min ASA Data Sci J 12(6):534–547. https://doi.org/10.1002/sam.11436
Young D (2010) tolerance: An R package for estimating tolerance intervals. J Stat Softw 36(5):1–39. https://doi.org/10.18637/jss.v036.i05
Acknowledgements
GV worked on the first version of the manuscript as a postdoctoral scholarship holder in international mobility at KU Leuven and acknowledges support from SBO grant HYMOP (150033) of the Research Foundation-Flanders (FWO-Vlaanderen). GV thanks: (i) Wannes Meert and Jesse Davis for the follow-up in the context of the HYMOP project and the suggestion of computing the variable importance; (ii) Jordi Fonollosa for the help with the gas sensor data; (iii) Sebastian Mair for the frame-based data factorization code. IE was supported by DPI2017-87333-R from the Spanish Ministry of Science, Innovation and Universities (AEI/FEDER, EU) and UJI-B2017-13 from Universitat Jaume I. The authors also thank the anonymous reviewers for their comments, and the UCI Machine Learning and UEA & UCR Time Series Classification repositories for providing open data.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Vinue, G., Epifanio, I. Robust archetypoids for anomaly detection in big functional data. Adv Data Anal Classif 15, 437–462 (2021). https://doi.org/10.1007/s11634-020-00412-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-020-00412-9