Robust archetypoids for anomaly detection in big functional data

Abstract

Archetypoid analysis (ADA) has proven to be a successful unsupervised statistical technique to identify extreme observations in the periphery of the data cloud, both in classical multivariate data and functional data. However, two questions remain open in this field: the use of ADA for outlier detection and its scalability. We propose to use robust functional archetypoids and adjusted boxplot to pinpoint functional outliers. Furthermore, we present a new archetypoid algorithm for obtaining results from large data sets in reasonable time. Functional time series are occurring in many practical problems, so this paper focuses on functional data settings. The new algorithm for detecting functional anomalies, called CRO-FADALARA, can be used with both univariate and multivariate curves. Our proposal for outlier detection is compared with all the state-of-the-art methods in a controlled study, showing a good performance. Furthermore, CRO-FADALARA is applied to two large time series data sets, where outliers curves are discussed and the reduction in computational time is clearly stated. A third case study with a small ECG data set is discussed, given its importance in functional data scenarios. All data, R code and a new R package are freely available.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    https://CRAN.R-project.org/package=adamethods.

  2. 2.

    http://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset+at+Different+Concentrations.

  3. 3.

    Run in R these two commands for inspecting all the results:

    library(shiny) ; \(\text {runUrl(`path\_to/Drift\_data\_app.zip')}\).

  4. 4.

    Run in R these two commands for inspecting all the results:

    library(shiny) ; \(\text {runUrl(`path\_to/Starlight\_data\_app.zip')}\).

References

  1. Alcacer A, Epifanio I, Ibáñez M, Simó A, Ballester A (2020) A data-driven classification of 3D foot types by archetypal shapes based on landmarks. PLoS ONE 15(1):e0228016. https://doi.org/10.1371/journal.pone.0228016

    Article  Google Scholar 

  2. Arribas-Gil A, Romo J (2014) Shape outlier detection and visualization for functional data: the outliergram. Biostatistics 15(4):603–619. https://doi.org/10.1093/biostatistics/kxu006

    Article  Google Scholar 

  3. Azcorra A, Chiroque L, Cuevas R, Fernández Anta A, Laniado H, Lillo R, Romo J, Sguera C (2018) Unsupervised scalable statistical method for identifying influential users in online social networks. Sci Rep 8:1–7. https://doi.org/10.1038/s41598-018-24874-2

    Article  Google Scholar 

  4. Bagnall A, Lines J, Vickers W, Keogh E (2018) The UEA & UCR time series classification repository. www.timeseriesclassification.com

  5. Beaton A, Tukey J (1974) The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16(2):147–185. https://doi.org/10.1080/00401706.1974.10489171

    Article  MATH  Google Scholar 

  6. Cabero I, Epifanio I (2019) Archetypal analysis: an alternative to clustering for unsupervised texture segmentation. Image Anal Stereol 38:151–160. https://doi.org/10.5566/ias.2052

    Article  MATH  Google Scholar 

  7. Cabero I, Epifanio I (2020) Finding archetypal patterns for binary questionnaires. SORT 44(1) (in press). arXiv:2003.00043

  8. Chang W, Cheng J, JJ A, Xie Y, McPherson J (2017) Shiny: web application framework for R. https://CRAN.R-project.org/package=shiny. R package version 1.0.5

  9. Chen Y, Mairal J, Harchaoui Z (2014) Fast and robust archetypal analysis for representation learning. In: CVPR 2014—IEEE conference on computer vision and pattern recognition, pp 1478–1485. https://doi.org/10.1109/CVPR.2014.192

  10. Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4):338–347. https://doi.org/10.2307/1269949

    MathSciNet  Article  MATH  Google Scholar 

  11. D’Orazio M (2018) univOutl: detection of univariate outliers. https://CRAN.R-project.org/package=univOutl. R package version 0.1-4

  12. Dua D, Karra-Taniskidou E (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml

  13. Epifanio I (2016) Functional archetype and archetypoid analysis. Comput Stat Data Anal 104:24–34. https://doi.org/10.1016/j.csda.2016.06.007

    MathSciNet  Article  MATH  Google Scholar 

  14. Epifanio I, Ibáñez M, Simó A (2018) Archetypal shapes based on landmarks and extension to handle missing data. Adv Data Anal Classif 12:705–735. https://doi.org/10.1007/s11634-017-0297-7

    MathSciNet  Article  MATH  Google Scholar 

  15. Epifanio I, Ibáñez M, Simó A (2020) Archetypal analysis with missing data: see all samples by looking at a few based on extreme profiles. Am Stat 72:169–183. https://doi.org/10.1080/00031305.2018.1545700

    MathSciNet  Article  Google Scholar 

  16. Eugster M, Leisch F (2011) Weighted and robust archetypal analysis. Comput Stat Data Anal 55:1215–1225. https://doi.org/10.1016/j.csda.2010.10.017

    MathSciNet  Article  MATH  Google Scholar 

  17. Febrero M, Galeano P, González-Manteiga W (2007) A functional analysis of \(NO_x\) levels: location and scale estimation and outlier detection. Comput Stat 22(3):411–427. https://doi.org/10.1007/s00180-007-0048-x

    Article  MATH  Google Scholar 

  18. Febrero M, Galeano P, González-Manteiga W (2008) Outlier detection in functional data by depth measures, with application to identify abnormal \(NO_x\) levels. Environmetrics 19:331–345. https://doi.org/10.1002/env.878

    MathSciNet  Article  Google Scholar 

  19. Febrero-Bande M, Oviedo de la Fuente M (2012) Statistical computing in functional data analysis: the R package fda.usc. J Stat Softw 51(4):1–28

    Article  Google Scholar 

  20. Fraiman R, Svarc M (2013) Resistant estimates for high dimensional and functional data based on random projections. Comput Stat Data Anal 58:326–338. https://doi.org/10.1016/j.csda.2012.09.006

    MathSciNet  Article  MATH  Google Scholar 

  21. Hubert M, Rousseeuw P, Segaert P (2015) Multivariate functional outlier detection. Stat Methods Appl 24(2):177–202. https://doi.org/10.1007/s10260-015-0297-8

    MathSciNet  Article  MATH  Google Scholar 

  22. Hubert M, Rousseeuw P, Segaert P (2017) Multivariate and functional classification using depth and distance. Adv Data Anal Classif 11:445–466. https://doi.org/10.1007/s11634-016-0269-3

    MathSciNet  Article  MATH  Google Scholar 

  23. Hyndman R, Shahid Ullah M (2007) Robust forecasting of mortality and fertility rates: a functional data approach. Comput Stat Data Anal 51(10):4942–4956. https://doi.org/10.1016/j.csda.2006.07.028

    MathSciNet  Article  MATH  Google Scholar 

  24. Hubert M, Vandervieren E (2008) An adjusted boxplot for skewed distributions. Comput Stat Data Anal 52:5186–5201. https://doi.org/10.1016/j.csda.2007.11.008

    MathSciNet  Article  MATH  Google Scholar 

  25. Hyndman R (2010) Rainbow plots, bagplots, and boxplots for functional data. J Comput Graph Stat 19(1):29–45. https://doi.org/10.1198/jcgs.2009.08158

    MathSciNet  Article  Google Scholar 

  26. Kaufman L, Rousseeuw P (1990) Finding groups in data, an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  27. Mair S, Boubekki A, Brefeld U (2017) Frame-based data factorizations. In: Proceedings of the 34th international conference on machine learning, Sydney, Australia, pp 2305–2313. http://proceedings.mlr.press/v70/mair17a/mair17a.pdf

  28. Millán-Roures L, Epifanio I, Martínez V (2018) Detection of anomalies in water networks by functional data analysis. Math Probl Eng 2018:1–14. https://doi.org/10.1155/2018/5129735

    Article  Google Scholar 

  29. Moliner J, Epifanio I (2019) Robust multivariate and functional archetypal analysis with application to financial time series analysis. Physica A Stat Mech Appl 519:195–208. https://doi.org/10.1016/j.physa.2018.12.036

    MathSciNet  Article  Google Scholar 

  30. Ooi H (2017) Microsoft Corporation, Weston, S., Tenenbaum, D.: doParallel: Foreach Parallel Adaptor for the ‘parallel’ Package. https://CRAN.R-project.org/package=doParallel. R package version 1.0.11

  31. R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

  32. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: SIGMOD ’00 proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp 427–438. https://doi.org/10.1145/342009.335437

  33. Ramsay JO, Silverman B (2005) Functional data analysis, 2nd edn. Springer, Berlin

    Google Scholar 

  34. Ramsay JO, Hooker G, Graves S (2009) Functional data analysis with R and MATLAB. Springer, Berlin

    Google Scholar 

  35. Ramsay JO, Wickham H, Graves S, Hooker G (2017) FDA: functional data analysis. R package version 2.4.7, https://CRAN.R-project.org/package=fda

  36. Rebbapragada U, Protopapas P, Brodley C, Alcock C (2009) Finding anomalous periodic time series. An application to catalogs of periodic variable stars. Mach Learn. https://doi.org/10.1007/s10994-008-5093-3

    Article  Google Scholar 

  37. Rodríguez-Luján I, Fonollosa J, Vergara A, Homer M, Huerta R (2014) On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemom Intell Lab Syst 130:123–134. https://doi.org/10.1016/j.chemolab.2013.10.012

    Article  Google Scholar 

  38. Rousseeuw P, Leroy A (1987) Robust regression and outlier detection. Wiley, New York

    Google Scholar 

  39. Segaert P, Hubert M, Rousseeuw P, Raymaekers J (2017) mrfDepth: depth measures in multivariate, regression and functional settings. R package version 1.0.6. https://CRAN.R-project.org/package=mrfDepth

  40. Shang HL, Hyndman RJ (2016) rainbow: Rainbow Plots, Bagplots and Boxplots for functional data. R package version 3.4. https://CRAN.R-project.org/package=rainbow

  41. Sinova B, González Rodríguez G, Van Aelst S (2018) M-estimators of location for functional data. Bernouilli 24(3):2328–2357. https://doi.org/10.3150/17-BEJ929

    MathSciNet  Article  MATH  Google Scholar 

  42. Sun Y, Genton M (2011) Functional boxplots. J Comput Graph Stat 20(2):316–334. https://doi.org/10.1198/jcgs.2011.09224

    MathSciNet  Article  Google Scholar 

  43. Sun W, Yang G, Wu K, Li W, Zhang D (2017) Pure endmember extraction using robust kernel archetypoid analysis for hyperspectral imagery. ISPRS J Photogr Remote Sens 131:147–159. https://doi.org/10.1016/j.isprsjprs.2017.08.001

    Article  Google Scholar 

  44. Tarabelloni N, Arribas-Gil A, Ieva F, Paganoni AM, Romo J (2018) roahd: robust analysis of high dimensional data. R package version 1.4, https://CRAN.R-project.org/package=roahd

  45. Vergara A, Vembu S, Ayhan T, Ryan M, Homer M, Huerta R (2012) Chemical gas sensor drift compensation using classifier ensembles. Sens Actuators B Chem 166:320–329. https://doi.org/10.1016/j.snb.2012.01.074

    Article  Google Scholar 

  46. Vinué G, Epifanio I, Alemany S (2015) Archetypoids: a new approach to define representative archetypal data. Comput Stat Data Anal 87:102–115. https://doi.org/10.1016/j.csda.2015.01.018

    MathSciNet  Article  MATH  Google Scholar 

  47. Vinué G, Epifanio I (2017) Archetypoid analysis for sports analytics. Data Min Knowl Discov 31(6):1643–1677. https://doi.org/10.1007/s10618-017-0514-1

    MathSciNet  Article  Google Scholar 

  48. Vinué G (2017) Anthropometry: an R package for analysis of anthropometric data. J Stat Softw 77(6):1–39 10.18637/jss.v077.i06

    MathSciNet  Article  Google Scholar 

  49. Vinué G, Epifanio I (2019) Forecasting basketball players’ performance using sparse functional data. Stat Anal Data Min ASA Data Sci J 12(6):534–547. https://doi.org/10.1002/sam.11436

    MathSciNet  Article  Google Scholar 

  50. Young D (2010) tolerance: An R package for estimating tolerance intervals. J Stat Softw 36(5):1–39. https://doi.org/10.18637/jss.v036.i05

    Article  Google Scholar 

Download references

Acknowledgements

GV worked on the first version of the manuscript as a postdoctoral scholarship holder in international mobility at KU Leuven and acknowledges support from SBO grant HYMOP (150033) of the Research Foundation-Flanders (FWO-Vlaanderen). GV thanks: (i) Wannes Meert and Jesse Davis for the follow-up in the context of the HYMOP project and the suggestion of computing the variable importance; (ii) Jordi Fonollosa for the help with the gas sensor data; (iii) Sebastian Mair for the frame-based data factorization code. IE was supported by DPI2017-87333-R from the Spanish Ministry of Science, Innovation and Universities (AEI/FEDER, EU) and UJI-B2017-13 from Universitat Jaume I. The authors also thank the anonymous reviewers for their comments, and the UCI Machine Learning and UEA & UCR Time Series Classification repositories for providing open data.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Guillermo Vinue.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (xz 25242 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Vinue, G., Epifanio, I. Robust archetypoids for anomaly detection in big functional data. Adv Data Anal Classif (2020). https://doi.org/10.1007/s11634-020-00412-9

Download citation

Keywords

  • Anomaly detection
  • Functional data analysis
  • Archetypal analysis
  • Big data
  • R package

Mathematics Subject Classification

  • 62P30