Skip to main content

Unsupervised and scalable subsequence anomaly detection in large data series

A Correction to this article was published on 31 August 2021

This article has been updated

Abstract

Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, the approaches that have been proposed so far in the literature have severe limitations: they either require prior domain knowledge or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In this work, we address these problems and propose NormA, a novel approach, suitable for domain-agnostic anomaly detection. NormA is based on a new data series primitive, which permits to detect anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach correctly identifies all single and recurrent anomalies of various types, with no prior knowledge of the characteristics of these anomalies (except for their length). Moreover, it outperforms by a large margin the current state-of-the art algorithms in terms of accuracy, while being orders of magnitude faster.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Change history

Notes

  1. 1.

    If the dimension that imposes the ordering of the sequence is time then we talk about time series. In the rest of this paper, we will use the terms sequence, data series, and time series interchangeably.

  2. 2.

    http://www.safran-group.com/.

  3. 3.

    A preliminary version of this paper and a corresponding demo paper have appeared elsewhere [10, 11].

  4. 4.

    The authors of these papers define the problem as kth-discord discovery.

References

  1. 1.

    http://data-acoustics.com/measurements/bearing-faults/bearing-4/ (2007)

  2. 2.

    http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml (2015)

  3. 3.

    http://helios.mi.parisdescartes.fr/~themisp/norma/

  4. 4.

    Abboud, D., Elbadaoui, M., Smith, W., Randall, R.: Advanced bearing diagnostics: A comparative study of two powerful approaches. MSSP 114 (2019)

  5. 5.

    Abdul-Aziz, A., Woike, M.R., Oza, N.C., Matthews, B.L., lekki, J.D.: Rotor health monitoring combining spin tests and data-driven anomaly detection methods. Struct. Health Monit. (2012)

  6. 6.

    Ahmad, S., Lavin, A., Purdy, S., Agha, Z.: Unsupervised real-time anomaly detection for streaming data. Neurocomputing (2017)

  7. 7.

    Antoni, J., Borghesani, P.: A statistical methodology for the design of condition indicators. Mech. Syst. Signal Process. 290–327 (2019)

  8. 8.

    Bagnall, A.J., Cole, R.L., Palpanas, T., Zoumpatianos, K.: Data series management (dagstuhl seminar 19282). Dagstuhl Rep. 9(7), 24–39 (2019)

  9. 9.

    Barnet, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1994)

  10. 10.

    Boniol, P., Linardi, M., Roncallo, F., Palpanas, T.: Automated Anomaly Detection in Large Sequences. In: ICDE pp. 1834–1837 (2020)

  11. 11.

    Boniol, P., Linardi, M., Roncallo, F., Palpanas, T.: SAD: an unsupervised system for subsequence anomaly detection. In: 36th IEEE International Conference on Data Engineering, ICDE, pp. 1778–1781. IEEE (2020)

  12. 12.

    Boniol, P., Palpanas, T.: Series2graph: graph-based subsequence anomaly detection for time series. Proc. VLDB Endow. 13(11), 1821–1834 (2020)

    Article  Google Scholar 

  13. 13.

    Boniol, P., Palpanas, T., Meftah, M., Remy, E.: Graphan: graph-based subsequence anomaly detection. Proc. VLDB Endow. 13(12), 2941–2944 (2020)

    Article  Google Scholar 

  14. 14.

    Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: Identifying density-based local outliers. In: SIGMOD (2000)

  15. 15.

    Bryant, P.G.: On the minimum description length (mdl) principle for hierarchical classifications. In: Data Science, Classification, and Related Methods (1998)

  16. 16.

    Bu, Y., Chen, L., Fu, A.W.C., Liu, D.: Efficient anomaly monitoring over moving object trajectory streams. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 159–168. Association for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/1557019.1557043

  17. 17.

    Bu, Y., Leung, O.T., Fu, A.W., Keogh, E.J., Pei, J., Meshkin, S.: WAT: finding top-k discords in time series database. In: SIAM (2007)

  18. 18.

    Chiu, B.Y., Keogh, E.J., Lonardi, S.: Probabilistic discovery of time series motifs. In: KDD (2003)

  19. 19.

    Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. PVLDB 2, 112–127 (2018)

  20. 20.

    Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Return of the lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB 13, 402–419 (2019)

    Google Scholar 

  21. 21.

    Fu, A.W., Leung, O.T., Keogh, E.J., Lin, J.: Finding time series discords based on haar transform. In: ADMA pp. 31–41 (2006)

  22. 22.

    Gharghabi, S., Yeh, C.M., Ding, Y., Ding, W., Hibbing, P., LaMunion, S., Kaplan, A., Crouter, S.E., Keogh, E.J.: Domain agnostic online semantic segmentation for multi-dimensional time series. Data Min. Knowl. Discov. 33(1), 96–130 (2019)

    MathSciNet  Article  Google Scholar 

  23. 23.

    Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000 (June 13)). Circulation Electronic Pages: http://circ.ahajournals.org/content/101/23/e215.fullPMID:1085218; https://doi.org/10.1161/01.CIR.101.23.e215

  24. 24.

    Grabocka, J., Schilling, N., Schmidt-Thieme, L.: Latent time-series motifs. TKDD 11(1), 6:1-6:20 (2016)

    Article  Google Scholar 

  25. 25.

    Hadjem, M., Naït-Abdesselam, F., Khokhar, A.A.: St-segment and t-wave anomalies prediction in an ECG data using rusboost. In: Healthcom (2016)

  26. 26.

    Keogh, E., Lin, J.: Clustering of time-series subsequences is meaningless: implications for previous and future research. KAIS 8(2) (2004)

  27. 27.

    Keogh, E., Lonardi, S., Ratanamahatana, C., Wei, L., Lee, S.H., Handley, J.: Compression-based data mining of sequential data. DMKD 14, 99–129 (2007)

    MathSciNet  Google Scholar 

  28. 28.

    Keogh, E.J., Lin, J., Fu, A.W.: HOT SAX: efficiently finding the most unusual time series subsequence. In: ICDM (2005)

  29. 29.

    Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: sortable summarizations for scalable indexes over static and streaming data series. VLDBJ 28(6) (2019)

  30. 30.

    Lee, J., Han, J., Li, X.: Trajectory outlier detection: a partition-and-detect framework. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 140–149 (2008)

  31. 31.

    Lee, T., Gottschlich, J., Tatbul, N., Metcalf, E., Zdonik, S.: greenhouse: a zero-positive machine learning system for time-series anomaly detection. CoRR arXiv:abs/1801.03168 (2018). URL http://arxiv.org/abs/1801.03168

  32. 32.

    Li, X., Lin, J.: Linear time motif discovery in time series. In: Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 136–144. SIAM (2019)

  33. 33.

    Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: the ulisse approach. PVLDB 11, 2236–2248 (2019)

    Google Scholar 

  34. 34.

    Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.: Matrix profile x: Valmod - scalable discovery of variable-length motifs in data series. In: SIGMOD (2018)

  35. 35.

    Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix Profile Goes MAD: variable-length motif and discord discovery in data series. In: DAMI (2020)

  36. 36.

    Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: ICDM, ICDM (2008)

  37. 37.

    Liu, Y., Chen, X., Wang, F.: Efficient detection of discords for time series stream. In: Advances in Data and Web Management (2009)

  38. 38.

    Luo, W., Gallagher, M.: Faster and parameter-free discord search in quasi-periodic time series. In: Advances in Knowledge Discovery and Data Mining (2011)

  39. 39.

    Malhotra, P., Vig, L., Shroff, G., Agarwal, P.: Long short term memory networks for anomaly detection in time series. In: ESANN (2015)

  40. 40.

    Moody, G.B., Mark, R.G.: The impact of the mit-bih arrhythmia database. IEEE Eng. Med. Biol. Mag. 20, 45–50 (2001)

    Article  Google Scholar 

  41. 41.

    Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B.: Exact discovery of time series motifs. In: SDM (2009)

  42. 42.

    Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Rec. 44(2), 47–52 (2015)

    Article  Google Scholar 

  43. 43.

    Palpanas, T.: Evolution of a Data Series Index. In: CCIS, pp. 68–83 (2020)

  44. 44.

    Palpanas, T., Beckmann, V.: Report on the first and second interdisciplinary time series analysis workshop (ITISA). SIGREC 48(3) (2019)

  45. 45.

    Paparrizos, J., Gravano, L.: K-shape: efficient and accurate clustering of time series. SIGMOD Rec. 45(1), 69–76 (2016). https://doi.org/10.1145/2949741.2949758

    Article  Google Scholar 

  46. 46.

    Paul Boniol (advisor: Themis Palpanas): Unsupervised subsequence anomaly detection in large sequences. In: Proceedings of the VLDB 2020 PhD Workshop colocated with the 46th International Conference on Very Large Databases (VLDB 2020), CEUR Workshop Proceedings, vol. 2652 (2020)

  47. 47.

    Peng, B., Palpanas, T., Fatourou, P.: Messi: In-memory data series indexing. In: ICDE (2020)

  48. 48.

    Peng, B., Palpanas, T., Fatourou, P.: Paris+: data series indexing on multi-core architectures. In: TKDE (2020)

  49. 49.

    Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: clustering time series streams requires ignoring some data. In: 2011 IEEE 11th International Conference on Data Mining, pp. 547–556 (2011)

  50. 50.

    Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)

    Article  Google Scholar 

  51. 51.

    Safran: Personal communication with Dr. Dohy Hong (2018)

  52. 52.

    Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S.: Time series anomaly discovery with grammar-based compression. In: EDBT (2015)

  53. 53.

    Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S.: Grammarviz 3.0: Interactive discovery of variable-length time series patterns. TKDD 12, 1–28 (2018)

    Article  Google Scholar 

  54. 54.

    Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. DMKD 19, 24–27 (2009)

    Google Scholar 

  55. 55.

    Subramaniam, S., Palpanas, T., Papadopoulos, D., Kalogeraki, V., Gunopulos, D.: Online outlier detection in sensor data using non-parametric models. In: VLDB (2006)

  56. 56.

    Wang, J., Balasubramanian, A., de la Vega, L.M., Green, J., Samal, A., Prabhakaran, B.: Word recognition from continuous articulatory movement time-series data using symbolic representations. In: SLPAT (2013)

  57. 57.

    Wang, X., Lin, J., Patel, N., Braun, M.: A self-learning and online algorithm for time series anomaly detection, with application in CPU manufacturing. In: CIKM (2016)

  58. 58.

    Whitney, C., Gottlieb, D., Redline, S., Norman, R., Dodge, R., Shahar, E., Surovec, S., Nieto, F.: Reliability of scoring respiratory disturbance indices and sleep staging. Sleep 21, 749–757 (1998)

    Article  Google Scholar 

  59. 59.

    Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945). http://www.jstor.org/stable/3001968

  60. 60.

    Wu, Q., Qi, X., Fuller, E., Zhang, C.Q.: Follow the leader: A centrality guided clustering and its application to social network analysis. Sci. World J. (2013)

  61. 61.

    Yankov, D., Keogh, E., Rebbapragada, U.: Disk aware discord discovery: finding unusual time series in terabyte sized datasets. In: ICDM (2007)

  62. 62.

    Yankov, D., Keogh, E., Rebbapragada, U.: Disk aware discord discovery: finding unusual time series in terabyte sized datasets. KAIS 17(2) (2008)

  63. 63.

    Yankov, D., Keogh, E.J., Medina, J., Chiu, B.Y., Zordan, V.B.: Detecting time series motifs under uniform scaling. In: KDD (2007)

  64. 64.

    Yeh, C., Zhu, Y., Ulanova, L., Begum, N., Ding, Y., Dau, H., Silva, D., Mueen, A., Keogh, E.: Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: ICDM (2016)

  65. 65.

    Yu, Y., Cao, L., Rundensteiner, E.A., Wang, Q.: Outlier detection over massive-scale trajectory streams. ACM Trans. Database Syst. (TODS) 42, 1–33 (2017)

  66. 66.

    Zhu, Y., Zimmerman, Z., Senobari, N.S., Yeh, C.M., Funning, G., Mueen, A., Brisk, P., Keogh, E.: Matrix profile ii: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 739–748 (2016). https://doi.org/10.1109/ICDM.2016.0085

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Paul Boniol.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Boniol, P., Linardi, M., Roncallo, F. et al. Unsupervised and scalable subsequence anomaly detection in large data series. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00655-8

Download citation

Keywords

  • Data series
  • Time series
  • Anomalies discovery