Skip to main content
Log in

Disk aware discord discovery: finding unusual time series in terabyte sized datasets

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The problem of finding unusual time series has recently attracted much attention, and several promising methods are now in the literature. However, virtually all proposed methods assume that the data reside in main memory. For many real-world problems this is not be the case. For example, in astronomy, multi-terabyte time series datasets are the norm. Most current algorithms faced with data which cannot fit in main memory resort to multiple scans of the disk /tape and are thus intractable. In this work we show how one particular definition of unusual time series, the time series discord, can be discovered with a disk aware algorithm. The proposed algorithm is exact and requires only two linear scans of the disk with a tiny buffer of main memory. Furthermore, it is very simple to implement. We use the algorithm to provide further evidence of the effectiveness of the discord definition in areas as diverse as astronomy, web query mining, video surveillance, etc., and show the efficiency of our method on datasets which are many orders of magnitude larger than anything else attempted in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ameen J, Bash R (2006) Mining time series for identifying unusual sub-sequences with applications. In: 1st International conference on innovative computing, information and control vol 1, 574–577

  2. Angiulli F, Fassetti F (2007a) Detecting distance-based outliers in streams of data. In: CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, pp 811–820

  3. Angiulli F, Fassetti F (2007b) Very efficient mining of distance-based outliers. In: CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, pp 791–800

  4. Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 29–38

  5. Berchtold S, Böhm C, Keim D, Kriegel H (1997) A cost model for nearest neighbor search in high-dimensional data space. In: Proceedings of the 16th ACM symposium on principles of database systems (PODS) pp 78–86

  6. Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. SIGMOD Rec 29(2): 93–104

    Article  Google Scholar 

  7. Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’03) pp 493–498

  8. Chuah M, Fu F (2007) ECG anomaly detection via time series analysis. Technical report LU-CSE-07-001

  9. Davies S, Moore A (2000) Mix-nets: factored mixtures of gaussians in Bayesian networks with mixed continuous and discrete variables. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, pp 168–175

  10. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: Proceedings of the 6th conference on symposium on opearting systems design and Implementation

  11. Factbook http://www.cia.gov/cia/publications/factbook/

  12. Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23(2): 419–429

    Article  Google Scholar 

  13. Fu A, Leung O, Keogh E, Lin J (2006) Finding time series discords based on Haar transform. In: Proceedings of the 2nd international conference on advanced data mining and applications, pp 31–41

  14. Ghoting A, Parthasarathy S, Otey M (2006) Fast mining of distance-based outliers in high dimensional datasets. In: Proceedings of the 6th SIAM international conference on data mining

  15. Hung E, Cheung D (2002) Parallel mining of outliers in large database. Distrib Parallel Databases 12(1): 5–26

    Article  MATH  Google Scholar 

  16. Jagadish H, Koudas N, Muthukrishnan S (1999) Mining deviants in a time series database. In: Proceedings of the 25th international conference on very large data bases, pp 102–113

  17. Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 102–111

  18. Keogh E, Lin J, Fu A (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: Proceedings of the 5th IEEE international conference on data mining, pp 226–233

  19. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24rd international conference on very large data bases (VLDB), pp 392–403

  20. Lozano E, Acuna E (2005) Parallel algorithms for distance-based and density-based outliers. In: ICDM ’05: Proceedings of the Fifth IEEE international conference on data mining, pp 729–732

  21. Malatesta K, Beck S, Menali G, Waagen E (2005) The AAVSO data validation project. J Am Assoc Variable Star Observ (JAAVSO) 78: 31–44

    Google Scholar 

  22. Naftel A, Khalid S (2006) Classifying spatiotemporal object trajectories using unsupervised learning in the coefficient feature space. Multimedia Syst 12(3): 227–238

    Article  Google Scholar 

  23. OGLE project http://bulge.astro.princeton.edu/~ogle/

  24. Pokrajac D, Lazarevic A, Latecki L (2007) Incremental local outlier detection for data streams. In: IEEE symposium on computational intelligence and data mining, pp 504–515

  25. Protopapas P, Giammarco J, Faccioli L, Struble M, Dave R, Alcock C (2006) Finding outlier light-curves in catalogs of periodic variable stars. Mon Notices R Astronom Soc 369: 677–696

    Article  Google Scholar 

  26. Riedewald M, Agrawal D, Abbadi A, Korn F (2003) Accessing scientific data: simpler is better. In: Proceedings of the 8th international symposium in spatial and temporal databases, pp 214–232

  27. Shapiro M (1977) The choice of reference points in best-match file searching. Commun ACM 20(5): 339–343

    Article  Google Scholar 

  28. Silverman B (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, London

    MATH  Google Scholar 

  29. Stoyan D (2006) On estimators of the nearest neighbour distance distribution function for stationary point processes. Metrica 64(2): 139–150

    Article  MATH  MathSciNet  Google Scholar 

  30. TAO project http://www.pmel.noaa.gov/tao/index.shtml

  31. Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 394–403

  32. Wang C, Wang X (2000) Multilevel filtering for high dimensional nearest neighbor search. In: ACM SIGMOD workshop on research issues in data mining and knowledge discovery, pp 37–43

  33. Wang D, Fortier P, Michel H, Mitsa T (2006) Hierarchical agglomerative clustering based t-outlier detection. In: 6th international conference on data mining—Workshops pp 731–738

  34. Wei L, Keogh E, Xi X (2006) SAXually explicit images: finding unusual shapes. In: Proceedings of the 6th international conference on data mining, pp 711–720

  35. Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana C (2006) Fast time series classification using numerosity reduction. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, pp 1033–1040

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dragomir Yankov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yankov, D., Keogh, E. & Rebbapragada, U. Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17, 241–262 (2008). https://doi.org/10.1007/s10115-008-0131-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0131-9

Keywords

Navigation