Advertisement

Knowledge and Information Systems

, Volume 17, Issue 2, pp 241–262 | Cite as

Disk aware discord discovery: finding unusual time series in terabyte sized datasets

  • Dragomir Yankov
  • Eamonn Keogh
  • Umaa Rebbapragada
Regular Paper

Abstract

The problem of finding unusual time series has recently attracted much attention, and several promising methods are now in the literature. However, virtually all proposed methods assume that the data reside in main memory. For many real-world problems this is not be the case. For example, in astronomy, multi-terabyte time series datasets are the norm. Most current algorithms faced with data which cannot fit in main memory resort to multiple scans of the disk /tape and are thus intractable. In this work we show how one particular definition of unusual time series, the time series discord, can be discovered with a disk aware algorithm. The proposed algorithm is exact and requires only two linear scans of the disk with a tiny buffer of main memory. Furthermore, it is very simple to implement. We use the algorithm to provide further evidence of the effectiveness of the discord definition in areas as diverse as astronomy, web query mining, video surveillance, etc., and show the efficiency of our method on datasets which are many orders of magnitude larger than anything else attempted in the literature.

Keywords

Time series Discords Distance outliers Disk aware algorithms 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ameen J, Bash R (2006) Mining time series for identifying unusual sub-sequences with applications. In: 1st International conference on innovative computing, information and control vol 1, 574–577Google Scholar
  2. 2.
    Angiulli F, Fassetti F (2007a) Detecting distance-based outliers in streams of data. In: CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, pp 811–820Google Scholar
  3. 3.
    Angiulli F, Fassetti F (2007b) Very efficient mining of distance-based outliers. In: CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, pp 791–800Google Scholar
  4. 4.
    Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 29–38Google Scholar
  5. 5.
    Berchtold S, Böhm C, Keim D, Kriegel H (1997) A cost model for nearest neighbor search in high-dimensional data space. In: Proceedings of the 16th ACM symposium on principles of database systems (PODS) pp 78–86Google Scholar
  6. 6.
    Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. SIGMOD Rec 29(2): 93–104CrossRefGoogle Scholar
  7. 7.
    Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’03) pp 493–498Google Scholar
  8. 8.
    Chuah M, Fu F (2007) ECG anomaly detection via time series analysis. Technical report LU-CSE-07-001Google Scholar
  9. 9.
    Davies S, Moore A (2000) Mix-nets: factored mixtures of gaussians in Bayesian networks with mixed continuous and discrete variables. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, pp 168–175Google Scholar
  10. 10.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: Proceedings of the 6th conference on symposium on opearting systems design and ImplementationGoogle Scholar
  11. 11.
  12. 12.
    Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23(2): 419–429CrossRefGoogle Scholar
  13. 13.
    Fu A, Leung O, Keogh E, Lin J (2006) Finding time series discords based on Haar transform. In: Proceedings of the 2nd international conference on advanced data mining and applications, pp 31–41Google Scholar
  14. 14.
    Ghoting A, Parthasarathy S, Otey M (2006) Fast mining of distance-based outliers in high dimensional datasets. In: Proceedings of the 6th SIAM international conference on data miningGoogle Scholar
  15. 15.
    Hung E, Cheung D (2002) Parallel mining of outliers in large database. Distrib Parallel Databases 12(1): 5–26zbMATHCrossRefGoogle Scholar
  16. 16.
    Jagadish H, Koudas N, Muthukrishnan S (1999) Mining deviants in a time series database. In: Proceedings of the 25th international conference on very large data bases, pp 102–113Google Scholar
  17. 17.
    Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 102–111Google Scholar
  18. 18.
    Keogh E, Lin J, Fu A (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: Proceedings of the 5th IEEE international conference on data mining, pp 226–233Google Scholar
  19. 19.
    Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24rd international conference on very large data bases (VLDB), pp 392–403Google Scholar
  20. 20.
    Lozano E, Acuna E (2005) Parallel algorithms for distance-based and density-based outliers. In: ICDM ’05: Proceedings of the Fifth IEEE international conference on data mining, pp 729–732Google Scholar
  21. 21.
    Malatesta K, Beck S, Menali G, Waagen E (2005) The AAVSO data validation project. J Am Assoc Variable Star Observ (JAAVSO) 78: 31–44Google Scholar
  22. 22.
    Naftel A, Khalid S (2006) Classifying spatiotemporal object trajectories using unsupervised learning in the coefficient feature space. Multimedia Syst 12(3): 227–238CrossRefGoogle Scholar
  23. 23.
  24. 24.
    Pokrajac D, Lazarevic A, Latecki L (2007) Incremental local outlier detection for data streams. In: IEEE symposium on computational intelligence and data mining, pp 504–515Google Scholar
  25. 25.
    Protopapas P, Giammarco J, Faccioli L, Struble M, Dave R, Alcock C (2006) Finding outlier light-curves in catalogs of periodic variable stars. Mon Notices R Astronom Soc 369: 677–696CrossRefGoogle Scholar
  26. 26.
    Riedewald M, Agrawal D, Abbadi A, Korn F (2003) Accessing scientific data: simpler is better. In: Proceedings of the 8th international symposium in spatial and temporal databases, pp 214–232Google Scholar
  27. 27.
    Shapiro M (1977) The choice of reference points in best-match file searching. Commun ACM 20(5): 339–343CrossRefGoogle Scholar
  28. 28.
    Silverman B (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, LondonzbMATHGoogle Scholar
  29. 29.
    Stoyan D (2006) On estimators of the nearest neighbour distance distribution function for stationary point processes. Metrica 64(2): 139–150zbMATHCrossRefMathSciNetGoogle Scholar
  30. 30.
  31. 31.
    Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 394–403Google Scholar
  32. 32.
    Wang C, Wang X (2000) Multilevel filtering for high dimensional nearest neighbor search. In: ACM SIGMOD workshop on research issues in data mining and knowledge discovery, pp 37–43Google Scholar
  33. 33.
    Wang D, Fortier P, Michel H, Mitsa T (2006) Hierarchical agglomerative clustering based t-outlier detection. In: 6th international conference on data mining—Workshops pp 731–738Google Scholar
  34. 34.
    Wei L, Keogh E, Xi X (2006) SAXually explicit images: finding unusual shapes. In: Proceedings of the 6th international conference on data mining, pp 711–720Google Scholar
  35. 35.
    Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana C (2006) Fast time series classification using numerosity reduction. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, pp 1033–1040Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Dragomir Yankov
    • 1
  • Eamonn Keogh
    • 1
  • Umaa Rebbapragada
    • 2
  1. 1.Computer Science and Engineering DepartmentUniversity of CaliforniaRiversideUSA
  2. 2.Department of Computer ScienceTufts UniversityMedfordUSA

Personalised recommendations