Disk aware discord discovery: finding unusual time series in terabyte sized datasets

Yankov, Dragomir; Keogh, Eamonn; Rebbapragada, Umaa

doi:10.1007/s10115-008-0131-9

Disk aware discord discovery: finding unusual time series in terabyte sized datasets

Regular Paper
Published: 11 March 2008

Volume 17, pages 241–262, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Dragomir Yankov¹,
Eamonn Keogh¹ &
Umaa Rebbapragada²

548 Accesses
78 Citations
Explore all metrics

Abstract

The problem of finding unusual time series has recently attracted much attention, and several promising methods are now in the literature. However, virtually all proposed methods assume that the data reside in main memory. For many real-world problems this is not be the case. For example, in astronomy, multi-terabyte time series datasets are the norm. Most current algorithms faced with data which cannot fit in main memory resort to multiple scans of the disk /tape and are thus intractable. In this work we show how one particular definition of unusual time series, the time series discord, can be discovered with a disk aware algorithm. The proposed algorithm is exact and requires only two linear scans of the disk with a tiny buffer of main memory. Furthermore, it is very simple to implement. We use the algorithm to provide further evidence of the effectiveness of the discord definition in areas as diverse as astronomy, web query mining, video surveillance, etc., and show the efficiency of our method on datasets which are many orders of magnitude larger than anything else attempted in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MERLIN++: parameter-free discovery of time series anomalies

Article 16 January 2023

Parallel Discord Discovery

A fast algorithm for complex discord searches in time series: HOT SAX Time

Article 11 January 2022

References

Ameen J, Bash R (2006) Mining time series for identifying unusual sub-sequences with applications. In: 1st International conference on innovative computing, information and control vol 1, 574–577
Angiulli F, Fassetti F (2007a) Detecting distance-based outliers in streams of data. In: CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, pp 811–820
Angiulli F, Fassetti F (2007b) Very efficient mining of distance-based outliers. In: CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, pp 791–800
Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 29–38
Berchtold S, Böhm C, Keim D, Kriegel H (1997) A cost model for nearest neighbor search in high-dimensional data space. In: Proceedings of the 16th ACM symposium on principles of database systems (PODS) pp 78–86
Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. SIGMOD Rec 29(2): 93–104
Article Google Scholar
Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’03) pp 493–498
Chuah M, Fu F (2007) ECG anomaly detection via time series analysis. Technical report LU-CSE-07-001
Davies S, Moore A (2000) Mix-nets: factored mixtures of gaussians in Bayesian networks with mixed continuous and discrete variables. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, pp 168–175
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI’04: Proceedings of the 6th conference on symposium on opearting systems design and Implementation
Factbook http://www.cia.gov/cia/publications/factbook/
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23(2): 419–429
Article Google Scholar
Fu A, Leung O, Keogh E, Lin J (2006) Finding time series discords based on Haar transform. In: Proceedings of the 2nd international conference on advanced data mining and applications, pp 31–41
Ghoting A, Parthasarathy S, Otey M (2006) Fast mining of distance-based outliers in high dimensional datasets. In: Proceedings of the 6th SIAM international conference on data mining
Hung E, Cheung D (2002) Parallel mining of outliers in large database. Distrib Parallel Databases 12(1): 5–26
Article MATH Google Scholar
Jagadish H, Koudas N, Muthukrishnan S (1999) Mining deviants in a time series database. In: Proceedings of the 25th international conference on very large data bases, pp 102–113
Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 102–111
Keogh E, Lin J, Fu A (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: Proceedings of the 5th IEEE international conference on data mining, pp 226–233
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24rd international conference on very large data bases (VLDB), pp 392–403
Lozano E, Acuna E (2005) Parallel algorithms for distance-based and density-based outliers. In: ICDM ’05: Proceedings of the Fifth IEEE international conference on data mining, pp 729–732
Malatesta K, Beck S, Menali G, Waagen E (2005) The AAVSO data validation project. J Am Assoc Variable Star Observ (JAAVSO) 78: 31–44
Google Scholar
Naftel A, Khalid S (2006) Classifying spatiotemporal object trajectories using unsupervised learning in the coefficient feature space. Multimedia Syst 12(3): 227–238
Article Google Scholar
OGLE project http://bulge.astro.princeton.edu/~ogle/
Pokrajac D, Lazarevic A, Latecki L (2007) Incremental local outlier detection for data streams. In: IEEE symposium on computational intelligence and data mining, pp 504–515
Protopapas P, Giammarco J, Faccioli L, Struble M, Dave R, Alcock C (2006) Finding outlier light-curves in catalogs of periodic variable stars. Mon Notices R Astronom Soc 369: 677–696
Article Google Scholar
Riedewald M, Agrawal D, Abbadi A, Korn F (2003) Accessing scientific data: simpler is better. In: Proceedings of the 8th international symposium in spatial and temporal databases, pp 214–232
Shapiro M (1977) The choice of reference points in best-match file searching. Commun ACM 20(5): 339–343
Article Google Scholar
Silverman B (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, London
MATH Google Scholar
Stoyan D (2006) On estimators of the nearest neighbour distance distribution function for stationary point processes. Metrica 64(2): 139–150
Article MATH MathSciNet Google Scholar
TAO project http://www.pmel.noaa.gov/tao/index.shtml
Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 394–403
Wang C, Wang X (2000) Multilevel filtering for high dimensional nearest neighbor search. In: ACM SIGMOD workshop on research issues in data mining and knowledge discovery, pp 37–43
Wang D, Fortier P, Michel H, Mitsa T (2006) Hierarchical agglomerative clustering based t-outlier detection. In: 6th international conference on data mining—Workshops pp 731–738
Wei L, Keogh E, Xi X (2006) SAXually explicit images: finding unusual shapes. In: Proceedings of the 6th international conference on data mining, pp 711–720
Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana C (2006) Fast time series classification using numerosity reduction. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, pp 1033–1040

Download references

Author information

Authors and Affiliations

Computer Science and Engineering Department, University of California, Riverside, CA, 92521, USA
Dragomir Yankov & Eamonn Keogh
Department of Computer Science, Tufts University, Medford, MA, USA
Umaa Rebbapragada

Authors

Dragomir Yankov
View author publications
You can also search for this author in PubMed Google Scholar
Eamonn Keogh
View author publications
You can also search for this author in PubMed Google Scholar
Umaa Rebbapragada
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dragomir Yankov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yankov, D., Keogh, E. & Rebbapragada, U. Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17, 241–262 (2008). https://doi.org/10.1007/s10115-008-0131-9

Download citation

Received: 29 October 2007
Revised: 04 December 2007
Accepted: 29 January 2008
Published: 11 March 2008
Issue Date: November 2008
DOI: https://doi.org/10.1007/s10115-008-0131-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Disk aware discord discovery: finding unusual time series in terabyte sized datasets

Abstract

Access this article

Similar content being viewed by others

MERLIN++: parameter-free discovery of time series anomalies

Parallel Discord Discovery

A fast algorithm for complex discord searches in time series: HOT SAX Time

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Disk aware discord discovery: finding unusual time series in terabyte sized datasets

Abstract

Access this article

Similar content being viewed by others

MERLIN++: parameter-free discovery of time series anomalies

Parallel Discord Discovery

A fast algorithm for complex discord searches in time series: HOT SAX Time

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation