Exploring variable-length time series motifs in one hundred million length scale

Abstract

The exploration of repeated patterns with different lengths, also called variable-length motifs, has received a great amount of attention in recent years. However, existing algorithms to detect variable-length motifs in large-scale time series are very time-consuming. In this paper, we introduce a time- and space-efficient approximate variable-length motif discovery algorithm, Distance-Propagation Sequitur (DP-Sequitur), for detecting variable-length motifs in large-scale time series data (e.g. over one hundred million in length). The discovered motifs can be ranked by different metrics such as frequency or similarity, and can benefit a wide variety of real-world applications. We demonstrate that our approach can discover motifs in time series with over one hundred million points in just minutes, which is significantly faster than the fastest existing algorithm to date. We demonstrate the superiority of our algorithm over the state-of-the-art using several real world time series datasets.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Notes

  1. 1.

    Example of Canebrake Groundcreeper records contain motif A: Athanas

  2. 2.

    Example of Streak-backed Canastero records contain motif B: Calderon-F

References

  1. Athanas N. Xc22831. Accessible at www.xeno-canto.org/22831. Accessed 11 Aug 2008

  2. Begum N, Keogh E (2014) Rare time series motif discovery from unbounded streams. Proc VLDB Endow 8(2):149–160

    Article  Google Scholar 

  3. Bob P, Willem-Pier V, Sander P, Jonathon J (2005) Xeno-Canto. www.xeno-canto.org. Accessed 30 May 2005

  4. Boesman P. Xc221161. Accessible at www.xeno-canto.org/221161

  5. Calderon-F D. Xc301107. Accessible at www.xeno-canto.org/301107. Accessed 13 Dec 2015

  6. Castro N, Azevedo PJ (2010) Multiresolution motif discovery in time series. In: Proceedings of the 2010 SIAM international conference on data mining. SIAM, pp 665–676

  7. Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 493–498

  8. Gao Y, Lin J, Rangwala H (2016) Iterative grammar-based framework for discovering variable-length time series motifs. In: 15th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 7–12

  9. Gao Y, Li Q, Li X, Lin J, Rangwala H (2017) Trajviz: a tool for visualizing patterns and anomalies in trajectory. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 428–431

  10. Giancarlo R, Scaturro D, Utro F (2009) Textual data compression in computational biology: a synopsis. Bioinformatics 25(13):1575–1586

    Article  MATH  Google Scholar 

  11. Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE (2000) Physiobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220

    Article  Google Scholar 

  12. Hughes JF, Skaletsky H, Pyntikova T, Graves TA, van Daalen SK, Minx PJ, Fulton RS, McGrath SD, Locke DP, Friedman C et al (2010) Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature 463(7280):536

    Article  Google Scholar 

  13. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D (2002) The human genome browser at UCSC. Genome Res 12(6):996–1006

    Article  Google Scholar 

  14. Keogh E, Lonardi S, Zordan VB, Lee SH, Jara M (2005a) Visualizing the similarity of human and chimp DNA (multimedia video). http://www.cs.ucr.edu/~eamonn/DNA/

  15. Keogh E, Lin J, Fu A (2005b) Hot sax: efficiently finding the most unusual time series subsequence. In: 2005 IEEE 5th international conference on data mining (ICDM), p 8

  16. Krabbe N. Xc235579. Accessible at www.xeno-canto.org/235579

  17. Li Y, Lin J, Oates T (2012) Visualizing variable-length time series motifs. In: Proceedings of the 2012 SIAM international conference on data mining. SIAM, pp 895–906

  18. Li Y, Yiu ML, Gong Z, et al. (2015) Quick-motif: an efficient and scalable framework for exact motif discovery. In: 2015 IEEE 31st international conference on data engineering (ICDE). IEEE, pp 579–590

  19. Lin J, Keogh E, Lonardi S, Lankford JP, Nystrom DM (2004) Visually mining and monitoring massive time series. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 460–469

  20. Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144

    MathSciNet  Article  Google Scholar 

  21. Lines J, Davis LM, Hills J, Bagnall A (2012) A shapelet transform for time series classification. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 289–297

  22. Liu B, Li J, Chen C, Tan W, Chen Q, Zhou M (2015) Efficient motif discovery for large-scale time series in healthcare. IEEE Trans Ind Inform 11(3):583–590

    Article  Google Scholar 

  23. Locke DP, Hillier LW, Warren WC, Worley KC, Nazareth LV, Muzny DM, Yang S-P, Wang Z, Chinwalla AT, Minx P et al (2011) Comparative and demographic analysis of orang-utan genomes. Nature 469(7331):529

    Article  Google Scholar 

  24. Mohammad Y, Nishida T (2009) Constrained motif discovery in time series. New Gener Comput 27(4):319–346

    Article  MATH  Google Scholar 

  25. Mohammad Y, Nishida T (2014a) Exact discovery of length-range motifs. In: Intelligent information and database systems. Springer, pp 23–32

  26. Mohammad Y, Nishida T (2014b) Scale invariant multi-length motif discovery. In: Modern advances in applied intelligence. Springer, pp 417–426

  27. Mueen A (2013) Enumeration of time series motifs of all lengths. In: 2013 IEEE 13th international conference on data mining (ICDM). IEEE, pp 547–556

  28. Mueen A, Keogh E (2010) Online discovery and maintenance of time series motifs. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1089–1098

  29. Mueen A, Keogh EJ, Zhu Q, Cash S, Westover MB (2009) Exact discovery of time series motifs. In: Proceedings of the 2009 SIAM international conference on data mining. SIAM, pp. 473–484

  30. Mueen A, Viswanathan K, Gupta C, Keogh E (2015) The fastest similarity search algorithm for time series subsequences under Euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html

  31. Murray D, Liao J, Stankovic L, Stankovic V, Hauxwell-Baldwin R, Wilson C, Coleman M, Kane T, Firth S (2015) A data management platform for personalised real-time energy feedback. In: Proceedings of the 8th international conference on energy efficiency in domestic appliances and lighting, pp 1–15

  32. Nevill-Manning CG, Witten IH (1997) Identifying hierarchical strcture in sequences: a linear-time algorithm. J Artif Intell Res (JAIR) 7:67–82

    Article  MATH  Google Scholar 

  33. Nunthanid P, Niennattrakul V, Ratanamahatana CA (2011) Discovery of variable length time series motif. In: 2011 8th international conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON). IEEE, pp 472–475

  34. Patel P, Keogh E, Jessica L, Lonardi S (2002) Mining motifs in massive time series databases. In: 2003 proceedings of the 2002 IEEE international conference on data mining (ICDM). IEEE, pp 370–377

  35. Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 262–270

  36. Senin P, Malinchik S (2013) Sax-vsm: Interpretable time series classification using sax and vector space model. In: 2013 IEEE 13th international conference on data mining (ICDM). IEEE, pp 1175–1180

  37. Senin P, Lin J, Wang X, Oates T, Gandhi S, Boedihardjo AP, Chen C, Frankenstein S, Lerner M (2014) Grammarviz 2.0: a tool for grammar-based pattern discovery in time series. In: Machine learning and knowledge discovery in databases. Springer, pp 468–472

  38. Shieh J, Keogh E (2009) iSAX: disk-aware mining and indexing of massive time series datasets. Data Min Knowl Discov 19(1):24–57

    MathSciNet  Article  Google Scholar 

  39. Shokoohi-Yekta M, Chen Y, Campana B, Hu B, Zakaria J, Keogh E (2015) Discovery of meaningful rules in time series. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1085–1094

  40. Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum HS, Hillier L, Brown LG, Repping S, Pyntikova T, Ali J, Bieri T et al (2003) The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423(6942):825–837

    Article  Google Scholar 

  41. Tang H, Liao SS (2008) Discovering original motifs with different lengths from time series. Knowl Based Syst 21(7):666–671

    Article  Google Scholar 

  42. Wang X, Lin J, Senin P, Oates T, Gandhi S, Boedihardjo AP, Chen C, Frankenstein S (2016) RPM: Representative pattern mining for efficient time series classification. In: 19th international conference on extending database technology (EDBT), pp 185–196

  43. Yeh C-CM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Silva DF, Mueen A, Keogh E (2016) Matrix profile i: All pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1317–1322

  44. Zhu Y, Schall-Zimmerman Z, Senobari NS, Yeh C-CM, Funning G, Mueen A, Brisk P, Keogh EJ (2016) Matrix profile ii: exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 739–748

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Yifeng Gao.

Additional information

Responsible editors: Kurt Driessens, Dragi Kocev, Marko Robnik-Šikonja, Myra Spiliopoulou.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gao, Y., Lin, J. Exploring variable-length time series motifs in one hundred million length scale. Data Min Knowl Disc 32, 1200–1228 (2018). https://doi.org/10.1007/s10618-018-0570-1

Download citation

Keywords

  • Time series data mining
  • Motif discovery
  • Variable length