Skip to main content
Log in

The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The recently introduced data structure, the Matrix Profile, annotates a time series by recording the location of and distance to the nearest neighbor of every subsequence. This information trivially provides answers to queries for both time series motifs and time series discords, perhaps two of the most frequently used primitives in time series data mining. One attractive feature of the Matrix Profile is that it completely divorces the high-level details of the analytics performed, from the computational “heavy lifting.” The Matrix Profile can be computed using the appropriate computational paradigm for the task at hand: CPU, GPU, FPGA, distributed computing, anytime computation, incremental computation, and so forth. However, all the details of such computation can be hidden from the analyst who only needs to think about her analytical need. In this work, we expand on this philosophy and ask the following question: If we assume that we get the Matrix Profile for free, what interesting analytics can we do, writing at most ten lines of code? As we will show, the answer is surprisingly large and diverse. Our aim here is not to establish or compete with state-of-the-art results, but merely to show that we can both reproduce the results of many existing algorithms and find novel regularities in time series data collections with very little effort.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. We will revisit the idea of computationally “free” for the Matrix Profile in Sect. 4. For the case of sorting numbers, most invocations of sorting are on less than one million numbers, and it is possible to sort a million 32-bit numbers on a modern machine in 20 ms with essentially no space overhead. Thus, for most applications/users, it makes sense to think of sorting as a no-cost resource. Clearly, sorting can be a bottleneck for some applications, but these are rare enough that we think our claim is self-evident.

References

  • Afsar O, Tirnakli U, Marwan N (2018) Recurrence Quantification Analysis at work: quasi-periodicity based interpretation of gait force profiles for patients with Parkinson disease. Sci Rep 8(1):9102

    Article  Google Scholar 

  • Bardainne T, Gaillot P, Dubos-Sallée N, Blanco J, Sénéchal G (2006) Characterization of seismic waveforms and classification of seismic events using chirplet atomic decomposition. Example from the Lacq gas field (Western Pyrenees, France). Geophys J Int 166(2):699–718

    Article  Google Scholar 

  • Batista GEAPA, Keogh EJ, Tataw OM, De Souza VMA (2014) CID: an efficient complexity-invariant distance for time seriem. Data Min Knowl Discov 28(3):634–669

    Article  MathSciNet  Google Scholar 

  • Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web (WWW), pp 131–140

  • Beeler NM, Lockner DL, Hickman SH (2001) A simple stick-slip and creep-slip model for repeating earthquakes and its implication for microearthquakes at Parkfield. Bull Seismol Soc Am 91(6):1797–1804

    Article  Google Scholar 

  • Bonds ME (1998) Haydn’s’ Cours complet de la composition’ and the Sturm und Drang. Haydn studies, pp 152–176

  • Chandola V, Cheboli D, Kumar V (2009) Detecting anomalies in a time series database. UMN TR09-004

  • Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The UCR time series classification archive. http://www.cs.ucr.edu/~eamonn/time_series_data/

  • Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh EJ (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. Proc VLDB Endow (VLDB) 1(2):1542–1552

    Article  Google Scholar 

  • Geller RJ, Mueller CS (1980) Four similar earthquakes in central California. Geophys Res Lett 7(10):821–824

    Article  Google Scholar 

  • Gharghabi S, Ding Y, Yeh CCM, Kamgar K, Ulanova L, Keogh E (2017) Matrix profile VIII: domain agnostic online semantic segmentation at superhuman performance levels. In: Proceedings of the 2017 IEEE international conference on data mining (ICDM), pp 117–126

  • Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220

    Article  Google Scholar 

  • Guillame-Bert M, Dubrawski A (2017) Classification of time sequences using graphs of temporal constraints. J Mach Learn Res 18(1):4370–4403

    MathSciNet  Google Scholar 

  • Gupta S, Reynolds MS, Patel SN (2010) ElectriSense: single-point sensing using EMI for electrical event detection and classification in the home. In: Proceedings of the 12th ACM international conference on ubiquitous computing, pp 139–148

  • Hausdorff JM, Ladin Z, Wei JY (1995) Footswitch system for measurement of the temporal parameters of gait. J Biomech 28(3):347–351

    Article  Google Scholar 

  • Hoehn MM, Yahr MD (1967) Parkinsonism: onset, progression and mortality. Neurology 17(5):427–442

    Article  Google Scholar 

  • Kao HY, Yu JY (2009) Contrasting eastern-Pacific and central-Pacific types of ENSO. J Clim 22(3):615–632

    Article  Google Scholar 

  • Kate PG, Rana JR (2015) ZIGBEE based monitoring theft detection and automatic electricity meter reading. In: Proceedings of the 2015 International conference on energy systems and applications, pp 258–262

  • Kurpiewski MR, Engler LE, Wozniak LA, Kobylanska A, Koziolkiewicz M, Stec WJ, Jen-Jacobson L (2004) Mechanisms of coupling between DNA recognition specificity and catalysis in EcoRI endonuclease. Structure 12(10):1775–1788

    Article  Google Scholar 

  • Lahr JC, Chouet BA, Stephens CD, Powers JA, Page RA (1994) Earthquake classification, location, and error analysis in a volcanic environment: implications for the magmatic system of the 1989–1990 eruptions at Redoubt Volcano, Alaska. J Volcanol Geotherm Res 62:137–152

    Article  Google Scholar 

  • LG Dishwasher Owners Manual (2017) http://www.lg.com/us/support/products/documents/Owners%20Manual.pdf. Accessed 2 Dec 2017

  • Li Y, Yiu ML, Gong Z (2015) Quick-motif: An efficient and scalable framework for exact motif discovery. In: Proceedings of the 2015 IEEE 31st international conference on data engineering (ICDE), pp 579–590

  • Lin J, Khade R, Li Y (2012) Rotation-invariant similarity in time series using bag-of-patterns representation. J Intell Inf Syst 39(2):287–315

    Article  Google Scholar 

  • Morris D, Saponas TS, Guillory A, Kelner I (2014) RecoFit: using a wearable sensor to find, recognize, and count repetitive exercises. In: Proceedings of the 2014 SIGCHI conference on human factors in computer systems, pp 3225–3234

  • Mueen A, Keogh E, Zhu Q, Cash S, Westover B (2009) Exact discovery of time series motif. In: Proceedings of the 2009 SIAM international conference on data mining (SDM), pp 473–484

  • Murray D, Liao J, Stankovic L, Stankovic V, Hauxwell-Baldwin R, Wilson C, Coleman M, Kane T, Firth S (2015) A data management platform for personalised real-time energy feedback. In: Proceedings of the 8th international conference on energy efficiency in domestic appliances and lighting (EEDAL), pp 1–15

  • Music Performance (2017) Joseph Haydn’s symphony no. 47 in G major, by the Tafelmusik Orchestra. www.youtube.com/watch?v=yeB_Ohpsm64. Accessed 2 Dec 2017

  • Nadeau RM, McEvilly TV (1999) Fault slip rates at depth from recurrence intervals of repeating microearthquakes. Science 285(5428):718–721

    Article  Google Scholar 

  • Nadeau RM, Foxall W, McEvilly TV (1995) Clustering and periodic recurrence of microearthquakes on the San Andreas Fault at Parkfield, California. Science 267(5197):503–507

    Article  Google Scholar 

  • Puder J (2000) Seventeen synonyms of Semordnilap. Word Ways 33(1), article 9

  • Reiss A, Stricker D (2012) Introducing a new benchmarked dataset for activity monitoring. In: Proceedings of the 16th international symposium on wearable computers (ISWC), pp 108–109

  • Richards-Dinger KB, Shearer PM (2000) Earthquake locations in southern California obtained using source-specific station terms. J Geophys Res Solid Earth 105(B5):10939–10960

    Article  Google Scholar 

  • Shakibay-Senobari N (2018) Personal correspondence. June 14, 2018

  • Shelly DR, Beroza GC, Ide S, Nakamula S (2006) Low-frequency earthquakes in Shikoku, Japan, and their relationship to episodic tremor and slip. Nature 442(7099):188–191

    Article  Google Scholar 

  • Sherburn S, Scott BJ, Nishi Y, Sugihara M (1998) Seismicity at White Island volcano, New Zealand: a revised classification and inferences about source mechanism. J Volcanol Geoth Res 83(3–4):287–312

    Article  Google Scholar 

  • Sreenivasan G (2016) Power theft. PHI Learning Pvt. Ltd, New Delhi

    Google Scholar 

  • Supporting Webpage (2019) https://sites.google.com/site/matrixprofiletopten/

  • Tasmanian devil (2017) Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Tasmanian_devil

  • The UCR Matrix Profile Page (2017) www.cs.ucr.edu/~eamonn/MatrixProfile.html. Accessed 2 Dec 2017

  • Trugman DT, Shearer PM (2017) GrowClust: a hierarchical clustering algorithm for relative earthquake relocation, with application to the Spanish Springs and Sheldon, Nevada, earthquake sequences. Seismol Res Lett 88(2A):379–391

    Article  Google Scholar 

  • Vidale JE, ElIsworth WL, Cole A, Marone C (1994) Variations in rupture process with recurrence interval in a repeated small earthquake. Nature 368(6472):624–629

    Article  Google Scholar 

  • Waldhauser F, Ellsworth WL (2000) A double-difference earthquake location algorithm: method and application to the northern Hayward fault. Bull Seismol Soc Am 90(6):1353–1368

    Article  Google Scholar 

  • Wang J, Liu P, She MF, Nahavandi S, Kouzani A (2013) Bag-of-words representation for biomedical time series classification. Biomed Signal Process Control 8(6):634–644

    Article  Google Scholar 

  • Wisely BA, Schmidt DA, Weldon II RJ (2008) Compilation of surface creep on California faults and comparison of WGCEP 2007 deformation model to Pacific-North American plate motion (No. 2007-1437-P). Geological Survey (US)

  • Yankov D, Keogh E, Medina J, Chiu B, Zordan V (2007) Detecting time series motifs under uniform scaling. In: Proceedings of the 2007 ACM SIGKDD international conference on knowledge discovery and data mining, pp 844–853

  • Ye L, Keogh E (2009) Time series shapelets: a new primitive for data mining. In: Proceedings of the 2009 ACM SIGKDD international conference on knowledge discovery and data mining, pp 947–956

  • Yeh CCM, Herle HV, Keogh E (2016a) Matrix profile III: the matrix profile allows visualization of salient subsequences in massive time series. In: Proceedings of the 2016 IEEE international conference on data mining (ICDM), pp 579–588

  • Yeh CCM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Silva DF, Mueen A, Keogh E (2016b) Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: Proceedings of the 2016 IEEE international conference on data mining (ICDM), pp 1317–1322

  • Yeh CCM, Kavantzas N, Keogh E (2017) Matrix profile IV: using weakly labeled time series to predict outcomes. Proc VLDB Endow (VLDB) 10(12):1802–1812

    Article  Google Scholar 

  • Yeh CCM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Zimmerman Z, Silva DF, Mueen A, Keogh E (2018) Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile. Data Min Knowl Disc 32(1):83–123

    Article  MathSciNet  Google Scholar 

  • Zhang M, Sawchuk A (2012) USC-HAD: a daily activity dataset for ubiquitous activity recognition using wearable sensors. In: Proceedings of the 2012 ACM conference on ubiquitous computing, pp 1036–1043

  • Zhu Y, Zimmerman Z, Senobari NS, Yeh CCM, Funning G, Mueen A, Brisk P, Keogh E (2016) Matrix profile II: exploiting a novel algorithm and GPUS to break the one hundred million barrier for time series motifs and joins. In: Proceedings of the 2016 IEEE international conference on data mining (ICDM), pp 739–748

  • Zhu Y, Imamura M, Nikovski D, Keogh E (2017) Matrix profile VII: time series chains: a new primitive for time series data mining. In: Proceedings of the 2017 IEEE international conference on data mining (ICDM), pp 695–704

  • Zhu Y, Yeh CCM, Zimmerman Z, Kamgar K, Keogh E (2018) Matrix profile XI: SCRIMP++: time series motif discovery at interactive speeds. In: Proceedings of the 2018 IEEE international conference on data mining (ICDM), pp 837–846

Download references

Acknowledgements

We gratefully acknowledge funding from NSF IIS-1161997 II, NASA award NNX15AM66H, USGS G16AP00034, MERL Labs and Samsung, and all the data donors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Zhu.

Additional information

Resposible editor: Panagiotis Papapetrou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, Y., Gharghabi, S., Silva, D.F. et al. The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code. Data Min Knowl Disc 34, 949–979 (2020). https://doi.org/10.1007/s10618-019-00668-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-019-00668-6

Keywords

Navigation