ParCorr: efficient parallel methods to identify similar time series pairs across sliding windows

Abstract

Consider the problem of finding the highly correlated pairs of time series over a time window and then sliding that window to find the highly correlated pairs over successive co-temporous windows such that each successive window starts only a little time after the previous window. Doing this efficiently and in parallel could help in applications such as sensor fusion, financial trading, or communications network monitoring, to name a few. We have developed a parallel incremental random vector/sketching approach to this problem and compared it with the state-of-the-art nearest neighbor method iSAX. Whereas iSAX achieves 100% recall and precision for Euclidean distance, the sketching approach is, empirically, at least 10 times faster and achieves 95% recall and 100% precision on real and simulated data. For many applications this speedup is worth the minor reduction in recall. Our method scales up to 100 million time series and scales linearly in its expensive steps (but quadratic in the less expensive ones).

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Notes

  1. 1.

    This is slightly simplified. Often, analysts consider the volume-weighted average price per time unit. So if there are 1000 shares traded at 100 and 1 million shares traded at 110 in a millisecond, then the volume-weighted average price during that millisecond is very close to 110. We compute the returns based on these volume-weighted prices.

  2. 2.

    http://www.cs.ucr.edu/~eamonn/MatrixProfile.html.

  3. 3.

    For this discussion, we assume that ts_to_node is 1 to 1. If not, then if a node has say the time series groups corresponding to \(i_1\), \(i_2\), and \(i_3\), then keep those groups separate.

  4. 4.

    Code and datasets available for free download at http://parcorr.gforge.inria.fr.

  5. 5.

    http://finance.yahoo.com.

References

  1. Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687

    MathSciNet  Article  MATH  Google Scholar 

  2. Agrawal R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: Proceedings of the international conference on foundations of data organization and algorithms (FODO). Springer, pp 69–84

  3. Assent I, Krieger R, Afschari F, Seidl T (2008) The ts-tree: efficient time series search and retrieval. In: Proceedings of the international conference on extending database technology (EDBT), pp 252–263

  4. Cai Y, Ng R (2004) Indexing spatio-temporal trajectories with Chebyshev polynomials. In: Proceedings of the international conference on management of data (SIGMOD). ACM, pp 599–610

  5. Camerra A, Palpanas T, Shieh J, Keogh E (2010) iSAX 2.0: Indexing and mining one billion time series. In: Proceedings of the international conference on data mining (ICDM), pp 58–67

  6. Camerra A, Shieh J, Palpanas T, Rakthanmanon T, Keogh EJ (2014) Beyond one billion time series: indexing and mining very large time series collections with iSAX2\(+\). Knowl Inf Syst 39(1):123–151

    Article  Google Scholar 

  7. Chakrabarti K, Keogh E, Mehrotra S, Pazzani M (2002) Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Data Syst 27(2):188–228

    Article  Google Scholar 

  8. Chan K, Fu AW (1999) Efficient time series matching by wavelets. In: Proceedings of the international conference on data engineering (ICDE). IEEE Computer Society, pp 126–133

  9. Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing (STOC), pp 380–388

  10. Cole R, Shasha D, Zhao X (2005) Fast window correlations over uncooperative time series. In: Proceedings of the international conference on knowledge discovery and data mining (SIGKDD). ACM, pp 743–749

  11. Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the international conference on management of data (SIGMOD), pp 419–429

  12. Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: International conference on discovery science, pp 278–289

  13. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large databases (VLDB), pp 518–529

  14. Gionis A, Mannila H, Seppänen J (2004) Geometric and combinatorial tiles in 0–1 data. In: Knowledge discovery in databases: PKDD, pp 173–184

  15. Guo T, Sathe S, Aberer K (2015) Fast distributed correlation discovery over streaming time-series data. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 1161–1170

  16. Hallac D, Vare S, Boyd SP, Leskovec J (2017) Toeplitz inverse covariance-based clustering of multivariate time series data. In: Proceedings of the international conference on knowledge discovery and data mining (SIGKDD), pp 215–223

  17. Henelius A, Karlsson I, Papapetrou P, Ukkonen A, Puolamäki K (2016) Semigeometric tiling of event sequences. In: Machine learning and knowledge discovery in databases. ECML PKDD, pp 329–344

  18. Incorporated research institutions for seismology—seismic data access. http://ds.iris.edu/data/access/. Accessed 16 Apr 2018

  19. Indyk P (2000) Stable distributions, pseudorandom generators, embeddings and data stream computation. In: 41st annual symposium on foundations of computer science (FOCS), pp 189–197

  20. Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In: Conference in modern analysis and probability, vol 26 of contemporary mathematics, pp 189–206

  21. Keogh EJ, Chakrabarti K, Pazzani MJ, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inf Syst 3(3):263–286

    Article  MATH  Google Scholar 

  22. Kushilevitz E, Ostrovsky R, Rabani Y (1998) Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the 30th annual ACM symposium on theory of computing (STOC), pp 614–623

  23. Matsubara Y, Sakurai Y (2016) Regime shifts in streams: real-time forecasting of co-evolving time sequences. In: Proceedings of the international conference on knowledge discovery and data mining (SIGKDD), pp 1045–1054

  24. Mueen A, Nath S, Liu J (2010) Fast approximate correlation for massive time-series data. In: Proceedings of the international conference on management of data (SIGMOD), pp 171–182

  25. Mueen A, Zhu Y, Yeh M, Kamgar K, Viswanathan K, Gupta C, Keogh E (2017) The fastest similarity search algorithm for time series subsequences under Euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html. Accessed 16 Apr 2018

  26. Papadimitriou S, Sun J, Faloutsos C (2005) Streaming pattern discovery in multiple time-series. In: Proceedings of the international conference on very large databases (VLDB), pp 697–708

  27. Papadimitriou S, Yu PS (2006) Optimal multi-scale patterns in time series streams. In: Proceedings of the international conference on management of data (SIGMOD), pp 647–658

  28. Perng C, Wang H, Ma S (2006) Fast relevance discovery in time series. In: Proceedings of the international conference on data mining (ICDM), pp 1016–1020

  29. Sakurai Y, Faloutsos C, Yamamuro M (2007) Stream monitoring under the time warping distance. In: Proceedings of the international conference on data engineering (ICDE), pp 1046–1055

  30. Shasha D, Zhu Y (2004) High performance discovery in time series, techniques and case studies. Springer, Berlin

    Google Scholar 

  31. Shieh J, Keogh E (2008) iSAX: Indexing and mining terabyte sized time series. In: Proceedings of the international conference on knowledge discovery and data mining (SIGKDD), pp 623–631

  32. Xie Q, Shang S, Yuan B, Pang C, Zhang X (2013) Local correlation detection with linearity enhancement in streaming data. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 309–318

  33. Yeh CM, Herle HV, Keogh EJ (2016a) Matrix profile III: the matrix profile allows visualization of salient subsequences in massive time series. In: Proceedings of the international conference on data mining (ICDM), pp 579–588

  34. Yeh CM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Silva DF, Mueen A, Keogh EJ (2016b) Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: Proceedings of the international conference on data mining (ICDM), pp 1317–1322

  35. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, pp 10–10

  36. Zhu Y, Zimmerman Z, Senobari NS, Yeh CM, Funning G, Mueen A, Brisk P, Keogh EJ (2016) Matrix profile II: exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In: Proceedings of the international conference on data mining (ICDM), pp 739–748

  37. Zhu Y, Imamura N, Nikovski DN, Keogh EJ (2017) Matrix profile VII: time series chains: a new primitive for time series data mining. In: Proceedings of the international conference on data mining (ICDM)

  38. Zoumpatianos K, Idreos S, Palpanas T (2014) Indexing for interactive exploration of big data series. In: Proceedings of the international conference on management of data (SIGMOD), pp 1555–1566

Download references

Acknowledgements

The research leading to these results has received funding from the European Union’s Horizon 2020—The EU Framework Programme for Research and Innovation 2014–2020, under Grant Agreement No. 732051.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Boyan Kolev.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Responsible editors Jesse Davis, Elisa Fromont, Derek Greene, Bjørn Bringmann.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yagoubi, D.E., Akbarinia, R., Kolev, B. et al. ParCorr: efficient parallel methods to identify similar time series pairs across sliding windows. Data Min Knowl Disc 32, 1481–1507 (2018). https://doi.org/10.1007/s10618-018-0580-z

Download citation

Keywords

  • Time series analysis
  • Data stream processing
  • Distributed computing
  • Data mining