Similarity join on time series by utilizing a dynamic segmentation index

Abstract

Similarity join on time series databases is an essential operation for data analysis applications. Due to the curse of dimensionality, it is not suitable to use traditional index techniques, such as R-tree and kd-tree. In the paper, a dynamic segment index (i.e., DSTree) is utilized to reduce the huge comparison cost on the similarity join on time series databases. However, the DSTree is designed for similarity search and only supports bound estimations between a time series and a batch of time series in a DSTree node. To make the DSTree suitable for the similarity join on time series databases, it is necessary to have tight bounds for the nodes to achieve a better pruning power, where the biggest challenge is that the DSTree nodes may have different segmentations. To solve the problem aforementioned, a segmentation alignment and synopsis evaluation method is proposed to support the estimation of DSTree nodes to significantly reduce the time cost by pruning unnecessary comparisons. Moreover, to make our approach I/O efficient, a caching strategy is proposed by taking advantage of both graph partitioning and the locality of the DSTree index. The efficiency and effectiveness of the proposed approaches are verified by experiments on real-life datasets.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

References

  1. 1.

    Camerra A, Palpanas T, Shieh J, Keogh EJ (2010) isax 2.0: indexing and mining one billion time series. In: ICDM, pp 58–67

  2. 2.

    Keogh E, Pazzani M (1998) An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: SIGKDD

  3. 3.

    Steinbach M, Tan P, Kumar V, Klooster S, Potter C (2003) Discovery of climate indices using clustering. In: SIGKDD

  4. 4.

    Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: FOCS, pp 459–468

  5. 5.

    Agrawal R, Psaila G, Wimmers EL, Zaït M (1995) Querying shapes of histories. In: VLDB’95, proceedings of 21th international conference on very large data bases, September 11–15, 1995, Zurich, Switzerland, pp 502–514

  6. 6.

    Koperski K, Han J (1995) Discovery of spatial association rules in geographic information databases. In: Advances in spatial databases, 4th international symposium, SSD’95, Portland, Maine, USA, August 6–9, 1995, proceedings, pp 47–66

    Google Scholar 

  7. 7.

    Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web, WWW 2008, Beijing, China, April 21–25, 2008, pp 131–140

  8. 8.

    Kolb L, Thor A, Rahm E (2012) Load balancing for mapreduce-based entity resolution. In: IEEE 28th international conference on data engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1–5 April, 2012, pp 618–629

  9. 9.

    Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering, ICDE 2006, 3–8 April 2006, Atlanta, GA, USA, p 5

  10. 10.

    Kanth K, Agrawal D, Singh A (1998) Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD

  11. 11.

    Agrawal R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: FODO

  12. 12.

    Chan K, Fu A (1999) Efficient time series matching by wavelets. In: ICDE

  13. 13.

    Yi B-K, Faloutsos C (2000) Fast time sequence indexing for arbitrary Lp norms. In: VLDB 2000, proceedings of 26th international conference on very large data bases, September 10–14, 2000, Cairo, Egypt, pp 385–394

  14. 14.

    Keogh EJ, Chakrabarti K, Mehrotra S, Pazzani MJ (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 151–162

  15. 15.

    Wang Y, Wang P, Pei J, Wang W, Huang S (2013) A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10):793–804

    Google Scholar 

  16. 16.

    Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. In: PVLDB

  17. 17.

    Böhm C (2000) A cost model for query processing in high dimensional data spaces. ACM Trans Database Syst 25(2):129–178

    Article  Google Scholar 

  18. 18.

    Nobari S, Tauheed F, Heinis T, Karras P, Bressan S, Ailamaki A (2013) TOUCH: in-memory spatial join by hierarchical data-oriented partitioning. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2013, New York, NY, USA, June 22–27, 2013, pp 701–712

  19. 19.

    Mueen A, Nath S, Liu J (2010) Fast approximate correlation for massive time-series data. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010, pp 171–182

  20. 20.

    Kernighan BW, Shen L (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49:291–307

    Article  Google Scholar 

  21. 21.

    Karger DR (2000) Minimum cuts in near-linear time. J ACM 47:46–76

    MathSciNet  Article  Google Scholar 

  22. 22.

    Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: Proceedings of the 19th design automation conference, DAC ’82, Las Vegas, Nevada, USA, June 14–16, 1982, pp 175–181

  23. 23.

    http://www.pmel.noaa.gov/tao/data_deliv/. Accessed 6 Apr 2017

  24. 24.

    http://archive.ics.uci.edu/ml/datasets/. Accessed 6 Apr 2017

  25. 25.

    Shim K, Srikant R, Agrawal R (1997) High-dimensional similarity joins. In: Proceedings of the thirteenth international conference on data engineering, April 7–11, 1997 Birmingham UK, pp 301–311

  26. 26.

    Böhm C, Braunmüller B, Krebs F, Kriegel H-P (2001) Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 379–388

  27. 27.

    Wang Y, Metwally A, Parthasarathy S (2013) Scalable all-pairs similarity search in metric spaces. In: The 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD 2013, Chicago, IL, USA, August 11–14, 2013, pp 829–837

Download references

Funding

Funding was provided by National Key Research and Development Program (Grant Nos. 2016YFE0100300, 2016YFB1000700), National Natural Science Foundation of China (U1509213).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Qiuhong Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Li, Q., Li, Z. et al. Similarity join on time series by utilizing a dynamic segmentation index. Knowl Inf Syst 61, 1517–1546 (2019). https://doi.org/10.1007/s10115-018-1317-4

Download citation

Keywords

  • Time series
  • Similarity join
  • Lower bound