Similarity join on time series by utilizing a dynamic segmentation index

  • Jinhua Wang
  • Qiuhong LiEmail author
  • Zhongsheng Li
  • Peng Wang
  • Yang Wang
  • Wei Wang
  • Ningting Pan
  • Mingmin Chi
Regular Paper


Similarity join on time series databases is an essential operation for data analysis applications. Due to the curse of dimensionality, it is not suitable to use traditional index techniques, such as R-tree and kd-tree. In the paper, a dynamic segment index (i.e., DSTree) is utilized to reduce the huge comparison cost on the similarity join on time series databases. However, the DSTree is designed for similarity search and only supports bound estimations between a time series and a batch of time series in a DSTree node. To make the DSTree suitable for the similarity join on time series databases, it is necessary to have tight bounds for the nodes to achieve a better pruning power, where the biggest challenge is that the DSTree nodes may have different segmentations. To solve the problem aforementioned, a segmentation alignment and synopsis evaluation method is proposed to support the estimation of DSTree nodes to significantly reduce the time cost by pruning unnecessary comparisons. Moreover, to make our approach I/O efficient, a caching strategy is proposed by taking advantage of both graph partitioning and the locality of the DSTree index. The efficiency and effectiveness of the proposed approaches are verified by experiments on real-life datasets.


Time series Similarity join Lower bound 



Funding was provided by National Key Research and Development Program (Grant Nos. 2016YFE0100300, 2016YFB1000700), National Natural Science Foundation of China (U1509213).


  1. 1.
    Camerra A, Palpanas T, Shieh J, Keogh EJ (2010) isax 2.0: indexing and mining one billion time series. In: ICDM, pp 58–67Google Scholar
  2. 2.
    Keogh E, Pazzani M (1998) An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: SIGKDDGoogle Scholar
  3. 3.
    Steinbach M, Tan P, Kumar V, Klooster S, Potter C (2003) Discovery of climate indices using clustering. In: SIGKDDGoogle Scholar
  4. 4.
    Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: FOCS, pp 459–468Google Scholar
  5. 5.
    Agrawal R, Psaila G, Wimmers EL, Zaït M (1995) Querying shapes of histories. In: VLDB’95, proceedings of 21th international conference on very large data bases, September 11–15, 1995, Zurich, Switzerland, pp 502–514Google Scholar
  6. 6.
    Koperski K, Han J (1995) Discovery of spatial association rules in geographic information databases. In: Advances in spatial databases, 4th international symposium, SSD’95, Portland, Maine, USA, August 6–9, 1995, proceedings, pp 47–66Google Scholar
  7. 7.
    Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web, WWW 2008, Beijing, China, April 21–25, 2008, pp 131–140Google Scholar
  8. 8.
    Kolb L, Thor A, Rahm E (2012) Load balancing for mapreduce-based entity resolution. In: IEEE 28th international conference on data engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1–5 April, 2012, pp 618–629Google Scholar
  9. 9.
    Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering, ICDE 2006, 3–8 April 2006, Atlanta, GA, USA, p 5Google Scholar
  10. 10.
    Kanth K, Agrawal D, Singh A (1998) Dimensionality reduction for similarity searching in dynamic databases. In: SIGMODGoogle Scholar
  11. 11.
    Agrawal R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: FODO Google Scholar
  12. 12.
    Chan K, Fu A (1999) Efficient time series matching by wavelets. In: ICDE Google Scholar
  13. 13.
    Yi B-K, Faloutsos C (2000) Fast time sequence indexing for arbitrary Lp norms. In: VLDB 2000, proceedings of 26th international conference on very large data bases, September 10–14, 2000, Cairo, Egypt, pp 385–394Google Scholar
  14. 14.
    Keogh EJ, Chakrabarti K, Mehrotra S, Pazzani MJ (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 151–162Google Scholar
  15. 15.
    Wang Y, Wang P, Pei J, Wang W, Huang S (2013) A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10):793–804Google Scholar
  16. 16.
    Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. In: PVLDBGoogle Scholar
  17. 17.
    Böhm C (2000) A cost model for query processing in high dimensional data spaces. ACM Trans Database Syst 25(2):129–178CrossRefGoogle Scholar
  18. 18.
    Nobari S, Tauheed F, Heinis T, Karras P, Bressan S, Ailamaki A (2013) TOUCH: in-memory spatial join by hierarchical data-oriented partitioning. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2013, New York, NY, USA, June 22–27, 2013, pp 701–712Google Scholar
  19. 19.
    Mueen A, Nath S, Liu J (2010) Fast approximate correlation for massive time-series data. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010, pp 171–182Google Scholar
  20. 20.
    Kernighan BW, Shen L (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49:291–307CrossRefzbMATHGoogle Scholar
  21. 21.
    Karger DR (2000) Minimum cuts in near-linear time. J ACM 47:46–76Google Scholar
  22. 22.
    Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: Proceedings of the 19th design automation conference, DAC ’82, Las Vegas, Nevada, USA, June 14–16, 1982, pp 175–181Google Scholar
  23. 23.
  24. 24.
  25. 25.
    Shim K, Srikant R, Agrawal R (1997) High-dimensional similarity joins. In: Proceedings of the thirteenth international conference on data engineering, April 7–11, 1997 Birmingham UK, pp 301–311Google Scholar
  26. 26.
    Böhm C, Braunmüller B, Krebs F, Kriegel H-P (2001) Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 379–388Google Scholar
  27. 27.
    Wang Y, Metwally A, Parthasarathy S (2013) Scalable all-pairs similarity search in metric spaces. In: The 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD 2013, Chicago, IL, USA, August 11–14, 2013, pp 829–837Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Shanghai Key Laboratory of Data Science, School of Computer ScienceFudan UniversityShanghaiChina
  2. 2.JiangNan Institute of Computing TechnologyWuxiChina

Personalised recommendations