The VLDB Journal

, Volume 25, Issue 6, pp 767–790 | Cite as

Efficient discovery of longest-lasting correlation in sequence databases

  • Yuhong Li
  • Leong Hou U
  • Man Lung Yiu
  • Zhiguo Gong
Regular Paper

Abstract

The search for similar subsequences is a core module for various analytical tasks in sequence databases. Typically, the similarity computations require users to set a length. However, there is no robust means by which to define the proper length for different application needs. In this study, we examine a new query that is capable of returning the longest-lasting highly correlated subsequences in a sequence database, which is particularly helpful to analyses without prior knowledge regarding the query length. A baseline, yet expensive, solution is to calculate the correlations for every possible subsequence length. To boost performance, we study a space-constrained index that provides a tight correlation bound for subsequences of similar lengths and offset by intraobject and interobject grouping techniques. To the best of our knowledge, this is the first index to support a normalized distance metric of arbitrary length subsequences. In addition, we study the use of a smart cache for disk-resident data (e.g., millions of sequence objects) and a graph processing unit-based parallel processing technique for frequently updated data (e.g., nonindexable streaming sequences) to compute the longest-lasting highly correlated subsequences. Extensive experimental evaluation on both real and synthetic sequence datasets verifies the efficiency and effectiveness of our proposed methods.

Keywords

Time series analysis Similarity search Longest-lasting correlated subsequences 

References

  1. 1.
    Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO, pp. 69–84 (1993)Google Scholar
  2. 2.
    Assent, I., Krieger, R., Afschari, F., Seidl, T.: The TS-tree: efficient time series search and retrieval. In: EDBT, pp. 252–263 (2008)Google Scholar
  3. 3.
    Athitsos, V., Papapetrou, P., Potamias, M., Kollios, G., Gunopulos, D.: Approximate embedding-based subsequence matching of time series. In: SIGMOD, pp. 365–378 (2008)Google Scholar
  4. 4.
    Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-Tree: an efficient and robust access method for points and rectangles. In: SIGMOD, pp. 322–331 (1990)Google Scholar
  5. 5.
    Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Bragge, T., Tarvainen, M., Karjalainen, P.A.: High-resolution qrs detection algorithm for sparsely sampled ECG recordings. University of Kuopio, Department of Applied Physics Report (2004)Google Scholar
  7. 7.
    Camerra, A., Palpanas, T., Shieh, J., Keogh, E.J.: iSAX 2.0: Indexing and mining one billion time series. In: ICDM, pp. 58–67 (2010)Google Scholar
  8. 8.
    Chan, K.P., Fu, A.W.-C.: Efficient time series matching by wavelets. In: ICDE, pp. 126–133 (1999)Google Scholar
  9. 9.
    Chandola, V., Mithal, V., Kumar, V.: Comparative evaluation of anomaly detection techniques for sequence data. In: ICDM, pp. 743–748 (2008)Google Scholar
  10. 10.
    Chang, C.-I.: Hyperspectral imaging: techniques for spectral detection and classification. Plenum Publishing Co., New York (2003)CrossRefGoogle Scholar
  11. 11.
    Chang, K., Deka, B., Hwu, W.W., Roth, D.: Efficient pattern-based time series classification on GPU. In: ICDM, pp. 131–140 (2012)Google Scholar
  12. 12.
    Chen, Q., Chen, L., Lian, X., Liu, Y., Yu, J.X.: Indexable pla for efficient similarity search. In: VLDB, pp. 435–446 (2007)Google Scholar
  13. 13.
    Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G.: The UCR time series classification archive, July (2015). www.cs.ucr.edu/~eamonn/time_series_data/
  14. 14.
    Cole, R., Shasha, D., Zhao, X.: Fast window correlations over uncooperative time series. In: KDD, pp. 743–749 (2005)Google Scholar
  15. 15.
  16. 16.
    Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.J.: Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2), 1542–1552 (2008)Google Scholar
  17. 17.
    Duda, R.O., Hart, P.E., et al.: Pattern classification and scene analysis, vol. 3. Wiley, New York (1973)MATHGoogle Scholar
  18. 18.
    Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD, pp. 419–429 (1994)Google Scholar
  19. 19.
    Filho, R.F.S., Traina, A.J.M., C.T. Jr., Faloutsos, C.: Similarity search without tears: the omni family of all-purpose access methods. In: ICDE, pp. 623–630 (2001)Google Scholar
  20. 20.
    Jiang, X., Li, C., Luo, P., Wang, M., Yu, Y.: Prominent streak discovery in sequence data. In: KDD, pp. 1280–1288 (2011)Google Scholar
  21. 21.
    Kahveci, T., Singh, A.K.: Optimizing similarity search for arbitrary length time series queries. IEEE TKDE 16(4), 418–433 (2004)Google Scholar
  22. 22.
    Keogh, E.J., Chakrabarti, K., Mehrotra, S., Pazzani, M.J.: Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD, pp. 151–162 (2001)Google Scholar
  23. 23.
    Keogh, E.J., Chakrabarti, K., Pazzani, M.J., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001)CrossRefMATHGoogle Scholar
  24. 24.
    Keogh, E.J., Kasetty, S.: On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min. Knowl. Discov. 7(4), 349–371 (2003)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Keogh, E.J., Wei, L., Xi, X., Vlachos, M., Lee, S.-H., Protopapas, P.: Supporting exact indexing of arbitrarily rotated shapes and periodic time series under euclidean and warping distance measures. VLDB J. 18(3), 611–630 (2009)CrossRefGoogle Scholar
  26. 26.
    Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.-F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: CVPR Workshop on Fine-Grained Visual Categorization (FGVC) (2011)Google Scholar
  27. 27.
    Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD, pp. 289–300 (1997)Google Scholar
  28. 28.
    Kristoufek, L.: Measuring correlations between non-stationary series with dcca coefficient. Phys. A 402, 291–298 (2014)CrossRefGoogle Scholar
  29. 29.
    Li, Y., U, L.H., Yiu, M.L., Gong, Z.: Discovering longest-lasting correlation in sequence databases. PVLDB 6(14), 1666–1677 (2013)Google Scholar
  30. 30.
    Liao, T.W.: Clustering of time series data—a survey. Pattern Recogn. 38(11), 1857–1874 (2005)CrossRefMATHGoogle Scholar
  31. 31.
    Lim, S.-H., Park, H., Kim, S.-W.: Using multiple indexes for efficient subsequence matching in time-series databases. Inf. Sci. 177(24), 5691–5706 (2007)CrossRefMATHGoogle Scholar
  32. 32.
    Lin, J., Keogh, E.J., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2), 107–144 (2007)MathSciNetCrossRefGoogle Scholar
  33. 33.
  34. 34.
    Micikevicius, P.: Advanced cuda, C. (2009) http://www.nvidia.com/content/GTC/documents/1029_GTC09.pdf
  35. 35.
    Moon, B., Jagadish, H.V., Faloutsos, C., Saltz, J.H.: Analysis of the clustering properties of the hilbert space-filling curve. IEEE TKDE 13(1), 124–141 (2001)Google Scholar
  36. 36.
    Morton, G.M.: A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing. International Business Machines Company, New York (1966)Google Scholar
  37. 37.
    Mueen, A., Hamooni, H., Estrada, T.: Time series join on subsequence correlation. In: ICDM, pp. 450–459 (2014)Google Scholar
  38. 38.
    Mueen, A., Keogh, E.J., Shamlo, N.B.: Finding time series motifs in disk-resident data. In: ICDM, pp. 367–376 (2009)Google Scholar
  39. 39.
    Mueen, A., Keogh, E.J., Young, N.: Logical-shapelets: an expressive primitive for time series classification. In: KDD, pp. 1154–1162 (2011)Google Scholar
  40. 40.
    Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B.: Exact discovery of time series motifs. In: SDM, pp. 473–484 (2009)Google Scholar
  41. 41.
    Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: SIGMOD, pp. 171–182 (2010)Google Scholar
  42. 42.
    Papapetrou, P., Athitsos, V., Potamias, M., Kollios, G., Gunopulos, D.: Embedding-based subsequence matching in time-series databases. ACM TODS 36(3), 17 (2011)Google Scholar
  43. 43.
    Paparrizos, J., Gravano, L.: k-shape: Efficient and accurate clustering of time series. In: SIGMOD, pp. 1855–1870 (2015)Google Scholar
  44. 44.
    Patel, P., Keogh, E.J., Lin, J., Lonardi, S.: Mining motifs in massive time series databases. In: ICDM, pp. 370–377 (2002)Google Scholar
  45. 45.
    Rafiei, D.: On similarity-based queries for time series data. In: ICDE, pp. 410–417 (1999)Google Scholar
  46. 46.
    Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD, pp. 262–270 (2012)Google Scholar
  47. 47.
    Sakurai, Y., Papadimitriou, S., Braid, C. Faloutsos.: Stream mining through group lag correlations. In: SIGMOD, pp. 599–610 (2005)Google Scholar
  48. 48.
    Sart, D., Mueen, A., Najjar, W.A., Keogh, E.J., Niennattrakul, V.: Accelerating dynamic time warping subsequence search with gpus and fpgas. In: ICDM, pp. 1001–1006 (2010)Google Scholar
  49. 49.
    Smith, J.E., Goodman, J.R.: Instruction cache replacement policies and organizations. IEEE Trans. Comput. 34(3), 234–241 (1985)CrossRefGoogle Scholar
  50. 50.
    Yi, B.-K., Faloutsos, C.: Fast time sequence indexing for arbitrary Lp norms. In: VLDB, pp. 385–394 (2000)Google Scholar
  51. 51.
    Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp. 311–321 (1993)Google Scholar
  52. 52.
    Zebende, G.: Dcca cross-correlation coefficient: quantifying level of cross-correlation. Phys. A 390(4), 614–618 (2011)CrossRefGoogle Scholar
  53. 53.
    Zhu, Y., Shasha, D.: Statstream: Statistical monitoring of thousands of data streams in real time. In: VLDB, pp. 358–369 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Yuhong Li
    • 1
  • Leong Hou U
    • 1
  • Man Lung Yiu
    • 2
  • Zhiguo Gong
    • 1
  1. 1.Department of Computer and Information ScienceUniversity of MacauMacau SARChina
  2. 2.Department of ComputingHong Kong Polytechnic UniversityHong Kong SARChina

Personalised recommendations