Skip to main content
Log in

PARROT: pattern-based correlation exploitation in big partitioned data series

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Data series approximate similarity search is a basic building block operation essential for almost all analytical tasks. To speed up this important operation, the prevalent approach is to construct indexes directly on the data series objects. This suffers from very high construction time and storage cost due to the inherent complexity of indexing these high-dimensional data series objects. We instead design a promising new approach that leverages the unique property of correlations between the high-dimensional data series objects and the (often simple) partitioning attribute(s) in distributed data series repositories. Our proposed infrastructure, called PARROT, discovers, assesses, and exploits such correlations for similarity query optimization. PARROT addresses several critical challenges including the high dimensionality of the data series objects, softness (uncertainty) of correlation, correlation granularity, and lack of a proper measure for assessing correlation strength in big data series. We present scalable solutions tackling each of these challenges including pattern-level indexing, exception handling strategies for soft correlations, and a new entropy-based measure for assessing the correlation strength and judging their potential effectiveness. The PARROT query engine efficiently supports approximate kNN similarity queries leveraging the PARROT index. PARROT prototype is implemented on Apache Spark. Extensive experiments on real and synthetic datasets demonstrate that PARROT has substantially lower index construction costs, smaller storage overhead, and better performance and accuracy for processing similarity queries compared to alternate state-of-the-art solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. In theory, any aggregation function over the segment’s values, e.g., average, min, max, and median, can be used to represent a segment. In this work, we use the average function as commonly used in the literature [3, 10, 28, 42, 56,57,58, 60, 61].

  2. A sorted list of all entries from the global index is not necessary. Since we only need a subset, a min-heap is used to incrementally keep or purge entries.

  3. https://github.com/lzhang6/parrot

  4. https://www.511ny.org/

  5. https://www.511virginia.org

  6. https://www.511.nebraska.gov/

  7. The typical range for the data series feature representation in literature is between 8 and 12 [10, 28, 42, 50, 58, 60, 61].

  8. Mean Average Precision (MAP) is a widely used accuracy measure for centralized systems, which captures and compares the order of the items in the answer sets. However, it is not explicitly reported here since in distributed platforms there is no notion of order because the answer is generated in a distributed fashion. Thus, the final results are globally sorted. In this case, MAP becomes equivalent to Recall.

References

  1. Apache hive (2020). https://hive.apache.org/

  2. U.S. Geological Survey, gross primary productivity (2020). https://lpdaac.usgs.gov/products/mod17a2hv006/

  3. Alghamdi, N., Zhang, L., Eltabakh, M.Y., Rundensteiner, E.A.: Chainlink: indexing big time series data for long subsequence matching. In: ICDE, pp. 529–540. IEEE (2020)

  4. Alghamdi, N.S., Zhang, L., Rundensteiner, E.A., Eltabakh, M.Y.: Scalable time series compound infrastructure. In: SIGMOD, pp. 1685–1698. ACM (2022)

  5. Aljawarneh, S., Radhakrishna, V., Kumar, P.V., Janaki, V.: A similarity measure for temporal pattern discovery in time series data generated by IoT. In: ICEMIS, pp. 1–4. IEEE (2016)

  6. Arora, A., Sinha, S., Kumar, P., Bhattacharya, A.: Hd-index: pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. PVLDB 11(8), 906–919 (2018)

    Google Scholar 

  7. Aucouturier, J.J., Pachet, F., et al.: Music similarity measures: What’s the use? In: ISMIR, pp. 13–17 (2002)

  8. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755. IEEE (2007)

  9. Brown, P.G., Haas, P.J.: Bhunt: Automatic discovery of fuzzy algebraic constraints in relational data. In: PVLDB. Elsevier (2003)

  10. Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: Indexing and mining one billion time series. In: ICDE. IEEE (2010)

  11. Carrington, P.J., Scott, J., Wasserman, S.: Models and Methods in Social Network Analysis. Cambridge University Press, Cambridge (2005)

    Book  Google Scholar 

  12. Chan, N.H.: Time Series: Applications to Finance, vol. 487. Wiley, London (2004)

    Google Scholar 

  13. Chu, X., Ilyas, I.F., Papotti, P.: Discovering denial constraints. PVLDB 6(13), 1498–1509 (2013)

    Google Scholar 

  14. Claesen, M., De Moor, B.: Hyperparameter search in machine learning. arXiv preprint arXiv:1502.02127 (2015)

  15. Cook, A.A., Mısırlı, G., Fan, Z.: Anomaly detection for IoT time-series data: a survey. In: Internet of Things Journal, 7, pp. 6481–6494. IEEE (2019)

  16. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 2, 107–113 (2008)

    Article  Google Scholar 

  17. Ebrahimi, N., Soofi, E.S., Soyer, R.: Information measures in perspective. Int. Stat. Rev. 5, 6266 (2010)

    Google Scholar 

  18. Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Return of the lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB 13(3), 403–420 (2019)

    Google Scholar 

  19. Eltabakh, M.Y.: Big data indexing. In: Encyclopedia of Big Data Technologies (2019)

  20. Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence match in time-series databases. In: SIGMOD, vol. 23. ACM (1994)

  21. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. In: TODS, vol. 33, pp. 1–48. ACM (2008)

  22. Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., El Abbadi, A.: Vector approximation based indexing for non-uniform high dimensional data sets. In: CIKM, pp. 202–209 (2000)

  23. Feurer, M., Hutter, F.: Hyperparameter optimization. In: Automated Machine Learning, pp. 3–33. Springer (2019)

  24. Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (IoT): a vision, architectural elements, and future directions. Futur. Gener. Comput. Syst. 29(7), 1645–1660 (2013)

    Article  Google Scholar 

  25. Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)

    Article  MATH  Google Scholar 

  26. Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: SIGMOD, pp. 647–658. ACM (2004)

  27. Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: SIGKDD, pp. 1334–1342. ACM (2011)

  28. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. In: KAIS, vol. 3, pp. 263–286. Springer (2001)

  29. Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Correlation maps: a compressed access method for exploiting soft functional dependencies. In: PVLDB, pp. 1222–1233 (2009)

  30. Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Coradd: Correlation aware database designer for materialized views and indexes. In: PVLDB, vol. 3, pp. 1103–1113 (2010)

  31. Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: A scalable bottom-up approach for building data series indexes. In: PVLDB, vol. 11, pp. 677–690 (2018)

  32. Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: The ulisse approach. In: PVLDB, pp. 2236–2248 (2018)

  33. Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. In: PVLDB, 12, pp. 1005–1016 (2016)

  34. Liu, Y., Liu, H., Xiao, D., Eltabakh, M.Y.: Adaptive correlation exploitation in big data query optimization. In: VLDB Journal. Springer (2018)

  35. Mandros, P., Boley, M., Vreeken, J.: Discovering reliable approximate functional dependencies. In: SIGKDD, pp. 355–363. ACM (2017)

  36. Mandros, P., Boley, M., Vreeken, J.: Discovering reliable dependencies from data: Hardness and improved algorithms. In: ICDM, pp. 317–326. IEEE (2018)

  37. Miyazawa, F.K., Pedrosa, L.L., Schouery, R.C., Sviridenko, M., Wakabayashi, Y.: Polynomial-time approximation schemes for circle and other packing problems. Algorithmica 76(2), 536–568 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  38. Nehme, R.V., Rundensteiner, E.A., Bertino, E.: Self-tuning query mesh for adaptive multi-route query processing. In: EDBT, pp. 803–814 (2009)

  39. Nguyen, H.V., Müller, E., Andritsos, P., Böhm, K.: Detecting correlated columns in relational databases with mixed data types. In: SSDBM, pp. 1–12 (2014)

  40. Palpanas, T.: Big sequence management: A glimpse of the past, the present, and the future. In: SOFSEM, pp. 63–80. Springer (2016)

  41. Palpanas, T.: The parallel and distributed future of data series mining. In: HPCS, pp. 916–920. IEEE (2017)

  42. Palpanas, T.: Evolution of a data series index. In: ISIP. Springer (2019)

  43. Palpanas, T., Beckmann, V.: Report on the first and second interdisciplinary time series analysis workshop (ITISA). In: SIGMOD. ACM (2019)

  44. Park, Y., Cafarella, M., Mozafari, B.: Neighbor-sensitive hashing. PVLDB 9, 144–155 (2015)

    Google Scholar 

  45. Pearson, K.: The problem of the random walk. Nature 72(1865), 5558 (1905)

    Article  Google Scholar 

  46. Peng, B., Fatourou, P., Palpanas, T.: Paris: The next destination for fast data series indexing and query answering. In: Big Data, pp. 791–800. IEEE (2018)

  47. Peng, B., Fatourou, P., Palpanas, T.: Messi: In-memory data series indexing. In: ICDE. IEEE (2020)

  48. Pennerath, F.: An efficient algorithm for computing entropic measures of feature subsets. In: ECML PKDD, pp. 483–499. Springer (2018)

  49. Reimherr, M., Nicolae, D.L., et al.: On quantifying dependence: A framework for developing interpretable measures. In: Statistical Science, vol. 28, pp. 116–130. Institute of Mathematical Statistics (IMS) (2013)

  50. Shieh, J., Keogh, E.: isax: indexing and mining terabyte sized time series. In: SIGKDD, pp. 623–631. ACM (2008)

  51. Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The hadoop distributed file system. In: MSST, pp. 1–10. IEEE (2010)

  52. Stephenson, K.: Circle packing: a mathematical tale. Not. AMS 50(11), 1376–1388 (2003)

    MathSciNet  MATH  Google Scholar 

  53. Stephenson, K.: Introduction to Circle Packing: The Theory of Discrete Analytic Functions. Cambridge University Press, Cambridge (2005)

    MATH  Google Scholar 

  54. Tamura, H., Yokoya, N.: Image database systems: a survey. Pattern Recogn. 17(1), 29–43 (1984)

    Article  Google Scholar 

  55. Ullman, J.D.: Principles of database and knowledge-base systems. In: Computer Science Press, Inc , vol. 1 (1988)

  56. Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: PVLDB, vol. 98, pp. 194–205 (1998)

  57. Wu, J., Wang, P., Pan, N., Wang, C., Wang, W., Wang, J.: Kv-match: A subsequence matching approach supporting normalization and time warping. In: ICDE. IEEE (2019)

  58. Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: DPiSAX: Massively distributed partitioned iSAX. In: ICDM. IEEE (2017)

  59. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud (2010)

  60. Zhang, L., Alghamdi, N., Eltabakh, M.Y., Rundensteiner, E.A.: TARDIS: Distributed indexing framework for big time series data. In: ICDE, pp. 1202–1213. IEEE (2019)

  61. Zhang, L., Alghamdi, N., Eltabakh, M.Y., Rundensteiner, E.A.: Big data series analytics using TARDIS and its exploitation in geospatial applications. In: SIGMOD, pp. 2785–2788. ACM (2020)

  62. Zoumpatianos, K., Idreos, S., Palpanas, T.: ADS: the adaptive data series index. In: VLDB Journal, vol. 25, pp. 843–866. Springer (2016)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liang Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, L., Alghamdi, N., Zhang, H. et al. PARROT: pattern-based correlation exploitation in big partitioned data series. The VLDB Journal 32, 665–688 (2023). https://doi.org/10.1007/s00778-022-00767-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00767-9

Keywords

Navigation