PARROT: pattern-based correlation exploitation in big partitioned data series

Zhang, Liang; Alghamdi, Noura; Zhang, Huayi; Eltabakh, Mohamed Y.; Rundensteiner, Elke A.

doi:10.1007/s00778-022-00767-9

PARROT: pattern-based correlation exploitation in big partitioned data series

Regular Paper
Published: 13 October 2022

Volume 32, pages 665–688, (2023)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Liang Zhang ORCID: orcid.org/0000-0001-7925-8706¹,
Noura Alghamdi¹,
Huayi Zhang¹,
Mohamed Y. Eltabakh¹ &
…
Elke A. Rundensteiner¹

342 Accesses
Explore all metrics

Abstract

Data series approximate similarity search is a basic building block operation essential for almost all analytical tasks. To speed up this important operation, the prevalent approach is to construct indexes directly on the data series objects. This suffers from very high construction time and storage cost due to the inherent complexity of indexing these high-dimensional data series objects. We instead design a promising new approach that leverages the unique property of correlations between the high-dimensional data series objects and the (often simple) partitioning attribute(s) in distributed data series repositories. Our proposed infrastructure, called PARROT, discovers, assesses, and exploits such correlations for similarity query optimization. PARROT addresses several critical challenges including the high dimensionality of the data series objects, softness (uncertainty) of correlation, correlation granularity, and lack of a proper measure for assessing correlation strength in big data series. We present scalable solutions tackling each of these challenges including pattern-level indexing, exception handling strategies for soft correlations, and a new entropy-based measure for assessing the correlation strength and judging their potential effectiveness. The PARROT query engine efficiently supports approximate kNN similarity queries leveraging the PARROT index. PARROT prototype is implemented on Apache Spark. Extensive experiments on real and synthetic datasets demonstrate that PARROT has substantially lower index construction costs, smaller storage overhead, and better performance and accuracy for processing similarity queries compared to alternate state-of-the-art solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

Big data analytics on Apache Spark

Article 13 October 2016

Stratified random sampling from streaming and stored data

Article 23 October 2020

Notes

In theory, any aggregation function over the segment’s values, e.g., average, min, max, and median, can be used to represent a segment. In this work, we use the average function as commonly used in the literature [3, 10, 28, 42, 56,57,58, 60, 61].
A sorted list of all entries from the global index is not necessary. Since we only need a subset, a min-heap is used to incrementally keep or purge entries.
https://github.com/lzhang6/parrot
https://www.511ny.org/
https://www.511virginia.org
https://www.511.nebraska.gov/
The typical range for the data series feature representation in literature is between 8 and 12 [10, 28, 42, 50, 58, 60, 61].
Mean Average Precision (MAP) is a widely used accuracy measure for centralized systems, which captures and compares the order of the items in the answer sets. However, it is not explicitly reported here since in distributed platforms there is no notion of order because the answer is generated in a distributed fashion. Thus, the final results are globally sorted. In this case, MAP becomes equivalent to Recall.

References

Apache hive (2020). https://hive.apache.org/
U.S. Geological Survey, gross primary productivity (2020). https://lpdaac.usgs.gov/products/mod17a2hv006/
Alghamdi, N., Zhang, L., Eltabakh, M.Y., Rundensteiner, E.A.: Chainlink: indexing big time series data for long subsequence matching. In: ICDE, pp. 529–540. IEEE (2020)
Alghamdi, N.S., Zhang, L., Rundensteiner, E.A., Eltabakh, M.Y.: Scalable time series compound infrastructure. In: SIGMOD, pp. 1685–1698. ACM (2022)
Aljawarneh, S., Radhakrishna, V., Kumar, P.V., Janaki, V.: A similarity measure for temporal pattern discovery in time series data generated by IoT. In: ICEMIS, pp. 1–4. IEEE (2016)
Arora, A., Sinha, S., Kumar, P., Bhattacharya, A.: Hd-index: pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. PVLDB 11(8), 906–919 (2018)
Google Scholar
Aucouturier, J.J., Pachet, F., et al.: Music similarity measures: What’s the use? In: ISMIR, pp. 13–17 (2002)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755. IEEE (2007)
Brown, P.G., Haas, P.J.: Bhunt: Automatic discovery of fuzzy algebraic constraints in relational data. In: PVLDB. Elsevier (2003)
Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: Indexing and mining one billion time series. In: ICDE. IEEE (2010)
Carrington, P.J., Scott, J., Wasserman, S.: Models and Methods in Social Network Analysis. Cambridge University Press, Cambridge (2005)
Book Google Scholar
Chan, N.H.: Time Series: Applications to Finance, vol. 487. Wiley, London (2004)
Google Scholar
Chu, X., Ilyas, I.F., Papotti, P.: Discovering denial constraints. PVLDB 6(13), 1498–1509 (2013)
Google Scholar
Claesen, M., De Moor, B.: Hyperparameter search in machine learning. arXiv preprint arXiv:1502.02127 (2015)
Cook, A.A., Mısırlı, G., Fan, Z.: Anomaly detection for IoT time-series data: a survey. In: Internet of Things Journal, 7, pp. 6481–6494. IEEE (2019)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 2, 107–113 (2008)
Article Google Scholar
Ebrahimi, N., Soofi, E.S., Soyer, R.: Information measures in perspective. Int. Stat. Rev. 5, 6266 (2010)
Google Scholar
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Return of the lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB 13(3), 403–420 (2019)
Google Scholar
Eltabakh, M.Y.: Big data indexing. In: Encyclopedia of Big Data Technologies (2019)
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence match in time-series databases. In: SIGMOD, vol. 23. ACM (1994)
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. In: TODS, vol. 33, pp. 1–48. ACM (2008)
Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., El Abbadi, A.: Vector approximation based indexing for non-uniform high dimensional data sets. In: CIKM, pp. 202–209 (2000)
Feurer, M., Hutter, F.: Hyperparameter optimization. In: Automated Machine Learning, pp. 3–33. Springer (2019)
Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (IoT): a vision, architectural elements, and future directions. Futur. Gener. Comput. Syst. 29(7), 1645–1660 (2013)
Article Google Scholar
Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Article MATH Google Scholar
Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: SIGMOD, pp. 647–658. ACM (2004)
Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: SIGKDD, pp. 1334–1342. ACM (2011)
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. In: KAIS, vol. 3, pp. 263–286. Springer (2001)
Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Correlation maps: a compressed access method for exploiting soft functional dependencies. In: PVLDB, pp. 1222–1233 (2009)
Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Coradd: Correlation aware database designer for materialized views and indexes. In: PVLDB, vol. 3, pp. 1103–1113 (2010)
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: A scalable bottom-up approach for building data series indexes. In: PVLDB, vol. 11, pp. 677–690 (2018)
Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: The ulisse approach. In: PVLDB, pp. 2236–2248 (2018)
Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. In: PVLDB, 12, pp. 1005–1016 (2016)
Liu, Y., Liu, H., Xiao, D., Eltabakh, M.Y.: Adaptive correlation exploitation in big data query optimization. In: VLDB Journal. Springer (2018)
Mandros, P., Boley, M., Vreeken, J.: Discovering reliable approximate functional dependencies. In: SIGKDD, pp. 355–363. ACM (2017)
Mandros, P., Boley, M., Vreeken, J.: Discovering reliable dependencies from data: Hardness and improved algorithms. In: ICDM, pp. 317–326. IEEE (2018)
Miyazawa, F.K., Pedrosa, L.L., Schouery, R.C., Sviridenko, M., Wakabayashi, Y.: Polynomial-time approximation schemes for circle and other packing problems. Algorithmica 76(2), 536–568 (2016)
Article MathSciNet MATH Google Scholar
Nehme, R.V., Rundensteiner, E.A., Bertino, E.: Self-tuning query mesh for adaptive multi-route query processing. In: EDBT, pp. 803–814 (2009)
Nguyen, H.V., Müller, E., Andritsos, P., Böhm, K.: Detecting correlated columns in relational databases with mixed data types. In: SSDBM, pp. 1–12 (2014)
Palpanas, T.: Big sequence management: A glimpse of the past, the present, and the future. In: SOFSEM, pp. 63–80. Springer (2016)
Palpanas, T.: The parallel and distributed future of data series mining. In: HPCS, pp. 916–920. IEEE (2017)
Palpanas, T.: Evolution of a data series index. In: ISIP. Springer (2019)
Palpanas, T., Beckmann, V.: Report on the first and second interdisciplinary time series analysis workshop (ITISA). In: SIGMOD. ACM (2019)
Park, Y., Cafarella, M., Mozafari, B.: Neighbor-sensitive hashing. PVLDB 9, 144–155 (2015)
Google Scholar
Pearson, K.: The problem of the random walk. Nature 72(1865), 5558 (1905)
Article Google Scholar
Peng, B., Fatourou, P., Palpanas, T.: Paris: The next destination for fast data series indexing and query answering. In: Big Data, pp. 791–800. IEEE (2018)
Peng, B., Fatourou, P., Palpanas, T.: Messi: In-memory data series indexing. In: ICDE. IEEE (2020)
Pennerath, F.: An efficient algorithm for computing entropic measures of feature subsets. In: ECML PKDD, pp. 483–499. Springer (2018)
Reimherr, M., Nicolae, D.L., et al.: On quantifying dependence: A framework for developing interpretable measures. In: Statistical Science, vol. 28, pp. 116–130. Institute of Mathematical Statistics (IMS) (2013)
Shieh, J., Keogh, E.: isax: indexing and mining terabyte sized time series. In: SIGKDD, pp. 623–631. ACM (2008)
Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The hadoop distributed file system. In: MSST, pp. 1–10. IEEE (2010)
Stephenson, K.: Circle packing: a mathematical tale. Not. AMS 50(11), 1376–1388 (2003)
MathSciNet MATH Google Scholar
Stephenson, K.: Introduction to Circle Packing: The Theory of Discrete Analytic Functions. Cambridge University Press, Cambridge (2005)
MATH Google Scholar
Tamura, H., Yokoya, N.: Image database systems: a survey. Pattern Recogn. 17(1), 29–43 (1984)
Article Google Scholar
Ullman, J.D.: Principles of database and knowledge-base systems. In: Computer Science Press, Inc , vol. 1 (1988)
Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: PVLDB, vol. 98, pp. 194–205 (1998)
Wu, J., Wang, P., Pan, N., Wang, C., Wang, W., Wang, J.: Kv-match: A subsequence matching approach supporting normalization and time warping. In: ICDE. IEEE (2019)
Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: DPiSAX: Massively distributed partitioned iSAX. In: ICDM. IEEE (2017)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud (2010)
Zhang, L., Alghamdi, N., Eltabakh, M.Y., Rundensteiner, E.A.: TARDIS: Distributed indexing framework for big time series data. In: ICDE, pp. 1202–1213. IEEE (2019)
Zhang, L., Alghamdi, N., Eltabakh, M.Y., Rundensteiner, E.A.: Big data series analytics using TARDIS and its exploitation in geospatial applications. In: SIGMOD, pp. 2785–2788. ACM (2020)
Zoumpatianos, K., Idreos, S., Palpanas, T.: ADS: the adaptive data series index. In: VLDB Journal, vol. 25, pp. 843–866. Springer (2016)

Download references

Author information

Authors and Affiliations

Data Science Department, Worcester Polytechnic Institute (WPI), 100 Institute Rd., Worcester, MA, 01609, USA
Liang Zhang, Noura Alghamdi, Huayi Zhang, Mohamed Y. Eltabakh & Elke A. Rundensteiner

Authors

Liang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Noura Alghamdi
View author publications
You can also search for this author in PubMed Google Scholar
Huayi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Y. Eltabakh
View author publications
You can also search for this author in PubMed Google Scholar
Elke A. Rundensteiner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, L., Alghamdi, N., Zhang, H. et al. PARROT: pattern-based correlation exploitation in big partitioned data series. The VLDB Journal 32, 665–688 (2023). https://doi.org/10.1007/s00778-022-00767-9

Download citation

Received: 11 November 2021
Revised: 20 August 2022
Accepted: 30 September 2022
Published: 13 October 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00778-022-00767-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PARROT: pattern-based correlation exploitation in big partitioned data series

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data analytics on Apache Spark

Stratified random sampling from streaming and stored data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PARROT: pattern-based correlation exploitation in big partitioned data series

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data analytics on Apache Spark

Stratified random sampling from streaming and stored data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation