Coconut: sortable summarizations for scalable indexes over static and streaming data series

Kondylakis, Haridimos; Dayan, Niv; Zoumpatianos, Kostas; Palpanas, Themis

doi:10.1007/s00778-019-00573-w

Coconut: sortable summarizations for scalable indexes over static and streaming data series

Regular Paper
Published: 25 September 2019

Volume 28, pages 847–869, (2019)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Haridimos Kondylakis ORCID: orcid.org/0000-0002-9917-4486¹,
Niv Dayan²,
Kostas Zoumpatianos² &
…
Themis Palpanas³

582 Accesses
15 Citations
Explore all metrics

Abstract

Many modern applications produce massive streams of data series that need to be analyzed, requiring efficient similarity search operations. However, the state-of-the-art data series indexes that are used for this purpose do not scale well for massive datasets in terms of performance, or storage costs. We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order. To address this problem, we present Coconut, the first data series index based on sortable summarizations and the first efficient solution for indexing and querying streaming series. The first innovation in Coconut is an inverted, sortable data series summarization that organizes data series based on a z-order curve, keeping similar series close to each other in the sorted order. As a result, Coconut is able to use bulk loading and updating techniques that rely on sorting to quickly build and maintain a contiguous index using large sequential disk I/Os. We then explore prefix-based and median-based splitting policies for bottom-up bulk loading, showing that median-based splitting outperforms the state of the art, ensuring that all nodes are densely populated. Finally, we explore the impact of sortable summarizations on variable-sized window queries, showing that they can be supported in the presence of updates through efficient merging of temporal partitions. Overall, we show analytically and empirically that Coconut dominates the state-of-the-art data series indexes in terms of construction speed, query speed, and storage costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Trends and Future Perspective Challenges in Big Data

Making data visualization more efficient and effective: a survey

Article 19 November 2019

Xuedi Qin, Yuyu Luo, … Guoliang Li

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

Notes

Informally, a data series or data sequence is an ordered sequence of data points. If the dimension that imposes the ordering of the sequence is time then we talk about time series, though a series can also be defined over other measures (e.g., angle in radial profiles in astronomy, mass in mass spectroscopy, position in genome sequences, etc.). For the rest of this paper, we are going to use the terms data series and sequence interchangeably.
This is analogous to sorting points in a multi-dimensional space based on one dimension.
Note that recent state-of-the-art serial scan algorithms [42, 55] are only efficient for scenarios that involve nearest neighbor operations of a short query subsequence against a very long data series. On the contrary, in this work, we are interested in finding similarities in very large collections of short sequences.
Note that SAX words are typically longer to enable more precision; we use 2-character SAX words in this example for ease of exposition.
In fact, this condition only holds as long as \(M > \sqrt{N}\) [57]. Since main memory is approximately two orders of magnitude more expensive than secondary storage, this condition holds in practice for massive datasets.
In a materialized index, the raw data series are stored alongside their summarizations within the index, whereas in a non-materialized one the index contains pointers to the raw data series that are stored in a different file.

References

Adhd-200. http://fcon\_1000.projects.nitrc.org/indi/adhd200/ (2017)
Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)
Article MathSciNet Google Scholar
Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO, pp. 69–84 (1993)
Alsubaiee, S., Carey, M.J., Li, C.: Lsm-based storage and indexing: an old idea with timely benefits. In: Second International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data, GeoRich@SIGMOD 2015, Melbourne, VIC, Australia, May 31, 2015, pp. 1–6 (2015). https://doi.org/10.1145/2786006.2786007
Assent, I., Krieger, R., Afschari, F., Seidl, T.: The ts-tree: efficient time series search and retrieval. In: EDBT, pp. 252–263 (2008). https://doi.org/10.1145/1353343.1353376
Bayer, R., Markl, V.: The ub-tree: performance of multidimensional range queries. Techinal Report. Institut fur Informatik, TU, Munchen (1998)
Camerra, A., Palpanas, T., Shieh, J., Keogh, E.J.: isax 2.0: indexing and mining one billion time series. In: ICDM, pp. 58–67 (2010). https://doi.org/10.1109/ICDM.2010.124
Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. KAIS 39(1), 123–151 (2014)
Google Scholar
Chakrabarti, K., Keogh, E.J., Mehrotra, S., Pazzani, M.J.: Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst. 27(2), 188–228 (2002). https://doi.org/10.1145/568518.568520
Article Google Scholar
Chan, K.P., Fu, A.W.: Efficient time series matching by wavelets. In: ICDE, pp. 126–133 (1999). https://doi.org/10.1109/ICDE.1999.754915
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)
Article Google Scholar
Dayan, N., Athanassoulis, M., Idreos, S.: Monkey: optimal navigable key-value store. In: SIGMOD, pp. 79–94 (2017). https://doi.org/10.1145/3035918.3064054
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. In: PVLDB (2019)
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD, pp. 419–429 (1994). https://doi.org/10.1145/191839.191925
Goldin, D.Q., Kanellakis, P.C.: On similarity queries for time-series data: constraint specification and implementation. In: Proceedings of First International Conference on Principles and Practice of Constraint Programming—CP’95, Cassis, France, September 19–22, 1995, pp. 137–153 (1995)
Chapter Google Scholar
Grabocka, J., Schilling, N., Schmidt-Thieme, L.: Latent time-series motifs. TKDD 11(1), 6:1–6:20 (2016)
Article Google Scholar
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp. 47–57 (1984). https://doi.org/10.1145/602259.602266
Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE CIM 9(3), 27–39 (2014)
Google Scholar
Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR 2007, pp. 68–78 (2007). http://cidrdb.org/cidr2007/papers/cidr07p07.pdf
Incorporated Research Institutions for Seismology—Seismic Data Access. http://ds.iris.edu/data/access/ (2016)
Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP, pp. 2993–2996 (1999). https://doi.org/10.1109/ICASSP.1999.757470
Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: SIGKDD, pp. 1334–1342 (2011). https://doi.org/10.1145/2020408.2020607
Kate, R.J.: Using dynamic time warping distances as features for improved time series classification. Data Min. Knowl. Discov. 30(2), 283–312 (2016). https://doi.org/10.1007/s10618-015-0418-x
Article MathSciNet MATH Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)
MATH Google Scholar
Keogh, E.J.: Fast similarity search in the presence of longitudinal scaling in time series databases. In: ICTAI, pp. 578–584 (1997). https://doi.org/10.1109/TAI.1997.632306
Keogh, E.J., Chakrabarti, K., Pazzani, M.J., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3(3), 263–286 (2001)
MATH Google Scholar
Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: KDD, pp. 239–243 (1998). http://www.aaai.org/Library/KDD/1998/kdd98-041.php
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: a scalable bottom-up approach for building data series indexes. PVLDB 11(6), 677–690 (2018). https://doi.org/10.14778/3184470.3184472
Article Google Scholar
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut palm: static and streaming data series exploration now in your palm. In: Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pp. 1941–1944 (2019). https://doi.org/10.1145/3299869.3320233
Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD, pp. 289–300 (1997). https://doi.org/10.1145/253260.253332
Leutenegger, S.T., Edgington, J.M., López, M.A.: STR: a simple and efficient algorithm for r-tree packing. In: ICDE, pp. 497–506 (1997). https://doi.org/10.1109/ICDE.1997.582015
Li, C., Yu, P.S., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE, pp. 546–553 (1996). https://doi.org/10.1109/ICDE.1996.492205
Liao, T.W.: Clustering of time series data—a survey. Pattern Recognit. 38(11), 1857–1874 (2005). https://doi.org/10.1016/j.patcog.2005.01.025
Article MATH Google Scholar
Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD (2003)
Lin, J., Keogh, E.J., Truppel, W.: Clustering of streaming time series is meaningless. In: DMKD, pp. 56–65 (2003). https://doi.org/10.1145/882082.882096
Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: the ULISSE approach. PVLDB 11(13), 2236–2248 (2018)
Google Scholar
Linardi, M., Palpanas, T.: ULISSE: ULtra compact Index for variable-length Similarity SEarch in data series. In: ICDE (2018)
Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix profile X: Valmod-scalable discovery of variable-length motifs in data series. In: SIGMOD (2018)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: BSMSP, pp. 281–297 (1967)
Mirylenka, K., Christophides, V., Palpanas, T., Pefkianakis, I., May, M.: Characterizing home device usage from wireless traffic time series. In: EDBT, pp. 551–562 (2016). https://doi.org/10.5441/002/edbt.2016.51
Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, Ottawa (1966)
Mueen, A., Hamooni, H., Estrada, T.: Time series join on subsequence correlation. In: ICDM, pp. 450–459 (2014)
Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B., Shamlo, N.B.: A disk-aware algorithm for time series motif discovery. DAMI 22(1–2), 73–105 (2011)
MathSciNet MATH Google Scholar
Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: SIGMOD, pp. 171–182 (2010). https://doi.org/10.1145/1807167.1807188
O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (lsm-tree). Acta Inf. 33(4), 351–385 (1996). https://doi.org/10.1007/s002360050048
Article MATH Google Scholar
Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Record 44(2), 47–52 (2015)
Article Google Scholar
Palpanas, T.: Big sequence management: a glimpse of the past, the present, and the future. In: SOFSEM, pp. 63–80 (2016). https://doi.org/10.1007/978-3-662-49192-8_6
Google Scholar
Palpanas, T.: The parallel and distributed future of data series mining. In: HPCS, pp. 916–920 (2017). https://doi.org/10.1109/HPCS.2017.155
Paparrizos, J., Gravano, L.: k-shape: efficient and accurate clustering of time series. In: SIGMOD, pp. 1855–1870 (2015). https://doi.org/10.1145/2723372.2737793
Paraskevopoulos, P., Dinh, T.C., Dashdorj, Z., Palpanas, T., Serafini, L.: Identification and characterization of human behavior patterns from mobile phone data. In: D4D Challenge Session, NetMob (2013)
Pelkonen, T., Franklin, S., Cavallaro, P., Huang, Q., Meza, J., Teller, J., Veeraraghavan, K.: Gorilla: a fast, scalable, in-memory time series database. PVLDB 8(12), 1816–1827 (2015)
Google Scholar
Peng, B., Fatourou, P., Palpanas, T.: ParIS: the next destination for fast data series indexing and query answering. In: BIGDATA, pp. 791–800 (2018). https://doi.org/10.1109/BigData.2018.8622293
Rafiei, D.: On similarity-based queries for time series data. In: ICDE, pp. 410–417 (1999). https://doi.org/10.1109/ICDE.1999.754957
Rafiei, D., Mendelzon, A.O.: Similarity-based queries for time series data. In: SIGMOD, pp. 13–25 (1997). https://doi.org/10.1145/253260.253264
Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: SIGKDD, pp. 262–270 (2012). https://doi.org/10.1145/2339530.2339576
Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: clustering time series streams requires ignoring some data. In: ICDM, pp. 547–556 (2011). https://doi.org/10.1109/ICDM.2011.146
Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)
MATH Google Scholar
Ramsak, F., Markl, V., Fenk, R., Zirkel, M., Elhardt, K., Bayer, R.: Integrating the ub-tree into a database system kernel. In: VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10–14, 2000, Cairo, Egypt, pp. 263–272 (2000)
Rao, J., Ross, K.A.: Making b\(^{+}\)-trees cache conscious in main memory. In: SIGMOD, pp. 475–486 (2000). https://doi.org/10.1145/342009.335449
Ratanamahatana, C.A., Keogh, E.J.: Three myths about dynamic time warping data mining. In: SIAM, pp. 506–510 (2005). https://doi.org/10.1137/1.9781611972757.50
Ravi Kanth, K.V., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD, pp. 166–176 (1998)
Raza, U., Camerra, A., Murphy, A.L., Palpanas, T., Picco, G.P.: Practical data prediction for real-world wireless sensor networks. IEEE Trans. Knowl. Data Eng. 27(8), 2231–2244 (2015). https://doi.org/10.1109/TKDE.2015.2411594
Article Google Scholar
Rodrigues, P.P., Gama, J., Pedroso, J.P.: Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Eng. 20(5), 615–627 (2008). https://doi.org/10.1109/TKDE.2007.190727
Article Google Scholar
Shasha, D.: Tuning time series queries in finance: case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999)
Google Scholar
Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. Data Min. Knowl. Discov. 19(1), 24–57 (2009)
Article MathSciNet Google Scholar
Shieh, J., Keogh, E.J.: isax: indexing and mining terabyte sized time series. In: ACM SIGKDD, pp. 623–631 (2008). https://doi.org/10.1145/1401890.1401966
Sloan digital sky survey. https://www.sdss3.org/dr10/data_access/volume.php (2017)
Soldi, S., Beckmann, V., Baumgartner, W., Ponti, G., Shrader, C., Lubinski, P., Krimm, H., Mattana, F., Tueller, J.: Long-term variability of AGN at hard X-rays. Astron. Astrophys. 563, A57 (2014)
Article Google Scholar
Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10), 793–804 (2013)
Google Scholar
Xi, X., Keogh, E.J., Shelton, C.R., Wei, L., Ratanamahatana, C.A.: Fast time series classification using numerosity reduction. In: ICML, pp. 1033–1040 (2006). https://doi.org/10.1145/1143844.1143974
Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Dpisax: massively distributed partitioned isax. In: ICDM, pp. 1135–1140 (2017)
Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Massively distributed time series indexing and querying. TKDE (accepted for publication, 2018)
Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: ACM SIGKDD, pp. 947–956 (2009). https://doi.org/10.1145/1557019.1557122
Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD, pp. 1555–1566 (2014)
Zoumpatianos, K., Idreos, S., Palpanas, T.: RINSE: interactive data series exploration with ADS+. PVLDB 8(12), 1912–1915 (2015)
Google Scholar
Zoumpatianos, K., Idreos, S., Palpanas, T.: ADS: the adaptive data series index. VLDB J. 25(6), 843–866 (2016). https://doi.org/10.1007/s00778-016-0442-5
Article Google Scholar
Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: ACM SIGKDD, pp. 1603–1612 (2015)
Zoumpatianos, K., Palpanas, T.: Data series management: fulfilling the need for big sequence analytics. In: ICDE (2018)

Download references

Author information

Authors and Affiliations

FORTH-ICS, Heraklion, Greece
Haridimos Kondylakis
Harvard University, Cambridge, USA
Niv Dayan & Kostas Zoumpatianos
Paris Descartes University, Paris, France
Themis Palpanas

Authors

Haridimos Kondylakis
View author publications
You can also search for this author in PubMed Google Scholar
Niv Dayan
View author publications
You can also search for this author in PubMed Google Scholar
Kostas Zoumpatianos
View author publications
You can also search for this author in PubMed Google Scholar
Themis Palpanas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haridimos Kondylakis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kondylakis, H., Dayan, N., Zoumpatianos, K. et al. Coconut: sortable summarizations for scalable indexes over static and streaming data series. The VLDB Journal 28, 847–869 (2019). https://doi.org/10.1007/s00778-019-00573-w

Download citation

Received: 03 February 2019
Revised: 16 July 2019
Accepted: 17 September 2019
Published: 25 September 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s00778-019-00573-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Coconut: sortable summarizations for scalable indexes over static and streaming data series

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Making data visualization more efficient and effective: a survey

Big data analytics on Apache Spark

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Coconut: sortable summarizations for scalable indexes over static and streaming data series

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Making data visualization more efficient and effective: a survey

Big data analytics on Apache Spark

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation