Abstract
Many modern applications produce massive streams of data series that need to be analyzed, requiring efficient similarity search operations. However, the state-of-the-art data series indexes that are used for this purpose do not scale well for massive datasets in terms of performance, or storage costs. We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order. To address this problem, we present Coconut, the first data series index based on sortable summarizations and the first efficient solution for indexing and querying streaming series. The first innovation in Coconut is an inverted, sortable data series summarization that organizes data series based on a z-order curve, keeping similar series close to each other in the sorted order. As a result, Coconut is able to use bulk loading and updating techniques that rely on sorting to quickly build and maintain a contiguous index using large sequential disk I/Os. We then explore prefix-based and median-based splitting policies for bottom-up bulk loading, showing that median-based splitting outperforms the state of the art, ensuring that all nodes are densely populated. Finally, we explore the impact of sortable summarizations on variable-sized window queries, showing that they can be supported in the presence of updates through efficient merging of temporal partitions. Overall, we show analytically and empirically that Coconut dominates the state-of-the-art data series indexes in terms of construction speed, query speed, and storage costs.
Similar content being viewed by others
Notes
Informally, a data series or data sequence is an ordered sequence of data points. If the dimension that imposes the ordering of the sequence is time then we talk about time series, though a series can also be defined over other measures (e.g., angle in radial profiles in astronomy, mass in mass spectroscopy, position in genome sequences, etc.). For the rest of this paper, we are going to use the terms data series and sequence interchangeably.
This is analogous to sorting points in a multi-dimensional space based on one dimension.
Note that recent state-of-the-art serial scan algorithms [42, 55] are only efficient for scenarios that involve nearest neighbor operations of a short query subsequence against a very long data series. On the contrary, in this work, we are interested in finding similarities in very large collections of short sequences.
Note that SAX words are typically longer to enable more precision; we use 2-character SAX words in this example for ease of exposition.
In fact, this condition only holds as long as \(M > \sqrt{N}\) [57]. Since main memory is approximately two orders of magnitude more expensive than secondary storage, this condition holds in practice for massive datasets.
In a materialized index, the raw data series are stored alongside their summarizations within the index, whereas in a non-materialized one the index contains pointers to the raw data series that are stored in a different file.
References
Adhd-200. http://fcon\_1000.projects.nitrc.org/indi/adhd200/ (2017)
Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)
Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO, pp. 69–84 (1993)
Alsubaiee, S., Carey, M.J., Li, C.: Lsm-based storage and indexing: an old idea with timely benefits. In: Second International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data, GeoRich@SIGMOD 2015, Melbourne, VIC, Australia, May 31, 2015, pp. 1–6 (2015). https://doi.org/10.1145/2786006.2786007
Assent, I., Krieger, R., Afschari, F., Seidl, T.: The ts-tree: efficient time series search and retrieval. In: EDBT, pp. 252–263 (2008). https://doi.org/10.1145/1353343.1353376
Bayer, R., Markl, V.: The ub-tree: performance of multidimensional range queries. Techinal Report. Institut fur Informatik, TU, Munchen (1998)
Camerra, A., Palpanas, T., Shieh, J., Keogh, E.J.: isax 2.0: indexing and mining one billion time series. In: ICDM, pp. 58–67 (2010). https://doi.org/10.1109/ICDM.2010.124
Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. KAIS 39(1), 123–151 (2014)
Chakrabarti, K., Keogh, E.J., Mehrotra, S., Pazzani, M.J.: Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst. 27(2), 188–228 (2002). https://doi.org/10.1145/568518.568520
Chan, K.P., Fu, A.W.: Efficient time series matching by wavelets. In: ICDE, pp. 126–133 (1999). https://doi.org/10.1109/ICDE.1999.754915
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)
Dayan, N., Athanassoulis, M., Idreos, S.: Monkey: optimal navigable key-value store. In: SIGMOD, pp. 79–94 (2017). https://doi.org/10.1145/3035918.3064054
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. In: PVLDB (2019)
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD, pp. 419–429 (1994). https://doi.org/10.1145/191839.191925
Goldin, D.Q., Kanellakis, P.C.: On similarity queries for time-series data: constraint specification and implementation. In: Proceedings of First International Conference on Principles and Practice of Constraint Programming—CP’95, Cassis, France, September 19–22, 1995, pp. 137–153 (1995)
Grabocka, J., Schilling, N., Schmidt-Thieme, L.: Latent time-series motifs. TKDD 11(1), 6:1–6:20 (2016)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp. 47–57 (1984). https://doi.org/10.1145/602259.602266
Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE CIM 9(3), 27–39 (2014)
Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR 2007, pp. 68–78 (2007). http://cidrdb.org/cidr2007/papers/cidr07p07.pdf
Incorporated Research Institutions for Seismology—Seismic Data Access. http://ds.iris.edu/data/access/ (2016)
Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP, pp. 2993–2996 (1999). https://doi.org/10.1109/ICASSP.1999.757470
Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: SIGKDD, pp. 1334–1342 (2011). https://doi.org/10.1145/2020408.2020607
Kate, R.J.: Using dynamic time warping distances as features for improved time series classification. Data Min. Knowl. Discov. 30(2), 283–312 (2016). https://doi.org/10.1007/s10618-015-0418-x
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)
Keogh, E.J.: Fast similarity search in the presence of longitudinal scaling in time series databases. In: ICTAI, pp. 578–584 (1997). https://doi.org/10.1109/TAI.1997.632306
Keogh, E.J., Chakrabarti, K., Pazzani, M.J., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3(3), 263–286 (2001)
Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: KDD, pp. 239–243 (1998). http://www.aaai.org/Library/KDD/1998/kdd98-041.php
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: a scalable bottom-up approach for building data series indexes. PVLDB 11(6), 677–690 (2018). https://doi.org/10.14778/3184470.3184472
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut palm: static and streaming data series exploration now in your palm. In: Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pp. 1941–1944 (2019). https://doi.org/10.1145/3299869.3320233
Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD, pp. 289–300 (1997). https://doi.org/10.1145/253260.253332
Leutenegger, S.T., Edgington, J.M., López, M.A.: STR: a simple and efficient algorithm for r-tree packing. In: ICDE, pp. 497–506 (1997). https://doi.org/10.1109/ICDE.1997.582015
Li, C., Yu, P.S., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE, pp. 546–553 (1996). https://doi.org/10.1109/ICDE.1996.492205
Liao, T.W.: Clustering of time series data—a survey. Pattern Recognit. 38(11), 1857–1874 (2005). https://doi.org/10.1016/j.patcog.2005.01.025
Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD (2003)
Lin, J., Keogh, E.J., Truppel, W.: Clustering of streaming time series is meaningless. In: DMKD, pp. 56–65 (2003). https://doi.org/10.1145/882082.882096
Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: the ULISSE approach. PVLDB 11(13), 2236–2248 (2018)
Linardi, M., Palpanas, T.: ULISSE: ULtra compact Index for variable-length Similarity SEarch in data series. In: ICDE (2018)
Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix profile X: Valmod-scalable discovery of variable-length motifs in data series. In: SIGMOD (2018)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: BSMSP, pp. 281–297 (1967)
Mirylenka, K., Christophides, V., Palpanas, T., Pefkianakis, I., May, M.: Characterizing home device usage from wireless traffic time series. In: EDBT, pp. 551–562 (2016). https://doi.org/10.5441/002/edbt.2016.51
Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, Ottawa (1966)
Mueen, A., Hamooni, H., Estrada, T.: Time series join on subsequence correlation. In: ICDM, pp. 450–459 (2014)
Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B., Shamlo, N.B.: A disk-aware algorithm for time series motif discovery. DAMI 22(1–2), 73–105 (2011)
Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: SIGMOD, pp. 171–182 (2010). https://doi.org/10.1145/1807167.1807188
O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (lsm-tree). Acta Inf. 33(4), 351–385 (1996). https://doi.org/10.1007/s002360050048
Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Record 44(2), 47–52 (2015)
Palpanas, T.: Big sequence management: a glimpse of the past, the present, and the future. In: SOFSEM, pp. 63–80 (2016). https://doi.org/10.1007/978-3-662-49192-8_6
Palpanas, T.: The parallel and distributed future of data series mining. In: HPCS, pp. 916–920 (2017). https://doi.org/10.1109/HPCS.2017.155
Paparrizos, J., Gravano, L.: k-shape: efficient and accurate clustering of time series. In: SIGMOD, pp. 1855–1870 (2015). https://doi.org/10.1145/2723372.2737793
Paraskevopoulos, P., Dinh, T.C., Dashdorj, Z., Palpanas, T., Serafini, L.: Identification and characterization of human behavior patterns from mobile phone data. In: D4D Challenge Session, NetMob (2013)
Pelkonen, T., Franklin, S., Cavallaro, P., Huang, Q., Meza, J., Teller, J., Veeraraghavan, K.: Gorilla: a fast, scalable, in-memory time series database. PVLDB 8(12), 1816–1827 (2015)
Peng, B., Fatourou, P., Palpanas, T.: ParIS: the next destination for fast data series indexing and query answering. In: BIGDATA, pp. 791–800 (2018). https://doi.org/10.1109/BigData.2018.8622293
Rafiei, D.: On similarity-based queries for time series data. In: ICDE, pp. 410–417 (1999). https://doi.org/10.1109/ICDE.1999.754957
Rafiei, D., Mendelzon, A.O.: Similarity-based queries for time series data. In: SIGMOD, pp. 13–25 (1997). https://doi.org/10.1145/253260.253264
Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: SIGKDD, pp. 262–270 (2012). https://doi.org/10.1145/2339530.2339576
Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: clustering time series streams requires ignoring some data. In: ICDM, pp. 547–556 (2011). https://doi.org/10.1109/ICDM.2011.146
Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)
Ramsak, F., Markl, V., Fenk, R., Zirkel, M., Elhardt, K., Bayer, R.: Integrating the ub-tree into a database system kernel. In: VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10–14, 2000, Cairo, Egypt, pp. 263–272 (2000)
Rao, J., Ross, K.A.: Making b\(^{+}\)-trees cache conscious in main memory. In: SIGMOD, pp. 475–486 (2000). https://doi.org/10.1145/342009.335449
Ratanamahatana, C.A., Keogh, E.J.: Three myths about dynamic time warping data mining. In: SIAM, pp. 506–510 (2005). https://doi.org/10.1137/1.9781611972757.50
Ravi Kanth, K.V., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD, pp. 166–176 (1998)
Raza, U., Camerra, A., Murphy, A.L., Palpanas, T., Picco, G.P.: Practical data prediction for real-world wireless sensor networks. IEEE Trans. Knowl. Data Eng. 27(8), 2231–2244 (2015). https://doi.org/10.1109/TKDE.2015.2411594
Rodrigues, P.P., Gama, J., Pedroso, J.P.: Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Eng. 20(5), 615–627 (2008). https://doi.org/10.1109/TKDE.2007.190727
Shasha, D.: Tuning time series queries in finance: case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999)
Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. Data Min. Knowl. Discov. 19(1), 24–57 (2009)
Shieh, J., Keogh, E.J.: isax: indexing and mining terabyte sized time series. In: ACM SIGKDD, pp. 623–631 (2008). https://doi.org/10.1145/1401890.1401966
Sloan digital sky survey. https://www.sdss3.org/dr10/data_access/volume.php (2017)
Soldi, S., Beckmann, V., Baumgartner, W., Ponti, G., Shrader, C., Lubinski, P., Krimm, H., Mattana, F., Tueller, J.: Long-term variability of AGN at hard X-rays. Astron. Astrophys. 563, A57 (2014)
Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10), 793–804 (2013)
Xi, X., Keogh, E.J., Shelton, C.R., Wei, L., Ratanamahatana, C.A.: Fast time series classification using numerosity reduction. In: ICML, pp. 1033–1040 (2006). https://doi.org/10.1145/1143844.1143974
Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Dpisax: massively distributed partitioned isax. In: ICDM, pp. 1135–1140 (2017)
Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Massively distributed time series indexing and querying. TKDE (accepted for publication, 2018)
Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: ACM SIGKDD, pp. 947–956 (2009). https://doi.org/10.1145/1557019.1557122
Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD, pp. 1555–1566 (2014)
Zoumpatianos, K., Idreos, S., Palpanas, T.: RINSE: interactive data series exploration with ADS+. PVLDB 8(12), 1912–1915 (2015)
Zoumpatianos, K., Idreos, S., Palpanas, T.: ADS: the adaptive data series index. VLDB J. 25(6), 843–866 (2016). https://doi.org/10.1007/s00778-016-0442-5
Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: ACM SIGKDD, pp. 1603–1612 (2015)
Zoumpatianos, K., Palpanas, T.: Data series management: fulfilling the need for big sequence analytics. In: ICDE (2018)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kondylakis, H., Dayan, N., Zoumpatianos, K. et al. Coconut: sortable summarizations for scalable indexes over static and streaming data series. The VLDB Journal 28, 847–869 (2019). https://doi.org/10.1007/s00778-019-00573-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-019-00573-w