Advertisement

Coconut: sortable summarizations for scalable indexes over static and streaming data series

  • 120 Accesses

  • 1 Citations

Abstract

Many modern applications produce massive streams of data series that need to be analyzed, requiring efficient similarity search operations. However, the state-of-the-art data series indexes that are used for this purpose do not scale well for massive datasets in terms of performance, or storage costs. We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order. To address this problem, we present Coconut, the first data series index based on sortable summarizations and the first efficient solution for indexing and querying streaming series. The first innovation in Coconut is an inverted, sortable data series summarization that organizes data series based on a z-order curve, keeping similar series close to each other in the sorted order. As a result, Coconut is able to use bulk loading and updating techniques that rely on sorting to quickly build and maintain a contiguous index using large sequential disk I/Os. We then explore prefix-based and median-based splitting policies for bottom-up bulk loading, showing that median-based splitting outperforms the state of the art, ensuring that all nodes are densely populated. Finally, we explore the impact of sortable summarizations on variable-sized window queries, showing that they can be supported in the presence of updates through efficient merging of temporal partitions. Overall, we show analytically and empirically that Coconut dominates the state-of-the-art data series indexes in terms of construction speed, query speed, and storage costs.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Notes

  1. 1.

    Informally, a data series or data sequence is an ordered sequence of data points. If the dimension that imposes the ordering of the sequence is time then we talk about time series, though a series can also be defined over other measures (e.g., angle in radial profiles in astronomy, mass in mass spectroscopy, position in genome sequences, etc.). For the rest of this paper, we are going to use the terms data series and sequence interchangeably.

  2. 2.

    This is analogous to sorting points in a multi-dimensional space based on one dimension.

  3. 3.

    Note that recent state-of-the-art serial scan algorithms [42, 55] are only efficient for scenarios that involve nearest neighbor operations of a short query subsequence against a very long data series. On the contrary, in this work, we are interested in finding similarities in very large collections of short sequences.

  4. 4.

    Note that SAX words are typically longer to enable more precision; we use 2-character SAX words in this example for ease of exposition.

  5. 5.

    In fact, this condition only holds as long as \(M > \sqrt{N}\) [57]. Since main memory is approximately two orders of magnitude more expensive than secondary storage, this condition holds in practice for massive datasets.

  6. 6.

    In a materialized index, the raw data series are stored alongside their summarizations within the index, whereas in a non-materialized one the index contains pointers to the raw data series that are stored in a different file.

References

  1. 1.

    Adhd-200. http://fcon\_1000.projects.nitrc.org/indi/adhd200/ (2017)

  2. 2.

    Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)

  3. 3.

    Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO, pp. 69–84 (1993)

  4. 4.

    Alsubaiee, S., Carey, M.J., Li, C.: Lsm-based storage and indexing: an old idea with timely benefits. In: Second International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data, GeoRich@SIGMOD 2015, Melbourne, VIC, Australia, May 31, 2015, pp. 1–6 (2015). https://doi.org/10.1145/2786006.2786007

  5. 5.

    Assent, I., Krieger, R., Afschari, F., Seidl, T.: The ts-tree: efficient time series search and retrieval. In: EDBT, pp. 252–263 (2008). https://doi.org/10.1145/1353343.1353376

  6. 6.

    Bayer, R., Markl, V.: The ub-tree: performance of multidimensional range queries. Techinal Report. Institut fur Informatik, TU, Munchen (1998)

  7. 7.

    Camerra, A., Palpanas, T., Shieh, J., Keogh, E.J.: isax 2.0: indexing and mining one billion time series. In: ICDM, pp. 58–67 (2010). https://doi.org/10.1109/ICDM.2010.124

  8. 8.

    Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. KAIS 39(1), 123–151 (2014)

  9. 9.

    Chakrabarti, K., Keogh, E.J., Mehrotra, S., Pazzani, M.J.: Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst. 27(2), 188–228 (2002). https://doi.org/10.1145/568518.568520

  10. 10.

    Chan, K.P., Fu, A.W.: Efficient time series matching by wavelets. In: ICDE, pp. 126–133 (1999). https://doi.org/10.1109/ICDE.1999.754915

  11. 11.

    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)

  12. 12.

    Dayan, N., Athanassoulis, M., Idreos, S.: Monkey: optimal navigable key-value store. In: SIGMOD, pp. 79–94 (2017). https://doi.org/10.1145/3035918.3064054

  13. 13.

    Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. In: PVLDB (2019)

  14. 14.

    Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD, pp. 419–429 (1994). https://doi.org/10.1145/191839.191925

  15. 15.

    Goldin, D.Q., Kanellakis, P.C.: On similarity queries for time-series data: constraint specification and implementation. In: Proceedings of First International Conference on Principles and Practice of Constraint Programming—CP’95, Cassis, France, September 19–22, 1995, pp. 137–153 (1995)

  16. 16.

    Grabocka, J., Schilling, N., Schmidt-Thieme, L.: Latent time-series motifs. TKDD 11(1), 6:1–6:20 (2016)

  17. 17.

    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp. 47–57 (1984). https://doi.org/10.1145/602259.602266

  18. 18.

    Huijse, P., Estévez, P.A., Protopapas, P., Principe, J.C., Zegers, P.: Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE CIM 9(3), 27–39 (2014)

  19. 19.

    Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: CIDR 2007, pp. 68–78 (2007). http://cidrdb.org/cidr2007/papers/cidr07p07.pdf

  20. 20.

    Incorporated Research Institutions for Seismology—Seismic Data Access. http://ds.iris.edu/data/access/ (2016)

  21. 21.

    Kashino, K., Smith, G., Murase, H.: Time-series active search for quick retrieval of audio and video. In: ICASSP, pp. 2993–2996 (1999). https://doi.org/10.1109/ICASSP.1999.757470

  22. 22.

    Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: SIGKDD, pp. 1334–1342 (2011). https://doi.org/10.1145/2020408.2020607

  23. 23.

    Kate, R.J.: Using dynamic time warping distances as features for improved time series classification. Data Min. Knowl. Discov. 30(2), 283–312 (2016). https://doi.org/10.1007/s10618-015-0418-x

  24. 24.

    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)

  25. 25.

    Keogh, E.J.: Fast similarity search in the presence of longitudinal scaling in time series databases. In: ICTAI, pp. 578–584 (1997). https://doi.org/10.1109/TAI.1997.632306

  26. 26.

    Keogh, E.J., Chakrabarti, K., Pazzani, M.J., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS 3(3), 263–286 (2001)

  27. 27.

    Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: KDD, pp. 239–243 (1998). http://www.aaai.org/Library/KDD/1998/kdd98-041.php

  28. 28.

    Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: a scalable bottom-up approach for building data series indexes. PVLDB 11(6), 677–690 (2018). https://doi.org/10.14778/3184470.3184472

  29. 29.

    Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut palm: static and streaming data series exploration now in your palm. In: Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pp. 1941–1944 (2019). https://doi.org/10.1145/3299869.3320233

  30. 30.

    Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in large datasets of time sequences. In: SIGMOD, pp. 289–300 (1997). https://doi.org/10.1145/253260.253332

  31. 31.

    Leutenegger, S.T., Edgington, J.M., López, M.A.: STR: a simple and efficient algorithm for r-tree packing. In: ICDE, pp. 497–506 (1997). https://doi.org/10.1109/ICDE.1997.582015

  32. 32.

    Li, C., Yu, P.S., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE, pp. 546–553 (1996). https://doi.org/10.1109/ICDE.1996.492205

  33. 33.

    Liao, T.W.: Clustering of time series data—a survey. Pattern Recognit. 38(11), 1857–1874 (2005). https://doi.org/10.1016/j.patcog.2005.01.025

  34. 34.

    Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD (2003)

  35. 35.

    Lin, J., Keogh, E.J., Truppel, W.: Clustering of streaming time series is meaningless. In: DMKD, pp. 56–65 (2003). https://doi.org/10.1145/882082.882096

  36. 36.

    Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: the ULISSE approach. PVLDB 11(13), 2236–2248 (2018)

  37. 37.

    Linardi, M., Palpanas, T.: ULISSE: ULtra compact Index for variable-length Similarity SEarch in data series. In: ICDE (2018)

  38. 38.

    Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix profile X: Valmod-scalable discovery of variable-length motifs in data series. In: SIGMOD (2018)

  39. 39.

    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: BSMSP, pp. 281–297 (1967)

  40. 40.

    Mirylenka, K., Christophides, V., Palpanas, T., Pefkianakis, I., May, M.: Characterizing home device usage from wireless traffic time series. In: EDBT, pp. 551–562 (2016). https://doi.org/10.5441/002/edbt.2016.51

  41. 41.

    Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, Ottawa (1966)

  42. 42.

    Mueen, A., Hamooni, H., Estrada, T.: Time series join on subsequence correlation. In: ICDM, pp. 450–459 (2014)

  43. 43.

    Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B., Shamlo, N.B.: A disk-aware algorithm for time series motif discovery. DAMI 22(1–2), 73–105 (2011)

  44. 44.

    Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: SIGMOD, pp. 171–182 (2010). https://doi.org/10.1145/1807167.1807188

  45. 45.

    O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (lsm-tree). Acta Inf. 33(4), 351–385 (1996). https://doi.org/10.1007/s002360050048

  46. 46.

    Palpanas, T.: Data series management: the road to big sequence analytics. SIGMOD Record 44(2), 47–52 (2015)

  47. 47.

    Palpanas, T.: Big sequence management: a glimpse of the past, the present, and the future. In: SOFSEM, pp. 63–80 (2016). https://doi.org/10.1007/978-3-662-49192-8_6

  48. 48.

    Palpanas, T.: The parallel and distributed future of data series mining. In: HPCS, pp. 916–920 (2017). https://doi.org/10.1109/HPCS.2017.155

  49. 49.

    Paparrizos, J., Gravano, L.: k-shape: efficient and accurate clustering of time series. In: SIGMOD, pp. 1855–1870 (2015). https://doi.org/10.1145/2723372.2737793

  50. 50.

    Paraskevopoulos, P., Dinh, T.C., Dashdorj, Z., Palpanas, T., Serafini, L.: Identification and characterization of human behavior patterns from mobile phone data. In: D4D Challenge Session, NetMob (2013)

  51. 51.

    Pelkonen, T., Franklin, S., Cavallaro, P., Huang, Q., Meza, J., Teller, J., Veeraraghavan, K.: Gorilla: a fast, scalable, in-memory time series database. PVLDB 8(12), 1816–1827 (2015)

  52. 52.

    Peng, B., Fatourou, P., Palpanas, T.: ParIS: the next destination for fast data series indexing and query answering. In: BIGDATA, pp. 791–800 (2018). https://doi.org/10.1109/BigData.2018.8622293

  53. 53.

    Rafiei, D.: On similarity-based queries for time series data. In: ICDE, pp. 410–417 (1999). https://doi.org/10.1109/ICDE.1999.754957

  54. 54.

    Rafiei, D., Mendelzon, A.O.: Similarity-based queries for time series data. In: SIGMOD, pp. 13–25 (1997). https://doi.org/10.1145/253260.253264

  55. 55.

    Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: SIGKDD, pp. 262–270 (2012). https://doi.org/10.1145/2339530.2339576

  56. 56.

    Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: clustering time series streams requires ignoring some data. In: ICDM, pp. 547–556 (2011). https://doi.org/10.1109/ICDM.2011.146

  57. 57.

    Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)

  58. 58.

    Ramsak, F., Markl, V., Fenk, R., Zirkel, M., Elhardt, K., Bayer, R.: Integrating the ub-tree into a database system kernel. In: VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10–14, 2000, Cairo, Egypt, pp. 263–272 (2000)

  59. 59.

    Rao, J., Ross, K.A.: Making b\(^{+}\)-trees cache conscious in main memory. In: SIGMOD, pp. 475–486 (2000). https://doi.org/10.1145/342009.335449

  60. 60.

    Ratanamahatana, C.A., Keogh, E.J.: Three myths about dynamic time warping data mining. In: SIAM, pp. 506–510 (2005). https://doi.org/10.1137/1.9781611972757.50

  61. 61.

    Ravi Kanth, K.V., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD, pp. 166–176 (1998)

  62. 62.

    Raza, U., Camerra, A., Murphy, A.L., Palpanas, T., Picco, G.P.: Practical data prediction for real-world wireless sensor networks. IEEE Trans. Knowl. Data Eng. 27(8), 2231–2244 (2015). https://doi.org/10.1109/TKDE.2015.2411594

  63. 63.

    Rodrigues, P.P., Gama, J., Pedroso, J.P.: Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Eng. 20(5), 615–627 (2008). https://doi.org/10.1109/TKDE.2007.190727

  64. 64.

    Shasha, D.: Tuning time series queries in finance: case studies and recommendations. IEEE Data Eng. Bull. 22(2), 40–46 (1999)

  65. 65.

    Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. Data Min. Knowl. Discov. 19(1), 24–57 (2009)

  66. 66.

    Shieh, J., Keogh, E.J.: isax: indexing and mining terabyte sized time series. In: ACM SIGKDD, pp. 623–631 (2008). https://doi.org/10.1145/1401890.1401966

  67. 67.

    Sloan digital sky survey. https://www.sdss3.org/dr10/data_access/volume.php (2017)

  68. 68.

    Soldi, S., Beckmann, V., Baumgartner, W., Ponti, G., Shrader, C., Lubinski, P., Krimm, H., Mattana, F., Tueller, J.: Long-term variability of AGN at hard X-rays. Astron. Astrophys. 563, A57 (2014)

  69. 69.

    Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10), 793–804 (2013)

  70. 70.

    Xi, X., Keogh, E.J., Shelton, C.R., Wei, L., Ratanamahatana, C.A.: Fast time series classification using numerosity reduction. In: ICML, pp. 1033–1040 (2006). https://doi.org/10.1145/1143844.1143974

  71. 71.

    Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Dpisax: massively distributed partitioned isax. In: ICDM, pp. 1135–1140 (2017)

  72. 72.

    Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Massively distributed time series indexing and querying. TKDE (accepted for publication, 2018)

  73. 73.

    Ye, L., Keogh, E.J.: Time series shapelets: a new primitive for data mining. In: ACM SIGKDD, pp. 947–956 (2009). https://doi.org/10.1145/1557019.1557122

  74. 74.

    Zoumpatianos, K., Idreos, S., Palpanas, T.: Indexing for interactive exploration of big data series. In: SIGMOD, pp. 1555–1566 (2014)

  75. 75.

    Zoumpatianos, K., Idreos, S., Palpanas, T.: RINSE: interactive data series exploration with ADS+. PVLDB 8(12), 1912–1915 (2015)

  76. 76.

    Zoumpatianos, K., Idreos, S., Palpanas, T.: ADS: the adaptive data series index. VLDB J. 25(6), 843–866 (2016). https://doi.org/10.1007/s00778-016-0442-5

  77. 77.

    Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: ACM SIGKDD, pp. 1603–1612 (2015)

  78. 78.

    Zoumpatianos, K., Palpanas, T.: Data series management: fulfilling the need for big sequence analytics. In: ICDE (2018)

Download references

Author information

Correspondence to Haridimos Kondylakis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kondylakis, H., Dayan, N., Zoumpatianos, K. et al. Coconut: sortable summarizations for scalable indexes over static and streaming data series. The VLDB Journal 28, 847–869 (2019). https://doi.org/10.1007/s00778-019-00573-w

Download citation

Keywords

  • Data series
  • Indexing structures
  • Streaming data series