Fast data series indexing for in-memory data

Abstract

Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this work, we propose MESSI, the first data series index designed for in-memory operation on modern hardware. Our index takes advantage of the modern hardware parallelization opportunities (i.e., SIMD instructions, multi-socket and multi-core architectures), in order to accelerate both index construction and similarity search processing times. Moreover, it benefits from a careful design in the setup and coordination of the parallel workers and data structures, so that it maximizes its performance for in-memory operations. MESSI supports similarity search using both the Euclidean and dynamic time warping (DTW) distances. Our experiments with synthetic and real datasets demonstrate that overall MESSI is up to 4x faster at index construction and up to 11x faster at query answering than the state-of-the-art parallel approach. MESSI is the first to answer exact similarity search queries on 100GB datasets in \(\sim \)50 ms (30–75 ms across diverse datasets), which enables real-time, interactive data exploration on very large data series collections.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32

Notes

  1. 1.

    A data series, or data sequence, is an ordered sequence of data points. If the ordering dimension is time, then we talk about time series, though, series can be ordered over other measures (e.g., angle in astronomical radial profiles, frequency in infrared spectroscopy, mass in mass spectroscopy, position in genome sequences, etc.).

  2. 2.

    http://www.airbus.com/.

  3. 3.

    A preliminary version of this work has appeared elsewhere [63].

  4. 4.

    A preliminary version of this paper has appeared elsewhere [63].

  5. 5.

    We also tried an alternative design, where buffers were not split, so many threads could try to update each element of a buffer concurrently. Therefore, each buffer had to be protected by a lock. This design resulted in worse performance due to the contention in accessing the iSAX buffers.

  6. 6.

    Parallelizing the processing inside each one of the index root subtrees would require a lot of synchronization due to node splitting. When a node is split, two new leaf nodes are created and the data of the original leaf are moved to the new leaves.

  7. 7.

    We note that other lower bounds for DTW can be used as well, such as LB_Improved [45]. Even though LB_Improved can produce tighter bounds, in our experiments it also resulted in higher query answering times due to the additional computations it involves.

  8. 8.

    In such a case, indexing and similarity search would not be useful anyways.

  9. 9.

    MESSI can be adapted to support subsequence matching as follows: given a long series (in which we need to identify the most similar subsequence to the query), we extract subsequences from the long series by sliding a window (of the same length as the query) over the entire length of the series, and then index all these subsequences.

References

  1. 1.

    Adhd-200. http://fcon\_1000.projects.nitrc.org/indi/adhd200/ (2017)

  2. 2.

    Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: FODO (1993)

  3. 3.

    Ailamaki, A.: Databases and hardware: The beginning and sequel of a beautiful friendship. VLDB (2015)

  4. 4.

    Alvarez, V., Schuhknecht, F.M., Dittrich, J., Richter, S.: Main memory adaptive indexing for multi-core systems. In: DaMoN (2014)

  5. 5.

    Bagnall, A.J., Cole, R.L., Palpanas, T., Zoumpatianos, K.: Data series management (dagstuhl seminar 19282). Dagstuhl Reports 9(7), (2019)

  6. 6.

    Bagnall, A.J., Lines, J., Bostrom, A., Large, J., Keogh, E.J.: The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min. Knowl. Discov. 31(3), 606–660 (2017). https://doi.org/10.1007/s10618-016-0483-9

    MathSciNet  Article  Google Scholar 

  7. 7.

    Binna, R., Zangerle, E., Pichl, M., Specht, G., Leis, V.: Hot: A height optimized trie index for main-memory database systems. In: SIGMOD. ACM (2018)

  8. 8.

    Blanas, S.: Query processing for datacenter-scale computers. In: CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings (2017)

  9. 9.

    Boniol, P., Linardi, M., Roncallo, F., Palpanas, T., Meftah, M., Remy, E.: Unsupervised and scalable subsequence anomaly detectionin large data series. In: VLDBJ (2021)

  10. 10.

    Boniol, P., Linardi, M., Roncallo, F., Palpanas, T.: Automated anomaly detection in large sequences. In: ICDE (2020)

  11. 11.

    Boniol, P., Palpanas, T.: Series2Graph: graph-based subsequence anomaly detection for time series. In: PVLDB (2020)

  12. 12.

    Boniol, P., Paparrizos, J., Palpanas, T., Franklin, M.J.: SAND in action: subsequence anomaly detection for streams. In: PVLDB (2021)

  13. 13.

    Boniol, P., Paparrizos, J., Palpanas, T., Franklin, M.J.: SAND: streaming subsequence anomaly detection. In: PVLDB (2021)

  14. 14.

    Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.: Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. KAIS 39(1), 2014 (2014)

    Google Scholar 

  15. 15.

    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. CSUR (2009)

  16. 16.

    Chatzigeorgakidis, G., Skoutas, D., Patroumpas, K., Palpanas, T., Athanasiou, S., Skiadopoulos, S.: Local pair and bundle discovery over co-evolving time series. In: Proceedings of the 16th International Symposium on Spatial and Temporal Databases, SSTD (2019)

  17. 17.

    Chatzigeorgakidis, G., Skoutas, D., Patroumpas, K., Palpanas, T., Athanasiou, S., Skiadopoulos, S.: Local similarity search on geolocated time series using hybrid indexing. In: Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL (2019)

  18. 18.

    Chatzigeorgakidis, G., Skoutas, D., Patroumpas, K., Palpanas, T., Athanasiou, S., Skiadopoulos, S.: Twin subsequence search in time series. In: Proceedings of the 24th International Conference on Extending Database Technology, EDBT (2021)

  19. 19.

    Chou, J., Wu, K., et al.: Fastquery: A parallel indexing system for scientific data. In: CLUSTER, pp. 455–464. IEEE (2011)

  20. 20.

    Coorporation, I.: Intel 64 and ia-32 architectures optimization reference manual (2016)

  21. 21.

    Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Return of the Lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB (2019)

  22. 22.

    Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The Lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. PVLDB (2018)

  23. 23.

    Echihabi, K., Zoumpatianos, K., Palpanas, T.: Big sequence management: on scalability. In: Proceedings of the IEEE International Conference on Big Data, IEEE BigData (2020)

  24. 24.

    Echihabi, K., Zoumpatianos, K., Palpanas, T.: Big sequence management: Scaling up and out. In: Proceedings of the 24th International Conference on Extending Database Technology, EDBT (2021)

  25. 25.

    Fekete, J.D., Primet, R.: Progressive analytics: a computation paradigm for exploratory data analysis. CoRR (2016)

  26. 26.

    Feng, K., Wang, P., Wu, J., Wang, W.: L-match: a lightweight and effective subsequence matching approach. IEEE Access 8, 71572–71583 (2020)

    Article  Google Scholar 

  27. 27.

    Gepner, P., Kowalik, M.F.: Multi-core processors: new way to achieve high system performance. In: PAR ELEC (2006)

  28. 28.

    Gogolou, A., Tsandilas, T., Echihabi, K., Bezerianos, A., Palpanas, T.: Data series progressive similarity search with probabilistic quality guarantees. In: Maier, D., Pottinger, R., Doan, A., Tan, W., Alawini, A., Ngo, H.Q. (eds.) Proceedings of the 2020 International Conference on Management of Data, SIGMOD (2020)

  29. 29.

    Gogolou, A., Tsandilas, T., Palpanas, T., Bezerianos, A.: Progressive similarity search on time series data. In: EDBT (2019)

  30. 30.

    Gowanlock, M.G., Casanova, H.: Distance threshold similarity searches: efficient trajectory indexing on the GPU. IEEE Trans. Parallel Distrib. Syst. 27(9), 2016 (2016)

    Article  Google Scholar 

  31. 31.

    Grabocka, J., Schilling, N., Schmidt-Thieme, L.: Latent time-series motifs. TKDD 11(1), 6:1–6:20 (2016)

    Article  Google Scholar 

  32. 32.

    Guillaume, A.: Head of Operational Intelligence Department Airbus. Personal communication (2017)

  33. 33.

    Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc, Revised Reprint (2012)

  34. 34.

    http://helios.mi.parisdescartes.fr/~themisp/messi/ (2020)

  35. 35.

    Incorporated Research Institutions for Seismology—Seismic Data Access. http://ds.iris.edu/data/access/ (2016)

  36. 36.

    Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: SIGKDD, pp. 1334–1342 (2011)

  37. 37.

    Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. KAIS (2001)

  38. 38.

    Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: KDD (1998)

  39. 39.

    Keogh, E., Ratanamahatana, C.A.: Exact indexing of dynamic time warping. Knowledge and information systems (2005)

  40. 40.

    Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut palm: Static and streaming data series exploration now in your palm. In: SIGMOD (2019)

  41. 41.

    Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: A scalable bottom-up approach for building data series indexes. PVLDB (2018)

  42. 42.

    Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: sortable summarizations for scalable indexes over static and streaming data series. VLDBJ 28(6), 2019 (2019)

    Google Scholar 

  43. 43.

    Laviron, P., Dai, X., Huquet, B., Palpanas, T.: Electricity demand activation extraction: From known to uknown signatures, using similarity search. In: Proceedings of the ACM International Conference on Future Energy Systems, e-Energy (2021)

  44. 44.

    Leis, V., Kemper, A., Neumann, T.: The adaptive radix tree: Artful indexing for main-memory databases. In: ICDE (2013)

  45. 45.

    Lemire, D.: Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern Recognit. 42(9), 2169–2180 (2009)

    Article  Google Scholar 

  46. 46.

    Levchenko, O., Kolev, B., Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T., Shasha, D.E., Valduriez, P.: Bestneighbor: efficient evaluation of knn queries on large time series databases. Knowl. Inf. Syst. 63(2), 349–378 (2021)

    Article  Google Scholar 

  47. 47.

    Li, C., Yu, P.S., Castelli, V.: Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In: ICDE (1996)

  48. 48.

    Liao, T.W.: Clustering of time series data—a survey. Pattern Recognit. 38(11), 1857–1874 (2005)

    Article  Google Scholar 

  49. 49.

    Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: The ulisse approach. PVLDB (2019)

  50. 50.

    Linardi, M., Palpanas, T.: ULISSE: ULtra compact Index for Variable-Length Similarity SEarch in Data Series. In: ICDE (2018)

  51. 51.

    Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix Profile Goes MAD: Variable-Length Motif And Discord Discovery in Data Series. In: DAMI (2020)

  52. 52.

    Linardi, M., Palpanas, T.: Scalable data series subsequence matching with ULISSE. VLDB J. 29(6), 1449–1474 (2020)

    Article  Google Scholar 

  53. 53.

    Lomet, D.B., Nawab, F.: High performance temporal indexing on modern hardware. In: ICDE (2015)

  54. 54.

    Lomont, C.: Introduction to intel advanced vector extensions. Intel White Paper (2011)

  55. 55.

    Mueen, A., Keogh, E.J., Zhu, Q., Cash, S., Westover, M.B., Shamlo, N.B.: A disk-aware algorithm for time series motif discovery. DAMI (2011)

  56. 56.

    Mueen, A., Nath, S., Liu, J.: Fast approximate correlation for massive time-series data. In: SIGMOD (2010)

  57. 57.

    Palpanas, T., Beckmann, V.: Report on the first and second interdisciplinary time series analysis workshop (ITISA). SIGREC 48(3) (2019)

  58. 58.

    Palpanas, T.: Data series management: The road to big sequence analytics. SIGMOD Record (2015)

  59. 59.

    Palpanas, T.: Evolution of a Data Series Index. CCIS 1197 (2020)

  60. 60.

    Palpanas, T.: The parallel and distributed future of data series mining. In: HPCS (2017)

  61. 61.

    Pelkonen, T., Franklin, S., Cavallaro, P., Huang, Q., Meza, J., Teller, J., Veeraraghavan, K.: Gorilla: A fast, scalable, in-memory time series database. VLDB (2015)

  62. 62.

    Peng, B., Fatourou, P., Palpanas, T.: SING: Sequence Indexing Using GPUs. In: ICDE (2021)

  63. 63.

    Peng, B., Palpanas, T., Fatourou, P.: Messi: In-memory data series indexing. In: ICDE (2020)

  64. 64.

    Peng, B., Palpanas, T., Fatourou, P.: Paris: The next destination for fast data series indexing and query answering. IEEE BigData (2018)

  65. 65.

    Peng, B., Palpanas, T., Fatourou, P.: Paris+: Data series indexing on multi-core architectures. TKDE (2020)

  66. 66.

    Piatov, D., Helmer, S., Dignös, A., Gamper, J.: Interactive and space-efficient multi-dimensional time series subsequence matching. Inf. Syst. 82, 121–135 (2019)

    Article  Google Scholar 

  67. 67.

    Polychroniou, O., Raghavan, A., Ross, K.A.: Rethinking SIMD vectorization for in-memory databases. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 1493–1508 (2015)

  68. 68.

    Polychroniou, O., Raghavan, A., Ross, K.A.: Rethinking simd vectorization for in-memory databases. In: SIGMOD. ACM (2015)

  69. 69.

    Polychroniou, O., Ross, K.A.: Vectorized bloom filters for advanced SIMD processors. In: Tenth International Workshop on Data Management on New Hardware, DaMoN 2014, Snowbird, UT, USA, June 23, 2014, pp. 6:1–6:6 (2014)

  70. 70.

    Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: SIGKDD (2012)

  71. 71.

    Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: clustering time series streams requires ignoring some data. In: ICDM, pp. 547–556 (2011)

  72. 72.

    Rodrigues, P.P., Gama, J., Pedroso, J.: Hierarchical clustering of time-series data streams. TKDE (2008)

  73. 73.

    Shieh, J., Keogh, E.: i sax: indexing and mining terabyte sized time series. In: SIGKDD (2008)

  74. 74.

    Shieh, J., Keogh, E.: iSAX: disk-aware mining and indexing of massive time series datasets. DMKD (2009)

  75. 75.

    Sloan digital sky survey. https://www.sdss3.org/dr10/data_access/volume.php (2017)

  76. 76.

    Southwest university adult lifespan dataset (sald). http://fcon\_1000.projects.nitrc.org/indi/retro/sald.html (2018)

  77. 77.

    Tan, C.W., Webb, G.I., Petitjean, F.: Indexing and classifying gigabytes of time series under time warping. In: ICDM (2017)

  78. 78.

    Tang, B., Yiu, M.L., Li, Y., et al.: Exploit every cycle: Vectorized time series algorithms on modern commodity cpus. In: IMDM (2016)

  79. 79.

    Tatikonda, S., Parthasarathy, S.: An adaptive memory conscious approach for mining frequent trees: implications for multi-core architectures. In: SIGPLAN. ACM (2008)

  80. 80.

    Wang, Q., Palpanas, T.: Deep Learning Embeddings for Data Series Similarity Search. In: SIGKDD (2021)

  81. 81.

    Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. VLDB (2013)

  82. 82.

    Wu, J., Wang, P., Pan, N., Wang, C., Wang, W., Wang, J.: Kv-match: A subsequence matching approach supporting normalization and time warping. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 866–877. IEEE (2019)

  83. 83.

    Xiao, L., Zheng, Y., Tang, W., Yao, G., Ruan, L.: Parallelizing dynamic time warping algorithm using prefix computations on gpu. In: (HPCC\_EUC). IEEE (2013)

  84. 84.

    Xie, Z., Cai, Q., Chen, G., Mao, R., Zhang, M.: A comprehensive performance evaluation of modern in-memory indices. In: ICDE (2018)

  85. 85.

    Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Massively distributed time series indexing and querying. IEEE Trans. Knowl. Data Eng. 32(1), 108–120 (2020)

    Article  Google Scholar 

  86. 86.

    Yi, B.K., Faloutsos, C.: Fast time sequence indexing for arbitrary lp norms. In: VLDB. Citeseer (2000)

  87. 87.

    Zeuch, S., Freytag, J., Huber, F.: Adapting tree structures for processing with SIMD instructions. In: EDBT (2014)

  88. 88.

    Zhou, J., Ross, K.A.: Implementing database operations using simd instructions. In: SIGMOD (2002)

  89. 89.

    Zoumpatianos, K., Palpanas, T.: Data series management: fulfilling the need for big sequence analytics. In: ICDE (2018)

  90. 90.

    Zoumpatianos, K., Idreos, S., Palpanas, T.: Ads: the adaptive data series index. VLDB J. 25, 843–866 (2016)

    Article  Google Scholar 

  91. 91.

    Zoumpatianos, K., Lou, Y., Ileana, I., Palpanas, T., Gehrke, J.: Generating data series query workloads. VLDB J. 27(6), 823–846 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

Work was supported by Investir l’Avenir, Univ. of Paris IDEX Emergence en Recherche ANR-18-IDEX-000, CSC, FMJH PGMO, EDF, Thales, HIPEAC 4 and partly performed when P. Fatourou visited LIPADE and B. Peng visited CARV, FORTH ICS.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Botao Peng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Peng, B., Fatourou, P. & Palpanas, T. Fast data series indexing for in-memory data. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00677-2

Download citation

Keywords

  • Data series
  • Indexing
  • Modern hardware