Skip to main content

Advertisement

Log in

ProS: data series progressive k-NN similarity search and classification with probabilistic quality guarantees

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Existing systems dealing with the increasing volume of data series cannot guarantee interactive response times, even for fundamental tasks such as similarity search. Therefore, it is necessary to develop analytic approaches that support exploration and decision making by providing progressive results, before the final and exact ones have been computed. Prior works lack both efficiency and accuracy when applied to large-scale data series collections. We present and experimentally evaluate ProS, a new probabilistic learning-based method that provides quality guarantees for progressive nearest neighbor (NN) query answering. We develop our method for k-NN queries and demonstrate how it can be applied with the two most popular distance measures, namely Euclidean and dynamic time warping. We provide both initial and progressive estimates of the final answer that are getting better during the similarity search, as well suitable stopping criteria for the progressive queries. Moreover, we describe how this method can be used in order to develop a progressive algorithm for data series classification (based on a k-NN classifier), and we additionally propose a method designed specifically for the classification task. Experiments with several and diverse synthetic and real datasets demonstrate that our prediction methods constitute the first practical solutions to the problem, significantly outperforming competing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Notes

  1. If the dimension that imposes the ordering of the sequence is time then we talk about time series Though, a series can also be defined over other measures (angle in radial profiles, frequency in infrared spectroscopy, etc.). We use the terms time series, data series, and sequence interchangeably.

  2. The dimensionality of a data series is the length, or number of points in the series [45]. In our context, by high-dimensional, we refer to series with dimensionality in the order of hundreds-thousands.

  3. We define the problem using k-NN, but for simplicity use \(k=1\) in the rest of this paper. We defer the discussion of the general case to future work.

  4. We note that other lower bounds for DTW can be used as well, such as LB_Improved [77]. Even though LB_Improved can produce tighter bounds, previous experiments have resulted in higher query answering times due to the additional computations it involves [105].

  5. When using DTW, k-NN search becomes computationally very expensive, and the time required to run all experiments with the original, large dataset sizes was prohibitive.

References

  1. Angelini, M., Santucci, G., Schumann, H., Schulz, H.J.: A review and characterization of progressive visual analytics. Informatics 5, 31 (2018)

    Article  Google Scholar 

  2. Ankerst, M., Kastenmüller, G., Kriegel, H.P., Seidl, T.: Nearest neighbor classification in 3d protein databases. ISMB (1999)

  3. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45(6), 891–923 (1998). https://doi.org/10.1145/293347.293348

    Article  MathSciNet  MATH  Google Scholar 

  4. Aßfalg, J., Kriegel, H., Kröger, P., Renz, M.: Probabilistic similarity search for uncertain time series. In: Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, June 2-4, 2009, Proceedings, pp. 435–443 (2009). https://doi.org/10.1007/978-3-642-02279-1_31

  5. Babenko, A., Lempitsky, V.S.: The inverted multi-index. IEEE Trans. Pattern Anal. Mach. Intell. 37(6), 1247–1260 (2015)

    Article  Google Scholar 

  6. Badam, S.K., Elmqvist, N., Fekete, J.D.: Steering the craft: Ui elements and visualizations for supporting progressive visual analytics. Comput. Graph. Forum 36(3), 491–502 (2017). https://doi.org/10.1111/cgf.13205

    Article  Google Scholar 

  7. Bagnall, A.J., Cole, R.L., Palpanas, T., Zoumpatianos, K.: Data series management (dagstuhl seminar 19282). Dagstuhl Reports 9(7), 24–39 (2019)

    Google Scholar 

  8. Bagnall, A.J., Lines, J., Bostrom, A., Large, J., Keogh, E.J.: The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min. Knowl. Discov. 31(3), 606–660 (2017)

    Article  MathSciNet  Google Scholar 

  9. Bansal, P., Deshpande, P., Sarawagi, S.: Missing value imputation on multidimensional time series. Proc. VLDB Endow. 14(11), 2533–2545 (2021). https://doi.org/10.14778/3476249.3476300

  10. Batista, G.E., Keogh, E.J., Tataw, O.M., Souza, V.M.: Cid: An efficient complexity-invariant distance for time series. Data Min. Knowl. Discov. 28(3), 634–669 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  11. Blázquez-García, A., Conde, A., Mori, U., Lozano, J.A.: A review on outlier/anomaly detection in time series data. ACM Comput. Surv. 54(3), 1–33 (2021). https://doi.org/10.1145/3444690

    Article  Google Scholar 

  12. Boniol, P., Linardi, M., Roncallo, F., Palpanas, T.: Automated Anomaly Detection in Large Sequences. In: ICDE (2020)

  13. Boniol, P., Linardi, M., Roncallo, F., Palpanas, T., Meftah, M., Remy, E.: Unsupervised and scalable subsequence anomaly detectionin large data series. VLDBJ (2021)

  14. Boniol, P., Meftah, M., Remy, E., Palpanas, T.: dcam: Dimension-wise class activation map for explaining multivariate data series classification. In: SIGMOD ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12–17, 2022, pp. 1175–1189 (2022)

  15. Boniol, P., Palpanas, T.: Series2Graph: Graph-based subsequence anomaly detection for time series. PVLDB (2020)

  16. Boniol, P., Paparrizos, J., Kang, Y., Palpanas, T., Tsay, R., Elmore, A.J., Franklin, M.J.: Theseus: Navigating the Labyrinth of Subsequence Anomaly Detection. Proc, VLDB Endow (2022)

    Google Scholar 

  17. Boniol, P., Paparrizos, J., Palpanas, T., Franklin, M.J.: SAND: Streaming Subsequence Anomaly Detection. PVLDB (2021)

  18. Brin, S.: Near neighbor search in large metric spaces. In: Proceedings of the 21th International Conference on Very Large Data Bases, VLDB ’95, pp. 574–584. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1995). http://dl.acm.org/citation.cfm?id=645921.673006

  19. Buono, P., Simeone, A.L.: Interactive shape specification for pattern search in time series. In: AVI (2008)

  20. Camerra, A., Palpanas, T., Shieh, J., Keogh, E.J.: isax 2.0: Indexing and mining one billion time series. In: ICDM, pp. 58–67. IEEE Computer Society (2010)

  21. Camerra, A., Shieh, J., Palpanas, T., Rakthanmanon, T., Keogh, E.J.: Beyond one billion time series: Indexing and mining very large time series collections with isax2+. Knowl. Inf. Syst. 39(1), 123–151 (2014)

    Article  Google Scholar 

  22. Castelli, V., Li, C., Turek, J., Kontoyiannis, I.: Progressive classification in the compressed domain for large EOS satellite databases. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, ICASSP ’96, Atlanta, Georgia, USA, May 7-10, 1996, pp. 2199–2202 (1996)

  23. Chakrabarti, K., Keogh, E., Mehrotra, S., Pazzani, M.: Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst. 27(2), 188–228 (2002). https://doi.org/10.1145/568518.568520

    Article  Google Scholar 

  24. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)

    Article  Google Scholar 

  25. Chatzigeorgakidis, G., Skoutas, D., Patroumpas, K., Palpanas, T., Athanasiou, S., Skiadopoulos, S.: Local similarity search on geolocated time series using hybrid indexing. In: Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL 2019, Chicago, IL, USA, November 5–8, 2019, pp. 179–188 (2019)

  26. Chatzigeorgakidis, G., Skoutas, D., Patroumpas, K., Palpanas, T., Athanasiou, S., Skiadopoulos, S.: Twin subsequence search in time series. In: Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021, pp. 475–480 (2021)

  27. Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: No silver bullet. In: SIGMOD (2017)

  28. Chen, Y., Garcia, E.K., Gupta, M.R., Rahimi, A., Cazzanti, L.: Similarity-based classification: Concepts and algorithms. J. Mach. Learn. Res. 10, 747–776 (2009)

    MathSciNet  MATH  Google Scholar 

  29. Ciaccia, P., Nanni, A., Patella, M.: A query-sensitive cost model for similarity queries with m-tree. In: In Proc. of the 10th ADC, pp. 65–76. Springer Verlag (1999)

  30. Ciaccia, P., Patella, M.: Pac nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces. In: ICDE, pp. 244–255 (2000)

  31. Ciaccia, P., Patella, M., Zezula, P.: A cost model for similarity queries in metric spaces. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98, pp. 59–68. ACM, New York, NY, USA (1998). https://doi.org/10.1145/275487.275495

  32. Correll, M., Gleicher, M.: The semantics of sketch: Flexibility in visual query systems for time series data. In: VAST (2016)

  33. Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: Return to the basics. PVLDB 5(11), 1662–1673 (2012)

    Google Scholar 

  34. Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-k nearest neighbor search in uncertain data series. Proc. VLDB Endow. 8(1), 13–24 (2014). https://doi.org/10.14778/2735461.2735463

  35. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09 (2009)

  36. Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample + seek: Approximating aggregates with distribution precision guarantee. In: SIGMOD (2016)

  37. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and mining of time series data: experimental comparison of representations and distance measures. Proc. VLDB Endow. 1(2), 1542–1552 (2008)

    Article  Google Scholar 

  38. Douze, M., Tolias, G., Pizzi, E., Papakipos, Z., Chanussot, L., Radenovic, F., Jenícek, T., Maximov, M., Leal-Taixé, L., Elezi, I., Chum, O., Canton-Ferrer, C.: The 2021 image similarity dataset and challenge. CoRR abs/2106.09672 (2021)

  39. Duong, T., Hazelton, M.L.: Cross-validation bandwidth matrices for multivariate kernel density estimation. Scand. J. Stat. 32(3), 485–506 (2005). https://doi.org/10.1111/j.1467-9469.2005.00445.x

    Article  MathSciNet  MATH  Google Scholar 

  40. Duong, T., Wand, M., Chacon, J., Gramacki, A.: ks: Kernel smoothing. https://cran.r-project.org/web/packages/ks/ (2019)

  41. Echihabi, K.: Truly Scalable Data Series Similarity Search. In: VLDB PhD Workshop (2019)

  42. Echihabi, K., Fatourou, P., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Hercules Against Data Series Similarity Search. PVLDB 15(10), 2005–2018 (2022)

    Google Scholar 

  43. Echihabi, K., Palpanas, T., Zoumpatianos, K.: New trends in high-d vector similarity search: AI-driven, progressive, and distributed. Proc. VLDB Endow. 14(12), 3198–3201 (2021)

    Article  Google Scholar 

  44. Echihabi, K., Zoumpatianos, K., Palpanas, T.: Big sequence management: Scaling up and out. In: Y. Velegrakis, D. Zeinalipour-Yazti, P.K. Chrysanthis, F. Guerra (eds.) Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021, pp. 714–717. OpenProceedings.org (2021). https://doi.org/10.5441/002/edbt.2021.91. https://doi.org/10.5441/002/edbt.2021.91

  45. Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: The lernaean hydra of data series similarity search: An experimental evaluation of the state of the art. PVLDB 12(2), 112–127 (2018)

    Google Scholar 

  46. Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Return of the Lernaean Hydra: experimental evaluation of data series approximate similarity search. PVLDB 13(3), 402–419 (2019)

    Google Scholar 

  47. Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: SIGMOD, pp. 419–429. ACM, New York, NY, USA (1994). https://doi.org/10.1145/191839.191925. https://doi.org/10.1145/191839.191925

  48. Fekete, J.D., Primet, R.: Progressive analytics: A computation paradigm for exploratory data analysis. CoRR abs/1607.05162 (2016). arXiv:1607.05162

  49. Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., Abbadi, A.E.: High dimensional nearest neighbor searching. Inf. Syst. 31(6), 512–540 (2006)

    Article  Google Scholar 

  50. Fisher, D., Drucker, S.M., König, A.C.: Exploratory visualization involving incremental, approximate database queries and uncertainty. IEEE CG &A 32 (2012)

  51. Gao, Y., Lin, J.: HIME: discovering variable-length motifs in large-scale time series. Knowl. Inf. Syst. 61(1), 513–542 (2019)

    Article  Google Scholar 

  52. Gao, Y., Lin, J., Brif, C.: Ensemble grammar induction for detecting anomalies in time series. In: Proceedings of the 23rd International Conference on Extending Database Technology, EDBT, pp. 85–96 (2020)

  53. Gogolou, A., Tsandilas, T., Echihabi, K., Bezerianos, A., Palpanas, T.: Data series progressive similarity search with probabilistic quality guarantees. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD (2020)

  54. Gogolou, A., Tsandilas, T., Palpanas, T., Bezerianos, A.: Comparing similarity perception in time series visualizations. IEEE TVCG 25, 523–533 (2018)

    Google Scholar 

  55. Gogolou, A., Tsandilas, T., Palpanas, T., Bezerianos, A.: Progressive similarity search on time series data. In: Proceedings of the Workshops of the EDBT/ICDT 2019 Joint Conference, EDBT/ICDT 2019, Lisbon, Portugal, March 26, 2019 (2019). http://ceur-ws.org/Vol-2322/BigVis_5.pdf

  56. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000 (June 13)). Circulation Electronic Pages: http://circ.ahajournals.org/content/101/23/e215.full PMID:1085218; https://doi.org/10.1161/01.CIR.101.23.e215

  57. Goldin, D.Q., Kanellakis, P.C.: On similarity queries for time-series data: Constraint specification and implementation. In: CP (1995)

  58. Guo, Y., Binnig, C., Kraska, T.: What you see is not what you get!: Detecting simpson’s paradoxes during data exploration. In: Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, HILDA@SIGMOD (2017)

  59. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD (1997)

  60. Hellerstein, J.M., Koutsoupias, E., Papadimitriou, C.H.: On the analysis of indexing schemes. In: Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 67, p. 249-256. Association for Computing Machinery, New York, NY, USA (1997). https://doi.org/10.1145/263661.263688. https://doi.org/10.1145/263661.263688

  61. Huang, T., Zhen, Z., Liu, J.: Semantic relatedness emerges in deep convolutional neural networks designed for object recognition. bioRxiv (2020). https://doi.org/10.1101/2020.07.04.188169. https://www.biorxiv.org/content/early/2020/07/06/2020.07.04.188169.1

  62. I.R.I. for Seismology: Iris seismic data access (2014). http://ds.iris.edu/data/access/

  63. Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2011)

    Article  Google Scholar 

  64. Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the DBO engine. ACM Trans. Database Syst. 33(4), 1–54 (2008)

    Article  Google Scholar 

  65. Jing, J., Dauwels, J., Rakthanmanon, T., Keogh, E., Cash, S., Westover, M.: Rapid annotation of interictal epileptiform discharges via template matching under dynamic time warping. Journal of Neuroscience Methods 274, 179–190 (2016)

    Article  Google Scholar 

  66. Koenker, R. et al.: quantreg: Quantile regression. https://cran.r-project.org/web/packages/quantreg (2019)

  67. Kanellakis, P.C., Ramaswamy, S., Vengroff, D.E., Vitter, J.S.: Indexing for data models with constraints and classes (extended abstract). In: Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 93, p. 233-243. Association for Computing Machinery, New York, NY, USA (1993). https://doi.org/10.1145/153850.153884. https://doi.org/10.1145/153850.153884

  68. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3(3), 263–286 (2001). https://doi.org/10.1007/PL00011669

    Article  MATH  Google Scholar 

  69. Keogh, E., Pazzani, M.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Fourth International Conference on Knowledge Discovery and Data Mining (KDD’98), pp. 239–241. ACM Press, New York City, NY (1998)

  70. Keogh, E., Ratanamahatana, C.A.: Exact indexing of dynamic time warping. Knowledge and information systems (2005)

  71. Koenker, R.: Quantile Regression. Econometric Society Monographs. Cambridge University Press (2005). https://doi.org/10.1017/CBO9780511754098

  72. Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: A scalable bottom-up approach for building data series indexes. PVLDB 11(6), 677–690 (2018). https://doi.org/10.14778/3184470.3184472

  73. Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: sortable summarizations for scalable indexes over static and streaming data series. VLDB J. 28(6), 847–869 (2019)

    Article  Google Scholar 

  74. Kraska, T.: Northstar: An interactive data science system. PVLDB 11(12), 2150–2164 (2018)

    Google Scholar 

  75. Kwon, O.W., Lee, J.H.: Web page classification based on k-nearest neighbor approach. In: Proceedings of the Fifth International Workshop on on Information Retrieval with Asian Languages (2000)

  76. Laviron, P., Dai, X., Huquet, B., Palpanas, T.: Electricity demand activation extraction: From known to uknown signatures, using similarity search. In: Proceedings of the ACM International Conference on Future Energy Systems, e-Energy (2021)

  77. Lemire, D.: Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern Recogn. 42(9), 2169–2180 (2009)

    Article  MATH  Google Scholar 

  78. Levchenko, O., Kolev, B., Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T., Shasha, D.E., Valduriez, P.: Bestneighbor: efficient evaluation of knn queries on large time series databases. Knowl. Inf. Syst. 63(2), 349–378 (2021). https://doi.org/10.1007/s10115-020-01518-4

    Article  Google Scholar 

  79. Li, C., Zhang, M., Andersen, D.G., He, Y.: Improving Approximate Nearest Neighbor Search through Learned Adaptive Early Termination. In: SIGMOD (2020)

  80. Li, X., Lin, J., Zhao, L.: Time series clustering in linear time complexity. Data Min. Knowl. Discov. 35(6), 2369–2388 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  81. Lin, J., Keogh, E.J., Lonardi, S., Chiu, B.Y.: A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, DMKD 2003, San Diego, California, USA, June 13, 2003, pp. 2–11 (2003). https://doi.org/10.1145/882082.882086

  82. Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: The ulisse approach. PVLDB (2019)

  83. Linardi, M., Palpanas, T.: Scalable data series subsequence matching with ulisse. VLDBJ (2020)

  84. Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix profile X: Valmod - scalable discovery of variable-length motifs in data series. In: SIGMOD (2018)

  85. Linardi, M., Zhu, Y., Palpanas, T., Keogh, E.J.: Matrix profile goes MAD: variable-length motif and discord discovery in data series. Data Min. Knowl. Discov. 34(4), 1022–1071 (2020)

    Article  MathSciNet  Google Scholar 

  86. Lu, Y., Wu, R., Mueen, A., Zuluaga, M.A., Keogh, E.J.: Matrix profile XXIV: scaling time series anomaly detection to trillions of datapoints and ultra-fast arriving data streams. In: KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14–18, 2022, pp. 1173–1182 (2022)

  87. Lucas, B., Shifaz, A., Pelletier, C., O’Neill, L., Zaidi, N.A., Goethals, B., Petitjean, F., Webb, G.I.: Proximity forest: an effective and scalable distance-based classifier for time series. Data Min. Knowl. Discov. 33(3), 607–635 (2019)

    Article  Google Scholar 

  88. Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 824–836 (2020)

    Article  Google Scholar 

  89. Mannino, M., Abouzied, A.: Expressive time series querying with hand-drawn scale-free sketches. In: CHI (2018)

  90. Micallef, L., Schulz, H.J., Angelini, M., Aupetit, M., Chang, R., Kohlhammer, J., Perer, A., Santucci, G.: The human user in progressive visual analytics. In: Short Paper Proceedings of EuroVis’19, pp. 19–23. Eurographics Association (2019). https://doi.org/10.2312/evs.20191164

  91. Miller, G.A.: Wordnet: A lexical database for english. Commun. ACM 38(11), 39–41 (1995). https://doi.org/10.1145/219717.219748

    Article  Google Scholar 

  92. Mirylenka, K., Dallachiesa, M., Palpanas, T.: Data series similarity using correlation-aware measures. In: SSDBM (2017)

  93. Moritz, D., Fisher, D., Ding, B., Wang, C.: Trust, but verify: Optimistic visualizations of approximate queries for exploring big data. In: CHI (2017)

  94. Moritz, D., Howe, B., Heer, J.: Falcon: Balancing interactive latency and resolution sensitivity for scalable linked visualizations. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, pp. 694:1–694:11. ACM, New York, NY, USA (2019). https://doi.org/10.1145/3290605.3300924. https://doi.org/10.1145/3290605.3300924

  95. Nielsen, J.: Response times: The 3 important limits. https://www.nngroup.com/articles/response-times-3-important-limits/

  96. Palpanas, T.: Data series management: The road to big sequence analytics. SIGMOD Record 44(2), 47–52 (2015). https://doi.org/10.1145/2814710.2814719

    Article  Google Scholar 

  97. Palpanas, T.: Evolution of a Data Series Index - The iSAX Family of Data Series Indexes. Communications in Computer and Information Science (CCIS) (2020)

  98. Palpanas, T., Beckmann, V.: Report on the First and Second Interdisciplinary Time Series Analysis Workshop (ITISA). SIGMOD Rec. 48(3), 36–40 (2019)

    Article  Google Scholar 

  99. Paparrizos, J., Boniol, P., Palpanas, T., Tsay, R.S., Elmore, A., Franklin, M.J.: Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection. PVLDB (2022)

  100. Paparrizos, J., Gravano, L.: Fast and accurate time-series clustering. ACM Trans. Database Syst. 42(2), 1–49 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  101. Paparrizos, J., Kang, Y., Boniol, P., Tsay, R., Palpanas, T., Franklin, M.J.: TSB-UAD: an end-to-end benchmark suite for univariate time-series anomaly detection. Proc. VLDB Endow. 15(8), 1697–1711 (2022)

    Article  Google Scholar 

  102. Paparrizos, J., Liu, C., Elmore, A.J., Franklin, M.J.: Debunking four long-standing misconceptions of time-series distance measures. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp. 1887–1905. ACM (2020). https://doi.org/10.1145/3318464.3389760. https://doi.org/10.1145/3318464.3389760

  103. Pelletier, C., Webb, G.I., Petitjean, F.: Temporal convolutional neural network for the classification of satellite image time series. Remote Sensing 11(5) (2019). https://doi.org/10.3390/rs11050523. https://www.mdpi.com/2072-4292/11/5/523

  104. Peng, B., Fatourou, P., Palpanas, T.: MESSI: In-Memory Data Series Indexing. In: ICDE (2020)

  105. Peng, B., Fatourou, P., Palpanas, T.: Fast data series indexing for in-memory data. VLDBJ (2021)

  106. Peng, B., Fatourou, P., Palpanas, T.: SING: Sequence Indexing Using GPUs. In: ICDE (2021)

  107. Peng, B., Palpanas, T., Fatourou, P.: Paris: The next destination for fast data series indexing and query answering. IEEE BigData (2018)

  108. Peng, B., Palpanas, T., Fatourou, P.: Paris+: Data series indexing on multi-core architectures. TKDE (2020)

  109. Petitjean, F., Forestier, G., Webb, G.I., Nicholson, A.E., Chen, Y., Keogh, E.J.: Dynamic time warping averaging of time series allows faster and more accurate classification. In: ICDM (2014)

  110. Phillips, N.: A companion to the e-book “yarrr!: The pirate’s guide to r”. https://github.com/ndphillips/yarrr (2017)

  111. Rahman, S., Aliakbarpour, M., Kong, H.K., Blais, E., Karahalios, K., Parameswaran, A., Rubinfield, R.: I’ve seen “enough”: Incrementally improving visualizations to support rapid decision making. Proc. VLDB Endow. 10(11), 1262–1273 (2017). https://doi.org/10.14778/3137628.3137637

  112. Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: KDD, pp. 262–270. ACM (2012)

  113. Rakthanmanon, T., Campana, B.J.L., Mueen, A., Batista, G.E.A.P.A., Westover, M.B., Zhu, Q., Zakaria, J., Keogh, E.J.: Searching and mining trillions of time series subsequences under dynamic time warping. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, pp. 262–270. ACM (2012)

  114. Rakthanmanon, T., Keogh, E.J., Lonardi, S., Evans, S.: Time series epenthesis: Clustering time series streams requires ignoring some data. In: Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 547–556. IEEE (2011)

  115. Rodrigues, P.P., Gama, J., Pedroso, J.P.: Odac: Hierarchical clustering of time series data streams. In: SDM, pp. 499–503. SIAM (2006)

  116. Supplementary material (2022). https://helios2.mi.parisdescartes.fr/~themisp/pros/

  117. Saito, N.: Local Feature Extraction and its Applications using a Library of Bases, pp. 269–451 (2000). https://doi.org/10.1142/9789812813305_0005. https://www.worldscientific.com/doi/abs/10.1142/9789812813305_0005

  118. Sakoe, H., Chiba, S.: Dynamic Programming Algorithm Optimization for Spoken Word Recognition, p. 159-165. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1990)

  119. Sarangi, S.R., Murthy, K.: Dust: A generalized notion of similarity between uncertain time series. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, July 25-28, 2010, pp. 383–392 (2010). https://doi.org/10.1145/1835804.1835854. http://doi.acm.org/10.1145/1835804.1835854

  120. Schäfer, P., Leser, U.: TEASER: early and accurate time series classification. Data Min. Knowl. Discov. 34(5), 1336–1362 (2020)

    Article  MathSciNet  Google Scholar 

  121. Schneider, J., Wenig, P., Papenbrock, T.: Distributed detection of sequential anomalies in univariate time series. VLDBJ 30, 579–602 (2021)

    Article  Google Scholar 

  122. Schulz, H.J., Angelini, M., Santucci, G., Schumann, H.: An enhanced visualization process model for incremental visualization. IEEE Trans. Vis. Comput. Graph. 22, 1830–1842 (2016). https://doi.org/10.1109/TVCG.2015.2462356

    Article  Google Scholar 

  123. Stolper, C.D., Perer, A., Gotz, D.: Progressive visual analytics: User-driven visual exploration of in-progress analytics. IEEE TVCG 20, 1653–1662 (2014)

    Google Scholar 

  124. Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: K. Chaudhuri, R. Salakhutdinov (eds.) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR (2019). http://proceedings.mlr.press/v97/tan19a.html

  125. Tufte, E.R.: The Visual Display of Quantitative Information (1986)

  126. Turkay, C., Kaya, E., Balcisoy, S., Hauser, H.: Designing progressive and interactive analytics processes for high-dimensional data analysis. IEEE Trans. Vis. Comput. Graph. 23(1), 131–140 (2017). https://doi.org/10.1109/TVCG.2016.2598470

    Article  Google Scholar 

  127. University, S.: Southwest university adult lifespan dataset (sald) (2017)

  128. Vision, S.C.: Deep billion-scale indexing. http://sites.skoltech.ru/compvision/noimi (2018)

  129. Wald, A.: Sequential tests of statistical hypotheses. Ann. Math. Stat. 16(2), 117–186 (1945). https://doi.org/10.1214/aoms/1177731118

    Article  MathSciNet  MATH  Google Scholar 

  130. Wand, M.P., Jones, M.C.: Comparison of smoothing parameterizations in bivariate kernel density estimation. J. Am. Stat. Assoc. 88(422), 520–528 (1993). https://doi.org/10.1080/01621459.1993.10476303

    Article  MathSciNet  MATH  Google Scholar 

  131. Wand, M.P., Jones, M.C.: Multivariate plug-in bandwidth selection. Comput. Stat. 9(2), 97–116 (1994)

    MathSciNet  MATH  Google Scholar 

  132. Wang, Q., Palpanas, T.: Deep Learning Embeddings for Data Series Similarity Search. In: SIGKDD (2021)

  133. Wang, Q., Whitmarsh, S., Navarro, V., Palpanas, T.: iEDeaL: A Deep Learning Framework for Detecting Highly Imbalanced Interictal Epileptiform Discharges. PVLDB 16(2) (2023)

  134. Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S.: A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10), 793–804 (2013)

    Google Scholar 

  135. Warren Liao, T.: Clustering of time series data - a survey. Pattern Recogn. 38(11), 1857–1874 (2005)

    Article  MATH  Google Scholar 

  136. Wellenzohn, K., Böhlen, M.H., Dignös, A., Gamper, J., Mitterer, H.: Continuous imputation of missing values in streams of pattern-determining time series. In: Proceedings of the 20th International Conference on Extending Database Technology, EDBT, pp. 330–341. OpenProceedings.org (2017)

  137. Wu, S., Ooi, B.C., Tan, K.: Online aggregation. In: Advanced Query Processing, Volume 1: Issues and Trends, pp. 187–210 (2013)

  138. Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Dpisax: Massively distributed partitioned isax (2017)

  139. Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: Massively distributed time series indexing and querying. TKDE 32(1), 108–120 (2020)

    Google Scholar 

  140. Yankov, D., Keogh, E.J., Rebbapragada, U.: Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl. Inf. Syst. 17(2), 241–262 (2008)

    Article  Google Scholar 

  141. Yeh, C.C.M., Zhu, Y., Ulanova, L., Begum, N., Ding, Y., Dau, H.A., Zimmerman, Z., Silva, D.F., Mueen, A., Keogh, E.: Time series joins, motifs, discords and shapelets: A unifying view that exploits the matrix profile. Data Mining and Knowledge Discovery pp. 1–41 (2017)

  142. Yeh, M., Wu, K., Yu, P.S., Chen, M.: Proud: A probabilistic approach to processing similarity queries over uncertain data streams. In: EDBT 2009, 12th International Conference on Extending Database Technology, Saint Petersburg, Russia, March 24-26, 2009, Proceedings, pp. 684–695 (2009). https://doi.org/10.1145/1516360.1516439. http://doi.acm.org/10.1145/1516360.1516439

  143. Zgraggen, E., Galakatos, A., Crotty, A., Fekete, J., Kraska, T.: How progressive visualizations affect exploratory analysis. IEEE Trans. Vis. Comput. Graph. 23(8), 1977–1987 (2017). https://doi.org/10.1109/TVCG.2016.2607714

    Article  Google Scholar 

  144. Zgraggen, E., Zhao, Z., Zeleznik, R.C., Kraska, T.: Investigating the effect of the multiple comparisons problem in visual analysis. In: CHI (2018)

  145. Zoumpatianos, K., Idreos, S., Palpanas, T.: Rinse: Interactive data series exploration with ads+. PVLDB 8(12), 1912–1915 (2015). https://doi.org/10.14778/2824032.2824099

  146. Zoumpatianos, K., Idreos, S., Palpanas, T.: Ads: The adaptive data series index. VLDB J. 25(6), 843–866 (2016). https://doi.org/10.1007/s00778-016-0442-5

    Article  Google Scholar 

  147. Zoumpatianos, K., Lou, Y., Palpanas, T., Gehrke, J.: Query workloads for data series indexes. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pp. 1603–1612 (2015). https://doi.org/10.1145/2783258.2783382. http://doi.acm.org/10.1145/2783258.2783382

Download references

Acknowledgements

We would like to thank Siddharth Grover for his contributions in the implementation of some of the algorithms in this paper. Work partially supported by program Investir l’Avenir and Univ. of Paris IDEX Emergence en Recherche ANR-18-IDEX-0001, EU project NESTOR (MSCA #748945), and FMJH Program PGMO with EDF-THALES.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karima Echihabi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Echihabi, K., Tsandilas, T., Gogolou, A. et al. ProS: data series progressive k-NN similarity search and classification with probabilistic quality guarantees. The VLDB Journal 32, 763–789 (2023). https://doi.org/10.1007/s00778-022-00771-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-022-00771-z

Keywords

Navigation