Abstract
The pattern matching is one of the essential tasks in streaming time-series data mining. Its purpose is to identify all sliding windows in streaming time-series whose Euclidean Distances with predefined patterns are smaller than a threshold pre-determined. The pattern can be high-dimensional data and the streaming time-series is frequently updated. Thus, the brute-force method, which involves calculating Euclidean Distances between each sliding window and all patterns, is not effective in practical applications. This paper develops a lower bound-basedmethod that can perform pattern matching in less time while guaranteeing the same results as brute-force method. The proposed method achieves speedup without any sacrifice in matching accuracy. The block vector is utilized to calculate the lower bound of Euclidean Distance. Our proposal can safely eliminate many expensive Euclidean Distance calculations between patterns and sliding window; thus, the efficiency of pattern matching can be improved. Besides, we present an approach that can obtain the block vectors on-the-fly in the streaming scenarios to improve efficiency further. The experimental study in synthetic and real-life data sets verifies the efficiency and advantage of the proposals over the state-of-the-art.
Similar content being viewed by others
Data availability
The synthetic data sets that support the findings of this study are available from the corresponding author upon reasonable request. The public online data sets are available for access on the website [47].
References
Mohammadi M, Al-Fuqaha A, Sorour S, Guizani M (2018) Deep learning for IoT big data and streaming analytics: a survey. IEEE Commun Surv Tutor 20(4):2923–2960
Gomes HM, Read J, Bifet A, Barddal JP, Gama J (2019) Machine learning for streaming data: state of the art, challenges, and opportunities. ACM SIGKDD Explor Newsl 21(2):6–22
Bodenham DA, Adams NM (2017) Continuous monitoring for changepoints in data streams using adaptive estimation. Stat Comput 27(5):1257–1270
Butler B, Pearson RG, Birtles RA (2021) Water-quality and ecosystem impacts of recreation in streams: monitoring and management. Environ Chall 5:100328
Henning S, Hasselbring W (2020) Scalable and reliable multi-dimensional sensor data aggregation in data streaming architectures. Data-Enabled Discov Appl 4(1):1–12
Lin H, Wu S, Kou NM, Gao Y, Lu D et al (2018) Finding the hottest item in data streams. Inf Sci 430:314–330
Chen L, Zou L-J, Tu L (2012) A clustering algorithm for multiple data streams based on spectral component similarity. Inf Sci 183(1):35–47
Wu J, Wang P, Pan N, Wang C, Wang W, Wang J (2019) Kv-match: a subsequence matching approach supporting normalization and time warping. In: 2019 IEEE 35th international conference on data engineering (ICDE), pp 866–877. IEEE
Alghamdi N, Zhang L, Zhang H, Rundensteiner EA, Eltabakh MY (2020) Chainlink: indexing big time series data for long subsequence matching. In: 2020 IEEE 36th international conference on data engineering (ICDE), pp 529–540. IEEE
Gong X, Fong S, Si Y-W (2019) Fast fuzzy subsequence matching algorithms on time-series. Expert Syst Appl 116:275–284
Peng B, Fatourou P, Palpanas T (2021) Fast data series indexing for in-memory data. VLDB J, 1–27
Linardi M, Palpanas T (2020) Scalable data series subsequence matching with ULISSE. VLDB J 29(6):1449–1474
Lian X, Chen L, Yu JX, Han J, Ma J (2008) Multiscale representations for fast pattern matching in stream time series. IEEE Trans Knowl Data Eng 21(4):568–581
Zhou K, Hou Q, Wang R, Guo B (2008) Real-time kd-tree construction on graphics hardware. ACM Trans Graph (TOG) 27(5):1–11
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Vldb, vol 97, pp 426–435. Citeseer
Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: Proceedings of the 23rd international conference on machine learning, pp 97–104
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data, pp 47–57
Almalawi AM, Fahad A, Tari Z, Cheema MA, Khalil I (2015) \(k\) NNVWC: an efficient \(k\)-nearest neighbors approach based on various-widths clustering. IEEE Trans Knowl Data Eng 28(1):68–81
Pan Y, Pan Z, Wang Y, Wang W (2020) A new fast search algorithm for exact k-nearest neighbors based on optimal triangle-inequality-based check strategy. Knowl-Based Syst 189:105088
Wang X (2011) A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. In: The 2011 international joint conference on neural networks, pp 1293–1299. IEEE
Camerra A, Palpanas T, Shieh J, Keogh E (2010) isax 2.0: Indexing and mining one billion time series. In: 2010 IEEE international conference on data mining, pp 58–67. IEEE
Peng B, Fatourou P, Palpanas T (2020) Paris+: data series indexing on multi-core architectures. IEEE Trans Knowl Data Eng 33(5):2151–2164
Wang Y, Wang P, Pei J, Wang W, Huang S (2013) A data-adaptive and dynamic segmentation index for whole matching on time series. Proc VLDB Endow 6(10):793–804
Zoumpatianos K, Idreos S, Palpanas T (2016) ADS: the adaptive data series index. VLDB J 25(6):843–866
Shieh J, Keogh E (2008) \(i\)SAX: indexing and mining terabyte sized time series. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 623–631
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, pp 151–162
Peng J, Wang H, Li J, Gao H (2016) Set-based similarity search for time series. In: Proceedings of the 2016 international conference on management of data, pp 2039–2052
Zhang H, Dong Y, Li J, Xu D (2021) An efficient method for time series similarity search using binary code representation and hamming distance. Intell Data Anal 25(2):439–461
Ye Y, Jiang J, Ge B, Dou Y, Yang K (2019) Similarity measures for time series data classification using grid representation and matrix distance. Knowl Inf Syst 60(2):1105–1134
Hwang Y, Baek M, Kim S, Han B, Ahn H-K (2018) Product quantized translation for fast nearest neighbor search. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Hwang Y, Han B, Ahn H-K (2012) A fast nearest neighbor search algorithm by nonlinear embedding. In: 2012 IEEE conference on computer vision and pattern recognition, pp 3053–3060. IEEE
Jeong S, Kim S-W, Kim K, Choi B-U (2006) An effective method for approximating the euclidean distance in high-dimensional space. In: International conference on database and expert systems applications, pp 863–872. Springer
Li M, Zhang Y, Sun Y, Wang W, Tsang IW, Lin X (2018) An efficient exact nearest neighbor search by compounded embedding. In: International conference on database systems for advanced applications, pp 37–54. Springer
Liu Y, Wei H, Cheng H (2018) Exploiting lower bounds to accelerate approximate nearest neighbor search on high-dimensional data. Inf Sci 465:484–504
Bottesch T, Bühler T, Kächele M (2016) Speeding up k-means by approximating Euclidean distances via block vectors. In: International conference on machine learning, pp 2578–2586. PMLR
Zhang H, Dong Y, Xu D (2021) Accelerating exact nearest neighbor search in high dimensional Euclidean space via block vectors. Int J Intell Syst 37:1697–1722
Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv (CSUR) 45(1):1–34
Berndt DJ, Clifford J (1996) Finding patterns in time series: a dynamic programming approach. In: Advances in knowledge discovery and data mining, pp 229–248
Chen L, Özsu MT, Oria V (2005) Robust and fast similarity search for moving object trajectories. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data, pp 491–502
Marteau P-F (2008) Time warp edit distance with stiffness adjustment for time series matching. IEEE Trans Pattern Anal Mach Intell 31(2):306–318
Stefan A, Athitsos V, Das G (2012) The move-split-merge metric for time series. IEEE Trans Knowl Data Eng 25(6):1425–1438
Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 262–270
Kim S-W, Park S, Chu WW (2001) An index-based approach for similarity search supporting time warping in large sequence databases. In: Proceedings 17th international conference on data engineering, pp 607–614. IEEE
Keogh E, Ratanamahatana CA (2005) Exact indexing of dynamic time warping. Knowl Inf Syst 7(3):358–386
Yi B-K, Faloutsos C (2000) Fast time sequence indexing for arbitrary Lp norms
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inf Syst 3(3):263–286
Dau HA, Keogh E, Kamgar K, Yeh C-CM, Zhu Y, Gharghabi S, Ratanamahatana CA, Yanping, Hu B, Begum N, Bagnall A, Mueen A, Batista G, Hexagon-ML (2018) The UCR time series classification archive. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/
Acknowledgements
We want to express our gratitude to Dr. Eamonn Keogh for providing the data sets used in this paper. This work is supported by Science Foundation of Zhejiang Sci-Tech University (ZSTU) under Grant No. 22232264-Y.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no conflicts of interest regarding the publication of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, H., Li, J. Speeding up pattern matching in streaming time-series via block vector and multilevel lower bound. Neural Comput & Applic 36, 3389–3403 (2024). https://doi.org/10.1007/s00521-023-09291-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09291-5