Pattern discovery in data streams under the time warping distance
Authors
Open AccessArticle
 First Online:
 Received:
 Revised:
 Accepted:
DOI: 10.1007/s0077801202893
Abstract
Subsequence matching is a basic problem in the field of data stream mining. In recent years, there has been significant research effort spent on efficiently finding subsequences similar to a query sequence. Another challenging issue in relation to subsequence matching is how we identify common local patterns when both sequences are evolving. This problem arises in trend detection, clustering, and outlier detection. Dynamic time warping (DTW) is often used for subsequence matching and is a powerful similarity measure. However, the straightforward method using DTW incurs a high computation cost for this problem. In this paper, we propose a onepass algorithm, CrossMatch, that achieves the above goal. CrossMatch addresses two important challenges: (1) how can we identify common local patterns efficiently without any omission? (2) how can we find common local patterns in data stream processing? To tackle these challenges, CrossMatch incorporates three ideas: (1) a scoring function, which computes the DTW distance indirectly to reduce the computation cost, (2) a position matrix, which stores starting positions to keep track of common local patterns in a streaming fashion, and (3) a streaming algorithm, which identifies common local patterns efficiently and outputs them on the fly. We provide a theoretical analysis and prove that our algorithm does not sacrifice accuracy. Our experimental evaluation and case studies show that CrossMatch can incrementally discover common local patterns in data streams within constant time (per update) and space.
Keywords
Data streams Subsequence matching Dynamic time warping1 Introduction
Data streams are becoming increasingly important in several domains including financial data analysis [68], sensor network monitoring [74], moving object trajectories [15, 38], web clickstream analysis [42, 48, 59], and network traffic analysis [37]. Many applications require timeseries data streams to be continuously monitored in real time, and the processing and mining of data streams are attracting increasing interest. In addition to providing SQLlike support for data stream management systems (DSMS), it is crucial to detect hidden patterns that may exist in data streams, and subsequence matching is one of the key techniques for achieving this goal.
Much of the previous work on subsequence matching over data streams has focused on finding subsequences similar to a query sequence [16, 58, 72]. In this setting, one is a fixed sequence and the other is an evolving sequence. This approach works well if we have already determined the patterns we want to find. However, we consider coevolving sequences and focus on the problem of identifying common local patterns between them. That is, our goal is to automatically detect all common local patterns over data streams without a query sequence. The problem we want to solve is as follows.
Given two data streams, determine common local patterns and their periodicities taking account of time scaling.
This problem motivates us to develop the following important techniques: (1) trend detection, which is the ability to detect the most frequently occurring patterns in data streams, (2) clustering, which is the ability to find sequences that look similar and to group them, and (3) outlier detection, which is the ability to discover anomalous patterns by comparing common patterns. These exciting techniques could also provide interpretations of clusters and anomalies by annotating them in an online fashion.

Web analysis: Web access patterns are very dynamic because of both the dynamics of web site content and structure, and the changes in the users’ interests. A continuous monitoring of web access will reveal interesting usage patterns or profiles and provide users with more suitable, customized services in real time. Webmasters may cluster users into groups based on their common characteristics for user behavioral analysis. Web site designers can use typical browsing patterns to personalize the user’s experience on the website. These groups and patterns essentially correspond to groups of common local patterns.

Motion capture: The recognition of human motion has been attracting intense interest in relation to computer animation, sports, and medical care. Motion data sequences are sampled many times per second and are data streams of high dimensionality. Humans never repeat exactly the same action patterns, and the actions tend to differ in terms of their duration. This appears as variability in the speed of human motion. For example, an actor may walk quickly or slowly. Such variability can manifest itself as time scaling, namely a stretching or shrinking of timeseries data. Our approach aids trend detection, which can be used to identify particular movement styles for game creators, and outlier detection, which can be used by coaches to analyze athletes’ performance by identifying timevarying common motions (i.e., common local patterns).

Sensor network: In sensor networks, sensors send their readings frequently. Each sensor produces a stream of data, and those streams need to be monitored and combined to detect interesting changes in the environment. It is likely that users are interested in one or more sensors within a particular spatial region. These interests are expressed as trends and similar patterns, that is, common local patterns.
What are the significant challenges in terms of detecting common local patterns over data streams? Typically, DTW is applied to limited situations in an offline manner. To identify common local patterns with DTW, we have to divide data streams into all possible subsequences and compute the similarities between them because we have no advance knowledge about the patterns we are seeking. Since data streams arrive online at high bit rates and are potentially unbounded in size, the computation time and memory space increase greatly. Ideally, we need a solution that can return correct results without any omissions, even at high speeds.
Recently, the work in [65, 66] addressed the problem of finding common local patterns in data streams. Problem def0.000.inition and its solution using DTW are introduced in [66], but the work does not provide any theoretical guarantees with respect to answer accuracy and its output overlapping results. In [65], we modified the problem definition and devised two ideas, a scoring function (Sect. 4.2.1) and a position matrix (Sect. 4.2.2). In this paper, while we share the same goals, we present a new streaming algorithm (Sect. 4.2.3), which incorporates these ideas and at the same time provides strict guarantees for our results (Sect. 4.3). By introducing a global constraint for DTW, which is suitable for stream settings, our algorithm improves the time and space requirements. Moreover, we propose enhanced solutions for different environments (Sects. 5 6) and make our algorithm more robust.

We present CrossMatch, which can efficiently detect common local patterns in data streams. CrossMatch is a onepass algorithm, which is strictly based on DTW and guarantees correct results.

In our theoretical analysis, we prove that CrossMatch does not sacrifice accuracy and detects the optimal subsequences. Moreover, we discuss the complexity in terms of computation time and memory space and show that CrossMatch significantly reduces the required amounts of these resources and achieves constant time (per update) and space.

For more effectiveness, we propose a sampling approach that introduces an approximation for CrossMatch. Our solution works properly for sampled sequences and achieves a significant reduction in resources.

As regards the accuracy and complexity for detecting common local patterns, we empirically show its usefulness on several real and synthetic data sets.

We address a more challenging problem of finding common local patterns in multiple data streams and show that CrossMatch can be effectively applied to this problem.
2 Related work
Related work falls broadly into three categories: timeseries similarity search, stream management, and stream mining. We review each category.
Timeseries analysis and similarity search.Timeseries analysis has been studied for many years. Most of the proposed methods focus on similarity queries with a query sequence. There are several distance measures for similarity queries on timeseries data, for example, euclidean distance [25], DTW [9, 54], distance based on the longest common subsequence (LCSS) [67], edit distance with real penalty (EDP) [14], and edit distance on real sequence (EDR) [15]. These distance measures are selected depending on the difference of the matching strategy in application domains.
To efficiently perform the similarity search efficiently, data sequences are transformed to lower dimensional points with a dimensionality reduction technique. Agrawal et al. [2] and Faloutsos et al. [25] have utilized discrete fourier transformation (DFT) and have inserted each point into an Rtree [8]. Other reduction techniques include discrete wavelet transform (DWT) [53], singular value decomposition (SVD) [33], piecewise aggregate approximation (PAA) [70], and adaptive piecewise constant approximation (APCA) [36]. Cao et al. [10] have proposed a data reduction technique for spatiotemporal data.
Sequence matching has attracted a lot of research interest, and very successful methods have been developed for timeseries data [4, 35, 61]. MDMWP [31] is a fast ranked subsequence matching solution. Ranked subsequence matching finds the topk similar subsequences to a query sequence from data sequences. It introduces two tight lower bounds and prunes unnecessary subsequence access requests at the index level. EBSM [5] is a method for approximate subsequence matching under DTW. The key idea is to convert subsequence matching to vector matching. For the conversion, EBSM uses precomputed alignments between database sequences and query sequences. Rakthanmanon et al. [55] have focused on one trillion length timeseries and several different many tens of billions timeseries data and have proposed a method for searching exactly under DTW. By introducing the four optimizations based on the early stop of the computation and lower bounds, they have shown that their method is much faster than the recent search method for DTW. The above methods focus mainly on stored sequences.
As regards subsequence matching based on DTW in data streams, Zhou et al. [72] presented an efficient batch filtering method. They observe a special property of data streams, which is that successive subsequences in a stream often overlap to some extent, and improve the performance by utilizing such overlapping information as filters for lower and upper bounds. Sakurai et al. [58] presented SPRING, which efficiently monitors multiple numerical streams. They introduce two new ideas; starpadding and subsequence time warping matrix. These methods can accurately detect similar subsequences in a constant time without fixing the window size. On the other hand, Chen et al. [16] proposed an original distance function that supports shifting and scaling in both the time and amplitude dimensions and used it as a similarity measure for the efficient and continuous detection of patterns in a streaming time sequence. The above methods are powerful for dealing with problems where fixed length query sequences are given. However, they scale poorly and so are ineffective with respect to our target problem.
Regarding the detection of common local patterns over data streams, the most relevant work is Toyoda et al. [66], which proposed an algorithm for finding similar subsequences. From an algorithmic perspective, a highly relevant work is Toyoda and Sakurai [65], which presents an algorithm based on DTW. However, the algorithm in [66] does not guarantee the correctness of results and outputs subsequences with redundant information. Both algorithms are linear with regard to time and space. This paper not only overcomes the issues of accuracy and complexity, but is also more efficient in detecting common local patterns.
In the field of bioinformatics, search techniques for biological sequences have been studied and the Smith–Waterman algorithm is used to find local similarities [63]. The studies in this area focus on symbol sequences, whereas our problem focuses on numerical sequences. Our method differs in that it computes the DTW distance precisely and guarantees the detection of subsequences with the minimum distance.
Continuous queries and data stream management Broadly related work includes DSMSs. Their common goal is to provide a generalpurpose infrastructure for the efficient management of data streams. Sample systems include Aurora [1], Stream [44], Telegraph [12], Gigascope [19], and OSCAR [13]. Algorithmic work includes query processing [41], scheduling [6, 11], and load shedding [20, 64]. As regards continuous queries, Arasu et al. [3] studied the memory requirements of continuous queries over relational data streams. SOLE [43] is a scalable algorithm for continuous spatiotemporal queries in data streams. To address multiple streams and queries, it provides a framework with caching of uncertainty regions and a shared operator on a shared buffer.
Approximation and adaptivity are also key features for DSMSs, such as sampling [7], sketches [17, 23, 27], statistics [21, 28], and wavelets [30]. The main goal of these methods is to estimate a global aggregate (e.g., sum, count, average) over a fixed window on the recent data.
The emphasis in the above works is to support traditional SQL queries on streams. None of them try to find patterns.
Stream mining. Many other previous studies have attempted pattern discovery in a streaming scenario. Mueen et al. [46] presented the first online motif discovery algorithm to accurately monitor and maintain motifs, which represent repeated subsequences in timeseries, in real time. AWSOM [50] is one of the first streaming algorithms for forecasting, and it is used to discover arbitrary periodicities in a time sequence. Zhu et al. [73] focused on monitoring multiple streams in real time and proposed StatStream, which computes pairwise correlations among all streams. The SPIRIT method [51] is used to address the problem of capturing correlations and finding hidden variables corresponding to trends in collections of coevolving data streams. BRAID [60] detects lag correlations between data streams by using geometric probing and smoothing to approximate the exact correlation. Papadimitriou et al. [52] proposed an algorithm for discovering optimal local patterns, which concisely describe the main trends in data streams. DynaMMo [40] summarizes and compresses multiple sequences and finds latent variables among them.
On the other hand, there are effective methods that address massive timeseries streams as applications for data center management. Reeves et al. [56] addressed the problem of the spaceefficient archiving of timeseries streams and the fast processing of several statistical and data mining queries regarding that archived data. They focused on the problem that traditional database systems have addressed spaceefficient archiving and query processing separately, and proposed Cypress, which preprocesses and decomposes each data stream into a small number of substreams, and answers common queries directly from a set of them rather than reconstructing the original stream. Mueen et al. [47] considered the problem of computing allpair correlations in a warehouse containing a large number of timeseries. A high I/O and CPU overhead make the fast computation of correlations a challenging issue. They proposed a caching algorithm to optimize overall I/O cost and two approximation algorithms to reduce CPU costs.
These techniques focus on trend detection, correlation, motif discovery, and prediction and so are not solutions for our goal, which is to find common local patterns based on DTW.
In our experiment on CrossMatch, we used scatter plots to show its outputs, which were the optimal subsequence pairs. Recurrence plot [24] and dot plot [69] have been proposed for visual sequence analysis and mining of timeseries data; they focus on the visualization of the similar parts of sequences on a scatter plot. Our objective is to identify which of the subsequences of \(X\) and \(Y\) are similar by applying the DTW approach in an online manner, and so differs from their objective and approach.
3 Problem definition
Definitions of main symbols
Symbols 
Definitions 

\(X\) 
Data sequence/stream of length \(n\) 
\(Y\) 
Data sequence/stream of length \(m\) 
\(x_i\) 
\(i\)th element of \(X\) 
\(y_j\) 
\(j\)th element of \(Y\) 
\(X[i_s:i_e], Y[j_s:j_e]\) 
Subsequences of \(X\) and \(Y\), including elements in positions \(i_s, j_s\) through \(i_e, j_e\) 
\(\varepsilon \) 
Distance threshold for finding qualifying subsequences 
\(l_{\min }\) 
Threshold of subsequence length 
\(l_x\) 
Length of \(X[i_s:i_e]\) 
\(l_y\) 
Length of \(Y[j_s:j_e]\) 
\(L(l_x, l_y)\) 
Function for length between \(X[i_s:i_e]\) and \(Y[j_s:j_e]\) 
\(w\) 
Width of warping scope 
\(d(i, j)\) 
Distance of \((i, j)\) in time warping matrix 
\(v(i, j)\) 
Score of \((i, j)\) in score matrix 
\(s(i, j)\) 
Starting position of \((i, j)\) in position matrix 
\(\mathcal{X }\) 
Sampled data sequence of \(X\) 
\(\mathcal{Y }\) 
Sampled data sequence of \(Y\) 
\(\mathrm{x}_{i}\) 
\(i\)th element of \(\mathcal{X }\) 
\(\mathrm{y}_{j}\) 
\(j\)th element of \(\mathcal{Y }\) 
\(f_{Nq}\) 
Nyquist frequency 
\(T_x\) 
Fixed sampling period of \(X\) 
\(T_y\) 
Fixed sampling period of \(Y\) 
3.1 Preliminaries
DTW requires \(O(nm)\) time since the time warping matrix consists of \(nm\) cells. Note that the space complexity is \(O(m)\) (or \(O(n)\)) since the algorithm needs only two columns (i.e., the current and previous columns) of the time warping matrix to compute the DTW distance. By using the warping scope, the time complexity is reduced to \(O(nw+mw)\). The space complexity is \(O(w)\) because we need only \(2w\) cells.
3.2 Crosssimilarity
Equation (2) allows us to detect subsequence pairs without regard to the subsequence length. In practice, however, we might detect shorter and meaningless matching pairs due to the influence of noise. We introduce the concept of subsequence match length to enable us to discard such meaningless pairs and to detect the optimal pairs that satisfy ‘real’ user requirements. We formally define the ‘crosssimilarity’ between \(X\) and \(Y\), which indicates common local patterns.
Definition 1
The minimum length \(l_{\min }\) of subsequence matches should be given by the users. The subsequences that satisfy this equation are guaranteed to have lengths exceeding \(l_{\min }\). We also agree that the user should select the length function \(L\) as well as \(l_{\min }\) to obtain desirable results.
We should also mention the following point: Whenever a subsequence pair matches, there will be several other matches that strongly overlap the ‘local minimum’ best match. Specifically, an overlap is simply the relation that two subsequence pairs have a common alignment, which is defined as follows:
Definition 2
(Overlap) Given two warping paths for subsequence pairs of \(X\) and \(Y\), their overlap is defined as the condition where the paths share at least one element.
Overlaps provide the user with redundant information and would slow down the algorithm since all useless ‘solutions’ are tracked and reported. Our solution is to detect the local best subsequences from the set of overlapping subsequences. Thus, our goal is to find the best match of crosssimilarity.
Problem 1
 1.
\(X[i_s:i_e]\) and \(Y[j_s:j_e]\) have the property of crosssimilarity.
 2.
\(D(X[i_s:i_e], Y[j_s:j_e])\varepsilon (L(l_x, l_y)l_{\min })\) is the minimum value among the set of overlapping subsequence pairs that satisfies the first condition.
Hereafter, we use ‘qualifying’ subsequence pairs to refer to pairs that satisfy the first condition, and we use ‘optimal’ subsequence pairs to refer to pairs that satisfy both conditions.
Typically, new elements in data streams, that is, those that have occurred recently, are usually more significant than those in the distant past [18]. To limit the cell in the matrix and focus on recent elements, we utilize a concept of global constraint for DTW, namely the Sacoe–Chiba band [57]. More specifically, for each sequence \(X\) and \(Y\), we compute the cells from the recent element (e.g., \(x_n\) or \(y_m\)) to an element of the warping scope \(w\) ago. If \(m = n\), the warping scope is exactly equal to the Sakoe–Chiba band.
4 Proposed method
In this section, we describe a straightforward solution to find the best match of crosssimilarity in data streams and also present our onepass algorithm, CrossMatch.
4.1 Naive solution
The most straightforward solution to this problem is to consider all possible subsequences of \(X[i_s:i_e] \ (1 \le i_s < i_e \le n)\) and all possible subsequences of \( Y[j_s:j_e] \ (1 \!\le \! j_s \!<\! j_e \!\le \! m)\) in the warping scope and apply the standard DTW dynamic programming algorithm. We call this method Naive.
4.2 CrossMatch
As mentioned in the previous section, the naive solution creates too many matrices because it computes the distance values between all possible subsequences. The distance threshold is proportional to the subsequence length (cf. Definition 1). The naive solution attempts to find the subsequence pairs semipermanently in each matrix. If we prune dissimilar subsequence pairs and reduce the number of matrices, the distance computations become much more efficient. Our method is motivated by this idea.
4.2.1 Scoring function
To identify the dissimilar subsequences early, we propose computing the DTW distance indirectly by using a scoring function. The scoring function has the following two characteristics: (a) it provides a nonnegative cumulative score, and (b) its operation is reversible with respect to the DTW distance.
The scoring function is essentially based on the dynamic programming approach. Whereas the DTW computes the minimum cumulative distance, our function computes the maximum cumulative score corresponding to the DTW distance with a score matrix. The score is determined by accumulating the difference between the threshold and the distance between the elements in the score matrix. Thus, we can recognize a dissimilar subsequence pair since the score has a negative value if the subsequence pair does not satisfy the first condition of Problem 1.
The scoring function selects the cell with the maximum cumulative score from the neighboring cells, and if the score is negative, the function initializes the score to zero and then restarts the computation from the cell. This operation allows us to discard unqualifying, nonoptimal subsequence pairs.
Definition 3
Example 1
Assume that we have two sequences of \(X=(5, 12, 6, 10, 3, 18)\), \(Y=(11, 9, 4, 2, 9, 13)\), and \(\varepsilon = 14\), \(l_{\min } = 2\), and \(w = 3\). Figure 5a shows the score matrix. The dark cell, which has the highest score, shows the optimal subsequence pair and indicates that the score is \(\varepsilon b_d \!\! x_5\!\!y_4 \!+\! v(4,3)=49\) and the end position is \((i_e, j_e)\!=\!(5, 4)\). The light cells show qualifying subsequence pairs. The cells that contain zero identify dissimilar subsequence pairs.
4.2.2 Position matrix
The scoring function tells us (a) where the subsequence match ends and (b) what the resulting score is. However, we lose the information about the starting position of the subsequence. This is the motivation behind our second idea, a position matrix: We store the starting position to keep track of the qualifying subsequence pair in a streaming fashion.
Definition 4
The starting position is described as a coordinate value; \(s(i_e, j_e)\) indicates the starting position \((i_s, j_s)\) of the subsequence pair \(X[i_s:i_e]\) and \(Y[j_s:j_e]\). We update the starting position in the position matrix as well as the score in the score matrix. We can identify the optimal subsequence that gives the maximum score during stream processing since exactly the same warping path is maintained in the score and position matrices. Moreover, the starting position of the shared cell is maintained through the subsequent alignments because we repeat the operation, which maintains the starting position of the selected previous cell. Thus, we know the overlapping subsequence pairs from the fact that the starting positions match.
Example 2
Figure 5b shows the position matrix corresponding to the score matrix in Fig. 5a. In cell \((5,4)\), the starting position \((2,1)\) is maintained because the scoring function selects the score of cell \((4,3)\) in the score matrix. By combining both matrices, we can identify the position of the optimal subsequence pair \(X[2:5]\) and \(Y[1:4]\). On the other hand, there are many overlapping subsequence pairs that have the same starting position \((2,1)\). Of these, we select the subsequence pair with the highest score as the optimal pair because we can determine the overlapping subsequence pairs from the position matrix.
Next, we show how subsequence pairs are pruned. The pruned subsequence pairs fall into one of the following two categories: (1) subsequence pairs that are absolutely not reflected in two matrices (i.e., the score and the position matrices) and (2) subsequence pairs that are pruned during the computation process. In any case, our method is designed so that we can evaluate the crosssimilarity between sequences from the score value, and guarantees that the pruned subsequence pairs are not optimal by using the fact that the overlapping subsequence pairs in cell \((i,j)\) have the same warping paths in the subsequent alignments (we will provide detailed proofs in Sect. 4.3).
Example 3
An example of case (1) corresponds to the subsequence pairs starting at \((3,2)\) in Fig. 5. In cell \((3,2)\), our method has to select one pair from neighboring cells since all neighboring cells include positive scores, and it prunes the subsequence pairs that have the starting position \((3,2)\). An example of case (2) corresponds to the subsequence pairs starting at \((1,3)\). In cell \((4,5)\), our method chooses the subsequence pair starting at \((2,1)\) because the pair has the maximum value. That is, the subsequence pair that has the starting position \((1,3)\) is pruned although the score indicates a positive value.
4.2.3 Streaming algorithm
CrossMatch requires three parameters, \(l_\mathrm{min}\), \(w\), and \(\varepsilon \). The subsequence length \(l_\mathrm{min}\) and the parameter \(\varepsilon \) are set based on the pattern the user wants to search. It is desirable to set the values according to the applications. The warping scope \(w\) determines the computation range in each matrix. At the same time, it asks the user how far back into the past the algorithm needs to go. If the user wants to search for subsequence pairs during the present timetick and a timetick in the relatively distant past, it is better to set a large \(w\). In our experiments, we simply use reasonable values for every data set, and we show that this way of setting parameters is sufficient for CrossMatch to verify the detection of the optimal subsequence pairs.
Example 4
Again, assume two sequences of \(X=(5, 12, 6, 10, 3, 18)\), \(Y=(11, 9, 4, 2, 9, 13)\), and \(\varepsilon = 14\), \(l_\mathrm{min} = 2\), and \(w = 3\) in Fig. 5. To simplify the example of our algorithm with no loss of generality, we assume that \(x_i\) and \(y_j\) arrive in alternately. At each timetick, the algorithm updates the scores and the starting positions. At \(i=4\), we update the cells from \((4, 1)\) to \((4, 3)\) and identify a candidate subsequence, \(X[2:4]\) and \(Y[1:3]\), starting at \((2, 1)\), whose score \(v(4,3)=36\) is greater than \(\varepsilon l_\mathrm{min}\). At \(j=4\), we update the cells from \((1, 4)\) to \((4, 4)\). Although no subsequences satisfying the condition are detected, we do not report the subsequence of \(X[2:4]\) and \(Y[1:3]\) since it is possible that this pair could be replaced by upcoming subsequences. We then capture the optimal subsequence pair of \(X[2:5]\) and \(Y[1:4]\) at \(i=5\). We finally report the subsequence as the optimal subsequence at \(j=6\) since we can confirm that none of the upcoming subsequences can be optimal. Figure 6 shows time warping matrix starting at \((2,1)\) in the naive solution, which includes the optimal subsequence pair in Fig. 5. In the score and the position matrices, the subsequence pairs that have the starting position \((2,1)\) correspond to the pairs on the time warping matrix in Fig. 6. From Eq. (6), we have \(\varepsilon L(4,4)\!\!V(X[2\!:\!5],Y[1\!:\!4]) = 14 \cdot 4 \!\! 49 = 7 = D(X[2\!:\!5],Y[1\!:\!4])\).
Comparison with positions of optimal and foremost subsequence pairs

Subseq. #1 
Subseq. #2  

\(i_s\) 
\(i_e\) 
\(j_s\) 
\(j_e\) 
\(i_s\) 
\(i_e\) 
\(j_s\) 
\(j_e\)  
Optimal subsequence pairs 
1 
15,000 
6 
22,629 
21,008 
24,505 
31,146 
38,013 
Foremost subsequence pairs 
1 
11,250 
6 
16,938 
21,008 
24,188 
31,146 
36,949 
4.3 Theoretical analysis
We introduce a brief theoretical analysis that confirms the accuracy and complexity of CrossMatch.
4.3.1 Accuracy
Lemma 1
 1.
\(V(X[i_s\!:\!i_e],Y[j_s\!:\!j_e]) \ge \varepsilon l_\mathrm{min}\)
 2.
\(V(X[i_s\!:\!i_e],Y[j_s\!:\!j_e]) \varepsilon l_\mathrm{min}\) is the maximum value in each group of subsequence pairs that the warping path crosses.
Proof
See Appendix 1. \(\square \)
Lemma 2
CrossMatch guarantees the output of the optimal subsequence pairs.
Proof
See Appendix 2. \(\square \)
4.3.2 Complexity
Let \(X\) and \(Y\) be evolving sequences of lengths \(n\) and \(m\), respectively.
Lemma 3
The naive solution requires \(O(nw^2 + mw^2)\) time (per update) and space to discover crosssimilarity.
Proof
See Appendix 3. \(\square \)
Lemma 4
CrossMatch requires \(O(w)\) (i.e., constant) time (per update) and space to discover crosssimilarity.
Proof
See Appendix 4. \(\square \)
5 Sampling approach
As mentioned above, CrossMatch detects crosssimilarity in constant time and space. The next question is what we can do in the highly likely case that the users need more efficient solutions given that, in practice, they require high accuracy, not a theoretical guarantee. This is our motivation for introducing an approximation for CrossMatch.
What approximate techniques are suitable for CrossMatch? An efficient idea involves the data reduction in a sequence. Optimal alignments of DTW correspond to matching the elements in time. To find optimal subsequence pairs by approximation, we choose to keep the sequence, which is transformed by data reduction operated in the time domain. As an extended version of CrossMatch, we propose compressing the matrices using a sampling approach. As we show later, this decision significantly improves both space cost and response time, with negligible effect on the mining results.
As the first step, we consider the following theorem.
Theorem 1
(Sampling theorem) If a continuous function contains no frequencies higher than \(f_{high}\), it is completely determined by its value at a series of points less than \(1/2f_{high}\) apart.
Poorf
See [39]. \(\square \)
In the theorem, the minimum sampling frequency, \(f_{Nq} = 2f_{high}\), is called the Nyquist frequency. We utilize this theorem for sampling the sequences. That is, we use coarse sequences yielded by sampling based on the theorem and detect the crosssimilarity. Since the original sequence is sampled once for each \(f_{Nq}\) value, that is, \(T=1/f_{Nq}\), we greatly reduce in the size of the matrix.
5.1 Scoring function
How do we compute the score between sampled sequences? Intuitively, the key idea is that when we select one of the neighboring cells for score computation, we interpolate the distance values that were dropped by sampling. In the score computation between sampled sequences, there are \(T\!\!1\) hidden cells that represent the missing values between the current cell (i.e., the cell that we should compute now) and its neighboring cells. We approximate the distance values, which should be provided by the hidden cells, by using the distance value in the current cell. Since the sampled sequences are obtained based on the sampling theorem, this is a suitable approximation.
5.2 Position matrix
5.3 Streaming algorithm
Algorithm 2 shows a detailed description of our sampling approach. The algorithm reflects the information about the skipped elements in the next computation and approximately computes the score and the position of the subsequence pair. The basic procedure is the same as that of the original version of CrossMatch (i.e., Algorithm 1); however, we can greatly reduce the space requirement and the computation cost by the sampling, which faithfully reconstructs the original sequence.
Lemma 5
Let \(T\) be the sampling period. With the sampling approach, CrossMatch requires \(O(w/T)\) time (per update) and space.
Proof
See Appendix 5. \(\square \)
5.4 Adaptive sampling approach
Let \(\mathcal{X } \!=\! (\mathrm{x}_{1},\ldots , \mathrm{x}_{i},\ldots , \mathrm{x}_{n^{\prime }})\), be the sampled sequences of \(X\), and \(\mathcal{T }_{x} \!=\! (t_{\mathrm{x}_{1}},\ldots , t_{\mathrm{x}_{i}}, \ldots , t_{\mathrm{x}_{n^{\prime }}})\) be the sampling period of \(\mathcal{X }\) in each timetick. In the adaptive sampling approach, the number of hidden cells varies according to the sampling period in each timetick. We compute the appropriate sampling period in each cell and approximate the distance values of the hidden cells accordingly. On the other hand, we determine the weights of each direction dynamically since the current \(L\) value is determined by the sampling period in each timetick. For \(L\!=\!(l_x\!+\!l_y)/2\), the weight of the horizontal direction in cell \((i,j)\) is \(b_h=t_{\mathrm{x}_{i}}/2\) and the others are similarly set by the sampling periods. In the adaptive sampling approach, we constantly use the sampling period, which reflects the sequence of recent timeticks. Thus, this approach would be more powerful when the sequence consists of high and low frequencies.
Incremental algorithms have been proposed for computing the frequency in the stream sense (e.g., [29, 49]). CrossMatch can utilize any and all of these solutions to compute the frequency efficiently. However, this research topic is beyond the scope of this paper.
Two sampling approaches do not guarantee their errorbound theoretically because the alignment of DTW depends on the data sequence and changes if the data sequence is sampled with a different sampling period. However, we show that their errors are very small in Sect. 7.3.
6 Discovery of groupsimilarity
So far, we have assumed the problem of crosssimilarity between two data streams. For more generality, we would like to make CrossMatch more flexible. We now tackle a more challenging problem: How do we efficiently identify common local patterns among multiple data streams? A useful feature of CrossMatch is that it can be effectively extended to this case.
Given multiple data streams (more than two sequences), we want to find ‘groupsimilarity,’ which means the crosssimilarity among them. The work in [62] has addressed the problem of similarity groupby that supports grouping based on tuples in a database. On the other hand, groupsimilarity provides grouping based on similar patterns. For example, in sensor networks, measurement values arriving from many different sensors have to be examined dynamically. CrossMatch makes it possible to reduce a large number of streams to just a handful of common patterns that compactly describe the key features. More importantly, the time and space requirements are constant per update.
We formally define groupsimilarity below. To simplify our presentation, we focus on three sequences \(X\), \(Y\), and \(Z\). We first present the DTW distance for the three sequences.
Definition 5
We compute the DTW distance among three sequences and detect the subsequences whose lengths are greater than \(l_{\min }\). As with crosssimilarity, we face the overlap problem. The number of overlapping subsequences increases significantly with the number of sequences. We detect the best match of groupsimilarity as follows.
Problem 2
 1.
\(X[i_s\!:\!i_e]\), \(Y[j_s\!:\!j_e]\), and \(Z[k_s\!:\!k_e]\) have the property of groupsimilarity.
 2.
\(D(X[i_s\!:\!i_e], Y[j_s\!:\!j_e], Z[k_s\!:\!k_e]) \!\! \varepsilon (L(l_x, l_y, l_z)\!\!l_{\min })\) is the minimum value from a set of overlapping subsequences that satisfies the first condition.
Lemma 6
The naive solution requires \(O(n_1w^3\!+\!n_2w^3\!+\!n_3w^3)\) time (per update) and space to discover the groupsimilarity for three sequences.
Proof
See Appendix 6. \(\square \)
Lemma 7
 1.
\(V(X[i_s\!:\!i_e],Y[j_s\!:\!j_e], Z[k_s\!:\!k_e]) \ge \varepsilon l_{\min }\)
 2.
\(V(X[i_s\!:\!i_e],Y[j_s\!:\!j_e], Z[k_s\!:\!k_e]) \varepsilon l_{\min }\) is the maximum value in each group of subsequences that the warping path crosses.
Proof
See Appendix 7. \(\square \)
From Lemma 7, Eq. (11) holds for any reported subsequences. As with Lemma 2, it is obvious that CrossMatch reports the optimal subsequences from the set of overlapping subsequences. Thus, CrossMatch guarantees the correctness of the result for groupsimilarity.
Lemma 8
CrossMatch requires \(O(w^2)\) time (per update) and \(O(w^2)\) space to discover the groupsimilarity for three sequences.
Proof
See Appendix 8. \(\square \)
Although the complexity of groupsimilarity is still quadratic with respect to the number of sequences, CrossMatch is much faster in practice than the naive solution and enables the examination of very large collections of sequences.
7 Experimental evaluation
 1.
How well does CrossMatch provide the optimal subsequences without redundant information?
 2.
How successful is CrossMatch in capturing crosssimilarity?
 3.
How effective is the sampling approach in capturing crosssimilarity?
 4.
How well does CrossMatch scale with the sequence length in terms of computation time and memory space?
 5.
How well does CrossMatch work in highdimensional data streams?
 6.
How well does CrossMatch identify groupsimilarity?
7.1 Filtering redundant information
We compared CrossMatch with the previous algorithm [66] to investigate its effectiveness in filtering redundant information.^{7} We used a synthetic data set, Sines, which consists of discontinuous sine waves with white noise (see Fig. 8a), and for our previous algorithm and CrossMatch we set \(l_\mathrm{min}\) at 15 % of the sequence length, \(\varepsilon \) at \(1.0\times 10^{2}\), and \(w\) at 50 % of the sequence length.
Figure 8b plots the sequence length versus the number of detected subsequence pairs for the two algorithms. In the previous algorithm, increases in sequence length trigger a large increase in the number of detected subsequence pairs. CrossMatch, on the other hand, detects fewer subsequence pairs than the previous algorithm.
7.2 Detecting crosssimilarity between two sequences
Details of data sets and parameter settings
Data sets 
Sequence length 
\(\varepsilon \)  

Seq. #1 
Seq. #2  
RandomSines 
25000 
25000 
1.0e\(\)4 
Spikes 
28000 
28000 
5.0e\(\)6 
Humidity 
26779 
40831 
8.0e\(\)1 
Automobile traffic 
16000 
16000 
8.5e+4 
Web 
32000 
32000 
4.0e+4 
Sunspots 
18000 
18000 
3.0e+2 
Temperature 
28000 
24000 
2.0e\(\)1 
7.2.1 RandomSines
We used a synthetic data set, RandomSines, which consists of discontinuous sine waves with white noise (see Fig. 9a). This data set includes differentlength intervals between the sine waves, which were generated using a random walk function. We varied the period of each sine wave and the intervals between these sine waves in the sequence.
As shown in the right figure of Fig. 9a, CrossMatch perfectly identifies all the sine waves and their timevarying periodicities. In this figure, the difference in the period of each sine wave appears as a difference in the slope.
7.2.2 Spikes
7.2.3 Humidity
Figure 1 shows the detected subsequence pairs for the humidity data set. CrossMatch captures common patterns except for the dissimilar sections. Our method is designed to find the similar subsequence pairs. However, by applying it to sequences that are roughly similar, it can utilize the discovery of dissimilar sections.
7.2.4 Automobile traffic
Figure 10a shows timeseries data of automobile traffic, which has a daily period. Each day contains other distinct patterns for the morning and afternoon rush hours. Hourly traffic is a bursty data, and we can regard it as white noise.
CrossMatch is successful in accurately detecting the daily period without being deceived by the highfrequency hourly traffic. Consecutive lines and their regular intervals indicate periodicity. Moreover, the intervals between the consecutive lines correspond to the daily period, and we can confirm that the characteristics of the data are revealed by the crosssimilarity thus detected.
7.2.5 Web
Figure 10b shows access counts for mail and blog sites obtained every 10 seconds. We observe the daily periodicity of sequences, which increases from morning to night and reaches a peak.
The right figure in Fig. 10b confirms that CrossMatch identified the periodicity. The figure shows winding lines, unlike Automobile. This indicates that CrossMatch aligned the elements of sequences that were stretched along the time axis. Crosssimilarity is detected by the timescaling property of CrossMatch.
7.2.6 Sunspots
Figure 10c is sunspots data set recorded on a daily basis. This is a wellknown data set whose timevarying periodicity is related to sun activity. The average number of visible sunspots increases when the sun is active and decreases when the sun is inactive. This change occurs with a regular period of about 11 years.
CrossMatch distinguishes the increase and decrease in the average number and captures similar periods.
7.2.7 Temperature
Despite the missing measurement values and the difference in the period, CrossMatch successfully detected the pattern.
7.3 Effect of sampling approach
In this section, we show the results we obtained with the sampling and adaptive sampling approaches. We used four real data sets for the experiment. To determine the sampling period for each data set in the sampling approach, we computed a power spectrum from the normalized sequences. Real data sets often include high frequencies with very low energy. The Nyquist frequencies for such data sets could be extremely high. Since the frequency limit is widely understood in various fields (e.g., audio processing and network analysis), in settings regarding Nyquist frequency, we disregard the highfrequency components, whose power is very small. We set the power value threshold at \(5.0\times 10^{4}\) in this experiment. The main energy of Traffic #1 is distributed in the \(0 \! \le \! f \! \le \!1478\) frequency range and the Nyquist frequency is \(f_{N_q} = 2596/n\). Similarly, the main energy of Traffic #2 is distributed in the \(0 \! \le \! f \! \le \! 1438\) frequency range and the Nyquist frequency is \(f_{N_q} = 2876/m\). Thus, the sampling periods are \(T_x = 5\) and \(T_y = 6\).^{8} With the adaptive sampling approach, the sampling rate varies depending on the frequency range.
Diarization error rate in two sampling approaches

Sampling approach 
Adaptive sampling  

\(DER_x\) 
\(DER_y\) 
\(DER_x\) 
\(DER_y\)  
Automobile traffic 
0.066 
0.094 
0.074 
0.095 
Web 
0.039 
0.042 
0.039 
0.036 
Sunspots 
0.105 
0.121 
0.093 
0.106 
Temperature 
0.068 
0.084 
0.002 
0.015 
7.4 Performance
We compared two CrossMatch approaches (i.e., the original and sampling approaches^{9}) with the naive solution and an existing method, SPRING, in terms of computation time and memory space. SPRING detects highsimilarity subsequences that are similar to a fixed length query sequence under the DTW distance [58]. SPRING is not intended to be used for finding crosssimilarity, but we can apply this method to evaluate the efficiency and to verify the complexity of CrossMatch. SPRING requires \(O(n+m)\) matrices; thus, an algorithm with SPRING for finding crosssimilarity requires \(O(nw+mw)\) time (per update) and space. We used Temperature for this experiment.
7.5 Extension to highdimensional data streams
\(X\) and \(Y\) are multidimensional time sequences, and our goal is to find matching subsequences between \(X\) and \(Y\). Intuitively, if \(X\) and \(Y\) include the same motions, we want to find these motions.
We used sequences obtained from the CMU motion cap database.^{10} We selected the data for limbs from the original data and used them as 8dimensional data. Each motion is listed in the tables in Fig. 14. The data set has two motions in common (i.e., walking and jumping upward), and the length of each motion is different. We set \(l_{\min }\) at 240, which corresponds to about two seconds, \(\varepsilon \) at 10, and \(w\) at 50% of the sequence length.
The result shown in Fig. 14 reveals that CrossMatch can accurately capture the two motions. We can confirm that the walking motion yields high crosssimilarity. There are many shifted sequences, because walking is a repetitive behavior in which the limbs move back and forth. CrossMatch works for highdimensional data sets and detects the repetitive motion as crosssimilarity.
7.6 Detecting groupsimilarity
We performed an experiment to discover groupsimilarity. We used the Sensor data set for this experiment (see Fig. 15). Sensor consists of three streams that represent temperature readings from sensors within several buildings. Each sensor provides a reading every 4 min. Overall, the data set fluctuates greatly at different timeticks but the three sensors exhibit a similar fluctuation pattern.
8 Conclusions

In contrast to the naive solution, CrossMatch greatly improves performance and can be processed at a high speed.

CrossMatch requires constant space (per update) to detect crosssimilarity or groupsimilarity, and it consumes only a small quantity of resources.

Despite the highspeed processing, CrossMatch guarantees correct results.

CrossMatch works efficiently for highdimensional data streams.
For \(L(l_x, l_y)\!=\!\text{ max}(l_x, l_y)\), each weight is set as follows. \(b_d\!=\!b_h\!=\!1\) and \(b_v\!=\!0\) if \(l_x\!>\!l_y\). \(b_d\!=\!b_v\!=\!1\) and \(b_h\!=\!0\) if \(l_x\!<\!l_y\). \(b_d\!=\!1\) and \(b_v\!=\!b_h\!=\!0\) if \(l_x\!=\!l_y\). Formally, each weight is defined as follows \(b_v \!=\! L(l_x, l_y)  L(l_x, l_y\!\!1)\). \(b_h \!=\! L(l_x, l_y)  L(L_x\!\!1, l_y)\). \(b_d \!=\! L(l_x, l_y)  L(l_x\!\!1, l_y\!\!1)\).
The other settings for DTW are as follows. \(d(0,0,0) = 0\). \(d(i,0,0) = d(0,j,0) = d(0,0,k) = \infty \). \(d(i, j, 0) = d(0,j,k) = d(i,0,k) = \infty \). \(i = 1,\ldots , n_1, j = 1, \ldots , n_2, k = 1, \ldots , n_3\). \((n_1\!\!w \le i \le n_1) \wedge (n_2\!\!w \le j \le n_2) \wedge (n_3\!\!w \le k \le n_3)\).
Here, we focus on a thirdorder tensor for time warping, namely the time warping tensor for three sequences. However, for simplicity, we shall use the term “time warping matrix” in this paper.
The other settings are as follows. \(d_{i,j,k}(0,0,0) = 0\), \(d_{i,j,k}(p,0,0) = d_{i,j,k}(0,q,0) = d_{i,j,k}(0,0,r) = \infty \). \(d_{i,j,k}(p,q,0) = d_{i,j,k}(0,q,r) = d_{i,j,k}(p,0,r) = \infty \). \(i = 1, \ldots , n_1\), \(p = 1, \ldots , n_1\!\!i\!+\!1\), \(j = 1, \ldots , n_2\), \(q = 1, \ldots , n_2\!\!j\!+\!1\), \(k= 1, \ldots , n_3\), \(r = 1, \ldots , n_3\!\!k\!+\!1\). \((n_1\!\!w \le i\!+\!p \le n_1) \wedge (n_2\!\!w \le j\!+\!q \le n_2) \wedge (n_3\!\!w \le k\!+\!r \le n_3\)).
The other settings for the scoring function are as follows. \(v(0,0,0) = v(i,0,0) = v(0,j,0) = v(0,0,k) = 0\). \(v(i, j, 0) = v(0, j, k) = v(i, 0, k) = 0\). \((n_1\!\!w \le i \le n_1) \wedge (n_2\!\!w \le j \le n_2) \wedge (n_3\!\!w \le k \le n_3)\).
The previous algorithm does not introduce the warping scope \(w\). To ensure the validity of the experiment, we modified the algorithm and introduced the warping scope.
The frequency components \(f\) and the sampling periods \(T\) of the other data sets are as follows. Mail: \(0\! \le \! f \! \le \! 63\), \(T_x\!=\!254\). Blog: \(0 \! \le \! f \! \le \! 80\), \(T_y\!=\!200\). Sunspots #1: \(0 \! \le \! f \! \le \! 1431\), \(T_x\!=\!6\). #2: \(0 \! \le \! f \! \le \! 1497\), \(T_y\!=\!6\). Temperature #1: \(0 \! \le \! f \! \le \! 201\), \(T_x\!=\!70\). #2: \(0 \! \le \! f \! \le \! 137\), \(T_y\!=\!88\).
We show only the result for the sampling approach since the average sampling periods were almost the same between two approaches.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
Appendix 1: Proof of Lemma 1
Appendix 2: Proof of Lemma 2
 1.
Any reported subsequence pairs must satisfy the property of crosssimilarity (i.e., Definition 1).
 2.
Each reported subsequence pair must be the optimal pair among the set of overlapping subse quence pairs.
 3.
If a subsequence pair that satisfies the property of crosssimilarity is not reported, another overlapping subsequence pair is reported where \(D(X[i_s\!:\!i_e],Y[j_s\!:\!j_e])\!\!\varepsilon (L(l_x, l_y)\!\!l_{\min })\) is the minimum value.
Property 2: From Lemma 1, the subsequence pair that minimizes \(D(X[i_s\!:\!i_e],Y[j_s\!:\!j_e])\!\!\varepsilon (L(l_x,l_y)\!\!l_{\min })\) is equivalent to the pair that maximizes \(V(X[i_s\!:\!i_e],Y[j_s\!:\!j_e])\!\!\varepsilon l_{\min }\). We assume that two overlapping subsequence pairs that have different starting positions \((i_s, j_s)\) or \((i_s^{\prime }, j_s^{\prime })\) share cell \((i,j)\). \(D(X[i_s\!:\!i],Y[j_s\!:\!j])\) is the minimum distance in the alignment from \((i_s,j_s)\) to \((i,j)\) of the time warping matrix starting at \((i_s,j_s)\). Similarly, \(D(X[i_s^{\prime }\!:\!i],Y[j_s^{\prime }\!:\!j])\) is also the minimum distance in the time warping matrix starting at \((i_s^{\prime },j_s^{\prime })\). Two pairs share a common warping path in the subsequent alignment from \((i,j)\) to \((i_e, j_e)\) because DTW computes the cumulative minimum distance. Thus, the subsequence pair that minimizes \(D(X[i_s\!:\!i],Y[j_s\!:\!j])\!\!\varepsilon (L(i\!\!i_s\!+\!1,j\!\!j_s\!+\!1)\!\!l_{\min })\) is equivalent to the pair that maximizes \(V(X[i_s\!:\!i],Y[j_s\!:\!j])\!\!\varepsilon l_{\min }\). CrossMatch selects the pair with the maximum score in each cell. Therefore, the matrices that CrossMatch prunes, that is, time warping matrices that are absolutely not reflected in the score and the position matrices and that are pruned during the computation process, do not include the optimal pair. As a result, CrossMatch constantly reports the optimal pair from the overlapping pairs. \(\square \)
Property 3: From property 2, the overlapping subsequence pairs share the same starting position through the operation of CrossMatch. When the subsequence pair satisfying Eq. (15) is detected, CrossMatch checks the pair with the same starting position in the candidate array. If the score of the detected pair is greater than that of the pair in the candidate array, CrossMatch updates the candidate pair by using the pair with the maximum score. This process is performed for every pair with a different starting position. Thus, if a subsequence pair that satisfies the property of crosssimilarity is not reported, there is another better candidate pair. \(\square \)
Appendix 3: Proof of Lemma 3
The naive solution has to maintain \(O(nw + mw)\) time warping matrices. It updates the \(O(w)\) values between \(x_i\) and the corresponding elements of \(Y\) (i.e., the elements from \(y_{iw}\) to \(y_i\)) in \(O(nw)\) matrices if we receive \(x_i\) at timetick \(i\). Similarly, it updates the \(O(w)\) values in \(O(mw)\) matrices if we receive \(y_j\) at timetick \(j\). Therefore, it requires \(O(nw^2 + mw^2)\) time per timetick. Since the naive solution maintains two arrays of \(w\) numbers for each matrix, it requires, in total, \(O(nw^2 + mw^2)\) space. \(\square \)
Appendix 4: Proof of Lemma 4
CrossMatch maintains two matrices (i.e., score and position matrices). It updates the \(O(w)\) values if we receive \(x_i\) or \(y_j\). Each matrix maintains two arrays of \(w\) numbers. Thus, it requires \(O(w)\) time (per update) and space.
Appendix 5: Proof of Lemma 5
The sampled sequences compress the original sequences to the size of \(1/T\). CrossMatch updates the \(O(w/T)\) values, which requires \(O(w/T)\) time (per update) and space.
Appendix 6: Proof of Lemma 6
Given three sequences \(X\), \(Y\), and \(Z\) whose lengths are \(n_1\), \(n_2\), and \(n_3\), the naive solution has to maintain \(O(n_1w\!+\!n_2w\!+\!n_3w)\) time warping matrices and updates the \(O(w^2)\) values for each matrix. Therefore, it requires \(O(n_1w^3\!+\!n_2w^3\!+\!n_3w^3)\) time per update. Since the naive solution maintains two planes of \(w^2\) numbers for each matrix, it requires \(O(n_1w^3\!+\!n_2w^3\!+\!n_3w^3)\) space.
Appendix 7: Proof of Lemma 7
Appendix 8: Proof of Lemma 8
CrossMatch maintains \(2w^2\) arrays (i.e., previous and current planes, which have \(w*w\) arrays per plane) for sequences \(X\), \(Y\), and \(Z\) in the score and position matrices. It updates \(O(w^2)\) numbers to identify the optimal subsequences if we receive \(x_i\) at timetick \(i\), \(y_j\) at timetick \(j\), or \(z_k\) at timetick \(k\). Therefore, it requires \(O(w^2)\) time (per update) and \(O(w^2)\) space.