# Trajectories, Discovering Similar

**DOI:**https://doi.org/10.1007/978-3-319-17885-1_1401

## Synonyms

## Definition

*outliers*or in other words incorrect data measurements (unlike for example, stock data which contain no errors whatsoever). An example of two trajectories is shown in Fig. 1.

Many data mining tasks, such as clustering and classification, necessitate a *distance function* that is used to estimate the *similarity* or *dis-similarity* between any two objects in the database. Furthermore, in order to provide efficient solutions to many data mining tasks, a method that can retrieve quickly the data objects that are more similar to a given query object (or a set of objects) is required. Therefore, to perform data mining on trajectories of moving objects, the following problem must be addressed: given a database \(\mathcal{D}\) of trajectories and a query \(\mathcal{Q}\) (not already in the database), the system has to find the trajectory \(\mathcal{T}\) that is closest to \(\mathcal{Q}\). In order to solve this problem, two important sub-problems must be addressed: (i) define a realistic and appropriate distance function and (ii) design an indexing scheme for answering the nearest neighbor query efficiently.

## Historical Background

Trajectories are modeled as multi-dimensional time series. Most of the related work on time-series data analysis has concentrated on the use of some metric *L*_{ p } norm. The *L*_{ p } norm distance between two n-dimensional vectors \(\bar{x}\) and \(\bar{y}\) is defined as \(L_{p}(\bar{x},\bar{y}) = (\sum _{i=1}^{n}(\vert x_{i} - y_{i}\vert )^{p})^{1/p}\). For *p* = 2 it is the well known Euclidean distance and for *p* = 1 the Manhattan distance. The advantage of this simple metric is that it allows efficient indexing with a dimensionality reduction technique (Agrawal et al. 1993; Faloutsos et al. 1994; Keogh et al. 2001). On the other hand, the model cannot deal well with outliers and is very sensitive to small distortions in the time axis (Das et al. 1997). There are a number of interesting extensions to the above model to support various transformations such as scaling (Rafiei and Mendelzon 2000), shifting (Goldin and Kanellakis 1995), normalization (Goldin and Kanellakis 1995) and moving average (Rafiei and Mendelzon 2000).

Other techniques to define time series similarity are based on extracting certain features (landmarks (Perng et al. 2000) or signatures (Faloutsos et al. 1997)) from each time-series and then using these features to define the similarity. Another approach is to represent a time series using the direction of the sequence at regular time intervals (Qu et al. 1998).

Although the vast majority of database/data mining research on time series data mining has focused on Euclidean distance, virtually all real world systems that use time series matching as a subroutine, utilize a similarity measure which allows warping. In retrospect, this is not very surprising, since most real world processes, particularly biological processes, can evolve at varying rates. For example, in bioinformatics, it is well understood that functionally related genes will express themselves in similar ways, but possibly at different rates. Therefore, the Dynamic Time Warping (DTW) distance has been used for many datasets of this type. The method to compute DTW between two sequences is based on Dynamic Programming (Berndt and Clifford 1994) and is more expensive than computing *L*_{ p } norms. Approaches to mitigate the large computational cost of the DTW have appeared in Keogh (2002) and Zhu and Shasha (2003) where lower bounding functions are used in order to speed up the execution of DTW. Furthermore, an approach to combine the benefits of warping distances and *L*_{ p } norms has been proposed in Chen and Ng (2004).

The flexibility provided by DTW is very important, however its efficiency deteriorates for noisy data, since by matching all the points, it also matches the outliers distorting the true distance between the sequences. An alternative approach is the use of *Longest Common Subsequence* (*LCSS*), which is a variation of the edit distance (Levenshtein 1966). The basic idea is to match two sequences by allowing them to stretch, without rearranging the order of the elements but allowing some elements to be *unmatched*. Using the *LCSS* of two sequences, one can define the distance using the length of this subsequence (Bollobás et al. 1997; Das et al. 1997).

## Scientific Fundamentals

First, some definitions are provided and then the similarity functions based on the appropriate models are presented. It is assumed that objects are points that move on the (*x*, *y*)-plane and time is discrete.

Let *A* and *B* be two trajectories of moving objects with size *n* and *m* respectively, where *A* = ((*a*_{x, 1}, *a*_{y, 1}), *…*, (*a*_{x, n}, *a*_{y, n})) and *B* = ((*b*_{x, 1}, *b*_{y, 1}), *…*, (*b*_{x, m}, *b*_{y, m})). For a trajectory *A*, let *Head(A)* be the sequence *Head*(*A*) = ((*a*_{x, 1}, *a*_{y, 1}), *…*, (*a*_{x, n−1}, *a*_{y, n−1})).

*δ*and a real number 0 <

*ε*< 1, the

*LCSS*

_{δ, ε}(

*A*,

*B*) is defined as follows:

*δ*controls how far in time can go in order to match a given point from one trajectory to a point in another trajectory. The constant

*ɛ*is the matching threshold (see Fig. 2).

The first similarity function is based on the *LCSS* and the idea is to allow time stretching. Then, objects that are close in space at different time instants can be matched if the time instants are also not very far.

*S*1 between two trajectories

*A*and

*B*, given

*δ*and

*ɛ*, is defined as follows:

*S*1 serves the purpose of comparing the LCSS value between sequences of different lengths.

The *S*1 function is used to define another similarity measure that is more suitable for trajectories. Consider the set of all translations. A translation simply shifts a trajectory in space by a different constant in each dimension. Let \(\mathcal{F}\) be the family of translations. Then a function *f*_{c, d} belongs to \(\mathcal{F}\) if *f*_{c, d}(*A*) = ((*a*_{x, 1} + *c*, *a*_{y, 1} + *d*), *…*, (*a*_{x, n} + *c*, *a*_{y, n} + *d*)). Using this family of translation, the following distance function is defined.

*δ*,

*ɛ*and the family \(\mathcal{F}\) of translations, the similarity function

*S*2 between two trajectories

*A*and

*B*, is defined as follows:

*S*1 and

*S*2 range from 0 to 1. Therefore, the distance function between two trajectories can be estimated as follows:

*δ*,

*ɛ*and two trajectories

*A*and

*B*, then:

*D*1 and

*D*2 are

*symmetric*.

*LCSS*

_{δ, ε}(

*A*,

*B*) is equal to

*LCSS*

_{δ, ε}(

*B*,

*A*) and the transformation that is used in

*D*2 is a translation which preserves the symmetric property.

By allowing translations, similarities between movements that are parallel in space can be detected. In addition, the *LCSS* model allows stretching and displacement in time, so it can detect similarities in movements that happen with different speeds, or at different times.

Given the definitions above, efficient methods to compute the distance functions are presented next.

### Computing the Similarity Function S1

To compute the similarity functions *S*1, *S*2 an *LCSS* computation is needed. The *LCSS* can be computed by a dynamic programming algorithm in *O*(*n*^{2}) time. However, if matchings are allowed only when the difference in the indices is at most *δ*, a faster algorithm is possible. The following result has been shown in Berndt and Clifford (1994) and Das et al. (1997): Given two trajectories *A* and *B*, with | *A* | = *n* and | *B* | = *m*, the *LCSS*_{δ, ε}(*A*, *B*) can be found in *O*(*δ*(*n* + *m*)) time.

If *δ* is small, the dynamic programming algorithm is very efficient. However, for some applications, *δ* may need to be large. For that case, the above computation can be improved using random sampling.

By taking a sufficiently small amount of random samples from the original data, it can be shown that with high probability the random sample preserves the properties (shape, structure, average value, etc) of the original population. The random sampling method will give an approximate result but with a probabilistic guarantee on the error. In particular, it can be shown that, given two trajectories *A* and *B* with length *n*, two constants *δ* and *ɛ*, and a random sample of *A*, | *RA* | = *s*, an approximation of the *LCSS*(*δ*, *ε*, *A*, *B*) can be computed such that the approximation error is less than *β* with probability at least 1 −*ρ*, in *O*(*ns*) time, where *s* = *f*(*ρ*, *β*). To give a practical perspective of the random sampling approach, to be within 0.1 of the true similarity of two trajectories *A* and *B*, with a probability of 90% and the similarity between them is around 0.8, the *A* should be sampled at 250 locations. Notice that this number is independent of the length of both *A* and *B*. To be able to capture accurately the similarity between less similar trajectories (e.g., with 0.4 similarity) then more sample points must be used (e.g., 500 points).

### Computing the Similarity Function S2

Consider now the more complex similarity function *S*2. Here, given two sequences *A*, *B*, and constants *δ*, *ε*, the translation *f*_{c, d} that maximizes the length of the longest common subsequence of *A*, *f*_{c, d}(*B*) (*LCSS*_{δ, ε}(*A*, *f*_{c, d}(*B*)) over all possible translations must be found.

Let the length of trajectories *A* and *B* be *n* and *m* respectively. Let also assume that the translation \(f_{c_{1},d_{1}}\) is the translation that, when applied to *B*, gives a longest common subsequence: \(LCSS_{\delta,\epsilon }(A,f_{c_{1},d_{1}}(B)) = a\), and it is also the translation that maximizes the length of the longest common subsequence: \(LCSS_{\delta,\epsilon }(A,f_{c_{1},d_{1}}(B)) =\max _{c,d\in \mathcal{R}}LCSS_{\delta,\epsilon }(A,f_{c,d}(B))\).

The key observation is that, although there is an infinite number of translations that can be applied on *B*, each translation *f*_{c, d} results in a longest common subsequence between *A* and *f*_{c, d}(*B*), and there is a finite set of possible longest common subsequences. Therefore, it is possible to enumerate the set of translations, such that this set provably includes a translation that maximizes the length of the longest common subsequence of *A* and *f*_{c, d}(*B*). Based on this idea, it has been shown in Vlachos et al. (2002) that: Given two trajectories *A* and *B*, with | *A* | = *n* and | *B* | = *m*, the *S*2(*δ*, *ε*, *A*, *B*) can be computed in *O*((*n* + *m*)^{3}*δ*^{3}) time.

Furthermore, a more efficient algorithm has been proposed that achieves a running time of *O*((*m* + *n*)*δ*^{3}∕*β*^{2}), given a constant 0 < *β* < 1. However, this algorithm is approximate and the approximation *AS*2_{δ, β}(*A*, *B*) is related with the actual distance with the formula: *S*2(*δ*, *ε*, *A*, *B*) − *AS*2_{δ, β}(*A*, *B*) < *β*.

### Indexing for LCCS Based Similarity

Even though the approximation algorithm for the *D*2 distance significantly reduces the computational cost over the exact algorithm, it can still be costly when one is interested in similarity search on massive trajectory databases. Thus, a hierarchical clustering algorithm using the distance *D*2 is provided that can be used to answer efficiently similarity queries.

*D*2 is that

*D*2 is not a metric, since it does not obey the triangle inequality. This makes the use of traditional indexing techniques difficult. Indeed, it is easy to construct examples with trajectories

*A*,

*B*and

*C*, where

*D*2(

*δ*,

*ε*,

*A*,

*C*) >

*D*2(

*δ*,

*ε*,

*A*,

*B*) +

*D*2(

*δ*,

*ε*,

*B*,

*C*). Such an example is shown in Fig. 3, where

*D*2(

*δ*,

*ε*,

*A*,

*B*) =

*D*2(

*δ*,

*ε*,

*B*,

*C*) = 0 (since the similarity is 1), and

*D*2(

*δ*,

*ε*,

*A*,

*C*) = 1 (because the similarity within

*ɛ*in space is zero).

*A*,

*B*,

*C*:

*B*| is the length of sequence

*B*.

To create the indexing structure, the set of trajectories is partitioned into groups according to their length, so that the longest trajectory in each group is at most *a* times the shortest (typically *a* = 2 is used.) Then, a hierarchical clustering algorithm is applied on each set, and the tree that the algorithm produces is used as follows:

*C*of the tree, the medoid (

*M*

_{ C }) of the cluster represented by this node is stored. The medoid is the trajectory that has the minimum distance (or maximum LCSS) from every other trajectory in the cluster: \(\max _{v_{i}\in C}\min _{v_{j}\in C}LCSS_{\delta,\epsilon,\mathcal{F}}(v_{i},v_{j},e)\). However, keeping only the medoid is not enough. Note that, a method is needed to efficiently prune part of the tree during the search procedure. Namely, given the tree and a query sequence

*Q*, the algorithm should decide whether to follow the subtree that is rooted at

*C*or not. However, from the previous lemma it is known that for any sequence

*B*in

*C*:

*r*

_{ c }that maximizes this expression is stored. Using this trajectory a lower bound on the distance between the query and any trajectory on the subtree can be estimated.

Next, the search function that uses the index structure discussed above is presented. It is assumed that the tree contains trajectories with minimum length *minl* and maximum length *maxl*. For simplicity, only the algorithm for the 1-Nearest Neighbor query is presented.

*N*in the tree, the query

*Q*and the distance to the closest trajectory found so far (Fig. 4). For each of the children

*C*, it is checked if it is a trajectory or a cluster. In case that it is a trajectory, its distance to

*Q*is compared with the current nearest trajectory. If it is a cluster, first the length of the query is checked and then the appropriate value for min( |

*B*|, |

*Q*| ) is chosen. Thus, a lower bound

*L*is computed on the distance of the query with any trajectory in the cluster and the result is compared with the distance of the current nearest neighbor

*mindist*. This cluster is examined only if

*L*is smaller than

*mindist*. In the scheme above, the approximate algorithm to compute the \(LCSS_{\delta,\epsilon,\mathcal{F}}\) is used. Consequently, the value of \((LCSS_{\delta,\epsilon,\mathcal{F}}(M_{C},B))/(\min (\vert B\vert,\vert Q\vert ))\) that is computed can be up to

*β*times higher than the exact value. Therefore, since the approximate algorithm of section 3.2 is used, the (

*β*∗ min( |

*M*

_{ C }|, |

*B*| ))∕(min( |

*B*|, |

*Q*| )) should be subtracted from the bound for

*D*2(

*δ*,

*ε*,

*B*,

*Q*) to get the correct results.

## Key Applications

### Sciences

Trajectory data with the characteristics discussed above (multi-dimensional and noisy) appear in many scientific data. In environmental, earth science and biological data analysis, scientists may be interested in identifying similar patterns (e.g., weather patterns), cluster related objects or subjects based on their trajectories and retrieve subjects with similar movements (e.g., in animal migration studies). In medical applications similar problems may occur, for example, when multiple attribute response curves in drug therapy are analyzed.

### Transportation and Monitoring Applications

In many monitoring applications, detecting movements of objects or subjects that exhibit similarity in space and time can be useful. These movements may have been reconstructed from a set of sensors, including cameras and movement sensors and therefore are inherently noisy. Another set of applications arise from cell phone and mobile communication applications where mobile users are tracked over time and patterns and clusters of these users can be used for improving the quality of the network (i.e., by allocating appropriate bandwidth over time and space).

## Future Directions

So far, it is assumed that objects are points that move in a multi-dimensional space, ignoring their shape in space. However, there are many applications where the extent of each object is also important. Therefore, a future direction, is to design similarity models for moving objects with extents, when both the locations and the extents of the objects change over time.

Another direction is to design a more general indexing scheme for distance functions that are similar to *LCCS* and can work for multiple distance functions and datasets.

## Cross-References

## References

- Agrawal R, Faloutsos C, Swami A (1993) Efficient similarity search in sequence databases. In: Proceedings of FODO, pp 69–84Google Scholar
- Berndt D, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: AAAI workshop on knowledge discovery in databases, pp 229–248Google Scholar
- Bollobás B, Das G, Gunopulos D, Mannila H (1997) Time-series similarity problems and wellseparated geometric sets. In: Proceedings of SCGMATHGoogle Scholar
- Chen L, Ng RT (2004) On the marriage of lp-norms and edit distance. In: Proceedings of VLDB, pp 792–803Google Scholar
- Das G, Gunopulos D, Mannila H (1997) Finding similar time series. In: Proceedings of PKDD, pp 88–100Google Scholar
- Faloutsos C, Jagadish HV, Mendelzon A, Milo T (1997) Signature technique for similarity-based queries. In: Proceedings of SEQUENCESGoogle Scholar
- Faloutsos C, Ranganathan M, Manolopoulos I (1994) Fast subsequence matching in time series databases. In: Proceedings of ACM SIGMOD, May 1994CrossRefGoogle Scholar
- Goldin D, Kanellakis P (1995) On similarity queries for time-series data. In: Proceedings of CP ’95, Sept 1995Google Scholar
- Keogh E (2002) Exact indexing of dynamic time warping. In: Proceedings of VLDBCrossRefGoogle Scholar
- Keogh E, Chakrabarti K, Mehrotra S, Pazzani M (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of ACM SIGMOD, pp 151–162MATHGoogle Scholar
- Levenshtein V (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10 10:707–710MathSciNetMATHGoogle Scholar
- Perng S, Wang H, Zhang S, Parker DS (2000) Landmarks: a new model for similarity-based pattern querying in time series databases. In: Proceedings of IEEE ICDE, pp 33–42Google Scholar
- Qu Y, Wang C, Wang XS (1998) Supporting fast search in time series for movement patterns in multiple scales. In: Proceedings of ACM CIKM, pp 251–258Google Scholar
- Rafiei D, Mendelzon A (2000) Querying time series data based on similarity. IEEE Trans Knowl Data Eng 12(5):675–693CrossRefGoogle Scholar
- Vlachos M, Kollios G, Gunopulos D (2002) Discovering similar multidimensional trajectories. In: Proceedings of IEEE ICDE, pp 673–684Google Scholar
- Zhu Y, Shasha D (2003) Query by humming: a time series database approach. In: Proceedings of ACM SIGMODCrossRefGoogle Scholar