Using the minimum description length to discover the intrinsic cardinality and dimensionality of time series
 3.2k Downloads
 3 Citations
Abstract
Many algorithms for data mining or indexing time series data do not operate directly on the raw data, but instead they use alternative representations that include transforms, quantization, approximation, and multiresolution abstractions. Choosing the best representation and abstraction level for a given task/dataset is arguably the most critical step in time series data mining. In this work, we investigate the problem of discovering the natural intrinsic representation model, dimensionality and alphabet cardinality of a time series. The ability to automatically discover these intrinsic features has implications beyond selecting the best parameters for particular algorithms, as characterizing data in such a manner is useful in its own right and an important subroutine in algorithms for classification, clustering and outlier discovery. We will frame the discovery of these intrinsic features in the Minimal Description Length framework. Extensive empirical tests show that our method is simpler, more general and more accurate than previous methods, and has the important advantage of being essentially parameterfree.
Keywords
Time Series MDL Dimensionality reduction1 Introduction
Most algorithms for indexing or mining time series data operate on higherlevel representations of the data, which include transforms, quantization, approximations and multiresolution approaches. For instance, Discrete Fourier Transform (DFT), Discrete Wavelet Transform (DWT), Adaptive Piecewise Constant Approximation (APCA) and Piecewise Linear Approximation (PLA) are models that all have their advocates for various data mining tasks and each has been used extensively (Ding et al. 2008). However, the question of choosing the best abstraction level and/or representation of the data for a given task/dataset still remains. In this work, we investigate this problem by discovering the natural intrinsic model, dimensionality and (alphabet) cardinality of a time series. We will frame the discovery of these intrinsic features in the Minimal Description Length (MDL) framework (Grünwald et al. 2005; Kontkanen and Myllym 2007; Pednault Pednault 1989; Rissanen et al. 1992). MDL is the cornerstone of many bioinformatics algorithms (Evans et al. 2007; Rissanen 1989), and has had some impact in data mining, however it is arguably underutilized in time series data mining (Jonyer et al. 2004; Papadimitriou et al. 2005).

Before we define more precisely what we mean by actual versus intrinsic cardinality, we should elaborate on the motivations behind our considerations. Our objective is generally not simply to save memory:^{1} if we are wastefully using eight bytes per time point instead of using the mere three bytes required by the intrinsic cardinality, the memory space saved is significant; however, memory is getting cheaper, and is rarely a bottleneck in data mining tasks. Instead, there are many other reasons why we may wish to find the true intrinsic model, cardinality and dimensionality of the data. For example, there is an increasing interest in using specialized hardware for data mining (Sart et al. 2010). However, the complexity of implementing data mining algorithms in hardware typically grows super linearly with the cardinality of the alphabet. For example, FPGAs usually cannot handle cardinalities greater than 256 (Sart et al. 2010).

Some data mining algorithms benefit from having the data represented in the lowest meaningful cardinality. As a trivial example, consider the time series: ..0, 0, 1, 0, 0, 1, 0, 0, 1. We can easily find the rule that a ‘1’ follows two appearances of ‘0’. However, notice that this rule is not apparent in this string: ..0, 0, 1.0001, 0.0001, 0, 1, 0.000001, 0, 1 even though it is essentially the same time series.

Most time series indexing algorithms critically depend on the ability to reduce the dimensionality (Ding et al. 2008) or the cardinality (Lin et al. 2007) of the time series (or both Assent et al. 2008; Camerra et al. 2010) and search over the compacted representation in main memory. However, setting the best level of representation remains a “black art.”

In resourcelimited devices, it may be helpful to remove the spurious precision induced by a cardinality/dimensionality that is too high. We elaborate on this issue by using a concrete example below.

Knowing the intrinsic model, cardinality and dimensionality of a dataset allows us to create very simple outlier detection models. We simply look for data where the parameters discovered in new data differ from our expectations learned on training data. This is a simple idea, but it can be very effective as we show in our experimental section.
1.1 A concrete example
For concreteness, we present a simple scenario that shows the utility of understanding the intrinsic cardinality/dimensionality of data. Suppose we wish to build a time series classifier into a device with a limited memory footprint such as a cell phone, pacemaker or “smartshoe” (Vahdatpour and Sarrafzadeh 2010). Let us suppose we have only 20kB available for the classifier, and that (as is the case with the benchmark dataset, TwoPat Keogh et al. 2006) each time series exemplar has a dimensionality of 128 and takes 4 bytes per value.
One could choose decision trees or Bayesian classifiers because they are space efficient; however, recent evidence suggests that nearest neighbor classifiers can be difficult to beat for time series problems (Ding et al. 2008). If we had simply stored forty random samples in the memory for our nearest neighbor classifier, the average error rate over fifty runs would be a respectable 58.7 % for a fourclass problem. However, we could also downsample the dimensionality by a factor of two, either by skipping every second point, or by averaging pairs of points (as in SAX, Lin et al. 2007), and place eighty reducedquality samples in memory. Or perhaps we could instead reduce the alphabet cardinality by reducing the precision of the original four bytes to just one byte, thus allowing 160 reducedfidelity objects to be placed in memory. Many other combinations of dimensionality and cardinality reduction could be tested, which would trade reduced fidelity to the original data for more exemplars stored in memory. In this case, a dimensionality of 32 and a cardinality of 6 allow us to place 852 objects in memory and achieve an accuracy of about 90.75 %, a remarkable improvement in accuracy given the limited resources. As we shall see, this combination of parameters can be found using our MDL technique.
In general, testing all of the combinations of parameters is computationally infeasible. Furthermore, while in this case we have class labels to guide us through the search of parameter space, this would not be the case for other unsupervised data mining algorithms, such as clustering, motif discovery (Lin et al. 2002), outlier discovery (Chandola et al. 2009; Vereshchagin and Vitanyi 2010; Yankov et al. 2008), etc.
As we shall show, our MDL framework allows us to automatically discover the parameters that reflect the intrinsic model/cardinality/dimensionality of the data without requiring external information or expensive cross validation search.
2 Definitions and notation
We begin with the definition of a time series:
Definition 1
A time series T is an ordered list of numbers. \(T=t_{1},t_{2},...,t_{m}\). Each value \(t_{i}\) is a finite precision number and m is the length of the time series T.
Before continuing, we must justify the decision of (slightly) quantizing the time series. MDL is only defined for discrete values,^{2} but most time series are realvalued. The cardinality of a set is defined as the measure of the number of elements of the set. In math, discrete values have a finite cardinality, and real numbers have an infinite cardinality. When dealing with values stored in a digital computer, this distinction can be problematic, as even real numbers must limited to a finite cardinality. Here we simply follow the convention that for very high cardinalities numbers can be considered essentially realvalued, thus we need to cast the “effectively infinite” 2\(^{64}\) cardinality we typically encounter into a more obvious discrete cardinality to allow MDL to be applied.
The figure illustrates that all of the points fall close to the diagonal, and thus the quantization makes no perceptible difference. Beyond this subjective visual test, we also reproduced the heavily cited UCR time series classification benchmark experiments (Keogh et al. 2006), replacing the original data with the 256cardinality version. For all cases the difference in classification accuracy was less than one tenth of 1 % (full details are at www.cs.ucr.edu/~bhu002/MDL/MDL.html). Based on these considerations, in this work we reduce all of the time series data to its 256 cardinality version by using a discretization function:
Definition 2
Given a time series \(T\), we are interested in estimating its minimum description length, i.e., the smallest number of bits it takes to represent it.
Definition 3
In the current literature, the number of bits required to store the time series depends on the idiosyncrasies of the data format or hardware device, not on any intrinsic properties of the data or domain. Here we are instead interested in knowing the minimum number of bits to exactly represent the data, i.e., the intrinsic amount of information in the time series. The general problem of determining the smallest program that can reproduce the time series, known as Kolmogorov complexity, is not computable (Li 1997). However, the Kolmogorov complexity can be approximated by using generalpurpose data compression methods, like Huffman coding (Grünwald et al. 2005; Vereshchagin and Vitanyi 2010; Zwally and Gloersen 1977). The (lossless) compressed file size is an upper bound to the Kolmogorov complexity of the time series (De Rooij and Vitányi 2012).
Observe that in order to decompress the data HuffmanCoding \((T)\), the Huffman tree (or the symbol frequencies) is needed, thus the description length could be more precisely defined as \(DL(T) =  HuffmanCoding(T)+ HuffmanTree(T)\). One could use a simple binary representation to encode the Huffman tree, however the most efficient way to encode it is to assume that the tree in the canonical form (see, e.g., Chapter 2 of Witten et al. 1999). In a canonical Huffman code all the codes for a given length are assigned their values sequentially. Instead of storing the structure of the Huffman tree explicitly, only the lengths of the codes (i.e., the number of bits) are required. Since the longest possible Huffman code over \(2^{b}\)symbols is at most \(2^{b}  1\) bits long (when symbol frequencies are Fibonacci numbers), the number of bits necessary to transmit its length is at most \(b\). Thus, the number of bits necessary to represent HuffmanTree \((T)\) in canonical form can be bounded by \(O(b2^{b})\).

The size of the tree is negligible compared to the number of bits required to represent the time series because \(m \gg b\).

The size of \(HuffmanTree(T)\) has very low variance, and thus can be regarded as a “constant” term. This is especially true when comparing similar models, for example, a model with a dimensionality of 10 to a model with a dimensionality of nine or eleven. When comparing vastly different models, for example a model with a dimensionality of ten with a model with a dimensionality of one hundred, the differences of the sizes of the relevant Huffman trees are greater, but this difference is dwarfed by the bit saving gained by discovering the true dimensionality.
One of the key steps in finding the intrinsic cardinality and/or dimensionality requires one to convert a given time series to another representation or model, e.g., by using DFT or DWT. We call this representation a hypothesis:
Definition 4
A hypothesis \(H\) is a representation of a discrete time series \(T\) after applying a transformation \(M\).
In general, there are many possible transforms. Examples include DWT, DFT, APCA, and Piecewise Linear Approximation (PLA), among others (Ding et al. 2008). Figure 8 shows three illustrative examples, DFT, APCA, and PLA. In this paper, we demonstrate our ideas using these three most commonly used representations, but our ideas are not restricted to these time series models (see Ding et al. 2008 for a survey of time series representations).
Henceforth, we will use the term model interchangeably with the term hypothesis.
Definition 5
The first term DL (\(H)\) is called the model cost and represents the number of bits required to store the hypothesis \(H\). For instance, the model cost for the PLA would include the bits needed to encode the mean, slope and length of each linear segment.
The second term, DL \((T\vert H)\), called the correction cost (in some works it is called the description cost or error term) is the number of bits required to rebuild the entire time series \(T\) from the given hypothesis \(H\).
There are many possible ways to encode \(T\) given \(H\). Perhaps the simplest way is to store the differences (i.e., the difference vector) between \(T\) and \(H\): one can easily reconstruct exactly the time series \(T\) from \(H\) and the difference vector. Thus, we simply use \(DL (T\vert H) = DL (TH)\).
We will demonstrate how to calculate the reduced description length in more detail in the next section.
3 MDL modeling of time series
3.1 An intuitive example of our basic idea
A review of the following facts may help make our contributions more intuitive. The quality of an approximate representation of a time series is measured by the reconstruction error (Ding et al. 2008; Keogh and Kasetty 2003). This is simply the Euclidean distance between the model and the raw data. For example, in Fig. 4 we see the reconstruction error of the single segment model is 18.78, and the reconstruction error of the two segment model in Fig. 5 is just 8.42. This suggests a general truth; for the APCA, PLA, and DFT approximations of time series, it is always the case that the \(d\)dimensional model has a greater or equal reconstruction error than \(d+1\) dimensional model (Ding et al. 2008; Palpanas et al. 2008). Note that this is only true on average for the DWT, SAX, IPLA, PAA approximations (Ding et al. 2008). However, even for these representations violations to this rule are very rare and minor.
The reader may have noted that in the example discussed, the range of the error vector (i.e. {max(\(e_{i})\) min(\(e_{i})\)}) also decreased, from 12 (7 to \(\)4) in the former case, to just 7 (3 to \(\)3) in the latter. This is not necessarily the case; the relevant algorithms are minimizing the global error of the model, not the maximum error on any individual segment/coefficient, and one can certainly construct synthetic datasets for which this is not the case. However, it is almost always the case, and necessarily so. Recall that as \(d\) approaches \(m\), this range approaches zero. Further note that this range is an upper bound for the number of unique values that the compression algorithm must encode in the description length.
We note that this tight relationship between Euclidean distance and MDL has been observed/exploited before. In Fig. 12 of Rakthanmanon et al. (2012), Rakthanmanon et al. shows a scatterplot illustrating the extraordinary high correlation between the Euclidean distance of pair of subsequences, and the MDL description length when using one subsequence to encode the other (using essentially the same MDL formulation that we use here). Thus we can see the naturalness of using MDL to score our model choice, which is solely concerned with minimizing the Euclidean distance between the model and the original data.
Intuitively, the alternative is much worse, vastly overestimating the mean of the original data. However, on what basis could MDL make this distinction? If our MDL formulation considered the Yaxis values to be categorical variables then there would be no reason to prefer either model.
However, note that the sum of the magnitude of the residuals is much greater in for Fig. 6—right. This is true by definition, as using the mean minimizes this value. However, nothing in our model description length explicitly accounts for this. An obvious solution to this issue is to encode a term that accounts for the range of numbers required to be modeled in the description length, in addition to their entropy. This issue is unique to ordinal data, and does not occur with categorical data. For example, when dealing with categorical data, there is no cost difference between say \(s_{\mathrm{x}} = \mathbf{a \ a \ a \ b,}\) and \(s_{\mathrm{y}} = \mathbf{m\, m\, m\, n }\). However, in our domain there is a significant difference between say \(e_\mathrm{x} = \mathbf{1\, 1\, 1\, 2,}\) and \(e_\mathrm{y} = \mathbf{3\, 3\, 3\, 4,}\) because the latter condemns us to consider values in a \(\hbox {log}_{2}\)(4) range in the description length for the model, whereas the former allows us to only consider values in the smaller \(\hbox {log}_{2}\)(2) range.
The reader can now appreciate our why “solution” to this issue was to simply ignore it. Because the underlying dimensionality reduction algorithms we are using (APCA, DFT, PLA) are attempting to minimize the residual error,^{5} they are also implicitly minimizing the range of residuals. As shown by Fig. 7, if we explicitly added a term for the range of residuals it would have no effect, as the dimensionality reduction algorithm has already minimized it.
As we apply our ideas to each representation, we must be careful to correctly “charge” each model for the number of parameters used in the model. For example, each APCA segment requires the mean value and length, whereas PLA segments require the mean value, segment length and slope. Each DFT coefficient can be represented by the amplitude and phase of each sine wave; however, because of the complex conjugate property, we get a “free” coefficient for each one we store (Camerra et al. 2010; Ding et al. 2008). In previous comparisons of the indexing performance of various time series representations, many authors have given an unfair advantage to one representation by counting the cost to represent an approximation incorrectly (Keogh and Pazzani 2000). The ideas in this work explicitly assume a fair comparison. Fortunately, the community seems to have become more aware of this issue in recent years (Camerra et al. 2010; Palpanas et al. 2008).
In the next section we give both the generic version of the MDL model discovery for time series algorithms and three concrete instantiations for DFT, APCA, and PLA.
3.2 Generic MDL for time series algorithms
In the previous section, we used a toy example to demonstrate how to compute the reduced description length of a time series with a competing hypothesis. In this section, we will show a detailed generic version of our algorithm, and then explain our algorithm in detail how we apply our algorithm to the three most commonly used time series representations.
Our algorithm not only discovers the intrinsic cardinality and dimensionality of an input time series, but it can also be used to find the right model or data representation for a given time series. Table 1 shows a highlevel view of our algorithm for discovering the best model, cardinality, and dimensionality which will minimize the total number of bits required to store the input time series.
Because MDL is the core of our algorithm, the first step is to quantize a realvalued time series into a discretevalued (but still finegrained) time series, \(T\) (line 1). Next, we consider each model, cardinality, and dimensionality one by one (lines 3–5). Then, a hypothesis \(H\) is created based on the selected model and parameters (line 6). For example, a hypothesis \(H\), shown in Fig. 5, is created when the model \(M\) = APCA, cardinality c = 16, and dimensionality d = 2; note that, in that case, the length of the input time series was m = 24.
Generic MDL algorithm for time series
For concreteness, we will now consider three specific versions of our generic algorithm.
3.3 Adaptive Piecewise Constant Approximation
Our algorithm specific to APCA
Note that if the dimensionality were more than m/2, some segments would contain only one point. Then, a hypothesis \(H\) would be created using the values of cardinality c and dimensionality d, as shown in Fig. 5, where c = 16 and d = 2. The model contains \(d\) constant segments, so the model cost is the number of bits required for storing d constant numbers, and d  1 pointers to indicate the offset of the end of each segment (line 6). The difference between \(T\) and \(H\) is also required to rebuild \(T\). The correction cost is computed; then the reduced description length is calculated from the combination of the model cost and the correction cost (line 7). Finally, the hypothesis that minimizes this value is returned as an output of the algorithm (lines 8–13).
3.4 Piecewise Linear Approximation
An example of a PLA model is shown in Fig. 8—right. In contrast to APCA, a hypothesis using PLA is more complex because each segment contains a line of any slope, instead of a constant line in APCA. The algorithm used to discover the intrinsic cardinality and dimensionality for PLA is shown in Table 3, which is similar to the algorithm for APCA, except for the code in line 5 and 6.
A PLA hypothesis \(H\) is created from the external module PLA (line 5). To represent each segment in hypothesis \(H\), we record the starting value, ending value, and the ending offset (line 6). The slope is not kept because storing a real number is more expensive than log \(_{2}\) c.
Our algorithm specific to PLA
3.5 Discrete Fourier Transform
A data representation in DFT space is simply a linear combination of sine waves, as shown in Fig. 8–left. Table 4 presents our algorithm specific to DFT. After we quantize the input time series to a discrete time series \(T\) (line 1), the external module DFT is called to return the list of sine wave coefficients that represent \(T\). The coefficients in DFT are a set of complex conjugates, so we store only half of all coefficients which contain complex numbers without their conjugate, called half_coef [line 5]. When half_coef is provided, it is trivial to compute their conjugates and obtain all original coefficients.
Our algorithm specific to DFT
For simplicity we placed the external modules APCA, PLA, and DFT inside two forloops; however, to improve performance, they should be moved outside the loops.
3.6 A mixed polynomial degree model
For a given time series \(T\), we want to know the representation that can minimize the reduced description length for \(T\). We have shown how to achieve this goal by applying the MDL principle to three different models (APCA, PLA and DFT). However, for some complex time series, using only one of the above models may not be sufficient to achieve the most parsimonious representation, as measured by bit cost, or by our subjective understanding of the data (Lemire 2007; Palpanas et al. 2008). It has been shown that averaged over many highly diverse datasets; there is not much difference among the different representations (Palpanas et al. 2008). However, it is possible that within a single dataset, the specific model used could make a significant difference. For example, consider each of the two time series that form the trajectory of an automobile as it drives through Manhattan. These time series are comprised of a combination of straight lines and curves. We could choose just one of these possibilities, either representing the automobile’s turns with many piecewise linear segments, or representing the long straight sections with a degenerate “curve.” However, a mixed polynomial degree model is clearly more natural here.
Several works propose that it may be fruitful to use a combination of different models within one time series (Keogh et al. 2011; Lemire 2007; Palpanas et al. 2008). For example, Lemire (2007) proposes a mixed model wherein the polynomial degree of each interval in one time series can vary. The polynomial degree can be zero, one, two or higher. The goal of Lemire (2007) is to minimize the Euclidean error between the model and the original data for a given number of segments. However, note that Lemire (2007) requires the user to state the desired dimensionality, something we obviously wish to avoid. Minimizing the Euclidean error between the model and the original data is a useful objective function for some tasks, but it is not necessarily the same as discovering the intrinsic dimensionality, which is our stated goal. In the following, we show that our proposed algorithm returns the intrinsic model by minimizing the reduced description length using MDL. Moreover, our algorithm is essentially parameterfree.
Our algorithm specific to the mixed polynomial degree model
Bottomup mixed polynomial degree model algorithm
Table 6 shows the bottomup mixed polynomial degree model algorithm. By choosing the minimum description costs as the objection function, the algorithm shown in Table 6 is a generalization of the bottomup algorithm for generating the PLA introduced in Keogh et al. (2011). There are two main differences between our bottomup mixed model algorithm and the bottomup algorithm described in Keogh et al. (2011). The first is a minor pragmatic point: instead of using two points in the finest possible approximation, the algorithm shown in Table 6 uses three points. This is because when the polynomial degree of the representation is two, the number of points by using this approximation must be at least three. Second, instead of using Euclidean distance as the objective function, the algorithm in Table 6 uses MDL cost. The algorithm calculates the MDL costs for three degrees of polynomial degree representations for a segment. The polynomial degrees are zero, one and two, respectively. Next, it chooses the one that can minimize the cost (the description length). The algorithm begins by creating the finest possible approximation of the input time series. So for a length of n time series, there are n/3 segments after this step, as shown in Table 6, lines 2–4. Then the cost of merging each pair of adjacent segments is calculated, as shown in lines 5–7. To minimize the merging cost for the two inputsegments, this calculate_MDL_cost function calculates the MDL costs for three kinds of polynomial degree representations, and then chooses the minimum one as the merging cost (line 6). After this step, the algorithm iteratively merges the lowest cost pair until a stopping criterion is met. In this scenario, the stopping criterion is the input number of segments. This means that the algorithm will not terminate as long as the current number of segments is larger than the input number of segments.
It is important to note that, similar to the algorithm (Keogh et al. 2011), our algorithm is greedy in the sense that once two regions have been joined together in a single segment, they will remain together in that segment (which may get larger as it is iteratively joined with other segments). There are only join operators; there are no split operators. However, if a region in our algorithm is initially assigned to a polynomial of a particular degree, this does not mean it cannot later be subsumed into a larger segment of a different degree. In other words, a tiny region that locally may consider itself, say, linear has the ability to later become part of a constant or quadratic segment as it obtains a more “global” view.
4 Experimental evaluation
To ensure that our experiments are easily reproducible, we have set up a website which contains all data and code, together with the raw spreadsheets of the results (www.cs.ucr.edu/~bhu002/MDL/MDL.html). In addition, this website contains additional experiments that are omitted here for brevity.
4.1 A detailed example on a famous problem
We start with a simple sanity check on the classic problem specifying the correct time series model, cardinality and dimensionality, given an observation of a corrupted version of it. While this problem has received significant attention in the literature (Donoho and Johnstone 1994; Salvador and Chan 2004; Sarle 1999), our MDL method has two significant advantages over existing works. First, there is no explicit parameter to set, whereas most other methods require several parameters to be set. Second, MDL helps to specify the model, cardinality and dimensionality, whereas other methods typically only consider the model and/or dimensionality.
The task is challenging because some of the piecewise constant sections are very short and thus easily dismissed during a model search. Dozens of algorithms have been tested on this time series (indeed, on this exact instance of data) in the last decade: which should we compare to? Most of these methods have several parameters, in some cases as many as six (Firoiu and Cohen 2002; GarcíaLópez and AcostaMesa 2009). We argue that comparisons to such methods are inappropriate, since our explicit aim is to introduce a parameterfree method. The most cited parameterfree method addressing this problem is the LMethod (Salvador and Chan 2004). In essence, the LMethod is a “kneefinding” algorithm. It attempts to explain the residual error vs. sizeofmodel curve using all possible pairs of two regression lines. Figure 11—top shows one such pair of lines, from one to ten and from eleven to the end. The location that produces the minimum sum of the residual errors of these two curves, R, is offered as the optimal model. As we can see in Fig. 11—bottom, this occurs at location ten, a reasonable estimate of the true value of 12.
The figure above uses a cardinality of 256, but the same answer is returned for (at least) every cardinality from 8 to 256.
Here MDL indicates a cardinality of ten, which is the correct answer (Sarle 1999). We also reimplemented the most referenced recent paper on time series discretization (GarcíaLópez and AcostaMesa 2009). The algorithm is stochastic, and requires the setting of five parameters. In one hundred runs over multiple parameters we found it consistently underestimated the cardinality of the data (the mean cardinality was 7.2).
4.2 An example application in physiology
We test this binary assumption by using MDL to find the model, dimensionality and cardinality. The results for the model and dimensionality are objectively correct, as we might have expected given the results in the previous section, but the results for cardinality, shown in Fig. 16—left, are worth examining.
4.3 An example application in astronomy
In this section (and the one following) we consider the possible utility of MDL scoring as an anomaly detector. Building an anomaly detector using MDL is very simple. We can simply record the best model, dimensionality and/or cardinality predicted for the training data, and then test on future observations that have significantly different learned parameters. We can illustrate this idea with an example in astronomy. We begin by noting that we are merely demonstrating an additional possible application of our ideas. We are only showing that we can reproduce the utility of existing works. However note that our technique is at least as fast as existing methods (Protopapas et al. 2006; Rebbapragada et al. 2009), and does not require any training data or parameter tuning, an important advantage for exploratory data mining.
We then took a test set of 8,124 objects, known to contain at least one anomaly, and measured the intrinsic DFT dimensionality of all of its members, and discovered that one had a value of 31. As shown in Fig. 17—bottom, the offending curve looks different from the other data, and is labeled RRL_OGLE053803.42695656.4.I.folded ANOM. This curve is a previously known anomaly. In this case, we are simply able to reproduce the anomaly finding ability of previous work (Protopapas et al. 2006; Rebbapragada et al. 2009). However, we achieved this result without extensive parameter tuning, and we can do so very efficiently.
4.4 An example application in cardiology
Once again, here we are simply reproducing a result that could be produced by other methods (Yankov et al. 2008). However, we reiterate that we are doing so without any parameter tuning. Moreover, it is interesting to note when our algorithm does not flag innocuous data (i.e., produces false positives). Consider the two adjacent heartbeats labeled A and B in Fig. 18—bottom. It happens that the completely normal heartbeat B has significantly more noise than heartbeat A. Such nonstationary noise presents great difficulties for distancebased and densitybased outlier detection methods (Yankov et al. 2008), but MDL is essentially invariant to it. Likewise, the significant wandering baseline (not illustrated) in parts of this dataset has no medical significance and is ignored by MDL, but it is the bane of many EEG anomaly detection methods (Chandola et al. 2009).
4.5 An example application in geosciences
Globalscale Earth observation satellites such as the Defense Meteorological Satellite Program (DMSP) Special Sensor Microwave/Imager (SSM/I) have provided temporally detailed information about the Earth’s surface since 1978, and the National Snow and Ice Data Center (NSIDC) in Boulder, Colorado makes this data available in real time. Such archives are a critical resource for scientists studying climate change (Picard et al. 2007). In Fig. 19, we show a brightness temperature time series from a region in Antarctica, using SSM/I daily observations over the 2001–2002 austral summer.
After consulting some polar climate experts, the following explanation emerges. For most of the year the location in question is covered in snow. The introduction of a small amount of liquid water will significantly change the reflective properties of the ground cover, allowing the absorption of more heat from the sun, thus producing more liquid water in a rapid positive feedback cycle. This explains why the data does not have a sinusoidal shape or a gradual (say, linear) rise, but a fast phase change, from a mean of about 155 Kelvin to a ninetyday summer of about 260 Kelvin.
4.6 An example application in hydrology and environmental science
In this section we show two applications of our algorithm in hydrological and environmental domains.
4.7 An example application in biophysics
4.8 An example application in prognostics
In this section, we demonstrate our framework’s ability to aid in clustering problems.
Recently, the field of prognostics for engineering systems has attracted a huge amount of attention due to its ability to provide an early warning for system failures, forecast maintenance as needed, and estimate the remaining useful life of a system (Goebel et al. 2008; Prognostics Center of Excellence, National Aeronautics and Space Administration (NASA) 2012; Vachtsevanos et al. 2006; Wang et al. 2008; Wang and Lee 2006). Datadriven prognostics are more useful than modeldriven prognostics, since a modeldriven prognostic requires incorporating a physical understanding of the systems (Goebel et al. 2008). This is especially true when we have access to large amounts of data, a situation that is becoming more and more common.
There may be thousands of sensors in a single engineering system. Consider, for example, a typical oildrilling platform that can have 20,000–40,000 sensors on board (IBM 2012). All of these sensors stream data about the health of the system (IBM 2012). Among the huge number of variables, there are some variables called operational sensors that have a substantial effect on system performance. In order to do a prognostic analysis, first the operational variables should be filtered from the nonoperational variables that are just responding to the operational ones (Wang et al. 2008).
Although data from the two kinds of variables look very similar (Fig. 28), our results in Fig. 29 show that there is a significant difference between them: the operational variables lie in the lower left corner of Fig. 29 with low cardinalities and small reduced description lengths. In contrast, the nonoperational variables lie in the upper right corner of Fig. 29 with high cardinalities and large reduced description lengths. This implies that the data from the operational variables is relatively ‘simple’ compared to the data from the nonoperational variables, since the intrinsic cardinalities and reduced description lengths of the data from the operational variables are relatively small. This result was confirmed by a Prognostics expert: the hypothesis for filtering out the operational variables is that data from operational variables tends to have simpler behavior, since there are only several crucial states for the engines (Heimes and BAE Systems 2008; PHM Data Challenge Competition 2008; Prognostics Center of Excellence, National Aeronautics and Space Administration (NASA) 2012; Wang et al. 2008; Wang and Lee 2006). Note that in our experiment we did not need to tune any parameters, while most of the related literature for this dataset use multilayer perceptron neural networks (Heimes and BAE Systems 2008; Wang et al. 2008; Wang and Lee 2006), which have the overhead of parameter tuning and are prone to overfitting.
4.9 Testing the mixed polynomial degree model
4.10 An example application in aeronautics
4.11 Quantifiable experiments

Noise The results shown in Sects. 4.1 and 4.2 suggest that our framework is at least somewhat robust to noise, but it is natural to ask at what point it breaks down, and how gracefully it degrades.

Sampling rate For many applications the ubiquity of cheap sensors and memory means that the data is sampled at a rate higher than any practical application needs. For example, in the last decade most ECG data has gone from being sampled at 256Hz to sampling rates of up to 10KHz, even though there is little evidence that this aids analysis in any way. Nevertheless, there are clearly situations in which the data may be sampled at a lower rate than the ideal, and again we should consider how gracefully our method degrades.

Model assumptions While we have attempted to have our algorithm as free of assumptions/parameters as possible, we still must specify the model class (es) to search over, i.e. DFT, APCA, and PLA. Clearly even if we had noisefree data, the data may not be exactly a “platonic idea” created from pure components of our chosen model. Thus we must ask ourselves how much our model assumptions can be violated before our algorithm degrades.

The RootMeanSquareError (RMSE). This is the average of the sum of squared differences between \(P\) and the predicted output of our algorithm. This is essentially the mean of square lengths of the gray hatch lines shown in Fig. 5. While zero clearly indicates a perfect recreation of the data, the absolute value of RMSE otherwise has little intuitive value. However the rate at which it changes due to a distortion is of interest here. This measure is shown with blue lines in Fig. 34.

Correct cardinality prediction This is a binary outcome, either our algorithm predicted the correct cardinality or it did not. This is shown with black lines in Fig. 34.

Correct dimensionality prediction This is also a binary outcome, either our algorithm predicted the correct dimensionality or it did not. Note that we only count a correct prediction of dimensionality if every segment endpoint is within three data points of the true location. This measure is shown with red lines in Fig. 34.

In Fig. 34—top we start with the initial prototype signal \(P\) which has no noise, and we add noise until the signaltonoise (SNR) ratio is \(\)4.0. The SNR is calculated according to the standard equation in Signal to Noise Ratio (http://en.wikipedia.org/wiki/Signaltonoise_ratio). As we can see, the cardinality prediction fails at a SNR of about \(\)1.7, and the dimensionality prediction shortly thereafter.

In Fig. 34—middle we start with the initial prototype signal \(P \)with an SNR of \(\)0.22, which is the published DJB data with the mediumnoise setting (Sarle 1999). We progressively resample the data from 2048 datapoints (the original data) down to just 82 datapoints. Both the cardinality and dimensionality predictions fail as we move from 300 to 320 datapoints.

In Fig. 34—bottom we again start with the initial prototype signal \(P \)with an SNR of \(\)0.22, this time we gradually add a global linear trend from 0 to 0.45, as measured by the gradient of the data. Both the dimensionality and cardinality predictions fail as the gradient is increase pasted 0.3.
It is important to reiterate that the experiment that added a linear trend to the data but only considered a constant model was deliberately testing the mismatch between assumptions and reality. In particular if we repeat the technique shown in Fig. 14, of testing over all models spaces in {DFT, APCA, PLA}, our algorithm does correctly predict the data in Fig. 35c as consisting of piecewise linear segments, and still correctly predicts the cardinality and the dimensionality.
5 Time and space complexity
The space complexity of our algorithm is linear in the size of the original data. The time complexity of the algorithms that use APCA and PLA as the representation in Tables 2 and 3 is O \((m^{2})\) and the time complexity of the algorithm using DFT as the approximation in Table 4 is O (\(m\)log\(m)\). Although we have two forloops in the above three tables, the forloops just add constant factors; they do not increase the degree of the polynomial to the time complexity. This is because outer forloop is the range of cardinalities \(c\) from 2 to 256 and the inner forloop is the range of the dimensionalities \( d\) from 2 to 64.
Our framework achieves the time complexity of O (\(m^{2})\). Note that the data in Sect. 4.5 were obtained over a year and the datasets in Sect. 4.6 were obtained over more than 80 years. Thus compared to how long it takes to collect the data, our algorithm’s execution time (a few seconds) is inconsequential for most applications.
Nevertheless, we can use the following two methods to speed up the search by pruning the search space of combinations of every\( c\) and \(d\) that are very unlikely to be fruitful.
In order to calculate the intrinsic dimensionality for DJB, we first fix the cardinality at 256, then find the MDL cost with dimensionality from range 2 to 64 (the inner forloop in Table 2). In this example, the time complexity for finding the intrinsic dimensionality is O \((m^{2})\). After we discover the intrinsic dimensionality is 12, we hardcode the dimensionality at 12, then calculate the MDL cost with cardinality ranging from 2 to 256. Thus, the time complexity for finding the intrinsic cardinality is O \((m^{2})\). Using with this method, there is no need to calculate the MDL cost for every combination of \(c\) and \(d\).
6 Discussion and related work
We took the opportunity to show an initial draft of this manuscript to many respected researchers in this area, and this paper greatly benefits from their input. However, many researchers passionately argued often mutually exclusive points related to MDL that we felt were orthogonal to our work and irrelevant distractions from our claims. We will briefly address these points here.
The first issue is who should be credited with the invention of the basic idea we are exploiting, that the shortest overall twopart message is most likely the correct explanation for the data. Experts in complexity theory advocate passionately for Andrey Kolmogorov, Chris Wallace, Ray Solomonoff, Jorma Rissanen or Gregory Chaitin, etc. Obviously, our work is not weighing in on such a discussion, and we refer to Li (1997) as a good neutral starting point for historical context. We stand on the shoulders of all such giants.
One researcher felt that MDL models could only be evaluated in terms of the prediction of future events, not on posthoc explanations of the models discovered (as we did in Fig. 15, for example). However, we have carried out prediction experiments. For example, in the introduction we used our MDL technique to predict which of approximately 700 combinations of settings of the cardinality/dimensionality/number of exemplars would produce the most accurate classifier under the given constraints. Clearly the 90.75 % accuracy we achieved significantly outperforms the default settings that gave only 58.70 %. However, a brute force search shows that our predicted model produced the best result (three similar settings of the parameters tied with the 90.75 % accuracy). Likewise, the experiment shown in Fig. 18 can be cast in a prediction framework: “predict which of these heartbeats a cardiologist is most likely to state is abnormal.” To summarize, we do not feel that the prediction/explanation dichotomy is of particular relevance here.
There are many works that use MDL in the context of realvalued time series. However, our parameterfree method is novel. For example, Firoiu and Cohen (2002) uses MDL to help guide a PLA segmentation of time series; however, the method also uses both hybrid neural networks and hidden Markov models, requiring at least six parameters to be set (and a significant amount of computational overhead). Similarly, Molkov et al. (2009) use MDL in the context of neural networks, inheriting the utility of MDL but also inheriting the difficulty of learning the topology and parameters of a neural network.
Likewise, the authors of Davis et al. 2008 use MDL to “find breaks” (i.e., segments) in a time series, but their formulation uses a genetic algorithm which requires a large computational overhead and the careful setting of seven parameters. Finally, there are now several research efforts that use MDL for time series (Vatauv 2012; Vespier et al. 2012) that were inspired by the original conference version of this work (Hu et al. 2011).
There are also examples of research efforts using MDL to help cluster or carry out motif discovery in time series; however, to the best of our knowledge, this is the first work to show a completely parameterfree method for the discovery of the cardinality/dimensionality/model of a time series.
7 Conclusions
We have shown that a simple, yet powerful methodology based on MDL can robustly identify the intrinsic model, cardinality and dimensionality of time series data in a wide variety of domains. Our method has significant advantages over existing methods in that it is more general and is essentially parameterfree. We have further shown applications of our ideas to resourcelimited classification and anomaly detection. We have given away all of our (admittedly very simple) code and datasets so that others can confirm and build on our results from www.cs.ucr.edu/~bhu002/MDL/MDL.html.
A reader may assert that our claim to be parameterfree is unwarranted because: we “choose” to use a binary computer instead of say a ternary computer,^{8} we use Huffman coding, not Shannon–Fano coding and we hard code the maximum cardinality of time series to 256. However, a pragmatic data miner will still see our work as being a way to explore time series data, free from the need to have to adjust parameters. In that sense our work is truly parameterfree.
In addition to the above, we need to acknowledge other shortcomings and limitations of our work. Our ideas, while built upon the solid theoretical foundation of MLD, are heuristic, we have not proved any properties of our algorithms. Moreover, our method is essentially a scoring function; as such it will inherit any limitations of the search function used (cf. Table 3). For example while there is an optimal algorithm for finding the cheapest (in the sense of lowest rootmeansquared error) PLA of a time series given any desired \(d\), this algorithm is too slow for most practical purposes and thus we (and virtually all the rest of the community) must content ourselves with an approximate PLA construction algorithm (Keogh et al. 2011).
Footnotes
 1.
However, Sect. 1.1 shows an example where this is useful.
 2.
The closely related technique of MML (Minimum Message Length Wallace and Boulton 1968) does allow for continuous realvalued data. However, here we stick with the more familiar MDL formulation.
 3.
This slightly awkward formula is necessary because we use the symmetric range [\(\)128, 127]. If we use range [1, 256] instead we get a more elegant: \({\textit{Discretization}}(T)={\textit{round}}\left( {\frac{(T{\textit{min}})}{({\textit{max}}{\textit{min}})}} \right) *(2^\mathrm{s}  1)+1.\)
 4.
The word error has a pejorative meaning not intended here; some authors prefer to use correction cost.
 5.
DFT does minimize the residual error at any desired dimensionality given its set of basis functions. For both APCA and PLA, while there are algorithms that can minimize the residual error, they are too slow to use in practice. We use greedy approximation algorithms that are known to produce near optimal results (Keogh et al. 2011; Rakthanmanon et al. 2012).
 6.
The values for \(k\) = 3, 4 or 5 do not differ by more than 1 %.
 7.
In Kehagias (2004) authors claimed that to obtain the optimal segmentation, the number of segments should be five. This claim is very subjective, simply because this “optimal” segmentation is with respect to the total deviation from segment means. Moreover, there is no hydrological interpretation of the five segments with regard to the real data.
 8.
Of course, no commercial ternary computers exist, however they are at least a logical possibility.
Notes
Acknowledgments
This project was supported by the Department of the United States Air Force, Air Force Research Laboratory under Contract FA875010C0160, and by NSF grants IIS1161997. We would like to thank the reviewers for their helpful comments, which have greatly improved this work.
References
 Assent I, Krieger R, Afschari F, Seidl T (2008) The TSTree: Efficient Time Series Search and Retrieval. In: EDBT. ACM, New YorkGoogle Scholar
 Bronson JE, Fei J, Hofman JM, Gonzalez RL, Wiggins CH (2009) Learning rates and states from biophysical time series: a Bayesian approach to model selection and singlemolecule FRET data. Biophys J 97:3196–3205CrossRefGoogle Scholar
 Camerra A, Palpanas T, Shieh J, Keogh E (2010) \(i\)SAX 2.0: indexing and mining one billion time series. In: International conference on data miningGoogle Scholar
 Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41:3CrossRefGoogle Scholar
 Davis RA, Lee TCM, RodriguezYam G (2008) Break detection for a class of nonlinear time series models. J Time Ser Anal 29:834–867CrossRefMATHMathSciNetGoogle Scholar
 De Rooij S, Vitányi P (2012) Approximating ratedistortion graphs of individual data: experiments in Lossy compression and denoising. IEEE Trans Comput 61(3):395–407Google Scholar
 Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: VLDB, Auckland, pp 1542–1552Google Scholar
 Donoho DL, Johnstone IM (1994) Ideal spatial adaptation via wavelet shrinkage. J Biometrika 81:425–455CrossRefMATHMathSciNetGoogle Scholar
 Evans SC et al (2007) Microrna target detection and analysis for genes related to breast cancer using MDL compress. EURASIP J Bioinform Syst Biol 1–16Google Scholar
 Firoiu L, Cohen PR (2002) Segmenting time series with a hybrid neural networks hidden Markov model. In: Proceedings of 8th national conference on artificial Intelligence, p 247Google Scholar
 GarcíaLópez D, AcostaMesa H (2009) Discretization of time series dataset with a genetic search. In: MICAI. Springer, Berlin, pp 201–212Google Scholar
 Goebel K, Saha B, Saxena A (2008) A comparsion of three datadriven techniques for prognostics. In: Failure prevention for system availability, 62th meeting of the MFPT Society, pp 119–131Google Scholar
 Grünwald PD, Myung IJ, Pitt MA (2005) Advances in minimum description length: theory and applications. MIT, CambridgeGoogle Scholar
 Heimes FO, BAE Systems (2008) Recurrent neural networks for remaining useful life estimation. In: International conference on prognostics and health managementGoogle Scholar
 Hu B, Rakthanmanon T, Hao Y, Evans S, Lonardi S, Keogh E (2011) Discovering the intrinsic cardinality and dimensionality of time series using MDL. In: ICDMGoogle Scholar
 International Business Machiness (IBM) (2012) Harness the power of big data. www.public.dhe.ibm.com/common/ssi/ecm/en/imm14100usen/IMM14100USEN.PDF. Accessed 7 Nov 2012
 Jonyer I, Holder LB, Cook DJ (2004) Attributevalue selection based on minimum description length. In: International conference on artificial intelligenceGoogle Scholar
 Kehagias Ath (2004) A hidden Markov model segmentation procedure for hydrological and enviromental time series. Stoch Environ Res Risk Assess 18:117–130Google Scholar
 Keogh E, Chu S, Hart D, Pazzani M (2011) An online algorithm for segmenting time series. In: KDDGoogle Scholar
 Keogh E, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical demonstration. J Data Min Knowl Discov 7(4):349–371Google Scholar
 Keogh E, Pazzani MJ (2000) A simple dimensionality reduction technique for fast similarity search in large time series databases. In: PAKDD, pp 122–133Google Scholar
 Keogh E, Zhu Q, Hu B, Hao Y, Xi X, Wei L, Ratanamahatana CA (2006) The UCR time series classification /clustering. www.cs.ucr.edu/~eamonn/time_series_data/
 Kontkanen P, Myllym P (2007) “MDL histogram density estimation. In: Proceedings of the eleventh international workshop on artificial intelligence and statisticsGoogle Scholar
 Lemire D (2007) A better alternative to piecewise linear time series segmentation. In: SDMGoogle Scholar
 Li M (1997) An introduction to Kolmogorov complexity and its applications, 2nd edn. Springer, BerlinCrossRefMATHGoogle Scholar
 Lin J, Keogh E, Lonardi S, Patel P (2002) Finding motifs in time series. In: Proceedings of 2nd workshop on temporal data miningGoogle Scholar
 Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. J DMKD 15(2):107–144MathSciNetGoogle Scholar
 Linacre E, Geerts B (2011) Resources in atmospheric science, 2002. http://wwwdas.uwyo.edu/~geerts/cwx/notes/chap15/global_temp.html. Accessed 1 Dec 2011
 Malatesta K, Beck S, Menali G, Waagen E (2005) The AAVSO data validation project. J Am Assoc Variable Star Observ (JAAVSO) 78:31–44Google Scholar
 Molkov YI, Mukhin DN, Loskutov EM, Feigin AM (2009) Using the minimum description length principle for global reconstruction of dynamic systems from noisy time series. Phys Rev E 80:046207CrossRefGoogle Scholar
 Mörchen F, Ultsch A (2005) Optimizing time series discretization for knowledge discovery. In: KDDGoogle Scholar
 National Aeronautics and Space Administration (2011) GISS surface temperature analysis. http://data.giss.nasa.gov/gistemp/. Accessed 1 Dec 2011
 Palpanas T, Vlachos M, Keogh E, Gunopulos D (2008) Streaming time series summarization using userdefined amnesic functions. IEEE Trans Knowl Data Eng 20(7):992–1006CrossRefGoogle Scholar
 Papadimitriou S, Gionis A, Tsaparas P, Väisänen A, Mannila H, Faloutsos C (2005) Parameterfree spatial data mining using MDL. In: ICDMGoogle Scholar
 Pednault EPD (1989) Some experiments in applying inductive inference principles to surface reconstruction. In: IJCAI, pp 1603–1609Google Scholar
 PHM Data Challenge Competition (2008). phmconf.orgjOCS/index.php/phm/2008/challengeGoogle Scholar
 Picard G, Fily M, Gallee H (2007) Surface melting derived from microwave radiometers: a climatic indicator in Antarctica. Ann Glaciol 47:29–34CrossRefGoogle Scholar
 Protopapas P, Giammarco JM, Faccioli L, Struble MF, Dave R, Alcock C (2006) Finding outlier lightcurves in catalogs of periodic variable stars. Monthly Not R Astron Soc 369:677–696CrossRefGoogle Scholar
 Prognostics Center of Excellence, National Aeronautics and Space Administration (NASA) (2012). ti.arc.nasa.gov/tech/dash/pcoe/prognosticdatarepository/. Accessed 7 Nov 2012Google Scholar
 Project URL. www.cs.ucr.edu/~bhu002/MDL/MDL.html. This URL contains all data and code used in this paper, as well as many additional experiments omitted for brevity
 Rakthanmanon T, Keogh E, Lonardi S, Evans S (2012) MDLbased time series clustering. Knowl Inf Syst 33(2):371–399CrossRefGoogle Scholar
 Rebbapragada U, Protopapas P, Brodley CE, Alcock CR (2009) Finding anomalous periodic time series. Mach Learn 74(3):281–313CrossRefGoogle Scholar
 Rissanen J (1989) Stochastic complexity in statistical inquiry. World Scientific, SingaporeMATHGoogle Scholar
 Rissanen J, Speed T, Yu B (1992) Density estimation by stochastic complexity. IEEE Trans Inf Theory 38:315–323CrossRefMATHGoogle Scholar
 Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: International conference on tools with artificial intelligence, pp 576–584Google Scholar
 Sarle W (1999) Donoho–Johnstone benchmarks: neural net results. ftp.sas.com/pub/neural/dojo/dojo.htmlGoogle Scholar
 Sart D, Mueen A, Najjar W, Niennattrakul V, Keogh E (2010) Accelerating dynamic time warping subsequence search with GPUs and FPGAs. In: IEEE international conference on data mining, pp 1001–1006Google Scholar
 Signal to Noise Ratio. http://en.wikipedia.org/wiki/Signaltonoise_ratio
 US Environmental Protection Agency (2011) Climate Change Science. www.epa.gov/climatechange/science/recenttc.html. Accessed 6 Dec 2011
 Vachtsevanos G, Lewis FL, Roemer M, Hess A, Wu B (2006) Intelligent fault diagnosis and prognosis for engineering systems, 1st edn. Wiley, HobokenGoogle Scholar
 Vahdatpour A, Sarrafzadeh M (2010) Unsupervised discovery of abnormal activity occurrences in multidimensional time series, with applications in wearable systems. In: SIAM international conference on data miningGoogle Scholar
 Vatauv R (2012) The impact of motion dimensionality and bit cardinality on the design of 3D gesture recognizers. Int J Hum–Comput Stud 71(4):387–409Google Scholar
 vbFRET Toolbox (2012) www.vbFRET.sourceforge.net. Accessed 8 Nov 2012
 Vereshchagin N, Vitanyi P (2010) Rate distortion and denoising of individual data using Kolmogorov complexity. IEEE Trans Inf Theory 56(7):3438–3454CrossRefMathSciNetGoogle Scholar
 Vespier U, Knobbe A, Nijssen S, Vanschoren J (2012) MDLbased analysis of time series at multiple timescales. Lecture notes in computer science (LNCS), vol 7524. Springer, BerlinGoogle Scholar
 Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2):185–194CrossRefMATHGoogle Scholar
 Wang T, Lee J (2006) On performance evaluation of prognostics algorithms. In: Proceedings of MFPT, pp 219–226Google Scholar
 Wang T, Yu J, Siegel D, Lee J (2008) A similaritybased prognostics approach for remaining useful life estimation of engineered systems. In: International conference on prognostics and health managementGoogle Scholar
 Witten H, Moffat A, Bell TC (1999) Managing gigabytes compressing and indexing documents and images. Morgan Kaufmann, San FranciscoGoogle Scholar
 Yankov D, Keogh E, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17(2):241–262CrossRefGoogle Scholar
 Zhao Q, Hautamaki V, Franti P (2008) Knee point detection in BIC for detecting the number of clusters. In: ACIVS, vol 5259, pp 664–673Google Scholar
 Zwally HJ, Gloersen P (1977) Passive microwave images of the polar regions and research applications. Polar Rec 18:431–450Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.