Keywords

1 Introduction

There are numerous problems from different fields that produce time series data, including chemical engineering [27], intrusion detection [31], economic forecasting [28], gene expression analysis [21], hydrology [23], social network analysis [32], and fault detection [11]. Fortunately, there are just as many algorithms available for analyzing time series data [9]. These algorithms involve tasks including queries [9, 10], anomoly detection [2], clustering [4, 20], classification [3, 9], motif discovery [9, 24], and segmentation [16]. From a practical point of view, these algorithms share basic data processing goals starting with pre-processing and normalization [9], representation [17], and similarity computation [9, 15].

In addition to the large body of algorithms available for mining time series data, there is an additional set of techniques available for visualization of time series [1, 22, 26]. These techniques belong to the field of Visual Analytics, or sometimes Interactive Visual Analytics [14, 30], and include methods such as Parallel Coordinates [12], multiple views, brushing, selection, and iteration. Researchers in Visual Analytics have called out the need for greater integration with underlying algorithms [5].

In between these two fields of research there is a smaller body of work which investigates the interactive visualization of multivariate time series data [18, 25, 29]. Most of this work focuses on visualization and interaction with multivariate time series plots. Our work fits within this area, but with an emphasis on the algorithms used in the visualization. We provide a layer of abstraction by providing an interactive visual summary of the data, rather than just looking at the time series themselves.

In this paper, we describe a lightweight system for analyzing multivariate time series data called Dial-A-Cluster (DAC). DAC is designed to provide a straightforward set of algorithms focused on allowing an analyst to visualize and interactively explore a multivariate time series dataset. DAC requires pre-computed distance matrices so it can exploit a large number of available algorithms related to time series representation and similarity comparison [9]. The DAC interface uses a multidimensional scaling [6] to provide a visualization of the dataset. The analyst can adjust the visualization by interactively weighting the distance measures for each time series. A modification of Fisher’s discriminant [8] can be used to rank the importance of each time series. Finally, an optimized weighting scheme for the visualization can be used to maximally correlate the data with analyst specified metadata.

DAC is implemented as a plugin for Slycat (slycat.readthedocs.org) [7], a system which provides a web server, a database, a Python infrastructure for remote computation (on the web server). The Slycat DAC plugin is a web application which provides the previously described time series analysis algorithms. It requires no installation and is platform independent. In addition, DAC supports (via Slycat) management of multiple users, multiple datasets, and access control, therefore encouraging collaboration while maintaining data privacy. Slycat and DAC are implemented using JavaScript and Python. Slycat is open source (github.com/sandialabs/slycat).

2 Algorithms

The primary goal of DAC is to provide a no-install, interactive user interface which can be used to organize and query multivariate time series data according to the interests of the analyst. There are three algorithms which support this goal: visualization using multidimensional scaling, identifying time series most responsible for differences in analyst selected clusters, and optimizing the visualization according to analyst specified metadata.

2.1 Multidimensional Scaling

DAC uses classical multidimensional scaling (MDS) to compute coordinates for a dataset, where each datapoint is a set of time series measurements. To be precise, suppose we have a dataset \(\{ x_i \}\), where \(x_i\) is a datapoint, for example an experiment or a test. Each datapoint consists of a number of time series measurements, which we write as a vector \(x_i = [ \mathbf {t}_{ik} ]\), where \(\mathbf {t}_{ik}\) is the kth time series vector for datapoint \(x_i\). Note that we are abusing notation here, because each vector \(\mathbf {t}_{ik}\) may be a different length, but that we require that the \(\mathbf {t}_{ik}\) have the same length for the same k. We also assume that we are given distance matrices

$$ D_k = \begin{bmatrix} d_k(x_1, x_1)&\ d_k(x_1, x_2)&\cdots \\ d_k(x_2, x_1)&\ d_k(x_2, x_2)&\cdots \\ \vdots&\ \vdots&\ \ddots \ \end{bmatrix} $$

for each time series measurement, where \(d_k (x_i, x_j)\) gives a distance between datapoint \(x_i\) and \(x_j\) for time series k. For example using Euclidean distance we would have

$$ d_k(x_i, x_j) = d(\mathbf {t}_{ik}, \mathbf {t}_{jk}) = \sqrt{\sum _l (t_{ikl} - t_{jkl})^2 }, $$

where \(\mathbf {t}_{ik} = [t_{ikl}]\) is the kth time series vector \(\mathbf {t}_{ik}\) for datapoint \(x_i\) indexed by l. Other distances can be used, so that each time series distance metric can be tailored to the type of measurement taken.

Now let \(\varvec{\alpha } = [ \alpha _k ]\) be a vector of scalars, with \(\alpha _k \in [0,1]\). The vector \(\varvec{\alpha }\) contains weights so that we may compute weighted versions of our datapoints, defined as \(\mathbf {\Phi } (x_i) = [ \alpha _k \mathbf {t}_{ik} ]\), where we are again abusing notation since the vectors \(\mathbf {t}_{ik}\) are allowed to have different lengths. Now we define a distance matrix D of pairwise weighted distances between every datapoint, where the entries of D are given by

$$ d^2 (x_i, x_j) = \mathbf {\Phi }(x_i)^T \mathbf {\Phi }(x_j) = \sum _k \alpha _k^2 d_k^2 (x_i, x_j). $$

The matrix D is the matrix of pairwise distances between datapoints used as input to MDS within DAC. The weights \(\alpha _k\) are adjustable by the analyst. Note that using these definitions

$$D = \sqrt{\sum _k \alpha _k^2 D_k^2}.$$

For completeness, we describe the MDS algorithm operating on the matrix D. First, we double center the distance matrix, obtaining

$$B = -\frac{1}{2}H D^2 H,$$

where \(D^2\) is the componentwise square of D, and \(H = I - II^T/n\), n being the size of D. Next, we perform an eigenvalue decomposition of B, keeping only the two largest positive eigenvalues \(\lambda _1, \lambda _2\) and corresponding eigenvectors \(\mathbf {e}_1, \mathbf {e}_2\). The MDS coordinates are given by the columns of \(E \varLambda ^{1/2}\), where E is the matrix containing the two eignvectors \(\mathbf {e}_1, \mathbf {e}_2\) and \(\varLambda \) is the diagonal matrix containing the two eigvenvalues \(\lambda _1, \lambda _2\).

Finally, we note that the eigenvectors computed by MDS are unique only up to sign. This fact can manifest itself as disconcerting coordinate flips in the DAC interface given even small changes in \(\varvec{\alpha }\) by the analyst. To minimize these flips, we use the Kabsch algorithm [13] to compute an optimal rotation so that the newly computed coordinates are as closely aligned to the existing coordinates as possible. The Kabsch algorithm uses the Singular Value Decomposition (SVD) to compute the optimal rotation matrix. If we assume that matrices P and Q have columns containing the previous and new MDS coordinates, then we form \(A = P^TQ\) and use the SVD to obtain \(A = U \Sigma V^T\). If we denote \(r = \text{ sign } (\det (VU^T))\) then the rotation matrix is given by

$$R = V \begin{bmatrix} 1&\ 0 \\ 0&\ r \\ \end{bmatrix} U^T.$$

2.2 Time Series Differences

In addition to using MDS to visualize the relationships between the datapoints, DAC allows the user to select subsets of the dataset and upon request ranks the time series according to how well each time series separates those subsets. DAC allows two different selections and ranks the time series according to Fisher’s Discriminant [8].

To be precise, for each distance matrix \(D_k\), we compute the values of Fisher’s Discriminant \(J_k (u,v)\), where \(u,v \subset \{x_i\}\) are two groups that we wish to contrast. By definition,

$$J_k(u,v) = \frac{\Vert \bar{u} - \bar{v} \Vert ^2}{S_u^2 + S_v^2},$$

where \(S_u^2 = \sum _i \Vert u_i - \bar{u} \Vert ^2, S_v^2 = \sum _j \Vert v_j - \bar{v} \Vert ^2\), and \(\bar{u}, \bar{v}\) are averages over the sets \(\{u_1, \dots , u_n \}, \{v_1, \dots , v_m \}\). Although we do not provide the algebraic derivation, we claim that

$$\Vert \bar{u} - \bar{v} \Vert ^2 = \frac{1}{n} \frac{1}{m} \sum _{ij} d^2 (u_i, v_j) - \frac{1}{2 n^2} \sum _{ik} d^2 (u_i, u_k) - \frac{1}{2 m^2} \sum _{jk} d^2 (v_j, v_k),$$

where k varies over i for \(\sum _{ik}\) and k varies over j for \(\sum _{jk}\). We similarly claim that \(S_u^2 = \frac{1}{2n} \sum _{ik} d^2 (u_i, u_k)\) and \(S_v^2 = \frac{1}{2m} \sum _{jk} d^2(v_j, v_k)\). Now we can compute \(J_k(u,v)\) using only submatrices of the distance matrices \(D_k\).

DAC ranks the time series in descending order of the values of \(J_k(u,v)\). Since a higher value of Fisher’s Discriminant \(J_k(u,v)\) indicates a greater separation between the selections, this ranking reveals the time series which exhibit the greatest differences between the subsets.

2.3 Clustering by Metadata

Often an analyst will be interested in metadata describing the datapoints. Questions might include: does the dataset cluster relative to a particular metadata variable?; can we make the dataset cluster relative to the metadata variable by adjusting the \(\varvec{\alpha }\) weights of the time series?; and which time series are most affected by a metadata variable? To address these questions, we incorporate a supervised optimization of the visualization which correlates the distances between time series with the distances between metadata values.

Specifically, we compute \(\varvec{\alpha }\) such that the distances in \(D^2 = \sum _k \alpha _k^2 D_k^2\) are as close as possible to the distances in the matrix \(D_p^2\), where \(D_p\) is a pairwise distance matrix for a given metadata property p. In other words, we want to solve

$$ \begin{array}{rcl} \min \nolimits _{\varvec{\alpha }} &{} &{} \sum \nolimits _{ij} (\sum \nolimits _k \alpha _k^2 d_k^2(x_i, x_j) - d_p^2(x_i, x_j))^2 \\ \text{ s.t. } &{} &{} \alpha _k \in [0,1], \end{array}$$

where \(d_p(x_i, x_j)\) is the property distance between \(x_i\) and \(x_j\), i.e. \(d_p(x_i, x_j) = |p_i - p_j|\), where \(p_i\) is the metadata property of \(x_i\) and \(p_j\) is the metadata property of \(x_j\). Note that for MDS, we can scale \(\varvec{\alpha }\) by a positive scalar with no effect, so that the constraint \(\varvec{\alpha } \in [0,1]\) is unnecessary. If we let \(\beta _k = \alpha _k^2\) we have

$$ \begin{array}{rcl} \min \nolimits _{\varvec{\beta }} &{} &{} \sum \nolimits _{ij} (\sum \nolimits _k \beta d_k^2(x_i, x_j) - d_p^2(x_i, x_j))^2 \\ \text{ s.t. } &{} &{} \beta _k \ge 0. \end{array}$$

In the Frobenius matrix norm, we have

$$ \begin{array}{rcl} \min \nolimits _{\varvec{\beta }} &{} &{} \Vert \sum \nolimits _k \beta _k D_k^2 - D_p^2 \Vert _F^2 \\ \text{ s.t. } &{} &{} \beta _k \ge 0. \end{array}$$

Now if we let \(U = [D_1^2, D_2^2, \cdots ]\), where each \(D_k^2\) is written as a column vector, and \(V=[D_p^2]\), where \(D_p^2\) is written as a column vector, then we have

$$ \begin{array}{rcl} \min \nolimits _{\varvec{\beta }} &{} &{} \Vert U\varvec{\beta } - V \Vert ^2 \\ \text{ s.t. } &{} &{} \varvec{\beta } \ge 0. \end{array}$$

This is known as a non-negative least squares problem [19]. Once we compute \(\varvec{\beta }\) we can obtain time series weights \(\varvec{\alpha }\) corresponding to an MDS visualization optimized to a particular metadata property value.

3 User Interface

The DAC user interface allows access to the algorithms discussed in Sect. 2. DAC assumes that the time series data has been pre-processed, metadata has been collected, and distance matrices have been computed. These assumptions allow flexibility in terms of representing the time series data and computing similarities, two steps in time series analysis served by a wide variety of different algorithms [9]. In addition, pre-computing the distance matrices ensures that DAC will operate in real-time for reasonable dataset sizes (up to \(\sim \)5000 datapoints).

The DAC user interface consists of sliders to adjust \(\alpha _k\) values (labelled using variable names meaningful to analysts), a canvas to display the MDS visualization, traditional time series plots, and a table displaying metadata. The interface is shown in Fig. 1.

Fig. 1.
figure 1

DAC user interface. Here we show DAC running in Firefox on a Windows PC. Reading the labels counter-clockwise from the upper right: (A) time series data is displayed in the traditional manner; (B) MDS is used to provide a visual representation of the datapoints in the dataset, shown as circles; (C) Fisher’s Discriminant can be used to order the time series to maximize the difference between analyst selected red and blue groups; (D) time series measurements can be weighted to adjust the visualization according to analyst preference; (E) the weights can be computed optimally to correlate with an analyst chosen metadata field; and (F) metadata can be examined. (Color figure online)

Fig. 2.
figure 2

DAC weather data. On the left (A), we show the DAC MDS visualization of the cities in the weather dataset. In the middle (B), the visualization is colored by average temperature, where yellow is low and brown is high. On the right (C), the visualization is colored by annual precipitation, where yellow is again low and brown is high. (Color figure online)

Fig. 3.
figure 3

DAC analyst selections. Here we show two selections made by the analyst, cities on the left hand side of the visualization are selected in blue, and cities in the upper right are selected in red. The same coloring scheme is automatically reflected in the metadata table and colored time series plots are shown on the right. The blue cities include Madison, WI and Milwaukee, WI, and the red cities include Mesa, AZ. By pushing the difference button, DAC ranks and orders the time series plots in the right hand panel of the interface. In this case, humidity gives the greatest difference between the red and blue cities, followed by temperature. (Color figure online)

Fig. 4.
figure 4

DAC optimal MDS. Here we show the optimal MDS coordinates correlated with latitude, computed according to the algorithm in Sect. 2.3. The \(\varvec{\alpha }\) values are automatically adjusted to show that temperature, dew point, and sea level pressure are best suited to represent latitude, and the previous city selections show that the cold wet cities tend to be in the north and the hot dry cities tend to be in the south. (Color figure online)

The DAC interface is a Slycat plugin (slycat.readthedocs.org) [7]. Slycat supports the management of multiple users, multiple datasets, and access controls. Both Slycat and DAC are implemented using JavaScript and Python. DAC is written in JavaScript using jQuery for the controls. The time series and MDS plots are rendered and animated using D3, and the metadata table uses SlickGrid. Calculations are performed on the Slycat webserver using Python and NumPy. Slycat is open source (github.com/sandialabs/slycat) and DAC will be released as open source in the near future.

4 Example

To demonstrate how DAC might be used by an analyst, we provide an example using publicly available weather data. The data consists of weather time series data from Weather Underground (www.wunderground.com) during the year 2014 for the 100 most populated cities in the the United States. The time series measurements include temperature, dew point, humidity, sea level pressure, visibility, wind speed, precipitation, cloud cover, and wind direction. Metadata for the cities includes city name, state, time zone, population, latitude, longitude, average temperature, average humidity, annual precipitation, average windspeed, average visibility and average cloud cover.

Upon starting, DAC produces an MDS visualization of the dataset assuming \(\varvec{\alpha } = \mathbf {1}\). For the weather data, this visualization is shown in Fig. 2(A). Among the simplest functions provided by DAC is the ability to color the datapoints according to analyst selected metadata. A coloring of the weather data by average temperature is shown in Fig. 2(B) and by annual precipitation in Fig. 2(C).

From the coloring, it appears that cities on the left hand side of the visualization are cold and wet, while cities on the upper right are hot and dry. This can be confirmed by selecting cities in these areas of the visualization and examining their metadata and time series, as shown in Fig. 3. The selections show that cities on the left (blue selections) are indeed cold and wet and are located in the northern and eastern parts of the country, while the cities in the upper right (red selections) are hot and dry and are located in Arizona and Nevada. By pushing the difference button, Fisher’s Discriminant is computed against the red and blue selections to rank the time series plots in the right hand panel of the DAC interface, showing that humidity and temperature give the greatest differences between the two selections.

Finally, the analyst might speculate that the latitude has a significant correlation with the MDS coordinate visualization. Coloring by latitude and pushing the cluster button produces the visualization shown in Fig. 4. This visualization is computed according to the optimization in Sect. 2.3 to obtain the MDS coordinates that best correlate with latitude. The analyst’s speculation is confirmed in that the red cities are positioned on the upper right and the blue cities are positioned on the lower left. In addition, the \(\varvec{\alpha }\) values computed show that temperature, dew point, and sea level pressure are the most significant weights in the optimized MDS coordinates. Unsurprisingly, temperature is the main influence.

5 Conclusion

Interactive visualization of multivariate time series data is a challenging problem. In addition to organizing what can be large quantities of data for display, there are many potential algorithms available for analyzing the data. We have designed a lightweight web application to bridge the gap between these two problems.

Our system, Dial-A-Cluster (DAC), allows an expert data mining practitioner to pick and choose among the available algorithms for time series representation and similarity comparison to pre-compute distance matrices for use with DAC. (Alternatively, a novice practitioner can use very simple pre-processing and Euclidean distance to compute the matrices for DAC.)

DAC in turn provides a subject matter expert a lightweight, no-installation, platform independent interface for examining the data. DAC implements a real-time MDS coordinate based abstraction for the dataset, as well as an interactive interface for examining the actual time series data and metadata. DAC uses Fisher’s Discriminant to rank and order the time series according to analyst selections. Finally, DAC provides an optimized computation for determining which time series measurements are correlated with metadata of interest to the analyst.

Instead of making the analyst an evaluator of the data mining results, DAC provides an easy to use interface which encourages the analyst to explore the data independently. Further, since DAC is implemented as a Slycat plugin, management of multiple datasets, multiple users, and access controls are also provided, encouraging collaboration between multiple anlaysts while maintaining data privacy.