# Graph-based predictable feature analysis

- 1.2k Downloads
- 2 Citations

**Part of the following topical collections:**

## Abstract

We propose graph-based predictable feature analysis (GPFA), a new method for unsupervised learning of predictable features from high-dimensional time series, where high predictability is understood very generically as low variance in the distribution of the next data point given the previous ones. We show how this measure of predictability can be understood in terms of graph embedding as well as how it relates to the information-theoretic measure of predictive information in special cases. We confirm the effectiveness of GPFA on different datasets, comparing it to three existing algorithms with similar objectives—namely slow feature analysis, forecastable component analysis, and predictable feature analysis—to which GPFA shows very competitive results.

## Keywords

Unsupervised learning Dimensionality reduction Feature learning Representation learning Graph embedding Predictability## 1 Introduction

When we consider the problem of an agent (artificial or biological) interacting with its environment, its signal processing is naturally embedded in time. In such a scenario, a feature’s ability to predict the future is a necessary condition for it to be useful in any behaviorally relevant way: A feature that does not hold information about the future is out-dated the moment it is processed and any action based on such a feature can only be expected to have random effects.

As an practical example, consider a robot interacting with its environment. When its stream of sensory input is high-dimensional (e.g., the pixel values from a camera), we are interested in mapping this input to a lower-dimensional representation to make subsequent machine learning steps and decision making more robust and efficient. At this point, however, it is crucial not to throw away information that the input stream holds about the future as any subsequent decision making will depend on this information. The same holds for time series like video, weather, or business data: when performing classification or regression on the learned features, or when the data is modelled for instance by a (hidden) Markov model, we are mostly interested in features that have some kind of predictive power.

Standard algorithms for dimensionality reduction (DR), like PCA, however, are designed to preserve properties of the data that are not (or at least not explicitly) related to predictability and thus are likely to waste valuable information that could be extracted from the data’s temporal structure. In this paper we will therefore focus on the unsupervised learning of predictable features for high-dimensional time series, that is, given a sequence of data points in a high-dimensional vector space we are looking for the projection into a sub-space which makes predictions about the future most reliable.

While aspects of predictability are (implicitly) dealt with through many different approaches in machine learning, only few algorithms have addressed this problem of finding subspaces for multivariate time series suited for predicting the future. The recently proposed *forecastable component analysis* (ForeCA) (Goerg 2013) is based on the idea that predictable signals can be recognized by their low entropy in the power spectrum while white noise in contrast would result in a power spectrum with maximal entropy. *Predictable feature analysis* (PFA) (Richthofer and Wiskott 2013) focuses on signals that are well predictable through autoregressive processes. Another DR approach that was not designed to extract predictable features but explicitly takes into account the temporal structure of the data is *slow feature analysis* (SFA) (Wiskott and Sejnowski 2002). Still, the resulting slow features can be seen as a special case of predictable features (Creutzig and Sprekeler 2008). For reinforcement learning settings, *predictive projections* (Sprague 2009) and *robotic priors* (Jonschkowski and Brock 2015) learn mappings where actions applied to similar states result in similar successor states. Also, there are recent variants of PCA that at least allow for weak statistical dependence between samples (Han and Liu 2013).

All in all, however, the field of unsupervised learning of predictable subspaces for time series is largely unexplored. Our contribution consists of a new measure of the predictability of learned features as well as of an algorithm for learning those. The proposed measure has the advantage of being very generic, of making only few assumptions about the data at hand, and of being easy to link to the information-theoretic quantity of *predictive information* (Bialek et al. 2001), that is, the mutual information between past and future. The proposed algorithm, *graph-based predictable feature analysis* (GPFA), not only shows very competitive results in practice but also has the advantage of being very flexible, and of allowing for a variety of future extensions. Through its formulation in terms of a graph embedding problem, it can be straightforwardly combined with many other, mainly geometrically motivated objectives that have been formulated in the graph embedding framework (Yan et al. 2007)—like Isomap (Tenenbaum et al. 2000), Locally Linear Embedding (LLE, Roweis and Saul 2000), Laplacian Eigenmaps (Belkin and Niyogi 2003), and Locality Preserving Projections (LPP, He and Niyogi 2004). Moreover, GPFA could make use of potential speed-ups like spectral regression (Cai et al. 2007), include additional label information in its graph like in (Escalante et al. 2013), or could be applied to non-vectorial data like text. Kernelization and other approaches to use GPFA in a non-linear way are discussed in Sect. 5.

The remaining paper is structured as follows. In Sect. 2 we derive the GPFA algorithm. We start by introducing a new measure of predictability (Sect. 2.1), a consistent estimate for it (Sect. 2.2), and a simplified version of the estimate which is used by the proposed algorithm as an intermediate step (Sect. 2.3). Then the link to the graph embedding framework is established in Sects. 2.4 and 2.5. After describing three useful heuristics in Sect. 2.6, the core algorithm is summarized in Sect. 2.7 and an iterated version of the algorithm is described in Sect. 2.8. Afterwards the algorithm is analyzed with respect to its objective’s close relation to predictive information (Sect. 2.9) and with respect to its time complexity (Sect. 2.10). Section 3 summarizes the most closely related approaches for predictable feature learning—namely SFA, ForeCA, and PFA—and Sect. 4 describes experiments on different datasets. We end with a discussion of limitations, open questions and ideas which shall be conducted by future research in Sect. 5 and with a conclusion in Sect. 6.

## 2 Graph-based predictable feature analysis

Given is a time series \(\mathbf {x}_t \in \mathbb {R}^N\), \(t = 1,\dots , S\), as training data that is assumed to be generated by a stationary stochastic process \((\varvec{X}_t)_{t}\) of order *p*. The goal of GPFA is to find a lower-dimensional feature space for that process by means of an orthogonal transformation \(\mathbf {A} \in \mathbb {R}^{N \times M}\), leading to projected random variables \(\varvec{Y}_t = \mathbf {A}^T \varvec{X}_t\) with low average variance given the state of the *p* previous time steps. We use \(\varvec{X}_{t}^{(p)}\) to denote the concatenation \((\varvec{X}_{t}^T, \dots , \varvec{X}_{t-p+1}^T)^T\) of the *p* predecessor of \(\varvec{X}_{t+1}\) to simplify notation. The corresponding state values are vectors in \(\mathbb {R}^{N \cdot p}\) and denoted by \(\mathbf {x}_t^{(p)}\).

### 2.1 Measuring predictability

*p*-step past, the higher the predictability. We measure this through the expected covariance matrix of \(\varvec{Y}_{t+1}\) given \(\varvec{Y}_{t}^{(p)}\) and minimize it in terms of its trace, i.e., we minimize the sum of variances in all principal directions. Formally, we look for the projection matrix \(\mathbf {A}\) leading to a projected stochastic process \((\varvec{Y}_t)_t\) with minimum

^{1}For non-Gaussian conditional distributions we assume the variance to function as a useful proxy for quantifying the uncertainty of the next step. Note, however, that assuming Gaussianity for the conditional distributions does not imply or require Gaussianity of \(\varvec{X}_t\) or of the joint distributions \(p(\varvec{X}_s, \varvec{X}_t)\), \(s \ne t\), which makes the predictability measure applicable to a wide range of stochastic processes.

### 2.2 Estimating predictability

*k*-nearest neighbor (kNN) estimate instead. Intuitively, the sample size is increased by also considering the

*k*points that are most similar (e.g., in terms of Euclidean distance) to \(\mathbf {y}_{t}^{(p)}\), assuming that a distribution \(p(\varvec{Y}_{t+1} | \varvec{Y}_{t}^{(p)} = \mathbf {y'}_{t}^{(p)})\) is similar to \(p(\varvec{Y}_{t+1} | \varvec{Y}_{t}^{(p)} = \mathbf {y}_{t}^{(p)})\) if \(\mathbf {y'}_{t}^{(p)}\) is close to \(\mathbf {y}_{t}^{(p)}\). In other words, we group together signals that are similar in their past

*p*steps. To that end, a set \(\mathcal {K}_{t}^{(p)}\) is constructed, containing the indices of all

*k*nearest neighbors of \(\mathbf {y}_{t}^{(p)}\) (plus the 0-st neighbor,

*t*itself), i.e., \(\mathcal {K}_{t}^{(p)} := \{i \, | \, \mathbf {y}_{i}^{(p)} \text { is kNN of } \mathbf {y}_{t}^{(p)}, i = 1,\dots ,S \} \cup \{ t \}\). The covariance is finally estimated based on the successors of these neighbors. Formally, the

*k*-nearest neighbor estimate of (1) is given by

*t*. Note that the distance measure used for the

*k*nearest neighbors does not necessarily need to be Euclidean. Think for instance of “perceived similarities” of words or faces.

While we introduce the kNN estimate here to assess the uncertainty inherent in the stochastic process, we note that it may be of practical use in a deterministic setting as well. For a deterministic dynamical system the kNN estimate includes nearby points belonging to nearby trajectories in the dataset. Thus, the resulting feature space may be understood as one with small divergence of neighboring trajectories (as measured through the Lyapunov exponent, for instance).

### 2.3 Simplifying predictability

*after*\(\mathbf {A}\) has been fixed. The circular nature of this optimization problem motivates the iterated algorithm described in Sect. 2.8. As a helpful intermediate step we define a weaker measure of predictability that is conditioned on the input \(\varvec{X}_t\) instead of the features \(\varvec{Y}_t\) and has a closed-form solution, namely minimizing

*k*-nearest neighbor estimate

*k*nearest neighbors of \(\mathbf {x}_{t}^{(p)}\) plus

*t*itself. Under certain mild mixing assumptions for the stochastic process, the text-book results on

*k*-nearest neighbor estimates can be applied to auto-regressive time series as well (Collomb 1985). Thus, in the limit of \(S \rightarrow \infty \), \(k \rightarrow \infty \), \(k/S \rightarrow 0\), the estimated covariance

When measuring predictability, one assumption made about the process \((\varvec{X}_t)_{t}\) in the following is that it is already white, i.e., \(\mathbb {E}[\varvec{X}_t] = \mathbf {0}\) and \({{\mathrm{cov}}}(\varvec{X}_t) = \mathbf {I}\) for all *t*. Otherwise components with lower variance would tend to have higher predictability *per se*.

### 2.4 Predictability as graph

^{2}of the edges \(\{\mathbf {y}_{i+1}, \mathbf {y}_{j+1}\}\) for all \(i,j \in \tilde{\mathcal {K}}_t^{(p)}\), \(t = p, \dots , S-1\), minimizing (4) directly leads to the minimization of (3).

Note that for the construction of the graph, the data actually does not need to be represented by points in a vector space. Data points also could, for instance, be words from a text corpus as long as there are either enough samples per word or there is an applicable distance measure to determine “neighboring words” for the *k*-nearest neighbor estimates.

### 2.5 Graph embedding

*M*eigenvectors of the eigenvalue problem

### 2.6 Additional heuristics

The following three heuristics proved to be useful for improving the results in practice.

#### 2.6.1 Normalized graph embedding

#### 2.6.2 Minimizing variance of the past

Second, while not being directly linked to the above measure of predictability, results benefit significantly when the variance of the past is minimized simultaneously to that of the future. To be precise, additional edges \(\{\mathbf {y}_{i-p}, \mathbf {y}_{j-p}\}\) are added to the graph for all \(i,j \in \tilde{\mathcal {K}}_t^{(p)}\), \(t=p+1,\dots ,S\). The proposed edges here have the effect of mapping states with *similar* futures to *similar* locations in feature space. In other words, states are represented with respect to what is expected in the next steps (not with respect to their past). Conceptually this is related to the idea of *causal states* (Shalizi and Crutchfield 2001), where all (discrete) states that share the same conditional distribution over possible futures are mapped to the same causal state (also see Still (2009) for a closely related formulation in interactive settings).

#### 2.6.3 Star-like graph structure

We refer to the resulting algorithms as GPFA (1) and GPFA (2), corresponding to the graphs defined through (5) and (9), respectively. See Fig. 1 for an illustration of both graphs. The differences in performance are empirically evaluated in Sect. 4.

### 2.7 Algorithm

- 1.
**Calculate neighborhood**For every \(\mathbf {x}_t^{(p)}\), \(t=p,\dots , S\), calculate index set \(\tilde{\mathcal {K}}_t^{(p)}\) of

*k*nearest neighbors (plus*t*itself). - 2.
**Construct graph (future)**Initialize connection matrix \(\mathbf {W}\) to zero. For every \(t=p,\dots , S-1\), add edges, according to either- (1)
\(\mathrm {W}_{i+1, j+1} \leftarrow \mathrm {W}_{i+1, j+1} + 1 \, \quad \forall i,j \in \tilde{\mathcal {K}}_t^{(p)}\) or

- (2)
\(\mathrm {W}_{i+1, t+1} \leftarrow \mathrm {W}_{i+1, t+1} + 1\) and \(\mathrm {W}_{t+1, i+1} \leftarrow \mathrm {W}_{t+1, i+1} + 1 \, \quad \forall i \in \tilde{\mathcal {K}}_t^{(p)} \setminus \{t\}\).

- (1)
- 3.
**Construct graph (past)**For every \(t=p+1,\dots ,S\), add edges, according to either- (1)
\(\mathrm {W}_{i-p, j-p} \leftarrow \mathrm {W}_{i-p, j-p} + 1 \, \quad \forall i,j \in \tilde{\mathcal {K}}_t^{(p)}\) or

- (2)
\(\mathrm {W}_{i-p, t-p} \leftarrow \mathrm {W}_{i-p, t-p} + 1\) and \(\mathrm {W}_{t-p, i-p} \leftarrow \mathrm {W}_{t-p, i-p} + 1 \,\quad \forall i \in \tilde{\mathcal {K}}_t^{(p)} \setminus \{t\}\).

- (1)
- 4.
**Linear graph embedding**Calculate \(\mathbf {L}\) and \(\mathbf {D}\) as defined in Sect. 2.5.

Find the first (“smallest”)

*M*solutions to \(\mathbf {X} \mathbf {L} \mathbf {X}^T \mathbf {a} = \lambda \mathbf {X} \mathbf {D} \mathbf {X}^T \mathbf {a}\) and normalize them, i.e., \(||\mathbf {a}|| = 1\).

### 2.8 Iterated GPFA

As shown in Sect. 2.4, the core algorithm above produces features \((\varvec{Y}_t)_t\) with low \({{\mathrm{trace}}}\;( \mathbb {E}_{\varvec{X}_t^{(p)}}[ {{\mathrm{cov}}}(\varvec{Y}_{t+1} | \varvec{X}_t^{(p)} ) ] )\). In many cases these features may already be predictable in themselves, that is, they have a low \({{\mathrm{trace}}}\;( \mathbb {E}_{\varvec{Y}_t^{(p)}}[ {{\mathrm{cov}}}(\varvec{Y}_{t+1} | \varvec{Y}_t^{(p)} ) ] )\). There are, however, cases where the results of both objectives can differ significantly (see Fig. 2 for an example of such a case). Also, the *k*-nearest neighbor estimates of the covariances become increasingly unreliable in higher-dimensional spaces.

Therefore, we propose an iterated version of the core algorithm as a heuristic to address these problems. First, an approximation of the desired covariances \({{\mathrm{cov}}}(\varvec{Y}_{t+1} | \varvec{Y}_t^{(p)} = \mathbf {y}_t^{(p)})\) can be achieved by rebuilding the graph according to neighbors of \(\mathbf {y}_t^{(p)}\), not \(\mathbf {x}_t^{(p)}\). This in turn may change the whole optimization problem, which is the reason to repeat the whole procedure several times. Second, calculating the sample covariance matrices based on the *k* nearest neighbors of \(\mathbf {y}_t^{(p)} \in \mathbb {R}^{M \cdot p}\) instead of \(\mathbf {x}_t^{(p)} \in \mathbb {R}^{N \cdot p}\) counteracts the problem of unreliable *k*-nearest neighbor estimates in high-dimensional spaces, since \(M \cdot p \ll N \cdot p\).

- (a)
Calculate neighborhoods \(\tilde{\mathcal {K}}_t^{(p)}\) of \(\mathbf {x}_t^{(p)}\) for \(t=p, \dots , S-1\).

- (b)
Perform steps 2–4 of GPFA as described in Sect. 2.7.

- (c)
Calculate projections \(\mathbf {y}_t = \mathbf {A}^T \mathbf {x}_t\) for \(t=1, \dots , S\).

- (d)
Calculate neighborhoods

^{3}\(\mathcal {K}_t^{(p)}\) of \(\mathbf {y}_t^{(p)}\) for \(t=p, \dots , S-1\). - (e)
Start from step (b), using \(\mathcal {K}_t^{(p)}\) instead of \(\tilde{\mathcal {K}}_t^{(p)}\).

*R*iterations or until convergence.

*M*of the intermediate projections \(\mathbf {y}_t \in \mathbb {R}^M\) to be the same as for the final feature space.

### 2.9 Relationship to predictive information

*Predictive information*—that is, the mutual information between past states and future states—has been used as a natural measure of how well-predictable a stochastic process is (e.g., Bialek and Tishby 1999; Shalizi and Crutchfield 2001). In this section we discuss under which conditions the objective of GPFA corresponds to extracting features with maximal predictive information.

*p*and its extracted features \(\varvec{Y}_t = \mathbf {A}^T \varvec{X}_t\). Their predictive information is given by

If we assume \(\varvec{Y}_{t+1}\) to be normally distributed—which can be justified by the fact that it corresponds to a mixture of a potentially high number of distributions from the original high-dimensional space—then its differential entropy is given by \(H( \varvec{Y}_{t+1} ) = \frac{1}{2} \log \{ (2 \pi e)^M \} + \log \{ | {{\mathrm{cov}}}( \varvec{Y}_{t+1} ) | \}\) and is thus a strictly increasing function of the determinant of its covariance. Now recall that \((\varvec{X}_t)_{t}\) is assumed to have zero mean and covariance \(\mathbf {I}\). Thus, \({{\mathrm{cov}}}(\varvec{Y}_{t+1}) = \mathbf {I}\) holds independently of the selected transformation \(\mathbf {A}\) which makes \(H(\varvec{Y}_{t+1})\) independent of \(\mathbf {A}\) too.

*M*smallest eigenvalues. Thereby the determinant \(| \mathbf {\Sigma }_{\varvec{Y}_{t+1} | \varvec{Y}_t^{(p)}} |\) is minimized as well, since—like for the trace—its minimization only depends on the selection of the smallest eigenvalues. Thus, GPFA produces features with the maximum predictive information under this assumption of a prediction error \(\mathbf {\Sigma }_{\varvec{Y}_{t+1} | \varvec{Y}_t^{(p)}}\) independent of the value of \(\varvec{Y}_t^{(p)}\) (and to the degree that the iterated heuristic in Sect. 2.8 minimizes (1)).

### 2.10 Time complexity

In the following section we derive GPFA’s asymptotic time complexity in dependence of the number of training samples *S*, input dimensions *N*, process order *p*, output dimensions *M*, number of iterations *R*, as well as the neighborhood size *k*.

#### 2.10.1 *k*-nearest-neighbor search

*k*-nearest-neighbor search. When we naively assume a brute-force approach, it can be realized in \(\mathcal {O}(N p S)\). This search is repeated for each of the

*S*data points and for each of the

*R*iterations (in \(N \cdot p\) dimensions for the first iteration and in \(M \cdot p\) for all others). Thus, the

*k*-nearest-neighbor search in the worst case has a time complexity of

*k*-nearest-neighbor search exist.

#### 2.10.2 Matrix multiplications

*L*being the number of non-zero elements, we assume \(\mathcal {O}(L n)\). This gives us a complexity of \(\mathcal {O}(N^2 S + L N)\) for the left-hand side of (8). For GPFA (1) there is a maximum of \(L = 2 k^2 S\) non-zero elements (corresponding to the edges added to the graph, which are not all unique), for GPFA (2) there is a maximum of \(L = 2 k S\). The right-hand side of (8) then can be ignored since it’s complexity of \(\mathcal {O}(N^2 S + S N)\) is completely dominated by the left-hand side. Factoring in the number of iterations

*R*, we finally have computational costs of

#### 2.10.3 Eigenvalue decomposition

For solving the eigenvalue problem (8) *R* times we assume an additional time complexity of \(\mathcal {O}(R N^3)\). This is again a conservative guess because only the first *M* eigenvectors need to be calculated.

#### 2.10.4 Overall time complexity

Taking together the components above, GPFA has a time complexity of \(\mathcal {O}(N p S^2 + R M p S^2 + R N^2 S + R L N + R N^3)\) with \(L = k^2 S\) for GPFA (1) and \(L = k S\) for GPFA (2). In terms of the individual variables, that is: \(\mathcal {O}(S^2)\), \(\mathcal {O}(N^3)\), \(\mathcal {O}(M)\), \(\mathcal {O}(p)\), \(\mathcal {O}(R)\), and \(\mathcal {O}(k^2)\) or \(\mathcal {O}(k)\) for GPFA (1) or GPFA (2), respectively.

## 3 Related methods

In this section we briefly summarize the algorithms most closely related to GPFA, namely SFA, ForeCA, and PFA.

### 3.1 SFA

Although SFA originally has been developed to model aspects of the visual cortex, it has been successfully applied to different problems in technical domains as well (see Escalante et al. 2012 for a short overview), like, for example, state-of-the art age-estimation (Escalante et al. 2016). It is one of the few DR algorithms that considers the temporal structure of the data. In particular, slowly varying signals can be seen as a special case of predictable features (Creutzig and Sprekeler 2008). It is also possible to reformulate the slowness principle implemented by SFA in terms of graph embedding, for instance to incorporate label information into the optimization problem (Escalante et al. 2013).

Adopting the notation from above, SFA finds an orthogonal transformation \(\mathbf {A} \in \mathbb {R}^{N \times M}\) such that the extracted signals \(\mathbf {y}_t = \mathbf {A}^T \mathbf {x}_t\) have minimum temporal variation \(\langle || \mathbf {y}_{t+1} - \mathbf {y}_t ||^2 \rangle _t\). The input vectors \(\mathbf {x}_t\)—and thus \(\mathbf {y}_t\) as well—are assumed to be white.

### 3.2 ForeCA

In case of ForeCA (Goerg 2013), \((\varvec{X}_t)_{t}\) is assumed to be a stationary second-order process and the goal of the algorithm is finding an extraction vector \(\mathbf {a}\) such that the projected signals \(Y_t = \mathbf {a}^T \varvec{X}_t\) are as *forecastable* as possible, that is, having a low entropy in their power spectrum. Like SFA, ForeCA has the advantage of being completely model- and parameter-free.

*forecastability*, first consider the signal’s autocovariance function \(\gamma _Y(l) = E(Y_t - \mu _Y) E(Y_{t-l} - \mu _Y)\), with \(\mu _Y\) being the mean value and the corresponding autocorrelation function \(\rho _Y(l) = \gamma _Y(l) / \gamma _Y(0)\). The spectral density of the process can be calculated as the Fourier transform of the autocorrelation function, i.e., as

*forecastability*as

### 3.3 PFA

*p*,

*p*-step history of \(\mathbf {x}_t\). Let further \(\mathbf {W} \in \mathbb {R}^{N \times N \cdot p}\) contain the coefficients that minimize the error of predicting \(\mathbf {x}_t\) from its own history, i.e., \(\langle \Vert \mathbf {x}_t - \mathbf {W} \mathbf {\zeta }_t \Vert ^2 \rangle _t\). Then minimizing \(\langle \Vert \mathbf {A}^T \mathbf {x}_t - \mathbf {A}^T \mathbf {W} \mathbf {\zeta }_t \Vert ^2 \rangle _t\) with respect to \(\mathbf {A}\) corresponds to a PCA (in the sense of finding the directions of smallest variance) on that prediction error. Minimizing this prediction error however does not necessarily lead to features \(\mathbf {y}_t = \mathbf {A}^T \mathbf {x}_t\) that are best for predicting their own future because the calculated prediction was based on the history of \(\mathbf {x}_t\), not \(\mathbf {y}_t\) alone. Therefore an additional heuristic is proposed that is based on the intuition that the inherited errors of

*K*times repeated autoregressive predictions create an even stronger incentive to avoid unpredictable components. Finally,

Like the other algorithms, PFA includes a preprocessing step to whiten the data. So far, PFA has been shown to work on artificially generated data. For further details about PFA (see Richthofer and Wiskott 2013).

## 4 Experiments

We conducted experiments^{4} on different datasets to compare GPFA to SFA, ForeCA, and PFA. As a baseline, we compared the features extracted by all algorithms to features that were created by projecting into an arbitrary (i.e., randomly selected) *M*-dimensional subspace of the data’s *N*-dimensional vector space.

For all experiments, first the training set was whitened and then the same whitening transformation was applied to the test set. After training, the learned projection was used to extract the most predictable *M*-dimensional signal from the test set with each of the algorithms. The extracted signals were evaluated in terms of their empirical predictability (2). The neighborhood size used for this evaluation is called *q* in the following to distinguish it from the neighborhood size *k* used during the training of GPFA. Since there is no natural choice for the different evaluation functions that effectively result from different *q*, we arbitrarily chose \(q=10\) but also include plots on how results change with the value of *q*. The size of training and test set will be denoted by \(S_{train}\) and \(S_{test}\), respectively. The plots show mean and standard deviation for 50 repetitions of each experiment.^{5}

### 4.1 Toy example (“predictable noise”)

We created a small toy data set to demonstrate performance differences of the different algorithms. The data set contains a particular kind of predictable signals which are challenging to identify for most algorithms. Furthermore, the example is suited to get an impression for running time constants of the different algorithms that are not apparent from the big \(\mathcal {O}\) notation in Sect. 2.10.

*K*, which was therefore set to \(K=0\).

Figure 3 shows the predictability of the signals extracted by the different algorithms and how it varies in \(S_{train}\), *N*, and *k*. Only ForeCA and GPFA are able to distinguish the two components of predictable noise from the unpredictable ones, as can be seen from reaching a variance of about 1, which corresponds to the variance of the two generated, partly predictable components. As Fig. 3b shows, the performance of both versions of GPFA (as of all other algorithms) declines with a higher number of input dimensions (but for GPFA (2) less than for GPFA (1)). At this point, a larger number of training samples is necessary to produce more reliable results (experiments not shown). The results do not differ much with the choice of *k* though.

*N*, so that it becomes very computationally expensive to be applied to time series with more than a few dozen dimensions. For that reason we excluded ForeCA from the remaining, high-dimensional experiments.

### 4.2 Auditory data

In the second set of experiments we focused on short-time Fourier transforms (STFTs) of audio files. Three public domain audio files (a silent film piano soundtrack, ambient sounds from a bar, and ambient sounds from a forest) were re-sampled to 22kHz mono. The STFTs were calculated with the Python library stft with a frame length of 512 and a cosine window function, resulting in three datasets with 26,147, 27,427, and 70,433 frames, respectively, each with 512 dimensions (after discarding complex-conjugates and representing the remaining complex values as two real values each). For each repetition of the experiment, \(S_{train} = 10{,}000\) successive frames were picked randomly as training set and \(S_{test} = 5000\) distinct and successive frames were picked as test set. PCA was calculated for each training set to preserve 99% of the variance and this transformation was applied to training and test set alike.

The critical parameters *p* and *k*, defining the assumed order of the process and the neighborhood size respectively, were selected through cross-validation to be a good compromise between working well for all values of *M* and also not treating one of the algorithms unfavourably. PFA and GPFA tend to benefit from the same values for *p*. The number of iteration *R* for GPFA was found to be not very critical and was set to \(R=50\). The iteration parameter *K* of PFA was selected by searching for the best result in \(\{0\dots 10\}\), leaving all other parameters fixed.

*M*. The other plots show how the results change with the individual parameters. Increasing the number of past time steps

*p*tends to improve the results first but may let them decrease later (see Figs. 5a, 6a, 7a). Presumably, because higher numbers of

*p*make the models more prone to overfitting. The neighborhood size

*k*had to be selected carefully for each of the different datasets. While its choice was not critical on the first dataset, the second dataset benefited from low values for

*k*and the third one from higher values (see Figs. 5b, 6b, 7b). Similar, the neighborhood size

*q*for calculating the final predictability of the results had different effects for different datasets (see Figs. 5c, 6c, 7c). At this point it’s difficult to favor one value over another, which is why we kept

*q*fixed to \(q=10\). As expected, results tend to improve with increasing numbers of training samples \(S_{train}\) (see Figs. 5d, 6d, 7d). Similarly, results first improve with the number of iterations

*R*for GPFA and then remain stable (see Figs. 5e, 6e, 7e). We take this as evidence for the viability of the iteration heuristic motivated in Sect. 2.8.

To gauge the statistical reliability of the results, we applied the Wilcoxon signed-rank test, testing the null hypothesis that the results for different pairs of algorithms actually come from the same distribution. We tested this hypothesis for each data set for the experiment with default parameters, i.e., for the results shown in Figs. 5f, 6f, 7f with \(M=5\). As can be seen from the *p*-values in Table 1, the null hypothesis can be rejected with certainty in many cases, which confirms that GPFA (2) learned the most predictable features on two of three datasets. For GPFA (1) the results are clear for the first dataset as well for the second in comparison to PFA. It remains a small probability, however, that the advantage compared to SFA on the second dataset is only due to chance. For the large third dataset, all algorithms produce relatively similar results with high variance between experiments. It depends on the exact value of *M* if SFA or GPFA produced the best results. For \(M=5\) GPFA happened to find slightly more predictable results (not highly significant though as can be seen in Table 1). But in general we don’t see a clear advantage of GPFA on the third dataset.

### 4.3 Visual data

A third experiment was conducted on a visual dataset. We modified the simulator from the *Mario AI challenge* (Karakovskiy and Togelius 2012) to return raw visual input in gray-scale without text labels. The raw input was scaled from \(320 \times 240\) down to \(160 \times 120\) dimensions and then the final data points were taken from a small window of \(20 \times 20 = 400\) pixels at a position where much of the game dynamics happened (see Fig. 8 for an example). As with the auditory datasets, for each experiment \(S_{train} = 10{,}000\) successive training and \(S_{test} = 5000\) non-overlapping test frames were selected randomly and PCA was applied to both, preserving \(99\%\) of the variance. Eventually, *M* predictable components were extracted by each of the algorithms and evaluated with respect to their predictability (2). Parameters *p* and *k* again were selected from a range of candidate values to yield the best results (see Fig. 9a, b).

*M*(see Fig. 9f). Again, this observation is highly significant with a Wilcoxon

*p*-value of 0.00 for \(M=12\).

*p*-values for the Wilcoxon signed-rank test which tests the null hypothesis that a pair of samples come from the same distribution

STFT #1 | STFT #2 | STFT #3 | ||||
---|---|---|---|---|---|---|

SFA | PFA | SFA | PFA | SFA | PFA | |

GPFA (1) | | | 0.18 | | 0.43 | 0.09 |

GPFA (2) | | | | | 0.38 | 0.17 |

## 5 Discussion and future work

In the previous section we saw that GPFA produced the most predictable features on a toy example with a certain kind of predictable noise as well as on two auditory datasets. However, on a third auditory dataset as well as on a visual dataset, GPFA did not show a clear advantage compared to SFA. This matches our experience with other visual datasets (not shown here). We hypothesize that SFA’s assumption of the most relevant signals being the slow ones may especially suited for the characteristics of visual data. This also matches the fact that SFA originally was designed for and already proved to work well for signal extraction from visual data sets. A detailed analysis of which algorithm and corresponding measure of predictability is best suited for what kind of data or domain remains a subject of future research.

In practical terms we conclude that GPFA (2) has some advantages over GPFA (1). First, its linear time complexity in *k* (see Sect. 2.10.2) makes a notable difference in practice (see Sect. 4.1). Second, GPFA (2) consistently produced better results (see Sect. 4) which is a bit surprising given that the fully connected graph of GPFA (1) is theoretically more sound and also matches the actual evaluation criterion (2). Our intuition here is that it is beneficial to give \(\mathbf {y}_{t+1}\) a central role in the graph because it is a more reliable estimate of the true mean of \(p(\varvec{Y}_{t+1} | \varvec{Y}_t = \mathbf {y}_t)\) than the empirical mean of all data points (stemming from different distributions) in the fully connected graph.

In the form described above, GPFA performs linear feature extraction. However, we are going to point out three strategies to extend the current algorithm for non-linear feature extraction. The first strategy is very straight-forward and can be applied to the other linear feature extractors as well: in a preprocessing step, the data is expanded in a non-linear way, for instance through all polynomials up to a certain order. Afterwards, application of a linear feature extractor implicitly results in non-linear feature extraction. This strategy is usually applied to SFA, often in combination with hierarchical stacking of SFA nodes which further increases the non-linearities while at the same time regularizing spatially (on visual data) (Escalante et al. 2012).

The other two approaches to non-linear feature extraction build upon the graph embedding framework. We already mentioned above that kernel versions of graph embedding are readily available (Yan et al. 2007; Cai et al. 2007). Another approach to non-linear graph embedding was described for an algorithm called *hierarchical generalized SFA*: A given graph is embedded by first expanding the data in a non-linear way and then calculating a lower-dimensional embedding of the graph on the expanded data. This step is repeated—each time with the original graph—resulting in an embedding for the original graph that is increasingly non-linear with every repetition (see Sprekeler 2011 for details).

Regarding the analytical understanding of GPFA, we have shown in Sect. 2.9 under which assumptions GPFA can be understood as finding the features with the highest predictive information, for instance when the underlying process is assumed to be deterministic but its states disturbed by independent Gaussian noise. If we generally had the goal of minimizing the coding length of the extracted signals (which would correspond to high predictive information) rather than minimizing their next-step variance, then the covariances in GPFA’s main objective (1) needed to be weighted logarithmically. Such an adoption, however, would not be straight forward to include into the graph structure.

Another information-theoretic concept relevant in this context (besides predictive information) is that of information bottlenecks (Tishby et al. 2000). Given two random variables \(\varvec{A}\) and \(\varvec{B}\), an information bottleneck is a compressed variable \(\varvec{T}\) that solves the problem \(\min _{p(\mathbf {t} | \mathbf {a})} I(\varvec{A}; \varvec{T}) - \beta I(\varvec{T}; \varvec{B})\). Intuitively, \(\varvec{T}\) encodes as much information from \(\varvec{A}\) about \(\varvec{B}\) as possible while being restricted in complexity. When this idea is applied to time series such that \(\varvec{A}\) represents the past and \(\varvec{B}\) the future, then \(\varvec{T}\) can be understood as encoding the most predictable aspects of that time series. In fact, SFA has been shown to implement a special case of such a past-future information bottleneck for Gaussian variables (Creutzig and Sprekeler 2008). The relationship between GPFA and (past-future) information bottlenecks shall be investigated in the future.

In Sect. 2.6 we introduced the heuristic of reducing the variance of the past in addition that of the future. Effectively this groups together parts of the feature space that have similar expected futures. This property may be especially valuable for interactive settings like reinforcement learning. When you consider an agent navigating its environment, it is usually less relevant to know which way it reached a certain state but rather where it can go to from there. That’s why state representations encoding the agent’s future generalize better and allow for more efficient learning of policies than state representations that encode the agent’s past (Littman et al. 2001; Rafols et al. 2005). To better address interactive settings, multiple actions may incorporated into GPFA by conditioning the kNN search on actions, for instance. Additional edges in the graph could also allow grouping together features with similar expected rewards. We see such extension of GPFA as an interesting avenue of future research.

## 6 Conclusion

We presented *graph-based predictable feature analysis* (GPFA), a new algorithm for unsupervised learning of predictable features from high-dimensional time series. We proposed to use the variance of the conditional distribution of the next time point given the previous ones to quantify the predictability of the learned representations and showed how this quantity relates to the information-theoretic measure of predictive information. As demonstrated, searching for the projection that minimizes the proposed predictability measure can be reformulated as a problem of graph embedding. Experimentally, GPFA produced very competitive results, especially on auditory STFT datasets, which makes it a promising candidate for every problem of dimensionality reduction (DR) in which the data is inherently embedded in time.

## Footnotes

- 1.
Note that in the Gaussian case the covariance not only covers the distribution’s second moments but is sufficient to describe the higher-order moments as well.

- 2.
All edge weights are initialized with zero.

- 3.
Of course, this step is not necessary for the last iteration.

- 4.
GPFA and experiments have been implemented in Python 2.7. Code and datasets will be published upon acceptance.

- 5.
Note that while the algorithms themselves do not depend on any random effects, the data set generation does.

## References

- Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation.
*Neural Computation*,*15*(6), 1373–1396.CrossRefzbMATHGoogle Scholar - Bialek, W., Nemenman, I., & Tishby, N. (2001). Predictability, complexity, and learning.
*Neural Computation*,*13*(11), 2409–2463.CrossRefzbMATHGoogle Scholar - Bialek, W., & Tishby, N. (1999).
*Predictive information*. e-print arXiv:cond-mat/9902341, February 1999. - Cai, D., He, X., & Han, J. (2007). Spectral regression: A unified approach for sparse subspace learning. In 7th IEEE
*International Conference on Data Mining*(ICDM 2007), pp. 73–82. IEEE.Google Scholar - Collomb, G. (1985). Non parametric time series analysis and prediction: Uniform almost sure convergence of the window and k-nn autoregression estimates.
*Statistics: A Journal of Theoretical and Applied Statistics*,*16*(2), 297–307.MathSciNetCrossRefzbMATHGoogle Scholar - Creutzig, F., & Sprekeler, H. (2008). Predictive coding and the slowness principle: An information-theoretic approach.
*Neural Computation*,*20*(4), 1026–1041.MathSciNetCrossRefzbMATHGoogle Scholar - Escalante, B., Alberto, N., & Wiskott, L. (2012). Slow feature analysis: Perspectives for technical applications of a versatile learning algorithm.
*Künstliche Intelligenz (Artificial Intelligence)*,*26*(4), 341–348.CrossRefGoogle Scholar - Escalante, B., Alberto, N., & Wiskott, L. (2013). How to solve classification and regression problems on high-dimensional data with a supervised extension of slow feature analysis.
*Journal of Machine Learning Research*,*14*(1), 3683–3719.MathSciNetzbMATHGoogle Scholar - Escalante, B., Alberto, N., & Wiskott, L. (2016)
*Improved graph-based SFA: Information preservation complements the slowness principle*. e-print arXiv:1601.03945, January 2016. - Goerg, G. (2013). Forecastable component analysis. In
*Proceedings of the 30th international conference on machine learning (ICML 2013)*, (Vol 28, pp. 64–72). JMLR Workshop and Conference Proceedings.Google Scholar - Han, F., & Liu, H. (2013). Principal component analysis on non-gaussian dependent data. In
*Proceedings of the 30th International Conference on Machine Learning (ICML 2013)*, (Vol. 28, pp. 240–248). JMLR Workshop and Conference Proceedings.Google Scholar - He, X., & Niyogi, P. (2004). Locality preserving projections. In T. Sebastian, K. S. Lawrence, & S. Bernhard (Eds.),
*Advances in neural information processing systems*(Vol. 16, pp. 153–160). Cambridge, MA: MIT Press.Google Scholar - Jonschkowski, R., & Brock, O. (2015). Learning state representations with robotic priors.
*Autonomous Robots*,*39*(3), 407–428.CrossRefGoogle Scholar - Karakovskiy, S., & Togelius, J. (2012). The Mario AI benchmark and competitions.
*IEEE Transactions on Computational Intelligence and AI in Games*,*4*(1), 55–67.CrossRefGoogle Scholar - Littman, M. L., Sutton, R. S., & Singh, S. (2001). Predictive representations of state. In
*Advances in neural information processing systems (NIPS)*(Vol. 14, pp. 1555–1561). Cambridge, MA: MIT Press.Google Scholar - Rafols, E. J., Ring, M. B., Sutton, R. S., & Tanner, B. (2005). Using predictive representations to improve generalization in reinforcement learning. In
*Proceedings of the 19th international joint conference on Artificial intelligence, IJCAI’05*(pp. 835–840). San Francisco, CA: Morgan Kaufmann Publishers Inc.Google Scholar - Richthofer, S., & Wiskott, L. (2013).
*Predictable feature analysis*. e-print arXiv:1311.2503, November 2013. - Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding.
*Science*,*290*(5500), 2323–2326.CrossRefGoogle Scholar - Shalizi, C. R., & Crutchfield, J. P. (2001). Computational mechanics: Pattern and prediction, structure and simplicity.
*Journal of Satistical Physics*,*104*(3–4), 817–879.MathSciNetCrossRefzbMATHGoogle Scholar - Sprague, N. (2009). Predictive projections. In
*Proceedings of the 21st international joint conference on artifical intelligence (IJCAI 2009)*(pp. 1223–1229). San Francisco, CA: Morgan Kaufmann Publishers Inc.Google Scholar - Sprekeler, H. (2011). On the relation of slow feature analysis and Laplacian eigenmaps.
*Neural Computation*,*23*(12), 3287–3302.MathSciNetCrossRefzbMATHGoogle Scholar - Still, S. (2009). Information-theoretic approach to interactive learning.
*Europhysics Letters*,*85*(2), 28005.CrossRefGoogle Scholar - Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction.
*Science*,*290*(5500), 2319–2323.CrossRefGoogle Scholar - Tishby, N., Pereira, F. C., & Bialek, W. (2000).
*The information bottleneck method*. e-print arXiv:physics/0004057, April 2000. - von Luxburg, U. (2007). A tutorial on spectral clustering.
*Statistics and Computing*,*17*(4), 395–416.MathSciNetCrossRefGoogle Scholar - Wiskott, L., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances.
*Neural Computation*,*14*(4), 715–770.CrossRefzbMATHGoogle Scholar - Yan, S., Dong, X., Zhang, B., Zhang, H.-J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: A general framework for dimensionality reduction.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*29*(1), 40–51.CrossRefGoogle Scholar