1 Introduction

Sequential data classification is a fundamental problem in the machine learning community. In the classification, the degree of dissimilarity between sequences needs to be quantified. If the sequential data are of equal length, it is sufficient to use conventional machine learning methods by treating sequences as numerical vectors. Kernel methods could be efficient and might achieve satisfying performances [18], provided that the length of sequence is not long. However, in reality, large amount of sequential data are variable-length.

To deal with sequential data that are variable-length and possibly long, plenty of algorithms, e.g. dynamic time warping [1], autoregressive kernel [8], spectral analysis [11], are proposed.

Searching for a global alignment between variable-length sequences is a way to handle variable-length data. This methodology of non-linear warping and matching segments of two sequences is exemplified by dynamic time warping (DTW) [21]. However, due to non-linear warping, the triangular inequality, one of the requisites for the validity of a metric, is not satisfied. The measurement in DTW is not a metric actually, lacking geometric interpretation to the experimental result [9].

Fisher Kernel [12] fits one single generative model (Hidden Markov Model) to sequences and compares how much new incoming sequence “stretches” the average model trained with past sequences. Fisher Kernel defines Fisher Score as gradients of log-likelihood, \(\log p(\varvec{x}|\theta )\), with regard to hidden parameters. As Fisher Kernel train the generative model under maximum likelihood principle, it may lead to sub-optimal results. Since a generative model that fits data well may easily get stuck in the local minimum of its log-likelihood, where the gradient representation of data is (nearly) zero [17].

The computation of Fisher Kernel of sequences \( \varvec{s}_i \) and \( \varvec{s}_j \) is defined as:

$$\begin{aligned} \nabla _{\varvec{\theta }}^T p(\varvec{x}|\theta ) \mathcal {I}^{-1}\nabla _{\varvec{\theta }} p(\varvec{x}|\theta ) \end{aligned}$$

where \( \mathcal {I} \) is the Fisher information matrix. Computation of Fisher Kernel involves the inverse of Fisher information matrix. This procedure could be time-consuming. A routinely adopted way to bypass this difficulty is to replace the Fisher information matrix with identity matrix, at the cost of losing some precision in the approximation [22].

Fisher Kernel learning [17] leverages the label information so that the objective functions in the same class have similar gradients. It applies idea from metric learning to improve its performance. Both methods show effectiveness but low efficiency in obtaining the representations to data, as more computation is involved in computing gradients, even when the Fisher information matrix is assumed to be identity matrix.

Autoregressive Kernel (AR) [8] employs a likelihood profile as features for sequences. The likelihood profile is generated by a Vector Autoregressive Model under different parametric settings. The dissimilarity between sequences is computed with Bayesian method. It can be verified that this measurement is a valid Hilbertian metric [8]. AR relaxes the constraint of using a single generative model to explain the whole data as did in Fisher Kernel and Fisher Kernel learning. However, AR does not use the timestamps in a sequence to improve the prediction [20].

Chen et al. approximated time series via echo state networks (ESN) [4, 5], and demonstrated that readout weights in ESNs could offer discriminant features for sequences. Under the representation provided by topologically fixed reservoir for the whole data, the readout weights, the only trained part, covers the uniqueness of a specific sequence, bringing in more versatility and flexibility. It was demonstrated that ESN is able to handle continuous sequences in complicated scenario [6]. In addition, a co-learning strategy was devised to strengthen its representation capability on continuous sequences [3]. In this paper, we further extend this methodology to process binary data, and demonstrate the improvement on performance by using liquid state machine (LSM). In LSM, individual node (neuron) has its own “state memory”, and responds from its own history and current input signal, while nodes in ESN give responses based on merely their current state. The replacement brings enhancement “memory” to the reservoir, and demonstrates to be beneficial by experiments.

In this paper, we propose a novel approach to representing sequences, which might be of different lengths and of different characteristics, in a higher dimensional space. In this approach, each sequence is represented by a LSM, which gives approximation to the conditional probability of likelihood of the sequence. After obtaining models, the classification is conducted on the models, rather than on the sequences directly. In this paper, we discuss measurements under different assumptions on the “model distributions”. The model set, along with the defined measurements, offers a novel space for classification and other possible learning tasks. This space is referred as a model space for a certain data set in this paper.

2 Discriminant Learning in the Model Space

LSM incorporates time into the model of neural network to enhance the level of realism in the simulation, emerging as a new computational model [16]. A LSM consists of two parts (apart from input layer) in its framework. A large collection of nodes that are randomly connected to each other make up the reservoir part. Each node receives inputs from input layer as well as from other nodes. The spatio-temporal pattern of the activations in nodes is read out by the final layer as linear combinations in performing certain tasks. The final layer is the only part that needs training.

We illustrate the scheme diagram of model space and LSM in Fig. 1. In the figure, LSMs are used to give approximations to sequences and in turn the set of LSMs is considered in the learning algorithms.

Fig. 1.
figure 1

The schematic diagram for LSM and model space. LSMs provide representations for two sequences. The model space is seen as a high dimensional space, in which the readout weights of LSMs are assembled.

The form of the LSM [14] is generalized as follows:

$$\begin{aligned} \left\{ \begin{aligned} \varvec{x}(t)&= Q(R\varvec{x}(t-1)+V\varvec{s}(t)) \\ \varvec{y}(t)&= f(\varvec{x}(t))= W\varvec{x} \end{aligned} \right. \end{aligned}$$

where \(\varvec{x}(t) \in \mathfrak {R}^{n}\) is the state vector defined in the real domain. Subscript n is the number of reservoir nodes. Input \(\varvec{s}(t)\in \mathfrak {\mathfrak {R}}^{d+1}\) is input which has been augmented by adding bias as one of its components. R and V are the appropriately defined coefficient matrices. \(\varvec{y}(t)\in \mathfrak {R}^{n'}\) and W denote output and readout weights respectively. Superscript \(n'\) is the dimensionality of output. \(Q(\cdot )\) is the response function defined on the internal nodes.

A LSM is trained by making use of past values and predicting the present value. Readout weights \(W \in \mathfrak {\mathfrak {R}}^{n'\times n}\) are trained by adjusting W in order for \(W\varvec{x(t)} = \varvec{s}(t+1) \). The dimensionality \( n' \) satisfies \( n' = d \) in this scenario.

We consider an arbitrary sequence \(\varvec{s}=\{s_0,s_1,\cdots ,s_n\}\in \mathfrak {R}^{d}\), where d is the dimensionality of the sequence. We also use \(\varvec{s}(t)\) to denote a sequence which is indexed by t. We assume that the index starts from 0 unless otherwise stated.

The likelihood of a sequence \( \varvec{s} \) is expressed as:

$$\begin{aligned} \ell (\varvec{s})=\ell (\{s_0,s_1,\cdots ,s_n\} ) \end{aligned}$$

which can be further factorized into

$$\begin{aligned} \ell (\varvec{s})= \mathcal {P}_0(s_0)\mathcal {P}_1(s_1|s_0)\mathcal {P}_2(s_2|s_1,s_0)\cdots \cdots \mathcal {P}_n(s_n|s_{n-1},\cdots ,s_{0}) \end{aligned}$$

where \( \mathcal {P}_i(s_i|s_{i-1}\cdots s_0),i=0\cdots n \) is the conditional probability.

In most cases, the assumption is too strong that the conditional probability \(\mathcal {P}_i(\cdot |\cdot )\) of a sequence can be generalized and formulated explicitly. Assumptions on the form of \(\mathcal {P}_i(\cdot |\cdot )\) might lead to sub-optimal results.

In our approach, we make use of the universal approximating ability [16] of LSM under a weak assumption on the conditional probability distribution, assuming \( \mathcal {P}_i(\cdot |\cdot ) \) is time-invariant, i.e. \( \mathcal {P}_i(\cdot |\cdot ) = \mathcal {P}(\cdot |\cdot ) \). The universal approximating ability states that, given enough variety in the interior nodes, nonlinear input-output mappings could be approximated by LSM under training of sufficiently long input sequences. Our approach bases the approximation to \( \mathcal {P}(\cdot |\cdot ) \) on this ability and therefore uses models rather than simplified formulations in the classification algorithm.

2.1 Measurement of Dissimilarity Between Models in the Model Space

The dissimilarity of two sequences is judged from the divergence between two fitting LSMs. Given two sequences \( \varvec{s}_i \) and \( \varvec{s}_j \), a general measurement of dissimilarity is formulated as follows:

$$\begin{aligned} \mathcal {D}(\varvec{s}_i,\varvec{s}_j)&= (\int _{\varvec{x} \in \mathcal {I}}||f_i - f_j||^2 d\mu (\varvec{x}))^{1/2} \\&=\left( \int _{\varvec{x} \in \mathcal {I}} {(W_i\varvec{x}-W_j\varvec{x})^T(W_i\varvec{x}-W_j\varvec{x})}d\mu (\varvec{x}) \right) ^{1/2} \nonumber \end{aligned}$$

\(||\cdot ||\) is the norm which calculates the disagreement between two model outputs. \(\mathcal {I}\) is the change interval for model vector \(\varvec{x}\). \(\mu (\varvec{x})\) is the probability distribution for \(\varvec{x}\).

Uniform distribution over \( \varvec{x} \) considers the simplest case, in which the probability distribution \( \mu (\varvec{x}) \) is assumed to be only dependent on the interval \( \mathcal {I} \). Later, this assumption will be relaxed and more general cases will be discussed.

Under the assumption of the uniform distribution, the dissimilarity between sequences \(\varvec{s}_i\) and \(\varvec{s}_j\) is simplified into

$$\begin{aligned} \mathcal {D}(\varvec{s}_i,\varvec{s}_j)&=\left( \int {(W_i\varvec{x}-W_j\varvec{x})^T(W_i\varvec{x}-W_j\varvec{x})}d\mu (\varvec{x}) \right) ^{1/2} \\&= \mathcal {C}||W_i\varvec{x}-W_j\varvec{x}||. \nonumber \end{aligned}$$

where the irrelevant terms in last formula of Eq. (4) are generalized into constant \(\mathcal {C} \).

In more general cases where \(\varvec{x}\) is not evenly distributed, but not changes dramatically, we use Gaussian mixture model to approximate the probability distribution \( \mu (\varvec{x}) \). It fits the probability distribution \(\mu (\varvec{x})\) with a mixture of finite Gaussian distributions.

$$\begin{aligned} \mu (\varvec{x}) = \sum \alpha _i N(\theta _i,\varSigma _i) \end{aligned}$$

where \(\alpha _i\) are the mixture coefficients for i-th Gaussian distribution. All \( \alpha _i \) sum up to 1, \( \sum \alpha _i = 1 \). Parameters \( \theta _i \) and \( \varSigma _i \) are mean and variance in i-th Gaussian distribution.

Substitute \(\mu (\varvec{x})\) with Gaussian mixture model, the dissimilarity between two sequences is formulated as:

$$\begin{aligned} \mathcal {D}(\varvec{s}_i,\varvec{s}_j) = \sum _k \alpha _k trace(W_i^TW_j\varSigma _k) + \theta _k^T W_i^TW_j \theta _k \end{aligned}$$

Sampling, as a natural alternative to the above approximation method, makes no assumptions on the form of \(\mu (\cdot )\). An asymptotic optimal estimation for a probability distribution \( \mu (\cdot ) \) is guaranteed from the law of large numbers. This estimation may lead to more robust result, if no prior information on \( \mu (\cdot ) \) exists. Applying sampling to Eq. (3) is straightforward.

$$\begin{aligned} \mathcal {D}(\varvec{s}_i,\varvec{s}_j)\approx \frac{1}{m}\sum _k||W_i\varvec{x}_k-W_j\varvec{x}_k|| \end{aligned}$$

where m denotes the amount of sampling points.

Assume the deviation \( \varepsilon (t) \) between the output of a LSM \(\varvec{y}(t) = W\varvec{x}\) and the desired output \( \varvec{s}(t+1) \) follows a zero-mean Gaussian distribution \( \varepsilon (t) = \mathcal {N}(0,\delta ^2 I)\). When the methodology of Fisher Kernel is applied, the conditional probability of observing \( \varvec{s}(t+1) \) given past values is formulated as:

$$\begin{aligned}&\mathcal {P}((\varvec{s}(t+1)| \varvec{s}(1\cdots t)) = (2\pi \delta ^2)^{-d/2}exp\big (-\frac{||\varvec{s}(t+1) - W\varvec{x}(t)||}{2\delta ^2}\big ) \end{aligned}$$

The Fisher score U between \(\varvec{s}_i\) and \(\varvec{s}_j\) takes the form of inner product of two derivatives with regard to the hidden parameters. The derivative quantifies how the model adjusts its current parametric setting in order to fit a new sequence. The derivative of probability \( \mathcal {P}(\cdot |\cdot ) \) in terms of W gives rise to:

$$\begin{aligned} U&= \frac{\partial \log {\mathcal {P}(\varvec{s}(1 \cdots l))}}{\partial W}\\&= \sum _{t=1}^l \frac{\varvec{s}(t)\varvec{x}(t-1)^T - W\varvec{x}(t-1)\varvec{x}(t-1)}{\delta ^2} \nonumber \end{aligned}$$

The dissimilarity between \( \varvec{s}_i \) and \( \varvec{s}_{j} \) is expressed as:

$$\begin{aligned} \mathcal {D}(\varvec{s}_i,\varvec{s}_j) = \varvec{1}U_i.*U_j\varvec{1}^T \end{aligned}$$

where \(.*\) denotes element-wise multiplication and \(\varvec{1}\) is the all-one vector.

Extending to Binary Data. The sequential data recorded in binary digits \( \{0,1\} \) are more encountered in clinical research, e.g. heart beating signal, signals from neurons. In terms of binary or discrete data, the traditional ways that minimize mean square error (MSE) as did on numerical sequences are infeasible. The traditional ways rely on the gradient of objective function for inference of parameters, while MSE from binary data is non-smooth and thus no gradients exist. LSM is extended to process binary data by replacing MSE with exponential van Rossum metric [23].

A general exponential van Rossum metric \(\psi (t, t_0)\) can be formulated as:

$$\begin{aligned} \psi (t,t_0) = \left\{ \begin{aligned}&-(t-t_0) \frac{e^{-(t-t_0)/\tau }}{\tau }&0 \le t < \varDelta t + t_0\\&+\infty&\text {otherwise} \\ \end{aligned} \right. \end{aligned}$$

where index \(t_0\) is the expected index. \(\varDelta t\) is a threshold, restricting the comparison to the affinity of \( t_0 \). Argument \(\tau \) is a penalty on the deviation.

3 Experimental Study

This section presents experiments conducted on synthetic binary data and classifications on benchmark univariate and multivariate data. For a given task, the topology (200 interior nodes) and interior weights between nodes were initialized and kept fixed. In this way, the randomness in LSM was controlled as an invariant factor for comparison purpose. The strategy of restart was adopted in experimentsFootnote 1.

The implementation of LSM made use of a software simulating the microcircuits of neural network–CSIM [19]. The parameters were set referring the attached examples.

In the implementation of the Gaussian mixture model, the number of Gaussian distribution was auto-determined by the method proposed in [10]. In the sampling, since there existed training sequences that were not sufficiently long, circular block bootstrap was applied. The block length was auto-determined by the method proposed in [15].

LIBSVM [2] was adopted in the classification algorithm. Multi-class data were classified via its default strategy, one-against-one.

The proposed methods were compared with common methods, including Dynamic Time Warping (DTW), Autoregressive Kernel (AR), Fast Fisher Kernel (Fisher), and Reservoir model (RV) proposed in [4, 5].

The parameters in the proposed algorithms (regression parameter \( \lambda \)), support vector machine (bandwidth \(\theta \) and cost \(\mathcal {C}\)), and the comparison algorithms were tuned with 5-fold-cross-validationFootnote 2. The search ranges for the parameters are detailed in Table 1.

Table 1. The parameters and search ranges

Three classification methods defined with Eqs. (4)–(7) are named as LSM with \( L^2 \) norm (\( L^2 \)-LSM), LSM with Gaussian mixture model (Gaussian-LSM), LSM with sampling method (Sampling-LSM), and LSM with Fisher methodology (Fisher-LSM).

3.1 Synthetic Data

Synthetic binary data were generated following Poisson distribution \( p(t) = \frac{\lambda ^te^{-\lambda }}{t!} \). The merit of using Poisson distribution is that it makes the events (bars in Fig. 2) evenly distributed and ensures that no events happen at the same time. The synthetic data were labeled into three classes. Different classes were generated under a slightly changed parameter setting.

For each parametric setting, the simulation lasted 2 s with time unit \(10^{-3}\) s, generating a 2000-length sequence. We generated 55 sequences for each class. In addition, all the sequences were corrupted with Gaussian white noise (\(mean=0, \varSigma =0.02 \varvec{I}\)). The Eq. (8) was adopted as cost function in the training algorithm. Figure 2 demonstrates parts of the binary sequences of three classes. From this figure, it is not easy to distinguish class labels.

Fig. 2.
figure 2

The parts of synthetic binary sequences. The data were generated following Poisson distribution and were corrupted with additive Gaussian white noise. Horizontal axis denotes the index. Different classes are drew in different colors, and are separated by a dash line.

The model space in this experiment, which is populated by readout weights of fitting models, is depicted in Fig. 3. In order to visualize the model space, multidimensional scaling (MDS) was used to reduce its dimensionality. MDS keeps the original between-objective distance faithfully in a lower dimensional space. Although it was hard to distinguish class labels in the binary data as depicted in Fig. 2, after representing the sequences in the model space, they became separable in Fig. 3.

Fig. 3.
figure 3

The model space of synthetic binary data in a 3-dimensional coordinate. Parts of the data are depicted in Fig. 2. The model space was constructed by fitting LSMs to the binary data and extracting data-specific features, i.e. the readout weights, from LSMs. Each point offers representation to an individual binary sequence. Different classes are denoted with different markers.

The sensitivity of proposed method to the additive Gaussian noise was also investigated, in comparison with AR and Fast Fisher Kernel (Fisher)Footnote 3. The classifications were conducted on data with various amplitudes of Gaussian noise. The experimental results are depicted in Fig. 4.

An overall advantage can be observed from Fig. 4. Not surprisingly, Fisher-LSM has the best performance in terms of classification accuracy and robustness to the noise among all the methods. Fisher-LSM assumes that the deviation between observation and true value follows zero-mean Gaussian distribution, which coincides with the noise used in this experiment. Sampling-LSM shows to be less robust to the added noise. Its classification accuracy drops after corrupting data with noise. But as the amplitude of noise grows, its influence on the performance of Sampling-LSM decreases.

Fig. 4.
figure 4

The classification accuracy versus different amplitudes of Gaussian noise. The horizontal axis denotes the amplitude of noise, and the vertical axis is the classification accuracy on the synthetic data. Overall, the proposed methods demonstrate clear advantages in handling binary sequences in this experimental setting.

3.2 Benchmark Data

The benchmark data sets were obtained from UCR time series classification archive [7] and UCI machine learning repositoryFootnote 4. Table 2 gives a summary of all data sets. In order to eliminate the influence of different units, all data sets were rescaled into interval \([-1,1] \).

Table 2. Summary description of univariate data sets.
Table 3. Classification accuracy of Dynamic Time Warping (DTW), Autoregressive Kernel (AR), Fisher Kernel Learning (Fisher), Reservoir Model (RV), \(L^2\)-LSM, Gaussian-LSM, sampling-LSM and Fisher-LSM. The best results are marked in bold.

The experimental results of 5 runs are listed in Table 3. From this table, A general advantage of classifications carried out in the model space of LSM over comparison algorithms can be observed. Among all the proposed methods, learning based on sampling achieved the best performance. The better performance of sampling-LSM is largely contributed from the weak hypothesis it imposed on the probability distribution \( \mu (\cdot ) \). However, Fisher-LSM underperformed on all sequences. A possible reason for its deficiency lies in the pre-assumption over the deviation. When strong autocorrelation exists, the assumption of zero-mean Gaussian noise is unlikely to be true. Compared with good performance achieved on binary sequences, it is more encouraged to be used on binary or discrete sequences.

As the number of Gaussian distributions was auto-determined in Gaussian-LSM and bootstrap was adopted in Sampling-LSM, the computational complexities of proposed approaches are difficult to analyze. We adopted experiments to illustrate the actual time consumption on benchmark data. Experiments were conducted on a sequential data set PEMSFootnote 5. By truncating the sequences and recording the time consumed in obtaining dissimilarities between pairwise sequences, we obtained tuples of time consumptionFootnote 6 versus length of sequence. And the results are plotted in Fig. 7.

In the Fig. 7, the time consumptions of all proposed approaches grow slowly after the sequential length becomes large (beyond 1800). The lines of Gaussian-LSM and \( L^2 \)-LSM grow in a similar pattern. However, Sampling-LSM maintains a (roughly) consistent time usage, even when the training sequences are short. The reason is, in order to compensate the approximation loss when the training sequences were not sufficiently long, more sampling had to been done. The computation of Fisher-LSM involves matrix multiplication, which makes it grow (roughly) linearly with the sequential length in our experiments (not shown).

In contrast, the time complexity of DTW is \(O(m_i m_j)\), where \( m_i \) is the length of i-th sequence. An improved variation [13] speeds up DTW by using piece-wise line of length c to approximate the time series. It is reported to have time complexity \(O(\frac{m_im_j}{c^2})\). Autoregressive kernel [8] have time complexity \((m_i+m_j-2p)^3\), where p is the order of employed model, far less than \( min(m_i,m_j)\). So compared with the above algorithms, Gaussian-LSM and \( L^2 \)-LSM show computational advantage.

Multivariate Sequences. The experiments of classifications on three multivariate data sets, Brazilian sign language (Libras), handwritten characters and Australian language of signs (AUSLAN) were conducted. Notably, handwritten and AUSLAN are also variable-length. The summary of three data sets is listed in Table 4.

In this experiment, we compared \( L^2 \)-LSM against comparison algorithms in multivariate data. And the experimental results of 5 runs are plotted in Fig. 6.

Fig. 5.
figure 5

The time consumptions on three multivariate data sets. The vertical axis denotes the time consumption. It is measured in the unit of CPU time (\( \sec \)).

From Fig. 6, \(L^2\)-LSM outperforms all comparison algorithms on handwritten. It also gains a slight advantage on data set Libras. On high dimensional data set AUSLAN, \( L^2 \)-LSM only surpasses AR. The hypothesis of uniform distribution fails to hold in the high dimensional data set AUSLAN, which leads to a suboptimal result for \( L^2 \)-LSM.

Fig. 6.
figure 6

The classification accuracies carried out on three multivariate data sets.

Fig. 7.
figure 7

The time consumptions under different length of sequential data. The vertical axis denotes the time consumption. It is measured in the unit of CPU time (\( \sec \)). The horizontal axis denotes the sequential length.

Fig. 8.
figure 8

The approximating errors under different regression coefficients \(\xi \).

Fig. 9.
figure 9

The classification accuracies under different regression coefficients \(\xi \).

Table 4. The description of multivariate data sets.

The time consumption of \( L^2 \)-LSM and comparison algorithms are plotted in Fig. 5. Generally, \( L^2 \)-LSM demonstrates an advantage on its efficiency. It has comparative time usage with RV. On high dimensional data set AUSLAN, \( L^2 \)-LSM and RV build a classifier using less time over other algorithms, and the difference within these two algorithms is not obvious.

Parametric Sensitivity Analysis. The performance achieved in the model space of LSM are jointly determined by two factors, i.e. the representations offered by LSMs to the sequences and the separation of LSMs in the corresponding space. An unsettled issue is the relationship between these two goals. In the approach, regression coefficient \(\xi \) is the parameter which needs careful tuning for a better trade-off between the approximation to the sequences and separation of LSMs in the model space.

In this experiment, three data sets were used as benchmark data sets. And experiments were conducted with different settings of parameter \(\xi \). For simplicity, we assume a uniform distribution for \( \mu (\cdot ) \). The experimental result in terms of classification accuracies versus \( \xi \) is plotted in Fig. 9, and the approximation errors versus \(\xi \) are plotted in Fig. 8.

Compare Figs. 8 and 9, we can observe a higher classification accuracy and a lower approximation error are likely to occur jointly, which suggests that two goals may not be conflicting objectives with regard to \(\xi \). A joint optimization procedure for \(\xi \) may be feasible.

4 Conclusion

This paper proposes model space learning for the sequential data on the basis of LSM. LSM is used as a universal approximating tool to fit the conditional probability of a sequence. The models offer representations for sequences of training data. As a result, the learning strategy is carried out in the model space instead of on the original data. From the experiments, the benefits brought by replacing the “memoryless” response function with node that has its own “history” are clear. Fisher-LSM is shown to be robust and effective on processing binary data. An overall improvement of classification accuracy on benchmark data has been observed via experiments. Sampling-LSM is encouraged when the dimensionality of training data is not high.

This paper also discusses measuring the dissimilarity between two LSMs in the model space. A set of models, instead of a single model, is used to give approximations to the training data. Learning in model space relaxes the requirement to use a single model to explain the whole data. The relationship between approximating capability to sequences and separation of LSMs is studied. The result shows the feasibility to implement joint optimization on two seemingly conflict targets.

In general, this paper proposes an approach to constructing data representation without need of assuming a parametric formulation. Its applications on lower dimensional data have been demonstrated to be effective. Promising future work includes improving the model space learning on high dimensional data without sacrificing its efficiency.