Abstract
In classification, prior knowledge is incorporated in a Bayesian framework by assuming that the featurelabel distribution belongs to an uncertainty class of featurelabel distributions governed by a prior distribution. A posterior distribution is then derived from the prior and the sample data. An optimal Bayesian classifier (OBC) minimizes the expected misclassification error relative to the posterior distribution. From an application perspective, prior construction is critical. The prior distribution is formed by mapping a set of mathematical relations among the features and labels, the prior knowledge, into a distribution governing the probability mass across the uncertainty class. In this paper, we consider prior knowledge in the form of stochastic differential equations (SDEs). We consider a vector SDE in integral form involving a drift vector and dispersion matrix. Having constructed the prior, we develop the optimal Bayesian classifier between two models and examine, via synthetic experiments, the effects of uncertainty in the drift vector and dispersion matrix. We apply the theory to a set of SDEs for the purpose of differentiating the evolutionary history between two species.
Introduction
A purely datadriven classifier design with small samples encounters a fundamental conundrum: since the error rate of a classifier quantifies its predictive accuracy, the salient epistemic attribute of any classifier and resampling strategies such as crossvalidation and bootstrap is generally very inaccurate on small samples due to excessive variance and lack of regression with the true error [1]. The inability to satisfactorily estimate the error with modelfree methods with small samples implies that classifier error estimation is virtually impossible without the use of prior information. Prior knowledge can be incorporated in a Bayesian framework by assuming that the featurelabel distribution belongs to an uncertainty class of featurelabel distributions governed by a prior distribution [2, 3]. Given the latter, in conjunction with sample data, one can optimally estimate the error of any classifier, relative to the mean square error (MSE) between the true and estimated errors, where expectations are taken with respect to a posterior distribution derived from the prior distribution and the data [4, 5]. Hence, optimality is with respect to our prior knowledge and the data. Furthermore, one can derive an optimal classifier relative to the expected error of the classifier over the posterior distribution, this being called the optimal Bayesian classifier (OBC) [6, 7]. Closedform solutions have been developed for multinomial and Gaussian models. In other situations, Markov Chain Monte Carlo (MCMC) methods can be used [8].
Having developed the statistical theory, one is confronted with an engineering problem: transform scientific knowledge given in some mathematical form into a prior distribution. Intuitively, given a set of mathematical relations among the features and labels, these relations constrain the uncertainty class of featurelabel distributions that could potentially govern the classification and the relative strengths of the relations can be transformed so as to determine the probability mass of the prior distribution. For instance, in phenotype classification based on gene expression, genetic regulatory pathways constitute graphical prior knowledge and this prior knowledge can be employed to formulate a prior distribution governing the uncertainty class of featurelabel distributions [9, 10]. Another genomic application involves using prior knowledge concerning RNAseq data to form sequencebased classifiers [8].
From a general perspective, when using Bayesian methods, prior construction constitutes the highest hurdle. A half century ago, E. T. Jaynes remarked,
Bayesian methods, for all their advantages, will not be entirely satisfactory until we face the problem of finding the prior probability squarely [11].
The aim of this paper is to utilize prior knowledge in the form of stochastic differential equations (SDEs) to classify timeseries data. Although we will confine ourselves to a Gaussian problem so that we can take advantage of existing closedform OBC representations, one can envision further applications using MCMC methods. Hence, the approach taken in the present paper may lead to utilizing SDEs across a number of timeseries classification problems, keeping in mind that SDEs play a major role in many disciplines including physics, biology, finance, and chemistry. Vector SDEs, our concern here, have various applications. Not only do they arise naturally in many systems with vector value states, but they also arise in many systems where the process is restricted to lie on certain manifolds [12].
In the stochastic setting, training data are collected over time processes. Given certain Gaussian assumptions, classification in the SDE setting takes the same form as ordinary classification in the Gaussian model and we can apply the optimal Bayesian classification theory once we have a prior distribution constructed in accordance with known stochastic equations. In this paper, we provide the mathematical framework to synthesize an OBC in the presence of prior knowledge induced in the form of SDEs governing the dynamics of the system. We consider a vector SDE in integral form involving a drift vector and dispersion matrix, develop the OBC between two models, and examine via synthetic experiments the effects of uncertainty in the drift vector and dispersion matrix.
We compare the performance of the OBC with quadratic discriminant analysis (QDA), a classical approach to building classifiers in the Gaussian model (see Additional file 2: Section I for definition of QDA). Such comparisons are useful because, even though the OBC is optimal given the uncertainty, its optimality is on average across the uncertainty class, so that its performance advantage varies for different featurelabel distributions in the uncertainty class (and can be disadvantageous for some distributions, although these will have small probability mass in the posterior distribution). Comparison to QDA is instructive because, as we will explain in the next section, QDA is a samplebased approximation to the optimal classifier for the true featurelabel distribution. In addition to synthetic experiments, we apply optimal Bayesian classification using a form of the OrnsteinUhlenbeck process that has been employed for modeling the evolutionary change of species; specifically, we use a set of SDEs to construct a classifier to differentiate the evolutionary history between two species.
Background
Classification
In a twoclass classification, the population is characterized by a featurelabel distribution F for a random pair (X,Y), where X is a vector of p features and Y is the binary label, 0 or 1, of the class containing X. The prior class probabilities are defined by c _{ j }=P(Y=j) and the classconditional densities by p _{ j }(x)=p(x∣Y=j), for j=0,1. To avoid trivialities, we assume min{c _{0},c _{1}}≠0. A classifier is a function ψ(X) assigning a binary label to each feature vector X. The error, ε[ψ], of ψ is the probability P(ψ(X)≠Y), which can be decomposed into ε=c _{0} ε ^{0}+c _{1} ε ^{1}, where ε ^{j}=P(ψ(X)=1−jY=j), for j=0,1. A classifier with minimum error among all classifiers is known as a Bayes classifier for F. The minimum error is called the Bayes error. Epistemologically, the error is the key issue since it quantifies the predictive capacity.
In practice, F is unknown and a classification rule ψ is used to design a classifier ψ _{ n } from a random sample S _{ n }={(X _{1},Y _{1}),(X _{2},Y _{2}),…,(X _{ n },Y _{ n })} of pairs drawn from F. If feature selection is involved, then it is part of the classification rule. Since the true classifier error ε[ ψ _{ n }] depends on F, which is unknown, ε[ψ _{ n }] is unknown. The true error must be estimated by an estimation rule, Ξ. Thus, the random sample S _{ n } yields a classifier ψ _{ n }=Ψ(S _{ n }) and an error estimate \(\hat {\varepsilon } [\!\psi _{n}]=\Xi (S_{n})\) (see Additional file 2: Section II for more information).
When a large amount of data is available, the sample can be split into independent training and test sets, the classifier being designed on the training data and its error being estimated by the proportion of errors on the test data; however, when data are limited, the sample cannot be split without leaving too little data to design a good classifier. Hence, training and error estimation must take place on the same data set. As noted in Section 1, accurate error estimation is virtually impossible with small samples in the absence of distributional assumptions.
Optimal Bayesian classification
Distributional assumptions can be imposed by defining a prior distribution over an uncertainty class of featurelabel distributions. This results in a Bayesian approach with the uncertainty class being given a prior distribution and the data being used to construct a posterior distribution.
Let Π _{0} and Π _{1} denote the class0 and class1 conditional distributions, respectively; let c be the probability of a point coming from Π _{0} (the “mixing” probability); and let Π _{0} and Π _{1} be parameterized by θ _{0} and θ _{1}, respectively. The overall model is parameterized by θ=(c,θ _{0},θ _{1}) with prior distribution π(θ). Given a random sample, S _{ n }, a classifier ψ _{ n } is designed and we wish to minimize the MSE between its true error, ε, and an error estimate, \(\widehat {\varepsilon }\). The minimum mean square error (MMSE) error estimator is the expected true error, \(\widehat {\varepsilon }(\psi _{n},S_{n})=\mathrm {E}_{\theta }[\varepsilon (\psi _{n},\theta)S_{n}]\). The expectation given the sample is over the posterior density of θ, denoted by π ^{∗}(θ). Thus, we write the Bayesian MMSE error estimator as \(\widehat {\varepsilon }=\mathrm {E}_{\pi ^{\ast }}[\varepsilon ]\).
The Bayesian error estimate is not guaranteed to be the optimal error estimate for any particular featurelabel distribution but optimal for a given sample, and assuming the parameterized model and prior probabilities, it is both optimal on average with respect to MSE and unbiased when averaged over all parameters and samples. To facilitate analytic representations, we assume c, θ _{0}, and θ _{1} are all mutually independent prior to observing the data. Denote the marginal priors of c, θ _{0}, and θ _{1} by π(c), π(θ _{0}), and π(θ _{1}), respectively, and suppose data are used to find each posterior, π ^{∗}(c), π ^{∗}(θ _{0}), and π ^{∗}(θ _{1}), respectively. Independence is preserved, i.e., π ^{∗}(c,θ _{0},θ _{1})=π ^{∗}(c)π ^{∗}(θ _{0})π ^{∗}(θ _{1}) [4].
If ψ _{ n } is a trained classifier given by ψ _{ n }(x)=0 if x∈R _{0} and ψ _{ n }(x)=1 if x∈R _{1}, where R _{0} and R _{1} are measurable sets partitioning the sample space, then the Bayesian MMSE error estimator can be found from effective classconditional densities, which are derived by taking the expectations of the individual classconditional densities with respect to the posterior distribution,
Using these [6] (see Additional file 2: Section III for more information),
In the context of a prior distribution, an optimal Bayesian classifier, ψ _{OBC}, is any classifier satisfying
for all \(\psi \in \mathcal {C}\), where \(\mathcal {C}\) is an arbitrary family of classifiers. Under the Bayesian framework, this is equivalent to minimizing the probability of error,
If \(\mathcal {C}\) is the set of all classifiers with measurable decision regions (which we will assume), then an optimal Bayesian classifier, ψ _{OBC}, satisfying (3) for all \(\psi \in \mathcal {C}\) exists and is given pointwise by [6]
In many applications, especially in biomedicine, the sample S _{ n } is obtained by first deciding how many sample points will be taken from each class and then randomly sampling from each class separately, the resulting sample said to be “separately sampled.” With separate sampling, the data cannot be used to generate a posterior distribution for c, so that c must be known. Stratified sampling is a special case of separate sampling in which the sample is drawn so that the proportion of sample points from class 0 is equal to c. In such a case, there is no posterior \(\mathrm {E}_{\pi ^{\ast }\phantom {\dot {i}\!}}[\!c]\) and \(\mathrm {E}_{\pi ^{\ast }\phantom {\dot {i}\!}}[\!c]\) is replaced by c in (5). We will utilize stratified sampling in our examples.
Binary classification of Gaussian processes
In this section, we frame the setting in which we are working and then define the problem of binary classification in the context of Gaussian processes. To begin with, a collection {X _{ t }:t∈T} of \(\mathbb {R}^{p}\)valued random variables defined on a common probability space \((\Omega,\mathcal {F},P)\) indexed by a parameter \( t\in \mathbf {T}\subset \mathbb {R}\) (here assumed to be time) and \(\mathcal {F}\) being a σalgebra of subsets of the sample space Ω (events) constitutes a stochastic process X with state space \(\mathbb {R} ^{p}\). Throughout this work, we consider \(\mathcal {F}\) as the σalgebra of Borel subsets of \(\mathbb {R}^{p}\). A stochastic process X is adapted to an increasing family of σalgebra \(\{\mathcal {F}_{t}:t\geq 0\}\) (a filtration) if for each t≥0, X _{ t } is \(\mathcal {F}_{t}\)measurable.
We study classification in the context of multivariate Gaussian processes (see Additional file 2: Section IV for a review of literature pertaining to classification of stochastic processes). Consider the pdimensional column random vectors \(\mathbf {X}_{t_{1}}\), \(\mathbf {X}_{t_{2}}\),...., \(\mathbf {X}_{t_{N}}\). A random process X is a multivariate Gaussian process if any finitedimensional vector \(\left [\mathbf {X}_{t_{1}}^{T},\mathbf {X}_{t_{2}}^{T},...,\mathbf {X}_{t_{N}}^{T}\right ]^{T} \) possesses a multivariate normal distribution \(\mathcal {N}\left (\boldsymbol {\mu }_{\mathbf {t}_{N}},\boldsymbol {\Sigma }_{\mathbf {t}_{N}}\right)\), where
with \(\boldsymbol {\mu }_{{t}_{i}\phantom {\dot {i}\!}}=E[\mathbf {X}_{t_{i}\phantom {\dot {i}\!}}]\), and \(\boldsymbol { \Sigma }_{\mathbf {t}_{N}\phantom {\dot {i}\!}}\) is the N p×N p covariance matrix dependent on t _{ N }=[t _{1},t _{2},...,t _{ N }]^{T} and structured as
where
We refer to t _{ N } as the observation time vector. For any fixed ω∈Ω, a sample path is a collection {X _{ t }(ω):t∈t}. We denote a realization of X at sample path ω and time vector t _{ N } by \( \mathbf {x}_{\mathbf {t}_{N}\phantom {\dot {i}\!}}(\omega)\).
We consider a general framework, referred to as binary classification of Gaussian processes (BCGP). Consider two independent multivariate Gaussian processes X ^{0} and X ^{1}, where for any t _{ N }, X ^{0} and X ^{1} possess mean and covariance \(\boldsymbol {\mu }_{\mathbf {t}_{N}}^{0}\) and \(\boldsymbol {\Sigma }_{\mathbf {t}_{N}}^{0}\), and \(\boldsymbol {\mu }_{\mathbf {t}_{N}}^{1}\) and \(\boldsymbol {\Sigma }_{\mathbf {t}_{N}}^{1}\), respectively. For y=0,1, \( \boldsymbol {\mu }_{\mathbf {t}_{N}}^{y}\) is defined similarly to (6) with \(\boldsymbol {\mu }_{{t}_{i}}^{y}=E\left [\mathbf {X}_{t_{i}}^{y}\right ]\) and \(\boldsymbol {\Sigma }_{\mathbf {t}_{N}}^{y}\) is defined similarly to (7) with
Let \(\mathbf {S}_{\mathbf {t}_{N}}^{y}\) denote a set of n ^{y} sample paths from process X ^{y} at t _{ N },
We assume that t _{ N } is the same for both classes. Let \(\mathbf {X}_{\mathbf {t}_{N}}^{y}(\omega _{s})\) denote a future test sample path observed on the same observation time vector as the training sample paths, where y∈{0,1} indicates the label of the classconditional process the sample path is coming from, either X ^{0} or X ^{1}. Note that, as compared with the classical probabilistic definition of classification where the sample points are observations of pdimension, here we define stochasticprocess classification in connection with a set of sample paths, which can be considered as observations of Np dimension. A classification problem arises from the fact that the experimenter is blind to the class label of \(\mathbf {X}_{\mathbf {t}_{N}}^{y}(\omega _{s})\), i.e., to y, and desires a discriminant \(\psi _{\mathbf {t}_{N}\phantom {\dot {i}\!}}(.)\) such that
Other types of classification could be defined. For example, one might be interested in classifying a test sample path \(\mathbf {x}_{\mathbf {t}_{N+M}}^{y}(\omega _{s})\) where the observation time vector of the test sample path is obtained by augmenting t _{ N } by another vector [t _{ N+1},t _{ N+2},...,t _{ N+M }]_{,} where M is a positive integer. In this case, the time of observation for the future sample path is extended. Similarly, one may define problems where the future time of observation is shrunken to a subset of time points in t _{ N } or problems where the future observation time vector is a set of time points totally or partially different from time points in t _{ N }. Throughout this work, we are mainly concerned with solving the classification problem as defined in (11), which we refer to as the standard type, and we discuss the feasibility of solving other cases when possible.
General presentation of stochastic differential equations (SDEs)
To define SDEs, we consider a diffusion process, the most fundamental being the Wiener process. For a general definition of a qdimensional Wiener process, see the Appendix. Let W={W _{ t }:t≥0} be a qdimensional Wiener process. For each sample path and for 0≤t _{0}≤t≤T, we consider a vector SDE in the integral form as follows:
where \(\mathbf {f}:[\!0,T]\times \Omega \rightarrow \mathbb {R}^{p}\) (the pdimensional drift vector) and \(\mathbf {G}:[\!0,T]\times \Omega \rightarrow \mathbb {R}^{p\times q}\) (the p×q dispersion matrix). The first integral in (12) is an ordinary Lebesgue integral, and throughout this work, we assume an Itô integration for the second integral. With slightly more work, the results can be extended to Stratonovich integration. Let \(\mathcal {L}\) be the σalgebra of Lebesgue subsets of \(\mathbb {R}\). A function h(t,ω) defined on a probability space \((\Omega,\mathcal {F},P)\) belongs to \(\mathcal {L}_{T}^{\omega }\) if it is jointly \(\mathcal {L}\times \mathcal {F}\) measurable, h(t,.) is \(\mathcal {F}_{t}\)measurable for each t∈[ 0,T], and with probability 1, \({\int _{0}^{T}}h(s,\omega)^{2}ds<\infty \). Let f ^{i} and g ^{i,j} denote the components of f and G, respectively. If we assume X _{0}(ω) is \(\mathcal {F}_{0}\)measurable and if \(\sqrt {f^{i}}\in \mathcal {L}_{T}^{\omega }\) and \(g^{i,j}\in \mathcal {L}_{T}^{\omega }\), then each component of the pdimensional process X _{ t }(ω) is \(\mathcal {F}_{t}\)measurable [12]. The \(\mathcal {F}_{t}\)measurability of X _{ t }(ω) along with the martingale property of W indicates “nonanticipativeness” of X _{ t }(ω) in general.
The integral Eq. (12) is commonly written in a symbolic form as
which is the representation of a vector SDE.
SDE prior knowledge in the BCGP model
Prior knowledge in the form of a set of stochastic differential equations constrains the possible behavior of the dynamical system to an uncertainty class. If such prior knowledge is available, then it can be used in the BCGP model to improve classification performance. The core underlying assumption of the BCGP model is that the data are generated from two Gaussian processes for which binary classification is desired. In this regard, we define valid prior knowledge (in the form of SDEs) as a set of SDEs with a unique solution that does not contradict the Gaussianity assumption of the dynamics of the model. For nonlinear f(t,X _{ t }) and G(t,X _{ t }) (w.r.t. to state X _{ t }), the solution of SDE (13) is generally a nonGaussian process. Fortunately, under a wide class of linear functions, the SDE solutions are Gaussian. To wit, the SDEs become valid prior knowledge for each classconditional process defined in the BCGP model. Henceforth, we focus on this type of SDE.
For class label y=0,1, the linear classes of SDEs that we consider are defined by replacing
in (13) with A ^{y}(t) (a p×p matrix), a ^{y}(t) (a p×1 vector), and B ^{y}(t) (a p×q matrix), these being measurable and bounded on [ t _{0},T]. This results in
This initial value problem has a unique solution that is a Gaussian stochastic process if and only if the initial conditions c ^{y} are constant or normally distributed (Theorem 8.2.10 [13]). Note that in this model, G ^{y}(t,X _{ t }) (i.e. B ^{y}(t)) is independent of ω. Under this model, it can be shown that the mean (at a time index t _{ i }) and the covariance matrix (at t _{ i } and t _{ j }) of the Gaussian process \(\mathbf {X}_{t}^{y}\) are given by [13]
and
where t _{0}≤t _{ i }≤t _{ j }≤T and Φ ^{y}(t _{ i }) is the fundamental matrix of the deterministic equation
SDEs as perfect representatives for the dynamics of classconditional processes
If the SDE model presented in (15) could perfectly represent the dynamics of the underlying stochastic processes of the BCGP model, then there would be no need for training sample paths. To see this, note that in this case \({\boldsymbol {\mu }}_{t}^{y}\) and \({\boldsymbol {\Sigma }} _{t_{i},t_{j}}^{y}\) defined in (6) and (7) are obtained by
where
and
where \({\mathbf {m}_{t_{i}}^{y\,T}}\) and \(\boldsymbol {\Psi }_{{t}_{i},{t} _{j}}^{y}\) are obtained from (16) and (17), respectively. Therefore, one can obtain the exact (or at least approximately exact) values of the means and autocovariances used to characterize the Gaussian processes involved in the BCGP model. To obtain \({\mathbf {m}_{t_{i}}^{y}}\) and \(\boldsymbol {\Psi }_{{t}_{i},{t}_{j}}^{y}\), two approaches can be taken. First, one may analytically solve (18) where possible and then use numerical methods to evaluate the integrations presented in (16) and (17). For example, if A ^{y}(t)=A ^{y}, i.e., being independent of t, the solution of (18) is given by a matrix exponential as
which can be used in (16) and (17). In general, where one may not be able to analytically solve (18), numerical methods such as the EulerMaruyama scheme [14] can be used to directly solve for \( \mathbf {X}_{t}^{y}(\omega)\) and obtain
where \(\mathbf {x}_{\mathbf {t}_{N}}^{y,\text {SDE}}(\omega _{i}),i=1,2,...,l^{y}\), are the generated sample paths obtained from solving SDEs. Since there is no restriction on generating an arbitrary number of sample paths from \(\mathbf {X}_{t}^{y}(\omega)\), one can take l ^{y}>>N p to have a positive definite \( \hat {{\boldsymbol {\Psi }}}_{\mathbf {t}_{N}}^{y}\) and, at the same time, obtain an accurate estimate of the actual values of \({\mathbf {m}}_{\mathbf {t}_{N}}^{y}\) and \({{\boldsymbol {\Psi }}}_{\mathbf {t}_{N}}^{y}\). In this approach, the knowledge of (16) and (17) is used in the existence of the limits \({\lim }_{l^{y}\rightarrow \infty }\,\hat {\mathbf {m}}_{\mathbf {t}_{N}}^{y}\) and \({\lim }_{l^{y}\rightarrow \infty }\,\hat {{\boldsymbol {\Psi }}}_{\mathbf {t}_{N}}^{y}\), i.e., justifies generating more sample paths as \({\lim }_{l^{y}\rightarrow \infty }\,\hat {\mathbf {m}}_{\mathbf {t}_{N}}^{y}={\mathbf {m}}_{\mathbf {t}_{N}}^{y}\) and \({\lim }_{l^{y}\rightarrow \infty }\,\hat {{\boldsymbol {\Psi }}}_{\mathbf {t}_{N}}^{y}={{\boldsymbol {\Psi }}}_{\mathbf {t}_{N}}^{y}\).
In any case, we can assume exact (approximately exact) values of \(\mathbf {m}_{t_{i}}^{0}\), \(\mathbf {m}_{t_{i}}^{1}\), \({\boldsymbol {\Psi }}_{t_{i},t_{j}}^{0}\), and \({\boldsymbol {\Psi }}_{t_{i},t_{j}}^{1}\) are available. The optimal discriminant in this case is obtained by using the conventional quadratic discriminant analysis (QDA), which is now defined by using the following statistic in (11):
The use of (24) is justified by the fact that the BCGP classification reduces to differentiating independent observations of Np dimension generated from two multivariate Gaussian distributions. Therefore, taking the same set of machinery as in [15] results in (24). We restate that in this case where (19) holds, there is no need for utilizing the sample path measurements (training sample paths) in finding the discriminant (24). This is due to the fact that the statistical properties of a Gaussian process at t _{ N } are solely determined by \({\mathbf {m}}_{\mathbf {t}_{N}}^{y}\) and Ψ t _{ N } y and, as mentioned before, either closedform solutions of these are available or they can be approximated elementwise with an arbitrary small error rate by generating a sufficiently large number of sample paths.
The optimal solution proposed in (24) is, in fact, a function of the observation time vector of future sample paths. Therefore, if a future sample point \(\mathbf {x}_{\mathbf {t}_{L}}^{y}(\omega _{s})\) is measured at an arbitrary time vector t _{ L }, which can be partially or totally different from t _{ N }, then the optimal discriminant \(\psi _{\mathbf {t}_{L}}\left (\mathbf {x}_{\mathbf {t}_{L}}^{y}(\omega _{s})\right)\) is obtained by determining the solution of SDEs at t _{ L } and replacing \({\mathbf {m}}_{\mathbf {t}_{N}}^{y}\) and \({\boldsymbol {\Psi }}_{\mathbf {t}_{N}}^{y}\) with \({\mathbf {m}}_{\mathbf {t}_{L}}^{y}\) and \({ \boldsymbol {\Psi }}_{\mathbf {t}_{L}}^{y}\), respectively, in (24).
SDEs as prior information for the dynamics of classconditional processes
In practice, the SDEs usually do not provide complete description and are then viewed as prior knowledge concerning the underlying dynamics of the BCGP model. Since we assume that a Gaussian process governs both the dynamics of each classconditional process (BCGP model in Section 3) and its corresponding set of SDEs (by using model (15)), incompleteness of the SDEs results from the fact that (19) does not necessarily hold. We make the following assumptions on the nature of the prior information to which the set of SDEs corresponding to each class give rise: (i) before observing the sample paths at an observation time vector, the SDEs characterize the only information that we have about the system and (ii) the statistical properties of all Gaussian processes that may generate the data are on average (over the parameter space) equivalent to the statistical properties determined from the SDEs. The latter statement will subsequently be formalized.
Assume that the parameters \(\boldsymbol {\mu }_{\mathbf {t}_{N}}^{y}\) and \(\boldsymbol {\Sigma }_{\mathbf {t}_{N}}^{y}\) defining the BCGP model constitute a realization of the random vector \(\mathbf {\theta }_{\mathbf {t}_{N}}^{y}=\left [\boldsymbol {\mu }_{\mathbf {t}_{N}}^{y},\boldsymbol {\Sigma }_{\mathbf {t}_{N}}^{y}\right ]\), where \(\mathbf {\theta }_{\mathbf {t}_{N}}^{y}\) has a prior distribution \(\pi (\mathbf {\theta }_{\mathbf {t}_{N}}^{y})\) parameterized by a set \(\left \{\breve {\mathbf {m}}_{\mathbf {t}_{N}}^{y},\breve {\boldsymbol {\Psi }}_{\mathbf {t}_{N}}^{y},\nu _{\mathbf {t}_{N}}^{y},\kappa _{\mathbf {t}_{N}}^{y}\right \}\) of hyperparameters. The quantities \(\nu _{\mathbf {t}_{N}}^{y}\) and \(\kappa _{\mathbf {t}_{N}}^{y}\) define our certainty about the prior knowledge (here, the set of SDEs presenting the dynamics of the model). If we take the conjugate priors for mean and covariance when the sampling is Gaussian, i.e., a normalinverseWishart distribution (which depends on t _{ N }), then
with \(\boldsymbol {\mu }_{\mathbf {t}_{N}}^{y}\) and \(\boldsymbol {\Sigma }_{\mathbf {t}_{N}}^{y}\) defined in (6) and (7). Therefore, the above assumption (ii) on the nature of the prior information means that
with \(\mathbf {m}_{\mathbf {t}_{N}}^{y}\) defined by (16) and (20) and \({\boldsymbol {\Psi }}_{\mathbf {t}_{N}}^{y}\) defined by (17) and (21). To see (26), note that from (25) and independence of \(\boldsymbol {\mu }_{\mathbf {t}_{N}}^{y}\) and \( \boldsymbol {\Sigma }_{\mathbf {t}_{N}}^{y}\), we have \(E_{\pi }\left [\!\boldsymbol {\mu }_{\mathbf {t}_{N}}^{y}\right ]=\breve {\mathbf {m}}_{\mathbf {t}_{N}}^{y}\) and \( E_{\pi }\left [\!\boldsymbol {\Sigma }_{\mathbf {t}_{N}}^{y}\right ]=\frac {\breve {{\boldsymbol {\Psi }}}_{\mathbf {t}_{N}}^{y}}{\kappa _{\mathbf {t}_{N}}^{y}Np1} \) (the latter is the mean of an inverseWishart distribution). The more confident we are about an a priori set of SDEs that is supposed to represent the underlying stochastic processes at t _{ N } ^{y}, the larger we might choose the values of \(\nu _{\mathbf {t}_{N}}^{y}\) and \(\kappa _{\mathbf {t}_{N}}^{y}\) and the more concentrated become the priors of the mean and covariance about \(\mathbf {m}_{\mathbf {t}_{N}}^{y}\) and \({\boldsymbol {\Psi }}_{\mathbf {t}_{N}}^{y}\), respectively. To ensure a proper prior distribution, we assume \(\breve {\boldsymbol {\Psi }}_{\mathbf {t}_{N}}^{y}\) is positive definite, \(\kappa _{\mathbf {t}_{N}}^{y}>Np1\), and \(\nu _{\mathbf {t}_{N}}^{y}>0\) for all t _{ N } (cf. p. 126 in [16], p. 178 in [17], and p. 427 in [3]).
Given the preceding framework for uncertainty in the BCGP model, the optimal Bayesian classification theory can be directly adapted. Specifically, the normalinverseWishart distribution prior as defined in (25) and the independence of \(\mathbf {X}_{\mathbf {t}_{N}}^{y}(\omega _{s})\) from training sample paths resemble the same set of conditions as in [6], i.e., having a normalinverseWishart distribution prior and independence of future data points from training data points. As a result, we can follow the same set of machinery to find the effective classconditional distributions of the processes (similar to equation (64) in [6]) and from there obtain the optimal discriminant. Therefore, extending the dimensionality of the problem to Np and using the set of parameters \(\left \{\breve {\mathbf {m}}_{\mathbf {t}_{N}}^{y},\breve {\boldsymbol {\Psi }}_{\mathbf {t}_{N}}^{y},\nu _{\mathbf {t}_{N}}^{y},\kappa _{ \mathbf {t}_{N}}^{y}\right \}\) in the discriminant presented by Eq. (65) in [6] yields
where
with
where \(\breve {\mathbf {m}}_{\mathbf {t}_{N}}^{y}\) and \(\breve {{\boldsymbol { \Psi }}}_{\mathbf {t}_{N}}^{y}\) are determined from (26), and \(\hat { \mathbf {\Sigma }}_{\mathbf {t}_{N}}^{y}\) and \(\hat {\boldsymbol {\mu }}_{\mathbf {t}_{N}}^{y}\) are the sample mean and sample covariance matrix obtained by using the sample path training sets \(\mathbf {S}_{\mathbf {t}_{N}}^{0}\) and \(\mathbf {S}_{\mathbf {t}_{N}}^{1}\) as follows:
As opposed to Section 4.1, where the discriminant can be applied to any future sample path with an arbitrary observation time vector, here, the discriminant depends on both the future and training observation time vectors. Thus, if the future observation time vector t _{ L } ^{y} contains only a set of time points t _{ i } where t _{ i }∈t _{ N } ^{y}, one may easily apply the optimal discriminant. This is easily doable by reducing the dimensionality of the problem by considering the training sample paths only at t _{ L } ^{y}, i.e., by discarding the training sample points at those t _{ N } ^{y} not in t _{ L } ^{y}(denoted by t _{ N } ^{y}∖t _{ L } ^{y}). However, solving the case where t _{ L } ^{y} includes time points not included in t _{ N } ^{y} is more difficult and requires further study. In this case, although one is able to construct the class of prior knowledge for t _{ L } ^{y} (i.e., constructing \({\boldsymbol {\mu }} _{\mathbf {t}_{N}}^{y}\) and \(\boldsymbol {\Psi }_{\mathbf {t}_{N}}^{y}\)), the paucity of training sample paths at t _{ L } ^{y}∖t _{ N } ^{y} does not permit employing (27).
Performance analysis
In this section, we analyze the effect of prior knowledge in the form of stochastic differential equations on the performance of the stochastic discriminant, \(\psi _{\mathbf {t}_{N}}^{OBC}\left (\mathbf {x}_{\mathbf {t} _{N}}^{y}(\omega _{s})\right)\), defined by (27)–(29). As the metric of performance, we take the true error averaged over the sampling space. The true error of a discriminant trained on an observation time vector t _{ N }, i.e., \(\psi _{\mathbf {t}_{N}\phantom {\dot {i}\!}}(.)\), is the probability of misclassification, which by considering (11) is defined as
where X ^{y} denotes the classconditional process that generates the future sample path \(\mathbf {X}_{\mathbf {t}_{N}}^{y}(\omega _{s})\) (we assume independence of future sample paths from training sample paths), \(\mathbf {S}_{\mathbf {t}_{N}}^{y}\) denotes the set of training sample paths from class y, and \(\alpha _{\mathbf {t}_{N}}^{y}\) is the mixing probability of the classconditional process.
Recall that in this work, we consider a separate sampling scheme. With separate sampling in a classical binary classification problem where sample points are generated from two classconditional densities, there is no sensible estimate of prior probabilities of classes from the sample [15]. In that case, either the ratio of the number of sample points in either class to the total sample size needs to reflect the corresponding prior probability of the class or the prior probabilities need to be known a priori; otherwise, classification rules or error estimation rules suffer performance degradation [15, 18, 19]. The same argument applies to this work in which we consider a binary classification of sample paths that are generated from two classconditional processes under a separate sampling scheme. In this regard, we assume that the prior probability \(\alpha _{\mathbf {t}_{N}}^{y}\) is known a priori.
Taking expectation over the sample space, that is over the mixture of Gaussian processes with the means and covariance matrices defined by (16), (20), (17), and (21), yields
As benchmarks for evaluating the performance of \(\psi _{\mathbf {t}_{N}}^{\text {OBC}} \left (\mathbf {x}_{\mathbf {t}_{N}}^{y}(\omega _{s})\right)\), we compare its performance to (1) the performance of the stochastic QDA, \(\psi _{\mathbf {t}_{N}}^{\text {QDA}}\left (\mathbf {X}_{\mathbf {t}_{N}}^{y}(\omega _{s})\right)\), which is defined by (23) and (24), where l _{ y }=n _{ y }, with n _{ y } indicating the number of available sample paths, and (2) the performance of a Bayes classifier obtained by plugging (16), (17), (20), and (21), into (24).
Synthetic experiments
Experimental setup
The following steps are used to set up the experiments:

1.
To fix the groundtruth model governing the underlying dynamics of the data, we consider a set of threedimensional SDEs (p=3) defined by (15) along with the following set of parameters:
$$ {\small{\begin{aligned} \mathbf{A}^{0}(t)&=\mathbf{A}^{1}(t)=[\!0.01,0.01,0.01]^{T},\\ \mathbf{a}^{0}(t)&=\mathbf{a}^{1}(t)=[\!0,0,0]^{T}, \\ &\mathbf{X}_{t_{0}}^{0}(\omega)=[\!0,0,0]^{T},\quad \mathbf{X}_{t_{0}}^{1}(\omega)=[\!0.25,0.25,0.25]^{T}, \\ &\mathbf{B}^{0}(t)=\mathbf{B}^{1}(t)=0.1\!\times \left\{ \begin{array}{ll} \sigma^{2}=1 &\text{diagonal elements} \\ \rho =0.4 & \text{otherwise} \end{array}\right.. \end{aligned}}} $$((33))The only difference between the SDEs describing X ^{0} and X ^{1} is in the constant initial conditions. Figure 1 presents a single sample path of these two threedimensional processes for 0≤t≤100.

2.
Use the groundtruth set of SDEs to generate a set of training sample paths, \(\mathbf {S}_{\mathbf {t}_{N}}^{y}\), of size n ^{y} for class y=0,1. We let n ^{0}=n ^{1}=n, where n∈, let the length of the observation time vector be N=20, and take [ t _{1},t _{2},...,t _{ N }] such that t _{ i }−t _{ i−1}=1, i=2,…,20.

3.
Use the groundtruth set of SDEs to generate a set of test sample paths, \(\mathbf {S}_{\mathbf {t}_{N}}^{y,\,\text {test}}\), of size n ^{y, test}=2,000 for class y=0,1, where n ^{0, test}=n ^{1, test}=n ^{test}.

4.
Use \(\mathbf {S}_{\mathbf {t}_{N}}^{0}\cup \mathbf {S}_{\mathbf {t} _{N}}^{1}\) to train the stochastic QDA, \(\psi _{\mathbf {t}_{N}}^{\text {QDA}}\left (\mathbf {x}_{\mathbf {t}_{N}}^{y}(\omega _{s})\right)\), which is defined by (23) and (24) with l ^{y}=n ^{y}. Apply the trained classifier to the set of test sample paths, \(\mathbf {S}_{\mathbf {t} _{N}}^{0,\,\text {test}}\cup \mathbf {S}_{\mathbf {t}_{N}}^{1,\,\text {test}}\), to determine the true error, \(\epsilon _{\mathbf {t}_{N}}^{\text {QDA}}\), which is defined by replacing \(\psi _{\mathbf {t}_{N}}\left (\mathbf {X}_{\mathbf {t}_{N}}^{y}(\omega _{s})\right)\) with \(\psi _{\mathbf {t}_{N}}^{\text {QDA}}\left (\mathbf {X}_{\mathbf {t}_{N}}^{y}(\omega _{s})\right)\) in (31). This procedure obtains an accurate estimate of true error.

5.
Assume a set of SDEs obtained from prior knowledge (a priori SDEs). Let this a priori set of SDEs be presented by replacing A ^{y}(t), B ^{y}(t), A ^{y}(t), and \(\mathbf {X} _{t_{0}}^{y}(\omega)\) in (15) with \(\tilde {\mathbf {A}}^{y}(t)\), \(\tilde {\mathbf {B}}^{y}(t)\), \(\tilde {\mathbf {A}}^{y}(t)\), and \(\tilde {\mathbf { X}}_{t_{0}}^{y}(\omega)\), respectively. To examine the effects of deviations in the drift vector and dispersion matrix in the a priori set of SDEs from the groundtruth model introduced in (33), we assume

\(\tilde {\mathbf {A}}^{0}(t)={\mathbf {A}}^{0}(t)\), \(\tilde {\mathbf {B}} ^{0}(t)={\mathbf {B}}^{0}(t)\), \(\tilde {\mathbf {X}}^{0}_{t_{0}}(\omega)= \mathbf {X}_{t_{0}}^{0}(\omega)\), \(\tilde {\mathbf {X}}^{1}_{t_{0}}(\omega)= \mathbf {X}_{t_{0}}^{1}(\omega)\), \(\tilde {\mathbf {a}}^{0}(t)=\mathbf {a} ^{0}(t),\tilde {\mathbf {a}}^{1}(t)=\mathbf {a}^{1}(t)\).

To study the effect of shift in the drift vector, we take \(\tilde { \mathbf {A}}^{1}(t)={\mathbf {A}}^{1}(t)+[\!\Delta \mu,\Delta \mu,\Delta \mu ]^{T}\), where Δ μ=0,0.1,0.2,0.3. Here we assume \(\tilde {\mathbf {B}} ^{1}(t)={\mathbf {B}}^{1}(t)\).

To study the effect of shift in the dispersion matrix, we assume the offdiagonal elements of \(\tilde {\mathbf {B}}^{1}(t)\) are defined by replacing ρ with ρ _{ d } in (33), where ρ _{ d }−ρ=Δ ρ=0,0.03,0.06,0.1. Here we assume \(\tilde {\mathbf {A}} ^{1}(t)={\mathbf {A}}^{1}(t)\).

The hyperparameters defining our uncertainty about the specific choice of a priori SDEs (in fact, about the resultant prior distributions) are \(\nu _{\mathbf {t}_{N}}^{0}=\nu _{\mathbf {t}_{N}}^{1}=\kappa _{\mathbf {t} _{N}}^{0}=\kappa _{\mathbf {t}_{N}}^{1}=Np+\kappa \). The choice of N p+κ, κ=20,50,100,500, is made to have proper prior distributions (see Section 4.2).


6.
Generate 2,000 sample paths from the a priori set of SDEs introduced in Step 5. These sample paths are used to calculate the hyperparameters \(\mathbf {m}_{\mathbf {t}_{N}}^{y}\) and \({\boldsymbol {\Psi }}_{ \mathbf {t}_{N}}^{y}\) being used in (26) (alternatively, one may solve (16), (17), (20), and (21) directly and use them in (26)).

7.
Use \(\mathbf {m}_{\mathbf {t}_{N}}^{y}\) and \({\boldsymbol {\Psi }}_{ \mathbf {t}_{N}}^{y}\) obtained from Step 6 along with \(\mathbf {S}_{\mathbf {t} _{N}}^{0}\cup \mathbf {S}_{\mathbf {t}_{N}}^{1}\) to train \(\psi _{\mathbf {t} _{N}}^{\text {OBC}}\left (\mathbf {x}_{\mathbf {t}_{N}}^{y}(\omega _{s})\right)\), which is defined in (27). Apply the trained classifier to the set of test sample paths, \(\mathbf {S}_{\mathbf {t}_{N}}^{0,\,\text {test}}\cup \mathbf {S}_{ \mathbf {t}_{N}}^{1,\,\text {test}}\), to determine the true error, \(\epsilon _{ \mathbf {t}_{N}}^{\text {OBC}}\), which is defined by replacing \(\psi _{\mathbf {t}_{N}} \left (\mathbf {X}_{\mathbf {t}_{N}}^{y}(\omega _{s})\right)\) with \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}\left (\mathbf {x}_{\mathbf {t}_{N}}^{y}(\omega _{s})\right)\) in (31).

8.
Repeat Steps 2 through 7 a total of T=1,000 times to estimate \( E\left [\epsilon _{\mathbf {t}_{N}}^{\text {QDA}}\right ]\) and \(E\left [\epsilon _{\mathbf {t}_{N}}^{\text {OBC}}\right ] \).

9.
Generate 2,000 sample paths from the groundtruth set of SDEs introduced in (33). Use these sample paths to train the stochastic QDA, \(\psi _{\mathbf {t}_{N}}^{\text {QDA}}\left (\mathbf {x}_{\mathbf {t} _{N}}^{y}(\omega _{s})\right)\), which is defined by (23) and (24) with l ^{y}=2,000. This provides an accurate estimate of the Bayes (optimal) classifier. Apply this classifier to \(\mathbf {S}_{\mathbf {t} _{N}}^{0,\,\text {test}}\cup \mathbf {S}_{\mathbf {t}_{N}}^{1,\,\text {test}}\) to obtain the Bayes error, which is a lower bound on the error of any classifier. Note that in our experiments obtaining the Bayes error is possible since we have complete knowledge of the underlying groundtruth models.
Results
Figure 2 shows the effect of a shift in the drift vector from the groundtruth model via plots of the expected true error of \(\psi _{\mathbf {t}_{N}}^{\text {QDA}}(.)\) and \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}(.)\) as functions of the size of training sample paths and κ for y=0,1, \(\tilde {\mathbf {B}}^{y}(t)={\mathbf {B}}^{y}(t)\), \(\tilde {\mathbf {X}} ^{y}_{t_{0}}(\omega)=\mathbf {X}_{t_{0}}^{y}(\omega)\), \(\tilde {\mathbf {A}} ^{0}(t)={\mathbf {A}}^{0}(t)\), and \(\tilde {\mathbf {A}}^{1}(t)={\mathbf {A}} ^{1}(t)+[\!\Delta \mu,\Delta \mu,\Delta \mu ]^{T}\), where Δ μ=0,0.1,0.2,0.3. If the set of a priori SDEs is equivalent or close to the groundtruth model, e.g., Δ μ=0 or Δ μ=0.1, then \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}(.)\) outperforms \(\psi _{\mathbf {t}_{N}}^{\text {QDA}}(.)\) for a wide range of training sample sizes and κ. The more the prior distribution generated from the set of a priori SDEs is concentrated about the true underlying parameters of the model and the larger κ, the better is the performance achieved by using \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}(.)\).
Figure 3 presents the effect of the discrepancy between the dispersion matrix of the groundtruth model and that of the a priori set of SDEs. Again, the closer the prior knowledge is to the groundtruth model and the larger κ, the better is the performance achieved by using \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}(.)\).
An experiment inspired by a model of the evolutionary process
In this section, we use a form of an OrnsteinUhlenbeck process introduced in [20] for modeling the evolutionary change of species. This model has been recently employed by [21] to simulate quantitative trait data as a function of single nucleotide polymorphism (SNP) states. The model is presented by the following SDE:
where \({X}_{t}^{y}\) is the quantitative trait value in a species y, θ ^{y} is the primary target value of the trait, \({X_{a}^{y}}\) is the mean state in an ancestor a, and \({W}_{t}^{y}\) represents Brownian motion. The parameter β ^{y} is the rate of adaptation of species y to the target value—a low rate of adaptation means very slow evolution while a large β ^{y} practically indicates an instantaneous adaptation. The parameter σ ^{y} is an indicator of perturbation due to random selective factors such as random mutations and environmental fluctuations [20]. Similar to [21], we assume the value of the primary target is constant over the history of the species. Nevertheless, the model in (34) can be extended to include situations where the primary target can change over the evolutionary history of the species (see [20]).
Using the model of (34), we generate the evolutionary histories of a quantitative trait of two species, 0 and 1, over a time span of 30 million years with time steps of 1 million years. Similarly to [20, 21], to fix the groundtruth model that generates the data, we vary values of β ^{y}, take σ ^{y}=1, and assume θ ^{0}=80 and θ ^{1}=85. Furthermore, we assume both species have a common ancestor at the state \({X_{a}^{y}}=1\). Figure 4 presents 20 sample paths from each of these evolutionary processes for the case where β ^{0}=β ^{1}=β, β=0.1 (Fig. 4 a) and β=0.15 (Fig. 4 b). A larger β indicates a faster adaptation of species to the target value. The problem considered here is to use a set of a priori SDEs in constructing a classifier to differentiate the evolutionary history of an nsize population of species 0 from an nsize population of species 1, where n∈[ 60,140].
The general protocol for evaluating the performance of ψ t _{ N }OBC(.) is similar to Section 5.1, except for replacing the groundtruth model (33) with (34) and using the following the step instead of Step 5:

Assume a set of SDEs obtained from prior knowledge (a priori SDEs). Let this a priori set of SDEs be presented by replacing β ^{y}, θ ^{y}, \({X_{a}^{y}}\), and σ ^{y} by \(\tilde {\beta } ^{y}\), \(\tilde {\theta }^{y}\), \(\tilde {X}_{a}^{y}\), and \(\tilde {\sigma }^{y}\), respectively, in (34). To examine the effect of deviation of the adaptation rate in the a priori set of SDEs from the groundtruth model, we let \( \tilde {\theta }^{y}={\theta }^{y}\), \(\tilde {\sigma }^{y}={\sigma }^{y}\), \( \tilde {X}_{a}^{y}={X}^{y}\), and \(\tilde {\beta }^{0}={\beta }^{0}\) and take \( \tilde {\beta }^{1}={\beta }^{1}+\Delta \beta \).
Results
Figures 5 and 6 (β=0.1 and β=0.15, respectively) show the effect of a deviation from the true rate of adaptation to the target value by considering \(\tilde {\beta }^{1}={\beta } ^{1}+\Delta \beta,\) where Δ β=0, 0.02, 0.04, 0.06. They provide plots of the expected true error of \(\psi _{\mathbf {t}_{N}}^{\text {QDA}} (.)\) and \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}(.)\) as functions of the size of training sample paths and κ. In both figures, the closer the prior knowledge is to the groundtruth evolutionary models, the better is the performance achieved by using \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}(.)\). The performance deteriorates and eventually becomes worse than \(\psi _{ \mathbf {t}_{N}}^{\text {QDA}}(.)\) as the prior knowledge diverges from the groundtruth model and the certainty about the prior knowledge increases (a bad combination when utilizing prior knowledge). In addition, comparing Figs. 5 and 6 shows that the smaller is the true value of β and the more destructive is a fixed deviation of prior knowledge from the true β.
Conclusions
This paper provides the first instance in which prior knowledge in the form of SDEs is used to construct a prior distribution over an uncertainty class of featurelabel distributions for the purpose of optimal classification. Given the ubiquity of small samples in biomedicine and other areas where sample data is expensive, timeconsuming, limited by regulation, or simply unavailable, we have previously made the point that prior knowledge is the only avenue available. To achieve the mapping of SDE prior knowledge into a prior distribution, we have taken advantage of the form and Gaussianity of (12). This mapping is heavily dependent on the form of the SDEs, and one can expect widely varying mappings for different SDE settings.
In general, all parameters used in the a priori set of SDEs can affect the performance of \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}(.)\). These parameters include every element of the matrices \(\tilde {\mathbf {A}}^{y}(t)\) and \(\tilde {\mathbf {B}}^{y}(t)\) and all the elements of the vectors \(\tilde { \mathbf {a}}^{y}(t)\) and \(\tilde {\mathbf {X}^{y}}_{t_{0}}(\omega)\) used in the SDE’s presentation in (15). For example, in the experiment of the evolutionary change of species considered in (34), a deviation from each of the parameters, namely \(\tilde {\beta }^{y}\), \(\tilde {\sigma } ^{y} \), \(\tilde {\theta }^{y}\), and \(\tilde {X}_{a}^{y},\) can affect the performance of \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}(.)\). Although simulation studies can elucidate the effects of deviation of prior knowledge from the groundtruth model (as done herein), it would be beneficial to analytically characterize the performance of \(\psi _{\mathbf {t}_{N}}^{\text {OBC}} (.)\) in terms of all the hyperparameters; however, this may be very difficult to accomplish. One possible approach may be to use an asymptotic Bayesian framework [22] to characterize the performance of \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}(.)\) in terms of sample size, dimensionality, and hyperparameters.
Recognizing that the construction of robust classifiers is simply a special case of optimal Bayesian classification where there are no sample data, so that the “posterior” is identical to the prior [7], the application of SDEs in this paper is at once applicable to optimal robust classification in a stochastic setting. Beyond that, one can consider the more general setting of optimal Bayesian robust filtering of random processes, where optimization across an uncertainty class of random processes, ideal and observed, is relative to process characteristics such as the auto and crosscorrelation functions [23]. Whereas in this paper we have considered using SDE prior knowledge to construct prior distributions governing uncertainty classes of featurelabel distributions, it seems feasible to use SDE knowledge to construct prior distributions governing uncertainty classes of randomprocess characteristics in the case of optimal filtering. Of course, one must confront the increased abstraction presented by canonical representation of random processes [24, 25]; nevertheless, so long as one remains in the framework of secondorder canonical expansions, it should be doable.
Appendix
Definition of qdimensional Wiener process
A onedimensional Wiener process over [ 0,T] is a Gaussian process W={W _{ t }:t≥0} satisfying the following properties:

For 0≤t _{1}<t _{2}<T, \(W_{t_{2}}W_{t_{1}}\) is distributed as \( \sqrt {t_{2}t_{1}}N\left (0,\sigma ^{2}\right)\), where σ>0 (for the standard Wiener process, σ=1).

For 0≤t _{1}<t _{2}<t _{3}<t _{4}<T, \(W_{t_{4}}W_{t_{3}}\) is independent of \(W_{t_{2}}W_{t_{1}}\).

W _{0}=0 with probability 1.

The sample paths of W are almost surely continuous everywhere.
In general, a qdimensional Wiener process is defined using the homogenous Markov process X _{ t } for t∈[t _{0},T]. Let \(P(t_{1},x;t_{2}, B)=P(\mathbf {X}_{t_{2}\phantom {\dot {i}\!}} \in B\mathbf {X}_{t_{1}\phantom {\dot {i}\!}}=x)\) denote the transition probabilities of a Markov process X _{ t } for t _{1}<t _{2}. For fixed values of t _{1}, x, and t _{2}, P(t _{1},x;t _{2},.) is a probability function (measure) on the σalgebra \(\mathcal {B}\) of Borel subsets of the sample space R ^{q}. Intuitively, P(t _{1},x;t _{2},B) is the probability that the process be in the set \(B\in \mathcal {B}\) at time t _{2} given it was in state x at time t _{1}. A Markov process is homogenous with respect to t if its transition probability P(t _{1},x;t _{2},B) is stationary. That is, for t _{0}<t _{1}<t _{2}<T and t _{0}<t _{1}+u<t _{2}+u<T, it satisfies
In this case P(t _{1},x;t _{2},B) is commonly denoted by P(t _{2}−t _{1},x;B). A qdimensional Wiener process is a qdimensional homogenous Markov process defined on [0,∞) with stationary transition probability defined by a multivariate Gaussian distribution as follows:
Therefore, each dimension of a qdimensional Wiener process is a onedimensional Wiener process per se.
Computational complexity
The computational complexity of the algorithm is determined by the computational cost of solving the set of SDEs from the EulerMaruyama scheme (see Section 4.1) along with the computational cost of evaluating (27). The computational cost of the EulerMaruyama scheme per sample path is inversely proportional to Δ t [26], where Δ t=T/N, with T and N being defined in Section 3. Thus, for l=l ^{0}+l ^{1} sample paths, it is O(l/Δ t). In (27), the computational cost of evaluating \(\mathbf {x}_{\mathbf {t}_{N}}^{y}(\omega _{s}){\mathbf {m}}_{\mathbf {t}_{N}}^{0\;\ast },\) with y=0,1, breaks down to a computation of \(\breve {\mathbf {m}}_{\mathbf {t}_{N}}^{y}\) and \(\hat {\boldsymbol {\mu }}_{\mathbf {t}_{N}}^{y}\), which are operations with computational costs of O(l ^{y} N p) and O(n ^{y} N p), respectively.
Computation of \({{\boldsymbol {\Pi }}_{\mathbf {t}_{N}}^{y\;1}}\) in (27) by Gaussian elimination is an O(max{n ^{y},N p}N ^{2} p ^{2})+O(l ^{y} N ^{2} p ^{2}) operation (cf. section 3.7.2 in [27]). This will be further simplified because, in order to have a positive definite \(\breve {{ \boldsymbol {\Psi }}}_{\mathbf {t}_{N}}^{y}\), we assume we generate many sample paths by solving the set of SDEs such that l ^{y}>>N p (see Section 4.1), but since \({\boldsymbol {\Psi }}_{\mathbf {t}_{N}}^{y\;\ast }\) and \({{ \boldsymbol {\Pi }}_{\mathbf {t}_{N}}^{y\;1}}\) defined in (29) become positive definite, we do not need to impose the condition of n ^{y}>N p. Having a realistic assumption on the number of available sample paths, we can assume l ^{y}>>n ^{y}, and therefore, the computation of \({{\boldsymbol {\Pi } }_{\mathbf {t}_{N}}^{y\;1}}\) becomes an O(l ^{y} N ^{2} p ^{2}) calculation. Furthermore, the product of \(\mathbf {x}_{\mathbf {t}_{N}}^{y}(\omega _{s}){ \mathbf {m}}_{\mathbf {t}_{N}}^{0\;\ast }\) with \({{\boldsymbol {\Pi }}_{\mathbf { t}_{N}}^{y\;1}}\) is an O(N ^{2} p ^{2}) calculation. Altogether, by assuming 1/Δ t<(N p)^{2} and k ^{0}+k ^{1}+N p<(N p)^{2}, the overall computational cost of \(\psi _{\mathbf {t}_{N}}^{\text {OBC}}\left (\mathbf {x}_{\mathbf {t}_{N}}^{y}(\omega _{s})\right)\) is O(max{l ^{0},l ^{1}}N ^{2} p ^{2}).
Using a similar approach, we see that the computational cost of QDA, which is solely constructed by using n ^{0}+n ^{1} training sample paths from classes 0 and 1 (i.e., no prior knowledge) is O(max{n ^{0},n ^{1}}N ^{2} p ^{2}). We also note that for computing QDA we need to have min{n ^{0},n ^{1}}>N p because, otherwise, the sample covariance matrices used in QDA are not invertible.
References
 1
BragaNeto, UM, & Dougherty, ER. (2015). Error Estimation for Pattern Recognition. New York: WileyIEEE Press.
 2
Kay, S. (1993). Fundamentals of Statistical Signal Processing: Estimation Theory. New Jersey: PrenticeHall.
 3
Carlin, BP, & Louis, TA. (2008). Bayesian Methods for Data Analysis. Boca Raton: CRC Press.
 4
Dalton, L, & Dougherty, ER (2011). Bayesian minimum meansquare error estimation for classification error–part I: definition and the Bayesian MMSE error estimator for discrete classification. IEEE Transactions on Signal Processing, 59(1), 115–129.
 5
Dalton, L, & Dougherty, ER (2011). Bayesian minimum meansquare error estimation for classification error–part II: linear classification of Gaussian models. IEEE Transactions on Signal Processing, 59(1), 130–144.
 6
Dalton, L, & Dougherty, ER (2013). Optimal classifiers with minimum expected error within a Bayesian framework – part I: discrete and Gaussian models. Pattern Recognition, 46, 1301–1314.
 7
Dalton, L, & Dougherty, ER (2013). Optimal classifiers with minimum expected error within a Bayesian framework – part II: properties and performance analysis. Pattern Recognition, 46, 1288–1300.
 8
Knight, J, Ivanov, I, Dougherty, ER (2014). MCMC implementation of the optimal Bayesian classifier for nongaussian models: modelbased RNAseq classification. BMC Bioinformatics, 15. doi:10.1186/s1285901404013.
 9
Esfahani, MS, & Dougherty, ER (2014). Incorporation of biological pathway knowledge in the construction of priors for optimal Bayesian classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11, 202–218.
 10
Esfahani, MS, & Dougherty, ER (2015). An optimizationbased framework for the transformation of incomplete biological knowledge into a probabilistic structure and its application to the utilization of gene/protein signaling pathways in discrete phenotype classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics. doi:10.1109/TCBB.2015.2424407.
 11
Jaynes, ET (1968). Prior probabilities. IEEE Transactions on Systems Science and Cybernetics, 4, 227–241.
 12
Kloeden, PE, & Platen, E. (1995). Numerical Solution of Stochastic Differential Equations. New York: Springer.
 13
Arnold, L. (1974). Stochastic Differential Equations: Theory and Applications. New York: Wiley.
 14
Higham, D (2001). An algorithmic introduction to numerical simulation of stochastic differential equations. SIAM Review, 43, 525–546.
 15
Anderson, TW (1951). Classification by multivariate analysis. Psychometrika, 16, 31–50.
 16
Murphy, KP. (2012). Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press.
 17
DeGroot, MH. (1970). Optimal Statistical Decisions. New York: McGrawHill.
 18
Esfahani, MS, & Dougherty, ER (2014). Effect of separate sampling on classification accuracy. Bioinformatics, 30, 242–250.
 19
BragaNeto, UM, Zollanvari, A, Dougherty, ER (2014). Crossvalidation under separate sampling: strong bias and how to correct it. Bioinformatics, 30, 3349–3355.
 20
Hansen, TF (1997). Stabilizing selection and the comparative analysis of adaptation. Evolution, 51, 1341–1351.
 21
Thompson, K, & Kubatko, LS (2013). Using ancestral information to detect and localize quantitative trait loci in genomewide association studies. BMC Bioinformatics, 14. doi:10.1186/1471210514200.
 22
Zollanvari, A, & Dougherty, ER (2014). Moments and rootmeansquare error of the Bayesian MMSE estimator of classification error in the Gaussian model. Pattern Recognition, 47, 2178–2192.
 23
Dalton, L, & Dougherty, ER (2014). Intrinsically optimal Bayesian robust filtering. IEEE Transactions on Signal Processing, 62(3), 657–670.
 24
Pugachev, VS. (1965). Theory of Random Functions and Its Applications to Control Problems. Oxford: Pergamon.
 25
Dougherty, ER. (1999). Random Processes for Image and Signal Processing. New York: SPIE Press and IEEE Presses.
 26
Higham, DJ (2015). An introduction to multilevel Monte Carlo for option valuation. International Journal of Computer Mathematics, 92(12).
 27
Duda, RO, Hart, PE, Stork, DG. (2000). Pattern Classification. New York: Wiley.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Additional file
Additional file 1
Supplementary information. I. Definition of QDA in a classical setting. II. Error estimation accuracy. III. Bayesian MMSE error estimator. IV. Review of literature pertaining to classification of stochastic processes.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Zollanvari, A., Dougherty, E.R. Incorporating prior knowledge induced from stochastic differential equations in the classification of stochastic observations. J Bioinform Sys Biology 2016, 2 (2016). https://doi.org/10.1186/s136370160036y
Received:
Accepted:
Published:
Keywords
 Classification
 Gaussian processes
 Stochastic differential equations
 Optimal Bayesian classifier