Skip to main content
Log in

Bellman residuals minimization using online support vector machines

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In this paper we present and theoretically study an Approximate Policy Iteration (API) method called A P IB R M 𝜖 using a very effective implementation of incremental Support Vector Regression (SVR) to approximate the value function able to generalize Reinforcement Learning (RL) problems with continuous (or large) state space. A P IB R M 𝜖 is presented as a non-parametric regularization method based on an outcome of the Bellman Residual Minimization (BRM) able to minimize the variance of the problem. The proposed method can be cast as incremental and may be applied to the on-line agent interaction framework of RL. Being also based on SVR which are based on convex optimization, is able to find the global solution of the problem. A P IB R M 𝜖 using SVR can be seen as a regularization problem using 𝜖−insensitive loss. Compared to standard squared loss also used in regularization, this allows to naturally build a sparse solution for the approximation function. We extensively analyze the statistical properties of A P IB R M 𝜖 founding a bound which controls the performance loss of the algorithm under some assumptions on the kernel and assuming that the collected samples are not-i.i.d. following a β−mixing process. Some experimental evidence and performance for well known RL benchmarks are also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Amir-massoud F, Szepesvári CC (2012) Regularized least-squares regression: Learning from a β-mixing sequence. Journal of Statistical Planning and Inference 142(2):493–505. http://www.sciencedirect.com/science/article/pii/S0378375811003181

    Article  MathSciNet  MATH  Google Scholar 

  2. Antos A, Szepesvári C, Munos R (2008) Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal 71:89–129

    Article  MATH  Google Scholar 

  3. Baird L (1995) Residual algorithms: Reinforcement learning with function approximation. In: Proceedings of the 12th international conference on machine learning. Morgan kaufmann, pp 30–37

  4. Bertsekas DP (2007) Dynamic programming and optimal control, vol II. Athena Scientific, Boston

    Google Scholar 

  5. Bethke BM (2010) Kernel-based approximate dynamic programming using bellman residual elimination. Ph.D. thesis, Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, Cambridge MA. http://acl.mit.edu/papers/BethkePhD.pdf

    Google Scholar 

  6. Busoniu L, Ernst D, De Schutter B, Babuska R (2010) Online least-squares policy iteration for reinforcement learning control. In: Baltimore U (ed) Proceedings of American Control Conference ACC-10, pp 486–491

  7. Busoniu L, Lazaric A, Ghavamzadeh M, Munos R, Babuska R, Schutter B (2012) Least-squares methods for policy iteration. In: Wiering M., Otterlo M (eds) reinforcement learning, adaptation, learning, and optimization, vol 12. Springer, Berlin Heidelberg, pp 75–109. doi:10.1007/978-3-642-27645-3_3

  8. Carrasco M, Chen X (2002) Mixing and moment properties of various garch and stochastic volatility model. Econ Theory 18:17–39

    Article  MathSciNet  MATH  Google Scholar 

  9. Cherkassky V, Ma Y (2004) Comparison of loss functions for linear regression. In: Proceedings of the 2004 IEEE International Joint Conference on Neural Networks, 2004, vol 1, p 400. doi:10.1109/IJCNN.2004.1379938

  10. Christmann A, Steinwart I (2004) On robust properties of convex risk minimization methods for pattern recognition. J Mach Learn Res 5:1007–1034

    MATH  Google Scholar 

  11. Cucker F, Smale S (2002) On the mathematical foundations of learning. Bull Am Math Soc 39:1–49

    Article  MathSciNet  MATH  Google Scholar 

  12. Daniel E, Pierre G, Luis W (2005) Tree based batch mode reinforcement learning. J Mach Learn Res 6:503–556

    MathSciNet  MATH  Google Scholar 

  13. Davydov YA (1973) Mixing conditions for markov chains. Teor Veroyatnost i Primenen 18:321–338

    MathSciNet  MATH  Google Scholar 

  14. Esposito G (2015) Regularized approximate policy iteration using Kernel for on-line reinforcement learning. PhD Thesis, Universitat Politecnica de Catalunya

    Google Scholar 

  15. Evgeniou T, Pontil M, Poggio T (1999) A unified framework for regularization networks and support vector machines. Tech. rep., MIT, Cambridge, MA, USA. http://www.ncstrl.org:8900/ncstrl/servlet/search?formname=detail&id=oai

    MATH  Google Scholar 

  16. Farahmand Am (2011) Regularization in reinforcement learning. Ph.D. thesis, University of Alberta

    MATH  Google Scholar 

  17. Farahmand Am, Munos R, Szepesvári C (2010) Error propagation for approximate policy and value iteration. In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A (eds) NIPS. Curran Associates, Inc, pp 568–576

  18. van de Geer S (2009) Empirical Processes in M-Estimation. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. http://books.google.es/books?id=0VEcQAAACAAJ

  19. Györfi L, Kohler M, Krzyzak A, Walk H (2002) A Distribution-Free theory of nonparametric regression. Springer

  20. Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366

    Article  Google Scholar 

  21. Jung T, Polani D (2006) Least squares svm for least squares td learning. In: Proceedings of 17th european conference on artificial intelligence, pp 499–503

  22. Karandikar RL, Vidyasagar M (2004) Probably approximately correct learning with beta mixing input sequences

  23. Kohler M, Krzyzak A, Schfer D (2000) Application of structural risk minimization to multivariate smoothing spline regression estimates

  24. Lagoudakis MG, Parr R (2003) Least-squares policy iteration. J sMach Learn Res 4:1107–1149. http://dblp.uni-trier.de/db/journals/jmlr/jmlr4.html#LagoudakisP03

    MathSciNet  MATH  Google Scholar 

  25. Lazaric A, Ghavamzadeh M, Munos R (2012) Finite-sample analysis of least-squares policy iteration. J Mach Learn Res 13:3041–3074

    MathSciNet  MATH  Google Scholar 

  26. Lee DH, Kim JJ, Lee JJ (2010) Online support vector regression based actor-critic method. In: IECON 2010 - 36th annual conference on IEEE industrial electronics society, pp 193–198

  27. Maillard OA, Munos R, Lazaric A, Ghavamzadeh M (2010) Finite-sample analysis of bellman residual minimization. In: Sugiyama M, Q.Y. 0001 (eds) Proceedings of the ACML, JMLR, pp 299–314. JMLR.org

  28. Martin M (2002) On-line support vector machine regression. In: Proceedings of the 13th European conference on machine learning, ECML ’02. Springer-Verlag, London, UK, UK, pp 282–294. http://dl.acm.org/citation.cfm?id=645329.650050

  29. Meir R, Hellerstein L (2000) Nonparametric time series prediction through adaptive model selection. Mach Learn:5–34

  30. Mohri M, Rostamizadeh A (2010) Stability bounds for stationary β−mixing and α-mixing processes. J Mach Learn Res 11:789–814. http://dl.acm.org/citation.cfm?id=1756006.1756032

    MathSciNet  MATH  Google Scholar 

  31. Moore AW, Atkeson CG (1995) The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Mach Learn 21(3):199–233

    Google Scholar 

  32. Randlov J, Alstrom P (1998) Learning to drive a bycicle using reinforcement learning an shaping. In: Proceeding of the 5th international conference on machine learning, pp 463–471

  33. Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: Helmbold DP, Williamson B (eds) COLT/EuroCOLT, lecture notes in computer science, vol 2111. Springer, pp 416–426. http://dblp.uni-trier.de/db/conf/colt/colt2001.html#ScholkopfHS01

  34. van Seijen H, van Hasselt H, Whiteson S, Wiering M (2009) A theoretical and empirical analysis of expected sarsa. In: ADPRL 2009: Proceedings of the IEEE symposium on adaptive dynamic programming and reinforcement learning, pp 177–184

  35. Steinwart I, Christmann A (2008) Sparsity of svms that use the epsilon-insensitive loss. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) NIPS. Curran Associates, Inc., pp 1569–1576. http://dblp.uni-trier.de/db/conf/nips/nips2008.html#SteinwartC08

  36. Steinwart I, Christmann A (2008) Support vector machines, 1st edn. Springer Publishing Company, Incorporated

    MATH  Google Scholar 

  37. Steinwart I, Hush D, Scovel C (2009) Learning from dependent observations. J Multivar Anal 100 (1):175–194. doi:10.1016/j.jmva.2008.04.001

    Article  MathSciNet  MATH  Google Scholar 

  38. Taylor G, Parr R (2009) Kernelized value function approximation for reinforcement learning. In: Proceedings of the 26th annual international conference on machine learning, pp 1017–1024

  39. Wu Q, Ying Y, Zhou DX (2006) Learning rates of least-square regularized regression. Found Comput Math 6(2):171–192. doi:10.1007/s10208-004-0155-9

    Article  MathSciNet  MATH  Google Scholar 

  40. Yu B (1994) Rates of convergence for empirical processes of stationary mixing sequences. Ann Probab 22(1):94–116. doi:10.1214/aop/1176988849

    Article  MathSciNet  MATH  Google Scholar 

  41. Zhu DX, Smale S (2003) Estimating the approximation error in learning theory. Anal Appl 1-1:1–49. doi:10.1142/S0219530503000089

Download references

Acknowledgments

This work was partially supported by the FI-DGR programme of AGAUR ECO/1551/2012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gennaro Esposito.

Appendix

Appendix

1.1 A.1 Reproducing kernel hilbert spaces

A kernel function κ(⋅,⋅) plays an important role in any kernel-based learning method mapping any two elements from a space of input patterns to real numbers taken the state space of the cMDP, and can be thought of as a similarity measure on the input space. In the derivation of kernel methods, the kernel function arises naturally as an inner product in a high-dimensional features space. Hence, the kernel satisfy several important properties of an inner product: it must be symmetric and positive semidefinite, meaning that the associated Gram matrix K i j = κ(s i , a i , s j , a j ) must be positive semidefinite. A kernel that satisfies these properties is said to be admissible. The key idea of the kernel technique is to invert the chain of arguments, choosing a kernel rather than a mapping before applying a learning algorithm. Of course, not any symmetric function κ can serve as a kernel. One fundamental property comes with Mercer’s kernels ( from section 4.5 of [36])

Proposition 2

(Mercer’s Kernel) The function \(\kappa \: : \: (S\times A) \: \times \: (S\times A)\: \rightarrow \: \mathbb {R}\) is a Mercer kernel if and only if for each \(n\in \mathbb {N}\) and Z n = {(s 1 ,a 1 ),...,(s n ,a n )}, the n × n Gram matrix K ij = κ((s i ,a i ),(s j ,a j )) is positive semidefinite.

Given a kernel function his smallest feature space which can serve as a canonical feature space is the RKHS defined as (adapted from [11]):

Definition 6

(Reproducing Kernel Hilbert Spaces) Consider a subset of measurable functions \(\mathcal {F}\: : \: S\rightarrow \: \mathbb {R}\) and a subset of vector values measurable functions \(\mathcal {F}^{|A|}\: : \: S \times A \rightarrow \: \mathbb {R}^{|A|}\) such that

$$\begin{array}{@{}rcl@{}} \mathcal{F}^{|A|}=\{(Q_{1},...,Q_{|A|} )\: | \: Q_{i}\in \mathcal{F}, \: i=1,...,|A|\} \end{array} $$
(86)

called the hypothesis space \(\mathcal {H}\). A natural choice for \(\mathcal {H}\) is within the framework of RKHS \(\mathcal {H} \: : \: S\times A \: \rightarrow \: \mathbb {R}\) which is an Hilbert space defined in S × A with the inner product 〈⋅,⋅〉 and characterized by a symmetric positive definite function \(\kappa \: : \: (S\times A) \: \times \: (S\times A)\: \rightarrow \: \mathbb {R}\) called reproducing kernel, continuous in S × A such that for each (s, a)∈S × A the following reproducing property holds:

$$\begin{array}{@{}rcl@{}} \forall \: (s,a)\in S\times A \quad Q(s,a)=\langle Q(\cdot), \kappa(\cdot,s,a) \rangle_{\mathcal{H}} \quad \kappa(\cdot,s,a)\in \mathcal{H} \end{array} $$
(87)

assuming as measurable function space \(\mathcal {F}^{|A|}=\mathcal {H}\). \(\mathcal {H}\) is the closure of the linear span of the set of functions Φ s p a n = { Φ(s, a) = κ(⋅,s, a) | (s, a)∈S × A} considering the map from Φ : S × AC 0(S × A) which denotes the function that assigns the value κ(s t , a t , s, a) to (s t , a t )∈S × A and C 0(S × A) the space of continuous functions on S × A.

RKHS spaces have the important property that norm convergence implies point-wise convergence. The full power of RKHS can be expressed by the following Theorem ( adapted from [33]):

Theorem 4

(Representer Theorem) Let κ be a Mercer Kernel and D n be a training set while \(R_{emp}\: : \: Z\: \times \: \mathbb {R}^{n} \rightarrow \mathbb {R} \cup \{\infty \}\) be any arbitrary function. Now let be \(R_{reg}\: : \: \mathbb {R} \: \rightarrow [0,\infty )\) be a strictly monotonically increasing function. Define \(\mathcal {H}_{K}\) as RKHS induced by κ. Then any \(Q\: \in \mathcal {H}_{K}\) minimizing the regularized risk \(R_{reg}(Q)=R_{emp}(Q)+\lambda \| Q\|_{\mathcal {H}}^{2}\) admits a representation of the form \(Q(\cdot )={\sum }_{t=1}^{n} \beta _{t} \kappa (\cdot ,s_{t},a_{t})\)

Let \(\bar {\kappa }=\sup _{(s,a)\in S\times A} \sqrt {\kappa (s,a,s,a)}\) then the reproducing property tells us that \(\| Q \|_{\infty } \leq \bar {\kappa } \| Q \|_{\mathcal {H}} \quad \forall \: Q\in \mathcal {H}\) assuming as Hilbert norm the expression

$$\|Q(\cdot,a) \|^{2}_{\mathcal{H}}=\sum\limits_{i,j=1}^{n} \beta_{i} \beta_{j} \kappa(s_{i},a,s_{j},a)$$

. Using the Representer Theorem, the function in \(\mathcal {H}_{K}\) can be expressed as linear combination of the elements in the span Φ s p a n = {Φ(s, a) = κ(⋅,s, a) | (s, a)∈S × A} which can be expressed as \(Q(s,a)={\sum }_{t} \beta _{t} \kappa (s,a,s_{t},a_{t})\). Hereafter we introduce the regularized regression using RKHS as functions spaces and quadratic norm into the Hilbert space \(\|Q\|^{2}_{\mathcal {H}} \) as regularizer term. In fact, as we want that the regularizer measures the complexity of the action value function Q(s, a), more complex functions should have larger regularizer which means larger norm. Moreover, we have to point out that dealing with finite action space one should define the norm of Q(⋅,a) aA in a proper way. As a mild condition the complexity of Q should upper bound the complexity of Q(⋅,a) for all aA and in RHKS this can be achieved defining

$$\begin{array}{@{}rcl@{}} \|Q(s,a)\|^{2}_{\mathcal{H}}=\frac{1}{|A|}\sum\limits_{a\in A} \|Q(\cdot,a)\|^{2}_{\mathcal{H}} \end{array} $$
(88)

with |A| the cardinality of A.

A useful concept related to the complexity of the function space is represented by the covering numbers (from definition 9.3 of [19]):

Definition 7

(Covering Number) Let 𝜖 > 0 and \(\mathcal {H}\) be a set of real-valued functions defined on X and ν x a probability measure on X. Every finite collection of \(N_{\epsilon }=\{f_{1},...,f_{N_{\epsilon }}\}\) defined on X with the property that for every \(f\in \mathcal {H}\) there is a function f N 𝜖 such that \(\|f-f^{\prime }\|^{q}_{p,\nu _{x}}\leq \epsilon \) is called 𝜖−cover of \(\mathcal {H}\) w.r.t. \(\|\cdot \|^{q}_{p,\nu _{x}}\). Let \(\mathcal {N}_{p}(\epsilon , \mathcal {H},\|\cdot \|^{q}_{p,\nu _{x}})\) be the size of the smallest 𝜖−cover of \(\mathcal {H}\) w.r.t. \(\|\cdot \|^{q}_{p,\nu _{x}}\). If no finite 𝜖−cover exists \(\mathcal {N}(\epsilon , \mathcal {H},\|\cdot \|^{q}_{p,\nu _{x}})=\infty \). Then \(\mathcal {N}_{p}(\epsilon , \mathcal {H},\|\cdot \|^{q}_{p,\nu _{x}})\) is called an 𝜖−covering number of \(\mathcal {F}\) and \(\log \: \mathcal {N}_{p}(\epsilon , \mathcal {H},\|\cdot \|^{q}_{p,\nu _{x}}) \) is called the metric entropy of \(\mathcal {H}\). Considering the empirical norm based \(\|\cdot \|^{q}_{p,n}\) on the sequence of random variable X n = {X 1,...,X n } we may define the empirical covering number as \(\mathcal {N}_{p}(\epsilon , \mathcal {H},\|\cdot \|^{q}_{p,n})\)

Given an admissible kernel κ and the RKHS \(\mathcal {H}\) assume that for each R > 0 let be \(\mathcal {H}_{R}=\{ \:f\in \mathcal {H} \: :\: \|f\|_{\mathcal {H}}\leq R \}\) then exists a constant C > 0 and 0 < α < 1 such that for any (R, 𝜖) the following metric entropy condition is satisfied ( from section 6.3 [36])

$$\begin{array}{@{}rcl@{}} \log \: \mathcal{N}_{p}(\epsilon, \mathcal{H}_{R},\|f\|^{q}_{p,n}) \leq C\left(\frac{R}{\epsilon}\right)^{2\alpha} \end{array} $$
(89)

Some of the presented results involve expectations of suprema over uncountable sets than it is useful to introduce the following notion (from section 7.3 of [36]) :

Definition 8

(Charatheodory Set) Let (T, d) be a metric space and (Z, σ Z ) a measurable space. A family of measurable maps {f t } t T is called a Caratheodory family if tf t (z) is continuous for all zZ. Moreover, if T is separable or complete, we say that {f t } t T is separable or complete, respectively. A measurable set \(\mathcal {F}\subset Z\) (separable or complete) is a Caratheodory set if there exists a (separable or complete) metric space (T, d) and a Caratheodory family {f t } t T such that \(\mathcal {F} = \{f_{t} \: : t\in T\}\). Note that, by the continuity of tf t (z), Caratheodory sets satisfy

$$\begin{array}{@{}rcl@{}} \sup\limits_{f\in \mathcal{F}} \: f(z)=\sup\limits_{t\in T} \: f_{t}(z)=\sup\limits_{t\in Z} \: f_{t}(z), \quad z\in Z \end{array} $$
(90)

for all dense ZT. For separable Caratheodory sets \(\mathcal {F}\), there exists a countable and dense ZT, and hence the map z↦ suptT f t (z) is measurable for such \(\mathcal {F}\). Also, for a complete Charatheodory set \(\mathcal {F}\) the map (z, t) ↦f t (z) is measurable.

1.2 A.2 Regularized non-parametric regression

Estimation of a real-valued function from a finite set of noisy samples is an important problem in statistical learning. Let \(X\subset \mathbb {R}^{d} \) be a measurable space and \(Y \subseteq \mathbb {R}\) a Polish space (separable completely metrizable topological space) with \(d\in \mathbb {N}\). In non-parametric regression the goal is to estimate a functional relationship between an input X and an output Y random variables under the assumption that the joint distribution P(X, Y) is (almost) completely unknown. To solve this problem, one typically assumes a set of observations (X i , Y i ) from i.i.d. random variables (x i , y i ), i = 1,...,n, all having distribution P(X, Y) with the corresponding Borel σ−algebra and a finite sequence of gathered samples Z n = {(X 1, Y 1),...,(X n , Y n )}.

The learning goal aims to build a predictor \(f\: : \:X\: \rightarrow \mathbb {R}\) on the basis of these observations such that f(x) is a good approximation of y. To formalize this aim, one assumes a continuous loss function \(\ell \: :\: \mathbb {R} \times \mathbb {R}\: \rightarrow \:[0,\infty )\) assessing the quality of a prediction f(x) for an observed output y. It is commonly assumed that the smaller (y, f(x)) is, the better the prediction is. The quality of a predictor f is then measured by the expected risk

$$\begin{array}{@{}rcl@{}} R_{\ell,P}(f)=\mathbb{E}[\ell(y,f(x))]={\int}_{X,Y} \ell(y,f(x)) d\nu_{x} (x) dP(y| x) \end{array} $$
(91)

which is a random variable defined by the average loss obtained by predicting with f. Following the interpretation that a small loss is desired, one tries to find a predictor with expected risk close to the optimal risk

$$\begin{array}{@{}rcl@{}} R^{*}_{\ell,P}=\inf\{\: R_{\ell,P}(f) \: | \: f\: : \: X \: \rightarrow \: \mathbb{R}\}. \end{array} $$
(92)

Moreover, one may also define the inner risk as

$$\begin{array}{@{}rcl@{}} \mathcal{C}_{\ell,P}(f(x))={\int}_{Y} \ell(y,f(x)) dP(y |x) \end{array} $$
(93)

and we are interested in the cost function having the property:

$$\begin{array}{@{}rcl@{}} R_{\ell,P}(f)-R_{\ell,P}^{*}={\int}_{X} (\mathcal{C}_{\ell,P}(f(x))-\mathcal{C}_{\ell,P}^{*}) d\nu_{x}(x) \end{array} $$
(94)

Consequently one can analyze \(R_{\ell ,P}(f)-R_{\ell ,P}^{*}\) looking at the inner risk excess \(\mathcal {C}_{\ell ,P}(f(x))-\mathcal {C}_{\ell ,P}^{*}\) investigating its measure with respect to ν x . In regression problems, the loss function is typically distance based meaning that if we call t = f(x) the loss function relies on η = yt mapping η(η) which is convex \(\forall \: \eta \in \mathbb {R}\) implying that is a Lipschitz continuous function having the following properties

$$\begin{array}{@{}rcl@{}} &&\forall \: N>0 \quad \exists \: L_{N} \: s.t. \: |\ell(y,t_{1})-\ell(y,t_{2})| \: \leq \: L_{N} |t_{1}-t_{2} | \\ && \forall \: t_{1},t_{2} \in [-N,N]\;, \forall \: y\in [-M,M] \\ && \exists \: C_{0} \: s.t. \: \forall \: y\in [-M,M] \quad \ell(y,0)\leq C_{0} \end{array} $$
(95)

with (L N , C 0) depending on the specific form of the loss function. Using the inequality ||a|p−|b|p| ≤ p|ab|| max(a, b)|p−1 we introduce the following losses:

$$\begin{array}{@{}rcl@{}} && {squared } \quad \quad \ell_{2}(y,t)=|y-t|^{2} \quad [ \:L_{N}=2N+M \: C_{0}=M^{2} \:] \\ && {absolute \: value} \quad \quad \ell_{1}(y,t)=|y-t| \quad [ \: L_{N}=1 \: C_{0}=M \:] \\ && \epsilon-{insensitive} \quad \quad \ell_{\epsilon}(y,t)=\max(0,|y-t|-\epsilon) \quad [ \: L_{N}=1 \: C_{0}=M \:] \end{array} $$

In the supervised setting, when we put t = f(x) with a slight abuse of notation, we may write the loss function as (f(x)) = (f(x),x). The classical loss function for regression is the square loss 2(y, t) mainly because it simplifies the mathematical treatment. Besides, it is well-known that the expected risk using squared loss is minimized by the conditional mean of Y given x (see [19] for details) i.e.

$$\begin{array}{@{}rcl@{}} f_{\ell_{2},P}^{*}(x)=\arg \min\limits_{f\: : X \: \rightarrow \: \mathbb{R}} \: R_{\ell_{2},P}(f)={\int}_{Y} ydP(y|x)=\mathbb{E}[Y|X=x] \end{array} $$
(96)

Though this loss seems mathematically rather easy to handle and the corresponding learning algorithms are often computational feasible, it is well-known that minimizing an empirical risk based on the squared loss is a method which is quite sensitive to outliers. Moreover, as the number of samples increases, the complexity of the algorithm increases as well and with large data sets may become unfeasible. Therefore, a number of more robust surrogate loss functions have been proposed and from a practical point of view, there are situations in which a different kind of loss is more appropriate.

Empirical results shows that the 𝜖−insensitive loss function has algorithmic advantages in terms of sparseness compared to the absolute and squared losses when used in the kernel-based regression. Robustness and prediction accuracy for 𝜖−insensitive loss with respect to square losses are especially evident for high dimensional sparse data sets. However, for large data sample settings, squared loss might provide better prediction accuracy at leat in the linear regression case, while the situation could be different using non-linear kernels. Moreover, results obtained under asymptotic setting when the number of training sample is large, might not hold for finite sample settings (see [9] for an empirical evaluation of this issue). On the other hand, 𝜖−insensitive loss are more computationally efficient than squared loss functions, resulting in sparse solutions with the 𝜖 parameter controlling number of support vectors, model complexity and generalization.

Theoretical asymptotic upper bound for the number of support vectors has been investigated for regularized regression using 𝜖−insensitive loss function in [35], which can be used to set a trade-off between sparsity and estimation accuracy. In particular, if the conditional distributions of Y given x P(⋅|x) are known to be symmetric, it can be shown (see section 9.5 of [36] and [35] for details) that using 𝜖−insensitive loss the only minimizer of expected risk \( R_{\ell _{\epsilon },P}(f)\) is represented by the conditional median of Y meaning the function given x

$$f_{\ell_{\epsilon},P}^{*}(x)=\arg \min\limits_{f\: : X \: \rightarrow \: \mathbb{R}} \: R_{\ell_{\epsilon},P}(f)=f_{1/2,P}^{*}(x)$$

where \(f_{1/2,P}^{*}(x)\) is the unique median of the distribution P(⋅|x) defined as

$$\begin{array}{@{}rcl@{}} f_{1/2,P}^{*}(x)=\{\: t\in \mathbb{R} \: | \: \quad P((-\infty,t]| x)\geq 1/2 \quad and\\ \quad P((t,\infty] |x)\geq 1/2 \} \end{array} $$
(97)

Empirical methods estimating the conditional median using 𝜖−insensitive losses, may obtain regularization functions f D for which \(R_{\ell _{\epsilon },P}(f_{D})\) is close to \(R_{\ell _{\epsilon },P}^{*}\) with high probability. In general, this only implies that f D is close to \(f^{*}_{\epsilon ,P}\) in a weak sense, meaning that we may obtain \(f_{D_{n}} \: \rightarrow \: f^{*}_{\epsilon ,P}\) in probability for all sequences \(\{f_{D_{n}}\}\) with \(R_{\ell _{\epsilon },P} (f_{D_{n}}) \: \rightarrow \: R_{\ell _{\epsilon },P}^{*}\). As a result, the approximate risk minimizers is close to the unique risk minimizer in probability. Nevertheless, for functions \(f_{D_{n}}\) satisfying the condition \(R_{\ell _{\epsilon },P} (f_{D_{n}}) - R_{\ell _{\epsilon },P}^{*}\leq \epsilon ^{2}/2\), a self-calibration inequality hold

$$\begin{array}{@{}rcl@{}} \| f_{D_{n}} - f^{*}_{\epsilon,P} \|_{1,\nu} \leq \sqrt{c_{p} [R_{\ell_{\epsilon},P}(f_{D}) - R_{\ell_{\epsilon},P}^{*}]} \end{array} $$
(98)

which using the condition (8) can be also written as

$$\begin{array}{@{}rcl@{}} \| f_{D_{n}} - f^{*}_{\epsilon,P} \|_{2,\nu} \leq \| f_{D_{n}} - f^{*}_{\epsilon,P} \|^{2}_{1,\nu} \leq c_{p} [R_{\ell_{\epsilon},P}(f_{D}) - R_{\ell_{\epsilon},P}^{*}] \end{array} $$
(99)

The factor c P depends on the 𝜖 and the symmetric distribution P(⋅|x) (see [35] for details). Conditions (98) and (99) can be used to control the error of \(f_{D_{n}}\) in a wider sense, meaning that \(f_{D_{n}} \: \rightarrow \: f^{*}_{\epsilon ,P}\) in probability as n → ∞ and thus \(f_{D_{n}}\) approximates the conditional median \(f^{*}_{\epsilon ,P}\). Approximation worsens the smaller c P becomes (i.e. increasing the parameter 𝜖). As a result, sparsity of the decision function will be paid trough less accurate estimation of the conditional median. Assuming moderate values of 𝜖 leads to reasonable estimates of the conditional median and relatively sparse decision functions.

Since the distribution P generating the input/output pairs is unknown, the risk R , P is unknown as well and consequently we cannot directly find f. To resolve this problem, it is tempting to replace the risk R , P (f) in by its empirical counterpart evaluating the empirical risk defined as

$$\begin{array}{@{}rcl@{}} R_{\ell,D}(f)=\mathbb{E}_{n}[\ell(y,f(x))]=\frac{1}{n}\sum\limits_{t=1}^{n}\ell(y_{t},f(x_{t})) \end{array} $$
(100)

Unfortunately, even though the law of large numbers shows that R , D is an approximation of R , P for each single f, solving \(\inf _{f\: : X \: \rightarrow \: \mathbb {R}} {\: R_{\ell ,D}(f)}\) does not in general lead to an approximate minimizer of R , P (⋅). This is an example of overfitting in which the learning method produces a function that models too closely the output values in Z n bringing to poor performance on future data. One common way to avoid overfitting is to choose a small set of functions \(f\in \mathcal {H}\) that is assumed to contain a reasonably good approximation of the solution. Then, instead of minimizing R , D over all functions, one minimizes only over \(\mathcal {H}\) solving \(\inf _{f\in \mathcal {H}} {\: R_{\ell ,D}(f)}\).

In regression problems, the main issues to be addressed are the model selection and the choice of the estimation parameters. Building a good non-parametric predictor f can be achieved using kernel-based regression, which finds a minimizer f P, λ of the regularized empirical risk

$$\begin{array}{@{}rcl@{}} {f}_{P,\lambda}= \arg \min\limits_{f\in \mathcal{H}} \{\: R_{\ell,P}(f)+\lambda \|f\|^{2}_{\mathcal{H}} \} \end{array} $$
(101)

where λ > 0 is a regularization parameter to reduce the danger of overfitting, \(\mathcal {H}\) is a RKHS of a kernel \(\kappa \: : \: X \times X \: \rightarrow \: \mathbb {R}\) and \(\ell (y,\cdot )\: :\: \mathbb {R}\: \rightarrow \:[0,\infty )\) is convex for all yY. Because problem (101) is strictly convex in f, the minimizer f P, λ is uniquely determined and a simple gradient descent algorithm can be used to find it. However, for specific losses such as 𝜖−insensitive more efficient algorithmic approaches are used in practice. Using the sequence of collected samples Z n the corresponding empirical problem can be formulated as

$$\begin{array}{@{}rcl@{}} {f}_{D,\lambda_{n}}= \arg \min\limits_{f\in \mathcal{H}} \{\: R_{\ell,D}(f)+\lambda_{n} \|f\|^{2}_{\mathcal{H}} \} \end{array} $$
(102)

Following the interpretation that a small risk is desired, one tries to find a predictor whose risk is close to the optimal risk \(R_{\ell ,P}^{*}\) which is a much stronger requirement than convergence in probability of R , P (f) to \(R_{\ell ,P,\mathcal {H}}=\inf _{f\in \mathcal {H}} \: R_{\ell ,P}(f)\) as is not obvious whether \(R_{\ell ,P,\mathcal {H}}=R_{\ell ,P}^{*}\) or not even for large Hilbert spaces \(\mathcal {H}\). Throughout this work we assume that for some M ≥ 0 the distribution P(⋅|x) is almost everywhere supported on [−M, M] that is |y| ≤ M almost surely with respect to P which means that for the loss functions we are taking into account always \(|f^{*}_{\ell ,P}|\leq M\) (truncation assumption). Efficiency of the optimization problem (102) can be measured by the difference between \({f}_{D,\lambda _{n}}\) and the regression function \(f^{*}_{\ell ,P}\). Using the squared loss function the approximation error can be expressed as (see [19] for details)

$$\begin{array}{@{}rcl@{}} &&\|{f}_{D,\lambda_{n}}-{f}^{*}_{\ell_{2}P} \|_{2,\nu_{x}}\\ &&=\sqrt{\mathbb{E}[\ell_{2}(y-f_{D,\lambda_{n}}(x))\: |\: Z_{n}]-\mathbb{E}[\ell_{2}(y-f^{*}_{\ell_{2},P}(x))]} \end{array} $$
(103)

holding for any distribution P.

Using the 𝜖−insensitive loss function, a lower bound on the approximation error can be found using the Liptschitz continuity property (95) as (see [36] for details)

$$\begin{array}{@{}rcl@{}} |\:\mathbb{E}[\ell_{\epsilon}(y,f_{D,\lambda_{n}}(x))|\: Z_{n}]-\mathbb{E}[\ell_{\epsilon}(y,f^{*}_{\ell_{\epsilon},P}(x))] \:| \leq \|{f}_{D,\lambda_{n}}-{f}^{*}_{\ell_{\epsilon},P} \|_{1,\nu_{x}} \end{array} $$
(104)

An upper bound on the approximation error can be also identified for functions \(f_{D,\lambda _{n}}\) satisfying the condition \(R_{\ell _{\epsilon },P} (f_{D,\lambda _{n}}) - R_{\ell _{\epsilon },P}^{*}\leq \epsilon ^{2}/2\) and symmetric distribution P(⋅|x) where the conditions (98) and (99) hold

$$\begin{array}{@{}rcl@{}} \|{f}_{D,\lambda_{n}}&-&{f}^{*}_{\ell_{\epsilon},P} \|_{2,\nu_{x}} \leq c_{P}\left[\:\mathbb{E}[\ell_{\epsilon}(y,f_{D,\lambda_{n}}(x))|\: Z_{n}]\right.\\&-&\left.\mathbb{E}[\ell_{\epsilon}(y,f^{*}_{\ell_{\epsilon},P}(x))] \:\right] \end{array} $$
(105)

with c P depending on the parameter 𝜖 and the distribution P(⋅|x).

In general, the estimation of the error for a given loss function depends on P and \(\mathcal {H}\). We should expect that the minimizer of the regularized empirical error \({f}_{D,\lambda _{n}}\) to be a good approximation of the minimizer \(f^{*}_{\ell ,P}\) of R , P (f) as n → ∞ and λ = λ(n) → 0 which is actually true if \(f^{*}_{\ell ,P}\) can be approximated using functions from \(\mathcal {H}\) and measured by the regularization error defined as

$$\begin{array}{@{}rcl@{}} \mathcal{A}(\lambda)= \inf_{f\in \mathcal{H}} \{\: \|f-{f}^{*}_{\ell,P} \|^{q}_{p,\nu_{x}}+\lambda \|f \|^{2}_{\mathcal{H}}\} \end{array} $$
(106)

where p = q = 2 for 2 losses and q = 1 and p = 2 for 1 of 𝜖 losses. Now the regularization function can be written as

$$\begin{array}{@{}rcl@{}} f_{\lambda}= \arg \min\limits_{f\in \mathcal{H}} \{\: \|f-{f}^{*}_{\ell,P} \|^{q}_{p,\nu_{x}}+\lambda \|f \|^{2}_{\mathcal{H}}\} \end{array} $$
(107)

Since the minimization in problem (102) rely on the discrete quantity R , D (f), the approximation \(f^{*}_{\ell ,P}\) by \({f}_{D,\lambda _{n}}\) involves the capacity of the function space \(\mathcal {H}\) which can be measured using the covering numbers. If we define the loss space as \(\mathcal {L}_{\mathcal {H}}=\{\: \ell (y,f(x)) \quad x\in X, \: y\in Y, \: f\in \mathcal {H} \}\) , the covering number of \(\mathcal {L}_{\mathcal {H}}\) and \(\mathcal {H}\) can be easily related using the Lipschitz condition |(y, f(x 1))−(y, f(x 2))| ≤ L M |f(x 1)−f(x 2)| and it can be shown that the following condition holds (see [29]):

$$\mathcal{N}_{p}(\epsilon, \mathcal{L}_{\mathcal{H}}(Z^{n}),\|\cdot\|^{q}_{p,n})\leq \mathcal{N}_{p}(\epsilon/L_{M}, \mathcal{H}(X^{n}),\|\cdot\|^{q}_{p,n})$$

where for the 𝜖 and 1 norm L M = 1.

We may finally present the following proposition extended from [39] to take into account both 2 and 𝜖 losses, giving an estimation of the excess error for the regularization function \({f}_{D,\lambda _{n}}\) by decomposition

Proposition 3 (Extended from 39)

Let \({f}_{\lambda }\in \mathcal {H}\) and \({f}_{D,\lambda _{n}}\) defined as in (102) and define the regularization error as

$$\mathcal{D}(\lambda)= c_{P}[R_{\ell,P}(f_{\lambda})+\lambda \|{f}_{\lambda}\|^{2}_{\mathcal{H}}-R_{\ell,P}^{*}]$$

and the estimation error as

$$\mathcal{E}_{D,\mathcal{H}}=c_{P}[R_{\ell,P}({f}_{D,\lambda_{n}})-R_{\ell,D}({f}_{D,\lambda_{n}})+R_{\ell,D}({f}_{\lambda})-R_{\ell,P}({f}_{\lambda})]$$

( c P ≥1 for ℓ 𝜖 losses and c P = 1 for ℓ 2 ). Then results

$$c_{P}[R_{\ell,P}({f}_{D,\lambda_{n}})-R_{\ell,P}^{*}]\leq c_{P}[R_{\ell,P}(f_{D,\lambda_{n}})+\lambda \|{f}_{D,\lambda_{n}}\|^{2}_{\mathcal{H}}-R_{\ell,P}^{*}]$$

which can be bounded by \(\mathcal {E}_{D,\mathcal {H}}+\mathcal {D}(\lambda )\) if condition (105) holds. Considering the two variables

$$\zeta_{1}(x,y)=\ell(y,{f}_{D,\lambda_{n}}(x))-\ell(y,{f}^{*}_{\ell,P}(x))$$

and

$$\zeta_{2}(x,y)=\ell(y,{f}_{\lambda}(x))-\ell(y,{f}^{*}_{\ell,P}(x))$$

the sample error can be written as

$$\mathcal{E}_{D,\mathcal{H}}=c_{P} [\mathbb{E}[\zeta_{1}]-\mathbb{E}_{n}[\zeta_{1}]]+c_{P}[\mathbb{E}_{n}[\zeta_{2}]-\mathbb{E}[\zeta_{2}]]$$

Proof is omitted for brevity but can be easily obtained for 𝜖 following the same method presented in [39] for the 2 loss.

The rate of the regularization error \(\mathcal {D}(\lambda )\) is important for bounding both estimation and regularization error. The decay of λ and n → ∞ determines the size of the hypothesis space and hence the sample error estimate. If the RKHS is large enough, we may consider that the approximation error is \(\mathcal {D}(\lambda )=0\) as λ → 0 and n → ∞, as shown in [19] for squared loss and bounded X and Y spaces, so one has to bound only the estimation error \(\mathcal {E}_{D,\mathcal {H}}\).

1.3 A.3 Regularization and support vector regression

Theoretical results presented in [10], show that kernel regression methods using a loss function with bounded first derivative in combination with bounded and rich enough continuous kernel (like RBF kernel), are not only consistent and computational tractable, but offer also attractive robustness properties. SVR show good performance in practical applications and a robust theoretical justification in terms of universal consistency and learning rates when training samples come from i.i.d. process (see [37]). Sometimes i.i.d. assumption might not be strictly justified in real-world problems. In some statistical learning applications (like system diagnosis, market prediction, speech recognition) processes involved are not strictly i.i.d.. Moreover, samples are often gathered from different sources which poses the problem that they might be not identically distributed. SVR in such non-i.i.d. scenarios, have no theoretical justification, but they are sometimes applied successfully. Nevertheless, for any process that satisfies the law of large numbers, a sequence of regularization parameters exists, such that the corresponding SVR can be considered at least consistent. Universal consistency for stationary processes has general negative results on this kind of sequence, and the regularization parameters cannot be adaptively chosen to rely on the stochastic properties of the process. However, if the process satisfies certain mixing properties, an adequate regularization sequence can be identified.

Whenever 𝜖−insensitive loss functions are used in SVR, the geometrical formulation can be obtained looking at the regularization problem as

$$\begin{array}{@{}rcl@{}} \frac{1}{n}\sum\limits_{t=1}^{n}\ell_{\epsilon}(y_{t},f(x_{t}))+\lambda \| f\|^{2}_{\mathcal{H}} \end{array} $$
(108)

which is equivalent (meaning that the same function minimizes both functionals) to solve the following optimization problem in where an additional set of parameters is introduced (see [15] for details)

$$\begin{array}{@{}rcl@{}} \min\limits_{\mathbf{w},b,\mathbf{\xi},\mathbf{\xi^{*}}} &\quad C \sum\limits_{t=1}^{n}(\xi_{t}+\xi_{t}^{*}) +\frac{1}{2}\|\mathbf{w}\|^{2}_{\mathcal{H}_{0}} \\ s.t. &\quad f_{\mathbf{w},b}(x_{t})-y_{t} \leq \epsilon +\xi_{t} \\ &\quad y_{t}-f_{\mathbf{w},b}(x_{t}) \leq \epsilon +\xi_{t}^{*} \\ &\quad \xi_{t},\xi_{t}^{*}\geq0 \quad \forall \: (x_{t},y_{t}) \in Z_{n} \end{array} $$
(109)

The model function is expressed by f w, b (x t )=〈Φ(x t ),w〉 + b, \(\lambda =\frac {1}{2nC}\) and ξ, ξ are slack variables. The optimization problem can be solved through the technique of Lagrange multipliers. The main difference between the regularized and the geometric formulation of SVR can be found on the meaning of the constant b (see section 1.3 of [36] for details). The geometrical approach considers an Hilbert space \(\mathcal {H}_{0}\) defining a function f w, b in terms of an affine hyperplane specified by (w, b). The regularized formulation of SVR directly considers a RKHS \(\mathcal {H}\) with the functions contained in it. The two approaches are equivalent whenever we fix b and the functions 〈Φ(⋅),w〉 with \(\mathbf {w} \in \mathcal {H}_{0}\) form a RKHS \(\mathcal {H}\) whose norm can be computed by

$$\begin{array}{@{}rcl@{}} \| f \|_{\mathcal{H}}=\inf \{\: \| \mathbf{w}\|_{\mathcal{H}_{0}} \: | \: \mathbf{w} \in \mathcal{H}_{0} \quad with \quad f=\langle {\Phi}(\cdot),\mathbf{w}\rangle\} \end{array} $$
(110)

The offset term makes a real difference and, in general, the decision functions produced by both approaches might be different.

As final remarks about SVR, it is also known that algorithms are based on solid theoretical guarantees. Solutions returned are sparse allowing a natural use of positive definite symmetric kernels naturally extending the algorithms to non-linear regression. SVR also admits favorable stability properties. Possible drawback of SVR is that they require the selection of the two parameters C and 𝜖. These can be selected empirically via cross-validation but this requires a relatively large validation set. Some heuristics are often used to guide the search for the regression parameters which in general requires using model selection techniques.

1.4 A.4 Mixing processes

A sequence of random variables is called a time series in the statistics literature and a (discrete time) stochastic process in the probability literature. Let X be a measurable space and \(Y \subseteq \mathbb {R}\) a closed space. Furthermore, let (Ω,σ Ω, ν) be a probability space with \(\mathcal {Z}=(Z)_{i\geq 1}\) indicating a stochastic process such that Z i : Ω → X × Y for all i ≥ 1. For n ≥ 1 we write Z n = {(X 1, Y 1),...,(X n , Y n ))}={Z 1,...,Z n )} for a training set of length n, distributed according to the first n components of \(\mathcal {Z}\). Moreover, assume \(\mathcal {Z}\) to be a stationary process meaning that the (X × Y)n-valued random variables \((Z_{i_{1}},...,Z_{i_{n}})\) and \((Z_{i_{1}+i},...,Z_{i_{n}+i})\) have the same distribution for all n, i, i 1,...,i n ≥ 1. We further write P for the distribution of all Z i for all measurable AX × Y with P(A) = ν({ω ∈ Ω : Z i (ω)∈σ Ω}). Knowing a set of samples Z n drawn from the probability distribution ν, our goal is to find a good approximation of a function f. Approximating a function from sparse samples is an ill-posed problem which can be managed using regularization. Learning rates measure the quality of the function approximation playing an important role in learning theory. In learning theory, mostly is made the assumption that samples are drawn independently from the unknown probability P(X, Y). However, independence in some context can be a restrictive concept and to learn from stationary processes whose components are not independent, it is necessary to replace the independence assumption by a notion that still guarantees certain concentration inequalities. The mixing condition allows to put the subject in a firm mathematical foundation and the notion of near independence permits to derive a more robust theory. In RL, a mixing processes can be easily encountered in the context of online policy scenario, as the actual distribution may be dependent on some of the previous observations. Differently from the i.i.d. case, the learning performance of the algorithms with mixing sequence is not measured directly by the sample size (or the number of observations). Instead, performance can be related to the effective number of observations.

To learn from stationary processes whose component are not independent [37] suggests that it is necessary to replace the independence assumption by a notion that still guarantees certain concentration inequalities. In a stationary sequence, the time index t does not affect the distribution of a variable Z t but this does not imply independence. In particular, for i < j < k, P(Z j |Z i ) may not equal P(Z k |Z i ), meaning that conditional probabilities may vary at different points in time. Standard mixing coefficients and their basic properties taken from [40] can be introduced. The α−mixing process is based on the α−mixing coefficients defined as:

$$\begin{array}{@{}rcl@{}} \alpha_{n}&=& \sup \{\:|\mu(A\cap B)-\mu(A)\mu(B)| \: i\geq1, \: A\in \mathcal{A}_{1}^{i} \\&&\quad and \: B\in \mathcal{A}_{i+n}^{\infty}\}, \quad n\geq 1 \end{array} $$
(111)

where \(\mathcal {A}_{1}^{i}\) and \(\mathcal {A}_{i+n}^{\infty }\) are the σ−algebras generated by (Z 1,...,Z i ) and (Z i + n , Z i + n+1,...) respectively. The process \(\mathcal {Z}\) is called strong α−mixing if for some \(\bar {\alpha }_{0},\bar {\alpha }_{1},\bar {\alpha }_{2}>0\) we have \(\alpha _{n}\leq \bar {\alpha }_{0} \exp (-\bar {\alpha }_{1}n^{\bar {\alpha }_{2}})\). A β-mixing process can be introduced using the β-mixing coefficients can be defined as

$$\begin{array}{@{}rcl@{}} \beta_{n}&=& \frac{1}{2}\sup \left\{\sum\limits_{i,j=1}^{\infty}|\mu(A_{i}\cap B_{j})-\mu(A_{i})\mu(B_{j})| \: \right.\\&&\left.: A_{i}\subset \mathcal{A}_{1}^{i} \: B_{j}\subset \mathcal{A}_{i+n}^{\infty} \: partitions \vphantom{\sum\limits_{i,j=1}^{\infty}}\right\} \end{array} $$
(112)

with β n → 0 with n→∞ and and the process is called exponentially β-mixing if for some constant \(\bar {\beta }_{0},\bar {\beta }_{1}>0\) we have \(\beta _{n}\leq \bar {\beta }_{0} \exp (-\bar {\beta }_{1} n)\) From the definitions results that all mixing coefficients are equal to zero if \(\mathcal {A}\) and \(\mathcal {B}\) are independent. Both α n and β n measure the dependence of an event on those that occurred more than n units of time in the past. It can be shown that β-mixing imply α-mixing in the non-i.i.d. scenario. In the most general version, future samples depend on the training sample Z n and thus the generalization error on Z n must be measured by its expected error conditioned by the sample Z n . The stability of a learning algorithm generally assumes the i.i.d. property. Replacing an element in a sequence with another has no effect on the expected value of a random variable defined over that sequence. Hence, the following equality holds, \(\mathbb {E}[F(Z_{1},...,Z_{i},..,Z_{n}) |Z_{n}] = \mathbb {E}[F(Z_{1},...,Z^{\prime },...,Z_{n})|Z_{n},Z^{\prime }]\) for a random variable F that is a function of the sequence of random variables Z n = (Z 1,...,Z n ). However, if the points in that sequence Z n are dependent, this equality may not hold anymore. The main technique working with this problem, is based on the independent block sequence. This consists on eliminating several blocks of contiguous points from the original dependent sequence, leaving us with some remaining blocks of points. Instead of the original dependent blocks, one then considers independent blocks of points, each with the same size and the same distribution (within each block) as the dependent ones. Several lemmas can be proved using α or β-mixing distribution, the expected value of a random variable defined over the dependent blocks is close to the one based on these independent blocks. As a result, using independent blocks brings us back to a situation similar to the i.i.d. case, with i.i.d. blocks replacing i.i.d. points (see [40] end [30] for more details on the independent block technique).

Using the blocking device of introduced in [40], may partition the set {1,...,n} determined by the choice of an integral block length a n with n = 2μ n a n for appropriate \(\mu _{n},a_{n} \in \mathbb {N}\). The partition will have 2μ n blocks of integral block length a n such that n−2a n ≤ 2μ n a n n and a residual block for 1 ≤ jμ n we have

$$\begin{array}{@{}rcl@{}} H_{j} &=& \{i:\: 2(j-1)a_{n}+1\leq i\leq(2j-1)a_{n} \} \quad head \\ T_{j} &=& \{i:\: (2j-1)a_{n}+1\leq i\leq 2ja_{n}\} \quad tail \\ R &=& \{2\mu_{n}a_{n}+1,...,n\} \quad residual \end{array} $$
(113)

The structure of the blocks goes as \(H_{1},T_{1},H_{2},T_{2},...,H_{\mu _{n}},T_{\mu _{n}},R\) where we define \(H=\cup _{1\leq j\leq \mu _{n}} H_{j}\). The samples in every second block are replaced by ghosts samples whose joint marginal distribution is kept the same as that of the original samples. For the new random variables the new blocks are now independent of each other. Consider the sequence of random variables \(\bar {Z}^{\prime }(H)=\{Z^{\prime }_{i} \: i\in H\}\) such that \(\bar {Z}^{\prime }(H)\) is independent of \(\bar {Z}_{n}\) and the blocks \(\bar {Z}^{\prime }(H_{j})\) with j = 1,..,μ n are i.i.d. each block having the same distribution as a block from the original sequence. Let \(\mathcal {F}\) be some space of measurable real-valued functions with a domain \(\mathcal {Z}\). For any \(f\in \mathcal {F}\) we may derive a the function space \(\mathcal {\bar {F}}=\{\bar {f} \: : \: f\in \mathcal {F} \}\) and \(\bar {f} \: : \: \mathcal {Z}^{a_{n}} \: \rightarrow \: \mathbb {R}\).

We quote the following general result:

Lemma 1

(Lemma 4.1 in [40]) Suppose \((Z_{1},...,Z_{n})\in \mathcal {Z}\) is a stationary β-mixing process with mixing coefficients β n and \(\bar {Z}^{\prime }\in \mathcal {Z}\) the block independent samples For any measurable function \(f\: :\: \mathcal {Z}^{\mu _{n}a_{n}} \: \rightarrow \: \mathbb {R}\) we have

$$\begin{array}{@{}rcl@{}} \mathbb{E}[\: f(\bar{Z}(H))-f(\bar{Z}^{\prime}(H))]\leq \| f\|_{\infty} (\mu_{n}-1) \beta_{a_{n}} \end{array} $$
(114)

We also report a technical lemma which apply for the β-mixing process and stated as Lemma 2 in [1] which will come in handy for the analysis of statistical errors involving β−mixing processes:

Lemma 2

(Lemma 2 in [1]) (Relative Deviation Inequality) Consider a \(\mathcal {Z}-\) valued stationary β-mixing sequence \(Z=\{Z_{t}\}_{t=1}^{n}\) and a permissible class \(\mathcal {F}\) of real functions f with domain Z. Assume that for some M > 0 we have \(\sup _{f\in \mathcal {F}} \|f\|_{\infty }\leq M\). Then fix \(n\in \mathbb {N}\) and 𝜖, η > 0. Let \(\bar {Z}^{\prime }(H)\) be a (μ n ,a n )−independent blocks sequence with a residual block R satisfying \(|R|\leq \frac {\epsilon \eta }{6M}\) . Then

$$\begin{array}{@{}rcl@{}} &&\mathbb{P} \left\{ \sup\limits_{f \in \mathcal{F}} \left|\frac{\frac{1}{n} {\sum}_{i=1}^{n} f(Z_{i})- \mathbb{E}[f(Z)] }{\eta+|\mathbb{E}[f(Z)] |}\right|>\epsilon \right\} \\ &&\leq 2\mathbb{P} \left\{ \sup\limits_{f \in \mathcal{F}} \left|\frac{\frac{1}{\mu_{n}} {\sum}_{j=1}^{\mu_{n}} \bar{f}(H^{\prime}_{j})\,-\, \mathbb{E}[\bar{f}(H_{1})]} {a_{n}\eta\,+\,|\mathbb{E}[\bar{f}(H_{1})] |}\right|\!>\!\frac{2}{3}\epsilon \right\}\,+\,2\beta_{a_{n}}\mu_{n} \end{array} $$
(115)

The following lemma generalizes for p, q = 1,2 Lemma 3 in [1] and relates the covering number of \(\mathcal {N}_{p}(\epsilon ,\mathcal {F},\|z\|^{q}_{p,n})\) with \(\mathcal {N}_{p}(\bar {\epsilon },\bar {\mathcal {F}},\|\bar {z}(H)\|^{q}_{p,\mu _{n}})\):

Lemma 3

[Extension of Lemma 3 in [1]] (Covering Number) For any \((z_{1},....,z_{n})\in \mathcal {Z}^{n}\) and for p = 1, 2 we have

$$\begin{array}{@{}rcl@{}} \mathcal{N}_{p}(\epsilon,\bar{\mathcal{F}},\|\bar{z}(H)\|^{p}_{p,\mu_{n}})\leq \mathcal{N}_{p} \left(\left[\frac{2(1-\frac{|R|}{n})}{4{a_{n}^{p}}}\right]^{\frac{1}{p}}\epsilon,\mathcal{F},\|z\|^{p}_{p,n} \right) \end{array} $$
(116)

Proof

Consider that for any function \(f \: : \mathcal {Z} \: \rightarrow \mathbb {R}\) we may bound \(\| \bar {f} \|^{p}_{p,\bar {z}(H)_{\mu _{n}}}\) in terms of \(\| f \|^{p}_{p,z_{n}}\). In fact considering that 2a n μ n = n−|R| and using the Jensen’s inequality and the fact thar H⊂{1,...,n} we have

$$\begin{array}{@{}rcl@{}} \| \bar{f} \|^{p}_{p,\bar{z}(H)_{\mu_{n}}}=\frac{1}{\mu_{n}}\sum\limits_{j=1}^{\mu_{n}} \Big| \sum\limits_{i\in H_{j}} f(z_{i})\Big|^{p}\leq \\ \frac{{a_{n}^{p}}}{\mu_{n} a_{n}} \sum\limits_{i\in H} |f(z_{i})|^{p}\leq \frac{4{a_{n}^{p}}}{2\left(1-\frac{|R|}{n}\right)} \|f(z)\|^{p}_{p,{z}_{n}} \end{array} $$
(117)

Now considering that for any \(f_{1},f_{2} \in \mathcal {F}\) using the previous inequality we have \(\overline {f_{1}-f_{2}}=\bar {f}_{1}-\bar {f}_{2}\) hence we get \(\| \bar {f}_{1} -\bar {f}_{2} \|^{p}_{p,\bar {z}(H)_{\mu _{n}}} \leq \left (\frac {4{a_{n}^{p}}}{2(1-\frac {|R|}{n})}\right ) \| f_{1}-f_{2} \|^{p}_{p,z_{n}}\) therefore any \(\left [\frac {2(1-\frac {|R|}{n})}{4{a_{n}^{p}}}\right ]^{\frac {1}{p}}\epsilon \) cover of \(\mathcal {F}\) is an 𝜖−cover of \(\mathcal {F}\). □

Finally, we introduce a slightly modified version of Theorem 4 in [1] working for p = 1,2 which generalizes for exponentially β-mixing processes Theorem 19.3 in [19] stated only for p = 2 and i.i.d. processes:

Theorem 5

(Extension of Theorem 4 in [1]) (Relative Deviation Concentration Inequality) Consider a \(\mathcal {Z}-\) valued stationary β-mixing sequence \(\bar {Z}=\{Z_{t}\}_{t=1}^{n}\) and a permissible class \(\mathcal {F}\) of real valued functions f with domain \(\mathcal {Z}\) . Let \(n \in \mathbb {N}\) and K 1 ,K 2 ≥ 1 and choose η > 0 and 0 <𝜖< 1. Hence, assume that for any \(f\in \mathcal {F}\) the following conditions hold:

  • C 1 - ∥fK 1

  • C 2 - \(\mathbb {E}[f^{2}(Z)]\leq K_{2}\mathbb {E}[f(Z)]\)

further consider the (a n n ) independent blocks with residual block R, assuming that the following conditions also hold:

  • C 3 - \(\sqrt {n} \sqrt {1-\epsilon } \sqrt {\eta } \geq 576 \max \{2K_{1}a_{n},\sqrt {2a_{n}K_{2}}\}\)

  • C 4 - \(\frac {|R|}{n}\leq \frac {\epsilon \eta }{6K_{1}}\) and |R| ≤ n/2

  • C 5 - For all \(z_{1},...,z_{n} \in \mathcal {Z}\) and all \(\delta >\frac {\eta a_{n}}{8}\) we have

$$\begin{array}{@{}rcl@{}} &&\frac{\sqrt{\mu_{n}}\epsilon (1-\epsilon) \delta }{96 \sqrt{2}a_{n} \max\{K_{1},2K_{2}\}} \geq {\int}_{\frac{\epsilon(1-\epsilon)\delta}{16a_{n}\max\{K_{1},2K_{2}\}}}^{\sqrt{\delta}} \\&&\quad\times\left(\log \: \mathcal{N}_{p} \left(\frac{u} { (4{a_{n}^{p}})^{\frac{1}{p}} }, \mathcal{F}, \|z\|_{p,n} \right) \right)^{\frac{1}{p}} du \end{array} $$

Then there exists universal constants c 1, c 2 > 0 such that

$$\begin{array}{@{}rcl@{}} &\mathbb{P} \left\{ \sup\limits_{f\in \mathcal{F}}\Big|\frac{\frac{1}{n}{\sum}_{t=1}^{n}f(Z_{t})-\mathbb{E}[f(Z)]}{\eta+\mathbb{E}[f(Z)]}\Big|>\epsilon \right\} \\ &\leq c_{1}\exp \left(-c_{2}\frac{\mu_{n} a_{n} \eta \epsilon^{2}(1-\frac{2}{3}\epsilon)}{\max\{{a_{n}^{2}}{K_{1}^{2}},a_{n} K_{2}\}}\right)+2\beta_{a_{n}}\mu_{n} \end{array} $$
(118)

The constants can be set to c 1 = 120 and \(c_{2}=\frac {1}{2^{12} 3^{4}}\).

The proof of the theorem can be found in [19] for p = 2 and i.i.d. data samples. Proof of the generalization to the β-mixing data samples for p = 2 is reported in [1]. The difference in using p = 1,2 essentially reflects in the use of the logarithm of the covering number in condition C 5. Proof of this theorem is rather large and technical as can be seen in the book [19]. However, one may believe that this result hold for both p = 1, 2 cases following the chain of arguments presented in the proof of Theorem 19.3 in [19] which essentially relies on Theorem 19.2 in [19]. This last theorem is finally supported by Theorem 11.6 in [19] essentially giving the same result but stated for the case p = 1case and using the logarithm of the covering number. Henceforth, extension to p = 1,2 follows the same argument used in the proof of Theorem 19.3 for i.i.d. data samples and relative extension of Theorem 4 in [1] for stationary β-mixing data samples which is omitted for brevity.

Lastly, as some proofs presented in this work rely on it, we introduce in this section a useful tool called peeling device which is discussed in [18]. Intuitively, the peeling device considers some function of larger and larger balls centered around a given one, it will allow us to peel off sets of increasingly smaller probability. To introduce the peeling, lets define the functions \(\tau \: : \mathcal {F} \rightarrow [\rho ,\infty )\) with ρ>0 and a strictly increasing sequence {σ l } l ≥ 0 starting with σ 0 = 0 but growing to infinity. We can peel the function space \(\mathcal {F}\) into \(\mathcal {F}=\cup _{l\geq 1} \mathcal {F}_{\sigma _{l}}\) where \(\mathcal {F}_{\sigma _{l}}=\{ f\in \mathcal {F} \: : \sigma _{l-1}\leq \tau (f)\leq \sigma _{l}, \: l=1,2,..\}\) . Then for any ρ>0 and a residual X n (f) stochastic process indexed by \(\mathcal {F}\) (meaning that implicitly depends on the functions f) we have:

Definition 9

(Peeling Device [18]) Consider the function space \(\mathcal {F}\) and let be X n (f) a stochastic process indexed by \(\mathcal {F}\). Now consider the function \(\tau \: : \mathcal {F} \rightarrow [\rho ,\infty )\) with ρ>0 with goal to have a probability upper bound on the weighted process |X n (f)|/τ(f). Let {σ l } l ≥ 0 be a strictly increasing sequence with σ 0 = 0 and \(\lim _{l\rightarrow \infty } \: \sigma _{l}=\infty \). The function space \(\mathcal {F}\) can be peeled off into smaller function space \(\mathcal {F}=\cup _{l\geq 1} \mathcal {F}_{\sigma _{l}}\) with \(\mathcal {F}_{\sigma _{l}}=\{ f\in \mathcal {F} \: : \sigma _{l-1}\leq \tau (f)\leq \sigma _{l}, \: l=1,2,..\}\) For any a>0 results

$$\begin{array}{@{}rcl@{}} \mathbb{P} \left\{ \sup\limits_{f\in \mathcal{F}} \frac{|X_{n}(f)|}{\tau(f)}>a \right\} &\leq& \sum\limits_{l\geq 1} \mathbb{P}\left\{\sup\limits_{f\in \mathcal{F}_{\sigma_{l}}} \frac{|X_{n}(f)|}{\tau(f)}>a \right\}\\&\leq& \sum\limits_{l\geq 1} \mathbb{P}\left\{ \sup\limits_{f\in \mathcal{F},\tau(f)<\sigma_{l}} |X_{n}(f)|> a \sigma_{l-1}\right\} \end{array} $$

which is called peeling device and each l = 1,2,..n denotes the level of peeling.

This result produces probability inequalities for the weighted process starting from probabilities of the original process and can be used to simplify the evaluation of statistical bounds.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Esposito, G., Martin, M. Bellman residuals minimization using online support vector machines. Appl Intell 47, 670–704 (2017). https://doi.org/10.1007/s10489-017-0910-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-017-0910-7

Keywords

Navigation