Machine Learning

, Volume 107, Issue 5, pp 887–902 | Cite as

Simpler PAC-Bayesian bounds for hostile data

  • Pierre Alquier
  • Benjamin Guedj


PAC-Bayesian learning bounds are of the utmost interest to the learning community. Their role is to connect the generalization ability of an aggregation distribution \(\rho \) to its empirical risk and to its Kullback-Leibler divergence with respect to some prior distribution \(\pi \). Unfortunately, most of the available bounds typically rely on heavy assumptions such as boundedness and independence of the observations. This paper aims at relaxing these constraints and provides PAC-Bayesian learning bounds that hold for dependent, heavy-tailed observations (hereafter referred to as hostile data). In these bounds the Kullack-Leibler divergence is replaced with a general version of Csiszár’s f-divergence. We prove a general PAC-Bayesian bound, and show how to use it in various hostile settings.


PAC-Bayesian theory Dependent and unbounded data Oracle inequalities f-divergence 

1 Introduction

Learning theory can be traced back to the late 60s and has attracted a great attention since. We refer to the monographs Devroye et al. (1996) and Vapnik (2000) for a survey. Most of the literature addresses the simplified case of i.i.d observations coupled with bounded loss functions. Many bounds on the excess risk holding with large probability were provided - these bounds are refered to as PAC learning bounds since Valiant (1984).1

In the late 90s, the PAC-Bayesian approach was pioneered by Shawe-Taylor and Williamson (1997) and McAllester (1998, 1999). It consists of producing PAC bounds for a specific class of Bayesian-flavored estimators. Similar to classical PAC results, most PAC-Bayesian bounds have been obtained with bounded loss functions (see Catoni (2007), for some of the most accurate results). Note that Catoni (2004) provides bounds for unbouded loss, but still under very strong exponential moment assumptions. Different types of PAC-Bayesian bounds were proved in very various models (Seeger 2002; Langford and Shawe-Taylor 2002; Seldin and Tishby 2010; Seldin et al. 2012, 2011; Guedj and Alquier 2013; Bégin et al. 2016; Alquier et al. 2016; Oneto et al. 2016) but the boundedness or exponential moment assumptions were essentially not improved in these papers.

The relaxation of the exponential moment assumption is however a theoretical challenge, with huge practical implications: in many applications of regression, there is no reason to believe that the noise is bounded or sub-exponential. Actually, the belief that the noise is sub-exponential leads to an overconfidence in the prediction that is actually very harmful in practice, see for example the discussion in Taleb (2007) on finance. Still, thanks to the aforementionned works, the road to obtain PAC bounds for bounded observations has now become so nice and comfortable that it might refrain inclination to explore different settings.

Regarding PAC bounds for heavy-tailed random variables, let us mention three recent approaches.
  • Using the so-called small-ball property, Mendelson and several co-authors developed in a striking series of papers tools to study the Empirical Risk Minimizer (ERM) and penalized variants without an exponential moment assumption: we refer to their most recent works (Mendelson 2015; Lecuè and Mendelson 2016). Under a quite similar assumption, Grünwald and Mehta (2016) derived PAC-Bayesian learning bounds (“empirical witness of badness” assumption). Other assumptions were introduced in order to derive fast rates for unbounded losses, like the multiscale Bernstein assumption (Dinh et al. 2016).

  • Another idea consists in using robust loss functions. This leads to better confidence bounds than the previous approach, but at the price of replacing the ERM by a more complex estimator, usually building on PAC-Bayesian approaches (Audibert and Catoni 2011; Catoni 2012; Oliveira 2013; Giulini 2015; Catoni 2016).

  • Finally, Devroye et al. (2015), using median-of-means, provide bounds in probability for the estimation of the mean without exponential moment assumption. It is possible to extend this technique to more general learning problems (Minsker 2015; Hsu and Sabato 2016; Lugosi and Mendelson 2016; Guillaume and Matthieu 2017; Lugosi and Mendelson 2017).

Leaving the well-marked path of bounded variables led the authors to sophisticated and technical mathematics, but in the end they obtained rates of convergence similar to the ones in bounded cases: this is highly valuable for the statistical and machine learning community.

Regarding dependent observations, like time series or random fields, PAC and/or PAC-Bayesian bounds were provided in various settings (Modha and Masry 1998; Steinwart and Christmann 2009; Mohri and Rostamizadeh 2010; Ralaivola et al. 2010; Seldin et al. 2012; Alquier and Wintenberger 2012; Alquier and Li 2012; Agarwal and Duchi 2013; Alquier et al. 2013; Kuznetsov and Mohri 2014; Giraud et al. 2015; Zimin and Lampert 2015; London et al. 2016). However these works massively relied on concentration inequalities for or limit theorems for time series (Yu 1994; Doukhan 1994; Rio 2000; Kontorovich and Ramanan 2008), for which boundedness or exponential moments are crucial.

This paper shows that a proof scheme of PAC-Bayesian bounds proposed by Bégin et al. (2016) can be extended to a very general setting, without independence nor exponential moments assumptions. We would like to stress that this approach is not comparable to the aforementionned work, and in particular it is technically far less sophisticated. However, while it leads to sub-optimal rates in many cases, it allows to derive PAC-Bayesian bounds in settings where no PAC learning bounds were available before: for example heavy-tailed time series.

Given the simplicity of the main result, we state it in the remainder of this section. The other sections are devoted to refinements and applications. Let \(\ell \) denote a generic loss function. The observations are denoted \((X_1,Y_1),\ldots ,(X_n,Y_n)\). Note that we do not require the observations to be independent, nor indentically distributed. We assume that a family of predictors \((f_{\theta },\theta \in \Theta )\) is chosen. Let \(\ell _i(\theta )=\ell [f_{\theta }(X_i),Y_i]\), and define the (empirical) risk as
$$\begin{aligned} r_n(\theta )&= \frac{1}{n}\sum _{i=1}^n \ell _i(\theta ), \\ R(\theta )&= \mathbb {E}\bigl [r_n(\theta )\bigr ]. \end{aligned}$$
Based on the observations, the objective is to build procedures with a small risk R. While PAC bounds focus on estimators \(\hat{\theta }_n\) that are obtained as functionals of the sample, the PAC-Bayesian approach studies an aggregation distribution \(\hat{\rho }_n\) that depends on the sample. In this case, the objective is to choose \(\hat{\rho }_n\) such that \(\int R(\theta ) \hat{\rho }_n(\mathrm{d}\theta )\) is small. In order to do so, a crucial point is to choose a reference probability measure \(\pi \), often referred to as the prior. In Catoni (2007), the role of \(\pi \) is discussed in depth: rather than reflecting a prior knowledge on the parameter space \(\Theta \), it should serve as a tool to measure the complexity of \(\Theta \).

Let us now introduce the two following key quantities.

Definition 1

For any function g, let
$$\begin{aligned} \mathcal {M}_{g,n} = \int \mathbb {E}\bigl [g \left( |r_n(\theta )-R(\theta )| \right) \bigr ] \pi (\mathrm{d}\theta ). \end{aligned}$$

Definition 2

Let f be a convex function with \(f(1)=0\). The f-divergence between two distributions \(\rho \) and \(\pi \) is defined by
$$\begin{aligned} D_{f}(\rho ,\pi ) = \int f\left( \frac{\mathrm{d}\rho }{\mathrm{d}\pi }\right) \mathrm{d}\pi \end{aligned}$$
when \(\rho \) is absolutely continous with respect to \(\pi \), and \( D_{f}(\rho ,\pi ) = +\infty \) otherwise.

Csiszár introduced f-divergences in the 60s, see his recent monograph Csiszàr and Shields (2004, Chapter 4) for a survey.

We use the following notation for recurring functions: \(\phi _p(x) = x^p\). Consequently \(\mathcal {M}_{\phi _p,n} = \int \mathbb {E}\left( |r_n(\theta )-R(\theta )|^{p} \right) \pi (\mathrm{d}\theta )\). Thus \(\mathcal {M}_{\phi _p,n}\) is a moment of order p. As for divergences, we denote the Kullback-Leibler divergence by \(\mathcal {K}(\rho ,\pi )=D_{f}(\rho ,\pi )\) when \(f(x)=x\log (x)\), and the chi-square divergence \(\chi ^2(\rho ,\pi )=D_{\phi _2-1}(\rho ,\pi )\).

Theorem 1

Fix \(p>1\), put \(q=\frac{p}{p-1}\) and fix \(\delta \in (0,1)\). With probability at least \(1-\delta \) we have for any aggregation distribution \(\rho \)
$$\begin{aligned} \left| \int R \mathrm{d}\rho - \int r_n \mathrm{d}\rho \right| \le \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\rho ,\pi ) +1 \right) ^{\frac{1}{p}}. \end{aligned}$$
The main message of Theorem 1 is that we can compare \(\int r_n \mathrm{d}\rho \) (observable) to \(\int R \mathrm{d}\rho \) (unknown, the objective) in terms of two quantities: the moment \(\mathcal {M}_{\phi _{q},n}\) (which depends on the distribution of the data) and the divergence \(D_{\phi _p-1}(\rho ,\pi )\) (which will reveal itself as a measure of the complexity of the set \(\Theta \)). The most important practical consequence is that we have, with probability at least \(1-\delta \), for any probability measure \(\rho \),
$$\begin{aligned} \int R \mathrm{d}\rho \le \int r_n \mathrm{d}\rho + \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\rho ,\pi ) +1 \right) ^{\frac{1}{p}}. \end{aligned}$$
This is a strong incitement to define our aggregation distribution \(\hat{\rho }_n\) as the minimizer of the right-hand side of (2). The core of the paper will discuss in details this strategy and other consequences of Theorem 1.

Proof of Theorem 1

Introduce \(\Delta _n(\theta ):= |r_n(\theta )-R(\theta )|\). We follow a scheme of proof introduced by Bégin et al. (2016) in the bounded setting. We adapt the proof to the general case:
$$\begin{aligned} \left| \int R \mathrm{d}\rho - \int r_n \mathrm{d}\rho \right|&\le \int \Delta _{n} \mathrm{d}\rho = \int \Delta _{n} \frac{\mathrm{d}\rho }{\mathrm{d}\pi } \mathrm{d}\pi \\&\le \left( \int \Delta _{n}^{q}\mathrm{d}\pi \right) ^{\frac{1}{q}} \left( \int \left( \frac{\mathrm{d}\rho }{\mathrm{d}\pi }\right) ^{p} \mathrm{d}\pi \right) ^{\frac{1}{p}} (\text {H}\ddot{\mathrm{o}}\text {lder ineq}.) \\&\le \left( \frac{\mathbb {E} \int \Delta _{n}^{q}\mathrm{d}\pi }{\delta } \right) ^{\frac{1}{q}} \left( \int \left( \frac{\mathrm{d}\rho }{\mathrm{d}\pi }\right) ^{p} \mathrm{d}\pi \right) ^{\frac{1}{p}} \text { (Markov ineq., w. prob. } 1-\delta ) \\&=\left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\rho ,\pi ) +1 \right) ^{\frac{1}{p}} . \end{aligned}$$
\(\square \)

In Sect. 2 we discuss the divergence term \(D_{\phi _p-1}(\rho ,\pi )\). In particular, we derive an explicit bound on this term when \(\rho \) is chosen in order to concentrate around the ERM (empirical risk minimizer) \(\hat{\theta }_{\mathrm{ERM}}=\arg \min _{\theta \in \Theta }\ r_n(\theta )\). This is meant to provide the reader some intuition on the order of magnitude of the divergence term. In Sect. 3 we discuss how to control the moment \(\mathcal {M}_{\phi _{q},n}\). We derive explicit bounds in various examples: bounded and unbounded losses, i.i.d and dependent observations. The most important result of the section is a risk bound for auto-regression with heavy-tailed time series, something new up to our knowledge. In Sect. 4 we come back to the general case. We show that it is possible to explicitely minimize the right-hand side in (2). We then show that Theorem 1 leads to powerful oracle inequalities in the various statistical settings discussed above, exhibiting explicit rates of convergence.

2 Calculation of the divergence term

The aim of this section is to provide some hints on the order of magnitude of the divergence term \(D_{\phi _p-1}(\rho ,\pi )\). We start with the example of a finite parameter space \(\Theta \). The following proposition results from straightforward calculations.

Proposition 1

Assume that \(\mathrm{Card}(\Theta )=K<\infty \) and that \(\pi \) is uniform on \(\Theta \). Then
$$\begin{aligned} D_{\phi _p-1}(\rho ,\pi ) +1= K^{p-1} \sum _{\theta \in \Theta } \rho (\theta )^p. \end{aligned}$$
A special case of interest is when \(\rho =\delta _{\hat{\theta }_{\mathrm{ERM}}}\), the Dirac mass concentrated on the ERM. Then
$$\begin{aligned} D_{\phi _p-1}(\delta _{\hat{\theta }_{\mathrm{ERM}}},\pi )+1 = K^{p-1}. \end{aligned}$$
Then (1) in Theorem 1 yields the following result.

Proposition 2

Fix \(p>1\), \(q=\frac{p}{p-1}\) and \(\delta \in (0,1)\). With probability at least \(1-\delta \) we have
$$\begin{aligned} R(\hat{\theta }_{\mathrm{ERM}}) \le \inf _{\theta \in \Theta }\ \bigl \{r_n(\theta )\bigr \} + K^{1-\frac{1}{p}} \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}}. \end{aligned}$$

Remark that \(D_{\phi _p-1}(\rho ,\pi )\) seems to be related to the complexity K of the parameter space \(\Theta \). This intuition can be extended to an infinite parameter space, for example using the empirical complexity parameter introduced in Catoni (2007).

Assumption 1

There exists \(d>0\) such that, for any \(\gamma >0\),
$$\begin{aligned} \pi \Bigl \{ \theta \in \Theta : \bigl \{ r_n(\theta ) \bigr \} \le \inf _{\theta '\in \Theta }\ r_n(\theta ') + \gamma \Bigr \} \ge \gamma ^d . \end{aligned}$$
In many examples, d corresponds to the ambient dimension [see Catoni (2007) for a thorough discussion). In this case, a sensible choice for \(\rho \), as suggested by Catoni, is \(\pi _{\gamma } (\mathrm{d}\theta ) \propto \pi (\mathrm{d}\theta ) \mathbf {1}\left[ r(\theta ) - r_n(\hat{\theta }_{\mathrm{ERM}}) \le \gamma \right] \) for \(\gamma \) small enough (in Sect. 4, we derive the consequences of Assumption 1 for other aggregation distributions]. We have
$$\begin{aligned} D_{\phi _p-1}(\pi _{\gamma },\pi ) +1 \le \gamma ^{-d(p-1)} \end{aligned}$$
$$\begin{aligned} \int r_n (\theta ) \mathrm{d}\pi _{\gamma } \le r_n(\hat{\theta }_{\mathrm{ERM}}) + \gamma \end{aligned}$$
so Theorem 1 leads to
$$\begin{aligned} \int R \mathrm{d}\pi _{\gamma } \le r_n(\hat{\theta }_{\mathrm{ERM}}) + \gamma + \gamma ^{-\frac{d}{q}} \left( \frac{ \mathcal {M}_{\phi _{q},n}}{\delta } \right) ^{\frac{1}{q}}. \end{aligned}$$
An explicit optimization with respect to \(\gamma \) leads to the choice
$$\begin{aligned} \gamma =\left( \frac{d}{q} \frac{\mathcal {M}_{\phi _{q},n}}{\delta }\right) ^{\frac{1}{1+\frac{d}{q}}} \end{aligned}$$
and consequently to the following result.

Proposition 3

Fix \(p>1\), \(q=\frac{p}{p-1}\) and \(\delta \in (0,1)\). Under Assumption 1, with probability at least \(1-\delta \) we have,
$$\begin{aligned} \int R \mathrm{d}\pi _{\gamma } \le \inf _{\theta \in \Theta }\ \Bigl \{r_n(\theta )\Bigr \} + \left( \frac{\mathcal {M}_{\phi _{q},n}}{\delta } \right) ^{\frac{1}{1+\frac{d}{q}}} \left\{ \left( \frac{d}{q}\right) ^\frac{1}{1+\frac{d}{q}} + \left( \frac{d}{q}\right) ^{\frac{-\frac{d}{q}}{1+\frac{d}{q}}} \right\} . \end{aligned}$$

So the bound is in \(\mathcal {O}((\mathcal {M}_{\phi _{q},n}/\delta )^{1/(1+d/q)})\). In order to understand the order of magnitude of the bound, it is now crucial to understand the moment term \(\mathcal {M}_{\phi _{q},n}\). This is the object of the next section.

3 Bounding the moments

In this section, we show how to control \(\mathcal {M}_{\phi _{q},n}\) depending on the assumptions on the data.

3.1 The i.i.d setting

First, let us assume that the observations \((X_i, Y_i)\) are independent and identically distributed. In general, when the observations are possibly heavy-tailed, we recommend to use Theorem 1 with \(q\le 2\) (which implies \(p\ge 2)\).

Proposition 4

Assume that
$$\begin{aligned} s^2 = \int \mathrm{Var}[\ell _1(\theta )] \pi (\mathrm{d}\theta )< +\infty \end{aligned}$$
$$\begin{aligned} \mathcal {M}_{\phi _{q},n} \le \left( \frac{ s^2 }{n}\right) ^{\frac{q}{2}}. \end{aligned}$$
As a conclusion for the case \(q \le 2 \le p\), (1) in Theorem 1 becomes:
$$\begin{aligned} \int R \mathrm{d}\rho \le \int r_n \mathrm{d}\rho + \frac{\left( D_{\phi _{p}-1}(\rho ,\pi ) +1 \right) ^{\frac{1}{p}}}{\delta ^{\frac{1}{q}}} \sqrt{\frac{s^2}{n}}. \end{aligned}$$
Without further assumptions, this bound can not be improved as a function of n (as can be seen in the simplest case where \(\mathrm {card}(\Theta )= 1\), by using the CLT).

Proof of Proposition 4

$$\begin{aligned} \mathcal {M}_{\phi _{q},n}&= \int \mathbb {E}\left( |r_n(\theta )-\mathbb {E}[r_n(\theta )]|^{2\frac{q}{2}} \right) \pi (\mathrm{d}\theta ) \\&\le \left( \int \mathbb {E}\left( |r_n(\theta )-\mathbb {E}[r_n(\theta )]|^{2} \right) \pi (\mathrm{d}\theta ) \right) ^{\frac{q}{2}} \\&\le \left( \int \frac{1}{n}\mathrm{Var}[\ell _1(\theta )] \pi (\mathrm{d}\theta ) \right) ^{\frac{q}{2}} = \left( \frac{ s^2 }{n}\right) ^{\frac{q}{2}}. \end{aligned}$$
\(\square \)
As an example, consider the regression setting with quadratic loss, where we use linear predictors: \(X_i\in \mathbb {R}^k\), \(\Theta =\mathbb {R}^k\) and \(f_{\theta }(\cdot )=\left<\cdot ,\theta \right>\). Define a prior \(\pi \) on \(\Theta \) such that
$$\begin{aligned} \tau := \int \Vert \theta \Vert ^4 \pi (\mathrm{d}\theta ) < \infty \end{aligned}$$
and assume that
$$\begin{aligned} \kappa := 8[\mathbb {E}(Y_i^4) + \tau \mathbb {E}(\Vert X_i\Vert ^4)] < \infty . \end{aligned}$$
$$\begin{aligned} \ell _i(\theta ) = (Y_i - \left<\theta ,X_i\right>)^2 \le 2 \left[ Y_i^2 + \Vert \theta \Vert ^2 \Vert X_i\Vert ^2 \right] \end{aligned}$$
and so
$$\begin{aligned} \mathrm{Var}(\ell _i(\theta )) \le \mathbb {E}(\ell _i(\theta )^2) \le 8 \mathbb {E}\left[ Y_i^4 + \Vert \theta \Vert ^4 \Vert X_i\Vert ^4 \right] . \end{aligned}$$
$$\begin{aligned} s^2 = \int \mathrm{Var}(\ell _i(\theta )) \pi (\mathrm{d}\theta ) \le \kappa <+\infty . \end{aligned}$$
We obtain the following corollary of (1) in Theorem 1 with \(p=q=2\).

Corollary 1

Fix \(\delta \in (0,1)\). Assume that \(\pi \) is chosen such that (3) holds, and assume that (4) also holds. With probability at least \(1-\delta \) we have for any \(\rho \)
$$\begin{aligned} \int R \mathrm{d}\rho \le \int r_n \mathrm{d}\rho + \sqrt{ \frac{\kappa [1+\chi ^2(\rho ,\pi )]}{n \delta }}. \end{aligned}$$

Note that a similar upper bound was proved in Honorio and Jaakkola (2014), yet only in the case of the 0–1 loss (which is bounded). Also, note that the assumption on the moments of order 4 is comparable to the one in Audibert and Catoni (2011) and allow heavy-tailed distributions. Still, in our result, the dependence in \(\delta \) is less good than in Audibert and Catoni (2011). So, we end this subsection with a study of the sub-Gaussian case (wich also includes the bounded case). In this case, we can use any \(q\ge 2\) in Theorem 1. The larger q, the better will be the dependence with respect to \(\delta \).

Definition 3

A random variable U is said to be sub-Gaussian with parameter \(\sigma ^2\) if for any \(\lambda >0\),
$$\begin{aligned} \mathbb {E}\Bigl \{\exp \bigl [ \lambda (U-\mathbb {E}(U)) \bigr ]\Bigr \} \le \exp \left[ \frac{\lambda ^2 \sigma ^2}{2} \right] . \end{aligned}$$

Proposition 5

(Theorem 2.1 page 25 in Boucheron et al. (2013)) When U is sub-Gaussian with parameter \(\sigma ^2\) then for any \(q\ge 2\),
$$\begin{aligned} \mathbb {E}\bigl [(U-\mathbb {E}(U))^q\bigr ] \le 2 \left( \frac{q}{2}\right) ! (2\sigma ^2)^{\frac{q}{2}} \le 2 (q \sigma ^2)^{\frac{q}{2}}. \end{aligned}$$

A straighforward consequence is the following result.

Proposition 6

Assume that, for any \(\theta \), \(\ell _i(\theta )\) is sub-Gaussian with parameter \(\sigma ^2\) (that does not depend on \(\theta \)), then \(\frac{1}{n}\sum _{i=1}^n \ell _i(\theta )\) is sub-Gaussian with parameter \(\sigma ^2/n\) and then, for any \(q\ge 2\),
$$\begin{aligned} \mathcal {M}_{\phi _{q},n} \le 2 \left( \frac{q \sigma ^2}{n} \right) ^{\frac{q}{2}}. \end{aligned}$$
As an illustration, consider the case of a finite parameter space, that is \(\mathrm{card}(\Theta )=K<+\infty \). Following Propositions 2 and 6, we obtain for any \(q\ge 2\) and \(\delta \in (0,1)\), with probability at least \(1-\delta \),
$$\begin{aligned} R(\hat{\theta }_{\mathrm{ERM}}) \le \inf _{\theta \in \Theta }\ \bigl \{r_n(\theta )\bigr \} + \sigma \sqrt{\frac{q}{n}} \left( \frac{2K}{\delta }\right) ^{\frac{1}{q}}. \end{aligned}$$
Optimization with respect to q leads to \(q=2\log (2K/\delta )\) and consequently
$$\begin{aligned} R(\hat{\theta }_{\mathrm{ERM}}) \le \inf _{\theta \in \Theta }\ \bigl \{r_n(\theta )\bigr \}+ \sqrt{\frac{2 \mathrm{e}\sigma ^2 \log \left( \frac{2K}{\delta }\right) }{n}}. \end{aligned}$$
Without any additional assumption on the loss \(\ell \), the rate on the right-hand side is optimal. This is for example proven by Audibert (2009) for the absolute loss.

3.2 Dependent observations

Here we propose to analyze the harder and more realistic case where the observations \((X_i,Y_i)\) are possibly dependent. It includes the autoregressive case where \(X_i=Y_{i-1}\) or \(X_i=(Y_{i-1},\ldots ,Y_{i-p})\). Note that in this setting, different notions of risks were used in the literature. The risk \(R(\theta )\) considered in this paper is the same as the one used in many references given in the introduction, Modha and Masry (1998), Steinwart and Christmann (2009), Alquier and Wintenberger (2012) and Alquier and Li (2012) among others. Alternative notions of risk were proposed, for example by Zimin and Lampert (2015).

We remind the following definition.

Definition 4

The \(\alpha \)-mixing coefficients between two \(\sigma \)-algebras \(\mathcal {F}\) and \(\mathcal {G}\) are defined by
$$\begin{aligned} \alpha (\mathcal {F},\mathcal {G}) = \sup _{A\in \mathcal {F},B\in \mathcal {G}} \Bigl |\mathbb {P}(A \cap B) - \mathbb {P}(A) \mathbb {P}(B)\bigr |. \end{aligned}$$

We refer the reader to Doukhan (1994) and Rio (2000) (among others) for more details. We still provide a basic interpretation of this definition. First, when \(\mathcal {F}\) and \(\mathcal {G}\) are independent, then for all \(A\in \mathcal {F}\) and \(B\in \mathcal {G}\), \(\mathbb {P}(A \cap B) = \mathbb {P}(A) \mathbb {P}(B)\) by definition of independence, and so \(\alpha (\mathcal {F},\mathcal {G})=0\). On the other hand, when \(\mathcal {F}=\mathcal {G}\), as soon as these \(\sigma \)-algebras contain an event A with \(\mathbb {P}(A)=1/2\) then \(\alpha (\mathcal {F},\mathcal {G})= |\mathbb {P}(A \cap A) - \mathbb {P}(A) \mathbb {P}(A)| = |1/2-1/4|=1/4\). More generally, \(\alpha (\mathcal {F},\mathcal {G})\) is a measure of the dependence of the information provided by \(\mathcal {F}\) and \(\mathcal {G}\), ranging from 0 (independance) to 1 / 4 (maximal dependence). We provide another interpretation in terms of covariances.

Proposition 7

(Classical, see Doukhan (1994) for a proof) We have
$$\begin{aligned} \alpha (\mathcal {F},\mathcal {G}) = \sup \Bigl \{\mathrm{Cov}(U,V), 0\le U \le 1, 0\le V\le 1, \\ U \text { is } \mathcal {F}\text {-measurable, }V \text { is } \mathcal {G}\text {-measurable}\Bigr \}. \end{aligned}$$
For short, define
$$\begin{aligned} \alpha _j=\alpha [\sigma (X_0,Y_0),\sigma (X_j,Y_j)] \end{aligned}$$
where we remind that for any random variable Z, \(\sigma (Z)\) is the \(\sigma \)-algebra generated by Z. The idea is that, when the future of the series is strongly dependent of the past, \(\alpha _j\) will remain constant, or decay very slowly. On the other hand, when the near future is almost independent of the past, then the \(\alpha _j\) decay very fast to 0 [examples of both kind can be found in Doukhan (1994) and Rio (2000)]. And, indeed, we will see below that when the rate of convergence of the \(\alpha _j\)’s to 0 is fast enough it is possible to derive results rather similar to the ones the independent case.

Let us first consider the bounded case.

Proposition 8

Assume that \(0\le \ell \le 1\). Assume that \((X_i,Y_i)_{i\in \mathbb {Z}}\) is a stationary process, and that it satisfies \( \sum _{j\in \mathbb {Z}} \alpha _j < \infty \). Then
$$\begin{aligned} \mathcal {M}_{\phi _{2},n} \le \frac{1}{n}\sum _{j\in \mathbb {Z}} \alpha _{j} . \end{aligned}$$
Examples of processes satisfying this assumption are discussed in Doukhan (1994) and Rio (2000). For example, if the \((X_i,Y_i)\)’s are actually a geometrically ergodic Markov chain then there exist some \(c_1, c_2>0\) such that \(\alpha _j \le c_1 \mathrm{e}^{-c_2 |j|}\). Thus
$$\begin{aligned} \mathcal {M}_{\phi _{2},n}\le \frac{1}{n} \frac{2c_1}{1-\mathrm{e}^{-c_2}} . \end{aligned}$$

Proof of Proposition 8

We have:
$$\begin{aligned} \mathbb {E} \left[ \left( \frac{1}{n}\sum _{i=1}^n \ell _{i}(\theta ) - \mathbb {E}[\ell _{i}(\theta )]\right) ^2\right]&= \frac{1}{n^2} \sum _{i=1}^n \sum _{j=1}^n \mathrm{Cov}[\ell _{i}(\theta ),\ell _{j}(\theta )] \\&\le \frac{1}{n^2} \sum _{i=1}^n \sum _{j\in \mathbb {Z}} \alpha _{j-i} = \frac{\sum _{j\in \mathbb {Z}} \alpha _{j}}{n} \end{aligned}$$
that does not depend on \(\theta \), and so
$$\begin{aligned} \mathcal {M}_{\phi _{2},n} = \int \mathbb {E} \left[ \left( \frac{1}{n}\sum _{i=1}^n \ell _{i}(\theta ) - R(\theta )\right) ^2\right] \pi (\mathrm{d}\theta ) \le \frac{\sum _{j\in \mathbb {Z}} \alpha _{j}}{n} . \end{aligned}$$
\(\square \)

Remark 1

Other assumptions than \(\alpha \)-mixing can be used. Actually, we see from the proof that the only requirement to get a bound on \(\mathcal {M}_{\phi _{2},n}\) is to control the covariance \(\mathrm{Cov}[\ell _{i}(\theta ),\ell _{j}(\theta )]\); \(\alpha \)-mixing is very stringent as it imposes that we can control this for any function \(\ell _i(\theta )\). In the case of a Lipschitz loss, we could actually consider more general conditions like the weak dependence conditions in Dedecker et al. (2007) and Alquier and Wintenberger (2012).

We now turn to the unbounded case.

Proposition 9

Assume that \((X_i,Y_i)_{i\in \mathbb {Z}}\) is a stationary process. Let \(r \ge 1\) and \(s \ge 2\) be any numbers with \(1/r+2/s=1\) and assume that
$$\begin{aligned} \sum _{j\in \mathbb {Z}} \alpha _j^{1/r} < \infty \end{aligned}$$
$$\begin{aligned} \int \left\{ \mathbb {E}\left[ \ell _{i}^s(\theta )\right] \right\} ^{\frac{2}{s}} \pi (\mathrm{d}\theta ) < \infty . \end{aligned}$$
$$\begin{aligned} \mathcal {M}_{\phi _{2},n} \le \frac{1}{n} \left( \int \left\{ \mathbb {E}\left[ \ell _{i}^s(\theta )\right] \right\} ^{\frac{2}{s}} \pi (\mathrm{d}\theta ) \right) \left( \sum _{j\in \mathbb {Z}} \alpha _{j}^{\frac{1}{r}}\right) . \end{aligned}$$

Proof of Proposition 9

The proof relies on the following property. \(\square \)

Proposition 10

(Doukhan 1994) For any random variables U and V, resp. \(\mathcal {F}\) and \(\mathcal {G}\)-mesurable, we have
$$\begin{aligned} |\mathrm{Cov}(U,V)| \le 8 \alpha ^{\frac{1}{r}}(\mathcal {F},\mathcal {G}) \Vert U\Vert _s \Vert V\Vert _t \end{aligned}$$
where \(1/r + 1/s + 1/t = 1\).
We use this with \(U=\ell _{i}(\theta ) \), \(V=\ell _{j}(\theta )\) and \(s=t\). Then
$$\begin{aligned} \mathbb {E} \Biggl [ \Biggl (\frac{1}{n}\sum _{i=1}^n \ell _{i}(\theta ) - \mathbb {E}[\ell _{i}(\theta )]\Biggr )^2\Biggr ]&= \frac{1}{n^2} \sum _{i=1}^n \sum _{j=1}^n \mathrm{Cov}[\ell _{i}(\theta ),\ell _{j}(\theta )] \\&\le \frac{8}{n^2} \sum _{i=1}^n \sum _{j\in \mathbb {Z}} \alpha _{j-i}^{\frac{1}{r}} \Vert \ell _{i}(\theta )\Vert _s \Vert \ell _{j}(\theta )\Vert _s \\&\le \frac{ 8\left\{ \mathbb {E}\left[ \ell _{i}^s(\theta )\right] \right\} ^{\frac{2}{s}} \sum _{j\in \mathbb {Z}} \alpha _{j}^{\frac{1}{r}}}{n}. \end{aligned}$$
\(\square \)
As an example, consider auto-regression with quadratic loss, where we use linear predictors: \(X_i=(1,Y_{i-1})\in \mathbb {R}^2\), \(\Theta =\mathbb {R}^2\) and \(f_{\theta }(\cdot )=\left<\theta ,\cdot \right>\). Then
$$\begin{aligned} |\ell _{i}(\theta )|^3 \le 32 [Y_i^6 + 4\Vert \theta \Vert ^6(1+ Y_{i-1}^6) ] \end{aligned}$$
and so
$$\begin{aligned} \mathbb {E}\left( |\ell _{i}(\theta )|^3\right) \le 32(1+4\Vert \theta \Vert ^6 )\mathbb {E}\left( Y_i^6\right) . \end{aligned}$$
Taking \(s=r=3\) in Proposition 9 leads to the following result.

Corollary 2

Fix \(\delta \in (0,1)\). Assume that \(\pi \) is chosen such that
$$\begin{aligned} \int \Vert \theta \Vert ^6 \pi (\mathrm{d}\theta ) < + \infty , \end{aligned}$$
\(\mathbb {E}\left( Y_i^6\right) <\infty \) and \(\sum _{j\in \mathbb {Z}} \alpha _j^{\frac{1}{3}} < +\infty \). Put
$$\begin{aligned} \nu = 32 \mathbb {E}\left( Y_i^6\right) ^{\frac{2}{3}} \sum _{j\in \mathbb {Z}} \alpha _j^{\frac{1}{3}} \left( 1+ 4 \int \Vert \theta \Vert ^6 \pi (\mathrm{d}\theta )\right) . \end{aligned}$$
With probability at least \(1-\delta \) we have for any \(\rho \)
$$\begin{aligned} \int R \mathrm{d}\rho \le \int r_n \mathrm{d}\rho + \sqrt{ \frac{\nu [1+\chi ^2(\rho ,\pi )]}{n \delta }}. \end{aligned}$$

This is, up to our knowledge, the first PAC(-Bayesian) bound in the case of a time series without any boundedess nor exponential moment assumption.

4 Optimal aggregation distribution and oracle inequalities

We have now gone through the way to control the different terms in our PAC-Bayesian inequality (Theorem 1). We now come back to this result to derive which predictor minimizes the bound, and which statistical guarantees can be achieved by this predictor.

We start with a reminder of two consequences of Theorem 1: for \(p>1\), and \(q=p/(p-1)\), with probability at least \(1-\delta \) we have for any \(\rho \)
$$\begin{aligned} \int R \mathrm{d}\rho \le \int r_n \mathrm{d}\rho + \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\rho ,\pi ) +1 \right) ^{\frac{1}{p}} \end{aligned}$$
$$\begin{aligned} \int r_n \mathrm{d}\rho \le \int R \mathrm{d}\rho + \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\rho ,\pi ) +1 \right) ^{\frac{1}{p}}. \end{aligned}$$
In this section we focus on the minimizer \(\hat{\rho }_n\) of the right-hand side of (5) , and on its statistical properties.

Definition 5

We define \(\overline{r}_n=\overline{r}_n(\delta ,p)\) as
$$\begin{aligned} \overline{r}_n = \min \left\{ u\in \mathbb {R}, \int \left[ u -r_n(\theta )\right] _+^{q} \pi (\mathrm{d}\theta ) = \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right\} . \end{aligned}$$
Note that such a minimum always exists as the integral is a continuous function of u, is equal to 0 when \(u=0\) and \(\rightarrow \infty \) when \(u\rightarrow \infty \). We then define
$$\begin{aligned} \frac{\mathrm{d}\hat{\rho }_n}{\mathrm{d}\pi }(\theta ) = \frac{ \left[ \overline{r}_n -r_n(\theta )\right] _+^{\frac{1}{p-1}} }{ \int \left[ \overline{r}_n -r_n\right] _+^{\frac{1}{p-1}} \mathrm{d}\pi }. \end{aligned}$$

The following proposition states that \(\hat{\rho }_n\) is actually the minimizer of the right-hand side in inequality (5).

Proposition 11

Under the assumptions of Theorem 1, with probability at least \(1-\delta \),
$$\begin{aligned} \overline{r}_n&= \int r_n \mathrm{d}\hat{\rho }_n + \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\hat{\rho }_n,\pi ) +1 \right) ^{\frac{1}{p}} \\&= \min _{\rho } \left\{ \int r_n \mathrm{d}\rho + \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\rho ,\pi ) +1 \right) ^{\frac{1}{p}} \right\} \end{aligned}$$
where the minimum holds for any probability distribution \(\rho \) over \(\Theta \).

Proof of Proposition 11

For any \(\rho \) we have
$$\begin{aligned} \overline{r}_n - \int r_n \mathrm{d}\rho&= \int \left[ \overline{r}_n - r_n \right] \mathrm{d}\rho \\&= \int \left[ \overline{r}_n - r_n \right] _+ \mathrm{d}\rho - \int \left[ \overline{r}_n - r_n \right] _- \mathrm{d}\rho \\&\le \int \left[ \overline{r}_n - r_n \right] _+ \mathrm{d}\rho = \int \left[ \overline{r}_n - r_n \right] _+ \frac{\mathrm{d}\rho }{\mathrm{d}\pi } \mathrm{d}\pi \\&\le \left( \int \left[ \overline{r}_n - r_n \right] _+^q \mathrm{d}\pi \right) ^{\frac{1}{q}} \left( \int \left( \frac{\mathrm{d}\rho }{\mathrm{d}\pi }\right) ^p \mathrm{d}\pi \right) ^{\frac{1}{p}} \\&\le \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\rho ,\pi ) +1 \right) ^{\frac{1}{p}} \end{aligned}$$
where we used Hölder’s inequality and then the definition of \(\overline{r}_n\) in the last line. Moreover, we can check that the two inequalities above become equalities when \(\rho =\hat{\rho }_n\): from (7),
$$\begin{aligned} \overline{r}_n - \int r_n \mathrm{d}\hat{\rho }_n&= \int \left[ \overline{r}_n - r_n \right] \mathrm{d}\hat{\rho }_n = \int \left[ \overline{r}_n - r_n \right] _+ \mathrm{d}\hat{\rho }_n \\&= \frac{ \int \left[ \overline{r}_n - r_n \right] _+ \left[ \overline{r}_n -r_n\right] _+^{\frac{1}{p-1}} \mathrm{d}\pi }{ \int \left[ \overline{r}_n -r_n\right] _+^{\frac{1}{p-1}} \mathrm{d}\pi } = \frac{ \int \left[ \overline{r}_n -r_n\right] _+^{q} \mathrm{d}\pi }{ \int \left[ \overline{r}_n -r_n\right] _+^{\frac{1}{p-1}} \mathrm{d}\pi } \\&= \frac{ \left( \int \left[ \overline{r}_n -r_n\right] _+^{q} \mathrm{d}\pi \right) ^{\frac{1}{p}+\frac{1}{q}} }{ \int \left[ \overline{r}_n -r_n\right] _+^{\frac{1}{p-1}} \mathrm{d}\pi } = \left( \int \left[ \overline{r}_n - r_n \right] _+^q \mathrm{d}\pi \right) ^{\frac{1}{q}} \frac{ \left( \int \left[ \overline{r}_n -r_n\right] _+^{\frac{p}{p-1}} \mathrm{d}\pi \right) ^{\frac{1}{p}}}{\int \left[ \overline{r}_n -r_n\right] _+^{\frac{1}{p-1}} \mathrm{d}\pi } \\&= \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( \int \left( \frac{\mathrm{d}\hat{\rho }_n}{\mathrm{d}\pi }\right) ^p \mathrm{d}\pi \right) ^{\frac{1}{p}} = \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\hat{\rho }_n,\pi ) +1 \right) ^{\frac{1}{p}}. \end{aligned}$$
\(\square \)

A direct consequence of (5) and (6) is the following result, which provides theoretical guarantees for \(\hat{\rho }_n\).

Proposition 12

Under the assumptions of Theorem 1, with probability at least \(1-\delta \),
$$\begin{aligned} \int R \mathrm{d}\hat{\rho }_n \le \overline{r}_n \le \inf _{\rho } \left\{ \int R \mathrm{d}\rho + 2 \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\rho ,\pi ) +1 \right) ^{\frac{1}{p}} \right\} . \end{aligned}$$

Proof of Proposition 12

First, (5) brings:
$$\begin{aligned} \int R \mathrm{d}\hat{\rho }_n&\quad \le \int r_n \mathrm{d}\hat{\rho }_n + \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\hat{\rho }_n,\pi ) +1 \right) ^{\frac{1}{p}} \nonumber \\&\quad = \inf _{\rho }\left\{ \int r_n \mathrm{d}\rho + \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta }\right) ^{\frac{1}{q}} \left( D_{\phi _p-1}(\rho ,\pi ) +1 \right) ^{\frac{1}{p}} \right\} \end{aligned}$$
by definition of \(\hat{\rho }_n\), and Proposition 11 shows that the right-hand side is \(\bar{r}_n\). Plug (6) into (9) to get the desired result. \(\square \)

Example 1

As an example of an application of Proposition 12, we come back to the setting of a possibly heavy-tailed time series. More precisely, we assume that we are under the assumptions of Corollary 2. In particular, \(X_i=(1,Y_{i-1})\in \mathbb {R}^2\) and \(f_{\theta }(\cdot )=\left<\theta ,\cdot \right>\) and \(p=q=2\). For the sake of simplicity, assume that the parameter space is \(\Theta = [-1,1]^2\). Let us fix \(\pi \) as uniform on \([-1,1]^2\). The empirical bound stated that, with probability at least \(1-\delta \), for any \(\rho \),
$$\begin{aligned} \int R \mathrm{d}\rho \le \int r_n \mathrm{d}\rho + \sqrt{ \frac{\nu [1+\chi ^2(\rho ,\pi )]}{n \delta }} \end{aligned}$$
where we remind that
$$\begin{aligned} \nu&= 32 \mathbb {E}\left( Y_i^6\right) ^{\frac{2}{3}} \sum _{j\in \mathbb {Z}} \alpha _j^{\frac{1}{3}} \left( 1+ 4 \int \Vert \theta \Vert ^6 \pi (\mathrm{d}\theta )\right) \\&\le 1056 \mathbb {E}\left( Y_i^6\right) ^{\frac{2}{3}} \sum _{j\in \mathbb {Z}} \alpha _j^{\frac{1}{3}}. \end{aligned}$$
In this context the minimizer of the right-hand side is
$$\begin{aligned} \frac{\mathrm{d}\hat{\rho }_n}{\mathrm{d}\pi }(\theta ) = \frac{ \left[ \overline{r}_n -r_n(\theta )\right] _+ }{ \int \left[ \overline{r}_n -r_n\right] _+ \mathrm{d}\pi } \end{aligned}$$
$$\begin{aligned} \overline{r}_n = \min \left\{ u\in \mathbb {R}, \int \left[ u -r_n(\theta )\right] _+ \pi (\mathrm{d}\theta ) = \frac{\nu }{n \delta } \right\} . \end{aligned}$$
The application of Proposition 12 leads to, with probability at least \(1-\delta \),
$$\begin{aligned} \int R \mathrm{d}\hat{\rho }_n \le \inf _{\rho } \left\{ \int R \mathrm{d}\rho + 2 \sqrt{ \frac{\nu [1+\chi ^2(\rho ,\pi )]}{n \delta }} \right\} . \end{aligned}$$
Note that it is possible to derive an oracle inequality from this. Let us denote by \(\bar{\theta }=(\bar{\theta }_1,\bar{\theta }_2)\) the minimizer of R. Consider the following posteriors, for \(1\le i,j \le N\):
$$\begin{aligned} \rho _{(i,j),N} \text { is uniform on } \left[ -1 + \frac{2(i-1)}{N},-1 + \frac{2i}{N} \right] \times \left[ -1 + \frac{2(j-1)}{N},-1 + \frac{2j}{N} \right] . \end{aligned}$$
For N fixed, there is always a pair (ij) such that \(\bar{\theta }\) belongs to the support of \( \rho _{(i,j),N}\). Elementary calculus shows that, for any \(\theta \) in the support of \( \rho _{(i,j),N}\) then
$$\begin{aligned} R(\theta ) - R(\bar{\theta }) \le \frac{2}{N} \left( 1+4 \mathbb {E}(|Y_i|) + 3 \mathbb {E}(Y_i^2) \right) =: \frac{\nu '}{N} . \end{aligned}$$
$$\begin{aligned} 1+\chi ^2(\rho _{(i,j),N},\pi ) = \frac{N^2}{2} . \end{aligned}$$
So the bound becomes
$$\begin{aligned} \int R \mathrm{d}\hat{\rho }_n \le \inf _{\theta \in [-1,1]^2} R(\theta ) + \inf _{N \in \mathbb {N}^*} \left\{ \frac{\nu '}{N} + N \sqrt{\frac{2 \nu }{ n \delta }} \right\} \end{aligned}$$
and in particular, the choice \(N=\left\lceil \sqrt{\nu '\sqrt{ n \delta /(2\nu )}} \right\rceil \) leads to
$$\begin{aligned} \int R \mathrm{d}\hat{\rho }_n \le \inf _{\theta \in [-1,1]^2} R(\theta ) + 3 \left( \frac{2\nu \nu '^2}{n\delta } \right) ^{\frac{1}{4}} \end{aligned}$$
at least for n large enough to ensure \(\left( \frac{n\delta }{2\nu }\right) ^{\frac{1}{4}} \ge \sqrt{\nu '}\).

The last example shows that it is possible in some cases to deduce from Proposition 12 an oracle inequality, that is, a comparison to the performance of the optimal parameter. The end of this section is devoted to a systematic derivation of such oracle inequalities, using the complexity parameter introduced in Sect. 3, first in its empirical version, and then in its theoretical form.

Theorem 2

Under the assumptions of Theorem 1 together with Assumption 1, with probability at least \(1-\delta \),
$$\begin{aligned} \int R \mathrm{d}\hat{\rho }_n \le \overline{r}_n \le \inf _{\theta \in \Theta }\ \bigl \{ r_n(\theta ) \bigr \} + 2 \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta } \right) ^{\frac{1}{q+d}}. \end{aligned}$$

Proof of Theorem 2

$$\begin{aligned} \gamma = \overline{r}_n - \inf _{\theta \in \Theta }\ \bigl \{ r_n(\theta )\bigr \}. \end{aligned}$$
Note that \(\gamma \ge 0\). Then:
$$\begin{aligned} \left( \frac{\gamma }{2}\right) ^{q} \pi \left\{ r_n(\theta ) \le \frac{\gamma }{2} + \inf \ r_n \right\} \le \underbrace{\int \left[ \overline{r}_n -r_n\right] _+^{q} \mathrm{d}\pi }_{= \frac{\mathcal {M}_{\phi _{q},n} }{\delta }} \le \gamma ^{q} \pi \bigl \{r_n(\theta ) \le \gamma + \inf \ r_n \bigr \}. \end{aligned}$$
$$\begin{aligned} \left( \frac{\gamma }{2}\right) ^{q} \pi \left\{ r_n(\theta ) \le \frac{\gamma }{2} + \inf \ r_n \right\} \le \frac{\mathcal {M}_{\phi _{q},n} }{\delta } \end{aligned}$$
and, using Assumption 1,
$$\begin{aligned} \left( \frac{\gamma }{2}\right) ^{q} \left( \frac{\gamma }{2}\right) ^d \le \frac{\mathcal {M}_{\phi _{q},n} }{\delta } \end{aligned}$$
which yields:
$$\begin{aligned} \gamma \le 2 \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta } \right) ^{\frac{1}{q+d}}. \end{aligned}$$
\(\square \)

We can also perform an explicit minimization of the oracle-type bound (8), which leads to a variant of Theorem 2 under a non-empirical complexity assumption.

Definition 6

$$\begin{aligned} \overline{R}_n = \min \left\{ u \in \mathbb {R}: \int \left[ u -R(\theta )\right] _+^{q} \pi (\mathrm{d}\theta ) = \frac{2^q \mathcal {M}_{\phi _{q}} }{\delta } \right\} . \end{aligned}$$

Assumption 2

There exists \(d>0\) such that, for any \(\gamma >0\),
$$\begin{aligned} \pi \Bigl \{ \theta \in \Theta : R(\theta ) \le \inf _{\theta '\in \Theta }\ \bigl \{ R(\theta ') \bigr \} + \gamma \Bigr \} \ge \gamma ^d . \end{aligned}$$

Theorem 3

Under the assumptions of Theorem 1 together with Assumption 2, with probability at least \(1-\delta \),
$$\begin{aligned} \int R \mathrm{d}\hat{\rho }_n \le \overline{R}_n \le \inf _{\theta \in \Theta }\ R(\theta ) + 2^{\frac{q}{q+d}} \left( \frac{\mathcal {M}_{\phi _{q},n} }{\delta } \right) ^{\frac{1}{q+d}}. \end{aligned}$$

The proof is a direct adaptation of the proofs of Propostion 11 and Theorem 2.

5 Discussion and perspectives

We proposed a new type of PAC-Bayesian bounds, which makes use of Csiszár’s f-divergence to generalize the Kullback-Leibler divergence. This is an extension of the results in Bégin et al. (2016). In favourable contexts, there exists sophisticated approaches to get better bounds, as discussed in the introduction. However, the major contribution of our work is that our bounds hold in hostile situations where no PAC bounds at all were available, such as heavy-tailed time series. We plan to study the connections between our PAC-Bayesian bounds and aforementionned approaches by Mendelson (2015) and Grünwald and Mehta (2016) in future works.


  1. 1.

    PAC stands for Probably Approximately Correct.



We would like to thank Pascal Germain for fruitful discussions, along with two anonymous Referees and the Editor for insightful comments.This author gratefully acknowledges financial support from the research programme New Challenges for New Data from LCL and GENES, hosted by the Fondation du Risque, from Labex ECODEC (ANR-11-LABEX-0047) and from Labex CEMPI (ANR-11-LABX-0007-01).


  1. Agarwal, A., & Duchi, J. C. (2013). The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory, 59(1), 573–587.MathSciNetCrossRefzbMATHGoogle Scholar
  2. Alquier, P., & Li, X. (2012). Prediction of quantiles by statistical learning and application to gdp forecasting. In 15th International Conference on Discovery Science 2012 (pp. 23–36). SpringerGoogle Scholar
  3. Alquier, P., & Wintenberger, O. (2012). Model selection for weakly dependent time series forecasting. Bernoulli, 18(3), 883–913.MathSciNetCrossRefzbMATHGoogle Scholar
  4. Alquier, P., Li, X., & Wintenberger, O. (2013). Prediction of time series by statistical learning: General losses and fast rates. Dependence Modeling, 1, 65–93.CrossRefzbMATHGoogle Scholar
  5. Alquier, P., Ridgway, J., & Chopin, N. (2016). On the properties of variational approximations of gibbs posteriors. Journal of Machine Learning Research, 17(239), 1–41.
  6. Audibert, J.-Y. (2009). Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37(4), 1591–1646.MathSciNetCrossRefzbMATHGoogle Scholar
  7. Audibert, J.-Y., & Catoni, O. (2011). Robust linear least squares regression. The Annals of Statistics, 39, 2766–2794.MathSciNetCrossRefzbMATHGoogle Scholar
  8. Bégin, L., Germain, P., Laviolette, F., & Roy, J.-F. (2016). PAC-Bayesian bounds based on the Rényi divergence. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (pp. 435–444).Google Scholar
  9. Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford: Oxford University Press.CrossRefzbMATHGoogle Scholar
  10. Catoni, O. (2004). Statistical learning theory and stochastic optimization. In J. Picard (Ed.), Saint-Flour Summer School on Probability Theory 2001., Lecture notes in mathematics Berlin: Springer.Google Scholar
  11. Catoni, O. (2007). PAC-Bayesian supervised classification: The thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series (Vol. 56). Beachwood, OH: Institute of Mathematical Statistics.Google Scholar
  12. Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques (Vol. 48, pp. 1148–1185). Paris: Institut Henri Poincaré.Google Scholar
  13. Catoni, O. (2016). PAC-Bayesian bounds for the Gram matrix and least squares regression with a random design. arXiv:1603.05229.
  14. Csiszár, I., & Shields, P. C. (2004). Information theory and statistics: A tutorial. Breda: Now Publishers Inc.zbMATHGoogle Scholar
  15. Dedecker, J., Doukhan, P., Lang, G., Rafael, L. R. J. Louhichi, S., & Prieur, C. (2007). Weak dependence. In Weak dependence: With examples and applications (pp. 9–20). Berlin: Springer.Google Scholar
  16. Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer.CrossRefzbMATHGoogle Scholar
  17. Devroye, L., Lerasle, M., Lugosi, G., & Oliveira, R. I. (2015). Sub-Gaussian mean estimators. arXiv:1509.05845.
  18. Dinh, V. C., Ho, L. S., Nguyen, B., & Nguyen, D.(2016). Fast learning rates with heavy-tailed losses. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29, pp. 505–513). Curran Associates, Inc.,
  19. Doukhan, P. (1994). Mixing: Properties and examples., Lecture notes in statistics New York: Springer.CrossRefzbMATHGoogle Scholar
  20. Giraud, C., Roueff, F., & Sanchez-Pèrez, A. (2015). Aggregation of predictors for nonstationary sub-linear processes and online adaptive forecasting of time varying autoregressive processes. The Annals of Statistics, 43(6), 2412–2450.MathSciNetCrossRefzbMATHGoogle Scholar
  21. Giulini, I. (2015). PAC-Bayesian bounds for principal component analysis in Hilbert spaces. arXiv:1511.06263.
  22. Grünwald, P. D., & Mehta, N. A. (2016). Fast rates with unbounded losses. arXiv:1605.00252.
  23. Guedj, B., & Alquier, P. (2013). PAC-Bayesian estimation and prediction in sparse additive models. Electronic Journal of Statistics, 7, 264–291.MathSciNetCrossRefzbMATHGoogle Scholar
  24. Guillaume, L., & Matthieu, L. (2017). Learning from mom’s principles. arXiv:1701.01961.
  25. Honorio, J., & Jaakkola, T.(2014). Tight bounds for the expected risk of linear classifiers and PAC-Bayes finite-sample guarantees. In Proceedings of the 17th international conference on artificial intelligence and statistics (pp. 384–392).Google Scholar
  26. Hsu, D., & Sabato, S. (2016). Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18), 1–40.MathSciNetzbMATHGoogle Scholar
  27. Kontorovich, L. A., Ramanan, K., et al. (2008). Concentration inequalities for dependent random variables via the martingale method. The Annals of Probability, 36(6), 2126–2158.MathSciNetCrossRefzbMATHGoogle Scholar
  28. Kuznetsov, V., & Mohri, M. (2014). Generalization bounds for time series prediction with non-stationary processes. In International conference on algorithmic learning theory (pp. 260–274). Springer.Google Scholar
  29. Langford, J., & Shawe-Taylor, J. (2002). PAC-Bayes & margins. In Proceedings of the 15th international conference on neural information processing systems (pp. 439–446). MIT Press.Google Scholar
  30. Lecué, G., & Mendelson, S. (2016). Regularization and the small-ball method I: Sparse recovery. arXiv:1601.05584.
  31. London, B., Huang, B., & Getoor, L. (2016). Stability and generalization in structured prediction. Journal of Machine Learning Research, 17(222), 1–52.MathSciNetzbMATHGoogle Scholar
  32. Lugosi, G., & Mendelson, S.(2016). Risk minimization by median-of-means tournaments. arXiv:1608.00757.
  33. Lugosi, G. & Mendelson, S. (2017). Regularization, sparse recovery, and median-of-means tournaments. arXiv:1701.04112.
  34. McAllester, D. A. (1998). Some PAC-Bayesian theorems. In Proceedings of the Eleventh annual conference on computational learning theory (pp. 230–234). New York: ACM.Google Scholar
  35. McAllester, D. A. (1999). PAC-Bayesian model averaging. In Proceedings of the twelfth annual conference on computational learning theory (pp. 164–170). ACM.Google Scholar
  36. Mendelson, S. (2015). Learning without concentration. Journal of ACM, 62(3), 21:1–21:25. ISSN: 0004-5411.
  37. Minsker, S. (2015). Geometric median and robust estimation in banach spaces. Bernoulli, 21(4), 2308–2335.MathSciNetCrossRefzbMATHGoogle Scholar
  38. Modha, D. S., & Masry, E. (1998). Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory, 44(1), 117–133.MathSciNetCrossRefzbMATHGoogle Scholar
  39. Mohri, M., & Rostamizadeh, A. (2010). Stability bounds for stationary \(\varphi \)-mixing and \(\beta \)-mixing processes. Journal of Machine Learning Research, 11, 789–814.MathSciNetzbMATHGoogle Scholar
  40. Oliveira, R. I. (2013). The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties. arXiv:1312.2903. (To appear in Probability Theory and Related Fields).
  41. Oneto, L., Anguita, D., & Ridella, S. (2016). PAC-Bayesian analysis of distribution dependent priors: Tighter risk bounds and stability analysis. Pattern Recognition Letters, 80, 200–207.CrossRefGoogle Scholar
  42. Ralaivola, L., Szafranski, M., & Stempfel, G. (2010). Chromatic PAC-Bayes bounds for non-iid data: Applications to ranking and stationary \(\beta \)-mixing processes. Journal of Machine Learning Research, 11, 1927–1956.MathSciNetzbMATHGoogle Scholar
  43. Rio, E. (2000). Théorie asymptotique des processus aléatoires faiblement dépendants (Vol. 31). Berlin: Mathématiques & Applications.zbMATHGoogle Scholar
  44. Seeger, M. (2002). PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of Machine Learning Research, 3, 233–269.MathSciNetCrossRefzbMATHGoogle Scholar
  45. Seldin, Y., & Tishby, N. (2010). PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11, 3595–3646.MathSciNetzbMATHGoogle Scholar
  46. Seldin, Y., Auer, P., Shawe-Taylor, J., Ortner, R & Laviolette, F. (2011). PAC-Bayesian analysis of contextual bandits. In Advances in neural information processing systems (pp. 1683–1691).Google Scholar
  47. Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J., & Auer, P. (2012). PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12), 7086–7093.MathSciNetCrossRefzbMATHGoogle Scholar
  48. Shawe-Taylor, J., & Williamson, R. (1997). A PAC analysis of a Bayes estimator. In Proceedings of the 10th annual conference on computational learning theory (pp. 2–9). New York: ACM.Google Scholar
  49. Steinwart, I., & Christmann, A. (2009). Fast learning from non-iid observations. In Advances in neural information processing systems (pp. 1768–1776).Google Scholar
  50. Taleb, N. N. (2007). The black swan: The impact of the highly improbable. New York: Random House.Google Scholar
  51. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.CrossRefzbMATHGoogle Scholar
  52. Vapnik, V. N. (2000). The nature of statistical learning theory. Berlin: Springer.CrossRefzbMATHGoogle Scholar
  53. Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1), 94–116Google Scholar
  54. Zimin, A., & Lampert, C. H.(2015). Conditional risk minimization for stochastic processes. arXiv:1510.02706.

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.CREST, ENSAE, Université Paris SaclayParisFrance
  2. 2.Modal Project-Team, InriaLille - Nord Europe research centerFrance

Personalised recommendations