Robust estimation for kernel exponential families with smoothed total variation distances

Kanamori, Takafumi; Yokoyama, Kodai; Kawashima, Takayuki

doi:10.1007/s41884-024-00141-4

Robust estimation for kernel exponential families with smoothed total variation distances

Research Paper
Open access
Published: 27 July 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Information Geometry Aims and scope Submit manuscript

Robust estimation for kernel exponential families with smoothed total variation distances

Download PDF

229 Accesses
5 Altmetric
Explore all metrics

Abstract

In statistical inference, we commonly assume that samples are independent and identically distributed from a probability distribution included in a pre-specified statistical model. However, such an assumption is often violated in practice. Even an unexpected extreme sample called an outlier can significantly impact classical estimators. Robust statistics studies how to construct reliable statistical methods that efficiently work even when the ideal assumption is violated. Recently, some works revealed that robust estimators such as Tukey’s median are well approximated by the generative adversarial net (GAN), a popular learning method for complex generative models using neural networks. GAN is regarded as a learning method using integral probability metrics (IPM), which is a discrepancy measure for probability distributions. In most theoretical analyses of Tukey’s median and its GAN-based approximation, however, the Gaussian or elliptical distribution is assumed as the statistical model. In this paper, we explore the application of GAN-like estimators to a general class of statistical models. As the statistical model, we consider the kernel exponential family that includes both finite and infinite-dimensional models. To construct a robust estimator, we propose the smoothed total variation (STV) distance as a class of IPMs. Then, we theoretically investigate the robustness properties of the STV-based estimators. Our analysis reveals that the STV-based estimator is robust against the distribution contamination for the kernel exponential family. Furthermore, we analyze the prediction accuracy of a Monte Carlo approximation method, which circumvents the computational difficulty of the normalization constant.

Mixture of experts distributional regression: implementation using robust estimation with adaptive first-order methods

Article Open access 15 November 2023

Adaptation of the tuning parameter in general Bayesian inference with robust divergence

Article Open access 04 February 2023

Minimizing robust density power-based divergences for general parametric density models

Article 02 May 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In statistical inference, we often assume ideal assumptions for data distribution. For instance, samples are independent and identically distributed from a probability distribution that is included in a pre-specified statistical model. However, such an assumption is often violated in practice. The data set may contain unexpected samples by, say, the failure of observation equipment. Even a single extreme sample called an outlier can have a large impact to classical estimators. Robust statistics studies how to construct reliable statistical methods that efficiently work even when ideal assumptions on the observation are violated.

1.1 Background

Robust statistics has a long history. The theoretical foundation of robust statistics was established by many works including [1,2,3,4]. A typical problem setting of robust statistics is that the data is observed from the contaminated model $(1-\varepsilon )P+\varepsilon Q$, where Q is a contamination distribution and $\varepsilon $ is a contamination ratio. The purpose is to estimate the target distribution P or its statistic, such as the location parameter or variance from the data. Samples from Q are regarded as outliers. Sensitivity measures such as the influence function or breakdown point are used to evaluate the robustness of the estimator against outliers. The influence function is defined as the Gâteaux derivative of the estimation functional to the direction of the point mass at an outlier. The most B-robust estimator is the estimator that minimizes the supremum of the modulus of the influence function [3]. The median is the most B-robust estimator for the mean value of the one-dimensional Gaussian distribution. The breakdown point is defined by the maximum contamination ratio such that the estimator still gives meaningful information of the target distribution P. It is well-known that the median asymptotically attains the maximum breakdown point under some assumptions. For the above reason, the median estimator is regarded as the most robust estimator of the mean value especially when the data is distributed from a contaminated Gaussian distribution.

Numerous works have appeared that study multi-dimensional extension of the median. The componentwise median and geometric median are straightforward extensions of the median to multivariate data. These estimators are, however, suboptimal as the estimator of the location parameter under the contaminated model [5,6,7]. The authors of [2] proposed the concept of data depth, which measures how deeply embedded a point is in the scattered data. The deepest point for the observed data set is called Tukey’s median. Numerous works have proved that Tukey’s median has desirable properties as an estimator of the multivariate location parameter for the target distribution; high breakdown point [8], redescending property of the influence function [9], the min–max optimal convergence rate [5], and possession of general notions of statistical depth [10].

As another approach of robust statistics, minimum distance methods have been explored by many authors including [11,12,13,14,15,16]. The robustness of statistical inference is closely related to the topological structure induced by the loss function over the set of probability distributions [17]. For instance, the negative log-likelihood loss used in the maximum likelihood estimator (MLE) corresponds to the minimization of the Kullback–Leibler (KL) divergence from the empirical distribution to the statistical model. A deep insight by [17] revealed that the KL-divergence does not yield a relevant topological structure on the set of probability distributions. That is a reason that the MLE does not possess robustness property to contamination. To construct a robust estimator, the KL divergence is replaced with the other divergences that induce a relevant topology, such as the density-power divergence or pseudo-spherical divergence [13,14,15], These divergences are included in the class of Bregman divergence [18]. The robustness property of Bregman divergence has been investigated by many works, including [19, 20].

Besides Bregman divergence, f-divergence [17, 21] and integral probability metrics (IPMs) [22] have been widely applied to statistics and machine learning. In statistical learning with deep neural networks, generative adversarial networks (GANs) have been proposed to estimate complex data distribution, such as the image-generating process [23]. GAN is regarded as a learning algorithm that minimizes the Jensen–Shannon (JS) divergence. The original GAN is extended to f-GAN, which is the learning method using f-divergence as a loss function [24]. This extension has enabled us to use other f-divergence for the estimation of generative models. Note that the variational formulation of the f-divergence accepts the direct substitution of the empirical distribution. IPMs such as the Wasserstein distance or maximum mean discrepancy (MMD), too, are used to learn generative models [25,26,27]. Learning methods using f-divergence and IPMs are formulated as the min–max optimization problem. The inner maximization yields the variational expression of the divergence.

Recently, [28, 29] found a relationship between f-GANs and robust estimators based on the data depth. Roughly, their works showed that depth-based estimators for the location parameter and the covariance matrix are approximately represented by the f-GAN using the total variation (TV) distance, i.e., TV-GAN. Furthermore, [28] proved that the approximate Tukey’s median by the TV-GAN attains the min–max optimal convergence rate as well as the original Tukey’s median [5]. The approximation of the data depth by GAN-based estimators is advantageous for computation. [6, 7] proposed polynomial-time computable robust estimators of the Gaussian mean and covariance. However, the computation algorithm often requires knowledge such as the contamination ratio, which is usually not usable in practice. On the other hand, GAN-based robust estimators are computationally efficient, though rigorous computation cost has not been theoretically guaranteed. Inspired by [28, 29], some authors have investigated depth-based estimators from the standpoint of GANs [30,31,32,33,34].

In this paper, we explore the application of GAN-based robust estimators to a general class of statistical models. In most theoretical analyses of Tukey’s median and its GAN-based approximation, usually, the Gaussian distribution or elliptical distribution is assumed. We propose a class of IPMs called the smoothed total variation distance to construct a robust estimator. As the statistical model, we consider the kernel exponential family [35, 36]. The kernel method is commonly used in statistics [37,38,39]. A kernel function corresponds to a reproducing kernel Hilbert space (RKHS). The kernel exponential family with a finite-dimensional RKHS produces the standard exponential family, and an infinite-dimensional RKHS provides an infinite-dimensional exponential family. We propose a statistical learning method using smoothed TV distance for the kernel exponential family and analyze its statistical properties, such as the convergence rate under contaminated models. Often, the computation of the normalization constant is infeasible [40,41,42,43]. In this paper, we use a Monte Carlo approximation to overcome the computational difficulty of the normalization constant. Then, we analyze the robustness property of approximate estimators.

1.2 Related works

This section discusses related works to our paper, highlighting GAN-based robust estimators, kernel exponential family and unnormalized models.

1.2.1 Relation between Tukey’s Median and GAN

Recent studies revealed some statistical properties of depth-based estimators, including Tukey’s median [5, 28,29,30,31,32,33,34]. A connection between the depth-based robust estimator and GAN was first studied by [28]. Originally, GAN was proposed as a learning algorithm for generative models. There are mainly two approaches to constructing learning algorithms for generative models.

One is to use f-divergences called f-GAN, including vanilla GAN [23, 24]. In the learning using f-GAN, the estimator is obtained by minimizing an approximate f-divergence between the empirical distribution and the generative model. In [28, 29, 31], the robustness of f-GAN implemented by DNNs has been revealed mainly for the location parameter and variance–covariance matrix of the normal distribution.

The other approach is to use IPM-based methods such as [25, 44]. In IPM-based learning, the estimator is computed by minimizing the gap between generalized moments of the empirical distribution and those of the estimated distribution. Hence, the IPM-based learning is similar to the generalized method of moments [45]. Gao, et al. [28] proved that Tukey’s median and depth-based robust estimators are expressed as the minimization of a modified IPM between the empirical distribution and the model. Some works, including [28, 30, 32] show that also the IPM-based estimators provide robust estimators for the parameters of the normal distribution. [30] studied the estimation accuracy of the robust estimator defined by the Wasserstein distance or the neural net distance [46], which is an example of the IPM.

The connection between the robust estimator and GAN allows us to apply recent developments in deep neural networks (DNNs) to robust statistics. However, most theoretical works on GAN-based robust statistics have focused on estimating the mean vector or variance–covariance matrix in the multivariate normal distribution, as shown above.

In our paper, we propose a smoothed variant of total variation (TV) distance called the smoothed TV (STV) distance and investigate the convergence property for general statistical models. The STV distance class is included in IPMs. The most related work to our paper is [32], in which the smoothed Kolmogorov–Smirnov (KS) distance over the set of probability distributions is proposed. In their paper, the smoothed KS distance is employed to estimate the mean value or the second-order moment of population distribution under contamination. On the other hand, we use the STV distance for the kernel exponential family, including infinite-dimensional models, to estimate the probability density. Furthermore, the estimation accuracy of the estimator using STV distance is evaluated by the TV distance. In many theoretical analyses, the loss function used to compute the estimator is again used to evaluate estimation accuracy. For example, the estimator based on the f-GAN is assessed by the same f-divergence. A comparison of estimation accuracy is not fairly performed in such an assessment. We use the TV distance to assess estimation accuracy for any STV-based learning in a unified manner. For that purpose, the difference between the TV distance and the STV distance is analyzed to evaluate the estimation bias induced by the STV distance.

1.2.2 Statistical inference with kernel exponential family

The exponential family plays a central role in mathematical statistics [47, 48]. Indeed, the exponential family has rich mathematical properties from various viewpoints; minimum sufficiency, statistical efficiency, attaining maximum entropy under the moment constraint of sufficient statics, geometric flatness in the sense of information geometry, and so on.

The kernel exponential family is an infinite dimensional extension of the exponential family [35]. Due to the computational tractability, the kernel exponential family is used as a statistical model for non-parametric density estimation [36, 49, 50].

Some authors have studied robustness for infinite-dimensional non-parametric estimators. In [51], the authors study an estimator based on a robust test, which is computationally infeasible. The TV distance is used as the discrepancy measure. In [52], the estimation accuracy of the wavelet thresholding estimator is evaluated using IPM loss defined by the Besov space.

On the other hand, we study the robustness property of the STV-based estimator with kernel exponential family. The estimation accuracy is evaluated by the TV distance, STV distance, and the norm in the parameter space. For the finite-dimensional multivariate normal distribution, we derive the minimax convergence rate of the robust estimator for the covariance matrix in terms of the Frobenius norm, while existing works mainly focus on the convergence rate under the operator norm.

1.2.3 Unnormalized models

For complex probability density models, including kernel exponential family, the computation of the normalization constant is often infeasible. In such a case, the direct application of the MLE is not possible. There are numerous works to mitigate the computational difficulty in statistical inference [40,41,42,43, 53,54,55,56,57,58]. The kernel exponential family has the same obstacle. To avoid the computation of the normalization constant, [36] investigated the estimation method using Fisher divergence. Another approach is to use the dual expression of the MLE for the kernel exponential family [36, 49, 50]. The IPM-based robust estimator considered in this paper, too, has a computational issue.

Here, we employ the Monte Carlo method. For the infinite-dimensional kernel exponential family, the representer theorem does not work to reduce the infinite-dimensional optimization problem to a finite one. However, the finite sampling by the Monte Carlo method enables us to use the standard representer theorem to compute the estimator. We analyze the relation between the number of Monte Carlo samples and the convergence rate under contaminated distribution.

1.2.4 Organization

The paper is organized as follows. In Sect. 2, we introduce some notations used in this paper and discrepancy measures that are used in statistical inference. Let us introduce the relationship between the data depth and GAN-based robust estimator. In Sect. 3, we show the definition of the smoothed TV distance. Some theoretical properties, such as the difference from the TV distance, are investigated. In Sect. 4, the smoothed TV distance with regularization is applied for the estimation of the kernel exponential family. We derive the convergence rate of the estimator in terms of the TV distance. In Sect. 5, we investigate the convergence rate of the estimator in the parameter space. For the estimation of the covariance matrix for the multivariate normal distribution, we prove that our method attains the minimax optimal rate in terms of the Frobenius norm under contamination, while most past works derived the convergence rate in terms of the operator norm. In Sect. 6, we investigate the convergence rate of the learning algorithm using Monte Carlo approximation for the normalization constant. Section 7 is devoted to the conclusion and future works. Detailed proofs of theorems are presented in Appendix.

2 Discrepancy measures for statistical inference

In this section, let us define the notations used throughout this paper. Then, we introduce some discrepancy measures for statistical inference.

2.1 Notations and definitions

We use the following notations throughout the paper. Let ${\mathcal {P}}$ be the set of all probability measures on a Borel space $({\mathcal {X}},{\mathcal {B}})$, where ${\mathcal {B}}$ is a Borel $\sigma $-algebra. For the Borel space, $L_0$ denotes the set of all measurable functions and $L_1(\subset L_0)$ denotes the set of all integrable functions. Here, functions in $L_0$ are allowed to take $\pm \infty $. For a function set ${\mathcal {U}}\subset {L_0}$ and $c\in {\mathbb {R}}$, let $c\,{\mathcal {U}}$ be $\{c u : u\in {\mathcal {U}}\}$ and $-{\mathcal {U}}$ be $(-1){\mathcal {U}}$. The expectation of $f\in L_0$ for the probability distribution $P\in {\mathcal {P}}$ is denoted by ${\mathbb {E}}_P[f]=\int _{{\mathcal {X}}}f\textrm{d}P$. We also write ${\mathbb {E}}_{X\sim {P}}[f(X)]$ or ${\mathbb {E}}_{{P}}[f(X)]$ to specify the random variable X. The range of the integration, ${\mathcal {X}}$, is dropped if there is no confusion. For a Borel measure $\mu $, $P\ll \mu $ denotes that P is dominated by $\mu $, i.e., $\mu (A)=0$ for $A\in {\mathcal {B}}$ leads to $P(A)=0$. When $P\ll \mu $ holds, P has the probability density function p such that the expectation is computed by $\int _{{\mathcal {X}}}f(x)p(x)\textrm{d}\mu (x)$ or $\int _{{\mathcal {X}}}f p\textrm{d}\mu $. The function p is denoted by $\frac{\textrm{d}{P}}{\textrm{d}{\mu }}$. The simple function on $A\in {\mathcal {B}}$ is denoted by $\varvec{1}_A(x)$ that takes 1 for $x\in {A}$ and 0 for $x\not \in {A}$. In particular, the step function on the set of non-negative real numbers, ${\mathbb {R}}_+$, is denoted by $\varvec{1}$ for simplicity, i.e., $\varvec{1}(x)=1$ for $x\ge 0$ and 0 otherwise. The indicator function ${\mathbb {I}}[A]$ takes 1 (resp. 0) when the proposition A is true (resp. false). The function $\textrm{id}:{\mathbb {R}}\rightarrow {\mathbb {R}}$ stands for the identity function, i.e., $\textrm{id}(x)=x$ for $x\in {\mathbb {R}}$. The Euclidean norm of ${\varvec{x}}\in {\mathbb {R}}^d$ is denoted by $\Vert {\varvec{x}}\Vert _2=\sqrt{{\varvec{x}}^T{\varvec{x}}}$.

Let us define ${\mathcal {H}}$ as a reproducing kernel Hilbert space (RKHS) on ${\mathcal {X}}$. For $f,g\in {\mathcal {H}}$, the inner product on ${\mathcal {H}}$ is expressed by $\langle f,g\rangle $ and the norm of f is defined by $\Vert f\Vert =\sqrt{\langle f,f\rangle }$. For a positive number U, let ${\mathcal {H}}_U$ be the subset of the RKHS ${\mathcal {H}}$ defined by $\{f\in {\mathcal {H}}:\Vert f\Vert \le U\}$. See [59] for details of RKHS.

In statistical inference, discrepancy measures for probability distributions play an important role. One of the most popular discrepancy measures in statistics is the Kullback–Leibler (KL) divergence

$$\begin{aligned} \textrm{KL}(P,Q):={\mathbb {E}}_P\!\left[ \log \frac{\textrm{d}{P}}{\textrm{d}{Q}}\right] . \end{aligned}$$

When $P\ll \mu $ and $Q\ll \mu $ holds, P (resp Q) has the probability density p (resp. q) for $\mu $. Then, the KL divergence is expressed by

$$\begin{aligned} \textrm{KL}(P,Q)=\int p(x) \log \frac{p(x)}{q(x)} \textrm{d}\mu (x). \end{aligned}$$

Note that the KL divergence does not satisfy the definition of the distance. Indeed, the symmetric property does not hold. Another important discrepancy measure is the total variation (TV) distance for $P,Q\in {\mathcal {P}}$,

$$\begin{aligned} \textrm{TV}(P,Q) := \sup _{A\in {\mathcal {B}}} |{{\mathbb {E}}_{P}[\varvec{1}_A]-{\mathbb {E}}_{Q}[\varvec{1}_A]}| = \sup _{\begin{array}{c} f\in L_0\\ 0\le f\le 1 \end{array}} {\mathbb {E}}_{P}[f]-{\mathbb {E}}_{Q}[f]. \end{aligned}$$

(1)

When P and Q respectively have the probability density function p and q for the Borel measure $\mu $, we have

$$\begin{aligned} \textrm{TV}(P,Q)=\frac{1}{2}\int |{p-q}|\textrm{d}{\mu }. \end{aligned}$$

As an expansion of the total variation distance, the integral probability measure (IPM) is defined by

$$\begin{aligned} \textrm{IPM}(P,Q;{\mathcal {F}}):=\sup _{f\in {\mathcal {F}}} |{{\mathbb {E}}_P[f]-{\mathbb {E}}_Q[f]}|, \end{aligned}$$

where ${\mathcal {F}}\subset L_0$ is a uniformly bounded function set, i.e., $\sup _{f\in {\mathcal {F}}, x\in {\mathcal {X}}}|{f(x)}|<\infty $ [22]. If ${\mathcal {F}}$ satisfies ${\mathcal {F}}=-{\mathcal {F}}$, clearly one can drop the modulus sign in the definition of the IPM. The same property holds when ${\mathcal {F}}=\{c-f:\,f\in {\mathcal {F}}\}$ holds for a fixed real number c; see (1). IPM includes not only the total variation distance, but also Wasserstein distance, Dudley distance, maximum mean discrepancy (MMD) [60], etc. The MMD is used for non-parametric two-sample test [61] and the Wasserstein distance is used for the transfer learning or the estimation with generative models [44, 62,63,64,65]. In the sequel sections, we study robust statistical inference using a class of the IPM.

2.2 Depth-based methods and IPMs

Tukey’s median is a multivariate analog of the median. Given d-dimensional i.i.d. samples $X_1,\ldots ,X_n$, Tukey’s median is defined as the minimum solution of

$$\begin{aligned} \min _{\varvec{\mu }\in {\mathbb {R}}^d}\max _{\begin{array}{c} {\varvec{u}}\in {\mathbb {R}}^d\\ \Vert {\varvec{u}}\Vert _2=1 \end{array}} \frac{1}{n}\sum _{i=1}^{n}{\mathbb {I}}[{\varvec{u}}^T(X_i-\varvec{\mu })\ge 0]. \end{aligned}$$

Let ${\widehat{P}}_{\varvec{\mu }}$ be the probability distribution having the uniform point mass on $X_1-\varvec{\mu },\ldots ,X_n-\varvec{\mu }$ and ${\mathcal {F}}$ be the function set ${\mathcal {F}}=\{{\varvec{x}}\mapsto {\mathbb {I}}[{\varvec{u}}^T{\varvec{x}}\ge 0]\,:\,{\varvec{u}}\in {\mathbb {R}}^d,\,\Vert {\varvec{u}}\Vert _2=1\} \cup \{{\varvec{x}}\mapsto {\mathbb {I}}[{\varvec{u}}^T{\varvec{x}}>0]\,:\,{\varvec{u}}\in {\mathbb {R}}^d,\,\Vert {\varvec{u}}\Vert _2=1\}$. Then, we can confirm that the IPM between ${\widehat{P}}_{\varvec{\mu }}$ and the multivariate standard normal distribution $N_d(\varvec{0},I_d)$ with the above ${\mathcal {F}}$ yields

$$\begin{aligned} \textrm{IPM}({\widehat{P}}_{\varvec{\mu }},N_d(\varvec{0},I_d);{\mathcal {F}})&= \max _{\begin{array}{c} {\varvec{u}}\in {\mathbb {R}}^d\\ \Vert {\varvec{u}}\Vert _2=1 \end{array}} \frac{1}{n}\sum _{i=1}^{n}{\mathbb {I}}[{\varvec{u}}^T(X_i-\varvec{\mu })\ge 0]-\frac{1}{2}, \end{aligned}$$

i.e., one can drop the modulus in the definition of the IPM. Tukey’s median is expressed by the minimization of the above IPM.

Likewise, the covariance matrix depth is expressed by the IPM between the probability distribution ${\widehat{P}}_\Sigma $ having the uniform point mass on $\Sigma ^{-1/2}X_1, \ldots , \Sigma ^{-1/2}X_n$ for a positive definite matrix $\Sigma $ and $N_d(\varvec{0},I_d)$. For the function set ${\mathcal {F}}=\{{\varvec{x}}\mapsto {\mathbb {I}}[{\varvec{u}}^T({\varvec{x}}{\varvec{x}}^T-I_d){\varvec{u}}\le 0]\,:\,{\varvec{u}}\in {\mathbb {R}}^d{\setminus }\{\varvec{0}\}\}$, we can confirm that

$$\begin{aligned}&\textrm{IPM}({\widehat{P}}_{\Sigma },N_d(\varvec{0},I_d);{\mathcal {F}})\\&\quad = \max _{{\varvec{u}}\in {\mathbb {R}}^d\setminus \{\varvec{0}\}} \bigg |\frac{1}{n}\sum _{i=1}^{n}{\mathbb {I}}[{\varvec{u}}^T\Sigma ^{-1/2}(X_iX_i^T-\Sigma )\Sigma ^{-1/2}{\varvec{u}}\le 0]\\&\qquad -{\mathbb {P}}_{Z\sim N(0,1)}(\Vert {\varvec{u}}\Vert ^2Z^2-\Vert {\varvec{u}}\Vert ^2\le 0) \bigg |\\&\quad = \max _{\begin{array}{c} {\varvec{u}}\in {\mathbb {R}}^d\\ \Vert {\varvec{u}}\Vert =1 \end{array}} \bigg |\frac{1}{n}\sum _{i=1}^{n}{\mathbb {I}}[({\varvec{u}}^T X_i)^2\le {\varvec{u}}^T\Sigma {\varvec{u}}] - {\mathbb {P}}_{Z\sim N(0,1)}(Z^2\le 1) \bigg |. \end{aligned}$$

The last line is nothing but the covariance matrix depth. The minimizer of $\textrm{IPM}({\widehat{P}}_{\Sigma },N_d(\varvec{0},I_d);{\mathcal {F}})$ over the positive definite matrix $\Sigma $ is equal to the covariance matrix estimator with the data depth.

Another IPM-based expression of the robust estimator is presented by [28]. In order to express the estimator by the minimization of the IPM from the empirical distribution of data $P_n$ to the statistical model, a variant of IPM loss is introduced. For instance, the robust covariance estimator with the data depth is expressed by the minimum solution of $\lim _{r\rightarrow 0}\textrm{IPM}(P_n, N_d(\varvec{0},\Sigma );{\mathcal {F}}_{N_d(\varvec{0},\Sigma ),r})$ with respect to the parameter of the covariance matrix $\Sigma $, where ${\mathcal {F}}_{Q,r}$ is a function set depending on the distribution Q and a positive real parameter r. Details are shown in Proposition 2.1 of [28]. Though the connection between the data depth and IPM is not straightforward, the GAN-based estimator is thought to be a promising method for robust density estimation.

3 Smoothed total variation distance

We define the smoothed total variation distance as a class of the IPM, and investigate its theoretical properties. All the proofs of theorems in this section are deferred to Appendix A.

3.1 Definition and some properties

As an extension of the TV distance, let us define the smoothed TV distance, which is a class of IPM.

Definition 1

Let $\sigma :{\mathbb {R}}\rightarrow {\mathbb {R}}$ be a measurable function and ${\mathcal {U}}\subset L_0$ be a function set including the zero function. For $P,Q\in {\mathcal {P}}$, the smoothed total variation (STV) distance, $\textrm{STV}_{{\mathcal {U}},\sigma }(P,Q)$, is defined by

$$\begin{aligned} \textrm{STV}_{{\mathcal {U}},\sigma }(P,Q) := \sup _{\begin{array}{c} u\in {\mathcal {U}} \end{array}, b\in {\mathbb {R}}} \big |{\mathbb {E}}_{X\sim P}[\sigma (u(X)-b)]-{\mathbb {E}}_{X\sim Q}[\sigma (u(X)-b)]\big |. \end{aligned}$$

For the bias b, one can impose the constraint such as $|{b}|\le R$ with a positive constant R. Regardless of the constraint on the bias b, we use the notations $\textrm{STV}(P,Q)$ and $\textrm{STV}_\sigma (P,Q)$ for $\textrm{STV}_{{\mathcal {U}},\sigma }(P,Q)$ if there is no confusion.

The STV distance is nothing but the IPM with the function class $\{\sigma (u-b):\,u\in {\mathcal {U}}, b\in {\mathbb {R}}\}$. We show that the STV distance is a smoothed variant of the TV distance and shares some statistical properties. When the function set ${\mathcal {U}}$ is properly defined, and $\sigma $ is smooth, it is possible to construct a computationally tractable learning algorithm using the STV distance. On the other hand, learning with the TV distance is often computationally intractable, as the indicator function, which is non-differentiable, prevents from efficient optimization. In our paper, we focus on the STV distance such that the function in ${\mathcal {U}}$ is expressed by a ball in the RKHS. The details are shown in Sect. 3.3.

Some examples of STV distance are shown below.

Example 1

The total variation distance is expressed by $\textrm{STV}_{L_0,\varvec{1}}$.

Example 2

The STV distance with the identity function $\sigma =\textrm{id}$ is reduced to the IPM defined by the function set ${\mathcal {U}}$. The MMD [61] is expressed by the STV distance with ${\mathcal {U}}={\mathcal {H}}_1$ and $\sigma =\textrm{id}$. The Wasserstein distance with 1st moment corresponds to the STV distance with ${\mathcal {U}}=\{f:{\mathcal {X}}\rightarrow {\mathbb {R}}\,:\,\text {f is 1-Lipschitz continuous}\}$ and $\sigma =\textrm{id}$.

The authors of [30] revealed the robustness of the Wasserstein-based estimator called Wasserstein-GAN. As for the robustness of the MMD-based method, i.e., MMD-GAN, the authors of [31] obtained negative results according to theoretical analysis under a strong assumption and numerical experiments. Though the STV distance with RKHSs is similar to MMD, a significant difference is that a non-linear function $\sigma $ and the RKHS ball ${\mathcal {H}}_U$ with variable radius are used in the STV distance. As a result, the STV-based method recovers the robustness. Section 4 and thereafter will discuss this in more detail.

Example 3

Let ${\mathcal {U}}\subset {L_0}$ be a function set. The STV distance with ${\mathcal {U}}$ and $\sigma =\varvec{1}$ is the generalized Kolmogorov–Smirnov distance [32],

$$\begin{aligned} \textrm{STV}_{{\mathcal {U}},\varvec{1}}(P,Q)&= \sup _{u\in {\mathcal {U}}, b\in {\mathbb {R}}} {\mathbb {E}}_P[\varvec{1}(u(X)-b\ge 0)]-{\mathbb {E}}_Q[\varvec{1}(u(X)-b\ge 0)] \\&= \sup _{u\in {\mathcal {U}}, b\in {\mathbb {R}}}P(u(X)\ge b)-Q(u(X)\ge b). \end{aligned}$$

When $\sigma $ is a cumulative distribution function of a probability distribution, the STV distance, $\textrm{STV}_\sigma $, is the smoothed generalized Kolmogorov–Smirnov distance [32].

Let us consider some basic properties of the STV distance.

Lemma 1

For the STV distance, the non-negativity, $\textrm{STV}(P,Q)\ge 0$, and the triangle inequality, $\textrm{STV}(P,Q)\le \textrm{STV}(P,R)+\textrm{STV}(R,Q)$ hold for $P,Q,R\in {\mathcal {P}}$. When $\sigma $ satisfies $0\le \sigma \le 1$, the inequality $\textrm{STV}_{\sigma }(P,Q)\le \textrm{TV}(P,Q)$ holds.

We omit the proof, since it is straightforward.

Let us consider the following assumptions.

Assumption (A)
: The function set ${\mathcal {U}}\subset L_0$ satisfies ${\mathcal {U}}=-{\mathcal {U}}$, i.e., ${\mathcal {U}}$ is closed for negation.
Assumption (B)
: The function $\sigma (z)$ is continuous and strictly monotone increasing. In addition, $\displaystyle \lim _{z\rightarrow -\infty }\sigma (z)=0, \lim _{z\rightarrow \infty }\sigma (z)=1$ and $\sigma (z)+\sigma (-z)=1, z\in {\mathbb {R}}$ hold.

Under Assumption (B), the STV distance is regarded as a class of the smoothed generalized Kolmogorov–Smirnov distance [32].

We show some properties of the STV distance under the above assumptions.

Lemma 2

Under Assumption (A), the following equality holds,

$$\begin{aligned} \textrm{STV}_{{\mathcal {U}},\varvec{1}}(P,Q) = \sup _{\begin{array}{c} u\in {\mathcal {U}}, b\in {\mathbb {R}} \end{array}} {\mathbb {E}}_{P}[\varvec{1}[u(X)-b\ge 0]]-{\mathbb {E}}_{Q}[\varvec{1}[u(X)-b\ge 0]] \end{aligned}$$

for $P,Q\in {\mathcal {P}}$, i.e., one can drop the modulus sign.

Lemma 3

Under Assumptions (A) and (B), it holds that

$$\begin{aligned} \textrm{STV}_{{\mathcal {U}},\sigma }(P,Q) = \sup _{\begin{array}{c} u\in {\mathcal {U}}, b\in {\mathbb {R}} \end{array}} {\mathbb {E}}_{P}[\sigma (u(X)-b)]-{\mathbb {E}}_{Q}[\sigma (u(X)-b)] \end{aligned}$$

for $P,Q\in {\mathcal {P}}$.

Let us consider the STV distance such that ${\mathcal {U}}$ is given by the RKHS ${\mathcal {H}}$ or its subset ${\mathcal {H}}_U$. When the RKHS is dense in the set of all continuous functions on ${\mathcal {X}}$ for the supremum norm, the RKHS is called universal RKHS [66]. It is well known that the Gaussian kernel induces a universal RKHS.

Lemma 4

Suppose that ${\mathcal {H}}$ be a universal RKHS. Under Assumptions (A) and (B), $\textrm{STV}_{{\mathcal {H}},\sigma }$ equals the TV distance. Furthermore, for ${\mathcal {H}}_U=\{f\in {\mathcal {H}}:\,\Vert f\Vert \le U\}$, the equality

$$\begin{aligned} \lim _{U\rightarrow \infty }\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P,Q) = \textrm{STV}_{{\mathcal {H}},\sigma }(P,Q) =\textrm{TV}(P,Q). \end{aligned}$$

holds for $P,Q\in {\mathcal {P}}$.

3.2 Gap between STV distance and TV distance

One can quantitatively evaluate the difference between the TV distance and STV distance. First of all, let us define the decay rate of the function $\sigma $.

Definition 2

Let $\sigma $ be the function satisfying Assumption (B). If there exists a function $\lambda (c)$ such that $\lim _{c\rightarrow \infty }\lambda (c)=0$ and

$$\begin{aligned} \sup _{t\ge 1}\sigma (-c\log t)(t-1) \le \lambda (c) \end{aligned}$$

(2)

hold for arbitrary $c> C_0>0$, $\lambda (c)$ is called the decay rate of $\sigma $, where $C_0$ is a positive constant.

Proposition 5

Assume the Assumptions (A) and (B). Suppose that the decay rate of $\sigma $ is $\lambda (c)$ for $c>C_0>0$, i.e., (2) holds. For $P,Q\in {\mathcal {P}}$, let $\mu $ be a Borel measure such that $P\ll \mu $ and $Q\ll \mu $ hold. Let us define the function s(x) on ${\mathcal {X}}$ by $s(x) = \log \frac{\frac{\textrm{d}{P}}{\textrm{d}{\mu }}(x)}{\frac{\textrm{d}{Q}}{\textrm{d}{\mu }}(x)}$, where $\log \frac{a}{0}=\infty $ and $\log \frac{0}{a}=-\infty $ for $a>0$ and $\frac{0}{0}=1$ by convention. Suppose $s\in {\mathcal {U}}\oplus {\mathbb {R}}$. Then, for the STV distance with $c\,{\mathcal {U}}, c>0$ and $\sigma $, the inequality

$$\begin{aligned} 0\le \textrm{TV}(P,Q)-\textrm{STV}_{c\,{\mathcal {U}},\sigma }(P,Q) \le \lambda (c)(1-\textrm{TV}(P,Q)) \end{aligned}$$

holds for $c> C_0$.

Note that for any pair of probability distributions, P and Q, there exists a measure $\mu $ such that $P\ll \mu $ and $Q\ll \mu $. A simple example is $\mu =P+Q$. The above proposition holds only for P and Q such that $s\in {\mathcal {U}}$. Under mild assumptions, any pair of probability distributions in a statistical model satisfies the condition $s\in {\mathcal {U}}$.

Remark 1

Let us consider the case of $\textrm{TV}(P,Q)=1$. A typical example is the pair of P and Q for which there exists a subset A such that $P(A)=1$ and $Q(A)=0$. In this case, $s(x)=\infty $ (resp. $s(x)=-\infty $) for $x\in {A}$ (resp. $x\not \in {A}$). If ${\mathcal {U}}$ includes such a function, we have $\sigma (s(x)-b)=+1$ for $x\in A$ and otherwise $\sigma (s(x)-b)=0$. Hence, the STV distance matches with the TV distance for P and Q.

Below, we show the lower bound of the decay rate and some examples.

Lemma 6

Under Assumption (B), the decay rate satisfies

$$\begin{aligned} \displaystyle \liminf _{c\rightarrow \infty }c \lambda (c)>0. \end{aligned}$$

The above lemma means that the order of the decay rate $\lambda (c)$ is greater than or equal to 1/c.

Example 4

For the sigmoid function $\sigma (z)=1/(1+e^{-z}), z\in {\mathbb {R}}$, the decay rate is given by $\lambda (c)=1/c$ for $c>1$. Indeed, the inequality

$$\begin{aligned} \sigma (-c \log t)(t-1) = \frac{t-1}{1+t^c} \le \frac{t-1}{t^c} \le \frac{1}{c}\left( \frac{c-1}{c}\right) ^{c-1} \le \frac{1}{c} \end{aligned}$$

holds for $t>1$ and $c>1$. We see that the sigmoid function attains the lowest order of the decay rate. Likewise, we find that the decay rate $\lambda (c)$ of the function $\sigma (-z) \asymp e^{-z}$ for $z>0$ is of the order 1/c.

Example 5

For the function $\sigma (-z)\asymp z^{-\beta } (z\rightarrow \infty )$ with a constant $\beta >0$, we can prove that there is no finite decay rate. Indeed, for $\sigma (-z)\asymp z^{-\beta }$, we have

$$\begin{aligned} \sup _{t\ge 1}\sigma (-c\log {t})(t-1) \asymp \frac{1}{c^\beta }\sup _{t\ge 1}\frac{t-1}{(\log {t})^\beta }=\infty . \end{aligned}$$

In the proof of Proposition 5, the density ratios, p/q and q/p, are replaced with the variable t. When the density ratios, p/q and q/p, are both bounded above by a constant $T_0>0$, the range of the supremum is restricted to $1\le t\le T_0$. In such a case, the decay rate is $\lambda (c)=\frac{1}{c^\beta }\sup _{1\le t\le T_0}\frac{t-1}{(\log {t})^\beta }\asymp 1/c^\beta $. The above additional assumption makes the lower bound in Lemma 6 smaller.

3.3 STV distances on kernel exponential family

We use the STV distance for the probability density estimation. There are numerous models of probability densities. In this paper, we focus on the exponential family and its kernel-based extension called kernel exponential family [35, 36]. The exponential family includes important statistical models. The kernel exponential family is a natural extension of the finite-dimensional exponential family to an infinite-dimensional one while preserving computational feasibility. We consider the robust estimator based on the STV distance for the kernel exponential family.

Let ${\mathcal {H}}$ be the RKHS endowed with the kernel function k. The kernel exponential family ${\mathcal {P}}_{\mathcal {H}}$ is defined by

$$\begin{aligned} {\mathcal {P}}_{\mathcal {H}} := \left\{ P_f = p_f\textrm{d}\mu =\exp (f-A(f))\textrm{d}\mu \,:\,f\in {\mathcal {H}},\, \int _{{\mathcal {X}}}e^{f(x)}\textrm{d}\mu <\infty \right\} , \end{aligned}$$

(3)

where A(f) is the moment generating function, $A(f)=\log \int _{{\mathcal {X}}}e^{f(x)}\textrm{d}\mu $, and $\mu $ is a Borel measure on ${\mathcal {X}}$.

The following lemmas indicate the basic properties of the STV distance for the kernel exponential family. The proofs are shown in Appendix A.

Lemma 7

For $P_f, P_g\in {\mathcal {P}}_{\mathcal {H}}$, $\textrm{STV}_{{\mathcal {H}}_{U},\varvec{1}}(P_f, P_g)$ equals the TV distance $\textrm{TV}(P_f, P_g)$ for any $U>0$.

Lemma 8

Suppose Assumption (B). For $P_f, P_g\in {\mathcal {P}}_{{\mathcal {H}}}$, we have

$$\begin{aligned} \lim _{U\rightarrow \infty }\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_f,P_g) = \textrm{TV}(P_f,P_g). \end{aligned}$$

The convergence rate is uniformly given by

$$\begin{aligned} 0\le \textrm{TV}(P_f,P_g) - \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_f,P_g)\le \lambda \left( \frac{U}{\Vert f-g\Vert }\right) (1-\textrm{TV}(P_f,P_g)) \end{aligned}$$

for any $P_f,P_g\in {\mathcal {P}}_{{\mathcal {H}}}$ as long as $U/\Vert f-g\Vert >C_0$, where $\lambda (c),c>C_0$ is the decay rate of $\sigma $.

Lemma 4 shows a similar result to the former part in the above lemma. In Lemma 8, the RKHS is not necessarity universal.

When $\sigma $ is the sigmoid function, we have $\lambda (c)=1/c$ for $c>1$. Thus,

$$\begin{aligned} 0\le \textrm{TV}(P_f,P_g) - \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_f,P_g) \le \frac{\Vert f-g\Vert }{U} (1-\textrm{TV}(P_f,P_g)) \le \frac{\Vert f-g\Vert }{U} \end{aligned}$$

(4)

holds for $U>\Vert f-g\Vert $.

Remark 2

One can use distinct RKHSs for the STV and the probability model. Let ${\mathcal {H}}$ and $\widetilde{{\mathcal {H}}}$ be RKHSs such that ${\mathcal {H}}\subset \widetilde{{\mathcal {H}}}$. Then, for $P_f, P_g\in {\mathcal {P}}_{{\mathcal {H}}}$, it holds that $\textrm{STV}_{\widetilde{{\mathcal {H}}}_{U},\varvec{1}}(P_f,P_g) = \textrm{TV}(P_f,P_g)$ and $\lim _{U\rightarrow \infty }\textrm{STV}_{\widetilde{{\mathcal {H}}}_{U},\sigma }(P_f,P_g)= \textrm{TV}(P_f,P_g)$ under Assumption (B) for $\sigma $.

Using the above lemma, one can approximate the learning with the TV distance by learning using STV distance, which does not include the non-differentiable indicator function in the loss function.

4 Estimation with STV distance for kernel exponential family

We consider the estimation of the probability density using the model ${\mathcal {P}}_{{\mathcal {H}}}$. We assume that i.i.d. samples generated from a contaminated distribution of $P_{f_0}$. For example, the Huber contamination is expressed by the mixture of $P_{f_0}$ and outlier distribution Q:

$$\begin{aligned} X_1,\ldots ,X_n\sim (1-\varepsilon )P_{f_0}+\varepsilon Q, \end{aligned}$$

(5)

where $Q\in {\mathcal {P}}$ is an arbitrary distribution. Our target is to estimate $P_{f_0}\in {\mathcal {P}}_{{\mathcal {H}}}$ from the data. For that purpose, there are numerous estimators [2, 28,29,30,31, 33, 34, 67]. In our paper, we assume the following condition for the contaminated distribution.

Assumption (C)
: For the target distribution $P_{f_0}\in {\mathcal {P}}_{{\mathcal {H}}}$ and the contamination rate $\varepsilon \in (0,1)$, the contaminated distribution $P_\varepsilon $ satisfies $\textrm{TV}(P_{f_0}, P_\varepsilon )<\varepsilon $, and i.i.d. samples, $X_1,\ldots ,X_n$, are generated from $P_\varepsilon $.

The Huber contamination (5) satisfies Assumption (C).

The model ${\mathcal {P}}_{{\mathcal {H}}}$ is regarded as a working hypothesis that explains the data generation process. Suppose that, under an ideal situation, the data is generated from the “true distribution” $P_0$, which may not be included in ${\mathcal {P}}_{{\mathcal {H}}}$. If data contamination occurs, $P_0$ may shift to the contaminated distribution $P_{\varepsilon }$. On the assumption that $\min _{P\in {\mathcal {P}}_{{\mathcal {H}}}}\textrm{TV}(P,P_0)<\varepsilon $ and $\textrm{TV}(P_0,P_{\varepsilon }) < \varepsilon $, all the theoretical findings from this section hold for the distribution $P_{f_0}\in {\mathcal {P}}_{{\mathcal {H}}}$ satisfying $\textrm{TV}(P_{f_0},P_0) < \varepsilon $. The above discussion means that we do not need to assume that the model ${\mathcal {P}}_{{\mathcal {H}}}$ should exactly include the true distribution. As a working hypothesis, however, we still need to select an appropriate model ${\mathcal {P}}_{{\mathcal {H}}}$. For finite-dimensional models, robust model-selection methods have been studied by [68, 69]. This paper does not address developing practical methods for robust model selection.

In the sequel, we consider the estimator using the STV distance, which is called STV learning. All proofs of theorems in this section are deferred to Appendix B.

4.1 Minimum STV distance estimators

For the RKHSs ${\mathcal {H}}$ and $\widetilde{{\mathcal {H}}}$ such that ${\mathcal {H}}\subset \widetilde{{\mathcal {H}}}$ and $\dim {\mathcal {H}}=d\le {\widetilde{d}}=\dim \widetilde{{\mathcal {H}}}<\infty $, let us consider the STV distance to obtain a robust estimator,

$$\begin{aligned} \inf _{P_f\in {\mathcal {P}}_{{\mathcal {H}}}}\textrm{STV}_{\widetilde{{\mathcal {H}}}_U, \varvec{1}}(P_f, P_n)\ \longrightarrow \ {\widehat{f}}, \end{aligned}$$

(6)

where $P_n$ is the empirical distribution of data. Note that $\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_f, P_n)$ does not necessarily match with the TV distance between $P_f$ and $P_n$, since usually $P_n\in {\mathcal {P}}_{{\mathcal {H}}}$ does not hold. Finding ${\widehat{f}}$ is not computationally feasible. Here, let us consider the estimation accuracy of ${\widehat{f}}$. The TV distance between the target distribution and the estimated one is evaluated as follows.

Theorem 9

Assume Assumption (C). Suppose that the dimensnion ${\widetilde{d}}=\dim \widetilde{{\mathcal {H}}}$ is finite. The TV distance between the target distribution and the estimator $P_{{\widehat{f}}}$ given by (6) is bounded above by

$$\begin{aligned} \textrm{TV}(P_{f_0},P_{{\widehat{f}}}) \lesssim \varepsilon + \sqrt{\frac{{\widetilde{d}}}{n}} + \sqrt{\frac{\log (1/\delta )}{n}} \end{aligned}$$

with probability greater than $1-\delta $.

The STV-learning with $\widetilde{{\mathcal {H}}}={\mathcal {H}}$ attains the lower error bound with $\sqrt{d/n}$ instead of $\sqrt{{\widetilde{d}}/n}$. In what follows, we assume ${\mathcal {H}}=\widetilde{{\mathcal {H}}}$.

Next, let us consider the estimator using $\textrm{STV}_{{\mathcal {H}}_U,\sigma }$, where $\sigma $ is the function satisfying Assumption (B):

$$\begin{aligned} \min _{f\in {\mathcal {H}}} \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_f,P_n) \longrightarrow P_{{\widehat{f}}}. \end{aligned}$$

(7)

A typical example of $\sigma $ is the sigmoid function, which leads to a computationally tractable learning algorithm. The optimization problem is written by the min–max problem,

$$\begin{aligned} \min _{f\in {\mathcal {H}}}\max _{u\in {\mathcal {H}}_U, b\in {\mathbb {R}}} {\mathbb {E}}_{P_f}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_n}[\sigma (u(X)-b)]. \end{aligned}$$

When ${\mathcal {H}}$ is the finite-dimensional RKHS, the objective function ${\mathbb {E}}_{P_f}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_n}[\sigma (u(X)-b)]$ is differentiable w.r.t. the finite-dimensional parameters of the model under the mild assumption.

In the same way as Theorem 9, we can see that the estimator (7) satisfies

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widehat{f}}})&= \underbrace{ \textrm{TV}(P_{f_0}, P_{{\widehat{f}}}) - \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}})}_{\text {bias}} + \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}}) \nonumber \\&\lesssim \text {bias} + \varepsilon +\sqrt{\frac{d}{n}} \end{aligned}$$

(8)

with a high probability. We need to control the bias term to guarantee the convergence of $\textrm{TV}(P_{f_0}, P_{{\widehat{f}}})$. For that purpose, we introduce the regularization to the estimator (7).

4.2 Regularized STV learning

When we use the learning with the STV distance, the bias term appears in the upper bound of the estimation error. In order to control the bias term, let us consider the learning with regularization,

$$\begin{aligned} \min _{f\in {\mathcal {H}}_r}\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_f,P_n)\ \ \longrightarrow {\widehat{f}}_r, \end{aligned}$$

(9)

where f is rectricted to ${\mathcal {H}}_r$ with a positive constant r that possibly depends on the sample size n. When the volume of ${\mathcal {X}}$, i.e., $\int _{{\mathcal {X}}}\textrm{d}\mu $, is finite, the regularized $\textrm{STV}$-based learning,

$$\begin{aligned} \min _{f\in {\mathcal {H}}} \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_f,P_n) +\frac{1}{r^2} \Vert f\Vert ^2 \ \longrightarrow \ {\widehat{f}}_{\textrm{reg},r}, \end{aligned}$$

(10)

leads to a solution that is similar to ${\widehat{f}}_r$. Indeed, under Assumption (B) for $\sigma $, we have

$$\begin{aligned} \frac{1}{r^2} \Vert {\widehat{f}}_{\textrm{reg},r}\Vert ^2&\le \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{{\widehat{f}}_{\textrm{reg},r}},P_n) +\frac{1}{r^2} \Vert {\widehat{f}}_{\textrm{reg},r}\Vert ^2\\&\le \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{0},P_n) +\frac{1}{r^2}\Vert 0\Vert ^2\le 1, \end{aligned}$$

where $P_0=p_0\textrm{d}\mu $ is the uniform distribution on ${\mathcal {X}}$ with respect to $\mu $. Therefore, we have $\Vert {\widehat{f}}_{\textrm{reg},r}\Vert \le r$. The following theorem shows the estimation accuracy of regularized STV learning.

Theorem 10

Assume Assumption (C). Let $\sigma $ be the sigmoid function, which satisfies Assumption (B). Suppose that the dimension $d=\dim {\mathcal {H}}$ is finite. We assume that $\Vert f_0\Vert \le r$ and $U\ge 2r$. Then, the TV distance between the target distribution and the above regularized estimators is bounded above by

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})&\lesssim \frac{r}{U}+ \varepsilon + \sqrt{\frac{d}{n}} + \sqrt{\frac{\log (1/\delta )}{n}},\\ \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})&\lesssim \frac{r}{U}+ \varepsilon + \sqrt{\frac{d}{n}} + \frac{\Vert f_0\Vert ^2}{r^2}+\sqrt{\frac{\log (1/\delta )}{n}} \end{aligned}$$

with probability greater than $1-\delta $.

Let us consider the choice of r and U. If r/U is of the order $O(\sqrt{d/n})$, the convergence rate of $\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})$ is $O_p(\varepsilon +\sqrt{d/n})$. This is easily realized by setting $U=r\sqrt{n/d}$ and $r=r_n$ as any increasing sequence to infinity as $n\rightarrow \infty $. On the other hand, the order of $1/r^2$ appears in the upper bound of $\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})$. By setting $r\ge n^{1/4}$ and $U\ge r\sqrt{n}$, we find that $\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}}) = O_p(\varepsilon +\sqrt{d/n})$. Therefore, with the appropriate setting of U and r, both $\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})$ and $\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})$ attain the order of $\varepsilon +\sqrt{d/n}$.

Corollary 11

Assume the assumption of Theorem 10 and $d=\dim \mathrm {{\mathcal {H}}}$. By setting $r\ge n^{1/4}$ and $U\ge r\sqrt{n}$, the inequalities

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r}) \lesssim \varepsilon +\sqrt{\frac{d}{n}},\quad \text {and}\quad \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}}) \lesssim \varepsilon +\sqrt{\frac{d}{n}} \end{aligned}$$

hold with a high probability.

Furthermore, we consider the STV learning in which the constraint $u\in {\mathcal {H}}_U$ in $\textrm{STV}_{{\mathcal {H}}_U,\sigma }$ is replaced with the regularization term,

$$\begin{aligned}&\min _{f\in {\mathcal {H}}} \max _{u\in {\mathcal {H}},b\in {\mathbb {R}}} {\mathbb {E}}_{P_f}[\sigma (u(X)-b)]- {\mathbb {E}}_{P_n}[\sigma (u(X)-b)] -\frac{1}{U^2} \Vert u\Vert ^2 +\frac{1}{r^2} \Vert f\Vert ^2\nonumber \\&\longrightarrow \ {\check{f}}_{\textrm{reg},r}. \end{aligned}$$

(11)

Let ${\check{u}}\in {\mathcal {H}}$ and ${\check{b}}\in {\mathbb {R}}$ be the optimal solution in the inner maximum problem for a fixed f. Then, we have

$$\begin{aligned} \frac{1}{U^2}\Vert {\check{u}}\Vert ^2 \le {\mathbb {E}}_{P_f} [\sigma ({\check{u}}(X)-{\check{b}})] - {\mathbb {E}}_{P_n}[\sigma ({\check{u}}(X)-{\check{b}})] \le 1. \end{aligned}$$

The first inequality is obtained by comparing $u={\check{u}}$ and $u=0$. In the same way as (10), the norm of ${\check{f}}_{\textrm{reg},r}$ is bounded above as follows.

$$\begin{aligned}&\frac{1}{r^2}\Vert {\check{f}}_{\textrm{reg},r}\Vert ^2\\&\quad \le \frac{1}{r^2}\Vert {\check{f}}_{\textrm{reg},r}\Vert ^2 +\max _{u,b}{\mathbb {E}}_{P_{{\check{f}}_{\textrm{reg},r}}}[\sigma (u(X)-b)]- {\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-\frac{1}{U^2}\Vert u\Vert ^2\\&\quad \le \max _{u,b}{\mathbb {E}}_{P_{0}}[\sigma (u(X)-b)]- {\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-\frac{1}{U^2}\Vert u\Vert ^2\qquad (\text {setting }f=0)\\&\quad \le 1. \end{aligned}$$

Thus, we have $\Vert {\check{f}}_{\textrm{reg},r}\Vert \le r$ and $\Vert {\check{u}}\Vert \le U$.

Theorem 12

Assume Assumption (C). Let $\sigma $ be the sigmoid function, which satisfies Assumption (B). Suppose that the dimension $d=\dim {\mathcal {H}}$ is finite. We assume that $\Vert f_0\Vert \le r$ and $U\ge 2r$. The TV distance between the target distribution and the regularized estimator ${\check{f}}_{\textrm{reg},r}$ in (11) is bounded above as follows,

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\check{f}}_{\textrm{reg},r}})&\lesssim \left( \frac{r}{U}\right) ^{2/3} +\frac{\Vert f_0\Vert ^2}{r^2} +\varepsilon + \sqrt{\frac{d}{n}}+\sqrt{\frac{\log (1/\delta )}{n}}. \end{aligned}$$

with probability greater than $1-\delta $.

Corollary 13

Assume the assumption in Theorem 12. The estimator ${\check{f}}_{\textrm{reg},r}$ with $r\ge n^{1/4}$ and $U=rn^{3/4}$, satisfies

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\check{f}}_{\textrm{reg},r}}) \lesssim \varepsilon +\sqrt{\frac{d}{n}} \end{aligned}$$

with a high probability for a large n.

The regularization parameters, $r=n^{1/4}$ and $U=n$, agree to the assumption in the above Corollary.

The above convergence analysis shows that the regularization for the “discriminator” $u\in {\mathcal {H}}_U$ should be weaker than that for the “generator” $f\in {\mathcal {H}}_r$, i.e., $r<U$ is preferable to achieve a high prediction accuracy. This is because the bias term induced by STV distance is bounded above by r/U up to a constant factor. The standard theoretical analysis of the GAN does not take the bias of the surrogate loss into account. That is, the loss function used in the probability density estimation is again used to evaluate the estimation error.

In our setting, we assume that the expectation of $\sigma (u(X)-b)$ for $X\sim P_f$ is exactly computed. In Sect. 6, we consider the approximation of the expectation by the Monte Carlo method. We can evaluate the required sample size of the Monte Carlo method.

4.3 Regularized STV learning for infinite-dimensional kernel exponential family

In this section, we consider the regularized STV learning for infinite dimensional RKHS. In the definition of the STV distance, suppose that the range of the bias term b is restricted to $|{b}|\le U$ when we deal with infinite-dimensional models. For the kernel exponential family ${\mathcal {P}}_{{\mathcal {H}}}$ defined from the infinite-dimensional RKHS ${\mathcal {H}}$, we employ the regularized STV learning, (9), (10), and (11) to estimate the probability density $p_{f_0}$ using contaminated samples.

Theorem 14

Assume Assumption (C). Let $\sigma $ be the sigmoid function, which satisfies Assumption (B). Suppose $\sup _{x}k(x,x)\le 1$. Then, for $\Vert f_0\Vert \le r$, the estimation error bounds of the regularized STV learning, (9), (10), and (11), are given by

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})&\lesssim \frac{r}{U} + \varepsilon + \frac{U}{\sqrt{n}} + \sqrt{\frac{\log (1/\delta )}{n}}, \\ \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})&\lesssim \frac{r}{U} + \varepsilon + \frac{U}{\sqrt{n}} + \sqrt{\frac{\log (1/\delta )}{n}}+ \frac{\Vert f_0\Vert ^2}{r^2}, \\ \textrm{TV}(P_{f_0}, P_{{\check{f}}_{\textrm{reg},r}})&\lesssim \left( \frac{r}{U}\right) ^{2/3} + \varepsilon + \frac{U}{\sqrt{n}} + \sqrt{\frac{\log (1/\delta )}{n}}+ \frac{\Vert f_0\Vert ^2}{r^2} \end{aligned}$$

with probability greater than $1-\delta $.

The proof is shown in Appendix B.4.

The model complexity of the infinite-dimensional model is bounded above by $U/\sqrt{n}$ in the upper bound in Theorem 14, while that of the d-dimensional model is bounded above by $\sqrt{d/n}$. Because of that, we can find the optimal order of U.

Corollary 15

Assume the assumption of Theorem 14. Then,

$$\begin{aligned}&\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r}) \lesssim \varepsilon +\frac{1}{n^{1/4}},\quad \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})\lesssim \varepsilon +\frac{1}{n^{1/5}},\\&\text {and}\ \ \textrm{TV}(P_{f_0}, P_{{\check{f}}_{\textrm{reg},r}})\lesssim \varepsilon +\frac{1}{n^{1/6}} \end{aligned}$$

hold with a high probability, where the poly-log order is omitted in $\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})$.

Remark 3

Using the localization technique introduced in Chapter 14 in [70], one can obtain a detailed convergence rate for infinite dimensional exponential families. Suppose that the kernel function $k(x,x')$ is expanded as $k(x,x')=\sum _{i=1}^{\infty }\mu _i \phi _i(x)\phi _i(x')$, where $\phi _i,i=1,\ldots ,$ are the orthonormal basis of the squared integrable function sets $L^2(P_{f_0})$. Suppose that $\mu _i$ is of the order $1/i^p,\,i=1,2,\ldots $. Using Theorem 14.20 in [70] with a minor modification, we find that

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})&\lesssim \varepsilon + \frac{r}{U}+\delta _n U, \ \ \delta _n = U^{\frac{1}{p+1}}n^{-\frac{p}{2(p+1)}} \end{aligned}$$

with probability greater than $1-\exp \{-cn\delta _n^2\}$, where c is a positive constant. Setting $U=n^{\frac{p}{4p+6}}$ for a fixed r, we have $\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})\lesssim \varepsilon + n^{-\frac{p}{4p+6}}$ with probability greater than $1-\exp \{-cn^{\frac{3}{2p+3}}\}$.

5 Accuracy of parameter estimation

This section is devoted to studying the estimation accuracy in the parameter space. The proofs are deferred to Appendix C.

5.1 Estimation error in RKHS

So far, we considered the estimation error in terms of the TV distance. In this section, we derive the estimation error in the RKHS. Such an evaluation corresponds to the estimation error of the finite-dimensional parameter in the Euclidean space. A lower bound of the TV distance is shown in the following lemma.

Lemma 16

For the RKHS ${\mathcal {H}}$ with the kernel function $k:{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {R}}$, we assume $\sup _{x\in {\mathcal {X}}}k(x,x)\le 1$. Suppose that $\int _{{\mathcal {X}}}\textrm{d}\mu =1$. Then, for $f,g\in {\mathcal {H}}_r$, it holds that

$$\begin{aligned} \textrm{TV}(P_f,P_g) \ge \frac{\Vert f-g\Vert }{32r e^{2r}}\inf _{\begin{array}{c} s\in {\mathcal {H}},\Vert s\Vert =1 \end{array}}\int |s(x)-{\mathbb {E}}_{P_0}[s]|^2\textrm{d}\mu , \end{aligned}$$

where $P_0$ is the probability measure $p_f\textrm{d}\mu $ with $f=0$, i.e., the uniform distribution $p_0=1$ on ${\mathcal {X}}$ w.r.t. the measure $\mu $.

For the RKHS ${\mathcal {H}}$, define

$$\begin{aligned} \xi ({\mathcal {H}}):=\inf _{\begin{array}{c} s\in {\mathcal {H}},\Vert s\Vert =1 \end{array}}\int |s(x)-{\mathbb {E}}_{P_0}[s]|^2\textrm{d}\mu . \end{aligned}$$

As shown in Lemma 16, the RKHS norm is bounded above by the total variation distance multiplied by $\xi ({\mathcal {H}})$ for a fixed r.

We consider the estimation error in the RKHS norm. Let ${\mathcal {H}}_1,{\mathcal {H}}_2,\ldots ,{\mathcal {H}}_d,\ldots $ be a sequence of RKHSs such that $\dim {\mathcal {H}}_d=d$. Our goal is to prove that the estimation error of the estimator ${\widehat{f}}_r$ in (9) for the statistical model ${\mathcal {P}}_{{\mathcal {H}}_{d,r}}$ is of the order $\varepsilon +\sqrt{d/n}$, where ${\mathcal {H}}_{d,r}=({\mathcal {H}}_{d})_r=\{f\in {\mathcal {H}}_d\,:\,\Vert f\Vert \le r\}$.

Corollary 17

For a positive constant r, let us consider the finite-dimensional statistical model ${\mathcal {P}}_{{\mathcal {H}}_{d,r}},\, d=1,2,\ldots $ such that $\inf _{d\in {\mathbb {N}}}\xi ({\mathcal {H}}_d)>0$. Assume Assumption (C) with ${\mathcal {H}}={\mathcal {H}}_{d,r}$ and Assumption (B) with the sigmoid function $\sigma $. Suppose that $f_0\in {\mathcal {H}}_{d,r}$. Then, the estimation error of the estimator ${\widehat{f}}_r$ in (9) with $U\gtrsim \sqrt{n}$ is

$$\begin{aligned} \Vert {\widehat{f}}_r-f_0\Vert \lesssim \varepsilon +\sqrt{\frac{d}{n}} \end{aligned}$$

with a high probability.

The proof is shown in Appendix C.2.

When $\xi ({\mathcal {H}}_d)$ is not bounded below by a positive constant, the estimation accuracy is of the order $\varepsilon +\frac{1}{\xi ({\mathcal {H}}_d)}\sqrt{\frac{d}{n}}$, meaning that the dependency on the dimension d is $\sqrt{d}/\xi ({\mathcal {H}}_d)$ that is greater than $\sqrt{d}$.

If an upper bound of $\Vert f_0\Vert $ is unknown, r is treated as the regularization parameter depending on n. In such a case, the coefficient of $\varepsilon $ depends on n, and it goes to infinity as $n\rightarrow \infty $. In theoretical analysis, the upper bound of $\Vert f_0\Vert $ is often assumed to be known especially for covariance matrix estimation [5, 29, 30, 51].

Let us construct an example of the sequence ${\mathcal {H}}_1,{\mathcal {H}}_2,\ldots $ satisfying the assumption of Corollary 17. Let $\{t_j(x)\}_{j=1}^{\infty }$ be the orthonormal functions which are orthogonal to the constant function under $\mu $, i.e,

$$\begin{aligned} {\mathbb {E}}_{P_0}[t_i(x)]=0, \ \ {\mathbb {E}}_{P_0}[t_i(x)t_j(x)]= {\left\{ \begin{array}{ll} 1,\quad i=j,\\ 0,\quad i\ne {j}. \end{array}\right. } \end{aligned}$$

for $i,j=1,2,\ldots $. For a positive decreasing sequence $\{\lambda _k\}_{k=1}^{\infty }$ such that $\lambda _\infty :=\lim _{k\rightarrow \infty }\lambda _k>0$, define the sequence of finite-dimensional RKHSs, ${\mathcal {H}}_1, {\mathcal {H}}_2,\ldots ,{\mathcal {H}}_d,\ldots $ by

$$\begin{aligned} {\mathcal {H}}_d=\bigg \{ f(x)=\sum _{j=1}^{d} \alpha _j t_j(x)\,:\,\alpha _1,\ldots ,\alpha _d\in {\mathbb {R}}\bigg \}, \end{aligned}$$

(12)

where the inner product $\langle f,g\rangle $ for $f=\sum _j\alpha _j t_j$ and $g=\sum _j\beta _j t_j$ is defined by $\langle f,g\rangle =\sum _{i}\alpha _i\beta _i/\lambda _i$. Here, $t_1,\ldots ,t_d$ are the sufficient statistic of ${\mathcal {P}}_{{\mathcal {H}}_d}$. Then,

$$\begin{aligned} \xi ({\mathcal {H}}_d) =\inf _{\sum _{i=1}^{d}\alpha _i^2/\lambda _i=1}\sum _{i=1}^{d}\alpha _i^2 \ge \inf _{\sum _{i=1}^{d}\alpha _i^2/\lambda _i=1}\sum _{i=1}^{d}\alpha _i^2\frac{\lambda _\infty }{\lambda _i}=\lambda _\infty > 0. \end{aligned}$$

Example 6

The construction of the RKHS sequence allows a distinct sample space for each d. Suppose $\mu _d$ be the d-dimensional standard normal distribution on ${\mathbb {R}}^d$. For ${\varvec{x}}=(x_1,\ldots ,x_d)$, define $t_i({\varvec{x}})=x_i$. Let $\lambda _i=1$ for all i. Then, the RKHS ${\mathcal {H}}_d$ is the linear kernel $k({\varvec{x}},{\varvec{z}})=\sum _{i=1}^{d}\mu _i t_i({\varvec{x}})t_i({\varvec{z}})={\varvec{x}}^T{\varvec{z}}$ for ${\varvec{x}},{\varvec{z}}\in {\mathbb {R}}^d$ and the corresponding statistical model is the multivariate normal distribution $N_d({\varvec{f}},I_d)$ with the parameter ${\varvec{f}}\in {\mathbb {R}}^d$.

Remark 4

For the infinite dimensional RKHS ${\mathcal {H}}$, typically $\xi ({\mathcal {H}})=0$ holds. For example, the kernel function is defined as $k(x,x')=\sum _{i=1}^{\infty }\lambda _i e_i(x)e_i(x')$ in the same way as above, where $\{\lambda _i\}$ is a positive decreasing sequence such that $\lambda _i\rightarrow 0$ as $i\rightarrow \infty $. Hence, we have $\xi ({\mathcal {H}})\le \int |\sqrt{\lambda _i}e_i(x)|^2\textrm{d}\mu =\lambda _i\rightarrow 0$ for $i\rightarrow \infty $. Lemma 16 does not provide a meaningful upper bound of the estimation error in such an infinite-dimensional RKHS.

5.2 Robust estimation of normal distribution

Let us consider the robust estimation of parameters for the normal distribution.

First of all, we prove that regularized STV learning provides a robust estimator of the mean vector in the multi-dimensional normal distribution with the identity covariance matrix. For ${\mathcal {X}}={\mathbb {R}}^d$, let us define $\textrm{d}\mu ({\varvec{x}})=C_d e^{-\Vert {\varvec{x}}\Vert _2^2/2}\textrm{d}{{\varvec{x}}}$ for ${\varvec{x}}\in {\mathbb {R}}^d$, where $C_d$ is the normalizing constant depending on the dimension d such that $\int _{{\mathbb {R}}^d}\textrm{d}\mu ({\varvec{x}})=1$. Let ${\mathcal {H}}$ be the RKHS with the kernel function $k({\varvec{x}},{\varvec{z}})={\varvec{x}}^T{\varvec{z}}$ for ${\varvec{x}},{\varvec{z}}\in {\mathbb {R}}^d$. For the functions $f({\varvec{x}})={\varvec{x}}^T{\varvec{f}}$ and $g({\varvec{x}})={\varvec{x}}^T{\varvec{g}}$, the inner product in the RKHS is $\langle f, g\rangle ={\varvec{f}}^T{\varvec{g}}$. The probability density of the d-dimensional multivariate normal distribution with the identity covariance matrix, $N_d({\varvec{f}},I)$, with respect to the measure $\mu $ is given by $p_f({\varvec{x}})=\exp \{{\varvec{x}}^T{\varvec{f}}-\Vert {\varvec{f}}\Vert _2^2/2\}$. Corollary 11 guarantees that the estimated mean vector $\widehat{{\varvec{f}}}_{\textrm{reg},r}$ by the regularized STV-learning satisfies

$$\begin{aligned} \textrm{TV}(N_d({\varvec{f}}_0,I), N_d(\widehat{{\varvec{f}}}_{\textrm{reg},r},I)) \lesssim \varepsilon + \sqrt{\frac{d}{n}}. \end{aligned}$$

A lower bound of the TV distance between the d-dimensional normal distributions, $N_d({\varvec{\mu }}_1,I)$ and $N_d({\varvec{\mu }}_2,I)$, is presented in [71],

$$\begin{aligned} \frac{\min \{1,\Vert {\varvec{\mu }}_1-{\varvec{\mu }}_2\Vert _2\}}{200}\le \textrm{TV}(N_d({\varvec{\mu }}_1,I), N_d({\varvec{\mu }}_2,I)). \end{aligned}$$

Therefore, we have

$$\begin{aligned} \Vert {\varvec{f}}_0-\widehat{{\varvec{f}}}_{\textrm{reg},r}\Vert _2\lesssim \varepsilon + \sqrt{\frac{d}{n}} \end{aligned}$$

(13)

for a small $\varepsilon $ and a large n. Though Corollary 17 provides the same conclusion, the above argument does not need the boundedness of $\Vert {\varvec{f}}_0\Vert _2$. As shown in [5], a lower bound of the estimation error under the contaminated distribution is

$$\begin{aligned} \inf _{\widehat{\varvec{f}}} \sup _{ Q:\inf _{{\varvec{f}}}\textrm{TV}(N({\varvec{f}},I),Q)\le \varepsilon } Q \bigg (\Vert \widehat{\varvec{f}}-{\varvec{f}}\Vert _2\gtrsim \varepsilon + \sqrt{\frac{d}{n}}\bigg )>0. \end{aligned}$$

Therefore, the estimator $\widehat{{\varvec{f}}}_{\textrm{reg},r}$ attains the minimax optimal rate.

Next, let us consider the estimator of the covariance matrix for the multivariate normal distribution with mean zero, $N_d(\varvec{0},\Sigma )$. For ${\mathcal {X}}={\mathbb {R}}^d$, let us define $\textrm{d}\mu ({\varvec{x}})=C_d e^{-\Vert {\varvec{x}}\Vert _2^2/2}\textrm{d}{{\varvec{x}}}$ for ${\varvec{x}}\in {\mathbb {R}}^d$ as above. Let ${\mathcal {H}}$ be the RKHS with the kernel function $k({\varvec{x}},{\varvec{z}})=({\varvec{x}}^T{\varvec{z}})^2$ for ${\varvec{x}},{\varvec{z}}\in {\mathbb {R}}^d$. Let $f({\varvec{x}})={\varvec{x}}^T F {\varvec{x}}$ and $g({\varvec{x}})={\varvec{x}}^T G{\varvec{x}}$ be quadratic functions defined by symmetric matrices F and G. Their inner product in the RKHS is given by $\langle f,g\rangle =\textrm{Tr}F^TG$. The probability density w.r.t. $\mu $ is defined by $p_f({\varvec{x}})=\exp \{-{\varvec{x}}^T F {\varvec{x}}/2-A(F)\}$ for $f({\varvec{x}})={\varvec{x}}^T F{\varvec{x}}$, which leads to the statistical model of the normal distribution $N_d(\varvec{0},(I+F)^{-1})$. Let $P_{f_0}$ be the normal distribution $N_d(\varvec{0}, \Sigma _0)$ and $P_{{\widehat{f}}}$ be the estimated distribution $N_d(\varvec{0}, \widehat{\Sigma })$ using the regularized STV learning. Here, the estimated function ${\widehat{f}}({\varvec{x}})={\varvec{x}}^T {\widehat{F}}{\varvec{x}}$ in the RKHS provides the estimator of the covariance matrix $\widehat{\Sigma } = (I+{\widehat{F}})^{-1}$. Then,

$$\begin{aligned} \textrm{TV}(N_d(\varvec{0},\Sigma _0),\,N_d(\varvec{0},\widehat{\Sigma }))\lesssim \varepsilon + \sqrt{\frac{d^2}{n}} \end{aligned}$$

holds with high probability. As shown in [71], the lower bound of the TV distance between $N_d(\varvec{0},\Sigma _0)$ and $N_d(\varvec{0},\Sigma _1)$ is given by

$$\begin{aligned} \frac{\min \{1,\Vert \Sigma _0\Vert _{\textrm{op}}^{-1}\Vert \Sigma _1-\Sigma _0\Vert _{\textrm{F}}\}}{100}\le \textrm{TV}(N_d(\varvec{0},\Sigma _0), N_d(\varvec{0},\Sigma _1)), \end{aligned}$$

where $\Vert \cdot \Vert _{\textrm{op}}$ is the operator norm defined as the maximum singular value and $\Vert \cdot \Vert _{\textrm{F}}$ is the Frobenius norm. If $\varepsilon +\sqrt{d^2/n}\lesssim 1$, we have

$$\begin{aligned} \Vert \Sigma _0-\widehat{\Sigma }\Vert _{\textrm{F}} \lesssim \varepsilon + \sqrt{\frac{d^2}{n}}, \end{aligned}$$

where $\Vert \Sigma _0\Vert _{\textrm{op}}$ is regarded as a constant. [29] proved that the estimater $\widehat{\Sigma }$ based on the GAN method attains the error bound in terms of the operator norm,

$$\begin{aligned} \Vert \Sigma _0-\widehat{\Sigma }\Vert _{\textrm{op}} \lesssim \varepsilon + \sqrt{\frac{d}{n}}. \end{aligned}$$

The relationship between the Frobenius norm and the operator norm leads to the inequality,

$$\begin{aligned} \Vert \Sigma _0-\widehat{\Sigma }\Vert _{\textrm{F}}\le \sqrt{d}\Vert \Sigma _0-\widehat{\Sigma }\Vert _{\textrm{op}} \lesssim \sqrt{d} \varepsilon + \sqrt{\frac{d^2}{n}}. \end{aligned}$$

A naive application of the result in [29] leads to $\sqrt{d}$ factor to $O(\varepsilon )$ term.

Table 1 Robust estimators of the covariance matrix for the multivariate normal distribution are compared: the matrix depth [5], f-GAN [29], W-GAN [30], STV(our method)

Full size table

In the same way as the estimation of the mean vector, the estimator $\widehat{\Sigma }$ attains the minimax optimality. Indeed, the following theorem holds. The proof is shown in Appendix C.3.

Theorem 18

Let us consider the statistical model ${\mathcal {N}}_R=\{N_d(\varvec{0}, \Sigma ):\,\Sigma \in {\mathfrak {T}}_R\}$, where ${\mathfrak {T}}_R=\{\Sigma \in {\mathbb {R}}^{d\times d}:\,\Sigma \succ {O},\, \Vert \Sigma \Vert _{\textrm{op}}\le 1+R\}$ for a positive constant $R>1/2$. Then,

$$\begin{aligned} \inf _{\widehat{\Sigma }}\sup _{Q:{\displaystyle \inf _{\Sigma \in {\mathfrak {T}}_R}}\!\! \textrm{TV}(N_d(\varvec{0},\Sigma ),Q)<\varepsilon } Q\bigg (\Vert \widehat{\Sigma }-\Sigma \Vert _{\textrm{F}} \gtrsim \varepsilon +\sqrt{\frac{d^2}{n}}\bigg )>0 \end{aligned}$$

holds for $\varepsilon \le 1/2$. The above lower bound is valid even for the unconstraint model, i.e., $R=\infty $.

The covariance estimator based on STV learning attains the minimax optimal rate for the Frobenius norm, while the optimality in the operator norm is studied in several works [5, 29, 30]. Table 1 shows theoretical results revealed by some works.

Let us show simple numerical results to confirm the feasibility of our method. We compare robust estimators for the mean vector and the full covariance matrix under Huber contamination. The expectation in the loss function is computed by sampling from the normal distribution.

For the mean vector of the multivariate normal distribution, the model $p_f({\varvec{x}})=\exp \{{\varvec{x}}^T{\varvec{f}}-\Vert {\varvec{f}}\Vert _2^2/2\}$ and the RKHS ${\mathcal {H}}_U$ endowed with the linear kernel are used. The data is generated from $0.9 N_d(\varvec{0}, I)+0.1 N_d(5\cdot {\varvec{e}},I)$ and the target is to estimate the mean vector of $N_d(\varvec{0}, I)$, where ${\varvec{e}}$ is the d-dimensional vector $(1,1,\ldots ,1)^T\in {\mathbb {R}}^d$. We examine the estimation accuracy of the component-wise median, the GAN-based estimator using Jensen–Shannon loss, and the minimum TV distance estimator. We used the regularized STV-based estimator (11) with the gradient descent/ascent method to solve the min–max optimization problem. The regularization parameters are set to $1/U^2=10^{-4}$ and $1/r^2=3\times 10^{-5}$. The results are presented in Tables 2 and 3. As shown in theoretical analysis in [5], the component-wise median is sub-optimal. The TV-based estimator is not efficient, partially because of the difficulty of solving the optimization problem. As the smoothed variant of the TV-based estimator, GAN-based method and our method provide reasonable results.

Table 2 Estimation of the mean vector for $N_d({\varvec{f}},I)$ with $d=50$ and the contamination ratio $\varepsilon =0.1$

Full size table

Table 3 Estimation of the mean vector for $N_d({\varvec{f}},I)$ with the sample size $n=1000$ and the contamination ratio $\varepsilon =0.1$

Full size table

For the estimation of the full covariance matrix for the multivariate normal distribution, the model $p_f({\varvec{x}})=\exp \{{\varvec{x}}^T F {\varvec{x}}-A(F)\}$ and the RKHS ${\mathcal {H}}_U$ with the quadratic kernel are used. The data is generated from $0.8 N_d(\varvec{0},\Sigma )+0.2 N(6\cdot {\varvec{e}}, \Sigma )$, where $\Sigma _{ij}=2^{-|{i-j}|}$. We examined the STV-based estimator with Kendall’s rank correlation coefficient [72] and GAN-based method [29]. We used the regularized STV-based estimator (11) with the gradient descent/ascent method to solve the min–max optimization problem. The regularization parameters are set to $1/U^2=10^{-4}$ and $1/r^2=10^{-4}$. The results are presented in Table 4. Overall, the GAN-based estimator outperforms the other method. This is because the optimization technique of the GAN-based estimator using deep neural networks is highly developed in comparison to the STV-based estimator. In the STV-based method, the optimization algorithm sometimes encounters a sub-optimal solution, though the objective function is smooth. Developing an efficient algorithm for the STV-based method is important future work.

Table 4 Median of the Frobenius norm with median absolute deviation

Full size table

6 Approximation of expectation and learning algorithm

Let us consider how to solve the min–max problem of regularized STV learning. The optimization problem is given by

$$\begin{aligned} \sup _{f\in {\mathcal {H}}_r} \inf _{u\in {\mathcal {H}}_U,|{b}|\le U} {\mathbb {E}}_{P_f}[\sigma (u(X)-b)] - {\mathbb {E}}_{P_n}[\sigma (u(X)-b)], \end{aligned}$$

where $\sigma $ is the sigmoid function. For the optimization problems on the infinite-dimensional RKHS, usually the representer theorem applies to reduce the infinite-dimensional optimization problem to the finite-dimensional one. However, the representer theorem does not work to the above optimization problem, since the expectation w.r.t. $P_f, f\in {\mathcal {H}}_r$ appears. That is, the loss function may depend on f(x) of all $x\in {\mathcal {X}}$. Even for the finite-dimensional RKHS, the computation of the expectation is often intractable. Here, we use the importance sampling method to overcome the difficulty.

We approximate the expectation w.r.t. $P_f$ by the sample mean over $Z_1,\ldots ,Z_\ell \sim q$, where q is an arbitrary density function such that the sampling from q and the computation of the probability density q(z) is feasible. A simple example is $q=p_0$, i.e., the uniform distribution on ${\mathcal {X}}$ w.r.t. the base measure $\mu $. Let us define $\sigma _X:=\sigma (u(X)-b)$. The expectation ${\mathbb {E}}_{X\sim P_{f}}[\sigma (u(X)-b)]$ is approximated by ${\bar{\sigma }}_\ell $,

$$\begin{aligned} {\bar{\sigma }}_\ell := \frac{\frac{1}{\ell }\sum _{j=1}^{\ell } \frac{e^{f(Z_j)}}{q(Z_j)}\sigma _{Z_j}}{\frac{1}{\ell }\sum _{j=1}^{\ell } \frac{e^{f(Z_j)}}{q(Z_j)}} \approx {\mathbb {E}}_{Z\sim q}\bigg [\frac{p_f(Z)}{q(Z)}\sigma _Z\bigg ] = {\mathbb {E}}_{X\sim P_{f}}[\sigma (u(X)-b)]. \end{aligned}$$

(14)

As an approximation of $X\sim P_f$, we employ the probability distribution on $Z_1,\ldots ,Z_\ell $ defined by

$$\begin{aligned} {\widehat{P}}_f(Z=Z_i)=\frac{e^{f(Z_i)}/q(Z_i)}{\sum _{j=1}^{\ell }e^{f(Z_j)}/q(Z_j)},\ \ Z_i\sim _{i.i.d.} q, \quad i=1,\ldots ,\ell . \end{aligned}$$

(15)

Then, an approximation of the estimator ${\widehat{f}}_r$ is given by the minimizer of $\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_n,{\widehat{P}}_f)$, i.e.,

$$\begin{aligned} \min _{f\in {\mathcal {H}}_r}\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_n,{\widehat{P}}_f)\ \longrightarrow \ {\widetilde{f}}_r. \end{aligned}$$

The approximation, ${\widetilde{f}}_r$, is usable regardless of the dimensionality of the model ${\mathcal {H}}$. Let us evaluate the error bound of the approximate estimator $P_{{\widetilde{f}}_r}$.

Theorem 19

Assume Assumption (C). Suppose that $\sup _{x\in {\mathcal {X}}}k(x,x)\le K^2$. Let us consider the approximate estimator $P_{{\widetilde{f}}_r}$ obtained by the regularized STV learning using $\textrm{STV}_{{\mathcal {H}}_U,\sigma }$, where $\sigma $ is the sigmoid function satisfying Assumption (B) and the constraint $|{b}|\le U$ is imposed the STV distance. Then, the estimation error of $P_{{\widetilde{f}}_r}$ is

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widetilde{f}}_r}) \lesssim \varepsilon + \frac{r}{U} + \frac{C_{{{\mathcal {H}}_U}}}{\sqrt{n}} + \frac{e^{Kr}(r+U)}{\sqrt{\ell }} + \sqrt{\log \frac{1}{\delta }}\left( \frac{1}{\sqrt{n}}+\frac{e^{Kr}}{\sqrt{\ell }}\right) \end{aligned}$$

with probability greater than $1-\delta $, where $\textrm{C}_{{\mathcal {H}}_U}=UK$ for the infinite-dimensional ${\mathcal {H}}_U$ and $\textrm{C}_{{\mathcal {H}}_U}=\sqrt{d}$ for the finite-dimensional ${\mathcal {H}}_U$ with $d=\dim {\mathcal {H}}$.

For the finite-dimensional RKHS ${\mathcal {H}}$ with the dimension d endowed with the bounded kernel, the convergence rate of $P_{{\widetilde{f}}}$ with $U=r\sqrt{n}$ and $\ell =n^2 r^2 e^{2Kr}$ attains

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widetilde{f}}_r})\lesssim \varepsilon +\sqrt{\frac{d}{n}}. \end{aligned}$$

We expect that a tighter bound for the approximation will lead to a more reasonable sample size such as $\ell =O(n)$.

For the infinite-dimensional RKHS, the convergence rate of $P_{{\widetilde{f}}}$ with $\ell =n,\, U=n^{1/4},\, r=O(\log \log {n})$ attains the following bound

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widetilde{f}}_r})\lesssim \varepsilon +\frac{1}{n^{1/4}} \end{aligned}$$

with high probability, where the poly-log order is omitted. Since the loss function of the approximate estimator depends on f via $f(Z_j),j=1,\ldots ,\ell $, the representation theorem works to compute the estimator ${\widetilde{f}}_r$. Indeed, the optimal solution of ${\widetilde{f}}_r$ is expressed by the linear sum of $k(Z_j,\cdot ),j=1,\ldots ,\ell $ and the optimal solution of u is expressed by the linear sum of $k(Z_j,\cdot ),j=1,\ldots ,\ell $ and $k(X_i,\cdot ),i=1,\ldots ,n$. Hence, the problem is reduced to solve the finite-dimensional problem.

According to Theorem 19, the exponential term, $e^{Kr}$, appears in the upper bound, meaning that we need a strong regularization to surpress that term. Recently, [49, 50] proposed an approximation method of the expectation in the penalized MLE for the kernel exponential family. It is an important issue to verify whether similar methods are also effective for STV-learning.

7 Concluding remarks

Since [28] connected the classical robust estimation with GANs, computationally efficient robust estimators using deep neural networks have been proposed. Most existing works focus on the theoretical analysis of estimators for the normal mean and covariance matrix under various conditions for contaminated distributions.

In this paper, we studied the IPM-based robust estimator for the kernel exponential family, including infinite-dimensional models for probability densities. As a class of IPMs, we defined the STV distance. The relationship between the TV distance and STV distance has an important role to evaluate the estimation accuracy of the STV-based robust estimator. For the covariance matrix estimation of the multivariate normal distribution, we proved the STV-based estimator attains the minimax optimal estimator for the Frobenius norm, while existing estimators are minimax optimal for the operator norm. Furthermore, we proposed an approximate STV-based estimator using importance sampling to mitigate the computational difficulty. The framework studied in this paper is regarded as a natural extension of existing IPM-based robust estimators for multivariate normal distribution to more general statistical models with theoretical guarantees.

The computational efficiency and stability of STV learning have room for improvement. Recent progress in robust statistics research has also been made from the viewpoint of computational algorithms; see [6, 73]. Incorporating these results into our research to improve our algorithms is an important research direction for practical applications.

Data availability

The dataset analyzed during the current study is available in the GitHub, https://github.co.jp/.

References

Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)
Article MathSciNet Google Scholar
Tukey, J.W.: Mathematics and the picturing of data. In: Proceedings of the International Congress of Mathematicians, vol. 2, pp. 523–531 (1975)
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.: Robust Statistics: The Approach Based on Influence Functions. Wiley Series in Probability and Statistics, Wiley, New York (1986)
Google Scholar
Donoho, D.L., Huber, P.J.: The notion of breakdown point. A festschrift for Erich L. Lehmann 157184 (1983)
Chen, M., Gao, C., Ren, Z.: Robust covariance and scatter matrix estimation under Hubers contamination model. Ann. Stat. 46(5), 1932–1960 (2018)
Article MathSciNet Google Scholar
Diakonikolas, I., Kamath, G., Kane, D.M., Li, J., Moitra, A., Stewart, A.: Robust estimators in high dimensions without the computational intractability. In: IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS, pp. 655–664 (2016)
Lai, K.A., Rao, A.B., Vempala, S.S.: Agnostic estimation of mean and covariance. In: 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 665–674 (2016)
Donoho, D.L., Gasko, M.: Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Ann. Stat. 20(4), 1803–1827 (1992)
Article MathSciNet Google Scholar
Chen, Z., Tyler, D.E.: The influence function and maximum bias of Tukey’s median. Ann. Stat. 30(6), 1737–1759 (2002)
Article MathSciNet Google Scholar
Zuo, Y., Serfling, R.: General notions of statistical depth function. Ann. Stat. 28(2), 461–482 (2000)
MathSciNet Google Scholar
Yatracos, Y.G.: Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. Ann. Stat. 13(2), 768–774 (1985)
Article MathSciNet Google Scholar
Basu, A., Shioya, H., Park, C.: Statistical Inference: The Minimum Distance Approach. Monographs on Statistics and Applied Probability, Taylor & Francis, Florida (2010)
Google Scholar
Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efficient estimation by minimising a density power divergence. Biometrika 85(3), 549–559 (1998)
Article MathSciNet Google Scholar
Jones, M.C., Hjort, N.L., Harris, I.R., Basu, A.: A comparison of related density-based minimum divergence estimators. Biometrika 88(3), 865–873 (2001)
Article MathSciNet Google Scholar
Fujisawa, H., Eguchi, S.: Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 99(9), 2053–2081 (2008)
Article MathSciNet Google Scholar
Kanamori, T., Fujisawa, H.: Robust estimation under heavy contamination using unnormalized models. Biometrika 102(3), 559–572 (2015)
Article MathSciNet Google Scholar
Csiszár, I.: On topological properties of f-divergence. Studia Sci. Math. Hungar. 2, 329–339 (1967)
MathSciNet Google Scholar
Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 200–217 (1967)
Article MathSciNet Google Scholar
Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of $U$-Boost and Bregman divergence. Neural Comput. 16(7), 1437–1481 (2004)
Article Google Scholar
Kanamori, T., Fujisawa, H.: Affine invariant divergences associated with composite scores and its applications. Bernoulli 20(4), 2278–2304 (2016)
Google Scholar
Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. B 28(1), 131–142 (1966)
Article MathSciNet Google Scholar
Müller, A.: Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 29, 429–443 (1997)
Article MathSciNet Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Nowozin, S., Cseke, B., Tomioka, R.: f-GAN: Training generative neural samplers using variational divergence minimization. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., Póczos, B.: MMD GAN: towards deeper understanding of moment matching network. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Chérief-Abdellatif, B.-E., Alquier, P.: Finite sample properties of parametric MMD estimation: robustness to misspecification and dependence. Bernoulli 28(1), 181–213 (2022)
Article MathSciNet Google Scholar
Gao, C., Liu, J., Yao, Y., Zhu, W.: Robust estimation and generative adversarial networks. In: International Conference on Learning Representations (2019)
Gao, C., Yao, Y., Zhu, W.: Generative adversarial nets for robust scatter estimation: a proper scoring rule perspective. J. Mach. Learn. Res. 21, 160–116048 (2020)
MathSciNet Google Scholar
Liu, Z., Loh, P.-L.: Robust W-GAN-based estimation under Wasserstein contamination. Inf. Inference J. IMA 12(1), 312–362 (2023)
MathSciNet Google Scholar
Wu, K., Ding, G.W., Huang, R., Yu, Y.: On minimax optimality of GANs for robust mean estimation. In: AISTATS, vol. 108, pp. 4541–4551 (2020)
Zhu, B., Jiao, J., Jordan, M.I.: Robust estimation for non-parametric families via generative adversarial networks. In: 2022 IEEE International Symposium on Information Theory (ISIT), pp. 1100–1105 (2022). IEEE
Zhu, B., Jiao, J., Steinhardt, J.: Generalized resilience and robust statistics. Ann. Stat. 50(4), 2256–2283 (2022)
Article MathSciNet Google Scholar
Zhu, B., Jiao, J., Tse, D.: Deconstructing generative adversarial networks. IEEE Trans. Inf. Theory 66(11), 7155–7179 (2020)
Article MathSciNet Google Scholar
Fukumizu, K.: Infinite dimensional exponential families by reproducing kernel Hilbert spaces. In: 2nd International Symposium on Information Geometry and Its Applications (IGAIA 2005), pp. 324–333 (2005)
Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyvärinen, A., Kumar, R.: Density estimation in infinite dimensional exponential families. J. Mach. Learn. Res. 18(57), 1–59 (2017)
MathSciNet Google Scholar
Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002)
Google Scholar
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT Press, London (2018)
Google Scholar
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY (2014)
Book Google Scholar
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13, 307–361 (2012)
MathSciNet Google Scholar
Gutmann, M.U., Hirayama, J.-i.: Bregman divergence as general framework to estimate unnormalized statistical models. In: Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pp. 283–290 (2011)
Geyer, C.: On the convergence of Monte Carlo maximum likelihood calculations. J. R. Stat. Soc. B 56, 261–274 (1994)
Article MathSciNet Google Scholar
Uehara, M., Kanamori, T., Takenouchi, T., Matsuda, T.: A unified statistically efficient estimation framework for unnormalized models. In: International Conference on Artificial Intelligence and Statistics, pp. 809–819 (2020). PMLR
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 214–223 (2017)
Hall, A.R.: Generalized Method of Moments. Advanced Texts in Econometrics, Oxford University Press, New York (2005)
Google Scholar
Arora, S., Ge, R., Liang, Y., Ma, T., Zhang, Y.: Generalization and equilibrium in generative adversarial nets (GANs). In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 224–232 (2017)
Lehmann, E.L., Casella, G.: Theory of Point Estimation, 2nd edn. Springer, New York, NY (1998)
Google Scholar
Amari, S., Nagaoka, H.: Methods of Information Geometry. Translations of Mathematical Monographs, vol. 191. Oxford University Press, New York (2000)
Google Scholar
Dai, B., Dai, H., Gretton, A., Song, L., Schuurmans, D., He, N.: Kernel exponential family estimation via doubly dual embedding (2019)
Dai, B., Liu, Z., Dai, H., He, N., Gretton, A., Song, L., Schuurmans, D.: Exponential family estimation via adversarial dynamics embedding (2020)
Chen, M., Gao, C., Ren, Z.: A general decision theory for Huber’s $\epsilon $-contamination model. Electron. J. Stat. 10(2), 3752–3774 (2016)
Article MathSciNet Google Scholar
Uppal, A., Singh, S., Póczos, B.: Robust density estimation under Besov IPM losses. Adv. Neural Inf. Process. Syst. 33, 5345–5355 (2020)
Google Scholar
Besag, J.: Statistical analysis of non-lattice data. J. R. Stat. Soc. D 24, 179–195 (1975)
Google Scholar
Hyvärinen, A.: Some extensions of score matching. Comput. Stat. Data Anal. 51, 2499–2512 (2007)
Article MathSciNet Google Scholar
Parry, M., Dawid, A.P., Lauritzen, S.: Proper local scoring rules. Ann. Stat. 40, 561–592 (2012)
Article MathSciNet Google Scholar
Dawid, A.P., Lauritzen, S., Parry, M.: Proper local scoring rules on discrete sample spaces. Ann. Stat. 40, 593–608 (2012)
Article MathSciNet Google Scholar
Varin, C., Reid, N., Firth, D.: An overview of composite likelihood methods. Stat. Sin. 21, 5–42 (2011)
MathSciNet Google Scholar
Lindsay, B.G., Yi, G.Y., Sun, J.: Issues and strategies in the selection of composite likelihoods. Stat. Sin. 21, 71–105 (2011)
MathSciNet Google Scholar
Berlinet, A., Thomas-Agnan, C.: Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer, New York (2011)
Google Scholar
Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.-P., Schölkopf, B., Smola, A.J.: Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14), 49–57 (2006)
Article Google Scholar
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13(25), 723–773 (2012)
MathSciNet Google Scholar
Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Gao, R., Kleywegt, A.: Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res. 48(2), 603–655 (2023)
Article MathSciNet Google Scholar
Lee, J., Raginsky, M.: Minimax statistical learning with Wasserstein distances. In: NeurIPS 2018, 3–8 December 2018, Montréal, Canada, pp. 2692–2701 (2018)
Courty, N., Flamary, R., Tuia, D., Rakotomamonjy, A.: Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1853–1865 (2017)
Article Google Scholar
Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)
Google Scholar
Huber, P.J., Wiley, J., InterScience, W.: Robust Statistics. Wiley, New York (1981)
Book Google Scholar
Ronchetti, E.: Robustness aspects of model choice. Stat. Sin. 7, 327–338 (1997)
MathSciNet Google Scholar
Sugasawa, S., Yonekura, S.: On selection criteria for the tuning parameter in robust divergence. Entropy 23(9), 1147 (2021)
Article MathSciNet Google Scholar
Wainwright, M.J.: High-dimensional Statistics: A Non-asymptotic Viewpoint. Cambridge University Press, Cambridge (2019)
Book Google Scholar
Devroye, L., Mehrabian, A., Reddad, T.: The total variation distance between high-dimensional Gaussians with the same mean. arXiv (2018)
Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938)
Article Google Scholar
Diakonikolas, I., Kane, D.M.: Algorithmic High-Dimensional Robust Statistics. Cambridge University Press, New York (2023)
Book Google Scholar
Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
Google Scholar
Ma, Z., Wu, Y.: Volume ratio, sparsity, and minimaxity under unitarily invariant norms. IEEE Trans. Inf. Theory 61(12), 6939–6956 (2015)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was partially supported by JSPS KAKENHI Grant Numbers 19H04071, 20H00576, and 23H03460.

Author information

Authors and Affiliations

Department of Mathematical and Computing Science, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8552, Japan
Takafumi Kanamori, Kodai Yokoyama & Takayuki Kawashima
AIP, RIKEN, 1-4-1 Nihonbashi, Chuo-ku, Tokyo, 103-0027, Japan
Takafumi Kanamori

Authors

Takafumi Kanamori
View author publications
You can also search for this author in PubMed Google Scholar
Kodai Yokoyama
View author publications
You can also search for this author in PubMed Google Scholar
Takayuki Kawashima
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takafumi Kanamori.

Ethics declarations

Conflict of interest

The corresponding author states that there is no conflict of interest.

Additional information

Communicated by Noboru Murata.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proofs of proposition and lemmas in Sect. 3

1.1 A.1 Proof of Lemma 2

Proof

For any fixed $u\in {\mathcal {F}}$, let us define $\xi (b)$ and ${\bar{\xi }}(b)$ by

$$\begin{aligned} \xi (b)= Q\big (u(x)<b \big )-P\big (u(x)<b \big ),\quad {\bar{\xi }}(b)= Q\big (u(x)\le b \big )-P\big (u(x)\le b \big ). \end{aligned}$$

Note that $\lim _{b\rightarrow \pm \infty }\xi (b)=\lim _{b\rightarrow \pm \infty }{\bar{\xi }}(b)=0$. We see that $\xi (b)$ is left-continuous and ${\bar{\xi }}(b)$ is right-continuous. From the definition of the distribution function, $\xi (b)={\bar{\xi }}(b)$ holds except for discontinuous points. Furthermore, for any $b_0\in {\mathbb {R}}$, $\lim _{b\rightarrow b_0\pm 0}\xi (b)=\lim _{b\rightarrow b_0\pm 0}{\bar{\xi }}(b)$ holds. In the below, we prove $\sup _b{\bar{\xi }}(b)=\sup _b\xi (b)$.

Suppose $\lim _{n\rightarrow \infty }\xi (b_n)=\sup _b\xi (b)$. When $b_n$ is bounded, there exists $b_0$ such that a subsequence of $b_n$ converges $b_0$. Then, $\lim _{b\rightarrow b_0-0}\xi (b)=\sup _b\xi (b)$ or $\lim _{b\rightarrow b_0+0}\xi (b)=\sup _b\xi (b)$ holds. This means $(0\le ) \sup _b\xi (b)\le \sup _b{\bar{\xi }}(b)$. Suppose $\lim _{n\rightarrow \infty }{\bar{\xi }}(b_n')=\sup _b{\bar{\xi }}(b)$. When $b_n'$ is unbounded, $\sup _b{\bar{\xi }}(b)=0$ should hold. Hence $\sup _b\xi (b)=\sup _b{\bar{\xi }}(b)=0$. When $b_n'$ is bounded, there exists $b_0'$ such that a subsequence of $b_n'$ convergences $b_0'$. Then, $\lim _{b\rightarrow b_0'-0}{\bar{\xi }}(b)=\sup _b{\bar{\xi }}(b)$ or $\lim _{b\rightarrow b_0'+0}{\bar{\xi }}(b)=\sup _b{\bar{\xi }}(b)$ holds. This means that $\sup _b{\bar{\xi }}(b)\le \sup _b\xi (b)$. As a result, when $b_n$ is bounded, we have $\sup _b{\bar{\xi }}(b)=\sup _b\xi (b)$.

If $b_n$ is unbounded, we have $\sup _b\xi (b)=0$ and $Q\big (u(x)<b \big )-P\big (u(x)<b \big )\le 0$ for any b. Hence, for any $b_0$, $Q\big ( u(x)\le b_0 \big )-P\big ( u(x)\le b_0 \big )=\lim _{b\rightarrow b_0+0}Q\big ( u(x)<b \big )-P\big ( u(x)<b \big )\le 0$. Thus, we have $\sup _b{\bar{\xi }}(b)=0$.

In any case we have $\sup _b\xi (b)=\sup _b {\bar{\xi }}(b)$. Thus,

$$\begin{aligned}&\sup _{u\in {\mathcal {F}},b}P(u(x)\ge b)-Q(u(x)\ge b)\\&\quad = \sup _{u\in {\mathcal {F}},b}Q(u(x)<b)-P(u(x)<b) \\&\quad = \sup _{u\in {\mathcal {F}},b}Q(u(x)\le b)-P(u(x)\le b) \\&\quad = \sup _{u\in {\mathcal {F}},b}Q(-u(x)\ge -b)-P(-u(x)\ge -b)\\&\quad = \sup _{u\in {\mathcal {F}},b}Q(u(x)\ge b)-P(u(x)\ge b). \end{aligned}$$

This means that the modulus is no need $\square $

1.2 A.2 Proof of Lemma 3

Proof

The equality $\sigma (z)+\sigma (-z)=1$ in Assumption (B) leads to ${\mathbb {E}}_{P}[\sigma (u(X)-b)]-{\mathbb {E}}_{Q}[\sigma (u(X)-b)] = {\mathbb {E}}_{Q}[\sigma (-u(X)+b)]-{\mathbb {E}}_{P}[\sigma (-u(X)+b)]$. Assumption (A) guarantees that $-u\in {\mathcal {U}}$. Hence, we can drop the modulus in the definition of the STV distance. $\square $

1.3 A.3 Proof of Lemma 4

Proof

For any measurable set $A\in {\mathcal {B}}$, the indicator function $\varvec{1}_A\in L_1({\mathcal {X}})$ is approximated by the series of continuous functions, $\{f_m\}\subset C({\mathcal {X}})$, i.e., $\Vert \varvec{1}_A-f_m\Vert _1\rightarrow 0\,(m\rightarrow \infty )$. Here, $L_1$ denotes the set of absolute integrable functions for any probability measure $P\in {\mathcal {P}}$. We assume $f_m\in [0,1]$ by clipping $f_m$ by 0 and 1. Let us consider the approximation of $f_m$ by $\sigma (u-b)$ with some $u\in {\mathcal {H}}$ and $b\in {\mathbb {R}}$. Since ${\mathcal {H}}$ is the universal RKHS, there exist $\{u_{m,k}\}_k\subset {\mathcal {H}}$ and $\{b_{m,k}\}_k\subset {\mathbb {R}}$ such that $\Vert f_m-\sigma (u_{m,k}-b_{m,k})\Vert _\infty \rightarrow 0,\,(k\rightarrow \infty )$ for each m. As a result, one can prove that $\Vert \varvec{1}_A-\sigma (u_{m,k_m}-b_{m,k_m})\Vert _1\rightarrow 0\,(m\rightarrow \infty )$. Hence,

$$\begin{aligned} \int \varvec{1}_A \textrm{d}(P-Q)=\lim _{m\rightarrow \infty }\int \sigma (u_{m,k_m}-b_{m,k_m}) \textrm{d}(P-Q) \le \textrm{STV}_{{\mathcal {H}},\sigma }(P,Q) \end{aligned}$$

for any A, meaning that $\textrm{TV}(P,Q)\le \textrm{STV}_{{\mathcal {H}},\sigma }(P,Q)$. On the other hand, $\textrm{TV}(P,Q)$ has the expression $\textrm{TV}(P,Q)=\sup _{u\in L_0, 0\le u\le 1}|\int u \textrm{d}(P-Q)|\ge \textrm{STV}_{{\mathcal {H}},\sigma }(P,Q)$. Therefore, the equality $\textrm{STV}_{{\mathcal {H}},\sigma }(P,Q)=\textrm{TV}(P,Q)$ holds.

Let us prove another equality. Suppose that the sequence $\{(u_k,b_k)\}_k\subset {\mathcal {H}}\times {\mathbb {R}}$ satisfies

$$\begin{aligned} {\mathbb {E}}_{P}[\sigma (u_k-b_k)]-{\mathbb {E}}_{Q}[\sigma (u_k-b_k)] \longrightarrow \textrm{STV}_{{\mathcal {H}},\sigma }(P,Q)=\textrm{TV}(P,Q),\quad k\rightarrow \infty . \end{aligned}$$

Let us define $r_k = k\vee \max \{\Vert u_{k'}\Vert :\, k'=1,2,\ldots ,k\}$. Then, we have

$$\begin{aligned} {\mathbb {E}}_{P}[\sigma (u_k-b_k)]-{\mathbb {E}}_{Q}[\sigma (u_k-b_k)] \le \textrm{STV}_{{\mathcal {H}}_{r_k},\sigma }(P,Q)\le \textrm{TV}(P,Q). \end{aligned}$$

As $r_k\nearrow \infty $, we see that $\lim _{r\rightarrow \infty }\textrm{STV}_{{\mathcal {H}}_r,\sigma }(P,Q)=\textrm{TV}(P,Q)$. $\square $

1.4 A.4 Proof of Proposition 5

Proof

The non-negativity is confirmed from the property of $\textrm{STV}$ with $\sigma :{\mathbb {R}}\rightarrow [0,1]$. Below, we derive the upper bound. Due to Lemma 3, we can drop the absolute value in the STV distance. Let us define $p=\frac{\textrm{d}{P}}{\textrm{d}{\mu }}$ and $q=\frac{\textrm{d}{Q}}{\textrm{d}{\mu }}$. Then, we have the following inequalities:

$$\begin{aligned}&\textrm{TV}(P, Q) - \textrm{STV}_{c\,{\mathcal {U}}, \sigma }(P,Q)\\&\quad \le {\mathbb {E}}_{P}[\varvec{1}[s(X)\ge 0]] - {\mathbb {E}}_{Q}[\varvec{1}[s(X)\ge 0]] - ({\mathbb {E}}_{P}[\sigma (c s(X))] - {\mathbb {E}}_{Q}[\sigma (c s(X))])\\&\quad = \int \left\{ \varvec{1}[s(x)\ge 0] - \sigma (c s(x)) \right\} (p(x)-q(x)) \textrm{d}\mu (x)\\&\quad = \int _{p\ge q}\sigma \left( -c\log \frac{p(x)}{q(x)}\right) \left( \frac{p(x)}{q(x)}-1\right) q(x)\textrm{d}\mu (x)\\&\quad + \int _{q>p}\sigma \left( -c\log \frac{q(x)}{p(x)}\right) \left( \frac{q(x)}{p(x)}-1\right) p(x)\textrm{d}\mu (x). \end{aligned}$$

Since the decay rate of $\sigma $ is $\lambda (c)$, the inequality

$$\begin{aligned} \textrm{TV}(P, Q) - \textrm{STV}_{c\,{\mathcal {U}}, \sigma }(P,Q)&\le \lambda (c) \left\{ \int _{p\ge q}q(x)\textrm{d}\mu (x) + \int _{q>p}p(x)\textrm{d}\mu (x) \right\} \\&= \lambda (c) \int \min \{p, q\} \textrm{d}\mu = \lambda (c) (1-\textrm{TV}(P,Q)) \end{aligned}$$

holds for $c> C_0$. $\square $

1.5 A.5 Proof of Lemma 6

Proof

Let $y=c\log t$, then decay rate $\lambda (c)$ should satisfy

$$\begin{aligned} \sigma (-y) \le \frac{\lambda (c)}{e^{y/c}-1} =\frac{\lambda (c)}{y/c+\frac{1}{2!}(y/c)^2+\cdots } = \frac{\lambda (c)c}{y+\frac{1}{2!}y^2/c+\cdots }. \end{aligned}$$

Suppose $\lambda (c)=o(1/c)$. For a fixed $y>0$, we have $\lim _{c\rightarrow \infty }\frac{\lambda (c)c}{y+\frac{1}{2!}y^2/c+\cdots }=0$. Hence, the function $\sigma (z)$ should be 0 on $z<0$ and thus, $\sigma (z)=1$ for $z>0$. This contradicts Assumption (B). $\square $

1.6 A.6 Some formulae for kernel exponential family

Lemma 20

Suppose $\sup _{x\in {\mathcal {X}}}k(x,x)\le K^2<\infty $ and $p_f\in {\mathcal {P}}_{\mathcal {H}}$. Then,

$$\begin{aligned} \lim _{\varepsilon \rightarrow 0}\frac{A(f+\varepsilon u)-A(f)}{\varepsilon }={\mathbb {E}}_{P_f}[u] \end{aligned}$$

holds for any $u\in {\mathcal {H}}$. Furthermore,

$$\begin{aligned}&\langle {\mathbb {E}}_{P_g}[k(x,\cdot )], f-g\rangle \le A(f)-A(g)\le \langle {\mathbb {E}}_{P_f}[k(x,\cdot )], f-g\rangle , \end{aligned}$$

(A1)

$$\begin{aligned}&|{A(f)-A(g)}|\le K\Vert f-g\Vert \end{aligned}$$

(A2)

hold for $f,g\in {\mathcal {H}}$ such that A(f) and A(g) are finite.

Proof

Without loss of generality, we assume that $K\le 1$. For $f\in {\mathcal {H}}$ such that $A(f)<\infty $ and $u\in {\mathcal {H}}$, we have $|\frac{d}{d\varepsilon }e^{f(x)+\varepsilon u(x)}|=|e^{f(x)+\varepsilon u(x)}u(x)|\le e^{f(x)} e^{|\varepsilon |\Vert u\Vert _\infty }\Vert u\Vert _\infty $, which is integrable. Thus, the equality $\frac{\textrm{d}}{\textrm{d}\varepsilon }\int e^{f+\varepsilon u}\textrm{d}\mu = \int \frac{\textrm{d}}{\textrm{d}\varepsilon } e^{f+\varepsilon u}\textrm{d}\mu =\int e^{f+\varepsilon u}u\textrm{d}\mu $ leads to

$$\begin{aligned} \frac{\textrm{d}}{\textrm{d}\varepsilon } \log \int e^{f(x)+\varepsilon u(x)}\textrm{d}\mu \bigg |_{\varepsilon =0} =\frac{\int e^{f(x)}u(x)\textrm{d}\mu }{\int e^{f(x)}\textrm{d}\mu } ={\mathbb {E}}_{P_f}[u]. \end{aligned}$$

The second statement is obtained as follows. The convexity of A(f) in $f\in {\mathcal {H}}$ leads to

$$\begin{aligned} A(f)-A(g)\ge \langle {\mathbb {E}}_{P_g}[k(x,\cdot )], f-g\rangle ,\, \text {and}\, A(g)-A(f)\ge \langle {\mathbb {E}}_{P_f}[k(x,\cdot )], g-f\rangle . \end{aligned}$$

Hence, we have

$$\begin{aligned} \langle {\mathbb {E}}_{P_g}[k(x,\cdot )], f-g\rangle \le A(f)-A(g)\le \langle {\mathbb {E}}_{P_f}[k(x,\cdot )], f-g\rangle . \end{aligned}$$

Therefore, $|{A(f)-A(g)}|\le \Vert f-g\Vert $. $\square $

Remark 5

Suppose that the kernel function is not bounded. Then, $\lim _{\varepsilon \rightarrow 0}\frac{A(f+\varepsilon u)-A(f)}{\varepsilon }={\mathbb {E}}_{P_f}[u]$ holds if $e^{f(x)+\varepsilon u(x)}u(x)$ is integrable w.r.t. $\mu $ for arbitrary $\varepsilon $ in the vicinity of zero. The inequalities (A1) and (A2) are replaced with ${\mathbb {E}}_{P_g}[f(x)-g(x)]\le A(f)-A(g)\le {\mathbb {E}}_{P_f}[f(x)-g(x)]$ and $|{A(f)-A(g)}|\le K_{f,g}\Vert f-g\Vert $, where $K_{f,g}=\max \{{\mathbb {E}}_{P_g}[\sqrt{k(x,x)}],\,{\mathbb {E}}_{P_f}[\sqrt{k(x,x)}]\}$.

1.7 A.7 Proof of Lemma 7

Proof

From the definition, we have

$$\begin{aligned} \textrm{STV}_{{\mathcal {H}}_U,\varvec{1}}(P_f, P_g) = \sup _{\Vert u\Vert \le U,b\in {\mathbb {R}}} P_f(u(x)\ge b)-P_g(u(x)\ge b) \le \textrm{TV}(P_f,P_g). \end{aligned}$$

We prove the inequality of the converse direction. The TV distance is expressed by the $L_1$ distance between probability densities. For $f\ne g$,

$$\begin{aligned} 2\textrm{TV}(P_f,P_g)&= \int _{{\mathcal {X}}} |{p_f-p_g}|\textrm{d}\mu \\&= \int _{p_f\ge p_g} (p_f-p_g)\textrm{d}\mu + \int _{p_g\ge p_f} (p_g-p_f)\textrm{d}\mu \\&= { \int _{\frac{U(f-g)}{\Vert f-g\Vert }\ge \frac{U(A(f)-A(g))}{\Vert f-g\Vert }} (p_f-p_g)\textrm{d}\mu + \int _{\frac{U(g-f)}{\Vert g-f\Vert }\ge \frac{U(A(g)-A(f))}{\Vert g-f\Vert }} (p_g-p_f)\textrm{d}\mu }\\&\le 2\textrm{STV}_{{\mathcal {H}}_U,\varvec{1}}(P_f,P_g). \end{aligned}$$

Hence, we have $\textrm{STV}_{{\mathcal {H}}_U,\varvec{1}}(P_f,P_g)=\textrm{TV}(P_f,P_g)$. $\square $

1.8 A.8 Proof of Lemma 8

Proof

When $f=g$, the lemma holds by regarding $\lambda (U/0)=\lambda (\infty )=0$. Suppose that $f\ne g$. Let us define $s(x)=\{f(x)-g(x)-(A(f)-A(g))\}/\Vert f-g\Vert $. Then, we have

$$\begin{aligned} \textrm{TV}(P_f,P_g)&= \frac{1}{2}\int _{{\mathcal {X}}}|{p_f(x)-p_g(x)}|\textrm{d}\mu \\&= \frac{1}{2}\int _{s\ge 0} (p_f(x)-p_g(x))\textrm{d}\mu +\frac{1}{2}\int _{s<0} (p_g(x)-p_f(x))\textrm{d}\mu \\&= \frac{1}{2}\int _{s\ge 0} (p_f(x)-p_g(x))\textrm{d}\mu \\&\quad +\frac{1}{2}\underbrace{\int _{{\mathcal {X}}} (p_g(x)-p_f(x))\textrm{d}\mu }_{=0} -\frac{1}{2}\int _{s\ge 0} (p_g(x)-p_f(x))\textrm{d}\mu \\&= \int _{s\ge 0} (p_f(x)-p_g(x))\textrm{d}\mu \\&= {\mathbb {E}}_{P_f}[\varvec{1}[U s(x)\ge 0]]-{\mathbb {E}}_{P_g}[\varvec{1}[U s(x)\ge 0]], \end{aligned}$$

where U is a positive number. We have

$$\begin{aligned} {\mathbb {E}}_{P_f}[\sigma (U s(X))]-{\mathbb {E}}_{P_g}[\sigma (U s(X))] \le \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_f,P_g) \le \textrm{TV}(P_f,P_g). \end{aligned}$$

Note that

$$\begin{aligned} \lim _{U\rightarrow \infty }\sigma (U z) = \varvec{1}[z\ge 0]-\frac{1}{2}\varvec{1}[z=0]= {\left\{ \begin{array}{ll} 1, &{} z>0,\\ 1/2, &{} z=0,\\ 0, &{} z<0. \end{array}\right. } \end{aligned}$$

Lebesgue’s dominated convergence theorem leads to

$$\begin{aligned}&\lim _{U\rightarrow \infty }{\mathbb {E}}_{P_f}[\sigma (U s(X))]-{\mathbb {E}}_{P_g}[\sigma (U s(X))] \\&\quad = {\mathbb {E}}_{P_f}[\varvec{1}[s(X)\ge 0]]-{\mathbb {E}}_{P_g}[\varvec{1}[s(X)\ge 0]] -\frac{1}{2}\underbrace{\{{\mathbb {E}}_{P_f}[\varvec{1}[s(X)=0]]-{\mathbb {E}}_{P_g}[\varvec{1}[s(X)=0]]\}}_{ =\int _{\{p_f=p_g\}}(p_f-p_g)\textrm{d}\mu =0}\\&\quad = \textrm{TV}(P_f,P_g). \end{aligned}$$

As a result, we have $\lim _{U\rightarrow \infty }\textrm{STV}_{{\mathcal {H}}_U, \sigma }(P_f,P_g) = \textrm{TV}(P_f,P_g)$.

The proof of the inequality follows Proposition 5. From $P_f\ll \mu $ and $P_g\ll \mu $, we have $s(x)=\log \frac{p_f(x)}{p_g(x)}=f(x)-g(x)-(A(f)-A(g))$. When $f\ne g$ and $U>\Vert f-g\Vert $, we have

$$\begin{aligned} 0\le \textrm{TV}(P_f,P_g)-\textrm{STV}_{{\mathcal {H}}_{U},\sigma }(P_f,P_g) \le \lambda \bigg (\frac{U}{\Vert f-g\Vert }\bigg )(1-\textrm{TV}(P_f,P_g)), \end{aligned}$$

where ${\mathcal {H}}_U$ is regarded as ${\mathcal {H}}_{\frac{U}{\Vert f-g\Vert }\Vert f-g\Vert }$. $\square $

Appendix B Proofs of theorems and lemmas in Sect. 4

1.1 B.1 Proof of Theorem 9

Proof

For the probability distributions $P_{f_0},P_{{\widehat{f}}_n}$ in ${\mathcal {P}}_{{\mathcal {H}}}$, we have

$$\begin{aligned} \textrm{TV}(P_{f_0},P_{{\widehat{f}}_n})&= \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_{f_0},P_{{\widehat{f}}_n}) \\&\le \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_{f_0},P_n) + \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_{{\widehat{f}}_n},P_n) \\&\le 2 \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_{f_0},P_n)\\&\le 2 \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_{f_0}, P_\varepsilon )+ 2 \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_\varepsilon ,P_n)\\&\le 2 \textrm{TV}(P_{f_0}, P_\varepsilon ) + 2 \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_\varepsilon ,P_n)\\&\le 2\varepsilon + 2 \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_\varepsilon ,P_n). \end{aligned}$$

Lemma 7 is used in the first equality. The probabilistic upper bound of $\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_\varepsilon ,P_n)$ is evaluated by the classical VC theory [38, 74]. The VC-dimension of the function set ${\mathcal {F}}=\{x\mapsto \varvec{1}[u(x)\ge b]\,:\,u\in \widetilde{{\mathcal {H}}}_U, b\in {\mathbb {R}}\}$ is bounded above by the dimension of $\widetilde{{\mathcal {H}}}$ [38, 70]. Let us define

$$\begin{aligned} \displaystyle g(X_1,\ldots ,X_n) := \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\varvec{1}}(P_\varepsilon ,P_n) = \sup _{\begin{array}{c} u\in \widetilde{{\mathcal {H}}}_U,\Vert u\Vert \le r,b\in {\mathbb {R}} \end{array}} P_{\varepsilon }(u(X)\ge b)- P_n(u(X)\ge b), \end{aligned}$$

McDiarmid inequality leads to inequality,

$$\begin{aligned} \textrm{Pr}\bigg (g \le {\mathbb {E}}[g] + \sqrt{\frac{\log (1/\delta )}{2n}}\bigg ) \ge 1-\delta . \end{aligned}$$

The expectation ${\mathbb {E}}[g]$ is bounded above by the Rademacher complexity ${\mathfrak {R}}_n({\mathcal {F}})$ which is bounded above by Dudley’s integral entropy bound,

$$\begin{aligned} {\mathbb {E}}[g]&\le {\mathfrak {R}}_n({\mathcal {F}}) \le {\mathbb {E}}\left[ \frac{24}{\sqrt{n}} \int _0^2\sqrt{\log N(\delta ,{\mathcal {F}},\Vert \cdot \Vert _n)}\textrm{d}\delta \right] , \end{aligned}$$

where $N(\delta ,{\mathcal {F}},\Vert \cdot \Vert _n)$ is the covering number of ${\mathcal {F}}$ with the norm $\Vert f\Vert _n=\sqrt{\frac{1}{n}\sum _{i=1}^{n}f(X_i)^2}$. The covering number of ${\mathcal {F}}$ is bounded above by $(3/\delta )^{{\widetilde{d}}}$ as ${\mathcal {F}}$ is a class of 1-uniformly bounded functions [70, Example 5.24]. The above integral is then bounded above by $24\times 3\sqrt{{\widetilde{d}}/n}$. As a result, we have

$$\begin{aligned} \textrm{TV}(P_{f_0},P_{{\widehat{f}}_n}) \lesssim \varepsilon + \sqrt{\frac{{\widetilde{d}}}{n}} +\sqrt{\frac{\log (1/\delta )}{n}} \end{aligned}$$

with probability greater than $1-\delta $. $\square $

1.2 B.2 Proof of Theorem 10

Proof

First, let us consider the convergence rate of $\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})$. The $\textrm{TV}$ distance between the target distribution and the estimator is bounded above as follows. For a sufficiently large r, suppose $f_0\in {\mathcal {H}}_{r}$. Then,

$$\begin{aligned} \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_r})&\le \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_n) + \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_n, P_{{\widehat{f}}_r})\\&\le 2\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_n)\\&\le 2\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0},P_\varepsilon ) + 2\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_\varepsilon ,P_n) \\&\le 2\textrm{TV}(P_{f_0},P_\varepsilon ) + 2\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_\varepsilon ,P_n) \\&\le 2\varepsilon + 2\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_\varepsilon ,P_n)\\&\lesssim \varepsilon + \sqrt{\frac{d}{n}} + \sqrt{\frac{\log (1/\delta )}{n}} \end{aligned}$$

with probability $1-\delta $. Therefore,

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})&= \left\{ \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})-\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_r})\right\} +\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_r})\\&\le \frac{\Vert f_0-{\widehat{f}}_r\Vert }{U} +\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_r}) \lesssim \frac{r}{U}+ \varepsilon +\sqrt{\frac{d}{n}} \end{aligned}$$

with high probability, where $U\ge 2r\ge \Vert f_0-{\widehat{f}}_r\Vert $. As shown above, the bias term in (8) is bounded by $\Vert f_0-{\widehat{f}}_r\Vert /U$ for the regularized STV learning.

Next, let us consider the convergence rate of $\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})$. The standard argument works to derive the upper bound as follows:

$$\begin{aligned}&\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})\\&\quad \le \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})\\&\qquad + \left[ \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{n}, P_{{\widehat{f}}_{\textrm{reg},r}}) + \frac{1}{r^2}\Vert {\widehat{f}}_{\textrm{reg},r}\Vert ^2\right] - \left[ \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{n}, P_{f_0}) + \frac{1}{r^2}\Vert f_0\Vert ^2\right] \\&\qquad +\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{n}, P_{f_0}) -\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{n}, P_{{\widehat{f}}_{\textrm{reg},r}}) +\frac{1}{r^2}\Vert f_0\Vert ^2\\&\quad \le \textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}}) +\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{n}, P_{f_0}) -\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{n}, P_{{\widehat{f}}_{\textrm{reg},r}}) +\frac{1}{r^2}\Vert f_0\Vert ^2\\&\quad \le 2\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{n}, P_{f_0})+\frac{1}{r^2}\Vert f_0\Vert ^2\\&\quad \le 2\textrm{TV}(P_{f_0},P_\varepsilon ) + 2\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_\varepsilon ,P_n)+\frac{1}{r^2}\Vert f_0\Vert ^2 \\&\quad \le 2\varepsilon + 2\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_\varepsilon ,P_n)+\frac{1}{r^2}\Vert f_0\Vert ^2 \\&\quad \lesssim \varepsilon + \sqrt{\frac{d}{n}} + \sqrt{\frac{\log (1/\delta )}{n}} + \frac{\Vert f_0\Vert ^2}{r^2}. \end{aligned}$$

The first inequality is obtained by the definition of ${\widehat{f}}_{\textrm{reg},r}$. The triangle inequality is used in the second inequality. Thus, we obtain

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})&= \left\{ \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}}) -\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})\right\} +\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})\\&\le \frac{\Vert f_0-{\widehat{f}}_{\textrm{reg},r}\Vert }{U} +\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})\\&\lesssim \frac{r}{U}+ \varepsilon +\sqrt{\frac{d}{n}}+ \sqrt{\frac{\log (1/\delta )}{n}} + \frac{\Vert f_0\Vert ^2}{r^2} \end{aligned}$$

for $U\ge 2r\ge \Vert f_0-{\widehat{f}}_{\textrm{reg},r}\Vert $. $\square $

1.3 B.3 Proof of Theorem 12

Proof

For ${\widetilde{U}}\le {U}$, let us consider an upper bound of $\textrm{STV}_{{\mathcal {H}}_{{\widetilde{U}}},\sigma }(P_{n},P_{{\check{f}}_{\textrm{reg},r}})$. In the following, let ${\check{f}}$ be ${\check{f}}_{\textrm{reg},r}$ for the sake of simplicity. The inequality $\Vert u\Vert ^2\le {\widetilde{U}}^2$ for $u\in {\mathcal {H}}_{{\widetilde{U}}}$ leads to

$$\begin{aligned}&\sup _{u\in {\mathcal {H}}_{{\widetilde{U}}},b\in {\mathbb {R}}} {\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_{{\check{f}}}}[\sigma (u(X)-b)] \\&\qquad \quad \le \sup _{u\in {\mathcal {H}}_{{\widetilde{U}}},b\in {\mathbb {R}}} \left\{ {\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_{{\check{f}}}}[\sigma (u(X)-b)] -\frac{\Vert u\Vert ^2}{U^2}+\frac{\Vert {\check{f}}\Vert ^2}{r^2}\right\} + \frac{{\widetilde{U}}^2}{U^2} \\&\qquad \quad \le \sup _{u\in {\mathcal {H}},b\in {\mathbb {R}}} \left\{ {\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_{{\check{f}}}}[\sigma (u(X)-b)] -\frac{\Vert u\Vert ^2}{U^2}+\frac{\Vert {\check{f}}\Vert ^2}{r^2}\right\} + \frac{{\widetilde{U}}^2}{U^2}. \end{aligned}$$

Moreover, we have

$$\begin{aligned}&\sup _{u\in {\mathcal {H}},b\in {\mathbb {R}}}{\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_{f_0}}[\sigma (u(X)-b)]-\frac{\Vert u\Vert ^2}{U^2}\\&\qquad \quad \le \sup _{u\in {\mathcal {H}}_U,b\in {\mathbb {R}}}{\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_{f_0}}[\sigma (u(X)-b)]\nonumber \end{aligned}$$

(B3)

since the norm of the optimal $u\in {\mathcal {H}}$ in (B3) is less than or equal to U. Therefore, we have

$$\begin{aligned}&\textrm{STV}_{{\mathcal {H}}_{{\widetilde{U}}},\sigma }(P_{n},P_{{\check{f}}}) \\&\quad = \sup _{u\in {\mathcal {H}}_{{\widetilde{U}}},b\in {\mathbb {R}}} {\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_{{\check{f}}}}[\sigma (u(X)-b)] \\&\quad \le \sup _{u\in {\mathcal {H}},b\in {\mathbb {R}}} \left\{ {\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_{{\check{f}}}}[\sigma (u(X)-b)] -\frac{\Vert u\Vert ^2}{U^2}+\frac{\Vert {\check{f}}\Vert ^2}{r^2}\right\} + \frac{{\widetilde{U}}^2}{U^2} \\&\qquad - \sup _{u\in {\mathcal {H}},b\in {\mathbb {R}}} \left\{ {\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_{f_0}}[\sigma (u(X)-b)] -\frac{\Vert u\Vert ^2}{U^2}+\frac{\Vert f_0\Vert ^2}{r^2}\right\} \\&\qquad +\sup _{u\in {\mathcal {H}}_{U},b\in {\mathbb {R}}} \left\{ {\mathbb {E}}_{P_n}[\sigma (u(X)-b)]-{\mathbb {E}}_{P_{f_0}}[\sigma (u(X)-b)]\right\} +\frac{\Vert f_0\Vert ^2}{r^2}\\&\quad \le \textrm{STV}_{{\mathcal {H}}_{U},\sigma }(P_n,P_{f_0})+\frac{{\widetilde{U}}^2}{U^2} +\frac{\Vert f_0\Vert ^2}{r^2}. \end{aligned}$$

The second inequality is obtained from the optimality of ${\check{f}}$. Hence, we have

$$\begin{aligned} \textrm{STV}_{{\mathcal {H}}_{{\widetilde{U}}},\sigma }(P_{f_0},P_{{\check{f}}})&\le \textrm{STV}_{{\mathcal {H}}_{{\widetilde{U}}},\sigma }(P_n,P_{f_0}) +\textrm{STV}_{{\mathcal {H}}_{{\widetilde{U}}},\sigma }(P_n,P_{{\check{f}}}) \\&\le 2\cdot \textrm{STV}_{{\mathcal {H}}_{U},\sigma }(P_n,P_{f_0}) +\frac{{\widetilde{U}}^2}{U^2} +\frac{\Vert f_0\Vert ^2}{r^2}. \end{aligned}$$

Therefore, we obtain

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\check{f}}})&= \left\{ \textrm{TV}(P_{f_0}, P_{{\check{f}}}) -\textrm{STV}_{{\mathcal {H}}_{{\widetilde{U}}},\sigma }(P_{f_0},P_{{\check{f}}}) \right\} +\textrm{STV}_{{\mathcal {H}}_{{\widetilde{U}}},\sigma }(P_{f_0},P_{{\check{f}}})\\&\le \frac{\Vert f_0-{\check{f}}\Vert }{{\widetilde{U}}} + 2\cdot \textrm{STV}_{{\mathcal {H}}_{U},\sigma }(P_n,P_{f_0}) +\frac{{\widetilde{U}}^2}{U^2} +\frac{\Vert f_0\Vert ^2}{r^2}\\&\lesssim \frac{r}{{\widetilde{U}}}+\frac{{\widetilde{U}}^2}{U^2} +\frac{\Vert f_0\Vert ^2}{r^2} +\textrm{TV}(P_{f_0},P_\varepsilon )+\textrm{STV}_{{\mathcal {H}}_U,\sigma }(P_n,P_\varepsilon )\\&\lesssim \frac{r}{{\widetilde{U}}}+\frac{{\widetilde{U}}^2}{U^2} +\frac{\Vert f_0\Vert ^2}{r^2} +\varepsilon + \sqrt{\frac{d}{n}}+\sqrt{\frac{\log (1/\delta )}{n}} \end{aligned}$$

for $U\ge {\widetilde{U}}\ge 2r$. By setting ${\widetilde{U}}=(2r)^{1/3}U^{2/3}$ for $2r\le U$, we obtain the inequality in Theorem 12. $\square $

1.4 B.4 Proof of Theorem 14

Using the upper bound of $\textrm{STV}_{{\widetilde{H}}_U,\sigma }(P_\varepsilon , P_n)$, we evaluate the estimation error of the regularized STV learning with infinite-dimensional RKHSs. The Lipschitz constant of $\sigma $ is 1. Since the Rademacher complexity of the function set $\{\sigma (u-b):\,u\in {\mathcal {H}}_U, |{b}|\le U\}$ over n samples is bounded above by $U/\sqrt{n}$, we have

$$\begin{aligned} \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_\varepsilon ,P_n)&=\sup _{u\in \widetilde{{\mathcal {H}}}_U,|{b}|\le U} {\mathbb {E}}_{P_\varepsilon }[\sigma (u(X)-b)]-{\mathbb {E}}_{P_n}[\sigma (u(X)-b)]\\&\lesssim \frac{U}{\sqrt{n}} + \sqrt{\frac{\log (1/\delta )}{n}} \end{aligned}$$

with probability greater than $1-\delta $. For a sufficiently large r such that $f_0\in {\mathcal {H}}_r$, the following inequalities hold,

$$\begin{aligned} \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_r})&\le \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_n) + \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_n, P_{{\widehat{f}}_r})\\&\le 2\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_n)\\&\le 2\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0},P_\varepsilon ) + 2\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_\varepsilon ,P_n) \\&\le 2\textrm{TV}(P_{f_0},P_\varepsilon ) + 2\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_\varepsilon ,P_n) \\&\le 2\varepsilon + 2\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_\varepsilon ,P_n) \lesssim \varepsilon + \frac{U}{\sqrt{n}} + \sqrt{\frac{\log (1/\delta )}{n}}. \end{aligned}$$

Therefore, we have

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r})&= \left\{ \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r}) -\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_r})\right\} +\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_r})\\&\lesssim \frac{\Vert f_0-{\widehat{f}}_r\Vert }{U} + \varepsilon + \frac{U}{\sqrt{n}} + \sqrt{\frac{\log (1/\delta )}{n}}\\&\lesssim \frac{r}{U} + \varepsilon + \frac{U}{\sqrt{n}} + \sqrt{\frac{\log (1/\delta )}{n}}. \end{aligned}$$

For the estimator ${\widehat{f}}_{\textrm{reg},r}$ given by the additive regularization, the same argument of the Theorem 10 leads to

$$\begin{aligned}&\textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})\\&\quad = \left\{ \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}}) -\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_{\textrm{reg},r}})\right\} +\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_{{\widehat{f}}_r})\\&\quad \lesssim \frac{\Vert f_0-{\widehat{f}}_{\textrm{reg},r}\Vert }{U} + \varepsilon + \frac{U}{\sqrt{n}} + \sqrt{\frac{\log (1/\delta )}{n}} + \frac{\Vert f_0\Vert ^2}{r^2}\\&\quad \lesssim \frac{r}{U} + \varepsilon + \frac{U}{\sqrt{n}} + \sqrt{\frac{\log (1/\delta )}{n}}+ \frac{\Vert f_0\Vert ^2}{r^2}. \end{aligned}$$

Likewise, the same argument of the Theorem 12 leads to the estimation error bound of ${\check{f}}_{\textrm{reg},r}$.

1.5 B.5 Proof of Corollary 15

For the estimator ${\widehat{f}}_r$, Theorem 14 with $r=O(\textrm{Poly}(\log n))$ and $U=n^{1/4}$ leads to the upper bound $\varepsilon +n^{-1/4}$, where the poly-log order is omitted. Likewise, the estimator ${\widehat{f}}_{\textrm{reg},r}$ with $r=n^{1/10},\, U=n^{3/10}$ and the estimator ${\check{f}}_{\textrm{reg},r}$ with $r=n^{1/12}, U=n^{1/3}$ yield the upper bound in the corollary.

Appendix C Proof in Sect. 5

1.1 C.1 Proof of Lemma 16

Proof

Due to (A1) in Appendix 1, we see that there exists $0\le \beta \le 1$ such that $A(f)-A(g)=\langle f-g,{\mathbb {E}}_{\beta }[k(x,\cdot )]\rangle $, where ${\mathbb {E}}_{\beta }$ is the expectation with the probability $p_\beta =\exp \{\beta f + (1-\beta ) g -A(\beta f + (1-\beta ) g)\}\in {\mathcal {P}}_{{\mathcal {H}}}$. A simple calculation leads that for $r\ge \log (2)/4$ and $a\in {\mathbb {R}}$ such that $|{a}|\le 4r$, the inequality

$$\begin{aligned} |{e^{a}-1}|\ge \frac{|{a}|}{8r} \end{aligned}$$

(C4)

holds. For $f\ne g$, let us define $s=(f-g)/\Vert f-g\Vert $. For $f,g\in {\mathcal {H}}_r$, we have

$$\begin{aligned} |{f-g-(A(f)-A(g))}|\le 2\Vert f-g\Vert \le 4r. \end{aligned}$$

Hence, the following inequalities hold.

$$\begin{aligned} 2\textrm{TV}(P_{g}, P_{f})&= \int |{e^{f(x)-g(x)-(A(f)-A(g))}-1}|p_{g}\textrm{d}\mu \nonumber \\&\ge \frac{1}{8r}\int |{\langle f-g,k_x-{\mathbb {E}}_{\beta }[k_x]\rangle }| p_{g}\textrm{d}\mu \end{aligned}$$

(C5)

$$\begin{aligned}&\ge \frac{\Vert f-g\Vert }{8r e^{2r}}\inf _{s\in {\mathcal {H}},\Vert s\Vert =1} \int |{s(x)-{\mathbb {E}}_{\beta }[s]}|\textrm{d}\mu \end{aligned}$$

(C6)

$$\begin{aligned}&\ge \frac{\Vert f-g\Vert }{16r e^{2r}}\inf _{\begin{array}{c} s\in {\mathcal {H}},\Vert s\Vert =1 \end{array}}\int |{s(x)-{\mathbb {E}}_\beta [s]}|^2\textrm{d}\mu \end{aligned}$$

(C7)

$$\begin{aligned}&\ge \frac{\Vert f-g\Vert }{16r e^{2r}}\inf _{\begin{array}{c} s\in {\mathcal {H}},\Vert s\Vert =1 \end{array}} \int |{s(x)-{\mathbb {E}}_{P_0}[s]}|^2\textrm{d}\mu . \end{aligned}$$

(C8)

(C5) uses the inequality (C4); (C6) is derived from the definition of s(x) and the inequality $p_g\ge e^{-2r}p_0=e^{-2r}$; (C7) uses the inequality $|{x}|\ge x^2/2$ for $|{x}|\le 2$; (C8) is derived from $\inf _{\alpha \in {\mathbb {R}}}{\mathbb {E}}_{P_0}[|{s-\alpha }|^2]={\mathbb {E}}_{P_0}[|{s-{\mathbb {E}}_{P_0}[s]}|^2]$. $\square $

1.2 C.2 Proof of Corollary 17

Proof

Since ${\widehat{f}}_r,f_0\in {\mathcal {H}}_{d,r}$, we have $\Vert {\widehat{f}}_r-f_0\Vert \le 2r$. From Lemma 16 and Theorem 10, it holds that

$$\begin{aligned} \Vert {\widehat{f}}_r-f_0\Vert \le \frac{32 r e^{2r}}{\inf _{d}\xi ({\mathcal {H}}_d)} \textrm{TV}(P_{f_0}, P_{{\widehat{f}}_r}) \lesssim \frac{32 r e^{2r}}{\inf _{d}\xi ({\mathcal {H}}_d)} \bigg (\varepsilon +\sqrt{\frac{d}{n}}\bigg ) \end{aligned}$$

with high probability. Since the parameter r is a positive constant, the upper bound in the RKHS norm is ensured. $\square $

1.3 C.3 Proof of Theorem 18

In the beginning, let us introduce a general theory of the minimax optimal rate of the parameter estimation. Let us define $\Theta \subset {\mathbb {R}}^d$ be the d-dimensional parameter space of the statistical model $\{p_{\varvec{\theta }}(x):\,{\varvec{\theta }}\in \Theta \}$. For any subset $T\subset \Theta $, let $\textrm{vol}(T)$ be the volume of the subset $T\subset {\mathbb {R}}^d$ for the usual Lebesgue measure. Define the KL diameter by $d_{\textrm{KL}}(T) =\sup _{{\varvec{\theta }},{\varvec{\theta }}'\in {T}}\textrm{KL}(p_{\varvec{\theta }}^{\otimes {n}},\,p_{\varvec{\theta }'}^{\otimes {n}}) =n\sup _{\varvec{\theta },{\varvec{\theta }}'\in {T}}\textrm{KL}(p_{\varvec{\theta }},\,p_{\varvec{\theta }'})$. For $B_\varepsilon =\{\varvec{\theta }\in {\mathbb {R}}^d\,:\,\Vert \varvec{\theta }\Vert _2\le \varepsilon \}$, it holds that

$$\begin{aligned} \inf _{\widehat{\varvec{\theta }}}\sup _{\varvec{\theta }\in \Theta }P_{\varvec{\theta }}(\Vert \widehat{\theta }-\theta \Vert \ge \varepsilon /2) \ge 1-\inf _{T\subset \Theta }\frac{d_{\textrm{KL}}(T)+\log 2}{\log \frac{\textrm{vol}(T)}{\textrm{vol}(B_{\varepsilon })}}. \end{aligned}$$

(C9)

Detail is shown in [75, Remark 3]. For the contaminated distribution, we have the following theorem

Theorem 21

[51] L is a loss function defined on the parameter space $\Theta $. Define

$$\begin{aligned} \omega (\varepsilon ,\Theta )= \sup \left\{ L({\varvec{\theta }}_1,{\varvec{\theta }}_2)\,:\, \textrm{TV}(P_{{\varvec{\theta }}_1},P_{{\varvec{\theta }}_2})\le \frac{\varepsilon }{1-\varepsilon },\,{\varvec{\theta }}_1, {\varvec{\theta }}_2\in \Theta \right\} . \end{aligned}$$

Suppose there is some ${\mathcal {R}}$ such that

$$\begin{aligned} \inf _{\widehat{\varvec{\theta }}} \sup _{{\varvec{\theta }}\in \Theta } P_{\varvec{\theta }}\left( L(\widehat{\varvec{\theta }}, \varvec{\theta })\ge {\mathcal {R}}\right) >0. \end{aligned}$$

Then,

$$\begin{aligned} \inf _{\widehat{\varvec{\theta }}} \sup _{\varvec{\theta }\in \Theta , Q} P_{\varvec{\theta },\varepsilon ,Q}\left( L(\widehat{\varvec{\theta }}, \varvec{\theta })\ge \omega (\varepsilon ,\Theta )+{\mathcal {R}}\right) >0, \end{aligned}$$

where the supremum is taken over the distribution $P_{\varvec{\theta },\varepsilon ,Q}=(1-\varepsilon )P_{\varvec{\theta }}+\varepsilon Q$ for any parameter $\varvec{\theta }\in \Theta $ and any distribution Q.

Proof of Theorem 18

Note that $\Vert A\Vert _{\textrm{op}}\le \Vert A\Vert _{\textrm{F}}$. For $r<1/2\,(<R)$, let us define $T_r:=I+B_{r}=\{I+A\,:\,\Vert A\Vert _{\textrm{F}}\le r\}\subset {\mathfrak {T}}_R=\{I+A\succ {O}\,:\,\Vert A\Vert _{\textrm{op}}\le R\}$. For $\Sigma \in T_r$, we have $1+r\ge \sigma _1(\Sigma )\ge \sigma _d(\Sigma )\ge 1-r$. Hence, $\sigma _1(\Sigma _0^{-1}\Sigma _1)\le \sigma _1(\Sigma _0^{-1})\sigma _1(\Sigma _1)\le \frac{1+r}{1-r}$ and $\sigma _d(\Sigma _0^{-1}\Sigma _1)\ge \sigma _d(\Sigma _0^{-1})\sigma _d(\Sigma _1)\ge \frac{1-r}{1+r}$ hold for $\Sigma _1, \Sigma _2\in T_r$. Therefore, we have

$$\begin{aligned} \frac{1}{3}\le \frac{1-r}{1+r}\le \sigma _k(\Sigma _0^{-1}\Sigma _1)\le \frac{1+r}{1-r}\le 3,\quad k=1,\ldots ,d. \end{aligned}$$

For $\Sigma _0, \Sigma _1\in T_r$,

$$\begin{aligned} \textrm{KL}(N(\varvec{0},\Sigma _1),N(\varvec{0},\Sigma _0))&=\frac{1}{2}\sum _{i=1}^{d}\left( \sigma _i(\Sigma _0^{-1}\Sigma _1)-1-\log \sigma _i(\Sigma _0^{-1}\Sigma _1) \right) \\&\le \frac{1}{2}\Vert \Sigma _0^{-1}\Sigma _1-I\Vert _{\textrm{F}}^2\\&\le \frac{1}{2}\Vert \Sigma _0^{-1}\Vert _{\textrm{op}}^2\Vert \Sigma _1-\Sigma _0\Vert _{\textrm{F}}^2\\&\le \frac{1}{2(1-r)^2}\Vert \Sigma _1-\Sigma _0\Vert _{\textrm{F}}^2. \end{aligned}$$

Therefore, $d_{\textrm{KL}}(T_r)\le n\cdot \frac{4r^2}{2(1-r)^2}\le 8n r^2$. Furthremore, $\textrm{vol}(T_r)=C_{d}r^{d(d+1)/2}$ and $\textrm{vol}(B_{\varepsilon })=C_{d}\varepsilon ^{d(d+1)/2}$, where $C_d$ is a constant depending only on d. Hence, we have

$$\begin{aligned} \inf _{\widehat{\Sigma }}\sup _{\Sigma \in {\mathfrak {T}}_R} P_\Sigma (\Vert \widehat{\Sigma }-\Sigma \Vert _{\textrm{F}}\ge \varepsilon /2) \ge 1-\frac{8nr^2+\log 2}{\frac{d(d+1)}{2}\log \frac{r}{\varepsilon }} \ge 1-\frac{16nr^2+2\log 2}{d^2\log \frac{r}{\varepsilon }}. \end{aligned}$$

For $r=\sqrt{d^2/n}<1/2$ such that $n>4d^2$ and $\varepsilon =2^{-29}\sqrt{d^2/n}$, we have

$$\begin{aligned} \inf _{\widehat{\Sigma }}\sup _{\Sigma \in {\mathfrak {T}}_R} P_\Sigma (\Vert \widehat{\Sigma }-\Sigma \Vert _{\textrm{F}}\ge 2^{-30}\sqrt{d^2/n}) \ge 0.1. \end{aligned}$$

The modulus of continuity of the Frobenius norm is given by

$$\begin{aligned} \omega (\varepsilon ,{\mathfrak {T}}_R)=\sup \{\Vert \Sigma _0-\Sigma _1\Vert _{\textrm{F}} \,:\,\textrm{TV}(P_{\Sigma _0}, P_{\Sigma _1})\le \varepsilon /(1-\varepsilon ),\,\Sigma _0,\Sigma _1\in {\mathfrak {T}}_R\}. \end{aligned}$$

Pinsker’s inequality yields that for $\Sigma _0, \Sigma _1\in T_{1/2}\subset {\mathfrak {T}}_R$,

$$\begin{aligned} \textrm{TV}(N(\varvec{0},\Sigma _0),N(\varvec{0},\Sigma _1))\le \sqrt{\frac{ \textrm{KL}(N(\varvec{0},\Sigma _1),N(\varvec{0},\Sigma _0))}{2}}\le \Vert \Sigma _0-\Sigma _1\Vert _F. \end{aligned}$$

Hence, the pair of $\Sigma _0$ and $\Sigma _1$ such that $\Vert \Sigma _0-\Sigma _1\Vert _F=\varepsilon /(1-\varepsilon )\le 1$ for $\varepsilon \le 1/2$ leads to $\textrm{TV}(N(\varvec{0},\Sigma _0),N(\varvec{0},\Sigma _1))\le \varepsilon /(1-\varepsilon )$, meaning that $\omega (\varepsilon ,{\mathfrak {T}}_R)\ge \varepsilon /(1-\varepsilon )\ge \varepsilon $. As a result, we have

$$\begin{aligned}&\inf _{\widehat{\Sigma }} \sup _{Q:\inf _{P\in {\mathcal {N}}_R}\textrm{TV}(P,Q)<\varepsilon } Q\bigg (\Vert \widehat{\Sigma }-\Sigma \Vert _{\textrm{F}} \ge \varepsilon +\sqrt{\frac{d^2}{n}}\bigg )\\&\quad \ge \inf _{\widehat{\Sigma }}\sup _{\Sigma \in {\mathfrak {T}}_R, Q} P_{\Sigma ,\varepsilon ,Q} \bigg (\Vert \widehat{\Sigma }-\Sigma \Vert _{\textrm{F}} \ge \varepsilon +\sqrt{\frac{d^2}{n}}\bigg ) >0 \end{aligned}$$

for $\varepsilon \le 1/2$. $\square $

C.4 Proof of Theorem 19

We prepare the following lemma.

Lemma 22

Suppose that $\sup _{x\in {\mathcal {X}}}k(x,x)\le K^2$. The Rademacher complexity of the function set

$$\begin{aligned} \{e^{f(z)}\sigma (u(z)-b)/q(z)\,:\,f\in {\mathcal {H}}_r, u\in {\mathcal {H}}_U,\, |{b}|\le {U}\} \end{aligned}$$

is bounded above by

$$\begin{aligned} \frac{e^{Kr}}{\min _z q(z)}\left\{ {\mathfrak {R}}_\ell ({\mathcal {H}}_r)+{\mathfrak {R}}_\ell ({\mathcal {H}}_U\oplus [-U,U])\right\} \lesssim \frac{e^{Kr}}{\min _z q(z)}\frac{r+U}{\sqrt{\ell }}. \end{aligned}$$

The Rademacher complexity of the function set $\{ e^{f(z)}/q(z)\,:\,f\in {\mathcal {H}}_r\}$ is bounded above by

$$\begin{aligned} \frac{e^{Kr}}{\min _z q(z)}{\mathfrak {R}}_\ell ({\mathcal {H}}_r) \lesssim \frac{e^{Kr}}{\min _z q(z)}\frac{r}{\sqrt{\ell }}. \end{aligned}$$

Proof

We compute the Rademacher complexity $ {\mathfrak {R}}_{Z}({\mathcal {F}})$ of the following function set, ${\mathcal {F}}:=\{e^{f(z)}\sigma (u(z)-b)/q(z)\,:\,f\in {\mathcal {H}}_r, u\in {\mathcal {H}}_U, |{b}|\le {U}\}$. The proof is given by a minor modification of the proof in [38]. For completeness, we show the proof below. In the below, $\gamma _1,\ldots ,\gamma _\ell $ are i.i.d. Rademacher random variables.

$$\begin{aligned}&{\mathfrak {R}}_{Z}({\mathcal {F}})\\&\quad = \frac{1}{\ell }{\mathbb {E}} \bigg [ {\mathbb {E}}_{\gamma _{\ell }}\bigg [ \sup _{f,u,b}\underbrace{\sum _{i=1}^{\ell -1}\gamma _i \frac{e^{f(z_i)}\sigma (u(z_i)-b)}{q(z_i)} }_{V_{\ell -1}(f,u,b)} +\gamma _\ell \frac{e^{f(z_\ell )}\sigma (u(z_\ell )-b)}{q(z_\ell )} \bigg ]\bigg ]\\&\quad = \frac{1}{\ell }{\mathbb {E}} \bigg [ \frac{1}{2}\bigg (V_{\ell -1}(f_{\ell ,1},u_{\ell ,1},b_{\ell ,1}) +\frac{e^{f_{\ell ,1}(z_\ell )}\sigma (u_{\ell ,1}(z_\ell )-b_{\ell ,1})}{q(z_\ell )} \bigg )\\&\qquad + \frac{1}{2}\bigg (V_{\ell -1}(f_{\ell ,2},u_{\ell ,2},b_{\ell ,2}) -\frac{e^{f_{\ell ,2}(z_\ell )}\sigma (u_{\ell ,2}(z_\ell )-b_{\ell ,2})}{q(z_\ell )} \bigg ) \bigg ]\\&\quad = \frac{1}{\ell }{\mathbb {E}} \bigg [ \frac{1}{2}\bigg ( V_{\ell -1}(f_{\ell ,1},u_{\ell ,1},b_{\ell ,1}) + V_{\ell -1}(f_{\ell ,2},u_{\ell ,2},b_{\ell ,2})\\&\qquad + \frac{e^{f_{\ell ,1}(z_\ell )}\sigma (u_{\ell ,1}(z_\ell )-b_{\ell ,1})-e^{f_{\ell ,1}(z_\ell )}\sigma (u_{\ell ,2} (z_\ell )-b_{\ell ,2})}{q(z_\ell )}\\&\qquad + \frac{e^{f_{\ell ,1}(z_\ell )}\sigma (u_{\ell ,2}(z_\ell )-b_{\ell ,2})-e^{f_{\ell ,2}(z_\ell )}\sigma (u_{\ell ,2} (z_\ell )-b_{\ell ,2})}{q(z_\ell )} \bigg )\bigg ]. \end{aligned}$$

In the above, $f_{\ell ,i}, u_{\ell ,i}, b_{\ell ,i},\, i=1,2$, are functions and bias terms that attain the supremum. We use the notations, $\ell _{1,\ell }:=\textrm{sign}(u_{\ell ,1}(z_\ell )-b_{\ell ,1}-u_{\ell ,2}(z_\ell )+b_{\ell ,2})$ and $\ell _{2,\ell }:=\textrm{sign}(f_{\ell ,1}(z_\ell )-f_{\ell ,2}(z_\ell ))$. Then, the Lipschitz continuity ensures that the above equation is bounded above by

$$\begin{aligned}&\frac{1}{\ell }{\mathbb {E}} \bigg [ \frac{1}{2}\bigg ( V_{\ell -1}(f_{\ell ,1},u_{\ell ,1},b_{\ell ,1}) + V_{\ell -1}(f_{\ell ,2},u_{\ell ,2},b_{\ell ,2})\\&\qquad + \frac{e^{K r}\ \ell _{1,\ell }(u_{\ell ,1}(z_\ell )-u_{\ell ,2}(z_\ell )-(b_{\ell ,1}-b_{\ell ,2}))}{\min _z q(z)} + \frac{e^{K r}\ \ell _{2,\ell } (f_{\ell ,1}(z_\ell )-f_{\ell ,2}(z_\ell ))}{\min _z q(z)} \bigg ) \bigg ]\\&\quad = \frac{1}{\ell }{\mathbb {E}} \bigg [ \frac{1}{2}\bigg ( V_{\ell -1}(f_{\ell ,1},u_{\ell ,1},b_{\ell ,1}) + \frac{e^{K r}}{\min _z q(z)}(\ell _{1,\ell } (u_{\ell ,1}(z_\ell )-b_{\ell ,1})+\ell _{2,\ell } f_{\ell ,1}(z_\ell ))\bigg )\\&\qquad + \frac{1}{2}\bigg ( V_{\ell -1}(f_{\ell ,2},u_{\ell ,2},b_{\ell ,2}) - \frac{e^{K r}}{\min _z q(z)}(\ell _{1,\ell } (u_{\ell ,2}(z_\ell )-b_{\ell ,2})+\ell _{2,\ell } f_{\ell ,2}(z_\ell ))\bigg )\bigg ]\\&\quad \le \frac{1}{\ell }{\mathbb {E}} \bigg [ \sup _{f,u,b}\frac{1}{2}\bigg ( V_{\ell -1}(f,u,b)+\frac{e^{K r}}{\min _z q(z)}(\ell _{1,\ell } (u(z_\ell )-b)+\ell _{2,\ell } f(z_\ell ))\bigg )\\&\qquad + \sup _{f,u,b}\frac{1}{2}\bigg ( V_{\ell -1}(f,u,b)-\frac{e^{K r}}{\min _z q(z)}(\ell _{1,\ell } (u(z_\ell )-b)+\ell _{2,\ell } f(z_\ell ))\bigg )\bigg ]\\&\quad \le \frac{1}{\ell }{\mathbb {E}} \bigg [ {\mathbb {E}}_{\gamma _{\ell }} \bigg [ \sup _{f,u,b}\bigg ( V_{\ell -1}(f,u,b)+\frac{e^{K r}}{\min _z q(z)} \gamma _\ell ((u(z_\ell )-b)+\ell _{1,\ell }\ell _{2,\ell }f(z_\ell ))\bigg ) \bigg ] \bigg ]. \end{aligned}$$

Repeating the same argument to the other terms in $V_{\ell -1}(f,u,b)$, we have

$$\begin{aligned} {\mathfrak {R}}_{Z}({\mathcal {F}})&\le \frac{1}{\ell }{\mathbb {E}} \bigg [ \sup _{f,u,b}\sum _{i=1}^{\ell } \frac{e^{K r}}{\min _z q(z)} \gamma _i((u(z_i)-b)+\ell _{1,i}\ell _{2,i}f(z_i)) \bigg ]\\&\le \frac{1}{\ell }{\mathbb {E}} \bigg [ \sup _{u,b}\sum _{i=1}^{\ell } \frac{e^{K r}}{\min _z q(z)} \gamma _i(u(z_i)-b)\bigg ] + \frac{1}{\ell }{\mathbb {E}} \bigg [ \sup _{f}\sum _{i=1}^{\ell } \frac{e^{K r}}{\min _z q(z)} \gamma _i f(z_i)\bigg ] \\&= \frac{e^{K r}}{\min _z q(z)}{\mathfrak {R}}_\ell ({\mathcal {H}}_U\oplus [-U,U]) + \frac{e^{K r}}{\min _z q(z)}{\mathfrak {R}}_\ell ({\mathcal {H}}_r). \end{aligned}$$

The second inequality of the above equations is obtained by the definition of the supremum. Note that the replacement of $\gamma _i\ell _{1,i}\ell _{2,i}$ with $\gamma _i$ does not change the expectation. Then, the first inequality of the lemma is obtained. The second inequality of the lemma is obtained from the standard Talagrand’s lemma. $\square $

Proof of Theorem 19

For the numerator and denominator of ${\bar{\sigma }}_\ell $ in (14), the uniform law of large numbers [38] with Lemma 22 leads to the following uniform error bound,

$$\begin{aligned}&\sup _{\Vert f\Vert \le {r},\Vert u\Vert \le U,|{b}|\le U}\bigg |\int e^{f(Z)}\sigma _{Z}\textrm{d}\mu -\frac{1}{\ell }\sum _{j=1}^{\ell } \frac{e^{f(Z_j)}}{q(Z_j)}\sigma _{Z_j}\bigg |\\&\qquad \lesssim \frac{1}{\inf _z q(z)} \frac{2e^{Kr}(r+U)}{\sqrt{\ell }} + \frac{e^{Kr}}{\inf _z q(z)}\sqrt{\frac{\log (1/\delta )}{\ell }},\\&\sup _{\Vert f\Vert \le {r}}\bigg |A(f)-\frac{1}{\ell }\sum _{j=1}^{\ell } \frac{e^{f(Z_j)}}{q(Z_j)}\bigg |\\&\qquad \lesssim \frac{1}{\inf _z q(z)} \frac{2e^{Kr}r}{\sqrt{\ell }}+\frac{e^{Kr}}{\inf _z q(z)}\sqrt{\frac{\log (1/\delta )}{\ell }}, \end{aligned}$$

where $\sigma _Z=\sigma (u(X)-b)$. By regarding $\inf _z q(z)$ as a constant, we have

$$\begin{aligned} \sup _{f\in {\mathcal {H}}_r,u\in \widetilde{{\mathcal {H}}}_U,|{b}|\le {U}} |{{\mathbb {E}}_{X\sim P_{f}}[\sigma (u(X)-b)]- {\bar{\sigma }}_\ell }| \lesssim \frac{e^{Kr}(r+U)}{\sqrt{\ell }}=:\xi \end{aligned}$$

with high probability. Therefore, the estimation error of ${\bar{\sigma }}_\ell $ is

$$\begin{aligned} \sup _{f\in {\mathcal {H}}_r} |{\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_f,P_n)-\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma } ({\widehat{P}}_f,P_n)}|&\le \sup _{f\in {\mathcal {H}}_r}\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_f,{\widehat{P}}_f) \\&\lesssim \xi . \end{aligned}$$

As a result, we obtain

$$\begin{aligned} \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_{{\widetilde{f}}_r})&\le \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_n) + \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_n, P_{{\widetilde{f}}_r})\\&\lesssim \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_n) + \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_n, {\widehat{P}}_{{\widetilde{f}}_r}) + \xi \\&\le \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_n) + \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_n, {\widehat{P}}_{{\widehat{f}}_r}) + \xi \\&\lesssim \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_n) + \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_n, P_{{\widehat{f}}_r}) + \xi \\&\le 2\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_n) + \xi \\&\lesssim \varepsilon + \frac{C_{{\mathcal {H}}_U}}{\sqrt{n}} + \sqrt{\frac{\log (1/\delta )}{n}} + \xi . \end{aligned}$$

Thus, we have

$$\begin{aligned} \textrm{TV}(P_{f_0}, P_{{\widetilde{f}}_r})&= \left( \textrm{TV}(P_{f_0}, P_{{\widetilde{f}}_r}) -\textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_{{\widetilde{f}}_r}) \right) + \textrm{STV}_{\widetilde{{\mathcal {H}}}_U,\sigma }(P_{f_0}, P_{{\widetilde{f}}_r}) \\&\lesssim \frac{r}{U} + \varepsilon + \frac{C_{{\mathcal {H}}_U}}{\sqrt{n}} + \frac{e^{Kr}(r+U)}{\sqrt{\ell }} + \sqrt{\frac{\log (1/\delta )}{n}} + e^{Kr}\sqrt{\frac{\log (1/\delta )}{\ell }} \end{aligned}$$

with probability greater than $1-\delta $. $\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kanamori, T., Yokoyama, K. & Kawashima, T. Robust estimation for kernel exponential families with smoothed total variation distances. Info. Geo. (2024). https://doi.org/10.1007/s41884-024-00141-4

Download citation

Received: 13 May 2023
Revised: 03 May 2024
Accepted: 12 July 2024
Published: 27 July 2024
DOI: https://doi.org/10.1007/s41884-024-00141-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust estimation for kernel exponential families with smoothed total variation distances

Abstract

Similar content being viewed by others

Mixture of experts distributional regression: implementation using robust estimation with adaptive first-order methods

Adaptation of the tuning parameter in general Bayesian inference with robust divergence

Minimizing robust density power-based divergences for general parametric density models

1 Introduction

1.1 Background

1.2 Related works

1.2.1 Relation between Tukey’s Median and GAN

1.2.2 Statistical inference with kernel exponential family

1.2.3 Unnormalized models

1.2.4 Organization

2 Discrepancy measures for statistical inference

2.1 Notations and definitions

2.2 Depth-based methods and IPMs

3 Smoothed total variation distance

3.1 Definition and some properties

Definition 1

Example 1

Example 2

Example 3

Lemma 1

Lemma 2

Lemma 3

Lemma 4

3.2 Gap between STV distance and TV distance

Definition 2

Proposition 5

Remark 1

Lemma 6

Example 4

Example 5

3.3 STV distances on kernel exponential family

Lemma 7

Lemma 8

Remark 2

4 Estimation with STV distance for kernel exponential family

4.1 Minimum STV distance estimators

Theorem 9

4.2 Regularized STV learning

Theorem 10

Corollary 11

Theorem 12

Corollary 13

4.3 Regularized STV learning for infinite-dimensional kernel exponential family

Theorem 14

Corollary 15

Remark 3

5 Accuracy of parameter estimation

5.1 Estimation error in RKHS

Lemma 16

Corollary 17

Example 6

Remark 4

5.2 Robust estimation of normal distribution

Theorem 18

6 Approximation of expectation and learning algorithm

Theorem 19

7 Concluding remarks

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A Proofs of proposition and lemmas in Sect. 3

1.1 A.1 Proof of Lemma 2

Proof

1.2 A.2 Proof of Lemma 3

Proof

1.3 A.3 Proof of Lemma 4

Proof

1.4 A.4 Proof of Proposition 5

Proof