Introduction and earlier work
A random variable (RV) is a measurable function mapping possible outcomes of an underlying random experiment to a set \({\mathcal {Z}}\) (often, \({\mathcal {Z}}\subset {\mathbb {R}}^d\), but our approach will be more general). The probability measure of the random experiment induces the distribution of the random variable. We will below not deal with the underlying probability space explicitly, and instead directly start from random variables X, Y with distributions \(p_X,p_Y\) and values in \({\fancyscript{X}},{\fancyscript{Y}}\). Suppose we have access to (data from) \(p_X\) and \(p_Y\), and we want to compute the distribution of the random variable f(X, Y), where f is a measurable function defined on \({\fancyscript{X}}\times {\fancyscript{Y}}\).
For instance, if our operation is addition \(f(X,Y)=X+Y\), and the distributions \(p_X\) and \(p_Y\) have densities, we can compute the density of the distribution of f(X, Y) by convolving those densities. If the distributions of X and Y belong to some parametric class, such as a class of distributions with Gaussian density functions, and if the arithmetic expression is elementary, then closed-form solutions for certain favorable combinations exist. At the other end of the spectrum, we can resort to numerical integration or sampling to approximate f(X, Y).
Arithmetic operations on random variables are abundant in science and engineering. Measurements in real-world systems are subject to uncertainty, and thus subsequent arithmetic operations on these measurements are operations on random variables. An example due to Springer (1979) is signal amplification. Consider a set of n amplifiers connected together in a serial fashion. If the amplification of the i-th amplifier is denoted by \(X_i\), then the total amplification, denoted by Y, is \(Y=X_1\cdot X_2\cdots X_n\), i.e., a product of n random variables.
A well-established framework for arithmetic operation on independent random variables (iRVs) relies on integral transform methods (Springer 1979). The above example of addition already suggests that Fourier transforms may help, and indeed, people have used transforms such as the ones due to Fourier and Mellin to derive the distribution function of either the sum, difference, product, or quotient of iRVs (Epstein 1948; Springer and Thompson 1966; Prasad 1970; Springer 1979). Williamson (1989) proposes an approximation using Laguerre polynomials and a notion of envelopes bounding the cumulative distribution function. This framework also allows for the treatment of dependent random variables, but the bounds can become very loose after repeated operations. Milios (2009) approximates the probability distributions of the input random variables as mixture models (using uniform and Gaussian distributions), and apply the computations to all mixture components.
Jaroszewicz and Korzen (2012) consider a numerical approach to implement arithmetic operations on iRVs, representing the distributions using piecewise Chebyshev approximations. This lends itself well to the use of approximation methods that perform well as long as the functions are well-behaved. Finally, Monte Carlo approaches can be used as well and are popular in scientific applications [see e.g., Ferson (1996)].
The goal of the present paper is to develop a derived data type representing a distribution over another data type and to generalize the available computational operations to this data type, at least approximately. This would allow us to conveniently handle error propagation as in the example discussed earlier. It would also help us perform inference involving conditional distributions of such variables given observed data. The latter is the main topic of a subject area that has recently begun to attract attention, probabilistic programming (Gordon et al. 2014). A variety of probabilistic programming languages has been proposed (Wood et al. 2014; Paige and Wood 2014; Cassel 2014). To emphasize the central role that kernel maps play in our approach, we refer to it as kernel probabilistic programming (KPP).
Computing functions of independent random variables using kernel maps
The key idea of KPP is to provide a consistent estimator of the kernel map of an expression involving operations on random variables. This is done by applying the expression to the sample points and showing that the resulting kernel expansion has the desired property. Operations involving more than one RV will increase the size of the expansion, but we can resort to existing RKHS approximation methods to keep the complexity of the resulting expansion limited, which is advisable in particular if we want to use it as a basis for further operations. The benefits of KPP are threefold. First, we do not make parametric assumptions on the distributions associated with the random variables. Second, our approach applies not only to real-valued random variables, but also to multivariate random variables, structured data, functional data, and other domains, as long as positive definite kernels can be defined on the data. Finally, it does not require explicit density estimation as an intermediate step, which is difficult in high dimensions.
We begin by describing the basic idea. Let f be a function of two independent RVs X, Y taking values in the sets \({\fancyscript{X}},{\fancyscript{Y}}\), and suppose we are given i.i.d. m-samples \(x_1,\dots ,x_m\) and \(y_1,\dots ,y_m\). We are interested in the distribution of f(X, Y), and seek to estimate its representation \(\mu \big [f(X,Y)\big ] := {\mathbb {E}}\Big [\varPhi \big (f(X,Y)\big )\Big ]\) in the RKHS as
$$\begin{aligned} \frac{1}{m^2} \sum _{i,j=1}^m \varPhi \left( f (x_i,y_j)\right) . \end{aligned}$$
(22)
Although \(x_1,\dots ,x_m \sim p_X\) and \(y_1,\dots ,y_m \sim p_Y\) are i.i.d. observations, this does not imply that the \(\big \{ f(x_i,y_j) | i,j=1,\dots ,m\big \}\) form an i.i.d. \(m^2\)-sample from f(X, Y), since—loosely speaking—each \(x_i\) (and each \(y_j\)) leaves a footprint in m of the observations, leading to a (possibly weak) dependency. Therefore, Theorem 1 does not imply that (22) is consistent. We need to do some additional work:
Theorem 2
Given two independent random variables X, Y with values in \({\fancyscript{X}},{\fancyscript{Y}}\), mutually independent i.i.d. samples \(x_1,\ldots ,x_m\) and \(y_1,\ldots ,y_n\), a measurable function \(f:{\fancyscript{X}}\times {\fancyscript{Y}}\rightarrow {\fancyscript{Z}}\), and a positive definite kernel on \({\fancyscript{Z}}\times {\fancyscript{Z}}\) with RKHS map \(\varPhi \), then
$$\begin{aligned} \frac{1}{mn}\sum _{i=1}^m\sum _{j=1}^n \varPhi \big (f(x_i,y_j)\big ) \end{aligned}$$
(23)
is an unbiased and consistent estimator of \(\mu \big [f(X,Y)\big ]\).
Moreover, we have convergence in probability
$$\begin{aligned}&\left\| \frac{1}{mn}\sum _{i=1}^m\sum _{j=1}^n \varPhi \big (f(x_i,y_j)\big ) - {\mathbb {E}}\Big [\varPhi \big (f(X,Y)\big )\Big ]\right\| \nonumber \\&\quad =O_p\left( \frac{1}{\sqrt{m}}+\frac{1}{\sqrt{n}}\right) , \quad (m,n\rightarrow \infty ). \end{aligned}$$
(24)
As an aside, note that (23) is an RKHS-valued two-sample U-statistic.
Proof
For any i, j, we have \({\mathbb {E}}\big [\varPhi \big (f(x_i,y_j)\big )\big ] = {\mathbb {E}}\big [\varPhi \big (f(X,Y)\big )\big ]\); hence, (23) is unbiased.
The convergence (24) can be obtained as a corollary to Theorem 3, and the proof is omitted here. \(\square \)
Approximating expansions
To keep computational cost limited, we need to use approximations when performing multi-step operations. If for instance, the outcome of the first step takes the form (23), then we already have \(m\,\times \, n\) terms, and subsequent steps would further increase the number of terms, thus quickly becoming computationally prohibitive.
We can do so by using the methods described in Chap. 18 of Schölkopf and Smola (2002). They fall in two categories. In reduced set selection methods, we provide a set of expansion points [e.g., all points \(f(x_i,y_j)\) in (23)], and the approximation method sparsifies the vector of expansion coefficients. This can be for instance done by solving eigenvalue problems or linear programs. Reduced set construction methods, on the other hand, construct new expansion points. In the simplest case, they proceed by sequentially finding approximate pre-images of RKHS elements. They tend to be computationally more demanding and suffer from local minima; however, they can lead to sparser expansions.
Either way, we will end up with an approximation
$$\begin{aligned} \sum _{k=1}^p\gamma _k\varPhi (z_k) \end{aligned}$$
(25)
of (23), where usually \(p\ll m\,\times \, n\). Here, the \(z_k\) are either a subset of the \(f(x_i,y_j)\), or other points from \({\fancyscript{Z}}\).
It is instructive to consider some special cases. For simplicity, assume that \({\fancyscript{Z}}={\mathbb {R}}^d\). If we use a Gaussian kernel
$$\begin{aligned} k(x,x')=\exp \big (-\Vert x-x'\Vert ^2/(2\sigma ^2)\big ) \end{aligned}$$
(26)
whose bandwidth \(\sigma \) is much smaller than the closest pair of sample points, then the points mapped into the RKHS will be almost orthogonal, and there is no way to sparsify a kernel expansion such as (23) without incurring a large RKHS error. In this case, we can identify the estimator with the sample itself, and KPP reduces to a Monte Carlo method. If, on the other hand, we use a linear kernel \(k(z,z')=\langle z,z'\rangle \) on \({\fancyscript{Z}}={\mathbb {R}}^d\), then \(\varPhi \) is the identity map and the expansion (23) collapses to one real number, i.e., we would effectively represent f(X, Y) by its mean for any further processing. By choosing kernels that lie ‘in between’ these two extremes, we retain a varying amount of information which we can thus tune to our wishes, see Table 1.
Computing functions of RKHS approximations
More generally, consider approximations of kernel means \(\mu [X]\) and \(\mu [Y]\)
$$\begin{aligned} \hat{\mu }[X] := \sum _{i=1}^{m'}\alpha _i\varPhi _x(x'_i), \qquad \hat{\mu }[Y] := \sum _{j=1}^{n'}\beta _j\varPhi _y(y'_j). \end{aligned}$$
(27)
In our case, we think of (27) as RKHS-norm approximations of the outcome of previous operations performed on random variables. Such approximations typically have coefficients \({\varvec{\alpha }}\in {\mathbb {R}}^{n'}\) and \({\varvec{\beta }}\in {\mathbb {R}}^{m'}\) that are not uniform, that may not sum to one, and that may take negative values (Schölkopf and Smola 2002), e.g., for conditional mean maps (Song et al. 2009; Fukumizu et al. 2013).
We propose to approximate the kernel mean \(\mu \big [f(X,Y)\big ]\) by the estimator
$$\begin{aligned} \hat{\mu }\big [f(X,Y)\big ]:= & {} \frac{1}{\sum _{i=1}^{m'}\alpha _i \sum _{j=1}^{n'}\beta _j}\nonumber \\&\times \sum _{i=1}^{m'}\sum _{j=1}^{n'} \alpha _i\beta _j \varPhi _z\big (f(x'_i,y'_j)\big ), \end{aligned}$$
(28)
where the feature map \(\varPhi _z\) defined on \({\fancyscript{Z}}\), the range of f, may be different from both \(\varPhi _x\) and \(\varPhi _y\). The expansion has \(m'\times n'\) terms, which we can subsequently approximate more compactly in the form (25), ready for the next operation. Note that (28) contains (23) as a special case.
One of the advantages of our approach is that (23) and (28) apply for general data types. In other words, \({\fancyscript{X}},{\fancyscript{Y}},{\fancyscript{Z}}\) need not be vector spaces—they may be arbitrary nonempty sets, as long as positive definite kernels can be defined on them.
Convergence analysis in an idealized setting
We analyze the convergence of (28) under the assumption that the expansion points are actually samples \(x_1,\dots ,x_m\) from X and \(y_1,\dots ,y_n\) from Y, which is for instance the case if the expansions (27) are the result of reduced set selection methods (cf. Sect. 3.3). Moreover, we assume that the expansion coefficients \(\alpha _1,\dots ,\alpha _m\) and \(\beta _1,\dots ,\beta _n\) are constants, i.e., independent of the samples.
The following proposition gives a sufficient condition for the approximations in (27) to converge. Note that below, the coefficients \(\alpha _1,\dots ,\alpha _m\) depend on the sample size m, but for simplicity we refrain from writing them as \(\alpha _{1,m},\dots ,\alpha _{m,m}\); and likewise, for \(\beta _1,\dots ,\beta _n\). We make this remark to ensure that readers are not puzzled by the below statement that \(\sum _{i=1}^m \alpha _i^2 \rightarrow 0\) as \(m\rightarrow \infty \).
Proposition 1
Let \(x_1,\ldots ,x_m\) be an i.i.d. sample and \((\alpha _i)_{i=1}^m\) be constants with \(\sum _{i=1}^m \alpha _i=1\). Assume \({\mathbb {E}}\big [k(X,X)\big ]>{\mathbb {E}}\big [k(X,\tilde{X})\big ]\), where X and \(\tilde{X}\) are independent copies of \(x_i\). Then, the convergence
$$\begin{aligned} {\mathbb {E}}\left\| \sum _{i=1}^m\alpha _i\varPhi (x_i)-\mu [X]\right\| ^2 \rightarrow 0\qquad (m\rightarrow \infty ) \end{aligned}$$
holds true if and only if \(\sum _{i=1}^m \alpha _i^2 \rightarrow 0\) as \(m\rightarrow \infty \).
Proof
From the expansion
$$\begin{aligned}&{\mathbb {E}}\left\| \sum _{i=1}^m\alpha _i\varPhi (x_i)-\mu [X]\right\| ^2 \\&\quad = \sum _{i,s=1}^m \alpha _i\alpha _s {\mathbb {E}}\big [k(x_i,x_s)\big ]-2\sum _{i=1}^m \alpha _i {\mathbb {E}}\big [k(x_i,X)\big ] \nonumber \\&\qquad +\, {\mathbb {E}}\big [k(X,\tilde{X})\big ] \\&\quad = \Bigl (1-\sum _i\alpha _i\Bigr )^2 {\mathbb {E}}\big [k(X,\tilde{X})\big ] + \Bigl ( \sum _i\alpha _i^2\Bigr )\Bigl \{ {\mathbb {E}}\big [k(X,X)\big ]\nonumber \\&\qquad -\,{\mathbb {E}}\big [k(X,\tilde{X}\big ] \Bigr \}, \end{aligned}$$
the assertion is straightforward. \(\square \)
The next result shows that if our approximations (27) converge in the sense of Proposition 1, then the estimator (28) (with expansion coefficients summing to 1) is consistent.
Theorem 3
Let \(x_1,\ldots ,x_m\) and \(y_1,\ldots ,y_n\) be mutually independent i.i.d. samples, and the constants \((\alpha _i)_{i=1}^m,(\beta _j)_{j=1}^n\) satisfy \(\sum _{i=1}^m\alpha _i=\sum _{j=1}^n\beta _j=1\). Assume \(\sum _{i=1}^m \alpha _i^2\) and \(\sum _{j=1}^n \beta _j^2\) converge to zero as \(n,m\rightarrow \infty \). Then
$$\begin{aligned}&\left\| \sum _{i=1}^m\sum _{j=1}^n\alpha _i\beta _j \varPhi \big (f(x_i,y_j)\big )-\mu \big [f(X,Y)\big ]\right\| \nonumber \\&\quad = O_p\left( \sqrt{\sum _i \alpha _i^2}+\sqrt{\sum _j\beta _j^2}\right) \end{aligned}$$
as \(m,n\rightarrow \infty \).
Proof
By expanding and taking expectations, one can see that
$$\begin{aligned} {\mathbb {E}}\left\| \sum _{i=1}^m\sum _{j=1}^n \alpha _i\beta _j \varPhi \big (f(x_i,y_j)\big ) - {\mathbb {E}}\Big [\varPhi \big (f(X,Y)\big )\Big ]\right\| ^2 \end{aligned}$$
equals
$$\begin{aligned}&\sum _{i=1}^m\sum _{j=1}^n \alpha _i^2 \beta _j^2 {\mathbb {E}}\left[ k\big (f(X,Y), f(X,Y)\big )\right] \nonumber \\&\qquad +\sum _{s\ne i}\sum _j\alpha _i\alpha _s\beta _j^2{\mathbb {E}}\left[ k\big (f(X,Y), f\left( \tilde{X},Y\big )\right) \right] \\&\qquad +\sum _{i}\sum _{t\ne j}\alpha _i^2\beta _j\beta _t{\mathbb {E}}\left[ k\left( f(X,Y), f(X,\tilde{Y})\right) \right] \\&\qquad +\sum _{s\ne i}\sum _{t\ne j}\alpha _i\alpha _s\beta _j\beta _t{\mathbb {E}}\left[ k\left( f(X,Y), f(\tilde{X},\tilde{Y})\right) \right] \\&\qquad -2 \sum _i\sum _j\alpha _i\beta _j {\mathbb {E}}\left[ k\left( f(X,Y), f(\tilde{X},\tilde{Y})\right) \right] \nonumber \\&\qquad +\,{\mathbb {E}}\left[ k\left( f(X,Y), f(\tilde{X},\tilde{Y})\right) \right] \\&\quad =\Bigl (\sum _i\alpha _i^2\Bigr )\Bigl (\sum _j\beta _j^2\Bigr ){\mathbb {E}}\left[ k\left( f(X,Y), f(X,Y)\right) \right] \\&\qquad + \left\{ \Bigl (1-\sum _i\alpha _i\sum _j\beta _j\Bigr )^2 + \sum _i\alpha _i^2\sum _j\beta _j^2 \right. \\&\qquad \left. - \sum _i\alpha _i^2(\sum _j\beta _j)^2 - (\sum _i\alpha _i)^2\sum _j\beta _j^2 \right\} \nonumber \\&\qquad \times {\mathbb {E}}\left[ k\left( f(X,Y), f(\tilde{X},\tilde{Y})\right) \right] \\&\qquad + \Bigl ((\sum _i\alpha _i)^2-\sum _i\alpha _i^2\Bigr ) \Bigl (\sum _j\beta _j^2\Bigr )\nonumber \\&\qquad \times {\mathbb {E}}\left[ k\left( f(X,Y), f(\tilde{X},Y)\right) \right] \\&\qquad + \Bigl (\sum _i\alpha _i^2\Bigr ) \Bigl ((\sum _j\beta _j)^2-\sum _j\beta _j^2\Bigr )\nonumber \\&\qquad \times {\mathbb {E}}\left[ k\left( f(X,Y), f(X,\tilde{Y})\right) \right] \\&\quad =\Bigl (\sum _i\alpha _i^2\Bigr ) \Bigl (\sum _j\beta _j^2\Bigr ){\mathbb {E}}\left[ k(f(X,Y), f(X,Y))\right] \\&\qquad + \left\{ \sum _i\alpha _i^2\sum _j\beta _j^2 - \sum _i\alpha _i^2 - \sum _j\beta _j^2 \right\} \\&\qquad \times {\mathbb {E}}\left[ k\left( f(X,Y), f(\tilde{X},\tilde{Y})\right) \right] \\&\qquad + \Bigl (1-\sum _i\alpha _i^2\Bigr )\Bigl (\sum _j\beta _j^2\Bigr ){\mathbb {E}}\left[ k\left( f(X,Y), f(\tilde{X},Y)\right) \right] \\&\qquad + \Bigl (\sum _i\alpha _i^2\Bigr )\Bigl (1-\sum _j\beta _j^2\Bigr ){\mathbb {E}}\left[ k\left( f(X,Y), f(X,\tilde{Y})\right) \right] , \end{aligned}$$
which implies that the norm in the assertion of the theorem converges to zero at \(O_p\left( \sqrt{\sum _i \alpha _i^2}+\sqrt{\sum _j\beta _j^2}\right) \) under the assumptions on \(\alpha _i\) and \(\beta _j\). Here, \((\tilde{X},\tilde{Y})\) is an independent copy of (X, Y). This concludes the proof. \(\square \)
Note that in the simplest case, where \(\alpha _i=1/m\) and \(\beta _j=1/n\), we have \(\sum _i\alpha _i^2=1/m\) and \(\sum _j\beta _j^2 = 1/n\), which proves Theorem 2. It is also easy to see from the proof that we do not strictly need \(\sum _i\alpha _i=\sum _j\beta _j=1\)—for the estimator to be consistent, it suffices if the sums converge to 1. For a sufficient condition for this convergence, see Kanagawa and Fukumizu (2014).
More general expansion sets
To conclude our discussion of the estimator (28), we turn to the case where the expansions (27) are computed by reduced set construction, i.e., they are not necessarily expressed in terms of samples from X and Y. This is more difficult, and we do not provide a formal result, but just a qualitative discussion.
To this end, suppose the approximations (27) satisfy
$$\begin{aligned} \sum _{i=1}^{m'}\alpha _i= & {} 1 \text { and for all }i, \alpha _i>0,\end{aligned}$$
(29)
$$\begin{aligned} \sum _{j=1}^{n'}\beta _j= & {} 1 \text { and for all }j, \beta _j>0, \end{aligned}$$
(30)
and we approximate \(\mu \big [f(X,Y)\big ]\) by the quantity (28).
We assume that (27) are good approximations of the kernel means of two unknown random variables X and Y; we also assume that f and the kernel mean map along with its inverse are continuous. We have no samples from X and Y, but we can turn (27) into sample estimates based on artificial samples \(\mathbf{X}\) and \(\mathbf{Y}\), for which we can then appeal to our estimator from Theorem 2.
To this end, denote by \(\mathbf{X}' =( x'_1,\ldots ,x'_{m'})\) and \(\mathbf{Y}'=(y'_1,\ldots ,y'_{n'})\) the expansion points in (27). We construct a sample \(\mathbf{X}=( x_1,x_2,\dots )\) whose kernel mean is close to \(\sum _{i=1}^{m'}\alpha _i\varPhi _x(x'_i)\) as follows: for each i, the point \(x'_i\) appears in \(\mathbf{X}\) with multiplicity \(\lfloor m\cdot \alpha _i \rfloor \), i.e., the largest integer not exceeding \(m\cdot \alpha _i\). This leads to a sample of size at most m. Note, moreover, that the multiplicity of \(x'_i\), divided by m, differs from \(\alpha _i\) by at most 1 / m, so effectively we have quantized the \(\alpha _i\) coefficients to this accuracy.
Since \(m'\) is constant, this implies that for any \(\varepsilon >0\), we can choose m
\(\in {\mathbb {N}}\) large enough to ensure that
$$\begin{aligned} \left\| \frac{1}{m}\sum _{1=1}^m \varPhi _x(x_i) - \sum _{i=1}^{m'}\alpha _i\varPhi _x(x'_i)\right\| ^2 < \varepsilon . \end{aligned}$$
(31)
We may thus work with \(\frac{1}{m}\sum _{1=1}^m \varPhi _x(x_i)\), which for strictly positive definite kernels corresponds uniquely to the sample \(\mathbf{X}=( x_1,\ldots ,x_m)\). By the same argument, we obtain a sample \(\mathbf{Y}=(y_1,\ldots ,y_n)\) approximating the second expansion. Substituting both samples into the estimator from Theorem 2 leads to
$$\begin{aligned} \hat{\mu }\big [f(X,Y)\big ]= & {} \frac{1}{\sum _{i=1}^{m'}\hat{\alpha }_i\sum _{j=1}^{n'}\hat{\beta _j}}\nonumber \\&\times \sum _{i=1}^{m'}\sum _{j=1}^{n'} \hat{\alpha }_i\hat{\beta }_j \varPhi _z\big (f(x'_i,y'_j)\big ), \end{aligned}$$
(32)
where \(\hat{\alpha }_i = \lfloor m\cdot \alpha _i \rfloor / m\), and \(\hat{\beta }_i = \lfloor n\cdot \beta _i \rfloor / n\). By choosing sufficiently large m, n, this becomes an arbitrarily good approximation (in the RKHS norm) of the proposed estimator (28). Note, however, that we cannot claim based on this argument that this estimator is consistent, not the least since Theorem 2 in the stated form requires i.i.d. samples.
Larger sets of random variables
Without analysis, we include the estimator for the case of more than two variables: Let g be a measurable function of jointly independent RVs \(U_j (j=1,\dots ,p)\). Given i.i.d. observations \(u^j_1,\dots ,u^j_m \sim U_j\), we have
$$\begin{aligned}&\frac{1}{m^p} \sum _{m_1,\dots ,m_p=1}^m \varPhi \left( g \big (u^1_{m_1},\dots ,u^p_{m_p}\big )\right) \nonumber \\&\quad \xrightarrow {m\rightarrow \infty } \mu \left[ g\big (U_1,\dots ,U_p \big )\right] \end{aligned}$$
(33)
in probability. Here, in order to keep notation simple, we have assumed that the sample sizes for each RV are identical.
As above, we note that (i) g need not be real-valued, it can take values in some set \({\fancyscript{Z}}\) for which we have a (possibly characteristic) positive definite kernel; (ii) we can extend this to general kernel expansions like (28); and (iii) if we use Gaussian kernels with width tending to 0, we can think of the above as a sampling method.