1 Introduction

In this section, we introduce background materials of our target problem addressed in this study.

1.1 Density-ratio estimation

Recently, methods of directly estimating the ratio of two probability densities without going through density estimation have been developed. These methods can be used to solve various machine learning tasks such as importance sampling, divergence estimation, mutual information estimation, and conditional probability estimation (Sugiyama et al. 2009, 2012b).

The kernel mean matching (KMM) method (Gretton et al. 2009) directly yields density ratio estimates by efficiently matching the two distributions using a special property of the universal reproducing kernel Hilbert spaces (RKHSs) (Steinwart 2001). Another approach is the M-estimator (Nguyen et al. 2010), which is based on the non-asymptotic variational characterization of the φ-divergence (Ali and Silvey 1966; Csiszár 1967). See Sugiyama et al. (2008a) for a similar algorithm that uses the Kullback-Leibler divergence. Non-parametric convergence properties of the M-estimator in RKHSs have been elucidated under the Kullback-Leibler divergence (Nguyen et al. 2010; Sugiyama et al. 2008b). A squared-loss version of the M-estimator for linear density-ratio models called unconstrained Least-Squares Importance Fitting (uLSIF) has also been developed (Kanamori et al. 2009). The squared-loss version was also shown to possess useful computational properties, e.g., a closed-form solution is available, and the leave-one-out cross-validation score can be computed analytically. A kernelized variant of uLSIF was recently proposed, and its statistical consistency was studied (Kanamori et al. 2012).

In this paper, we study loss functions of M-estimators. In Nguyen et al. (2010), a general framework of the density-ratio estimation has been established (also see Sugiyama et al. 2012a). However, when we estimate the density ratio for real-world data analysis, it becomes necessary to choose an M-estimator from infinitely many candidates. Hence it is important to study which M-estimator should be chosen in practice. The suitability of the estimator depends on the chosen criterion. In learning problems, there are mainly two criteria for choosing the estimator: (1) the estimation accuracy and (2) the computational cost. Kanamori et al. (2012) studied the choice of loss functions in density-ratio estimation from the viewpoint of the estimation accuracy. In the present paper, we focus on the computational cost associated with density-ratio estimators.

1.2 Condition numbers

In numerical analysis, the computational cost is closely related to the so-called condition number (von Neumann and Goldstine 1947; Turing 1948; Eckart and Young 1936). Indeed, the condition number appears as a parameter in complexity bounds for a variety of efficient iterative algorithms in linear algebra, linear and convex optimization, and homotopy methods for solving systems of polynomial equations (Luenberger and Ye 2008; Nocedal and Wright 1999; Renegar 1987, 1995; Smale 1981; Demmel 1997).

The definition of the condition number depends on the problem. In computational tasks involving matrix manipulations, a typical definition of the condition number is the ratio of the maximum and minimum singular values of the matrix given as the input of the problem under consideration. For example, consider solving the linear equation Ax=b. The input of the problem is the matrix A, and the computational cost to find the solution can be evaluated by the condition number of A, denoted hereafter as κ(A). hereafter. Specifically, when an iterative algorithm is applied to solving Ax=b, the number of iterations required to converge to a solution is evaluated using κ(A). In general, a problem with a larger condition number results in a higher computational cost. Since the condition number is independent of the algorithm, it is expected to represent the essential difficulty of the problem.

To evaluate the efficiency of numerical algorithms, a two-stage approach is frequently used: In the first stage, the relation between the computational cost c(A) of an algorithm with input A and the condition number κ(A) of the problem is studied. A formula such as c(A)=O(κ(A)α) is obtained, where α is a constant depending on the algorithm. At the second stage, the probability distribution of κ(A) is estimated, for example, in the form of Pr(κ(A)≥x)≤x β, where the probability is designed to represent a “practical” input distribution. As a result, the average computational cost of the algorithm can be evaluated. For details of this approach, see Blum and Shub (1986), Renegar (1987), Demmel (1988), Kostlan (1988), Edelman (1988, 1992), Shub (1993), Shub and Smale (1994, 1996), Cheung and Cucker (2002), Cucker and Wschebor (2002), Beltran and Pardo (2006), Bürgisser et al. (2010).

1.3 Smoothed analysis

The “average” performance is often controversial, because it is hard to identify the input probability distribution in real-world problems. Spielman and Teng (2004) proposed the smoothed analysis to refine the second stage of the above scheme for obtaining more meaningful probabilistic upper complexity bounds. Smoothed analysis is a hybrid of the worst and average-case analyses. Consider the averaged computational cost E P [c(A)], where c(A) is the cost of an algorithm for input A and E P [ ⋅ ] denotes the expectation with respect to the probability P over the input space. Let \(\mathcal{P}\) be a set of probability distributions on the input space. Then, in the smoothed analysis, the performance of the algorithm is measured by \(\max_{P\in\mathcal{P}}\,\mathrm{E}_{P}[c(A)]\), i.e., the worst-case evaluation of the expected computational cost over a set of probability distributions.

The smoothed analysis was successfully employed in understanding the practical efficiency of the simplex algorithm for linear programming problems (Spielman and Teng 2004; Bürgisser et al. 2006a). In the context of machine learning, the smoothed analysis was applied to elucidate the complexity of learning algorithms such as the perceptron algorithm and the k-means method; see Vershynin (2006), Blum and Dunagan (2002), Becchetti et al. (2006), Röglin and Vöcking (2007), Manthey and Röglin (2009), Bürgisser et al. (2006b), Bürgisser and Cucker (2010), Sankar et al. (2006) for more applications of the smoothed analysis technique.

The concept of the smoothed analysis, i.e., the worst-case evaluation of the expected computational cost over a set of probability distributions, is compatible with many problem setups in machine learning and statistics. A typical assumption in statistical inference is that training samples are distributed according to a probability distribution in a probability distribution set. The probability distribution may be specified by a finite-dimensional parameter, or an infinite-dimensional space may be introduced to deal with a probability distribution set.

1.4 Our contributions

In this study, we apply the concept of smoothed analysis for studying the computational cost of density-ratio estimation algorithms. In our analysis, we define the probability distribution on the basis of training samples, and study the optimal choice of the loss functions for M-estimators.

More specifically, we consider the optimization problems associated with the M-estimators. There are some definitions of condition numbers to measure the complexity of optimization problems (Bürgisser et al. 2006c; Renegar 1995; Todd et al. 2001). In unconstrained non-linear optimization problems, the condition number defined from the Hessian matrix of the loss function plays a crucial role, because it determines the convergence rate of optimization and the numerical stability (Luenberger and Ye 2008; Nocedal and Wright 1999). When a loss function to be optimized depends on random samples, the computational cost will be affected by the distribution of the condition number. Therefore, we study the distribution of condition numbers for randomly perturbed matrices. Next, we derive the loss function that has the smallest condition number among all M-estimators in the min-max sense. We also give a probabilistic evaluation of the condition number. Finally, we verify these theoretical findings through numerical experiments.

There are many important aspects to the computational cost of numerical algorithms such as memory requirements, the role of stopping conditions, and the scalability to large data sets. In this study, we evaluate the computational cost and stability of learning problems on the basis of the condition number of the loss function, because the condition number is a major parameter to quantify the difficulty of the numerical computation as explained above.

1.5 Structure of the paper

The remainder of this paper is structured as follows. In Sect. 2, we formulate the problem of density-ratio estimation and briefly review existing methods. In Sect. 3, a kernel-based density-ratio estimator is introduced. Section 4 is the main contribution of this paper, i.e., the presentation of condition number analyses of density-ratio estimation methods. In Sect. 5, we further investigate the possibility of reducing the condition number of loss functions. In Sect. 6, we experimentally investigate the behavior of condition numbers, confirming the validity of our theoretical analysis. In Sect. 7, we conclude by summarizing our contributions and indicating possible future research directions. Technical details are presented in Appendices AE.

2 Estimation of density ratios

In this section, we formulate the problem of density-ratio estimation and briefly review existing methods.

2.1 Formulation and notations

Consider two probability distributions P and Q on a probability space \(\mbox {$\mathcal {Z}$}\). Let the distributions P and Q have the probability densities p and q, respectively. We assume p(x)>0 for all \(x\in \mbox {$\mathcal {Z}$}\). Suppose that we are given two sets of independent and identically distributed (i.i.d.) samples,

(1)

Our goal is to estimate the density ratio

based on the observed samples.

We summarize some notations to be used throughout the paper. For a vector a in the Euclidean space, ∥a∥ denotes the Euclidean norm. Given a probability distribution P and a random variable h(X), we denote the expectation of h(X) under P by ∫hdP or ∫h(x)P(dx). Let ∥⋅∥ be the infinity norm. For a reproducing kernel Hilbert space (RKHS) \(\mathcal{H}\) (Aronszajn 1950), the inner product and the norm on \(\mathcal {H}\) are denoted as \(\cdot,\cdot{\,}_{\mathcal{H}}\) and \(\|\cdot\|_{\mathcal {H}}\), respectively.

2.2 M-estimator based on φ-divergence

An estimator of the density ratio based on the φ-divergence (Ali and Silvey 1966; Csiszár 1967) has been proposed by Nguyen et al. (2010). Let φ:ℜ→ℜ be a convex function, and suppose that φ(1)=0. Then, the φ-divergence between P and Q is defined by the integral

Setting φ(z)=−logz, we obtain the Kullback-Leibler divergence as an example of the φ-divergence. Let ψ be the conjugate dual function of φ, i.e.,

When φ is a convex function, we also have

(2)

We assume ψ is differentiable. See Sects. 12 and 26 of Rockafellar (1970) for details on the conjugate dual function. Substituting (2) into the φ-divergence, we obtain the expression,

(3)

where the infimum is taken over all measurable functions \(w:\mbox {$\mathcal {Z}$}\rightarrow \Re \). The infimum is attained at the function w satisfying

(4)

where ψ′ is the derivative of ψ.

Approximating (3) with the empirical distribution, we obtain the empirical loss function,

A parametric or non-parametric model is assumed for the function w. This estimator is referred to as the M-estimator of the density ratio (Nguyen et al. 2010). The M-estimator based on the Kullback-Leibler divergence is derived from ψ(z)=−1−log(−z). Sugiyama et al. (2008a) have studied the estimator in detail using the Kullback-Leibler divergence, and proposed a practical method that includes basis function selection by cross-validation. Kanamori et al. (2009) proposed unconstrained Least-Squares Importance Fitting (uLSIF) which is derived from the quadratic function ψ(z)=z 2/2.

3 Kernel-based M-estimator

In this study, we consider kernel-based estimators of density ratios because the kernel methods provide a powerful and unified framework for statistical inference (Schölkopf and Smola 2002). Let \(\mathcal{H}\) be an RKHS endowed with the kernel function k defined on \(\mathcal{Z}\times\mathcal{Z}\). Then, based on (3), we minimize the following loss function over \(\mathcal{H}\).

(5)

where the regularization term \(\frac{\lambda}{2}\|w\|_{\mathcal{H}}^{2}\) with the regularization parameter λ is introduced to avoid overfitting. Then, an estimator of the density ratio w 0 is given by \(\psi'(\widehat{w}(x))\), where \(\widehat{w}\) is the minimizer of (5). Statistical convergence properties of the kernel estimator using the Kullback-Leibler divergence have been investigated in Nguyen et al. (2010) and Sugiyama et al. (2008b), and similar analysis for the squared-loss was given in Kanamori et al. (2012).

In the RKHS \(\mathcal{H}\), the representer theorem (Kimeldorf and Wahba 1971) is applicable, and the optimization problem on \(\mathcal{H}\) is reduced to a finite-dimensional optimization problem. A detailed analysis leads us to a specific form of the solution as follows.

Lemma 1

Suppose the samples (1) are observed and assume that the function ψ in (5) is a differentiable convex function, and that λ>0. Let v(α,β)∈ℜn be the vector-valued function defined by

for α∈ℜn and β∈ℜm, where ψdenotes the derivative of ψ. Let \(\mbox {\bf 1}_{m}=(1,\ldots,1)^{\top}\in \Re ^{m}\) for a positive integer m and suppose that there exists a vector \(\bar{\alpha}=(\bar{\alpha}_{1},\ldots,\bar{\alpha}_{n})\in \Re ^{n}\) such that

(6)

Then, the estimator \(\widehat{w}\), an optimal solution of (5), has the form

(7)

The proof is deferred to Appendix A, which can be regarded as an extension of the proof for the least-squares estimator (Kanamori et al. 2012) to general M-estimators. This theorem implies that it is sufficient to find n variables \(\bar{\alpha}_{1},\ldots,\bar{\alpha}_{n}\) to obtain the estimator \(\widehat{w}\).

Using Lemma 1, we can obtain the estimator based on the φ-divergence by solving the following optimization problem

(8)

Though the problem (8) is a constrained optimization problem with respect to the parameter α=(α 1,…,α n ), it can be easily rewritten as an unconstrained one. In this paper, our main concern is to study which ψ we should use as the loss function of the M-estimator. In Sects. 4 and 5, we will show that the quadratic function is a preferable choice from a computational efficiency viewpoint.

Consider the condition (6) for the quadratic function, ψ(z)=z 2/2. Let K 11, K 12, and K 21 be the sub-matrices of the Gram matrix

where i,i′=1,…,n, j,j′=1,…,m. Then, for the quadratic loss ψ(z)=z 2/2, we have

and thus, there exists a vector \(\bar{\alpha}\) that satisfies the equation (6). For ψ(z)=z 2/2, the problem (8) is reduced to

(9)

by ignoring the term that is independent of the parameter α. The density-ratio estimator obtained by solving (9) is referred to as the kernelized uLSIF (KuLSIF) (Kanamori et al. 2012).

When the matrix K 11 is non-degenerate, the optimal solution of (9) is equal to

(10)

It is straightforward to confirm that the optimal solution of the problem

(11)

is the same as (10). The estimator given by solving the optimization problem (11) is denoted by Reduced-KuLSIF (R-KuLSIF). Though the objective functions in KuLSIF and R-KuLSIF are different, the optimal solution is the same. In Sect. 5, we show that R-KuLSIF is more preferable than the other M-estimators (including KuLSIF) from a numerical computation viewpoint.

4 Condition number analysis for density-ratio estimation

In this section, we study the condition number of loss functions for density-ratio estimation. Through the analysis of condition numbers, we elucidate the computational efficiency of the M-estimator, which is the main contribution of this study.

4.1 Condition number in numerical analysis and optimization

The condition number plays a crucial role in numerical analysis and optimization (Demmel 1997; Luenberger and Ye 2008; Sankar et al. 2006). The main concepts are briefly reviewed here.

Let A be a symmetric positive definite matrix. Then, the condition number of A is defined by λ max/λ min (≥1), where λ max and λ min are the maximum and minimum eigenvalues of A, respectively.Footnote 1 The condition number of A is denoted by κ(A).

In numerical analysis, the condition number governs the round-off error of the solution of a linear equation Ax=b. A matrix A with a large condition number results in a large upper bound on the relative error of the solution x. More precisely, in the perturbed linear equation

the relative error of the solution is given as (Demmel 1997, Sect. 2.2)

where ∥A∥ is the operator norm for the matrix A defined by

Hence, a small condition number is preferred in numerical computation.

In optimization problems, the condition number provides an upper bound of the convergence rate for optimization algorithms. Let us consider a minimization problem min x f(x), x∈ℜn, where f:ℜn→ℜ is a differentiable function and let x 0 be a local optimal solution. We consider an iterative algorithm that generates a sequence \(\{x_{i}\}_{i=1}^{\infty}\). Let ∇f be the gradient vector of f. In various iterative algorithms, the sequence is generated in the following form

(12)

where η i is a non-negative number appropriately determined and H i is a symmetric positive definite matrix which approximates the Hessian matrix of f at x 0, i.e., ∇2 f(x 0). Then, under a mild assumption, the sequence \(\{x_{i}\}_{i=1}^{\infty}\) converges to a local minimizer x 0.

We introduce convergence rates of some optimization methods. According to the ‘modified Newton method’ theorem (Luenberger and Ye 2008, Sect. 10.1), the convergence rate of (12) is given by

(13)

where κ i is the condition number of \(H_{i}^{-1/2}(\nabla^{2} f(x_{0})) H_{i}^{-1/2}\). Though the modified Newton method theorem is shown only for convex quadratic functions (Luenberger and Ye 2008), the rate-of-convergence behavior is essentially the same for general nonlinear objective functions. In terms of non-quadratic functions, details are presented in Sect. 8.6 of Luenberger and Ye (2008). Equation (13) implies that the convergence rate of the sequence {x k } is fast if κ i ,i=1,2,… are small. In the conjugate gradient method, the convergence rate is expressed by (13) with \(\sqrt{\kappa(\nabla^{2}f(x_{0}))}\) instead of κ i (Nocedal and Wright 1999, Sect. 5.1). Even in proximal-type methods, the convergence rate is described by a quantity similar to the condition number, when the objective function is strongly convex. See Propositions 3 and 4 in Schmidt et al. (2011) for details.

A pre-conditioning technique is often applied to speed up the convergence rate of the optimization algorithm. The idea behind pre-conditioning is to perform a change of variables \(x=S\bar{x}\), where S is an invertible matrix. An iterative algorithm is applied to the function \(\bar{f}(\bar {x})=f(S\bar{x})\) in the coordinate system \(\bar{x}\). Then a local optimal solution \(\bar{x}_{0}\) of \(\bar{f}(\bar{x})\) is pulled back to \(x_{0}=S\bar{x}_{0}\).

The pre-conditioning technique is useful, if the conditioning of \(\bar {f}(\bar{x})\) is preferable to f(x). However, in general, there are some difficulties in obtaining a suitable pre-conditioning. Consider the iterative algorithm (12) with H i =I in the coordinate \(\bar{x}\), i.e., \(\bar{x}_{i+1}=\bar{x}_{i}-\eta_{i}\nabla{\bar{f}}(\bar{x}_{i})\). The Hessian matrix is given as \(\nabla^{2}\bar{f}(\bar{x}_{0})=S^{\top }\nabla^{2}{f}(x_{0})S\). Then, the best change of variables is given by S=(∇2 f(x 0))−1/2. This is also confirmed by the fact that the gradient descent method with respect to \(\bar{x}\) is represented as x i+1=x i η i SS f(x i ) in the coordinate system x. In this case, there are at least two drawbacks:

  1. 1.

    There is no unified strategy to find a good change of variables \(x=S\bar{x}\).

  2. 2.

    Under the best change of variables S=(∇2 f(x 0))−1/2, the computation of the variable change can be expensive and unstable, when the condition number of ∇2 f(x 0) is large.

Similar drawbacks appear in the conjugate gradient methods (Hager and Zhang 2006; Nocedal and Wright 1999).

The first drawback is obvious. To find a good change of variables, it is necessary to estimate the shape of the function f around the local optimal solution x 0 before solving the problem. Except for a specific type of problems such as discretized partial differential equations, finding a good change of variables is difficult (Benzi et al. 2011; Axelsson and Neytcheva 2002; Badia et al. 2009). Though there are some general-purpose pre-conditioners such as the incomplete Cholesky decomposition and banded pre-conditioners, their degree of success varies from problem to problem (Nocedal and Wright 1999, Chap. 5).

To remedy the second drawback, one can use a matrix S with a moderate condition number. When κ(S) is moderate, the computation of the variable change is stable. In the optimization toolbox in MATLAB®, gradient descent methods are implemented by the function fminunc. The default method in fminunc is the BFGS quasi-Newton method, and the Cholesky factorization of the approximate Hessian is used as the transformation matrix S at each step of the algorithm. When the modified Cholesky factorization is used, the condition number of S is guaranteed to be bounded from above by some constant C. See Moré and Sorensen (1984) for more details.

When the variable change \(x=S\bar{x}\) with a bounded condition number is used, there is a trade-off between the numerical accuracy and convergence rate. The trade-off is summarized as

(14)

The proof of this equality is given in Appendix B. When C in (14) is small, the computation of the variable change is stable. Conversely, the convergence speed will be slow because the right-hand side of (14) is large. Thus, the formula (14) presents the trade-off between the numerical stability and the convergence speed. This implies that the convergence rate and stable computation are not consistent when the condition number of the original problem is large. If κ(∇2 f(x 0)) is small, however, the right-hand side of (14) will not be too large. In this case, the trade-off is not significant and thus the numerical stability and convergence speed can be consistent.

Therefore, it is preferable that the condition number of the original problem is kept as small as possible, despite the fact that some scaling or pre-conditioning techniques are available. In the following section, we pursue a loss function of the density-ratio estimator whose Hessian matrix has a small condition number.

4.2 Condition number analysis of M-estimators

In this section, we study the condition number of the Hessian matrix associated with the minimization problem in the φ-divergence approach, and show that KuLSIF is optimal among all M-estimators. More specifically, we will provide two kinds of condition numbers analyses: a min-max evaluation (Sect. 4.2.1) and a probabilistic evaluation (Sect. 4.2.2).

4.2.1 Min-max evaluation

We assume that a universal RKHS \(\mathcal{H}\) (Steinwart 2001) endowed with a kernel function k on a compact set \(\mathcal{Z}\) is used for density-ratio estimation. The M-estimator is obtained by solving the problem (8). The Hessian matrix of the loss function (8) is equal to

(15)

where D ψ,w is the n-by-n diagonal matrix defined as

(16)

and ψ″ denotes the second-order derivative of ψ. The condition number of the above Hessian matrix is denoted by κ 0(D ψ,w ):

In KuLSIF, the equality ψ″=1 holds, and thus the condition number is equal to κ 0(I n ). Now we analyze the relation between κ 0(I n ) and κ 0(D ψ,w ).

Theorem 1

(Min-max Evaluation)

Suppose that \(\mathcal{H}\) is a universal RKHS, and that K 11 is non-singular. Let c be a positive constant. Then, the equality

(17)

holds, where the infimum is taken over all convex second-order continuously differentiable functions satisfying ψ″((ψ′)−1(1))=c.

The proof is deferred to Appendix C.

Both ψ(z)=z 2/2 and ψ(z)=−1−log(−z) satisfy the constraint ψ″((ψ′)−1(1))=1, and KuLSIF using ψ(z)=z 2/2 minimizes the worst-case condition number, because of the fact that the condition number of KuLSIF does not depend on the optimal solution. Note that, because both sides of (17) depend on the samples X 1,…,X n , KuLSIF achieves the min-max solution for each observation.

By introducing the constraint ψ″((ψ′)−1(1))=c, the balance between the loss term and the regularization term in the objective function of (8) is adjusted. Suppose that q(x)=p(x), i.e., the density ratio is a constant. Then, according to the equality (4), the optimal \(w\in\mathcal{H}\) satisfies 1=ψ′(w(x)), if the constant (ψ′)−1(1) is included in \(\mathcal{H}\). In this case, the diagonal of D ψ,w is equal to ψ″(w(X i ))=ψ″((ψ′)−1(1))=c. Thus, the Hessian matrix (15) is equal to \(\frac{c}{n}K_{11}^{2}+\lambda{}K_{11}\), which is independent of ψ as long as ψ satisfies ψ″((ψ′)−1(1))=c. Then, the constraint ψ″((ψ′)−1(1))=c adjusts the scaling of the loss term at the constant density ratio. Under the adjustment, the quadratic function ψ(z)=cz 2/2 is optimal up to a linear term in the min-max sense.

4.2.2 Probabilistic evaluation

Next, we present a probabilistic evaluation of condition numbers. As shown in (15), the Hessian matrix at the estimated function \(\widehat{w}\) (which is the minimizer of (8)) is given as

Let us define the random variable T n as

(18)

Since ψ is convex, T n is a non-negative random variable. Let F n be the distribution function of T n . The notations T n and F n imply that they depend on n. To be precise, T n and F n actually depend on both n and m. Here we suppose that m is fixed to a natural number including infinity, or m is a function of n as m=m n . Then, T n and F n depend only on n.

Below, we first compute the distribution of the condition number κ(H). Then we investigate the relation between the function ψ and the distribution of the condition number κ(H). To this end, we need to study eigenvalues and condition numbers of random matrices. For the Wishart distribution, the probability distribution of condition numbers has been investigated by Edelman (1988) and Edelman and Sutton (2005). Recently, the condition number of matrices perturbed by additive Gaussian noise has been investigated under the name of smoothed analysis (Sankar et al. 2006; Spielman and Teng 2004; Tao and Vu 2007). However, the statistical property of the above-defined matrix H is more complicated than those studied in the existing literature. In our problem, the probability distribution of each element will be far from well-known, and elements are correlated to each other through the kernel function.

Now, we briefly introduce the core idea of the smoothed analysis (Spielman and Teng 2004), and discuss its relation with our study. Consider the averaged computational cost E P [c(X)], where c(X) is the cost of an algorithm for input X, and E P [ ⋅ ] denotes the expectation with respect to the probability P over the input space. Let \(\mathcal{P}\) be a set of probabilities on the input space. In the smoothed analysis, the performance of the algorithm is measured by \(\max_{P\in\mathcal{P}}\,\mathrm{E}_{P}[c(X)]\). The set of Gaussian distributions is a popular choice for \(\mathcal{P}\).

Conversely, in our theoretical analysis, we consider the probabilistic order of condition numbers O p (κ(H)), as a measure of computational costs. The worst-case evaluation of the computational complexity is measured by max P,Q O p (κ(H)), where the sample distributions P and Q vary in an appropriate set of distributions. The quantity, max P,Q O p (κ(H)), is the counterpart of the worst-case evaluation of the averaged computational cost E P [c(X)] in the smoothed analysis. The probabilistic order of κ(H) depends on the loss function ψ. Then, we suggest that the loss function that achieves the optimal solution of the min-max problem, min ψ max P,Q O p (κ(H)), is the optimal choice. The details are shown below, where our concern is not only to provide the worst-case computational cost, but also to find the optimal loss function for the M-estimator.

Theorem 2

(Probabilistic Evaluation)

Let \(\mathcal{H}\) be an RKHS endowed with a kernel function \(k:\mathcal{Z}\times\mathcal{Z}\rightarrow \Re \) satisfying the boundedness condition, \(\sup_{x,x'\in\mathcal{Z}}k(x,x')<\infty\). Assume that the Gram matrix K 11 is almost surely positive definite in terms of the probability measure P. Suppose that, for the regularization parameter λ n,m , the boundedness condition lim sup n→∞ λ n,m <∞ is satisfied. Let \(U=\sup_{x,x'\in\mathcal{Z}}k(x,x')\) and t n be a sequence such that

(19)

where F n is the probability distribution of T n defined in (18). Then, we have

(20)

where H is defined as \(H=\frac{1}{n}K_{11}D_{\psi,\widehat {w}} K_{11} + \lambda K_{11}\). The probability Pr(⋅) is defined from the distribution of samples X 1,…,X n , Y 1,…,Y m .

The proof of Theorem 2 is deferred to Appendix D.

Remark 1

The Gaussian kernel on a compact set \(\mathcal{Z}\) meets the condition of Theorem 2 under a mild assumption on the probability P. Suppose that \(\mathcal{Z}\) is included in the ball {x∈ℜd | ∥x∥≤R}. Then, for k(x,x′)=exp{−γxx′∥2} with \(x,x'\in\mathcal {Z}\) and γ>0, we have \(e^{-4\gamma R^{2}}\leq k(x,x')\leq1\). If the distribution P of samples X 1,…,X n is absolutely continuous with respect to the Lebesgue measure, the Gram matrix of the Gaussian kernel is almost surely positive definite because K 11 is positive definite if X i X j for ij.

When ψ is the quadratic function, ψ(z)=z 2/2, the distribution function F n is given by \(F_{n}(t)=\mbox {\bf 1}[t\geq1]\), where \(\mbox {\bf 1}[\,\cdot\,]\) is the indicator function. By choosing t n =1 in Theorem 2, an upper bound of κ(H) for ψ(z)=z 2/2 is asymptotically given as \(\kappa(K_{11})(1+\lambda_{n,m}^{-1})\). Conversely, for the M-estimator with the Kullback-Leibler divergence (Nguyen et al. 2010), the function ψ is defined as ψ(z)=−1−log(−z), z<0, and thus, ψ″(z)=1/z 2 holds. Then we have \(T_{n}=\max_{1\leq i\leq n}(\widehat{w}(X_{i}))^{-2}\). Note that there is a possibility that \((\widehat{w}(X_{i}))^{2}\) takes a very small value, and thus it is expected that T n is of a larger than constant order. As a result, t n would diverge to infinity for ψ(z)=−1−log(−z). Results of the above theoretical analysis are confirmed by numerical studies in Sect. 6.

Using the above argument, we show that the quadratic loss is approximately an optimal loss function in the sense of probabilistic upper bounds in Theorem 2. Suppose that the true density ratio q(z)/p(z) is well approximated by the estimator \(\psi'(\widehat{w}(z))\). Instead of T n , we study an approximation \(\sup_{z\in\mathcal{Z}}\psi''((\psi')^{-1}(q(z)/p(z)))\). Then, for any loss function ψ such that ψ″((ψ′)−1(1))=1, the inequality

holds, where p and q are probability densities such that \((\psi')^{-1}(q/p)\in\mathcal{H}\). The equality holds for the quadratic loss. The meaning of the constraint ψ″((ψ′)−1(1))=1 is presented in Sect. 4.2.1. Thus, t n =1 provided by the quadratic loss function is expected to approximately attain the minimum upper bound in (20). The quantity \(\sup_{p,q}\sup_{z\in\mathcal{Z}}\psi''((\psi')^{-1}(q(z)/p(z)))\) is the counterpart of \(\max_{P\in\mathcal{P}}E_{P}[c(X)]\) in the smoothed analysis. We expect that the loss function attaining the infimum of this quantity provides a computationally efficient learning algorithm.

5 Reduction of condition numbers

In the previous section, we showed that KuLSIF is preferable in terms of computational efficiency and numerical stability. In this section, we study the reduction of condition numbers.

Let L KuLSIF(α) and \(L_{\mathrm{R\mbox{-}KuLSIF}}(\alpha)\) be loss functions of KuLSIF (9) and R-KuLSIF (11), respectively. The Hessian matrices of L KuLSIF(α) and \(L_{\mathrm{R\mbox{-}KuLSIF}}(\alpha)\) are given by

(21)
(22)

Because of the equality \(\kappa(H_{\mathrm{KuLSIF}}) = \kappa(K_{11}) \kappa(H_{\mathrm{R\mbox{-}KuLSIF}})\), we have the inequality

This inequality implies that the loss function L KuLSIF(α) can be transformed to \(L_{\mathrm{R\mbox{-}KuLSIF}}(\alpha)\) without changing the optimal solution, whereas the condition number is reduced. Hence, R-KuLSIF will be more preferable than KuLSIF in the sense of both convergence speed and numerical stability as explained in Sect. 4.1. Though the loss function of R-KuLSIF is not a member of the regularized M-estimator (8), KuLSIF can be transformed to R-KuLSIF without any computational effort.

Below, we study whether the same reduction of condition numbers is possible in the general φ-divergence approach. If there are M-estimators other than KuLSIF whose condition numbers are reducible, we should compare them with R-KuLSIF and pursue more computationally efficient density-ratio estimators. Our conclusion is that among all of the φ-divergence approaches, the condition number is reducible only for KuLSIF. Thus, the reduction of condition numbers by R-KuLSIF is a special property that makes R-KuLSIF particularly attractive for practical use.

We now show why the condition number of KuLSIF is reducible from κ(H KuLSIF) to \(\kappa(H_{\mathrm{R\mbox{-}KuLSIF}})\) without changing the optimal solution. Solving an unconstrained optimization problem is equivalent to finding a zero of the gradient vector of the loss function. For the loss functions \(L_{\mathrm{R\mbox{-}KuLSIF}}(\alpha)\) and L KuLSIF(α), the equality

holds for any α. Hence, for non-degenerate K 11, zeros of \(\nabla{}L_{\mathrm{R\mbox{-}KuLSIF}}(\alpha)\) and ∇L KuLSIF(α) are the same. In general, for the quadratic convex loss functions L 1(α) and L 2(α) that share the same optimal solution, there exists a matrix C such that ∇L 1=CL 2. Indeed, for L 1(α)=(αα ) A 1(αα ) and L 2(α)=(αα ) A 2(αα ), the matrix \(C=A_{1}A_{2}^{-1}\) yields the equality ∇L 1=CL 2. Based on this fact, one can obtain the quadratic loss function that shares the same optimal solution with a smaller condition number without further computational cost.

Now, we study loss functions of general M-estimators. Let L ψ (α) be the loss function of the M-estimator (8), and let L(α) be any other function. Suppose that ∇L(α )=0 holds if and only if ∇L ψ (α )=0. This implies that extremal points of L ψ (α) and L(α) are the same. Then, there exists a matrix-valued function C(α)∈ℜn×n such that

(23)

where C(α) is non-degenerate for any α. Suppose C(α) is differentiable. Then, the derivative of the above equation at the extremal point α leads to the equality

When κ(∇2 L(α ))≤κ(∇2 L ψ (α )), L(α) will be preferable to L ψ (α) for numerical computation.

We require a careful treatment for the choice of the matrix C(α) or the loss function L(α). If there is no restriction on the matrix-valued function C(α), the most preferable choice of C(α ) is given by C(α )=(∇2 L ψ (α ))−1. However this is clearly meaningless for the purpose of numerical computation because the transformation requires the knowledge of the optimal solution. Even if the function L ψ (α) is quadratic, finding (∇2 L ψ (α ))−1 is computationally equivalent to solving the optimization problem. To obtain a suitable loss function L(α) without additional computational effort, we need to impose a meaningful constraint on C(α). Below, we assume that the matrix-valued function C(α) is a constant function.Footnote 2

As shown in the proof of Lemma 1, the gradient of the loss function L ψ (α) is equal to

where the function v is defined in Lemma 1. Let C∈ℜn×n be a constant matrix, and suppose that the ℜn-valued function CL ψ (α) is represented as the gradient of a function L, i.e., there exists an L such that ∇L=CL ψ . Then, the function CL ψ is called integrable (Nakahara 2003). We now require a ψ for which there exists a non-identity matrix C such that CL ψ (α) is integrable. According to the Poincaré lemma (Nakahara 2003; Spivak 1979), the necessary and sufficient condition of integrability is that the Jacobian matrix of CL ψ (α) is symmetric. The Jacobian matrix of CL ψ (α) is given by

where D ψ,α is the n-by-n diagonal matrix with diagonal elements

In terms of the Jacobian matrix J ψ,C (α), we have the following theorem.

Theorem 3

Let c be a constant value inand the function ψ be second-order continuously differentiable. Suppose that the Gram matrix K 11 is non-singular, and that K 11 does not have any zero elements. If there exists a non-singular matrix CcI n such that J ψ,C (α) is symmetric for any α∈ℜn, then, ψis a constant function.

The proof is provided in Appendix E.

Theorem 3 implies that for the non-quadratic function ψ, the gradient CL ψ (α) cannot be integrable unless C=cI n , c∈ℜ. As a result, the condition number of loss functions is reducible only when ψ is a quadratic function.Footnote 3 The same procedure works for kernel ridge regression (Chapelle 2007; Ratliff and Bagnell 2007) and kernel PCA (Mika et al. 1999). However, there exists no similar procedure for M-estimators with non-quadratic functions.

In general, the change of variables is a standard and useful approach to reducing the condition number of loss functions. However, we need a good prediction of the Hessian matrix at the optimal solution to obtain good conditioning. Moreover, additional computation including matrix manipulation will be required for the coordinate transformation. Conversely, an advantage of the transformation considered in this section is that it does not require any effort to predict the Hessian matrix or to manipulate the matrix.

Remark 2

We summarize our theoretical results on condition numbers. Let H ψ-div be the Hessian matrix of the loss function (8). Then, the following inequalities hold:

Based on a probabilistic evaluation, the inequality

will also hold with high probability.

6 Simulation study

In this section, we experimentally investigate the relation between the condition number and the convergence rate. All computations are conducted using a Xeon X5482 (3.20 GHz) and 32 GB physical memory with CentOS Linux release 5.2. For optimization problems, we applied the gradient descent method and quasi Newton methods instead of the Newton method, since the Newton method does not efficiently work for high-dimensional problems (Luenberger and Ye 2008, introduction of Chap. 10).

6.1 Synthetic data

In the M-estimator based on the φ-divergence, the Hessian matrix involved in the optimization problem (8) is given as

(24)

For the estimator using the Kullback-Leibler divergence (Nguyen et al. 2010; Sugiyama et al. 2008a), the function φ(z) is given as φ(z)=−logz, and thus, ψ(z)=−1−log(−z), z<0. Then, ψ′(z)=−1/z and ψ″(z)=1/z 2 for z<0. Thus, for the optimal solution w ψ (x) under the population distribution, we have ψ″(w ψ (x))=ψ″((ψ′)−1(w 0(x)))=w 0(x)2, where w 0 is the true density ratio q/p. Then the Hessian matrix at the target function w ψ is given as

Conversely, in KuLSIF, the Hessian matrix is given by H KuLSIF defined in (21), and the Hessian matrix of R-KuLSIF, \(H_{\mathrm{R\mbox{-}KuLSIF}}\), is shown in (22).

The condition numbers of Hessian matrices, H KL,H KuLSIF, and \(H_{\mathrm{R\mbox{-}KuLSIF}}\), are numerically compared. In addition, the condition number of K 11 is computed. The probability distributions P and Q are set to the normal distribution on the 10-dimensional Euclidean space with the identity variance-covariance matrix I 10. The mean vectors of P and Q are set to \(0\times \mbox {\bf 1}_{10}\) and \(\mu \times \mbox {\bf 1}_{10}\) with μ=0.2 or μ=0.5, respectively. Note that the mean value μ only affects the condition number of the KL method, not R-KuLSIF and KuLSIF. The true density-ratio w 0 is determined by P and Q. In the kernel-based estimators, we use the Gaussian kernel with width σ=4, where σ=4 is close to the median of the distance ∥X i X j ∥ between samples. Using the median distance as the kernel width is a popular heuristic (Caputo et al. 2002; Schölkopf and Smola 2002). We study two setups: In the first setup, the sample size from P is equal to that from Q, that is, n=m, and in the second setup, the sample size from Q is fixed to m=50 and n is varied from 20 to 500. The regularization parameter λ is set to λ n,m =1/(nm)0.9, where nm=min{n,m}.

In each setup, the samples X 1,…,X n are randomly generated and the condition number is computed. Figure 1 shows the condition number average over 1000 runs. We see that for all cases, the condition number of R-KuLSIF is significantly smaller than that of the other methods. Thus, it is expected that R-KuLSIF converges faster than the other methods and that R-KuLSIF is robust against numerical degeneracy.

Fig. 1
figure 1

Average condition number of Hessian matrix over 1000 runs. Left panel shows condition number in case n=m and σ=4, and right panel shows result when sample size from Q is fixed to m=50 and σ is set to 4. KL(μ) denotes condition number of H KL, when mean vector of probability distribution Q is specified by μ. Note that condition number of R-KuLSIF and KuLSIF does not depend on μ

Figure 2 and Table 1 show the average number of iterations and average computation time for solving the optimization problems over 50 runs. The probability distributions P and Q are the same as those in the above experiments, and the mean vector of Q is set to \(0.5\times \mbox {\bf 1}_{10}\). The number of samples from each probability distribution is set to n=m=100,…,6000, and the regularization parameter is set to λ=1/(nm)0.9. Note that n is equal to the number of parameters to be optimized. R-KuLSIF, KuLSIF, and the method based on the Kullback-Leibler divergence (KL) are compared. In addition, the computation time for solving the linear equation

(25)

instead of optimizing (11) is also shown as “direct” in the plot. The kernel parameter σ is determined based on the median of ∥X i X j ∥. To solve the optimization problems for M-estimators, we use two optimization methods: one is the BFGS quasi-Newton method implemented in the optim function in R (R Development Core Team 2009), and the other is the steepest descent method. Furthermore, for the “direct” method, we use the solve function in R. Figure 2 shows the result for the BFGS method and Table 1 shows the result for the steepest descent method. In the numerical experiments for the steepest descent method, the maximum number of iterations is limited to 4000, and the KL method reaches the limit. The numerical results indicate that the number of iterations in the optimization procedure is highly correlated with the condition number of the Hessian matrices.

Fig. 2
figure 2

Average computation time and average number of iterations in BFGS method over 50 runs

Table 1 Average computation time and average number of iterations in steepest descent method over 50 runs. “>” means that actual computation time is longer than number described in table

Although the practical computational time would depend on various issues such as stopping rules, our theoretical results in Sect. 4 are shown to be in good agreement with the empirical results for the synthetic data. We observed that numerical optimization methods such as the quasi-Newton method are competitive with numerical algorithms for solving linear equations using LU decomposition or Cholesky decomposition, especially when the sample size n (which is equal to the number of optimization parameters in the current setup) is large. This implies that the theoretical result obtained in this study will be useful in large sample cases, which is common in practical applications.

6.2 Benchmark data

Next, we apply the density-ratio estimation to benchmark data sets, and compare the computational cost. The statistical performance of each estimator for a linear model has been extensively compared on benchmark data sets in Kanamori et al. (2009, 2012), and Hido et al. (2011). Therefore, here, we focus on the numerical efficiency of each method.

Let us consider an outlier detection problem of finding irregular samples in a data set (“evaluation data set”) based on another data set (“model data set”) that only contains regular samples (Hido et al. 2011). Defining the density ratio over the two sets of samples, we can see that the density-ratio values for regular samples are close to one, while those for outliers tend to deviate significantly from one. Since the evaluation data set usually has a wider support than the model data set, we regard the evaluation data set as samples corresponding to the denominator of the density ratio and the model data set as samples corresponding to the numerator. Then the target density-ratio w 0(x) is approximately equal to one in a wide range of the data domain, and will take small values around outliers.

The data sets provided by IDA (Rätsch et al. 2001) are used. These are binary classification data sets consisting of positive/negative and training/test samples. We allocate all positive training samples to the “model” set, while all positive test samples and 5 % of negative test samples are assigned to the “evaluation set.” Thus, we regard the positive samples as inliers and negative samples as outliers.

Table 2 shows the average computation time and average number of iterations over 20 runs for image and splice and over 50 runs for the other data sets. In the same way as the simulations in Sect. 6.1, we compare R-KuLSIF, KuLSIF, and the M-estimator with the Kullback-Leibler divergence (KL). In addition, the computation time of solving the linear equation (25) is shown as “direct” in the table. For the optimization, we use the BFGS method implemented in the optim function in R (R Development Core Team 2009), and we use the solve function in R for the “direct” method. The kernel parameter σ is determined based on the median of ∥X i X j ∥ which is computed by the function sigest in the kernlab library (Karatzoglou et al. 2004). The average number of samples is shown in the second column, and the regularization parameter is set to λ=1/(nm)0.9.

Table 2 Average computation time (s) and average number of iterations for benchmark data sets are shown. BFGS quasi-Newton method in optim function of R environment is used to obtain numerical solution. Data sets are arranged in ascending order of sample size n. Results of method having lowest mean are described in bold face

The numerical results show that, when the sample size is balanced (i.e., n and m are comparable to each other), the number of iterations for R-KuLSIF is the smallest, which agrees well with our theoretical analysis. On the other hand, for titanic, waveform, banana, ringnorm, and twonorm, the number of iterations for each method is almost the same. In these data sets, m is much smaller than n, and thus the second term λK 11 in the Hessian matrix (24) for the M-estimator will govern the convergence property, since the order of λ n,m is larger than O(1/n). This tendency is explained by the result in Theorem 2. Based on (20), we see that a large λ n,m will provide a smaller upper bound of κ(H).

Next, we investigate the number of iterations when n and m are comparable to each other. The data sets, titanic, waveform, banana, ringnorm, and twonorm are used. We consider two setups: In the first series of experiments, the evaluation data set consists of all positive test samples, and the model data set is defined by all negative test samples. Therefore, the target density-ratio may be far from the constant function w 0(x)=1. Table 3 shows the average computation time and average number of iterations over 20 runs. In this case, the number of iterations for optimization agrees with our theoretical result, that is, R-KuLSIF yields low computational costs for all experiments. In the second series of experiments, both model samples and evaluation samples are randomly chosen from all (i.e., both positive and negative) test samples. Thus, the target density-ratio is almost equal to the constant function w 0(x)=1. Table 4 shows the average computation time and the average number of iterations over 20 runs. The number of iterations for “KL” is much smaller than that for the first setup shown in Table 3. This is because the condition number of the Hessian matrix (24) is likely to be small when the true density-ratio w 0 is close to the constant function. R-KuLSIF is, however, still the preferable approach. Furthermore, the computation time of R-KuLSIF is comparable to that of a direct method such as the Cholesky decomposition when the sample size (i.e., the number of variables) is large.

Table 3 For balanced sample size, average computation time (s) and average number of iterations for benchmark data sets are presented. Titanic, waveform, banana, ringnorm, and twonorm are used as data sets. Evaluation data set consists of all positive test samples, and model data set is defined by all negative test samples, i.e., density ratio will be far from constant function. BFGS quasi-Newton method in optim function of R environment is used to obtain numerical solution. Data sets are arranged in ascending order of sample size n. Results of method having lowest mean are described in bold face
Table 4 For balanced sample size, average computation time (s) and average number of iterations for benchmark data sets are presented. Titanic, waveform, banana, ringnorm, and twonorm are used as data sets. Evaluation data set and model data set are randomly generated from all (i.e., both positive and negative) test samples, i.e., density ratio is close to constant function. BFGS quasi-Newton method in optim function of R environment is used to obtain numerical solution. Data sets are arranged in ascending order of sample size n. Results of method having lowest mean are described in bold face

In summary, the numerical experiments showed that the convergence rate for optimization is well explained by the condition number of the Hessian matrix. The relation between the loss function ψ and condition number was discussed in Sect. 4, and our theoretical result implies that R-KuLSIF is computationally an effective way to estimate density ratios. The numerical results in this section also indicated that our theoretical result is useful to obtain practical and computationally efficient estimators.

7 Conclusions

We considered the problem of estimating the ratio of two probability densities and investigated theoretical properties of the kernel least-squares estimator called KuLSIF. More specifically, we theoretically studied the condition number of Hessian matrices, because the condition number is closely related to the convergence rate of optimization and the numerical stability. We found that KuLSIF has a smaller condition number than the other methods. Therefore, KuLSIF will have preferable computational properties. We further showed that R-KuLSIF, which is an alternative formulation of KuLSIF, possesses an even smaller condition number. Numerical experiments showed that practical numerical properties of optimization algorithms could be well explained by our theoretical analysis of condition numbers, even though the condition number only provides an upper bound of the rate of convergence. A theoretical issue to be further investigated is the derivation of a tighter probabilistic order of the condition number.

Density-ratio estimation was shown to provide new approaches to solving various machine learning problems (Sugiyama et al. 2009, 2012b), including covariate shift adaptation (Shimodaira 2000; Zadrozny 2004; Sugiyama and Müller 2005; Gretton et al. 2009; Sugiyama et al. 2007; Bickel et al. 2009; Quiñonero Candela et al. 2009; Sugiyama and Kawanabe 2012), multi-task learning (Bickel et al. 2008; Simm et al. 2011), inlier-based outlier detection (Hido et al. 2008, 2011; Smola et al. 2009), change detection in time-series (Kawahara and Sugiyama 2011), divergence estimation (Nguyen et al. 2010), two-sample testing (Sugiyama et al. 2011), mutual information estimation (Suzuki et al. 2008, 2009b), feature selection (Suzuki et al. 2009a), sufficient dimension reduction (Sugiyama et al. 2010a), independence testing (Sugiyama and Suzuki 2011), independent component analysis (Suzuki and Sugiyama 2011), causal inference (Yamada and Sugiyama 2010), object matching (Yamada and Sugiyama 2011), clustering (Kimura and Sugiyama 2011), conditional density estimation (Sugiyama et al. 2010b), and probabilistic classification (Sugiyama 2010). In future work, we will develop practical algorithms for a wide range of applications on the basis of theoretical guidance provided in this study.