Computational complexity of kernel-based density-ratio estimation: a condition number analysis

Kanamori, Takafumi; Suzuki, Taiji; Sugiyama, Masashi

doi:10.1007/s10994-012-5323-6

Computational complexity of kernel-based density-ratio estimation: a condition number analysis

Published: 12 December 2012

Volume 90, pages 431–460, (2013)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Computational complexity of kernel-based density-ratio estimation: a condition number analysis

Download PDF

Takafumi Kanamori¹,
Taiji Suzuki² &
Masashi Sugiyama³

1739 Accesses
10 Citations
Explore all metrics

Abstract

In this study, the computational properties of a kernel-based least-squares density-ratio estimator are investigated from the viewpoint of condition numbers. The condition number of the Hessian matrix of the loss function is closely related to the convergence rate of optimization and the numerical stability. We use smoothed analysis techniques and theoretically demonstrate that the kernel least-squares method has a smaller condition number than other M-estimators. This implies that the kernel least-squares method has desirable computational properties. In addition, an alternate formulation of the kernel least-squares estimator that possesses an even smaller condition number is presented. The validity of the theoretical analysis is verified through numerical experiments.

Kernel density estimation by stagewise algorithm with a simple dictionary

Article 02 December 2022

Optimal Kernel Selection for Density Estimation

Kernel Matrix Regularization via Shrinkage Estimation

1 Introduction

In this section, we introduce background materials of our target problem addressed in this study.

1.1 Density-ratio estimation

Recently, methods of directly estimating the ratio of two probability densities without going through density estimation have been developed. These methods can be used to solve various machine learning tasks such as importance sampling, divergence estimation, mutual information estimation, and conditional probability estimation (Sugiyama et al. 2009, 2012b).

The kernel mean matching (KMM) method (Gretton et al. 2009) directly yields density ratio estimates by efficiently matching the two distributions using a special property of the universal reproducing kernel Hilbert spaces (RKHSs) (Steinwart 2001). Another approach is the M-estimator (Nguyen et al. 2010), which is based on the non-asymptotic variational characterization of the φ-divergence (Ali and Silvey 1966; Csiszár 1967). See Sugiyama et al. (2008a) for a similar algorithm that uses the Kullback-Leibler divergence. Non-parametric convergence properties of the M-estimator in RKHSs have been elucidated under the Kullback-Leibler divergence (Nguyen et al. 2010; Sugiyama et al. 2008b). A squared-loss version of the M-estimator for linear density-ratio models called unconstrained Least-Squares Importance Fitting (uLSIF) has also been developed (Kanamori et al. 2009). The squared-loss version was also shown to possess useful computational properties, e.g., a closed-form solution is available, and the leave-one-out cross-validation score can be computed analytically. A kernelized variant of uLSIF was recently proposed, and its statistical consistency was studied (Kanamori et al. 2012).

In this paper, we study loss functions of M-estimators. In Nguyen et al. (2010), a general framework of the density-ratio estimation has been established (also see Sugiyama et al. 2012a). However, when we estimate the density ratio for real-world data analysis, it becomes necessary to choose an M-estimator from infinitely many candidates. Hence it is important to study which M-estimator should be chosen in practice. The suitability of the estimator depends on the chosen criterion. In learning problems, there are mainly two criteria for choosing the estimator: (1) the estimation accuracy and (2) the computational cost. Kanamori et al. (2012) studied the choice of loss functions in density-ratio estimation from the viewpoint of the estimation accuracy. In the present paper, we focus on the computational cost associated with density-ratio estimators.

1.2 Condition numbers

In numerical analysis, the computational cost is closely related to the so-called condition number (von Neumann and Goldstine 1947; Turing 1948; Eckart and Young 1936). Indeed, the condition number appears as a parameter in complexity bounds for a variety of efficient iterative algorithms in linear algebra, linear and convex optimization, and homotopy methods for solving systems of polynomial equations (Luenberger and Ye 2008; Nocedal and Wright 1999; Renegar 1987, 1995; Smale 1981; Demmel 1997).

The definition of the condition number depends on the problem. In computational tasks involving matrix manipulations, a typical definition of the condition number is the ratio of the maximum and minimum singular values of the matrix given as the input of the problem under consideration. For example, consider solving the linear equation Ax=b. The input of the problem is the matrix A, and the computational cost to find the solution can be evaluated by the condition number of A, denoted hereafter as κ(A). hereafter. Specifically, when an iterative algorithm is applied to solving Ax=b, the number of iterations required to converge to a solution is evaluated using κ(A). In general, a problem with a larger condition number results in a higher computational cost. Since the condition number is independent of the algorithm, it is expected to represent the essential difficulty of the problem.

To evaluate the efficiency of numerical algorithms, a two-stage approach is frequently used: In the first stage, the relation between the computational cost c(A) of an algorithm with input A and the condition number κ(A) of the problem is studied. A formula such as c(A)=O(κ(A)^α) is obtained, where α is a constant depending on the algorithm. At the second stage, the probability distribution of κ(A) is estimated, for example, in the form of Pr(κ(A)≥x)≤x ^−β, where the probability is designed to represent a “practical” input distribution. As a result, the average computational cost of the algorithm can be evaluated. For details of this approach, see Blum and Shub (1986), Renegar (1987), Demmel (1988), Kostlan (1988), Edelman (1988, 1992), Shub (1993), Shub and Smale (1994, 1996), Cheung and Cucker (2002), Cucker and Wschebor (2002), Beltran and Pardo (2006), Bürgisser et al. (2010).

1.3 Smoothed analysis

The “average” performance is often controversial, because it is hard to identify the input probability distribution in real-world problems. Spielman and Teng (2004) proposed the smoothed analysis to refine the second stage of the above scheme for obtaining more meaningful probabilistic upper complexity bounds. Smoothed analysis is a hybrid of the worst and average-case analyses. Consider the averaged computational cost E_P[c(A)], where c(A) is the cost of an algorithm for input A and E_P[ ⋅ ] denotes the expectation with respect to the probability P over the input space. Let $\mathcal{P}$ be a set of probability distributions on the input space. Then, in the smoothed analysis, the performance of the algorithm is measured by $\max_{P\in\mathcal{P}}\,\mathrm{E}_{P}[c(A)]$, i.e., the worst-case evaluation of the expected computational cost over a set of probability distributions.

The smoothed analysis was successfully employed in understanding the practical efficiency of the simplex algorithm for linear programming problems (Spielman and Teng 2004; Bürgisser et al. 2006a). In the context of machine learning, the smoothed analysis was applied to elucidate the complexity of learning algorithms such as the perceptron algorithm and the k-means method; see Vershynin (2006), Blum and Dunagan (2002), Becchetti et al. (2006), Röglin and Vöcking (2007), Manthey and Röglin (2009), Bürgisser et al. (2006b), Bürgisser and Cucker (2010), Sankar et al. (2006) for more applications of the smoothed analysis technique.

The concept of the smoothed analysis, i.e., the worst-case evaluation of the expected computational cost over a set of probability distributions, is compatible with many problem setups in machine learning and statistics. A typical assumption in statistical inference is that training samples are distributed according to a probability distribution in a probability distribution set. The probability distribution may be specified by a finite-dimensional parameter, or an infinite-dimensional space may be introduced to deal with a probability distribution set.

1.4 Our contributions

In this study, we apply the concept of smoothed analysis for studying the computational cost of density-ratio estimation algorithms. In our analysis, we define the probability distribution on the basis of training samples, and study the optimal choice of the loss functions for M-estimators.

More specifically, we consider the optimization problems associated with the M-estimators. There are some definitions of condition numbers to measure the complexity of optimization problems (Bürgisser et al. 2006c; Renegar 1995; Todd et al. 2001). In unconstrained non-linear optimization problems, the condition number defined from the Hessian matrix of the loss function plays a crucial role, because it determines the convergence rate of optimization and the numerical stability (Luenberger and Ye 2008; Nocedal and Wright 1999). When a loss function to be optimized depends on random samples, the computational cost will be affected by the distribution of the condition number. Therefore, we study the distribution of condition numbers for randomly perturbed matrices. Next, we derive the loss function that has the smallest condition number among all M-estimators in the min-max sense. We also give a probabilistic evaluation of the condition number. Finally, we verify these theoretical findings through numerical experiments.

There are many important aspects to the computational cost of numerical algorithms such as memory requirements, the role of stopping conditions, and the scalability to large data sets. In this study, we evaluate the computational cost and stability of learning problems on the basis of the condition number of the loss function, because the condition number is a major parameter to quantify the difficulty of the numerical computation as explained above.

1.5 Structure of the paper

The remainder of this paper is structured as follows. In Sect. 2, we formulate the problem of density-ratio estimation and briefly review existing methods. In Sect. 3, a kernel-based density-ratio estimator is introduced. Section 4 is the main contribution of this paper, i.e., the presentation of condition number analyses of density-ratio estimation methods. In Sect. 5, we further investigate the possibility of reducing the condition number of loss functions. In Sect. 6, we experimentally investigate the behavior of condition numbers, confirming the validity of our theoretical analysis. In Sect. 7, we conclude by summarizing our contributions and indicating possible future research directions. Technical details are presented in Appendices A–E.

2 Estimation of density ratios

In this section, we formulate the problem of density-ratio estimation and briefly review existing methods.

2.1 Formulation and notations

Consider two probability distributions P and Q on a probability space $\mbox {$\mathcal {Z}$}$. Let the distributions P and Q have the probability densities p and q, respectively. We assume p(x)>0 for all $x\in \mbox {$\mathcal {Z}$}$. Suppose that we are given two sets of independent and identically distributed (i.i.d.) samples,

(1)

Our goal is to estimate the density ratio

based on the observed samples.

We summarize some notations to be used throughout the paper. For a vector a in the Euclidean space, ∥a∥ denotes the Euclidean norm. Given a probability distribution P and a random variable h(X), we denote the expectation of h(X) under P by ∫hdP or ∫h(x)P(dx). Let ∥⋅∥_∞ be the infinity norm. For a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ (Aronszajn 1950), the inner product and the norm on $\mathcal {H}$ are denoted as $\cdot,\cdot{\,}_{\mathcal{H}}$ and $\|\cdot\|_{\mathcal {H}}$, respectively.

2.2 M-estimator based on φ-divergence

An estimator of the density ratio based on the φ-divergence (Ali and Silvey 1966; Csiszár 1967) has been proposed by Nguyen et al. (2010). Let φ:ℜ→ℜ be a convex function, and suppose that φ(1)=0. Then, the φ-divergence between P and Q is defined by the integral

Setting φ(z)=−logz, we obtain the Kullback-Leibler divergence as an example of the φ-divergence. Let ψ be the conjugate dual function of φ, i.e.,

When φ is a convex function, we also have

(2)

We assume ψ is differentiable. See Sects. 12 and 26 of Rockafellar (1970) for details on the conjugate dual function. Substituting (2) into the φ-divergence, we obtain the expression,

(3)

where the infimum is taken over all measurable functions $w:\mbox {$\mathcal {Z}$}\rightarrow \Re $. The infimum is attained at the function w satisfying

(4)

where ψ′ is the derivative of ψ.

Approximating (3) with the empirical distribution, we obtain the empirical loss function,

A parametric or non-parametric model is assumed for the function w. This estimator is referred to as the M-estimator of the density ratio (Nguyen et al. 2010). The M-estimator based on the Kullback-Leibler divergence is derived from ψ(z)=−1−log(−z). Sugiyama et al. (2008a) have studied the estimator in detail using the Kullback-Leibler divergence, and proposed a practical method that includes basis function selection by cross-validation. Kanamori et al. (2009) proposed unconstrained Least-Squares Importance Fitting (uLSIF) which is derived from the quadratic function ψ(z)=z ²/2.

3 Kernel-based M-estimator

In this study, we consider kernel-based estimators of density ratios because the kernel methods provide a powerful and unified framework for statistical inference (Schölkopf and Smola 2002). Let $\mathcal{H}$ be an RKHS endowed with the kernel function k defined on $\mathcal{Z}\times\mathcal{Z}$. Then, based on (3), we minimize the following loss function over $\mathcal{H}$.

(5)

where the regularization term $\frac{\lambda}{2}\|w\|_{\mathcal{H}}^{2}$ with the regularization parameter λ is introduced to avoid overfitting. Then, an estimator of the density ratio w ₀ is given by $\psi'(\widehat{w}(x))$, where $\widehat{w}$ is the minimizer of (5). Statistical convergence properties of the kernel estimator using the Kullback-Leibler divergence have been investigated in Nguyen et al. (2010) and Sugiyama et al. (2008b), and similar analysis for the squared-loss was given in Kanamori et al. (2012).

In the RKHS $\mathcal{H}$, the representer theorem (Kimeldorf and Wahba 1971) is applicable, and the optimization problem on $\mathcal{H}$ is reduced to a finite-dimensional optimization problem. A detailed analysis leads us to a specific form of the solution as follows.

Lemma 1

Suppose the samples (1) are observed and assume that the function ψ in (5) is a differentiable convex function, and that λ>0. Let v(α,β)∈ℜⁿ be the vector-valued function defined by

for α∈ℜⁿ and β∈ℜ^m, where ψ′ denotes the derivative of ψ. Let $\mbox {\bf 1}_{m}=(1,\ldots,1)^{\top}\in \Re ^{m}$ for a positive integer m and suppose that there exists a vector $\bar{\alpha}=(\bar{\alpha}_{1},\ldots,\bar{\alpha}_{n})\in \Re ^{n}$ such that

(6)

Then, the estimator $\widehat{w}$, an optimal solution of (5), has the form

(7)

The proof is deferred to Appendix A, which can be regarded as an extension of the proof for the least-squares estimator (Kanamori et al. 2012) to general M-estimators. This theorem implies that it is sufficient to find n variables $\bar{\alpha}_{1},\ldots,\bar{\alpha}_{n}$ to obtain the estimator $\widehat{w}$.

Using Lemma 1, we can obtain the estimator based on the φ-divergence by solving the following optimization problem

(8)

Though the problem (8) is a constrained optimization problem with respect to the parameter α=(α ₁,…,α _n)^⊤, it can be easily rewritten as an unconstrained one. In this paper, our main concern is to study which ψ we should use as the loss function of the M-estimator. In Sects. 4 and 5, we will show that the quadratic function is a preferable choice from a computational efficiency viewpoint.

Consider the condition (6) for the quadratic function, ψ(z)=z ²/2. Let K ₁₁, K ₁₂, and K ₂₁ be the sub-matrices of the Gram matrix

where i,i′=1,…,n, j,j′=1,…,m. Then, for the quadratic loss ψ(z)=z ²/2, we have

and thus, there exists a vector $\bar{\alpha}$ that satisfies the equation (6). For ψ(z)=z ²/2, the problem (8) is reduced to

(9)

by ignoring the term that is independent of the parameter α. The density-ratio estimator obtained by solving (9) is referred to as the kernelized uLSIF (KuLSIF) (Kanamori et al. 2012).

When the matrix K ₁₁ is non-degenerate, the optimal solution of (9) is equal to

(10)

It is straightforward to confirm that the optimal solution of the problem

(11)

is the same as (10). The estimator given by solving the optimization problem (11) is denoted by Reduced-KuLSIF (R-KuLSIF). Though the objective functions in KuLSIF and R-KuLSIF are different, the optimal solution is the same. In Sect. 5, we show that R-KuLSIF is more preferable than the other M-estimators (including KuLSIF) from a numerical computation viewpoint.

4 Condition number analysis for density-ratio estimation

In this section, we study the condition number of loss functions for density-ratio estimation. Through the analysis of condition numbers, we elucidate the computational efficiency of the M-estimator, which is the main contribution of this study.

4.1 Condition number in numerical analysis and optimization

The condition number plays a crucial role in numerical analysis and optimization (Demmel 1997; Luenberger and Ye 2008; Sankar et al. 2006). The main concepts are briefly reviewed here.

Let A be a symmetric positive definite matrix. Then, the condition number of A is defined by λ _max/λ _min (≥1), where λ _max and λ _min are the maximum and minimum eigenvalues of A, respectively.^{Footnote 1} The condition number of A is denoted by κ(A).

In numerical analysis, the condition number governs the round-off error of the solution of a linear equation Ax=b. A matrix A with a large condition number results in a large upper bound on the relative error of the solution x. More precisely, in the perturbed linear equation

the relative error of the solution is given as (Demmel 1997, Sect. 2.2)

where ∥A∥ is the operator norm for the matrix A defined by

Hence, a small condition number is preferred in numerical computation.

In optimization problems, the condition number provides an upper bound of the convergence rate for optimization algorithms. Let us consider a minimization problem min_x f(x), x∈ℜⁿ, where f:ℜⁿ→ℜ is a differentiable function and let x ₀ be a local optimal solution. We consider an iterative algorithm that generates a sequence $\{x_{i}\}_{i=1}^{\infty}$. Let ∇f be the gradient vector of f. In various iterative algorithms, the sequence is generated in the following form

(12)

where η _i is a non-negative number appropriately determined and H _i is a symmetric positive definite matrix which approximates the Hessian matrix of f at x ₀, i.e., ∇² f(x ₀). Then, under a mild assumption, the sequence $\{x_{i}\}_{i=1}^{\infty}$ converges to a local minimizer x ₀.

We introduce convergence rates of some optimization methods. According to the ‘modified Newton method’ theorem (Luenberger and Ye 2008, Sect. 10.1), the convergence rate of (12) is given by

(13)

where κ _i is the condition number of $H_{i}^{-1/2}(\nabla^{2} f(x_{0})) H_{i}^{-1/2}$. Though the modified Newton method theorem is shown only for convex quadratic functions (Luenberger and Ye 2008), the rate-of-convergence behavior is essentially the same for general nonlinear objective functions. In terms of non-quadratic functions, details are presented in Sect. 8.6 of Luenberger and Ye (2008). Equation (13) implies that the convergence rate of the sequence {x _k} is fast if κ _i,i=1,2,… are small. In the conjugate gradient method, the convergence rate is expressed by (13) with $\sqrt{\kappa(\nabla^{2}f(x_{0}))}$ instead of κ _i (Nocedal and Wright 1999, Sect. 5.1). Even in proximal-type methods, the convergence rate is described by a quantity similar to the condition number, when the objective function is strongly convex. See Propositions 3 and 4 in Schmidt et al. (2011) for details.

A pre-conditioning technique is often applied to speed up the convergence rate of the optimization algorithm. The idea behind pre-conditioning is to perform a change of variables $x=S\bar{x}$, where S is an invertible matrix. An iterative algorithm is applied to the function $\bar{f}(\bar {x})=f(S\bar{x})$ in the coordinate system $\bar{x}$. Then a local optimal solution $\bar{x}_{0}$ of $\bar{f}(\bar{x})$ is pulled back to $x_{0}=S\bar{x}_{0}$.

The pre-conditioning technique is useful, if the conditioning of $\bar {f}(\bar{x})$ is preferable to f(x). However, in general, there are some difficulties in obtaining a suitable pre-conditioning. Consider the iterative algorithm (12) with H _i=I in the coordinate $\bar{x}$, i.e., $\bar{x}_{i+1}=\bar{x}_{i}-\eta_{i}\nabla{\bar{f}}(\bar{x}_{i})$. The Hessian matrix is given as $\nabla^{2}\bar{f}(\bar{x}_{0})=S^{\top }\nabla^{2}{f}(x_{0})S$. Then, the best change of variables is given by S=(∇² f(x ₀))^−1/2. This is also confirmed by the fact that the gradient descent method with respect to $\bar{x}$ is represented as x _i+1=x _i−η _i SS ^⊤∇f(x _i) in the coordinate system x. In this case, there are at least two drawbacks:

1.
There is no unified strategy to find a good change of variables $x=S\bar{x}$.
2.
Under the best change of variables S=(∇² f(x ₀))^−1/2, the computation of the variable change can be expensive and unstable, when the condition number of ∇² f(x ₀) is large.

Similar drawbacks appear in the conjugate gradient methods (Hager and Zhang 2006; Nocedal and Wright 1999).

The first drawback is obvious. To find a good change of variables, it is necessary to estimate the shape of the function f around the local optimal solution x ₀ before solving the problem. Except for a specific type of problems such as discretized partial differential equations, finding a good change of variables is difficult (Benzi et al. 2011; Axelsson and Neytcheva 2002; Badia et al. 2009). Though there are some general-purpose pre-conditioners such as the incomplete Cholesky decomposition and banded pre-conditioners, their degree of success varies from problem to problem (Nocedal and Wright 1999, Chap. 5).

To remedy the second drawback, one can use a matrix S with a moderate condition number. When κ(S) is moderate, the computation of the variable change is stable. In the optimization toolbox in MATLAB^®, gradient descent methods are implemented by the function fminunc. The default method in fminunc is the BFGS quasi-Newton method, and the Cholesky factorization of the approximate Hessian is used as the transformation matrix S at each step of the algorithm. When the modified Cholesky factorization is used, the condition number of S is guaranteed to be bounded from above by some constant C. See Moré and Sorensen (1984) for more details.

When the variable change $x=S\bar{x}$ with a bounded condition number is used, there is a trade-off between the numerical accuracy and convergence rate. The trade-off is summarized as

(14)

The proof of this equality is given in Appendix B. When C in (14) is small, the computation of the variable change is stable. Conversely, the convergence speed will be slow because the right-hand side of (14) is large. Thus, the formula (14) presents the trade-off between the numerical stability and the convergence speed. This implies that the convergence rate and stable computation are not consistent when the condition number of the original problem is large. If κ(∇² f(x ₀)) is small, however, the right-hand side of (14) will not be too large. In this case, the trade-off is not significant and thus the numerical stability and convergence speed can be consistent.

Therefore, it is preferable that the condition number of the original problem is kept as small as possible, despite the fact that some scaling or pre-conditioning techniques are available. In the following section, we pursue a loss function of the density-ratio estimator whose Hessian matrix has a small condition number.

4.2 Condition number analysis of M-estimators

In this section, we study the condition number of the Hessian matrix associated with the minimization problem in the φ-divergence approach, and show that KuLSIF is optimal among all M-estimators. More specifically, we will provide two kinds of condition numbers analyses: a min-max evaluation (Sect. 4.2.1) and a probabilistic evaluation (Sect. 4.2.2).

4.2.1 Min-max evaluation

We assume that a universal RKHS $\mathcal{H}$ (Steinwart 2001) endowed with a kernel function k on a compact set $\mathcal{Z}$ is used for density-ratio estimation. The M-estimator is obtained by solving the problem (8). The Hessian matrix of the loss function (8) is equal to

(15)

where D _ψ,w is the n-by-n diagonal matrix defined as

(16)

and ψ″ denotes the second-order derivative of ψ. The condition number of the above Hessian matrix is denoted by κ ₀(D _ψ,w):

In KuLSIF, the equality ψ″=1 holds, and thus the condition number is equal to κ ₀(I _n). Now we analyze the relation between κ ₀(I _n) and κ ₀(D _ψ,w).

Theorem 1

(Min-max Evaluation)

Suppose that $\mathcal{H}$ is a universal RKHS, and that K ₁₁ is non-singular. Let c be a positive constant. Then, the equality

(17)

holds, where the infimum is taken over all convex second-order continuously differentiable functions satisfying ψ″((ψ′)⁻¹(1))=c.

The proof is deferred to Appendix C.

Both ψ(z)=z ²/2 and ψ(z)=−1−log(−z) satisfy the constraint ψ″((ψ′)⁻¹(1))=1, and KuLSIF using ψ(z)=z ²/2 minimizes the worst-case condition number, because of the fact that the condition number of KuLSIF does not depend on the optimal solution. Note that, because both sides of (17) depend on the samples X ₁,…,X _n, KuLSIF achieves the min-max solution for each observation.

By introducing the constraint ψ″((ψ′)⁻¹(1))=c, the balance between the loss term and the regularization term in the objective function of (8) is adjusted. Suppose that q(x)=p(x), i.e., the density ratio is a constant. Then, according to the equality (4), the optimal $w\in\mathcal{H}$ satisfies 1=ψ′(w(x)), if the constant (ψ′)⁻¹(1) is included in $\mathcal{H}$. In this case, the diagonal of D _ψ,w is equal to ψ″(w(X _i))=ψ″((ψ′)⁻¹(1))=c. Thus, the Hessian matrix (15) is equal to $\frac{c}{n}K_{11}^{2}+\lambda{}K_{11}$, which is independent of ψ as long as ψ satisfies ψ″((ψ′)⁻¹(1))=c. Then, the constraint ψ″((ψ′)⁻¹(1))=c adjusts the scaling of the loss term at the constant density ratio. Under the adjustment, the quadratic function ψ(z)=cz ²/2 is optimal up to a linear term in the min-max sense.

4.2.2 Probabilistic evaluation

Next, we present a probabilistic evaluation of condition numbers. As shown in (15), the Hessian matrix at the estimated function $\widehat{w}$ (which is the minimizer of (8)) is given as

Let us define the random variable T _n as

(18)

Since ψ is convex, T _n is a non-negative random variable. Let F _n be the distribution function of T _n. The notations T _n and F _n imply that they depend on n. To be precise, T _n and F _n actually depend on both n and m. Here we suppose that m is fixed to a natural number including infinity, or m is a function of n as m=m _n. Then, T _n and F _n depend only on n.

Below, we first compute the distribution of the condition number κ(H). Then we investigate the relation between the function ψ and the distribution of the condition number κ(H). To this end, we need to study eigenvalues and condition numbers of random matrices. For the Wishart distribution, the probability distribution of condition numbers has been investigated by Edelman (1988) and Edelman and Sutton (2005). Recently, the condition number of matrices perturbed by additive Gaussian noise has been investigated under the name of smoothed analysis (Sankar et al. 2006; Spielman and Teng 2004; Tao and Vu 2007). However, the statistical property of the above-defined matrix H is more complicated than those studied in the existing literature. In our problem, the probability distribution of each element will be far from well-known, and elements are correlated to each other through the kernel function.

Now, we briefly introduce the core idea of the smoothed analysis (Spielman and Teng 2004), and discuss its relation with our study. Consider the averaged computational cost E_P[c(X)], where c(X) is the cost of an algorithm for input X, and E_P[ ⋅ ] denotes the expectation with respect to the probability P over the input space. Let $\mathcal{P}$ be a set of probabilities on the input space. In the smoothed analysis, the performance of the algorithm is measured by $\max_{P\in\mathcal{P}}\,\mathrm{E}_{P}[c(X)]$. The set of Gaussian distributions is a popular choice for $\mathcal{P}$.

Conversely, in our theoretical analysis, we consider the probabilistic order of condition numbers O _p(κ(H)), as a measure of computational costs. The worst-case evaluation of the computational complexity is measured by max_P,Q O _p(κ(H)), where the sample distributions P and Q vary in an appropriate set of distributions. The quantity, max_P,Q O _p(κ(H)), is the counterpart of the worst-case evaluation of the averaged computational cost E_P[c(X)] in the smoothed analysis. The probabilistic order of κ(H) depends on the loss function ψ. Then, we suggest that the loss function that achieves the optimal solution of the min-max problem, min_ψmax_P,Q O _p(κ(H)), is the optimal choice. The details are shown below, where our concern is not only to provide the worst-case computational cost, but also to find the optimal loss function for the M-estimator.

Theorem 2

(Probabilistic Evaluation)

Let $\mathcal{H}$ be an RKHS endowed with a kernel function $k:\mathcal{Z}\times\mathcal{Z}\rightarrow \Re $ satisfying the boundedness condition, $\sup_{x,x'\in\mathcal{Z}}k(x,x')<\infty$. Assume that the Gram matrix K ₁₁ is almost surely positive definite in terms of the probability measure P. Suppose that, for the regularization parameter λ _n,m, the boundedness condition lim sup_n→∞ λ _n,m<∞ is satisfied. Let $U=\sup_{x,x'\in\mathcal{Z}}k(x,x')$ and t _n be a sequence such that

(19)

where F _n is the probability distribution of T _n defined in (18). Then, we have

(20)

where H is defined as $H=\frac{1}{n}K_{11}D_{\psi,\widehat {w}} K_{11} + \lambda K_{11}$. The probability Pr(⋅) is defined from the distribution of samples X ₁,…,X _n, Y ₁,…,Y _m.

The proof of Theorem 2 is deferred to Appendix D.

Remark 1

The Gaussian kernel on a compact set $\mathcal{Z}$ meets the condition of Theorem 2 under a mild assumption on the probability P. Suppose that $\mathcal{Z}$ is included in the ball {x∈ℜ^d | ∥x∥≤R}. Then, for k(x,x′)=exp{−γ∥x−x′∥²} with $x,x'\in\mathcal {Z}$ and γ>0, we have $e^{-4\gamma R^{2}}\leq k(x,x')\leq1$. If the distribution P of samples X ₁,…,X _n is absolutely continuous with respect to the Lebesgue measure, the Gram matrix of the Gaussian kernel is almost surely positive definite because K ₁₁ is positive definite if X _i≠X _j for i≠j.

When ψ is the quadratic function, ψ(z)=z ²/2, the distribution function F _n is given by $F_{n}(t)=\mbox {\bf 1}[t\geq1]$, where $\mbox {\bf 1}[\,\cdot\,]$ is the indicator function. By choosing t _n=1 in Theorem 2, an upper bound of κ(H) for ψ(z)=z ²/2 is asymptotically given as $\kappa(K_{11})(1+\lambda_{n,m}^{-1})$. Conversely, for the M-estimator with the Kullback-Leibler divergence (Nguyen et al. 2010), the function ψ is defined as ψ(z)=−1−log(−z), z<0, and thus, ψ″(z)=1/z ² holds. Then we have $T_{n}=\max_{1\leq i\leq n}(\widehat{w}(X_{i}))^{-2}$. Note that there is a possibility that $(\widehat{w}(X_{i}))^{2}$ takes a very small value, and thus it is expected that T _n is of a larger than constant order. As a result, t _n would diverge to infinity for ψ(z)=−1−log(−z). Results of the above theoretical analysis are confirmed by numerical studies in Sect. 6.

Using the above argument, we show that the quadratic loss is approximately an optimal loss function in the sense of probabilistic upper bounds in Theorem 2. Suppose that the true density ratio q(z)/p(z) is well approximated by the estimator $\psi'(\widehat{w}(z))$. Instead of T _n, we study an approximation $\sup_{z\in\mathcal{Z}}\psi''((\psi')^{-1}(q(z)/p(z)))$. Then, for any loss function ψ such that ψ″((ψ′)⁻¹(1))=1, the inequality

holds, where p and q are probability densities such that $(\psi')^{-1}(q/p)\in\mathcal{H}$. The equality holds for the quadratic loss. The meaning of the constraint ψ″((ψ′)⁻¹(1))=1 is presented in Sect. 4.2.1. Thus, t _n=1 provided by the quadratic loss function is expected to approximately attain the minimum upper bound in (20). The quantity $\sup_{p,q}\sup_{z\in\mathcal{Z}}\psi''((\psi')^{-1}(q(z)/p(z)))$ is the counterpart of $\max_{P\in\mathcal{P}}E_{P}[c(X)]$ in the smoothed analysis. We expect that the loss function attaining the infimum of this quantity provides a computationally efficient learning algorithm.

5 Reduction of condition numbers

In the previous section, we showed that KuLSIF is preferable in terms of computational efficiency and numerical stability. In this section, we study the reduction of condition numbers.

Let L _KuLSIF(α) and $L_{\mathrm{R\mbox{-}KuLSIF}}(\alpha)$ be loss functions of KuLSIF (9) and R-KuLSIF (11), respectively. The Hessian matrices of L _KuLSIF(α) and $L_{\mathrm{R\mbox{-}KuLSIF}}(\alpha)$ are given by

(21)

(22)

Because of the equality $\kappa(H_{\mathrm{KuLSIF}}) = \kappa(K_{11}) \kappa(H_{\mathrm{R\mbox{-}KuLSIF}})$, we have the inequality

This inequality implies that the loss function L _KuLSIF(α) can be transformed to $L_{\mathrm{R\mbox{-}KuLSIF}}(\alpha)$ without changing the optimal solution, whereas the condition number is reduced. Hence, R-KuLSIF will be more preferable than KuLSIF in the sense of both convergence speed and numerical stability as explained in Sect. 4.1. Though the loss function of R-KuLSIF is not a member of the regularized M-estimator (8), KuLSIF can be transformed to R-KuLSIF without any computational effort.

Below, we study whether the same reduction of condition numbers is possible in the general φ-divergence approach. If there are M-estimators other than KuLSIF whose condition numbers are reducible, we should compare them with R-KuLSIF and pursue more computationally efficient density-ratio estimators. Our conclusion is that among all of the φ-divergence approaches, the condition number is reducible only for KuLSIF. Thus, the reduction of condition numbers by R-KuLSIF is a special property that makes R-KuLSIF particularly attractive for practical use.

We now show why the condition number of KuLSIF is reducible from κ(H _KuLSIF) to $\kappa(H_{\mathrm{R\mbox{-}KuLSIF}})$ without changing the optimal solution. Solving an unconstrained optimization problem is equivalent to finding a zero of the gradient vector of the loss function. For the loss functions $L_{\mathrm{R\mbox{-}KuLSIF}}(\alpha)$ and L _KuLSIF(α), the equality

holds for any α. Hence, for non-degenerate K ₁₁, zeros of $\nabla{}L_{\mathrm{R\mbox{-}KuLSIF}}(\alpha)$ and ∇L _KuLSIF(α) are the same. In general, for the quadratic convex loss functions L ₁(α) and L ₂(α) that share the same optimal solution, there exists a matrix C such that ∇L ₁=C∇L ₂. Indeed, for L ₁(α)=(α−α ^∗)^⊤ A ₁(α−α ^∗) and L ₂(α)=(α−α ^∗)^⊤ A ₂(α−α ^∗), the matrix $C=A_{1}A_{2}^{-1}$ yields the equality ∇L ₁=C∇L ₂. Based on this fact, one can obtain the quadratic loss function that shares the same optimal solution with a smaller condition number without further computational cost.

Now, we study loss functions of general M-estimators. Let L _ψ(α) be the loss function of the M-estimator (8), and let L(α) be any other function. Suppose that ∇L(α ^∗)=0 holds if and only if ∇L _ψ(α ^∗)=0. This implies that extremal points of L _ψ(α) and L(α) are the same. Then, there exists a matrix-valued function C(α)∈ℜ^n×n such that

(23)

where C(α) is non-degenerate for any α. Suppose C(α) is differentiable. Then, the derivative of the above equation at the extremal point α ^∗ leads to the equality

When κ(∇² L(α ^∗))≤κ(∇² L _ψ(α ^∗)), L(α) will be preferable to L _ψ(α) for numerical computation.

We require a careful treatment for the choice of the matrix C(α) or the loss function L(α). If there is no restriction on the matrix-valued function C(α), the most preferable choice of C(α ^∗) is given by C(α ^∗)=(∇² L _ψ(α ^∗))⁻¹. However this is clearly meaningless for the purpose of numerical computation because the transformation requires the knowledge of the optimal solution. Even if the function L _ψ(α) is quadratic, finding (∇² L _ψ(α ^∗))⁻¹ is computationally equivalent to solving the optimization problem. To obtain a suitable loss function L(α) without additional computational effort, we need to impose a meaningful constraint on C(α). Below, we assume that the matrix-valued function C(α) is a constant function.^{Footnote 2}

As shown in the proof of Lemma 1, the gradient of the loss function L _ψ(α) is equal to

where the function v is defined in Lemma 1. Let C∈ℜ^n×n be a constant matrix, and suppose that the ℜⁿ-valued function C∇L _ψ(α) is represented as the gradient of a function L, i.e., there exists an L such that ∇L=C∇L _ψ. Then, the function C∇L _ψ is called integrable (Nakahara 2003). We now require a ψ for which there exists a non-identity matrix C such that C∇L _ψ(α) is integrable. According to the Poincaré lemma (Nakahara 2003; Spivak 1979), the necessary and sufficient condition of integrability is that the Jacobian matrix of C∇L _ψ(α) is symmetric. The Jacobian matrix of C∇L _ψ(α) is given by

where D _ψ,α is the n-by-n diagonal matrix with diagonal elements

In terms of the Jacobian matrix J _ψ,C(α), we have the following theorem.

Theorem 3

Let c be a constant value in ℜ and the function ψ be second-order continuously differentiable. Suppose that the Gram matrix K ₁₁ is non-singular, and that K ₁₁ does not have any zero elements. If there exists a non-singular matrix C≠cI _n such that J _ψ,C(α) is symmetric for any α∈ℜⁿ, then, ψ″ is a constant function.

The proof is provided in Appendix E.

Theorem 3 implies that for the non-quadratic function ψ, the gradient C∇L _ψ(α) cannot be integrable unless C=cI _n, c∈ℜ. As a result, the condition number of loss functions is reducible only when ψ is a quadratic function.^{Footnote 3} The same procedure works for kernel ridge regression (Chapelle 2007; Ratliff and Bagnell 2007) and kernel PCA (Mika et al. 1999). However, there exists no similar procedure for M-estimators with non-quadratic functions.

In general, the change of variables is a standard and useful approach to reducing the condition number of loss functions. However, we need a good prediction of the Hessian matrix at the optimal solution to obtain good conditioning. Moreover, additional computation including matrix manipulation will be required for the coordinate transformation. Conversely, an advantage of the transformation considered in this section is that it does not require any effort to predict the Hessian matrix or to manipulate the matrix.

Remark 2

We summarize our theoretical results on condition numbers. Let H _ψ-div be the Hessian matrix of the loss function (8). Then, the following inequalities hold:

Based on a probabilistic evaluation, the inequality

will also hold with high probability.

6 Simulation study

In this section, we experimentally investigate the relation between the condition number and the convergence rate. All computations are conducted using a Xeon X5482 (3.20 GHz) and 32 GB physical memory with CentOS Linux release 5.2. For optimization problems, we applied the gradient descent method and quasi Newton methods instead of the Newton method, since the Newton method does not efficiently work for high-dimensional problems (Luenberger and Ye 2008, introduction of Chap. 10).

6.1 Synthetic data

In the M-estimator based on the φ-divergence, the Hessian matrix involved in the optimization problem (8) is given as

(24)

For the estimator using the Kullback-Leibler divergence (Nguyen et al. 2010; Sugiyama et al. 2008a), the function φ(z) is given as φ(z)=−logz, and thus, ψ(z)=−1−log(−z), z<0. Then, ψ′(z)=−1/z and ψ″(z)=1/z ² for z<0. Thus, for the optimal solution w _ψ(x) under the population distribution, we have ψ″(w _ψ(x))=ψ″((ψ′)⁻¹(w ₀(x)))=w ₀(x)², where w ₀ is the true density ratio q/p. Then the Hessian matrix at the target function w _ψ is given as

Conversely, in KuLSIF, the Hessian matrix is given by H _KuLSIF defined in (21), and the Hessian matrix of R-KuLSIF, $H_{\mathrm{R\mbox{-}KuLSIF}}$, is shown in (22).

The condition numbers of Hessian matrices, H _KL,H _KuLSIF, and $H_{\mathrm{R\mbox{-}KuLSIF}}$, are numerically compared. In addition, the condition number of K ₁₁ is computed. The probability distributions P and Q are set to the normal distribution on the 10-dimensional Euclidean space with the identity variance-covariance matrix I ₁₀. The mean vectors of P and Q are set to $0\times \mbox {\bf 1}_{10}$ and $\mu \times \mbox {\bf 1}_{10}$ with μ=0.2 or μ=0.5, respectively. Note that the mean value μ only affects the condition number of the KL method, not R-KuLSIF and KuLSIF. The true density-ratio w ₀ is determined by P and Q. In the kernel-based estimators, we use the Gaussian kernel with width σ=4, where σ=4 is close to the median of the distance ∥X _i−X _j∥ between samples. Using the median distance as the kernel width is a popular heuristic (Caputo et al. 2002; Schölkopf and Smola 2002). We study two setups: In the first setup, the sample size from P is equal to that from Q, that is, n=m, and in the second setup, the sample size from Q is fixed to m=50 and n is varied from 20 to 500. The regularization parameter λ is set to λ _n,m=1/(n∧m)^0.9, where n∧m=min{n,m}.

In each setup, the samples X ₁,…,X _n are randomly generated and the condition number is computed. Figure 1 shows the condition number average over 1000 runs. We see that for all cases, the condition number of R-KuLSIF is significantly smaller than that of the other methods. Thus, it is expected that R-KuLSIF converges faster than the other methods and that R-KuLSIF is robust against numerical degeneracy.

Figure 2 and Table 1 show the average number of iterations and average computation time for solving the optimization problems over 50 runs. The probability distributions P and Q are the same as those in the above experiments, and the mean vector of Q is set to $0.5\times \mbox {\bf 1}_{10}$. The number of samples from each probability distribution is set to n=m=100,…,6000, and the regularization parameter is set to λ=1/(n∧m)^0.9. Note that n is equal to the number of parameters to be optimized. R-KuLSIF, KuLSIF, and the method based on the Kullback-Leibler divergence (KL) are compared. In addition, the computation time for solving the linear equation

(25)

instead of optimizing (11) is also shown as “direct” in the plot. The kernel parameter σ is determined based on the median of ∥X _i−X _j∥. To solve the optimization problems for M-estimators, we use two optimization methods: one is the BFGS quasi-Newton method implemented in the optim function in R (R Development Core Team 2009), and the other is the steepest descent method. Furthermore, for the “direct” method, we use the solve function in R. Figure 2 shows the result for the BFGS method and Table 1 shows the result for the steepest descent method. In the numerical experiments for the steepest descent method, the maximum number of iterations is limited to 4000, and the KL method reaches the limit. The numerical results indicate that the number of iterations in the optimization procedure is highly correlated with the condition number of the Hessian matrices.

Table 1 Average computation time and average number of iterations in steepest descent method over 50 runs. “>” means that actual computation time is longer than number described in table

Full size table

Although the practical computational time would depend on various issues such as stopping rules, our theoretical results in Sect. 4 are shown to be in good agreement with the empirical results for the synthetic data. We observed that numerical optimization methods such as the quasi-Newton method are competitive with numerical algorithms for solving linear equations using LU decomposition or Cholesky decomposition, especially when the sample size n (which is equal to the number of optimization parameters in the current setup) is large. This implies that the theoretical result obtained in this study will be useful in large sample cases, which is common in practical applications.

6.2 Benchmark data

Next, we apply the density-ratio estimation to benchmark data sets, and compare the computational cost. The statistical performance of each estimator for a linear model has been extensively compared on benchmark data sets in Kanamori et al. (2009, 2012), and Hido et al. (2011). Therefore, here, we focus on the numerical efficiency of each method.

Let us consider an outlier detection problem of finding irregular samples in a data set (“evaluation data set”) based on another data set (“model data set”) that only contains regular samples (Hido et al. 2011). Defining the density ratio over the two sets of samples, we can see that the density-ratio values for regular samples are close to one, while those for outliers tend to deviate significantly from one. Since the evaluation data set usually has a wider support than the model data set, we regard the evaluation data set as samples corresponding to the denominator of the density ratio and the model data set as samples corresponding to the numerator. Then the target density-ratio w ₀(x) is approximately equal to one in a wide range of the data domain, and will take small values around outliers.

The data sets provided by IDA (Rätsch et al. 2001) are used. These are binary classification data sets consisting of positive/negative and training/test samples. We allocate all positive training samples to the “model” set, while all positive test samples and 5 % of negative test samples are assigned to the “evaluation set.” Thus, we regard the positive samples as inliers and negative samples as outliers.

Table 2 shows the average computation time and average number of iterations over 20 runs for image and splice and over 50 runs for the other data sets. In the same way as the simulations in Sect. 6.1, we compare R-KuLSIF, KuLSIF, and the M-estimator with the Kullback-Leibler divergence (KL). In addition, the computation time of solving the linear equation (25) is shown as “direct” in the table. For the optimization, we use the BFGS method implemented in the optim function in R (R Development Core Team 2009), and we use the solve function in R for the “direct” method. The kernel parameter σ is determined based on the median of ∥X _i−X _j∥ which is computed by the function sigest in the kernlab library (Karatzoglou et al. 2004). The average number of samples is shown in the second column, and the regularization parameter is set to λ=1/(n∧m)^0.9.

Table 2 Average computation time (s) and average number of iterations for benchmark data sets are shown. BFGS quasi-Newton method in optim function of R environment is used to obtain numerical solution. Data sets are arranged in ascending order of sample size n. Results of method having lowest mean are described in bold face

Full size table

The numerical results show that, when the sample size is balanced (i.e., n and m are comparable to each other), the number of iterations for R-KuLSIF is the smallest, which agrees well with our theoretical analysis. On the other hand, for titanic, waveform, banana, ringnorm, and twonorm, the number of iterations for each method is almost the same. In these data sets, m is much smaller than n, and thus the second term λK ₁₁ in the Hessian matrix (24) for the M-estimator will govern the convergence property, since the order of λ _n,m is larger than O(1/n). This tendency is explained by the result in Theorem 2. Based on (20), we see that a large λ _n,m will provide a smaller upper bound of κ(H).

Next, we investigate the number of iterations when n and m are comparable to each other. The data sets, titanic, waveform, banana, ringnorm, and twonorm are used. We consider two setups: In the first series of experiments, the evaluation data set consists of all positive test samples, and the model data set is defined by all negative test samples. Therefore, the target density-ratio may be far from the constant function w ₀(x)=1. Table 3 shows the average computation time and average number of iterations over 20 runs. In this case, the number of iterations for optimization agrees with our theoretical result, that is, R-KuLSIF yields low computational costs for all experiments. In the second series of experiments, both model samples and evaluation samples are randomly chosen from all (i.e., both positive and negative) test samples. Thus, the target density-ratio is almost equal to the constant function w ₀(x)=1. Table 4 shows the average computation time and the average number of iterations over 20 runs. The number of iterations for “KL” is much smaller than that for the first setup shown in Table 3. This is because the condition number of the Hessian matrix (24) is likely to be small when the true density-ratio w ₀ is close to the constant function. R-KuLSIF is, however, still the preferable approach. Furthermore, the computation time of R-KuLSIF is comparable to that of a direct method such as the Cholesky decomposition when the sample size (i.e., the number of variables) is large.

Table 3 For balanced sample size, average computation time (s) and average number of iterations for benchmark data sets are presented. Titanic, waveform, banana, ringnorm, and twonorm are used as data sets. Evaluation data set consists of all positive test samples, and model data set is defined by all negative test samples, i.e., density ratio will be far from constant function. BFGS quasi-Newton method in optim function of R environment is used to obtain numerical solution. Data sets are arranged in ascending order of sample size n. Results of method having lowest mean are described in bold face

Full size table

Table 4 For balanced sample size, average computation time (s) and average number of iterations for benchmark data sets are presented. Titanic, waveform, banana, ringnorm, and twonorm are used as data sets. Evaluation data set and model data set are randomly generated from all (i.e., both positive and negative) test samples, i.e., density ratio is close to constant function. BFGS quasi-Newton method in optim function of R environment is used to obtain numerical solution. Data sets are arranged in ascending order of sample size n. Results of method having lowest mean are described in bold face

Full size table

In summary, the numerical experiments showed that the convergence rate for optimization is well explained by the condition number of the Hessian matrix. The relation between the loss function ψ and condition number was discussed in Sect. 4, and our theoretical result implies that R-KuLSIF is computationally an effective way to estimate density ratios. The numerical results in this section also indicated that our theoretical result is useful to obtain practical and computationally efficient estimators.

7 Conclusions

We considered the problem of estimating the ratio of two probability densities and investigated theoretical properties of the kernel least-squares estimator called KuLSIF. More specifically, we theoretically studied the condition number of Hessian matrices, because the condition number is closely related to the convergence rate of optimization and the numerical stability. We found that KuLSIF has a smaller condition number than the other methods. Therefore, KuLSIF will have preferable computational properties. We further showed that R-KuLSIF, which is an alternative formulation of KuLSIF, possesses an even smaller condition number. Numerical experiments showed that practical numerical properties of optimization algorithms could be well explained by our theoretical analysis of condition numbers, even though the condition number only provides an upper bound of the rate of convergence. A theoretical issue to be further investigated is the derivation of a tighter probabilistic order of the condition number.

Density-ratio estimation was shown to provide new approaches to solving various machine learning problems (Sugiyama et al. 2009, 2012b), including covariate shift adaptation (Shimodaira 2000; Zadrozny 2004; Sugiyama and Müller 2005; Gretton et al. 2009; Sugiyama et al. 2007; Bickel et al. 2009; Quiñonero Candela et al. 2009; Sugiyama and Kawanabe 2012), multi-task learning (Bickel et al. 2008; Simm et al. 2011), inlier-based outlier detection (Hido et al. 2008, 2011; Smola et al. 2009), change detection in time-series (Kawahara and Sugiyama 2011), divergence estimation (Nguyen et al. 2010), two-sample testing (Sugiyama et al. 2011), mutual information estimation (Suzuki et al. 2008, 2009b), feature selection (Suzuki et al. 2009a), sufficient dimension reduction (Sugiyama et al. 2010a), independence testing (Sugiyama and Suzuki 2011), independent component analysis (Suzuki and Sugiyama 2011), causal inference (Yamada and Sugiyama 2010), object matching (Yamada and Sugiyama 2011), clustering (Kimura and Sugiyama 2011), conditional density estimation (Sugiyama et al. 2010b), and probabilistic classification (Sugiyama 2010). In future work, we will develop practical algorithms for a wide range of applications on the basis of theoretical guidance provided in this study.

Notes

In general, the condition number for a (possibly non-symmetric) matrix is defined through singular values. However, the above simple definition is sufficient for our purpose.
We must admit that this is a rather strict condition. It is an important future work to investigate the relaxation of the condition in a feasible way.
The linear function does not provide a consistent estimator of density ratios, because ψ′ is constant.

References

Ali, S. M., & Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B. Methodological, 28, 131–142.
MathSciNet MATH Google Scholar
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404.
Article MathSciNet MATH Google Scholar
Axelsson, O., & Neytcheva, M. (2002). Robust preconditioners for saddle point problems. In Numerical methods and application (pp. 158–166).
Google Scholar
Badia, S., Nobile, F., & Vergara, C. (2009). Robin-robin preconditioned Krylov methods for fluid-structure interaction problems. Computer Methods in Applied Mechanics and Engineering, 198, 2768–2784.
Article MathSciNet MATH Google Scholar
Becchetti, L., Leonardi, S., Marchetti-Spaccamela, A., Schafer, G., & Vredeveld, T. (2006). Average-case and smoothed competitive analysis of the multilevel feedback algorithm. Open Access publications from Maastricht University (urn:nbn:nl:ui:27-17093). Maastricht University.
Beltran, C., & Pardo, L. M. (2006). Estimates on the distribution of the condition number of singular matrices. Foundations of Computational Mathematics, 7, 87–134.
Article MathSciNet Google Scholar
Benzi, M., Haber, E., & Taralli, L. (2011). A preconditioning technique for a class of PDE-constrained optimization problems. Advances in Computational Mathematics, 35, 149–173.
Article MathSciNet MATH Google Scholar
Bickel, S., Bogojeska, J., Lengauer, T., & Scheffer, T. (2008). Multi-task learning for HIV therapy screening. In Proceedings of 25th annual international conference on machine learning (ICML2008) (pp. 56–63). Helsinki: Omnipress.
Chapter Google Scholar
Bickel, S., Brückner, M., & Scheffer, T. (2009). Discriminative learning under covariate shift. Journal of Machine Learning Research, 10, 2137–2155.
MATH Google Scholar
Blum, A., & Dunagan, J. (2002). Smoothed analysis of the perceptron algorithm for linear programming. In Proc. of the 13th annual ACM-SIAM symp. on discrete algorithms (pp. 905–914).
Google Scholar
Blum, L., & Shub, M. (1986). Evaluating rational functions: infinite precision is finite cost and tractable on average. SIAM Journal on Computing, 15, 384–398.
Article MathSciNet MATH Google Scholar
Bürgisser, P., & Cucker (2010). Smoothed analysis of Moore-Penrose inversion. SIAM Journal on Matrix Analysis and Applications, 31, 2769–2783.
Article MathSciNet MATH Google Scholar
Bürgisser, P., Cucker, F., & de Naurois, P. (2006a). The complexity of semilinear problems in succinct representation. Computational Complexity, 15, 197–235.
Article MathSciNet MATH Google Scholar
Bürgisser, P., Cucker, F., & Lotz, M. (2006b). General formulas for the smoothed analysis of condition numbers. Comptes Rendus de L’Académie des Sciences. Series 1, Mathematics, 343, 145–150.
MATH Google Scholar
Bürgisser, P., Cucker, F., & Lotz, M. (2006c). Smoothed analysis of complex conic condition numbers. Journal de Mathématiques Pures et Appliquées, 86, 293–309.
Article MATH Google Scholar
Bürgisser, P., Cucker, F., & Lotz, M. (2010). Coverage processes on spheres and condition numbers for linear programming. Annals of Probability, 38, 570–604.
Article MathSciNet MATH Google Scholar
Caputo, B., Sim, K., Furesjo, F., & Smola, A. (2002). Appearance-based object recognition using SVMs: which kernel should I use? In Proceedings of NIPS workshop on statistical methods for computational experiments in visual processing and computer vision.
Google Scholar
Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19, 1155–1178.
Article MathSciNet MATH Google Scholar
Cheung, D., & Cucker, F. (2002). Probabilistic analysis of condition numbers for linear programming. Journal of Optimization Theory and Applications, 114, 55–67.
Article MathSciNet MATH Google Scholar
Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2, 229–318.
Google Scholar
Cucker, F., & Wschebor, M. (2002). On the expected condition number of linear programming problems. Numerische Mathematik, 94, 94–419.
MathSciNet Google Scholar
Demmel, J. (1988). The probability that a numerical analysis problem is difficult. Mathematics of Computation, 50, 449–480.
Article MathSciNet MATH Google Scholar
Demmel, J. W. (1997). Applied numerical linear algebra. Philadelphia: SIAM.
Book MATH Google Scholar
Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1, 211–218.
Article MATH Google Scholar
Edelman, A. (1988). Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix Analysis and Applications, 9, 543–560.
Article MathSciNet MATH Google Scholar
Edelman, A. (1992). On the distribution of a scaled condition number. Mathematics of Computation, 58, 185–190.
Article MathSciNet MATH Google Scholar
Edelman, A., & Sutton, B. D. (2005). Tails of condition number distributions. SIAM Journal on Matrix Analysis and Applications, 27, 547–560.
Article MathSciNet MATH Google Scholar
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., & Schölkopf, B. (2009). Covariate shift by kernel mean matching. In J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, & N. Lawrence (Eds.), Dataset shift in machine learning (pp. 131–160). Cambridge: MIT Press.
Google Scholar
Hager, W. W., & Zhang, H. (2006). A survey of the nonlinear conjugate gradient methods. Pacific Journal of Optimization, 2, 35–58.
MathSciNet MATH Google Scholar
Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. (2008). Inlier-based outlier detection via direct density ratio estimation. In Proceedings of IEEE international conference on data mining (ICDM2008), Pisa, Italy (pp. 223–232).
Chapter Google Scholar
Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T. (2011). Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 26, 309–336.
Article Google Scholar
Horn, R., & Johnson, C. (1985). Matrix analysis. Cambridge: Cambridge University Press.
MATH Google Scholar
Kanamori, T., Hido, S., & Sugiyama, M. (2009). A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10, 1391–1445.
MathSciNet MATH Google Scholar
Kanamori, T., Suzuki, T., & Sugiyama, M. (2012). Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86, 335–367.
Article MathSciNet MATH Google Scholar
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). Kernlab—an S4 package for kernel methods in R. Journal of Statistical Software, 11, 1–20.
Google Scholar
Kawahara, Y., & Sugiyama, M. (2011). Sequential change-point detection based on direct density-ratio estimation. Statistical Analysis and Data Mining, 5, 114–127.
Article MathSciNet Google Scholar
Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95.
Article MathSciNet MATH Google Scholar
Kimura, M., & Sugiyama, M. (2011). Dependence-maximization clustering with least-squares mutual information. Journal of Advanced Computational Intelligence and Intelligent Informatics, 15, 800–805.
Google Scholar
Kostlan, E. (1988). Complexity theory of numerical linear algebra. Journal of Computational and Applied Mathematics, 22, 219–230.
Article MathSciNet MATH Google Scholar
Luenberger, D., & Ye, Y. (2008). Linear and nonlinear programming. Berlin: Springer.
MATH Google Scholar
Manthey, B., & Röglin, H. (2009). Worst-case and smoothed analysis of k-means clustering with Bregman divergences. In ISAAC (pp. 1024–1033).
Google Scholar
Mika, S., Schölkopf, B., Smola, A., Müller, K.-R., Scholz, M., & Rätsch, G. (1999). Kernel PCA and de-noising in feature spaces. In Proceedings of the 1998 conference on advances in neural information processing systems II (pp. 536–542). Cambridge: MIT Press.
Google Scholar
Moré, J. J., & Sorensen, D. C. (1984). Newton’s method. In G. H. Golub (Ed.), Studies in numerical analysis. pub-MATH-ASSOC-AMER.
Google Scholar
Nakahara, M. (2003). Geometry, topology and physics (2nd ed.). London: Taylor & Francis.
MATH Google Scholar
Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56, 5847–5861.
Article MathSciNet Google Scholar
Nocedal, J., & Wright, S. J. (1999). Numerical optimization. Berlin: Springer.
Book MATH Google Scholar
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. (Eds.) (2009). Dataset shift in machine learning. Cambridge: MIT Press.
Google Scholar
R Development Core Team (2009). R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
Google Scholar
Ratliff, N., & Bagnell, J. D. (2007). Kernel conjugate gradient for fast kernel machines. In International joint conference on artificial intelligence.
Google Scholar
Rätsch, G., Onoda, T., & Müller, K.-R. (2001). Soft margins for adaboost. Machine Learning, 42, 287–320.
Article MATH Google Scholar
Renegar, J. (1987). On the efficiency of newton’s method in approximating all zeros of a system of complex polynomials. Mathematics of Operations Research, 12, 121–148.
Article MathSciNet MATH Google Scholar
Renegar, J. (1995). Incorporating condition measures into the complexity theory of linear programming. SIAM Journal on Optimization, 5.
Rockafellar, R. T. (1970). Convex analysis. Princeton: Princeton University Press.
MATH Google Scholar
Röglin, H., & Vöcking, B. (2007). Smoothed analysis of integer programming. Mathematical Programming, 110, 21–56.
Article MathSciNet MATH Google Scholar
Sankar, A., Spielman, D. A., & Teng, S.-H. (2006). Smoothed analysis of the condition numbers and growth factors of matrices. SIAM Journal on Matrix Analysis and Applications, 28, 446–476.
Article MathSciNet MATH Google Scholar
Schmidt, M., Le Roux, N., & Bach, F. (2011). Convergence rates of inexact proximal-gradient methods for convex optimization. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, & K. Weinberger (Eds.), Advances in neural information processing systems (Vol. 24, pp. 1458–1466).
Google Scholar
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge: MIT Press.
Google Scholar
Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 227–244.
Article MathSciNet MATH Google Scholar
Shub, M. (1993). Some remarks on Bézout’s theorem and complexity theory. In From topology to computation: proceedings of the smalefest (pp. 443–455). Berlin: Springer.
Chapter Google Scholar
Shub, M., & Smale, S. (1994). Complexity of Bézout’s theorem. V: polynomial time. Theoretical Computer Science, 133.
Shub, M., & Smale, S. (1996). Complexity of Bézout’s theorem. IV: probability of success; extensions. SIAM Journal on Numerical Analysis, 33, 128–148.
Article MathSciNet MATH Google Scholar
Simm, J., Sugiyama, M., & Kato, T. (2011). Computationally efficient multi-task learning with least-squares probabilistic classifiers. IPSJ Transactions on Computer Vision and Applications, 3, 1–8.
Article Google Scholar
Smale, S. (1981). The fundamental theorem of algebra and complexity theory. Bulletin of the American Mathematical Society, 4, 1–36.
Article MathSciNet MATH Google Scholar
Smola, A., Song, L., & Teo, C. H. (2009). Relative novelty detection. In Twelfth international conference on artificial intelligence and statistics (pp. 536–543).
Google Scholar
Spielman, D. A., & Teng, S.-H. (2004). Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time. Journal of the ACM, 51, 385–463.
Article MathSciNet MATH Google Scholar
Spivak, M. (1979). A comprehensive introduction to differential geometry (Vol. I) (2nd ed.). Berkley: Publish or Perish.
Google Scholar
Steinwart, I. (2001). On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2, 67–93.
MathSciNet Google Scholar
Sugiyama, M. (2010). Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D, 2690–2701.
Article Google Scholar
Sugiyama, M., & Kawanabe, M. (2012). Machine learning in non-stationary environments: Introduction to covariate shift adaptation. Cambridge: MIT Press.
Google Scholar
Sugiyama, M., & Müller, K.-R. (2005). Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions, 23, 249–279.
Article MathSciNet MATH Google Scholar
Sugiyama, M., & Suzuki, T. (2011). Least-squares independence test. IEICE Transactions on Information and Systems, E94-D, 1333–1336.
Article Google Scholar
Sugiyama, M., Krauledat, M., & Müller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8, 985–1005.
MATH Google Scholar
Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P., & Kawanabe, M. (2008a). Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems (Vol. 20, pp. 1433–1440). Cambridge: MIT Press.
Google Scholar
Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., Kawanabe, M., & Nakajima, S. (2008b). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60, 699–746.
Article MathSciNet MATH Google Scholar
Sugiyama, M., Kanamori, T., Suzuki, T., Hido, S., Sese, J., Takeuchi, I., & Wang, L. (2009). A density-ratio framework for statistical data processing. IPSJ Transactions on Computer Vision and Applications, 1, 183–208.
Article Google Scholar
Sugiyama, M., Takeuchi, I., Kanamori, T., Suzuki, T., Hachiya, H., & Okanohara, D. (2010a). Conditional density estimation via least-squares density ratio estimation. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (AISTATS2010), Sardinia, Italy (pp. 781–788).
Google Scholar
Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., & Okanohara, D. (2010b). Least-squares conditional density estimation. IEICE Transactions on Information and Systems, E93-D, 583–594.
Article Google Scholar
Sugiyama, M., Suzuki, T., Itoh, Y., Kanamori, T., & Kimura, M. (2011). Least-squares two-sample test. Neural Networks, 24, 735–751.
Article Google Scholar
Sugiyama, M., Suzuki, T., & Kanamori, T. (2012a). Density ratio matching under the Bregman divergence: A unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 64, 1009–1044.
Article MathSciNet MATH Google Scholar
Sugiyama, M., Suzuki, T., & Kanamori, T. (2012b). Density ratio estimation in machine learning. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Suzuki, T., & Sugiyama, M. (2011). Least-squares independent component analysis. Neural Computation, 23, 284–301.
Article MathSciNet MATH Google Scholar
Suzuki, T., Sugiyama, M., Sese, J., & Kanamori, T. (2008). Approximating mutual information by maximum likelihood density ratio estimation. In JMLR workshop and conference proceedings (pp. 5–20).
Google Scholar
Suzuki, T., Sugiyama, M., Kanamori, T., & Sese, J. (2009a). Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics, 10, S52.
Article Google Scholar
Suzuki, T., Sugiyama, M., & Tanaka, T. (2009b). Mutual information approximation via maximum likelihood estimation of density ratio. In Proceedings of 2009 IEEE international symposium on information theory (ISIT2009), Seoul, Korea (pp. 463–467).
Chapter Google Scholar
Tao, T., & Vu, V. H. (2007). The condition number of a randomly perturbed matrix. In Proceedings of the thirty-ninth annual ACM symposium on theory of computing (pp. 248–255). New York: ACM.
Google Scholar
Todd, M. J., Tunçel, L., & Ye, Y. (2001). Characterizations, bounds, and probabilistic analysis of two complexity measures for linear programming problems. Mathematical Programming, 90, 59–69.
Article MathSciNet MATH Google Scholar
Turing, A. M. (1948). Rounding-off errors in matrix processes. Quarterly Journal of Mechanics and Applied Mathematics, 1, 287–308.
Article MathSciNet MATH Google Scholar
Vershynin, R. (2006). Beyond Hirsch conjecture: walks on random polytopes and smoothed complexity of the simplex method. In FOCS 2006 (47th annual symposium on foundations of computer science (pp. 133–142).
Chapter Google Scholar
von Neumann, J., & Goldstine, H. (1947). Numerical inverting of matrices of high order. Bulletin of the American Mathematical Society, 53, 1021–1099.
Article MathSciNet MATH Google Scholar
Yamada, M., & Sugiyama, M. (2010). Dependence minimizing regression with model selection for non-linear causal inference under non-Gaussian noise. In Proceedings of the twenty-fourth AAAI conference on artificial intelligence (AAAI2010) (pp. 643–648). Atlanta: AAAI Press.
Google Scholar
Yamada, M., & Sugiyama, M. (2011). Cross-domain object matching with model selection. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS2011), Fort Lauderdale, Florida, USA (pp. 807–815).
Google Scholar
Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the Twenty-First international conference on machine learning. New York: ACM.
Google Scholar

Download references

Acknowledgements

The authors are grateful to anonymous reviewers for their helpful comments. TK was partially supported by Grant-in-Aid for Young Scientists (20700251). TS was partially supported by MEXT Kakenhi 22700289, Global COE Program “The Research and Training Center for New Development in Mathematics,” and the Aihara Project, the FIRST program from JSPS, initiated by CSTP. MS was supported by SCAT, AOARD, and the JST PRESTO program.

Author information

Authors and Affiliations

Nagoya University Department of Computer Science and Mathematical Informatics, Nagoya University, Furocho, Chikusaku, Nagoya, 464-8601, Japan
Takafumi Kanamori
Department of Mathematical Informatics, University of Tokyo, 3-1 Hongo 7-Chome, Bunkyo-ku, Tokyo, 113-8656, Japan
Taiji Suzuki
Department of Computer Science, Tokyo Institute of Technology, 2-12-1 O-okayama, Meguro-ku, Tokyo, 152-8552, Japan
Masashi Sugiyama

Authors

Takafumi Kanamori
View author publications
You can also search for this author in PubMed Google Scholar
Taiji Suzuki
View author publications
You can also search for this author in PubMed Google Scholar
Masashi Sugiyama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takafumi Kanamori.

Additional information

Editor: Massimiliano Pontil.

Appendices

Appendix A: Proof of Lemma 1

Proof

We consider the minimization of the loss function,

(26)

Applying the representer theorem (Kimeldorf and Wahba 1971), we see that an optimal solution of (26) has the form of

(27)

Let K ₁₁, K ₁₂, K ₂₁, and K ₂₂ be the sub-matrices of the Gram matrix:

where i,i′=1,…,n, j,j′=1,…,m. Then, the extremal condition of (26) under the constraint (27) is given as

If α and β satisfy the above conditions, they are the optimal solution because the loss function is convex in α and β. Substituting $\beta=\frac{1}{m\lambda} \mbox {\bf 1}_{m}$, we obtain

For $\alpha=\bar{\alpha}$, the above equalities are satisfied, since

is assumed. Therefore, $\alpha=\bar{\alpha}$ and $\beta=\mbox {\bf 1}_{m}/m\lambda$ with (27) provide the minimizer of (26). □

Appendix B: Proof of (14)

Let κ(A) be the condition number of the symmetric positive definite matrix A, then we shall prove the equality

where S is symmetric and positive definite. The same equality holds, when S is non-symmetric and the condition number of S is defined through singular values. We prove the case that S is a symmetric positive definite matrix for simplicity.

Proof

First, we prove $\min_{S:\kappa(S)\leq C}\kappa(SAS) \geq \max\{\frac{\kappa (A)}{C^{2}},\ 1\}$.

The matrix A is symmetric positive definite, thus, there exists an orthogonal matrix Q and a diagonal matrix Λ=diag(λ ₁,…,λ _n) such that A=QΛQ ^⊤. The eigenvalues are arranged in the decreasing order, i.e., λ ₁≥λ ₂≥⋯≥λ _n>0. In the similar way, let S be PDP ^⊤, where P is an orthogonal matrix and D=diag(d ₁,…,d _n) is a diagonal matrix such that d ₁≥d ₂≥⋯≥d _n>0 and d ₁/d _n≤C. Hence,

Let Q ^⊤ P be R ^⊤ which is also an orthogonal matrix. Then the maximum eigenvalue of DRΛR ^⊤ D is given as

Let $R=(\mbox {\boldmath $r$}_{1},\ldots,\mbox {\boldmath $r$}_{n})$, where $\mbox {\boldmath $r$}_{i}\in \Re ^{n}$, and we choose $\mbox {\boldmath $x$}_{1}$ such that $\mbox {\boldmath $r$}_{i}^{\top}D \mbox {\boldmath $x$}_{1}=0$ for i=2,…,n and $\|\mbox {\boldmath $x$}_{1}\|=1$. Then,

From the assumption on $\mbox {\boldmath $x$}_{1}$, $D\mbox {\boldmath $x$}_{1}$ is represented as $c\mbox {\boldmath $r$}_{1}$ for some c∈ℜ, and we have $(\mbox {\boldmath $x$}_{1}^{\top}D \mbox {\boldmath $r$}_{1})^{2} = c^{2}=\mbox {\boldmath $x$}_{1}^{\top}D^{2}\mbox {\boldmath $x$}_{1}\geq d_{n}^{2}$. Hence, we have

On the other hand, the minimum eigenvalue of DRΛR ^⊤ D is given as

We choose $\mbox {\boldmath $x$}_{n}$ such that $\mbox {\boldmath $r$}_{i}^{\top}D \mbox {\boldmath $x$}_{n}=0$ for i=1,…,n−1 and $\|\mbox {\boldmath $x$}_{n}\|=1$. Then,

As a result, the condition number of SAS is bounded below as

Next, we prove $\min_{S:\kappa(S)\leq C}\kappa(SAS)\leq\max\{\frac{\kappa(A)}{C^{2}},\ 1\}$. If κ(A)≤C ², the inequality min_S:κ(S)≤C κ(SAS)=1 holds, because we can choose S=A ^−1/2. Then, we prove $\min_{S:\kappa(S)\leq C}\kappa(SAS)\leq\frac{\kappa(A)}{C^{2}}$ when 1≤C ²≤κ(A) holds.

Let S=QΓQ ^⊤ with Γ be a diagonal matrix diag(γ ₁,…,γ _n), then $\kappa(SAS)=\kappa(\mathrm{diag}(\gamma_{1}^{2}\lambda_{1},\ldots ,\gamma_{n}^{2}\lambda_{n}))$ holds. Let γ ₁=1 and γ _n=C. Since 1≤C ²≤κ(A)=λ ₁/λ _n holds, for k=2,…,n−1 we have

and thus, we obtain

Hence, there exists γ _k, k=2,…,n−1 such that

Thus, 1≤γ _k≤C holds for all k=2,…,n−1. Moreover, $C^{2}\lambda_{n}\leq\gamma_{k}^{2}\lambda_{k}\leq\lambda_{1}$ also holds. These inequalities imply κ(S)=C and κ(SAS)=λ ₁/(C ² λ _n)=κ(A)/C ². Therefore $\min_{S:\kappa(S)\leq C}\kappa(SAS)\leq\frac{\kappa(A)}{C^{2}}$ holds if 1≤C ²≤κ(A). □

Appendix C: Proof of Theorem 1

We show the proof of Theorem 1.

Proof

For a fixed function ψ satisfying the assumption in the theorem, let b be a real number in the domain of ψ such that ψ′(b)=1. Then, we have ψ″(b)=c. Let w _b be the constant function taking b over $\mbox {$\mathcal {Z}$}$. In a universal RKHS, for any δ>0, there exists $w\in\mathcal{H}$ such that ∥w _b−w∥_∞≤δ. According to Appendix D in Horn and Johnson (1985), eigenvalues of a matrix are continuous on its entries, and thus the same thing holds for the minimal and maximal eigenvalues and the condition number as long as the condition number is well-defined. Then, for any ε>0 there exists $w\in\mathcal{H}$ such that

(28)

since ψ″(w _b)=ψ″(b)=c. For any ψ satisfying the assumption, we show that (28) leads to the inequality

(29)

for fixed samples X ₁,…,X _n. We prove (29) by contradiction. Suppose that $\sup\{\kappa_{0}(D_{\psi,w})\mid w\in\mathcal{H}\}<\kappa_{0}(c I_{n})$ holds, and let $\delta=\kappa_{0}(c I_{n})-\sup\{\kappa_{0}(D_{\psi,w})~|~w\in\mathcal{H}\}$. Then, δ is positive. The inequality $\kappa_{0}(D_{\psi,w})\leq\sup\{\kappa_{0}(D_{\psi,w})~|~w\in\mathcal{H}\}$ leads to the inequality $\kappa_{0}(c I_{n})-\kappa_{0}(D_{\psi,w}) \geq\kappa_{0}(c I_{n})-\sup\{\kappa_{0}(D_{\psi,w})~|~w\in\mathcal{H}\}=\delta >\delta/2>0$ for all $w\in\mathcal{H}$. This inequality contradicts (28), because the inequality (28) guarantees that there exists $w\in\mathcal{H}$ such that |κ ₀(D _ψ,w)−κ ₀(cI _n)|≤δ/2 holds. Hence, the inequality (29) should hold.

In addition, for the quadratic function ψ(z)=cz ²/2, the equality

holds. Thus, we obtain (17). □

Appendix D: Proof of Theorem 2

The following lemma is the key to prove Theorem 2.

Lemma 2

Let k be a kernel function on $\mathcal{Z}\times\mathcal{Z}$ satisfying the boundedness condition, $\sup_{x,x'\in\mathcal {Z}}k(x,x')<\infty$, and U be $U=\sup_{x,x'\in\mathcal{Z}}k(x,x')$. Suppose that the Gram matrix (K ₁₁)_ij=k(X _i,X _j) is almost surely positive definite in terms of the probability measure P. Then, the inequality

(30)

holds, where H is defined by $H=\frac{1}{n}K_{11}D_{\psi,\widehat {w}} K_{11} + \lambda K_{11}$. In the above expressions, the probability Pr(⋯) is defined from the distribution of all samples X ₁,…,X _n, Y ₁,…,Y _m.

Proof

Let k _i be the i-th column vector of the Gram matrix K ₁₁, and d _i be $\psi''(\widehat {w}(X_{i}))$. Then the matrix H is represented as

Let us define

i.e., W _n and Z _n are the minimal and maximal eigenvalues of H. Then, the condition number of H is given as κ(H)=Z _n/W _n. Let τ ₁ and τ _n be the maximal and minimal eigenvalues of K ₁₁. Since all diagonal elements of K ₁₁ are less than or equal to U, we have

Then, we have a lower bound of W _n and an upper bound of Z _n as follows:

where the last inequality for Z _n follows from τ ₁≤Un. Therefore, for any δ>0, we have

□

In Lemma 2, the distributions of W _n and Z _n are separately computed. This idea is borrowed from smoothed analysis of the condition numbers (Sankar et al. 2006).

Below, we show the proof of Theorem 2.

Proof of Theorem 2

Using (30) in Lemma 2, we have

□

Appendix E: Proof of Theorem 3

We show the proof of Theorem 3

Proof

Assume that ψ″(z) is not a constant function. Since K ₁₁ is non-singular, the vector $K_{11}\alpha+\frac{1}{m\lambda}K_{12}\mbox {\bf 1}_{m}$ takes an arbitrary value in ℜⁿ by varying α∈ℜⁿ. Hence, each diagonal element of D _ψ,α can take arbitrary values in an open subset S⊂ℜ. We consider (CK ₁₁)⁻¹ J _ψ,C(α)((CK ₁₁)^⊤)⁻¹ instead of J _ψ,C. Suppose that there exists a matrix C such that the matrix

(31)

is symmetric for any (s ₁,…,s _n)∈S ⁿ. Let a _ij be the (i,j) element of K ₁₁(K ₁₁ C ^⊤)⁻¹, and t _ij be the (i,j) element of (K ₁₁ C ^⊤)⁻¹. Then, the (i,j) and (j,i) elements of (31) are equal to $\frac{1}{n}s_{i}a_{ij}+\lambda t_{ij}$ and $\frac {1}{n}s_{j}a_{ji}+\lambda t_{ji}$, respectively. Due to the assumption, the equality

holds for any s _i,s _j∈S. When i≠j, we obtain a _ij=a _ji=0 and t _ij=t _ji. Thus, K ₁₁(K ₁₁ C ^⊤)⁻¹ should be equal to some diagonal matrix, and (K ₁₁ C ^⊤)⁻¹ is a symmetric matrix. There exists a diagonal matrix Q=diag(q ₁,…,q _n) such that K ₁₁=Q(K ₁₁ C ^⊤) holds. As a result, we have (K ₁₁)_ij=q _i(K ₁₁ C ^⊤)_ij, (K ₁₁)_ji=q _j(K ₁₁ C ^⊤)_ji, (K ₁₁ C ^⊤)_ij=(K ₁₁ C ^⊤)_ji, and (K ₁₁)_ij=(K ₁₁)_ji. Hence we obtain

and then, q _i=q _j or (K ₁₁ C ^⊤)_ij=0 holds for any i and j. Since (K ₁₁)_ij is non-zero element, the only possibility is q ₁=q ₂=⋯=q _n≠0. Therefore, the diagonal matrix Q should be proportional to the identity matrix and there exists a constant c∈ℜ such that the equality C=cI _n holds. This equality contradicts the assumption. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kanamori, T., Suzuki, T. & Sugiyama, M. Computational complexity of kernel-based density-ratio estimation: a condition number analysis. Mach Learn 90, 431–460 (2013). https://doi.org/10.1007/s10994-012-5323-6

Download citation

Received: 02 March 2011
Accepted: 13 September 2012
Published: 12 December 2012
Issue Date: March 2013
DOI: https://doi.org/10.1007/s10994-012-5323-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Computational complexity of kernel-based density-ratio estimation: a condition number analysis

Abstract

Similar content being viewed by others

Kernel density estimation by stagewise algorithm with a simple dictionary

Optimal Kernel Selection for Density Estimation

Kernel Matrix Regularization via Shrinkage Estimation

1 Introduction

1.1 Density-ratio estimation

1.2 Condition numbers

1.3 Smoothed analysis

1.4 Our contributions

1.5 Structure of the paper

2 Estimation of density ratios

2.1 Formulation and notations

2.2 M-estimator based on φ-divergence

3 Kernel-based M-estimator

Lemma 1

4 Condition number analysis for density-ratio estimation

4.1 Condition number in numerical analysis and optimization

4.2 Condition number analysis of M-estimators

4.2.1 Min-max evaluation

Theorem 1

4.2.2 Probabilistic evaluation

Theorem 2

Remark 1

5 Reduction of condition numbers

Theorem 3

Remark 2

6 Simulation study

6.1 Synthetic data

6.2 Benchmark data

7 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Proof of Lemma 1

Proof

Appendix B: Proof of (14)

Proof

Appendix C: Proof of Theorem 1

Proof

Appendix D: Proof of Theorem 2

Lemma 2

Proof

Proof of Theorem 2

Appendix E: Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation