Direct conditional probability density estimation with sparse feature selection
 1.2k Downloads
 1 Citations
Abstract
Regression is a fundamental problem in statistical data analysis, which aims at estimating the conditional mean of output given input. However, regression is not informative enough if the conditional probability density is multimodal, asymmetric, and heteroscedastic. To overcome this limitation, various estimators of conditional densities themselves have been developed, and a kernelbased approach called leastsquares conditional density estimation (LSCDE) was demonstrated to be promising. However, LSCDE still suffers from large estimation error if input contains many irrelevant features. In this paper, we therefore propose an extension of LSCDE called sparse additive CDE (SACDE), which allows automatic feature selection in CDE. SACDE applies kernel LSCDE to each input feature in an additive manner and penalizes the whole solution by a groupsparse regularizer. We also give a subgradientbased optimization method for SACDE training that scales well to highdimensional large data sets. Through experiments with benchmark and humanoid robot transition datasets, we demonstrate the usefulness of SACDE in noisy CDE problems.
Keywords
Conditional density estimation Feature selection Sparse structured norm1 Introduction
Estimating the statistical dependency between input \(\varvec{x}\) and output \(\varvec{y}\) plays a crucial role in various realworld applications. For example, in robot transition estimation which is highly useful in modelbased reinforcement learning (Sutton and Barto 1998), input \(\varvec{x}\) corresponds to the pair of the current state of a robot and an action the robot takes, and output \(\varvec{y}\) corresponds to the destination state after taking the action. Another application is disease diagnosis, in which input \(\varvec{x}\) corresponds to measurements of biomarkers and/or clinical images and output \(\varvec{y}\) corresponds to the presence (or the progression level) of a disease. Thus, accurately estimating the statistical dependency is an important and fundamental problem in statistical data analysis. The most basic approach to this problem is regression, which estimates the conditional mean of output \(\varvec{y}\) given input \(\varvec{x}\). Regression gives the optimal estimation of output \(\varvec{y}\) for additive Gaussian output noise. However, if the conditional probability density of output \(\varvec{y}\) given input \(\varvec{x}\), denoted by \(p(\varvec{y}\varvec{x})\), possesses more complex structure such as multimodality, asymmetry, and heteroscedasticity, estimating the conditional mean by regression is not necessarily informative.
To overcome the limitation of regression, estimation of conditional densities from paired samples \(\{(\varvec{x}^{(n)},\varvec{y}^{(n)})\}_{n=1}^N\) has been investigated. The most naive approach to estimating \(p(\varvec{y}\varvec{x}=\widetilde{\varvec{x}})\), the conditional density of output \(\varvec{y}\) at test input point \(\varvec{x}=\widetilde{\varvec{x}}\), is to use the kernel density estimator (KDE) (Silverman 1986) with samples such that \(\Vert \varvec{x}^{(n)}\widetilde{\varvec{x}}\Vert _2^2\le \epsilon \). However, this naive method does not work well in highdimensional problems. Slightly more sophisticated variants have been proposed that use weighted KDE (Fan et al. 1996; Wolff et al. 1999), but they still share the same weakness.
The mixture density network (MDN) (Bishop 2006) uses a mixture of parametric densities for modeling the conditional density, and the parameters are estimated by a neural network as functions of input \(\varvec{x}\). MDN was demonstrated to work well, but its training is timeconsuming and only a local optimal solution may be found due to the nonconvexity of neural network training. A similar method based on a mixture of Gaussian processes was developed (Tresp 2001), which can be trained in a computationally more efficient way by the expectationmaximization algorithm (Dempster et al. 1977). However, due to the nonconvexity of the optimization problem, it is difficult to find the global optimal solution.
Kernel quantile regression (KQR) (Takeuchi et al. 2006; Li et al. 2007) gives nonparametric percentile estimates of conditional distributions through convex optimization. KQR can be used for estimating the entire conditional cumulative distribution by solving KQR for all percentiles. It was shown that the regularization path tracking technique (Hastie et al. 2004) can be employed for efficiently computing the entire conditional cumulative distribution (Takeuchi et al. 2009). However, KQR is applicable only to onedimensional output, which limits the range of applications significantly.
Leastsquares conditional density estimation (LSCDE) allows estimation of multipleinputmultipleoutput conditional densities by directly learning a conditional density model with leastsquares estimation (Sugiyama et al. 2010). For linearinparameter models such as a linear combination of Gaussian kernels, LSCDE is formulated as a convex optimization problem and its solution can be obtained efficiently and analytically just by solving a system of linear equations. Furthermore, kernel LSCDE was proved to achieve the optimal nonparametric convergence rate to the true conditional density in the minimax sense, meaning that no method can be better than LSCDE asymptotically. Through extensive experiments, LSCDE was demonstrated to compare favorably with competing approaches.
However, LSCDE still suffers from large estimation error when many irrelevant features exist in input \(\varvec{x}\). Such irrelevant features are conceivable in many realworld problems. For example, in gene expression analysis for diseased cells, only a small subset of biomarker genes (input) affects the disease status (output). A standard way to cope with high input dimensionality is to select relevant features with forward selection or backward elimination (Guyon and Elisseeff 2003), but this often leads to a local optimal set of features.
In this paper, we propose extending LSCDE to allow simultaneous feature selection during conditional density estimation. More specifically, we apply kernel LSCDE to each input feature in an additive manner and penalize the whole solution by a groupsparse regularizer (Yuan and Lin 2006). Our subgradientbased optimization solver allows computationally efficient selection of relevant features that are even nonlinearly correlated with output \(\varvec{y}\). Numerical experiments on noisy conditional density estimation demonstrate that our proposed method, which we call sparse additive CDE (SACDE), compares favorably with baseline approaches in estimation accuracy and computational efficiency.
The remainder of this paper is structured as follows. In Sect. 2, we formulate the problem of conditional density estimation and describe our proposed SACDE method. We experimentally evaluate the performance of SACDE in Sect. 3, and we summarize our contribution in Sect. 4.
2 Conditional density estimation with sparse feature selection
In this section, we formulate the problem of conditional density estimation and describe our proposed SACDE method.
2.1 Problem formulation
Our goal is to estimate the conditional density \(p(\varvec{y}\varvec{x})\) from the training samples (1) via sufficient feature selection (2).
2.2 Sparse additive conditional density estimation
2.3 Optimization algorithm
2.4 Post processing
2.5 Crossvalidation for model selection
Performance of SACDE depends on the choice of model parameters such as the Gaussian width \(\sigma \) and the regularization parameter \(\lambda \). Crossvalidation (CV) is available to systematically choose these model parameters. Throughout this paper, we use fivefold CV: we first divide the samples into five subsets, then learn the parameter using four subsets, and evaluate the test error using the heldout subset. This procedure is iterated five times with different trainingtest choice and the error is averaged.
3 Numerical experiments
In this section, we experimentally evaluate the performance of our proposed method, SACDE. Throughout the experiments, the number of basis functions is fixed to \(B = \min (100, N)\). The model parameters \(\sigma \) and \(\lambda \) were chosen from the twenty values between \(10^{2}\) and \(2\) at the equal interval in the logarithmic scale by fivefold CV. We use NLL (22) for the performance measure of conditional density estimation. NLL is computed from test samples, which are not used for learning parameters and hyperparameters. All experiments were implemented by Matlab 2013b and an HP DL360p Gen8 E5 v2 server with two CPUs of Xeon E52650 v2 2.60 GHz (8 Core) and the main memory of 96 GB.
3.1 Compared methods

Sparse additive feature selection LSCDE (SALSCDE): SALSCDE is a variation of the proposed SACDE, which first runs SACDE for feature selection and then estimates the conditionally density by LSCDE with only selected features.
 \(\epsilon \) neighbor kernel density estimation (eKDE): eKDE estimates a conditional density by standard kernel density estimation using neighborhood samples in the domain of input \(\varvec{x}\), denoted by \(\mathcal {I}_{\varvec{x},\epsilon } := \{ \varvec{x}^{(i)}: \Vert \varvec{x}^{(i)}  \varvec{x} \Vert _2^2 \le \epsilon \}\) for threshold \(\epsilon \). In the case of Gaussian kernels, eKDE is given aswhere \(\mathcal {N}(\varvec{y},\varvec{\mu }, \varvec{\Sigma })\) denotes the Gaussian density function with respect to \(\varvec{y}\) with mean \(\varvec{\mu }\) and covariance matrix \(\varvec{\Sigma }\), and \(\varvec{I}_{D_x}\) is the identity matrix of size \(D_x\). In experiments, threshold \(\epsilon \) and bandwidth \(\sigma \) were chosen based on fivefold CV with respect to NLL, where the candidate values of \(\epsilon \) are the twenty values between \(10^{2}\) and \(5\) at the equal interval in the logarithmic scale.$$\begin{aligned} \hat{p}(\varvec{y}\varvec{x}) = \frac{1}{\mathcal {I}_{\varvec{x},\epsilon }} \sum _{i \in \mathcal {I}_{\varvec{x},\epsilon }} \mathcal {N}(\varvec{y},\varvec{y}^{(i)},\sigma ^2 \varvec{I}_{D_y}), \end{aligned}$$(23)
 Leastsquares conditional density estimation (LSCDE): The original LSCDE method. This corresponds to a multidimensional nonsparse version of SACDE where, instead of the groupsparse penalty and an additive model, an \(\ell _2\)penalty \(\lambda \Vert \varvec{\alpha } \Vert _2^2 \) and a multidimensional linearinparameter model,is used. We use the Gaussian kernels for both \(\eta _b( \cdot )\) and \(\varphi _b( \cdot )\), where the bandwidth \(\sigma \) and the regularization parameter \(\lambda \) are chosen based on fivefold CV with respect to NLL.$$\begin{aligned} \hat{p}(\varvec{y}\varvec{x} )&:= \hat{r}(\varvec{y}, \varvec{x})\nonumber \\&= \sum _{b=1}^B \alpha _{b} \Big \{ \eta _b\big ( \varvec{y}\big ) \cdot \varphi _{b}\big ( \varvec{x} \big ) \Big \}, \end{aligned}$$(24)
 NadarayaWatson CDE (NWCDE): This corresponds to a simple version of LSCDE, which fixes weights of basis functions to \(\frac{1}{B}\):We use the Gaussian kernels for both \(\eta _b( \cdot )\) and \(\varphi _b( \cdot )\), where the bandwidth \(\sigma \) is chosen based on leaveoneout CV for the exact likelihood formulated in Holmes et al. (2007). To directly employ the method in Holmes et al. (2007), we only use \(B\) samples in this CV procedure.$$\begin{aligned} \hat{p}(\varvec{y}\varvec{x} )&:= \frac{\hat{p}(\varvec{y}, \varvec{x} )}{\hat{p}(\varvec{x} )}\nonumber \\&= \frac{ \sum _{b=1}^B \eta _b\big ( \varvec{y}\big ) \cdot \varphi _{b}\big ( \varvec{x} \big ) }{ \sum _{b=1}^B \varphi _{b}\big ( \varvec{x} \big ) }. \end{aligned}$$(25)

Forward feature selection + eKDE (FWeKDE): Forward feature selection is performed based on fivefold CV with respect to NLL. That is, the most useful feature that maximally reduces the crossvalidated NLL by eKDE is selected one by one until the crossvalidated NLL no longer decreases.

Forward feature selection + LSCDE (FWLSCDE): Similarly, forward feature selection is performed for LSCDE.

Forward feature selection + NWCDE (FWNWCDE): Similarly, forward feature selection is performed for NWCDE.
Computational complexities of our methods and existing CDEs
Method  SACDE  SALSCDE  LSCDE  eKDE  NWCDE 

Time  \(O(N^3D_x^3)\)  \(O(N^3D_x^3)\)  \(O(N^3)\)  \(O(N^2D_x)\)  \(O(N^2D_x)\) 
Space  \(O(N^2D_x^2)\)  \(O(N^2D_x^2)\)  \(O(N^2)\)  \(O(N^2)\)  \(O(N^2)\) 
Computational complexities of our methods and existing CDEs with forward feature selection
Method  SACDE  SALSCDE  FWLSCDE  FWeKDE  FWNWCDE 

Time  \(O(N^3D_x^3)\)  \(O(N^3D_x^3)\)  \(O(D_x!N^3)\)  \(O(D_x!N^2)\)  \(O(D_x!N^2)\) 
Space  \(O(N^2D_x^2)\)  \(O(N^2D_x^2)\)  \(O(N^2)\)  \(O(N^2)\)  \(O(N^2)\) 
3.2 Illustrative examples
 Toy data 1: \(x_1\) is independently generated following the uniform distribution on \([1,1]\), while each of \(x_2,\dots , x_6\) is generated by \(x_1+ \epsilon _c\) where \(\epsilon _c\) is a noise variable following the normal distribution with mean 0 and standard deviation \(3\hat{\sigma }\), and \(\hat{\sigma }\) is the standard deviation of \(x_1\). Output \(y\) is generated as a function of \(x_1\) aswhere \(\varepsilon \) is standard normal noise. We generate \(N=300\) samples for estimating the conditional density.$$\begin{aligned} y  x_{1}&\sim \hbox {sinc}\left( \frac{3}{4}\pi x_1 \right) + \frac{1}{8} \exp \big ( 1  x_1 \big ) \cdot \varepsilon , \end{aligned}$$(26)

Old Faithful Geyser: A benchmark dataset with \(D_x=D_y=1\) that consists of durations of \(N=299\) eruptions of the Old Faithful Geyser (Weisberg 1985). We add five irrelevant features \(x_2,\dots ,x_6\) in a similar manner to Toy data 1.

Bone Mineral Density: A benchmark dataset with \(D_x=D_y=1\) that consists of relative spinal bone mineral density measurements on \(N=485\) North American adolescents (Hastie et al. 2001). We add five irrelevant features \(x_2,\dots ,x_6\) in a similar manner to Toy data 1.
The regularization paths in Fig. 4 show that, in (a) Toy data 1 and (b) Old Faithful Geyser, the parameters corresponding to the irrelevant feature are zero and that corresponding to the relevant feature is nonzero for the crossvalidated solution, which means that SACDE optimally performs feature selection. In Fig. 4c Bone Mineral Density, some of irrelevant features are nonzero because features with the skewed distribution are strongly correlated with the relevant feature despite additive Gaussian noise. Thus these features may still contain some information on the output value.
The estimation results in Fig. 3a show that, SACDE gives more accurate estimates than the plain LSCDE. In Fig. 3b, c, SACDE tends to give sharper conditional density estimates than the plain LSCDE. This is because relatively large Gaussian kernel widths are chosen in LSCDE to incorporate irrelevant noisy features. This indicates that LSCDE with many irrelevant features tend to produce too flat conditional densities which are not informative, while SACDE can avoid this problem by automatically eliminating irrelevant features.
3.3 Comparison of performance and computation time for different numbers of samples

Toy data 1: The generation procedure is the same as the one in the previous section, in which both dimensions of relevant feature (input) and output are one.
 Toy data 2: Each irrelevant \(x_1, x_2, x_4, x_5, \dots , x_{D_x2}, x_{D_x1}\) is independently generated following the uniform distribution on \([1,1]\). Relevant features are generated by \(x_{3d} = x_{3d2} + x_{3d1}\), and the \(d\)th dimension of output \(\varvec{y}\) is generated as a function of \(x_{3d},~d=1,2,\dots , D_x/3\) aswhere \(\varepsilon \) is standard normal noise. This dataset has multidimensional relevant features and outputs.$$\begin{aligned} y_d  x_{3d}&\sim \hbox {sinc}\left( \frac{3}{4}\pi x_{3d} \right) + \frac{1}{8} \exp \big ( 1  x_{3d} \big ) \cdot \varepsilon , \end{aligned}$$(27)
3.4 Hyperparameter selection
3.5 Performance comparison for different numbers of irrelevant features
In all three cases, the NLL values of LSCDE, NWCDE, and eKDE (without feature selection) grow as the number of irrelevant features increases. On the other hand, the NLL values of SACDE, SALSCDE, FWLSCDE, FWNWCDE, and FWeKDE do not grow that much when the number of irrelevant features increases. This clearly demonstrates an advantage of performing feature selection.
3.6 Benchmark datasets
NLL for benchmark datasets with five dimensional irrelevant features
Name  \(\mathcal {F}\)  \(N\)  SACDE  SALSCDE  LSCDE  eKDE  NWCDE 

caution  2  100  \(1.34 \pm 0.6\)  \(\varvec{1.24} \pm \varvec{0.4}\)  \(1.38 \pm 0.3\)  \(24.25 \pm 3.4\)  \(1.36 \pm 0.2\) 
CobarOre  2  38  \(1.71 \pm 0.5\)  \(1.70 \pm 0.4\)  \(\varvec{1.62} \pm \varvec{0.2}\)  \(31.81 \pm 2.9\)  \(\varvec{1.62} \pm \varvec{0.2}\) 
snowgeese  2  45  \(\varvec{1.80} \pm \varvec{2.0}\)  \(\varvec{1.76} \pm \varvec{1.8}\)  \(1.85 \pm 1.3\)  \(22.04 \pm 6.1\)  \(\varvec{1.59} \pm \varvec{1.0}\) 
topo  2  52  \(1.17 \pm 0.3\)  \(1.14 \pm 0.3\)  \(1.22 \pm 0.2\)  \(29.30 \pm 3.0\)  \(1.21 \pm 0.1\) 
sniffer  4  125  \(0.70 \pm 0.6\)  \(\varvec{0.60} \pm \varvec{0.7}\)  \(0.85 \pm 0.2\)  \(16.91 \pm 3.2\)  \(0.83 \pm 0.2\) 
crabs  6  200  \(0.44 \pm 0.1\)  \(\varvec{0.47} \pm \varvec{0.3}\)  \(0.53 \pm 0.1\)  \(26.03 \pm 3.1\)  \(0.58 \pm 0.1\) 
UN3  6  125  \(\varvec{1.27} \pm \varvec{0.2}\)  \(\varvec{1.35} \pm \varvec{0.4}\)  \(1.57 \pm 0.6\)  \(33.36 \pm 1.6\)  \(1.54 \pm 0.6\) 
birthwt  7  189  \(\varvec{1.49} \pm \varvec{0.2}\)  \(\varvec{1.52} \pm \varvec{0.1}\)  \(\varvec{1.51} \pm \varvec{0.1}\)  \(31.77 \pm 1.6\)  \(1.67 \pm 0.2\) 
cpus  7  209  \(\varvec{0.36} \pm \varvec{0.6}\)  \(0.80 \pm 0.7\)  \(1.19 \pm 0.5\)  \(22.29 \pm 3.4\)  \(1.17 \pm 0.6\) 
gilgais  8  365  \(\varvec{0.70} \pm \varvec{0.2}\)  \(0.89 \pm 0.2\)  \(1.16 \pm 0.2\)  \(27.77 \pm 2.2\)  \(1.11 \pm 0.2\) 
BigMac  9  69  \(1.33 \pm 0.8\)  \(1.37 \pm 0.7\)  \(1.42 \pm 0.7\)  \(35.79 \pm 0.5\)  \(1.34 \pm 0.5\) 
highway  11  39  \(\varvec{1.38} \pm \varvec{0.7}\)  \(1.60 \pm 0.7\)  \(1.71 \pm 0.8\)  \(36.04 \pm 0.0\)  \(1.74 \pm 0.7\) 
Time  \(1.00\)  \(0.00\)  \(0.06\)  \(0.02\)  \(0.00\) 
Name  FWLSCDE  FWeKDE  FWNWCDE  

caution  \(1.33 \pm 0.6\)  \(1.35 \pm 0.6\)  \(1.30 \pm 0.5\)  
CobarOre  \(1.95 \pm 0.6\)  \(2.45 \pm 1.9\)  \(\varvec{1.65} \pm \varvec{0.4}\)  
snowgeese  \(2.09 \pm 1.9\)  \(3.03 \pm 2.4\)  \(\varvec{1.82} \pm \varvec{1.8}\)  
topo  \(1.19 \pm 0.4\)  \(1.73 \pm 1.2\)  \(\varvec{1.07} \pm \varvec{0.2}\)  
sniffer  \(0.74 \pm 0.8\)  \(0.96 \pm 0.8\)  \(0.96 \pm 1.1\)  
crabs  \(0.37 \pm 0.3\)  \(0.08 \pm 0.6\)  \(0.12 \pm 0.8\)  
UN3  \(\varvec{1.27} \pm \varvec{0.3}\)  \(1.60 \pm 0.6\)  \(\varvec{1.34} \pm \varvec{0.3}\)  
birthwt  \(1.67 \pm 0.2\)  \(1.75 \pm 0.5\)  \(3.85 \pm 2.1\)  
cpus  \(0.70 \pm 0.8\)  \(1.00 \pm 0.9\)  \(0.76 \pm 0.9\)  
gilgais  \(0.76 \pm 0.2\)  \(0.97 \pm 0.3\)  \(1.20 \pm 0.3\)  
BigMac  \(1.45 \pm 0.9\)  \(2.54 \pm 1.7\)  \(\varvec{1.23} \pm \varvec{0.8}\)  
highway  \(2.06 \pm 1.0\)  \(3.17 \pm 1.9\)  \(2.18 \pm 1.8\)  
Time  \(2.73\)  \(0.54\)  \(0.01\) 
Table 3 shows that the performance of our methods (SACDE and SALSCDE) is best on nine datasets. For highdimensional datasets, especially when \(\mathcal {F}\) is seven or more, SACDE tends to outperform other methods with statistical significance. For lowdimensional datasets with large \(N\), the performance of SALSCDE outperforms SACDE because of their expressive power of functions. For lowdimensional datasets with small \(N\), FWNWCDE performs the best because all other methods optimizing weights of basis functions cause overfitting. LSCDE, eKDE, NWCDE, and FWeKDE are computationally much more efficient than SACDE and SALSCDE, but these methods tend to perform poorly for highdimensional relevant features with noisy dimensions.
3.7 Humanoid robot transition dataset
Finally, we evaluate the performance of the proposed method on humanoid robot transition estimation with multiple inputs and multiple outputs. The dataset was generated from a simulator of the upperbody part of the humanoid robot CBi (Cheng et al. 2007). The robot has 9 controllable joints: shoulder pitch, shoulder roll and elbow pitch of the right arm, shoulder pitch, shoulder roll and elbow pitch of the left arm, wait yaw, torso roll, and torso pitch joints.
To generate transition samples, we first generated the initial posture of the robot \(\varvec{s}^{(1)}\) at random and then simulated a trajectory with 100 steps, i.e. \(\varvec{s}^{(2)}, \dots , \varvec{s}^{(100)}\). For each step, we additionally generated \(m\) irrelevant input features \(\varvec{z}^{(n)} \in \mathbb {R}^m\) by copying a relevant variable or by linearly combining two relevant variables contaminated with Gaussian noise in a similar manner to the previous experiments. By iterating these procedures, we obtained the transition samples \(\{ (\varvec{s}^{(n)}, \varvec{a}^{(n)}, \varvec{z}^{(n)}, \varvec{s}'^{(n)}) \}_{n=1}^{10000}\).
Our goal is to learn the system dynamics as state transition probability \(p(\varvec{s}'  \varvec{s}, \varvec{a}, \varvec{z})\) from these samples. Thus, as the conditional density estimation problem, the stateaction pair \((\varvec{s}^{\mathrm {T}}, \varvec{a}^{\mathrm {T}}, \varvec{z}^{\mathrm {T}})^{\mathrm {T}}\) is regarded as input variable \(\varvec{x}\), while the next state \(\varvec{s}'\) is regarded as output variable \(\varvec{y}\). Note that an accurate estimate of the state transition probability is highly useful in modelbased reinforcement learning (Sutton and Barto 1998).
From the transition samples, we randomly picked up 5000 samples as training data and used the other 5000 samples as test data to calculate NLL. We compare our proposed method SACDE with LSCDE, NWCDE, FWLSCDE, and FWNWCDE, as well as parametric conditional density estimation by the Gaussian process regression (GPCDE) (Rasmussen and Williams 2005). In this experiment, the candidate values of regularization parameter \(\lambda \) are the twenty values between \(10^{3}\) and \(10^{1}\) at the equal interval in the logarithmic scale, while the candidate values of other parameters are the same as the previous setting. We consider three datasets with \(J=2, 4, 9\) joints, and change the number of irrelevant features as \(m=0, 5, 10, 15, 20\). Thus, the input dimensionality is \(3J+m\), while the output dimensionality is \(2J\). For each \(J\) and \(m\), we evaluated the performance of conditional density estimation methods by averaged NLL and averaged computational time over 20 runs.
Overall, in this challenging task of robot transition estimation, SALSCDE, the combination of SACDE and LSCDE, was shown to be the most promising approach.
4 Conclusions
We proposed a direct estimator of conditional probability densities that is equipped with feature selection. Our feature selection strategy is based on the \(\ell _1 / \ell _2\) mixednorm, which tends to produce a groupsparse solution. An optimization algorithm based on a proximal method was presented that is guaranteed to possess fast convergence. The numerical experiments on benchmark and robot transition datasets demonstrated that the proposed method is promising.
SACDE assumes the additive structure for feature selection. However, this causes linear increase of the time and space complexities, resulting in high computation costs for datasets with a large number of features. Improving the scalability issue is future work.
Footnotes
Notes
Acknowledgments
Motoki Shiga was supported by JSPS KAKENHI 25870322. Masashi Sugiyama was supported by JSPS KAKENHI 23120004 and AOARD. Authors thank Dr. Ichiro Takeuchi, Nagoya Institute of Technology, for kindly providing his source codes.
References
 Beck, A., & Teboulle, M. (2009). A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.MathSciNetCrossRefMATHGoogle Scholar
 Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY: Springer.MATHGoogle Scholar
 Cheng, C., Hyon, S. H., Morimoto, J., Ude, A., Hale, J. G., Colvin, G., et al. (2007). Cb: A humanoid research platform for exploring neuroscience. Advanced Robotics, 21(10), 1097–1114.CrossRefGoogle Scholar
 Cook, R. D., & Ni, L. (2005). Sufficient dimension reduction via inverse regression. Journal of the American Statistical Association, 100(470), 410–428.MathSciNetCrossRefMATHGoogle Scholar
 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.MathSciNetMATHGoogle Scholar
 Fan, J., Yao, Q., & Tong, H. (1996). Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. Biometrika, 83(1), 189–206.MathSciNetCrossRefGoogle Scholar
 Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar.), 1157–1182.MATHGoogle Scholar
 Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391–1415.MathSciNetMATHGoogle Scholar
 Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York, NY: Springer.CrossRefGoogle Scholar
 Holmes, M. P., Gray, A. G., & Isbell, C. L. (2007). Fast nonparametric conditional density estimation. In Proceedings of the twentythird conference annual conference on uncertainty in artificial intelligence (pp 175–182).Google Scholar
 Li, K. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–342.MathSciNetCrossRefMATHGoogle Scholar
 Li, Y., Liu, Y., & Zhu, J. (2007). Quantile regression in reproducing kernel Hilbert spaces. Journal of the American Statistical Association, 102(477), 255–268.MathSciNetCrossRefMATHGoogle Scholar
 Rasmussen, C. E., & Williams, C. K. I. (2005). Gaussian processes for machine learning (adaptive computation and machine learning). Cambridge: MIT Press.Google Scholar
 Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall.CrossRefMATHGoogle Scholar
 Sra, S., Nowozin, S., & Wright, S. (2012). Optimization for machine learning. neural information processing series. Cambridge: MIT Press.Google Scholar
 Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., & Okanohara, D. (2010). Leastsquares conditional density estimation. IEICE Transactions on Information and Systems, E93D(3), 583–594.Google Scholar
 Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning (1st ed.). Cambridge, MA: MIT Press.Google Scholar
 Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A. J. (2006). Nonparametric quantile estimation. Journal of Machine Learning Research, 7, 1231–1264.MathSciNetMATHGoogle Scholar
 Takeuchi, I., Nomura, K., & Kanamori, T. (2009). Nonparametric conditional density estimation using piecewiselinear solution path of kernel quantile regression. Neural Computation, 21(2), 533–559.MathSciNetCrossRefMATHGoogle Scholar
 Tresp, V. (2001). Mixtures of Gaussian processes. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in Neural information processing systems, vol. 13 (pp. 654–660). Cambridge, MA: MIT Press.Google Scholar
 Weisberg, S. (1985). Applied linear regression. New York, NY: Wiley.MATHGoogle Scholar
 Wolff, R. C. L., Yao, Q., & Hall, P. (1999). Methods for estimating a conditional distribution function. Journal of the American Statistical Association, 94(445), 154–163.MathSciNetCrossRefMATHGoogle Scholar
 Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1), 49–67.MathSciNetCrossRefMATHGoogle Scholar