# Direct conditional probability density estimation with sparse feature selection

- First Online:

- Received:
- Accepted:

## Abstract

Regression is a fundamental problem in statistical data analysis, which aims at estimating the conditional mean of output given input. However, regression is not informative enough if the conditional probability density is multi-modal, asymmetric, and heteroscedastic. To overcome this limitation, various estimators of conditional densities themselves have been developed, and a kernel-based approach called *least-squares conditional density estimation* (LS-CDE) was demonstrated to be promising. However, LS-CDE still suffers from large estimation error if input contains many irrelevant features. In this paper, we therefore propose an extension of LS-CDE called *sparse additive CDE* (SA-CDE), which allows automatic feature selection in CDE. SA-CDE applies kernel LS-CDE to each input feature in an additive manner and penalizes the whole solution by a group-sparse regularizer. We also give a subgradient-based optimization method for SA-CDE training that scales well to high-dimensional large data sets. Through experiments with benchmark and humanoid robot transition datasets, we demonstrate the usefulness of SA-CDE in noisy CDE problems.

### Keywords

Conditional density estimation Feature selection Sparse structured norm## 1 Introduction

Estimating the statistical dependency between input \(\varvec{x}\) and output \(\varvec{y}\) plays a crucial role in various real-world applications. For example, in robot transition estimation which is highly useful in *model-based reinforcement learning* (Sutton and Barto 1998), input \(\varvec{x}\) corresponds to the pair of the current state of a robot and an action the robot takes, and output \(\varvec{y}\) corresponds to the destination state after taking the action. Another application is disease diagnosis, in which input \(\varvec{x}\) corresponds to measurements of biomarkers and/or clinical images and output \(\varvec{y}\) corresponds to the presence (or the progression level) of a disease. Thus, accurately estimating the statistical dependency is an important and fundamental problem in statistical data analysis. The most basic approach to this problem is regression, which estimates the conditional *mean* of output \(\varvec{y}\) given input \(\varvec{x}\). Regression gives the optimal estimation of output \(\varvec{y}\) for additive Gaussian output noise. However, if the conditional probability density of output \(\varvec{y}\) given input \(\varvec{x}\), denoted by \(p(\varvec{y}|\varvec{x})\), possesses more complex structure such as multi-modality, asymmetry, and heteroscedasticity, estimating the conditional mean by regression is not necessarily informative.

To overcome the limitation of regression, estimation of conditional densities from paired samples \(\{(\varvec{x}^{(n)},\varvec{y}^{(n)})\}_{n=1}^N\) has been investigated. The most naive approach to estimating \(p(\varvec{y}|\varvec{x}=\widetilde{\varvec{x}})\), the conditional density of output \(\varvec{y}\) at test input point \(\varvec{x}=\widetilde{\varvec{x}}\), is to use the *kernel density estimator* (KDE) (Silverman 1986) with samples such that \(\Vert \varvec{x}^{(n)}-\widetilde{\varvec{x}}\Vert _2^2\le \epsilon \). However, this naive method does not work well in high-dimensional problems. Slightly more sophisticated variants have been proposed that use weighted KDE (Fan et al. 1996; Wolff et al. 1999), but they still share the same weakness.

The *mixture density network* (MDN) (Bishop 2006) uses a mixture of parametric densities for modeling the conditional density, and the parameters are estimated by a neural network as functions of input \(\varvec{x}\). MDN was demonstrated to work well, but its training is time-consuming and only a local optimal solution may be found due to the non-convexity of neural network training. A similar method based on a mixture of Gaussian processes was developed (Tresp 2001), which can be trained in a computationally more efficient way by the expectation-maximization algorithm (Dempster et al. 1977). However, due to the non-convexity of the optimization problem, it is difficult to find the global optimal solution.

*Kernel quantile regression* (KQR) (Takeuchi et al. 2006; Li et al. 2007) gives non-parametric percentile estimates of conditional distributions through convex optimization. KQR can be used for estimating the entire conditional cumulative distribution by solving KQR for all percentiles. It was shown that the regularization path tracking technique (Hastie et al. 2004) can be employed for efficiently computing the entire conditional cumulative distribution (Takeuchi et al. 2009). However, KQR is applicable only to one-dimensional output, which limits the range of applications significantly.

*Least-squares conditional density estimation* (LS-CDE) allows estimation of multiple-input-multiple-output conditional densities by directly learning a conditional density model with least-squares estimation (Sugiyama et al. 2010). For linear-in-parameter models such as a linear combination of Gaussian kernels, LS-CDE is formulated as a convex optimization problem and its solution can be obtained efficiently and analytically just by solving a system of linear equations. Furthermore, kernel LS-CDE was proved to achieve the optimal non-parametric convergence rate to the true conditional density in the mini-max sense, meaning that no method can be better than LS-CDE asymptotically. Through extensive experiments, LS-CDE was demonstrated to compare favorably with competing approaches.

However, LS-CDE still suffers from large estimation error when many irrelevant features exist in input \(\varvec{x}\). Such irrelevant features are conceivable in many real-world problems. For example, in gene expression analysis for diseased cells, only a small subset of biomarker genes (input) affects the disease status (output). A standard way to cope with high input dimensionality is to select relevant features with forward selection or backward elimination (Guyon and Elisseeff 2003), but this often leads to a local optimal set of features.

In this paper, we propose extending LS-CDE to allow simultaneous feature selection during conditional density estimation. More specifically, we apply kernel LS-CDE to each input feature in an additive manner and penalize the whole solution by a group-sparse regularizer (Yuan and Lin 2006). Our subgradient-based optimization solver allows computationally efficient selection of relevant features that are even non-linearly correlated with output \(\varvec{y}\). Numerical experiments on noisy conditional density estimation demonstrate that our proposed method, which we call *sparse additive CDE* (SA-CDE), compares favorably with baseline approaches in estimation accuracy and computational efficiency.

The remainder of this paper is structured as follows. In Sect. 2, we formulate the problem of conditional density estimation and describe our proposed SA-CDE method. We experimentally evaluate the performance of SA-CDE in Sect. 3, and we summarize our contribution in Sect. 4.

## 2 Conditional density estimation with sparse feature selection

In this section, we formulate the problem of conditional density estimation and describe our proposed SA-CDE method.

### 2.1 Problem formulation

Our goal is to estimate the conditional density \(p(\varvec{y}|\varvec{x})\) from the training samples (1) via sufficient feature selection (2).

### 2.2 Sparse additive conditional density estimation

### 2.3 Optimization algorithm

*proximal method*(Sra et al. 2012; Beck and Teboulle 2009) to solve the optimization problem (16). More specifically, we consider a linear approximation to function \(\hat{J}_0\) at the current solution \(\varvec{\alpha }^{(t)}\), penalized by a proximal term to keep the update confined in the neighborhood:

^{1}which is given by the maximum eigenvalue of \(\varvec{H}\) in the current setup. We can describe our update rule analytically as follows (the detailed derivation is described in “Appendix 2”):

### 2.4 Post processing

*sparse additive CDE*(SA-CDE).

### 2.5 Cross-validation for model selection

Performance of SA-CDE depends on the choice of model parameters such as the Gaussian width \(\sigma \) and the regularization parameter \(\lambda \). Cross-validation (CV) is available to systematically choose these model parameters. Throughout this paper, we use fivefold CV: we first divide the samples into five subsets, then learn the parameter using four subsets, and evaluate the test error using the held-out subset. This procedure is iterated five times with different training-test choice and the error is averaged.

## 3 Numerical experiments

In this section, we experimentally evaluate the performance of our proposed method, SA-CDE. Throughout the experiments, the number of basis functions is fixed to \(B = \min (100, N)\). The model parameters \(\sigma \) and \(\lambda \) were chosen from the twenty values between \(10^{-2}\) and \(2\) at the equal interval in the logarithmic scale by fivefold CV. We use NLL (22) for the performance measure of conditional density estimation. NLL is computed from test samples, which are not used for learning parameters and hyper-parameters. All experiments were implemented by Matlab 2013b and an HP DL360p Gen8 E5 v2 server with two CPUs of Xeon E5-2650 v2 2.60 GHz (8 Core) and the main memory of 96 GB.

### 3.1 Compared methods

**Sparse additive feature selection LSCDE (SA-LSCDE):**SA-LSCDE is a variation of the proposed SA-CDE, which first runs SA-CDE for feature selection and then estimates the conditionally density by LS-CDE with only selected features.- \(\epsilon \)
**-neighbor kernel density estimation (eKDE):**eKDE estimates a conditional density by standard kernel density estimation using neighborhood samples in the domain of input \(\varvec{x}\), denoted by \(\mathcal {I}_{\varvec{x},\epsilon } := \{ \varvec{x}^{(i)}: \Vert \varvec{x}^{(i)} - \varvec{x} \Vert _2^2 \le \epsilon \}\) for threshold \(\epsilon \). In the case of Gaussian kernels, eKDE is given aswhere \(\mathcal {N}(\varvec{y},\varvec{\mu }, \varvec{\Sigma })\) denotes the Gaussian density function with respect to \(\varvec{y}\) with mean \(\varvec{\mu }\) and covariance matrix \(\varvec{\Sigma }\), and \(\varvec{I}_{D_x}\) is the identity matrix of size \(D_x\). In experiments, threshold \(\epsilon \) and bandwidth \(\sigma \) were chosen based on fivefold CV with respect to NLL, where the candidate values of \(\epsilon \) are the twenty values between \(10^{-2}\) and \(5\) at the equal interval in the logarithmic scale.$$\begin{aligned} \hat{p}(\varvec{y}|\varvec{x}) = \frac{1}{|\mathcal {I}_{\varvec{x},\epsilon }|} \sum _{i \in \mathcal {I}_{\varvec{x},\epsilon }} \mathcal {N}(\varvec{y},\varvec{y}^{(i)},\sigma ^2 \varvec{I}_{D_y}), \end{aligned}$$(23) **Least-squares conditional density estimation (LS-CDE):**The original LS-CDE method. This corresponds to a multi-dimensional non-sparse version of SA-CDE where, instead of the group-sparse penalty and an additive model, an \(\ell _2\)-penalty \(\lambda \Vert \varvec{\alpha } \Vert _2^2 \) and a multi-dimensional linear-in-parameter model,is used. We use the Gaussian kernels for both \(\eta _b( \cdot )\) and \(\varphi _b( \cdot )\), where the bandwidth \(\sigma \) and the regularization parameter \(\lambda \) are chosen based on fivefold CV with respect to NLL.$$\begin{aligned} \hat{p}(\varvec{y}|\varvec{x} )&:= \hat{r}(\varvec{y}, \varvec{x})\nonumber \\&= \sum _{b=1}^B \alpha _{b} \Big \{ \eta _b\big ( \varvec{y}\big ) \cdot \varphi _{b}\big ( \varvec{x} \big ) \Big \}, \end{aligned}$$(24)**Nadaraya-Watson CDE (NW-CDE):**This corresponds to a simple version of LS-CDE, which fixes weights of basis functions to \(\frac{1}{B}\):We use the Gaussian kernels for both \(\eta _b( \cdot )\) and \(\varphi _b( \cdot )\), where the bandwidth \(\sigma \) is chosen based on leave-one-out CV for the exact likelihood formulated in Holmes et al. (2007). To directly employ the method in Holmes et al. (2007), we only use \(B\) samples in this CV procedure.$$\begin{aligned} \hat{p}(\varvec{y}|\varvec{x} )&:= \frac{\hat{p}(\varvec{y}, \varvec{x} )}{\hat{p}(\varvec{x} )}\nonumber \\&= \frac{ \sum _{b=1}^B \eta _b\big ( \varvec{y}\big ) \cdot \varphi _{b}\big ( \varvec{x} \big ) }{ \sum _{b=1}^B \varphi _{b}\big ( \varvec{x} \big ) }. \end{aligned}$$(25)**Forward feature selection + eKDE (FW-eKDE):**Forward feature selection is performed based on fivefold CV with respect to NLL. That is, the most useful feature that maximally reduces the cross-validated NLL by eKDE is selected one by one until the cross-validated NLL no longer decreases.**Forward feature selection + LS-CDE (FW-LSCDE):**Similarly, forward feature selection is performed for LS-CDE.**Forward feature selection + NW-CDE (FW-NWCDE):**Similarly, forward feature selection is performed for NW-CDE.

Computational complexities of our methods and existing CDEs

Method | SA-CDE | SA-LSCDE | LS-CDE | e-KDE | NW-CDE |
---|---|---|---|---|---|

Time | \(O(N^3D_x^3)\) | \(O(N^3D_x^3)\) | \(O(N^3)\) | \(O(N^2D_x)\) | \(O(N^2D_x)\) |

Space | \(O(N^2D_x^2)\) | \(O(N^2D_x^2)\) | \(O(N^2)\) | \(O(N^2)\) | \(O(N^2)\) |

Computational complexities of our methods and existing CDEs with forward feature selection

Method | SA-CDE | SA-LSCDE | FW-LSCDE | FW-eKDE | FW-NWCDE |
---|---|---|---|---|---|

Time | \(O(N^3D_x^3)\) | \(O(N^3D_x^3)\) | \(O(D_x!N^3)\) | \(O(D_x!N^2)\) | \(O(D_x!N^2)\) |

Space | \(O(N^2D_x^2)\) | \(O(N^2D_x^2)\) | \(O(N^2)\) | \(O(N^2)\) | \(O(N^2)\) |

### 3.2 Illustrative examples

**Toy data 1:**\(x_1\) is independently generated following the uniform distribution on \([-1,1]\), while each of \(x_2,\dots , x_6\) is generated by \(x_1+ \epsilon _c\) where \(\epsilon _c\) is a noise variable following the normal distribution with mean 0 and standard deviation \(3\hat{\sigma }\), and \(\hat{\sigma }\) is the standard deviation of \(x_1\). Output \(y\) is generated as a function of \(x_1\) aswhere \(\varepsilon \) is standard normal noise. We generate \(N=300\) samples for estimating the conditional density.$$\begin{aligned} y | x_{1}&\sim \hbox {sinc}\left( \frac{3}{4}\pi x_1 \right) + \frac{1}{8} \exp \big ( 1 - x_1 \big ) \cdot \varepsilon , \end{aligned}$$(26)**Old Faithful Geyser:**A benchmark dataset with \(D_x=D_y=1\) that consists of durations of \(N=299\) eruptions of the Old Faithful Geyser (Weisberg 1985). We add five irrelevant features \(x_2,\dots ,x_6\) in a similar manner to Toy data 1.**Bone Mineral Density:**A benchmark dataset with \(D_x=D_y=1\) that consists of relative spinal bone mineral density measurements on \(N=485\) North American adolescents (Hastie et al. 2001). We add five irrelevant features \(x_2,\dots ,x_6\) in a similar manner to Toy data 1.

The regularization paths in Fig. 4 show that, in (a) Toy data 1 and (b) Old Faithful Geyser, the parameters corresponding to the irrelevant feature are zero and that corresponding to the relevant feature is non-zero for the cross-validated solution, which means that SA-CDE optimally performs feature selection. In Fig. 4c Bone Mineral Density, some of irrelevant features are non-zero because features with the skewed distribution are strongly correlated with the relevant feature despite additive Gaussian noise. Thus these features may still contain some information on the output value.

The estimation results in Fig. 3a show that, SA-CDE gives more accurate estimates than the plain LS-CDE. In Fig. 3b, c, SA-CDE tends to give sharper conditional density estimates than the plain LS-CDE. This is because relatively large Gaussian kernel widths are chosen in LS-CDE to incorporate irrelevant noisy features. This indicates that LS-CDE with many irrelevant features tend to produce too flat conditional densities which are not informative, while SA-CDE can avoid this problem by automatically eliminating irrelevant features.

### 3.3 Comparison of performance and computation time for different numbers of samples

**Toy data 1:**The generation procedure is the same as the one in the previous section, in which both dimensions of relevant feature (input) and output are one.**Toy data 2:**Each irrelevant \(x_1, x_2, x_4, x_5, \dots , x_{D_x-2}, x_{D_x-1}\) is independently generated following the uniform distribution on \([-1,1]\). Relevant features are generated by \(x_{3d} = x_{3d-2} + x_{3d-1}\), and the \(d\)th dimension of output \(\varvec{y}\) is generated as a function of \(x_{3d},~d=1,2,\dots , D_x/3\) aswhere \(\varepsilon \) is standard normal noise. This dataset has multi-dimensional relevant features and outputs.$$\begin{aligned} y_d | x_{3d}&\sim \hbox {sinc}\left( \frac{3}{4}\pi x_{3d} \right) + \frac{1}{8} \exp \big ( 1 - x_{3d} \big ) \cdot \varepsilon , \end{aligned}$$(27)

### 3.4 Hyper-parameter selection

### 3.5 Performance comparison for different numbers of irrelevant features

*Old Faithful Geyser*benchmark dataset (\(N=299\) and \(D_x=1\)), and (c) the

*crabs*benchmark dataset (\(N=200\) and \(D_x=6\)) taken from the R package.

^{2}For each dataset, we add \(m~(=0,1,\dots ,10)\) irrelevant features by copying \(x_1\) and adding Gaussian noise in a similar manner to the previous experiments. We randomly choose a half of samples as training samples to estimate conditional densities and use the rest as test samples to compute the test NLL. This procedure is repeated 100 times and the averaged test NLL is computed. The experimental results are summarized in Fig. 17.

In all three cases, the NLL values of LS-CDE, NW-CDE, and eKDE (without feature selection) grow as the number of irrelevant features increases. On the other hand, the NLL values of SA-CDE, SA-LSCDE, FW-LSCDE, FW-NWCDE, and FW-eKDE do not grow that much when the number of irrelevant features increases. This clearly demonstrates an advantage of performing feature selection.

### 3.6 Benchmark datasets

NLL for benchmark datasets with five dimensional irrelevant features

Name | \(|\mathcal {F}|\) | \(N\) | SA-CDE | SA-LSCDE | LS-CDE | eKDE | NW-CDE |
---|---|---|---|---|---|---|---|

caution | 2 | 100 | \(1.34 \pm 0.6\) | \(\varvec{1.24} \pm \varvec{0.4}\) | \(1.38 \pm 0.3\) | \(24.25 \pm 3.4\) | \(1.36 \pm 0.2\) |

CobarOre | 2 | 38 | \(1.71 \pm 0.5\) | \(1.70 \pm 0.4\) | \(\varvec{1.62} \pm \varvec{0.2}\) | \(31.81 \pm 2.9\) | \(\varvec{1.62} \pm \varvec{0.2}\) |

snowgeese | 2 | 45 | \(\varvec{1.80} \pm \varvec{2.0}\) | \(\varvec{1.76} \pm \varvec{1.8}\) | \(1.85 \pm 1.3\) | \(22.04 \pm 6.1\) | \(\varvec{1.59} \pm \varvec{1.0}\) |

topo | 2 | 52 | \(1.17 \pm 0.3\) | \(1.14 \pm 0.3\) | \(1.22 \pm 0.2\) | \(29.30 \pm 3.0\) | \(1.21 \pm 0.1\) |

sniffer | 4 | 125 | \(0.70 \pm 0.6\) | \(\varvec{0.60} \pm \varvec{0.7}\) | \(0.85 \pm 0.2\) | \(16.91 \pm 3.2\) | \(0.83 \pm 0.2\) |

crabs | 6 | 200 | \(-0.44 \pm 0.1\) | \(\varvec{-0.47} \pm \varvec{0.3}\) | \(0.53 \pm 0.1\) | \(26.03 \pm 3.1\) | \(0.58 \pm 0.1\) |

UN3 | 6 | 125 | \(\varvec{1.27} \pm \varvec{0.2}\) | \(\varvec{1.35} \pm \varvec{0.4}\) | \(1.57 \pm 0.6\) | \(33.36 \pm 1.6\) | \(1.54 \pm 0.6\) |

birthwt | 7 | 189 | \(\varvec{1.49} \pm \varvec{0.2}\) | \(\varvec{1.52} \pm \varvec{0.1}\) | \(\varvec{1.51} \pm \varvec{0.1}\) | \(31.77 \pm 1.6\) | \(1.67 \pm 0.2\) |

cpus | 7 | 209 | \(\varvec{0.36} \pm \varvec{0.6}\) | \(0.80 \pm 0.7\) | \(1.19 \pm 0.5\) | \(22.29 \pm 3.4\) | \(1.17 \pm 0.6\) |

gilgais | 8 | 365 | \(\varvec{0.70} \pm \varvec{0.2}\) | \(0.89 \pm 0.2\) | \(1.16 \pm 0.2\) | \(27.77 \pm 2.2\) | \(1.11 \pm 0.2\) |

BigMac | 9 | 69 | \(1.33 \pm 0.8\) | \(1.37 \pm 0.7\) | \(1.42 \pm 0.7\) | \(35.79 \pm 0.5\) | \(1.34 \pm 0.5\) |

highway | 11 | 39 | \(\varvec{1.38} \pm \varvec{0.7}\) | \(1.60 \pm 0.7\) | \(1.71 \pm 0.8\) | \(36.04 \pm 0.0\) | \(1.74 \pm 0.7\) |

Time | \(1.00\) | \(0.00\) | \(0.06\) | \(0.02\) | \(0.00\) |

Name | FW-LSCDE | FW-eKDE | FW-NWCDE | ||||
---|---|---|---|---|---|---|---|

caution | \(1.33 \pm 0.6\) | \(1.35 \pm 0.6\) | \(1.30 \pm 0.5\) | ||||

CobarOre | \(1.95 \pm 0.6\) | \(2.45 \pm 1.9\) | \(\varvec{1.65} \pm \varvec{0.4}\) | ||||

snowgeese | \(2.09 \pm 1.9\) | \(3.03 \pm 2.4\) | \(\varvec{1.82} \pm \varvec{1.8}\) | ||||

topo | \(1.19 \pm 0.4\) | \(1.73 \pm 1.2\) | \(\varvec{1.07} \pm \varvec{0.2}\) | ||||

sniffer | \(0.74 \pm 0.8\) | \(0.96 \pm 0.8\) | \(0.96 \pm 1.1\) | ||||

crabs | \(-0.37 \pm 0.3\) | \(0.08 \pm 0.6\) | \(-0.12 \pm 0.8\) | ||||

UN3 | \(\varvec{1.27} \pm \varvec{0.3}\) | \(1.60 \pm 0.6\) | \(\varvec{1.34} \pm \varvec{0.3}\) | ||||

birthwt | \(1.67 \pm 0.2\) | \(1.75 \pm 0.5\) | \(3.85 \pm 2.1\) | ||||

cpus | \(0.70 \pm 0.8\) | \(1.00 \pm 0.9\) | \(0.76 \pm 0.9\) | ||||

gilgais | \(0.76 \pm 0.2\) | \(0.97 \pm 0.3\) | \(1.20 \pm 0.3\) | ||||

BigMac | \(1.45 \pm 0.9\) | \(2.54 \pm 1.7\) | \(\varvec{1.23} \pm \varvec{0.8}\) | ||||

highway | \(2.06 \pm 1.0\) | \(3.17 \pm 1.9\) | \(2.18 \pm 1.8\) | ||||

Time | \(2.73\) | \(0.54\) | \(0.01\) |

Table 3 shows that the performance of our methods (SA-CDE and SA-LSCDE) is best on nine datasets. For high-dimensional datasets, especially when \(|\mathcal {F}|\) is seven or more, SA-CDE tends to outperform other methods with statistical significance. For low-dimensional datasets with large \(N\), the performance of SA-LSCDE outperforms SA-CDE because of their expressive power of functions. For low-dimensional datasets with small \(N\), FW-NWCDE performs the best because all other methods optimizing weights of basis functions cause overfitting. LSCDE, eKDE, NW-CDE, and FW-eKDE are computationally much more efficient than SA-CDE and SA-LSCDE, but these methods tend to perform poorly for high-dimensional relevant features with noisy dimensions.

### 3.7 Humanoid robot transition dataset

Finally, we evaluate the performance of the proposed method on humanoid robot transition estimation with multiple inputs and multiple outputs. The dataset was generated from a simulator of the upper-body part of the humanoid robot *CB-i* (Cheng et al. 2007). The robot has 9 controllable joints: shoulder pitch, shoulder roll and elbow pitch of the right arm, shoulder pitch, shoulder roll and elbow pitch of the left arm, wait yaw, torso roll, and torso pitch joints.

*Proportional-Derivative*(PD) controller as

To generate transition samples, we first generated the initial posture of the robot \(\varvec{s}^{(1)}\) at random and then simulated a trajectory with 100 steps, i.e. \(\varvec{s}^{(2)}, \dots , \varvec{s}^{(100)}\). For each step, we additionally generated \(m\) irrelevant input features \(\varvec{z}^{(n)} \in \mathbb {R}^m\) by copying a relevant variable or by linearly combining two relevant variables contaminated with Gaussian noise in a similar manner to the previous experiments. By iterating these procedures, we obtained the transition samples \(\{ (\varvec{s}^{(n)}, \varvec{a}^{(n)}, \varvec{z}^{(n)}, \varvec{s}'^{(n)}) \}_{n=1}^{10000}\).

Our goal is to learn the system dynamics as state transition probability \(p(\varvec{s}' | \varvec{s}, \varvec{a}, \varvec{z})\) from these samples. Thus, as the conditional density estimation problem, the state-action pair \((\varvec{s}^{\mathrm {T}}, \varvec{a}^{\mathrm {T}}, \varvec{z}^{\mathrm {T}})^{\mathrm {T}}\) is regarded as input variable \(\varvec{x}\), while the next state \(\varvec{s}'\) is regarded as output variable \(\varvec{y}\). Note that an accurate estimate of the state transition probability is highly useful in *model-based reinforcement learning* (Sutton and Barto 1998).

From the transition samples, we randomly picked up 5000 samples as training data and used the other 5000 samples as test data to calculate NLL. We compare our proposed method SA-CDE with LS-CDE, NW-CDE, FW-LSCDE, and FW-NWCDE, as well as parametric conditional density estimation by the *Gaussian process regression* (GP-CDE) (Rasmussen and Williams 2005). In this experiment, the candidate values of regularization parameter \(\lambda \) are the twenty values between \(10^{-3}\) and \(10^{-1}\) at the equal interval in the logarithmic scale, while the candidate values of other parameters are the same as the previous setting. We consider three datasets with \(J=2, 4, 9\) joints, and change the number of irrelevant features as \(m=0, 5, 10, 15, 20\). Thus, the input dimensionality is \(3J+m\), while the output dimensionality is \(2J\). For each \(J\) and \(m\), we evaluated the performance of conditional density estimation methods by averaged NLL and averaged computational time over 20 runs.

Overall, in this challenging task of robot transition estimation, SA-LSCDE, the combination of SA-CDE and LS-CDE, was shown to be the most promising approach.

## 4 Conclusions

We proposed a direct estimator of conditional probability densities that is equipped with feature selection. Our feature selection strategy is based on the \(\ell _1 / \ell _2\) mixed-norm, which tends to produce a group-sparse solution. An optimization algorithm based on a proximal method was presented that is guaranteed to possess fast convergence. The numerical experiments on benchmark and robot transition datasets demonstrated that the proposed method is promising.

SA-CDE assumes the additive structure for feature selection. However, this causes linear increase of the time and space complexities, resulting in high computation costs for datasets with a large number of features. Improving the scalability issue is future work.

## Footnotes

## Notes

### Acknowledgments

Motoki Shiga was supported by JSPS KAKENHI 25870322. Masashi Sugiyama was supported by JSPS KAKENHI 23120004 and AOARD. Authors thank Dr. Ichiro Takeuchi, Nagoya Institute of Technology, for kindly providing his source codes.

### References

- Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
*SIAM Journal on Imaging Sciences*,*2*(1), 183–202.MathSciNetCrossRefMATHGoogle Scholar - Bishop, C. M. (2006).
*Pattern recognition and machine learning*. New York, NY: Springer.MATHGoogle Scholar - Cheng, C., Hyon, S. H., Morimoto, J., Ude, A., Hale, J. G., Colvin, G., et al. (2007). Cb: A humanoid research platform for exploring neuroscience.
*Advanced Robotics*,*21*(10), 1097–1114.CrossRefGoogle Scholar - Cook, R. D., & Ni, L. (2005). Sufficient dimension reduction via inverse regression.
*Journal of the American Statistical Association*,*100*(470), 410–428.MathSciNetCrossRefMATHGoogle Scholar - Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm.
*Journal of the Royal Statistical Society, Series B*,*39*(1), 1–38.MathSciNetMATHGoogle Scholar - Fan, J., Yao, Q., & Tong, H. (1996). Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems.
*Biometrika*,*83*(1), 189–206.MathSciNetCrossRefGoogle Scholar - Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection.
*Journal of Machine Learning Research*,*3*(Mar.), 1157–1182.MATHGoogle Scholar - Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine.
*Journal of Machine Learning Research*,*5*, 1391–1415.MathSciNetMATHGoogle Scholar - Hastie, T., Tibshirani, R., & Friedman, J. (2001).
*The elements of statistical learning: Data mining, inference, and prediction*. New York, NY: Springer.CrossRefGoogle Scholar - Holmes, M. P., Gray, A. G., & Isbell, C. L. (2007). Fast nonparametric conditional density estimation. In
*Proceedings of the twenty-third conference annual conference on uncertainty in artificial intelligence*(pp 175–182).Google Scholar - Li, K. (1991). Sliced inverse regression for dimension reduction.
*Journal of the American Statistical Association*,*86*(414), 316–342.MathSciNetCrossRefMATHGoogle Scholar - Li, Y., Liu, Y., & Zhu, J. (2007). Quantile regression in reproducing kernel Hilbert spaces.
*Journal of the American Statistical Association*,*102*(477), 255–268.MathSciNetCrossRefMATHGoogle Scholar - Rasmussen, C. E., & Williams, C. K. I. (2005).
*Gaussian processes for machine learning (adaptive computation and machine learning)*. Cambridge: MIT Press.Google Scholar - Silverman, B. W. (1986).
*Density estimation for statistics and data analysis*. London: Chapman and Hall.CrossRefMATHGoogle Scholar - Sra, S., Nowozin, S., & Wright, S. (2012).
*Optimization for machine learning. neural information processing series*. Cambridge: MIT Press.Google Scholar - Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., & Okanohara, D. (2010). Least-squares conditional density estimation.
*IEICE Transactions on Information and Systems*,*E93-D*(3), 583–594.Google Scholar - Sutton, R. S., & Barto, A. G. (1998).
*Introduction to reinforcement learning*(1st ed.). Cambridge, MA: MIT Press.Google Scholar - Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A. J. (2006). Nonparametric quantile estimation.
*Journal of Machine Learning Research*,*7*, 1231–1264.MathSciNetMATHGoogle Scholar - Takeuchi, I., Nomura, K., & Kanamori, T. (2009). Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression.
*Neural Computation*,*21*(2), 533–559.MathSciNetCrossRefMATHGoogle Scholar - Tresp, V. (2001). Mixtures of Gaussian processes. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.),
*Advances in Neural information processing systems, vol. 13*(pp. 654–660). Cambridge, MA: MIT Press.Google Scholar - Weisberg, S. (1985).
*Applied linear regression*. New York, NY: Wiley.MATHGoogle Scholar - Wolff, R. C. L., Yao, Q., & Hall, P. (1999). Methods for estimating a conditional distribution function.
*Journal of the American Statistical Association*,*94*(445), 154–163.MathSciNetCrossRefMATHGoogle Scholar - Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables.
*Journal of the Royal Statistical Society, Series B*,*68*(1), 49–67.MathSciNetCrossRefMATHGoogle Scholar