Analysis and Insights into the Variable Selection Problem
In many large applications a large number of input variables is initially available, and a subset selection step is needed to select the best few to be be used in the subsequent classification or regression step. The designer initially screens the inputs for the ones that have good predictive ability and that are not too much correlated with the other selected inputs. In this paper, we study how the predictive ability of the inputs, viewed individually, reflect on the performance of the group (i.e. what are the chances that as a group they perform well). We also study the effect of “irrelevant” inputs. We develop a formula for the distribution of the change in error due to adding an irrelevant input. This can be a useful reference. We also study the role of correlations and their effect on group performance. To study these issues, we first perform a theoretical analysis for the case of linear regression problems. We then follow with an empirical study for nonlinear regression models such as neural networks.
KeywordsMean Square Error Group Performance Benchmark Problem Scatter Diagram Random Input
Unable to display preview. Download preview PDF.