Iteratively Selecting Feature Subsets for Mining from High-Dimensional Databases
We propose a new data mining method that is effective for mining from extremely high-dimensional databases. Our proposed method iteratively selects a subset of features from a database and builds a hypothesis with the subset. Our selection of a feature subset has two steps, i.e. selecting a subset of instances from the database, to which predictions by multiple hypotheses previously obtained are most unreliable, and then selecting a subset of features, the distribution of whose values in the selected instances varies the most from that in all instances of the database. We empirically evaluate the effectiveness of the proposed method by comparing its performance with those of two other methods, including Xing et al.’s one of the latest feature subset selection methods. The evaluation was performed on a real-world data set with approximately 140,000 features. Our results show that the performance of the proposed method exceeds those of the other methods, both in terms of the final predictive accuracy and the precision attained at a recall given by Xing et al.’s method. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels.
Unable to display preview. Download preview PDF.
- 2.Joachims, T. Making Large-scale SVMLearning Practical. In: Scholkopf, B., Burges, C., Smola, A. (eds.): Advances in Kernel Methods-Support Vector Learning, B. MIT Press, Cambridge (1999)Google Scholar
- 4.Koller, D., Sahami, M.: Toward Optimal Feature Selection. In: Saitta, L. (eds.): Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, Bari, Italy (1996) 284–292Google Scholar
- 7.Mamitsuka, H., Abe, N.: Efficient Mining from Large Databases by Query Learning. In: Langley, P. (eds.): Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann, Stanford Univ., CA (2000) 575–582Google Scholar
- 8.Ng, A.: On Feature Selection: Learning with Exponentially Many Irrelevant Features as Training Examples. In: Shavlik, J. (eds.): Proceedings of the Fifteenth Intenational Conference on Machine Learning. Morgan Kaufmann, Madison, WI (1998) 404–412Google Scholar
- 10.Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
- 11.Seung, H. S., Opper, M., Sompolinsky, H.: Query by Committee. In: Haussler, D. (eds.): Proceedings of the Fifth Intenational Conference on Computational Learning Theory. Morgan Kaufmann, NY (1992) 287–294Google Scholar
- 12.Xing, E. P., Jordan, M. I., Karp, R. M.: Feature Selection for High-dimensional Genomic Microarray Data In: Brodley, C. E., Danyluk, A. P. (eds.): Proceedings of the Eighteenth Intenational Conference on Machine Learning. Morgan Kaufmann, Madison, WI (2001) 601–608Google Scholar