Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets

Mamitsuka, Hiroshi

doi:10.1007/s10115-005-0199-4

Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets

Regular Paper
Published: 20 April 2005

Volume 9, pages 91–108, (2006)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Hiroshi Mamitsuka¹

90 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

We propose a new data-mining method that is effective for learning from extremely high-dimensional data sets. Our proposed method selects a subset of features from a high-dimensional data set by a process of iterative refinement. Our selection of a feature-subset has two steps. The first step selects a subset of instances, to which predictions by hypotheses previously obtained are most unreliable, from the data set. The second step selects a subset of features whose values in the selected instances vary the most from those in all instances of the database. We empirically evaluate the effectiveness of the proposed method by comparing its performance with those of four other methods, including one of the latest feature-subset selection methods. The evaluation was performed on a real-world data set with approximately 140,000 features. Our results show that the performance of the proposed method exceeds those of the other methods in terms of prediction accuracy, precision at a certain recall value, and computation time to reach a certain prediction accuracy. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels. Extended abstracts of parts of the work presented in this paper have appeared in Mamitsuka [14] and Mamitsuka [15].

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sequential Instance Based Feature Subset Selection for High Dimensional Data

Feature subset selection for data and feature streams: a review

Article Open access 13 July 2023

Dynamic Selection of Classifiers Applied to High-Dimensional Small-Instance Data Sets: Problems and Challenges

References

Breiman, L (1999) Pasting small votes for classification in large databases and on-line. Mach Learn 36(1–2):85–103
Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
MATH Google Scholar
Freund Y, Shapire R (1997) A decision theoretic generalization of on-line learning and an application to boosting. J Comput Sys Sci 55(1):119–139
Article Google Scholar
Freund Y, Seung H, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm. Mach Learn 28(2–3):133–168
Google Scholar
Hagmann M (2000) Computers aid vaccine design. Science 290(5489):80–82
Google Scholar
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844
Google Scholar
Joachims T (1999) Making large-scale SVM learning practical, In: Scholkopf B, Burges C, Smola A (eds) Advances in Kernel methods–support vector learning, B. MIT Press, Cambridge, MA, pp 41–56
Google Scholar
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Google Scholar
Koller D, Sahami M (1996) Toward optimal feature selection, In: Saitta L (ed) Proceedings of the thirteenth international conference on machine learning. Morgan Kaufmann, Bari, Italy, pp. 284–292
Kononenko I, Hong SJ (1997) Attribute selection for modelling. Future Gener Comput Sys 13(2–3):181–195
Google Scholar
Lewis D, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Cohen W, Hirsh H (eds) Proceedings of the eleventh international conference on machine learning, Morgan Kaufmann, Brunswick, pp. 148–156
Lewis D, Gale W (1994) Training text classifiers by uncertainty sampling. In: Smeaton AF (ed) Proceedings of the seventeenth annual international ACM SIGIR conference on research and development in information retrieval. ACM, Dublin, Ireland, pp. 3–12
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer, Boston.
Google Scholar
Mamitsuka H (2002) Iteratively selecting feature subsets for mining from high-dimensional databases. In: Elomaa T, Mannila H, Toivonen H (eds) Proceedings of the 6th European conference on principles and practice of knowledge discovery in databases. Springer, Berlin Heidelberg New York, pp. 361–372
Mamitsuka H (2003) Empirical evaluation of ensemble feature subset selection methods for learning from a high-dimensional database in drug design. In: Bourbakis N (ed) Proceedings of the third IEEE international symposium on bioinformatics and bioengineering. IEEE Computer Society Press, Bethesda, MD, pp. 253–257
Mamitsuka H, Abe N (2000) Efficient mining from large databases by query learning. In: Langley P (ed) Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann, Stanford University, Stanford, pp. 575–582
Miller MA (2002) Chemical database techniques in drug discovery. Nat Rev Drug Discovery 1:220–227
Google Scholar
Ng A (1998) On feature selection: learning with exponentially many irrelevant features as training examples. In: Shavlik J (ed) Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann, Madison, WI, pp. 404–412
Provost F, Kolluri V (1999) A survey of methods for scaling up inductive algorithms. Know Discovery Data Min 3(2):131–169
Google Scholar
Quinlan J (1983) Learning efficient classification procedures and their applications to chess endgames. In: Michalski RS, Carbonell JG, Mitchell TM (eds) Machine learning: an artificial intelligence approach. Morgan Kaufmann, Palo Alto, CA, pp. 463–482
Google Scholar
Quinlan J (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CA
Google Scholar
Räsch G, Onoda T, Müller KR (2001) Soft margins for AdaBoost. Mach Learn 42(3):287–320
Google Scholar
Seung HS, Opper N, Sompolinsky H (1992) Query by committee. In: Haussler D (ed) Proceedings of the 5th international conference on computational learning theory. Morgan Kaufmann, New York, pp. 287–294
Xing EP, Karp RM (2001) CLIFF: clustering of high-dimensional microarray data via feature filtering using normalized cuts. Bioinformatics 17(Suppl 1):S306–S315
Google Scholar
Xing EP, Jordan MI, Karp RM (2001) Feature selection for high-dimensional genomic microarray data. In: Brodley CE, Danyluk AP (eds) Proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann, Madison, WI, pp. 601–608

Download references

Author information

Authors and Affiliations

Institute for Chemical Research, Kyoto University, Gokasho, Uji, 611-0011, Japan
Hiroshi Mamitsuka

Authors

Hiroshi Mamitsuka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiroshi Mamitsuka.

Additional information

Hiroshi Mamitsuka is currently Associate Professor in the Institute for Chemical Research at Kyoto University. He received his B.S. in Biochemistry and Biophysics, M.E. in Information Engineering and Ph.D. in Information Sciences from the University of Tokyo in 1988, 1991 and 1999, respectively. He worked in NEC Research Laboratories in Japan from 1991 to 2002. His current research interests are in bioinformatics, computational molecular biology, chemical genomics, medicinal chemistry, machine learning and data mining.

Hiroshi Mamitsuka, Institute for Chemical Research, Kyoto University, Gokasho, Uji 611-0011, Japan. E-mail mami@kuicr.kyoto-u.ac.jp:

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mamitsuka, H. Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets. Knowl Inf Syst 9, 91–108 (2006). https://doi.org/10.1007/s10115-005-0199-4

Download citation

Received: 01 August 2003
Revised: 20 December 2004
Accepted: 27 January 2005
Published: 20 April 2005
Issue Date: January 2006
DOI: https://doi.org/10.1007/s10115-005-0199-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets

Abstract

Access this article

Similar content being viewed by others

Sequential Instance Based Feature Subset Selection for High Dimensional Data

Feature subset selection for data and feature streams: a review

Dynamic Selection of Classifiers Applied to High-Dimensional Small-Instance Data Sets: Problems and Challenges

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Query-learning-based iterative feature-subset selection for learning from high-dimensional data sets

Abstract

Access this article

Similar content being viewed by others

Sequential Instance Based Feature Subset Selection for High Dimensional Data

Feature subset selection for data and feature streams: a review

Dynamic Selection of Classifiers Applied to High-Dimensional Small-Instance Data Sets: Problems and Challenges

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation