Prediction of Molecular Bioactivity for Drug Design Using a Decision Tree Algorithm

  • Sanghoon Lee
  • Jihoon Yang
  • Kyung-whan Oh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2843)


A machine learning-based approach to the prediction of molecular bioactivity in new drugs is proposed. Two important aspects are considered for the task: feature subset selection and cost-sensitive classification. These are to cope with the huge number of features and unbalanced samples in a dataset of drug candidates. We designed a pattern classifier with such capabilities based on information theory and re-sampling techniques. Experimental results demonstrate the feasibility of the proposed approach. In particular, the classification accuracy of our approach was higher than that of the winner of KDD Cup 2001 competition.


Feature Selection Information Gain Feature Subset GINI Index Feature Subset Selection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    C. Hatzis, David Page(2001). KDD-2001 Cup The Genomics Challenge (2001) Google Scholar
  2. 2.
    Gibas, C., Jambeck, P.: Developing Bioinformatics Computer Skills. O’Reilly, Sebastopol (2001)Google Scholar
  3. 3.
    Siedlecki, W., Sklansky, J.: On automatic feature selection. International Journal of Pattern Recognition 2, 197–220 (1988)CrossRefGoogle Scholar
  4. 4.
    Langley, P.: Selection of relevant features in machine learning. In: Proceedings of the AAAI Fall Symposium on Relevance, New Orleans, LA, pp. 1–5. AAAI Press, Menlo Park (1994)Google Scholar
  5. 5.
    Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1(3) (1997)Google Scholar
  6. 6.
    Yang, J., Honavar, V.: Feature Subset Selection Using A Genetic Algorithm. In: Proceedings of the GP 1997, Stanford, CA, pp. 380–385 (1997)Google Scholar
  7. 7.
    Nucciardi, A., Gose, E.: A comparison of seven techniques for choosing subsets of pattern recognition. IEEE Transactions on Computers 20, 1023–1031 (1971)CrossRefGoogle Scholar
  8. 8.
    Battiti, R.: Using Mutual Information for Selecting Features in Supervised Neural Net Learning. IEEE Transaction on Neural Networks 5(4), 537–550 (1994)CrossRefGoogle Scholar
  9. 9.
    Al-Ani, A., Deriche, M.: Feature selection using a mutual information based measure. In: Proceedings of 16th International Conference on Pattern Recognition, vol. 4, pp. 82–85 (2002)Google Scholar
  10. 10.
    Siedlecki, W., Sklansky, J.: A note on genetic algorithms for large-scale feature selection. IEEE Transactions on Computers 10, 335–347 (1989)zbMATHGoogle Scholar
  11. 11.
    Brill, F., Brown, D., Martin, W.: Fast Genetic selection of features for neural network classifiers. IEEE Transactions on Neural Networks 3(2), 324–328 (1992)CrossRefGoogle Scholar
  12. 12.
    Richeldi, M., Lanzi, P.: Performing effective feature selection by investigating the deep structure of the data. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 379–383. AAAI Press, Menlo Park (1996)Google Scholar
  13. 13.
    Ng, A.Y.: Preventing “over-fitting” of cross-validation data. In: Proceedings of the 14th International Conference on Machine Learning (ICML), Nashvilli, TN, pp. 245–253 (1997)Google Scholar
  14. 14.
    Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: International Conference on Artificial Intelligence( IJCAI) (1995)Google Scholar
  15. 15.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley Interscience, Hoboken (2001)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Sanghoon Lee
    • 1
  • Jihoon Yang
    • 1
  • Kyung-whan Oh
    • 1
  1. 1.Department of Computer ScienceSogang UniversitySeoulKorea

Personalised recommendations