A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction

  • Rezarta Islamaj
  • Lise Getoor
  • W. John Wilbur
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4213)


In this paper we present a new approach to feature selection for sequence data. We identify general feature categories and give construction algorithms for each of them. We show how they can be integrated in a system that tightly couples feature construction and feature selection. This integrated process, which we refer to as feature generation, allows us to systematically search a large space of potential features. We demonstrate the effectiveness of our approach for an important component of the gene finding problem, splice-site prediction. We show that predictive models built using our feature generation algorithm achieve a significant improvement in accuracy over existing, state-of-the-art approaches.


Feature Selection False Positive Rate Splice Site Information Gain Feature Selection Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Kohavi, R., John, G.: The wrapper approach. In: Liu, H., Motoda, H. (eds.) Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic Publishers, Dordrecht (1998)Google Scholar
  2. 2.
    Koller, D., Sahami, M.: Toward optimal feature selection. In: ICML, pp. 284–292 (1996)Google Scholar
  3. 3.
    Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: ICML (1997)Google Scholar
  4. 4.
    Yu, L., Liu, H.: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. In: ICML (2003)Google Scholar
  5. 5.
    Blum, A., Langley, P.: Selection of relevant features and examples in machine learning. Artificial Intelligence (1997)Google Scholar
  6. 6.
    Liu, H., Wong, L.: Data mining tools for biological sequences. Journal of Bioinformatics and Computational Biology (2003)Google Scholar
  7. 7.
    Degroeve, S., Baets, B., de Peer, Y.V., Rouze, P.: Feature subset selection for splice site prediction. In: ECCB, pp. 75–83 (2002)Google Scholar
  8. 8.
    Yeo, G., Burge, C.: Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. In: RECOMB (2003)Google Scholar
  9. 9.
    Zhang, X., Heller, K., Hefter, I., Leslie, C., Chasin, L.: Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. Genome Research 13, 2637–2650 (2003)CrossRefGoogle Scholar
  10. 10.
    Degroeve, S., Saeys, Y., Baets, B.D., Rouzé, P., de Peer, Y.V.: SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 21, 1332–1338 (2005)CrossRefGoogle Scholar
  11. 11.
    Pertea, M., Lin, X., Salzberg, S.: GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Research 29, 1185–1190 (2001)CrossRefGoogle Scholar
  12. 12.
    Zhang, M.: Statistical features of human exons and their flanking regions. Human Molecular Genetics 7, 919–932 (1998)CrossRefGoogle Scholar
  13. 13.
    Zhang, T., Oles, F.: Text categorization based on regularized linear classification methods. Information Retrieval 4, 5–31 (2001)MATHCrossRefGoogle Scholar
  14. 14.
    Witten, I., Moffat, A., Bell, T. (eds.): Managing Gigabytes, 2nd edn. Van Nostrand Reinhold (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Rezarta Islamaj
    • 1
  • Lise Getoor
    • 1
  • W. John Wilbur
    • 2
  1. 1.Computer Science DepartmentUniversity of MarylandCollege ParkUSA
  2. 2.National Center for Biotechnology Information, NLM, NIHBethesdaUSA

Personalised recommendations