Functional Annotation of Proteins by a Novel Method Using Weight and Feature Selection

  • Jaehee Jung
  • Heung Ki Lee
  • Gangman Yi
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 301)


The definition of the automatic protein function means designating the function with the automation by utilizing the data that already revealed unknown protein function. The demand for analysis on the sequencing technology such as the next generation genome analysis (NGS) and the subsequent genome are on the rise; thus, the need for the method of predicting the protein function automatically has been more and more highlighted. As for the existing methods, the studies on the definition of function between the similar species based on the similarities of sequence have been primarily conducted. However, this paper aims to designate by automatically predicting the function of genome by utilizing InterPro (IPR) that can represent the properties of the protein family, which similarly groups the protein function. Moreover, the gene ontology (GO), which is the controlled vocabulary to describe the protein function comprehensively, is to be used. As for the data used in the experiment, the analysis on properties was conducted in the sparse state that is deflected to one side. Thus, this paper aims to analyze the prediction method for protein function automatically through selecting the features, assigning the data processing and weights and applying a variety of classification methods to overcome that property.


Gene ontology GO InterPro IPR Functional annotation Gene annotation SVM SMO Adaboosting 



This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2013R1A1A2063006).


  1. 1.
    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402CrossRefGoogle Scholar
  2. 2.
    Ashburner M, Ball CA, Blake JA et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29CrossRefGoogle Scholar
  3. 3.
    Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R (2004) The gene ontology annotation (GOA) database: sharing knowledge in Uniprot with gene ontology. Nucleic Acids Res 32: D262–D266Google Scholar
  4. 5.
    Chang CC, Lin CJ (2011). LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27Google Scholar
  5. 6.
    Chawla N, Bowyer K, Hall L, Kegelmeyer P (2002) SMOTE: synthetic minority over-sampling technique. JAIR 16:321–357Google Scholar
  6. 4.
    Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21(18):3674–3676 Epub, Aug 4 2005Google Scholar
  7. 7.
    Freund Y, Schapire R (1996) A short introduction to boosting. J Japan Soc Artif Intell 14(5):771–780Google Scholar
  8. 8.
    Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATHGoogle Scholar
  9. 9.
    Hunter S, Jones P, Mitchell A et al (2011) InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res 40:D306–D312CrossRefGoogle Scholar
  10. 10.
    John CP Sequential minimal optimization: a fast algorithm for training support vector machinesGoogle Scholar
  11. 11.
    Koski LB, Gray MW, Lang BF, Burger G (2005) AutoFACT: an automatic functional annotation and classification tool. BMC Bioinf 6:151Google Scholar
  12. 12.
    Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the fourteenth international conference on machine learning, pp 179–186Google Scholar
  13. 13.
    Martin DM, Berriman M, Barton GJ (2004) GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinf 5:178Google Scholar
  14. 15.
    Quevillon E, Silventoinen V, Pillai S et al (2005) InterProScan: protein domains identifier. Nucleic Acids Res 33:W116–W120CrossRefGoogle Scholar
  15. 14.
    Shahib A Al, Breitling R, Gilbert D (2005) Feature selection and the class imbalance problem in predicting protein function from sequence. Appl Bioinf 4(3):195–203Google Scholar
  16. 16.
    Zehetner G (2003) OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res 31(13):803–3799CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  1. 1.Samsung ElectronicsSuwonSouth Korea
  2. 2.Gangneung-Wonju National UniversityGangwonSouth Korea

Personalised recommendations