Functional Annotation of Proteins by a Novel Method Using Weight and Feature Selection
The definition of the automatic protein function means designating the function with the automation by utilizing the data that already revealed unknown protein function. The demand for analysis on the sequencing technology such as the next generation genome analysis (NGS) and the subsequent genome are on the rise; thus, the need for the method of predicting the protein function automatically has been more and more highlighted. As for the existing methods, the studies on the definition of function between the similar species based on the similarities of sequence have been primarily conducted. However, this paper aims to designate by automatically predicting the function of genome by utilizing InterPro (IPR) that can represent the properties of the protein family, which similarly groups the protein function. Moreover, the gene ontology (GO), which is the controlled vocabulary to describe the protein function comprehensively, is to be used. As for the data used in the experiment, the analysis on properties was conducted in the sparse state that is deflected to one side. Thus, this paper aims to analyze the prediction method for protein function automatically through selecting the features, assigning the data processing and weights and applying a variety of classification methods to overcome that property.
KeywordsGene ontology GO InterPro IPR Functional annotation Gene annotation SVM SMO Adaboosting
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2013R1A1A2063006).
- 3.Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R (2004) The gene ontology annotation (GOA) database: sharing knowledge in Uniprot with gene ontology. Nucleic Acids Res 32: D262–D266Google Scholar
- 5.Chang CC, Lin CJ (2011). LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27Google Scholar
- 6.Chawla N, Bowyer K, Hall L, Kegelmeyer P (2002) SMOTE: synthetic minority over-sampling technique. JAIR 16:321–357Google Scholar
- 4.Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21(18):3674–3676 Epub, Aug 4 2005Google Scholar
- 7.Freund Y, Schapire R (1996) A short introduction to boosting. J Japan Soc Artif Intell 14(5):771–780Google Scholar
- 10.John CP Sequential minimal optimization: a fast algorithm for training support vector machinesGoogle Scholar
- 11.Koski LB, Gray MW, Lang BF, Burger G (2005) AutoFACT: an automatic functional annotation and classification tool. BMC Bioinf 6:151Google Scholar
- 12.Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the fourteenth international conference on machine learning, pp 179–186Google Scholar
- 13.Martin DM, Berriman M, Barton GJ (2004) GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinf 5:178Google Scholar
- 14.Shahib A Al, Breitling R, Gilbert D (2005) Feature selection and the class imbalance problem in predicting protein function from sequence. Appl Bioinf 4(3):195–203Google Scholar