Abstract
Classification of genes into biologically related groups facilitates inference of their functions. Codon usage bias has been described previously as a potential feature for gene classification. In this paper, we demonstrate that di-codon usage can further improve classification of genes. By using both codon and di-codon features, we achieve near perfect accuracies for the classification of HLA molecules into major classes and sub-classes. The method is illustrated on 1,841 HLA sequences which are classified into two major classes, HLA-I and HLA-II. Major classes are further classified into sub-groups. A binary SVM using di-codon usage patterns achieved 99.95% accuracy in the classification of HLA genes into major HLA classes; and multi-class SVM achieved accuracy rates of 99.82% and 99.03% for sub-class classification of HLA-I and HLA-II genes, respectively. Furthermore, by combining codon and di-codon usages, the prediction accuracies reached 100%, 99.82%, and 99.84% for HLA major class classification, and for sub-class classification of HLA-I and HLA-II genes, respectively.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Sharp, P.M., Cowe, E., Higgins, D.G.: Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens: a review of the considerable within-species diversity. Nucleic Acids Res. 16, 8207–8211 (1988)
Kanaya, S., Yamada, Y., Kudo, Y., Ikemura, T.: Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238, 143–155 (1999)
Ma, J.M., Nguyen, M.N., Rajapakse, J.C.: Gene Classification using codon usage and support vector machines. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(1), 134–143 (2009)
Zhang, Y., Rajapakse, J.C. (eds.): Machine Learning in Bioinformatics. John Wiley and Sons Inc., Chichester (2009)
Wallace, I.M., Blackshields, G., Higgins, D.G.: Multiple sequence alignments. Curr. Opin. Struct. Biol. 15, 261–266 (2005)
Shatsky, M., Nussinov, R., Wolfson, H.J.: Optimization of multiple-sequence alignment based on multiple-structure alignment. Proteins: Structure, Function, and Bioinformatics 62, 209–217 (2006)
Noguchi, H., Park, J., Takagi, T.: MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Research 34(19), 5623–5630 (2006)
Kim, C., Konagaya, A., Asai, K.: A generic criterion for gene recognitions in genomic sequences. Genome Inform. Ser. Workshop Genome Inform. 10, 13–22 (1999)
Paces, J., Paces, V.: DicodonUse: the programme for dicodon bias visualization in prokaryotes. Folia Biol. (Praha) 48(6), 246–249 (2002)
Uno, R., Nakayama, Y., Tomita, M.: Over-representation of Chi sequences caused by di-codon increase in Escherichia coli K-12. Gene 380(1), 30–37 (2006)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Vapnik, V.: Statistical Learning Theory. Wiley and Sons, Inc., New York (1998)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)
Nguyen, M.N., Rajapakse, J.C.: Prediction of protein relative solvent accessibility with a two-stage SVM approach. Proteins: Structure, Function, and Bioinformatics 59, 30–37 (2005)
Nguyen, M.N., Rajapakse, J.C.: Two-stage support vector regression approach for predicting accessible surface areas of amino acids. Proteins: Structure, Function, and Bioinformatics 63, 542–550 (2006)
Nguyen, M.N., Rajapakse, J.C.: Prediction of protein secondary structure with two-stage multi-class SVM approach. International Journal of Data Mining and Bioinformatics 1(3), 248–269 (2007)
Duan, K.B., Rajapakse, J.C.: Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans. Nanobioscience 4(3), 228–234 (2005)
Rajapakse, J.C., Duan, K.B., Yeo, W.K.: Proteomic cancer classification with mass spectrometry data. American Journal of Pharmacology 5(5), 281–292 (2005)
Lin, K., Kuang, Y., Joseph, J.S., Kolatkar, P.R.: Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics. Nucleic Acids Res. 30, 2599–2607 (2002)
Bhasin, M., Raghava, G.P.: SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence. Bioinformatics 20, 421–423 (2004)
Bhasin, M., Raghava, G.P.: Prediction of CTL epitopes using QM, SVM and ANN techniques. Vaccine 22, 3195–3204 (2004)
Donnes, P., Elofsson, A.: Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics 3(1), 25–32 (2002)
Zhao, Y., Pinilla, C., Valmori, D., Martin, R., Simon, R.: Application of support vector machines for T-cell epitopes prediction. Bioinformatics 19, 1978–1984 (2003)
Robinson, J., Waller, M.J., Parham, P., Bodmer, J.G., Marsh, S.G.E.: IMGT/HLA Sequence Database - a sequence database for the human major histocompatibility complex. Nucleic Acids Res. 29, 210–213 (2001)
Robinson, J., Waller, M.J., Parham, P., de Groot, N., Bontrop, R., Kennedy, L.J., Stoehr, P., Marsh, S.G.E.: IMGT/HLA and IMGT/MHC: sequence databases for the study of the major histocompatibility complex. Nucleic Acids Res. 31, 311–314 (2003)
Galperin, M.: The Molecular Biology Database Collection: 2004 update. Nucleic Acids Res. 32, D2–D22 (2004)
Bodmer, J.G., Marsh, S.G.E., Albert, E.D., Bodmer, W.F., Bontrop, R.E., Charron, D., Dupont, B., Erlish, H.A., Mach, B., Mayr, W.R., Parham, P., Sasazuki, T., Schreuder, G.M.T., Strom-inger, J.L., Svejgaard, A., Terasaki, P.I.: Nomenclature for factors of the HLA system, 1995. Tissue Antigens 46, 1–18 (1995)
Rosenthal, A.S., Shevach, E.: Function of macrophages in antigen recognition by guinea pig T lymphocytes. I. Requirement for histocompatibile macrophages and lymphocytes. J. Exp. Med. 138, 1194–1212 (1973)
Zinkernagel, R.M., Doherty, P.C.: Restriction of in vitro T cell-mediated cytotoxicity in lymphocytic choriomeningitis within a syngeneic or semiallogeneic system. Nature 248, 701–702 (1974)
Katz, D.H., Hamoaka, T., Benacerraf, B.: Cell interactions between histocompatible T and B lymphocytes. Failure of physiologic cooperation interactions between T and B lymphocytes from allogeneic donor strains in humoral response to hapten-protein conjugates. J. Exp. Med. 137, 1405–1418 (1973)
Han, H.X., Kong, F.H., Xi, Y.Z.: Progress of studies on the function of MHC in immuno-recognition. J. Immunol. (Chinese) 16(4), 15–17 (2000)
Crammer, K., Singer, Y.: On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47, 201–233 (2002)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm
Hsu, C.W., Lin, C.J.: A comparison on methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13, 415–425 (2002)
Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., Higgins, D.G.: The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 24, 4876–4882 (1997)
Grishin, V.N., Grishin, N.V.: Euclidian space and grouping of biological objects. Bioinformatics 18, 1523–1534 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nguyen, M.N., Ma, J., Fogel, G.B., Rajapakse, J.C. (2009). Di-codon Usage for Gene Classification. In: Kadirkamanathan, V., Sanguinetti, G., Girolami, M., Niranjan, M., Noirel, J. (eds) Pattern Recognition in Bioinformatics. PRIB 2009. Lecture Notes in Computer Science(), vol 5780. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04031-3_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-04031-3_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04030-6
Online ISBN: 978-3-642-04031-3
eBook Packages: Computer ScienceComputer Science (R0)