Di-codon Usage for Gene Classification

  • Minh N. Nguyen
  • Jianmin Ma
  • Gary B. Fogel
  • Jagath C. Rajapakse
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5780)


Classification of genes into biologically related groups facilitates inference of their functions. Codon usage bias has been described previously as a potential feature for gene classification. In this paper, we demonstrate that di-codon usage can further improve classification of genes. By using both codon and di-codon features, we achieve near perfect accuracies for the classification of HLA molecules into major classes and sub-classes. The method is illustrated on 1,841 HLA sequences which are classified into two major classes, HLA-I and HLA-II. Major classes are further classified into sub-groups. A binary SVM using di-codon usage patterns achieved 99.95% accuracy in the classification of HLA genes into major HLA classes; and multi-class SVM achieved accuracy rates of 99.82% and 99.03% for sub-class classification of HLA-I and HLA-II genes, respectively. Furthermore, by combining codon and di-codon usages, the prediction accuracies reached 100%, 99.82%, and 99.84% for HLA major class classification, and for sub-class classification of HLA-I and HLA-II genes, respectively.


Human Leukocyte Antigen Codon Usage Codon Usage Bias Codon Usage Pattern Human Leukocyte Antigen Gene 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Sharp, P.M., Cowe, E., Higgins, D.G.: Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens: a review of the considerable within-species diversity. Nucleic Acids Res. 16, 8207–8211 (1988)CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Kanaya, S., Yamada, Y., Kudo, Y., Ikemura, T.: Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238, 143–155 (1999)CrossRefPubMedGoogle Scholar
  3. 3.
    Ma, J.M., Nguyen, M.N., Rajapakse, J.C.: Gene Classification using codon usage and support vector machines. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(1), 134–143 (2009)CrossRefPubMedGoogle Scholar
  4. 4.
    Zhang, Y., Rajapakse, J.C. (eds.): Machine Learning in Bioinformatics. John Wiley and Sons Inc., Chichester (2009)Google Scholar
  5. 5.
    Wallace, I.M., Blackshields, G., Higgins, D.G.: Multiple sequence alignments. Curr. Opin. Struct. Biol. 15, 261–266 (2005)CrossRefPubMedGoogle Scholar
  6. 6.
    Shatsky, M., Nussinov, R., Wolfson, H.J.: Optimization of multiple-sequence alignment based on multiple-structure alignment. Proteins: Structure, Function, and Bioinformatics 62, 209–217 (2006)CrossRefGoogle Scholar
  7. 7.
    Noguchi, H., Park, J., Takagi, T.: MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Research 34(19), 5623–5630 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Kim, C., Konagaya, A., Asai, K.: A generic criterion for gene recognitions in genomic sequences. Genome Inform. Ser. Workshop Genome Inform. 10, 13–22 (1999)PubMedGoogle Scholar
  9. 9.
    Paces, J., Paces, V.: DicodonUse: the programme for dicodon bias visualization in prokaryotes. Folia Biol. (Praha) 48(6), 246–249 (2002)Google Scholar
  10. 10.
    Uno, R., Nakayama, Y., Tomita, M.: Over-representation of Chi sequences caused by di-codon increase in Escherichia coli K-12. Gene 380(1), 30–37 (2006)CrossRefPubMedGoogle Scholar
  11. 11.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)CrossRefGoogle Scholar
  12. 12.
    Vapnik, V.: Statistical Learning Theory. Wiley and Sons, Inc., New York (1998)Google Scholar
  13. 13.
    Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)CrossRefGoogle Scholar
  14. 14.
    Nguyen, M.N., Rajapakse, J.C.: Prediction of protein relative solvent accessibility with a two-stage SVM approach. Proteins: Structure, Function, and Bioinformatics 59, 30–37 (2005)CrossRefGoogle Scholar
  15. 15.
    Nguyen, M.N., Rajapakse, J.C.: Two-stage support vector regression approach for predicting accessible surface areas of amino acids. Proteins: Structure, Function, and Bioinformatics 63, 542–550 (2006)CrossRefGoogle Scholar
  16. 16.
    Nguyen, M.N., Rajapakse, J.C.: Prediction of protein secondary structure with two-stage multi-class SVM approach. International Journal of Data Mining and Bioinformatics 1(3), 248–269 (2007)CrossRefPubMedGoogle Scholar
  17. 17.
    Duan, K.B., Rajapakse, J.C.: Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans. Nanobioscience 4(3), 228–234 (2005)CrossRefPubMedGoogle Scholar
  18. 18.
    Rajapakse, J.C., Duan, K.B., Yeo, W.K.: Proteomic cancer classification with mass spectrometry data. American Journal of Pharmacology 5(5), 281–292 (2005)CrossRefGoogle Scholar
  19. 19.
    Lin, K., Kuang, Y., Joseph, J.S., Kolatkar, P.R.: Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics. Nucleic Acids Res. 30, 2599–2607 (2002)CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Bhasin, M., Raghava, G.P.: SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence. Bioinformatics 20, 421–423 (2004)CrossRefPubMedGoogle Scholar
  21. 21.
    Bhasin, M., Raghava, G.P.: Prediction of CTL epitopes using QM, SVM and ANN techniques. Vaccine 22, 3195–3204 (2004)CrossRefPubMedGoogle Scholar
  22. 22.
    Donnes, P., Elofsson, A.: Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics 3(1), 25–32 (2002)CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    Zhao, Y., Pinilla, C., Valmori, D., Martin, R., Simon, R.: Application of support vector machines for T-cell epitopes prediction. Bioinformatics 19, 1978–1984 (2003)CrossRefPubMedGoogle Scholar
  24. 24.
    Robinson, J., Waller, M.J., Parham, P., Bodmer, J.G., Marsh, S.G.E.: IMGT/HLA Sequence Database - a sequence database for the human major histocompatibility complex. Nucleic Acids Res. 29, 210–213 (2001)CrossRefPubMedPubMedCentralGoogle Scholar
  25. 25.
    Robinson, J., Waller, M.J., Parham, P., de Groot, N., Bontrop, R., Kennedy, L.J., Stoehr, P., Marsh, S.G.E.: IMGT/HLA and IMGT/MHC: sequence databases for the study of the major histocompatibility complex. Nucleic Acids Res. 31, 311–314 (2003)CrossRefPubMedPubMedCentralGoogle Scholar
  26. 26.
    Galperin, M.: The Molecular Biology Database Collection: 2004 update. Nucleic Acids Res. 32, D2–D22 (2004)CrossRefGoogle Scholar
  27. 27.
    Bodmer, J.G., Marsh, S.G.E., Albert, E.D., Bodmer, W.F., Bontrop, R.E., Charron, D., Dupont, B., Erlish, H.A., Mach, B., Mayr, W.R., Parham, P., Sasazuki, T., Schreuder, G.M.T., Strom-inger, J.L., Svejgaard, A., Terasaki, P.I.: Nomenclature for factors of the HLA system, 1995. Tissue Antigens 46, 1–18 (1995)CrossRefPubMedGoogle Scholar
  28. 28.
    Rosenthal, A.S., Shevach, E.: Function of macrophages in antigen recognition by guinea pig T lymphocytes. I. Requirement for histocompatibile macrophages and lymphocytes. J. Exp. Med. 138, 1194–1212 (1973)CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Zinkernagel, R.M., Doherty, P.C.: Restriction of in vitro T cell-mediated cytotoxicity in lymphocytic choriomeningitis within a syngeneic or semiallogeneic system. Nature 248, 701–702 (1974)CrossRefPubMedGoogle Scholar
  30. 30.
    Katz, D.H., Hamoaka, T., Benacerraf, B.: Cell interactions between histocompatible T and B lymphocytes. Failure of physiologic cooperation interactions between T and B lymphocytes from allogeneic donor strains in humoral response to hapten-protein conjugates. J. Exp. Med. 137, 1405–1418 (1973)CrossRefPubMedPubMedCentralGoogle Scholar
  31. 31.
    Han, H.X., Kong, F.H., Xi, Y.Z.: Progress of studies on the function of MHC in immuno-recognition. J. Immunol. (Chinese) 16(4), 15–17 (2000)Google Scholar
  32. 32.
    Crammer, K., Singer, Y.: On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47, 201–233 (2002)CrossRefGoogle Scholar
  33. 33.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines,
  34. 34.
    Hsu, C.W., Lin, C.J.: A comparison on methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13, 415–425 (2002)CrossRefPubMedGoogle Scholar
  35. 35.
    Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., Higgins, D.G.: The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 24, 4876–4882 (1997)CrossRefGoogle Scholar
  36. 36.
    Grishin, V.N., Grishin, N.V.: Euclidian space and grouping of biological objects. Bioinformatics 18, 1523–1534 (2002)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Minh N. Nguyen
    • 1
  • Jianmin Ma
    • 1
  • Gary B. Fogel
    • 2
  • Jagath C. Rajapakse
    • 3
    • 4
    • 5
  1. 1.BioInfomatics InstituteSingapore
  2. 2.Natural Selection Inc. San DiegoUSA
  3. 3.BioInformatics Research CentreNanyang Technological UniversitySingapore
  4. 4.Singapore-MIT AllianceSingapore
  5. 5.Department of Biological EngineeringMassachusettes Institutes of TechnologyUSA

Personalised recommendations