Skip to main content
Log in

Application of data mining techniques on diabetes related proteins

  • Original Article
  • Published:
International Journal of Diabetes in Developing Countries Aims and scope Submit manuscript

Abstract

Genomic Data is growing very rapidly with the sequencing of genomes of various forms of life. To understand the overwhelming data and to obtain meaningful information, Data Mining techniques such as Principal Component Analysis and Discriminant Analysis are used for the purpose. Data Mining is basically used when the data is vast and there is need to extract the hidden knowledge in the form of useful patterns. The data set taken into consideration is protein data pertaining to diabetes mellitus obtained from a database. The task at hand was to find out in which species most of the diabetes related proteins exist. It so happened that most of these proteins were prevalent in Human Beings, House Mice and Norway Rat as they are all mammals and Human Beings have orthologs as House Mice and Norway Rat. Both these techniques prove that human beings show a variation from those of House Mice and Norway Rat which are similar in terms of the variation of protein attributes. This can also be inferred from statistical analysis by using histograms and bivariate plots. Other Data Mining Techniques such as Regression and Clustering can be used to further explore the above inference.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ewens WJ, Grant GR. Statistical methods in bioinformatics an introduction. New Delhi: Springer Verlag; 2004.

    Google Scholar 

  2. Nagabhushana S. Datawarehousing OLAP and data mining. New Delhi: New Age International Publishers; 2006. p. 252.

    Google Scholar 

  3. Mount DW. Bioinformatics sequence and genome analysis. CBS Publishers and Distributors; 2003. pp. 534.

  4. Sridhar GR, Murali G. Technology Spectrum. 2008;2:69–72.

    Google Scholar 

  5. Allam AR, Ravi B, Sridhar GR. Mathematical analysis of diabetes related proteins having high sequence complexity, ictai, Proc. 18th IEEE International Conference on Tools with Artificial Intelligence. 2006; pp. 810–21

  6. Rubin EM, Barsh GS. Biological insights through genomics: mouse to man. J Clin Invest. 1996;97:275–80.

    Article  PubMed  CAS  Google Scholar 

  7. Huynen MA, Nimwegen EV. The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol. 1998;15:583–9.

    PubMed  CAS  Google Scholar 

  8. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the human genome. Nature. 2002;420:520–62.

    Article  Google Scholar 

  9. Zhang L, Pavlovic V, Cantor CR, et al. Human-mouse gene identification by comparative evidence integration and evolutionary analysis. Genome Res. 2003;13:1190–120.

    Article  PubMed  CAS  Google Scholar 

  10. Foucan L, Vaillant J. Hypertension in the metabolic syndrome among Caribbean non Diabetic Subjects. Arch Mal Coeur Vaiss. 2007;100:649–53.

    PubMed  CAS  Google Scholar 

  11. Nandi T, B-Rao C, Ramachandran S. Comparative genomics using data mining tools. J Biosci. 2002;27 Suppl 1:15–25.

    Article  PubMed  CAS  Google Scholar 

Download references

Funding

None

Conflict of interest

None

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. R. Sridhar.

Appendix 1

Appendix 1

The protein variates are:

  1. 1.

    Variate 1 is the length (L) of the protein in number of amino acids.

  2. 2.

    Variate 2 is the percent of basic amino acids in a given protein. The basic amino acids are H, K; R. percent basic is given by

    $$ \frac{{{\text{Number}}\,{\text{of}}\,{\text{basic}}\,{\text{amino}}\,{\text{acids}}}}{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{amino}}\,{\text{acids}}}} \times 100 $$
  3. 3.

    Variate 3 is the percent of acidic/amide amino acids in a given protein. The acidic/amide amino acids are D, E, N, and Q. Percent acidic/amide is given by

    $$ \frac{{{\text{Number}}\,{\text{of}}\,{\text{acidic/amide}}\,{\text{amino}}\,{\text{acids}}}}{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{amino}}\,{\text{acids}}}} \times 100 $$
  4. 4.

    Variate 4 is the percent of small and medium hydrophobic amino acids in a given protein. The small and medium hydrophobic amino acids are V, L, I, M. Percent hydrophobicity is given by

    $$ \frac{{{\text{Number}}\,{\text{of}}\,{\text{hydrophobic}}\,{\text{amino}}\,{\text{acids}}}}{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{amino}}\,{\text{acids}}}} \times 100 $$
  5. 5.

    Variate 5 is the percent of aromatic amino acids in a given protein. The aromatic amino acids are F, Y, and W. Percent aromatic is given by

    $$ \frac{{{\text{Number}}\,{\text{of}}\,{\text{aromatic}}\,{\text{amino}}\,{\text{acids}}}}{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{amino}}\,{\text{acids}}}} \times 100 $$
  6. 6.

    Variate 6 is the percent of small/polar amino acids in a given protein. The small/polar amino acids are A, G, S, T, P [32]. (Teresa K. Attwood et al. 2004). Percent small/polar is given by

    $$ \frac{{{\text{Number}}\,{\text{of}}\,{\text{small/polar}}\,{\text{amino}}\,{\text{acids}}}}{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,{\text{amino}}\,{\text{acids}}}} \times 100 $$
  7. 7.

    Variate 7 is a measure of distance of a protein sequence from a fixed reference point.

    The distance is measured according to the formula:

    $$ {\hbox{Distance }}{\left( {\hbox{D}} \right)_{\rm{fixed}}} = { }\surd {\sum^{{2}0}}_{{\rm{i}} = {1}}{\left( {{{\hbox{O}}_{\rm{i}}} - { }{{\hbox{E}}_{\rm{i}}}} \right)^2} $$

    where Oi is the observed number of amino acid of type ‘i’ in the concerned protein and Ei, the expected number of amino acid of type ‘i’ in the same protein. Ei is L/20 considering all amino acid to be uniformly distributed in the protein. We refer to this point as the fixed reference point. Dfixed is square root of sum of squares from i = 1 to 20 of difference of observed and expected number of amino acids. Here it is considered fixed as Ei = L/20 is a constant for all the amino acids.

  8. 8.

    Variate 8 is the distance of a protein sequence from a variable reference point. The distance Dvar, globular has the same formula as that in variate 4 but the Ei is calculated according to the formula:

    $$ {\hbox{Ei }} = {\hbox{ fi}} {\hbox{x}} {\hbox{L}} $$

    where L is the length of the concerned protein in amino acids and fi is the average frequency of occurrence of the ith amino acid in the set of proteins that are of high sequence complexity [11]. Here this is considered variable reference point since fi changes for every amino acid and hence Ei changes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhramaramba, R., Allam, A.R., Kumar, V.V. et al. Application of data mining techniques on diabetes related proteins. Int J Diabetes Dev Ctries 31, 22–25 (2011). https://doi.org/10.1007/s13410-010-0001-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13410-010-0001-3

Keywords

Navigation