Skip to main content
Log in

Comparative genomics using data mining tools

  • Published:
Journal of Biosciences Aims and scope Submit manuscript

Abstract

We have analysed the genomes of representatives of three kingdoms of life, namely, archaea, eubacteria and eukaryota using data mining tools based on compositional analyses of the protein sequences. The representatives chosen in this analysis wereMethanococcus jannaschii, Haemophilus influenzae andSaccharomyces cerevisiae. We have identified the common and different features between the three genomes in the protein evolution patterns.M. jannaschii has been seen to have a greater number of proteins with more charged amino acids whereasS. cerevisiae has been observed to have a greater number of hydrophilic proteins. Despite the differences in intrinsic compositional characteristics between the proteins from the different genomes we have also identified certain common characteristics. We have carried out exploratory Principal Component Analysis of the multivariate data on the proteins of each organism in an effort to classify the proteins into clusters. Interestingly, we found that most of the proteins in each organism cluster closely together, but there are a few ‘outliers’. We focus on the outliers for the functional investigations, which may aid in revealing any unique features of the biology of the respective organisms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Andrade M A, Ouzounis C, Sander C, Tamames J and Valencia A 1999 Functional classes in the three domains of life;J. Mol. Evol. 49 551–557

    Article  PubMed  CAS  Google Scholar 

  • Casari G, Sander C and Valencia A 1995 A method to predict functional residues in proteins;Struct Biol. 2 171–178

    Article  CAS  Google Scholar 

  • Fauchere J L and Pliska V 1983 Hydrophobic parameters of amino acid side chains from the partitioning of N-acetyl-amino acid amides;Eur. J. Med. Chem.-Chim. Ther. 18 369–375

    CAS  Google Scholar 

  • Forster M J, Heath A B and Afzal M A 1999 Application of distance geometry to 3D visualization of sequence relationships;Bioinformatic 15 89–90

    Article  CAS  Google Scholar 

  • Fraser C M, Gocayne J D, White O, Adams M D, Clayton R A, Fleischmann R D, Bult C J, Kerlavage A R, Sutton G, Kelley J M, Fritchmann J L, Weidman J F, Small K V, Sandusky M, Fuhrman J, Nguyen D, Utterback T R, Saudek D M, Phillips C A, Merrick J M, Tomb J F, Dougherty B A, Bott K F, Hu P C, Lucier T S, Peterson S N, Smith H O, Hutchison III C A and Ventor J C 1995 The minimal gene complement ofMycoplasma genitalium;Science 270 397–403

    Article  PubMed  CAS  Google Scholar 

  • Gelfand M S, Koonin E V, Mironov A A 2000 Prediction of transcription regulatory sites in Archaea by a comparative genomic approach;Nucleic Acids Res. 28 695–705

    Article  PubMed  CAS  Google Scholar 

  • Gribskov M and Devereux J (eds) 1992Sequence Analysis Primer (Oxford: Oxford University Press) pp 67–71

    Google Scholar 

  • Hutchison C A, Peterson S N, Gill S R, Cline R T, White O, Fraser C M, Smith H O and Venter J C 1999 Global transposon mutagenesis and a minimal Mycoplasma genome;Science 286 2165–2169

    Article  PubMed  CAS  Google Scholar 

  • Koonin E V, Tatusov R L and Galperin M Y 1998 Beyond complete genomes: from sequence to structure and function;Curr. Opin. Struct. Biol. 8 355–363

    Article  PubMed  CAS  Google Scholar 

  • Mushegian A R and Koonin E V 1996 A minimal gene set for cellular life derived by comparison of complete bacterial genomes;Proc. Natl. Acad. Sci. USA 93 10268–10273

    Article  PubMed  CAS  Google Scholar 

  • Nakashima H and Nishikawa K 1992 The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins;FEBS Lett. 303 141–146

    Article  PubMed  CAS  Google Scholar 

  • Nakashima H and Nishikawa K 1994 Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies;J. Mol. Biol. 238 54–61

    Article  PubMed  CAS  Google Scholar 

  • Nakashima H, Nishikawa K and Ooi T 1986 The folding type of a protein is relevant to the amino acid composition;J. Biochem. 99 153–162

    PubMed  CAS  Google Scholar 

  • Natesh R, Bhanumoorthy P, Vithayathil P J, Sekar K, Ramakumar S and Viswamitra M A 1999 Crystal structure at 1.8 Å resolution and proposed amino acid sequence of a thermostable xylanase fromThermoascus aurantiacus;J. Mol. Biol. 288 999–1012

    Article  PubMed  CAS  Google Scholar 

  • Raghavan S, Hariharan R and Brahmachari S K 2000 Polypurine polypyrimidine sequences in complete bacterial genomes: preference for polypurines in protein-coding regions;Gene 242 275–283

    Article  PubMed  CAS  Google Scholar 

  • Schneider G 1999 How many potentially secreted proteins are contained in a bacterial genome?;Gene 237 113–121

    Article  PubMed  CAS  Google Scholar 

  • Schneider G and Wrede P 1993 Development of artificial neural filters for pattern recognition in protein sequences;J. Mol. Evol. 36 586–595

    Article  PubMed  CAS  Google Scholar 

  • Tatusov R L, Galperin M Y, Natale D A and Koonin E V 2000 The COG database: a tool for genome-scale analysis of protein functions and evolution;Nucleic Acids Res. 28 33–36

    Article  PubMed  CAS  Google Scholar 

  • Tatusov R L, Koonin E V and Lipman D J 1997 A genomic perspective on protein families;Science 278 631–637

    Article  PubMed  CAS  Google Scholar 

  • Van Heel M 1991 A new family of powerful multivariate statistical sequence analysis techniques;J. Mol. Biol. 220 877–887

    Article  PubMed  Google Scholar 

  • Wootton J C 1994 Non globular domains in protein sequences: Automated segmentation using complexity measures;Comput. Chem. 18 269–285

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nandi, T., B-Rao, C. & Ramachandran, S. Comparative genomics using data mining tools. J Biosci 27, 15–25 (2002). https://doi.org/10.1007/BF02703680

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02703680

Keywords

Navigation