Skip to main content

PCA and K-Means Decipher Genome

  • Conference paper

Part of the book series: Lecture Notes in Computational Science and Enginee ((LNCSE,volume 58))

In this paper, we aim to give a tutorial for undergraduate students studying statistical methods and/or bioinformatics. The students will learn how data visualization can help in genomic sequence analysis. Students start with a fragment of genetic text of a bacterial genome and analyze its structure. By means of principal component analysis they “discover” that the information in the genome is encoded by non-overlapping triplets. Next, they learn how to find gene positions. This exercise on PCA and K-Means clustering enables active study of the basic bioinformatics notions. The Appendix contains program listings that go along with this exersice.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Genbank FTP-site: ftp://ftp. ncbi. nih. gov/genbank/genomes

    Google Scholar 

  2. An http-folder with all materials required for the tutorial: http://www. ihes. fr/ ∼zinovyev/pcadg.

  3. Crick, F. H. C., Barnett, L., Brenner, S., and Watts-Tobin, R. J.: General nature of the genetic code for proteins. Nature, 192, 1227-1232 (1961)

    Article  Google Scholar 

  4. Clark, D. and Russel, L.: Molecular Biology Made Simple and Fun. Cache River Press (2000)

    Google Scholar 

  5. Caulobacter crescentus short introduction at http://caulo. stanford. edu/caulo/.

  6. Jackson, J.: A User’s Guide to Principal Components (Wiley Series in Proba-bility and Statistics). Wiley-Interscience (2003)

    Google Scholar 

  7. Zinovyev A.: Hierarchical Cluster Structures and Symmetries in Genomic Seq-uences. Colloquium talk at the Centre for Mathematical Modelling, University of Leicester. December (2004)

    Google Scholar 

  8. Zinovyev, A.: Visualizing the spatial structure of triplet distributions in genetic texts. HES Preprint, M/02/28(2003) Online: http://www. ihes. fr/PREPRINTS/M02/Resu/resu-M02-28. html

  9. Gorban, A. N., Zinovyev, A. Yu., and Popova, T. G.: Statistical approaches to the automated gene identification without teacher, Institut des Hautes Etudes Scientiques. IHES Preprint, M/01/34 (2001)

    Google Scholar 

  10. Zinovyev, A. Yu., Gorban, A. N., and Popova, T. G.: Self-Organizing Approach for Automated Gene Identification. Open Systems and Information Dynamics, 10 (4), 321-333 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  11. Gorban, A., Zinovyev, A., and Popova, T.: Seven clusters in genomic triplet dis-tributions In Silico Biology, 3, 0039 (2003) (Online: http://arxiv. org/abs/cond-mat/0305681 and http://cogprints. ecs. soton. ac. uk/archive/00003077/)

  12. Gorban, A. N., Popova, T. G., and Zinovyev, A. Yu.: Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences. Physica A, 353, 365-387 (2005)

    Article  Google Scholar 

  13. Ou, H. Y., Guo, F. B., and Zhang, C. T.: Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method, FEBS Lett. 540 (1-3), 188-194 (2003)

    Article  Google Scholar 

  14. Cluster structures in genomic word frequency distributions. Web-site with supplementary materials. http://www. ihes. fr/∼zinovyev/7clusters/index. htm

  15. Gorban, A. N., Zinovyev, A. Yu., and Popova, T. G.: Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences. In Silico Biology5(2005)0025. On-line: http://www. bioinfo. de/isb/2005/05/0025/

  16. Staden, R. and McLachlan, A. D.: Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res 10 (1), 141-56 (1982)

    Article  Google Scholar 

  17. Gorban, A. N., Zinovyev, A. Y., and Wunsch, D. C.: Application of The Method of Elastic Maps In Analysis of Genetic Texts, In Proceedings of International Joint Conference on Neural Networks (IJCNN’03), Portland, Oregon (2003)

    Google Scholar 

  18. Gorban, A. N., Sumner, N. R., and Zinovyev, A. Y.: Elastic maps and nets for approximating principal manifolds and their application to microarray data visualization, In this book.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gorban, A.N., Zinovyev, A.Y. (2008). PCA and K-Means Decipher Genome. In: Gorban, A.N., Kégl, B., Wunsch, D.C., Zinovyev, A.Y. (eds) Principal Manifolds for Data Visualization and Dimension Reduction. Lecture Notes in Computational Science and Enginee, vol 58. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73750-6_14

Download citation

Publish with us

Policies and ethics