In this paper, we aim to give a tutorial for undergraduate students studying statistical methods and/or bioinformatics. The students will learn how data visualization can help in genomic sequence analysis. Students start with a fragment of genetic text of a bacterial genome and analyze its structure. By means of principal component analysis they “discover” that the information in the genome is encoded by non-overlapping triplets. Next, they learn how to find gene positions. This exercise on PCA and K-Means clustering enables active study of the basic bioinformatics notions. The Appendix contains program listings that go along with this exersice.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Genbank FTP-site: ftp://ftp. ncbi. nih. gov/genbank/genomes
An http-folder with all materials required for the tutorial: http://www. ihes. fr/ ∼zinovyev/pcadg.
Crick, F. H. C., Barnett, L., Brenner, S., and Watts-Tobin, R. J.: General nature of the genetic code for proteins. Nature, 192, 1227-1232 (1961)
Clark, D. and Russel, L.: Molecular Biology Made Simple and Fun. Cache River Press (2000)
Caulobacter crescentus short introduction at http://caulo. stanford. edu/caulo/.
Jackson, J.: A User’s Guide to Principal Components (Wiley Series in Proba-bility and Statistics). Wiley-Interscience (2003)
Zinovyev A.: Hierarchical Cluster Structures and Symmetries in Genomic Seq-uences. Colloquium talk at the Centre for Mathematical Modelling, University of Leicester. December (2004)
Zinovyev, A.: Visualizing the spatial structure of triplet distributions in genetic texts. HES Preprint, M/02/28(2003) Online: http://www. ihes. fr/PREPRINTS/M02/Resu/resu-M02-28. html
Gorban, A. N., Zinovyev, A. Yu., and Popova, T. G.: Statistical approaches to the automated gene identification without teacher, Institut des Hautes Etudes Scientiques. IHES Preprint, M/01/34 (2001)
Zinovyev, A. Yu., Gorban, A. N., and Popova, T. G.: Self-Organizing Approach for Automated Gene Identification. Open Systems and Information Dynamics, 10 (4), 321-333 (2003)
Gorban, A., Zinovyev, A., and Popova, T.: Seven clusters in genomic triplet dis-tributions In Silico Biology, 3, 0039 (2003) (Online: http://arxiv. org/abs/cond-mat/0305681 and http://cogprints. ecs. soton. ac. uk/archive/00003077/)
Gorban, A. N., Popova, T. G., and Zinovyev, A. Yu.: Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences. Physica A, 353, 365-387 (2005)
Ou, H. Y., Guo, F. B., and Zhang, C. T.: Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method, FEBS Lett. 540 (1-3), 188-194 (2003)
Cluster structures in genomic word frequency distributions. Web-site with supplementary materials. http://www. ihes. fr/∼zinovyev/7clusters/index. htm
Gorban, A. N., Zinovyev, A. Yu., and Popova, T. G.: Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences. In Silico Biology5(2005)0025. On-line: http://www. bioinfo. de/isb/2005/05/0025/
Staden, R. and McLachlan, A. D.: Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res 10 (1), 141-56 (1982)
Gorban, A. N., Zinovyev, A. Y., and Wunsch, D. C.: Application of The Method of Elastic Maps In Analysis of Genetic Texts, In Proceedings of International Joint Conference on Neural Networks (IJCNN’03), Portland, Oregon (2003)
Gorban, A. N., Sumner, N. R., and Zinovyev, A. Y.: Elastic maps and nets for approximating principal manifolds and their application to microarray data visualization, In this book.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gorban, A.N., Zinovyev, A.Y. (2008). PCA and K-Means Decipher Genome. In: Gorban, A.N., Kégl, B., Wunsch, D.C., Zinovyev, A.Y. (eds) Principal Manifolds for Data Visualization and Dimension Reduction. Lecture Notes in Computational Science and Enginee, vol 58. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73750-6_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-73750-6_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73749-0
Online ISBN: 978-3-540-73750-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)