Abstract
Next generation sequencing (NGS) technologies have generated enormous amount of shotgun read data and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics, \(D_2, D_2^*\), and \(D_2^S\), both theoretically and by simulations. Theoretical formulas for the power of detecting the relationship between two sequences related through a common motif model are derived. It is shown that both \(D_2^*\) and \(D_2^S\) outperform D 2 for detecting the relationship between two sequences based on NGS data. We then study the effects of length of the tuple, read length, coverage, and sequencing error on the power of \(D_2^*\) and \(D_2^S\). Finally, variations of these statistics, \(d_2, d_2^*\) and \(d_2^S\), respectively, are used to first cluster 5 mammalian species with known phylogenetic relationships and then cluster 13 tree species whose complete genome sequences are not available using NGS shotgun reads. The clustering results using \(d_2^S\) are consistent with biological knowledge for the 5 mammalian and 13 tree species, respectively. Thus, the statistic \(d_2^S\) provides a powerful alignment-free comparison tool to study the relationships among different organisms based on NGS read data without assembly.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences of the United States of America 83(14), 5155–5159 (1986)
Domazet-Lošo, M., Haubold, B.: Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics 27(11), 1466–1472 (2011)
Ivan, A., Halfon, M., Sinha, S.: Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs. Genome Biology 9(1), R22 (2008)
Jun, S.R., Sims, G.E., Wu, G.A., Kim, S.H.: Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proceedings of the National Academy of Sciences of the United States of America 107(1), 133–138 (2010)
Leung, G., Eisen, M.B.: Identifying cis-regulatory sequences by word profile similarity. PLoS One 4, e6901 (2009)
Lippert, R.A., Huang, H.Y., Waterman, M.S.: Distributional regimes for the number of k-word matches between two random sequences. Proceedings of the National Academy of Sciences of the United States of America 100(13), 13980–13989 (2002)
Liu, X., Wan, L., Li, J., Reinert, G., Waterman, M.S., Sun, F.: New powerful statistics for alignment-free sequence comparison under a pattern transfer model. Journal of Theoretical Biology 284(1), 106–116 (2011)
Reinert, G., Chew, D., Sun, F.Z., Waterman, M.S.: Alignment-free sequence comparison (I): Statistics and power. Journal of Computational Biology 16(12), 1615–1634 (2009)
Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 106(8), 2677–2682 (2009)
Vinga, S., Almeida, J.: Alignment-free sequence comparison–a review. Bioinformatics 19(4), 513–523 (2003)
Wan, L., Reinert, G., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (II): Theoretical power of comparison statistics. Journal of Computational Biology 17(11), 1467–1490 (2010)
Zhai, Z.Y., Ku, S.Y., Luan, Y.H., Reinert, G., Waterman, M.S., Sun, F.Z.: The power of detecting enriched patterns: An HMM approach. Journal of Computational Biology 17(4), 581–592 (2010)
Zhang, Z.D., Rozowsky, J., Snyder, M., Chang, J., Gerstein, M.: Modeling ChIP sequencing in silico with applications. PLoS Computational Biology 4(8), e1000158 (2008)
Hansen, K.D., Brenner, S.E., Dudoit, S.: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research 38(12), e131 (2010)
Li, J., Jiang, H., Wong, W.H.: Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biology 11, R50 (2010)
Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One 3(10), e3373 (2008)
Cannon, C.H., Kua, C.S., Zhang, D., Harting, J.R.: Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack. Molecular Ecology 19(suppl. 1), 146–160 (2010)
Miller, W., Rosenbloom, K., Hardison, R.C., Hou, M., Taylor, J., Raney, B., Burhans, R., King, D.C., Baertsch, R., Blankenberg, D., et al.: 28-way vertebrate alignment and conservation track in the UCSC genome browser. Genome Research 17(12), 1797–1808 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F. (2012). Alignment-Free Sequence Comparison Based on Next Generation Sequencing Reads: Extended Abstract. In: Chor, B. (eds) Research in Computational Molecular Biology. RECOMB 2012. Lecture Notes in Computer Science(), vol 7262. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29627-7_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-29627-7_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29626-0
Online ISBN: 978-3-642-29627-7
eBook Packages: Computer ScienceComputer Science (R0)