Alignment-Free Sequence Comparison Based on Next Generation Sequencing Reads: Extended Abstract

Song, Kai; Ren, Jie; Zhai, Zhiyuan; Liu, Xuemei; Deng, Minghua; Sun, Fengzhu

doi:10.1007/978-3-642-29627-7_29

Alignment-Free Sequence Comparison Based on Next Generation Sequencing Reads: Extended Abstract

Kai Song²⁰,
Jie Ren²⁰,
Zhiyuan Zhai²¹,
Xuemei Liu²²,
Minghua Deng²⁰ &
…
Fengzhu Sun^23,24

Conference paper

1604 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7262))

Abstract

Next generation sequencing (NGS) technologies have generated enormous amount of shotgun read data and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics, \(D_2, D_2^*\), and \(D_2^S\), both theoretically and by simulations. Theoretical formulas for the power of detecting the relationship between two sequences related through a common motif model are derived. It is shown that both \(D_2^*\) and \(D_2^S\) outperform D ₂ for detecting the relationship between two sequences based on NGS data. We then study the effects of length of the tuple, read length, coverage, and sequencing error on the power of \(D_2^*\) and \(D_2^S\). Finally, variations of these statistics, \(d_2, d_2^*\) and \(d_2^S\), respectively, are used to first cluster 5 mammalian species with known phylogenetic relationships and then cluster 13 tree species whose complete genome sequences are not available using NGS shotgun reads. The clustering results using \(d_2^S\) are consistent with biological knowledge for the 5 mammalian and 13 tree species, respectively. Thus, the statistic \(d_2^S\) provides a powerful alignment-free comparison tool to study the relationships among different organisms based on NGS read data without assembly.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences of the United States of America 83(14), 5155–5159 (1986)
Article MATH Google Scholar
Domazet-Lošo, M., Haubold, B.: Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics 27(11), 1466–1472 (2011)
Article Google Scholar
Ivan, A., Halfon, M., Sinha, S.: Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs. Genome Biology 9(1), R22 (2008)
Article Google Scholar
Jun, S.R., Sims, G.E., Wu, G.A., Kim, S.H.: Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proceedings of the National Academy of Sciences of the United States of America 107(1), 133–138 (2010)
Article Google Scholar
Leung, G., Eisen, M.B.: Identifying cis-regulatory sequences by word profile similarity. PLoS One 4, e6901 (2009)
Article Google Scholar
Lippert, R.A., Huang, H.Y., Waterman, M.S.: Distributional regimes for the number of k-word matches between two random sequences. Proceedings of the National Academy of Sciences of the United States of America 100(13), 13980–13989 (2002)
Article MathSciNet Google Scholar
Liu, X., Wan, L., Li, J., Reinert, G., Waterman, M.S., Sun, F.: New powerful statistics for alignment-free sequence comparison under a pattern transfer model. Journal of Theoretical Biology 284(1), 106–116 (2011)
Article Google Scholar
Reinert, G., Chew, D., Sun, F.Z., Waterman, M.S.: Alignment-free sequence comparison (I): Statistics and power. Journal of Computational Biology 16(12), 1615–1634 (2009)
Article MathSciNet Google Scholar
Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 106(8), 2677–2682 (2009)
Article Google Scholar
Vinga, S., Almeida, J.: Alignment-free sequence comparison–a review. Bioinformatics 19(4), 513–523 (2003)
Article Google Scholar
Wan, L., Reinert, G., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (II): Theoretical power of comparison statistics. Journal of Computational Biology 17(11), 1467–1490 (2010)
Article MathSciNet Google Scholar
Zhai, Z.Y., Ku, S.Y., Luan, Y.H., Reinert, G., Waterman, M.S., Sun, F.Z.: The power of detecting enriched patterns: An HMM approach. Journal of Computational Biology 17(4), 581–592 (2010)
Article MathSciNet Google Scholar
Zhang, Z.D., Rozowsky, J., Snyder, M., Chang, J., Gerstein, M.: Modeling ChIP sequencing in silico with applications. PLoS Computational Biology 4(8), e1000158 (2008)
Article MathSciNet Google Scholar
Hansen, K.D., Brenner, S.E., Dudoit, S.: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research 38(12), e131 (2010)
Article Google Scholar
Li, J., Jiang, H., Wong, W.H.: Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biology 11, R50 (2010)
Article Google Scholar
Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One 3(10), e3373 (2008)
Article Google Scholar
Cannon, C.H., Kua, C.S., Zhang, D., Harting, J.R.: Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack. Molecular Ecology 19(suppl. 1), 146–160 (2010)
Google Scholar
Miller, W., Rosenbloom, K., Hardison, R.C., Hou, M., Taylor, J., Raney, B., Burhans, R., King, D.C., Baertsch, R., Blankenberg, D., et al.: 28-way vertebrate alignment and conservation track in the UCSC genome browser. Genome Research 17(12), 1797–1808 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematics, Peking University, Beijing, P.R. China
Kai Song, Jie Ren & Minghua Deng
School of Mathematics, Shandong University, P.R. China
Zhiyuan Zhai
School of Physics, South China University of Technology, Guangzhou, P.R. China
Xuemei Liu
TNLIST/Department of Automation, Tsinghua University, Beijing, P.R. China
Fengzhu Sun
Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
Fengzhu Sun

Authors

Kai Song
View author publications
You can also search for this author in PubMed Google Scholar
Jie Ren
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Xuemei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Minghua Deng
View author publications
You can also search for this author in PubMed Google Scholar
Fengzhu Sun
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Tel-Aviv University, 69978, Tel-Aviv, Israel
Benny Chor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F. (2012). Alignment-Free Sequence Comparison Based on Next Generation Sequencing Reads: Extended Abstract. In: Chor, B. (eds) Research in Computational Molecular Biology. RECOMB 2012. Lecture Notes in Computer Science(), vol 7262. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29627-7_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-29627-7_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29626-0
Online ISBN: 978-3-642-29627-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics