Resolving Multicopy Duplications de novo Using Polyploid Phasing

Chaisson, Mark J.; Mukherjee, Sudipto; Kannan, Sreeram; Eichler, Evan E.

doi:10.1007/978-3-319-56970-3_8

Mark J. Chaisson¹⁴,
Sudipto Mukherjee¹⁵,
Sreeram Kannan¹⁵ &
…
Evan E. Eichler^14,16

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2195 Accesses
13 Citations
1 Altmetric

Abstract

While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog-specific variants. In this paper, we study the problem of resolving the variations in multicopy, long segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation-clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on average 7.0 haplotypes in 10-copy duplication datasets whereas existing algorithms reconstruct less than one copy on average.

M.J. Chaisson and S. Mukherjee—Joint first authorship.

S. Kannan and E.E. Eichler—Joint last authorship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aguiar, D., Istrail, S.: Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 29(13), i352–i360 (2013)
Article Google Scholar
Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: ranking and clustering. J. ACM (JACM) 55(5), 23 (2008)
Article MathSciNet MATH Google Scholar
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56(1–3), 89–113 (2004)
Article MathSciNet MATH Google Scholar
Bansal, V., Bafna, V.: Hapcut: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24(16), i153–i159 (2008)
Article Google Scholar
Berger, E., Yorukoglu, D., Peng, J., Berger, B.: Haptree: a novel Bayesian framework for single individual polyplotyping using NGS data. PLoS Comput. Biol. 10(3), e1003502 (2014)
Article Google Scholar
Berlin, K., Koren, S., Chin, C.-S., Drake, J.P., Landolin, J.M., Phillippy, A.M.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33(6), 623–630 (2015)
Article Google Scholar
Bonizzoni, P., Dondi, R., Klau, G.W., Pirola, Y., Pisanti, N., Zaccaria, S.: On the minimum error correction problem for haplotype assembly in diploid and polyploid genomes. J. Comput. Biol. 23, 718–736 (2016)
Article MathSciNet Google Scholar
Cai, C., Sanghavi, S., Vikalo, H.: Structured low-rank matrix factorization for haplotype assembly. J. Sel. Top. Sig. Process. 10(4), 647–657 (2016)
Article Google Scholar
Cai, J.-F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Article MathSciNet MATH Google Scholar
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Commun. ACM 55(6), 111–119 (2012)
Article MATH Google Scholar
Chaisson, M.J.: https://github.com/mchaisso/blasr
Charikar, M., Guruswami, V., Wirth, A.: Clustering with qualitative information. In: Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science, pp. 524–533. IEEE (2003)
Google Scholar
Chen, Y., Kamath, G., Suh, C., Tse, D.: Community recovery in graphs with locality (2016). arXiv preprint arXiv:1602.03828
Das, S., Vikalo, H.: SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming. BMC Genom. 16(1), 4 (2015)
Article Google Scholar
Demaine, E.D., Immorlica, N.: Correlation clustering with partial information. In: Arora, S., Jansen, K., Rolim, J.D.P., Sahai, A. (eds.) APPROX/RANDOM -2003. LNCS, vol. 2764, pp. 1–13. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45198-3_1
Google Scholar
Dempster, A.P.: Laird, N, M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodol.) 39, 1–38 (1977)
Google Scholar
Dennis, M.Y., Nuttle, X., Sudmant, P.H., Antonacci, F., Graves, T.A., Nefedov, M., Rosenfeld, J.A., Sajjadian, S., Malig, M., Kotkiewicz, H., et al.: Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149(4), 912–922 (2012)
Article Google Scholar
Eichler, E.E.: Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17(11), 661–669 (2001)
Article Google Scholar
Emanuel, D., Fiat, A.: Correlation clustering – minimizing disagreements on arbitrary weighted graphs. In: Battista, G., Zwick, U. (eds.) ESA 2003. LNCS, vol. 2832, pp. 208–220. Springer, Heidelberg (2003). doi:10.1007/978-3-540-39658-1_21
Chapter Google Scholar
Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010)
Article MathSciNet Google Scholar
Gordon, D., Huddleston, J., Chaisson, M.J.P., Hill, C.M., Kronenberg, Z.N., Munson, K.M., Malig, M., Raja, A., Fiddes, I., Hillier, L.W., et al.: Long-read sequence assembly of the gorilla genome. Science 352(6281), aae0344 (2016)
Article Google Scholar
Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimization. In: Proceedings of 45h Annual ACM Symposium on Theory of Computing, STOC 2013, pp. 665–674, ACM, New York (2013)
Google Scholar
Jiang, Z., Tang, H., Ventura, M., Cardone, M.F., Marques-Bonet, T., She, X., Pevzner, P.A., Eichler, E.E.: Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat. Genet. 39(11), 1361–1368 (2007)
Article Google Scholar
Koren, S., Walenz, B.P., Berlin, K., Miller, J.R., Phillippy, A.M.: Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. bioRxiv, p. 071282 (2016)
Google Scholar
Lancia, G., Bafna, V., Istrail, S., Lippert, R., Schwartz, R.: SNPs problems, complexity, and algorithms. In: Heide, F.M. (ed.) ESA 2001. LNCS, vol. 2161, pp. 182–193. Springer, Heidelberg (2001). doi:10.1007/3-540-44676-1_15
Chapter Google Scholar
Motahari, A., Ramchandran, K., Tse, D., Ma, N.: Optimal DNA shotgun sequencing: noisy reads are as good as noiseless reads (2013). arXiv preprint arXiv:1304.2798
Myers, E.W.: Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2(2), 275–290 (1995)
Article MathSciNet Google Scholar
Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 52–67. Springer, Heidelberg (2014). doi:10.1007/978-3-662-44753-6_5
Google Scholar
Patterson, M., Marschall, T., Pisanti, N., Iersel, L., Stougie, L., Klau, G.W., Schönhuth, A.: WhatsHap: haplotype assembly for future-generation sequencing reads. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 237–249. Springer, Cham (2014). doi:10.1007/978-3-319-05269-4_19
Chapter Google Scholar
Pevzner, P.A.: Dna physical mapping and alternating Eulerian cycles in colored graphs. Algorithmica 13(1–2), 77–105 (1995)
Article MathSciNet MATH Google Scholar
Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Nat. Acad. Sci. 98(17), 9748–9753 (2001)
Article MathSciNet MATH Google Scholar
Puljiz, Z., Vikalo, H.: Decoding genetic variations: communications-inspired haplotype assembly. IEEE/ACM Trans. Comput. Biol. Bioinform. 13(3), 518–530 (2016)
Article Google Scholar
Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)
Article MathSciNet MATH Google Scholar
Schwartz, R., et al.: Theory and algorithms for the haplotype assembly problem. Commun. Inf. Syst. 10(1), 23–38 (2010)
MathSciNet MATH Google Scholar
Seo, J.-S., Rhie, A., Lee, S., Sohn, M.-H., Kim, C.-U., Hastie, A., Cao, H., Yun, J.-Y., Kim, J., et al.: De novo assembly and phasing of a Korean human genome. Nature 538, 243 (2016)
Article Google Scholar
Si, H., Vikalo, H., Vishwanath, S.: Haplotype assembly: an information theoretic view. In: 2014 IEEE Information Theory Workshop (ITW), pp. 182–186. IEEE (2014)
Google Scholar
Stankiewicz, P., Lupski, J.R.: Genome architecture, rearrangements and genomic disorders. Trends Genet. 18(2), 74–82 (2002)
Article Google Scholar
Steinberg, K.M., Graves-Lindsay, T., Schneider, V.A., Chaisson, M.J.P., Tomlinson, C., Huddleston, J.L., Minx, P., Kremitzki, M., Albrecht, D., Magrini, V., et al.: High-quality assembly of an individual of Yoruban descent. bioRxiv, p. 067447 (2016)
Google Scholar
Usher, C.L., Handsaker, R.E., Esko, T., Tuke, M.A., Weedon, M.N., Hastie, A.R., Cao, H., Moon, J.E., Kashin, S., Fuchsberger, C., et al.: Structural forms of the human amylase locus and their relationships to SNPs, haplotypes and obesity. Nat. Genet. 47(8), 921–925 (2015)
Article Google Scholar
Welling, M., Kurihara, K.: Bayesian k-means as a maximization-expectation algorithm (2007)
Google Scholar

Download references

Acknowledgements

This work was supported, in part, by U.S. National Institutes of Health (NIH) grants 5R01HG002385-15 (E.E.E. and M.J.C.) and 5R01HG008164-02 (S.K. and S.M.). E.E.E. is an investigator of the Howard Hughes Medical Institute.

Author information

Authors and Affiliations

Department of Genome Sciences, University of Washington, Seattle, Washington, 98195, USA
Mark J. Chaisson & Evan E. Eichler
Department of Electrical Engineering, University of Washington, Seattle, Washington, 98195, USA
Sudipto Mukherjee & Sreeram Kannan
Howard Hughes Medical Institute, University of Washington, Seattle, Washington, 98195, USA
Evan E. Eichler

Authors

Mark J. Chaisson
View author publications
You can also search for this author in PubMed Google Scholar
Sudipto Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
Sreeram Kannan
View author publications
You can also search for this author in PubMed Google Scholar
Evan E. Eichler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Evan E. Eichler .

Editor information

Editors and Affiliations

Indiana University Bloomington, Bloomington, Indiana, USA
S. Cenk Sahinalp

A Appendix

After each gradient step, the resultant matrix is projected onto the box. The updates for A and B are as follows:

$$ \tilde{A}^{(t+1)} \leftarrow A^{(t)} - \alpha _{A} \nabla _{A} f(A) $$

Then $ A_{ij}^{(t+1)} = {\left\{ \begin{array}{ll} 0, &{} \text {if } \tilde{A}^{(t+1)}_{ij} < 0 \\ \tilde{A}^{(t+1)}_{ij}, &{} \text {if } 0 \le \tilde{A}^{(t+1)}_{ij} \le 1 \\ 1, &{} \text {if } \tilde{A}^{(t+1)}_{ij} > 1 \end{array}\right. } $

$$ \tilde{B}^{(t+1)} \leftarrow B^{(t)} - \alpha _{B} \nabla _{A} f(B) $$

Then $ B_{ij}^{(t+1)} = {\left\{ \begin{array}{ll} -1, &{} \text {if } \tilde{B}^{(t+1)}_{ij} < -1 \\ \tilde{B}^{(t+1)}_{ij}, &{} \text {if } -1 \le \tilde{A}^{(t+1)}_{ij} \le 1 \\ 1, &{} \text {if } \tilde{A}^{(t+1)}_{ij} > 1 \end{array}\right. } $

where $f(\cdot )$ is the objective function. The projected gradient descent allows us to incorporate additional constraints on the problem as well. If we further enforce that the sum of each row of A equals 1, then we would have the projection as $A_{ij}^{(t+1)} = \max \lbrace 0, \tilde{A}_{ij}^{(t+1)} - \nu _i \rbrace $ where $\nu _i$ can be computed for each row i using the equality

$$ \sum _{j=1}^S \max \lbrace 0, \tilde{A}_{ij}^{(t+1)} - \nu _i \rbrace =1 $$

We allow a maximum of 50 iteration steps for minimizing each of A and B, and 100 iteration steps for alternating minimization. We exit the iterations if the change in norm is insignificant ($1e-02$) or if the objective value change is below a tolerance ($1e-04$). The learning rate values have to be computed in order to ensure that gradient steps do not diverge. Our choices of learning rates have been

$$ \alpha _A = C \frac{\Vert \nabla f(A^{(t)})\Vert _F^2}{\Vert \mathcal {P}_\varOmega (\nabla f (A^{(t)}) \cdot B^{(t)}) \Vert _F^2} $$

and

$$ \alpha _B = C \frac{\Vert \nabla f(B^{(t)})\Vert _F^2}{\Vert \mathcal {P}_\varOmega ( A^{(t)} \cdot \nabla f (B^{(t)}) ) \Vert _F^2} $$

where $C \in (0,1)$.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chaisson, M.J., Mukherjee, S., Kannan, S., Eichler, E.E. (2017). Resolving Multicopy Duplications de novo Using Polyploid Phasing. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-56970-3_8
Published: 12 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56969-7
Online ISBN: 978-3-319-56970-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Resolving Multicopy Duplications de novo Using Polyploid Phasing

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation