Resolving Multicopy Duplications de novo Using Polyploid Phasing
While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog-specific variants. In this paper, we study the problem of resolving the variations in multicopy, long segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation-clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on average 7.0 haplotypes in 10-copy duplication datasets whereas existing algorithms reconstruct less than one copy on average.
This work was supported, in part, by U.S. National Institutes of Health (NIH) grants 5R01HG002385-15 (E.E.E. and M.J.C.) and 5R01HG008164-02 (S.K. and S.M.). E.E.E. is an investigator of the Howard Hughes Medical Institute.
- 11.Chaisson, M.J.: https://github.com/mchaisso/blasr
- 12.Charikar, M., Guruswami, V., Wirth, A.: Clustering with qualitative information. In: Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science, pp. 524–533. IEEE (2003)Google Scholar
- 13.Chen, Y., Kamath, G., Suh, C., Tse, D.: Community recovery in graphs with locality (2016). arXiv preprint arXiv:1602.03828
- 16.Dempster, A.P.: Laird, N, M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodol.) 39, 1–38 (1977)Google Scholar
- 22.Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimization. In: Proceedings of 45h Annual ACM Symposium on Theory of Computing, STOC 2013, pp. 665–674, ACM, New York (2013)Google Scholar
- 24.Koren, S., Walenz, B.P., Berlin, K., Miller, J.R., Phillippy, A.M.: Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. bioRxiv, p. 071282 (2016)Google Scholar
- 26.Motahari, A., Ramchandran, K., Tse, D., Ma, N.: Optimal DNA shotgun sequencing: noisy reads are as good as noiseless reads (2013). arXiv preprint arXiv:1304.2798
- 29.Patterson, M., Marschall, T., Pisanti, N., Iersel, L., Stougie, L., Klau, G.W., Schönhuth, A.: WhatsHap: haplotype assembly for future-generation sequencing reads. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 237–249. Springer, Cham (2014). doi: 10.1007/978-3-319-05269-4_19 CrossRefGoogle Scholar
- 36.Si, H., Vikalo, H., Vishwanath, S.: Haplotype assembly: an information theoretic view. In: 2014 IEEE Information Theory Workshop (ITW), pp. 182–186. IEEE (2014)Google Scholar
- 38.Steinberg, K.M., Graves-Lindsay, T., Schneider, V.A., Chaisson, M.J.P., Tomlinson, C., Huddleston, J.L., Minx, P., Kremitzki, M., Albrecht, D., Magrini, V., et al.: High-quality assembly of an individual of Yoruban descent. bioRxiv, p. 067447 (2016)Google Scholar
- 40.Welling, M., Kurihara, K.: Bayesian k-means as a maximization-expectation algorithm (2007)Google Scholar