Abstract
Both the weighted and unweighted UniFrac distances have been very successfully employed to assess if two communities differ, but do not give any information about how two communities differ. We take advantage of recent observations that the UniFrac metric is equivalent to the so-called earth mover’s distance (also known as the Kantorovich–Rubinstein metric) to develop an algorithm that not only computes the UniFrac distance in linear time and space, but also simultaneously finds which operational taxonomic units are responsible for the observed differences between samples. This allows the algorithm, called EMDUniFrac, to determine why given samples are different, not just if they are different, and with no added computational burden. EMDUniFrac can be utilized on any distribution on a tree, and so is particularly suitable to analyzing both operational taxonomic units derived from amplicon sequencing, as well as community profiles resulting from classifying whole genome shotgun metagenomes. The EMDUniFrac source code (written in python) is freely available at: https://github.com/dkoslicki/EMDUniFrac.
Similar content being viewed by others
References
Adler I, Hoffman AJ, Shamir R (1993) Monge and feasibility sequences in general flow problems. Discrete Appl Math 44(1–3):21–38
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Altschuler, J, Weed J, Rigollet P (2017) Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. arXiv preprint arXiv:1705.09634
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI et al (2010) Qiime allows analysis of high-throughput community sequencing data. Nat Methods 7(5):335–336
Cuturi M (2013) Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in neural information processing systems 26, proceedings of the neural information processing systems conference 2013, pp 2292–2300
Evans SN, Matsen FA (2012) The phylogenetic kantorovich-rubinstein metric for environmental sequence samples. J R Stat Soc Ser B (Stat Methodol) 74(3):569–592
Frank DN, Amand ALS, Feldman RA, Boedeker EC, Harpaz N, Pace NR (2007) Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases. Proc Nat Acad Sci 104(34):13780–13785
Hamady M, Lozupone C, Knight R (2010) Fast unifrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and phylochip data. ISME J 4(1):17–27
Huerta-Cepas J, Serra F, Bork P (2016) Ete 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol 33(6):1635–1638
Ley RE, Peterson DA, Gordon JI (2006) Ecological and evolutionary forces shaping microbial diversity in the human intestine. Cell 124(4):837–848
Ling H, Okada K (2006) Emd-l 1: an efficient and robust algorithm for comparing histogram-based descriptors. Comput Vis ECCV 2006:330–343
Lozupone C, Knight R (2005) Unifrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71(12):8228–8235
Lozupone CA, Hamady M, Kelley ST, Knight R (2007) Quantitative and qualitative \(\beta \) diversity measures lead to different insights into factors that structure microbial communities. Appl Environ Microbiol 73(5):1576–1585
Maidak BL, Cole JR, Lilburn TG, Parker CT Jr, Saxman PR, Farris RJ, Garrity GM, Olsen GJ, Schmidt TM, Tiedje JM (2001) The RDP-II (ribosomal database project). Nucleic Acids Res 29(1):173–174
Mangul S, Koslicki D (2016) Reference-free comparison of microbial communities via de bruijn graphs. ACM-BCB, in print. http://www.biorxiv.org/content/biorxiv/early/2016/05/24/055020.full.pdf
Manichanh C, Borruel N, Casellas F, Guarner F (2012) The gut microbiota in IBD. Nat Rev Gastroenterol Hepatol 9(10):599–608
Orlin JB (1997) A polynomial time primal network simplex algorithm for minimum cost flows. Math Program 78(2):109–129
Parks DH, Beiko RG (2010) Identifying biologically relevant differences between metagenomic communities. Bioinformatics 26(6):715–721
Pele O, Werman M (2008) A linear time histogram metric for improved sift matching. Comput Vis ECCV 2008:495–508
Pele O, Werman M (2009) Fast and robust earth mover’s distances. In: IEEE 12th international conference on computer vision, 2009, pp 460–467. IEEE
Rawls JF, Mahowald MA, Ley RE, Gordon JI (2006) Reciprocal gut microbiota transplants from zebrafish and mice to germ-free recipients reveal host habitat selection. Cell 127(2):423–433
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2):99–121
Sandler R, Lindenbaum M (2011) Nonnegative matrix factorization with earth mover’s distance metric for image analysis. IEEE Trans Pattern Anal Mach Intell 33(8):1590–1602
Schloss PD, Handelsman J (2006) Introducing sons, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures. Appl Environ Microbiol 72(10):6773–6779
Shirdhonkar S, Jacobs DW (2008) Approximate earth movers distance in linear time. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8
Spor A, Koren O, Ley R (2011) Unravelling the effects of the environment and host genotype on the gut microbiome. Nat Rev Microbiol 9(4):279–290
White JR, Nagarajan N, Pop M (2009) Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput Biol 5(4):e1000352
Willing BP, Dicksved J, Halfvarson J, Andersson AF, Lucio M, Zheng Z, Järnerot G, Tysk C, Jansson JK, Engstrand L (2010) A pyrosequencing study in twins shows that gastrointestinal microbial profiles vary with inflammatory bowel disease phenotypes. Gastroenterology 139(6):1844–1854
Wooley JC, Godzik A, Friedberg I (2010) A primer on metagenomics. PLoS Comput Biol 6(2):e1000667
Xu D, Yan S, Luo J (2008) Face recognition using spatially constrained earth mover’s distance. IEEE Trans Image Process 17(11):2256–2260
Yilmaz P, Parfrey LW, Yarza P, Gerken J, Pruesse E, Quast C, Schweer T, Peplies J, Ludwig W, Glöckner FO (2013) The SILVA and all-species living tree project (LTP) taxonomic frameworks. Nucleic Acids Res. https://doi.org/10.1093/nar/gkt1209
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
McClelland, J., Koslicki, D. EMDUniFrac: exact linear time computation of the UniFrac metric and identification of differentially abundant organisms. J. Math. Biol. 77, 935–949 (2018). https://doi.org/10.1007/s00285-018-1235-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00285-018-1235-9