Journal of Mathematical Biology

, Volume 77, Issue 4, pp 935–949 | Cite as

EMDUniFrac: exact linear time computation of the UniFrac metric and identification of differentially abundant organisms

  • Jason McClellandEmail author
  • David Koslicki


Both the weighted and unweighted UniFrac distances have been very successfully employed to assess if two communities differ, but do not give any information about how two communities differ. We take advantage of recent observations that the UniFrac metric is equivalent to the so-called earth mover’s distance (also known as the Kantorovich–Rubinstein metric) to develop an algorithm that not only computes the UniFrac distance in linear time and space, but also simultaneously finds which operational taxonomic units are responsible for the observed differences between samples. This allows the algorithm, called EMDUniFrac, to determine why given samples are different, not just if they are different, and with no added computational burden. EMDUniFrac can be utilized on any distribution on a tree, and so is particularly suitable to analyzing both operational taxonomic units derived from amplicon sequencing, as well as community profiles resulting from classifying whole genome shotgun metagenomes. The EMDUniFrac source code (written in python) is freely available at:


Earth movers distance UniFrac Kantorovich–Rubinstein metric Optimization Linear time Linear space 

Mathematics Subject Classification

92B05 05C85 


  1. Adler I, Hoffman AJ, Shamir R (1993) Monge and feasibility sequences in general flow problems. Discrete Appl Math 44(1–3):21–38MathSciNetCrossRefGoogle Scholar
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRefGoogle Scholar
  3. Altschuler, J, Weed J, Rigollet P (2017) Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. arXiv preprint arXiv:1705.09634
  4. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI et al (2010) Qiime allows analysis of high-throughput community sequencing data. Nat Methods 7(5):335–336CrossRefGoogle Scholar
  5. Cuturi M (2013) Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in neural information processing systems 26, proceedings of the neural information processing systems conference 2013, pp 2292–2300Google Scholar
  6. Evans SN, Matsen FA (2012) The phylogenetic kantorovich-rubinstein metric for environmental sequence samples. J R Stat Soc Ser B (Stat Methodol) 74(3):569–592MathSciNetCrossRefGoogle Scholar
  7. Frank DN, Amand ALS, Feldman RA, Boedeker EC, Harpaz N, Pace NR (2007) Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases. Proc Nat Acad Sci 104(34):13780–13785CrossRefGoogle Scholar
  8. Hamady M, Lozupone C, Knight R (2010) Fast unifrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and phylochip data. ISME J 4(1):17–27CrossRefGoogle Scholar
  9. Huerta-Cepas J, Serra F, Bork P (2016) Ete 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol 33(6):1635–1638CrossRefGoogle Scholar
  10. Ley RE, Peterson DA, Gordon JI (2006) Ecological and evolutionary forces shaping microbial diversity in the human intestine. Cell 124(4):837–848CrossRefGoogle Scholar
  11. Ling H, Okada K (2006) Emd-l 1: an efficient and robust algorithm for comparing histogram-based descriptors. Comput Vis ECCV 2006:330–343Google Scholar
  12. Lozupone C, Knight R (2005) Unifrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71(12):8228–8235CrossRefGoogle Scholar
  13. Lozupone CA, Hamady M, Kelley ST, Knight R (2007) Quantitative and qualitative \(\beta \) diversity measures lead to different insights into factors that structure microbial communities. Appl Environ Microbiol 73(5):1576–1585CrossRefGoogle Scholar
  14. Maidak BL, Cole JR, Lilburn TG, Parker CT Jr, Saxman PR, Farris RJ, Garrity GM, Olsen GJ, Schmidt TM, Tiedje JM (2001) The RDP-II (ribosomal database project). Nucleic Acids Res 29(1):173–174CrossRefGoogle Scholar
  15. Mangul S, Koslicki D (2016) Reference-free comparison of microbial communities via de bruijn graphs. ACM-BCB, in print.
  16. Manichanh C, Borruel N, Casellas F, Guarner F (2012) The gut microbiota in IBD. Nat Rev Gastroenterol Hepatol 9(10):599–608CrossRefGoogle Scholar
  17. Orlin JB (1997) A polynomial time primal network simplex algorithm for minimum cost flows. Math Program 78(2):109–129MathSciNetCrossRefGoogle Scholar
  18. Parks DH, Beiko RG (2010) Identifying biologically relevant differences between metagenomic communities. Bioinformatics 26(6):715–721CrossRefGoogle Scholar
  19. Pele O, Werman M (2008) A linear time histogram metric for improved sift matching. Comput Vis ECCV 2008:495–508Google Scholar
  20. Pele O, Werman M (2009) Fast and robust earth mover’s distances. In: IEEE 12th international conference on computer vision, 2009, pp 460–467. IEEEGoogle Scholar
  21. Rawls JF, Mahowald MA, Ley RE, Gordon JI (2006) Reciprocal gut microbiota transplants from zebrafish and mice to germ-free recipients reveal host habitat selection. Cell 127(2):423–433CrossRefGoogle Scholar
  22. Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2):99–121CrossRefGoogle Scholar
  23. Sandler R, Lindenbaum M (2011) Nonnegative matrix factorization with earth mover’s distance metric for image analysis. IEEE Trans Pattern Anal Mach Intell 33(8):1590–1602CrossRefGoogle Scholar
  24. Schloss PD, Handelsman J (2006) Introducing sons, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures. Appl Environ Microbiol 72(10):6773–6779CrossRefGoogle Scholar
  25. Shirdhonkar S, Jacobs DW (2008) Approximate earth movers distance in linear time. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8Google Scholar
  26. Spor A, Koren O, Ley R (2011) Unravelling the effects of the environment and host genotype on the gut microbiome. Nat Rev Microbiol 9(4):279–290CrossRefGoogle Scholar
  27. White JR, Nagarajan N, Pop M (2009) Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput Biol 5(4):e1000352CrossRefGoogle Scholar
  28. Willing BP, Dicksved J, Halfvarson J, Andersson AF, Lucio M, Zheng Z, Järnerot G, Tysk C, Jansson JK, Engstrand L (2010) A pyrosequencing study in twins shows that gastrointestinal microbial profiles vary with inflammatory bowel disease phenotypes. Gastroenterology 139(6):1844–1854CrossRefGoogle Scholar
  29. Wooley JC, Godzik A, Friedberg I (2010) A primer on metagenomics. PLoS Comput Biol 6(2):e1000667CrossRefGoogle Scholar
  30. Xu D, Yan S, Luo J (2008) Face recognition using spatially constrained earth mover’s distance. IEEE Trans Image Process 17(11):2256–2260MathSciNetCrossRefGoogle Scholar
  31. Yilmaz P, Parfrey LW, Yarza P, Gerken J, Pruesse E, Quast C, Schweer T, Peplies J, Ludwig W, Glöckner FO (2013) The SILVA and all-species living tree project (LTP) taxonomic frameworks. Nucleic Acids Res. CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Mathematics DepartmentOregon State UniversityCorvallisUSA

Personalised recommendations