Journal of Molecular Evolution

, Volume 67, Issue 5, pp 465–487 | Cite as

PCA and Clustering Reveal Alternate mtDNA Phylogeny of N and M Clades

  • G. Alexe
  • R. Vijaya Satya
  • M. Seiler
  • D. Platt
  • T. Bhanot
  • S. Hui
  • M. Tanaka
  • A. J. Levine
  • G. BhanotEmail author


Phylogenetic trees based on mtDNA polymorphisms are often used to infer the history of recent human migrations. However, there is no consensus on which method to use. Most methods make strong assumptions which may bias the choice of polymorphisms and result in computational complexity which limits the analysis to a few samples/polymorphisms. For example, parsimony minimizes the number of mutations, which biases the results to minimizing homoplasy events. Such biases may miss the global structure of the polymorphisms altogether, with the risk of identifying a “common” polymorphism as ancient without an internal check on whether it either is homoplasic or is identified as ancient because of sampling bias (from oversampling the population with the polymorphism). A signature of this problem is that different methods applied to the same data or the same method applied to different datasets results in different tree topologies. When the results of such analyses are combined, the consensus trees have a low internal branch consensus. We determine human mtDNA phylogeny from 1737 complete sequences using a new, direct method based on principal component analysis (PCA) and unsupervised consensus ensemble clustering. PCA identifies polymorphisms representing robust variations in the data and consensus ensemble clustering creates stable haplogroup clusters. The tree is obtained from the bifurcating network obtained when the data are split into k = 2,3,4,…,k max clusters, with equal sampling from each haplogroup. Our method assumes only that the data can be clustered into groups based on mutations, is fast, is stable to sample perturbation, uses all significant polymorphisms in the data, works for arbitrary sample sizes, and avoids sample choice and haplogroup size bias. The internal branches of our tree have a 90% consensus accuracy. In conclusion, our tree recreates the standard phylogeny of the N, M, L0/L1, L2, and L3 clades, confirming the African origin of modern humans and showing that the M and N clades arose in almost coincident migrations. However, the N clade haplogroups split along an East-West geographic divide, with a “European R clade” containing the haplogroups H, V, H/V, J, T, and U and a “Eurasian N subclade” including haplogroups B, R5, F, A, N9, I, W, and X. The haplogroup pairs (N9a, N9b) and (M7a, M7b) within N and M are placed in nonnearest locations in agreement with their expected large TMRCA from studies of their migrations into Japan. For comparison, we also construct consensus maximum likelihood, parsimony, neighbor joining, and UPGMA-based trees using the same polymorphisms and show that these methods give consistent results only for the clade tree. For recent branches, the consensus accuracy for these methods is in the range of 1–20%. From a comparison of our haplogroups to two chimp and one bonobo sequences, and assuming a chimp-human coalescent time of 5 million years before present, we find a human mtDNA TMRCA of 206,000 ± 14,000 years before present.


mtDNA phylogeny Principal component analysis Unsupervised consensus ensemble clustering Clade tree Homoplasy Time to most recent common ancestor 



G.B. and M.T. thank Dr. K. Shinoda for insight into the Eastern migrations of the N and M haplogroups and Dr. Cabrera for a critical reading of an early version of the manuscript. G.A. and G.B. acknowledge discussions with many colleagues at IBM Research, where this study was initiated in 2005, and at the Aspen Center for Physics in 2007, where it was concluded.

Supplementary material

239_2008_9148_MOESM1_ESM.xls (332 kb)
MOESM1 (XLS 332 kb)
239_2008_9148_MOESM2_ESM.pdf (421 kb)
MOESM2 (PDF 421 kb)
239_2008_9148_MOESM3_ESM.xls (5.1 mb)
MOESM3 (XLS 5235 kb)
239_2008_9148_MOESM4_ESM.xls (294 kb)
MOESM4 (XLS 293 kb)
239_2008_9148_MOESM5_ESM.xls (40 kb)
MOESM5 (XLS 39 kb)
239_2008_9148_MOESM6_ESM.xls (694 kb)
MOESM6 (XLS 694 kb)
239_2008_9148_MOESM7_ESM.xls (171 kb)
MOESM7 (XLS 171 kb)
239_2008_9148_MOESM8_ESM.xls (84 kb)
MOESM8 (XLS 84 kb)
239_2008_9148_MOESM9_ESM.xls (4.8 mb)
MOESM9 (XLS 4892 kb)
239_2008_9148_MOESM10_ESM.txt (65 kb)
MOESM10 (TXT 65 kb)
239_2008_9148_MOESM11_ESM.xls (288 kb)
MOESM11 (XLS 288 kb)


  1. Bandelt H-J, Richards M, Macaulay V (2006) Human mitochondrial DNA and the evolution of Homo sapiens (Nucleic acids and molecular biology), 1st edn. Springer, New YorkGoogle Scholar
  2. Cann RL, Stoneking M, Wilson AC (1987) Mitochondrial DNA and human evolution. Nature 325(6099):31–36PubMedCrossRefGoogle Scholar
  3. Cerny V (1985) A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm. J Optim Theory Appl 45:41–51CrossRefGoogle Scholar
  4. Densmore LD 3rd (2001) Phylogenetic inference and parsimony analysis. Methods Mol Biol 176:23–36PubMedGoogle Scholar
  5. Drummond A, Rodrigo AG (2000) Reconstructing genealogies of serial samples under the assumption of a molecular clock using serial-sample UPGMA. Mol Biol Evol 17(12):1807–1815PubMedGoogle Scholar
  6. Felsenstein J (1996) Inferrring phylogenies. Sinauer Associates, Sunderland, MAGoogle Scholar
  7. Harpending H, Eswaran V, Macaulay V et al (2005) Tracing modern human origins. Science 309(5743):1995b–1997CrossRefGoogle Scholar
  8. Hasegawa M, Kishino H, Saitou N (1991) On the maximum likelihood method in molecular phylogenetics. J Mol Evol 32(5):443–445PubMedCrossRefGoogle Scholar
  9. Ingman M, Kaessmann H, Paabo S, Gyllensten U (2000) Mitochondrial genome variation and the origin of modern humans. Nature 408(6813):708–713PubMedCrossRefGoogle Scholar
  10. Jin G, Nakhleh L, Snir S, Tuller T (2006) Maximum likelihood of phylogenetic networks. Bioinformatics 22(21):2604–2611PubMedCrossRefGoogle Scholar
  11. Jobling MA, Hurles ME, Tyler-Smith C (2004) Human evolutionary genetics: origins, peoples, and disease. Garland Science, New YorkGoogle Scholar
  12. Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New YorkGoogle Scholar
  13. Kaufmann L, Rousserw PJ (1990) Finding groups in data: an introduction to cluster analysis, 1st edn. John Wiley & Sons, New YorkGoogle Scholar
  14. Kirkpatrick S, Gelatt C, Vecchi M (1983) Optimization by simulated annealing. Science 220(4598):671–680PubMedCrossRefGoogle Scholar
  15. Kong QP, Bandelt HJ, Sun C et al (2006) Updating the East Asian mtDNA phylogeny: a prerequisite for the identification of pathogenic mutations. Hum Mol Genet 15(13):2076–2086PubMedCrossRefGoogle Scholar
  16. Kumar S, Gadagkar SR (2000) Efficiency of the neighbor-joining method in reconstructing deep and shallow evolutionary relationships in large phylogenies. J Mol Evol 51(6):544–553PubMedGoogle Scholar
  17. Minh BQ, Vinh le S, von Haeseler A, Schmidt HA (2005) pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics 21(19):3794–3796PubMedCrossRefGoogle Scholar
  18. Monti S, Tamayo P, Mesirov PJ, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learn J 52(1–2):91–118CrossRefGoogle Scholar
  19. Myers E, Miller W (1998) Optimal alignments in linear space. CABIOS 4(1):11–17Google Scholar
  20. Ota S, Li WH (2000) NJML: a hybrid algorithm for the neighbor-joining and maximum-likelihood methods. Mol Biol Evol 17(9):1401–1409PubMedGoogle Scholar
  21. Parsons BL, Heflich RH (1998) Detection of basepair substitution mutation at a frequency of 1 × 10(−7) by combining two genotypic selection methods, MutEx enrichment and allele-specific competitive blocker PCR. Environ Mol Mutagen 32(3):200–211PubMedCrossRefGoogle Scholar
  22. Pearson WR, Robins G, Zhang T (1999) Generalized neighbor-joining: more reliable phylogenetic tree reconstruction. Mol Biol Evol 16(6):806–816PubMedGoogle Scholar
  23. Saitou N (1990) Maximum likelihood methods. Methods Enzymol 183:584–598PubMedCrossRefGoogle Scholar
  24. Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4):406–425PubMedGoogle Scholar
  25. Sanderson MJ (1994) Reconstructing the history of evolutionary processes using maximum likelihood. Soc Gen Physiol Ser 49:13–26PubMedGoogle Scholar
  26. Shinoda K-I (2005) Ancient DNA analysis of skeletal samples recovered from the Kuma-Nishioda Yayoi site. Bull Natl Sci Mus Ser D (Anthropol) 30:1–8Google Scholar
  27. Stewart CB (1993) The powers and pitfalls of parsimony. Nature 361(6413):603–607PubMedCrossRefGoogle Scholar
  28. Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining partitionings. In: Eighteenth National Conference on Artificial Intelligence, July 28–August 01, 2002, Edmonton, Alberta, Canada, pp 93–98Google Scholar
  29. Stringer C (2001) Modern human origins—distinguishing the models. J Afr Archaeol Rev 18(2):67–75CrossRefGoogle Scholar
  30. Studier JA, Keppler KJ (1988) A note on the neighbor-joining algorithm of Saitou and Nei. Mol Biol Evol 5(6):729–731PubMedGoogle Scholar
  31. Sullivan J (2005) Maximum-likelihood methods for phylogeny estimation. Methods Enzymol 395:757–779PubMedCrossRefGoogle Scholar
  32. Tamura K, Nei M, Kumar S (2004) Prospects for inferring very large phylogenies by using the neighbor-joining method. Proc Natl Acad Sci USA 101(30):11030–11035PubMedCrossRefGoogle Scholar
  33. Tanaka M, Ozawa T (1994) Strand and symmetry in human mitochondria. Genomics 22:327–335PubMedCrossRefGoogle Scholar
  34. Tanaka M, Cabrera VM, Gonzalez AM et al (2004) Mitochondrial genome variation in eastern Asia and the peopling of Japan. Genome Res 14(10A):1832–1850PubMedCrossRefGoogle Scholar
  35. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a dataset via the gap statistic. J Roy Stat Soc Ser B 63:411–423CrossRefGoogle Scholar
  36. Yang Z (1996) Phylogenetic analysis using parsimony and likelihood methods. J Mol Evol 42(2):294–307PubMedCrossRefGoogle Scholar
  37. Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13(5):555–556PubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • G. Alexe
    • 1
    • 2
  • R. Vijaya Satya
    • 3
  • M. Seiler
    • 4
  • D. Platt
    • 5
  • T. Bhanot
    • 6
  • S. Hui
    • 4
  • M. Tanaka
    • 7
  • A. J. Levine
    • 2
  • G. Bhanot
    • 2
    • 4
    • 8
    • 9
    Email author
  1. 1.The Broad Institute of MIT and HarvardCambridgeUSA
  2. 2.Simons Center for Systems BiologyInstitute for Advanced StudyPrincetonUSA
  3. 3.School of Computer ScienceUniversity of Central FloridaOrlandoUSA
  4. 4.BioMaPS InstituteRutgers UniversityPiscatawayUSA
  5. 5.IBM Thomas J. Watson Research CenterYorktown HeightsUSA
  6. 6.Graduate Program in Microbiology & Molecular GeneticsRutgers UniversityPiscatawayUSA
  7. 7.Tokyo Metropolitan Institute of GerontologyTokyoJapan
  8. 8.Department of Physics and Department of Molecular Biology & BiochemistryRutgers UniversityPiscatawayUSA
  9. 9.Cancer Institute of New JerseyNew BrunswickUSA

Personalised recommendations