Exploring Population Structure with Admixture Models and Principal Component Analysis

Liu, Chi-Chun; Shringarpure, Suyash; Lange, Kenneth; Novembre, John

doi:10.1007/978-1-0716-0199-0_4

Chi-Chun Liu³,
Suyash Shringarpure⁴,
Kenneth Lange^5,6,7 &
…
John Novembre⁸

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2090))

The original version of this chapter was revised. The correction to this chapter is available at https://doi.org/10.1007/978-1-0716-0199-0_20

Abstract

Population structure is a commonplace feature of genetic variation data, and it has importance in numerous application areas, including evolutionary genetics, conservation genetics, and human genetics. Understanding the structure in a sample is necessary before more sophisticated analyses are undertaken. Here we provide a protocol for running principal component analysis (PCA) and admixture proportion inference—two of the most commonly used approaches in describing population structure. Along with hands-on examples with CEPH-Human Genome Diversity Panel and pragmatic caveats, readers will learn to analyze and visualize population structure on their own data.

You have full access to this open access chapter, Download protocol PDF

Factor analysis of ancient population genomic samples

Article Open access 16 September 2020

Legofit: estimating population history from genetic data

Article Open access 28 October 2019

Models, methods and tools for ancestry inference and admixture analysis

Article 19 August 2017

Key words

1 Introduction

Population structure is a commonplace feature of genetic variation data, and it has importance in numerous application areas, including evolutionary genetics, conservation genetics, and human genetics. At a broad level, population structure is the existence of differing levels of genetic relatedness among some subgroups within a sample. This may arise for a variety of reasons, but a common cause is that samples have been drawn from geographically isolated groups or different locales across a geographic continuum. Regardless of the cause, understanding the structure in a sample is necessary before more sophisticated analyses are undertaken. For example, to infer divergence times between two populations requires knowing two populations even exist and which individuals belong to each.

Two of the most commonly used approaches to describe population structure in a sample are principal component analysis [5, 16, 23, 25] and admixture proportion inference [19, 26]. In brief, principal component analysis reduces a multi-dimensional dataset to a much smaller number of dimensions that allows for visual exploration and compact quantitative summaries. In its application to genetic data, the numerous genotypes observed per individual are reduced to a few summary coordinates. With admixture proportion inference, individuals in a sample are modeled as having a proportion of their genome derived from each of several source populations. The goal is to infer the proportions of ancestry from each source population, and these proportions can be used to produce compact visual summaries that reveal the existence of population structure in a sample.

The history and basic behaviors of both these approaches have been written about extensively, including by some of us, and so we refer readers to several previous publications to learn the basic background and interpretative nuances of these approaches and their derivatives [1, 2, 9, 10, 12, 17, 18, 20, 21, 23, 25,26,27, 29]. Here, in the spirit of this volume, we provide a protocol for running these analyses and share some pragmatic caveats that do not always arise in more abstract discussions regarding these methods.

2 Materials

The protocol we present is based on two pieces of software: (1) the ADMIXTURE software that our team developed [2] for efficiently estimating admixture proportions in the “Pritchard-Stephens-Donnelly” model of admixture [19, 26]. (2) The smartpca software developed by Nick Patterson and colleagues for carrying out PCA [25]. Both of these pieces of software are used widely. We also pair them with downstream tools for visualization, in particular pong [3], for visualizing output of admixture proportion inferences, and PCAviz [31], a novel R package for plotting PCA outputs. We also use PLINK [6, 24] as a tool to perform some basic manipulations of the data (see Chapter 3 for more background on PLINK).

The example data we use is derived from publicly available single-nucleotide polymorphism (SNP) genotype data from the CEPH-Human Genome Diversity Panel [4]. Specifically, we will look at Illumina 650Y genotyping array data as first described by Li et al. [15]. This sample is a global-scale sampling of human diversity with 52 populations in total, and the raw files are available from the following link: http://hagsc.org/hgdp/files.html. These data have been used in numerous subsequent publications and are an important reference set.

A few technical details are that the genotypes were filtered with a cutoff of 0.25 for the Illumina GenCall score [13] (a quality score generated by the basic genotype calling software). Further, individuals with a genotype call rate <98.5% were removed, with the logic being that if a sample has many missing genotypes it may be due to poor quality of the source DNA, and so none of the genotypes from that individual should be trusted. Beyond this, to prepare the data, we have filtered down the individuals to a set of 938 unrelated individuals. We exclude related individuals as we are not interested in population structure that is due to family relationships and methods such as PCA and ADMIXTURE can inadvertently mistake family structure for population structure. The starting data are available as plink-formatted files H938.bed H938.fam, H938.bim, and an accompanying set of population identifiers H938.clst.txt in the raw_input sub-directory of the companion data.

As a pragmatic side note, it is common (and recommended) when carrying out analyses of population structure to merge one’s data with other datasets that contain populations which may be representative sources of admixing individuals. For example, in analyzing a dataset with African American individuals, it can be helpful to include datasets containing African and European individuals in the analysis. These datasets can be merged with your dataset using software such as plink. However, when merging several datasets, one should be aware of potential biases that can be introduced due to strand flips (i.e., one dataset reports genotypes on the “+ ” strand of the reference human genome, and another on the “−” strand). One precautionary step to detect strand flips is to group individuals by what dataset they derive from and then produce a scatterplot of allele frequencies for pairs of groups at a time. If strand flips are not being controlled correctly, one will observe numerous variants on the y = 1 − x line, where x is the frequency in one dataset and y is the frequency in a second dataset. (Note: This rule of thumb assumes levels of differentiation are low between datasets, as is the case in human datasets in general, but one should still keep this in mind interpreting results.)

3 Methods

In this section we walk you through an example analysis using ADMIXTURE and smartpca. We assume the raw data files are in a directory raw_input that is below our working directory and that a second directory out exists in which outputs can be placed. If following along in an R console, you should use the setwd( ) command to set the working directory correctly.

3.1 Subsetting Data

For running some simple examples below, we will first create a subset of the HGDP sample that is restricted to only European populations. The European populations in the HGDP have the labels “Adygei,” “Basque,” “French,” “Italian,” “Orcadian,” “Russian,” “Sardinian” and “Tuscan,” so we create a list of individuals matching these labels using an awk command, and then use plink‘s –keep option to make a new dataset with output prefix “H938_Euro.”

3.2 Filter Out SNPs to Remove Linkage Disequilibrium (LD)

SNPs in high LD with each other contain redundant information. More worrisome is the potential for some regions of the genome to have a disproportionate influence on the results and thus distort the representation of genome-wide structure. A nice empirical example of the problem is in figure 5 of Tian et al. [30], where PC2 of the genome-wide data is shown to be reflecting the variation in a 3.8 Mb region of chromosome 8 that is known to harbor an inversion. A standard approach to address this issue is to filter out SNPs based on pairwise LD to produce a reduced set of more independent markers. Here we use plink’s commands to produce a new LD-pruned dataset with output prefix H938_Euro.LDprune. The approach considers a chromosomal window of 50 SNPs at a time, and for any pair whose genotypes have an association r² value greater than 0.1, it removes a SNP from the pair. Then the window is shifted by 10 SNPs and the procedure is repeated:

(Advanced note: For particularly sensitive results, we recommend additional rounds of SNP filtering based on observed principal component loadings and/or population differentiation statistics. For example, a robust approach is to filter out large windows around any SNP with a high PCA loading, see ref. 22.)

3.3 Running ADMIXTURE

3.3.1 An Example Run with Visualization

The ADMIXTURE software (v 1.3.0 here) comes as a pre-compiled binary executable file for either Linux or Mac operating systems. To install, simply download the package and move the executable into your standard execution path (e.g. “/usr/local/bin” on many Linux systems). Once installed, it is straightforward to run ADMIXTURE with a fixed number of source populations, commonly denoted by K. For example, to get started let’s run ADMIXTURE with K=6:

ADMIXTURE is a maximum-likelihood based method, so as the method runs, you will see updates to the log-likelihood as it converges on a solution for the ancestry proportions and allele frequencies that maximize the likelihood function. The algorithm will stop when the difference between successive iterations is small (the “delta” value takes a small value). A final output is an estimated F_ST value [11] between each of the source populations, based on the inferred allele frequencies. These estimates reflect how differentiated the source populations are, which is important for understanding whether the population structure observed in a sample is substantial or not (values closer to 0 reflect less population differentiation).

After running, ADMIXTURE produces two major output files. The file with suffix .P contains an L × K table of the allele frequencies inferred for each SNP in each population. The file with suffix .Q contains an N × K table of inferred individual ancestry proportions from the K ancestral populations, with one row per individual.

For our example dataset with K=6, this will be a file called H938.LDprune.6.Q. This file can be used to generate a plot showing individual ancestry (see Fig. 1). In R, this can be done using the following commands:

Each thin vertical line in the barplot represents one individual and each color represents one inferred ancestral population. The length of each color in a vertical bar represents the proportion of that individual’s ancestry that is derived from the inferred ancestral population corresponding to that color. The above image suggests there are some genetic clusters in the data, but it’s not a well-organized data display.

To improve the visualization, one can use a package dedicated to plotting ancestry proportions [3, 14, 28]. Here we use a post-processing tool, pong [3], which visualizes individual ancestry with similarity between individuals within clusters. You will most likely want to install pong on a local machine as it initializes a local web server to display the results.

To run pong requires setting up a few files: (1) an ind2pop file that maps individuals to populations; (2) a Qfilemap file that points pong towards which “.Q” files to display; these are easy to build up from the command-line using the Euro.clst.txt file we built above, and an awk command to output tab-separated text to a file with the Qfilemap suffix added to whatever file prefix we’re using to organize our runs:

Note when building the .Qfilemap one needs to use tabs to separate the columns for pong to read the file correctly.

Then to run pong, we use the following command:

We open a web browser to http://localhost:4000/ to view the results. Figure 2 shows an example of what you should see. From this visualization, we can see the admixture model fits most individuals of the Adygei, Sardinian, Russian, French Basque samples as being derived each from a single source population (represented by purple, red, green, and yellow, respectively). The French, Tuscan, and North Italian samples are generally estimated to have a majority component of ancestry from a single source population (blue) though with admixture with other sources. A first conclusion is that the population labels do not capture the complexity of the population structure. There is apparent cryptic structure within some samples (e.g., Orcadian) and minimal differentiation between other samples (North Italian and Tuscan samples, for instance).

Because ADMIXTURE is a “greedy,” “hill-climbing” optimization algorithm it is good practice to do multiple runs from different initial random starting points. We can do this by using the -s flag to specify the random seed for each ADMIXTURE run.

Pong has nice functionality for summarizing the output of the multiple ADMIXTURE runs. It can collect similar solutions into “modes” and display them in ranked order of the number of runs supporting each. In the interactive version, you use the check to highlight multimodality checkbox and whiten populations with ancestry matrices agreeing with the major mode. One can also click on and visualize only one cluster. Here we set up the PONG input files and show an example output.

The resulting figure (Fig. 3) shows that six out of ten runs converged to the same mode, which appears equivalent to our initial run above. We observe the appearance of structure within Sardinia in the second and third modes. The original run had North Italian and Tuscan samples as a mostly unadmixed, while all three minor modes model the two population as highly admixed. The fourth mode (supported by just one run) inferred sub-structure within the French Basque sample. This instability in the solution is a hint that the ADMIXTURE model with K = 6 is not a perfect fit to this data.

3.3.2 Considering Different Values of K

In a typical analysis, one wants to explore the sensitivity of the results to the choice of K. One approach is to run ADMIXTURE with various plausible values of K and compare the performance of results visually and using cross-validation error rates. Here is a piece of bash command-line code that will run ADMIXTURE for values of K from 2 to 12, and that will build a file with a table of cross-validation error rates per value of K.

Then let’s compile results on cross-validation error across values of K:

Now let’s inspect the outputs. First let’s make a plot of the cross-validation error as a function of K (Fig. 4):

The cross-validation error suggests a single source population can model the data adequately and larger values of K lead to over-fitting.

To inspect further, we can use the pong software to visualize the ancestry components inferred at different K across several runs. We need to set up some of the results and input files first.

Here we find how as K increases through to K = 6, the Sardinian, Basque, Adygei, and Russian samples are typically modeled as descended from unique sources, and at K of 7, 8, 9 we find structure within the Sardinian, Russian, Orcadian, and Basque samples is revealed, though for each, the substructure is not very stable (Fig. 5). The values of K = 10 and above make increasingly finer scale divisions that are difficult to interpret, and the major modes for K = 7 and up only consist of one to three runs, suggesting a very multi-modal likelihood surface and a poor resolution of the population structure.

Overall, it is interesting to note that the visual inspection of the results suggests several “real” clusters in the data, supported by an alignment of the clustering with known population labels, even though the cross-validation supports a value of K = 1. This highlights a long-standing known issue with admixture modeling: the selection of K is a difficult problem to automate in a way that is robust.

3.3.3 Some Advanced Options

Running ADMIXTURE with the -B option provides estimates of standard errors on the ancestry proportion inferences. The -l flag runs ADMIXTURE with a penalized likelihood that favors more sparse solutions (i.e., ancestry proportions that are closer to zero). This is useful in settings where small, possibly erroneous ancestry proportions may be overinterpreted. By using the -P option, the population allele frequencies inferred from one dataset can be provided as input for inference of admixture proportions in a second dataset. This is useful when individuals of unknown ancestry are being analyzed against the background of a reference sample set. Please see the ADMIXTURE manual for a complete listing of options and more detail, and we encourage testing these options in test datasets such as the one provided here.

3.4 PCA with SMARTPCA

3.4.1 Running PCA

Comparing ADMIXTURE and PCA results often helps give insight and confirmation regarding population structure in a sample. To run PCA, a standard package that is well-suited for SNP data is the smartpca package maintained by Nick Patterson and Alkes Price (at http://data.broadinstitute.org/alkesgroup/EIGENSOFT/). To run it, we first set up a basic smartpca parameter file from the command-line of a bash shell:

This input parameter file runs smartpca in its most basic mode (i.e., no automatic outlier removal or adjustments for LD—features which you might want to explore later).

As a minor issue, smartpca ignores individuals in the .fam file if they are marked as missing in the phenotypes column. This awk command provides a new .fam file that will automatically include all individuals.

Now run smartpca with the following command.

You will find the output files in the out sub-directory as specified in the parameter file.

3.4.2 Plotting PCA Results with PCAviz

The PCAviz package can be found at https://github.com/NovembreLab/PCAviz. It provides a simple interface for quickly creating plots from PCA results. It encodes several of our favored best practices for plotting PCA (such as using abbreviations for point characters and plotting median positions of each labelled group). To install the package use:

The following command in R generates plots showing each individual sample’s position in the PCA space and the median position of each labelled group in PCA space:

First one may notice several populations are separated with PC1 and PC2, with the more isolated populations being those that were most distinguished from the others by ADMIXTURE (Fig. 6). PC4 distinguishes a subset of Orcadian individuals and PC5 distinguishes two Adygei individuals. PC6 corresponds to the cryptic structure observed within Sardinians in the ADMIXTURE analysis.

As an alternative visualization, it can be helpful to see the distribution of PC coordinates per population for each labeled group in the data (see Fig. 7):

As mentioned above in the section on LD, it is useful to inspect the PC loadings to ensure that they broadly represent variation across the genome, rather than one or a small number of genomic regions [7] (see Fig. 8). SNPs that are selected in the same direction as genome-wide structure can show high loadings, but what is particularly pathological is if the only SNPs that show high loadings are all concentrated in a single region of the genome, as might occur if the PCA is explaining local genomic structure (such as an inversion) rather than population structure.

The proportion of total variance explained by each PC is a useful metric for understanding structure in a sample and for evaluating how many PCs one might want to include in downstream analyses (see Fig. 9). This can be computed as λ_i∕∑_kλ_k, with λ_i being eigenvalues in decreasing order, and is plotted below:

The results show that the top PCs only explain a small fraction of the variance (<1.5%) and that after about K = 6 the variance explained per PC becomes relatively constant; roughly in line with the visual inspection of the admixture results that revealed K = 6 may be reasonable for this dataset.

4 Discussion

Our protocol above is relatively straightforward and presents the most basic implementation of these analyses. Each analysis software (ADMIXTURE and smartpca) and each visualization package (pong and PCAviz) contain numerous other options that may be suitable for specific analyses and we encourage the readers to spend time in the manuals of each. Nonetheless, what we have presented is a useful start and a standard pipeline that we use in our research.

Two broad perspectives we find helpful to keep in mind are: (1) How the admixture model and PCA framework are related to each other indirectly as different forms of sparse factor analysis [8]; (2) How the PCA framework in particular can be considered as a form of efficient data compression. Both of these perspectives can be helpful in interpreting the outputs of the methods and for appreciating how these approaches best serve as helpful visual exploratory tools for analyzing structure in genetic data. These methods are ultimately relatively simple statistical tools being used to summarize complex realities. They are part of the toolkit for analysis, and often are extremely useful for framing specific models of population structure that can be further investigated using more detailed and explicit approaches (such as those based on coalescent or diffusion theory, Chapters 7 on MSMC and 8 on CoalHMM).

Change history

27 February 2021
Chapter 2, “Processing and Analyzing Multiple Genomes Alignments with MafFilter,” was previously published without including the Electronic Supplementary Material. This has now been included in the revised version of this book.

References

Alexander DH, Lange K (2011) Enhancements to the admixture algorithm for individual ancestry estimation. BMC Bioinformatics 12:246. https://doi.org/10.1186/1471-2105-12-246
Article Google Scholar
Alexander DH, Novembre J, Lange K (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19(9):1655–1664. https://doi.org/10.1101/gr.094052.109
Article CAS Google Scholar
Behr AA, Liu KZ, Liu-Fang G, Nakka P, Ramachandran S (2016) Pong: fast analysis and visualization of latent clusters in population genetic data. Bioinformatics 32(18):2817–2823. https://doi.org/10.1093/bioinformatics/btw327
Article CAS Google Scholar
Cann HM, de Toma C, Cazes L, Legrand MF, Morel V, Piouffre L, Bodmer J, Bodmer WF, Bonne-Tamir B, Cambon-Thomsen A, Chen Z, Chu J, Carcassi C, Contu L, Du R, Excoffier L, Ferrara GB, Friedlaender JS, Groot H, Gurwitz D, Jenkins T, Herrera RJ, Huang X, Kidd J, Kidd KK, Langaney A, Lin AA, Mehdi SQ, Parham P, Piazza A, Pistillo MP, Qian Y, Shu Q, Xu J, Zhu S, Weber JL, Greely HT, Feldman MW, Thomas G, Dausset J, Cavalli-Sforza LL (2002) A human genome diversity cell line panel. Science 296(5566):261–262. https://doi.org/10.1126/science.296.5566.261b
Article CAS Google Scholar
Cavalli-Sforza LL, Menozzi P, Piazza A (1994) The history and geography of human genes. Princeton University Press, Princeton. https://doi.org/10.2307/2058750
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ (2015) Second-generation plink: rising to the challenge of larger and richer datasets. GigaScience 4(1):s13742–015–0047–8. https://doi.org/10.1186/s13742-015-0047-8
Duforet-Frebourg N, Luu K, Laval G, Bazin E, Blum MG (2016) Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 genomes data. Mol Biol Evol 33(4):1082–1093. https://doi.org/10.1093/molbev/msv334
Article CAS Google Scholar
Engelhardt BE, Stephens M (2010) Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLOS Genet 6(9):1–12. https://doi.org/10.1371/journal.pgen.1001117
Article CAS Google Scholar
Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164(4):471–492
Article Google Scholar
Falush D, van Dorp L, Lawson D (2016) A tutorial on how (not) to over-interpret structure/admixture bar plots. Nat Commun 9:3258. https://doi.org/10.1101/066431
Google Scholar
Holsinger K, Weir B (2009) Genetics in geographically structured populations: defining, estimating and interpreting FST. Nat Rev Genet 10:639–650
Article CAS Google Scholar
Hubisz MJ, Falush D, Stephens M, Pritchard JK (2009) Inferring weak population structure with the assistance of sample group information. Mol Ecol Resour 9(5):1322–1332. https://doi.org/10.1111/j.1755-0998.2009.02591.x
Article Google Scholar
Kermani BG (2006) Artificial intelligence and global normalization methods for genotyping. U.S. Patent No. 7,035,740. Washington, DC: U.S. Patent and Trademark Office
Google Scholar
Kopelman NM, Mayzel J, Jakobsson M, Rosenberg NA, Mayrose I (2015) Clumpak: a program for identifying clustering modes and packaging population structure inferences across K. Mol Ecol Resour 15(5):1179–1191. https://doi.org/10.1111/1755-0998.12387
Article CAS Google Scholar
Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran, S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM (2008) Worldwide human relationships inferred from genome-wide patterns of variation. Science 319(5866):1100–1104. https://doi.org/10.1126/science.1153717
Article CAS Google Scholar
Menozzi P, Piazza A, Cavalli-Sforza LL (1978) Synthetic maps of human gene frequencies in Europeans. Science 201(4358):786–792
Article CAS Google Scholar
McVean G (2009) A genealogical interpretation of principal components analysis. PLoS Genet 5(10):e1000686. https://doi.org/10.1371/journal.pgen.1000686
Article CAS Google Scholar
Novembre J (2014) Variations on a common structure: new algorithms for a valuable model. Genetics 197(3), 809–811. https://doi.org/10.1534/genetics.114.166264
Article Google Scholar
Novembre J (2016) Pritchard, Stephens, and Donnelly on population structure. Genetics 204(2):391–393. https://doi.org/10.1534/genetics.116.195164
Article Google Scholar
Novembre J, Peter BM (2016) Recent advances in the study of fine-scale population structure in humans. Curr Opin Genet Dev 41:98–105. https://doi.org/10.1016/j.gde.2016.08.007
Article CAS Google Scholar
Novembre J, Stephens M (2008) Interpreting principal component analyses of spatial population genetic variation. Nat Genet 40(5):646–649. https://doi.org/10.1038/ng.139
Article CAS Google Scholar
Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King K, Bergmann S, Nelson M, Stephens M, Bustamante C (2008) Genes mirror geography within Europe. Nature 456:274
Article CAS Google Scholar
Patterson NJ, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet 2(12):2074–2093. https://doi.org/10.1371/journal.pgen.0020190
Article CAS Google Scholar
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira, ARM, Bender D, Maller J, Sklar P, de Bakker IWP, Daly M, Sham CP (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81, 559–575
Article CAS Google Scholar
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909. https://doi.org/10.1038/ng1847
Article CAS Google Scholar
Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155(2):945–959
Article CAS Google Scholar
Raj A, Stephens M, Pritchard JK (2014) fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197(2):573–589. https://doi.org/10.1534/genetics.114.164350
Article Google Scholar
Rosenberg NA (2004) Distruct: a program for the graphical display of population structure. Mol Ecol Notes 4(1):137–138. https://doi.org/10.1046/j.1471-8286.2003.00566.x
Article Google Scholar
Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, Feldman MW (2005) Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet 1(6):e70. https://doi.org/10.1371/journal.pgen.0010070
Article CAS Google Scholar
Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, Klareskog L, Pulver AE, Qi L, Gregersen PK, Seldin MF (2008) Analysis and application of european genetic substructure using 300 K SNP information. PLoS Genet 4(1):e4. https://doi.org/10.1371/journal.pgen.0040004
Article CAS Google Scholar
Williams R, Pourreza H, Wang Y, Carbonetto P, Novembre J (2017) PCAviz: visualizing principal components analysis. http://github.com/NovembreLab/PCAviz

Download references

Author information

Authors and Affiliations

Department of Human Genetics, University of Chicago, Chicago, IL, USA
Chi-Chun Liu
23andMe Inc., Sunnyvale, CA, USA
Suyash Shringarpure
Department of Computational Medicine, University of California, Los Angeles, CA, USA
Kenneth Lange
Department of Human Genetics, University of California, Los Angeles, CA, USA
Kenneth Lange
Department of Statistics, University of California, Los Angeles, CA, USA
Kenneth Lange
Department of Human Genetics, Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA
John Novembre

Authors

Chi-Chun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Suyash Shringarpure
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth Lange
View author publications
You can also search for this author in PubMed Google Scholar
John Novembre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John Novembre .

Editor information

Editors and Affiliations

Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön, Germany
Julien Y. Dutheil

1 Electronic supplementary material

Supplementary Material (ZIP 88.6MB)

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Liu, CC., Shringarpure, S., Lange, K., Novembre, J. (2020). Exploring Population Structure with Admixture Models and Principal Component Analysis. In: Dutheil, J.Y. (eds) Statistical Population Genomics. Methods in Molecular Biology, vol 2090. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0199-0_4

Download citation

DOI: https://doi.org/10.1007/978-1-0716-0199-0_4
Published: 24 January 2020
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-0198-3
Online ISBN: 978-1-0716-0199-0
eBook Packages: Springer Protocols

Publish with us

Policies and ethics