Exploring Population Structure with Admixture Models and Principal Component Analysis

Population structure is a commonplace feature of genetic variation data, and it has importance in numerous application areas, including evolutionary genetics, conservation genetics, and human genetics. Understanding the structure in a sample is necessary before more sophisticated analyses are undertaken. Here we provide a protocol for running principal component analysis (PCA) and admixture proportion inference— two of the most commonly used approaches in describing population structure. Along with hands-on examples with CEPH-Human Genome Diversity Panel and pragmatic caveats, readers will learn to analyze and visualize population structure on their own data.


Introduction
Population structure is a commonplace feature of genetic variation data, and it has importance in numerous application areas, including evolutionary genetics, conservation genetics, and human genetics. At a broad level, population structure is the existence of differing levels of genetic relatedness among some subgroups within a sample. This may arise for a variety of reasons, but a common cause is that samples have been drawn from geographically isolated groups or different locales across a geographic continuum. Regardless of the cause, understanding the structure in a sample is necessary before more sophisticated analyses are undertaken. For example, to infer divergence times between two populations requires knowing two populations even exist and which individuals belong to each.
Two of the most commonly used approaches to describe population structure in a sample are principal component analysis [5,16,23,25] and admixture proportion inference [19,26]. In brief, principal component analysis reduces a multi-dimensional dataset to a much smaller number of dimensions that allows for visual exploration and compact quantitative summaries. In its application to genetic data, the numerous genotypes observed per individual are reduced to a few summary coordinates. With admixture proportion inference, individuals in a sample are modeled as having a proportion of their genome derived from each of several source populations. The goal is to infer the proportions of ancestry from each source population, and these proportions can be used to produce compact visual summaries that reveal the existence of population structure in a sample.
The history and basic behaviors of both these approaches have been written about extensively, including by some of us, and so we refer readers to several previous publications to learn the basic background and interpretative nuances of these approaches and their derivatives [1,2,9,10,12,17,18,20,21,23,[25][26][27]29]. Here, in the spirit of this volume, we provide a protocol for running these analyses and share some pragmatic caveats that do not always arise in more abstract discussions regarding these methods.

Materials
The protocol we present is based on two pieces of software: (1) the ADMIXTURE software that our team developed [2] for efficiently estimating admixture proportions in the "Pritchard-Stephens-Donnelly" model of admixture [19,26]. (2) The smartpca software developed by Nick Patterson and colleagues for carrying out PCA [25]. Both of these pieces of software are used widely. We also pair them with downstream tools for visualization, in particular pong [3], for visualizing output of admixture proportion inferences, and PCAviz [31], a novel R package for plotting PCA outputs. We also use PLINK [6,24] as a tool to perform some basic manipulations of the data (see Chapter 3 for more background on PLINK).
The example data we use is derived from publicly available single-nucleotide polymorphism (SNP) genotype data from the CEPH-Human Genome Diversity Panel [4]. Specifically, we will look at Illumina 650Y genotyping array data as first described by Li et al. [15]. This sample is a global-scale sampling of human diversity with 52 populations in total, and the raw files are available from the following link: http://hagsc.org/hgdp/files.html. These data have been used in numerous subsequent publications and are an important reference set.
A few technical details are that the genotypes were filtered with a cutoff of 0.25 for the Illumina GenCall score [13] (a quality score generated by the basic genotype calling software). Further, individuals with a genotype call rate <98.5% were removed, with the logic being that if a sample has many missing genotypes it may be due to poor quality of the source DNA, and so none of the genotypes from that individual should be trusted. Beyond this, to prepare the data, we have filtered down the individuals to a set of 938 unrelated individuals. We exclude related individuals as we are not interested in population structure that is due to family relationships and methods such as PCA and ADMIXTURE can inadvertently mistake family structure for population structure. The starting data are available as plink-formatted files H938.bed H938.fam, H938.bim, and an accompanying set of population identifiers H938.clst.txt in the raw_input sub-directory of the companion data.
As a pragmatic side note, it is common (and recommended) when carrying out analyses of population structure to merge one's data with other datasets that contain populations which may be representative sources of admixing individuals. For example, in analyzing a dataset with African American individuals, it can be helpful to include datasets containing African and European individuals in the analysis. These datasets can be merged with your dataset using software such as plink. However, when merging several datasets, one should be aware of potential biases that can be introduced due to strand flips (i.e., one dataset reports genotypes on the "+ " strand of the reference human genome, and another on the "À" strand). One precautionary step to detect strand flips is to group individuals by what dataset they derive from and then produce a scatterplot of allele frequencies for pairs of groups at a time. If strand flips are not being controlled correctly, one will observe numerous variants on the y ¼ 1 À x line, where x is the frequency in one dataset and y is the frequency in a second dataset. (Note: This rule of thumb assumes levels of differentiation are low between datasets, as is the case in human datasets in general, but one should still keep this in mind interpreting results.)

Methods
In this section we walk you through an example analysis using ADMIXTURE and smartpca. We assume the raw data files are in a directory raw_input that is below our working directory and that a second directory out exists in which outputs can be placed. If following along in an R console, you should use the setwd( ) command to set the working directory correctly.

Subsetting Data
For running some simple examples below, we will first create a subset of the HGDP sample that is restricted to only European populations. The European populations in the HGDP have the labels "Adygei," "Basque," "French," "Italian," "Orcadian," "Russian," "Sardinian" and "Tuscan," so we create a list of individuals matching these labels using an awk command, and then use plink's --keep option to make a new dataset with output prefix "H938_Euro." SNPs in high LD with each other contain redundant information. More worrisome is the potential for some regions of the genome to have a disproportionate influence on the results and thus distort the representation of genome-wide structure. A nice empirical example of the problem is in figure 5 of Tian et al. [30], where PC2 of the genome-wide data is shown to be reflecting the variation in a 3.8 Mb region of chromosome 8 that is known to harbor an inversion. A standard approach to address this issue is to filter out SNPs based on pairwise LD to produce a reduced set of more independent markers. Here we use plink's commands to produce a new LD-pruned dataset with output prefix H938_Euro. LDprune. The approach considers a chromosomal window of 50 SNPs at a time, and for any pair whose genotypes have an association r 2 value greater than 0.1, it removes a SNP from the pair. Then the window is shifted by 10 SNPs and the procedure is repeated: (Advanced note: For particularly sensitive results, we recommend additional rounds of SNP filtering based on observed principal component loadings and/or population differentiation statistics. For example, a robust approach is to filter out large windows around any SNP with a high PCA loading, see ref. 22.)

An Example Run with Visualization
The ADMIXTURE software (v 1.3.0 here) comes as a pre-compiled binary executable file for either Linux or Mac operating systems. To install, simply download the package and move the executable into your standard execution path (e.g. "/usr/local/bin" on many Linux systems). Once installed, it is straightforward to run ADMIX-TURE with a fixed number of source populations, commonly denoted by K. For example, to get started let's run ADMIXTURE with K¼6: ADMIXTURE is a maximum-likelihood based method, so as the method runs, you will see updates to the log-likelihood as it converges on a solution for the ancestry proportions and allele frequencies that maximize the likelihood function. The algorithm will stop when the difference between successive iterations is small (the "delta" value takes a small value). A final output is an estimated F ST value [11] between each of the source populations, based on the inferred allele frequencies. These estimates reflect how differentiated the source populations are, which is important for understanding whether the population structure observed in a sample is substantial or not (values closer to 0 reflect less population differentiation).
After running, ADMIXTURE produces two major output files. The file with suffix .P contains an L Â K table of the allele frequencies inferred for each SNP in each population. The file with suffix .Q contains an N Â K table of inferred individual ancestry proportions from the K ancestral populations, with one row per individual.
For our example dataset with K¼6, this will be a file called H938.LDprune.6.Q. This file can be used to generate a plot showing individual ancestry (see Fig. 1). In R, this can be done using the following commands: Each thin vertical line in the barplot represents one individual and each color represents one inferred ancestral population. The length of each color in a vertical bar represents the proportion of that individual's ancestry that is derived from the inferred ancestral population corresponding to that color. The above image suggests there are some genetic clusters in the data, but it's not a wellorganized data display. To improve the visualization, one can use a package dedicated to plotting ancestry proportions [3,14,28]. Here we use a postprocessing tool, pong [3], which visualizes individual ancestry with similarity between individuals within clusters. You will most likely want to install pong on a local machine as it initializes a local web server to display the results.
To run pong requires setting up a few files: (1) an ind2pop file that maps individuals to populations; (2) a Qfilemap file that points pong towards which ".Q" files to display; these are easy to build up from the command-line using the Euro.clst.txt file we built above, and an awk command to output tab-separated text to a file with the Qfilemap suffix added to whatever file prefix we're using to organize our runs: Note when building the .Qfilemap one needs to use tabs to separate the columns for pong to read the file correctly.
Then to run pong, we use the following command: We open a web browser to http://localhost:4000/ to view the results. Figure 2 shows an example of what you should see. From this visualization, we can see the admixture model fits most individuals of the Adygei, Sardinian, Russian, French Basque samples as being derived each from a single source population (represented by purple, red, green, and yellow, respectively). The French, Tuscan, and North Italian samples are generally estimated to have a majority component of ancestry from a single source population (blue) though with admixture with other sources. A first conclusion is that the population labels do not capture the complexity of the  Fig. 2 Plot of the ADMIXTURE results for K ¼ 6 using PONG population structure. There is apparent cryptic structure within some samples (e.g., Orcadian) and minimal differentiation between other samples (North Italian and Tuscan samples, for instance). Because ADMIXTURE is a "greedy," "hill-climbing" optimization algorithm it is good practice to do multiple runs from different initial random starting points. We can do this by using the -s flag to specify the random seed for each ADMIXTURE run.
Pong has nice functionality for summarizing the output of the multiple ADMIXTURE runs. It can collect similar solutions into "modes" and display them in ranked order of the number of runs supporting each. In the interactive version, you use the check to highlight multimodality checkbox and whiten populations with ancestry matrices agreeing with the major mode. One can also click on and visualize only one cluster. Here we set up the PONG input files and show an example output.
The resulting figure (Fig. 3) shows that six out of ten runs converged to the same mode, which appears equivalent to our initial run above. We observe the appearance of structure within Sardinia in the second and third modes. The original run had North Italian and Tuscan samples as a mostly unadmixed, while all three minor modes model the two population as highly admixed. The fourth mode (supported by just one run) inferred sub-structure within the French Basque sample. This instability in the solution is a hint that the ADMIXTURE model with K ¼ 6 is not a perfect fit to this data.

Considering Different Values of K
In a typical analysis, one wants to explore the sensitivity of the results to the choice of K. One approach is to run ADMIXTURE with various plausible values of K and compare the performance of results visually and using cross-validation error rates. Here is a piece of bash command-line code that will run ADMIXTURE for values of K from 2 to 12, and that will build a file with a table of crossvalidation error rates per value of K. Now let's inspect the outputs. First let's make a plot of the cross-validation error as a function of K (Fig. 4): The cross-validation error suggests a single source population can model the data adequately and larger values of K lead to overfitting.
To inspect further, we can use the pong software to visualize the ancestry components inferred at different K across several runs. We need to set up some of the results and input files first.  Here we find how as K increases through to K ¼ 6, the Sardinian, Basque, Adygei, and Russian samples are typically modeled as descended from unique sources, and at K of 7, 8, 9 we find structure within the Sardinian, Russian, Orcadian, and Basque samples is revealed, though for each, the substructure is not very stable (Fig. 5). The values of K ¼ 10 and above make increasingly finer scale divisions that are difficult to interpret, and the major modes for K ¼ 7 and up only consist of one to three runs, suggesting a very multi-modal likelihood surface and a poor resolution of the population structure.
Overall, it is interesting to note that the visual inspection of the results suggests several "real" clusters in the data, supported by an alignment of the clustering with known population labels, even though the cross-validation supports a value of K ¼ 1. This highlights a long-standing known issue with admixture modeling: the selection of K is a difficult problem to automate in a way that is robust.

Some Advanced Options
Running ADMIXTURE with the -B option provides estimates of standard errors on the ancestry proportion inferences. The -l flag runs ADMIXTURE with a penalized likelihood that favors more sparse solutions (i.e., ancestry proportions that are closer to zero). This is useful in settings where small, possibly erroneous ancestry proportions may be overinterpreted. By using the -P option, the population allele frequencies inferred from one dataset can be provided as input for inference of admixture proportions in a second dataset. This is useful when individuals of unknown ancestry are being analyzed against the background of a reference sample set. Please see the ADMIXTURE manual for a complete listing of options and more detail, and we encourage testing these options in test datasets such as the one provided here.

Running PCA
Comparing ADMIXTURE and PCA results often helps give insight and confirmation regarding population structure in a sample. To run PCA, a standard package that is well-suited for SNP data is the smartpca package maintained by Nick Patterson and Alkes Price (at http://data.broadinstitute.org/alkesgroup/EIGENSOFT/). To run it, we first set up a basic smartpca parameter file from the command-line of a bash shell: This input parameter file runs smartpca in its most basic mode (i.e., no automatic outlier removal or adjustments for LDfeatures which you might want to explore later).
As a minor issue, smartpca ignores individuals in the .fam file if they are marked as missing in the phenotypes column. This awk command provides a new .fam file that will automatically include all individuals. Now run smartpca with the following command.
You will find the output files in the out sub-directory as specified in the parameter file.

Plotting PCA Results with PCAviz
The PCAviz package can be found at https://github.com/ NovembreLab/PCAviz. It provides a simple interface for quickly creating plots from PCA results. It encodes several of our favored best practices for plotting PCA (such as using abbreviations for point characters and plotting median positions of each labelled group). To install the package use: The following command in R generates plots showing each individual sample's position in the PCA space and the median position of each labelled group in PCA space: First one may notice several populations are separated with PC1 and PC2, with the more isolated populations being those that were most distinguished from the others by ADMIXTURE (Fig. 6). PC4 distinguishes a subset of Orcadian individuals and PC5 distinguishes two Adygei individuals. PC6 corresponds to the cryptic structure observed within Sardinians in the ADMIXTURE analysis.
As an alternative visualization, it can be helpful to see the distribution of PC coordinates per population for each labeled group in the data (see Fig. 7): As mentioned above in the section on LD, it is useful to inspect the PC loadings to ensure that they broadly represent variation across the genome, rather than one or a small number of genomic regions [7] (see Fig. 8). SNPs that are selected in the same direction as genome-wide structure can show high loadings, but what is particularly pathological is if the only SNPs that show high loadings are all concentrated in a single region of the genome, as might occur if the PCA is explaining local genomic structure (such as an inversion) rather than population structure.  The proportion of total variance explained by each PC is a useful metric for understanding structure in a sample and for evaluating how many PCs one might want to include in downstream analyses (see Fig. 9). This can be computed as λ i /∑ k λ k , with λ i being eigenvalues in decreasing order, and is plotted below: The results show that the top PCs only explain a small fraction of the variance (<1.5%) and that after about K ¼ 6 the variance explained per PC becomes relatively constant; roughly in line with the visual inspection of the admixture results that revealed K ¼ 6 may be reasonable for this dataset.

Discussion
Our protocol above is relatively straightforward and presents the most basic implementation of these analyses. Each analysis software (ADMIXTURE and smartpca) and each visualization package (pong and PCAviz) contain numerous other options that may be suitable for specific analyses and we encourage the readers to spend time in the manuals of each. Nonetheless, what we have presented is a useful start and a standard pipeline that we use in our research. Two broad perspectives we find helpful to keep in mind are: (1) How the admixture model and PCA framework are related to each other indirectly as different forms of sparse factor analysis [8]; (2) How the PCA framework in particular can be considered as a form of efficient data compression. Both of these perspectives can be helpful in interpreting the outputs of the methods and for appreciating how these approaches best serve as helpful visual exploratory tools for analyzing structure in genetic data. These methods are ultimately relatively simple statistical tools being used to summarize complex realities. They are part of the toolkit for analysis, and often are extremely useful for framing specific models of population structure that can be further investigated using more detailed and explicit approaches (such as those based on coalescent or diffusion theory, Chapters 7 on MSMC and 8 on CoalHMM).