Background

Advances in genomics during the past decade have accelerated research in molecular ecology by significantly increasing the capacity of researchers to generate vast quantities of data at relatively low cost. These advances largely represent the development of reduced representation genomic libraries [1,2,3] that identify tens of thousands of SNPs for non-model organisms, coupled with high-throughput sequencing methods that efficiently genotype fewer SNPs for thousands of individuals [4]. However, data generation, particularly through these novel and affordable marker-discovery methods [5], has greatly outpaced analytical capabilities, and especially so with regard to evolutionary and conservation genomics.

Technological advances have also precipitated a suite of new analytical issues. The thousands of SNPs generated in a typical RADseq project may exhibit biases that impact the inferences that can be drawn from these data [6], and which necessitate careful data filtration to avoid [7]. Yet, the manner by which data are filtered represents a double-edged sword. While it is certainly mandated (as above), the procedures involved must be carefully evaluated in the context of each study, in that downstream analyses can be seriously impacted [8, 9], to include the derivation of population structure [10].

For example, the analysis of multilocus codominant markers in evaluation of population structure is frequently accomplished using methods that make no a priori assumptions about underlying population structure. One of the most popular methods in this regard is the program Structure [11,12,13]. However, it necessitates that users test specific clustering values (K), and conduct post hoc evaluation of results so as to determine an optimal K [14]. This typically involves searching a complicated parameter space using heuristic algorithms for Maximum Likelihood (ML) and Bayesian (BA) methods that, in turn, provide additional complications such as a tendency to sample local optima [15].

A common mitigation strategy is to sample multiple independent replicates at each K, using different random number seeds for initialization. These results are subsequently collated and evaluated to assess confidence that global rather than local optima have indeed been sampled. Clearly, this procedure must be automated so as to alleviate the onerous task of testing multiple replicates across a range of K-values. Pipelines to do so are available for Structure, and have been deployed on high-performance computing systems via integrated parallelization (StrAuto, ParallelStructure) [16, 17]. Multiple programs have likewise been developed for handling Structure output (i.e., Clumpp, Distruct) [18, 19]; and pipelines constructed to assess the most appropriate K-values (i.e., structureHarvester, Clumpak) [20, 21].

Despite the considerable focus on Structure, few such resources have been developed for a popular alternative program (i.e., Admixture [22]). The Web of Science indexing service indicates that (as of January, 2020) Admixture has been cited 1812 times since initial publication (September, 2009). This includes 479 (26.4%) in 2019 alone. Despite its popularity, it has just a single option that promotes the program as part of a pipeline (i.e., SNiPlay3 [23]), which unfortunately requires a reference genome as an adjunct for its application. Needless to say, its applicability is thus limited for those laboratories that employ non-model organisms as study species.

Options for post-processing of Admixture results are similarly limited, but some packages do exist. One positive is that Clumpak is flexible enough in its implementation to allow for the incorporation of Admixture output, as well as that of Structure. Alternatively, pong provides options for processing and visualizing Admixture outputs [24]. However, no available software currently exists to summarize variation in cross-validation (CV) values, the preferred method for selecting an optimal K-value in Admixture [25].

Here we describe a novel software package that integrates Admixture as the primary component of an analytical pipeline that also incorporates the filtering of data as part of its procedure. This, in turn, provides a high-throughput capability that not only generates input for Admixture but also evaluates the impact of filtering on population structure. AdmixPipe also automates the process of testing multiple K-values, conducts replicates at each K, and automatically formats these results as input for the Clumpak pipeline. Optional post-processing scripts are also provided as a part of the toolkit to process Clumpak output, and to visualize the variability among CV values for independent Admixture runs. Sections of the pipeline are specifically designed for use with non-model organisms, as these are the dominant study species in evolutionary and conservation genomic investigations.

Implementation

The workflow for AdmixPipe is presented in Fig. 1. The pipeline requires two input files: a population map and a standard VCF file. The population map is a tab-delimited text file with each row representing a sample name/ population pair. The VCF file is filtered according to user-specified command line options that include the following: minor allele frequency (MAF) filter, biallelic filtering, data thinning measured in basepairs (bp), and missing data filtering (for both individuals and loci). Users may also remove specific samples from their analysis by designating a file of sample names to be ignored. All filtering and the initial conversion to Plink (PED/MAP) format [26] is handled by VCFtools [27].

Fig. 1
figure 1

The workflow for AdmixPipe involves two files as Input: 1) a VCF-formatted file of genotypes, and 2) a tab-delimited population map. These proceed through admixturePipeline.py which handles filtering, file conversion, and execution of Admixture according to user-specified parameters. After completion, the user can submit their output to Clumpak for analysis. The resulting files can then be visualized using distructRerun.py, and variability in cross validation (CV) values is assessed using cvSum.py

An important consideration in filtering is mitigation of linkage disequilibrium. VCFtools can calculate linkage disequilibrium statistics, however these do not consider population information, thereby increasing the potential for type I error [28]. Plink not only suffers from these limitations, but also requires a “window size” input that specifies the lengths of genomic regions within which statistical comparisons among loci are conducted. This is typically inappropriate for non-model organisms due to a lack of whole-genome resources. Non-overlapping contigs produced via reduced-representation methods can be short (e.g., 100 bp), making it a reasonable assumption that all SNPs within a contig are linked. Therefore, we suggest specifying a thinning interval in excess of the longest contig length to ensure that AdmixPipe samples a single SNP per contig. This method is homologous to solutions implemented in popular RADseq assembly pipelines such as Stacks and ipyrad to minimize linkage disequilibrium in datasets [29, 30].

Additional conversions following the filtering and initial conversion via VCFtools are required before the Plink-formatted files will be accepted by Admixture. Popular software packages for de novo assembly of RADseq data, such as pyRAD [29, 31] produce VCF files with each locus as an individual “chromosome.” As a consequence, these pipelines produce outputs in which the number of “chromosomes” exceeds the number present in the model organisms for which Plink was originally designed. The initial MAP file is therefore modified to append a letter at the start of each “chromosome” number. Plink is then executed using the “–allow-extra-chr 0” option that treats loci as unplaced contigs in the final PED/ MAP files submitted to Admixture.

The main element of the pipeline executes Admixture on the filtered data. The assessment of multiple K values and multiple replicates is automated, based upon user-specified command line input. The user defines minimum and maximum K values to be tested, in addition to the number of replicates for each K. Users may also specify the number of processor cores to be utilized by Admixture, and the cross-validation number that is utilized in determining optimal K. The final outputs of the pipeline include a compressed results file and a population file that are ready for direct submission to Clumpak for processing and visualization.

The pipeline also offers two accessory scripts for processing of Clumpak output. The first (i.e., distructRerun.py) compiles the major clusters identified by Clumpak, generates Distruct input files, executes Distruct, and extracts CV-values for all major cluster runs. The second script (i.e., cvSum.py) plots the boxplots of CV-values against each K so as to summarize the distribution of CV-values for multiple Admixture runs. This permits the user to make an informed decision on the optimal K by graphing how these values vary according to independent Admixture runs.

Admixture is the only component of the pipeline that is natively parallelized. Therefore, we performed benchmarking to confirm that processing steps did not significantly increase runtime relative to that expected for Admixture. Data for benchmarking were selected from a recently published paper that utilized AdmixPipe for data processing [32]. The test data contained 343 individuals and 61,910 SNPs. Four data thinning intervals (i.e.,1, 25, 50, and 100) yielded SNP datasets of variable size for performance testing. All filtering intervals were repeated with variable numbers of processor cores (i.e.,1, 2, 4, 8, and 16). Sixteen replicates of Admixture were first conducted for each K = 1–8 at each combination of thinning interval and number of processor cores, for a total of 20 executions of the pipeline. The process was then repeated for each K = 9–16, for an additional 20 runs of the pipeline. Memory profiling was conducted through the python3 ‘mprof’ package at K = 16, with a thinning interval of 1 as a final test of performance. All tests were completed on a computer equipped with dual Intel Xeon E5–4627 3.30GHz processors, 256GB RAM, and with a 64-bit Linux environment.

Results

The filtering intervals resulted in datasets containing 61,910 (interval = 1 bp), 25,851 (interval = 25 bp), 19,140 (interval = 50 bp), and 12,527 SNPs (interval = 100 bp). Runtime increased linearly with the number of SNPs analyzed, regardless of the number of processors utilized (Fig. 2a: R2 = 0.975, df = 58). For example, increasing the number of SNPs from 12,527 to 61,910 (494% increase) produced an average increase of 519% in AdmixPipe runtime (SD = 41.6%).

Fig. 2
figure 2

Benchmarking results for AdmixPipe. a The percent increase in runtime for AdmixPipe exhibits a nearly 1:1 ratio with respect to percent increase in the number of SNPs. Data are based upon pairwise comparisons (% increase) of runtime and input size for four datasets of varying size (61,910 SNPs, 25,851 SNPs, 19,140 SNPs, and 12,527 SNPs; R2 = 0.975, degrees of freedom = 58). b shows benchmarking results for a range of K values (K = 1–8; 16 replicates at each K), while c) shows the equivalent results for K = 9–16 (16 replicates at each K). Time for b) and c) is presented in hours on the Y-axis. The number of processor cores (CPU = 1, 2, 4, 8, and 16) was varied across runs. Four data thinning intervals (1, 25, 50, and 100) produced variable numbers of SNPs (61,910, 25,851, 19,140, and 12,527 respectively)

Little change was observed in response to increasing the numbers of processor cores from K = 1–8 (Fig. 2b). A slight decrease in performance was observed in some cases, particularly for the largest dataset. This trend changed at higher K-values, as substantial gains were observed at K = 9–16 (Fig. 2c) when processors were increased from 1 to 4. The most dramatic performance increase was observed for the 61,910 SNP dataset, where a 24.3-h (34.5%) reduction in computation time occurred when processors increased from 1 to 4. However, only marginal improvements occurred when processors were increased from 1 to 8 (24.5 h; 34.7%) or 16 (26.2 h; 37.7%).

Profiling also revealed efficient and consistent memory usage of AdmixPipe. The greatest memory spike occurred during the initial filtering steps, when peak memory usage reached approximately 120 MB. All subsequent usage held constant at ~ 60 MB as Admixture runs progressed.

Discussion

The performance of AdmixPipe improved with the number of processor cores utilized at higher K-values. However, it did not scale at the rate suggested in the original Admixture publication. We have been unable to attribute the difference in performance to any inherent property of our pipeline. Filtering and file conversion steps at the initiation of AdmixPipe are non-parallel sections. Reported times for completion of these steps were approximately constant across runs, with the maximum being 8 seconds. This indicates that Admixture itself is the main driver of performance, as it comprises the vast majority of system calls made by AdmixPipe.

The original performance increase documented for Admixture was 392% at K = 3, utilizing four processor cores [25]. Unfortunately, we could not replicate this result with our benchmarking data [32], or the original test data (i.e., 324 samples; 13,928 SNPs) [25] which parallels our own. When we attempted to replicate the original benchmark scores, we found that it also failed to scale as the number of processor cores increased (1-core \( \overline{x} \) = 40.63 s, σ = 0.90; 4-core \( \overline{x} \) = 47.46 s, σ = 4.71). Furthermore, we verified that performance did increase with up to four processor cores at higher K values (K ≥ 9). We therefore view this as ‘expected behavior’ for Admixture, and find no reason to believe that AdmixPipe has negatively impacted the performance of any individual program.

Results of AdmixPipe were similar to those estimated by Structure for the test dataset, as evaluated in an earlier publication [32], and gauged for the optimum K = 8. This is not surprising, given that Admixture implements the same likelihood model as does Structure [22]. However, minor differences have previously been noted for both programs in the assignment probabilities [32, 33].

Memory usage was efficient and constant, with the greatest increase occurring when Plink was executed. Thus, users will be able to execute AdmixPipe on their desktop machines for datasets sized similarly to those evaluated herein. Performance gains were minimal with > 4 processors, and this (again) reduces the necessity for supercomputer access, since desktop computers with ≥4 processor cores are now commonplace. However, given the built-in parallelization capabilities of Admixture, its application on dedicated high-performance computing clusters will be beneficial when runtime considerations are necessary, such as when evaluating K > 8, or SNPs≥20,000.

Finally, our integration of common SNP filtering options provides the flexibility to quickly filter data and assess the manner by which various filtering decisions impact results. A byproduct of the filtering process is the production of a Structure-formatted file that will facilitate comparisons with other popular algorithms that assess population structure. These options are important tools, particularly given recent documentation regarding the impacts of filtering on downstream analyses. We thus suggest that users implement existing recommendations on filtering RAD data, and use these to investigate subsequent impacts on their own data [7,8,9,10].

Conclusions

Benchmarking has demonstrated that the benefits of AdmixPipe (e.g., low memory usage and performance scaling with low numbers of processor cores at high K-values) will prove useful for researchers with limited access to advanced computing resources. AdmixPipe also allows the effects of common filtering options to be assessed on population structure of study species by coupling this process with the determination of population structure. Integration with Clumpak, and our custom options that allow plotting of data, to include variability in CV-values and customization of population-assignment plots, will facilitate the selection of appropriate K-values and allow variability to be assessed across runs. These benefits will allow researchers to implement recommendations regarding assignment of population structure in their studies, and to accurately report the variability found in their results [34]. In conclusion, AdmixPipe is a new tool that successfully fills a contemporary gap found in pipelines that assess population structure. We anticipate that AdmixPipe, and its subsequent improvements, will greatly facilitate the analysis of SNP data in non-model organisms.

Availability and requirements

Project name: AdmixPipe: A Method for Parsing and Filtering VCF Files for Admixture Analysis.

Project home page: https://github.com/stevemussmann/admixturePipeline

Operating system(s): Linux, Mac OSX.

Programming language: Python.

Other requirements: Python 2.7+ or Python 3.5+; Python argparse and matplotlib libraries; Dependencies include additional software packages (Admixture v1.3, Distruct v1.1, Plink 1.9 beta 4.5 or higher, and VCFtools v0.1.16).

License: GNU General Public License v3.0.

Any restrictions to use by non-academics: None.