Background

Plasmodium vivax is the most widely distributed human malaria parasite outside sub-Saharan Africa, accounting for approximately 7.4 million clinical cases per year [1]. Despite previously being categorized as a benign infection, studies have revealed that P. vivax can cause severe and life-threatening malaria and some cases may be drug resistant [2,3,4,5]. Features of P. vivax biology such as relapse, low-density infections and the appearance of transmission forms (gametocytes) prior to detectable clinical symptoms [6, 7], are a challenge for controlling and eliminating this disease. These characteristics, in addition to the high proportion of asymptomatic P. vivax infections in combination with increasing human movement also pose a significant challenge to malaria elimination [8, 9]. Establishing a strong malaria surveillance system is essential to monitor the changing malaria landscape and to achieve elimination goals.

Currently endemic countries rely on traditional malaria surveillance approaches such as Light Microscopy (LM) and rapid diagnostic test (RDTs) [10, 11], or more recently, through molecular diagnosis (PCR) of infection [12] to measure infection prevalence and provide an estimate of transmission intensity. Population genetics however can measure parasite genetic diversity, population structure, gene flow and relatedness of genotypes to help define transmission “zones”, local transmission dynamics [13, 14], to identify the source(s) of outbreaks [15], distinguish between local or imported cases [16, 17], and track imported infections [18,19,20]. In combination with epidemiological data this can help to guide control strategies and to monitor the effect of interventions facilitate malaria elimination [21, 22].

Capturing accurate population genetic signatures and tailoring molecular tools to the local malaria transmission scenario depends on the appropriate selection and use of informative molecular markers [23]. For over a decade, malaria population genetists have used panels of 6-12 microsatellites for assessment of malaria parasite transmission dynamics, its origins and dispersal [24,25,26]. Microsatellites have several advantages including being neutrally evolving [27, 28], multiplexing can be easily done for ten or more markers in a single PCR cocktail and they are abundant in the genome [29]. Despite these advantages, microsatellites have high PCR amplification biases that may cause incorrect classification of dominant and minor haplotypes [30]. Furthermore the small number of markers, high mutation rate and difficulty of scoring alleles accurately decreases the resolution of microsatellite markers to identify related parasites [26]. Due to their high diversity, parasite population sub-structure may be missed in areas of high transmission since existing panels of 10 to 14 markers may provide inaccurate estimates of relatedness. Furthermore, microsatellite genotyping is difficult to standardize across laboratories, reproducibility is lacking, and amplicons require fragment analysis at core sequencing facilities [28, 31]. Whole genome sequencing (WGS) is not widely used in malaria-endemic countries since it is costly and data analysis needs advanced bioinformatic tools and expertise [18].

As countries intensify their control programmes and approach malaria elimination, a robust, cost effective, rapid and easy-to-use set of molecular markers is urgently required as an alternative genotyping tool to rapidly track disease spread and imported cases. So-called ‘barcodes’ composed of a panel of single nucleotide polymorphisms (SNPs) can be used to profile each parasite isolate and to generate population genetic insights [32,33,34]. SNP barcoding is more easily standardized across studies and offers more rapid and highly automated genotyping options compared to microsatellites [23, 35]. Putatively universal (global) barcodes of 42 P. vivax SNPs have been developed and tested for their utility to determine parasite population structure [33]. However, there has been limited validation of these markers in different endemic settings to determine whether they are informative for all P. vivax populations. Informative barcodes need to distinguish between populations circulating in distinct geographic areas at sub-national resolution, as this is essential to inform malaria control programmes.

Universal barcodes may not provide an accurate estimate of local population structure due to ascertainment bias occuring if the SNP panel used was developed for populations other than those to be studied. That is, SNPs may be polymorphic (informative) in some populations, but not in others [31, 36, 37]. Ascertainment bias is a limitation to measuring the true allele frequency [38] resulting in minor allele frequency (MAF) biases in populations not included in ascertainment group [38, 39]. Thus, global P. vivax SNP marker selection [33] may limit the resolution of this barcode to genotype parasite populations not closely related to the ascertainment group. Moreover, recent intensive malaria control activities may lead to changes in MAF with a loss of rare variants [19, 20] that could reduce the power of the existing SNP barcode to distinguish between different genotypes. The use of population genomic data to aid malaria control relies on having this insight at regional or sub-national resolution, and will vary for different endemic settings and stages in the elimination pipeline [23]. Therefore, validating available barcodes or developing a new barcode that accurately captures the diversity of a country’s parasite population will facilitate characterization of the local malaria transmission scenario.

Here, we describe the development of a SNP barcode designed to capture the diversity of P. vivax populations of Papua New Guinea (PNG), which has the highest transmission of P. vivax in the world [40, 41]. SNP barcodes have been employed in several studies in recent years [15, 16, 23, 33], however their performance has not been compared to microsatellites, a commonly used genotyping tool [42,43,44] and used in our previous studies [9, 45,46,47]. We thus compared parasite population genetics using the newly developed SNP barcode with that for microsatellite markers to address the following research questions: (1) do these two marker panels capture parasite diversity at the sub-provincial scale at the same resolution, and (2) which marker panel has higher resolution to capture parasite geographic connectedness and population structure? Comparisons of SNP and microsatellite data demonstrates the superiority of this locally validated SNP barcode for monitoring parasite populations and tracking the source of infections.

Methods

This study aimed to develop a SNP barcode for high-resolution genomic surveillance of P. vivax in PNG (potentially applicable to other P. vivax endemic areas) and to benchmark it against an existing microsatellite marker panel that has been used in previous population genetic surveys. The study was performed using P. vivax isolates from four locations within a contiguous highly endemic region of the north coast of PNG, specifically in East Sepik (n = 1) and Madang Provinces (n = 3) (Fig. 1), where cross-sectional surveys and population genetic analyses were previously conducted using microsatellite markers [47]. Previous microsatellite analysis of these populations in 2005 and 2006 failed to identify any geographic population structure [46, 47].

Fig. 1
figure 1

Map of the study area. Map of Papua New Guinea showing the location of four distinct catchment areas from which P. vivax isolates were obtained (colored dots). A total of 94 P. vivax isolates were selected from 2012/13 East Sepik and 2014 Madang cross-sectional surveys. Samples were genotyped using both SNP barcodes and microsatellites. n = the number of genotyped P. vivax isolates. Black dots indicate three provinces of PNG where 20 WGS P. vivax obtained for the assay development

Plasmodium vivax isolates

For the genotyping, we selected a total of 94 P. vivax positive isolates collected during two cross-sectional surveys on the north coast of PNG including one population from East Sepik, collected in 2012/13 and three populations from Madang Provinces (Malala, Mugil and Utu) collected in 2014 (Fig. 1) [48]. These P. vivax isolates were selected due to the availability of published microsatellite data [49], which was used to compare with SNP barcode data.

Selecting candidate SNPs

The software package, Genome Analysis Toolkit (GATK) [50] was used for selection of informative SNPs from published WGS data for 40 P. vivax isolates from three regions of PNG (Madang, East Sepik/Maprik and Milne Bay/Alotau) [14, 51] (Additional file 1: Table S1). WGS data quality was checked using Fastqc and coverage checked using Bam coverage to identify high quality genomes. To identify informative SNP variants, paired-end raw reads were aligned to P. vivax Salvador I strain (PvSal1) [52] using the bwa-mem mapping algorithm [53]. SNPs were called and filtered using GATK HaplotypeCaller [50]. To ensure uniform coverage of the whole parasite genome, variants present on 14 nuclear chromosomes were included after excluding all indel calls and ‘blacklisted’ highly polymorphic regions including the telomeres. Additional ‘hard filtering’ included retaining only biallelic single nucleotide variants (SNVs), high coverage (at least 90% of their bases covered up to 5x) SNPs, with a minor allele frequency (MAF) > 10% (0.10), with pairwise LD < 0.2 throughout a window of (0.5 kb), low positive or negative Tajima’s D values (|Tajima’s D| < 0.5) throughout a window of 0.5 kb, high heterozygosity (> 0.4) within PNG, and if they were relatively uniformly spaced across the P. vivax genome. From a total of 24,283 SNPs with MAF > 10%, 4006 remained after filtering and 220 relatively evenly spaced SNPs were selected for assay development (Additional file 2: Figure S1).

SNP barcoding assay development

The assay consists of a series of 20 × 8-10-plex PCRs, with multiplexed amplification of the target regions (PCR#1) using Locus-Specific Primers (LSP) containing universal Illumina overhang adaptors (OH), attaching to all amplicons from each sample (PCR#2) a short sequence tag (multiplex identifier, MID) unique to each sample. This was followed by pooling of all amplicons after indexing each sample and sequencing on an Illumina MiSeq (Additional file 2: Figure S2).

The major problem with multiplex PCR is primer dimer formation and melting temperature (Tm) variation between primers. To minimize these challenges, PrimerPlex software [54] was used to design multiplex PCRs using LSP pools for target 400bp genomic regions which contain SNPs of interest. A total of 22 multiplex PCRs were designed, with each containing 8–12 LSP pairs (220 SNPs total).

Primary multiplex PCRs were performed and optimized for each pool using published guidelines [55]. From a total of 220 SNPs, 42 were negative and were not amplified in the primary multiplex PCR. The remaining 178 SNPs were used for the assay development (Table S2). The optimized conditions for the primary PCR (PCR#1) required 2 μl of sample (20–40 ng DNA template) in a 20 μl reaction consisting of 0.3 mM each dNTP, 3 mM MgCl2, 1X buffer (B1), 0.2 μM primer pool, and 1unit of Hotstart DNA polymerase. The PCR conditions consisted of an initial denaturation step of 12 min at 95 °C, followed by 30 cycles of 15 s at 95 °C, 30 s at 60 °C, and 30 s at 72 °C, and a final 5 min extension at 72 °C.

SNP barcoding assay optimization

To optimize the assay, a total of six (three monoclonal and three polyclonal infections) P. vivax positive field samples were genotyped using 20 randomly selected SNP markers (2 multiplex sets) from a total of 178 described in the above section. The amplicon sequencing approach was used to amplify all 20 P. vivax genomic loci with target SNPs in multiple samples at a time. Following the optimization, the assay was applied to a total of 94 P. vivax positive field samples.

Due to the small size of the Plasmodium genome in comparison to the human genome, the presence of just a few nucleated human cells in field specimens can impede genotyping or sequencing sensitivity and specificity by contributing a large proportion of unwanted human DNA to the DNA sample. Thus, to obtain enough parasite DNA from field samples and to minimize contaminating human DNA, digestion with the McrBC enzyme, which is a DNA methylation-dependent restriction enzyme (MDRE), followed by random whole genome amplification (rWGA) [56, 57] was used. In brief, digestion of human gDNA was done using the McrBC (methylation dependent) enzyme (New England Biolabs, United States) followed by whole genome amplification (WGA) of Plasmodium DNA by very high fidelity Phi29 DNA polymerase proofreading enzyme using the V2 DNA Amplification Kit (GE Lifesciences, Australia). The main aim of this protocol is to deplete contaminating human DNA in malaria field isolates by selectively digesting highly methylated DNA (human) followed by WGA of the remaining high molecular weight DNA (predominantly parasite). This enriches Plasmodium DNA for further use (e.g. SNP genotyping or whole genome sequencing). The protocol uses a minimum starting volume of 6ul of DNA extracted from human blood samples and works well even with low-density samples.

Primary PCR was performed to amplify target loci using optimized PCR#1 conditions as described in above. Then, primary PCR amplicons of each sample from two multiplex reactions were combined and purified using a QIAquick PCR Purification Kit (Qiagen) as per the manufacturer’s protocol. The amount of DNA in the primary PCR was measured using the Qubit DsDNA High Sensitivity (HS) Assay Kit (Thermo Fisher Scientific, Scoresby, Victoria, Australia) and normalized by diluting over-represented amplicons in PCR grade water. The secondary PCR reaction (PCR#2) was performed using cleaned primary PCR products as a template. Illumina adapters and a six-nucleotide sequence specific to each individual sample (MID index) was added to the template. The optimized conditions for PCR#2 requires 2 μl of combined primary PCR product in a 20 μl reaction consisting of 0.3 mM each dNTP, 3 mM MgCl2, 1X buffer (B1), 1 μM forward index primer, 1 μM reverse index primer and 1.5 unit of Hot Start DNA polymerase (QIAGEN). The PCR conditions consisted of an initial denaturation of 3 min at 95 °C followed by 25 cycles of 15 s at 95 °C, 30 s at 60 °C, and 30 s at 72 °C, and a final 7 min extension at 72 °C.

Equimolar amounts of each amplicon pool from all samples were combined into a single tube and purified using AMPure XP magnetic beads (Beckman Coulter) for library preparation. Standard sequencing libraries were prepared following the manufacturer’s recommended protocol and sequenced using in an Illumina MiSeq platform to generate (2X300) paired end reads. The TruSeq Custom Amplicon Sequencing Kit (Illumina, Inc) was used to allow 96 or more samples with integrated barcodes to be pooled prior to sequencing on an Illumina MiSeq.

Data analysis

Bioinformatic analysis

The raw FASTQ files were demultiplexed by binning based on the MID index, the read quality was checked using FastQC (Version 0.8.0) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and combined FastQC output for all samples were visualized using MultiQC [58]. Low-quality reads (< Q30), adaptors, primers, and reads shorter or longer than expected size of amplicon were trimmed using Trimmomatic [59]. Only reads that passed stringent quality filters progressed for alignment and variant calling.

Unmapped BAM files were generated from quality filtered and trimmed FASTQ files using the FastqtoSam function (http://broadinstitute.github.io/picard/). The combined pipeline was then used to generate indexed, mapped BAM files. This pipeline consists of SamToFastq, bwa-mem and MergeBamAlignment to map reads, and generated a clean and indexed mapped BAM file. In brief, the sequenced reads were mapped to the P. vivax Salvador I strain reference genome bwa mem [50] (Additional file 2: Figure S3). Overall quality and genome coverage of mapped bam files were checked using QualiMap v.2.2.1 [60]. We set a minimum cutoff of 50-fold coverage and successful genotyping of loci in at least 60% of sequenced samples to avoid inclusion of PCR and sequencing errors. After removing unsuccessfully amplified loci, the coverage and the frequencies of the reference and alternative alleles were determined using the samtools mpileup function [61] for each sample and SNP. A VCF file from the samtools mpileup analysis output was further filtered using vcftools to remove sites containing insertions and deletions [62]. Finally, all selected SNPs were further confirmed by visually inspecting the individual mapped reads using IGV software [63].

Population genetic analyses

Alternative allele frequency (AAF) was computed as the proportion of genotyped samples whose genotype was not the reference allele for target loci. MAF was computed as the proportion of genotyped samples carrying the genotype that was least common (i.e. MAF = AAF if AAF < 0.5; MAF = (1-AAF) if not). To estimate the actual number of clones per sample the VCF file containing SNP data was converted to The Real McCOIL categorical method format: heterozygous call (0.5), homozygous minor allele (0), homozygous major allele (1) and no call (−1) and used as an input file for analysis of multiplicity of infection (MOI) using The Real McCOIL R package [63]. Input files for genetic analysis were created using the and PGDSpider (version 2.0.0.3) [65]. Genetic diversity was calculated as SNPπ using DnaSP Version 5.0 [64] for SNP data, and expected heterozygosity (He) and allelic richness, using the FSTAT software, version 2.9.4 for microsatellite data [65]), were calculated. Genetic differentiation (FST) was determined using DnaSP Version 5.0 [66] for SNP data and FSTAT 2.9.4 [67] for microsatellites. The Mantel Test was performed to measure associations between genetic distance and spatial geographical distance between catchments, using the R “Vegan” package [68]. Phylogenetic analysis was done based on the distance metric 1-PS using the “Ape” R package and the ‘dist.gene’ function for SNP and microsatellite data and visualized using the FigTree software, version 1.4.3.

The Bayesian clustering software, STRUCTURE version 2.3.4 [69] was used to determine the number of discrete genetic clusters (K) and whether haplotypes cluster according to geographical origin. STRUCTURE runs were performed with a burn-in period of 100,000 followed by 100,000 Monte Carlo steps. The simulations were replicated 20 times with different seeds for K values ranging from 1 to 20. The optimal K value was calculated based on Evanno’s method of ΔK statistics. The CLUMPAK web-based server was used for summation and graphical representation of the STRUCTURE results. The assumptions underlying the population genetics model in STRUCTURE software may limit its use to detect malaria parasite population structure with declining transmission. Unlike natural populations, malaria parasites undergo inbreeding, clonal propagation, and there will be an absence of panmictic conditions when transmission declines. Therefore, to further explore parasite clustering the discriminant analysis of principal components (DAPC) was performed using the R package “Adegenet” [70]. DAPC is robust to Hardy–Weinberg disequilibrium or linkage disequilibrium [71].

Statistical analysis

The Mann–Whitney U test or a one-way analysis of variance were used to measure differences among two groups or more than two groups, respectively. To assess the concordance between genotype allele sharing by SNPs and microsatellite markers we performed correlation analysis using Kendall’s Tau. Statistical analyses were performed using GraphPad Prism Software version 7.0 and a p value of ≤ 0.05 was considered statistically significant.

Results

Identification of SNP candidates

From a total of 40 P. vivax isolates from PNG, 23 were sequenced at the Broad Institute (BI) in Boston, MA, USA [14] and the remaining 17 were sequenced at the Wellcome Trust Sanger Institute (WTSI), Cambridge UK as part of the MalariaGEN Plasmodium vivax Community Project [51]. To include the highest quality samples for SNP selection, 16 isolates were excluded due to low quality and poor coverage (less than 90% of their bases covered up to 5x) or to remove the lowest quality genome of any duplicated samples (4 isolates). Additionally, four samples derived from sequencing pooled isolates were excluded from analysis since pooling could affect variant calling. The remaining 20 P. vivax genomes, originating from three hyper-endemic provinces of PNG (Madang = 17, East Sepik = 2, Milne Bay = 1, Fig. 1, Additional file 1: Table S1), were used to select informative SNPs. From a total of 405,825 variants present on 14 nuclear chromosomes, 144,517 were included after excluding all indel calls and ‘blacklisted’ highly polymorphic regions including the telomeres. Finally, after additional ‘hard filtering’ (see “Methods”), 220 SNPs with MAF greater than 10% and relatively uniformly spaced across P. vivax genome were selected for assay development (Additional file 2: Figure S1).

Assay development

Six P. vivax field isolates were used to develop and optimize the new SNP genotyping assays. To evaluate the amplification of each target locus in the multiplex PCR, single-plex PCR was performed using primary multiplexed PCR products as the template. Of the six samples, three contained single clones (multiplicity of infection, MOI = 1) and the remaining three samples had two clones (MOI = 2) based on Pvmsp1F3 and Pvms16 genotyping [48]. Polyclonal infections were included to assess amplification bias of SNP alleles in complex infections due to multiple amplification steps. Of the 220 primer pairs tested, 178 produced a single clear band of the expected size. These 178 SNP loci were then used to develop a multiplex PCR, for further genotyping of P. vivax isolates from PNG (Additional file 3: Table S2). Only two SNPs from the previously developed barcode [33] met the inclusion criteria (Additional file 3: Table S2). To assess amplification bias that may occur due to preliminary amplification steps such as whole genome amplification (WGA), amplicon deep sequencing was performed for six samples with, and without, WGA for a test set of 20 SNP markers (2 × 10-plex PCRs). A total of 17 of 20 (85%) SNPs were successfully genotyped in these samples. There were no significant amplification differences before and after WGA (p = 0.13 for MOI = 1 samples and p = 0.92 for MOI = 2 (Mann–Whitney U test)) (Additional file 2: Figure S4a). There were also no observed discrepancies in the genotype calls between the WGA and non-WGA samples. There was an average read depth of 133 (range = 50–567) for test run samples (Additional file 2: Figure S4b). This read depth of approximately 133X is suitable to call variants however deeper sequencing is required to do downstream population genetic analysis with high confidence.

Data summary and validation of the barcode

A total of 94 low complexity (MOI \(\le\) 2) P. vivax samples from a cross-sectional survey conducted in four catchment areas of PNG (Madang Province: Mugil, Malala and Utu; East Sepik Province: Ilaita area) (Fig. 1) were then genotyped for all 178 SNPs using the parallel targeted amplicon sequencing assay. These samples had already been genotyped with ten microsatellite markers in a previous study [49]. A total of 28,934,460 reads were generated from the MiSeq Illumina run with a variable sequencing coverage across samples per locus (median = 563, range 56 –7586). Of the 178 SNPs, five were not amplified at all (no reads detected) and 34 had high missingness (no reads for > 40% of samples). Of the 94 genotyped samples, 83 were successfully genotyped for the remaining 146 SNP markers (Additional file 2: Figure S4c) indicating the genotyping success rate amongst samples was 88.2% (83/94) with an 82.1% (146/178) marker positivity rate. There were no identical genotypes, suggesting that the barcode is a unique identifier for P. vivax isolates from PNG. The SNPs generally had moderate minor allele frequencies (MAF) with 98% of SNP loci showing greater than 10% MAF (Additional file 2: Figure S5a). There were no private SNPs (unique to any one population). Published genotyping data for MOI [72] and microsatellite data for these P. vivax isolates [49] was then used to compare the population genetic metrics with the new SNP barcode data.

The SNP barcode detects more multiple clone infections than classical genotyping for MOI

Despite previous ‘classical’ genotyping for MOI using msp1F3 and ms16 indicating the majority of samples were single clone infections (i.e. one allele at both markers [48]), all 83 samples showed at least one heterozygous call (two alleles found amongst the reads at a particular SNP locus), which is evidence of polyclonality (genotyping error is filtered out by the variant calling algorithm). The Real McCOIL analysis indicated that out of the 83 successfully genotyped P. vivax samples, 69 samples (83.2%) have at least two clones and 24 samples (16.8%) were confirmed monoclonal infections. For the population genetic analyses, the dominant allele was used to reconstruct dominant haplotypes (Additional file 4: Table S3).

The SNP barcode captures variable genetic diversity amongst parasite populations

Variable levels of within-population genetic diversity were observed in the four parasite populations with an average nucleotide diversity (π) of 0.33 per SNP site (range = 0.24–0.45) (Fig. 2a). Nucleotide diversity was lowest in the inland Utu population and highest in the coastal Malala parasite population. The genetic diversity (Heterozygosity, He) of the same parasite populations using the microsatellite panel showed uniformly high genetic diversity among populations (mean He = 0.82, range = 0.78–0.85) (Fig. 2b). Note that the different diversity measures necessary for these different markers may also impact this result. Therefore, we used the alternative metric 1-PS (1-pairwise allele sharing) (Fig. 2c, d) to measure genetic diversity within each parasite population for both markers. Genetic diversity by both marker panels was significantly different among the four parasite populations (p value < 0.001, Kruskal–Wallis test). In general, microsatellite genotypes had higher genetic diversity compared to SNP genotypes. However, microsatellites show a wider range of values and more closely related genotype pairs (outliers 1-Ps < 0.4) (Fig. 2d).

Fig. 2
figure 2

Genetic diversity of P. vivax populations from the north coast of Papua New Guinea. Genetic diversity was measured for four catchment areas on the north coast of Papua New Guinea using a SNP nucleotide diversity (π), which was measured by calculating the average number of pairwise differences at assayed SNPs between all members of sample using DnaSP Version 5.0 [65]; b Microsatellite Expected Heterozygosity (He = [n/(n-1)] [(1- Σp 2i )] where n is the number of isolates sampled and pi is the allele frequency at the ith loci) using as FSTAT software version 2.9.4 [67]; c SNP barcode diversity and d microsatellite haplotype diversity. For c and d, box plots show the results from another genetic diversity metric, 1-mean pairwise allele sharing. The variation in the box and median distribution indicates variability in genotype relatedness amongst pairs of genotypes. The analysis was done using genetic distance matrix for 1-PS generated by the ‘dist. gene’ command in “Ape” R package

The SNP barcode detects parasite population divergence that is associated with geographic distance

Bayesian cluster analysis of SNP genotypes using STRUCTURE software [66] identified that three genetic clusters (K = 3) provided the best fit for the SNP data and two genetic clusters (K = 2) for microsatellites (Additional file 2: Figure S6).

Less population structure and genotype clustering according to their geographic origin was observed by microsatellite markers compared to SNPs (Fig. 3). Discriminant Analysis of Principal Components (DAPC) detected higher levels of genotype assignment to different geographic origins and higher differentiation between distant compared to neighbouring populations for SNPs (Fig. 4a, top) than microsatellite markers (Fig. 4b, top). Microsatellites revealed limited differentiation of distant parasite populations such as East Sepik and Utu (Fig. 4b, top).

Fig. 3
figure 3

Bayesian cluster analysis of P. vivax genotypes from the north coast of Papua New Guinea. Cluster analysis was done using a SNP barcodes or b microsatellite haplotypes for 86 P. vivax isolates from four geographic regions of Papua New Guinea using STRUCTURE software version 2.3.4 [68]. STRUCTURE bar plots representing Individual ancestry coefficients are shown for K = 3, each vertical bar represents an individual haplotype and the membership coefficient (Q) within each of the genetic populations, as defined by the different colours

Fig. 4
figure 4

Discriminant analysis of principal component (DAPC) of P. vivax isolates from the north coast of Papua New Guinea. DAPC was used to identify clustering amongst isolates from the four catchment areas for a SNP barcodes and b microsatellite haplotypes. On the top of the figure scatterplots of DAPC (Bottom) are shown. Clusters are defined by ellipses and indicate the variance within the clusters whereas dots indicate the positions of individual parasite genotypes within the cluster. Eigenvalues represent the amount of genetic variation captured by the discriminant factors plotted as the x- and y-axis. On the bottom, individual density plots are shown for the first discriminant function. The data was analysed using DAPC function in “Adegenet” R package [69]

To further explore the patterns of gene flow in different geographic areas we measured genetic differentiation (FST) and observed very low to moderate genetic differentiation (FST = 0.02–0.12) between parasite populations using either marker panel (Table 1). However, values were higher for SNPs and there was greater differentiation of distant populations e.g. East Sepik vs Utu for the SNP marker data.

Table 1 Pairwise population differentiation among P. vivax populations in four different geographic clusters in North Coast of Papua New Guinea

For SNP barcodes, the DAPC individual density plot also supports the FST result, where distant parasite populations were more distinctly clustered (Fig. 4a, bottom) than the nearby populations. However, the microsatellite marker data showed an unusual clustering of distant parasite populations together (East Sepik and Utu) (Fig. 4b, bottom). To assess whether the geographic distance between geographic clusters affects gene flow, a Mantel correlation test was conducted. The analysis showed a significant association between genetic distance and geographic distance in km for SNP markers (Fig. 5a), but not for microsatellite markers (Fig. 5b).

Fig. 5
figure 5

Association between geographic and genetic distance between P. vivax populations of Papua New Guinea. Genetic differentiation between population pairs was measuring using FST for a SNP barcodes and b microsatellites and measured in association with geographic distance based on geographic co-ordinates of villages. A Mantel test was used to measure the association using R “Vegan” package [67]

No association between microsatellite and SNP haplotype relatedness

To further explore clustering patterns and investigate the relatedness of individual SNP haplotypes, phylogenetic analysis was conducted using Neighbour Joining trees. This identified clusters of closely related isolates from the same province and village with moderate population structure and geographic clustering of genotypes (Fig. 6). More clustering of genotypes was found in the East Sepik population compared to the three parasite populations from Madang (Malala, Mugil and Utu for the SNP markers) (Fig. 6a). Overall, phylogenetic analysis supported the STRUCTURE and DAPC results, with higher parasite clustering between East Sepik versus Madang by SNP barcode compared to microsatellites.

Fig. 6
figure 6

Phylogenetic relationships among individual P. vivax isolates from the north coast of Papua New Guinea. Neighbour joining trees are shown for a SNP barcodes and b microsatellite haplotypes. Branches are coloured according to the four catchment areas. Labels indicate a unique sample ID and village of origin. Distance was measured using the genetic distance matrix for 1-PS calculated by “Ape” R package, ‘dist. gene’ command

Unless there is overall high relatedness among genotypes, it is difficult to identify population structure using phylogenetic analysis due to high recombination between distinct clones. To further infer relatedness between parasites within and between populations, a simple pairwise allele-sharing (PS) measure was used. Relatedness analysis using the SNP markers (Additional file 2: Figure S6) showed that the majority of genotypes share alleles at 50–70% of markers suggesting parasites are unrelated [73]. Only a few genotypes showed high relatedness, with 70–90% of alleles shared (Additional file 2: Figure S7) within the population, but no identical genotypes were detected. The allele sharing analysis of the same P. vivax isolates using ten microsatellite markers was consistent with SNP data where the majority of microsatellite genotypes are unique and only a few genotypes shared a high proportion of alleles.

Concordance analysis of allele sharing between genotypes using the SNP barcode and microsatellite haplotypes showed no statistically significant association (r = 0.032, p-value = 0.56) (Additional file 2: Figure S8). Thus, high outcrossing in the PNG populations due to high transmission removes any association between these markers.

Discussion

Genomic surveillance of malaria parasite populations is a useful tool to assess changing transmission patterns, identify imported cases and track the spread of infections [23, 74, 75]. Reliable, cheap and high-resolution genotyping assays are therefore needed to support malaria control programmes. SNP barcodes have been developed to study the complexity of infection [76, 77], parasite population structure and the origins of outbreaks [15, 33, 34, 78]. However, ascertainment bias can reduce the sensitivity of detecting distinct clones and population genetic analyses to detect and track discrete parasite populations. SNP barcodes need to be validated and/or tailored to specific geographic areas to reflect the SNP diversity in local parasite populations [23]. Here, the development of a SNP barcode comprising 178 locally-validated biallelic SNP loci is described. This barcode was tailored specifically to PNG, one of the world’s hotspots for P. vivax malaria infection. Genetic diversity and population structure amongst four distinct catchment areas on the north coast of PNG was compared for the SNP barcode and panel of ten polyallelic microsatellite markers that many groups have previously employed for population genetic analyses [24,25,26]. The results demonstrate the greater sensitivity of these large biallelic SNP barcodes for malaria genomic epidemiology and potential to provide useful data to guide malaria control strategies.

The SNP barcode detected a higher number of clones compared to two highly polymorphic microsatellite markers ms16 and msp1F3, which have been used previously to measure multiplicity of infection [79, 80]. This indicates that the SNP barcode has higher resolution to identify multiple infections, most likely due to the much larger number of loci genotyped. However, there is an upper limit to clone detection due to the biallelic nature of the SNP loci. It also suggests that the complexity of P. vivax infections (based on a small number of loci) is currently underappreciated, and that the barcode will be more useful than small numbers of microsatellites in very low endemic settings to distinguish between very closely related parasites. Samples were pre-genotyped and selected for monoclonal infection, which limits the direct comparison of the ability of these two microsatellites and SNP barcode to identify polyclonal infections. Further evaluation of the barcode by genotyping a large set of randomly selected field samples is needed to fully assess its utility for estimating complexity of infection.

Population genetic analyses using the SNP barcode elucidated genetic diversity, relatedness, population structure and connectivity of circulating parasite populations at higher resolution relative to the larger panel of ten microsatellite markers. More variable genetic diversity among populations is captured by the SNP barcode than microsatellite markers. Previous work with microsatellite markers in eight locations of PNG revealed geographic population structure between the mainland, islands and highland areas [9]. However, microsatellites were unable to differentiate populations at a finer spatial scale between the mainland north coast provinces of East Sepik and Madang [46, 47, 49]. Indeed, microsatellite performance has not previously been compared to SNPs in terms of their ability to differentiate between P. vivax populations. The results suggest that large numbers of SNPs have higher resolution to detect differences in transmission dynamics between populations. Moreover, SNP barcodes detected substantial geographic population structure between the four catchment areas with clustering of haplotypes according to their geographic origin, whereas microsatellites did not achieve this. There was also a significant association between genetic distance and geographic distance for SNPs but not for microsatellite markers suggesting that SNPs can accurately pinpoint geographic origins of infections, whereas microsatellites cannot. This also implies that large numbers of SNP markers can capture population connectivity at fine spatial scales in high transmission areas. The ten microsatellite markers identify some population structure, but it does not fit the expected isolation by distance pattern expected of this contiguous endemic area. The findings are consistent with another study on the malaria vector Anopheles darlingi where SNP markers showed higher discrimination among genetic clusters with more than 4–35 fold higher FST estimates than microsatellite markers [81]. Other studies in different fish species have also shown that biallelic SNP markers have greater accuracy and finer population structure than microsatellite markers [82, 83]. The SNP barcode is more sensitive because markers are more densely covering the chromosomes than the microsatellite panel with less than one marker per chromosome, and thus will more accurately detect relatedness among parasites through inherited segments of the genome, using Identity by Descent (IBD) approaches [84, 85]. Also, in high transmission areas, genetic differentiation (i.e. the difference between the diversity of the subpopulations compared to the metapopulation [86]) is typically estimated as being low when using polyallelic microsatellite markers because their diversity is at a maximum, making it difficult to identify low levels of population structure. Other measures such as Jost’s D have been used to overcome these limitations of microsatellites [47]. While our results suggest that the currently used ten microsatellite panel may have lower resolution to identify distinct genotypes and correctly identify related parasites, larger microsatellite panels will undoubtedly be more sensitive and may also deconvolute complex mixtures of clones within an infection. Selection of informative markers is important to track gene flow and quantify parasite connectivity using IBD measures [87]. SNP barcoding using an adequate number and density of SNPs will be important for the characterization of these population genetic signals, and to identify patterns of parasite migration [23, 32, 87]. The generation of additional WGS data for the surveyed populations would help to verify the performance of different marker panels.

Related isolates have a higher probability of identical alleles at a given locus than unrelated genotypes [73, 74]. In this study, a significant difference in pairwise allele sharing was not detected either within and between populations using SNP or microsatellite markers. The SNP barcode identified a narrow range of allele sharing with > 90% genotypes with alleles shared amongst 50–90% of the markers. This finding is consistent with the previous study by Nkhoma et al. [73] where a SNP barcode comprising 96 SNPs detected high allele sharing in diverse and unrelated parasites (PS of up to 0.74 observed in two unrelated parasites (0% IBD)). Thus, only genotypes with PS values greater than 0.8 (distributed at tail end of the histogram) for either SNP or microsatellite markers are truly related. This finding indicates that simple pairwise allele sharing (IBS) values do not accurately represent the actual percentage of the parasite genome that is IBD or that this measure is not as sensitive to measure the actual genotype relatedness. IBD values were not compared between the different markers in this study since the ten microsatellite markers will have limited sensitivity to estimate the proportion of the genome that is IBD. Studies using large SNP barcodes (> 100 loci) such as that described here, are recommended in order to apply IBD measures to calculate parasite relatedness and between population connectivity [87].

A minor allele frequency criteria was applied to select informative SNPs (> 0.1 MAF) to capture diverse parasites in hyperendemic regions of PNG. It is recommended to use these validated (n = 146) bi-allelic SNPs for future genotyping of P. vivax parasites if informative for the studied parasite population. A similar approach has been used to select a ‘universal’ 42 SNP barcode from hundreds of thousands of SNPs from genomic sequences of globally diverse parasite isolates [33]. A recent study focused on Plasmodium falciparum revealed that barcodes of 93 or even just 24 SNP markers are adequate in a low transmission area to capture parasite connectivity, allow stratification of closely related parasite populations and identify source and sink populations—with a high sample size required for a small number of markers and vice versa [85]. Before applying the developed markers to genotype parasite isolates from a given population, it is recommended to validate markers by evaluating allele frequencies within a subset of samples. Moreover, SNP barcodes need to be continually evaluated and validated against local WGS data to ensure they provide similar insights into parasite population genetics.

In conclusion, the locally-validated SNP barcoding assay showed higher resolution to measure variations in P. vivax diversity and population structure at local (sub-provincial) scale compared to the currently used panel of ten microsatellite markers. As countries approach malaria elimination, SNP barcoding will help to identify transmission zones and their dynamics and routes of parasite migration, and hence how to contain infections and to monitor whether control efforts are having an impact. The findings from this approach in combination with epidemiological data are essential to policy makers. The developed amplicon sequencing assay requires only a small amount of starting DNA (2 mL) and can be done relatively easily using available Next Generation Sequencing technology platforms at low cost (less than $18USD per isolate). This technology allows the “plug and play” incorporation of other markers such as SNPs informative for a given country/region or those associated with drug resistance that could help to concurrently genotype circulating parasites and resistance genes to give timely information for malaria control strategies. Overall, the findings suggest that SNPs may be better suited than the currently used microsatellite markers due to their higher resolution. SNP barcodes would also be more suitable in a control program setting, given the availability of cost-effective and robust high-throughput sequencing and the relative lack of technical issues.