Background

Malaria remains a major public health problem worldwide. Plasmodium falciparum is the parasite species causing the lethal form of the disease whilst Plasmodium vivax has long been considered a parasite causing mild disease, thereby diverting attention away from this species regarding research; however, recent studies have reported that this species also causes severe clinical syndromes[1, 2]. Even though both species infect humans, they both emerged from different evolutionary lineages; whilst P. vivax shares a common ancestor with Asian non-human primate malaria, P. falciparum has diverged from parasites infecting great apes[3].

The different evolutionary paths leading to the appearance of P. vivax and P. falciparum have also led to important differences regarding hosts being invaded by both species[4, 5]. In spite of such differences, initial interaction between the parasite and red blood cells (RBC) seems to be directed by the MSP-1 protein[68] which is present in all species from the genus. MSP-1 forms a complex with MSP-6 and MSP-7 in P. falciparum[911]; the latter protein is encoded by a gene forming part of a multigene family which has been differentially expanded amongst Plasmodium species[12]. Studies involving msp-7 family members have shown that the resulting protein products are located on the parasite membrane and that a 22 kDa C-terminal fragment (derived from proteolytic processing during parasite development)[10] has regions interacting with RBC[13]. The msp-7 knockout in P. falciparum (pfmsp-7I) and Plasmodium berghei (pbmsp-7B) has shown that even though its absence is not lethal, it does reduce mutant parasite invasion ability[14, 15]. These results, together with prior in silico analysis, have suggested that the members of this family could have functional redundancy[12, 15, 16] and their protein products (or some of them) could thus be involved in invasion. On the other hand, antigenicity studies have shown that some of these genes’ protein products are recognized by sera from infected patients[17, 18]. Antibodies directed against these proteins can inhibit parasite invasion of RBC[19], whilst immunization with members of the Plasmodium yoelii msp-7 family has shown that they can confer protection in vaccinated mice following experimental challenge[20].

The genetic variability patterns observed in msp-7 family members have been different between P. falciparum and P. vivax[2124]; whilst members of the former species have low polymorphism[23, 24], some members of P. vivax (pvmsp-7C, pvmsp-7H and pvmsp-7I) are highly polymorphic[21]. However, other members, such as pvmsp-7A and pvmsp-7K, are amongst the most conserved P. vivax antigens[22]. There are thirteen msp-7 genes in this species’ chromosome 12; these have been named in alphabetical order according to their location regarding the PVX_082640 gene[12]. Eleven of these genes are transcribed, but only seven of them are transcribed during the last hours of the intra-erythrocyte stage[25]. The genetic diversity of four of these seven genes has already been evaluated[21, 22]; this study was therefore aimed at evaluating the genetic variability of the three remaining members (pvmsp-7E, pvmsp-7F and pvmsp-7L) which are expressed during the intra-erythrocyte stage. pvmsp-7E displayed high polymorphism and its central region had undergone rapid evolution whilst pvmsp-7F and pvmsp-7L were seen to be highly conserved. The genes’ 3′-ends tended to be conserved by negative selection, suggesting that they encode the functional region for these proteins. Similar to what happened with the msp-1 gene[26, 27], msp-7 genes seem to have diverged due to positive selection, which could have resulted from malaria parasites adaptation to different hosts.

Methods

Ethics statement

All P. vivax-infected patients who provided us with the blood samples were informed about the purpose of the study and all gave their written consent. All procedures carried out in this study were approved by the ethics committee of the Fundación Instituto de Inmunología de Colombia.

Parasite DNA and genotyping

Thirty-six peripheral blood samples from patients proving positive for P. vivax malaria by microscope examination were collected from some of Colombia’s departments (Chocó and Nariño in the south-west, Guainía, Guaviare and Meta in the south-east, Tolima in the Midwest, and Atlántico, Antioquia and Córdoba in the north-west) between 2007 and 2010 (nine isolates in 2007, seven isolates in 2008, 8 isolates in 2009 and twelve isolates in 2010). DNA was obtained using a Wizard Genomic DNA Purification kit (Promega), following the manufacturer’s instructions, and stored at −20°C until use. The parasite samples were genotyped by PCR-RFLP of the pvmsp-1 gene as previously described[28]. Samples having single P. vivax msp-1 allele infection were used for PCR amplification.

PCR amplification and sequencing

Primers were designed for amplifying pvmsp-7E, pvmsp-7F and pvmsp-7L DNA fragments, based on Sal-I sequences (PlasmoDB IDs: PVX_082665, PVX_082670 and PVX_082700, respectively). The pvmsp-7E gene fragment was amplified with 7Edto 5′ GCCGATCTGTTGTCTTTTCC 3′ and 7Erev 5′ CCTTACGACACGTCAAATGG 3′ primers. pvmsp-7F was amplified by using 7Fdto 5′ TCCTCTCCTTGCTGATACTCC 3′ and 7Frev 5′ CAGCCGCTTAAATCACTTC 3′ primers whilst pvmsp-7L was amplified with 7Ldto 5′ AGTACTATTCTTCTTGCCGTCC 3′ and 7Lrev 5′ TCCCCTCAGTAGTAAAACATCG 3′ primers. All PCR reactions were performed using KAPA HiFi HotStart Readymix containing 0.3 μM of each primer in a final 25 μL volume. Thermal conditions were set as follows: one cycle of 5 min at 95°C, 30 cycles of 20 sec at 98°C, 15 sec at 63°C, 30 sec at 72°C, followed by a 5 min final extension at 72°C. PCR products were purified using the UltraClean PCR Clean-up (MO BIO) kit, and then sequenced with a BigDye Terminator kit (Macrogen, Seoul, South Korea) in both directions. Three PCR products obtained from independent PCR amplifications were sequenced per isolate to discard errors. Sequences having a different haplotype to the previously reported ones were deposited in the GenBank database (accession numbers KM212276 - KM212302).

Phylogenetic analysis for Plasmodium cynomolgi msp-7 orthologous identification

A similar approach used for msp-7 identification in other Plasmodium species[12] was adopted for identifying msp-7 genes in Plasmodium cynomolgi (pc) and establishing their orthologous relationships. The genomic region (obtained from GenBank, accession number: NC_020405) encoded by the PCYB_122860 and PCYB_122720 genes (homologues to PVX_082640 and PVX_082715 which circumscribed the msp-7 region in P. vivax) was analysed using ORF Finder[29] and Gene Runner software for identifying open reading frames (ORFs) encoding proteins larger than 300 amino acids. Deduced amino acid sequences obtained with Gene Runner were aligned with P. vivax (12 proteins) and Plasmodium knowlesi (5 proteins) MSP-7 sequences using the MUSCLE algorithm[30]. The best model for amino acid substitutions was selected by Akaike’s information criterion using the ProtTest program[31]. Phylogenetic trees were inferred through Maximum Likelihood (ML) and Bayesian (BY) methods using the JTT+G model. The observed amino acid frequencies (JTT+G+F) were also considered in Bayesian phylogenetic analysis and the analysis was run for one million generations. ML topology reliability was evaluated by bootstrap, using 1,000 iterations, whilst the sump and sumt commands in Bayesian analysis were used for tabulating posterior probability and building consensus trees. MEGA v.5 software was used for ML analysis and MrBayes v.3.2 software for assessing Bayesian inference. The P. falciparum MSP-7H (Pf MSP-7H) sequence was used as outgroup in both methods.

DNA diversity and evolutionary analysis in pvmsp-7 genes

CLC Main workbench (CLC bio, Cambridge, MA, USA) software was used to assemble forward and reverse sequences from three independent PCR fragments per isolate. Deduced amino acids from Colombian isolates’ pvmsp-7 sequences and those obtained from several sequencing projects (Sal-I, Brazil-I, Mauritania-I, India-VII and North Korean reference sequences)[4, 32] were aligned using the MUSCLE algorithm[30], followed by manual editing. PAL2NAL software[33] was then used for inferring codon alignments from the aligned amino acid sequences. The T-REKS algorithm[34] was used for searching for repeats having 90% similarity with the deduced msp-7 amino acid sequences.

DnaSP v.5 software[35] was used for calculating the number of polymorphic segregating sites (Ss), the number of singleton sites (s), the number of parsimony-informative sites (Ps), the number of haplotypes (H), the Watterson estimator (θw) and nucleotide diversity per site (π) for all available sequences (reference sequences and Colombian ones), as well as for the Colombian population alone. Departure from the neutral model was assessed in the Colombian population by frequency spectrum-based tests (Tajima’s D, Fu and Li’s D* and F* statistics Fay and Wu’s H) and tests based on the distribution of haplotypes (Fu’s Fs and K-tests and H-test (for the latter test haplotype diversity obtained from DnaSP software was multiplied by (n-1)/n according to Depaulis and Veuille[35, 36])). DnaSP v.5 and/or ALLELIX software were used for these tests, coalescent simulations being used for obtaining confidence intervals[35]. Positions containing gaps or repeats in the alignment were not taken into account.

Natural selection was assessed by using the modified Nei-Gojobori method[37] which calculated non-synonymous (dN) and synonymous (dS) rate substitution. Differences between dN and dS were assessed by applying Fisher’s exact test (suitable for dN and dS < 10[38]) and the Z-test available in MEGA software v.5[39]. The Datamonkey web server[40] was used for assessing codon sites under positive or negative selection at population level, along with the IFEL codon-based maximum likelihood method[41]. Positive or negatively selected sites were also assessed by FEL, SLAC, REL[42], MEME[43] and FUBAR[44] methods. A <0.1 p-value was considered significant for IFEL, FEL, SLAC and MEME methods, a >50 Bayes factor for REL and a >0.9 posterior probability for FUBAR. Recombination was considered before running these tests. The branch-site REL method was used for identifying branches (lineages) when a percentage of sites have evolved under positive selection for exploring the long-term selection effect. Non-synonymous divergence (KN) and synonymous divergence (KS) rate substitutions were also calculated using the modified Nei-Gojobori method[37] with Jukes-Cantor correction[45]. Positive and negative selection at every codon for P. vivax/P. cynomolgi msp-7 alignments were also evaluated by FEL, SLAC, MEME, REL and FUBAR methods.

Linkage disequilibrium (LD) was evaluated by calculating the ZnS statistic[46]. Linear regression between LD and nucleotide distances was evaluated to ascertain whether recombination was taking place in pvmsp-7 genes. Recombination was also assessed by the GARD method[47] and by ZZ[48] and RM tests[49]. RDP3 v3.4 software was used for detecting recombinant fragments in pvmsp-7 genes[50].

Results and discussion

Genotyping natural isolates

The thirty-six samples used in this study were genotyped by PCR-RFLP of the pvmsp-1 marker. All samples were infected by a single strain (a single P. vivax msp-1 allele was detected) and considered for PCR amplification of pvmsp-7 genes. The RFLP pattern confirmed the presence of different genotypes in the isolates so obtained. In spite of all samples amplifying the pvmsp-1 fragment, no amplimers were detected in some samples for some of the msp-7 genes (pvmsp-7E n = 31, pvmsp-7F n = 36 and pvmsp-7L n = 31).

The msp-7 family structure in Plasmodium cynomolgi and phylogenetic analysis

Prior analysis has suggested that there are several msp-7 species-specific duplications in P. vivax[12]. The recent sequencing of the P. cynomolgi genome[51], a species phylogenetically close to P. vivax[3], has meant that new sequences from this multigene family are now available. The P. cynomolgi genomic region flanked by the PCYB_122860 and PCYB_122720 genes contained eleven 0.9 to 1.4 Kb length ORFs having the same transcription orientation. A shorter 0.5 Kb fragment having 30% similarity with the identified ORFs was also observed. A 314 bp region having 75.8% identity with the 285 bp fragment in P. vivax between the pvmsp-7I and pvmsp-7K genes was also found (Figure 1). The P. cynomolgi msp-7 genes (and/or fragments) were named in alphabetical order, according to their location regarding the PCYB_122860 gene (Figure 1). Contrasting with PlasmoDB annotation, our group found that PcMSP-7C, −7F, −7H, −7I, −7K proteins might be encoded by a single exon like P. vivax MSP-7K[52] (Additional file1). The deduced amino acid sequences from these ORFs had a signal peptide, but no membrane-anchoring regions. The domain characteristic of the MSP-7 family (MSP_7C, Pfam domain ID: PF12948) was absent in the deduced PcMSP-7L protein sequences due to a premature stop codon.

Figure 1
figure 1

Schematic representation of the msp-7 family in Plasmodium vivax , P. cynomolgi and P. knowlesi . The genes flanking the msp-7 chromosome region in these three species are represented by purple boxes. The blue boxes represent P. vivax msp-7 genes, the red ones represent P. cynomolgi msp-7 genes and the yellow ones represent P. knowlesi msp-7 genes. The genes are given in alphabetical order from left to right. The dashed lines connect orthologous genes. All genes are represented to scale, but in P. cynomolgi and P. knowlesi the distance between them is not representative.

Orthologous relationships were established for P. cynomolgi MSP-7 (PcMSP-7) sequences by inferring phylogenies, using these sequences together with P. vivax (Pv) and P. knowlesi (Pk) MSP-7 sequences (Figure 2) with the P. falciparum MSP-7H (PfMSP-7H) as outgroup. The topologies revealed eleven clades having good statistical support (Figure 2); each P. cynomolgi MSP-7 had a counterpart in P. vivax or P. knowlesi. However, clustering for PcMSP-7B and PcMSP-7E differed regarding the other MSP-7s. PcMSP-7E formed a group with PvMSP-7E (Figure 2A) in ML topology, even though not having very high statistical support (72%). The BY topology method (Figure 2B) gave a subgroup formed by PcMSP-7B and PcMSP-7E which appeared as an external PvMSP-7B/PvMSP-7E group suggesting that these genes are inparalogous and, therefore, the duplication events occurred after the divergence of P. cynomolgi and P. vivax. Even though pcmsp-7B and pcmsp-7E did not form a group with pvmsp-7B and pvmsp-7E, respectively, their location regarding PVX_082640 and PCYB_122860 genes was the same (Figure 1). Moreover, genetic distances between pcmsp-7B and pcmsp-7E (and/or pvmsp-7B and pvmsp-7E) were similar to pcmsp-7B and pvmsp-7B (or pcmsp-7E and pvmsp-7E). Furthermore, genetic distance between pcmsp-7B and pvmsp-7B was smaller than that between pcmsp-7B and pvmsp-7E and the distance between pcmsp-7E and pvmsp-7E was less than the distance between pcmsp-7E and pvmsp-7B (Additional file2). Consequently, there is little probability that duplication events following the divergence of these two species independently led to the same order of msp-7B and msp-7E genes. Evidence of gene conversion (mechanism conducting paralogous homogenization) between P. vivax msp-7 members has been reported previously[21]; if such mechanism occurs between msp-7B and msp-7E genes it would be expected that they would become clustered in the phylogeny and not with their counterparts from a sister taxon; however, further analysis is needed for confirming such hypothesis.

Figure 2
figure 2

msp-7 gene family phylogeny inferred by Maximum Likelihood (A) and Bayesian (B) methods for P. vivax and two closely related species. The coloured branches show the MSP-7E (red), MSP-7F (green) and MSP-7L (blue) sequences. Sequence clustering reflects orthologous relationships between the msp-7 genes from these three species. The P. falciparum MSP-7F (PfalH) sequence was used as outgroup. Numbers over the branches are bootstrap and probability values.

The aforementioned results have shown that P. vivax and P. cynomolgi share the whole msp-7 repertoire described to date, suggesting that the duplications which gave rise to the msp-7B, −7C, −7E, −7F, −7I, −7L and -7M genes occurred before the divergence between P. vivax and P. cynomolgi and after their divergence from P. knowlesi, and thus msp-7B, −7C, −7E, −7F, −7I, −7L and -7M are not exclusive to P. vivax.

pvmsp-7E genetic diversity

168 segregant sites were found in the pvmsp-7E gene (Table  1), showing that this gene is highly polymorphic. The nucleotide diversity (π) estimated for this gene (Table  1) was comparable to that found in other genes encoding surface proteins (pvmsp-1 [53], pvmsp-3 [54] and pvmsp-5 [55, 56]) as well as other members of pvmsp-7 family [21]. Even though genes such as pvmsp-1, pvmsp-3 and pvmsp-5 are highly polymorphic, their diversity at protein level is usually located in determined regions. These regions are usually immune response targets and have thus tended to evolve more rapidly, accumulating mutations which alter the protein sequence and thereby evading the host’s immune system. Around 60% of the polymorphism found in pvmsp-7E was located at the gene’s central region whilst the gene’s ends were relatively high conserved (Additional files 3 and 4). This pattern has been previously observed in other pvmsp- 7 genes [21].

Table 1 DNA polymorphism measurements for pvmsp-7 genes

Regarding DNA, 23 haplotypes have been found worldwide in pvmsp-7E (Table 1 and Additional file4); fourteen different haplotypes have been found in this gene’s 3′-end, whilst eleven haplotypes have been found in the central region and five in the 5′-end (Additional file3). Nineteen of these 23 haplotypes were found in the Colombian population (Table 1 and Additional file4; haplotype 10: 26%; haplotypes 5, 9: 15%; haplotypes 7, 11, 13, 18: 10%; and haplotypes 6, 8, 12, 14–17, 19–23: 5%) over the course of a 3-year period (2007–2010) without any longitudinal or spatial trends. Thirteen haplotypes were found in Colombia at amino acid level (Additional file5); ten haplotypes having similar frequencies were observed in the pvmsp- 7E central region in the Colombian population whilst three and four haplotypes were distinguished towards the N- and C-terminals, respectively (Additional file5).

The T-REKS algorithm did not find repeats in the deduced protein sequences. Of the thirty-six sequences analysed for this gene, the North Korean sequence had a premature stop codon, suggesting that pvmsp-7E is a pseudogene in this strain; however, the annotation in the Broad institute for the North Korean pvmsp-7E gene (accession: PVNG_00513.1) suggests that this could have an intron. Further cDNA analysis is needed to confirm such issue regarding this strain.

pvmsp-7E neutrality and selection tests

Several tests based on the neutral model of molecular evolution were used with pvmsp-7E sequences from the Colombian population for evaluating whether this gene deviated from neutral expectations (Table 2 and Additional file6). Tests based on the polymorphism frequency spectrum had significant values (Table 2). Fu and Li’s D* and F* estimators had values greater than zero whilst Fay and Wu’s H estimator had statistically significant negative values (Table 2), indicating deviation from the neutral model of evolution. In addition, the haplotype distribution-based tests also gave statistically significant values; Fu’s Fs test gave values greater than zero and the haplotype number (18) and haplotype diversity (0.922) were lower than that expected under neutrality (Table 2).

Table 2 Neutrality, linkage disequilibrium and recombination tests for pvmsp-7 genes for the Colombian population

A sliding window for D, D*, F* and H statistics gave values greater than zero (D, D* and F*) in the gene’s central region and lower than zero in the 3′-end, this being the region where the most negative value for H was located (Additional file7). The gene was divided into three fragments: 5′-end (nucleotide 1 to 390), central (nucleotide 391 to 747) and 3′-end (nucleotide 748 to 1158); the aforementioned neutrality estimators were calculated for each of them (Additional file6). Fu and Li’s D* and Fu’s Fs tests gave values greater than zero at the 5′-end whilst Tajima’s D, Fu and Li’s D*, F* and Fu’s Fs test scores were greater than zero in the central region. The haplotype number and haplotype diversity were lower than expected under neutrality in the 5′-end and central region. Only Fay and Wu’s H tests gave statistically significant values at the 3′-end (Additional file6).

Natural selection seemed to act in different ways within the pvmsp-7E gene. A sliding window for the non-synonymous substitutions per non-synonymous site and synonymous substitutions per synonymous site rate (dN/dS = ω) gave values greater than 1 (ω > 1) in the central region and a peak at the 3′-end (Figure 3). The dS rate at the 5′ and 3′-ends was significantly greater than dN, whilst dN was significantly greater than dS in this gene’s central region (Table 3). Eight positively selected sites were identified in the central region and one in the 3′-end. Twenty-six negatively selected sites were identified, mainly in the 5′-end (9 sites) and 3′-end (14 sites) (Additional files4 and8).

Figure 3
figure 3

Sliding window for ω rate. Plasmodium vivax msp-7E, msp-7F and msp-7L genes’ ω (dN/dS) values represented in blue, whilst the divergence omega (ω: KN/KS) between P. vivax and P. cynomolgi is shown in purple. A diagram of each gene is given below the sliding window. Intra-species non-synonymous substitutions (red) and synonymous substitutions (green) are shown by vertical lines above each gene.

Table 3 Average number of pvmsp-7 gene synonymous substitutions per synonymous site (d S ) and non-synonymous substitutions per non-synonymous site (d N )

These results suggested that the central region was under natural positive selection. According to Tajima and Fu and Li tests (which had significant values higher than zero) pvmsp-7E seemed to be under balancing selection favouring the existence of different alleles in the population. This type of selection frequently occurs in antigens exposed to the immune system; immune responses therefore seemed to be directed towards the pvmsp-7E central region in which the mutations were accumulated at a greater rate by positive selection (Additional files4 and8).

Interestingly, as Fay and Wu’s H tests gave significant negative values and as K-test and H-test were lower than expected under neutrality (Table 2), then a selective sweep would be probable and strong LD and low genetic diversity would thus be expected. ZnS values suggested that pvmsp-7E had non-random polymorphism association, as expected in a selective sweep (Table 2). However, differently to what was expected, pvmsp-7 had high genetic diversity (Table 1). When recombination is present in a locus under selective sweep, then it would be expected that genetic diversity would only become reduced close to the selection site[57]. There was evidence of recombination throughout the pvmsp-7E gene (Table 1 and Figure 4, see below) and since the deepest H “valley” was located at the 3′-end (Additional file7), the selective sweep may not have affected the gene completely but just this region. The 5′-end and central regions showed no evidence of selective sweep (Additional file6) and π was high in the central region and low at the 5′ extreme (Additional file3). The 3′-end had significant values in H and ZnS tests (Additional file6), suggesting that the selection site should have been located in this region. The π value in 3′-end was considerably reduced regarding that for the central region (Additional file3). However, the number of SNPs seems to be higher than that expected in a selective sweep. Our results suggested that a selective sweep affected the pvmsp-7E gene, the selection site was located in the gene’s 3′-end and this seemed to be an incomplete selective sweep due to the presence of recombination since not all variability had been lost. However, if the selective sweep has not been recent, new mutations could have become fixed following such sweep[57].

Figure 4
figure 4

Schematic representation of recombination segments identified in pvmsp-7E by RDP3. Recombination fragments were only found in pvmsp-7E. Each variant is represented by a colour bar; a segment having a different colour below the bar represents the recombination event so identified.

Fu’s Fs test gave values greater than zero (Table 2 and Additional file6) which may have resulted from a reduction in haplotypes due to a recent bottleneck. Consequently, a low genetic diversity throughout the gene pool is expected in P. vivax Colombia population. Prior studies have shown high genetic diversity in parasitic antigens in the Colombian population[21, 55], meaning that such demographic event is highly unlikely. This test’s result may have been due to a reduction of pvmsp-7E haplotypes by the selective sweep, causing the number of haplotypes to be lower than that expected.

pvmsp-7F and pvmsp-7L genetic diversity

In contrast to pvmsp-7E, pvmsp-7F and pvmsp-7L had low genetic diversity (Table 1). This genetic diversity was similar to that observed in pvmsp-4[58, 59], pvmsp-8[60], pvmsp-10[22, 60], pv12, pv38[61], pv41[62], as well as in pvmsp-7A and pvmsp-7K[22]. Aligning the Colombian sequences with those obtained from the databases (reference sequences, Additional files9 and10) revealed that these genes only had four and six segregant sites, respectively. The π values and the number of haplotypes for these genes were low (Table 1). The most frequently occurring pvmsp-7F allele in Colombia was haplotype 2 (61%), followed by haplotype 1 (19%), haplotype 3 (17%) and haplotype 4 (3%), whilst haplotype 1 (61%) was the most frequent for pvmsp-7L, followed by haplotype 3 (16%), haplotype 2 (14%) and haplotypes 4, 5 and 6 (3%).

pvmsp-7F and pvmsp-7L neutrality and selection tests

Neutrality for pvmsp-7F and -7L genes in the Colombian population could not be ruled out as no statistically significant values were found for the tests based on the neutral model of molecular evolution (Table 2 and Additional file6). Likewise, no natural selection signals were found to be acting on these genes when dN and dS rates were calculated (Table 3). However, when the effect of selection on each codon was evaluated, it was seen that codon 424 regarding pvmsp-7F, was under positive selection. Concerning pvmsp-7L, codons 159, 260 and 357 showed positive selection signals (Additional files8,9 and10).

Intragene linkage disequilibrium (LD) and recombination in pvmsp-7 genes

As mentioned above, there were non-random associations regarding polymorphism for pvmsp-7E according to the ZnS test (Table 2 and Additional file6). No evidence of LD was found in pvmsp-7F or pvmsp-7L (Table 2 and Additional file6), indicating that polymorphism within these genes was not associated. A linear regression between LD and nucleotide distance for pvmsp-7s gave a line sloping downwards as nucleotide distance increased in pvmsp-7E, suggesting intragene recombination. Twelve minimal recombination (RM) events were found for pvmsp-7E whilst only one RM was found in pvmsp-7F and pvmsp-7L (Table 2). The ZZ test and GARD method suggested recombination in pvmsp-7E (ZZ = p < 0.05 and GARD 2 breakpoints, p < 0.0004) but not in pvmsp-7F or pvmsp-7L. Figure 4 shows the fragments produced by recombination in pvmsp-7E.

pcmsp-7 and pvmsp-7 genes appear to have diverged by positive selection

Natural selection’s long-term effect on evolutionary history can be evaluated by comparing the orthologous genes from phylogenetically-related species[6064]. msp-7E was highly divergent when compared to the pvmsp-7E and pcmsp-7E genes (Figure 3). In spite of msp-7F and msp-7L being highly conserved in P. vivax, they also have been shown to be highly divergent when compared to P. cynomolgi orthologous genes (Figure 3). The random effects branch-site model (Branch-site REL) was performed for determining how natural selection had acted during P. vivax and P. cynomolgi evolutionary history. This test displayed lineage-specific diversifying selection signals in msp-7E and msp-7L (ω > 1, Figure 5). Moreover, the sliding window for the non-synonymous divergence per non-synonymous site rate (KN) and the synonymous divergence per synonymous site rate (Ks) (divergent omega, KN/Ks = ω) gave highly divergent areas in the central region and 3′-end of these three genes (Figure 3). A statistically significant KN > KS was found in the central region (p < 0.001) in pvmsp-7E (Table 4). No significant values were found in msp-7F, but in msp-7L, KS was significantly higher than KN (Table 4). However, when intraspecific polymorphism was compared to interspecific divergence using the McDonald-Kreitman test (MKT) no statistically significant values were found for these genes. The methods for estimating ω values for each codon (SLAC, FEL, REL, MEME and FUBAR) identified twenty-five (for msp-7E), four (for msp-7F) and seven (for msp-7L) codons under positive selection between pvmsp-7 and pcmsp-7 sequences (Additional files4,9,10 and11).

Figure 5
figure 5

Positive lineage-specific selection in msp-7E (A) and msp-7L (B) genes. Branches under diversifying selection were identified by Branch-site REL method. ω values, the percentage of selected sites (Pr [ω = ω+]) and p-values are shown.

Table 4 Average number of msp-7 gene synonymous divergence per synonymous site (K S ) and non-synonymous divergence per non-synonymous site (K N )

These results suggested that these genes have become diversified by positive selection; a similar pattern which have been reported for the pvmsp-1 gene[26, 27]. Divergence due to positive selection in msp-1 coinciding with Asian macaque radiation[26, 65] 3 to 6 million years ago means that divergence by positive selection in msp-1 appears to be the result of adaptations to available new hosts[26, 65]. P. falciparum MSP-1 and MSP-7 form a protein complex involved in invasion[9, 10]. Assuming the formation of a protein complex between MSP-1 and MSP-7 in P. vivax, MSP-7s would be under the same selective pressures and may thus have evolved in a similar way. Theoretically[66, 67], it has been suggested that a strong selective sweep may result in population differentiation at the hitchhiking locus, provided that the gene flow between these populations is low. Since malarial parasites could become diversified by sympatric events[68, 69], msp-7 (similar to msp-1) may have become diversified by positive selection (Figure 5) as a mechanism for adapting the ancestral P. vivax population to a new host during the switch to humans[70] and thus the selective sweep detected in msp-7E might have been an effect of such adaptation.

Negative selection within and between species supports the idea that the 3′-end encodes the functional region in MSP-7 proteins

In spite of divergence by positive selection, msp-7 functional regions could have evolved more slowly due to their role during invasion and thus the accumulation of substitutions would have been mainly synonymous. KS > KN was revealed in msp-7E and msp-7L when comparing P. vivax and P. cynomolgi sequences (Table 4). Fifty-seven sites were revealed to be under negative selection in msp-7E, twenty-four in msp-7F and thirty-six in msp-7L (Additional files4,9,10 and11). A large percentage of negatively selected sites were located in the gene’s 3′-end encoding the msp-7 family’s characteristic domain (MSP7_C, Pfam domain ID: PF12948). The protein’s C-terminal region encoded by these genes was highly conserved in pvmsp-7A, −7C, −7H, −7I, −7K[21, 22], −7E, −7F and -7L; furthermore, this region has been conserved for a long period of time (2.6 to 5.2 million years ago[3]), at least in msp-7E (84.8% similarity between P. vivax and P. cynomolgi), −7F (86.8%) and -7L (95.4%). The negative selection signals identified at the 3’-end of these three genes (Additional files8 and11) suggested that the biological structure encoded by this region has been stable slowly evolving since divergence between P. vivax and P. cynomolgi due to its functional importance. These results support the idea that this region encodes this family’s functional domain[21].

The pvmsp-7 and pcmsp-7 sequences have different gene structures

Marked differences were observed between P. vivax and P. cynomolgi msp-7 genes. pcmsp-7F had a long insertion (one hundred ninety-two nucleotides) compared to pvmsp-7F (Additional file9); however, the ORF remained open. pcmsp-7L had a premature stop codon caused by the deletion of one or two nucleotides from the sequence (Additional file10). The protein encoded by this gene thus had no domain characteristic of this family (MSP7_C, Pfam domain ID: PF12948); however, many synonymous substitutions between species were observed in the region encoding this domain (the gene’s 3‵-end) when P. vivax and P. cynomolgi sequences were compared. Thirteen sites in this region were under negative selection in msp-7L (Additional files10 and11). The GeneScan algorithm[71] was then used for searching for exon/intron splice sites in pcmsp-7F and pcmsp-7L sequences. GeneScan analysis revealed regions which could act as donor and acceptor sequences in pcmsp-7L but not in pcmsp-7F. There was a thymine in pcmsp-7L nucleotide 609, whilst there was a cytosine in the homologous position in its orthologue in P. vivax (nucleotide 615 in the Sal-I sequence). Such change may have produced a putative donor (GT) site in pcmsp-7L whilst a putative acceptor site was located in position 1,030/1,031 (Additional file12); an intron region was thus located in pcmsp-7L between nucleotides 608 and 1,031. Such exon-intron-exon structure in pcmsp-7L can be observed in the annotation of the P. cynomolgi genome available from PlasmoDB; however, the intron predicted in PlasmoDB was shorter than that predicted by GeneScan. This exon-intron-exon structure allowed pcmsp-7L to encode a protein having the MSP7_C domain.

Conclusions

Our results confirmed that the P. vivax msp-7 family has a heterogeneous genetic diversity pattern. Some members were seen to be highly conserved whilst other had high genetic diversity. Consequently, P. vivax msp-7 genes must have evolved differently from those in P. falciparum which have low polymorphism[23, 24]. The PvMSP-7s C-terminal region (the gene’s 3′-end) tended to be conserved within and between genes[21]. This region’s conservation tended to be maintained by negative selection in msp-7E, msp-7F and msp-7L, suggesting that this is the functional region for this group of proteins. On the other hand, PvMSP-7 highly diverse members (pvmsp-7C, −7H, −7I[21] and -7E) were seen to have undergone rapid evolution at the protein’s central region; immune responses would thus been directed towards this portion of the protein. New alleles have consequently arisen in the population and been maintained by balancing selection as a mechanism for evading an immune response. In addition to this type of evasion, the P. vivax msp-7 family (similar to that suggested for the pvmsp-3 family[72]) would follow a model of multi-allele diversifying selection where functionally redundant paralogues[12] would increase evasion of the immune responses by antigenic diversity.

Our results have shown that P. vivax and P. cynomolgi share the whole msp-7 repertoire described to date and have revealed lineage-specific positive selection signals which are similar to those reported for pvmsp-1. Mutations occurring in msp-7s genes during host switch may thus have succeeded in adapting the ancestral P. vivax parasite to humans.