Background

The eukaryotic parasite Plasmodium falciparum features the repetitive interspersed family (RIFIN) and subtelomeric variable open reading frame (STEVOR) multigene family, variant surface antigens that are associated with severe malaria pathogenesis and immune evasion [1,2,3]. RIFINs and STEVORs share a domain architecture, although RIFINs can be further subtyped into RIFIN-As and -Bs based on a 25 amino acid indel in the semi-conserved domain and differences in subcellular localization suggestive of distinct functions (Fig. 1) [4, 5]. A subset of RIFIN-As harboring a seven amino acid FHEYDER motif in the semi-conserved domain have been shown to inhibit B- and NK-cell activation, weakening host defenses against malaria infection [6]. Both protein families are also targets of natural immunity [7].

Fig. 1
figure 1

General structure of RIFINs and STEVORs. RIFINs and STEVORs are expressed on the surface of an erythrocyte infected with P. falciparum. Protein domains are illustrated as green (signal peptide), grey (variable domains), red (transmembrane domains), blue (25 amino acid insertion), and orange and purple (semi-conserved domains). There are approximately 160 rif genes in the 3D7 reference genome, separated into two subtypes, RIFIN-A and RIFIN-B, depending on sequence and subcellular localization. The FHEYDER motif (in blue) is present in the semi-conserved domain of 36 RIFIN-As in the 3D7 reference strain. STEVORs encompass ~ 30 genes per genome and are structurally similar to RIFIN-Bs

RIFINs and STEVORs pose challenges in genomic analyses due to their immense genetic diversity and numerous paralogs, which cause difficulties in reference-based assembly and identification. There are limited bioinformatic approaches to distinguish between RIFINs and STEVORs and to further classify RIFINs to the subtype level. Apart from laborious sequence alignment and phylogenetic analyses, BLAST is one of the few available tools [8]. However, BLAST requires a comprehensive reference index, lacks the sensitivity to detect highly divergent sequences, and cannot readily delineate between RIFIN subtypes. In contrast, profile Hidden Markov Models (HMM) offer not only better accuracy and speed, but also sensitivity in detecting remote homologs [9]. Three HMM-based tools have been used to categorize RIFIN and STEVOR sequences: RSpred [4], TIGRFAM [10], and PFAM [11]; however, each is built using limited sets of reference RIFIN and/or STEVOR sequences. The more recent tools TIGRFAM and PFAM, as part of the Interpro database [11], do not subtype RIFINs or automatically assign annotations. While RSpred addressed these concerns, it was web-based, could only evaluate ten sequences per job, and its web interface is no longer responsive.

Here, we introduce an improved HMM-based, command-line program called STRIDE (STevor and RIfin iDEntifier). STRIDE has better sensitivity than available HMM tools to detect both RIFINs and STEVORs, and also features RIFIN subtyping, automated annotations, and adjustable thresholds for sensitivity and specificity. Importantly, STRIDE allows for the determination of the number of RIFIN-A sequences with a FHEYDER motif, providing insight into mechanisms to weaken host defenses. STRIDE will have particular value for malaria genomic epidemiologists, as next-generation sequencing of clinical field isolates increases in prevalence and the contributions of RIFIN and STEVOR multigene families to severe malaria pathogenesis and the acquisition of natural immunity to malaria become clearer.

Implementation

STRIDE consists of a merged HMM generated from three different refined multiple sequence alignments of full-length publicly available RIFIN and STEVOR protein sequences (Additional file 1: Figs. S1 and S2). A total of 3536 RIFIN and STEVOR sequences were downloaded from PlasmoDB (Release 45; August 28, 2019, keyword: “RIFIN/STEVOR”). Redundant sequences were clustered with CD-HIT v4.6 (option: -c 1.0). RIFIN-A, RIFIN-B, and STEVOR proteins were first identified via BLAST. For each set of protein sequences, a multiple sequence alignment was created, and a corresponding HMM was generated with hmmbuild (default parameters) as part of the HMMER3 v3.2.1 package. In an iterative process (Additional file 1: Fig. S1), we used each HMM profile to search for homologous sequences in other datasets. Sequences with the highest scores were incorporated into a new seed alignment, where another respective HMM profile was created. Training concluded for each HMM profile when no additional sequences could be extracted.

STRIDE uses a FASTA file as input and scores the query sequences against the HMM profile. A subprogram written in Perl v5.24 parses these scores and outputs the sequence classifications as a tab-delimited text file (Additional file 2). The main classifications are “RIFIN-A”, “RIFIN-B”, and “STEVOR.” STRIDE outputs the number of RIFIN-As with a FHEYDER amino acid motif as an exact pattern match. Truncated or highly divergent sequences are designated as “likely” RIFIN or STEVOR, and those that are unable to meet RIFIN subtyping criteria due to insufficient discriminatory characteristics (e.g. missing the protein segment containing the defining 25 amino acid indel) are called simply “RIFIN.”

To determine sensitivities and specificities, we created a “validation” dataset that spanned a range of variant surface antigen sequence sizes, including 3888 presumed RIFINs and STEVORs from sequenced clinical isolates and publicly available assemblies (Table 1, Additional file 1: Fig. S2) [12]. In addition, we downloaded annotated protein FASTA files from several Plasmodium reference genomes: P. falciparum 3D7 (5548 sequences), P. vivax (6667 sequences), P. berghei strain ANKA (5076 sequences), P. reichenowi (5644 sequences), and P. chabaudi (5217 sequences) to test our profiles for false positives and negatives.

Table 1 Comparison of STRIDE to PFAM and TIGRFAM, using the same parameter values

Results

Generation of HMM profiles

From the 3536 RIFIN and STEVOR sequences downloaded from PlasmoDB, 967 RIFIN-A, 495 RIFIN-B, and 229 STEVOR sequences comprised the final datasets at the conclusion of HMM training (Fig. 2, Additional file 1: Fig. S2). This included representation of sequences from all sampled genomes. The Malian (ML01) and Togo (TG01) strains were polyclonal and had higher overall numbers of representative sequences. Of the 228 total RIFINs and STEVORs annotated in the 3D7 reference genome, STRIDE incorporated 122 of these sequences.

Fig. 2
figure 2

Stacked bar graphs of the sequence distribution from all available P. falciparum genomes from PlasmoDB v45 at the conclusion of training each HMM profile. A total of 3536 RIFIN and STEVOR sequences were downloaded from PlasmoDB (Release 45; August 28, 2019). Redundant sequences were clustered with CD-HIT v4.6. HMM (Hidden Markov Model) profiles specific for RIFIN-A, RIFIN-B, and STEVOR proteins were created and iteratively trained against subsets of sequences that were not present in the initial seeding. 967 RIFIN-A, 495 RIFIN-B, and 229 STEVOR sequences comprised the final datasets, providing representation of sequences from all genomes. The Malian (ML01) and Togo (TG01) strains were polyclonal and had overall higher numbers of representative sequences. Of the total of 228 RIFINs and STEVORs annotated in the 3D7 reference genome, STRIDE used 122 3D7 sequences

Performance evaluation

The sensitivity and specificity of STRIDE is adjustable, although default parameters have been optimized to produce the most conservative designations (Fig. 3, Additional file 2). Datasets of 404 RIFIN-A, 476 RIFIN-B, and 40 STEVOR sequences that were randomly selected and excluded from the HMM training were used to test and define the limits of detection for each profile (Fig. 3, Additional file 1: Figs. S1 and S2). All RIFIN-A and -B sequences had low concordance to the STEVOR profile, failing to meet the STEVOR threshold score of 145. The 404 RIFIN-A sequences had whole sequence (represented in blue) and domain (represented in red) scores that exceeded the thresholds for the RIFIN-A profile. In contrast, none of the 404 RIFIN-A sequences met classification criteria for RIFIN-Bs, as their domain scores (red) were below the threshold score of 250. In the same manner, none of the 476 RIFIN-B sequences met the 250 domain threshold score (red) to be classified as a RIFIN-A profile. A set of positive control sequences from 3D7 demonstrated high concordance to each profile illustrated by their respective Circos plot (Additional file 1: Fig. S3).

Fig. 3
figure 3

Depicting relationships of HMM Scores. Whole sequence scores are represented in blue and HMM domain scores are represented in red. Sets of sequences excluded from the creation of each HMM profile were used to define the limits of detection, represented by a grey line. Datasets of 404 RIFIN-A, 476 RIFIN-B, and 40 STEVOR sequences excluded from the HMM training were used to test and define the limits of detection for each profile. All RIFIN-A and -B sequences had low concordance to the STEVOR profile, failing to meet its threshold score of 145. The 404 RIFIN-A sequences had whole sequence (blue) and domain (red) scores that exceeded the thresholds for the RIFIN-A profile. In contrast, none of the 404 RIFIN-A sequences met classification criteria for RIFIN-Bs, as their domain scores (red) were below the threshold score of 250. The Y-axis represents HMM scores, and the X-axis represents the ordered numerical label for each sequence

Based on these findings, we developed an algorithm to specify the type and subtype of a queried sequence based on whole sequence and domain scores (Additional file 2). The first limit of detection determines which of the three profiles (RIFIN-A, RIFIN-B, or STEVOR) registered the greatest whole sequence score. For a queried sequence to be considered a RIFIN, the whole sequence score must surpass a threshold of 200 against either the RIFIN-A or RIFIN-B profile. Queries with whole sequence scores between 100 and 200 to a RIFIN profile are considered “likely RIFINs” and scores ≤ 100 are considered “unlikely RIFINs”. RIFIN subtyping requires a domain score ≥ 250 to a respective RIFIN profile, otherwise a query receives only a “RIFIN” annotation. Similarly, for the STEVOR HMM profiles, scores between 100 and 145 were considered “likely STEVORs,” and scores ≤ 100 were “unlikely STEVORs.” STRIDE does not report queries that are vastly different to any of the profiles.

Discussion

To compare sensitivity and specificity between tools, we adjusted the parameters of PFAM and TIGRFAM to match those of STRIDE. STRIDE detected STEVORs in the curated 3D7 reference genome with similar sensitivity to PFAM and TIGRFAM, although sensitivity of STRIDE to detect RIFINs was higher, but this was not statistically significant (p = 0.30; χ2 = 2.41, DF = 2, Table 2). Specificity to 3D7 sequences was equivalent across all tools. Unlike PFAM and TIGRFAM, STRIDE was not trained using the entirety of RIFINs and STEVORs from the 3D7 repertoire (Fig. 2, Additional file 1: Fig. S4).

Table 2 Depicting the sensitivity and specificity analyses of STRIDE compared to PFAM# and TIGRFAM# using 3D7^

The “validation” dataset spanned a range of variant surface antigen sequence sizes, which included 3888 presumed RIFINs and STEVORs from sequenced clinical isolates and publicly available assemblies (Table 1). STRIDE detected a total of 3540 RIFIN and STEVOR sequences (91.0%), more than the counts for PFAM (2707, 69.6%; p < 0.00001, χ2 = 31.30, DF = 1) or for TIGRFAM (3394, 87.3%; p = 0.31716, χ2 = 1.00, DF = 1). We also used other Plasmodium reference genomes to further test for specificity. STRIDE appropriately detected RIFINs and STEVORs in gorilla- and chimpanzee-infecting parasites (e.g. P. reichenowi) but did not register any hits to the genomes of P. vivax, berghei, or chabaudi, three species that lack RIFIN and STEVOR orthologs (Table 1).

Using STRIDE, we reevaluated a subset of 320 sequences from PlasmoDB that had received a broad, overlapping “RIFIN/STEVOR family, putative” designation (Additional file 3). These sequences originated from long read-based assemblies of several parasite strains [13]. Among the 312 sequences that met or exceeded identification thresholds, 176 were determined to be RIFIN-As, including 52 with FHEYDER motifs; 80 were RIFIN-Bs; and 56 were STEVORs. Eight sequences did not meet the designated limits of detection for exact classifications. These were mostly truncated copies and thus classified by STRIDE as “RIFIN” or “likely RIFIN.”

We also applied STRIDE to predict the number and classification of RIFINs and STEVORs from 15 unannotated long read-based de novo assemblies of clinical field isolates (Additional file 3) [12]. Initial classification using BLASTp led to mixed results and overlapping annotations. The number of STRIDE-predicted RIFINs and STEVORs from the NF54 de novo assembly mirrored that of 3D7, which was expected given that 3D7 is a clone of the NF54 isolate [14]. STRIDE also consistently identified comparable numbers of RIFINs, STEVORs, and FHEYDER motifs across most clinical samples from diverse geographies. Several “likely RIFIN” sequences from each assembly are encoded by short, truncated contigs in each assembly and could not be precisely classified. There were proportionally greater numbers of sequences found in the Myanmar samples, which are likely polyclonal (Additional file 3).

Conclusions

We present STRIDE, an HMM-based, command-line program that automates RIFIN and STEVOR prediction, differentiates RIFIN-As from RIFIN-Bs, and identifies the number of sequences with the known pathogenic FHEYDER motif. STRIDE eliminates the need to perform multiple sequence alignments and phylogenetic analyses or to use specialized knowledge of these two protein families to sort RIFINs and STEVORs. STRIDE has better sensitivity to detect RIFINs than other available HMM-based tools and supports adjustable thresholds to customize desired levels of sensitivity and specificity. This HMM-based approach for variant surface antigen classification may be useful for other Plasmodium species and organisms with multigene families, such as Trypanosoma.