Background

Malaria caused by Plasmodium vivax has recently entered the global health agenda in the context of global malaria elimination. This has followed a re-evaluation of the long-held opinion that this parasite causes limited morbidity and essentially no mortality; a range of recent studies suggest that it is a major contributor to both in the wide-spread regions where it is endemic [1, 2]. Furthermore, the presence of dormant liver forms (hypnozoites) which can re-activate infection is an important barrier in disease control towards global malaria elimination [3, 4].

Mathematical and statistical models are an important area of research in malaria given the complex dynamics of the parasite-host-vector system [5, 6]. The majority of malaria models have focused on the species common in sub-Saharan Africa, P. falciparum; only recently have efforts been directed towards P. vivax[710]. The distributions of event times like incubation period have an important role in modeling infectious disease [11], and realistic assumptions about the distributions are crucial for accurate models. Published P. vivax models have made a range of implicit and explicit assumptions about the functional form of incubation periods and relapse intervals with limited empirical justification. Earlier work has focused on statistically and clinically significant differences in the epidemiology of sub-populations of this parasite [12, 13], but these epidemiological models do not provide well-defined parametric distributions for application within mathematical or statistical models. The purpose of this study is to use data synthesis to provide accurate, realistic and readily implementable parameters for modeling P. vivax infection event times using several historical human infection datasets.

Methods

We have utilized data from our earlier study of historical human challenge studies in two populations: patients receiving pre-antibiotic era neurosyphilis treatments, and prison volunteers in experiments for malaria prophylaxis. These two groups of institutionalized patients had mosquito-transmitted infections with defined exposure dates and complete follow-up [12]. Data for the infections with long-latency (extended incubation periods) were extracted from three published studies that involved two unrelated temperate strains; one involved drug prophylaxis [14], two were observational studies with inferred exposure dates [15, 16], and all include extensive interval censoring in reported event times.

The composition of the two populations for the incubation period analysis can be found in tables 1 and 2; the population for relapses is in Table 3. CONSORT diagrams for the two experimental studies can be found in Figure 1.

Table 1 Study population for incubation period analysis (experimental studies)
Table 2 Study population for incubation period analysis (observational studies)
Table 3 Study population for relapse period analysis (experimental studies)
Figure 1
figure 1

CONSORT diagram for study populations, Plasmodium vivax malaria (A) experimental incubation period study (B) experimental time-to-first relapse study.

Individuals without a recorded incubation or relapse (censored observations) have not been included in this analysis. ‘Failed’ primary infections were generally not reported within the original studies and may represent experimental difficulties. In analysis of relapses, our primary consideration was to determine the underlying distribution of events for modeling; non-parametric ding the proportion with relapses, can be found in the Additional file 1 (section III).

Case-patients were exposed to parasites from a range of geographic locations, which were characterized by hemisphere and latitude. As in prior studies and historical precedence, the sub-populations from the Western hemisphere are referred to as the New World, and Old World region consists of the Eastern hemisphere and Pacific regions [17]; temperate and tropical regions have been split at ± 27.5° N/S. Many of these data include interval censoring; that is, the event was reported as occurring within a specified time interval, but the exact time in unknown [18].

This study analyzes de-identified, secondary data published in the open literature (in the public domain); no ethics review was required. Analysis of data from patients at the same neurosyphilis treatment centers has been published with an extensive discussion of the ethical issues [19]; the issues inherent to the prison volunteers in these studies have also received extensive attention [2022].

The incubation period refers to the time from parasite exposure to onset of clinical symptoms; prepatent periods, which refer to the identification of blood-stage parasites, were not included in this analysis. For the experimental studies, all patients received only symptomatic treatments; all cases with malaria prophylaxis or radical cure were excluded. Relapses were measured from the primary infection as reported by the original study authors, and correspond to the onset of new clinical symptoms after parasites are no longer visible in the peripheral blood following the primary infection [23]. These data were examined using survival models, to specifically address the non-normal distribution of event times.In this analysis, we examined a range of distributions including exponential, gamma, Gompertz, log-logistic, log-normal, Weibull; time-shifted distributions from these respective families; and mixture distributions from each distribution family. The general forms of these distributions are shown in Figure 2.

Figure 2
figure 2

Comparison of general probability distributions included within this study [shape = 0.5, rate = 1; shift = 0.5 (shifted log-logistic only)].

Model fitting and parameter estimation utilized the Markov-Chain Monte-Carlo algorithm [24], and interval censored data were addressed using data augmentation methods [25]. Two complementary sets of analyses were performed for each of the experimental incubation and relapse datasets. In the first set, the best-fit distribution was found using the aggregate data, and then parameters for this optimal distribution were determined for each of the subregions of interest. In the second analysis, the best-fit distribution was found for each of the subregions independently.

Deviance Information Criterion (DIC) was used for model comparison [26], with standard thresholds to determine strength of evidence. That is, an absolute difference between models of < 2 DIC units was taken as indicating little difference; from 2-7 units indicating large differences; and > 7 DIC units indicating clear evidence of superiority.

To examine the sensitivity of the model selection procedure, we multiplied all time points by log-normal noise, with mean of 0 on the log scale, plus 0.01 standard deviation, i.e. randomly scaled up or down by ± 2%. Model sensitivity was then assessed by comparing the DIC from best-fit model with the DIC value from fitting the same distributional model to the generated pseudodata.

To assess the epidemiological and practical impacts of identified distributions, we performed a series of stochastic compartmental (SIR) models at fixed R0 values while varying the underlying distributions. The distributions were implemented using the best-fit parameters from our data augmentation process. For these epidemic simulations, we utilized the R0 package in R [27] for discrete-time models, running 10,000 stochastic simulations, reporting the mean values for each of the resulting sets of epidemics. We have not incorporated uncertainty in the extrinsic incubation period due to lack of reliable data, and we have made the assumption that the incubation period distribution is proportional to the generation interval, as P. vivax infections produce infective gametocytes rapidly upon onset of clinical symptoms [28].

We used non-informative priors for all parameter estimations. Proposal distributions were adjusted using estimated means and covariances from pilot runs in an iterative process to accelerate convergence; assessment of convergence was performed using Geweke’s diagnostic [29]. All statistical analyses were performed in R (version 2.15.2) [30], using the packages fitdistrplus, flexsurv, grid, MASS, seqinr, FAdist, stats4, R0 and custom-built code for the MCMC algorithm.

Results

Incubation periods

The study of experimental incubation period included 461 case-patients and overlaid distributions can be found in Figure 3; DIC comparisons for these distributional families for both the aggregate and for subregion-specific incubation period data are shown in Table 4. It should be noted for all the results that ‘flattening’ of the fitted curves relative to the data-based histograms arises from the data augmentation processes. The Deviance Information Criterion (DIC) indicates that the shifted log-logistic distribution has a substantially better fit than the second best shifted log-normal distribution, as evidenced by a DIC difference of 0.4. Mixture distributions of two gammas had limited levels of support (Δ DIC < 3), while all other distributions were not supported by DIC. The increased complexity of mixture distributions does not explain any greater variation in these data, and are also not supported by DIC. The subregion specific distributions show some differences from the best-fit distribution (shifted log-logistic) from the aggregate data. Among the New World, tropical parasites there is support for a shifted Gompertz distribution (Δ DIC = 2.5), and in the New World, temperate strains there is very strong evidence for a shifted Weibull distribution (Δ DIC = 18.3). Quadrant-specific plots of Kaplan-Meier curves with best-fit distributions can be found in the Additional file 1 (section II).

Figure 3
figure 3

Comparison of crude (non-data augmented) data and estimated parametric models of experimental incubation times, Plasmodium vivax malaria (N = 454). Experimental data are in black outlines, and parametric model fits are shown with 95% confidence intervals, along with the overall best parametric fit.

Table 4 Fitted distributions for experimental incubation times, Plasmodium vivax malaria

The distribution of the 529 cases in confounded and observational studies with longer-term incubations is shown in Table 5 and Figure 4; a bimodal peak is evident. The studies with the St. Elizabeth strain involved a range of chemoprophylaxis regimens, and the Korean strain infections were all observational studies that largely included chemoprophylaxis. These results also overwhelmingly support a log-logistic distribution; in this case a mixture of two log-logistic distributions accurately capture the bimodal distribution commonly observed in temperate zone epidemiology. Shifted distributions showed extremely poor fit and are not reported. A Kaplan-Meier curve for these data can be found in the Additional file 1 (section II).

Table 5 Fitted distributions for observational incubation time studies, Plasmodium vivax malaria
Figure 4
figure 4

Comparison of crude (non-data augmented) data and estimated parametric model of observational incubation times, Plasmodium vivax malaria (N = 529). Observational data are in black outlines, and parametric model fits are shown with 95% confidence intervals, along with the overall best parametric fit.

Times to first relapse

The results of the time-to-relapse analysis (primary infection to the first relapse) are shown in Table 6. We find that mixture distributions provide better fit for the total dataset than standard families; specifically, we find the best fit with a Gompertz mixture, followed by the log-logistic mixture (Δ DIC = 4.1), log-normal mixture (Δ DIC = 4.7) and Weibull mixture (Δ DIC = 5.4). While the differences among these three distributions are very minor, all capture the event times poorly relative to the Gompertz. The gamma and exponential mixtures both fit poorly (Δ DIC > 7). Figure 5 shows these distributions compared with the experimental data; the district bimodal peak is captured by the Gompertz mixture. There is some limited support for a shifted Gompertz in the New World Tropical region (Δ DIC = 2.9), but the remaining regions, and the global fit to aggregate data all show strong statistical support for a mixture Gompertz distribution.

Table 6 Fitted distributions for experimental relapse times, Plasmodium vivax malaria
Figure 5
figure 5

Comparison of crude (non-data augmented) data and estimated parametric model of first relapse times, Plasmodium vivax malaria. (N = 222). Experimental data are in black outlines, and parametric model fits are shown with 95% confidence intervals, along with the best overall parametric fit.

Sensitivity analyses

A sensitivity analysis was performed for all three datasets and each of the subregions individually, and show strong evidence that the models provided good fits for the pseudodata by comparisons of the DIC values. Detailed results, estimated posterior distributions for model parameters overall and by quadrant, are presented in the Additional file 1.

Epidemic simulations

The results of stochastic epidemic simulations can be found in Figure 6. These results suggest that at a reproductive number (R0) of 5, the time scale of a modeled epidemic varies dramatically based upon the distribution of the incubation period. Use of an exponential distribution, as is extremely common in SIR compartmental simulations, shows a much more rapid epidemic, with Gompertz and Weibull distributions showing more gradual epidemic evolution. Finally, gamma, log-logistic, log-normal, and shifted log-logistic have the latest epidemic peaks, and are virtually indistinguishable from one another. The mean total cases for each set of 10,000 simulations by underlying distributions are shown in Table 7. Comparison of these totals shows that within the 95% confidence intervals, the total number of cases within the epidemic is greater for the best-fitting log-logistic and shifted log-logistic distributions relative to exponential- and gamma-distributed incubation periods. Simulations with R0 = 50 and 75 produced consistent results but with greater separation of the gamma, log-logistic, log-normal, and shifted-log-logistic epidemic curves (results not shown).

Figure 6
figure 6

Comparison of simulated Plasmodium vivax malaria epidemics with R 0 = 5 (mean values from 10,000 simulations for each standard distribution).

Table 7 Total case counts from epidemic simulations, Plasmodium vivax malaria (mean values and 95% CIs from 10,000 simulations for each distribution)

Discussion

Although some of the earliest simulation models of malaria were directed towards P. vivax in epidemics, this parasite has received limited attention from modelers [31].

The models that have appeared have used a range of distributions for the model parameters of incubation period and time-to-relapse. Some of the earliest comprehensive mathematical models for P. vivax did not consider distributional assumptions and relied on point estimates [32]; other mathematical models used a log-normal distribution for relapses and a single estimate of 15 days for incubation period [7], implying an exponential distribution. A stochastic model of potential P. vivax transmission within Japan used a gamma distribution for the incubation period, an exponential distribution for short relapse periods, and a log-normal for longer relapses [33].

Two other comprehensive mathematical models implicitly assume exponential distributions for both incubation and times-to-relapse [9, 34]. A recent comprehensive model including multiple relapse states used a 15-day incubation period in simulations to produce a mean relapse interval of 7.1 months for cases in India, with incubation as an exponential distribution, and relapses modeled using a gamma distribution [10].

Several studies have found that results from infectious disease models can be highly sensitive to accurate distributional assumptions [35, 36]; our study reinforces these conclusions in finding the ‘default’ exponential and gamma distributions, used for mathematical tractability, inadequately capture the complexity of experimental data [37, 38]. The results from our simulations concur with these statements and suggest that use of best-fitting distributions can lead to larger overall case-counts and slower epidemic evolution than would be predicted based upon exponential or gamma distributed incubation periods. As the underlying distributional assumptions have large and important impacts upon both the time-scale of epidemic evolution and total case counts in P. vivax epidemics, these parameters are therefore a critical component of accurate models.

A range of entomological, molecular, genetic, and epidemiological evidence suggests the existence of subspecies within P. vivax[12, 17, 39]; however there has been limited consideration of this aspect of parasite biology in published models [9]. Little empirical data exists to support models that include explicit consideration of this aspect of the epidemiology; this study provides parameterization for subpopulations by climactic zone (temperate and tropical) as well as the postulated subspecies P. vivax vivax (E. hemisphere) and P. vivax collinsi (W. hemisphere) to inform region-based models towards global malaria elimination. Our results show that shifted log-logistic distributions adequately capture the incubation period for all regions except for the New World, temperate parasites, which show strong support for a shifted Weibull. However, as these parasite populations were eliminated in the early 20th century, they have limited relevance for modern modeling studies [40].

The results from the observational long-latent infections have several important implications. Although the biological underpinnings of relapse remain obscure [41], there has been considerable debate that long-latencies may in fact be relapses after a sub-clinical primary infection [42]. The results from this study show that relapses exhibit quantitatively different behavior at a population-level from these long-incubation periods, and this in turn suggests a closer biological link to ‘normal’ incubations than to relapses.

Secondly, the congruence of the distributions from experimental and observational studies suggests that results from observational studies, while inherently limited, may still adequately capture the natural history of infection with P. vivax. This finding may greatly expand the utility of available datasets to more completely explore the epidemiology of P. vivax.

In modeling the time-to-relapse, there is strong support for a mixture Gompertz for all event times except in the New World Tropical region, where a shifted Gompertz is supported. In addition to simplifying modeling, this concordance of distributions in different parasite populations suggests that hypnozoite activation may have a common underlying biological trigger, regardless of parasite genetics [43]. While these results for temperate zone parasites are based primarily on now-eliminated Russian strains, the parasites currently circulating on the Korean peninsula have been reported to have similar relapse patterns [3].

However, this study has several limitations. The times we have analyzed are from adult, non-immune and mostly Caucasian subjects with uncertain inclusion or exclusion criteria, and may not represent the experience in high transmission settings due to the influence of immunological factors, as well as the poorly understood impact of mixed-species malaria infections [44]. While malariotherapy for neurosyphilis treatment has been shown to have minimal impacts on incubation periods, larger impacts were found for relapse periods in some sub-populations of P. vivax[12].

A related study examined the length of P. falciparum infections found that the total duration of infections were best modeled using a Gompertz distribution [37]. However, the existence of relapses makes defining a duration of infection with P. vivax difficult; multiple lines of evidence suggests that relapses within a single infection may be genetically distinct from the primary infection [45].

Conclusions

Our results suggest that the ‘default’ distributions used in many modeling studies (exponential and gamma distributions), may be inadequate to fully capture the natural variability and complexity of event times in human infections with Plasmodium vivax malaria. Future modeling studies should consider the use of log-logistic and Gompertz distributions for incubation periods and relapse times respectively, and the region-specific distributions included in this work should be considered to accurately model regional variations in the epidemiology of this parasite. Future statistical and mathematical models of P. vivax transmission should incorporate the more complex distributions identified in this study to maximize the congruence with the true natural history and epidemiology of this important human pathogen.