Background

Malaria is a life-threatening disease caused by Plasmodium parasite, transmitted to humans through the bites of Plasmodium-infected female Anopheles mosquitoes [1]. Different species of Plasmodium cause malaria in humans, with Plasmodium falciparum (Pf) being the most lethal and prevalent in Africa [2]. Other species that cause malaria in humans include Plasmodium vivax, Plasmodium malariae, Plasmodium ovale, and Plasmodium knowlesi [2]. One of the most devastating complications associated with Pf infection is cerebral malaria [3,4,5], a severe disease characterised by vascular leakage and cerebral swelling that can lead to coma and death [5,6,7]. This complication can be difficult to diagnose and treat, significantly contributing to the high malaria mortality rate in sub-Saharan Africa [5, 8].

Malaria is endemic to sub-Saharan Africa [9,10,11], where 29 countries account for 96% of global malaria cases [12]. Nigeria, in particular, has one of the highest malaria burdens globally and is a significant contributor to the global malaria mortality rate [13]. Approximately 100 million malaria cases are reported annually in Nigeria, resulting in over 300,000 deaths [13]. Along with the Republic of Congo, Nigeria accounts for 36% of global malaria cases [13]. Given the health implications of malaria, Nigeria has joined other African countries to eradicate the disease between 2025 and 2030 [14]. In addition to the Federal Ministry of Health’s National Malaria Elimination Programme (NMEP), the President established the “Nigeria End Malaria Council” in August 2022 to reduce the malaria burden in the country and serve as a platform to solicit funds to promote malaria elimination [3, 15,16,17]. Several control measures, including the distribution of long-lasting insecticide-treated mosquito nets, provision of malaria chemopreventive drugs, and utilisation of indoor residual insecticide spray, among other strategies to eradicate malaria, have been implemented by various African governments [18,19,20,21,22]. However, despite ongoing efforts by African governments to combat malaria, it remains a significant public health challenge and continues to affect the continent’s population and economy [23].

The World Health Organization (WHO) recommends prompt malaria diagnosis, either by microscopy or rapid diagnostic tests (RDTs), for all suspected malaria cases before treatment [24]. Microscopy is still considered the “gold standard” for malaria diagnosis in endemic countries. This method has a sensitivity of 50–500 parasites [25], is cost-effective, and enables species and parasite density identification [26, 27]. However, multiple fields must be examined to detect infection, which requires the expertise of at least two microscopists [6]. Hence, the diagnostic accuracy of microscopy is often lacking [6]. Other limitations of microscopic diagnosis include a high number of false negatives, shortage of skilled microscopists, inadequate quality control, and possibility of misdiagnosis due to low parasitaemia or mixed infections [28,29,30].

RDTs are recommended by the WHO as a good alternative to microscopy in remote areas of Sub-Saharan Africa, with histidine-rich protein II (HRP2)-based RDT being the most used. Some studies have shown that RDT is more sensitive than microscopy [31, 32]. However, false positives are a significant limitation of RDTs, because HRP2 remains in the blood for several days after infection clearance. Furthermore, false negatives can occur because of gene deletions, necessitating an improved and complementary approach to overcome some of these shortcomings.

Accurate and prompt diagnosis of malaria is crucial for effective decision-making, better patient care, and illness management. Correctly identifying which patient needs to take malaria drug(s) and should undergo additional examinations will prevent the overuse of malaria medications and significantly reduce deaths attributable to malaria [33, 34]. Numerous studies have demonstrated machine-learning benefits for different healthcare systems [35,36,37,38]. Recently, several studies have used supervised learning algorithms to identify malaria [39,40,41,42]. However, despite the success of machine learning in managing malaria, most of its applications concentrate on microscopic image analysis to diagnose malaria, while ignoring the fact that most healthcare institutions in the rural areas of most malaria-endemic countries lack basic facilities to make accurate diagnoses.

Given the widespread practice of self-medication with anti-malarial drugs and the difficulties facing Africa’s health system, a machine learning-based diagnosis model is essential. Additionally, for individuals who cannot obtain a laboratory-based diagnosis, the model can help in accurately diagnosing malaria. Machine learning-based diagnostic tools may provide a simple yet reliable method for assessing the potential malaria status. Hence, this study used patient symptoms, demographic and environmental features to develop a clinical tool for prompt and accurate malaria diagnosis.

Methods

Study area, design, and participants

Cross-sectional sampling was conducted in Osogbo, the capital of Osun State, Southwest Nigeria, between June and November 2022 (rainy to dry season). In addition, the entire Osun state (latitude 7.5876° N and longitude 4.5624° E) is located in the tropical rainforest (average rainfall ranges from 1,125 mm in the derived savannah to 1,475 mm in the rainforest belt, with an annual temperature ranging from 27.2 °C in June to 39.0 °C in December of southwest Nigeria [43]. Therefore, water is collected in potholes and hollow objects around human dwellings and workplaces after rain (hence bushy surroundings and stagnant water around homes and workplaces). The majority of the participants in the study were Yoruba residents of Osogbo who sought medical attention at the four Primary Healthcare Centres (PHCs) chosen in the town. The Osun State University Health Research Ethics Committee (HREC) granted ethical approval for this study.

Outcomes

The data were split into two categories, Pf-positive and Pf-negative, indicating those with and without malaria, respectively.

Features

Participants were given a detailed explanation of the study protocol by the medical staff of the four Primary Health care facilities, and only those who provided written informed consent were recruited. Data on the socio-demographic behaviour, environment, and clinical characteristics of the subjects were gathered through questionnaires. Each participant’s body temperature, weight, and height were measured at appropriate facilities. Age less than 18 years and a lack of interest in participating in the study were requirements for exclusion. Information on age, sex, body weight, height, body mass index, body temperature, fever, diarrhoea, vomiting, headache, cough, sore throat, dizziness, muscle pain, presence of stagnant water at home, presence of stagnant water in the workplace, presence of bushes in the surroundings, and use of mosquito repellants were collected from the patients. This information was collected because these variables are commonly associated with malaria risk [44].

To ensure high quality of our data, we adhered to the specific guidelines and definitions of our methods. Fever was defined as an axillary temperature of ≥ 37.5 °C, in line with the World Health Organization’s standards [45]. The determination of bush proximity and density was achieved via GPS coordinates ‘close proximity’, defined as bushes within 100 m of a participant’s residence, and high bush density as > 50% area coverage [46, 47]. All Pf malaria diagnoses were RDT-confirmed, in line with the best practices of WHO [48].

Statistical analysis

Patient baseline characteristics

Patients’ baseline characteristics were summarised using frequencies and proportions for categorical variables and medians and ranges for continuous variables. The characteristics were compared between Pf-positive and Pf-negative patients using the Wilcoxon Rank Sum test for continuous variables and Pearson’s chi-square test for categorical variables, with Yates’ continuity correction when appropriate. Indicators of significant associations between variables were set at P < 0.05.

Multivariable models development

Multivariable penalised logistic regression [49, 50], Bayesian generalised model [51, 52], and decision tree model [53,54,55] with nested cross-validation [56, 57] for parameter optimisation and wrapper-based sequential backward feature selection [58] were employed to determine the malaria type (Pf-positive or -negative). Randomly selected 80% of the samples (160 samples consisting of twenty-eight and one hundred thirty-two Pf-positive and Pf-negative samples, respectively) were used for the model training. The remaining 20% (40 samples consisting of seven and thirty-three Pf-positive and-negative samples, respectively) were used for testing.

Data scaling

Continuous variables in the training set were scaled to have a mean of 0 and standard deviations of 1 using the z-score algorithm, and the corresponding variables of the test set were mapped onto the space on the training set.

Nested cross-validation

Nested cross-validations (CVs) involving multiple layers of cross-validation (inner and outer folds) were performed on the training dataset to obtain reliable classification accuracy and avoid overfitting [56, 57]. The inner folds were used to optimise the model parameters and select useful feature subsets, and the performance of the best (inner) model was then evaluated in the outer fold. For the outer fold, we split the training dataset into a 30-fold cross-validation; one-fold was kept as a test set, while the remaining 29 folds (i.e. outer training fold) were, in turn, split in the inner fold into 20 stratified folds, 19 folds for model training, and the remaining fold for validation, to provide an unbiased evaluation of the model fit on the inner training set while tuning the model’s hyperparameters and selecting optimal features. The outer and inner folds were repeated 20 times to obtain a robust model. In addition, to address the imbalance in our dataset, we employed stratified k-folds in the outer and inner folds.

Optimal feature selection and hyperparameters

Feature selection was performed using sequential backward search selection (SBSS) for each inner training set [58]. The SBSS started with all features and dropped the non-informative features at each iteration, improving the model’s performance. This process was continued until no improvement was observed. Once the best combination of hyperparameters and feature subsets that maximised the performance metrics in the validation set was found, the model with the combination of hyperparameters and feature subsets was re-trained on the outer training set and tested on the test set kept out from the outer CV. The feature subsets from all outer folds were then combined using a voting strategy that retained features with more than 50% occurrences in all outer folds as informative; hence, they were chosen as the final feature subset [59]. The median of the best hyperparameters from the outer CV folds was used to fit the final model.

Performance evaluations

To generate summary performance estimates, we averaged the area under the curve (AUC) of the receiver operating characteristic (ROC) curve and other performance evaluations, such as sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of the cross-validation [60, 61]. The sensitivity \(\left(\frac{TP}{TP+FP}\right)\), specificity \(\left(\frac{TN}{TN+FN}\right)\), PPV\(\left(\frac{TP}{TP+FN}\right)\), and NPV \(\left(\frac{TN}{TN+FP}\right)\), where TP, FP, TN, and FN are the numbers of true positives, false positives, true negatives, and false negatives, respectively, were calculated using the default cutoff value (0.5) for the Pf-positive or -negative classes for each model. We chose the model parameter values that led to the highest specificity values.

Package and software

All statistical analyses were performed using R. The machine- learning models were carried out using the Caret library (version 6.0.93). The receiver operating characteristic (ROC) curves of the models were drawn using the pROC library (version 1.18.0). We examined the association between model-selected predictors and the odds of malaria. The predictors and their corresponding adjusted odds ratios (AOR), confidence intervals (CI), and p-values are presented. The AOR estimates an increase in the odds of having malaria per unit increase in the predictor. The CI provides a range of values for the AOR, which are likely to contain the true value of the AOR with a 95% degree of confidence.

Results

Patient’s Characteristics

This training set included samples from 160 Pf-negative and Pf-positive patients (Table 1). The median age of the patients was 41 years. Patients with Pf negativity tended to be older than those who tested positive for Pf (p = 0.025). In contrast, patients with Pf negativity were associated with lower body weight (p < 0.001), lower height (p = 0.03), and lower body mass index (p = 0.033) than those with Pf positivity. There was an association between Pf positivity and fever (p = 0.004), headache (p = 0.003), stagnant water at the workplace (p = 0.039), or bushes in the surroundings (p = 0.003). However, no association was observed between Pf positivity and sex, diarrhoea, cough, sore throat, dizziness, muscle pain, stagnant water at home, or mosquito repellant use. The baseline characteristics of patients in the training and test sets were similar (Table 2).

Machine learning models for predicting malaria status

We trained and tested each model and calculated the performance metrics for the training and test sets. The multivariable penalised logistic regression (Fig. 1a) and Bayesian generalised (Fig. 2a) models included patients’ body weight, headache, fever, body mass index, bushes in surroundings, age, vomiting, muscle pain, mosquito repellant, body temperature, sore throat, stagnant water at home, sex, and dizziness as the informative features, with training and test AUC (%) values: multivariable penalised logistic regression model (training: 84% vs. test: 83%; Fig. 1b), and Bayesian generalised model (training: 84% vs. test: 76%; Fig. 2b). The Bayesian generalised model also includes height as a part of the informative features. In contrast, the decision tree model included body weight, body mass index, and bushes in the surroundings as informative features (Fig. 3a), with AUC (%) values of 66% and 69% for the training and test datasets, respectively (Fig. 3b).

Table 1 Patients’ characteristics
Table 2 Baseline characteristics of patients in the training and test sets
Fig. 1
figure 1

Features Important plot (a) and Roc curve (b) from multivariable penalised logistic regression model

Fig. 2
figure 2

Features Important plot (a) and Roc curve (b) from Bayesian generalised model

Fig. 3
figure 3

Features Important plot (a) and Roc curve (b) from decision tree model

Table 3 Performance of penalised logistic regression, Bayesian generalised, and decision tree models for training and test sets

The sensitivity, specificity, PPV, and NPV proportions from the models for the training and test datasets are presented in Table 3. The penalised logistic regression and Bayesian generalised models achieved similar sensitivity, specificity, PPV, and NPV values, outperforming the decision tree model. Comparisons of the AUC and other performance parameters revealed the advantage of the penalised regression model over other models in predicting the malaria class. The optimal parameters of the penalised logistic model were α = 0.025 and λ = 0.002.

Table 4 Adjusted odds ratios (AOR) from the multivariate penalised logistic regression model

Relationships between patient features and malaria

Table 4 presents the adjusted odds ratios (AOR), AOR confidence intervals, and p-values of the predictors from the penalised logistic regression models. As shown in Table 4, increased odds of Pf antigen positivity (malaria) were associated with higher body weight (AOR = 4.50, 95% confidence interval (CI): 2.27 to 8.01, p < 0.0001) and high body temperature (AOR = 1.40, 95% CI: 0.99 to 1.91, p = 0.054). In contrast, decreased odds of Pf antigen positivity (malaria) were associated with age (AOR = 0.62, 95% CI: 0.41to 0.90, p = 0.012) and BMI (AOR = 0.47, 95% CI: 0.26 to 0.80, p = 0.006). Patients who had (or experienced) bushes in the surroundings (AOR = 2.60, 95% CI: 1.30 to 4.66, p = 0.006) or experienced fever (AOR = 2.10, 95% CI: 0.88 to 4.24, p = 0.099), headache (AOR = 2.07; 95% CI: 0.95 to 3.95, p = 0.068), muscle pain (AOR = 1.49; 95% CI: 0.66 to 3.39, p = 0.333), and vomiting (AOR = 2.32; 95% CI: 0.85 to 6.82, p = 0.097) were more likely to be positive for the Pf antigen test than those who did not have bushes in the surroundings, fever, headache, muscle pain, and vomiting, respectively. In contrast, male patients (AOR = 0.72; 95% CI: 0.24 to 1.71, p = 0.373), those who had (or experienced) dizziness (OR = 0.30; 95% CI: 0.05 to 0.94, p = 0.042), stagnant water at home (AOR = 0.26; 95% CI: 0.11 to 0.53, p < 0.0001), and sore throat (AOR = 0.26; 95% CI: 0.01 to 0.55, p < 0.0001) were less likely to be positive for the Pf antigen test (experience malaria) than female patients or those who did not have stagnant water at home, dizziness, or sore throat. Surprisingly, compared to those who did not use mosquito repellents, our data showed that patients who used mosquito repellents had higher odds of testing positive for the Pf antigen (developing malaria) (AOR = 1.78; 95% CI: 0.86 to 3.27, p = 0.128).

Discussion

This study routinely collected sociodemographic, environmental, and clinical data to predict the incidence of Pf infections. Among the tested models, the penalised logistic regression model exhibited the best performance, with 84% and 83% training and test AUC accuracies, respectively, in predicting malaria status. Our results revealed associations between the presence of Pf (determined by RDT) and body mass index (BMI) (AOR = 0.47, 95% CI: 0.26 to 0.80, p-value = 0.006), body weight (AOR = 4.50, 95% CI: 2.27 to 8.01, p < 0.0001), dizziness (OR = 0.30; 95% CI: 0.05 to 0.94, p-value = 0.042), and sore throat (AOR = 0.26; 95% CI: 0.01to 0.55, p < 0.0001).

Body weight and BMI have been shown to affect the incidence of Pf malaria [62], which is consistent with our findings. Our results confirmed the need to consider patient BMI and weight when diagnosing Pf malaria, as these factors play significant roles in determining the presence of the disease. Although there have been a few reports of dizziness and sore throat as clinical signs of Pf malaria [63, 64], it is believed that changes in antioxidant marker levels and the status of several enzyme activities have been observed in patients with Pf malaria, suggesting that oxidative stress may play a significant role in malaria [65].

Our results also demonstrate a relationship between age and the prevalence of Pf infection, which is consistent with earlier research showing that younger people are more susceptible to malaria [66,67,68,69]. Thus, special interventions should be implemented for younger individuals because they are more vulnerable to Pf infections. In contrast, none of the other demographic features considered in this study was associated with the incidence of Pf infection.

Our findings also revealed associations between the positivity of the Pf antigen (malaria) and some environmental features, such as bushes in the surroundings (AOR = 2.60, 95% CI: 1.30 to 4.66, p = 0.006) and the presence of stagnant water (AOR = 0.26; 95% CI: 0.11 to 0.53, p < 0.0001). This study is in line with previous research demonstrating how environmental elements, such as vegetation and water bodies, might affect malaria transmission [70, 71]. Bushes can serve as breeding grounds for mosquitoes, which are the main carriers of malaria and can also offer shade and humidity, both of which are conducive to mosquito survival and reproduction. Thus, clearing bushes and other vegetation from the areas surrounding homes and communities can be a useful tactic for lowering the risk of malaria transmission. However, the use of mosquito repellents was not significantly associated with a reduced likelihood of malaria, which is not particularly surprising as reports have emerged that mosquitoes and other pests have become resistant to some routinely used repellents [72,73,74].

Unlike the work by [75, 76], which revealed associations between clinical symptoms, such as fever, vomiting, and headache, and the incidence of falciparum infection, it is interesting to note that our results revealed no significant associations between the occurrence of Pf and fever, vomiting, or headache, even though they all showed a high propensity for malaria. Our results showed that, although Pf typically causes symptoms such as fever, vomiting, and headache, these signs or symptoms are non-specific and can be mistaken for other illnesses [77].

In addition to the established factors previously identified in malaria prediction, our study introduces novel features that contribute to the accuracy and utility of the model. By incorporating environmental factors such as the presence of bushes in the surroundings and stagnant water in the home, the model acknowledges the role of the immediate environment in malaria transmission. This recognition of local ecological factors enhances the ability of the model to predict malaria occurrence in specific settings, thus tailoring the results to the unique risks faced by individuals in various regions. Furthermore, our model’s integration of these novel features highlights the importance of a holistic approach to understanding and addressing malaria transmission, which could ultimately lead to more effective intervention strategies.

Another innovative aspect of our study is the application of machine learning techniques to predict malaria occurrence using routinely collected data. By employing penalised logistic regression implemented under nested cross-validation with sequential backward feature selection, our model optimised its predictive power while minimising the risk of overfitting. This data-driven approach facilitates the identification of key predictors of malaria and provides a more precise prediction of malaria risk at an individual level. The use of machine-learning techniques in this context is not novel. Nevertheless, it demonstrates the potential of such models to enhance clinical decision making and resource allocation, particularly in resource-limited settings. This diligent application of machine learning has the potential to transform the way healthcare professionals approach malaria prevention and treatment, ultimately improving patient outcomes and the efficiency of the healthcare system.

Despite the relatively small sample size, which may limit the generalisability of our findings, we employed robust methodologies to ensure the reliability and validity of our results. Specifically, our use of nested cross-validation for hyperparameter search and sequential backward feature selection mitigated the risk of overfitting, which is a common pitfall in studies with limited data. Using these rigorous techniques, we optimised the extraction of meaningful insights from our dataset, thereby enhancing the reliability and validity of our findings. Consequently, while acknowledging the potential limitations imposed by the sample size, we maintain that our approach and analytical rigor provide a sound foundation for the results of this study.

Although our study relied on self-reported symptoms and environmental factors, we acknowledge that this method can introduce a recall bias or misclassification. However, we implemented stringent measures to mitigate these issues and to ensure the accuracy of our data. We addressed the recall bias using shorter recall periods. This approach minimised the chances of participants forgetting or misremembering the information, thereby increasing the reliability of their responses. We employed precise and accurate diagnostic techniques to minimise the risk of misclassification, particularly for malaria diagnoses. Routine rapid diagnostic testing, a highly sensitive and specific method for identifying Plasmodium species, was used for all the suspected malaria cases. This strategy greatly reduces the likelihood of misclassifying cases, and thus increases the accuracy of our data. Furthermore, we ensured that all the health workers involved in this study were highly experienced and thoroughly trained, which was critical to the robustness of our data collection process. Their expertise significantly minimised any potential errors that could have occurred during data collection. Despite these mitigation measures, we recognise that there is always potential for some level of bias in self-reported data. Future studies could consider incorporating additional methods to further reduce bias, such as triangulation of data through multiple data collection methods and sources or using more objective measurements where feasible. Despite these limitations, our study demonstrates the potential utility of machine learning models using sociodemographic, environmental, and clinical features to predict malaria occurrence.

Conclusion

In conclusion, this study effectively employed penalised logistic regression to classify malaria types as either positive or negative. Our findings emphasise the significance of patient characteristics such as age, body weight, and symptoms in malaria diagnosis and management. In addition, stagnant water has been identified as a critical challenge in malaria control, necessitating interventions to address this issue. Implementing strategies such as regular cleaning and removal of stagnant water, community engagement, and promoting the use of insecticide-treated bed nets can help reduce the incidence of malaria. Educating people about risk factors and the need to seek medical attention for symptoms such as fever and headache can further contribute to the decline in malaria cases.

These findings enrich our understanding of the epidemiology of the disease and could potentially help prioritise preventive measures, particularly in resource-limited settings. However, it is crucial to reiterate that this predictive model is not intended to replace laboratory diagnosis. Instead, it was designed to augment them by providing an early indicator of potential disease incidence, particularly when resources for comprehensive laboratory testing are limited. Laboratory diagnosis remains the gold standard for identifying malarial infections, and our research aimed to complement this method by providing additional clues that could enhance its predictive power.

We focused on Pf because of its prevalence and severe impact in Nigeria and acknowledge that other malarial species are also relevant. Future studies should consider a more inclusive approach, investigate other Plasmodium species, and include more variables. This could further refine our understanding of the complex epidemiology of malaria in Nigeria and other similar contexts, ultimately leading to more effective strategies for malaria prediction and control. Our study underscores the need for and potential benefits of an integrated, multifaceted approach to predict and control malaria. Our findings support ongoing efforts to combat this disease, enhance the effectiveness of existing strategies, and offer new avenues for future research.

These findings may inform targeted interventions and contribute to the development of more accurate and efficient strategies for malaria prevention and control. In particular, this study may aid in clinical decision-making and resource allocation, particularly in resource-limited settings where traditional diagnostic methods are either unavailable or limited in accuracy. Finally, further research is needed to validate the model in larger and more diverse populations, and to assess its impact on patient outcomes and healthcare system efficiency.