Background

In literature, human microbiome studies have received increasing attention. This domain is considered a potential source for the diagnosis and development of new medical treatments [1]. Several studies aim to identify variations in the gut microbiome and potential biomarkers to diagnose diseases and disorders such as inflammatory bowel disease (IBD) [2,3,4,5], type 2 diabetes (T2D) [6,7,8,9], autism spectrum disorder (ASD) [10,11,12,13], and some types of cancer [14,15,16,17], among others. Microbiome studies have also been used to develop medical treatments and to analyze the responses from patients [18,19,20,21]. Microbiome analysis consists in sequencing the gene encoding 16s ribosomal RNA (rRNA) and compare it with known bacteria sequence databases to identify bacterial members of a microbial population [22]. Several software tools and pipelines are available for this process, such as QIIME2 [23], VSEARCH [24], DADA2 [25], Trimmomatic [26], mothur [27], and FLASH [28]. These software tools allow performing the quality analysis of 16s rRNA raw data (filtering, trimming, chimera removal, merge sequences, taxonomy assignment) to generate Operational Taxonomy Units (OTUs) or Amplicon Sequence Variants (ASVs) and performing statistical analysis on the resulting bacterial taxonomy and abundance.

With the advancements in omics technologies and AI, research focused on the search for potential biomarkers in the human microbiome using machine learning tools has increased, where the use of taxonomy-based feature selection is one of the most common approach [29]. Nowadays, it is common to find research works that aims to find relevant taxonomy-based features and use them as potential biomarkers to apply them in medical conditions such as ASD [30, 31], cardiovascular disease [32], T2D [33, 34], IBD [35,36,37,38], Parkinson [39], and also to analyze the effect of medical treatments [40, 41]. Despite the promising results, several issues can still be found in these studies:

  • Datasets: high dimensional data with a small number of samples are common, usually because of the costs (time and money) associated with data collection from human participants. This causes machine learning models prone to overfitting and biased performance [42].

  • Inconsistent results: Most of the studies use Operational Taxonomy Units (OTUs) in their experiments, and due to the limitations and their inability to be used in independent studies [43, 44], may be the reason for obtaining inconsistent results [29, 45, 46].

  • Reproducibility: Several factors such as the lack of uniform processing methodologies, incomplete or erroneous descriptions of the simulations, incomplete or erroneous dataset documentation, which software version was used, incomplete documentation, or not having the code available for use are responsible for a lack of reproducibility in microbial research [45, 47].

The main objective in this work, is to address the lack of reproducibility by providing a methodology, that considers more than one dataset, that combines a DADA2-based pipeline for 16s rRNA sequences processing and the Recursive Ensemble Feature Selection (REFS) algorithm, previously used in [48]. This methodology also provides an approach to deal with high dimensional data with a small number of samples, inconsistent results, and the lack of uniform processing and analysis methodologies. The effectiveness of the proposed methodology was tested by comparing its results with different feature selection methods. Three experiments were performed analyzing microbiome data related to: Inflammatory Bowel Disease (IBD), Autism Spectrum Disorder (ASD), and Type 2 Diabetes (T2D). The results of these experiments provide valuable insights about the performance of the proposed methodology and its potential application in microbiome research. Further research is needed to confirm these findings and to explore their potential clinical applications.

Results

Autism spectrum disorder (ASD)

Raw data processing

The trimming parameters used for DADA2-based srcipt filtering process and the number of Amplicon Sequence Variants (ASVs) generated for each dataset were:

  • David et al – parameter: trimLeft = 10, 2040 ASVs generated.

  • PRJNA589343 – parameter: truncLen = c(250), 2040 ASVs generated.

  • PRJNA578223 – parameter: truncLen = c(290,220), 18,758 ASVs generated.

Feature selection phase

We selected David et al [49] for discovery following the eligibility criteria. 26 out of the 2040 features resulted after applying the Recursive Ensemble Feature Selection (REFS) algorithm. This means, REFS achieved its highest accuracy (> 0.8) with 26 features, Fig. 1a. The result of the validation module for the selected 26 features was an average AUC of 0.816, which is considered “very good” diagnostic accuracy [50]. The Multilayer Perceptron (MLP) algorithm had the best performance, Fig. 1b.

Fig. 1
figure 1

a The minimum number of features to obtain the higher accuracy, b Plot of the classifier with the best performance in the validation process for discovery dataset David et al, c Plot of the classifier with the best performance in the validation process for PRJNA589343, and d Plot of the classifier with the best performance in the validation process for PRJNA578223

In comparison, we applied the same validation module to the complete 2040 features, the resulting average AUC was 0.41. For feature selection using K-Best, with k = 26, the average AUC was 0.706. The detailed validation results are presented in Table 1. Using the Matthews correlation coefficient (MCC) as additional metric to evaluate the performance of the methodology, REFS achieved better average MCC (0.649) compared with the other feature selection methods, see Table 1.

Table 1 Individual and average AUCs and MCCs from the validation phase and the additional validation approaches applied to the ASD datasets. The standard deviation of each result was excluded to keep the table simple and avoid complexity

Testing phase

We searched the 26 features selected by REFS in the testing datasets, the result was 22 out of 26 for PRJNA589343 and 20 out of 26 for PRJNA578223. We applied the validation module to the features found in both testing datasets. For PRJNA589343 we obtained an average AUC of 0.748 and for PRJNA578223 we obtained an average AUC of 0.74. Both average AUCs corresponds to a “good” diagnostic accuracy [50]. In both cases, the classifier with the best performance was Extra Trees, Fig. 1c,d.

For the comparative analysis, we searched for the 26 features selected by K-Best on each testing dataset, the result was 20 out of 26 for PRJNA589343 and 17 out of 26 for PRJNA578223. We applied the validation module to the features found in both testing datasets. The resulting average AUCs were 0.704 for PRJNA589343 and 0.678 for PRJNA578223. For the 10-time random selection the resulting average AUCs were 0.6278 for PRJNA589343 and 0.6352 for PRJNA578223. The detailed validation results are presented in Table 1. Using the MCC as additional metric for this phase, REFS achieved better performance in both testing datasets with average MCC values of 0.4794 for PRJNA589343 and 0.5071 for PRJNA578223, see Table 1.

Inflammatory bowel disease (IBD)

Raw data processing

The trimming parameters used for DADA2-based srcipt filtering process and the number of Amplicon Sequence Variants (ASVs) generated for each dataset were:

  • PRJEB21504 – parameter: trim = 20 and truncLen = c(160), 1793 ASVs generated.

  • DRA006094 – parameter: trim = 20 and truncLen = c(200), 375 ASVs generated.

  • PRJNA684584 – parameter: trim = 20, 1621 ASVs generated.

Feature selection phase

We selected PRJEB21504 for discovery following the eligibility criteria. 53 out of the 1793 features resulted after applying the Recursive Ensemble Feature Selection (REFS) algorithm. This means, REFS achieved its highest accuracy (> 0.95) with 53 features, Fig. 2a. The result of the validation module for the selected 53 features was an average AUC of 0.936, considered “excellent” diagnostic accuracy [50]. The Multilayer Perceptron (MLP) algorithm had the best performance, Fig. 2b.

Fig. 2
figure 2

a The minimum number of features to obtain the higher accuracy, b Plot of the classifier with the best performance in the validation process for discovery dataset PRJEB2150, c Plot of the classifier with the best performance in the validation process for DRA00609, and d Plot of the classifier with the best performance in the validation process for PRJNA684584

In contrast, we applied the same validation module to the complete 1793 features, the resulting average AUC was 0.718. For feature selection using K-Best, with k = 53, the average AUC was 0.902. The detailed validation results are presented in Table 2. Considering the Matthews correlation coefficient (MCC) as additional metric to evaluate the performance of the methodology, REFS achieved an average MCC value of 0.8715 which is higher than MCC values achieved by the other feature selection methods, see Table 2.

Table 2 Individual and average AUCs and MCCs from the validation phase and the additional validation approaches applied to the IBD datasets. The standard deviation of each result was excluded to keep the table simple and avoid complexity

Testing phase

We searched the 53 features selected by REFS in each testing dataset, the result was 22 out of 53 for DRA006094 and 48 out of 53 for PRJNA684584. After applying the validation module, we obtained an average AUC of 0.778 for DRA006094 and for PRJNA684584 we obtained an average AUC of 0.71. Both average AUCs correspond to a “good” diagnostic accuracy [50]. In this case, the classifier with the best performance was KNeighbors for DRA006094 and Extra Trees for PRJNA684584, Fig. 2c,d.

For the comparative analysis, we searched for the 53 features selected by K-Best on the testing datasets. The result was 21 out of 53 for DRA006094 and 52 out of 53 for PRJNA684584. We applied the validation module to the features found in both testing datasets. The resulting average AUCs were 0.732 for DRA006094 and 0.652 for PRJNA684584. For the 10-time random selection the resulting average AUCs were 0.528 for DRA006094 and 0.5582 for PRJNA684584. The detailed validation results are presented in Table 2. Using the MCC as additional metric for this phase, REFS achieved better performance in both testing datasets with average MCC values of 0.4057 for DRA006094 and 0.3567 for PRJNA684584, see Table 2.

Type 2 diabetes (T2D)

Raw data processing

The trimming parameters used for DADA2-based srcipt filtering process and the number of Amplicon Sequence Variants (ASVs) generated for each dataset were:

  • PRJNA3259311 – parameter: trimLeft = 15, 3316 ASVs generated.

  • PRJNA5545355 – parameter: truncLen = c(400), 3201 ASVs generated.

  • PRJEB53017 - no parameter used, 3672 ASVs generated.

Feature selection phase

We selected PRJNA3259311 for discovery according to the eligibility criteria. 9 out of the 3316 features resulted by using the Recursive Ensemble Feature Selection (REFS) algorithm. Thus, REFS achieved its highest accuracy (> 0.90) with 9 features, Fig. 3a. The result of the validation module for the selected 9 features was an average AUC of 0.79, which is considered “good” diagnostic accuracy [50]. In this case, the Multilayer Perceptron (MLP) the algorithm had the best performance, Fig. 3b.

Fig. 3
figure 3

a The minimum number of features to obtain the higher accuracy, b Plot of the classifier with the best performance in the validation process for discovery dataset PRJNA325931, c Plot of the classifier with the best performance in the validation process for PRJNA554535, and d Plot of the classifier with the best performance in the validation process for PRJEB53017

In comparison, we applied the same validation module to the total 3316 features, the resulting average AUC was 0.494. For feature selection using K-Best, with k = 9, the average AUC was 0.75. The detailed validation results are presented in Table 3. Using the Matthews correlation coefficient (MCC) as additional metric to evaluate the performance of the methodology, REFS achieved better performance, compared with the other feature selection methods, with an average MCC of 0.79, see Table 3.

Testing phase

We searched the 9 features selected by REFS in each testing dataset, the result was 5 out of 9 for both testing datasets. We applied the validation module to the features found in both testing datasets. For PRJNA5545355 we obtained an average AUC of 0.714 and for PRJEB53017 we obtained an average AUC of 0.662. The average AUC for PRJNA5545355 corresponds to a “good” diagnostic accuracy and for PRJEB53017 the average AUC corresponds to a “sufficient” [50]. For both testing datasets, the classifier with the best performance was Extra Trees, Fig. 3c, d.

For the comparative analysis, we searched for the 9 features selected by K-Best on each testing dataset, the result was 4 out of 9 for both testing datasets. We applied the validation module to the features found in both testing datasets. The resulting average AUCs were 0.668 for PRJNA5545355 and 0.582 for PRJEB53017. For the 10-time random selection the resulting average AUCs were 0.5238 for PRJNA5545355 and 0.5154 for PRJEB53017. The detailed validation results are presented in Table 3. Using the MCC as additional metric for this phase, REFS achieved better performance in both testing datasets with average MCC values of 0.4210 for PRJNA5545355 and 0.3429 for PRJEB53017, see Table 3.

Table 3 Individual and average AUCs and MCCs from the validation phase and the additional validation approaches applied to the T2D datasets. The standard deviation of each result was excluded to keep the table simple and avoid complexity

Discussion

In traditional analyses, groups of taxa called Operational Taxonomy Units (OTUs) are generated with sequences that are similar with a percentage of error, usually 3% [43, 44]. Considering this error, it is possible to miss variations (possible mutations) making a specific taxa that could be important in medical applications unable to be analyzed. Using Amplicon Sequence Variants (ASVs) this potential loss can be avoided due to all their properties such as ASVs inferred independently from different studies or different samples can be comparable across studies, reduced need for computation power, and are not limited by incomplete reference databases to mention some of them [43, 44]. ASVs allow individual experiment and the results could be tested and validated in separate datasets in contrast to merging datasets as in pooling analysis [29]. Using our methodology, we are able to achieve a signature of taxa across different datasets. In contrast with [51], where a signature of taxa between the microbiome and the diagnosis of ASD was not found through the analysis of various datasets. To the best of our knowledge, these type of experiments are not reported in the literature. The complete resulting taxa for each experiment is in Tables 1-3 of Additional file 1. Visualization of difference abundance of the results is in Supplementary figures 1-12 of Additional file 2. Finally, for individual AUC and MCC obtained in the Random Selection is in Additional file 3.

Despite the promising results and findings, more research and experimentation should be done with microbiome sequencing because counterexamples can be found that make this methodology ineffective. Such is the case with datasets related to asthma: PRJEB44044 [52], PRJNA601757 [53], and PRJNA913468 [54], where the feature selection and testing phase were inefficient. This was due to the lack of datasets with samples from the same source, the quality of the sequences, the lack of documentation, variations in the technical sequencing equipment used, also known as the batch effect [55, 56]. Thus, this methodology is dependent of the batch effect. Additionally, the experiments must be extended to study the relationship taxa-disease or taxa-disorder for possible medical applications.

Furthermore, from all experiments, it is easy to notice that the classification performance on the discovery dataset is considerably higher than those on the validation datasets. There are two possible explanations for this result. First of all, not all ASV features selected by the proposed methodology on the discovery dataset are found in the validation datasets: thus, the classifiers do not have access to all the information that led to the better performance on the discovery data, resulting in an decreased AUC and MCC. Secondly, the datasets could present differences due to the batch effect. We intentionally did not apply any batch correction methodology in this work, to better isolate and study the results of the proposed methodology.

Conclusion

We developed a methodology for reproducible biomarker discovery for 16s rRNA microbiome sequence analysis, addressing the issues related with high dimensional data with a small number of samples, inconsistent results, the lack of uniform processing and analysis methodologies, and to achieve validations in separate databases. The results from the three experiments show that the proposed methodology achieved better performance (AUC and MCC) compared to K-Best and 10-time random selection methods. This methodology is a first approach to increase reproducibility, to provide robust and reliable results, and further testing needs to be done, as shown by the experiment in Asthma (PRJEB44044, PRJNA601757 and PRJNA913468) described in the discussion section. Nevertheless, the approach to the individual study of ASVs makes possible to identify small variations that can have a positive impact on medical applications. This methodology provides results that hopefully will allow pharmacologists, biologists, and health researchers to direct their efforts to the analysis of a list with a smaller number of individual taxa, instead of thousands of taxa grouped in clusters.

Methods

Methodology

The proposed methodology consists of four phases: (1) dataset selection criteria, (2) raw data processing, (3) feature selection, and (4) testing. In contrast to other methodologies, such as pooling analysis [29, 51], we do not combine more than two datasets to produce a single one to be analyzed. The proposed methodology is oriented to work with Amplicon Sequence Variants (ASVs) because they can be used in independent studies [43, 44]. Using ASVs provide a possible solution to avoid inconsistent results, at the same time, they can help achieve external validation in separate datasets which [29]. For external validations, we recommend working with at least three datasets: one for discovery and the rest for testing. We address the issues of overfitting and biased performance associated with the datasets implementing a nested cross-validation scheme [42]. To provide a reproducible approach in the microbiome research, we document software versions, description about each phase and the necessary code/scripts to perform experiments are available on Github (https://github.com/steppenwolf0/MicrobiomeREFS). An overview of the proposed methodology is illustrated in Fig. 4.

Fig. 4
figure 4

Overview of the proposed methodology. The upper shows the workflow for the dataset selection criteria, raw data processing and feature selection phases. The lower part shows the testing phase workflow

The dataset selection criteria phase involves the selection, download, and extraction of relevant information from metadata (e.g., samples labels). The datasets must meet the following conditions:

  • The databases should be 16s ribosomal RNA (rRNA) amplicon sequencing and belong to the same domain such as disease, disorder, or medication.

  • There should be a minimum of two groups such as a control group and a case group.

  • Each group should have a minimum of 10 samples.

  • The documentation, whether it be metadata or a scientific paper, should clearly specify which group each sample belongs to.

  • Datasets should have the same source of samples such as tissue, feces, or mucosa.

The raw data processing phase involves performing amplicon workflow on the raw data in the selected datasets and generate ASVs (features). We selected the DADA2 pipelineFootnote 1 [25] due to its clear documentation. The DADA2 open-source R package allows to implement the full amplicon workflow on 16s rRNA sequences: filtering, dereplication, sample inference, chimera identification, and merging of paired-end reads [25]. We developed a DADA2-based script in R version 4.1.2, the code editor was RStudio version 2022.07.2 build 576, the DADA2 library version was 1.22.0, the DECIPHER library version was 2.22.0, the BiocManager library version was 1.30.19, and the taxonomy assignment was performed based on the SILVA_SSU_r138_2019Footnote 2 reference database.

The feature selection phase aims to identify features, since we are working with the sequence as feature instead of the taxa because sequence is unique on the dataset, the feature selected should be contained in testing, so one dataset must be selected for discovery. The eligibility criteria for the discovery dataset is the one that contains the shortest sequence length after the raw data processing phase. Once the discovery dataset is selected, we have to perform two processes:

  1. 1.

    The Recursive Ensemble Feature Selection (REFS),which is an algorithm for identifying biomarkers by determining the features that are most effective in differentiating between groups in datasets achieving the highest accuracy with the fewest number of features [48, 57,58,59,60,61,62]. The ensemble is composed by 8 classifiers from the scikit-learn toolbox [63]: Stochastic Gradient Descent (SGD) on linear models, Support Vector Machine classifier (SVC), Gradient Boosting, Random Forest, Logistic Regression, Passive Aggressive classifier, Ridge Classifier and Bagging. To minimize overfitting and biased performance, REFS employs a nested approach within a 10-fold cross-validation scheme, which is a proven solution to yield more accurate and unbiased results, even with a small sample size [42]. REFS was built on python version 3.10.8 using the scikit-learn toolkit version 1.1.3.

  2. 2.

    Validation, to minimize bias selection, we developed a validation module with 5 different classifiers from the scikit-learn toolkit [63]: AdaBoost, Extra Trees, KNeighbors, Multilayer Perceptron (MLP), and LassoCV. This validation module also employs a nested approach within a 10-fold cross-validation scheme. This module must be executed two times: (1) using samples labels, the selected features, and the corresponding abundance, and (2) using samples labels, all features, and the corresponding abundance. The 5 classifiers provides an average value for the area under curve (AUC), that evaluates the effectiveness of a discriminant test. Values approaching to 1.0 indicate excellent performance [50].

These processes should be executed at least 10 times concurrently, to compensate for the stochasticity of some of the classifiers used in the study (e.g. Random Forest) and the internal cross-validation process.

The testing phase involves testing the features selected by using REFS in a minimum of two separate datasets. The selected features must be searched on each testing dataset. Features can be repeated in the testing datasets, so we must follow the next process: if Feature x is present n-times in the testing dataset, the final abundance of Feature x will be the sum of the abundance of those n-occurrences. To validate the features found in each testing dataset, the validation module must be executed one time on each testing dataset using as input the samples labels, the found features, and the corresponding abundance. The AUC is employed as a measure of diagnostic accuracy.

Additionally, we conducted a comparative analysis with two different feature selection methods:

  • K-Best with F-score. This selection method will be applied to the discovery dataset instead of REFS. We used the SelectKbest algorithm from the scikit-learn toolbox which selects the K top-scoring features based on a user-defined metric, with the F-score [63]. The value assigned to K is determined by the number of features obtained using REFS. For instance, if REFS selected 10 features, the value of K would be set to 10.

  • 10-time random selection. This method consists in randomly select a given number of features from all features in each testing dataset. This given number is determined by the number of features found in each testing dataset. For instance, if 8 out of 10 features selected by using REFS were identified in the testing dataset, then 8 features will be randomly selected each time.

The AUC provided by the validation module is used as a metric for comparing the results of the proposed methodology with these two feature selection methods. Additionally, we use the Matthews Correlation Coefficient (MCC) [64] as a metric to evaluate the performance of the methodology as well as for comparison with other feature selection methods.

Datasets

We used a total of nine datasets, with three datasets for each experiment: Autism Spectrum Disorder (ASD), Inflammatory Bowel disease (IBD), and Type 2 Diabetes (T2D), see Fig. 5. Each dataset adhered to the data selection criteria phase. We considered only two groups within each dataset: control and cases. The control group is made up of healthy people or people in remission, the case group is made up of people diagnosed with the medical condition.

Fig. 5
figure 5

Overview of the datasets used for each experiment

The datasets related with ASD are: (1) David et al [49]Footnote 3 it has 117 samples of which 57 belong to the control group and 60 to the case group, (2) PRJNA589343 [65] downloaded from the NCBI public repository,Footnote 4 it has 127 samples of which 50 belong to the control group and 77 to the case group, and (3) PRJNA578223 [66] downloaded from the NCBI public repository, it has 96 samples of which 48 belong to the control group and 48 to the case group.

The datasets related with IBD are: (1) PRJEB21504 [67] downloaded from the NCBI public repository, it has 95 samples of which 66 belong to the control group and 29 to the case group, (2) DRA006094 [68] downloaded from the NCBI public repository, it has 70 samples of which 15 belong to the control group and 55 to the case group, and (3) PRJNA68458  [69] downloaded from the NCBI public repository, it has 103 samples of which 45 belong to the control group and 58 to the case group.

The datasets related with T2D are: (1) PRJNA3259311 [70] downloaded from the NCBI public repository, it has 112 samples of which 84 belong to the control group and 28 to the case group, (2) PRJNA5545355 [71] downloaded from the NCBI public repository, it has 60 samples of which 20 belong to the control group and 40 to the case group, and (3) PRJEB53017 [72] downloaded from the NCBI public repository, it has 94 samples of which 46 belong to the control group and 48 to the case group.