A review of machine learning methods for cancer characterization from microbiome data

Teixeira, Marco; Silva, Francisco; Ferreira, Rui M.; Pereira, Tania; Figueiredo, Ceu; Oliveira, Hélder P.

doi:10.1038/s41698-024-00617-7

A review of machine learning methods for cancer characterization from microbiome data

Review Article
Open access
Published: 30 May 2024

Volume 8, article number 123, (2024)
Cite this article

Download PDF

You have full access to this open access article

npj Precision Oncology

A review of machine learning methods for cancer characterization from microbiome data

Download PDF

2023 Accesses
1 Citation
8 Altmetric
1 Mention
Explore all metrics

Abstract

Recent studies have shown that the microbiome can impact cancer development, progression, and response to therapies suggesting microbiome-based approaches for cancer characterization. As cancer-related signatures are complex and implicate many taxa, their discovery often requires Machine Learning approaches. This review discusses Machine Learning methods for cancer characterization from microbiome data. It focuses on the implications of choices undertaken during sample collection, feature selection and pre-processing. It also discusses ML model selection, guiding how to choose an ML model, and model validation. Finally, it enumerates current limitations and how these may be surpassed. Proposed methods, often based on Random Forests, show promising results, however insufficient for widespread clinical usage. Studies often report conflicting results mainly due to ML models with poor generalizability. We expect that evaluating models with expanded, hold-out datasets, removing technical artifacts, exploring representations of the microbiome other than taxonomical profiles, leveraging advances in deep learning, and developing ML models better adapted to the characteristics of microbiome data will improve the performance and generalizability of models and enable their usage in the clinic.

Gut microbiome, big data and machine learning to promote precision medicine for cancer

Article 09 July 2020

Machine learning for data integration in human gut microbiome

Article Open access 23 November 2022

Multimodal deep learning applied to classify healthy and disease states of human microbiome

Article Open access 17 January 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Introduction

Cancer is the second most common cause of death worldwide, with 10M estimated deaths in 2020^1,2. It is a heterogeneous group of diseases that result from complex gene-environment interactions. There is increasing evidence that in addition to the series of hallmarks featured by cancer, the microbiome can have an important impact on cancer phenotypes³. The human microbiome is the community of all microorganisms that colonize the human body, including bacteria, viruses, and fungi⁴. The advent of next-generation sequencing (NGS)⁵ technologies made it possible to study the composition of the human microbiome. The widespread use of omics approaches^6,7, together with efforts such as the Human Microbiome Project⁸, have largely improved our understanding of the impact of the microbiome in health and disease. Knowledge about human-microbial interactions is greater for the microbiome of the gut, since it is easy to non-invasively characterize the microbial communities in fecal samples⁹. Changes in the gut microbiome have been associated with various diseases, including inflammatory bowel disease (IBD)¹⁰, obesity¹⁰, colorectal cancer¹¹, and even anxiety and depression¹².

Considering the relationship between the microbiome and cancer, a growing amount of clinical and preclinical studies have revealed that the microbiome can impact cancer development, cancer progression, and response to cancer therapies¹³, suggesting microbiome-based diagnostic, prognostic, and therapeutic approaches. It is now clear that the microbial communities found in the tumor tissues and in the oral and gut microbiomes of patients with various types of cancer are distinct from those found in healthy control individuals^14,15,16. Studies that indirectly derived microbial profiles from human genome and transcriptome samples of large databases, such as The Cancer Genome Atlas (TCGA)¹⁷, also identified cancer type-specific microbiota signatures^18,19. Furthermore, the microbiome found in cancer patients is predictive of their response to chemotherapy and to cancer immune checkpoint therapy^20,21,22.

Machine Learning (ML) models are algorithms that make predictions given a dataset of learnable examples. These models can often uncover complex patterns more easily than human experts²³. Methods based on ML are now ubiquitous in healthcare, with models used in imaging, to manage electronic health records, and for genetic analysis²⁴. As microbiome-based signatures are complex and involve multiple species, ML approaches are an optimal tool for the investigation of relationships between the microbiome and host phenotypes⁹. As such, various studies have used ML models for disease prediction based on the gut microbiome^25,26. Until recently, most of the applications of ML in the field of microbiome and cancer have focused on colorectal cancer and the gut microbiome^27,28,29. However, recent studies using ML approaches have suggested biomarkers for cancer detection, disease prognosis, and response to treatment in different cancer types based on the tumor, gut, and oral microbiomes^{18,30,31,32,33,34,35}. Despite their promising results, these approaches require further improvement in the accuracy of models before widespread use and translation of the findings into the clinic.

This review characterizes the state-of-the-art ML approaches for cancer characterization from microbiome data, analyzing all development steps from sample extraction to inference. It tries to bridge the gap between those familiar with ML and researchers experienced in the microbiome field, and help the development of robust approaches by highlighting the strengths and pitfalls of methods. Zhou and Ghallins³⁶ previously reviewed ML models for host trait prediction, without focusing on cancer, while Cheung and Yu³⁷ focused only on gastrointestinal cancer; however, due to their complexity and recency, cancer-related applications raise challenges that grant a specific discussion. Therefore, this review focuses its analysis on applications linked to the characterization of multiple types of cancer. Furthermore, we aim to analyze ML methods and their performance in greater detail.

We searched PubMed, Scopus, and IEEE Xplore for relevant publications from 1 January 2015 and up to 1 May 2023. Details on the criteria for inclusion are available in Supplementary Methods 1; Supplementary Fig. 1 contains the number of articles included in each stage of selection. Supplementary Table I holds all full-text articles assessed, characterized by the ML method used. In this manuscript, we start by discussing the sample collection and processing strategies for microbiome characterization, and how to address contamination. Then, we assess the possible input microbiome-derived features for an ML model and feature selection and transformation methods used in this domain. We analyze the most popular ML models and promising deep learning approaches, as well as how to evaluate and validate the ML models developed; we address current limitations, discussing why ML-based findings regarding the association between cancer and the microbiome are often contradictory. Finally, we present future research directions and discuss how to circumvent some of the current pitfalls. Figure 1 contains an overview of all these steps needed to develop and validate an ML model using microbiome data for cancer characterization.

**Fig. 1: Steps and decisions undertaken when developing an ML model for cancer characterization with microbiome data.**

Sample collection, processing, and decontamination

There are various methods for microbiome characterization, but the most widely used are based on nucleic acids. In these methods, the molecular analysis of the microbiome starts with the acquisition of genetic material from a sample. DNA or RNA isolation can be achieved from several sources, such as fecal samples, mucosal swabs, tissue biopsies, and blood. The selection of sample type is important, as different sources can provide different information about the microbiome and its interactions with the human host. For example, tumor biopsies enable the characterization of the local microbiome, which can provide information on how the microbiome directly influences tumor development or progression. On the other hand, sampling the microbiome at a distant location from the tumor can be indicative of dysbiosis in cancer patients but may not provide information on the interaction between the microbiome and the tumor. Some sample types are easier to acquire than others. Fecal, oral, and blood samples are easy to collect, as they can be sampled with non-invasive or minimally invasive methods⁹, whereas tissues or biopsies require invasive procedures, making sample collection more difficult and demanding.

All types of samples are prone to external contamination with microorganisms, which alter the microbiome composition of the sample leading to biased results. Two main types of contaminations can influence microbiome studies: contamination by external DNA (due to contaminants from the local environment where samples are collected, and during nucleic acid isolation and amplification) and cross-contamination during sample processing. Contaminations can be controlled by acting on two levels: during sample processing and during sequencing data analysis. To account for the potential external contamination, a series of control samples can be prepared by adding all reagents but no genetic material. During sequencing data analysis, the removal of microbial contaminations can be performed by comparing sample microbial content with that observed in the control samples. Extensive decontamination after sequencing has been performed when mining sequencing datasets of the TCGA project for reads of microbial origin. For example, The Cancer Microbiome Atlas (TCMA) provided curated and decontaminated microbiota profiles for oropharyngeal, esophageal, gastrointestinal, and colorectal cancer tissues, after decontamination of species equiprevalent across sample types³⁸. In other cases, contaminants were identified by comparing lists of commonly found microorganisms in laboratory reagents and kits with those identified in samples¹⁸. Other contaminants may be identified by assuming that the abundance of a contaminant is inversely correlated with sample concentration: if the sample concentration is small but the abundance of an organism is unusually high, that organism is likely to be a contaminant³⁹.

These in silico approaches do not replace well designed datasets for microbiome analysis, with adequate controls and minimal handling of samples. One study implementing such measures in breast cancer biopsies demonstrated their efficacy, as only low and moderate levels of contamination were found³⁰. Stringent decontamination may also improve the separability of breast tumor and normal tissue samples, decreasing the similarity in the feature space between samples from the same patient³⁰. This sampling procedure, specific for microbiome analysis of cancerous tissue, is the exception rather than the rule. Still, for most cancer types, bespoke datasets are not available, and their usage may provide more reliable results. In conclusion, when choosing the type of sample to use in cancer-related microbiome analysis, one must consider the research question and the information the sample to be collected may provide. Furthermore, care should be taken to avoid contamination during sample collection and processing. Microbiome data generated from clinical samples should clearly describe all steps from sample processing to analysis and the types of controls included to handle possible contaminants to properly develop and validate models for diagnosis and characterization.

Types of data for analyzing the microbiome

One of the basic aspects when characterizing the microbiome is its taxonomic composition. For that, it is necessary to assign DNA/RNA sequences to taxonomic units. A taxonomic unit is a set of related organisms, whose degree of relatedness depends on the type of unit chosen. It is useful to think of taxonomic units at the species level: in this case, relative abundance features are the proportion of organisms of each species in a sample⁴⁰. A pipeline for cancer characterization using ML models based on abundance profiles, from sample collection to model validation, is shown in Fig. 2. Although some studies have analyzed microbiome profiles at the species level⁴¹, the broader genus level^18,42,43, operational taxonomic units (OTUs), or the amplicon sequence variants (ASVs) are more frequently used^33,44. OTUs group bacteria based on sequence similarity (usually at 97%), while ASVs identify true biological sequences (100% similarity) allowing the discrimination of closely related taxa. On one hand, it is less computationally demanding to determine OTUs or ASVs than to produce taxonomic classifications for each sequencing read⁴⁵. On the other hand, the widely used target amplicon-based sequencing makes a species-level characterization of the microbiome difficult, leading to a preference for profiles at the genus level⁴⁶. Alternatively, some authors have suggested using the abundances of multiple taxonomic levels as input features to an ML model^47,48, or incorporating abundance profiles into phylogenetic trees⁴⁹.

**Fig. 2: Pipeline for ML-based cancer identification from microbial abundance profiles.**

The microbial abundance data most often used when studying the links between the human microbiome and cancer has a set of characteristics that make the development of ML models challenging. Firstly, it has high dimensionality, as every taxon found in at least one sample constitutes a feature. Machine Learning models using a large number of features tend to overfit the training data, showing poor generalization to unseen samples⁵⁰. As such, methods for dimensionality reduction are often advantageous when pre-processing microbiome data, prior to training ML models⁵¹. As a consequence of including one feature per taxon found, microbiome abundance profiles are sparse, with the abundances of a taxonomical unit typically following zero-inflated negative binomial distributions. This may impact the performance of ML and statistical models, leading to overdispersion⁵². The microbial abundance profiles of different samples are also usually obtained with variable sequencing depths. Thus, these data are compositional, as samples should not be characterized by the absolute abundance of each taxon, but rather by their relative counts⁵³. However, it is unclear how this compositional framework impacts the performance of ML models⁵¹.

Current technologies used to characterize the microbiome allow the quantification of other variables besides the microbial composition profiles. These types of data can improve the performance of ML models over abundance profiles alone. As an example, in ref. ⁵⁴, ML models trained with functional profiles to distinguish fecal samples from colorectal cancer and adenoma patients achieved greater accuracy than classifiers using taxonomical profiles. Similar results have been reported when predicting cachexia in lung cancer patients⁵⁵ or the survival time of patients with neuroblastoma³¹. These functional profiles result from the quantification of the relative abundance of groups of organisms, genes, proteins, and metabolites^46,54. Other types of microbiome data used as input for ML models in cancer-related applications include microbial single nucleotide polymorphisms⁵⁶.

Pre-processing microbial data

The data described in the previous section should be pre-processed to aid ML models in learning robust decision rules. Pre-processing steps may include transformations, which decrease the influence of outliers, normalization⁵⁷, and feature selection; these will be discussed in this section. These pre-processing steps may greatly improve the performance of ML models. However, many ML models proposed for cancer characterization using microbiome data do not use any pre-processing other than data standardization^18,47,58, as it is difficult to predict if such pre-processing steps will improve the performance of ML models before testing them.

Methods for feature transformation

As previously mentioned, taxonomic profiles are heterogeneous, with an abundance of zero-valued features⁵⁹. Thus, these are usually transformed to decrease the importance of dominant features and outliers, improving the performance of classifiers⁶⁰. Log transformations (for which the transformed abundances ${x}^{{\prime} }$ are obtained from the original values x according to ${x}^{{\prime} }=log\left(x+c\right)$, with c known as the pseudocount, taking the value of one for most cases) are commonly used⁶¹; however, the chosen pseudocount value can impact the downstream analysis of the data, and it is unclear how to best choose this parameter⁶². Other transformations include a cube-root normalization proposed in Mulenga et al.⁶³ and rescaling feature vectors to a unitary scale⁴⁸. Mulenga et al.⁶³ further proposed applying two normalization approaches to taxonomic profiles and concatenating the resulting features with the original values. This approach improved classification performance when combined with data augmentation with a variational autoencoder and a deep neural network⁶³.

HARMONIES⁵⁹ is a more complex transformation method developed for microbiome taxonomic data. This method uses Bayesian models to transform the read counts per taxon, taking different sequencing depths (the number of reads produced in each position in the sample genome) and data sparsity into account. It also considers that one feature may have zero abundance both because that taxon is absent from the sample, or because sequencing failed to capture it. However, HARMONIES⁵⁹ has not yet been applied in conjunction with ML approaches. Despite the claims that these methods can better reflect the true microbiome or improve the accuracy of models, the most suitable normalization method is dependent on the original dataset: some transformations, such as those based on feature variance, are better at decreasing the influence of outliers, while normalization approaches preserving feature variance, as the cube-root, may be more effective when outliers are absent^60,63.

Methods for dimensionality reduction

Microbiome datasets have a large number of features, usually greater than the number of samples. Machine Learning models trained on data with a high number of features relative to the number of samples tend to overfit and show high variance, due to the large number of possible relationships exploited to create decision boundaries, most of them spurious⁵⁰. Furthermore, training ML models in these settings is computationally expensive⁶⁴. In these cases, it is recommended to use simpler models that are less likely to overfit the training data, with heavy regularization⁵⁰; alternatively, methods for dimensionality reduction can be used. These include feature extraction methods, that transform the original data into a dataset with fewer features, and feature selection approaches, that select a subset of the original features for training^64,65.

Most approaches to studying the link between cancer and the human microbiome using ML focus on the interpretability of models. Models are used to find biomarkers, such as species linked to a certain cancer type. Because of this, feature extraction methods such as principal component analysis (PCA) or principal coordinates analysis (PCoA), which transform the original feature space, are seldom used. Rather, most of the applied dimensionality reduction methods are filters for feature selection. These are independent of the model and act as a pre-processing step, selecting a subset of features based on the data alone. Typically, filter methods rank features according to a given criterion and select the highest-ranked⁶⁶. These methods only consider the predictive power of one feature at a time, and not the discriminative capability of a set of features, as is the case for most ML models. As such, the features selected by filter methods may not form the optimal feature subset for the performance of the ML model.

Simple filter feature selectors use statistical tests to rank features according to their relevance. One such test is the Student’s t test for independent samples, which tests if a set of samples from two groups originated from distributions with the same mean. In feature selection for binary classification, samples are divided by class. The sample means for each feature and class are compared using a t test and the features with the highest statistic, reflecting a larger difference in means, are kept⁵⁰. However, the t test should only be used when the features are approximately normally distributed. When this is not the case, it is possible to use a Mann-Whitney U test, assessing if the values of a feature are larger in one class than another. The Mann-Whitney U test is non-parametric and thus potentially more robust to outliers than the t test. Both tests were used in ref. ⁴⁴ to find epithelial ovarian cancer from peritoneal fluid microbiota. For multiclass classification, one can resort to a one-way analysis of variance (ANOVA) test. One-way ANOVA tests compare the means of a feature between all classes using the F-distribution. In ref. ⁵⁴, ANOVA was used to select features for colorectal cancer prediction from stool samples.

One downside of selectors based on statistical tests is their tendency to select many correlated features: if multiple features are correlated and have a large discriminative power, they are likely to all be selected despite being redundant⁶⁷. Minimum Redundancy - Maximum Relevance (mRMR)⁶⁸ was developed for microarray gene expression data. It selects uncorrelated and informative features. Reflecting its popularity, mRMR has been widely used in cancer characterization from microbiome data^69,70,71. In mRMR, the feature relevance is estimated using information gain (IG) or an F-test between a feature and the target classes, for categorical/discrete and continuous variables, respectively. The first selected feature is that with the highest relevance criterium. For each subsequent iteration, the redundancy of the remaining unselected features is calculated through the sum of Pearson’s correlation coefficients or IG between the tested feature and all those previously selected. The algorithm then selects the features with the maximum ratio or difference of relevance to redundancy⁶⁸.

Linear Discriminant Analysis Effect Size (LEfSe) is another popular method developed for feature selection on metagenomic data⁷². Besides statistical significance, LEfSe also takes biological relevance into account. It starts by discarding all features with p values greater than a threshold according to a Kruskall–Wallis test, which extends the Mann-Whitney U test to a multiclass setting. Thus, it selects the most relevant features for further analysis. Then, the dataset is further divided into sub-classes. These should reflect a biological characteristic of a sample (e.g. the sex of a patient, as done in ref. ⁴³, or other clinical covariates in ref. ³³). For each previously selected feature, all sub-classes in different classes are pairwise compared using a Wilcoxon signed-rank test. If any comparison results in a p value greater than a threshold, or if there is a difference in sign among the comparisons, the feature is deemed as not biologically robust and excluded. Lastly, Linear Discriminant Analysis is used to estimate the effect size of the resulting features and rank them⁷². LDA is trained by preserving classes as targets and using the original features along with covariates as input variables. The effect size is quantified by averaging the difference in class means with the difference in class means along the first linear discriminant axis⁷².

Autoencoders, which are discussed in greater detail in section “Autoencoders”, can also be used as filter feature selectors, despite having rarely been used in cancer-related host trait prediction from microbiome data.

Although used less frequently than filters due to the higher computational cost, feature selection can also be performed using embedded or wrapper methods and surrogate models. Tree-based models, such as Random Forests and Gradient-Boosted Trees can inexpensively give an importance score (such as Gini importance) to the input features⁷³. The best-ranked features can be used as input data for a more powerful downstream model, responsible for the final classification⁷⁰. However, the features selected by these surrogate models may not be the most adequate for the predictor, as these operate under different constraints and optimization approaches. Wrapper and embedded feature selection methods are dependent on the predictor model⁷⁴. These include sparsity-aware approaches that set some of the model coefficients to zero, such as Least Absolute Shrinkage and Selection (LASSO) regression⁷⁵, as used in refs. ^41,61,76 and⁴⁷, or Recursive Feature Elimination Support Vector Machines (SVM-RFE)⁷⁷, which has been applied in ref. ⁷⁸. Both approaches will be discussed in greater detail in the next section.

The most effective feature selection approach is dependent on the characteristics of the dataset, namely its sparsity and feature correlation, and the model used. One study compared the efficacy of statistical test-based methods, mRMR, a Fast Correlation Based Filter⁷⁹, Conditional Mutual Information, and prior selection with Gradient Boosted Trees for dimensionality reduction in colorectal cancer prediction from metagenomic samples⁷⁰. Statistical test-based methods, along with Gradient Boosted Trees selection achieved the best results when coupled with a Random Forest classifier; however, this may not hold true for other models or data: a feature selection approach should be selected on a case-by-case basis or with cross-validation. The authors also applied an ensemble selection, resulting in an 11% increase of AUC when compared to no feature selection⁷⁰. Therefore, despite being computationally more expensive, ensemble approaches may provide more robust features selected by multiple different methods, and therefore suitable for a wider range of models.

Table 1 contains an overview of the methods discussed in this section, grouped by type (filter, wrapper, or embedded) along with the major advantages and disadvantages of each. In conclusion, despite being frequently used, it is unclear what feature selection method, if any, is best for each learning task. As such, feature selection approaches should be selected according to cross-validation. Alternatively, ensembles of selectors, or ML models capable of performing feature selection, such as deep learning approaches, may be used.

Table 1 Methods for microbiome feature selection

Full size table

Machine Learning models

Machine Learning models take a feature vector ${{{\bf{x}}}}\in {{\mathbb{R}}}^{M}$ as input, where M is the number of features, and predict a target value $\hat{y}$. This is done by applying a decision function ϕ such that $\hat{y}=\phi \left({{{\bf{x}}}}\right)$. Most ML models try to minimize a loss function that quantifies the difference between the predicted and ground-truth variables⁸⁰. Learning tasks can be divided into classification or regression, when the target variable is categorical or numerical, respectively⁷⁵. When studying the cancer-related microbiome, classification is the most common, mainly encompassing cancer diagnosis or tumor type identification. However, regression tasks can also be found in literature, as in survival time prediction³¹ or predicting the number of tumors formed in murine models⁸¹.

In this section, we will discuss the most popular and promising ML methods applied to cancer identification and characterization from the microbiome. Table 2 shows some examples of these applications and lists the ML models used in each.

Table 2 Examples of applications of ML methods for cancer characterization from microbiome data

Full size table

Support Vector Machines

Support Vector Machines (SVMs)⁸² are commonly used with microarray data⁸³. Therefore, these are also often employed to identify microbiome-related biomarkers for cancer. In their simplest form, SVMs are binary classification models that define the decision boundary as a hyperplane. When the data is linearly separable, there are infinite hyperplanes that perfectly separate the samples according to their class⁷⁵. SVMs select the hyperplane with the largest margin to the training data⁸⁴ and only some of the training samples define the decision boundary; these are called the support vectors⁸⁰.

Linear SVMs were found to achieve comparable performance to non-linear models in colorectal cancer identification from microbial abundance profiles⁸⁵. Nevertheless, SVMs can be adapted to non-linear decision boundaries by extending the original feature space with feature transformations⁵⁰, which results in non-linear decision boundaries in the original feature space and improves separability. Feature extension can be done without increasing the computational load with the so-called kernel trick⁸⁶. The most used kernel in tumor-associated microbiome analysis is the radial basis function (RBF)^85,87, which has achieved good results in colorectal cancer prediction^48,71. However, when kernels are used, the prediction for a sample is based on its kernel distance to the support vectors⁸⁸. Therefore, the decisions of SVMs are more difficult to interpret than with Random Forests or Logistic Regression⁸⁸.

SVMs have been used to predict colorectal patient survival time^89,90, cancer prognosis, and drug responses⁷⁸ from microbiome and gene expression data, and to identify colorectal cancer patients using taxonomic profiles^71,85. Furthermore, SVMs are frequently used as benchmarks when evaluating other methods^47,87,91. The popularity of SVMs for cancer-related host trait prediction from the microbiome is due to their widespread availability in ML libraries and the good performance demonstrated by these models in multiple works. In⁴¹, SVMs outperformed Logistic Regression when predicting various diseases using taxonomic profiles. In a few cases, SVMs achieved better or comparable results to Random Forrest classifiers^41,48. However, in most of the experimental settings, SVMs failed to surpass the performance of Random Forest, namely in survival time prediction⁹⁰, identification of cancer types⁶⁹, and colorectal polyp identification⁹².

Because SVMs are binary classifiers, extending them to multiclass classification comes with multiple problems. Usually, a classifier is trained for each class, in a one-versus-rest approach, with the final prediction taken as the one yielding the highest value for the decision function. Because each classifier is trained with different target labels, there is no assurance that these values are scaled equally⁸⁰. Furthermore, this approach is computationally expensive, as it requires training a potentially large number of classifiers. These issues make SVMs rarely used for microbiome-based multicancer prediction. In such an application, an SVM achieved a much worse performance than a Random Forest, a k-Nearest Neighbors, and a Decision Tree classifier⁶⁹, all models that can be directly used in multiclass settings.

Despite all the aforementioned shortcomings, SVMs may prove useful when only a small training set is available. These models are less prone to overfit the training data and can show good generalization performance when the number of features far exceeds that of training samples⁹³. This has motivated its usage in oral carcinoma identification from taxonomic and functional profiles in a study comprising only 38 samples⁹⁴. However, it is unclear if other models could have achieved comparable performance, as no other approaches were tested. Furthermore, this good generalization performance is dependent on an adequate choice of the regularization parameter controlling the width of the decision boundary margins, which requires costly cross-validation⁷⁵.

Using a wrapper, SVMs can be adapted to perform feature selection. As previously mentioned, this is known as Recursive Feature Elimination Support Vector Machine (SVM-RFE)⁷⁷. In SVM-RFE, an SVM is initially trained on all the features. In each iteration, the feature with the smallest contribution according to the model parameters is removed, and the SVM is retrained⁷⁷.

Decision Tree-based models

Models based on Decision Trees - Random Forests in particular - are the most frequently used ML models when studying the link between cancer and the human microbiome. Decision Trees are sequential models that apply successive rules to yield a final prediction. At each level of the tree, starting from the root, the value of a feature is compared to a threshold learned from the training data. The decision path followed by the model is defined by the result of each comparison and the final prediction is given by the class associated with the leaf reached⁹⁵. When splitting a node, Decision Trees exhaustively search for the threshold and feature achieving the best purity in the child nodes⁵⁰. Purity, expressing the homogeneity in target classes of the samples in a leaf, is commonly quantified using the Gini index or entropy⁹⁶.

Random Forests

Decision Trees are prone to exhibit high variability, as small changes in the training data may result in different splitting thresholds. Alterations in the first nodes propagate into changes in the structure of the subsequent levels, greatly impacting the overall classification rules⁵⁰. This leads to high variance and helps explain why Decision Trees are rarely used when studying the link between cancer and the microbiome. Bagging decreases the variance shown by Decision Trees: multiple models are trained using samples and features randomly sampled with replacement from the training set^50,97. Bagging Decision Trees results in a Random Forest; the different trees have slightly different structures, as they were trained on different data⁹⁵. The final prediction is the average of the predictions of the individual trees⁹⁵.

In a work on multicancer identification from microbial abundances, a Random Forest classifier achieved improved generalization performance over a Decision Tree⁶⁹. A similar result was found in ref. ⁸⁵ for colorectal cancer identification. Unlike Decision Trees, Random Forest classifiers are highly popular when studying the association between cancer and the microbiome. This is supported by several published benchmarking experiments in which this model outperformed all other tested algorithms in tasks including identifying colorectal cancer⁸⁵, melanomas in mice⁸⁷, cancer subtypes⁶⁹, and other host traits⁴¹. Random Forests have been applied to predicting the survival time of colorectal cancer patients from gene expression and microbiome taxonomic profiles⁹⁰ and identifying several tumor types such as epithelial ovarian cancer⁴⁴, tonsillar squamous cell carcinoma⁵⁸, lung adenocarcinoma³³, colorectal cancer²⁸, oral squamous cell carcinoma⁹⁸, and in a multiclass classification setting⁶⁹.

Boosting

Boosting is another approach to decrease variance and operates on a similar idea to bagging; however, instead of training the standalone estimators independently, it builds models sequentially⁸⁰ and these are kept shallow⁹⁹. When boosting Decision Trees, each tree is trained with samples drawn from the dataset, weighted according to the performance of the previous classifier on that data point. This way, the next individual model is trained to improve the performance on data that the previous tree failed to predict⁸⁰. The final prediction is taken as a weighted average of the prediction of all the trees, or as a majority voting⁸⁰. The improved generalization performance of boosting methods over Decision Trees was also exemplified in ref. ⁸⁵. However, it is unclear whether boosting can achieve better performance over Random Forests in cancer-related host trait prediction from the microbiome. In this same publication, a boosting approach and a Random Forest demonstrated comparable performance⁸⁵, while in ref. ⁸⁷ a boosting model showed a decreased AUC but improved precision and recall over Random Forests. Boosting methods have also been proposed to predict tissue malignancy in breast cancer using bacterial taxonomic profiles from biopsies³⁰ and to identify several tumor subtypes from microbiome data¹⁸.

Interpretability and Explainable Boosting Machines

Besides the excellent performance demonstrated by Random Forests and boosting models in cancer-related microbiome host trait prediction, these models are often used due to their improved interpretability over other black box algorithms such as non-linear SVMs and Neural Networks. For sufficiently shallow Decision Trees, decision rules can be easily analyzed and understood, making finding putative biomarkers easier. Decision tree flowcharts are already widely used in medicine for diagnosis, so they are familiar to doctors and health professionals¹⁰⁰. As Random Forests and boosting combine the predictions of several trees, this direct interpretation of feature contribution is lost. However, it is possible to quantify the importance of each feature by analyzing the number of times a feature is used in a node or the average increase in purity achieved with it⁷³. In ref. ⁴², Random Forests were used to identify acute myeloid leukemia from taxonomic profiles, and the contribution of each taxon for classification was assessed using the average increase in purity. However, these scores do not fully reflect the complex decision rules of the classifier, which depend on the association of multiple features.

With boosting, the link between importance scores and decision rules is even more unclear than for Random Forests, as different trees have a different impact on the final prediction. Explainable Boosting Machines (EBM)¹⁰¹ try to make this association clearer so that the decision of a model is better understood through feature importance scores. EBM is an implementation of Generalized Additive Models¹⁰² that also models pairwise feature interactions, and in which the feature transformations are modeled using a boosting algorithm. When training EBMs, each tree can only use one feature. The set of M features is iterated multiple times, with each tree contributing by a small amount to the final classification. Because of this, a large number of trees is needed in EBMs, making training times longer than for other boosting approaches^103,104. The importance of each feature can be easily assessed by analyzing the importance scores and using heatmaps for pairwise interactions¹⁰⁴. Because each tree can use only one feature and is a weak classifier, these scores provide better insight into the classification rules¹⁰¹. Explainable Boosting Machines have been used to identify colorectal cancer and adenoma from microbial gut functional profiles⁵⁴, allowing greater interpretability without sacrificing accuracy. Glass-box methods such as EBM may help identify novel biomarkers for cancer, based on the human microbiome.

Logistic Regression

Logistic Regression is a linear model for classification that, due to its simplicity, has been widely used in studies investigating the tumor-specific microbiome, mainly for feature selection and benchmarking^47,61. This model assumes that the microbiome data originated from class-conditional Gaussian distributions of equal covariance matrices. The decision function of Logistic Regression, shown in Eq. (1) for binary classification, is the estimated posterior probability of a sample x being of a given class⁷⁵. The model parameters β and β₀ are learned using iterative processes, usually gradient descent with binary cross entropy as a loss function⁷⁵.

$$\phi \left({{{\bf{x}}}}\right)=\frac{1}{1+{e}^{{{{{\bf{x}}}}}^{T}\beta +{\beta }_{0}}}$$

(1)

Despite having linear decision boundaries, Logistic Regression classifiers can, in some cases, achieve similar results to non-linear counterparts in microbiome cancer-related host trait prediction: in ref. ⁸⁵, L2 regularized Logistic Regression achieved a comparable AUC to Random Forest and Boosted Trees for colorectal cancer identification. However, this is seldom observed^49,87,91; as examples, Logistic Regression failed to improve colorectal polyp identification over a Multilayer Perceptron and Naïve Bayes classifier⁹², and colorectal cancer identification over a Multimodal Neural Network⁴⁷. Because of this decreased performance, Logistic Regression has mainly been used for feature selection.

L1 and L2 regularization are approaches to decrease model variance by constraining its parameters¹⁰⁵. L2 regularization penalizes the squared value of the model parameters (β and β₀), decreasing their absolute values¹⁰⁶. L1 regularization, also known as Least Absolute Shrinkage and Selection Operator (LASSO), constrains the model parameters so that the sum of their absolute values is less than a number¹⁰⁵. LASSO regression forces some of the parameters to take null values, effectively disregarding the corresponding features during inference⁵⁰. Combining L1 and L2 regularization results in a model known as ElasticNet¹⁰⁷. Therefore, LASSO and ElasticNet are embedded forms of feature selection and have been used to select microbiome features for a more powerful downstream classifier. As the decision boundary is linear, feature contribution for classification can more intuitively be quantified by the corresponding model parameter. The most relevant features are chosen accordingly^44,61. This approach was used for colorectal cancer identification coupled with a Multilayer Perceptron⁵⁷ and Generalized Neural Networks⁶¹. LASSO was also included in the ensemble used for feature selection in ref. ⁴⁴ for ovarian cancer identification from taxonomic profiles and serum tumor marker levels.

The same properties that make Logistic Regression adequate for feature selection also explain its improved interpretability over methods such as Random Forests and Artificial Neural Networks: the contribution of each feature can be quantified by the parameters of the model⁸⁵. This allows researchers to more easily identify potential bacterial biomarkers associated with specific types of cancer⁸⁵. This strategy was used to identify seven genera potentially associated with invasive cervical cancer¹⁰⁸ and to access feature contributions when identifying colorectal cancer using taxonomic profiles⁸⁵.

Artificial Neural Networks

Artificial Neural Networks (NNs) are complex non-linear ML models that usually achieve better performance than their simpler counterparts¹⁰⁹. To build an NN, multiple activation units are stacked into layers. Activation units take a vector as input and output a scalar, applying an activation function on the linear combination of inputs and a bias parameter. The outputs of a layer are concatenated and fed into the following layer until the last is reached. This last layer is the output layer, as it outputs the final prediction of the model. The layers between the input and output are called hidden layers¹¹⁰. Architectures with more than one hidden layer are considered Deep Neural Networks¹¹⁰. The model parameters are learned using stochastic gradient descent. Because the gradients for the parameters of each activation unit are computed using the chain rule from the output to the input layer, this computation is referred to as back-propagation¹¹⁰.

Unlike other ML methods, Deep NNs do not require extensive previous feature selection and transformation. These models are complex enough to do the necessary processing of input features¹¹¹. The output vectors of hidden layers can be perceived as mappings of a sample to a transformed feature space, in which class separability is enhanced¹¹⁰. This transformation can be exploited for dimensionality reduction and data augmentation, as we will discuss briefly. Nevertheless, multiple works on the tumor-specific microbiome have used Deep NNs trained with previously selected features^48,87,92. Doing so may hinder performance, as these features may not be ideal given the model.

Because of their complexity and non-linearity, the decision rules of NNs are difficult to understand. Thus, NNs are seen as black box models that make biomarker identification more difficult than Random Forests or linear models. In host trait prediction from microbiome data, works assessing feature importance usually resort to model-agnostic methods: Mulenga et al.⁶³ used L2 regularization, while Arabameri et al.⁶¹ employed Derivative-based Feature Selection. These methods do not provide a complete understanding of the model decisions, nor of the interdependencies between features, as provided by Decision Trees, for instance.

Multilayer perceptrons

The Multilayer Perceptron (MLP) is an NN in which consecutive layers are fully connected, as seen in Fig. 3. Multilayer Perceptrons can easily overfit the training data if no regularization is adopted, such as dropout and batch normalization, due to their large number of parameters¹¹². In ref. ⁴⁸, an MLP failed to improve the performance of an RBF-kernel SVM for colorectal cancer prediction using microbiome taxonomic profiles. However, the MLP used features pre-selected using Linear Regression, which may not be ideal for NN classification. Similar findings have also been reported for melanoma identification in mice while employing feature selection⁸⁷. On the other hand, the MLP with prior feature selection used in ref. ⁶¹ achieved higher accuracy but worse sensitivity when compared to a k-Nearest Neighbors classifier and an SVM. Therefore, it is unclear if MLPs can achieve better predictions than those provided by simpler models such as Random Forests and non-linear SVMs.

**Fig. 3: Diagram of an MLP for cancer prediction from microbial abundance profiles.**

One notable approach using MLPs is that of DeepEn-Phy, a model operating on relative abundances incorporated into a phylogenetic tree⁴⁹. This model uses a set of cascaded MLPs trained end-to-end. The phylogenetic trees are divided into levels, from leaves to roots. Each MLP operates in one level, with a level having multiple MLPs. The first MLPs use relative abundances in the leaf nodes and their outputs are fed into downstream MLPs, until reaching the root MLP. This model achieved better results when predicting smoking status and body mass index using the gut microbiome than Random Forest, Gradient Boosted Trees, and PopPhy-CNN, another deep learning model operating on phylogenetic trees¹¹³.

General Regression Neural Networks

General Regression Neural Networks (GRNN)¹¹⁴ are variations of kernel regression networks¹¹⁵. In a GRNN, each of the N activation units applies an RBF centered in each of the training samples during inference. The output of the model is a weighted sum of the activation units¹¹⁴. One advantage of GRNN over MLPs is the smaller number of hyperparameters: only the spread parameter needs to be specified, while in MLPs one must choose the number of layers and activation units, and the optimization parameters⁶¹. However, the pattern layer grows with the training set, making learning and inference computationally expensive¹¹⁶.

This model was used to detect colorectal cancer from taxonomic profiles of the gut microbiome, having outperformed an MLP and an RBF kernel SVM⁶¹. According to the authors, GRNNs can achieve better accuracy than MLPs, even with a small number of training samples.

Multimodal Neural Networks

Multimodal NNs combine MLPs to integrate features from different modalities. In one cancer and microbiome-related application, a multimodal NN was used to identify several diseases, including colorectal cancer, using taxonomic and functional profiles from the gut microbiome⁴⁷. MLPs were trained using features from each modality alone. Embeddings were derived by concatenating the outputs from the last hidden layers of each modality. The multimodal NN achieved improved accuracy over MLPs that used just one of the feature types; it also outperformed Random Forest, an SVM, Gradient Boosted Trees, and LASSO regression for colorectal cancer prediction⁴⁷. This work shows the potential of combining multiple types of data in increasing the accuracy of tumor diagnosis and introduces multimodal NNs as a promising method to integrate various modalities of features that complementarily capture the complex mechanisms that underly cancer development.

Autoencoders

The outputs of hidden layers of NNs can be interpreted as mappings of the input vectors into a latent space. If a layer has less than M activation units, with M the number of input features, these can be used as compressed representations of samples⁷⁵. Autoencoders are NNs that try to reconstruct the input data. Their goal is not to achieve useful predictions; rather, autoencoders learn representations of the input vectors in a latent space¹¹⁰.

Due to the high dimensionality of microbiome data, autoencoders can be useful in avoiding the curse of dimensionality. However, despite having been proposed for dimensionality reduction in metagenomic studies¹¹⁷, autoencoders are rarely used in cancer-related host trait prediction from the microbiome. In ref. ¹¹⁸, autoencoders coupled with an MLP failed to improve the AUC in identifying colorectal cancer from the gut microbiome over MetAML, which uses Random Forests, SVMs, LASSO regression, or an ElasticNet¹¹⁸. However, it is unclear if this is due to the classifier used or the dimensionality reduction strategy.

Autoencoders can be adapted to generate synthetic samples. Because data augmentation techniques increase the number of training samples, they decrease the variance of a model, assuming that the synthetic samples are derived from a similar distribution to the original data¹¹⁹. This is particularly important for models that are prone to overfit the training data, such as deep NNs. While studying the cancer-related microbiome, this issue is made worse due to the scarcity of data. Therefore, data augmentation can be a powerful aid in the development of more accurate ML models acting on microbiome data. In ref. ⁶³, data augmentation using Variational Autoencoders (VAE) improved the predictions of an NN classifier for colorectal cancer using gut microbial abundances⁶³, with an increase in the AUC of up to 30%.

Choosing the right ML model for the task

The best choice for a microbiome-based ML model for cancer-related applications depends on the diversity, quality, and quantity of the available data and on the learning task. Table 3 lists the main strengths and weaknesses of ML models discussed in this review and is intended to aid researchers in selecting an ML model for their particular use cases. Nevertheless, it is difficult to know a priori which model will achieve the best performance, and testing multiple models is recommended.

Table 3 Advantages and disadvantages of ML methods for cancer characterization from microbiome data

Full size table

As previously mentioned, Random Forests are highly popular, as they achieve a good balance between interpretability and performance for most applications. When inference performance is of the utmost importance (for instance, in cancer screening), more complex methods such as Boosted Trees or NNs may be needed. However, this implies a higher computational cost for training when compared to other, more simple models. The decision rules of Boosted Trees and NNs are also difficult to interpret, making it difficult to validate the decisions of the models and to identify characteristics of the microbiome associated with a given prediction. Furthermore, in the case of deep NNs, any increase in performance and generalizability over Random Forests and other approaches requires large amounts of training data, which are usually not available in cancer-associated microbiome studies. When the datasets available for training are small, SVMs are likely a good choice^93,94; however, these models require costly hyperparameter tuning and do not support multiclass classification, hindering their application for multicancer identification⁸⁰.

When the goal is to identify microbiome-based cancer biomarkers, being able to derive meaning from the decisions of the ML models is more important than achieving the best performance possible. Therefore, in these use cases, simpler and more interpretable models (e.g. Decision Trees and Logistic Regressors) may be the best choice. Both Decision Trees and Logistic Regressors have widely available implementations and their decision rules are easy to analyze. However, Decision Trees tend to show high variance and poor generalizability⁵⁰, while Logistic Regressors are limited to linear decision boundaries, which may be unsuited for some learning tasks⁹².

Model validation

After training an ML model, its performance should be properly evaluated to ensure its generalizability and to avoid biases that are not related to the microbiome nor clinically relevant, such as those caused by technical artifacts. Models should be tested with a hold-out dataset, independent of that used for training. Including information about the training data in the test dataset, also known as data leakage, prevents an exact assessment of how the model will behave with previously unseen data. This can lead to an overestimation of model performance and generalizability¹²⁰. Thus, when studying the links between cancer and the human microbiome, ML model validation should ideally include evaluating the model performance using data generated in a slightly different manner (e.g., from another institution) than that used for training, and excluding patients in the training set. When such data is not available, cross-validation can be used as an alternative. In cross-validation, the training dataset is split into k subsets of the same size. The ML model is trained and evaluated k times, each time using a different subset for validation and the remaining k − 1 for training. The performance metrics are then averaged to yield a final estimate of the performance of the model.

The same care should be taken when selecting hyperparameters. To ensure that the selected values are those that maximize the performance of a model with previously unseen data, these should be selected through cross-validation¹²¹.

It is key to also avoid data leakage in the feature transformation and dimensionality reduction steps, as any inference about the structure of the data based on the test set or the entire dataset may compromise generalizability. As an example, if using LEfSe with microbial abundance profiles, only the abundances of samples in the training dataset should be used to evaluate statistical significance and effect sizes. However, promising bacterial biomarkers are often chosen through statistical tests on the whole dataset, before being used as input features to an ML model. This feature selection based on all samples may lead to the overfitting of the dataset and poor generalizability. In other cases, the publications are not clear about the validation strategies employed and the possibility of data leakage, making it difficult to assess the generalization of their conclusions.

Current limitations and future perspectives

In this section, we discuss the current limitations of research on ML methods for microbiome-based cancer characterization, along with how to address them. Future perspectives in this area are also provided here. Figure 4 contains an overview of the limitations discussed in this section.

**Fig. 4: Overview of needed improvements in microbiome-based identification of cancer.**

Many publications on ML for cancer characterization from microbiome data have contradictory results, with questionable generalization. As an example, in refs. ⁷¹ and ²⁸, the genus Fusobacterium was found to be more abundant in the gut microbiome of colorectal cancer patients; however, this genus was not identified as a possible biomarker in refs. ^18,27 nor ⁶¹. Zhou et al.²⁹ aggregated the proposed microbial biomarkers for colorectal cancer from various publications and found inconsistent results. Nevertheless, the overall findings show a difference between colorectal cancer and the healthy microbiome, although the microorganisms responsible may differ across studies. The findings in ref. ¹⁸ have also been questioned with models said to be relying on bacteria unlikely to be present in humans to predict cancer types^122,123; however the authors have since reproduced their findings with a revised methodology^124,125. These conflicting results are caused by three major factors: an inadequate validation of ML models; the usage of small datasets to train and evaluate ML models; and a lack of correction of technical variations.

Besides the lack of agreement in microbial biomarkers, these shortcomings make it difficult to compare the impact of different ML algorithms and pre-processing approaches in model performance - an improvement may not be a result of a different methodology, but due to overfitting. Recently, some studies have focused on improving the generalizability of ML models to detect colorectal cancer, with multiple datasets from different regions and a consistent processing pipeline^18,27. However, this should be extended to all works using ML.

Inadequate ML model validation and the need for larger datasets

In section “Model validation”, we have discussed how to properly evaluate the performance of an ML model, with a focus on ensuring its generalizability. However, as previously mentioned, not all research adheres to these recommendations. Most notably, data leakage is frequent during the feature transformation and dimensionality reduction steps, leading to an overestimation of model performance with unseen data and potentially to non-generalizable results.

Furthermore, to achieve sufficient model generalizability, researchers should ensure that the dataset used during development reflects the characteristics of the population the model will be used on. For instance, any model trained with fecal microbiome data of patients of a specific geographic region is likely to perform poorly for patients of a different area, as geography is known to impact the composition of the gut microbiome¹²⁶. Using large datasets from diverse populations can therefore lead to more generalizable models and in turn to approaches that can be used with a greater number of patients.

However, the creation of these large datasets is costly and time-consuming. Therefore, it is crucial to implement data-sharing protocols between researchers, ensuring that data from patients across the world are integrated into the development and validation of ML models. When possible, collected microbiome data should be made publicly available in online repositories such as NCBI’s SRA¹²⁷. Alternatively, processed data can also be hosted online, although this limits the ways in which these data can be repurposed. Furthermore, researchers should be transparent about how the microbiome data used in their studies was generated and processed, and how models were developed and validated. All code used should be made publicly available, facilitating future analyses and allowing for a more thorough assessment of model validation. Together, these approaches can improve the generalizability of ML models and accelerate the discovery of further links between the human microbiome and cancer.

Generative ML models can also aid in expanding the available datasets through the generation of synthetic data. These models are often based on NNs and include the Autoencoders discussed in section “Autoencoders” and Generative Adversarial Networks (GANs)¹²⁸. However, and despite recent breakthroughs in generative ML models for genomics¹²⁹, these have seldom been used to uncover the association between cancer and the human microbiome.

The effects of technical variations and clinical covariates

The lack of correction for batch effects and data leakage further raises doubts about how the findings of most works can be generalized. Not all differences in the data are linked to biologically relevant processes: rather, some may be due to technical variation or clinical covariates. As an example, the TCGA dataset is known to show strong batch effects linked to sequencing centers^{18,130,131,132}. In these cases, ML models may be relying on these non-biologically relevant patterns for their predictions. This leads to poor generalizability and puts the findings derived from these models into question.

When it is possible to normalize the data to remove technical variations, this should be done. However, in some cases, the biological variable of interest is perfectly correlated with another covariate. Under these conditions, it is almost impossible to determine which changes are biologically relevant. Therefore, this ambiguity should be avoided, or, at least, control samples enabling the normalization of the data should be acquired. Testing models with hold-out datasets, obtained under different conditions than the ones used to train the models can also avoid spurious associations between cancer and traits of the microbiome by flagging ML models with poor generalizability.

Knowing when an ML model is relying on technical artifacts to make accurate predictions is often difficult due to the black box nature of most models. For this reason, more interpretable models, such as Decision Tree-based approaches, should be preferred. Recent advances in post hoc explainability^133,134,135 may also enable the assessment of more complex decision rules, such as those from deep NNs. With these advances, researchers can analyze how ML models make decisions, detecting when these are relying on patterns that are unlikely to be related to biological phenomena.

Unclear predictive capability of ML models based on microbiome data

The predictive capability of ML models applied to the tumor-specific microbiome is still unclear. For colorectal cancer identification from gut taxonomic profiles, classifiers can currently achieve a promising performance: the GRNN in ref. ⁶¹ had an AUC of up to 85% in an independent test set, while the SVM in ref. ⁷¹ achieved 81% accuracy while using microbial abundance features. These metrics are indicative of good performance, however, they are unlikely to be robust enough for widespread use yet. Care should also be taken regarding the evaluation metrics used, to provide representative estimates of model performance. Analyzing the performance of models is difficult due to the small datasets used for development^18,43, especially for cancer types other than colorectal, which in some cases have as few as 23 patients³⁰. Furthermore, models are often evaluated using a one-versus-rest approach, which exacerbates class imbalance. Evaluation metrics such as accuracy are not adequate for highly imbalanced datasets. Even the AUC may hide the low precision or recall of a classifier, with a heavily skewed class distribution¹³⁶. For instance, despite having an AUC of over 82%, in many cases reaching 100%, the classifiers to distinguish between cancer subtypes using the tumor microbiome in ref. ¹⁸ have an Area Under the Precision-Recall Curve (AUPR) of as little as 36%. To better understand the potential of ML approaches in identifying tumors based on the microbiome, these should be evaluated using metrics adequate for imbalanced datasets such as AUPR and the F-score¹³⁶.

Improving the predictive power of ML models applied to cancer microbiome analysis will likely require the implementation of state-of-the-art methods such as NNs, increased regularization, and the aforementioned development of larger bespoke datasets. Despite the increasing popularity of NNs for this purpose, many advances in deep ML architectures are still to be applied. For instance, Autoencoders or attention mechanisms¹³⁷ are seldom used, despite having achieved improved performance over other methods in biological tasks^138,139. We expect these approaches to better model the complex relations between cancer and the human microbiome. However, when using these methods, special care should be taken to avoid overfitting. As such, the current preference for simpler ML models may be explained by a lack of data, which further highlights the need for larger datasets. Furthermore, new ways to constrain the latent representations of the input vectors are needed to avoid overfitting.

Unclear specificity of microbial biomarkers with confounding diseases

It is still unclear if bacterial biomarkers for cancers found using ML models hold specificity in the presence of other diseases. Most of the published research on the tumor-specific microbiome compares the microbiome profiles of cancer patients to those of healthy individuals, excluding those with other diseases^71,98. Works developing models capable of identifying other diseases, such as type 2 diabetes or IBD, only include colorectal cancer^41,47. Excluding confounding diseases that may affect the microbiome is likely to make it simpler to find alterations linked to cancer. However, these will not be adequate for widespread clinical usage in case they lose their predictive power in the presence of comorbidities. Type 2 diabetes increases the risk of cancer^140,141, so employing biomarkers that cannot distinguish between the two diseases may yield incorrect results for a large fraction of the human population. Several publications have also found that type 2 diabetes is linked to disruptions in the gut microbiome¹⁴². Therefore, it is unlikely that the patterns differentiating the tumor and healthy microbiomes will be similar to those found between cancer and type 2 diabetes. Further research into the differences between the human microbiota in the presence of tumors and other confounding diseases is required to better understand the limitations of using microbiome-derived biomarkers for cancer diagnosis and analysis.

Microbiome features other than taxonomic profiles are largely unused

The vast majority of ML approaches take on taxonomic abundance profiles as input data. However, these may not be the most informative characteristics of the microbiome: for instance, gut microbiome functional profiles were shown to provide an improvement in performance over the taxonomic counterparts when distinguishing colorectal cancer from adenoma⁵⁴. Data integration (for instance, using both taxonomic and functional features, as done in⁴⁷, or coupling the taxonomic profiles with other methods for cancer diagnosis such as fecal occult-blood testing⁷¹) may also improve the classification accuracy in cancer diagnosis, but few publications exploit this for tumors other than colorectal.

The emergence of next-generation sequencing (NGS), allowing high-throughput sequencing and shotgun metagenomics, has enabled the cost-effective functional profiling of the microbiome through technologies such as mRNA sequencing (RNA-Seq)^143,144. Epigenomics analyses were also made possible by NGS¹⁴⁴. Therefore, we expect further work to be undertaken to better understand how these different types of data and biological characteristics relate to one another, and if data integration can yield more powerful methods for cancer diagnosis and characterization.

The need for ML models adapted to the characteristics of microbiome data

The relationships between the host and its microbiome are highly complex, and microbiome data has characteristics that make ML models challenging. For once, abundance taxonomic profiles have many zero values and small variations linked to technical phenomena. This leads to the most common ML approaches overfitting the training data. Thus, the next steps in ML for cancer-related applications using microbiome data likely include the development and application of models that are more robust under zero-inflated data with high variability. Several models have already been proposed for this kind of data^145,146,147, some specifically designed for microbial count data¹⁴⁸. Such models should find microbial signatures associated with cancer characteristics while filtering out sporadic associations resulting from technical variations. In conjunction with this, we expect further work on feature selection and dimensionality reduction approaches for microbial count data to improve the performance of ML models for cancer characterization from microbiome data. Again, by preserving only the most biologically relevant features, these approaches could reduce the variability of models.

Translation to clinical practice and ethical concerns

Although the performance of ML methods for cancer characterization using microbiome data is currently insufficient for clinical use, it is important to delineate how future models can be incorporated into clinical workflows. So far, most works have focused on discovering microbial biomarkers through the use of ML. These biomarkers, after careful validation, could potentially be the target of diagnostic assays. Blood or stool-derived microbial signatures, in particular, are promising, as these would enable a non-invasive screening of high-risk patients¹⁸. Patients exhibiting these biomarkers could then be further examined through imaging, histologic, and cytologic exams to arrive at a diagnosis. Microbiome-based biomarkers could also be used to guide treatment by predicting the efficacy of different approaches^{32,35,149,150,151}. All decisions and diagnoses derived from microbial biomarkers should be discussed and carefully explained to patients, using a patient-centered approach. To this end, it is crucial that patients are aware of the shortcomings of these biomarkers and the uncertainties surrounding the links between cancer and the microbiome.

Although these non-invasive methods could allow for better care and improved survival rates, clinicians are often wary of diagnostic tools derived from ML models due to their black box nature¹⁵². This highlights the need for clinicians to be involved in the development of these tools and for ML models with more understandable decision rules. Furthermore, we should avoid increasing the workload of clinicians when integrating these approaches into the clinical workflow¹⁵².

Finally, the ethical use of ML in cancer diagnosis and treatment requires stringent measures to safeguard patient privacy and ensure data security. These should include anonymization, encryption, and transparency to guarantee that the identities and health information of patients remain protected¹⁵³. This is also important for the regulatory approval of tools integrating ML algorithms using microbiome data. Therefore, balancing patient confidentiality and safety with innovation will require adapting existing frameworks to accommodate the dynamic nature of these technologies and their evolving clinical implications.

The need for interdisciplinary approaches

As we have seen, ensuring adequate ML model development and translating these models or their findings to clinical practice is a complex and interdisciplinary endeavor: proper model training and testing requires familiarity with computer science, as expert knowledge is needed to avoid data leakage and to extract understandable insights into the decision rules of ML models. An inadequate understanding of these concepts may lead to overly optimistic estimates of model accuracy and the generalizability of findings. Likewise, biologists should be involved in model validation (especially in the assessment of their decision rules), ensuring that these make reasonable predictions and are not relying on clinically irrelevant artifacts. Furthermore, clinical and microbial field knowledge may also aid in the development of the ML model itself. Finally, clinicians should be a part of model development and validation, ensuring its applicability. This interdisciplinary approach should be fostered and the norm rather than the exception.

Conclusions

The human microbiome holds great potential for cancer diagnosis, prognosis, and therapies, offering a promising avenue for research. However, the biomarkers derived from this complex system present a challenge in their analysis. To address this, ML methods have emerged as crucial tools and have been widely used in recent breakthroughs to unravel cancer-microbiome relationships. In this review, we described and compared current ML approaches from sample collection to the final prediction. Most approaches have focused on gut-derived taxonomic profiles, mainly for colorectal cancer prediction. Tree-based models are the most used and recommended when there is a need to understand the decision rules of models; however, recent works suggest that deep learning architectures such as VAEs and multimodal NNs can achieve better performance and offer a great tool for applications requiring excellent inference performance, such as diagnosis. Many published works do not ensure the generalization of their models, avoiding data leakage and confounding technical variables, leading to conflicting results. Therefore, future research should be careful to implement adequate generalization assessments with expanded hold-out datasets and careful removal of technical variations.

Despite promising, models developed for cancer types other than colorectal do not yet achieve sufficient performance for clinical usage. Improving this performance will likely require integrating microbial taxonomy profiles with other data types, including gene expression and host characteristics, leveraging advances in sequencing technology. This approach will be critical for a better understanding of the complex interplay between the microbiome and the human host in the context of cancer. We also expect the recent advances in deep learning approaches, along with the application of ML methods better suitable to the characteristics of microbiome data to aid in this endeavor. Finally, data-sharing and interdisciplinary efforts are instrumental in ensuring the development of accurate and generalizable models.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The full-text articles assessed in this review are listed in Supplementary Table 1.

References

Ferlay, J. et al. Estimating the global cancer incidence and mortality in 2018: GLOBOCAN sources and methods. Int. J. Cancer 144, 1941–1953 (2019).
Article CAS PubMed Google Scholar
WHO. WHO Methods and Data Sources for Country-Level Causes of Death: 2000-2019 (World Health Organization, 2020).
Hanahan, D. Hallmarks of cancer: new dimensions. Cancer Discov. 12, 31–46 (2022).
Article CAS PubMed Google Scholar
Gilbert, J. A. et al. Current understanding of the human microbiome. Nat. Med. 24, 392–400 (2018).
Article CAS PubMed PubMed Central Google Scholar
Behjati, S. & Tarpey, P. S. What is next generation sequencing? Arch. Dis. Child. Educ. Pract. Ed. 98, 236–238 (2013).
Article PubMed PubMed Central Google Scholar
Jiang, D. et al. Microbiome multi-omics network analysis: statistical considerations, limitations, and opportunities. Front. Genet. 10, 995 (2019).
Article CAS PubMed PubMed Central Google Scholar
Jovel, J. et al. Characterization of the gut microbiome using 16S or shotgun metagenomics. Front. Microbiol. 7, 459 (2016).
Article PubMed PubMed Central Google Scholar
Turnbaugh, P. J. et al. The Human Microbiome Project. Nature 449, 804–810 (2007).
Article CAS PubMed PubMed Central Google Scholar
Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).
Article PubMed PubMed Central Google Scholar
Glassner, K. L., Abraham, B. P. & Quigley, E. M. M. The microbiome and inflammatory bowel disease. J. Allergy Clin. Immunol. 145, 16–27 (2020).
Article CAS PubMed Google Scholar
Chen, W., Liu, F., Ling, Z., Tong, X. & Xiang, C. Human intestinal lumen and mucosa-associated microbiota in patients with colorectal cancer. PLoS ONE 7, e39743 (2012).
Article CAS PubMed PubMed Central Google Scholar
Carabotti, M., Scirocco, A., Maselli, M. A. & Severi, C. The gut-brain axis: interactions between enteric microbiota, central and enteric nervous systems. Ann. Gastroenterol. Hepatol. 28, 203–209 (2015).
Google Scholar
Helmink, B. A., Khan, M. A. W., Hermann, A., Gopalakrishnan, V. & Wargo, J. A. The microbiome, cancer, and cancer therapy. Nat. Med. 25, 377–388 (2019).
Article CAS PubMed Google Scholar
Ferreira, R. M. et al. Gastric microbial community profiling reveals a dysbiotic cancer-associated microbiota. Gut 67, 226–236 (2018).
Article CAS PubMed Google Scholar
Flemer, B. et al. The oral microbiota in colorectal cancer is distinctive and predictive. Gut 67, 1454–1463 (2018).
Article CAS PubMed Google Scholar
Kartal, E. et al. A faecal microbiota signature with high specificity for pancreatic cancer. Gut 71, 1359–1372 (2022).
Article CAS PubMed Google Scholar
Cancer Genome Atlas Research Network. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Article PubMed Central Google Scholar
Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 (2020).
Article CAS PubMed PubMed Central Google Scholar
Rodriguez, R. M., Hernandez, B. Y., Menor, M., Deng, Y. & Khadka, V. S. The landscape of bacterial presence in tumor and adjacent normal tissue across 9 major cancer types using TCGA exome sequencing. Comput. Struct. Biotechnol. J. 18, 631–641 (2020).
Article CAS PubMed PubMed Central Google Scholar
Geller, L. T. et al. Potential role of intratumor bacteria in mediating tumor resistance to the chemotherapeutic drug gemcitabine. Science 357, 1156–1160 (2017).
Article CAS PubMed PubMed Central Google Scholar
Matson, V. et al. The commensal microbiome is associated with anti–PD-1 efficacy in metastatic melanoma patients. Science 359, 104–108 (2018).
Article CAS PubMed PubMed Central Google Scholar
Routy, B. et al. Gut microbiome influences efficacy of PD-1–based immunotherapy against epithelial tumors. Science 359, 91–97 (2018).
Article CAS PubMed Google Scholar
Nichols, J. A., Herbert Chan, H. W. & Baker, M. A. B. Machine learning: applications of artificial intelligence to imaging and diagnosis. Biophys. Rev. 11, 111–118 (2019).
Article PubMed Google Scholar
Miotto, R., Wang, F., Wang, S., Jiang, X. & Dudley, J. T. Deep learning for healthcare: review, opportunities and challenges. Brief. Bioinf. 19, 1236–1246 (2018).
Article Google Scholar
Liu, W., Fang, X., Zhou, Y., Dou, L. & Dou, T. Machine learning-based investigation of the relationship between gut microbiome and obesity status. Microbes Infect. 24, 104892 (2022).
Article CAS PubMed Google Scholar
Radjabzadeh, D. et al. Gut microbiome-wide association study of depressive symptoms. Nat. Commun. 13, 7128 (2022).
Article CAS PubMed PubMed Central Google Scholar
Konishi, Y. et al. Development and evaluation of a colorectal cancer screening method using machine learning-based gut microbiota analysis. Cancer Med. 11, 3194–3206 (2022).
Article CAS PubMed PubMed Central Google Scholar
Shah, M. S. et al. Leveraging sequence-based faecal microbial community survey data to identify a composite biomarker for colorectal cancer. Gut 67, 882–891 (2018).
Article CAS PubMed Google Scholar
Zhou, Z. et al. Human gut microbiome-based knowledgebase as a biomarker screening tool to improve the predicted probability for colorectal cancer. Front. Microbiol. 11, 596027 (2020).
Article PubMed PubMed Central Google Scholar
Hogan, G. et al. Biopsy bacterial signature can predict patient tissue malignancy. Sci. Rep. 11, 18535 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, X.et al. The machine-learning-mediated interface of microbiome and genetic risk stratification in neuroblastoma reveals molecular pathways related to patient survival. Cancers 14, 2874 (2022).
Liang, H. et al. Predicting cancer immunotherapy response from gut microbiomes using machine learning models. Oncotarget 13, 876–889 (2022).
Article PubMed PubMed Central Google Scholar
Ma, Y. et al. Distinct tumor bacterial microbiome in lung adenocarcinomas manifested as radiological subsolid nodules. Transl. Oncol. 14, 101050 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mao, X.-Y. et al. iCEMIGE: integration of CEll-morphometrics, MIcrobiome, and GEne biomarker signatures for risk stratification in breast cancers. World J. Clin. Oncol. 13, 616–629 (2022).
Article PubMed PubMed Central Google Scholar
Montassier, E. et al. Pretreatment gut microbiome predicts chemotherapy-related bloodstream infection. Genome Med. 8, 49 (2016).
Article PubMed PubMed Central Google Scholar
Zhou, Y.-H. & Gallins, P. A review and tutorial of machine learning methods for microbiome host trait prediction. Front. Genet. 10, 579 (2019).
Article PubMed PubMed Central Google Scholar
Cheung, H. & Yu, J. Machine learning on microbiome research in gastrointestinal cancer. J. Gastroenterol. Hepatol. 36, 817–822 (2021).
Article PubMed Google Scholar
Dohlman, A. B. et al. The cancer microbiome atlas: a pan-cancer comparative analysis to distinguish tissue-resident microbiota from contaminants. Cell Host Microbe 29, 281–298.e5 (2021).
Article PubMed PubMed Central Google Scholar
Davis, N. M., Proctor, D. M., Holmes, S. P., Relman, D. A. & Callahan, B. J. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6, 226 (2018).
Noecker, C., McNally, C. P., Eng, A. & Borenstein, E. High-resolution characterization of the human microbiome. Transl. Res. 179, 7–23 (2017).
Article CAS PubMed Google Scholar
Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12, e1004977 (2016).
Article PubMed PubMed Central Google Scholar
Woerner, J. et al. Circulating microbial content in myeloid malignancy patients is associated with disease subtypes and patient outcomes. Nat. Commun. 13, 1038 (2022).
Article CAS PubMed PubMed Central Google Scholar
Yang, J. et al. Brain tumor diagnostic model and dietary effect based on extracellular vesicle microbiome data in serum. Exp. Mol. Med. 52, 1602–1613 (2020).
Article CAS PubMed PubMed Central Google Scholar
Miao, R. et al. Assessment of peritoneal microbial features and tumor marker levels as potential diagnostic tools for ovarian cancer. PLoS ONE 15, e0227707 (2020).
Article CAS PubMed PubMed Central Google Scholar
He, Y. et al. Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity. Microbiome 3, 20 (2015).
Article PubMed PubMed Central Google Scholar
Breitwieser, F. P., Lu, J. & Salzberg, S. L. A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20, 1125–1136 (2019).
Article CAS PubMed Google Scholar
Lee, S. J. & Rho, M. Multimodal deep learning applied to classify healthy and disease states of human microbiome. Sci. Rep. 12, 824 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhao, D. et al. A reliable method for colorectal cancer prediction based on feature selection and support vector machine. Med. Biol. Eng. Comput. 57, 901–912 (2019).
Article PubMed Google Scholar
Ling, W., Qi, Y., Hua, X. & Wu, M. C. Deep ensemble learning over the microbial phylogenetic tree (DeepEn-Phy). In 2021 IEEE International Conference on Bioinformatics and Biomedicine (IEEE, 2021).
Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009).
D’Elia, D.et al. Advancing microbiome research with machine learning: Key findings from the ML4Microbiome COST action. Front. Microbiol. 14, 1257002 (2023).
Corsini, N. & Viroli, C. Dealing with overdispersion in multivariate count data. Comput. Stat. Data Anal. 170, 107447 (2022).
Article Google Scholar
Greenacre, M., Martínez-Álvaro, M. & Blasco, A. Compositional data analysis of microbiome and any-omics datasets: a validation of the additive logratio transformation. Front. Microbiol. 12, 727398 (2021).
Casimiro-Soriguer, C. S., Loucera, C., Peña-Chilet, M. & Dopazo, J. Towards a metagenomics machine learning interpretable model for understanding the transition from adenoma to colorectal cancer. Sci. Rep. 12, 450 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ni, Y. et al. Distinct composition and metabolic functions of human gut microbiota are associated with cachexia in lung cancer patients. ISME J. 15, 3207–3220 (2021).
Article CAS PubMed PubMed Central Google Scholar
Han, S., Zhuang, J., Pan, Y., Wu, W. & Ding, K. Different characteristics in gut microbiome between advanced adenoma patients and colorectal cancer patients by metagenomic analysis. Microbiol. Spectr. 10, e01593–22 (2022).
Article PubMed PubMed Central Google Scholar
Mulenga, M., Kareem, S. A., Sabri, A. Q. M. & Seera, M. Stacking and chaining of normalization methods in deep learning-based classification of colorectal cancer using gut microbiome data. IEEE Access 9, 97296–97319 (2021).
Article Google Scholar
De Martin, A. et al. Distinct microbial communities colonize tonsillar squamous cell carcinoma. Oncoimmunology 10, 1945202 (2021).
Article PubMed PubMed Central Google Scholar
Jiang, S. et al. HARMONIES: a hybrid approach for microbiome networks inference via exploiting sparsity. Front. Genet. 11, 445 (2020).
Article CAS PubMed PubMed Central Google Scholar
Singh, D. & Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 97, 105524 (2020).
Article Google Scholar
Arabameri, A., Asemani, D. & Teymourpour, P. Detection of colorectal carcinoma based on microbiota analysis using generalized regression neural networks and nonlinear feature selection. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 547–557 (2020).
PubMed Google Scholar
Weiss, S. et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5, 27 (2017).
Article PubMed PubMed Central Google Scholar
Mulenga, M. et al. Feature extension of gut microbiome data for deep neural network-based colorectal cancer classification. IEEE Access 9, 23565–23578 (2021).
Article Google Scholar
Jović, A., Brkić, K. & Bogunović, N. A review of feature selection methods with applications. In 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 1200–1205 (IEEE, 2015).
Nogales, R. E. & Benalcázar, M. E. Analysis and evaluation of feature selection and feature extraction methods. Int. J. Comput. Intell. Syst. 16, 153 (2023).
Article Google Scholar
Miao, J. & Niu, L. A survey on feature selection. Proc. Comput. Sci. 91, 919–926 (2016).
Article Google Scholar
Jaeger, J., Sengupta, R. & Ruzzo, W. L. Improved gene selection for classification of microarrays. In Pacific Symposium on Biocomputing 2003 (Lihue, 2003).
Ding, C. & Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3, 185–205 (2005).
Article CAS PubMed Google Scholar
Chen, L. et al. Identifying robust microbiota signatures and interpretable rules to distinguish cancer subtypes. Front. Mol. Biosci. 7, 604794 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jabeer, A. et al. Identifying taxonomic biomarkers of colorectal cancer in human intestinal microbiota using multiple feature selection methods. In 2022 Innovations in Intelligent Systems and Applications Conference (IEEE, 2022).
Yuan, B. et al. Fecal bacteria as non-invasive biomarkers for colorectal adenocarcinoma. Front. Oncol. 11, 664321 (2021).
Article CAS PubMed PubMed Central Google Scholar
Segata, N. et al. Metagenomic biomarker discovery and explanation. Genome Biol. 12, R60 (2011).
Article PubMed PubMed Central Google Scholar
Menze, B. H. et al. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinform. 10, 213 (2009).
Article Google Scholar
Venkatesh, B. & Anuradha, J. A review of Feature Selection and its methods. Cybern. Inf. Technol. 19, 3–26 (2019).
Google Scholar
Theodoridis, S. Machine Learning: A Bayesian and Optimization Perspective (Academic Press, 2015).
Chen, F. et al. Meta-analysis of fecal viromes demonstrates high diagnostic potential of the gut viral signatures for colorectal cancer and adenoma risk assessment. J. Adv. Res. 49, 103–114 (2022).
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Mach. Learn. 46, 389–422 (2002).
Article Google Scholar
Hermida, L. C., Gertz, E. M. & Ruppin, E. Predicting cancer prognosis and drug response from the tumor microbiome. Nat. Commun. 13, 2896 (2022).
Article CAS PubMed PubMed Central Google Scholar
Senliol, B., Gulgezen, G., Yu, L. & Cataltepe, Z. Fast Correlation Based Filter (FCBF) with a different search strategy. In 2008 23rd International Symposium on Computer and Information Sciences (IEEE, 2008).
Bishop, C. M. Pattern Recognition and Machine Learning (Springer Verlag, 2006).
Zackular, J. P., Baxter, N. T., Chen, G. Y. & Schloss, P. D. Manipulation of the gut microbiota reveals role in colon tumorigenesis. mSphere 1, e00001–15 (2016).
Article PubMed Google Scholar
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
Article Google Scholar
Noble, W. S. What is a support vector machine? Nat. Biotechnol. 24, 1565–1567 (2006).
Article CAS PubMed Google Scholar
Schuldt, C., Laptev, I. & Caputo, B. Recognizing human actions: a local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004 (IEEE, 2004).
Topçuoğlu, B. D., Lesniak, N. A., Ruffin 4th, M. T., Wiens, J. & Schloss, P. D. A framework for effective application of machine learning to microbiome-based classification problems. MBio 11, e00434–20 (2020).
Camps-Valls, G., Gomez-Chova, L., Munoz-Mari, J., Vila-Frances, J. & Calpe-Maravilla, J. Composite kernels for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 3, 93–97 (2006).
Article Google Scholar
Rossi, M. et al. Gut microbial shifts indicate melanoma presence and bacterial interactions in a murine model. Diagnostics 12, 958 (2022).
Article CAS PubMed PubMed Central Google Scholar
Karamizadeh, S., Abdullah, S. M., Halimi, M., Shayan, J. & Rajabi, M. J. Advantage and drawback of support vector machine functionality. In 2014 International Conference on Computer, Communications, and Control Technology (IEEE, 2014).
Kishk, A.et al. A Hybrid Machine Learning Approach for the Phenotypic Classification of Metagenomic Colon Cancer Reads Based on Kmer Frequency and Biomarker Profiling. In 2018 9th Cairo International Biomedical Engineering Conference (IEEE, 2018).
Yang, M. et al. A multi-omics machine learning framework in predicting the survival of colorectal cancer patients. Comput. Biol. Med. 146, 105516 (2022).
Article CAS PubMed Google Scholar
Ashraf, F. B., Shafi, M. S. R. & Kabir, M. R. Host trait prediction from human microbiome data for Colorectal Cancer. In 2020 23rd International Conference on Computer and Information Technology (IEEE, 2020).
Dadkhah, E. et al. Gut microbiome identifies risk for colorectal polyps. BMJ Open Gastroenterol. 6, e000297 (2019).
Article PubMed PubMed Central Google Scholar
Kotsiantis, S. B., Zaharakis, I. D. & Pintelas, P. E. Machine learning: a review of classification and combining techniques. Artif. Intell. Rev. 26, 159–190 (2006).
Warnke-Sommer, J. D. & Ali, H. H. Evaluation of the oral microbiome as a biomarker for early detection of human oral carcinomas. In 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2069–2076 (IEEE, 2017).
Kingsford, C. & Salzberg, S. L. What are decision trees? Nat. Biotechnol. 26, 1011–1013 (2008).
Article CAS PubMed PubMed Central Google Scholar
Kotsiantis, S. B. Decision trees: a recent overview. Artif. Intell. Rev. 39, 261–283 (2013).
Article Google Scholar
Chen, X. & Ishwaran, H. Random forests for genomic data analysis. Genomics 99, 323–329 (2012).
Article CAS PubMed Google Scholar
Zhou, X. et al. The clinical potential of oral microbiota as a screening tool for oral squamous cell carcinomas. Front. Cell. Infect. Microbiol. 11, 728933 (2021).
Article CAS PubMed PubMed Central Google Scholar
Ferreira, A. J. & Figueiredo, M. A. T. Boosting algorithms: a review of methods, theory, and applications. In Ensemble Machine Learning, 35–85 (Springer US, 2012).
Podgorelec, V., Kokol, P., Stiglic, B. & Rozman, I. Decision trees: an overview and their use in medicine. J. Med. Syst. 26, 445–463 (2002).
Article PubMed Google Scholar
Lou, Y., Caruana, R., Gehrke, J. & Hooker, G. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2013).
Hastie, T. & Tibshirani, R. Generalized Additive Models; Some Applications. J. Am. Stat. Assoc. 82 371–386 (1985).
Lou, Y., Caruana, R. & Gehrke, J. Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2012).
Maxwell, A. E., Sharma, M. & Donaldson, K. A. Explainable boosting machines for slope failure spatial predictive modeling. Remote Sens. 13, 4991 (2021).
Article Google Scholar
Ranstam, J. & Cook, J. A. LASSO regression. Br. J. Surg. 105, 1348 (2018).
Article Google Scholar
Ng, A. Y. Feature selection, L1 vs. L2 regularization, and rotational invariance. In Twenty-First International Conference on Machine Learning - ICML ’04 (ACM Press, 2004).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Kang, G.-U. et al. Dynamics of fecal microbiota with and without invasive cervical cancer and its application in early diagnosis. Cancers 12, 3800 (2020).
Article CAS PubMed PubMed Central Google Scholar
Goldberg, Y. A primer on neural network models for natural language processing. J. Artif. Intell. Res. 57, 345–420 (2016).
Article Google Scholar
Goodfellow, I., Bengio, Y. & Courville, A.Deep Learning (MIT Press, 2016).
Mahmud, M., Kaiser, M. S., Hussain, A. & Vassanelli, S. Applications of deep learning and reinforcement learning to biological data. IEEE Trans Neural Netw Learn Syst 29, 2063–2079 (2018).
Article PubMed Google Scholar
Alzubaidi, L. et al. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J Big Data 8, 53 (2021).
Article PubMed PubMed Central Google Scholar
Reiman, D., Metwally, A. A., Sun, J. & Dai, Y. PopPhy-CNN: a phylogenetic tree embedded architecture for convolutional neural networks to predict host phenotype from metagenomic data. IEEE J. Biomed. Health Inf. 24, 2993–3001 (2020).
Article Google Scholar
Specht, D. F. A general regression neural network. IEEE Trans. Neural Netw. 2, 568–576 (1991).
Article CAS PubMed Google Scholar
Hannan, S. A., Manza, R. R. & Ramteke, R. J. Generalized regression neural network and radial basis function for heart disease diagnosis. Int. J. Comput. Appl. 7, 7–13 (2010).
Google Scholar
Al-Mahasneh, A. J., Anavatti, S. G. & Garratt, M. A. Review of applications of Generalized Regression Neural Networks in identification and control of dynamic systems. arXiv https://doi.org/10.48550/arXiv.1805.11236 (2018).
García-Jiménez, B., Muñoz, J., Cabello, S., Medina, J. & Wilkinson, M. D. Predicting microbiomes through a deep latent space. Bioinformatics 37, 1444–1451 (2021).
Article PubMed Google Scholar
Oh, M. & Zhang, L. DeepMicro: deep representation learning for disease prediction based on microbiome data. Sci. Rep. 10, 1–9 (2020).
Google Scholar
Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J. Big Data 6, 60 (2019).
Rosenblatt, M., Tejavibulya, L., Jiang, R., Noble, S. & Scheinost, D. Data leakage inflates prediction performance in connectome-based machine learning models. Nat. Commun. 15, 1829 (2024).
Article CAS PubMed PubMed Central Google Scholar
Refaeilzadeh, P., Tang, L. & Liu, H. Encyclopedia of Database Systems (eds. Liu, L. & Özsu, M. T.) 532–538 (Springer US, 2009).
Gihawi, A. et al. Major data analysis errors invalidate cancer microbiome findings. mBio 14, e01607–23 (2023).
Article PubMed PubMed Central Google Scholar
Gihawi, A., Cooper, C. S. & Brewer, D. S. Caution regarding the specificities of pan-cancer microbial structure. Microb. Genomics 9, 001088 (2023).
Sepich-Poore, G. D.et al. Robustness of cancer microbiome signals over a broad range of methodological variation. Oncogene 43, 1127–1148 (2024).
Sepich-Poore, G. D. et al. Reply to: caution regarding the specificities of pan-cancer microbial structure. Preprint at: https://www.biorxiv.org/content/10.1101/2023.02.10.528049v1 (2023).
Gaulke, C. A. & Sharpton, T. J. The influence of ethnicity and geography on human gut microbiome composition. Nature Medicine 24, 1495–1496 (2018).
Article CAS PubMed Google Scholar
Leinonen, R., Sugawara, H., Shumway, M. & on behalf of the International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011).
Article CAS PubMed Google Scholar
Yelmen, B. & Jay, F. An overview of deep generative models in functional and evolutionary genomics. Annu. Rev. Biomed. Data Sci. 6 173–189 (2023).
Yelmen, B. et al. Creating artificial human genomes using generative neural networks. PLOS Genet. 17, e1009303 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cavadas, B. et al. Gastric microbiome diversities in gastric cancer patients from europe and asia mimic the human population structure and are partly driven by microbiome quantitative trait loci. Microorganisms 8, 1196 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lauss, M. et al. Monitoring of technical variation in quantitative high-throughput datasets. Cancer Inf. 12, 193–201 (2013).
Google Scholar
Rasnic, R., Brandes, N., Zuk, O. & Linial, M. Substantial batch effects in TCGA exome sequences undermine pan-cancer analysis of germline variants. BMC Cancer 19, 783 (2019).
Article PubMed PubMed Central Google Scholar
Ribeiro, M. T., Singh, S. & Guestrin, C. "Why Should I Trust You?”: Explaining the predictions of any classifier. arXiv https://doi.org/10.48550/arXiv.1602.04938 (2016).
Lundberg, S. & Lee, S.-I. A unified approach to interpreting model predictions. arXiv https://doi.org/10.48550/arXiv.1705.07874 (2017).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. arXiv https://doi.org/10.48550/arXiv.1704.02685 (2019).
Japkowicz, N. Imbalanced Learning, 187–206 (John Wiley & Sons, Inc., 2013).
Vaswani, A. et al. Attention is all you need. arXiv https://doi.org/10.48550/arXiv.1706.03762 (2017).
Feng, C. et al. A deep-learning model with the attention mechanism could rigorously predict survivals in neuroblastoma. Front. Oncol. 11, 653863 (2021).
Lin, M. et al. Application of Deep Learning on predicting prognosis of acute myeloid leukemia with cytogenetics, age, and mutations. arXiv https://doi.org/10.48550/arXiv.1810.13247 (2018).
Larsson, S. C., Orsini, N. & Wolk, A. Diabetes mellitus and risk of colorectal cancer: a meta-analysis. J. Natl. Cancer Inst. 97, 1679–1687 (2005).
Article PubMed Google Scholar
Tsilidis, K. K., Kasimis, J. C., Lopez, D. S., Ntzani, E. E. & Ioannidis, J. P. A. Type 2 diabetes and cancer: Umbrella review of meta-analyses of observational studies. BMJ 350, g7607–g7607 (2015).
Article PubMed Google Scholar
Li, W.-Z., Stirling, K., Yang, J.-J. & Zhang, L. Gut microbiota and diabetes: from correlation to causality and mechanism. World J. Diabetes 11, 293–308 (2020).
Article PubMed PubMed Central Google Scholar
Wensel, C. R., Pluznick, J. L., Salzberg, S. L. & Sears, C. L. Next-generation sequencing: Insights to advance clinical investigations of the microbiome. J. Clin. Investig. 132, e154944 (2022).
Article CAS PubMed PubMed Central Google Scholar
Satam, H. et al. Next-generation sequencing technology: current trends and advancements. Biology 12, 997 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kong, S. et al. Deep hurdle networks for zero-inflated multi-target regression: application to multiple species abundance estimation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20 (2021).
Lu, Y. & Liao, Y. STS: A novel deep learning method for zero-inflated crime prediction. In Proceedings of the 2022 4th International Conference on Robotics, Intelligent Control and Artificial Intelligence, RICAI ’22, 1097–1103 (Association for Computing Machinery, 2023).
Wei, M., Liu, R., Wang, Y. J. & Huang, C. SoutheastCon 2023, 901–905 (IEEE, 2023).
Osawa, T., Mitsuhashi, H., Uematsu, Y. & Ushimaru, A. Bagging GLM: improved generalized linear model for the analysis of zero-inflated data. Ecol. Inf. 6, 270–275 (2011).
Article Google Scholar
Liu, B., Chau, J., Dai, Q., Zhong, C. & Zhang, J. Exploring gut microbiome in predicting the efficacy of immunotherapy in non-small cell lung cancer. Cancers 14, 5401 (2022).
Article CAS PubMed PubMed Central Google Scholar
Heshiki, Y. et al. Predictable modulation of cancer treatment outcomes by the gut microbiota. Microbiome 8, 28 (2020).
Article CAS PubMed PubMed Central Google Scholar
Stein-Thoeringer, C. K. et al. A non-antibiotic-disrupted gut microbiome is associated with clinical responses to CD19-CAR-T cell cancer immunotherapy. Nat. Med. 29, 906–916 (2023).
Article CAS PubMed PubMed Central Google Scholar
Shamszare, H. & Choudhury, A. Clinicians’ perceptions of artificial intelligence: focus on workload, risk, trust, clinical decision making, and clinical integration. Healthcare 11, 2308 (2023).
Article PubMed PubMed Central Google Scholar
Doherty, M., Metcalfe, T., Guardino, E., Peters, E. & Ramage, L. Precision medicine and oncology: an overview of the opportunities presented by next-generation sequencing and big data and the challenges posed to conventional drug development and regulatory approval pathways. Ann. Oncol. 27, 1644–1646 (2016).
Article CAS PubMed Google Scholar
Qu, K., Gao, F., Guo, F. & Zou, Q. Taxonomy dimension reduction for colorectal cancer prediction. Comput. Biol. Chem. 83, 107160 (2019).
Article CAS PubMed Google Scholar
Zheng, Y. et al. Specific gut microbiome signature predicts the early-stage lung cancer. Gut Microbes 11, 1030–1042 (2020).
Article PubMed PubMed Central Google Scholar
Chen, M. et al. Carcinogenesis of male oral submucous fibrosis alters salivary microbiomes. J. Dent. Res. 100, 397–405 (2021).
Article CAS PubMed Google Scholar
Chen, J.-W. et al. Taxonomic and functional dysregulation in salivary microbiomes during oral carcinogenesis. Front. Cell. Infect. Microbiol. 11, 663068 (2021).
Article CAS PubMed PubMed Central Google Scholar
Shrode, R. L. et al. Breast cancer patients from the Midwest region of the United States have reduced levels of short-chain fatty acid-producing gut bacteria. Sci. Rep. 13, 526 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wang, N. et al. Identifying distinctive tissue and fecal microbial signatures and the tumor-promoting effects of deoxycholic acid on breast cancer. Front. Cell. Infect. Microbiol. 12, 1029905 (2022).
Article CAS PubMed PubMed Central Google Scholar
An, J. et al. Prediction of breast cancer using blood microbiome and identification of foods for breast cancer prevention. Sci. Rep. 13, 5110 (2023).
Article CAS PubMed PubMed Central Google Scholar
Uzelac, M., Li, Y., Chakladar, J., Li, W. T. & Ongkeko, W. M. Archaea microbiome dysregulated genes and pathways as molecular targets for lung adenocarcinoma and squamous cell carcinoma. Int. J. Mol. Sci. 23, 11566 (2022).
Article CAS PubMed PubMed Central Google Scholar
Banavar, G. et al. The salivary metatranscriptome as an accurate diagnostic indicator of oral cancer. npj Genom. Med. 6, 105 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bukavina, L. et al. Global meta-analysis of urine microbiome: colonization of polycyclic aromatic hydrocarbon–degrading bacteria among bladder cancer patients. Eur. Urol. Oncol. 6, 190–203 (2023).
Article PubMed Google Scholar
Bang, S. et al. Establishment and evaluation of prediction model for multiple disease classification based on gut microbial data. Sci. Rep. 9, 10189 (2019).
Article PubMed PubMed Central Google Scholar
Su, Q. et al. Faecal microbiome-based machine learning for multi-class disease diagnosis. Nat. Commun. 13, 6818 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wickramaratne, D., Wijesinghe, R. & Weerasinghe, R. Human gut microbiome data analysis for disease likelihood prediction using autoencoders. In 2021 21st International Conference on Advances in ICT for Emerging Regions (ICter), 49–54 (IEEE, 2021).
Jiang, P., Lai, S., Wu, S., Zhao, X.-M. & Chen, W.-H. Host DNA contents in fecal metagenomics as a biomarker for intestinal diseases and effective treatment. BMC Genomics 21, 348 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jiang, P., Wu, S., Luo, Q., Zhao, X.-m & Chen, W.-H. Metagenomic analysis of common intestinal diseases reveals relationships among microbial signatures and powers multidisease diagnostic models. mSystems 6, e00112–21 (2021).
Article PubMed PubMed Central Google Scholar
McDowell, A. et al. Machine-learning algorithms for asthma, COPD, and lung cancer risk assessment using circulating microbial extracellular vesicle data and their application to assess dietary effects. Exp. Mol. Med. 54, 1586–1595 (2022).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the Fundação para a Ciência e a Tecnologia (FCT) through national funds: project reference PTDC/BTM-TEC/0367/2021, and grants 2021.05767.BD (F. Silva) and CEECIND/01854/2017 (R.M.F.).

Author information

Authors and Affiliations

Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal
Marco Teixeira, Francisco Silva, Tania Pereira & Hélder P. Oliveira
Faculty of Engineering, University of Porto, Porto, Portugal
Marco Teixeira
Faculty of Science, University of Porto, Porto, Portugal
Francisco Silva & Hélder P. Oliveira
Ipatimup - Institute of Molecular Pathology and Immunology of the University of Porto, Porto, Portugal
Rui M. Ferreira & Ceu Figueiredo
Instituto de Investigação e Inovação em Saúde, University of Porto, Porto, Portugal
Rui M. Ferreira & Ceu Figueiredo
Faculty of Sciences and Technology, University of Coimbra, Coimbra, Portugal
Tania Pereira
Faculty of Medicine, University of Porto, Porto, Portugal
Ceu Figueiredo

Authors

Marco Teixeira
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Silva
View author publications
You can also search for this author in PubMed Google Scholar
Rui M. Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Tania Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Ceu Figueiredo
View author publications
You can also search for this author in PubMed Google Scholar
Hélder P. Oliveira
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.P.O., T.P., R.M.F., and C.F. conceived and designed the study. M.T. performed data curation and analysis. F.S., R.M.F., and C.F. provided technical and biological insights. All authors contributed to the discussion. M.T., R.M.F., and C.F. wrote the manuscript. All authors have read and agreed to the final version of the manuscript.

Corresponding author

Correspondence to Marco Teixeira.

Ethics declarations

Competing interests

R.M.F. and C.F. own patent WO/2018/169423 on microbiome markers for gastric cancer. The remaining authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Material

Reporting summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Teixeira, M., Silva, F., Ferreira, R.M. et al. A review of machine learning methods for cancer characterization from microbiome data. npj Precis. Onc. 8, 123 (2024). https://doi.org/10.1038/s41698-024-00617-7

Download citation

Received: 15 January 2024
Accepted: 17 May 2024
Published: 30 May 2024
DOI: https://doi.org/10.1038/s41698-024-00617-7
Springer Nature Limited

Associated content

AI in precision oncology

Collection 19 April 2023

A review of machine learning methods for cancer characterization from microbiome data

Abstract

Similar content being viewed by others

Gut microbiome, big data and machine learning to promote precision medicine for cancer

Machine learning for data integration in human gut microbiome

Multimodal deep learning applied to classify healthy and disease states of human microbiome

Explore related subjects

Introduction

Sample collection, processing, and decontamination

Types of data for analyzing the microbiome

Pre-processing microbial data

Methods for feature transformation

Methods for dimensionality reduction

Machine Learning models

Support Vector Machines

Decision Tree-based models

Random Forests

Boosting

Interpretability and Explainable Boosting Machines

Logistic Regression

Artificial Neural Networks

Multilayer perceptrons

General Regression Neural Networks

Multimodal Neural Networks

Autoencoders

Choosing the right ML model for the task

Model validation

Current limitations and future perspectives

Inadequate ML model validation and the need for larger datasets

The effects of technical variations and clinical covariates

Unclear predictive capability of ML models based on microbiome data

Unclear specificity of microbial biomarkers with confounding diseases

Microbiome features other than taxonomic profiles are largely unused

The need for ML models adapted to the characteristics of microbiome data

Translation to clinical practice and ethical concerns

The need for interdisciplinary approaches

Conclusions

Reporting summary

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Material

Reporting summary

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation