1 Introduction

The idea for this work came from the recent authors’ efforts in transnational scientific networks and research projects on the microbiome. In the last few years, the concept of applying computer-based algorithms for assessing medical problems has become a trending topic. The availability of large amounts of data, often referred to as big data, is a crucial enabling factor for this approach.

This article strives to provide a thorough overview of the use of Artificial Intelligence (AI) techniques in studying the gut microbiota and its role in the diagnosis and treatment of some important diseases.

The term microbiota refers to all microorganisms living in the same place, while microbiota habitat, the largest eukaryotic organism where the microbiota is located, is termed the host [1]. In animals, the site in which the largest amount of microorganisms resides is the gastro- digestive tract (mainly large intestine) [2].

A complex ecosystem consisting of bacteria, viruses, fungi and protozoans, is a human microbiota. It contains more than 100 times the human genome and gives us the functional properties we do not possess. It is composed by a number of genes (the microbiome). According to a recent estimation, the amount of bacteria contained in it could be higher than the amount of eukaryotic cells in the human body [3]: some 30 to 400 trillion microorganisms live in the gastrointestinal tract [4, 5]. Any surface exposed to the external environment, such as skin and mucosa (gastrointestinal, respiratory, and urogenital), is populated with the commensal microbiota, with the colon containing over 70% of all the bacteria in our body. An ecological connotation is thus assumed by the entire organism, which can now be redefined as a network of interactions and connections between various organisms (both eukaryotes and prokaryotes) [6].

Gut microbiota has essential dietary and metabolic functions (such as fermentation and digestion of carbohydrates, xenobiotic metabolism and vitamin synthesis) [7, 8]. It helps to safeguard against pathogenic issues. It is also important for the growth of the lymphoid tissue associated with the gut (GALTs) and the maturation of the innate and adaptive immune systems [9, 10]. The commensal bacteria are symbiotic, but they can cause a pathological state after translocation through the mucosa or in specific conditions such as immunodeficiency. The microbiota composition varies substantially between individuals, but it is also dynamic and susceptible to change. Moreover, the composition of the human microbiota is strictly personal, but the diversity in the structure of the bacterial population among the body sites is greater than it is between individuals. To date, although there have been over 50 bacterial phyla described, only two of them dominate the human gut normal flora: the Bacteroidetes and the Firmicutes, whereas Actinobacteria, Proteobacteria, Fusobacteria, Verrucomicrobia and Cyanobacteria appear in minor proportion [11].

Interestingly, a wide proportion of the human microbiota (about 70%) is composed of microbes that cannot be cultivated by common microbiological methods. Today, the advent of new molecular microbiota profiling tools, such as genomic Next-Generation Sequencing (NGS) and metagenomics shotgun sequencing, allows us to obtain more information about the microbiota impact in both healthy and pathological conditions [12]. As these techniques generate large amounts of data, they have sprung the development of bioinformatics techniques. However, the association between microbiota, diseases, and clinical relevance, is still a challenge. In this scenario, the advances in AI techniques, such as Machine Learning (ML) and Deep Learning (DL), can help clinicians in processing and interpreting information that can be extracted from these data sets.

ML and DL are two approaches to AI. However, defining the relation between them is not an easy and unambiguous task. Indeed, nowadays the field of AI is flourishing and the different techniques are continuously evolving. This results in a very dynamic scenario, within which the limits between one approach and another are evanescent. For example, in this paper DL is considered a subset of ML techniques [13] as Fig. 1 shows. However, this is not the only and most comprehensive representation of the field of AI and the reader can find other interpretations in the scientific literature.

Fig. 1
figure 1

The figure shows the relation between AI, ML and DL through a Venn diagram

1.1 Machine learning

Since the dawn of the technological era, computers’ capabilities have been exploited for computer gaming and AI. The expression machine learning dates back to 1959 when Arthur Samuel used it for the first time in the IBM Journal of Research and Development [14]

ML approach is grounded on algorithms for solving problems of classification or prediction of patterns from data (regression models). It is common to talk about “learning from data” [15]. A typical distinction can be made between supervised or unsupervised ML algorithms. In the former approach, a number of input measures are used to predict the value of the output or for selecting one of the output classes. Unsupervised learning, instead, aims to “describe associations and patterns among a set of input measures” [15].

We can also say that in supervised algorithms we need some labels used during the training phase to instruct the machine on how to properly interpret the input. A typical label, in medicine, could be the diagnosis (healthy vs. ill, disease A vs. disease B, etc.). Conversely, the unsupervised approach is used to examine the data, searching for structures and patterns previously unknown (no labeled data needed).

As highlighted by us in a recent review article on machine learning methods for microbiome host trait predictions, several ML methods have been applied in the latest years for microbiome prediction. The microbiome data are often arranged into Operational Taxonomic Units (OTUs) [16], each collecting similar sequences representing a specific bacteria [17]. A taxonomy is used to represent the relationship among the microbes and each OTU. This relationship can be used in ML for the taxonomy-informed feature selection. This approach facilitates the selection of features to be used as input for ML algorithms. A detailed description of these methods and techniques is out of the scope of this work, but the reader can find a comprehensive overview in [17].

An example of the use of taxonomy can be found in the paper by Vangay et al. (2019). They created a publicly available repository, being comprised of 33 curated regression and classification tasks involving human microbiome data from 15 public datasets [18]. The repository has been developed as a powerful tool for two different types of users: ML algorithms developers owing a limited knowledge of microbiome and, on the other hand, microbiome researchers who want to find new datasets for performing a meta-analysis.

Many different algorithms can be used, alone or in combination, to perform automated data analyses. Some are statistical models, such as the Logistic Regression (LR), and are often used in ML to predict the risk of developing a certain disease. This method cannot be considered a classifier, since it models the probability of output, given an input, but it can be used as such by setting cutoff thresholds [15]. Another algorithm for performing a regression analysis is the Least Absolute Shrinkage and Selection Operator (LASSO) [15]. Some more algorithms can be used both for regression and for classification. This is the case of Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Decision Tree (DT) and Gradient Boosting Decision Tree (GBDT), which are some of the most popular supervised algorithms.

In DTs, each internal node is associated with a variable, also known as property. Each arch linking a node to other children nodes represents a decision (e.g. a possible value for that variable). DTs are often used in ensemble methods [19], techniques that combine multiple models or algorithms to achieve better predictive performances. The concept of “boosting” is also common, and used to convert the weak learners to the strong ones (e.g. Logitboost) [20]. Recent evolution is represented by the C5.0 algorithm, improving DT with feature selection and reduced pruning error [21, 22].

GBDT combines the predictions from a series of decision trees which are used as the base learners. A new decision tree is trained at each step. EXtreme Gradient Boosting (XGBoost) is an open-source implementation of the gradient boost algorithm that uses a second-order gradient to guide the boosting process. Adaptive Boosting (ADA) is also a boosting algorithm which is used both for classification and regression problems, where the “weak learners” are decision trees with a single split [20].

Logistic Model Tree (LMT) can also be considered an ensemble method since it combines LR and DT [15]. Some very simple classifiers are the Naïve Bayes (NB) probabilistic classifiers, grounded on the well-known Bayes’ theorem, and used in ML since the very first approaches. Despite their simplicity, they are still often used in clinical decision support systems and can provide good performances. A variation of NB is the Multinomial Naïve Bayes (MNB) classifier, where the features represent the frequencies with which some events have been generated by a multinomial distribution. Both the above classifiers are widely used for text classification purposes as well [23].

Another classification method for high-dimensional data, such as microarray data, is the Nearest Shrunken Centroids (NSC). It calculates centroids for each class and somehow compresses them to zero, using a thresholding technique [24].

1.1.1 Support vector machines

SVMs are among the most adopted and best-performing algorithms, used as supervised binary classifiers. They were first introduced by Boser et al. in [25]. The space of features is separated into two regions that correspond to the binary classes of the training data. This is done by using a linear hyperplane of equation [26]:

$${w}^{t}x+b=0$$
(1)

The above equation (Eq. 1) is obtained as a result of a training process that optimizes the geometric margin (to achieve the best separation between classes). When an elementary hyperplane is used, it is common to talk about “linear-SVM”.

1.1.2 Artificial Neural Networks

Artificial Neural Networks (ANNs), often simply referred to as Neural Networks (NNs), have been extensively used for many years as automated classifiers [27]. Today, they are living a new youth thanks to their wide use in DL.

Inspired by the biological neurons, which are connected by synapses and neurotransmitters, NNs are made of at least two layers of nodes: one dedicated to host the input values and the other for the classifier’s output. Very often they are also provided with one or more intermediate layers, between the input and the output ones, named hidden layers (see Fig. 2).

Fig. 2
figure 2

Schematic representation of an ANN with a single hidden layer. Source: www.learndatasci.com

The connections between nodes belonging to different layers are obtained through numerical weights. A nonlinear activation function, named transfer function, rep- resents the action potential firing in the cell [28]. During the supervised training process, these weights are fine-tuned to achieve satisfactory classification performances, i.e. a robust connection between an array of input and the corresponding output (label). For training NNs, an iterative method is commonly used to optimize parametric functions, named Stochastic Gradient Descent (SGD) [29].

1.1.3 Random Forest

Random Forest (RF) is an ensemble method introduced by Breiman in 2001 [30]. As mentioned in Section 1.1, ensemble methods are the combination of several algorithms for classification or regression. The overall result is the enhancement of the performances. Thus, RF improves the predictive power of a single DT, by training multiple trees and combining their outputs [31]. The training process of each tree selects a random subset from the training set. This procedure is called bagging, i.e. bootstrap aggregating. The final prediction is obtained as the average or the majority of the trees’ estimations. RF splits the sample into groups, using features and associated thresholds. It shows very good performance, with only a few parameters to tune [32].

Thanks to their optimal characteristics, RFs are widely used and strongly applied also in the field of bioinformatics, metagenomics, and genomic data analysis [33,34,35].

1.2 Deep Learning

The term Deep Learning identifies a subset of ML algorithms, characterized by multiple layers of representation between input and output. DL has been developed to overcome the limitations of the ML techniques. Indeed, one of the issues arising with the use of ML is the so-called “Curse of dimensionality” [16]. It refers to the extremely growing complexity of the ML algorithms when the number of the dimensions of the input data is high. The DL techniques can be supervised or unsupervised as well.

The majority of DL algorithms are built upon ANNs, a class of learning algorithms composed of multiple interconnected layers for reproducing the way the brain processes and spreads information, as explained in Section 1.1.2. Thus, Deep Neural Network (DNN) refers to the high number of layers that compose the net- work of the DL algorithm. Such a high number of levels and units enable higher complexity of function representation if compared to ML.

Convolutional Neural Networks (CNNs) are a wide class of DNNs, often applied to image analysis. These networks apply convolution to at least one of their layers. Fioravanti et al. [36] have developed a particular DL approach, based on CNNs, for the classification of metagenomics data. This algorithm is called Phylogenetic CNN (Ph-CNN).

Multilayer Perceptron Neural Networks (MLPNNs), also called deep forward neural networks or feedforward neural networks, are a particular class of ANNs that perform unsupervised learning, with no feedback connections: the output of the model is not fed back to the network [13].

Deep Belief Networks (DBNs) are a subset of DNN algorithms. DBNs are characterized by the connection between different layers, but not between the units within each layer.

One last type of DNN is the Autoencoder Neural Network (AutoNN). An autoencoder is the combination of an encoder function and a decoder function. This algorithm reproduces the input in a more compressed representation (i.e. with a lower number of features needed), allowing a dimensionality reduction.

Zhou and Feng [37] proposed multi-Grained Cascade Forest (gcForest), a novel decision tree ensemble approach, consisting of a combination of a traditional ML algorithm and DL. It exhibits excellent performance in a broad range of tasks, being comparable to a DNN. In particular, gcForest is less sensitive to the changes of the network parameters (hyper-parameters) and thus it is more robust to hyperparameter settings if compared to other DL algorithms.

1.3 Article structure

Following an overview in Section 1 on the main ML and DL algorithms, Section 2 of this article provides detailed information about the sources of information, and the methods for analyzing the results. Section 3 provides an accurate synthesis of the results, thoroughly discussed in Section 4. Finally, Section 5 provides the reader with considerations on the use of ML and DL for analysis of microbiome, from a clinical standpoint.

2 Materials and methods

2.1 Information sources and literature search

Two research groups have been involved in this Scoping Review, working in two different areas of Europe: Florence (Italy) and Sarajevo (Bosnia and Herzegovina). The scoping review has been conducted according to the PRISMAFootnote 1 Guidelines for Scoping Reviews [38].

The research strategy has been defined jointly by the two teams and the literature search has been performed in parallel on the main biomedical databases: MEDLINE (OvidFootnote 2) and PubMed.Footnote 3

The keywords used for the search are consistent with the scope of the review and are the following: artificial intelligence, machine learning, deep learning, transfer learning, neural network/s, expert system/s, automatic classifier, deep network/s, classification, clustering, regression, prediction, microbiota, microbiome, gut, colorectal, colon, Chron.

2.2 Eligibility criteria

The papers to be included in the review had to describe the use of ML or DL methods, applied to the study of human gut microbiota. Moreover, the following limitations have been adopted: publication year from 2004 to current and English language only. In particular, limiting the publishing time allows focusing on the techniques and algorithms developed in the latest years.

2.3 Selection of sources of evidence

After the literature search, all the records retrieved have been screened using the eligibility criteria. Two reviewers (D.B. and R.F.) independently screened the titles and abstracts of all the records in the output of the literature search. The ones considered as non-pertinent to the extent of the review have been removed. The results obtained separately by the two reviewers have been compared: the articles which both considered eligible for the study have been directly included in the list for full-text download. A third reviewer (A.A.), the immunologist participating as a clinical partner, was asked for a decision about those papers selected by only one of the two reviewers.

The full texts of the above-mentioned list of papers potentially eligible for the review have been downloaded. Once more, the two reviewers read the full texts and excluded some more papers not consistent with the objectives of the review. In this way, the final list of papers to be included in the review was created.

The reviewers made use of a collaborative worksheet in a shared Google Drive folder. The process described above is presented in a dedicated flowchart in Section 3.

2.4 Data Charting Form

The Data Charting Form (DCF) was developed to select the variables to be extracted and analyzed from the papers. The two groups independently created a DCF based on the reading of a small subset of papers. Then, the two forms have been compared and merged in the final DCF, used for the analysis.

The analyzed variables concern article characteristics (i.e. first author name, year and country of publication, publisher), method’s scope and limitation, sample (i.e. type, size, age), application (e.g. microbiota body site, diseases considered), analysis technique (e.g. algorithms), validation and metric used to assess the performance. A detailed list of all the variables can be found in the tables in Section 3. As for the other steps, Data Collection was also performed by the two groups independently. Two DCFs were compiled and then the results were discussed to agree on the final data form to be included in the review.

2.5 Synthesis of methods for results handling

The combined DCF was the source of the results that are reported in this paper. We included a number of useful variables that describe the different research results in great detail. For readability purposes, the variables were spread across three tables.

Different metrics can be applied to evaluate the performance of a binary classifier. These metrics are presented in Table 1 and outlined in the following work by Flach [84]. Four outcomes of binary classification are used to produce constituents for defining more complex performance metrics: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).

Precision specifies the proportion of positive identifications that are correct.

$$Precision=\frac{TP}{TP+FP}$$
(2)

True Positive Rate (TPR), or Sensitivity, or Recall, specifies the proportion of actual positives that were correctly labeled.

$$TPR/Sensitivity/Recall=\frac{TP}{TP+FN}$$
(3)

Accuracy is a percentage of correctly identified samples (either as positive or negative) and is not a good metric if sizes of two groups are unbalanced.

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(4)

Specificity measures classifier’s ability to correctly label negatives.

$$Specificity=\frac{TN}{TN+FP}$$
(5)

False Positive Rate (FPR) is calculated as the ratio between false positive and the total number of actual negatives.

$$FPR=1-TNR=\frac{FP}{TN+FP}$$
(6)

The Receiver Operating Characteristic (ROC) curve is a plot of FPR (x-axis) vs. TPR (y-axis). The Area Under the ROC Curve (AUC), is a threshold invariant aggregated measure of binary classifier performances that takes into account all possible threshold values. AUC values range from 0 to 1.

F1-score is a metric that combines recall and precision using harmonic mean:

$$F1-score=\frac{2*Precision*TPR}{Precision+TPR}$$
(7)

F1-macro is a metric used by Lo and Marculescu [48] and described as follows: “We estimate F1macro by calculating the accuracy for each class and then finding their unweighted mean”.

The Matthews Correlation Coefficient (MCC) is, in essence, a correlation coefficient between the actual and predicted binary classifications and it assumes a value in [− 1, 1]. A value of 1 represents a complete agreement of prediction and observation, 0 a random prediction, and − 1 means opposite values of prediction and observation.

$$MCC=\frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}}$$
(8)

3 Results

3.1 Selection of sources of evidence

The output of the literature search sums up to 1109 articles in total. After the screening, based on title and abstract, 22 papers were assessed as eligible: their full-texts have been downloaded and read. Further 10 articles have been excluded after reading. Among these, the work by Zhou and Gallins [17] is a review. Although it has been excluded from the study, it provided four new articles [48, 52, 53, 85] to be added in full-text to the list of the eligible papers. Also, the paper [18] has been excluded as it is also a review. It presents a repository of classification and regression tasks from human microbiome datasets publicly available and, as for the previously mentioned review, this paper has been read and some of the studies it reviews have been considered for the assessment.

The remaining excluded articles were eliminated either because they did not apply ML or DL algorithms but just performed statistical analysis on microbiota data [86,87,88], or because the analysis was not focused on microbiota (e.g. enzymes profiles or metagenomics were analyzed). A detailed list of the reasons for exclusion for each of the ten articles is discussed in Section 4.

The final set of articles included and examined in this scoping review is made up of 16 articles. The process of selection of sources of evidence is reported in the flowchart in Fig. 3.

Fig. 3
figure 3

The figure summarizes the process of selection of sources of evidence for the study

3.2 Synthesis of results

Table 1 gives a high-level overview for all the studies. For each study it states its type, cross-validation method, examined taxonomy level, and studied trait (if applicable). Table 2 summarizes all the data sets that were used in the examined studies while Table 3 summarizes all the different algorithms that have been used together with indicators of their performances.

Table 1 High level information on studies
Table 2 Description of data sets that each study used
Table 3 Studies’ algorithms and their performance

4 Discussion

A total of 26 papers have been reviewed for this study. Out of those on the final list, 10 papers were identified as cases whose findings are not fit for this review. Table 4 lists those papers that were excluded together with the reasons for their exclusion. The remaining 16 papers were fully examined. The high-level data from Table 1 shows that 4 papers reported new methods that can be used to analyze the gut microbiota data while the remaining 12 papers applied the existing methods to analyze gut microbiota of humans with different traits. All papers performed some cross-validation methods: 10-folds cross-validation was the most common.

Table 4 Excluded papers with the reason for exclusion

The majority of the examined papers used species level as the final taxa resolution. This is consistent with findings in the literature: species is the last taxonomic level at which 16 s rRNA sequencing technique is accurate. As for the traits examined, there is a great variance, but datasets of individuals with either Chron’s disease or Colorectal cancer have been examined multiple times.

As reported in Table 2, most of the studies examined multiple datasets, counting from one to eight. All but two papers (that examined saliva microbiota) analyzed gut microbiota data. Classification of different body sites is a problem that has been examined in two studies. While three other papers included different classification problems, and one used cohort study as the design method, the overwhelming majority was either fully or partially based on the case–control design method. This is due to AI techniques being suitable for the design of computational classifiers that can distinguish the samples from case and control groups with high probability. The total number of samples in a single dataset varied from 40 to 10, 101. In the articles that used case–control study design, the case groups contained 16 to 500 samples, while the control groups were made up of 24 to 500 samples.

A different range of AI algorithms was applied in the reviewed papers. Some of them were from the ML group while the others were from DL. 11 papers evaluated just different ML algorithms (ranging from one to eight algorithms applied to one dataset). The remaining five papers examined both ML and DL algorithms.

The most often applied ML algorithm was Random Forest. This algorithm had also most often the best performance reported. For the papers that applied both ML and DL algorithms, it is inconclusive which ones performed better.

When it comes to reporting the performance of AI algorithms on microbiome data, as summarized in Table 3, the metrics that were reported most frequently were AUC and Accuracy. Additional metrics include Sensitivity, Specificity, MCC, F1-score and F1-macro.

5 Conclusion

It is commonly accepted that the “ecosystem” microbiome plays a central role in health and disease development. The human microbiome consists of promising biomarkers for various pathological states and there is an overflow of metagenomics results. Translating these data into clinical practice is now a big challenge for the future. Microorganisms and host cells communicate by producing and sharing metabolites and generating metabolic networks that we can use to develop meta- metabolic network models. Studying network biology using ML represents a great opportunity for exploring the “human health condition”.

Some models could be used for understanding the microbial-host interplay, as well as for predicting and gaining insights into the synergistic and dysbiotic connections. Some models could be used to inspect how the abnormal growth of a specific microbial species might perturb the metabolic balance of the ecosystem by secreting beneficial metabolites that promote health or, conversely, toxic ones that could damage the host tissues. Some models could also be used to foster the development of innovative diagnostic applications. The huge amount of data produced by these models is often referred to as big data.

However, such a big amount of data needs to be reported in an intelligible way. Each prediction allows for more extensive analysis, which in turn may let clinicians make informed and accurate decisions. Using a method for explaining individual classifier decisions for complex microbiota analysis may assist in performing treatment management for every single patient. This approach can also help the physician in improving his/her clinical expertise (with new and fine stratification of patients’ sub-types), thus opening new perspectives on personalized therapy.