1 Introduction

Microbiome has been linked to many major human diseases, including obesity, diabetes, autism, allergies, inflammatory bowel disease, cardiovascular disease, many types of cancer and depression and so on [28]. Therefore, human microbiome may become the latest therapeutic intervention targets and thus play an important role in the diagnosis, analysis and treatment of these diseases [42]. For example, correlation analysis of gut microbiome has been successfully applied in clinical assessment and patient diagnosis of diabetes [38]. At present, microbiome research has not only fully demonstrated its great value in clinical medicine and personalized medicine [12], but also penetrated into many fields such as Marine science [27], environmental science [51], agricultural science [51] and earth science [17]. According to the official website of the International Diabetes Federation (IDF) in November 2021, the number of adult diabetic patients in the world has reached 537 million, accounting for 10.5% of the total population of the world; China’s diabetes mellitus has grown to 140 million, ranking first in the world (https://diabetesatlas.org/). Diabetes has become a major public health problem that seriously affects people’s physical and mental health. Therefore, taking microbiome of patients with type 2 diabetes as an example in this paper, probabilistic topic model was employed to mine hidden information in microbiome and then infer the probability topics related to type 2 diseases, which will provide a new perspective for the study of microbiome, and may provide new targeted microbiological treatment for type 2 diabetes.

2 Related work

The study of microbiome usually analyzes the community composition and diversity of bacteria, which is used to study one of the basic problems of microbial ecology: how many different taxa or OTUs (operational taxonomic units) are present? Usually, multivariate statistics or pattern recognition methods are employed to identify different structural patterns in gut microbiome, such as principal component analysis (PCA) [32, 33, 38], principal coordinate analysis (PCoA) [13, 39, 49], partitioning around medoid (PAM) clustering [2, 57], etc. However, microbial metagenomics data is characterized by high diversity but sparseness. These methods have some inherent shortcomings and cannot deal with microbiome data well. The probabilistic topic model is not sensitive to highly sparse and noisy data, which is not only widely used in the field of document mining, but also used in microbiome data analysis to mine hidden topics.

2.1 Traditional methods

Both PCA and PCoA are dimension reduction techniques, as shown in Table 1. The advantages of PCA and PCoA are simple and easy to use, low cost, easy to understand results, and no parameter restrictions. The disadvantages are: (1) the data information cannot be retained well in the case of complete ignorance of the data. For example, PCA needs data preprocessing and standardization. The usual way to standardize is to divide by the standard deviation. There may be a problem here. If the standard deviation is very small and close to zero, especially for the data polluted by noise, the standard deviation of noise has a more significant effect on data amplification, while the data that is not polluted by noise has less amplification effect. (2) The final number of reduced dimension, that is, the number of potential latent variables, cannot be well estimated. PAM is a kind of clustering partition algorithm, also known as K-medoid algorithm, which refers to using the center point to represent a cluster. The advantage of PAM is that the sensitivity to outliers is greatly reduced, because the class center it selects is a specific point rather than a geometric center (such as K-means). The disadvantages are: (1) it is needed to specify the K value; (2) it is very effective for small data sets, but it does not have good scalability for large data sets.

Table 1 The principles, advantages and disadvantages of PCA, PCoA, PAM, and probabilistic topic models

2.2 Probabilistic topic model

Microbial metagenomics data is characterized by high diversity but sparseness, and most taxa appear only in a few samples with low abundance. In addition, the samples vary in reads: a small sample will inherently be noisier than a larger one. Therefore, PCA, PCoA and PAM do not work well for such data sets. The probabilistic topic model is not sensitive to highly sparse and noisy data, so it is more suitable for microbial metagenomics data. According to the probabilistic topic model, not as representing the community, but the sample is treated as having being generated by sampling from the community, in which the most natural assumption to make is sampling with replacement, so that the likelihood of an observed sample is a multinomial distribution with a parameter vector where a given item represents the probability that a read is from a given taxa [25]. The natural priori of polynomial distribution parameters is Dirichlet. This is the widely used probabilistic topic model - Latent Dirichlet Allocation (LDA). The basic idea of LDA is that a document is regarded as a mixture of latent topics, each of which is expressed by a distribution on words (these items, such as document, topics and words, are related to document mining because this method was first applied in the field of natural language processing, and in other fields “document” and “word” have different meanings). LDA employs two Dirichlet-Multinomial distributions to model the relationship between documents and topics, and the relationship between topics and words respectively [22, 36]. Approximate methods, such as variational inference [6] and Markov chain Monte Carlo (MCMC) [29], are commonly used in LDA to calculate the posterior probabilities. The calculated probability distributions are employed to make inference about the topics and documents.

LDA has been widely used in document mining [5, 6, 19, 20, 23] and image retrieval and annotation [14, 31, 45].It also has been applied in bioinformatics for various purposes, such as protein structure representation [43], drug labeling [4], and next generation sequence [59]. However, the study of applying probabilistic topic model to gut microbiome is scarce. Zhang et al. exploited LDA to boost metagenomic reads binning [58]. Chen et al. showed that the configuration of functional groups in meta-genome samples can be inferred by probabilistic topic modeling (LDA) [10]. Holmes et al. applied Dirichlet multinomial mixtures (DMM) model to gut microbiome of the fat and thin twins [25]. Stewart et al. used DMM to model the 16 S rRNA gene sequencing and metagenomic sequencing data of children gut microbiome [49]. Wang et al. used LDA to study gut microbiome of patients with mild hepatic encephalopathy and the efficacy of rifampicin combined with probiotics [53, 54]. Woloszynek et al. evaluated a topic model approach for parsing microbiome data structure [56]. Abe et al. proposed a new probabilistic model for microbial association analysis, because traditional probabilistic modeling cannot distinguish between the bacterial differences derived from enterotype and those related to a specific disease [1]. Okui et al. proposed a bayesian nonparametric topic model for microbiome data using subject attributes [35]. These studies indicate that there are some meaningful findings in the analysis of gut microbiome by probabilistic topic model, which is different from the traditional statistical methods. To the best of our knowledge, there is no research on the combination of surveillance information (i.e. the degree of correlation between different microbiome and patients) and probabilistic topic model. In this study, gut microbiome data of patients with type 2 diabetes were taken as a case. In addition, previous studies only used probabilistic topic models to cluster gut microbiome, or to find topics, but did not make full use of the characteristics of gut microbiome itself and in-depth analysis of the output of probabilistic topic model.

Therefore, the contributions of this paper include: (1) a new model based on probabilistic topic model was proposed to analyze gut microbiome, which could mine the information hidden in gut microbiome data, and the output topics could describe the characteristics of gut microbiome; (2) the distributions generated by LDA model could be combined with various data mining algorithms as new features, which is helpful for us to well understand the structural differences of gut microbiome among different groups.

3 Methods

Weights of OTUs were calculated firstly, and then different weights were assigned to different microbiome, and then employed LDA to obtain the distribution of gut microbiome in different groups. Finally, the distribution over topics for gut microbiome in each patient (i.e., per-patient topic distributions) generated by LDA were clustered and classified to verify its ability to characterize gut microbiome. The flow chart of the proposed method is shown in Fig. 1. Firstly, the data set (relative abundance of gut microbiome in Fig. 1a) is acquired, and then the weight of each OTU (Fig. 1b) is calculated, and the procedure to obtain the weights is shown in Fig. 2; secondly, the LDA model is employed (Fig. 1c), represented by two distributions: the distribution over OTUs for each topic (per-topic OTU distributions) and the distribution over topics for gut microbiome in each patient (per-patient topic distributions); thirdly, Gibbs collapsed sampling [37] is employed to determine the optimal number of topics (Fig. 1d), and the analysis results are visualized in a tree graph [15] (Fig. 1e); finally, the per-patient topic distributions generated by LDA are clustered and classified to verify their ability to characterize the gut microbiome data (Fig. 1f).

Fig. 1
figure 1

 A flowchart of the proposed method. a shows the gut microbiome data set, where OTU1, …, OTUN denote the name of OTUs, S1, …, SS represent subjects, and RA represents relative abundance. b shows the weights of each OUT, which are calculated according to formula (1)~(4). c shows LDA modeling after calculating weights, where tK is the k-th topic, ON is the n-th OTU, P(ON|tK) is the conditional probability; Ss is the s-th subject, P(tK|Ss) is the conditional probability. d shows the determination of the optimal number of topics in LDA model. e shows the tree graph of three groups according to the results of LDA model. Yellow, orange and red circles represent normal health subjects (abbreviated as N), T2DM with genetic autonomic neuropathy (abbreviated as G) and T2DM (abbreviated as D) respectively. d shows clustering and classification operations based on the results of LDA model

Fig. 2
figure 2

 A flowchart of weight calculation

3.1 Calculating weights of OTUs

In the field of document mining, it is necessary to first convert the terms in document into the DocumentTermMatrix (DTM), that is, the frequency of each term (word or vocab) in each document. For gut microbiome data, relative abundance reflects the proportion of different bacteria in the samples, that is, corresponding to the DocumentTermMatrix (DTM). Probabilistic topic model was originally designed for document analysis, which assumes that the importance of each word in document is the same. However, this assumption is not perfect. Wallach et al. pointed out that the high frequency stopwords had a great influence on the topic inference of probabilistic topic model [52]. In the human intestines, the distributions of gut microbiome are also different, among which the dominant floras are Bacteroidetes and Firmicutes [46], which may have an impact on the inference of disease-related topics. In addition, when the distributions generated by LDA were directly used to construct classifiers, it is found that different OTUs of different groups played different roles. Therefore, according to the different importance of OTUs, the weights of OTUs were calculated and multiplied by relative abundance, so as to adjust the proportion of different microbiome. The flow chart of weight calculation is shown in Fig. 2.

The importance of missing OTUs can be measured by the ratio of the recognition rate of data set after deleting an OTU to the recognition rate of the whole data set, as shown in formula (1):

$$\text{I}=\text{l}\text{o}\text{g}\frac{p\left({d}^{{\prime }}\right)}{p\left(d\right)}$$
(1)

Where p(d) is the correct identification probability of the whole data set, p(d’) is the correct identification probability of missing an OTU. Obviously, the greater the difference between p(d) and p(d’ )is, the greater the absolute value of I is. The p(d) is determined for given data set, so the larger the value of I is, the greater the change of classification accuracy after deleting an OTU is, the higher the importance of the OTU is; conversely, if the value of I is smaller, it shows that the classification accuracy changes slightly after deleting the OTU, and the effect of the OTU on classification is relatively small. In this paper, random forest is employed to calculate the correct identification probability, as shown in Fig. 2.

In order to prevent the weights of some OTUs to be too large, the formula (1) is slightly modified according to the references [21, 47], which is replaced by the absolute value of the subtraction of \(log\frac{{p}_{err}(-i,d)}{{p}_{err}\left(d\right)}\) and \(log\frac{{p}_{cor}(-i,d)}{{p}_{cor}\left(d\right)}\), in which the normalization factor is added, as shown in formula (2)~(3). Therefore, I(i) is defined as the change of identification accuracy after deleting the OTU i, \(\overline{I}\) is the average value of I(i), α is the number of OTUs 1551, and weight(i) is the weight of the OTU i, as follows:

$$I\left(i\right)=\left|log\frac{{p}_{err}(-i,d)}{{p}_{err}\left(d\right)}-log\frac{{p}_{cor}(-i,d)}{{p}_{cor}\left(d\right)}\right|$$
(2)
$$\overline{I}=\frac{1}{\alpha }\sum {_{i=1}^{\alpha }}\left|log\frac{{p}_{err}(-i,d)}{{p}_{err}\left(d\right)}-log\frac{{p}_{cor}(-i,d)}{{p}_{cor}\left(d\right)}\right|$$
(3)
$$weight\left(i\right)=\frac{I\left(i\right)}{\overline{I}}$$
(4)

Where perr(-i, d) is the error identification probability after the missing OTU i, pcor(-i, d) is the correct identification probability after the missing OTU i, perr(d) is the error identification probability of the whole data set, similarly, pcor(d) is the correct identification probability of the whole data set.

3.2 Latent Dirichlet Allocation model

In this study, 140 subjects were recruited, and the gut microbiome of each subject included 1551 OTUs. According to LDA model, each patient’s gut microbiome was treated as one document and each OTU as one word, so that the data was composed of 140 documents and each document was composed of 1551 words. The algorithm is as follows [6]:

  1. 1.

    for each topic k, where k in {1… K}, pick a distribution over OTUs φk~ Dir(β);

  2. 2.

    for each patient Pm, where m in {1… M},

    1. a.

      pick a distribution over topics θm~ Dir(θ);

    2. b.

      for each OTU On with n in {1… N},

    3. (1)

      pick a topic z ~ Multinomial (θm);

    4. (2)

      pick OTU On ~ Multinomial (φz);

Where, implied variables θ and φ can be estimated according to Eqs. (5) and (6):

$${\theta }_{m,k}=\frac{{n}_{m}^{\left(k\right)}+{\alpha }_{k}}{{\sum }_{k=1}^{K}({n}_{m}^{\left(k\right)}+{\alpha }_{k})}$$
(5)
$$\varphi_{k,t}=\frac{n_k^{\left(t\right)}+\beta_t}{\sum_{t=1}^V(n_k^{\left(t\right)}+\beta_t)}$$
(6)

Where, φk is a distribution over OTUs for topic k, θm is a distribution over topics for patient m, \({n}_{m}^{\left(k\right)}\) represents the number of OTUs with topic k in patient m, \({n}_{k}^{\left(t\right)}\) denotes the number of OTUs with topic k in the OTU t, and V denotes the total number of OTUs without repetition. Dir represents a Dirichlet distribution and Multinomial represents a multinomial distribution. The distribution of OTUs for topics and the distribution of topics for patients are viewed as random variables obeying Dirichlet distributions with parameters β and α, respectively.

The initial value of α is 50/k, where k is the number of topics and the initial value of β is 0.1 [55]. In the original LDA model published by Blei et al. [6, 24], variational EM algorithm was used to estimate unknown parameters \({\theta }_{m,k}\) and \(\varphi_{k,t}\), and later researchers found that Gibbs sampling was also a good method to infer unknown parameters [37].

3.3 Choosing the number of topics

The main parameter of LDA is to determine the number of topics k (optimal values for other hyper-parameters (i.e., α and β) are automatically picked by the different fitting methods). The generally-recommended method to select the number of topics is to use cross-validation with different values of k, looking at the likelihood for each topic number [15]. However, the computation time for such a method may be prohibitive on large data sets and large range of topic numbers. In addition, a large number of topics (and therefore a more complex statistical model) may lead to over fitting. Therefore, it is preferable to use the smallest possible number that provides a good explanation of the data. However, because of the loose significance of the concept of ‘topics’ in the context of gut microbiome, it is difficult to give a reliable estimate of the ideal number based on biological knowledge alone. Three fitting methods are provided in the Celltree package [15], namely Gibbs, VEM and maptpx. In Gibbs method, Collapsed Gibbs Sampling method [37] is used to infer the parameters of the Dirichlet distributions for a given number of topics. It gives high accuracy but is very time-consuming to run on a large amount of data sets. In VEM method, Variational Expectation-Maximisation [24] is used, which tends to converge faster than Gibbs collapsed sampling, but with lower accuracy. In Maptpx method, the method described in [44] is used, which estimates the parameters of the topic model for increasing number of topics (using previous estimates as a starting point for larger topic numbers). In this study, maptpx method was adopted firstly and it is found that the optimal number of topics was more than 100, which was obviously inappropriate and not well explained our data. Therefore, the Gibbs method was finally adopted. For more related information, please see the  Section 5 of this paper.

3.4 Clustering analysis

In this study, the distributions generated by LDA (the per-patient topic distributions) were regarded as the features of gut microbiome [59], and then the conventional clustering method (k-means) was adopted for verifying cluster performance. Since the data sets included three groups (N, G, D groups), the number of clusters was set as 3 (N, G and D 3 groups) or 2 (G and D 2 groups) in the k-means method. The per-patient topic distributions were equivalent to perform a dimensionality reduction on the original data. The traditional PCA method was also used to reduce dimensionality, so that the number of obtained principal component from PCA was equal to the number of topics from the per-patient topic distributions to facilitate comparison. Clustering results were evaluated by Adjusted Rand Index (ARI) [26], with values ranging from 0 to 1. Generally, the higher the value is, the better the clustering performance is.

3.5 Classification analysis

To further evaluate the ability of the per-patient topic distributions to characterize gut microbiome, the distributions were employed to construct classifiers, such as support vector machine (SVM) [48] and random forest (RF) [7], to identify 3 or 2 groups of patients. 70% of each group was used as the training set, and the remaining 30% was used as the test set. In order to illustrate the performance of the proposed method in classification, the weights and the per-patient topic distribution were calculated on the training set (70% of the data set), and the performance was verified on the test set (30% of the data set). In this study, the function “svm” (with “Polynomial” kernel and optimized values of parameters gamma and cost under different classification tasks) in R package “e1071” and function “randomForest” (with number of trees setting as 500 and default values of other parameters) in R package “randomForest” were utilized to train the classifiers.

4 Experiment results

In this study, 140 cases of gut microbiome data were collected from the Department of Endocrinology in Yunnan First People’s Hospital, China, from 2015 to 2017, 74 cases of patients with T2DM (abbreviated as D), 27 cases of patients with T2DM with gastrointestinal autonomic neuropathy (abbreviated as G) and 39 cases of the normal healthy subjects (abbreviated as N). All subjects signed the informed consent, and the experiment was approved by the ethics committee of Kunming University of Science and Technology. No antibiotics, probiotics or lactose were used for all subjects within one month before sampling.

4.1 Experimental environment

The experiments in this paper are run on the computer of Intel(R) Core(TM) Ci9-9900k CPU @ 3.60 GHz and 32G RAM. And R 4.0.0 (https://www.r-project.org/) is employed for all data processing and plotting in this study. The LDA modeling and result visualization are completed by celltree software package [15] based on R language.

4.2 Weights of OTUs

The weight of each OTU was calculated in two cases: 3 classifications for N, G and D groups and 2 classifications for G and D groups. The weight calculation process is shown in Fig. 2. The weights of OTUs for 2- and 3-classification are shown in Fig. 3a and b. There are six grades, i.e., 3.10, 2.45, 1.75, 0.98, 0.88 and 0.11 for 3-classification in Fig. 3a. There are five OTUs with the largest weight 3.10, namely OTU108 (Ruminiclostridium), OTU365 (Mollicutes), OTU855 (Lachnospiraceae), OTU1586 (Nitrosomonadaceae) and OTU1793 (Clostridiales). The mean relative abundances of these five OTUs in N, G and D groups are shown in Fig. 4. There are five grades, i.e., 5.21, 3.19, 2.24, 1.70, 0.06 for 2-classification in Fig. 3b. There is only one OTU with the largest weight 5.21, OTU253 (Gemella), and only one OTU with the second largest weight 3.19, OTU857 (Prevotella). The mean relative abundances of the two OTUs in G and D groups are shown in Fig. 5a. For the other three smaller grades, the mean relative abundances of OTU7, OTU35 and OTU1 are plotted in Fig. 5b. It can be found that the plots of OTU253 and OTU857 with larger weight have a significant difference in G and D groups, while the plots of OTU7, OTU35 and OTU1 with smaller weight have a small difference in G and D groups

Fig. 3
figure 3

The weights of OTUs for 3-classification (N, G and D groups) and 2-classification (G and D groups). N - Normal healthy subjects, D - Patients with T2DM, G - Patients with T2DM with gastrointestinal autonomic neuropathy patients

Fig. 4
figure 4

Five mean relative abundances with the largest weight for 3-classification. N - Normal healthy subjects, D - Patients with T2DM, G - Patients with T2DM with gastrointestinal autonomic neuropathy patients

Fig. 5
figure 5

Mean relative abundances with five weight grades for 2-classification, N - Normal healthy subjects, D - Patients with T2DM, G - Patients with T2DM with gastrointestinal autonomic neuropathy patients

4.3 Topic analysis

The size of gut microbiome data of three groups inputted into LAD model is 1551*140, and the optimal number of topics is 12, as shown Fig. 11. The size of per-patient topic distributions is 140*12, whose heat map is shown in Fig. 6, in which three groups are shown in red, blue and green on the right side. The size of per-topic OTU distributions is 1551*12, whose heat map is shown in Fig. 7. The top 10 OTUs with high probability of the 12 topics are listed in Table 2, in which the names of OTUs at the generic level are indicated. These OTUs in each topic are arranged in descending order of probability. The size of gut microbiome of two groups (G, D groups) is 1551*101, and the number of topics is also 12. The size of per-patient topic distributions is 101*12, as shown in Fig. 8. The size of per-topic OTU distributions is also 1551*12, as shown in Fig. 9, of which the top 10 OTUs with high probability are listed in Table 3.

Fig. 6
figure 6

Heat map of the per-patient topic distributions of N, G and D groups when the topic number is 12. The color histogram from blue to red shows the value of the topic probability of patient ranged from 0 to 1. On the right side of the graph, three groups of 140 subjects are shown in red, blue and green. Topic 6 is mainly spread among N group, topics 5, 8 and 12 are mainly spread among D group, and topics 4 and 7 are mainly spread G group. N - Normal healthy subjects, D - Patients with T2DM, G - Patients with T2DM with gastrointestinal autonomic neuropathy patients

Fig. 7
figure 7

Heat map of the per-topic OTU distributions of N, G and D groups when the topic number is 12. The color histogram from blue to red shows the OTU probability of topics ranged from − 14 to -4. The first 150 OTUs are with high probability among the 12 topics. N - Normal healthy subjects, D - Patients with T2DM, G - Patients with T2DM with gastrointestinal autonomic neuropathy patients

Table 2 Top 10 OTUs with high probability of topic 4, 5, 6, 7, 8, 12 of N, G and D groups
Fig. 8
figure 8

Heat map of the per-patient topic distributions of G and D groups when the topic number is 12. D - Patients with T2DM, G - Patients with T2DM with gastrointestinal autonomic neuropathy patients

Fig. 9
figure 9

Heat map of the per-topic OTU distributions of G and D groups when the topic number is 12. D - Patients with T2DM, G - Patients with T2DM with gastrointestinal autonomic neuropathy patients 

Table 3 Top 10 OTUs with high probability of topic 1, 3, 5, 9, 10 of G and D group

As shown in Fig. 6, topic 6 is mainly spread among N group, and this topic covers most healthy subjects. Topics 5, 8 and 12 are mainly spread among D group, and these three topics account for about one third of D group respectively. Topics 4 and 7 are mainly spread among G group. Two-thirds of topic 7 is spread among G group, and the other one-third is spread among D group. Topics 3, 10 and 11 are widely spread among three groups. Topics 2 and 9 are less spread among three groups.

In Fig. 7, the first 150 OTUs among the 12 topics are with high probability. From Table 2 at order level, the top 10 OTUs of the topic 6 of N group are Clostridiales (Romboutsia, Pseudobutyrivibrio, Faecalibacteri, Lachnospiraceae and Roseburia belong to Clostridiales), Bacteroidales, Burkholderiales (Parasutterella belong to Burkholderiales). The top 10 OTUs of the topics 5, 8 and 12 of D group are Bacteroidales (Bacteroides, Prevotella), Lactobacillus, Fusobacteriaceae, Clostridiales (Ruminococcus, Romboutsia, Roseburia), Enterobacteriales (Escherichia), Selenomonadales (Phascolarctobacterium). The top 10 OTUs of the topics 4 of G group are Bifidobacteriales, Selenomonadales (Megamonas), Bacteroidales (Bacteroides, Prevotella), Enterobacteriales (Escherichia), Burkholderiales (Parasutterella), Lactobacillus. The top 10 OTUs of the topic 7 are Lactobacillus, Enterobacteriales (Escherichia), Bacteroidales, Clostridiales (Romboutsia), Selenomonadales (Veillonella)

In Fig. 8, topics 3, 9 and 10 are mainly spread among D group, and these three topics account for about one third of D group respectively, similar to topics 5, 8 and 12 of D group in Fig. 6. Topic 5 is mainly spread among G group. Two-thirds of topic 1 is spread among G group, and the other one-third is spread among D group. As shown in Table 3 at order level, the top 10 OTUs of the topics 3, 9 and 10 of D group are Bacteroidales (Bacteroides, Prevotella, Parabacteroides), Clostridiales (Ruminococcus, Pseudobutyrivibrio, Lachnospiraceae), Burkholderiales (Parasutterella), Selenomonadales (Megamonas). The top 10 OTUs of the topic 5 of G group are Lactobacillus (Lactobacillus, Streptococcus), Bacteroidales (Bacteroides, Prevotella, Parabacteroides), Enterobacteriales (Escherichia). Compared with topic 7 in Table 2, there are fewer Clostridiales (Romboutsia) and Selenomonadales (Veillonella). This is because two-thirds of topic 7 is spread among G group and the other one-third is spread among D group in Table 2, while topic 5 is all spread among G group in Table 3. The top 10 OTUs of the topic 1 are Bacteroidales (Bacteroides, Prevotella), Selenomonadales (Megamonas, Phascolarctobacterium, Veillonella), Burkholderiales (Parasutterella).

4.4 Topic visualization

In order to visualize the representation of the topics generated by LDA to different groups, Celltree software package [15] is employed to visualize the generated topics with tree graph. Extracting a hierarchical structure from the lower-dimensional model follows the same general idea as other methods for dimensionality reduction (i.e., PCA or ICA): firstly computing a matrix of pairwise distance, of which the chi-square distance [9] is used to compare the topic histograms. Then this distance matrix obtained may be used with various tree building algorithms to identify the underlying tree structure. One natural way to visualize such a structure is using a minimum spanning tree (MST). As shown in Fig. 10, yellow, orange and red nodes represent N, G and D groups respectively. The left half of tree is N group and a small number of G group, and the right half are D group and the remaining G group. The plot of topics is shown in supplement material S1. Each node in the graph represents one subject, and the color sectors in the node represent 12 topics. For different subjects, the proportion of 12 topics is different. The backbone tree are shown in supplement material S2 ~ S3. Large nodes represent the trunk of tree and small nodes represent branches of tree. The tree graph and backbone tree of G and D groups are shown in supplement material S4 ~ S5.

Fig. 10
figure 10

Groping tree graph of N, G and D groups. Yellow, orange and red nodes represent N, G and D groups respectively. N - Normal healthy subjects, D - Patients with T2DM, G - Patients with T2DM with gastrointestinal autonomic neuropathy patients 

4.5 Clustering results

In order to verify the performance of the proposed method, k-means clustering was performed on the original data, the per-patient topic distributions (12 topics) generated by LDA (LDA means that LDA model is used directly, that is, the weights of all OTUs are equal) and wLDA (wLDA means that is the proposed method in this paper, it means that the weights of all OTUs are calculated according to formula (2)~(4) and multiplied by the relative abundance, and then LDA model is employed.), 12 principal components of PCA. Clustering performance was measured by ARI, as shown in Table 4. It can be seen that the ARI of LDA and wLDA is equal to 1, which is better than that of the original data, and the ARI of PCA for 3-classifications is worst.

Table 4 Comparison of clustering performance ARI.

4.6 Classification results

SVM and RF were used to train classifiers to compare the per-patient topic distributions generated by LDA and wLDA, and the classified accuracy is shown in Table 5. As can be seen from Table 5, the 3-classification accuracy of original data + SVM is low, only 0.5952, and the accuracy of LDA + SVM and wLDA + SVM is significantly improved, reaching to 0.8571 and 1 respectively. The 2-classification accuracy of LDA + SVM and LDA + RF is the same as that of original data. While the 2- and 3-classification accuracy of wLDA + SVM and wLDA + RF is significantly improved, and that of 3-classification of wLDA + SVM reaches 1. It should be noted that when calculating weights, the weights are different for different classification tasks. In order to compare with wLDA, the number of topics of LDA selected here is also 12.

Table 5 Comparison of classification accuracy

5 Discussion

Using unsupervised learning or clustering methods to determine clusters of communities or envirotypes is a hot issue in the analysis of microbial community data. However, previous studies mostly adopted methods such as PCA [32, 33, 38]、PCoA [13, 39, 49]、PAM clustering [2, 57]. Since there are some inherent problems in microbiome data [25], new methods are needed. In this study, a new method based on probabilistic topic model was proposed to analyze gut microbiome of N, G and D groups.

To study the roles of different OTUs in three groups, the weight of each OTU was calculated. The results showed that these OUTs varied greatly in different groups, leading to a large change in recognition rate, and thus large weights were acquired according to the formula (2 ~ 4). It could be found that these OTUs have higher correlation with T2DM in this way, but it cannot be inferred that these OTUs are indicators to distinguish different groups and that these OTUs are dominant in a certain group.

When the probabilistic topic model was used, the topics distributions and corresponding OTUs of different groups were obtained through the per-patient topic distributions and the per-topic OTU distributions. It was easy to find out which OTUs had changed and which OTUs of topics were dominant. However, our previous experiments found that the OTUs with highest probability of topics of LDA for original relative abundance data were all Bacteroides. This is not surprising, because the most abundant bacteria in the human gut are Bacteroidetes and Firmicutes [46]. In addition, the per-patient topic distributions of LDA were employed as features for clustering and classification, in which the recognition rate was expected to improve further. Could OTUs with high probability associated with each group be found? Could the recognition rate of LDA be further improved? Therefore, combining the weight information with LDA model was considered.

Applying the proposed method to gut microbiome of N, G and D groups, the per-patient topic distributions and the per-topic OTU distributions are shown in Figs. 6, 7, 8 and 9. Fusobacterium, Enterobacteriales and Selenomonadales of D group increase, and Clostridiales and Burkholderiales decrease compared with N group. Lactobacillus, Bacteroidales and Enterobacteriales of G group increase, and Clostridiales decrease. This is particularly evident in topic 5 of Table 3, which is spread throughout G group. This indicates that patients with T2DM have moderate intestinal dysregulation, which is consistent with some previous studies [30, 38, 41]. Qin et al. [38] found that in the intestinal tract of T2DM, the abundance of butyric acid bacteria such as Clostridiales (Roseburia and Faecalibacteri) decreased, and the abundance of some opportunistic pathogens such as Bacteroidales and Enterobacteriales increased. Sato et al. [41] found that Lactobacillus increased in the intestinal tract of T2DM, and some Clostridiales that could produce butyric acid in short-chain fatty acids (SCFAs) decreased. Karlsson et al. [30] found that the abundance of 4 Lactobacillus increased and that of 5 Clostridium decreased in T2DM. In addition, when calculating the weights of OTUs, 7 OTUs with larger change was found, among which there were no Lactobacillus, Bacteroidales and Enterobacteriales. While weight information is combing with probabilistic topic model, the situation becomes different. Using wLDA, topic 5 was mainly distributed in G group, and the most probable OTUs were Lactobacillus, Bacteroidales and Enterobacteriales. It shows that probabilistic topic model has advantages in mining hidden information, and the information mined by adding the supervised information is more targeted.

The concept of “topic” in probabilistic topic model comes from the field of document mining and is used to mine the hidden topics in the document set. It is similar to the concept of “cluster” in clustering methods. The clustering method is used to determine how many clusters are in the gut microbiome data, while the probabilistic topic model determines how many topics. The determination method is shown in the related explanation of Fig. 11 (12 topics). And these topics are reflected by the two distributions “per-patient topic distributions” and “per-topic OTU distributions”, as shown in Figs. 6 and 7. In order to further visualize the representation of the topics generated by LDA, Celltree software package is employed to visualize the generated topics with tree graph [15], as shown in supplement material S1. Grouping tree graphs can be drawn by calculating the distance according to the topics obtained, where three groups can be clearly distinguished in the grouping tree graphs.

Fig. 11
figure 11

A log likelihood value graph with the number of topics from 2 to 50. The red dot in the graph indicates that when K is 12, the log likelihood reaches the local minimum

The clustering effect of G group on the tree graphs is not very good (Fig. 10). In fact, from the output of wLDA (per-patient topic distributions (Fig. 6)), it could be found that topic 6 is mainly spread among N group; topics 5, 8, and 12 are mainly spread among D group; while topics 4 and 7 are mainly spread among G group. While three-quarters of topic 4 is in G group, the other quarter is in N group, and two-thirds of topic 7 is in G group, the other third is in D group. This indicates that there are no topics that belong entirely to G group. Topics 4 and 7 are shared with N and D groups, which reflect that the clustering effect of G group is not as good as that of N and D groups on the tree graph.

To objectively illustrate the effectiveness of our proposed method, the output results of wLDA, LDA and PCA were clustered and classified, and their performance was measured by two indicators, ARI and recognition rate. The ARI of LDA and wLDA reaches 1. The performance of wLDA + SVM and wLDA + RF are significantly improved. This indicates that the combining weighted information and probabilistic topic model is effective. In addition, data transformations may influence classifier performance. In subsequent studies, preprocessing the gut microbiome data, such as centered logratio transformation (CLR) will be considered, because the microbiome data itself has the problem of zero-inflated [11], and then further verify its impact on statistical results and classification performance.

About the number of topics, the maptpx method provided by Celltree software package firstly was used [44], which determines the optimal number of topics by judging whether the log likelihood value monotonously decreases in three consecutive iterations. The optimal number of topics given by this method for gut microbiome of three groups is greater than 100, which obviously cannot meet our needs, because fewer and more representative topics in gut microbiome are expected. Therefore, Gibbs method [37] was employed to draw a log likelihood value graph with the number of topics from 2 to 50, as shown in Fig. 11. As the number of topics increases, the log likelihood value keeps increasing, which is similar to the result of maptpx method. Theoretically, the higher the likelihood value is, the better the model is, but a large number of topics cannot be used to interpret our data. Therefore, the topic number 12 corresponding to the local extremum of the curve is taken as the optimal topic number, i.e., the red dot in Fig. 11. After determining the number of topics 12, the per-patient topic distributions and the per-topic OTU distributions are shown in Figs. 6 and 7. To further validate this discovery, the number of topics from 13 to 50 using Gibbs method again is taken. And the experimental results show that although the number of topics increases in the hot map of the per-patient topic distributions, the unique topics of the three groups do not change, as detailed in Appendix (K takes 13, 15, 20, 30, 40 and 50, respectively).

About the computational cost of LDA model, time complexity: O(NiterKNpL(L-1)/2), where Niter is the number of iterations, K is the number of topics, Np is the number of patients, and L is the length of OTUs; Space complexity: NpK + WK + Np*L, where Np*K is per-patient topic distribution, and W*K is per-topic OTU distribution. When the number of topics is 2 ~ 50 and the number of patients is 140, the running time of maptpx method of LDA is 100.53s, which of Gibbs method is 2188.04s.

About T2DM with gastrointestinal autonomic neuropathy, according to statistics, about 4% ~ 22% of diabetes patients may have diarrhea [16, 40]. Brock et al. found that 50% of patients with long-course diabetes have severe gastrointestinal symptoms [8]. The mechanism by which diabetes induces chronic diarrhea is not well understood. However, some scholars believe that visceral autonomic neuropathy caused by diabetes leads to intestinal dysfunction, leading to irritable bowel, increased secretion and increased stool frequency and stool thinning, which constitute the basis for the onset of chronic diarrhea induced by diabetes [3, 18]. However, there are not many studies on the relationship between diabetic diarrhea and gut microbiome [34]. Virally-Monod’s studies showed that the intestinal bacteria of patients with diabetes accompanied by chronic diarrhea were excessively proliferated, with the incidence of 43% [50]. In this paper, it is found that Lactobacillus, Bacteroidales and Enterobacteriales increased and Clostridiales decreased in the intestinal tract of T2DM with gastrointestinal autonomic neuropathy. However, there are only 27 cases of these patients, so it is still necessary to increase the number of these patients and conduct a large number of studies to research the relationship between T2DM with gastrointestinal autonomic neuropathy and gut microbiome.

Finally, a new model based on probabilistic topic model was proposed to analyze gut microbiome of T2DM in this study. Fusobacterium, Enterobacteriales and Selenomonadales of T2DM increased, and Clostridiales and Burkholderiales decreased. Lactobacillus, Bacteroidales and Enterobacteriales of T2DM with gastrointestinal autonomic neuropathy increased, and Clostridiales decreased. This provides a new perspective for us to study gut microbiome, and may provide new targeted microbiological treatment for type 2diabetes. In addition, the distributions generated by LDA model can be combined with various data mining algorithms as new features, which will have great application potential and will be helpful for us to well understand the structural differences of gut microbiome among different populations.