Meta-network: optimized species-species network analysis for microbial communities
The explosive growth of microbiome data provides ample opportunities to gain a better understanding of the microbes and their interactions in microbial communities. Given these massive data, optimized data mining methods become important and necessary to perform deep and comprehensive analysis. Among the various priorities for microbiome data mining, the examination of species-species co-occurrence patterns becomes one of the key themes in urgent need.
Hence, in this work, we propose the Meta-Network framework to lucubrate the microbial communities. Rooted in loose definitions of network (two species co-exist in a certain samples rather than all samples) as well as association rule mining (mining more complex forms of correlations like indirect correlation and mutual information), this framework outperforms other methods in restoring the microbial communities, based on two cohorts of microbial communities: (a) the loose definition strategy is capable to generate more reasonable relationships among species in the species-species co-occurrence network; (b) important species-species co-occurrence patterns could not be identified by other existing approaches, but could successfully generated by association rule mining.
Results have shown that the species-species co-occurrence network we generated are much more informative than those based on traditional methods. Meta-Network has consistently constructed more meaningful networks with biologically important clusters, hubs, and provides a general approach towards deciphering the species-species co-occurrence networks.
KeywordsMicrobial network Data-mining Network analysis Associate-rule mining
Operational Taxonomic Unit
Path Consistency Algorithm
Part Mutual Information
Network-based approaches are gaining momentum as one of the most helpful tools for the analysis of microbial community structure. They offer new methodological and biological insights to investigate species interactions. Many microorganisms co-exist by interacting with each other and effectively exert various functions . In addition, due to currently insufficient understanding of the community structure, the mounting volume of metagenomics data limits the traditional network analysis to recover the real relationships in bacterial community . Hence finding out intricate yet important associations (for example, to explore the cyclical process of a substance or element in a bacterial community ) becomes increasingly challenging for traditional methods.
To address these limitations, we propose Meta-Network to establish the species-species co-occurrence network. The loose definition method is introduced first to recover more correlations before correlation calculation. Then we utilize the FS-Weight and PCA-PMI (Part Mutual Information adjusted by Path Consistency Algorithm) methods to explore the indirect and non-linear correlations, respectively. To investigate the optimized species-species co-occurrence network, systematic evaluation is investigated to discover meaningful biological implications in the species-species co-occurrence network.
Construction workflow of species-species co-occurrence network
Loose definition of species-species relationship (Fig. 1, a.1)
Indirect relationships in the network (Fig. 1, a.2)
While direct relationships have been reported in many existing works in the context of intricate bacterial community, it is possible to detect more complex correlations using association rule mining . In bacterial community, bacteria which do not interact but share interaction partners may play roles in the same functional pathway . Hence, in our context, the FS-Weight method is applied to detect the indirect correlations (Fig. 1, a.2). FS-Weight measures the overlap between the pairwise species, and is originally designed to estimate association between direct and indirect correlations based on network structure .
For pairwise species, FS-Weight value between them was calculated in two steps. First, the direct correlations were calculated using a user-defined threshold for edge value. In this work, Pearson coefficient correlation was selected and the thresholds was set as 0.5 on genus level and 0.7 on OTU level based on previous research . Second, the network constructed by FS-Weight was applied to filter out less reliable correlations and to add meaningful indirect relationships. FS-Weight threshold was set to 0.7 on OTU level and 0.5 on genus level based on the comparison between different thresholds. The benchmark result for FS-Weight on genus level was shown at Additional file 1: Figure S1, A and B.
Nonlinear associations (Fig. 1, a.3)
In order-one PCA-PMI, only one species was considered as indirect neighbor between two species, and the Path Consistency Algorithm (PCA)  was applied to adjust the correlation distribution using a user-defined correlation threshold (0.02 as default). This threshold was set based on the comparison between different thresholds. The benchmark result for FS-Weight on genus level was shown at Additional file 1: Figure S1, C and D. The sparse matrix is defined as a matrix owing a large number of nodes and sparse species distribution, thus it is feasible to use PCA method to construct the species-species co-occurrence network . After all linear and non-linear correlations calculated, the network was updated by detecting higher orders of PMI (more species were set as indirect neighbor for two species). Higher orders of PMI are calculated to check the correlation reliability iteratively until no more edges are changed.
Identification of clusters and hubs in the network
Detecting co-occurrence patterns and their biological significance (Fig. 1, b) supply an important material to understand the microbial network. First, we perform cluster calculation and hub nodes detection based on density clustering. We apply the MCODE cluster algorithm to detect the potential clusters . Aligning cluster members against taxonomical or functional annotation databases can predict potential cluster functions and taxonomical compositions.
To identify hub nodes, we calculate the most connected nodes as candidates for hub nodes. These nodes are selected to perform Kruskal-Wallis test (test differences in network distribution before and after deleting hub node) to decide whether they are hub nodes. Finally, we interpret the clusters and hub nodes based on known taxonomical and functional databases and literatures.
Systematic evaluation methods
In order to compare Pearson, loose definition, FS-Weight, PCA-PMI algorithm, the correlation coefficient calculated by loose definition which is implemented by the loose definition plus Pearson method.
First, we calculate the global network alignment by MAGNA++  based on genetic algorithm. The alignment quality is measured by edge correctness (EC), induced conserved structure (ICS) and symmetric substructure score (S3) for each pair of networks. Then, after all nodes compared based on Jaccard index, we construct a similarity tree measured by the global similarity between any two networks been compared.
Secondly, these four algorithms were compared based on their topological structures including network similarity, global properties and local properties by compNet software . Furthermore, we analyze the local properties for all taxon based on the R package igraph (http://igraph.org/r/), including coreness, degree, eigenvector centrality, eccentricity, betweeness and degree centrality.
Thirdly, we examined the correlation distribution using the same set of species as target. Based on top 100 abundant genera in gut microbiome datasets, four networks are constructed to investigate their edge distributions, which reflect the correlation difference. Then network motif distribution was investigated by mfinder (version 1.2) . The motif size is set to 4, the query random network size set as 100, and other parameters are set as default. Network motif distribution was calculated in all sub networks and ranked by their frequency.
Results and discussions
Network construction based on human gut datasets
We applied the Meta-Network on a dataset of health young Chinese as our representation of human gut (MGP15838 in MG-RAST database) . It consists of 314 healthy young adults, covering 20 rural and urban cohorts from 7 ethnic groups and 9 provinces throughout China. 5,102,015 high-quality sequences were generated and QIIME (version 1.91)  was applied to process this dataset. Finally, we obtained in total 24, 125 microbial OTUs. On genus level, we identified in total 2124 genera in which 102 genera process the relative abundance above 0.1%.
Network comparison based on human gut datasets
We performed a systematic evaluation to compare the network constructed by Pearson, loose definition, FS-Weight and PCA-PMI algorithms. Global network alignment was carried to compare the network constructed by these four algorithms on OTU level (Fig. 3, a). The alignment between networks constructed by Pearson correlation (100% co-occurrence probability) and loose definition method (80% co-occurrence probability) measured the co-occurrence probability optimization. Owing to a low EC and S3 score (0.63, 0.579, respectively), the two networks had a low match. Network constructed by FS-Weight and PCA-PMI methods had the highest quality score (S3 score: 0.982), indicating that the network constructed by FS-Weight and PCA-PMI methods had the highest similarity. Network alignment results indicate that: First, both two association rule mining methods show a low match to network constructed by Pearson method. Second, two different methods show a high similarity.
Based on the similarity tree based on Jaccard index, network constructed by Pearson method had a low similarity with other three networks (Fig. 3, b). Network constructed by loose-definition and FS-Weight had the shortest distance because FS-Weight employed loose-definition abundance as input to find indirect correlations and PCA-PMI based network had a similar structural composition compared to network constructed by FS-Weight method.
Based on their motif distributions, we compared these four networks to reflect homology relationship (Fig. 3, c). Network constructed by loose definition and FS-Weight method present similar network motif distributions, and this result suggested a high homology . Motifs detected in Pearson method had a different distribution with the others (Fig. 3, d).
Correlation distribution and subnetwork comparison based on human gut datasets
In the network constructed by Pearson and Spearman algorithm (Fig. 4, b), Genera Lactocuccus and Granulicatella processed low abundance (average abundance 1.25e-5, 2.58e-4, respectively) and existed in 15 and 14 samples, respectively. In addition, a strong and significant positive correlation were detected between them (correlation coefficient 0.915, 0.823, respectively, both p-value < 0.01). No convincing literature and reported function (Lactocuccus: commonly identified as produce lactic acid, Granulicatella: commonly identified as potential pathogenic bacteria) could prove their relationships and both genera [19, 20]. These two correlations might be noise correlations and do not take part in this non-digestible carbohydrates degradation pathway.
More importantly, FS-Weight and PCA-PMI methods are capable to discover correlations undetected by the Pearson and Spearman algorithms (Fig. 4, d). For example, genus Syntrophomonas is reported as a fatty-acidoxidizing genus which cooperates with methanogens such as genus Methanosphaera and Methanobrevibacter . However, Pearson method failed to detect these correlations, probably due to the large heterogeneity (different kinds of methanogens in gut samples) in human gut dataset. On the contrary, the correlations between Syntrophomonas and methanogens could be clearly identified by FS-Weight and PCA-PMI methods (FS-Weight correlation: 0.873, 0.926; PCA-PMI correlation: 0.912, 0.915). The interaction mechanism was illustrated in Fig. 4, c. First, genera such as Bacteroides and Ruminococcus degrade the non-digestible carbohydrates into the short-chain fatty acid . Second, the Syntrophomonas and methanogens cooperate to transfer the short-chain fatty acid into methane and energy. Combining the network analysis results with literature review, we speculate that Bacteroides and Ruminococcus play the role, as indirect neighbor, between the Syntrophomonas and methanogens.
In network constructed by CCLasso, the indirect correlation pattern could also be found, which could again prove the indirect pattern and advantage of FS-Weight and PCA-PMI. The result is provided in Additional file 1.
Network construction and analysis based on the Tara oceans datasets
Figure 5, b illustrated the network constructed by FS-Weight algorithm, in which 190 nodes and 8135 edges were detected, including 3910 indirect correlations. Ten clusters were calculated and 6 of them were composed with similar taxonomical composition compared to the network constructed by Pearson method. New cluster members were identified as common members in the ocean bacterial community . The network constructed by PCA-PMI method was composed with 237 nodes and 5861 edges, with a clustering coefficient of 0.666 (Fig. 5, c). Nitrosopumilus and Candidatus_Scalindua were detected only in network constructed by PCA-PMI method and were identified as common members in the ocean environment .
These results have again, proved that the Meta-Network workflow has the ability to discover more complex formats of correlations. Moreover, Networks constructed by two different methods (FS-Weight and PCA-PMI) shared many identical network members and clusters, indicating that both methods portrayed the complex formats of correlations in microbial communities.
Species-species co-occurrence network is becoming one of the emerging fronts for microbiome research, largely due to the important ecological patterns it could reveal. However, previous methods are limited by only detecting the direct correlations among pairwise species. This work presents the Meta-Network method that optimized the construction, analysis and interpretation of the species-species co-occurrence network, focusing on three critical processes for species-species correlation analysis: we first employ loose definition to recover correlations missed by strict co-occurrence probability before calculating correlations. Based on these candidate correlations, we also develop two association rule mining methods for the recurrence of the real bacterial community network: a graph-based method FS-Weight to detect indirect correlations, PCA-PMI method to detect indirect correlations. Therefore, we believe that Meta-Network provides a general approach towards deciphering the species-species co-occurrence networks.
This method could be improved in several ways: in line with current gene expression network inferences, we can also improve its high-level associations (i.e., associations between two clusters of species). A complete species-species network, which provides the full picture of the regulation profile in the community, should also include viruses. All of these will be considered in our future work.
We thank Torsten Juelich for linguistic assistance during the preparation of this manuscript.
This work is partially supported by National Science Foundation of China grant 31871334 and 31671374, Ministry of Science and Technology’s high-tech (863) grant 2018YFC0910502, and Sino-German Research Center grant GZ878. Publication of this article was sponsored by Ministry of Science and Technology’s high-tech (863) grant 2018YFC0910502.
Availability of data and materials
The source code and documentation are accessible at http://www.microbioinformatics.org/software/Meta-Network.htm.
About this supplement
This article has been published as part of BMC Genomics Volume 20 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-20-supplement-2.
The whole study was designed by KN. PSY and SJY collected samples and analyzed the data. PSY, CL, KN explained the results, PSY and KN wrote the initial draft of the manuscript, and all revised the manuscript. All of the authors have read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare no competing financial interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 20.Kuczynski J, Stombaugh J, Walters WA, Gonzalez A, Caporaso JG, Knight R. Using QIIME to analyze 16S rRNA gene sequences from microbial communities. Curr Protoc Microbiol. 2012, Chapter 1:Unit;27:1E–5.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.