Abstract
Metagenomic assembly is a very challenging subject due to the huge data volume of next-generation sequencing (NGS). The ability of clustering strategy to handle large amounts of data makes it an ideal solution to memory limitations. SpaRC (Spark Reads Clustering), a scalable sequences clustering tool based on the Apache Spark, a distributed big data analysis platform, provides a solution to cluster hundreds of GBs of sequences from different genomes. However, the Label Propagation Algorithm (LPA) used in SpaRC is usually unstable, causing the clustering results to oscillate and contain too many tiny clusters. In this paper, we proposed a method for clustering metagenomic sequences based on the distributed Louvain algorithm to obtain more accurate clustering results. We performed experiments on two different datasets with millions of genome sequences based on LPA and Louvain, respectively. The experimental results indicate that this approach can effectively improve clustering performance. We hope that the method applied in this paper can be widely used in other metagenomic clustering studies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Li, K., Lu, Y., Deng, L., Wang, L., Shi, L., Wang, Z.: Deconvolute individual genomes from metagenome sequences through short read clustering. PeerJ 8, e8966 (2020)
Yan, W., Sun, C., Yuan, J., Yang, N.: Gut metagenomic analysis reveals prominent roles of Lactobacillus and cecal microbiota in chicken feed efficiency. Sci. Rep. 28(7), 45308 (2017)
Dong, E., Du, H., Gardner, L.: An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases, 19 February 2020
Hillmann, B., et al.: Evaluating the information content of shallow shotgun metagenomics. Msystems 3(6), e00069–18, 30 October 2018
Sandhya, S., Srivastava, H., Kaila, T., Tyagi, A., Gaikwad, K.: Methods and tools for plant organelle genome sequencing, assembly, and downstream analysis. In: Legume Genomics, Humana, New York, NY, pp. 49–98 (2020). https://doi.org/10.1007/978-1-0716-0235-5_4
Compeau, P.E., Pevzner, P.A., Tesler, G.: Why are de Bruijn graphs useful for genome assembly? Nat. Biotechnol. 29(11), 987 (2011)
Kelley, D.R., Salzberg, S.L.: Clustering metagenomic sequences with interpolated Markov models. BMC Bioinf. 11(1), 544 (2010)
Onate, F.P., Batto, J.M., Juste, C., Fadlallah, J., Fougeroux, C., Gouas, D., Pons, N., Kennedy, S., Levenez, F., Dore, J., Ehrlich, S.D.: Quality control of microbiota metagenomics by k-mer analysis. BMC Genom. 16(1), 1 (2015)
Zou, Q., Lin, G., Jiang, X., Liu, X., Zeng, X.: Sequence clustering in bioinformatics: an empirical study. Brief. Bioinform. 21(1), 1 (2020)
Bao, E., Jiang, T., Kaloshian, I., Girke, T.: SEED: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)
Jokar, E., Mosleh, M.: Community detection in social networks based on improved Label Propagation Algorithm and balanced link density. Phys. Lett. A 383(8), 718–727 (2019)
Li, W., Huang, C., Wang, M., Chen, X.: Stepping community detection algorithm based on label propagation and similarity. Phys. A 15(472), 145–155 (2017)
Chaudhary, L., Singh, B.: Community detection using an enhanced louvain method in complex networks. In: Fahrnberger, G., Gopinathan, S., Parida, L. (eds.) ICDCIT 2019. LNCS, vol. 11319, pp. 243–250. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05366-6_20
Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008)
Ghosh, S., Halappanavar, M., Tumeo, A., Kalyanarainan, A.: Scaling and quality of modularity optimization methods for graph clustering. In: 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE, 24 September 2019
Guo, R., Zhao, Y., Zou, Q., Fang, X., Peng, S.: Bioinformatics applications on apache spark. GigaScience, 7(8), giy098, August 2018
Shi, L., Meng, X., Tseng, E., Mascagni, M., Wang, Z.: SpaRC: scalable sequence clustering using Apache Spark. Bioinformatics 35(5), 760–768 (2019)
Chen, D., Yuan, Y., Zhang, R., Huang, X., Wang, D.: A smart weighted-louvain algorithm for community detection in large-scale networks. In: FSDM, pp. 273–281, 6 November 2019
Bascol, K., Emonet, R., Fromont, E., Habrard, A., Metzler, G., Sebban, M.: From cost-sensitive to tight f-measure bounds. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1245–1253, 11 April 2019
Wang, Y., Ni, X.S.: A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization. arXiv preprint arXiv:1901.08433 (2019)
Acknowledgments
This work is supported by National Natural Science Foundation (NNSF) of China under Grant 61802246.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lu, Y., Deng, L., Wang, L., Li, K., Wu, J. (2020). Improving Metagenome Sequence Clustering Application Performance Using Louvain Algorithm. In: Fei, M., Li, K., Yang, Z., Niu, Q., Li, X. (eds) Recent Featured Applications of Artificial Intelligence Methods. LSMS 2020 and ICSEE 2020 Workshops. LSMS ICSEE 2020 2020. Communications in Computer and Information Science, vol 1303. Springer, Singapore. https://doi.org/10.1007/978-981-33-6378-6_29
Download citation
DOI: https://doi.org/10.1007/978-981-33-6378-6_29
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6377-9
Online ISBN: 978-981-33-6378-6
eBook Packages: Computer ScienceComputer Science (R0)