Skip to main content

Applications of Community Detection Algorithms to Large Biological Datasets

  • Protocol
  • First Online:
Deep Sequencing Data Analysis

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2243))


Recent advances in data acquiring technologies in biology have led to major challenges in mining relevant information from large datasets. For example, single-cell RNA sequencing technologies are producing expression and sequence information from tens of thousands of cells in every single experiment. A common task in analyzing biological data is to cluster samples or features (e.g., genes) into groups sharing common characteristics. This is an NP-hard problem for which numerous heuristic algorithms have been developed. However, in many cases, the clusters created by these algorithms do not reflect biological reality. To overcome this, a Networks Based Clustering (NBC) approach was recently proposed, by which the samples or genes in the dataset are first mapped to a network and then community detection (CD) algorithms are used to identify clusters of nodes.

Here, we created an open and flexible python-based toolkit for NBC that enables easy and accessible network construction and community detection. We then tested the applicability of NBC for identifying clusters of cells or genes from previously published large-scale single-cell and bulk RNA-seq datasets.

We show that NBC can be used to accurately and efficiently analyze large-scale datasets of RNA sequencing experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


  1. The GTEx Consortium (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660.,

  2. Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA et al (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 45:1113–1120.,

  3. Durbin RM, Altshuler DL, Durbin RM et al (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073.

    Article  CAS  PubMed  Google Scholar 

  4. Baran Y, Subramaniam M, Biton A et al (2015) The landscape of genomic imprinting across diverse adult human tissues. Genome Res 25:927–936.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Pirinen M, Lappalainen T, Zaitlen NA et al (2015) Assessing allele-specific expression across multiple tissues from RNA-seq read data. Bioinformatics 31:2497–2504.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Lappalainen T, Sammeth M, Friedländer MR et al (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501:506–511.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Mele M, Ferreira PG, Reverter F et al (2015) The human transcriptome across tissues and individuals. Science 348:660–665.,

  8. Leiserson MDM, Vandin F, Wu H-T et al (2014) Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet 47:106–114.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  9. Nawy T (2013) Single-cell sequencing. Nat Methods 11:18.,

  10. Ramsköld D, Luo S, Wang Y-C et al (2012) Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells. Nat Biotechnol 30:777–782.,

  11. Shalek AK, Satija R, Adiconis X et al (2013) Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature 498:236–240.,

  12. Jaitin DA, Kenigsberg E, Keren-Shaul H et al (2014) Massively parallel single-cell RNA-Seq for marker free decomposition of tissues into cell types. Science 343:776–779.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Macosko EZ, Basu A, Satija R et al (2015) Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161:1202–1214.

  14. Stephens ZD, Lee SY, Faghri F et al (2015) Big data: astronomical or genomical? PLoS Biol. 13:e1002195.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. Marx V (2013) Biology: the big challenges of big data. Nature 498:255–260.

    Article  CAS  PubMed  Google Scholar 

  16. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16:1370–1386.

  17. Sørlie T, Perou CM, Tibshirani R et al (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA 98:10869–10874.,,

  18. Kapp AV, Jeffrey SS, Langerød A et al (2006) Discovery and validation of breast cancer subtypes. BMC Genomics 7:231.,,

  19. Rothenberg ME, Nusse Y, Kalisky T et al (2012) Identification of a cKit(+) colonic crypt base secretory cell that supports Lgr5(+) stem cells in mice. Gastroenterology 142:1195–1205.e6.,

  20. Pollen AA, Nowakowski TJ, Shuga J et al (2014) Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol 32:1053–1058.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Treutlein B, Lee QY, Camp JG et al (2016) Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq. Nature 1–15.

  22. Kolodziejczyk AA, Kim JK, Tsang JCH et al (2015) Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell stem cell 17:471–85.,

  23. Wang J, Xia S, Arand B et al (2016) Single-cell co-expression analysis reveals distinct functional modules, co-regulation mechanisms and clinical outcomes. PLoS Comput Biol 12:e1004892.,

  24. Wills QF, Livak KJ, Tipping AJ et al (2013) Single-cell gene expression analysis reveals genetic associations masked in whole-tissue experiments. Nat Biotechnol 31:748–752.,

  25. Hung J-H, Yang T-H, Hu Z et al (2012) Gene set enrichment analysis: performance evaluation and usage guidelines. Brief Bioinform 13:281–291.

    Article  PubMed  Google Scholar 

  26. Ashburner M, Ball CA, Blake JA et al (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25:25–29.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Kanehisa M (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28:27–30.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Hamosh A (2004) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33:D514–D517.

    Article  PubMed Central  CAS  Google Scholar 

  29. Huang DW, Sherman BT, Lempicki RA (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37:1–13.

    Article  CAS  Google Scholar 

  30. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review.

  31. Kohonen T (1990) The self-organizing map. Proc IEEE 78:1464–1480.,

  32. Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416. arXiv:0711.0189v1

  33. Martin E, Hans-Peter K, Jörg S et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD-96 Proceedings, pp 226–231. CiteSeerX:

    Google Scholar 

  34. Xu R, Wunsch DC (2010) Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng 3:120–154.,

  35. Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data: recent advances in clustering. Springer, Berlin, pp 25–71.

    Chapter  Google Scholar 

  36. Lewis K, Kaufman J, Gonzalez M et al (2008) Tastes, ties, and time: a new social network dataset using Soc Netw 30:330–342.,

  37. Ediger D, Jiang K, Riedy J et al (2010) Massive social network analysis: mining twitter for social good. In: 2010 39th International conference on parallel processing. IEEE, pp 583–593.

  38. Jeong H, Mason SP, Barabási A-L et al (2001) Lethality and centrality in protein networks. Nature 411:41–42.

    Article  CAS  PubMed  Google Scholar 

  39. Shen-Orr SS, Milo R, Mangan S et al (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31:64–68.

    Article  CAS  PubMed  Google Scholar 

  40. Papadopoulos S, Kompatsiaris Y, Vakali A et al (2012) Community detection in Social Media. Data Min Knowl Discov 24:515–554.

    Article  Google Scholar 

  41. Chen J, Yuan B (2006) Detecting functional modules in the yeast protein-protein interaction network. Bioinformatics 22:2283–2290.

    Article  CAS  PubMed  Google Scholar 

  42. Dourisboure Y, Geraci F, Pellegrini M (2007) Extraction and classification of dense communities in the web. In: Proceedings of the 16th international conference on world wide web WWW ’07. ACM, New York, pp 461–470.

    Chapter  Google Scholar 

  43. Fortunato S (2010) Community detection in graphs. Phys Rep 486:75–174.,

  44. Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci 103:8577–8582.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113.,

  46. Newman MEJ (2004) Analysis of weighted networks. Phys Rev E 70:056131.

    Article  CAS  Google Scholar 

  47. Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very large networks. Phys Rev E 70:066111.

    Article  CAS  Google Scholar 

  48. Reichardt J, Bornholdt S (2006) Statistical mechanics of community detection. Phys Rev E 74:016110.

    Article  CAS  Google Scholar 

  49. Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci 105:1118–1123.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Yucel M, Muchnik L, Hershberg U (2016) Detection of network communities with memory-biased random walk algorithms. J Complex Netw 5:48–69.

    Google Scholar 

  51. Jiang P, Singh M (2010) SPICi: a fast clustering algorithm for large biological networks. Bioinformatics (Oxford, England) 26:1105–1111.,,

  52. Blondel VD, Guillaume J-L, Lambiotte R et al (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008:P10008.,

  53. Waltman L, van Eck NJ (2013) A smart local moving algorithm for large-scale modularity-based community detection. Eur Phys J B 86:471.

    Article  CAS  Google Scholar 

  54. Levine J, Simonds E, Bendall S et al (2015) Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162:184–197., http://linkinghub.elseviercom/retrieve/pii/S0092867415006376

  55. Xu C, Su Z (2015) Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31:1974–1980.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. PhenoGraph repository. Accessed 3 May 2018

  57. SNN-Cliq repository. Accessed 3 May 2018

  58. Butler A, Hoffman P, Smibert P et al (2018) Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 36:411–420.,

  59. Seurat repository. Accessed 3 May 2018

  60. Patel AP, Tirosh I, Trombetta JJ et al (2014) Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science (New York, N.Y.) 344:1–9.,

  61. Series GSE57872. Accessed 7 Sept 2017

  62. Klein AM, Mazutis L, Akartuna I et al (2015) Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161:1187–1201.,

  63. Series GSE65525. Accessed 7 Sept 2017

  64. GTEx Portal. Accessed 7 Sept 2017

  65. Durinck S, Moreau Y, Kasprzyk A et al (2005) BioMart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics (Oxford, England) 21:3439–3440.,

  66. Series GSE63472. Accessed 7 Sept 2017

  67. Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830.

    Google Scholar 

  68. Omohundro SM (1989) Five balltree construction algorithms. International Computer Science Institute, Berkeley

    Google Scholar 

  69. Fruchterman TMJ, Reingold EM (1991) Graph drawing by force-directed placement. Softw Pract Exp 21:1129–1164.

    Article  Google Scholar 

  70. Csardi G, Nepusz T (2006) The igraph software package for complex network research. Int J Complex Syst 1695(5):1–9

    Google Scholar 

  71. Desgraupes B (2018) clusterCrit: clustering indices

    Google Scholar 

  72. R Core Team (2016) R: a language and environment for statistical computing

    Google Scholar 

  73. Karatzoglou A, Smola A, Hornik K et al (2004) kernlab an S4 package for Kernel methods in R. J Stat Softw 11.,

  74. Uhlen M, Fagerberg L, Hallstrom BM et al (2015) Tissue-based map of the human proteome. Science 347:1260419.

    Article  PubMed  CAS  Google Scholar 

  75. Human Protein Atlas Version 14. Accessed 6 Aug 2018

  76. Bullard JH, Purdom E, Hansen KD et al (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11:94.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  77. Diaz A, Liu SJ, Sandoval C et al (2016) SCell: integrated analysis of single-cell RNA-seq data. Bioinformatics 32:2219–2220.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Guo M, Wang H, Potter SS et al (2015) SINCERA: a pipeline for single-cell RNA-Seq profiling analysis. PLoS Comput Biol 11:e1004575.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  79. Li P, Piao Y, Shon HS et al (2015) Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data. BMC Bioinformatics 16:347.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  80. Vallejos CA, Risso D, Scialdone A et al (2017) Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat Methods 14:565–571.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. van Dongen S, Enright AJ (2012) Metric distances derived from cosine similarity and Pearson and Spearman correlations. Preprint, arXiv:1208.3145.

    Google Scholar 

  82. Jaskowiak PA, Campello RJGB, Costa IG (2014) On the selection of appropriate distances for gene expression data clustering. BMC Bioinformatics 15(Suppl 2):S2.,,

  83. Heng TSP, Painter MW, Immunological Genome Project Consortium (2008) The Immunological Genome Project: networks of gene expression in immune cells. Nat Immunol 9:1091–1094.,

  84. Harding SD, Armit C, Armstrong J et al (2011) The GUDMAP database–an online resource for genitourinary research. Development (Cambridge, England) 138:2845–2853.,,

  85. Ritchie ME, Phipson B, Wu D et al (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43:e47.,,,

  86. Subramanian A, Tamayo P, Mootha VK et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci 102:15545–15550.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Yaari G, Bolen CR, Thakar J et al (2013) Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations. Nucleic Acids Res 41:e170.,,

  88. Dalerba P, Kalisky T, Sahoo D et al (2011) Single-cell dissection of transcriptional heterogeneity in human colon tumors. Nat Biotechnol 29:1120–1127.,,

  89. Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4:44–57.,

  90. Fan HC, Fu GK, Fodor SPA (2015) Combinatorial labeling of single cells for gene expression cytometry. Science 347:1258367.,

  91. Andoni A, Indyk P (2008) Near-optimal Hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 51:117–122.

    Article  Google Scholar 

  92. Bawa M, Condie T, Ganesan P (2005) LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th international conference on world wide web WWW ’05. ACM, New York, pp 651–660.

    Chapter  Google Scholar 

  93. Wang M, Zhang W, Ding W et al (2014) Parallel clustering algorithm for large-scale biological data sets. PLoS ONE 9:e91315.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  94. Hastie T, Tibshirani R (2004) Efficient quadratic regularization for expression arrays. Biostatistics 5:329–340.,

  95. Halko N, Martinsson PG, Tropp JA (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53:217–288.

    Article  Google Scholar 

  96. van der Maaten L, Hinton GE (2008) Visualizing high-dimensional data using t-SNE. J Mach Learn Res 9:2579–2605

    Google Scholar 

  97. Tirosh I, Izar B, Prakadan SM et al (2016) Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352:189–196.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science (New York, N.Y.) 290:2319–2323.,

Download references

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Gur Yaari or Tomer Kalisky .

Editor information

Editors and Affiliations

1 Electronic Supplementary Material

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Kanter, I., Yaari, G., Kalisky, T. (2021). Applications of Community Detection Algorithms to Large Biological Datasets. In: Shomron, N. (eds) Deep Sequencing Data Analysis. Methods in Molecular Biology, vol 2243. Humana, New York, NY.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-1102-9

  • Online ISBN: 978-1-0716-1103-6

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics