Using Spark and GraphX to Parallelize Large-Scale Simulations of Bacterial Populations over Host Contact Networks

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10393)


Large-scale population genetics studies are fundamental for phylogenetic and epidemiology analysis of pathogens. And the validation of both evolutionary models and methods used in such studies depend on large data analysis. It is, however, unrealistic to work with large datasets as only rather small samples of the real pathogen population are available. On the other hand, given model complexity and required population sizes, large-scale simulations are the only way to address this issue. In this paper we study how to efficiently parallelize such extensive simulations on top of Apache Spark, making use of both the MapReduce programming model and the GraphX API. We propose a simulation framework for large bacterial populations, over host contact networks, implementing the Wright-Fisher model. The experimental evaluation shows that we can effectively speedup simulations. We also evaluate inherent parallelism limits, drawing conclusions on the relation between cluster computing power and simulations speedup.


Population genetics Large-scale simulations Graph-parallel computations Spark GraphX 



This work was partly supported by DEI, IST, Universidade de Lisboa, and national funds through FCT – Fundação para a Ciência e Tecnologia, under projects TUBITACK/0004/2014, LISBOA-01-0145-FEDER-016394, PTDC/EEISII/5081/2014, PTDC/MAT/STA/3358/2014, and UID/CE C/500021/2013.


  1. 1.
    Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the Spring Joint Computer Conference, AFIPS 1967 (Spring), pp. 483–485. ACM, 18–20, April 1967Google Scholar
  2. 2.
    Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Chung, F., Lu, L., Dewey, T.G., Galas, D.J.: Duplication models for biological networks. J. Comput. Biol. 10(5), 677–687 (2003)CrossRefGoogle Scholar
  4. 4.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  5. 5.
    Fraser, C., Hanage, W., Spratt, B.: Neutral microepidemic evolution of bacterial pathogens. PNAS 102(6), 1968–1973 (2005)CrossRefGoogle Scholar
  6. 6.
    Fraser, C., Alm, E.J., Polz, M.F., Spratt, B.G., Hanage, W.P.: The bacterial species challenge: making sense of genetic and ecological diversity. Science 323(5915), 741–746 (2009)CrossRefGoogle Scholar
  7. 7.
    Fraser, C., Hanage, W.P., Spratt, B.G.: Neutral microepidemic evolution of bacterial pathogens. Proc. Natl. Acad. Sci. U.S.A. 102(6), 1968–1973 (2005)CrossRefGoogle Scholar
  8. 8.
    Fraser, C., Hanage, W.P., Spratt, B.G.: Recombination and the nature of bacterial speciation. Science 315(5811), 476–480 (2007)CrossRefGoogle Scholar
  9. 9.
    Hanage, W.P., Spratt, B.G., Turner, K.M., Fraser, C.: Modelling bacterial speciation. Philos. Trans. Roy. Soc. Lond. B: Biol. Sci. 361(1475), 2039–2044 (2006)CrossRefGoogle Scholar
  10. 10.
    Kimura, M.: Evolutionary rate at the molecular level. Nature 217, 624–626 (1968)CrossRefGoogle Scholar
  11. 11.
    Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers (2010)Google Scholar
  12. 12.
    Maiden, M., Bygraves, J., Feil, E., Morelli, G., Russell, J., Urwin, R., Zhang, Q., Zhou, J., Zurth, K., Caugant, D., et al.: Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. PNAS 95(6), 3140–3145 (1998)CrossRefGoogle Scholar
  13. 13.
    Ochman, H., Lawrence, J.G., Groisman, E.A.: Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299–304 (2000)CrossRefGoogle Scholar
  14. 14.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999–66, Stanford InfoLab (1999)Google Scholar
  15. 15.
    Robinson, D.A., Falush, D., Feil, E.J.: Bacterial Population Genetics in Infectious Disease. John Wiley & Sons, Hoboken (2010)CrossRefGoogle Scholar
  16. 16.
    Spratt, B.G., Hanage, W.P., Feil, E.J.: The relative contributions of recombination and point mutation to the diversification of bacterial clones. Curr. Opin. Microbiol. 4(5), 602–606 (2001)CrossRefGoogle Scholar
  17. 17.
    Tran, T.D., Hofrichter, J., Jost, J.: An introduction to the mathematical structure of the Wright-Fisher model of population genetics. Theory Biosci. 132(2), 73–82 (2013)CrossRefzbMATHGoogle Scholar
  18. 18.
    Verma, S., Leslie, L.M., Shin, Y., Gupta, I.: An experimental comparison of partitioning strategies in distributed graph processing. Proc. VLDB Endow. 10(5), 493–504 (2017)CrossRefGoogle Scholar
  19. 19.
    Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, GRADES 2013, pp. 2:1–2:6. ACM (2013)Google Scholar
  20. 20.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, p. 2. USENIX Association (2012)Google Scholar
  21. 21.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association (2010)Google Scholar
  22. 22.
    Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation (2002)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.INESC-ID LisboaLisboaPortugal
  2. 2.Instituto Superior TécnicoUniversidade de LisboaLisboaPortugal
  3. 3.Faculdade de Medicina, Instituto de Microbiologia and Instituto de Medicina MolecularUniversidade de LisboaLisboaPortugal

Personalised recommendations