The Journal of Supercomputing

, Volume 73, Issue 4, pp 1467–1483 | Cite as

An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop

  • Giuseppe Cattaneo
  • Umberto Ferraro Petrillo
  • Raffaele Giancarlo
  • Gianluca Roscigno


Alignment-free methods are one of the mainstays of biological sequence comparison, i.e., the assessment of how similar two biological sequences are to each other, a fundamental and routine task in computational biology and bioinformatics. They have gained popularity since, even on standard desktop machines, they are faster than methods based on alignments. However, with the advent of Next-Generation Sequencing Technologies, datasets whose size, i.e., number of sequences and their total length, is a challenge to the execution of alignment-free methods on those standard machines are quite common. Here, we propose the first paradigm for the computation of k-mer-based alignment-free methods for Apache Hadoop that extends the problem sizes that can be processed with respect to a standard sequential machine while also granting a good time performance. Technically, as opposed to a standard Hadoop implementation, its effectiveness is achieved thanks to the incremental management of a persistent hash table during the map phase, a task not contemplated by the basic Hadoop functions and that can be useful also in other contexts.


Alignment-free sequence comparison and analysis Distributed computing MapReduce Hadoop 



We would like to thank the Department of Statistical Sciences of University of Rome-La Sapienza for computing time on the TeraStat cluster and Nicola Segata for providing the meta-genomic dataset. We also would like to thank the referees for comments that helped in the presentation of our results.

Compliance with ethical standards


MIUR PRIN Project: 2010RTFWBH_003 “Data-Centric Genomic Computing (GenData 2020)” and Unipa Progetto di Ateneo 2012-ATE-0298 “Metodi Formali ed Algoritmici per la Bioinformatica su Scala Genomica”.


  1. 1.
    Allen F, Almasi G, Andreoni W, Beece D, Berne BJ, Bright A, Brunheroto J, Cascaval C, Castanos J, Coteus P et al (2001) Blue Gene: a vision for protein science using a petaflop supercomputer. IBM Syst J 40(2):310–327CrossRefGoogle Scholar
  2. 2.
    Apostolico A, Giancarlo R (1998) Sequence alignment in molecular biology. J Comput Biol 5(2):173–196CrossRefMATHGoogle Scholar
  3. 3.
    Audano P, Vannberg F (2014) KAnalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics 30(14):2070–2072CrossRefGoogle Scholar
  4. 4.
    Boden M, Schöneich M, Horwege S, Lindner S, Leimeister C, Morgenstern B (2013) Alignment-free sequence comparison with spaced k-mers. OASIcs-OpenAccess Series in Informatics, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik 34:24–34MATHGoogle Scholar
  5. 5.
    Cattaneo G, Roscigno G, Ferraro Petrillo U (2014) A scalable approach to source camera identification over Hadoop. In: 28th IEEE International Conference on Advanced Information Networking and Applications (AINA), IEEE, pp 366–373Google Scholar
  6. 6.
    Cattaneo G, Ferraro Petrillo U, Giancarlo R, Roscigno G (2015) Alignment-free sequence comparison over Hadoop for computational biology. In: 44rd International Conference on Parallel Processing Workshops (ICCPW 2015), IEEE, pp 1–9Google Scholar
  7. 7.
    Chan CX, Bernard G, Poirion O, Hogan JM, Ragan MA (2014) Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Reports 4:6504CrossRefGoogle Scholar
  8. 8.
    Chor B, Horn D, Goldman N, Levy Y, Massingham T et al (2009) Genomic DNA k-mer spectra: models and modalities. Genome Biol 10(10):R108CrossRefGoogle Scholar
  9. 9.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Operating Systems Design and Implementation (OSDI) pp 137–150Google Scholar
  10. 10.
    Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10):1569–1576CrossRefGoogle Scholar
  11. 11.
    Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, New YorkCrossRefMATHGoogle Scholar
  12. 12.
    Ekanayake J, Pallickara S, Fox G (2008) MapReduce for data intensive scientific analyses. In: 2008 IEEE Fourth International Conference on eScience, pp 277–284Google Scholar
  13. 13.
    Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, pp 265–268Google Scholar
  14. 14.
    Fan H, Ives AR, Surget-Groba Y, Cannon CH (2015) An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genom 16(1):1–18CrossRefGoogle Scholar
  15. 15.
    Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G (2007) Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinform 8:252CrossRefGoogle Scholar
  16. 16.
    Giancarlo R, Scaturro D, Utro F (2009) Textual data compression in computational biology: a synopsis. Bioinformatics 25(13):1575–1586CrossRefMATHGoogle Scholar
  17. 17.
    Giancarlo R, Lo Bosco G, Pinello L, Utro F (2013) A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC Bioinform 14(1):1–14CrossRefGoogle Scholar
  18. 18.
    Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Briefings Bioinform 15(3):390–406CrossRefGoogle Scholar
  19. 19.
    Greco V, Giancarlo R (2007) Grid-K: A cometa VO service for compression-based classification of biological sequences and structures. Symposium GRID Open Days at the University of Palermo, Italy pp 87–93Google Scholar
  20. 20.
    Gunarathne T, Wu TL, Qiu J, Fox G (2010) MapReduce in the clouds for science. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, pp 565–572Google Scholar
  21. 21.
    Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New YorkCrossRefMATHGoogle Scholar
  22. 22.
    Haubold B (2014) Alignment-free phylogenetics and population genetics. Briefings Bioinform 15(3):407–418CrossRefGoogle Scholar
  23. 23.
    Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister CA, Morgenstern B (2014) Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res 42(W1):7–11CrossRefGoogle Scholar
  24. 24.
    Huang K, Brady A, Mahurkar A, White O, Gevers D, Huttenhower C, Segata N (2013) MetaRef: a pan-genomic database for comparative and community microbial genomicsGoogle Scholar
  25. 25.
    Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86CrossRefMATHMathSciNetGoogle Scholar
  26. 26.
    Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B (2014) Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30(14):1991–1999CrossRefMATHGoogle Scholar
  27. 27.
    Li KB (2003) ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12):1585–1586CrossRefGoogle Scholar
  28. 28.
    Lloyd S, Snell Q (2011) Accelerated large-scale multiple sequence alignment. BMC Bioinform 12:466CrossRefGoogle Scholar
  29. 29.
    Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6):764–770CrossRefGoogle Scholar
  30. 30.
    Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA et al (2000) A whole-genome assembly of drosophila. Science 287(5461):2196–2204CrossRefGoogle Scholar
  31. 31.
    Nordberg H, Bhatia K, Wang K, Wang Z (2013) BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23):3014–3019CrossRefGoogle Scholar
  32. 32.
    Schatz MC (2009) Cloudburst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11):1363–1369CrossRefGoogle Scholar
  33. 33.
    Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), IEEE, pp 1–10Google Scholar
  34. 34.
    Sims GE, Kim SH (2011) Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proceed Nat Acad Sci 108(20):8329–8334CrossRefGoogle Scholar
  35. 35.
    Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud: models, techniques and applications, 1st edn. Elsevier Science Publishers B. V, AmsterdamGoogle Scholar
  36. 36.
    Taylor RC (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform 11(Suppl 12):1–6CrossRefGoogle Scholar
  37. 37.
    Torney DC, Burks C, Davison D, Sirotkin KM (1990) Computation of d2: a measure of sequence dissimilarity. In: Computers and DNA: the proceedings of the Interface between Computation Science and Nucleic Acid Sequencing Workshop, Redwood City, Calif.: Addison-Wesley Pub. CoGoogle Scholar
  38. 38.
    Vinga S (2014) Editorial: alignment-free methods in computational biology. Brief Bioinform 15(3):341–342CrossRefGoogle Scholar
  39. 39.
    Vinga S, Almeida J (2003) Alignment-free sequence comparison-a review. Bioinformatics 19:513–523CrossRefGoogle Scholar
  40. 40.
    Vouzis PD, Sahinidis NV (2010) GPU-BLAST: Using graphics processors to accelerate protein sequence alignment. BioinformaticsGoogle Scholar
  41. 41.
    Warnke J, Pawaskar S, Ali H (2012) An energy-aware Bioinformatics application for assembling short reads in high performance computing systems. In: 2012 International Conference onHigh Performance Computing and Simulation (HPCS), pp 154–160Google Scholar
  42. 42.
    Wong AK, You M (1985) Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans Patt Anal Mach Intel 7(5):599–609CrossRefMATHGoogle Scholar
  43. 43.
    Yang K, Zhang L (2008) Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucl Acids Res 36(5):1–9CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Giuseppe Cattaneo
    • 1
  • Umberto Ferraro Petrillo
    • 2
  • Raffaele Giancarlo
    • 3
  • Gianluca Roscigno
    • 1
  1. 1.Dipartimento di InformaticaUniversità degli Studi di SalernoFiscianoItaly
  2. 2.Dipartimento di Scienze StatisticheUniversità di Roma “Sapienza”RomeItaly
  3. 3.Dipartimento di Matematica ed InformaticaUniversità degli Studi di PalermoPalermoItaly

Personalised recommendations