Skip to main content

Hadooping the genome: The impact of big data tools on biology

Abstract

This essay examines the consequences of the so-called ‘big data’ technologies in biomedicine. Analyzing algorithms and data structures used by biologists can provide insight into how biologists perceive and understand their objects of study. As such, I examine some of the most widely used algorithms in genomics: those used for sequence comparison or sequence mapping. These algorithms are derived from the powerful tools for text searching and indexing that have been developed since the 1950s and now play an important role in online search. In biology, sequence comparison algorithms have been used to assemble genomes, process next-generation sequence data, and, most recently, for ‘precision medicine.’ I argue that the predominance of a specific set of text-matching and pattern-finding tools has influenced problem choice in genomics. It allowed genomics to continue to think of genomes as textual objects and to increasingly lock genomics into ‘big data’-driven text-searching methods. Many ‘big data’ methods are designed for finding patterns in human-written texts. However, genomes and other’ omic data are not human-written and are unlikely to be meaningful in the same way.

This is a preview of subscription content, access via your institution.

Notes

  1. In the 1990s, most of the sequencing machines used to sequence the human genome were reliable up to about 500 base pairs. Later versions of Sanger sequencers were reliable closer to 1000 base pairs. Very early NGS machines had read lengths of 20–30 base pairs; common read lengths on current (2015) models are between 100 and 250 base pairs.

  2. The move from Sanger sequencing to NGS can also be characterized as a move from constructing ‘reference genomes’ to ‘references populations’ characterized by their specific patterns of variations. On the production and use of ‘reference populations,’ see M'Charek (2005, pp. 44–46).

  3. For a review of these algorithms, see Li and Homer (2010).

  4. The size of the index would depend on the content of the book and the minimum and maximum size of the words in the index. Indexing a genome is harder than indexing a book, since there are no discrete words. For example, the words “wire door” would need to be indexed not only as “wire,” “door,” and “wire door,” but also as “wi,” “ir,” “re,” “ed,” “do,” “oo,” “or,” “wir,” “ire,” “red,” “edo,” “doo,” “oor,” “ired,” “redo,” and “edoo.”

  5. GWAS had necessarily had to limit itself to looking for so-called ‘common variants.’ NGS could search for ‘rare variants.’

  6. ENCODE has been criticized for its ‘big science’ approach to biology (Eisen, 2012).

References

  • Allen, H.L. et al (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, no 7321: 832–838.

    Article  Google Scholar 

  • Altschul, S.F. et al (1990) Basic local alignment search tool. Journal of Molecular Biology 215: 403–410.

    Article  Google Scholar 

  • Anson, E. and Myers, E. (1999) Algorithms for whole genome shotgun sequencing. In: Proceedings of RECOMB’99, Lyon, pp. 1–9.

  • Belzer, J. et al (eds.) (1978) Encyclopedia of Computer Science and Technology. Vo1. 10. Linear and Matrix Algebra to Microorganisms. New York: Marcel Dekker.

  • Bisciglia, C. (2009) Analyzing human genomes with Apache Hadoop. Weblog, 15 October, Cloudera. http://blog.cloudera.com/blog/2009/10/analyzing-human-genomes-with-hadoop/, accessed 27 May 2015.

  • Bowker, G. (2006) Memory Practices in the Sciences. Cambridge: MIT Press.

    Google Scholar 

  • Bowker, G. and Star, S.L. (1999) Sorting Things Out: Classification and its Consequences. Cambridge: MIT Press.

    Google Scholar 

  • Boyd, D. and Crawford, K. (2012) Critical questions for big data. Information, Communication & Society 15(5): 662–679.

    Article  Google Scholar 

  • Brin, S. and Page, L. (2000) The anatomy of a large-scale hypertextual web search engine. Computer Science Department, Stanford University. http://infolab.stanford.edu/pub/papers/google.pdf, accessed 27 May 2015.

  • Brust, A. (2012) Cloudera and Mount Sinai: The structure of a big data revolution? ZDNet, 6 July. http://www.zdnet.com/article/cloudera-and-mount-sinai-the-structure-of-a-big-data-revolution/, accessed 27 May 2015.

  • Burrows, M. and Wheeler, D.J. (1994) A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation. http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html, accessed 27 May 2015.

  • Carr, D.F. (2006) How Google Works: The Google File System. Baseline, 6 July. http://www.baselinemag.com/c/a/Infrastructure/How-Google-Works-1/4, accessed 27 May 2015.

  • Celera (2000) Celera Genomics to Acquire Paracel Inc. Press release, 20 March. https://www.celera.com/celera/pr_1056568938, accessed 18 September 2015.

  • Dalton, C. and Thatcher, J. (2014) What does a critical data studies look like, and why do we care? Seven points for a critical approach to big data. Society and Space. http://societyandspace.com/material/commentaries/craig-dalton-and-jim-thatcher-what-does-a-critical-data-studies-look-like-and-why-do-we-care-seven-points-for-a-critical-approach-to-big-data/#comments, accessed 23 September 2015.

  • Daly, A.K. (2010) Genome-wide association studies in pharmacogenomics. Nature Reviews Genetics 11: 241–246.

    Article  Google Scholar 

  • Dean, J. and Ghemawat, S. (2004) MapReduce: Simplified data processing on large clusters. Google Research Publications (appeared in OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, California, December 2004). http://static.googleusercontent.com/media/research.google.com/es/us/archive/mapreduce-osdi04.pdf, accessed 27 May 2015.

  • Delcher, A.L. et al (1999) Alignment of whole genomes. Nucleic Acids Research 27(11): 2369–76.

    Article  Google Scholar 

  • Dourish, P. (2014) No SQL: The shifting materialities of database technology. Computational Culture: A Journal of Software. http://computationalculture.net/article/no-sql-the-shifting-materialities-of-database-technology, accessed 18 September 2015.

  • Eisen, M. (2012) Blinded by big science. Weblog entry, 10 September. www.michaeleisen.org/blog/?p=1179, accessed 23 September 2015.

  • ENCODE at UCSC (2012) ENCODE experiment matrix, http://genome.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html, accessed 27 May 2015.

  • Ferragina, P. and Manzini, G. (2000) Opportunistic data structures with applications. Foundations of Computer Science. In: Proceedings, 41st Annual Symposium, pp. 390–398. IEEE.

  • Garland, A. (2015) Ex Machina (film). Writer and director: Alex Garland.

  • Gitelman, L., ed. (2013) Raw Data is an Oxymoron. Cambridge: MIT Press.

    Google Scholar 

  • Gonella, G and Kurtz, S. (2012) Readjoiner: A fast and memory efficient string graph-based sequence assembler. BMC Bioinformatics 13(1): 1–19.

    Article  Google Scholar 

  • Gusfield, D. (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge.

    Book  Google Scholar 

  • Harris, D. (2012) Better medicine, brought to you by big data. GigaOm, 15 July. https://gigaom.com/2012/07/15/better-medicine-brought-to-you-by-big-data/, accessed 27 May 2015.

  • Hazelhurst, S. and Lipák, Z. (2011). KABOOM! a new auffix array based algorithm for clustering expression data. Bioinformatics 27(24): 3348–55.

    Article  Google Scholar 

  • Hebbring, S.J. (2014) The challenges, advantages and future of phenome-wide association studies. Immunology 141(2): 157–65.

    Article  Google Scholar 

  • Hernandez, D. (2013) Data crunchers ditch Hadoop for homegrown software. Wired, 20 February. http://www.wired.com/2013/02/genetic-data-glut/, accessed 27 May 2015.

  • Ilie, L. et al (2011) HiTEC: Accurate error correction in high-throughput sequencing data. Bioinformatics 27(3): 295–302.

    Article  Google Scholar 

  • Illumina (2013) An introduction to next-generation sequencing technology. http://res.illumina.com/documents/products/illumina_sequencing_introduction.pdf, accessed 27 May 2015.

  • Kay, L.E. (2000) Who Wrote the Book of Life? A History of the Genetic Code. Stanford University Press.

    Google Scholar 

  • Kielbasa, S.M. et al (2011) Adaptive seeds tame genomic sequence comparison. Genome Research 21: 487–93.

    Article  Google Scholar 

  • Kirschenbaum, M. (2007) Mechanisms: New Media and the Forensic Imagination. Cambridge, MA: MIT Press.

    Google Scholar 

  • Kitchin, R. (2014) The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. SAGE Publications.

    Book  Google Scholar 

  • Knuth, D.E. (1973) The Art of Computer Programming, Volume 3, “Sorting and Searching.” Addison-Wesley, Redwood City.

    Google Scholar 

  • Koboldt, D.C. et al (2013) The next-generation sequencing revolution and its impact on genomics. Cell 155(1): 27–38.

    Article  Google Scholar 

  • Kurtz, S. et al (2008) A new method to computer k-mer frequencies and its application to annotate large plant genomes. BMC Genomics 9(1): 1–18.

    Article  Google Scholar 

  • Langmead, B. et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10: R25.

    Article  Google Scholar 

  • Levy, S. (2011) In the Plex: How Google Thinks, Works, and Shapes Our Lives. Simon & Schuster, New York.

    Google Scholar 

  • Li, H. and Homer, N. (2010) A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics 11(5): 473-483.

    Article  Google Scholar 

  • Lohr, S. (2015) On the case at Mount Sinai, It’s Dr. Data. New York Times, 7 March, BU1.

  • Luhn, H.P. (1958) A business intelligence system. IBM Journal of Research and Development 2(4): 314.

    Article  Google Scholar 

  • Mackenzie, A. (2012) More parts than elements: How databases multiply. Environment and Planning D: Society and Space 30: 335–350.

    Article  Google Scholar 

  • Mackenzie, A. (2015b) Machine learning and genomic dimensionality. In: S. Richardson and H. Stevens (eds.) Postgenomics: Perspectives on Biology After the Genome. Durham and London: Duke University Press, pp. 73–102.

    Chapter  Google Scholar 

  • Mackenzie, A. et al (2015) Post-archival genomics and the bulk Logistics of DNA sequences. Biosocieties 11(1): 82–105.

    Article  Google Scholar 

  • Manber, U. and Myers, E. (1990) Suffix arrays: a new method of on-line string searches. In: Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 319–327.

  • Manolio, T.A. et al (2009) Finding the missing heritability of complex diseases. Nature 461, no. 7265: 747–753.

    Article  Google Scholar 

  • Manovich, L. (1999) Database as a symbolic form. Millennium Film Journal 34 (Fall).

    Google Scholar 

  • Manovich, L. (2014) Software Takes Command. Bloomsbury Academic, London.

    Google Scholar 

  • M'Charek, A. (2005) The Human Genome Diversity Project: An Ethnography of Scientific Practice. Cambridge, UK: Cambridge University Press.

    Book  Google Scholar 

  • Metz, C. (2011) How Yahoo spawned Hadoop, the future of big data. Wired, 18 October. http://www.wired.com/2011/10/how-yahoo-spawned-hadoop/, accessed 27 May 2015.

  • Myers, E. et al (2000) Whole-genome assembly of Drosophila. Science 287: 2196–2204.

    Article  Google Scholar 

  • NextBio (2012) NextBio and Intel collaborate to optimize the Hadoop stack and advance big data technologies in genomics, Press release, 11 July. http://www.nextbio.com/b/corp/pressReleases.nb#pr40, accessed 27 May 2015.

  • Pasquale, F. (2015) The Black Box Society: The Secret Algorithms That Control Money and Information. Cambridge and London: Harvard University Press.

    Book  Google Scholar 

  • Patel, C.J. et al (2010) An Enviroment-Wide Association Study (EWAS) on Type 2 Diabetes Mellitus. PLoS One DOI:10.1371/journal.pone.0010746.

    Google Scholar 

  • Pollack, A. (2000) Technology; Supercomputers Track Human Genome. New York Times, 28 August.

  • Rose, N. (2007) The Politics of Life Itself: Biomedicine, Power, and Subjectivity in the Twenty-First Century. Princeton: Princeton University Press.

    Book  Google Scholar 

  • Ruppert, E. et al (2015) Socializing big data: From concept to practice. CRESC Working Paper No. 138, The University of Manchester and Open University.

  • Schatz, M. (2009) Cloudburst: Highly sensitive read mapping with MapReduce. Bioinformatics 25(11): 1363–1369.

    Article  Google Scholar 

  • Schneier, B. (2015) Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World. New York: Norton.

    Google Scholar 

  • Science (2001) Epigenetics. Science, special issue, 293, no. 5532: 1001–1208.

    Google Scholar 

  • Shendure, J. and Ji, H. (2008) Next-generation DNA sequencing. Nature Biotechnology 26: 1135–45.

    Article  Google Scholar 

  • Silverman, J. (2015) Terms of Service: Social Media and the Price of Constant Connection. New York: Harper.

    Google Scholar 

  • Smith, B.C. (1998) On the Origin of Objects. MIT Press, Cambridge.

    Google Scholar 

  • Stein, R. A. (2008) Next-generation sequencing update. Genetic Engineering & Biotechnology News 28(15), 1 September. http://www.genengnews.com/gen-articles/next-generation-sequencing-update/2584/, accessed 27 May 2015.

  • Stevens, H. (2011a) Coding Sequences: A History of Sequence Comparison Algorithms as a Scientific Instrument. Perspectives on Science 19(3): 263–299.

    Article  Google Scholar 

  • Stevens, H. (2011b) On the means of bioproduction: Bioinformatics and how to make knowledge in a high-throughput genomics laboratory. Biosocieties 6(2): 217–242.

    Article  Google Scholar 

  • Stevens, H. (2013) Life Out of Sequence: A Data-Driven History of Bioinformatics. Chicago: University of Chicago Press.

    Book  Google Scholar 

  • Sutton et al (1995) TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Science & Technology 1(1): 9–19.

    Article  Google Scholar 

  • Taylor, R.C. (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11(Suppl 12): S1.

    Article  Google Scholar 

  • Thacker, E. (2005) The Global Genome: Biotechnology, Politics, and Culture. Cambridge: MIT Press.

    Google Scholar 

  • Thomas, U.G. (2012) Google works with ISB to evaluate life sciences as application area for new cloud infrastructure. Genomeweb, 20 July. https://www.genomeweb.com/informatics/google-works-isb-evaluate-life-sciences-application-area-new-cloud-infrastructur, accessed 27 May 2015.

  • Vaidhyanathan, S. (2011) The Googlization of Everything (And Why We Should Worry). Berkeley: University of California Press.

    Google Scholar 

  • Venter, J.C. et al (2001) The Sequence of the Human Genome. Science 291, no. 5507: 1304-1351.

    Article  Google Scholar 

  • Visscher, P.M. et al (2012a) Evidence-based psychiatric genetics, AKA the false dichotomy between the common and rare variant hypotheses. Molecular Psychiatry 17, no. 5: 474–485.

    Article  Google Scholar 

  • Visscher, P.M. et al (2012b) Five years of GWAS discovery. American Journal of Human Genetics 90, no. 1: 7-24.

    Article  Google Scholar 

  • Wojcicki, A. et al (2012) Deleterious Me: Whole Genome Sequencing, 23andMe, and the Crowd-Sourced Health Care Revolution. Science and Democracy Lecture Series, Harvard Kennedy School, 18 April. Available at https://vimeo.com/40657814.

  • Zhang, J. et al (2011) The impact of next-generation sequencing on genomics. Journal of Genetics and Genomics 38(3): 95–109.

    Article  Google Scholar 

Download references

Acknowledgements

The research for this manuscript included no human subject research and was not subject to IRB approval. The author has no competing intellectual or financial interests in the research described in the manuscript. Research work was supported in part by a Tier 1 grant from the Singapore Ministry of Education. Thanks to Linda Hogle and Rayna Rapp who offered much useful advice and support in the writing of this essay. 

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hallam Stevens.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Stevens, H. Hadooping the genome: The impact of big data tools on biology. BioSocieties 11, 352–371 (2016). https://doi.org/10.1057/s41292-016-0003-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1057/s41292-016-0003-6

Keywords

  • big data
  • DNA sequence
  • genomics
  • Google
  • Hadoop