Biosequence Similarity Search on the Mercury System

  • Praveen Krishnamurthy
  • Jeremy Buhler
  • Roger Chamberlain
  • Mark Franklin
  • Kwame Gyang
  • Arpith Jacob
  • Joseph Lancaster
Article

Abstract

Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity search tool for biosequences is BLAST, a program designed to compare query sequences to a database. Here, we present the design of BLASTN, the version of BLAST that searches DNA sequences, on the Mercury system, an architecture that supports high-volume, high-throughput data movement off a data store and into reconfigurable hardware. An important component of application deployment on the Mercury system is the functional decomposition of the application onto both the reconfigurable hardware and the traditional processor. Both the Mercury BLASTN application design and its performance analysis are described.

Keywords

DNA sequencing comparative annotation biosequence 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D.J. Lipman, “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Res., vol. 25, 1997, pp. 3389–3402.CrossRefGoogle Scholar
  2. 2.
    B. Bloom, “Space/Time Trade-Offs in Hash Coding with Allowable Errors,” Commun. ACM, vol. 13, no. 7, 1970, pp. 422–426.CrossRefMATHGoogle Scholar
  3. 3.
    J. Buhler, “Mercury BLAST Dictionaries: Analysis and Performance Measurement,” Technical Report WUCSE-2007-13, Washington University in St. Louis, 2007.Google Scholar
  4. 4.
    J. Buhler, U. Keich and Y. Sun, “Designing Seeds for Similarity Search in Genomic DNA,” J. Comput. Syst. Sci., vol. 70, 2005, pp. 342–363.MathSciNetCrossRefGoogle Scholar
  5. 5.
    L. Carter and M. Wegman, “Universal Classes of Hashing Functions,” J. Comput. Syst. Sci., vol. 18, 1979, pp. 143–154.MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    R. Chamberlain and R. Cytron, “Novel Techniques for Processing Unstructured Data Sets,” in Proc. of IEEE Aerospace Conf., Montana, March 2005.Google Scholar
  7. 7.
    R. Chamberlain and B. Shands, “Streaming Data from Disk Store to Application,” in Proc. of 3rd Int’l Workshop on Storage Network Architecture and Parallel I/Os, St. Louis, MO, September 2005, pp. 17–23.Google Scholar
  8. 8.
    R. Chamberlain, B. Shands and J. White, “Achieving Real Data Throughput for an FPGA Co-Processor on Commodity Server Platforms,” in Proc. of 1st Workshop on Building Block Engine Architectures for Computers and Networks, Boston, MA, October 2004.Google Scholar
  9. 9.
    R.D. Chamberlain, R.K. Cytron, M.A. Franklin and R.S. Indeck, The Mercury System: Exploiting Truly Fast Hardware for Data Search,” in Proc. of Int’l Workshop on Storage Network Architecture and Parallel I/Os, pp. 65–72, September 2003.Google Scholar
  10. 10.
    Z.J. Czech, G. Havas and B.S. Majewski, “Perfect Hashing,” Theor. Comp. Sci., vol. 182, 1997, pp. 1–143.MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    W.J. Dally et al., “Merrimac: Supercomputing with Streams.” in Proc. of Supercomputing Conf., November 2003.Google Scholar
  12. 12.
    S. Dharmapurikar, P. Krishnamurthy, T. Sproull and J. Lockwood, “Deep Packet Inspection Using Parallel Bloom Filters,” IEEE Micro, vol. 24, no. 1, 2004, pp. 52–61.CrossRefGoogle Scholar
  13. 13.
    R.K. Singh et al., “BioSCAN: A Dynamically Reconfigurable Systolic Array for Biosequence Analysis,” in Proc. CERCS 96, 1996.Google Scholar
  14. 14.
    M. Franklin, R. Chamberlain, M. Henrichs, B. Shands and J. White, “An Architecture for Fast Processing of Large Unstructured Data Sets,” in Proc. of the 22nd Int’l Conf. on Computer Design, October 2004, pp. 280–287.Google Scholar
  15. 15.
    T. Hagerup, P.B. Miltersen and R. Pagh, “Deterministic Dictionaries,” J. Algorithms, vol. 41, 2001, pp. 69–85.MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    J.D. Hirschberg, R. Hughley and K. Karplus, “Kestrel: A Programmable Array for Sequence Analysis,” in Proc. of IEEE International Conference on Application-Specific Systems, Architecture, and Processors, 1996, pp. 23–34.Google Scholar
  17. 17.
    D.T. Hoang, “Searching Genetic Databases on Splash 2,” in IEEE Workshop on FPGAs for Custom Computing Machines, 1993, pp. 185–191.Google Scholar
  18. 18.
    W.J. Kent, “BLAT: The BLAST-Like Alignment Tool,” Genome Res., vol. 12, 2002, pp. 656–664.MathSciNetCrossRefGoogle Scholar
  19. 19.
    G. Knowles and P. Gardner-Stephen, “DASH: Localizing Dynamic Programming for Order of Magnitude Faster, Accurate Sequence Alignment,” in Proc. of the 3rd International IEEE Computer Society Computational Systems Bioinformatics Conference, 2004, pp. 732–735.Google Scholar
  20. 20.
    G. Knowles and P. Gardner-Stephen, “A New Hardware Architecture for Genomic and Proteomic Sequence Alignment,” in Proc. of IEEE Computational Systems Bioinformatics Conf., 2004.Google Scholar
  21. 21.
    J. Lancaster, J. Buhler and R.D. Chamberlain, “Acceleration of Ungapped Extension in Mercury BLAST.” in Proc. of the 7th Workshop on Media and Streaming Processors, November 2005.Google Scholar
  22. 22.
    D. Lavenier, S. Guytant, S. Derrien and S. Rubin, “A Reconfigurable Parallel Disk System for Filtering Genomic Banks,” in ERSA’03, Engineering of Reconfigurable Systems and Algorithms, 2003.Google Scholar
  23. 23.
    M. Li, B. Ma, D. Kisman and J. Tromp, “Patternhunter II: Highly Sensitive and Fast Homology Search,” J. Bioinform. Comput. Biol., vol. 2, 2004, pp. 417–439.CrossRefGoogle Scholar
  24. 24.
    National Center for Biological Information, “Growth of GenBank,” 2002, http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html.
  25. 25.
    Z. Ning, A.J. Cox and J.C. Mullikin, “SSAHA: A Fast Search Method for Large DNA Databases,” Genome Res., vol. 11, 2001, pp. 1725–1729.CrossRefGoogle Scholar
  26. 26.
    N. Pappas, “Searching Biological Sequence Databases Using Distributed Adaptive Computing,” Master’s thesis, Virginia Polytechnic Institute and State University, 2003.Google Scholar
  27. 27.
    P.A. Pevzner and M.S. Waterman, “Multiple Filtration and Approximate Pattern Matching,” Algorithmica, vol. 13, no. 1/2, 1995, pp. 135–154.MathSciNetCrossRefMATHGoogle Scholar
  28. 28.
    M.V. Ramakrishna, E. Fu and E. Bahcekapili, “Efficient Hardware Hashing Functions for High Performance Computers,” IEEE Trans. Comput., vol. 46, 1997, pp. 1378–1381.CrossRefGoogle Scholar
  29. 29.
    E. Reidel, C. Faloutsos, G. Gibson and D. Nagle, “Active Disks for Large-Scale Data Processing,” IEEE Comput., vol. 34, no. 6, June 2001, pp. 68–74.CrossRefGoogle Scholar
  30. 30.
    T.F. Smith and M.S. Waterman, “Identification of Common Molecular Subsequences,” J. Mol. Biol., vol. 147, no. 1, March 1981, pp. 195–197.CrossRefGoogle Scholar
  31. 31.
    R. Sprugnoli, “Perfect Hashing Functions: A Single Probe Retrieving Method for Static Sets,” Commun. ACM, vol. 20, no. 11, 1977, pp. 841–850.MathSciNetCrossRefMATHGoogle Scholar
  32. 32.
    R.E. Tarjan and A.C.C. Yao, “Storing a Sparse Table,” Commun. ACM, vol. 22, no. 11, 1979, pp. 606–611.MathSciNetCrossRefMATHGoogle Scholar
  33. 33.
    R.H. Waterston et al., “Initial Sequencing and Comparative Analysis of the Mouse Genome,” Nature, vol. 420, 2002, pp. 520–562.CrossRefGoogle Scholar
  34. 34.
    B. West, R.D. Chamberlain, R.S. Indeck and Q. Zhang, “An FPGA-Based Search Engine for Unstructured Database,” in Proc. of 2nd Workshop on Application Specific Processors, December 2003, pp. 25–32.Google Scholar
  35. 35.
    Y. Yamaguchi, T. Maruyama and A. Konagaya, “High Speed Homology Search with FPGAs,” in Pacific Symposium on Biocomputing, 2002, pp. 271–282.Google Scholar
  36. 36.
    Q. Zhang, R.D. Chamberlain, R.S. Indeck, B. West and J. White, “Massively Parallel Data Mining Using Reconfigurable Hardware: Approximate String Matching,” in Proc. Workshop on Massively Parallel Processing, April 2004.Google Scholar
  37. 37.
    Z. Zhang, S. Schwartz, L. Wagner and W. Miller, “A Greedy Algorithm for Aligning DNA Sequences,” J. Comput Biol., vol. 7, 2000, pp. 203–214.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Praveen Krishnamurthy
    • 1
  • Jeremy Buhler
    • 1
  • Roger Chamberlain
    • 1
  • Mark Franklin
    • 1
  • Kwame Gyang
    • 1
  • Arpith Jacob
    • 1
  • Joseph Lancaster
    • 1
  1. 1.Department of Computer Science and EngineeringWashington UniversitySt. LouisUSA

Personalised recommendations