Protein Similarity Search with Subset Seeds on a Dedicated Reconfigurable Hardware

  • Pierre Peterlongo
  • Laurent Noé
  • Dominique Lavenier
  • Gilles Georges
  • Julien Jacques
  • Gregory Kucherov
  • Mathieu Giraud
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4967)

Abstract

With a sharp increase of available DNA and protein sequence data, new precise and fast similarity search methods are needed for large-scale genome and proteome comparisons. Modern seed-based techniques of similarity search (spaced seeds, multiple seeds, subset seeds) provide a better sensitivity/specificity ratio. We present an implementation of such a seed-based technique on a parallel specialized hardware embedding reconfigurable architecture (FPGA), where the FPGA is tightly connected to large capacity Flash memories. This parallel system allows large databases to be fully indexed and rapidly accessed. Compared to traditional approaches presented by the Blastp software, we obtain both a significant speed-up and better results. To the best of our knowledge, this is the first attempt to exploit efficient seed-based algorithms for parallelizing the sequence similarity search.

Keywords

sequence similarity search spaced seeds subset seeds indexing FPGA reconfigurable architecture dedicated hardware 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Altschul, S., Gish, W., Miller, W., Myers, W., Lipman, D.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)Google Scholar
  2. 2.
    Rognes, T.: ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Research 29(7), 1647–1652 (2001)CrossRefGoogle Scholar
  3. 3.
    Farrar, M.: Striped smith–waterman speeds database searches six times over other simd implementations. Bioinformatics 23(2), 156–161 (2007)CrossRefGoogle Scholar
  4. 4.
    Darling, A., Carey, L., Feng, W.: The design, implementation, and evaluation of mpiBLAST. In: ClusterWorld Conference and Expo (CWCE 2003) (2003)Google Scholar
  5. 5.
    Thorsen, O., Smith, B., Sosa, C.P., Jiang, K., Lin, H., Peters, A., Fen, W.: Parallel genomic sequence-search on a massively parallel system. In: Int. Conference on Computing Frontiers (CF 2007), pp. 59–68 (2007)Google Scholar
  6. 6.
    Lavenier, D., Xinchun, L., Georges, G.: Seed-based genomic sequence comparison using a FPGA/FLASH accelerator. In: Field Programmable Technology (FPT 2006), pp. 41–48 (2006)Google Scholar
  7. 7.
    Smith, T., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)CrossRefGoogle Scholar
  8. 8.
    Crochemore, M., Landau, G., Ziv-Ukelson, M.: A sub-quadratic sequence alignment algorithm for unrestricted cost matrices. In: Symposium On Discrete Algorithms (SODA 2002), pp. 679–688 (2002)Google Scholar
  9. 9.
    Ma, B., Tromp, J., Li, M.: PatternHunter: Faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)CrossRefGoogle Scholar
  10. 10.
    Noé, L., Kucherov, G.: YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Research 33, W540–W543 (2005)CrossRefGoogle Scholar
  11. 11.
    Csürös, M., Ma, B.: Rapid homology search with two-stage extension and daughter seeds. In: Wang, L. (ed.) COCOON 2005. LNCS, vol. 3595, pp. 104–114. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  12. 12.
    Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. Journal of Computer and System Sciences 70(3), 342–363 (2005)CrossRefMathSciNetGoogle Scholar
  13. 13.
    Brejová, B., Brown, D., Vinar, T.: Vector seeds: An extension to spaced seeds. Journal of Computer and System Sciences 70(3), 364–380 (2005)CrossRefMathSciNetMATHGoogle Scholar
  14. 14.
    Li, M., Ma, M., Zhang, L.: Superiority and complexity of the spaced seeds. In: Symp. on Discrete Algorithms (SODA 2006), pp. 444–453 (2006)Google Scholar
  15. 15.
    Mak, D., Gelfand, Y., Benson, G.: Indel seeds for homology search. Bioinformatics 22(14), e341–e349 (2006)CrossRefGoogle Scholar
  16. 16.
    Kisman, D., Li, M., Ma, B., Li, W.: tPatternhunter: gapped, fast and sensitive translated homology search. Bioinformatics 21(4), 542–544 (2005)CrossRefGoogle Scholar
  17. 17.
    Brown, D.: Optimizing multiple seeds for protein homology search. IEEE Transactions on Computational Biology and Bioinformatics 2(1), 29–38 (2005)CrossRefGoogle Scholar
  18. 18.
    Kung, H.T., Leiserson, C.: Algorithms for VLSI processors arrays. Addison-Wesley, Reading (1980)Google Scholar
  19. 19.
    Lipton, R., Lopresti, D.: In: Fuchs, H. (ed.) A systolic array for rapid string comparison, pp. 363–376. Computer Science Press, Rockville, MD (2004)Google Scholar
  20. 20.
    Chow, E., Hunkapiller, T., Peterson, J., Waterman, M.S.: Biological information signal processor. In: International Conference on Application Specific Array Processors (ASAP 1991), pp. 144–160 (1991)Google Scholar
  21. 21.
    Hoang, D.: Searching genetic databases on splash 2. In: IEEE Workshop on FPGAs for Custom Computing Machines (FCCM 1993), Napa, California, pp. 185–191 (1993)Google Scholar
  22. 22.
    Lavenier, D., Giraud, M.: Bioinformatics Applications. In: Gokhale, M.B., Graham, P.S. (eds.) Reconfigurable Computing, Springer, Heidelberg (2005)Google Scholar
  23. 23.
    Dydel, S., Bala, P.: Large scale protein sequence alignment using FPGA reprogrammable logic devices. In: Becker, J., Platzner, M., Vernalde, S. (eds.) FPL 2004. LNCS, vol. 3203, pp. 23–32. Springer, Heidelberg (2004)Google Scholar
  24. 24.
    Court, T.V., Herbordt, M.C.: Families of fpga-based accelerators for approximate string matching. Microprocessors and Microsystems 31(2), 135–145 (2007)CrossRefGoogle Scholar
  25. 25.
    Singh, R.K., Tell, S.G., White, C.T., Hoffman, D., Chi, V.L., Erickson, B.W.: A scalable systolic multiprocessor system for analysis of biological sequences. In: Borrielo, G., Ebeling, C. (eds.) Symposium on Research on Integrated Systems, pp. 168–182 (1993)Google Scholar
  26. 26.
    Knowles, G., Gardner-Stephen, P.: A new hardware architecture for genomic and proteomic sequence alignment. In: IEEE Computational Systems Bioinformatics Conference (CSBC 2004) (2004)Google Scholar
  27. 27.
    Kucherov, G., Noé, L., Roytberg, M.: A unifying framework for seed sensitivity and its application to subset seeds. J. Bioinf. Comp. Biology 4(2), 553–569 (2006)CrossRefGoogle Scholar
  28. 28.
    Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Pierre Peterlongo
    • 1
  • Laurent Noé
    • 2
  • Dominique Lavenier
    • 1
  • Gilles Georges
    • 1
  • Julien Jacques
    • 1
  • Gregory Kucherov
    • 2
  • Mathieu Giraud
    • 2
  1. 1.Symbiose, IRISA, INRIA, CNRS, Université Rennes 1 
  2. 2.Sequoia/Bioinfo, LIFL, INRIA, CNRS, Université Lille 1 

Personalised recommendations