Skip to main content

Exhaustive Peptide Searching Using Relations

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNISA,volume 4587)

Abstract

We present a new robust solution to short peptide searching, tested on a relational platform, with a set of biological queries. Our algorithm is appropriate for large scale scientific data analysis, and has been tested with 1.4 GB of amino-acids. Protein sequences are indexed as short overlapping string windows, and stored in a relation. To find approximate matches, we use a neighbourhood generation algorithm. The words in the neighbourhood are then fetched and stored in a relation. We measure execution time and compare the matches found to those delivered by BLAST. We report some performance gains in exact matching and searching within edit distance 1, and very significant quality improvements over heuristics, as we guarantee to deliver all relevant matches.

Keywords

  • Exact Match
  • Edit Distance
  • Nucleic Acid Research
  • Approximate Match
  • Blast Match

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)

    Google Scholar 

  2. Altschul, S.F., Madden, T.L., Schaeffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)

    CrossRef  Google Scholar 

  3. Baeza-Yates, R., Navarro, G.: A Hybrid Indexing Method for Approximate String Matching. JDA 1, 205–239 (2001)

    MathSciNet  Google Scholar 

  4. Burkhardt, S., et al.: q-gram Based Database Searching Using a Suffix Array. In: RECOMB, pp. 77–83. ACM Press, New York (1999)

    CrossRef  Google Scholar 

  5. Eckman, B.A., Kaufmann, A.: Querying BLAST within a Data Federation. IEEE Data Eng. Bull. 27(3), 12–19 (2004)

    Google Scholar 

  6. Eidhammer, I., Jonassen, I., Taylor, W.R.: Protein Bioinformatics. Wiley, Chichester (2003)

    Google Scholar 

  7. Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., McKenney, K., Sutton, G., FitzHugh, W., Fields, C., Gocayne, J.D., Scott, J., Shirley, R., Liu, L.-I., Glodek, A., Kelley, J.M., Weidman, J.F., Phillips, C.A., Spriggs, T., Hedblom, E., Cotton, M.D., Utterback, T.R., Hanna, M.C., Nguyen, D.T., Saudek, D.M., Brandon, R.C., Fine, L.D., Fritchman, J.L., Fuhrmann, J.L., Geoghagen, N.S.M., Gnehm, C.L., McDonald, L.A., Small, K.V., Fraser, C.M., Smith, H.O., Venter, J.C.: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223), 496–512 (1995)

    CrossRef  Google Scholar 

  8. Gautier, L., et al.: Alternative mapping of probes to genes for Affymetrix chips. BMC Bioninformatics, p. 111 (2004)

    Google Scholar 

  9. Guccione, S.A., Keller, E.: Gene Matching using JBits. In: Glesner, M., Zipf, P., Renovell, M. (eds.) FPL 2002. LNCS, vol. 2438, pp. 1168–1171. Springer, Heidelberg (2002)

    Google Scholar 

  10. Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)

    MATH  Google Scholar 

  11. Hyyrö, H., Navarro, G.: A Practical Index for genome Searching. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 341–349. Springer, Heidelberg (2003)

    Google Scholar 

  12. Hunt, E.: Indexed Searching on Proteins Using a Suffix Sequoia. IEEE Data Eng. Bulletin 27(3), 24–31 (2004)

    Google Scholar 

  13. Hunt, E., Atkinson, M.P., Irving, R.W.: A database index to large biological sequences. In: VLDB, pp. 139–148. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  14. Hunt, E., Atkinson, M.P., Irving, R.W.: Database Indexing for Large DNA and Protein Sequence Collections. The. VLDB Journal 11, 256–271 (2002)

    CrossRef  MATH  Google Scholar 

  15. Kahveci, T., Singh, A.K.: An Efficient Index Structure for String Databases. In: VLDB, pp. 351–360. Morgan and Kaufmann, Washington (2001)

    Google Scholar 

  16. Kahveci, T., Singh, A.K.: Progressive searching of biological sequences. IEEE Data Eng. Bull. 27(3), 32–39 (2004)

    Google Scholar 

  17. Karakoç, E., Özsoyoglu, Z.M., Sahinalp, S.C., Tasan, M., Zhang, X.: Novel approaches to biomolecular sequence indexing. IEEE Data Eng. Bull. 27(3), 40–47 (2004)

    Google Scholar 

  18. Kent, W.J.: BLAT: The BLAST-like Alignment Tool. Genome Res. 12(4), 656–664 (2002)

    MathSciNet  Google Scholar 

  19. Kim, Y.J., Boyd, A., Athey, B.D., Patel, J.M.: miBLAST: Scalable Evaluation of a Batch of Nucleotide Sequence Queries with BLAST. Nucleic Acids Research 33, 4335–4344 (2005)

    CrossRef  Google Scholar 

  20. Levenstein, V.I.: Binary codes capable of correcting insertions and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)

    MathSciNet  Google Scholar 

  21. Meek, C., Patel, J.M., Kasetty, S.: OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences. In: VLDB 2003, pp. 910–921 (2003)

    Google Scholar 

  22. Mewes, H.W., Hani, J., Pfeiffer, F., Frishman, D.: MIPS: a database for protein sequences and complete genomes. Nucleic Acids Research 26, 33–37 (1998)

    CrossRef  Google Scholar 

  23. Miller, C., Gurd, J., Brass, A.: A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases. Bioinformatics 15, 111–121 (1999)

    CrossRef  Google Scholar 

  24. Miranker, D.P., Briggs, W.J., Mao, R., Ni, S., Xu, W.: Biosequence Use Cases in MoBIoS SQL. IEEE Data Eng. Bull. 27(3), 3–11 (2004)

    Google Scholar 

  25. Myers, E.W.: A sublinear algorithm for approximate key word searching. Algorithmica 12(4/5), 345–374 (1994)

    CrossRef  MATH  MathSciNet  Google Scholar 

  26. Navarro. G.: NR-grep: A Fast and Flexible Pattern Matching Tool. Technical report (2000). TR/DCC-2000-3. University of Chile, Departmento de Ciencias de la Computacion, www.dcc.uchile.cl/~gnavarro

  27. Needleman, S.B., Wunsch, C.D.: A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins. J. Mol. Biol. 48, 443–453 (1970)

    CrossRef  Google Scholar 

  28. Ning, Z., Cox, A.J., Mullikin, J.C.: SSAHA: A Fast Search Method for Large DNA Databases. Genome Res. 11(10), 1725–1729 (2001)

    CrossRef  Google Scholar 

  29. Sethupathy, P., et al.: A guide through present computational approaches for the identification of mammalian microRNA targets. Nat Methods 3(11), 881–886 (2006)

    CrossRef  Google Scholar 

  30. Sidhu, S.S.: Phage Display In Biotechnology and Drug Discovery. Taylor and Francis, Abington (2005)

    Google Scholar 

  31. Smith, T.A., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)

    CrossRef  Google Scholar 

  32. Stephens, S., Chen, J.Y., Thomas, S.: ODM BLAST: Sequence Homology Search in the RDBMS. IEEE Data Eng. Bull. 27(3), 20–23 (2004)

    Google Scholar 

  33. Ukkonen, E.: Approximate string matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) Combinatorial Pattern Matching. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)

    CrossRef  Google Scholar 

  34. Wall, L., Schwartz, R.L., Christiansen, T., Potter, S.: Programming Perl. Nutshell Handbook, 2nd edn. O’Reilly & Associates (1996)

    Google Scholar 

  35. Work, L.M., Bining, H., Hunt, E., et al.: Vascular Bed-Targeted in vivo Gene Delivery Using Tropism-Modified Adeno-associated Viruses. Molecular Therapy 13(4), 683–693 (2006)

    CrossRef  Google Scholar 

  36. Yamaguchi, Y., Miyajima, Y., Maruyama, T., Konagaya, A.: High Speed Homology Search Using Run-Time Reconfiguration. In: Glesner, M., Zipf, P., Renovell, M. (eds.) FPL 2002. LNCS, vol. 2438, pp. 281–291. Springer, Heidelberg (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Richard Cooper Jessie Kennedy

Rights and permissions

Reprints and Permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hunt, E. (2007). Exhaustive Peptide Searching Using Relations. In: Cooper, R., Kennedy, J. (eds) Data Management. Data, Data Everywhere. BNCOD 2007. Lecture Notes in Computer Science, vol 4587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73390-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73390-4_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73389-8

  • Online ISBN: 978-3-540-73390-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics