Exhaustive Peptide Searching Using Relations

Hunt, Ela

doi:10.1007/978-3-540-73390-4_3

Ela Hunt¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4587))

Included in the following conference series:

British National Conference on Databases

618 Accesses
2 Citations

Abstract

We present a new robust solution to short peptide searching, tested on a relational platform, with a set of biological queries. Our algorithm is appropriate for large scale scientific data analysis, and has been tested with 1.4 GB of amino-acids. Protein sequences are indexed as short overlapping string windows, and stored in a relation. To find approximate matches, we use a neighbourhood generation algorithm. The words in the neighbourhood are then fetched and stored in a relation. We measure execution time and compare the matches found to those delivered by BLAST. We report some performance gains in exact matching and searching within edit distance 1, and very significant quality improvements over heuristics, as we guarantee to deliver all relevant matches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Querying and Mining Strings Made Easy

An efficient and flexible scanning of databases of protein secondary structures

Article Open access 30 January 2015

Fast Indexes for Gapped Pattern Matching

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Google Scholar
Altschul, S.F., Madden, T.L., Schaeffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Article Google Scholar
Baeza-Yates, R., Navarro, G.: A Hybrid Indexing Method for Approximate String Matching. JDA 1, 205–239 (2001)
MathSciNet Google Scholar
Burkhardt, S., et al.: q-gram Based Database Searching Using a Suffix Array. In: RECOMB, pp. 77–83. ACM Press, New York (1999)
Chapter Google Scholar
Eckman, B.A., Kaufmann, A.: Querying BLAST within a Data Federation. IEEE Data Eng. Bull. 27(3), 12–19 (2004)
Google Scholar
Eidhammer, I., Jonassen, I., Taylor, W.R.: Protein Bioinformatics. Wiley, Chichester (2003)
Google Scholar
Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., McKenney, K., Sutton, G., FitzHugh, W., Fields, C., Gocayne, J.D., Scott, J., Shirley, R., Liu, L.-I., Glodek, A., Kelley, J.M., Weidman, J.F., Phillips, C.A., Spriggs, T., Hedblom, E., Cotton, M.D., Utterback, T.R., Hanna, M.C., Nguyen, D.T., Saudek, D.M., Brandon, R.C., Fine, L.D., Fritchman, J.L., Fuhrmann, J.L., Geoghagen, N.S.M., Gnehm, C.L., McDonald, L.A., Small, K.V., Fraser, C.M., Smith, H.O., Venter, J.C.: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223), 496–512 (1995)
Article Google Scholar
Gautier, L., et al.: Alternative mapping of probes to genes for Affymetrix chips. BMC Bioninformatics, p. 111 (2004)
Google Scholar
Guccione, S.A., Keller, E.: Gene Matching using JBits. In: Glesner, M., Zipf, P., Renovell, M. (eds.) FPL 2002. LNCS, vol. 2438, pp. 1168–1171. Springer, Heidelberg (2002)
Google Scholar
Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
MATH Google Scholar
Hyyrö, H., Navarro, G.: A Practical Index for genome Searching. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 341–349. Springer, Heidelberg (2003)
Google Scholar
Hunt, E.: Indexed Searching on Proteins Using a Suffix Sequoia. IEEE Data Eng. Bulletin 27(3), 24–31 (2004)
Google Scholar
Hunt, E., Atkinson, M.P., Irving, R.W.: A database index to large biological sequences. In: VLDB, pp. 139–148. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Hunt, E., Atkinson, M.P., Irving, R.W.: Database Indexing for Large DNA and Protein Sequence Collections. The. VLDB Journal 11, 256–271 (2002)
Article MATH Google Scholar
Kahveci, T., Singh, A.K.: An Efficient Index Structure for String Databases. In: VLDB, pp. 351–360. Morgan and Kaufmann, Washington (2001)
Google Scholar
Kahveci, T., Singh, A.K.: Progressive searching of biological sequences. IEEE Data Eng. Bull. 27(3), 32–39 (2004)
Google Scholar
Karakoç, E., Özsoyoglu, Z.M., Sahinalp, S.C., Tasan, M., Zhang, X.: Novel approaches to biomolecular sequence indexing. IEEE Data Eng. Bull. 27(3), 40–47 (2004)
Google Scholar
Kent, W.J.: BLAT: The BLAST-like Alignment Tool. Genome Res. 12(4), 656–664 (2002)
MathSciNet Google Scholar
Kim, Y.J., Boyd, A., Athey, B.D., Patel, J.M.: miBLAST: Scalable Evaluation of a Batch of Nucleotide Sequence Queries with BLAST. Nucleic Acids Research 33, 4335–4344 (2005)
Article Google Scholar
Levenstein, V.I.: Binary codes capable of correcting insertions and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)
MathSciNet Google Scholar
Meek, C., Patel, J.M., Kasetty, S.: OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences. In: VLDB 2003, pp. 910–921 (2003)
Google Scholar
Mewes, H.W., Hani, J., Pfeiffer, F., Frishman, D.: MIPS: a database for protein sequences and complete genomes. Nucleic Acids Research 26, 33–37 (1998)
Article Google Scholar
Miller, C., Gurd, J., Brass, A.: A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases. Bioinformatics 15, 111–121 (1999)
Article Google Scholar
Miranker, D.P., Briggs, W.J., Mao, R., Ni, S., Xu, W.: Biosequence Use Cases in MoBIoS SQL. IEEE Data Eng. Bull. 27(3), 3–11 (2004)
Google Scholar
Myers, E.W.: A sublinear algorithm for approximate key word searching. Algorithmica 12(4/5), 345–374 (1994)
Article MATH MathSciNet Google Scholar
Navarro. G.: NR-grep: A Fast and Flexible Pattern Matching Tool. Technical report (2000). TR/DCC-2000-3. University of Chile, Departmento de Ciencias de la Computacion, www.dcc.uchile.cl/~gnavarro
Needleman, S.B., Wunsch, C.D.: A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins. J. Mol. Biol. 48, 443–453 (1970)
Article Google Scholar
Ning, Z., Cox, A.J., Mullikin, J.C.: SSAHA: A Fast Search Method for Large DNA Databases. Genome Res. 11(10), 1725–1729 (2001)
Article Google Scholar
Sethupathy, P., et al.: A guide through present computational approaches for the identification of mammalian microRNA targets. Nat Methods 3(11), 881–886 (2006)
Article Google Scholar
Sidhu, S.S.: Phage Display In Biotechnology and Drug Discovery. Taylor and Francis, Abington (2005)
Google Scholar
Smith, T.A., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Article Google Scholar
Stephens, S., Chen, J.Y., Thomas, S.: ODM BLAST: Sequence Homology Search in the RDBMS. IEEE Data Eng. Bull. 27(3), 20–23 (2004)
Google Scholar
Ukkonen, E.: Approximate string matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) Combinatorial Pattern Matching. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)
Chapter Google Scholar
Wall, L., Schwartz, R.L., Christiansen, T., Potter, S.: Programming Perl. Nutshell Handbook, 2nd edn. O’Reilly & Associates (1996)
Google Scholar
Work, L.M., Bining, H., Hunt, E., et al.: Vascular Bed-Targeted in vivo Gene Delivery Using Tropism-Modified Adeno-associated Viruses. Molecular Therapy 13(4), 683–693 (2006)
Article Google Scholar
Yamaguchi, Y., Miyajima, Y., Maruyama, T., Konagaya, A.: High Speed Homology Search Using Run-Time Reconfiguration. In: Glesner, M., Zipf, P., Renovell, M. (eds.) FPL 2002. LNCS, vol. 2438, pp. 281–291. Springer, Heidelberg (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
Ela Hunt

Authors

Ela Hunt
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Richard Cooper Jessie Kennedy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hunt, E. (2007). Exhaustive Peptide Searching Using Relations. In: Cooper, R., Kennedy, J. (eds) Data Management. Data, Data Everywhere. BNCOD 2007. Lecture Notes in Computer Science, vol 4587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73390-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-540-73390-4_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73389-8
Online ISBN: 978-3-540-73390-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exhaustive Peptide Searching Using Relations

Abstract

Access this chapter

Preview

Similar content being viewed by others

Querying and Mining Strings Made Easy

An efficient and flexible scanning of databases of protein secondary structures

Fast Indexes for Gapped Pattern Matching

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Exhaustive Peptide Searching Using Relations

Abstract

Access this chapter

Preview

Similar content being viewed by others

Querying and Mining Strings Made Easy

An efficient and flexible scanning of databases of protein secondary structures

Fast Indexes for Gapped Pattern Matching

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation