Skip to main content

Query Languages and Evaluation Techniques for Biological Sequence Data

  • Living reference work entry
  • First Online:
Encyclopedia of Database Systems
  • 62 Accesses

Synonyms

Querying DNA sequences; Querying protein sequences

Definition

A common type of data that is used in life science applications is biological sequence data. Data such as DNA sequence and protein sequence data are growing at a very fast rate. For example, the data at GenBank[GB07] has been growing exponentially, doubling roughly every 18 months. These sequence datasets are often queried in complex ways and the methods required to query these sequences go far beyond the simple string matching methods that have been used in more traditional string applications. In order to enable users to easily pose sophisticated queries on these biological sequences, different languages have been designed to support a rich library of functions. In addition, some database systems have been extended to support a rich set of operators on the sequence data type. Compared to the stand-alone approach, the database method brings the power of algebraic query optimization and the use of indexes making it...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Recommended Reading

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.

    Article  Google Scholar 

  2. Barbara A, Eckman AK. Querying BLAST within a data federation. Q Bull IEEE TC Data Eng. 2004;27(3):12–9.

    Google Scholar 

  3. Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolutionary change in proteins. Atlas Protein Seq Struct. 1978;5:345–52.

    Google Scholar 

  4. Hammer J, Schneider M. Genomics algebra: a new, integrating data model, language, and tool for processing and querying genomic information. In: Proceedings 1st biennial conference on innovative data systems research. 2003. p. 176–87.

    Google Scholar 

  5. Henikoff S, Henikoff J. Amino acid substitution matrices from protein blocks. In Proc Natl Acad Sci. 1992;89(22):10915–9.

    Article  Google Scholar 

  6. Hsiao R-L, Stott Parker Jr D, Yang H-C. Support for BioIndexing in BLASTgres. In: In Data Integration in the Life Sciences (DILS), LNCS, vol. 3615. Berlin: Springer; 2005. p. 284–7.

    Chapter  Google Scholar 

  7. Mao R, Weijia X, Neha S, Miranker DP. An assessment of a metric space database index to support sequence homology. In: Proceedings IEEE 3rd international symposium on bioinformatics and bioengineering. 2003. p. 375–82.

    Google Scholar 

  8. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. In Proc Natl Acad Sci. 1988;85(8):2444–8.

    Article  Google Scholar 

  9. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–7.

    Article  Google Scholar 

  10. Stephens S, Chen JY, Davidson MG, Thomas S, Trute BM. Oracle database 10 g: a platform for BLAST search and regular expression pattern matching in life sciences. Nucleic Acids Res. 2005;33(Database-Issue):675–9.

    Google Scholar 

  11. Stephens S, Chen JY, Thomas S. ODM BLAST: sequence homology search in the RDBMS. Q Bull IEEE TC Data Eng. 2004;27(3):20–3.

    Google Scholar 

  12. Tata S, Lang W, Patel JM. Periscope/SQ: interactive exploration of biological sequence databases. In: Proceedings 33rd international conference on very large data bases. 2007. p. 1406–9.

    Google Scholar 

  13. Tata S, Patel JM. PiQA: an algebra for querying protein data sets. In: Proceedings 15th international conference on scientific and statistical database management. 2003. p. 141–50.

    Google Scholar 

  14. Tata S, Patel JM, Friedman JS, Swaroop A. Declarative querying for biological sequences. In: Proceedings 22nd international conference on data engineering. 2006. p. 87.

    Google Scholar 

  15. Weiner P. Linear pattern matching algorithm. In: Proceedings of the 14th annual IEEE symposium on switching and automata theory. 1973. p. 1–11.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sandeep Tata .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media LLC

About this entry

Cite this entry

Tata, S., Patel, J.M. (2016). Query Languages and Evaluation Techniques for Biological Sequence Data. In: Liu, L., Özsu, M. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4899-7993-3_630-2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4899-7993-3_630-2

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, New York, NY

  • Online ISBN: 978-1-4899-7993-3

  • eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering

Publish with us

Policies and ethics