Skip to main content

Querying and Mining Strings Made Easy

  • Conference paper
  • First Online:
Advanced Data Mining and Applications (ADMA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10604))

Included in the following conference series:

  • 3051 Accesses

Abstract

With the advent of large string datasets in several scientific and business applications, there is a growing need to perform ad-hoc analysis on strings. Currently, strings are stored, managed, and queried using procedural codes. This limits users to certain operations supported by existing procedural applications and requires manual query planning with limited tuning opportunities. This paper presents StarQL, a generic and declarative query language for strings. StarQL is based on a native string data model that allows StarQL to support a large variety of string operations and provide semantic-based query optimization. String analytic queries are too intricate to be solved on one machine. Therefore, we propose a scalable and efficient data structure that allows StarQL implementations to handle large sets of strings and utilize large computing infrastructures. Our evaluation shows that StarQL is able to express workloads of application-specific tools, such as BLAST and KAT in bioinformatics, and to mine Wikipedia text for interesting patterns using declarative queries. Furthermore, the StarQL query optimizer shows an order of magnitude reduction in query execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt.

  2. 2.

    http://cancergenome.nih.gov.

  3. 3.

    https://www.skatelescope.org.

  4. 4.

    http://www.tgac.ac.uk/KAT/.

  5. 5.

    http://cloud.kaust.edu.sa/Pages/stardb.aspx.

  6. 6.

    ftp://ftp.ncbi.nih.gov/blast/db/FASTA/igSeqNt.gz.

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)

    Article  Google Scholar 

  2. Apostolico, A., Comin, M., Parida, L.: VARUN: discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comput. Biol. Bioinform. 7(4), 752–762 (2010)

    Article  Google Scholar 

  3. Balkir, N., Sukan, E., Ozsoyoglu, G., Ozsoyoglu, G.: Visual: a graphical icon-based query language. In: Proceedings of International Conference on Data Engineering (ICDE) (1996)

    Google Scholar 

  4. Benedikt, M., Libkin, L., Schwentick, T., Segoufin, L.: String operations in query languages. In: Proceedings of PODS (2001)

    Google Scholar 

  5. Carvalho, A.M., Oliveira, A.L., Freitas, A.T., Sagot, M.F.: A parallel algorithm for the extraction of structured motifs. In: Proceedings of the ACM Symposium on Applied Computing (SAC) (2004)

    Google Scholar 

  6. Date, C.: An Introduction to Database Systems, 8th edn. Pearson/Addison-Wesley, Boston (2003)

    MATH  Google Scholar 

  7. Dube, K., Mansour, E., Wu, B.: Supporting collaboration and information sharing in computer-based clinical guideline management. In: 18th IEEE Symposium on Computer-Based Medical Systems (CBMS), Dublin, Ireland (2005)

    Google Scholar 

  8. Etzold, T., Argos, P.: SRS - an indexing and retrieval tool for flat file data libraries. Comput. Appl. Biosci. 9(1), 49–57 (1993)

    Google Scholar 

  9. Etzold, T., Argos, P.: Transforming a set of biological flat file libraries to a fast access network. Comput. Appl. Biosci. 9(1), 49–57 (1993)

    Google Scholar 

  10. Floratou, A., Tata, S., Patel, J.M.: Efficient and accurate discovery of patterns in sequence data sets. TKDE 23(8), 1154–1168 (2011)

    Google Scholar 

  11. Ginsburg, S., Wang, X.S.: Regular sequence operations and their use in database queries. J. Comput. Syst. Sci. 56(1), 1–26 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  12. Ginsburg, S., Wang, X.: Pattern matching by RS-operations: towards a unified approach to querying sequenced data. In: Proceedings of PODS (1992)

    Google Scholar 

  13. Grahne, G., Hakli, R., Nykänen, M., Tamm, H., Ukkonen, E.: Design and implementation of a string database query language. Inf. Syst. 28(4), 311–337 (2003)

    Article  MATH  Google Scholar 

  14. Grahne, G., Nykänen, M., Ukkonen, E.: Reasoning about strings in databases. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1994 (1994)

    Google Scholar 

  15. Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J., Clavijo, B.J.: KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33(4), 574–576 (2017)

    Google Scholar 

  16. Mathur, A., Sihag, A., Bagaria, E., Rajawat, S., et al.: A new perspective to data processing: Big data. In: Proceedings of INDIACom, pp. 110–114 (2014)

    Google Scholar 

  17. Niedringhaus, T.P., Milanova, D., Kerby, M.B., Snyder, M.P., Barron, A.E.: Landscape of next-generation sequencing technologies. Anal. Chem. 83(12), 4327–4341 (2011)

    Article  Google Scholar 

  18. O’Connor, B.D., Merriman, B., Nelson, S.F.: Seqware query engine: storing and searching sequence data in the cloud. BMC Bioinf. 11(12), S2 (2010)

    Article  Google Scholar 

  19. Richardson, J.: Supporting lists in a data model (a timely approach). In: Proceedings of the 18th International Confernce on Very Large Data Bases, VLDB 1992 (1992)

    Google Scholar 

  20. Sahli, M., Mansour, E., Alturkestani, T., Kalnis, P.: Automatic tuning of bag-of-tasks application. In: International Conference on Data Engineering (ICDE) (2015)

    Google Scholar 

  21. Sahli, M., Mansour, E., Kalnis, P.: Parallel motif extraction from very long sequences. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM) (2013)

    Google Scholar 

  22. Sahli, M., Mansour, E., Kalnis, P.: ACME: a scalable parallel system for extracting frequent patterns from a very long sequence. VLDB J. 23(6), 871–873 (2014)

    Article  Google Scholar 

  23. Sahli, M., Mansour, E., Kalnis, P.: StarDB: a large-scale DBMS for strings. Proc. VLDB 8, 1844–1847 (2015)

    Article  Google Scholar 

  24. Seshadri, P., Livny, M., Ramakrishnan, R.: The design and implementation of a sequence database system. In: Proceedings of the International Conference on Very Large Data Bases, VLDB 1996 (1996)

    Google Scholar 

  25. Stonebraker, M., Cetintemel, U.: “One size fits all”: an idea whose time has come and gone. In: Proceedings of International Conference on Data Engineering (ICDE) (2005)

    Google Scholar 

  26. Tata, S., Friedman, J., Swaroop, A.: Declarative querying for biological sequences. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 87–87, April 2006

    Google Scholar 

  27. Tata, S., Lang, W., Patel, J.M.: Periscope/SQ: interactive exploration of biological sequence databases. In: Proceedings of VLDB (2007)

    Google Scholar 

  28. Wolper, P.: Temporal logic can be more expressive. In: 22nd Annual Symposium on Foundations of Computer Science, SFCS 1981, pp. 340–348, October 1981

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Majed Sahli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Sahli, M., Mansour, E., Kalnis, P. (2017). Querying and Mining Strings Made Easy. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69179-4_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69178-7

  • Online ISBN: 978-3-319-69179-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics