Querying and Mining Strings Made Easy

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10604)

Abstract

With the advent of large string datasets in several scientific and business applications, there is a growing need to perform ad-hoc analysis on strings. Currently, strings are stored, managed, and queried using procedural codes. This limits users to certain operations supported by existing procedural applications and requires manual query planning with limited tuning opportunities. This paper presents StarQL, a generic and declarative query language for strings. StarQL is based on a native string data model that allows StarQL to support a large variety of string operations and provide semantic-based query optimization. String analytic queries are too intricate to be solved on one machine. Therefore, we propose a scalable and efficient data structure that allows StarQL implementations to handle large sets of strings and utilize large computing infrastructures. Our evaluation shows that StarQL is able to express workloads of application-specific tools, such as BLAST and KAT in bioinformatics, and to mine Wikipedia text for interesting patterns using declarative queries. Furthermore, the StarQL query optimizer shows an order of magnitude reduction in query execution time.

References

  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRefGoogle Scholar
  2. 2.
    Apostolico, A., Comin, M., Parida, L.: VARUN: discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comput. Biol. Bioinform. 7(4), 752–762 (2010)CrossRefGoogle Scholar
  3. 3.
    Balkir, N., Sukan, E., Ozsoyoglu, G., Ozsoyoglu, G.: Visual: a graphical icon-based query language. In: Proceedings of International Conference on Data Engineering (ICDE) (1996)Google Scholar
  4. 4.
    Benedikt, M., Libkin, L., Schwentick, T., Segoufin, L.: String operations in query languages. In: Proceedings of PODS (2001)Google Scholar
  5. 5.
    Carvalho, A.M., Oliveira, A.L., Freitas, A.T., Sagot, M.F.: A parallel algorithm for the extraction of structured motifs. In: Proceedings of the ACM Symposium on Applied Computing (SAC) (2004)Google Scholar
  6. 6.
    Date, C.: An Introduction to Database Systems, 8th edn. Pearson/Addison-Wesley, Boston (2003)MATHGoogle Scholar
  7. 7.
    Dube, K., Mansour, E., Wu, B.: Supporting collaboration and information sharing in computer-based clinical guideline management. In: 18th IEEE Symposium on Computer-Based Medical Systems (CBMS), Dublin, Ireland (2005)Google Scholar
  8. 8.
    Etzold, T., Argos, P.: SRS - an indexing and retrieval tool for flat file data libraries. Comput. Appl. Biosci. 9(1), 49–57 (1993)Google Scholar
  9. 9.
    Etzold, T., Argos, P.: Transforming a set of biological flat file libraries to a fast access network. Comput. Appl. Biosci. 9(1), 49–57 (1993)Google Scholar
  10. 10.
    Floratou, A., Tata, S., Patel, J.M.: Efficient and accurate discovery of patterns in sequence data sets. TKDE 23(8), 1154–1168 (2011)Google Scholar
  11. 11.
    Ginsburg, S., Wang, X.S.: Regular sequence operations and their use in database queries. J. Comput. Syst. Sci. 56(1), 1–26 (1998)MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Ginsburg, S., Wang, X.: Pattern matching by RS-operations: towards a unified approach to querying sequenced data. In: Proceedings of PODS (1992)Google Scholar
  13. 13.
    Grahne, G., Hakli, R., Nykänen, M., Tamm, H., Ukkonen, E.: Design and implementation of a string database query language. Inf. Syst. 28(4), 311–337 (2003)CrossRefMATHGoogle Scholar
  14. 14.
    Grahne, G., Nykänen, M., Ukkonen, E.: Reasoning about strings in databases. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1994 (1994)Google Scholar
  15. 15.
    Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J., Clavijo, B.J.: KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33(4), 574–576 (2017)Google Scholar
  16. 16.
    Mathur, A., Sihag, A., Bagaria, E., Rajawat, S., et al.: A new perspective to data processing: Big data. In: Proceedings of INDIACom, pp. 110–114 (2014)Google Scholar
  17. 17.
    Niedringhaus, T.P., Milanova, D., Kerby, M.B., Snyder, M.P., Barron, A.E.: Landscape of next-generation sequencing technologies. Anal. Chem. 83(12), 4327–4341 (2011)CrossRefGoogle Scholar
  18. 18.
    O’Connor, B.D., Merriman, B., Nelson, S.F.: Seqware query engine: storing and searching sequence data in the cloud. BMC Bioinf. 11(12), S2 (2010)CrossRefGoogle Scholar
  19. 19.
    Richardson, J.: Supporting lists in a data model (a timely approach). In: Proceedings of the 18th International Confernce on Very Large Data Bases, VLDB 1992 (1992)Google Scholar
  20. 20.
    Sahli, M., Mansour, E., Alturkestani, T., Kalnis, P.: Automatic tuning of bag-of-tasks application. In: International Conference on Data Engineering (ICDE) (2015)Google Scholar
  21. 21.
    Sahli, M., Mansour, E., Kalnis, P.: Parallel motif extraction from very long sequences. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM) (2013)Google Scholar
  22. 22.
    Sahli, M., Mansour, E., Kalnis, P.: ACME: a scalable parallel system for extracting frequent patterns from a very long sequence. VLDB J. 23(6), 871–873 (2014)CrossRefGoogle Scholar
  23. 23.
    Sahli, M., Mansour, E., Kalnis, P.: StarDB: a large-scale DBMS for strings. Proc. VLDB 8, 1844–1847 (2015)CrossRefGoogle Scholar
  24. 24.
    Seshadri, P., Livny, M., Ramakrishnan, R.: The design and implementation of a sequence database system. In: Proceedings of the International Conference on Very Large Data Bases, VLDB 1996 (1996)Google Scholar
  25. 25.
    Stonebraker, M., Cetintemel, U.: “One size fits all”: an idea whose time has come and gone. In: Proceedings of International Conference on Data Engineering (ICDE) (2005)Google Scholar
  26. 26.
    Tata, S., Friedman, J., Swaroop, A.: Declarative querying for biological sequences. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 87–87, April 2006Google Scholar
  27. 27.
    Tata, S., Lang, W., Patel, J.M.: Periscope/SQ: interactive exploration of biological sequence databases. In: Proceedings of VLDB (2007)Google Scholar
  28. 28.
    Wolper, P.: Temporal logic can be more expressive. In: 22nd Annual Symposium on Foundations of Computer Science, SFCS 1981, pp. 340–348, October 1981Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Saudi AramcoDhahranSaudi Arabia
  2. 2.Qatar Computing Research InstituteHBKUDohaQatar
  3. 3.KAUSTThuwalSaudi Arabia

Personalised recommendations