Skip to main content

Answering GPSJ Queries in a Polystore: A Dataspace-Based Approach

  • Conference paper
  • First Online:
Conceptual Modeling (ER 2019)

Abstract

The discipline of data science is steering analysts away from traditional data warehousing and towards a more flexible and lightweight approach to data analysis. The idea is to perform OLAP analyses in a pay-as-you-go manner across heterogeneous schemas and data models, where the integration is progressively carried out by the user as the available data is explored. In this paper, we propose an approach to support data analysis within a polystore supporting relational, document and column data models by automatically handling both data model and schema heterogeneity through a dataspace layer on top of the underlying databases. The expressiveness we enable corresponds to GPSJ queries, which are the most common class of queries in OLAP applications. We rely on Nested Relational Algebra to define a cross-database execution plan. The plan is composed of several local plans, to be executed on the distinct databases, and a global plan, which combines and possibly aggregates inter-database data. The system has been prototyped on Apache Spark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We remark that column-based NoSQL systems (e.g., BigTable [4]) are different from column-oriented DBMS (e.g., Vertica).

  2. 2.

    We define the aggregation with the operator \({\gamma }\) declared as \({}_{X}{\gamma }_Y\), where X is the group-by set (i.e., a set of features) and Y is the set of aggregations (where each aggregation is composed of a feature and an aggregation function).

References

  1. Ben Hamadou, H., et al.: Schema-independent querying for heterogeneous collections in NoSQL document stores. Inf. Syst. (2019, in press). https://doi.org/10.1016/j.is.2019.04.005

  2. Ben Hamadou, H., Ghozzi, F., Péninou, A., Teste, O.: Towards schema-independent querying on document data stores. In: 20th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data Co-Located with EDBT/ICDT. CEUR-WS.org (2018)

    Google Scholar 

  3. Botoeva, E., Calvanese, D., Cogrel, B., Xiao, G.: Expressivity and complexity of MongoDB queries. In: 21st International Conference on Database Theory, pp. 9:1–9:23. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2018). https://doi.org/10.4230/LIPIcs.ICDT.2018.9

  4. Chang, F., et al.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1–4:26 (2008)

    Article  Google Scholar 

  5. Corbellini, A., Mateos, C., Zunino, A., Godoy, D., Schiaffino, S.N.: Persisting big-data: the NoSQL landscape. Inf. Syst. 63, 1–23 (2017)

    Article  Google Scholar 

  6. DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: 2016 ACM SIGMOD International Conference on Management of Data, pp. 295–310. ACM (2016). https://doi.org/10.1145/2882903.2882924

  7. Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)

    Article  Google Scholar 

  8. Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: 2016 IEEE High Performance Extreme Computing Conference, pp. 1–6. IEEE (2016)

    Google Scholar 

  9. Gallinucci, E., Golfarelli, M., Rizzi, S.: Variety-aware OLAP of document-oriented databases. In: 20th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data Co-Located with EDBT/ICDT. CEUR-WS.org (2018)

    Google Scholar 

  10. Gallinucci, E., Golfarelli, M., Rizzi, S.: Approximate OLAP of document-oriented databases: a variety-aware approach. Inf. Syst. (2019, in press). https://doi.org/10.1016/j.is.2019.02.004

  11. Golfarelli, M., et al.: OLAP query reformulation in peer-to-peer data warehousing. Inf. Syst. 37(5), 393–411 (2012). https://doi.org/10.1016/j.is.2011.06.003

    Article  Google Scholar 

  12. Gupta, A., Harinarayan, V., Quass, D.: Aggregate-query processing in data warehousing environments. In: 21th International Conference on Very Large Data Bases, pp. 358–369. Morgan Kaufmann (1995)

    Google Scholar 

  13. Herrero, V., Abelló, A., Romero, O.: NOSQL design for analytical workloads: variability matters. In: 35th International Conference on Conceptual Modeling, pp. 50–64 (2016). https://doi.org/10.1007/978-3-319-46397-1_4

  14. Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: 2008 ACM SIGMOD International Conference on Management of Data, pp. 847–860. ACM (2008). https://doi.org/10.1145/1376616.1376701

  15. LeFevre, J., et al.: MISO: souping up big data query processing with a multistore system. In: 2014 ACM SIGMOD International Conference on Management of Data, pp. 1591–1602. ACM (2014). https://doi.org/10.1145/2588555.2588568

  16. Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ semi-structured data model and query language: a capabilities survey of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR abs/1405.3631 (2014)

    Google Scholar 

  17. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001). https://doi.org/10.1007/s007780100057

    Article  MATH  Google Scholar 

  18. Rolls, D., Joslin, C., Scholz, S.: Unibench: a tool for automated and collaborative benchmarking. In: 18th IEEE International Conference on Program Comprehension, pp. 50–51. IEEE Computer Society (2010). https://doi.org/10.1109/ICPC.2010.36

  19. Rostin, A., et al.: A machine learning approach to foreign key discovery. In: 12th International Workshop on the Web and Databases (2009)

    Google Scholar 

  20. Sadalage, P.J., Fowler, M.: NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Pearson Education, London (2013)

    Google Scholar 

  21. Sheth, A.P.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. In: 17th International Conference on Very Large Data Bases, p. 489. Morgan Kaufmann (1991)

    Google Scholar 

  22. Tahara, D., Diamond, T., Abadi, D.J.: Sinew: a SQL system for multi-structured data. In: 2014 ACM SIGMOD International Conference on Management of Data, pp. 815–826. ACM (2014). https://doi.org/10.1145/2588555.2612183

  23. Tan, R., et al.: Enabling query processing across heterogeneous data models: a survey. In: 2017 IEEE International Conference on Big Data, pp. 3211–3220. IEEE Computer Society (2017). https://doi.org/10.1109/BigData.2017.8258302

  24. Thomas, S.J., Fischer, P.C.: Nested relational structures. Adv. Comput. Res. 3, 269–307 (1986)

    Google Scholar 

  25. Wang, L., et al.: Schema management for document stores. PVLDB 8(9), 922–933 (2015). https://doi.org/10.14778/2777598.2777601

    Article  Google Scholar 

  26. Zaharia, M., et al.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matteo Golfarelli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ben Hamadou, H., Gallinucci, E., Golfarelli, M. (2019). Answering GPSJ Queries in a Polystore: A Dataspace-Based Approach. In: Laender, A., Pernici, B., Lim, EP., de Oliveira, J. (eds) Conceptual Modeling. ER 2019. Lecture Notes in Computer Science(), vol 11788. Springer, Cham. https://doi.org/10.1007/978-3-030-33223-5_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-33223-5_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-33222-8

  • Online ISBN: 978-3-030-33223-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics