Answering GPSJ Queries in a Polystore: A Dataspace-Based Approach

Ben Hamadou, Hamdi; Gallinucci, Enrico; Golfarelli, Matteo

doi:10.1007/978-3-030-33223-5_16

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11788))

Included in the following conference series:

International Conference on Conceptual Modeling

1822 Accesses
11 Citations

Abstract

The discipline of data science is steering analysts away from traditional data warehousing and towards a more flexible and lightweight approach to data analysis. The idea is to perform OLAP analyses in a pay-as-you-go manner across heterogeneous schemas and data models, where the integration is progressively carried out by the user as the available data is explored. In this paper, we propose an approach to support data analysis within a polystore supporting relational, document and column data models by automatically handling both data model and schema heterogeneity through a dataspace layer on top of the underlying databases. The expressiveness we enable corresponds to GPSJ queries, which are the most common class of queries in OLAP applications. We rely on Nested Relational Algebra to define a cross-database execution plan. The plan is composed of several local plans, to be executed on the distinct databases, and a global plan, which combines and possibly aggregates inter-database data. The system has been prototyped on Apache Spark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We remark that column-based NoSQL systems (e.g., BigTable [4]) are different from column-oriented DBMS (e.g., Vertica).
2.
We define the aggregation with the operator \({\gamma }\) declared as \({}_{X}{\gamma }_Y\), where X is the group-by set (i.e., a set of features) and Y is the set of aggregations (where each aggregation is composed of a feature and an aggregation function).

References

Ben Hamadou, H., et al.: Schema-independent querying for heterogeneous collections in NoSQL document stores. Inf. Syst. (2019, in press). https://doi.org/10.1016/j.is.2019.04.005
Ben Hamadou, H., Ghozzi, F., Péninou, A., Teste, O.: Towards schema-independent querying on document data stores. In: 20th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data Co-Located with EDBT/ICDT. CEUR-WS.org (2018)
Google Scholar
Botoeva, E., Calvanese, D., Cogrel, B., Xiao, G.: Expressivity and complexity of MongoDB queries. In: 21st International Conference on Database Theory, pp. 9:1–9:23. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2018). https://doi.org/10.4230/LIPIcs.ICDT.2018.9
Chang, F., et al.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1–4:26 (2008)
Article Google Scholar
Corbellini, A., Mateos, C., Zunino, A., Godoy, D., Schiaffino, S.N.: Persisting big-data: the NoSQL landscape. Inf. Syst. 63, 1–23 (2017)
Article Google Scholar
DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: 2016 ACM SIGMOD International Conference on Management of Data, pp. 295–310. ACM (2016). https://doi.org/10.1145/2882903.2882924
Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)
Article Google Scholar
Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: 2016 IEEE High Performance Extreme Computing Conference, pp. 1–6. IEEE (2016)
Google Scholar
Gallinucci, E., Golfarelli, M., Rizzi, S.: Variety-aware OLAP of document-oriented databases. In: 20th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data Co-Located with EDBT/ICDT. CEUR-WS.org (2018)
Google Scholar
Gallinucci, E., Golfarelli, M., Rizzi, S.: Approximate OLAP of document-oriented databases: a variety-aware approach. Inf. Syst. (2019, in press). https://doi.org/10.1016/j.is.2019.02.004
Golfarelli, M., et al.: OLAP query reformulation in peer-to-peer data warehousing. Inf. Syst. 37(5), 393–411 (2012). https://doi.org/10.1016/j.is.2011.06.003
Article Google Scholar
Gupta, A., Harinarayan, V., Quass, D.: Aggregate-query processing in data warehousing environments. In: 21th International Conference on Very Large Data Bases, pp. 358–369. Morgan Kaufmann (1995)
Google Scholar
Herrero, V., Abelló, A., Romero, O.: NOSQL design for analytical workloads: variability matters. In: 35th International Conference on Conceptual Modeling, pp. 50–64 (2016). https://doi.org/10.1007/978-3-319-46397-1_4
Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: 2008 ACM SIGMOD International Conference on Management of Data, pp. 847–860. ACM (2008). https://doi.org/10.1145/1376616.1376701
LeFevre, J., et al.: MISO: souping up big data query processing with a multistore system. In: 2014 ACM SIGMOD International Conference on Management of Data, pp. 1591–1602. ACM (2014). https://doi.org/10.1145/2588555.2588568
Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ semi-structured data model and query language: a capabilities survey of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR abs/1405.3631 (2014)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001). https://doi.org/10.1007/s007780100057
Article MATH Google Scholar
Rolls, D., Joslin, C., Scholz, S.: Unibench: a tool for automated and collaborative benchmarking. In: 18th IEEE International Conference on Program Comprehension, pp. 50–51. IEEE Computer Society (2010). https://doi.org/10.1109/ICPC.2010.36
Rostin, A., et al.: A machine learning approach to foreign key discovery. In: 12th International Workshop on the Web and Databases (2009)
Google Scholar
Sadalage, P.J., Fowler, M.: NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Pearson Education, London (2013)
Google Scholar
Sheth, A.P.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. In: 17th International Conference on Very Large Data Bases, p. 489. Morgan Kaufmann (1991)
Google Scholar
Tahara, D., Diamond, T., Abadi, D.J.: Sinew: a SQL system for multi-structured data. In: 2014 ACM SIGMOD International Conference on Management of Data, pp. 815–826. ACM (2014). https://doi.org/10.1145/2588555.2612183
Tan, R., et al.: Enabling query processing across heterogeneous data models: a survey. In: 2017 IEEE International Conference on Big Data, pp. 3211–3220. IEEE Computer Society (2017). https://doi.org/10.1109/BigData.2017.8258302
Thomas, S.J., Fischer, P.C.: Nested relational structures. Adv. Comput. Res. 3, 269–307 (1986)
Google Scholar
Wang, L., et al.: Schema management for document stores. PVLDB 8(9), 922–933 (2015). https://doi.org/10.14778/2777598.2777601
Article Google Scholar
Zaharia, M., et al.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institut de Recherche en Informatique de Toulouse, Toulouse, France
Hamdi Ben Hamadou
University of Bologna, Cesena, Italy
Enrico Gallinucci & Matteo Golfarelli

Authors

Hamdi Ben Hamadou
View author publications
You can also search for this author in PubMed Google Scholar
Enrico Gallinucci
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Golfarelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matteo Golfarelli .

Editor information

Editors and Affiliations

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Alberto H. F. Laender
Politecnico di Milano, Milan, Italy
Barbara Pernici
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Univ Federal do Rio Grande do Sul, Porto Alegre, Brazil
José Palazzo M. de Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ben Hamadou, H., Gallinucci, E., Golfarelli, M. (2019). Answering GPSJ Queries in a Polystore: A Dataspace-Based Approach. In: Laender, A., Pernici, B., Lim, EP., de Oliveira, J. (eds) Conceptual Modeling. ER 2019. Lecture Notes in Computer Science(), vol 11788. Springer, Cham. https://doi.org/10.1007/978-3-030-33223-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-33223-5_16
Published: 15 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33222-8
Online ISBN: 978-3-030-33223-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics