Multistore Big Data Integration with CloudMdsQL

  • Carlyna Bondiombouy
  • Boyan Kolev
  • Oleksandra Levchenko
  • Patrick Valduriez
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9940)

Abstract

Multistore systems have been recently proposed to provide integrated access to multiple, heterogeneous data stores through a single query engine. In particular, much attention is being paid on the integration of unstructured big data typically stored in HDFS with relational data. One main solution is to use a relational query engine that allows SQL-like queries to retrieve data from HDFS, which requires the system to provide a relational view of the unstructured data and hence is not always feasible. In this paper, we propose a functional SQL-like query language (based on CloudMdsQL) that can integrate data retrieved from different data stores, to take full advantage of the functionality of the underlying data processing frameworks by allowing the ad-hoc usage of user defined map/filter/reduce operators in combination with traditional SQL statements. Furthermore, our solution allows for optimization by enabling subquery rewriting so that bind join can be used and filter conditions can be pushed down and applied by the data processing framework as early as possible. We validate our approach through implementation and experimental validation with three data stores and representative queries. The experimental results demonstrate the usability of the query language and the benefits from query optimization.

Notes

Acknowledgements

This research has been partially funded by the European Commission under project CoherentPaaS (FP7-611068).

References

  1. 1.
    Abouzeid, A., Badja-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB 2, 922–933 (2009)Google Scholar
  2. 2.
    Armbrust, M., Xin, R., Lian, C., Huai, Y., Liu, D., Bradley, J., Meng, X., Kaftan, T., Franklin, M., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in Spark. In: ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)Google Scholar
  3. 3.
    Binnig, C., Rehrmann, R., Faerber, F., Riewe, R.: FunSQL: it is time to make SQL functional. In: EDBT/ICDT Conference, pp. 41–46 (2012)Google Scholar
  4. 4.
    Bondiombouy, C., Kolev, B., Levchenko, O., Valduriez, P.: Integrating big data and relational data with a functional SQL-like query language. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9261, pp. 170–185. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  5. 5.
    Bugiotti, F., Bursztyn, D., Deutsch, A., Ileana, I., Manolescu, I.: Invisible glue: scalable self-tuning multi-stores. In: CIDR Conference (2015)Google Scholar
  6. 6.
    Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1, 1265–1276 (2008)Google Scholar
  7. 7.
    CoherentPaaS project. http://coherentpaas.eu
  8. 8.
    DeWitt, D., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, M.: Split query processing in Polybase. In: ACM SIGMOD Conference, pp. 1255–1266 (2013)Google Scholar
  9. 9.
    Duggan, J., Elmore, A.J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., Zdonik, S.: The BigDAWG polystore system. ACM SIGMOD Rec. 44(2), 11–16 (2015)CrossRefGoogle Scholar
  10. 10.
    Haas, L., Kossmann, D., Wimmers, E., Yang, J.: Optimizing queries across diverse data sources. In: International Conference on Very Large Databases (VLDB), pp. 276–285 (1997)Google Scholar
  11. 11.
    Hacigümüs, H., Sankaranarayanan, J., Tatemura, J., LeFevre, J., Polyzotis, N.: Odyssey: a multi-store system for evolutionary analytics. PVLDB 6, 1180–1181 (2013)Google Scholar
  12. 12.
    Kolev, B., Valduriez, P., Bondiombouy, C., Jiménez-Peris, R., Pau, R., Pereira, J.: CloudMdsQL: querying heterogeneous cloud data stores with a common language. In: Distributed and parallel databases, pp. 463–503 (2015). http://link.springer.com/article/10.1007%2Fs10619-015-7185-y
  13. 13.
    LeFevre, J., Sankaranarayanan, J., Hacigümüs, H., Tatemura, J., Polyzotis, N., Carey, M.: MISO: souping up big data query processing with a multistore system. In: ACM SIGMOD Conference, pp. 1591–1602 (2014)Google Scholar
  14. 14.
    Minpeng, Z., Tore, R.: Querying combined cloud-based and relational databases. In: International Conference on Cloud and Service Computing (CSC), pp. 330–335 (2011)Google Scholar
  15. 15.
    Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL ++ semi-structured data model and query language: a capabilities survey of SQL-on-Hadoop, NoSQL and NewSQL databases (2014). Corr, abs/1405.3631Google Scholar
  16. 16.
    Özsu, T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011)Google Scholar
  17. 17.
    Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: ACM SIGMOD Conference, pp. 829–840 (2012)Google Scholar
  18. 18.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a map-reduce framework. PVLDB 2, 1626–1629 (2009)Google Scholar
  19. 19.
    Tomasic, A., Raschid, L., Valduriez, P.: Scaling access to heterogeneous data sources with DISCO. IEEE Trans. Knowl. Data Eng. 10, 808–823 (1998)CrossRefGoogle Scholar
  20. 20.
    Valduriez, P., Danforth, S.: Functional SQL, an SQL upward compatible database programming language. Inf. Sci. 62, 183–203 (1992)CrossRefMATHGoogle Scholar
  21. 21.
    Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25, 38–49 (1992)CrossRefGoogle Scholar
  22. 22.
    Wyss, C.M., Robertson, E.L.: Relational languages for metadata integration. ACM Trans. Database Syst. 30(2), 624–660 (2005)CrossRefGoogle Scholar
  23. 23.
    Yuanyuan, T., Zou, T., Özcan, F., Gonscalves, R., Pirahesh, H.: Joins for hybrid warehouses: exploiting massive parallelism in hadoop and enterprise data warehouses. In: EDBT/ICDT Conference, pp. 373–384 (2015)Google Scholar
  24. 24.
    Zhou, J., Bruno, N., Wu, M., Larson, P., Chaiken, R., Shakib, D.: SCOPE: Parallel Databases Meet MapReduce. PVLDB 21, 611–636 (2012)Google Scholar
  25. 25.
    Zhu, Q., Larson, P.-A.: A query sampling method for estimating local cost parameters in a multidatabase system. In: International Conference on Data Engineering (ICDE), pp. 144–153 (1994)Google Scholar
  26. 26.
    Zhu, Q., Larson, P.-A.: Global query processing and optimization in the CORDS multidatabase system. In: International Conference on Parallel and Distributed Computing Systems, pp. 640–647 (1996)Google Scholar
  27. 27.
    Zhu, Q., Sun, Y., Motheramgari, S.: Developing cost models with qualitative variables for dynamic multidatabase environments. In: International Conference on Data Engineering (ICDE), pp. 413–424 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Carlyna Bondiombouy
    • 1
  • Boyan Kolev
    • 1
  • Oleksandra Levchenko
    • 1
  • Patrick Valduriez
    • 1
  1. 1.Inria and LIRMM, University of MontpellierMontpellierFrance

Personalised recommendations