Advertisement

Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics

  • Éric Leclercq
  • Annabelle Gillet
  • Thierry Grison
  • Marinette SavonnetEmail author
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11860)

Abstract

This paper presents a Tensor based Data Model (TDM) for polystore systems meant to address two major closely related issues in big data analytics architectures, namely logical data independence and data impedance mismatch. The TDM is an expressive model that subsumes traditional data models, it allows to link different data models of various data stores, and which also facilitates data transformations by using operators with clearly defined semantics. Our contribution is twofold. Firstly, it is the addition of the notion of a schema for the tensor mathematical object using typed associative arrays. Secondly, it is the definition of a set of operators to manipulate data through the TDM. In order to validate our approach we first show how our TDM model is inserted into a given polystore architecture. We then describe some use cases of real analyses using our TDM and its operators in the context of the French Presidential Election in 2017.

Keywords

Polystore Data model Logical data independence Impedance mismatch Tensor 

Notes

Acknowledgement

This research was partially supported by the project I-SITE UBFC COCKTAIL. We thank George Becker for comments that have greatly improved the manuscript and Arnaud Da Costa for the maintenance of the server infrastructure.

References

  1. 1.
    Abo Khamis, M., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: In-database learning with sparse tensors. In: ACM SIGMOD/PODS Symposium on Principles of Database Systems, pp. 325–340 (2018)Google Scholar
  2. 2.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endowment 2(1), 922–933 (2009)CrossRefGoogle Scholar
  3. 3.
    Al-Garadi, M.A., et al.: Analysis of online social network connections for identification of influential users: survey and open research issues. ACM Comput. Surv. (CSUR) 51(1), 1–37 (2018)CrossRefGoogle Scholar
  4. 4.
    Allen, D., Hodler, A.: Weave together graph and relational data in apache spark. In: Spark+AI Summit. Neo4j (2018). https://vimeo.com/274433801
  5. 5.
    Alsubaiee, S., et al.: AsterixDB: a scalable, open source BDMS. Proc. VLDB Endow. 7(14), 1905–1916 (2014)CrossRefGoogle Scholar
  6. 6.
    Angles, R.: A comparison of current graph database models. In: IEEE International Conference on Data Engineering Workshops (ICDEW), pp. 171–177 (2012)Google Scholar
  7. 7.
    Astrahan, M.M., et al.: System R: relational approach to database management. ACM Trans. Database Syst. (TODS) 1(2), 97–137 (1976)CrossRefGoogle Scholar
  8. 8.
    Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., Paleczny, M.: Workload analysis of a large-scale key-value store. ACM SIGMETRICS Perform. Evaluation Rev. 40, 53–64 (2012)CrossRefGoogle Scholar
  9. 9.
    Austin, W., Ballard, G., Kolda, T.G.: Parallel tensor compression for large-scale scientific data. In: IEEE International Parallel and Distributed Processing Symposium, pp. 912–922 (2016)Google Scholar
  10. 10.
    Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: Extending Database Technology (EDBT), p. 222, 233 (2017)Google Scholar
  11. 11.
    Barabási, A.L., et al.: Network Science. Cambridge University Press, Cambridge (2016)zbMATHGoogle Scholar
  12. 12.
    Battaglino, C., Ballard, G., Kolda, T.: A practical randomized CP tensor decomposition. arXiv preprint arXiv:1701.06600 (2017)
  13. 13.
    Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008)CrossRefGoogle Scholar
  14. 14.
    Brodie, M.L., Schmidt, J.W.: Final report of the ANSI/X3/SPARC DBS-SG relational database task group. ACM SIGMOD Rec. 12(4), 1–62 (1982)Google Scholar
  15. 15.
    Bugiotti, F., Bursztyn, D., Deutsch, A., Ileana, I., Manolescu, I.: Invisible glue: scalable self-tuning multi-stores. In: Conference on Innovative Data Systems Research (CIDR) (2015)Google Scholar
  16. 16.
    Bugiotti, F., Bursztyn, D., Deutsch, A., Manolescu, I., Zampetakis, S.: Flexible hybrid stores: constraint-based rewriting to the rescue. In: International Conference on Data Engineering (ICDE), pp. 1394–1397 (2016)Google Scholar
  17. 17.
    Buluc, A., Gilbert, J.: On the representation and multiplication of hypersparse matrices. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–11 (2008)Google Scholar
  18. 18.
    Chen, J., Huang, Q.: Eliminating the Impedance Mismatch Between Relational Systems and Object-Oriented Programming Languages. Monash University, Clayton (1995)Google Scholar
  19. 19.
    Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blindsource Separation. Wiley, Hoboken (2009)CrossRefGoogle Scholar
  20. 20.
    De Domenico, M., et al.: Mathematical formulation of multilayer networks. Phys. Rev. X 3(4), 041022 (2013)Google Scholar
  21. 21.
    Deng, D., et al.: The data civilizer system. In: Conference on Innovative Data Systems Research (CIDR) (2017)Google Scholar
  22. 22.
    DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Proceedings of the International Conference on Management of Data, pp. 295–310. ACM (2016)Google Scholar
  23. 23.
    Dittrich, J., Jindal, A.: Towards a one size fits all database architecture. In: Conference on Innovative Data Systems Research (CIDR), pp. 195–198 (2011)Google Scholar
  24. 24.
    Duggan, J., et al.: The BigDAWG polystore system. ACM SIGMOD Rec. 44(2), 11–16 (2015)CrossRefGoogle Scholar
  25. 25.
    Färber, F., et al.: The SAP HANA database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)Google Scholar
  26. 26.
    Gadepally, V., et al.: The BigDAWG polystore system and architecture. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2016)Google Scholar
  27. 27.
    Gama, J.: A survey on learning from data streams: current and future trends. Prog. Artif. Intell. 1(1), 45–55 (2012)CrossRefGoogle Scholar
  28. 28.
    Ghosh, D.: Multiparadigm data storage for enterprise applications. IEEE Soft. 27(5), 57–60 (2010)CrossRefGoogle Scholar
  29. 29.
    Giannakouris, V., Papailiou, N., Tsoumakos, D., Koziris, N.: MuSQLE: distributed SQL query execution over multiple engine environments. In: IEEE International Conference on Big Data, pp. 452–461 (2016)Google Scholar
  30. 30.
    Gray, J., Liu, D.T., Nieto-Santisteban, M., Szalay, A., DeWitt, D.J., Heber, G.: Scientific data management in the coming decade. ACM SIGMOD Rec. 34(4), 34–41 (2005)CrossRefGoogle Scholar
  31. 31.
    Haerder, T., Reuter, A.: Principles of transaction-oriented database recovery. ACM Comput. Surv. (CSUR) 15(4), 287–317 (1983)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Halu, A., Mondragón, R.J., Panzarasa, P., Bianconi, G.: Multiplex pagerank. PloS ONE 8(10), e78293 (2013)CrossRefGoogle Scholar
  33. 33.
    Hammer, M., McLeod, D.: On database management system architecture. Technical report, Massachusetts Institute of Technology, Cambridge Lab. For Computer Science (1979)Google Scholar
  34. 34.
    Härder, T.: DBMS architecture-the layer model and its evolution. Datenbank-Spektrum 13, 45–57 (2005)Google Scholar
  35. 35.
    Hellerstein, J.M., et al.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)CrossRefGoogle Scholar
  36. 36.
    Hewasinghage, M., Varga, J., Abelló, A., Zimányi, E.: Managing polyglot systems metadata with hypergraphs. In: Trujillo, J.C., et al. (eds.) ER 2018. LNCS, vol. 11157, pp. 463–478. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-00847-5_33CrossRefGoogle Scholar
  37. 37.
    Hölsch, J., Schmidt, T., Grossniklaus, M.: On the performance of analytical and pattern matching graph queries in Neo4j and a relational database. In: EDBT/ICDT International Workshop on Querying Graph Structured Data (GraphQ) (2017)Google Scholar
  38. 38.
    Hutchison, D., Howe, B., Suciu, D.: Lara: a key-value algebra underlying arrays and relations. arXiv preprint arXiv:1604.03607 (2016)
  39. 39.
    Hutchison, D., Howe, B., Suciu, D.: LaraDB: A minimalist kernel for linear and relational algebra computation. In: ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pp. 2–12 (2017)Google Scholar
  40. 40.
    Jananthan, H., Zhou, Z., Gadepally, V., Hutchison, D., Kim, S., Kepner, J.: Polystore mathematics of relational algebra. In: IEEE International Conference on Big Data, pp. 3180–3189 (2017)Google Scholar
  41. 41.
    Johnson, M., Rosebrugh, R., et al.: Database interoperability through state-based logical data independence. Int. J. Comput. Appl. Technol. 16(2–3), 97–102 (2003)CrossRefGoogle Scholar
  42. 42.
    Kanellakis, P.C.: Elements of relational database theory. In: Formal models and semantics, pp. 1073–1156. Elsevier (1990)Google Scholar
  43. 43.
    Kang, U., Papalexakis, E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 316–324 (2012)Google Scholar
  44. 44.
    Kepner, J., et al.: Dynamic distributed dimensional data model (D4M) database and computation system. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5349–5352 (2012)Google Scholar
  45. 45.
    Kepner, J., et al.: Achieving 100,000,000 database inserts per second using Accumulo and D4M. In: High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2014)Google Scholar
  46. 46.
    Kim, M.: TensorDB and tensor-relational model (TRM) for efficient tensor-relational operations (2014)Google Scholar
  47. 47.
    Kivelä, M., Arenas, A., Barthelemy, M., Gleeson, J.P., Moreno, Y., Porter, M.A.: Multilayer networks. J. Complex Netw. 2(3), 203–271 (2014)CrossRefGoogle Scholar
  48. 48.
    Klug, A.: Equivalence of relational algebra and relational calculus query languages having aggregate functions. J. ACM 29(3), 699–717 (1982)MathSciNetzbMATHCrossRefGoogle Scholar
  49. 49.
    Knuth, D.: The Art of Computer Programming, Vol. 1: Fundamental Algorithms. Addison-Wesley, Boston (1978)zbMATHGoogle Scholar
  50. 50.
    Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  51. 51.
    Kolev, B., Bondiombouy, C., Valduriez, P., Jiménez-Peris, R., Pau, R., Pereira, J.: The CloudMdsQL multistore system. In: International Conference on Management of Data (SIGMOD), pp. 2113–2116 (2016)Google Scholar
  52. 52.
    Kuang, L., Hao, F., Yang, L.T., Lin, M., Luo, C., Min, G.: A tensor-based approach for big data representation and dimensionality reduction. IEEE Trans. Emerg. Top. Comput. 2(3), 280–291 (2014)CrossRefGoogle Scholar
  53. 53.
    Lämmel, R., Meijer, E.: Revealing the X/O impedance mismatch. In: Backhouse, R., Gibbons, J., Hinze, R., Jeuring, J. (eds.) SSDGP 2006. LNCS, vol. 4719, pp. 285–367. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-76786-2_6CrossRefGoogle Scholar
  54. 54.
    Leclercq, E., Savonnet, M.: TDM: A tensor data model for logical data independence in polystore systems. In: Heterogeneous Data Management, Polystores, and Analytics for Healthcare - VLDB 2018 Workshops, Poly and DMAH, pp. 39–56 (2018)CrossRefGoogle Scholar
  55. 55.
    LeFevre, J., Sankaranarayanan, J., Hacigumus, H., Tatemura, J., Polyzotis, N., Carey, M.J.: MISO: souping up big data query processing with a multistore system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1591–1602 (2014)Google Scholar
  56. 56.
    Li, X., Cui, B., Chen, Y., Wu, W., Zhang, C.: MLog: towards declarative in-database machine learning. Proc. VLDB Endow. 10(12), 1933–1936 (2017)CrossRefGoogle Scholar
  57. 57.
    Lin, J., Ryaboy, D.: Scaling big data mining infrastructure: the Twitter experience. SIGKDD Explor. Newsl. 14(2), 6–19 (2013)CrossRefGoogle Scholar
  58. 58.
    Litwin, W., Abdellatif, A., Zeroual, A., Nicolas, B., Vigier, P.: MSQL: a multidatabase language. Inf. Sci. 49(1–3), 59–101 (1989)zbMATHCrossRefGoogle Scholar
  59. 59.
    Lu, J., Holubova, I.: Multi-model databases: a new journey to handle the variety of data. ACM Comput. Surv. (CSUR) 52(3), 55 (2019)CrossRefGoogle Scholar
  60. 60.
    Maccioni, A., Torlone, R.: Augmented access for querying and exploring a Polystore. In: 34th International Conference on Data Engineering (ICDE), pp. 77–88. IEEE (2018)Google Scholar
  61. 61.
    Maier, D., Rozenshtein, D., Salveter, S., Stein, J., Warren, D.S.: Toward logical data independence: a relational query language without relations. In: ACM SIGMOD International Conference on Management of Data, pp. 51–60 (1982)Google Scholar
  62. 62.
    McGregor, A.: Graph stream algorithms: a survey. ACM SIGMOD Rec. 43(1), 9–20 (2014)CrossRefGoogle Scholar
  63. 63.
    McHugh, J., Cuddihy, P.E., Williams, J.W., Aggour, K.S., Kumar, V.S., Mulwad, V.: Integrated access to big data polystores through a knowledge-driven framework. In: IEEE International Conference on Big Data, pp. 1494–1503 (2017)Google Scholar
  64. 64.
    Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ query language: configurable. Unifying and semi-structured. Technical report, UCSD (2015)Google Scholar
  65. 65.
    Ouzzani, M., Tang, N., Fernandez, R.C.: Data civilizer: end-to-end support for data discovery, integration, and cleaning. In: Making Databases Work, pp. 291–300. Association for Computing Machinery and Morgan & Claypool (2019)Google Scholar
  66. 66.
    Özsoyoğlu, G., Özsoyoğlu, Z.M., Matos, V.: Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. ACM Trans. Database Syst. 12(4), 566–592 (1987)MathSciNetCrossRefGoogle Scholar
  67. 67.
    Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, New York (2011)Google Scholar
  68. 68.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the Web. In: Proceedings of the 7th International World Wide Web Conference, pp. 161–172 (1999)Google Scholar
  69. 69.
    Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Tensors for data mining and data fusion: models, applications, and scalable algorithms. ACM Trans. Intell. Syst. Technol. (TIST) 8(2), 16 (2017)Google Scholar
  70. 70.
    Riquelme, F., González-Cantergiani, P.: Measuring user influence on Twitter: a survey. Inf. Process. Manage. 52(5), 949–975 (2016)CrossRefGoogle Scholar
  71. 71.
    Sharp, J., McMurtry, D., Oakley, A., Subramanian, M., Zhang, H.: Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence. Microsoft patterns & practices (2013)Google Scholar
  72. 72.
    Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 183–236 (1990)CrossRefGoogle Scholar
  73. 73.
    Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., De Carvalho, A.C., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. (CSUR) 46(1), 13 (2013)zbMATHCrossRefGoogle Scholar
  74. 74.
    Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)CrossRefGoogle Scholar
  75. 75.
    Smith, S., Ravindran, N., Sidiropoulos, N.D., Karypis, G.: SPLATT: efficient and parallel sparse tensor-matrix multiplication. In: IEEE International Parallel and Distributed Processing Symposium, pp. 61–70 (2015)Google Scholar
  76. 76.
    Stonebraker, M., et al.: One size fits all? Part 2: benchmarking results. In: Conference on Innovative Data Systems Research (CIDR) (2007)Google Scholar
  77. 77.
    Stonebraker, M., Cetintemel, U.: “One size fits all”: an idea whose time has come and gone. In: International Conference on Data Engineering, ICDE 2005, pp. 2–11. IEEE (2005)Google Scholar
  78. 78.
    Stonebraker, M., et al.: C-store: a column-oriented DBMS. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 553–564. VLDB Endowment (2005)Google Scholar
  79. 79.
    Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: IEEE International Conference on Big Data, pp. 3211–3220 (2017)Google Scholar
  80. 80.
    Vargas-Solar, G., Zechinelli-Martini, J.L., Espinosa-Oviedo, J.A.: Big Data management: what to keep from the past to face future challenges? Data Sci. Eng. 2(4), 328–345 (2017)CrossRefGoogle Scholar
  81. 81.
    Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online human-bot interactions: detection, estimation, and characterization. In: Proceedings of the 11th International Conference on Web and Social Media (ICWSM), pp. 280–289 (2017)Google Scholar
  82. 82.
    Vogt, M., Stiemer, A., Schuldt, H.: Icarus: towards a multistore database system. In: IEEE International Conference on Big Data, pp. 2490–2499 (2017)Google Scholar
  83. 83.
    Wang, J., et al.: The Myria big data management and analytics system and cloud services. In: Conference on Innovative Data Systems Research (CIDR)Google Scholar
  84. 84.
    Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)CrossRefGoogle Scholar
  85. 85.
    Wu, D., Sakr, S., Zhu, L.: Big Data programming models. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 31–63. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-49340-4_2CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.LIB EA 7534 - University of BourgogneDijonFrance

Personalised recommendations