Learning Models over Relational Data: A Brief Tutorial

  • Maximilian Schleich
  • Dan OlteanuEmail author
  • Mahmoud Abo-Khamis
  • Hung Q. Ngo
  • XuanLong Nguyen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11940)


This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research.

The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using a statistical package. This approach can be expensive as it requires the materialization of the training dataset. An alternative approach is to cast the machine learning problem as a database problem by transforming the data-intensive component of the learning task into a batch of aggregates over the feature extraction query and by computing this batch directly over the input database.

The tutorial highlights a variety of techniques developed by the database theory and systems communities to improve the performance of the learning task. They rely on structural properties of the relational data and of the feature extraction query, including algebraic (semi-ring), combinatorial (hypertree width), statistical (sampling), or geometric (distance) structure. They also rely on factorized computation, code specialization, query compilation, and parallelization.


Relational learning Query processing 


  1. 1.
    Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016)Google Scholar
  2. 2.
    Abo Khamis, M., et al.: On functional aggregate queries with additive inequalities. In: PODS, pp. 414–431 (2019)Google Scholar
  3. 3.
    Abo Khamis, M., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: AC/DC: In-database learning thunderstruck. In: DEEM, pp. 8:1–8:10 (2018)Google Scholar
  4. 4.
    Abo Khamis, M., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: In-database learning with sparse tensors. In: PODS, pp. 325–340 (2018)Google Scholar
  5. 5.
    Abo Khamis, M., Ngo, H.Q., Olteanu, D., Suciu, D.: Boolean tensor decomposition for conjunctive queries with negation. In: ICDT, pp. 21:1–21:19 (2019)Google Scholar
  6. 6.
    Abo Khamis, M., Ngo, H.Q., Rudra, A.: FAQ: questions asked frequently. In: PODS, pp. 13–28 (2016)Google Scholar
  7. 7.
    Abo Khamis, M., Ngo, H.Q., Suciu, D.: What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? In: PODS, pp. 429–444 (2017)Google Scholar
  8. 8.
    Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: FOCS, pp. 739–748 (2008)Google Scholar
  9. 9.
    Bakibayev, N., Kociský, T., Olteanu, D., Závodný, J.: Aggregation and ordering in factorised databases. PVLDB 6(14), 1990–2001 (2013)Google Scholar
  10. 10.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)zbMATHGoogle Scholar
  11. 11.
    Chen, L., Kumar, A., Naughton, J.F., Patel, J.M.: Towards linear algebra over normalized data. PVLDB 10(11), 1214–1225 (2017)Google Scholar
  12. 12.
    Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theor. 14(3), 462–467 (2006)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Curtin, R.R., Edel, M., Lozhnikov, M., Mentekidis, Y., Ghaisas, S., Zhang, S.: mlpack 3: a fast, flexible machine learning library. J. Open Source Soft. 3, 726 (2018)CrossRefGoogle Scholar
  14. 14.
    Curtin, R.R., Moseley, B., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: Rk-means: fast coreset construction for clustering relational data (2019)Google Scholar
  15. 15.
    Elghandour, I., Kara, A., Olteanu, D., Vansummeren, S.: Incremental techniques for large-scale dynamic query processing. In: CIKM, pp. 2297–2298 (2018). TutorialGoogle Scholar
  16. 16.
    Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-RDBMS analytics. In: SIGMOD, pp. 325–336 (2012)Google Scholar
  17. 17.
    van Geffen, B.: QR decomposition of normalised relational data (2018), MSc thesis, University of OxfordGoogle Scholar
  18. 18.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. The Johns Hopkins University Press, Baltimore (2013)zbMATHGoogle Scholar
  19. 19.
    Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In: ICDE, pp. 152–159 (1996)Google Scholar
  20. 20.
    Grohe, M., Marx, D.: Constraint solving via fractional edge covers. In: SODA, pp. 289–298 (2006)Google Scholar
  21. 21.
    Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD, pp. 287–298 (1999)Google Scholar
  22. 22.
    Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. In: SIGMOD, pp. 205–216 (1996)CrossRefGoogle Scholar
  23. 23.
    Hellerstein, J.M., et al.: The madlib analytics library or MAD skills, the SQL. PVLDB 5(12), 1700–1711 (2012)Google Scholar
  24. 24.
    Inelus, G.R.: Quadratically Regularised Principal Component Analysis over multi-relational databases, MSc thesis, University of Oxford (2019)Google Scholar
  25. 25.
    Joachims, T.: Training linear SVMS in linear time. In: SIGKDD, pp. 217–226 (2006)Google Scholar
  26. 26.
    Kaggle: The State of Data Science and Machine Learning (2017).
  27. 27.
    Kara, A., Ngo, H.Q., Nikolic, M., Olteanu, D., Zhang, H.: Counting triangles under updates in worst-case optimal time. In: ICDT, pp. 4:1–4:18 (2019)Google Scholar
  28. 28.
    Koch, C., Ahmad, Y., Kennedy, O., Nikolic, M., Nötzli, A., Lupei, D., Shaikhha, A.: Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. VLDB J. 23(2), 253–278 (2014)CrossRefGoogle Scholar
  29. 29.
    Kumar, A., Naughton, J.F., Patel, J.M.: Learning generalized linear models over normalized data. In: SIGMOD, pp. 1969–1984 (2015)Google Scholar
  30. 30.
    Kumar, A., Naughton, J.F., Patel, J.M., Zhu, X.: To join or not to join?: thinking twice about joins before feature selection. In: SIGMOD, pp. 19–34 (2016)Google Scholar
  31. 31.
    Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join and XDB: online aggregation via random walks. ACM Trans. Database Syst. 44(1), 2:1–2:41 (2019)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Marx, D.: Approximating fractional hypertree width. ACM Trans. Algorithms 6(2), 29:1–29:17 (2010)MathSciNetCrossRefGoogle Scholar
  33. 33.
    McKinney, W.: pandas: a foundational python library for data analysis and statistics. Python High Perform. Sci. Comput. 14 (2011)Google Scholar
  34. 34.
    Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)MathSciNetzbMATHGoogle Scholar
  35. 35.
    Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2013)zbMATHGoogle Scholar
  36. 36.
    Neumann, T.: Efficiently compiling efficient query plans for modern hardware. PVLDB 4(9), 539–550 (2011)Google Scholar
  37. 37.
    Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms. In: PODS, pp. 37–48 (2012)Google Scholar
  38. 38.
    Ngo, H.Q., Ré, C., Rudra, A.: Skew strikes back: New developments in the theory of join algorithms. In: SIGMOD Rec., pp. 5–16 (2013)CrossRefGoogle Scholar
  39. 39.
    Nikolic, M., Olteanu, D.: Incremental view maintenance with triple lock factorization benefits. In: SIGMOD, pp. 365–380 (2018)Google Scholar
  40. 40.
    Olteanu, D., Schleich, M.: F: regression models over factorized views. PVLDB 9(10), 1573–1576 (2016)Google Scholar
  41. 41.
    Olteanu, D., Schleich, M.: Factorized databases. SIGMOD Rec. 45(2), 5–16 (2016)CrossRefGoogle Scholar
  42. 42.
    Olteanu, D., Závodný, J.: Size bounds for factorised representations of query results. TODS 40(1), 2 (2015)MathSciNetCrossRefGoogle Scholar
  43. 43.
    Park, Y., Qing, J., Shen, X., Mozafari, B.: Blinkml: efficient maximum likelihood estimation with probabilistic guarantees. In: SIGMOD, pp. 1135–1152 (2019)Google Scholar
  44. 44.
    Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  45. 45.
    Poon, H., Domingos, P.M.: Sum-product networks: a new deep architecture. In: UAI, pp. 337–346 (2011)Google Scholar
  46. 46.
    R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Stat. Comp. (2013).
  47. 47.
    Rendle, S.: Factorization machines. In: Proceedings of the 2010 IEEE International Conference on Data Mining. ICDM 2010, pp. 995–1000. IEEE Computer Society, Washington, DC (2010)Google Scholar
  48. 48.
    Rendle, S.: Factorization machines with libFM. ACM Trans. Intell. Syst. Technol. 3(3), 57:1–57:22 (2012)CrossRefGoogle Scholar
  49. 49.
    Rendle, S.: Scaling factorization machines to relational data. PVLDB 6(5), 337–348 (2013)Google Scholar
  50. 50.
    Schleich, M., Olteanu, D., Abo Khamis, M., Ngo, H.Q., Nguyen, X.: A layered aggregate engine for analytics workloads. In: SIGMOD, pp. 1642–1659 (2019)Google Scholar
  51. 51.
    Schleich, M., Olteanu, D., Ciucanu, R.: Learning linear regression models over factorized joins. In: SIGMOD, pp. 3–18 (2016)Google Scholar
  52. 52.
    Shaikhha, A., Klonatos, Y., Koch, C.: Building efficient query engines in a high-level language. TODS 43(1), 4:1–4:45 (2018)MathSciNetCrossRefGoogle Scholar
  53. 53.
    Shaikhha, A., Klonatos, Y., Parreaux, L., Brown, L., Dashti, M., Koch, C.: How to architect a query compiler. In: SIGMOD, pp. 1907–1922 (2016)Google Scholar
  54. 54.
    Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized low rank models. Found. Trends Mach. Learn. 9(1), 1–118 (2016)CrossRefGoogle Scholar
  55. 55.
    Veldhuizen, T.L.: Triejoin: a simple, worst-case optimal join algorithm. In: ICDT, pp. 96–106 (2014)Google Scholar
  56. 56.
    Wickham, H., Francois, R., Henry, L., Müller, K., et al.: dplyr: a grammar of data manipulation. R package version 0.4 3 (2015)Google Scholar
  57. 57.
    Zaharia, M., Chowdhury, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, p. 2 (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Maximilian Schleich
    • 1
  • Dan Olteanu
    • 1
    Email author
  • Mahmoud Abo-Khamis
    • 2
  • Hung Q. Ngo
    • 2
  • XuanLong Nguyen
    • 3
  1. 1.University of OxfordOxfordUK
  2. 2.RelationalAI, Inc.BerkeleyUSA
  3. 3.University of MichiganAnn ArborUSA

Personalised recommendations