Compressed linear algebra for large-scale machine learning


Large-scale machine learning algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory and enable fast matrix-vector operations on in-memory data. General-purpose, heavy- and lightweight compression techniques struggle to achieve both good compression ratios and fast decompression speed to enable block-wise uncompressed operations. Therefore, we initiate work—inspired by database compression and sparse matrix formats—on value-based compressed linear algebra (CLA), in which heterogeneous, lightweight database compression techniques are applied to matrices, and then linear algebra operations such as matrix-vector multiplication are executed directly on the compressed representation. We contribute effective column compression schemes, cache-conscious operations, and an efficient sampling-based compression algorithm. Our experiments show that CLA achieves in-memory operations performance close to the uncompressed case and good compression ratios, which enables fitting substantially larger datasets into available memory. We thereby obtain significant end-to-end performance improvements up to \(9.2\mathrm{x}\).

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15


  1. 1.

    Dummy coding transforms a categorical feature having d possible values into d Boolean features, each indicating the rows in which a given value occurs. The larger the value of d, the greater the sparsity (from adding \(d-1\) zeros per row).

  2. 2.

    The results with native BLAS libraries would be similar because memory bandwidth and I/O are the bottlenecks.

  3. 3.

    For consistency with previously published results [32], we use Snappy, which was the default codec in Spark 1.x. However, we also include LZ4, which is the default in Spark 2.x.

  4. 4.

    For Mnist with its original 10 classes, we created the labels with \(\mathbf {y} \leftarrow (\mathbf {y}==7)\) (i.e., class 7 against the rest), whereas for ImageNet with its 1000 classes, we created the labels with \(\mathbf {y}\leftarrow (\mathbf {y}_0 > (\max (\mathbf {y}_0) - (\max (\mathbf {y}_0)-\min (\mathbf {y}_0))/2))\), where we derived \(\mathbf {y}_0 = \mathbf {X}\mathbf {w}\) from the data \(\mathbf {X}\) and a random model \(\mathbf {w}\).

  5. 5.

    We enabled code generation for cell-wise operations only because SystemML 0.14 does not yet support operator fusion, i.e., code generation, for compressed matrices.


  1. 1.

    Abadi, D.J., et al.: Integrating compression and execution in column-oriented database systems. In: SIGMOD (2006)

  2. 2.

    Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. In: CoRR (2016)

  3. 3.

    Adler, M., Mitzenmacher, M.: Towards compressing web graphs. In: DCC (2001)

  4. 4.

    Alexandrov, A., et al.: The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)

    Article  Google Scholar 

  5. 5.

    American Statistical Association (ASA). Airline on-time performance dataset.

  6. 6.

    Ashari, A., et al.: An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. In: ICS (2014)

  7. 7.

    Ashari, A., et al.: On optimizing machine learning workloads via kernel fusion. In: PPoPP (2015)

  8. 8.

    Bandyopadhyay, B., et al.: Topological graph sketching for incremental and scalable analytics. In: CIKM (2016)

  9. 9.

    Bassiouni, M.A.: Data compression in scientific and statistical databases. Trans. Softw. Eng. (TSE) 11(10), 1047–1058 (1985)

    Article  Google Scholar 

  10. 10.

    Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: SC (2009)

  11. 11.

    Bergstra, J., et al.: Theano: a CPU and GPU math expression compiler. In: SciPy (2010)

  12. 12.

    Beyer, K.S., et al.: On synopses for distinct-value estimation under multiset operations. In: SIGMOD (2007)

  13. 13.

    Bhattacharjee, B., et al.: Efficient index compression in DB2 LUW. PVLDB 2(2), 1462–1473 (2009)

    Google Scholar 

  14. 14.

    Bhattacherjee, S., et al.: PStore: an efficient storage framework for managing scientific data. In: SSDBM (2014)

  15. 15.

    Binnig, C., et al.: Dictionary-based order-preserving string compression for main memory column stores. In: SIGMOD (2009)

  16. 16.

    Boehm, M., et al.: SystemML: declarative machine learning on spark. PVLDB 9(13), 1425–1436 (2016)

    Google Scholar 

  17. 17.

    Boehm, M., et al.: Declarative machine learning—a classification of basic properties and types. In: CoRR (2016)

  18. 18.

    Bolosky, W.J., Scott, M.L.: False sharing and its effect on shared memory performance. In: SEDMS (1993)

  19. 19.

    Bottou, L.: The infinite MNIST dataset.

  20. 20.

    Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. In: WSDM (2008)

  21. 21.

    Charikar, M., et al.: Towards estimation error guarantees for distinct values. In: SIGMOD (2000)

  22. 22.

    Chen, L., et al.: Towards linear algebra over normalized data. PVLDB 10(11), 1214–1225 (2017)

    Google Scholar 

  23. 23.

    Chitta, R., et al.: Approximate kernel k-means: solution to large scale kernel clustering. In: KDD (2011)

  24. 24.

    Cohen, J., et al.: MAD skills: new analysis practices for big data. PVLDB 2(2), 1481–1492 (2009)

    Google Scholar 

  25. 25.

    Constantinescu, C., Lu, M.: Quick estimation of data compression and de-duplication for large storage systems. In: CCP (2011)

  26. 26.

    Cormack, G.V.: Data compression on a database system. Commun. ACM 28(12), 1336–1342 (1985)

    Article  Google Scholar 

  27. 27.

    Damme, P., et al.: Lightweight data compression algorithms: an experimental survey. In: EDBT (2017)

  28. 28.

    Das, S., et al.: Ricardo: integrating R and hadoop. In: SIGMOD (2010)

  29. 29.

    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)

  30. 30.

    Elgamal, T., et al.: sPCA: scalable principal component analysis for big data on distributed platforms. In: SIGMOD (2015)

  31. 31.

    Elgamal, T., et al.: SPOOF: sum-product optimization and operator fusion for large-scale machine learning. In: CIDR (2017)

  32. 32.

    Elgohary, A., et al.: Compressed linear algebra for large-scale machine learning. PVLDB 9(12), 960–971 (2016)

    Google Scholar 

  33. 33.

    Fan, W., et al.: Query preserving graph compression. In: SIGMOD (2012)

  34. 34.

    Ghoting, A., et al.: SystemML: declarative machine learning on MapReduce. In: ICDE (2011)

  35. 35.

    Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–264 (1953)

    MathSciNet  Article  Google Scholar 

  36. 36.

    Graefe, G., Shapiro, L.D.: Data compression and database performance. In: Applied Computing (1991)

  37. 37.

    Haas, P.J., Stokes, L.: Estimating the number of classes in a finite population. J. Am. Stat. Assoc. 93(444), 1475–1487 (1998)

    MathSciNet  Article  Google Scholar 

  38. 38.

    Halko, N., et al.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)

    MathSciNet  Article  Google Scholar 

  39. 39.

    Harnik, D., et al.: Estimation of deduplication ratios in large data sets. In: MSST (2012)

  40. 40.

    Harnik, D., et al.: To zip or not to zip: effective resource usage for real-time compression. In: FAST (2013)

  41. 41.

    Huang,B., et al.: Cumulon: optimizing statistical data analysis in the cloud. In: SIGMOD (2013)

  42. 42.

    Huang,B., et al.: Resource elasticity for large-scale machine learning. In: SIGMOD (2015)

  43. 43.

    Idreos, S., et al.: Estimating the compression fraction of an index using sampling. In: ICDE (2010)

  44. 44.

    Intel. MKL: Math Kernel Library.

  45. 45.

    Johnson, D.S., et al.: Worst-case performance bounds for simple one-dimensional packing algorithms. SIAM J. Comput. 3(4), 299–325 (1974)

    MathSciNet  Article  Google Scholar 

  46. 46.

    Johnson, N.L., et al.: Univariate Discrete Distributions, 2nd edn. Wiley, New York (1992)

    Google Scholar 

  47. 47.

    Kang, D., et al.: NoScope: Optimizing deep CNN-based queries over video streams at scale. PVLDB 10(11), 1586–1597 (2017)

    Google Scholar 

  48. 48.

    Karakasis, V., et al.: An extended compression format for the optimization of sparse matrix-vector multiplication. Trans. Parallel Distrib. Syst. (TPDS) 24(10), 1930–1940 (2013)

    Article  Google Scholar 

  49. 49.

    Kernert, D., et al.: SLACID—sparse linear algebra in a column-oriented in-memory database system. In: SSDBM (2014)

  50. 50.

    Kim, M.: TensorDB and tensor-relational model (TRM) for efficient tensor-relational operations. Ph.D. Thesis, ASU (2014)

  51. 51.

    Kimura, H., et al.: Compression aware physical database design. PVLDB 4(10), 657–668 (2011)

    Google Scholar 

  52. 52.

    Kourtis, K., et al.: Optimizing sparse matrix-vector multiplication using index and value compression. In: CF (2008)

  53. 53.

    Kumar, A., et al.: Demonstration of Santoku: optimizing machine learning over normalized data. PVLDB 8(12), 1864–1867 (2015)

    Google Scholar 

  54. 54.

    Kumar, A., et al.: Learning generalized linear models over normalized data. In: SIGMOD (2015)

  55. 55.

    Lang, H., et al.: Data blocks: hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In: SIGMOD (2016)

  56. 56.

    Larson, P., et al.: SQL server column store indexes. In: SIGMOD (2011)

  57. 57.

    Lecun, Y.: Deep learning. Nature 521, 436–444 (2015)

    MathSciNet  Article  Google Scholar 

  58. 58.

    Li, F., et al.: When Lempel–Ziv–Welch meets machine learning: a case study of accelerating machine learning using coding. In: CoRR (2017)

  59. 59.

    Lichman, M.: UCI machine learning repository: higgs, covertype, US Census (1990).

  60. 60.

    Luo, S., et al.: Scalable linear algebra on a relational database system. In: ICDE (2017)

  61. 61.

    Maccioni, A., Abadi, D.J.: Scalable pattern matching over compressed graphs via dedensification. In: KDD (2016)

  62. 62.

    Maneth, S., Peternek, F.: A survey on methods and systems for graph compression. In: CoRR (2015)

  63. 63.

    Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (1999)

    Google Scholar 

  64. 64.

    NVIDIA. cuSPARSE: CUDA Sparse Matrix Library.

  65. 65.

    Olteanu, D., Schleich, M.: F: Regression models over factorized views. PVLDB 9(13), 1573–1576 (2016)

    Google Scholar 

  66. 66.

    O’Neil, P.E.: Model 204 architecture and performance. In: High Performance Transaction Systems (1989)

  67. 67.

    Or, A., Rosen, J.: Unified memory management in spark 1.6, SPARK-10000 design document (2015)

  68. 68.

    Oracle. Data Warehousing Guide, 11g Release 1 (2007)

  69. 69.

    Papadopoulos, S., et al.: The TileDB array data storage manager. PVLDB 10(4), 349–360 (2016)

    Google Scholar 

  70. 70.

    Qin, C., Rusu,F.: Speculative approximations for terascale analytics. In: CoRR (2015)

  71. 71.

    Raman, V., Swart, G.: How to wring a table dry: entropy compression of relations and querying of compressed relations. In: VLDB (2006)

  72. 72.

    Raman, V., et al.: DB2 with BLU acceleration: so much more than just a column store. PVLDB 6(11), 1080–1091 (2013)

    Google Scholar 

  73. 73.

    Raskhodnikova, S., et al.: Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM J. Comput. 39(3), 813–842 (2009)

    MathSciNet  Article  Google Scholar 

  74. 74.

    Rendle, S.: Scaling factorization machines to relational data. PVLDB 6(5), 337–348 (2013)

    Google Scholar 

  75. 75.

    Rohrmann, T., et al.: Gilbert: declarative sparse linear algebra on massively parallel dataflow systems. In: BTW (2017)

  76. 76.

    Saad, Y: SPARSKIT: a basic tool kit for sparse matrix computations—Version 2 (1994)

  77. 77.

    Satuluri, V., et al.: Local graph sparsification for scalable clustering. In: SIGMOD (2011)

  78. 78.

    Schelter, S., et al.: Samsara: declarative machine learning on distributed dataflow systems. In: NIPS Workshop MLSystems (2016)

  79. 79.

    Schlegel, B., et al.: Memory-efficient frequent-itemset mining. In: EDBT (2011)

  80. 80.

    Schleich, M., et al.: Learning linear regression models over factorized joins. In: SIGMOD (2016)

  81. 81.

    Stonebraker, M., et al.: C-store: a column-oriented DBMS. In: VLDB (2005)

  82. 82.

    Stonebraker, M., et al.: The Architecture of SciDB. In: SSDBM (2011)

  83. 83.

    Sysbase. IQ 15.4 System Administration Guide (2013)

  84. 84.

    Tabei, Y., et al.: Scalable partial least squares regression on grammar-compressed data matrices. In: KDD (2016)

  85. 85.

    Tepper, M., Sapiro, G.: Compressed nonnegative matrix factorization is fast and accurate. IEEE Trans. Signal Process. 64(9), 2269–2283 (2016)

    MathSciNet  Article  Google Scholar 

  86. 86.

    Tian, Y., et al.: Scalable and numerically stable descriptive statistics in SystemML. In: ICDE (2012)

  87. 87.

    Valiant, G., Valiant, P.: Estimating the unseen: an n/log(n)-sample estimator for entropy and support size. In: STOC, Shown Optimal via New CLTs (2011)

  88. 88.

    Wang, W., et al.: Database meets deep learning: challenges and opportunities. SIGMOD Rec. 45(2), 17–22 (2016)

    Article  Google Scholar 

  89. 89.

    Westmann, T., et al.: The implementation and performance of compressed databases. SIGMOD Rec. 29(3), 55–67 (2000)

    Article  Google Scholar 

  90. 90.

    Willhalm, T., et al.: SIMD-Scan: ultra fast in-memory table scan using on-chip vector processing units. PVLDB 2(1), 385–394 (2009)

    Google Scholar 

  91. 91.

    Williams, S., et al.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: SC (2007)

  92. 92.

    Wu, K., et al.: Optimizing bitmap indices with efficient compression. TODS 31(1), 1–38 (2006)

    Article  Google Scholar 

  93. 93.

    Yu, L., et al.: Exploiting matrix dependency for efficient distributed matrix computation. In: SIGMOD (2015)

  94. 94.

    Zadeh, R. B., et al.: Matrix computations and optimization in apache spark. In: KDD (2016)

  95. 95.

    Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI (2012)

  96. 96.

    Zhang, C., et al.: Materialization optimizations for feature selection workloads. In: SIGMOD (2014)

  97. 97.

    Zukowski, M., et al.: Super-scalar RAM-CPU cache compression. In: ICDE (2006)

Download references


We thank Alexandre Evfimievski and Prithviraj Sen for thoughtful discussions on compressed linear algebra and code generation, Srinivasan Parthasarathy for pointing us to the related work on graph compression, as well as our reviewers for their valuable comments and suggestions.

Author information



Corresponding author

Correspondence to Matthias Boehm.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Elgohary, A., Boehm, M., Haas, P.J. et al. Compressed linear algebra for large-scale machine learning. The VLDB Journal 27, 719–744 (2018).

Download citation


  • Machine learning
  • Large-scale
  • Declarative
  • Linear algebra
  • Lossless compression