Skip to main content
Log in

PCA for large data sets with parallel data summarization

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Parallel processing is essential for large-scale analytics. Principal Component Analysis (PCA) is a well known model for dimensionality reduction in statistical analysis, which requires a demanding number of I/O and CPU operations. In this paper, we study how to compute PCA in parallel. We extend a previous sequential method to a highly parallel algorithm that can compute PCA in one pass on a large data set based on summarization matrices. We also study how to integrate our algorithm with a DBMS; our solution is based on a combination of parallel data set summarization via user-defined aggregations and calling the MKL parallel variant of the LAPACK library to solve Singular Value Decomposition (SVD) in RAM. Our algorithm is theoretically shown to achieve linear speedup, linear scalability on data size, quadratic time on dimensionality (but in RAM), spending most of the time on data set summarization, despite the fact that SVD has cubic time complexity on dimensionality. Experiments with large data sets on multicore CPUs show that our solution is much faster than the R statistical package as well as solving PCA with SQL queries. Benchmarking on multicore CPUs and a parallel DBMS running on multiple nodes confirms linear speedup and linear scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Blakeley, J.A., Rao, V., Kunen, I., Prout, A., Henaire, M., Kleinerman, C.: .NET database programmability and extensibility in Microsoft SQL server. In: Proc. ACM SIGMOD Conference, pp. 1087–1098 (2008)

    Google Scholar 

  2. Brown, G.P.: Overview of SciDB: large scale array storage, processing and analysis. In: Proc. ACM SIGMOD Conference (2010)

    Google Scholar 

  3. Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: Parallel tiled QR factorization for multicore architectures. Concurr. Comput. 20(13), 1573–1590 (2008)

    Article  Google Scholar 

  4. Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: Proc. NIPS Conference, pp. 281–288 (2006)

    Google Scholar 

  5. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  6. Demmel, J.W.: Applied Numerical Linear Algebra, 1st edn. SIAM, Philadelphia (1997)

    Book  MATH  Google Scholar 

  7. Dongarra, J., Duff, I.S., Sorensen, D.C., van der Vost, H.A.: Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia (1998)

    Book  MATH  Google Scholar 

  8. Eddelbuettel, D.: Benchmarking single-and multi-core BLAS implementations and GPUs for use with R. Mathematica (2010)

  9. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 1st edn. Prentice Hall, New York (2001)

    Google Scholar 

  10. Halko, N.H., Martinsson, P.-G., Shkolnisky, Y., Tygert, M.: An algorithm for the principal component analysis of large data sets. SIAM J. Sci. Comput. 33(5), 2580–2594 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  11. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)

    Google Scholar 

  12. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning, 1st edn. Springer, New York (2001)

    Book  MATH  Google Scholar 

  13. Hellerstein, J., Re, C., Schoppmann, F., Wang, D.Z., et al.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)

    Google Scholar 

  14. Kanth, K.V.R., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD ’98), pp. 166–176. ACM, New York (1998)

    Chapter  Google Scholar 

  15. Krycha, K., Wexler, J.: Sas and teradata in-database model development using the business example of churn modeling. In: Proc. SAS Global Forum (2013). Paper id 087

    Google Scholar 

  16. Kyriacou, C., Evripidou, P., Trancoso, P.: Data-driven multithreading using conventional microprocessors. IEEE Trans. Parallel Distrib. Syst. 17(10), 1176–1188 (2006)

    Article  Google Scholar 

  17. Lahabar, S., Narayanan, P.J.: Singular value decomposition on GPU using CUDA. In: Proc. of IEEE International Symposium on Parallel & Distributed Processing, pp. 1–10 (2009)

    Google Scholar 

  18. Luo, C., Thakkar, H., Wang, H., Zaniolo, C.: A native extension of SQL for mining data streams. In: Proc. ACM SIGMOD Conference, pp. 873–875 (2005)

    Google Scholar 

  19. Navas, M., Ordonez, C.: Efficient computation of PCA with SVD in SQL. In: KDD Workshop on Data Mining Using Tensors and Matrices (2009), Article No. 5

    Google Scholar 

  20. Ordonez, C.: Statistical model computation with UDFs. IEEE Trans. Knowl. Data Eng. 22(12), 1752–1765 (2010)

    Article  Google Scholar 

  21. Ordonez, C., Cereghini, P.: SQLEM: fast clustering in SQL using the EM algorithm. In: Proc. ACM SIGMOD Conference, pp. 559–570 (2000)

    Google Scholar 

  22. Ordonez, C., Mohanam, N., Garcia-Alvarado, C., Tosic, P.T., Martinez, E.: Fast PCA computation in a DBMS with aggregate UDFs and LAPACK. In: Proc. ACM CIKM Conference (2012)

    Google Scholar 

  23. Ries, F., DeMarco, T., Guerrieri, R.: Triangular matrix inversion on heterogeneous multicore systems. IEEE Trans. Parallel Distrib. Syst. 23(1), 177–184 (2012)

    Article  Google Scholar 

  24. Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by NSF grant IIS 0914861.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Ordonez.

Additional information

Communicated by Feifei Li and Suman Nath.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ordonez, C., Mohanam, N. & Garcia-Alvarado, C. PCA for large data sets with parallel data summarization. Distrib Parallel Databases 32, 377–403 (2014). https://doi.org/10.1007/s10619-013-7134-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-013-7134-6

Keywords

Navigation