Skip to main content
Log in

Hash-tree PCA: accelerating PCA with hash-based grouping

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In data mining or machine learning, one of the most commonly used feature extraction techniques is principal component analysis (PCA). However, it performs poorly on a large dataset. In this paper, we propose a new method of accelerating conventional PCA, named hash-tree PCA. It samples the objects that are similar to each other without losing the original data distribution. First, it explores similar objects and stores them in hash tables. Afterward, it samples a certain number of the objects from each hash table and creates a new dataset with a reduced number of objects. Finally, it executes PCA on the sampled dataset. Experimental results show that our method outperforms the PCA and fast PCA methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Augusto JC (2009) Past, present and future of ambient intelligence and smart environments. In: International Conference on Agents and Artificial Intelligence, pp. 3–15. Springer, Berlin

  2. Pearson K (1901) LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos Mag J Sci 2(11):559–572

    Article  Google Scholar 

  3. McAfee A, Brynjolfsson E, Davenport TH, Patil DJ, Barton D (2012) Big data: the management revolution. Harvard Bus Rev 90(10):60–68

    Google Scholar 

  4. Funatsu N, Kuroki Y (2010) Fast parallel processing using GPU in computing L1-PCA bases. In: IEEE Region 10 Conference TENCON, pp 2087–2090

  5. Vogt F, Tacke M (2001) Fast principal component analysis of large data sets. Chemometr Intell Lab Syst 59(1–2):1–18

    Article  Google Scholar 

  6. Battulga L, Nasridinov A, Yoo KH (2017) Quad-PCA: quad-tree based data composition for fast PCA. In: International Conference on Big Data Applications and Services, pp 331–338

  7. Golub GH (1996) CF van loan. Matrix Computations, The Johns Hopkins

    Google Scholar 

  8. Zhu X, Li X, Zhang S, Xu Z, Yu L, Wang C (2017) Graph PCA hashing for similarity search. IEEE Trans Multimed 19(9):2033–2044

    Article  Google Scholar 

  9. Mao M, Zheng Z, Chen Z, Liu H, He X, Ye R (2016) Two-dimensional pca hashing and its extension. In: 23rd International Conference on Pattern Recognition (ICPR), pp 1624–1629

  10. Honda K, Notsu A, Ichihashi H (2010) Fuzzy PCA-guided robust k-means clustering. IEEE Trans Fuzzy Syst 18(1):67–79

    Article  Google Scholar 

  11. Andrecut M (2009) Parallel GPU implementation of iterative PCA algorithms. J Comput Biol 16(11):1593–1599

    Article  MathSciNet  Google Scholar 

  12. Jain A, Bakshi M, Kalele A, Subramanian E (2015) On accelerating concurrent PCA computations for financial risk applications. In: IEEE 22nd International Conference on High Performance Computing (HiPC), pp 175–184

  13. Sharma A, Paliwal KK (2007) Fast principal component analysis using fixed-point algorithm. Pattern Recogn Lett 28(10):1151–1155

    Article  Google Scholar 

  14. Wang J, Barreto A, Rishe N, Andrian J, Adjouadi M (2011) A fast incremental multilinear principal component analysis algorithm. Int J Innov Comput Inf Control 7:6019–6040

    Google Scholar 

  15. Bartelmaos S, Abed-Meraim K (2008) Fast principal component extraction using givens rotations. IEEE Signal Process Lett 15:369–372

    Article  Google Scholar 

  16. Borzsony S, Kossmann D, Stocker K (2001) The skyline operator. In: Proceedings 17th International Conference on Data Engineering, pp 421–430

  17. Samet H (2006) Foundations of multidimensional and metric data structures. Morgan Kaufmann, Burlington

    MATH  Google Scholar 

  18. Cao Y, Qi H, Zhou W, Kato J, Li K, Liu X, Gui J (2018) Binary hashing for approximate nearest neighbor search on big data: a survey. IEEE Access 6:2039–2054

    Article  Google Scholar 

  19. Wang J, Liu W, Kumar S, Chang SF (2016) Learning to hash for indexing big data—A survey. Proc IEEE 104(1):34–57

    Article  Google Scholar 

  20. Song feature dataset. www.kaggle.com/uciml/msd-audio-features. Accessed 18 July 2018

Download references

Acknowledgements

This research was supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No. 10082578, Development of intelligent operation system based on big data for production process efficiency and quality optimization in non-ferrous metal industry).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kwan-Hee Yoo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Battulga, L., Lee, SH., Nasridinov, A. et al. Hash-tree PCA: accelerating PCA with hash-based grouping. J Supercomput 76, 8248–8264 (2020). https://doi.org/10.1007/s11227-019-02947-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-02947-x

Keywords

Navigation