Abstract
In data mining or machine learning, one of the most commonly used feature extraction techniques is principal component analysis (PCA). However, it performs poorly on a large dataset. In this paper, we propose a new method of accelerating conventional PCA, named hash-tree PCA. It samples the objects that are similar to each other without losing the original data distribution. First, it explores similar objects and stores them in hash tables. Afterward, it samples a certain number of the objects from each hash table and creates a new dataset with a reduced number of objects. Finally, it executes PCA on the sampled dataset. Experimental results show that our method outperforms the PCA and fast PCA methods.
Similar content being viewed by others
References
Augusto JC (2009) Past, present and future of ambient intelligence and smart environments. In: International Conference on Agents and Artificial Intelligence, pp. 3–15. Springer, Berlin
Pearson K (1901) LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos Mag J Sci 2(11):559–572
McAfee A, Brynjolfsson E, Davenport TH, Patil DJ, Barton D (2012) Big data: the management revolution. Harvard Bus Rev 90(10):60–68
Funatsu N, Kuroki Y (2010) Fast parallel processing using GPU in computing L1-PCA bases. In: IEEE Region 10 Conference TENCON, pp 2087–2090
Vogt F, Tacke M (2001) Fast principal component analysis of large data sets. Chemometr Intell Lab Syst 59(1–2):1–18
Battulga L, Nasridinov A, Yoo KH (2017) Quad-PCA: quad-tree based data composition for fast PCA. In: International Conference on Big Data Applications and Services, pp 331–338
Golub GH (1996) CF van loan. Matrix Computations, The Johns Hopkins
Zhu X, Li X, Zhang S, Xu Z, Yu L, Wang C (2017) Graph PCA hashing for similarity search. IEEE Trans Multimed 19(9):2033–2044
Mao M, Zheng Z, Chen Z, Liu H, He X, Ye R (2016) Two-dimensional pca hashing and its extension. In: 23rd International Conference on Pattern Recognition (ICPR), pp 1624–1629
Honda K, Notsu A, Ichihashi H (2010) Fuzzy PCA-guided robust k-means clustering. IEEE Trans Fuzzy Syst 18(1):67–79
Andrecut M (2009) Parallel GPU implementation of iterative PCA algorithms. J Comput Biol 16(11):1593–1599
Jain A, Bakshi M, Kalele A, Subramanian E (2015) On accelerating concurrent PCA computations for financial risk applications. In: IEEE 22nd International Conference on High Performance Computing (HiPC), pp 175–184
Sharma A, Paliwal KK (2007) Fast principal component analysis using fixed-point algorithm. Pattern Recogn Lett 28(10):1151–1155
Wang J, Barreto A, Rishe N, Andrian J, Adjouadi M (2011) A fast incremental multilinear principal component analysis algorithm. Int J Innov Comput Inf Control 7:6019–6040
Bartelmaos S, Abed-Meraim K (2008) Fast principal component extraction using givens rotations. IEEE Signal Process Lett 15:369–372
Borzsony S, Kossmann D, Stocker K (2001) The skyline operator. In: Proceedings 17th International Conference on Data Engineering, pp 421–430
Samet H (2006) Foundations of multidimensional and metric data structures. Morgan Kaufmann, Burlington
Cao Y, Qi H, Zhou W, Kato J, Li K, Liu X, Gui J (2018) Binary hashing for approximate nearest neighbor search on big data: a survey. IEEE Access 6:2039–2054
Wang J, Liu W, Kumar S, Chang SF (2016) Learning to hash for indexing big data—A survey. Proc IEEE 104(1):34–57
Song feature dataset. www.kaggle.com/uciml/msd-audio-features. Accessed 18 July 2018
Acknowledgements
This research was supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under Industrial Technology Innovation Program (No. 10082578, Development of intelligent operation system based on big data for production process efficiency and quality optimization in non-ferrous metal industry).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Battulga, L., Lee, SH., Nasridinov, A. et al. Hash-tree PCA: accelerating PCA with hash-based grouping. J Supercomput 76, 8248–8264 (2020). https://doi.org/10.1007/s11227-019-02947-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02947-x