Abstract
As the high-dimensional data impact the performance of the k-Nearest Neighbor (k-NN) classification algorithm, dimension reduction for the big data classification using k-NN algorithm draws huge attention from the industry as well as academia. The popular dimension reduction techniques such as PCA, LDA, SVD are data-dependent methods where the projection matrix was generated from the input data which make the reusability of the projection matrix impractical. In this paper, a Data-Independent Reusable Projection (DIRP) technique has been proposed to project high-dimensional data to low-dimensional data and prove how the projection matrix can be reused for any dataset with same number of dimensions. The proposed DIRP method preserves the distance between any two points in the dataset which works well for the distance-based classification algorithms like k-NN. The DIRP method has been implemented in R, and a new package “RandPro” for generating projection matrix has been developed and tested with the CIFAR-10, handwritten digit recognition (MNIST) dataset and human activity recognition dataset. The two versions of the RandPro package have been uploaded in the CRAN repository. The running time and classification metrics comparison between PCA and DIRP method has been analyzed. From the results, it has been found that the running time of the proposed method is reduced significantly with the near equivalent accuracy to the original data.
References
Philip Chen CL, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci 275:314–347. https://doi.org/10.1016/j.ins.2014.01.015
Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Tech. rep, META Group
Tsai CW, Lai CF, Chao HC, Vasilakos AV (2015) Big data analytics: a survey. J. Big Data 2(1):21. https://doi.org/10.1186/s40537-015-0030-3
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27. https://doi.org/10.1145/1961189.1961199
Mehta K, Tyagi G, Rao A, Kumar P, Chauhan DS (2017) Modified locally linear embedding with affine transformation. Natl Acad Sci Lett 40(3):189–196. https://doi.org/10.1007/s40009-017-0536-7
Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32(2):9–15. https://doi.org/10.1109/MIS.2017.38
Haris BC, Sinha R (2014) Exploring data-independent dimensionality reduction in sparse representation-based speaker identification. Circuits Syst Signal Process 33(8):2521–2538. https://doi.org/10.1007/s00034-014-9757-x
Guyon I, Gunn S, Nikravesh M, Zadeh LA (2006) Feature extraction: foundations and applications (studies in fuzziness and soft computing). Springer, New York
Frs KP (1901) Liii. on lines and planes of closest fit to systems of points in space. Philos Mag 2(11):559–572. https://doi.org/10.1080/14786440109462720
Martínez AM, Kak AC (2001) Pca versus lda. IEEE Trans Pattern Anal Mach Intell 23(2):228–233. https://doi.org/10.1109/34.908974
Parsons S (2005) Independent component analysis: a tutorial introduction by james v. stone, mit press, 193 pp, \$35.00, isbn 0-262-69315-1. Knowl Eng Rev 20(2):198–199. https://doi.org/10.1017/S0269888905220519
Johnson W, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In: Conference in modern analysis and probability (New Haven, Conn., 1982), contemporary mathematics, vol 26, American Mathematical Society, pp 189–206
Kane DM, Nelson J (2014) Sparser johnson-lindenstrauss transforms. J ACM 61(1):4:1–4:23. https://doi.org/10.1145/2559902
Dasgupta S, Gupta A (2003) An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct Algorithms 22(1):60–65. https://doi.org/10.1002/rsa.10073
Dasgupta S (2000) Experiments with random projection. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, UAI ’00, pp 143–151
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’01, pp 245–250. https://doi.org/10.1145/502512.502546
Paul S, Boutsidis C, Magdon-Ismail M, Drineas P (2014) Random projections for linear support vector machines. ACM Trans Knowl Discov Data 8(4):22:1–22:25. https://doi.org/10.1145/2641760
Wu X, Kumar V (2009) The top ten algorithms in data mining, 1st edn. Chapman & Hall/CRC, London
Cover T, Hart P (2006) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964
Maillo J, Ramírez S, Triguero I, Herrera F (2017) knn-is. Knowl Based Syst 117(C):3–15. https://doi.org/10.1016/j.knosys.2016.06.012
Aghila G, Siddharth R (2017) RandPro: random projection. https://CRAN.R-project.org/package=RandPro, r package version 0.1.0
Spark (2017) Apache spark—lightning-fast cluster computing. http://spark.apache.org/. Accessed 10 Oct 2017
Achlioptas D (2003) Database-friendly random projections: Johnson-lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687. https://doi.org/10.1016/S0022-0000(03)00025-4
Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’06, pp 287–296. https://doi.org/10.1145/1150402.1150436
Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL (2013) A public domain dataset for human activity recognition using smartphones. In: 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN, Bruges, Belgium
LeCun Y, Cortes C (2010) MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep
R Core Team (2017) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ravindran, S., Aghila, G. A Data-Independent Reusable Projection (DIRP) Technique for Dimension Reduction in Big Data Classification Using k-Nearest Neighbor (k-NN). Natl. Acad. Sci. Lett. 43, 13–21 (2020). https://doi.org/10.1007/s40009-018-0771-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40009-018-0771-6