A Data-Independent Reusable Projection (DIRP) Technique for Dimension Reduction in Big Data Classification Using k-Nearest Neighbor (k-NN)

Ravindran, Siddharth; Aghila, G

doi:10.1007/s40009-018-0771-6

A Data-Independent Reusable Projection (DIRP) Technique for Dimension Reduction in Big Data Classification Using k-Nearest Neighbor (k-NN)

Short Communication
Published: 06 February 2019

Volume 43, pages 13–21, (2020)
Cite this article

National Academy Science Letters Aims and scope Submit manuscript

430 Accesses
11 Citations
3 Altmetric
Explore all metrics

Abstract

As the high-dimensional data impact the performance of the k-Nearest Neighbor (k-NN) classification algorithm, dimension reduction for the big data classification using k-NN algorithm draws huge attention from the industry as well as academia. The popular dimension reduction techniques such as PCA, LDA, SVD are data-dependent methods where the projection matrix was generated from the input data which make the reusability of the projection matrix impractical. In this paper, a Data-Independent Reusable Projection (DIRP) technique has been proposed to project high-dimensional data to low-dimensional data and prove how the projection matrix can be reused for any dataset with same number of dimensions. The proposed DIRP method preserves the distance between any two points in the dataset which works well for the distance-based classification algorithms like k-NN. The DIRP method has been implemented in R, and a new package “RandPro” for generating projection matrix has been developed and tested with the CIFAR-10, handwritten digit recognition (MNIST) dataset and human activity recognition dataset. The two versions of the RandPro package have been uploaded in the CRAN repository. The running time and classification metrics comparison between PCA and DIRP method has been analyzed. From the results, it has been found that the running time of the proposed method is reduced significantly with the near equivalent accuracy to the original data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Philip Chen CL, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci 275:314–347. https://doi.org/10.1016/j.ins.2014.01.015
Article Google Scholar
Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Tech. rep, META Group
Tsai CW, Lai CF, Chao HC, Vasilakos AV (2015) Big data analytics: a survey. J. Big Data 2(1):21. https://doi.org/10.1186/s40537-015-0030-3
Article Google Scholar
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27. https://doi.org/10.1145/1961189.1961199
Article Google Scholar
Mehta K, Tyagi G, Rao A, Kumar P, Chauhan DS (2017) Modified locally linear embedding with affine transformation. Natl Acad Sci Lett 40(3):189–196. https://doi.org/10.1007/s40009-017-0536-7
Article MathSciNet Google Scholar
Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32(2):9–15. https://doi.org/10.1109/MIS.2017.38
Article Google Scholar
Haris BC, Sinha R (2014) Exploring data-independent dimensionality reduction in sparse representation-based speaker identification. Circuits Syst Signal Process 33(8):2521–2538. https://doi.org/10.1007/s00034-014-9757-x
Article MathSciNet Google Scholar
Guyon I, Gunn S, Nikravesh M, Zadeh LA (2006) Feature extraction: foundations and applications (studies in fuzziness and soft computing). Springer, New York
Book Google Scholar
Frs KP (1901) Liii. on lines and planes of closest fit to systems of points in space. Philos Mag 2(11):559–572. https://doi.org/10.1080/14786440109462720
Article Google Scholar
Martínez AM, Kak AC (2001) Pca versus lda. IEEE Trans Pattern Anal Mach Intell 23(2):228–233. https://doi.org/10.1109/34.908974
Article Google Scholar
Parsons S (2005) Independent component analysis: a tutorial introduction by james v. stone, mit press, 193 pp, \$35.00, isbn 0-262-69315-1. Knowl Eng Rev 20(2):198–199. https://doi.org/10.1017/S0269888905220519
Johnson W, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In: Conference in modern analysis and probability (New Haven, Conn., 1982), contemporary mathematics, vol 26, American Mathematical Society, pp 189–206
Kane DM, Nelson J (2014) Sparser johnson-lindenstrauss transforms. J ACM 61(1):4:1–4:23. https://doi.org/10.1145/2559902
Article MathSciNet MATH Google Scholar
Dasgupta S, Gupta A (2003) An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct Algorithms 22(1):60–65. https://doi.org/10.1002/rsa.10073
Article MathSciNet MATH Google Scholar
Dasgupta S (2000) Experiments with random projection. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, UAI ’00, pp 143–151
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’01, pp 245–250. https://doi.org/10.1145/502512.502546
Paul S, Boutsidis C, Magdon-Ismail M, Drineas P (2014) Random projections for linear support vector machines. ACM Trans Knowl Discov Data 8(4):22:1–22:25. https://doi.org/10.1145/2641760
Article Google Scholar
Wu X, Kumar V (2009) The top ten algorithms in data mining, 1st edn. Chapman & Hall/CRC, London
Book Google Scholar
Cover T, Hart P (2006) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964
Article MATH Google Scholar
Maillo J, Ramírez S, Triguero I, Herrera F (2017) knn-is. Knowl Based Syst 117(C):3–15. https://doi.org/10.1016/j.knosys.2016.06.012
Article Google Scholar
Aghila G, Siddharth R (2017) RandPro: random projection. https://CRAN.R-project.org/package=RandPro, r package version 0.1.0
Spark (2017) Apache spark—lightning-fast cluster computing. http://spark.apache.org/. Accessed 10 Oct 2017
Achlioptas D (2003) Database-friendly random projections: Johnson-lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687. https://doi.org/10.1016/S0022-0000(03)00025-4
Article MathSciNet MATH Google Scholar
Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’06, pp 287–296. https://doi.org/10.1145/1150402.1150436
Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL (2013) A public domain dataset for human activity recognition using smartphones. In: 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN, Bruges, Belgium
LeCun Y, Cortes C (2010) MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep
R Core Team (2017) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology Puducherry, Karaikal, 609609, India
Siddharth Ravindran & G Aghila

Authors

Siddharth Ravindran
View author publications
You can also search for this author in PubMed Google Scholar
G Aghila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siddharth Ravindran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ravindran, S., Aghila, G. A Data-Independent Reusable Projection (DIRP) Technique for Dimension Reduction in Big Data Classification Using k-Nearest Neighbor (k-NN). Natl. Acad. Sci. Lett. 43, 13–21 (2020). https://doi.org/10.1007/s40009-018-0771-6

Download citation

Received: 18 October 2017
Revised: 01 May 2018
Accepted: 27 December 2018
Published: 06 February 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s40009-018-0771-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Data-Independent Reusable Projection (DIRP) Technique for Dimension Reduction in Big Data Classification Using k-Nearest Neighbor (k-NN)

Abstract

Access this article

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation