Skip to main content
Log in

A Data-Independent Reusable Projection (DIRP) Technique for Dimension Reduction in Big Data Classification Using k-Nearest Neighbor (k-NN)

  • Short Communication
  • Published:
National Academy Science Letters Aims and scope Submit manuscript

Abstract

As the high-dimensional data impact the performance of the k-Nearest Neighbor (k-NN) classification algorithm, dimension reduction for the big data classification using k-NN algorithm draws huge attention from the industry as well as academia. The popular dimension reduction techniques such as PCA, LDA, SVD are data-dependent methods where the projection matrix was generated from the input data which make the reusability of the projection matrix impractical. In this paper, a Data-Independent Reusable Projection (DIRP) technique has been proposed to project high-dimensional data to low-dimensional data and prove how the projection matrix can be reused for any dataset with same number of dimensions. The proposed DIRP method preserves the distance between any two points in the dataset which works well for the distance-based classification algorithms like k-NN. The DIRP method has been implemented in R, and a new package “RandPro” for generating projection matrix has been developed and tested with the CIFAR-10, handwritten digit recognition (MNIST) dataset and human activity recognition dataset. The two versions of the RandPro package have been uploaded in the CRAN repository. The running time and classification metrics comparison between PCA and DIRP method has been analyzed. From the results, it has been found that the running time of the proposed method is reduced significantly with the near equivalent accuracy to the original data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

References

  1. Philip Chen CL, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci 275:314–347. https://doi.org/10.1016/j.ins.2014.01.015

    Article  Google Scholar 

  2. Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Tech. rep, META Group

  3. Tsai CW, Lai CF, Chao HC, Vasilakos AV (2015) Big data analytics: a survey. J. Big Data 2(1):21. https://doi.org/10.1186/s40537-015-0030-3

    Article  Google Scholar 

  4. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27. https://doi.org/10.1145/1961189.1961199

    Article  Google Scholar 

  5. Mehta K, Tyagi G, Rao A, Kumar P, Chauhan DS (2017) Modified locally linear embedding with affine transformation. Natl Acad Sci Lett 40(3):189–196. https://doi.org/10.1007/s40009-017-0536-7

    Article  MathSciNet  Google Scholar 

  6. Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32(2):9–15. https://doi.org/10.1109/MIS.2017.38

    Article  Google Scholar 

  7. Haris BC, Sinha R (2014) Exploring data-independent dimensionality reduction in sparse representation-based speaker identification. Circuits Syst Signal Process 33(8):2521–2538. https://doi.org/10.1007/s00034-014-9757-x

    Article  MathSciNet  Google Scholar 

  8. Guyon I, Gunn S, Nikravesh M, Zadeh LA (2006) Feature extraction: foundations and applications (studies in fuzziness and soft computing). Springer, New York

    Book  Google Scholar 

  9. Frs KP (1901) Liii. on lines and planes of closest fit to systems of points in space. Philos Mag 2(11):559–572. https://doi.org/10.1080/14786440109462720

    Article  Google Scholar 

  10. Martínez AM, Kak AC (2001) Pca versus lda. IEEE Trans Pattern Anal Mach Intell 23(2):228–233. https://doi.org/10.1109/34.908974

    Article  Google Scholar 

  11. Parsons S (2005) Independent component analysis: a tutorial introduction by james v. stone, mit press, 193 pp, \$35.00, isbn 0-262-69315-1. Knowl Eng Rev 20(2):198–199. https://doi.org/10.1017/S0269888905220519

  12. Johnson W, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In: Conference in modern analysis and probability (New Haven, Conn., 1982), contemporary mathematics, vol 26, American Mathematical Society, pp 189–206

  13. Kane DM, Nelson J (2014) Sparser johnson-lindenstrauss transforms. J ACM 61(1):4:1–4:23. https://doi.org/10.1145/2559902

    Article  MathSciNet  MATH  Google Scholar 

  14. Dasgupta S, Gupta A (2003) An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct Algorithms 22(1):60–65. https://doi.org/10.1002/rsa.10073

    Article  MathSciNet  MATH  Google Scholar 

  15. Dasgupta S (2000) Experiments with random projection. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, UAI ’00, pp 143–151

  16. Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’01, pp 245–250. https://doi.org/10.1145/502512.502546

  17. Paul S, Boutsidis C, Magdon-Ismail M, Drineas P (2014) Random projections for linear support vector machines. ACM Trans Knowl Discov Data 8(4):22:1–22:25. https://doi.org/10.1145/2641760

    Article  Google Scholar 

  18. Wu X, Kumar V (2009) The top ten algorithms in data mining, 1st edn. Chapman & Hall/CRC, London

    Book  Google Scholar 

  19. Cover T, Hart P (2006) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964

    Article  MATH  Google Scholar 

  20. Maillo J, Ramírez S, Triguero I, Herrera F (2017) knn-is. Knowl Based Syst 117(C):3–15. https://doi.org/10.1016/j.knosys.2016.06.012

    Article  Google Scholar 

  21. Aghila G, Siddharth R (2017) RandPro: random projection. https://CRAN.R-project.org/package=RandPro, r package version 0.1.0

  22. Spark (2017) Apache spark—lightning-fast cluster computing. http://spark.apache.org/. Accessed 10 Oct 2017

  23. Achlioptas D (2003) Database-friendly random projections: Johnson-lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687. https://doi.org/10.1016/S0022-0000(03)00025-4

    Article  MathSciNet  MATH  Google Scholar 

  24. Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’06, pp 287–296. https://doi.org/10.1145/1150402.1150436

  25. Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL (2013) A public domain dataset for human activity recognition using smartphones. In: 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN, Bruges, Belgium

  26. LeCun Y, Cortes C (2010) MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/

  27. Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep

  28. R Core Team (2017) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Siddharth Ravindran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ravindran, S., Aghila, G. A Data-Independent Reusable Projection (DIRP) Technique for Dimension Reduction in Big Data Classification Using k-Nearest Neighbor (k-NN). Natl. Acad. Sci. Lett. 43, 13–21 (2020). https://doi.org/10.1007/s40009-018-0771-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40009-018-0771-6

Keywords

Navigation