Advertisement

A Data-Independent Reusable Projection (DIRP) Technique for Dimension Reduction in Big Data Classification Using k-Nearest Neighbor (k-NN)

  • Siddharth RavindranEmail author
  • G Aghila
Short Communication
  • 7 Downloads

Abstract

As the high-dimensional data impact the performance of the k-Nearest Neighbor (k-NN) classification algorithm, dimension reduction for the big data classification using k-NN algorithm draws huge attention from the industry as well as academia. The popular dimension reduction techniques such as PCA, LDA, SVD are data-dependent methods where the projection matrix was generated from the input data which make the reusability of the projection matrix impractical. In this paper, a Data-Independent Reusable Projection (DIRP) technique has been proposed to project high-dimensional data to low-dimensional data and prove how the projection matrix can be reused for any dataset with same number of dimensions. The proposed DIRP method preserves the distance between any two points in the dataset which works well for the distance-based classification algorithms like k-NN. The DIRP method has been implemented in R, and a new package “RandPro” for generating projection matrix has been developed and tested with the CIFAR-10, handwritten digit recognition (MNIST) dataset and human activity recognition dataset. The two versions of the RandPro package have been uploaded in the CRAN repository. The running time and classification metrics comparison between PCA and DIRP method has been analyzed. From the results, it has been found that the running time of the proposed method is reduced significantly with the near equivalent accuracy to the original data.

Keywords

Dimension reduction Random projection k-NN classification Big data 

Notes

References

  1. 1.
    Philip Chen CL, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf Sci 275:314–347.  https://doi.org/10.1016/j.ins.2014.01.015 CrossRefGoogle Scholar
  2. 2.
    Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Tech. rep, META GroupGoogle Scholar
  3. 3.
    Tsai CW, Lai CF, Chao HC, Vasilakos AV (2015) Big data analytics: a survey. J. Big Data 2(1):21.  https://doi.org/10.1186/s40537-015-0030-3 CrossRefGoogle Scholar
  4. 4.
    Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27.  https://doi.org/10.1145/1961189.1961199 CrossRefGoogle Scholar
  5. 5.
    Mehta K, Tyagi G, Rao A, Kumar P, Chauhan DS (2017) Modified locally linear embedding with affine transformation. Natl Acad Sci Lett 40(3):189–196.  https://doi.org/10.1007/s40009-017-0536-7 MathSciNetCrossRefGoogle Scholar
  6. 6.
    Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32(2):9–15.  https://doi.org/10.1109/MIS.2017.38 CrossRefGoogle Scholar
  7. 7.
    Haris BC, Sinha R (2014) Exploring data-independent dimensionality reduction in sparse representation-based speaker identification. Circuits Syst Signal Process 33(8):2521–2538.  https://doi.org/10.1007/s00034-014-9757-x MathSciNetCrossRefGoogle Scholar
  8. 8.
    Guyon I, Gunn S, Nikravesh M, Zadeh LA (2006) Feature extraction: foundations and applications (studies in fuzziness and soft computing). Springer, New YorkCrossRefGoogle Scholar
  9. 9.
    Frs KP (1901) Liii. on lines and planes of closest fit to systems of points in space. Philos Mag 2(11):559–572.  https://doi.org/10.1080/14786440109462720 CrossRefGoogle Scholar
  10. 10.
    Martínez AM, Kak AC (2001) Pca versus lda. IEEE Trans Pattern Anal Mach Intell 23(2):228–233.  https://doi.org/10.1109/34.908974 CrossRefGoogle Scholar
  11. 11.
    Parsons S (2005) Independent component analysis: a tutorial introduction by james v. stone, mit press, 193 pp, \$35.00, isbn 0-262-69315-1. Knowl Eng Rev 20(2):198–199.  https://doi.org/10.1017/S0269888905220519
  12. 12.
    Johnson W, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In: Conference in modern analysis and probability (New Haven, Conn., 1982), contemporary mathematics, vol 26, American Mathematical Society, pp 189–206Google Scholar
  13. 13.
    Kane DM, Nelson J (2014) Sparser johnson-lindenstrauss transforms. J ACM 61(1):4:1–4:23.  https://doi.org/10.1145/2559902 MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Dasgupta S, Gupta A (2003) An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct Algorithms 22(1):60–65.  https://doi.org/10.1002/rsa.10073 MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Dasgupta S (2000) Experiments with random projection. In: Proceedings of the 16th conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, UAI ’00, pp 143–151Google Scholar
  16. 16.
    Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’01, pp 245–250.  https://doi.org/10.1145/502512.502546
  17. 17.
    Paul S, Boutsidis C, Magdon-Ismail M, Drineas P (2014) Random projections for linear support vector machines. ACM Trans Knowl Discov Data 8(4):22:1–22:25.  https://doi.org/10.1145/2641760 CrossRefGoogle Scholar
  18. 18.
    Wu X, Kumar V (2009) The top ten algorithms in data mining, 1st edn. Chapman & Hall/CRC, LondonCrossRefGoogle Scholar
  19. 19.
    Cover T, Hart P (2006) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27.  https://doi.org/10.1109/TIT.1967.1053964 CrossRefzbMATHGoogle Scholar
  20. 20.
    Maillo J, Ramírez S, Triguero I, Herrera F (2017) knn-is. Knowl Based Syst 117(C):3–15.  https://doi.org/10.1016/j.knosys.2016.06.012 CrossRefGoogle Scholar
  21. 21.
    Aghila G, Siddharth R (2017) RandPro: random projection. https://CRAN.R-project.org/package=RandPro, r package version 0.1.0
  22. 22.
    Spark (2017) Apache spark—lightning-fast cluster computing. http://spark.apache.org/. Accessed 10 Oct 2017
  23. 23.
    Achlioptas D (2003) Database-friendly random projections: Johnson-lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687.  https://doi.org/10.1016/S0022-0000(03)00025-4 MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Li P, Hastie TJ, Church KW (2006) Very sparse random projections. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’06, pp 287–296.  https://doi.org/10.1145/1150402.1150436
  25. 25.
    Anguita D, Ghio A, Oneto L, Parra X, Reyes-Ortiz JL (2013) A public domain dataset for human activity recognition using smartphones. In: 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN, Bruges, BelgiumGoogle Scholar
  26. 26.
    LeCun Y, Cortes C (2010) MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/
  27. 27.
    Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. repGoogle Scholar
  28. 28.
    R Core Team (2017) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

Copyright information

© The National Academy of Sciences, India 2019

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringNational Institute of Technology PuducherryKaraikalIndia

Personalised recommendations