Proposition of a Parallel and Distributed Algorithm for the Dimensionality Reduction with Apache Spark

  • Abdelali Zbakh
  • Zoubida Alaoui Mdaghri
  • Mourad El Yadari
  • Abdelillah Benyoussef
  • Abdellah El Kenz
Conference paper
Part of the Lecture Notes in Networks and Systems book series (LNNS, volume 37)

Abstract

In recent years, the field of storage and data processing has known a radical evolution, because of the large mass of data generated every minute. As a result, traditional tools and algorithms have become incapable of following this exponential evolution and yielding results within a reasonable time. Among the solutions that can be adopted to solve this problem, is the use of distributed data storage and parallel processing. In our work we used the distributed platform Spark, and a massive data set called hyperspectral image. Indeed, a hyperspectral image processing, such as visualization and feature extraction, has to deal with the large dimensionality of the image. Several dimensions reduction techniques exist in the literature. In this paper, we proposed a distributed and parallel version of Principal Component Analysis (PCA).

Keywords

Distributed PCA BIG DATA Spark platform Map-Reduce Dimension reduction Hyperspectral data 

References

  1. 1.
    Mercier, L.: Système d’analyse et de visualisation d’images hyperspectrales appliqué aux sciences planétaires (2011)Google Scholar
  2. 2.
    Zebin, W., et al.: Parallel and distributed dimensionality reduction of hyperspectral data on cloud computing architectures. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 9(6), 2270–2278 (2016)Google Scholar
  3. 3.
    Apache Software Foundation. Official apache hadoop. http://hadoop.apache.org/. Accessed 10 July 2017
  4. 4.
    Apache Spark - Lightning-Fast Cluster Computing. http://spark.apache.org/. Accessed 10 July 2017
  5. 5.
    Van Der Maaten, L., Postma, E., Van den Herik, J.: Dimensionality reduction: a comparative. J. Mach. Learn. Res. 10, 66–71 (2009)Google Scholar
  6. 6.
    Elgamal, T., et al.: sPCA: scalable principal component analysis for big data on distributed platforms. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM (2015)Google Scholar
  7. 7.
    Shlens, J.: A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100 (2014)
  8. 8.
    MLlib machine learning library. https://spark.apache.org/mllib/. Accessed 10 July 2017
  9. 9.
    Mahout machine learning library. http://mahout.apache.org/. 10 July 2017
  10. 10.
    AVIRIS - Airborne Visible/Infrared Imaging Spectrometer - Data. http://aviris.jpl.nasa.gov/data/image_cube.html. 10 July 2017

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Faculty of SciencesUniversity Mohammed VRabatMorocco
  2. 2.University Moulay IsmailMeknesMorocco

Personalised recommendations