Advertisement

Sparkmach: A Distributed Data Processing System Based on Automated Machine Learning for Big Data

  • Gusseppe Bravo-Rocca
  • Piero Torres-Robatty
  • Jose Fiestas-IquiraEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 898)

Abstract

This work proposes a semi-automated analysis and modeling package for Machine Learning related problems. The library goal is to reduce the steps involved in a traditional data science roadmap. To do so, Sparkmach takes advantage of Machine Learning techniques to build base models for both classification and regression problems. These models include exploratory data analysis, data preprocessing, feature engineering and modeling.

The project has its basis in Pymach, a similar library that faces those steps for small and medium-sized datasets (about ten millions of rows and a few columns). Sparkmach central labor is to scale Pymach to overcome big datasets by using Apache Spark distributed computing, a distributed engine for large-scale data processing, that tackle several data science related problems in a cluster environment. Despite the software nature, Sparkmach can be of use for local environments, getting the most benefits from the distributed processing tools.

Keywords

Semi-automated machine learning Data Science Data mining Statistics Data engineering Big data 

Notes

Acknowledgments

The project would have been impossible without the support of Ciencia Activa and Fondo para la Innovación, la Ciencia y la Tecnología - Innovation, Science and Technology Fund (FINCyT).

References

  1. 1.
    Bravo-Rocca, G.: Pyspark package for getting an overview of a dataset (2016). https://pymach.readthedocs.io/en/latest/readme.html
  2. 2.
    Brownlee, J.: Machine learning mastery with Python (2016)Google Scholar
  3. 3.
    Christensson, P.: Python definition. https://techterms.com. Accessed 7 May 2018
  4. 4.
    Duch, W.: Meta-learning. Nicolaus Copernicus University, PolandGoogle Scholar
  5. 5.
    Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark, Lightning-Fast Data Analysis. O’Reilly, Sebastopol (2015)Google Scholar
  6. 6.
    Plotly Technologies Inc.: Collaborative data science (2015)Google Scholar
  7. 7.
    McKinney, W.: Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference, pp. 51–56 (2010)Google Scholar
  8. 8.
    Metropolitan Transportation Authority. MTA | Subway, Bus, L.I.R.R.M.N.: Metropolitan transportation authority. MTA | subway, bus, long island rail road, metro-north (2014). http://web.mta.info/developers/MTA-Bus-Time-historical-data.html
  9. 9.
    Pyspark: Extracting, transforming and selecting features. https://spark.apache.org/docs/latest/ml-features.html. Accessed 7 May 2018
  10. 10.
    Repository, M.L.: Hepmass dataset. UCI, p. 3 (2014). https://archive.ics.uci.edu/ml/datasets/HEPMASS. Accessed 7 May 2018

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Universidad Nacional de IngenieríaRimac LimaPeru

Personalised recommendations