Lightweight Multi-language Bindings for Apache Spark

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9833)

Abstract

Apache Spark has emerged as one of the most prominent frameworks for distributed high-performance data analysis. Among Spark’s most appealing features are its bindings for dynamic languages such as Python and R. Despite of the great flexibility of such languages, they often cannot match the performance of statically typed languages such as Java or Scala. However, this limitation is not only due to the intrinsic nature of dynamically typed languages. Largely, the performance gap is caused by the way the language runtimes interact with Spark. In this paper we describe a new approach to integrating Python and R into data-intensive Spark applications. Our approach significantly reduces the performance gap between such languages and their statically typed counterpart, making dynamic languages an attractive alternative for the implementation of big-data applications.

Keywords

Java Virtual Machine Abstract Syntax Tree Python Language Task Runner Dynamic Language 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

Our research has been supported by Oracle (ERO project 1332) and by the Swiss National Science Foundation (project 200021 153560). We thank the VM Research Group at Oracle for their support. Oracle, Java, and HotSpot are trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

References

  1. 1.
    The Apache Hadoop distributed system. http://hadoop.apache.org
  2. 2.
    Apache Pig, high-level platform for MapReduce. https://pig.apache.org/
  3. 3.
    The Apache Spark engine. https://spark.apache.org
  4. 4.
    Catalyst: A Query Optimization Framework for Spark and Shark. https://github.com/apache/spark/tree/master/sql/catalyst
  5. 5.
    FastR, an high performance R runtime. https://bitbucket.org/allr/fastr/overview
  6. 6.
    Google Cloud Dataflow. http://cloud.google.com/dataflow
  7. 7.
  8. 8.
  9. 9.
    HDFS distributed file system. https://hadoop.apache.org/docs/r1.2.1
  10. 10.
    NumPy, scientific computing with Python. http://www.numpy.org/
  11. 11.
    Pandas, Python Data Analysis Library. http://pandas.pydata.org/
  12. 12.
  13. 13.
  14. 14.
    ZipPy, a fast and lightweight Python implementation. https://bitbucket.org/ssllab/zippy
  15. 15.
    Efficient Embedding of Dynamic Languages in Big-data Analytics. In: Proceedings of the 36th International Conference on Distributed Computing Systems Workshops. DCPerf 2016, IEEE (2016)Google Scholar
  16. 16.
    Alexandrov, A., Kunft, A., Katsifodimos, A., Schüler, F., Thamsen, L., Kao, O., Herb, T., Markl, V.: Implicit parallelism through deep language embedding. In: Proceedings of SIGMOD, pp. 47–61 (2015)Google Scholar
  17. 17.
    Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: relational data processing in spark. In: Proceedings of SIGMOD 2015, pp. 1383–1394. ACM (2015)Google Scholar
  18. 18.
    Bolz, C.F., Cuni, A., Fijalkowski, M., Rigo, A.: Tracing the Meta-level: PyPy’s tracing JIT compiler. In: Proceedings of ICOOLPS, pp. 18–25 (2009)Google Scholar
  19. 19.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  20. 20.
    Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)MATHGoogle Scholar
  21. 21.
    Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996)Google Scholar
  22. 22.
    Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., Patterson, D.A.: Rethinking data-intensive science using scalable analytics systems. In: Proceedings of SIGMOD 2015, pp. 631–646 (2015)Google Scholar
  23. 23.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report 1999–66, November 1999Google Scholar
  24. 24.
    Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: Proceedings of KDD, pp. 2323–2324 (2015)Google Scholar
  25. 25.
    Würthinger, T., Wimmer, C., Wöß, A., Stadler, L., Duboscq, G., Humer, C., Richards, G., Simon, D., Wolczko, M.: One vm to rule them all. In: Proceedings of Onward! 2013, pp. 187–204. ACM (2013)Google Scholar
  26. 26.
    Würthinger, T., Wöß, A., Stadler, L., Duboscq, G., Simon, D., Wimmer, C.: Self-optimizing AST interpreters. SIGPLAN Not. 48(2), 73–82 (2012)Google Scholar
  27. 27.
    Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: a resilient distributed graph system on spark. In: Proceedings of GRADES, pp. 2:1–2:6 (2013)Google Scholar
  28. 28.
    Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of SIGMOD 2013, pp. 13–24. ACM (2013)Google Scholar
  29. 29.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of NSDI 2012, p. 2 (2012)Google Scholar
  30. 30.
    Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of SOSP, pp. 423–438 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Faculty of InformaticsUniversità della Svizzera italiana (USI)LuganoSwitzerland
  2. 2.Oracle Labs, VM Research GroupLuganoSwitzerland

Personalised recommendations