The R Language: A Powerful Tool for Taming Big Data
The R language (R Core Team 2017; Chambers 2008; Matloff 2011) is currently the most popular tool in the general data science field. It features outstanding graphics capabilities and a rich set of more than 10,000 library packages to draw upon. (Other notable languages in data science are Python and Julia. Python is popular among those trained in computer science. Julia, a new language, has as top priority producing fast code.) Its interfaces to SQL databases and the C/C++ language are first rate. All of this, along with recent developments regarding memory issues, makes R well poised as a highly effective tool in Big Data applications. In this chapter, the use of R in Big Data settings will be presented.
Big-n: Large number of data points.
Big-p: Large number of variables/features.
Both senses will come into play later. For now, though,...
- Breshears C (2009) The art of concurrency: a thread monkey’s guide to writing parallel applications. O’Reilly Media, SebastopolGoogle Scholar
- Bühlmann P, Drineas P, Kane M, van der Laan M (2016) Handbook of big data. Chapman & Hall/CRC handbooks of modern statistical methods. CRC Press, Boca RatonGoogle Scholar
- Chambers J (2008) Software for data analysis: programming with R. Statistics and computing. Springer, New YorkGoogle Scholar
- Chang W (2013) R graphics cookbook. Oreilly and associate series. O’Reilly Media, Sebastopol, CAGoogle Scholar
- Dowle M (2017) Data analysis using data.table. https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
- Eddelbuettel D (2013) Seamless R and C++ integration with Rcpp. Use R! Springer, New YorkGoogle Scholar
- Inselberg A (2009) Parallel coordinates: visual multidimensional geometry and its applications. Springer, New YorkGoogle Scholar
- Kane MJ, Emerson J, Weston S (2013) Scalable strategies for computing with massive data. J Stat Softw 55(14):1–19Google Scholar
- Lichman M (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
- Luraschi J, Ushey K, Allaire J (2017) Sparklyr: R interface to Apache Spark. https://CRAN.R-project.org/package=sparklyr
- Matloff N (2011) The art of R programming: a tour of statistical software design. No starch press series. No Starch Press, San FranciscoGoogle Scholar
- Matloff N (2015) Parallel computing for data science: with examples in R, C++ and CUDA. Chapman & Hall/CRC the R series. CRC Press, Boca RatonGoogle Scholar
- Matloff N (2016) Software Alchemy: turning complex statistical computations into embarassingly–parallel ones. J Stat Softw 71(4):1–15Google Scholar
- Matloff N, Fitzgerald C, Davis R, Yancey R, Huang S (2017a) partools: tools for the ‘Parallel’ package. https://github.com/matloff/partools
- Matloff N, Yang V, Nguyen H (2017b) cdparcoord: top frequency-based parallel coordinates. https://CRAN.R-project.org/package=cdparcoodr
- Murrell P (2011) R graphics, 2nd edn. Chapman & Hall/CRC the R series. Taylor & Francis, Boca Raton, FLGoogle Scholar
- Nielsen F (2016) Introduction to HPC with MPI for data science. Undergraduate topics in computer science. Springer International Publishing, ChamGoogle Scholar
- Plotly Technologies Inc (2015) Collaborative data science. https://plot.ly
- R Core Team (2017) R: a language and environment for statistical computing. In: R foundation for statistical computing, Vienna. https://www.R-project.org/
- Reinders J (2007) Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O’Reilly series. O’Reilly Media, SebastopolGoogle Scholar
- Sarkar D (2008) Lattice: multivariate data visualization with R. Use R! Springer, New YorkGoogle Scholar
- Unwin A, Theus M, Hofmann H (2007) Graphics of large datasets: visualizing a million. Statistics and computing. Springer, New YorkGoogle Scholar
- Weston S (2017) foreach: provides foreach looping construct for R. https://CRAN.R-project.org/package=foreach
- Wickham H (2016) Ggplot2: elegant graphics for data analysis. Use R! Springer International Publishing, New YorkGoogle Scholar
- Yang V, Nguyen H, Matloff N, Xie Y (2017) Top-frequency parallel coordinates plots (arxiv). arXiv:1709.00665Google Scholar
- Yu H (2014) [Rmpi] news. http://www.stats.uwo.ca/faculty/yu/Rmpi/