Abstract
From Business Intelligence to advanced statistics applications, professionals are expected to access and manipulate large datasets, and R is the perfect tool for it. In this introductory chapter, we explain the principles of programming and the position of R in data science today. Then, a beginners level course on R starts introducing the main data types of this superior programming language. Examples and exercises are included to provide a hands-on training, guaranteeing the users control and understanding of R capabilities. Then, two main generic programming tools are introduced: control structures and functions. This will allow us to manipulate our datasets and generate all sorts of values and conclusions. In addition, this chapter includes specific R operators that highly simplify the use of R and enhance its capabilities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This algorithm is usually called long division method in US schools and many other places.
- 2.
Assembly languages are often abbreviated asm.
- 3.
Here, with machine we refer both to hardware, the architecture of the computer, and software, the operating system.
- 4.
FORTRAN is the acronym of FORmula TRANslation.
- 5.
Latest FORTRAN version was released on November 28, 2018, known as FORTRAN 2018, see https://wg5-fortran.org/f2018.html.
- 6.
Message, by Peter Dalgaard, of the first beta version released https://stat.ethz.ch/pipermail/r-announce/2000/000127.html.
- 7.
GNU is a recursive acronym for “GNU’s Not Unix.”
- 8.
- 9.
Upon completion of this book, last stable versions are R 3.6.2, called Dark and Stormy Night, released on December 12, 2019.
- 10.
- 11.
Visit https://stat.ethz.ch/pipermail/r-announce/1997/000001.html for the announcement by Kurt Hornik of the opening of CRAN site.
- 12.
- 13.
In order to know the exact amount of available packages at a certain moment one can type nrow( available.packages( ) ) on the R console. .
- 14.
- 15.
- 16.
- 17.
- 18.
Computational costs are defined as the amount of time and memory needed to run an algorithm.
- 19.
Upon this book completion, the CRAN package repository features 15,368 available packages comprehending many possible extensions of the R core library.
- 20.
- 21.
- 22.
https://julialang.org/blog/2012/02/why-we-created-julia Ⓒ 2020 JuliaLang.org contributors.
- 23.
- 24.
- 25.
Announcement of RStudio release on February 28, 2011, https://blog.rstudio.com/2011/02/28/rstudio-new-open-source-ide-for-r/. Upon completion of this book, last stable version released is RStudio 1.2.5033, on December 3, 2019.
- 26.
For example, Oracle, OLBC, Spark, and many others.
- 27.
A very useful shortcut is to use Control+ Enter in PC, or Command+ Enter in Mac, to run each code line.
- 28.
Have a look at the menu Code in RStudio for different run options.
- 29.
The package ggplot2 will be called in Sect. 4.2 to create ellaborated plots in R.
- 30.
When calling library( ) quotation marks are not needed.
- 31.
- 32.
We can access this list at https://cran.r-project.org/web/views/.
- 33.
- 34.
- 35.
Learn how to make your questions reading https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and https://www.r-project.org/posting-guide.html.
- 36.
The acronym RAM stands for Random-access memory.
- 37.
The same result is achieved by typing assign( "x",4) .
- 38.
A Boolean expression is a data type whose possible values are either TRUE or FALSE . It is named after the mathematician George Bool.
- 39.
For simplicity, logical values can also be written as T or F. We will use the full word or the initial letter indistinctively throughout the book.
- 40.
In Sect. 5.1 we will see how to remove these NAs when performing calculations over vectors containing them, with the argument na.rm .
- 41.
The conversion between Celsius degrees ∘C and Fahrenheit degrees ∘F is ∘F = 1.8 ×∘C + 32. To go from Celsius to Kelvin we just shift the zero in the scale to 273.15.
- 42.
See Sect. 5.1.1 for an explanation of the arithmetic mean and other statistical measures.
- 43.
summary( ) is one of the most robust and powerful commands in R. Almost all kind of structures can be passed as an argument of this command and it will usually provide plenty of information.
- 44.
Everything can be ordered, alphabetically for example, but nominal scales have no meaningful order related to anything intrinsic to the nature of the variable.
- 45.
Thanks to the combination command c( ) , if data are of different types, all of them are stored in the most general type admitting all kinds appearing in the structure.
- 46.
Unlike matrices, if the column lengths to be included in the data frame are not the same, the function returns an error and a data frame filling the gaps is not created.
- 47.
In Spain and other countries, two family names are used, preserving both the last name of the father and the mother.
- 48.
When applying as.data.frame, unless otherwise specified, the default names of the variables are V 1, V 2, etc., meaning variable 1, variable 2, etc.
- 49.
Some R packages are specially designed for dealing with datasets, such as tibble and data.table, we will explore the later one in Chap. 3.
- 50.
Technically speaking, when using 1 : 10, R is internally doing a loop, so the previous code could be simplified to print( 1 : 10) but it is valid as a first and easy example.
- 51.
Note that, in the example, the logical evaluation of the expression 3!=3 is FALSE, whereas being or not a logical expression is TRUE. Try the command is.logical( "Hello") to see the difference.
- 52.
Observe that f(1) is undefined, because we are dividing by zero. Despite this, R outputs Inf recovering the limits of f when x approaches 0.
- 53.
A richer function is already implemented in the R base library under the name mat.or.vec( ) .
- 54.
The computational advantages and disadvantages of using or not return( ) are beyond the scope of this book.
References
Allaire, J.J. Rstudio: Integrated development environment for r. In The R User Conference, useR!, page 14, Coventry, UK, 2011. University of Warwick.
Allen, F.E. The history of language processor technology in IBM. IBM Journal of Research and Development, 25(5):535–548, 1981.
Austrian, G. Herman Hollerith, forgotten giant of information processing. Columbia University Press, New York, USA, 1982.
Babbage, C. Passages from the Life of a Philosopher. Longman, Green, Longman, Roberts, and Green, London, UK, 1864.
Blass, A. and Gurevich, Y. Algorithms: A quest for absolute definitions. In Current Trends in Theoretical Computer Science: The Challenge of the New Century Vol 1: Algorithms and Complexity Vol 2: Formal Models and Semantics, pages 283–311. World Scientific, Singapur, 2004.
Böhm, C. Calculatrices digitales. Du déchiffrage de formules logico-mathématiques par la machine même dans la conception du programme. Annali di Matematica Pura ed Applicata, 37(1):175–217, 1954.
Cardelli, L. Type systems. ACM Computing Surveys, 28(1):263–264, 1996.
Chambers, J.M.S. Programming with data: A guide to the S language. Springer Science & Business Media, Berlin, Germany, 1998.
Conference on Data Systems Languages. Programming Language Committee. CODASYL COBOL journal of development, 1968. United States Dept. of Commerce, National Bureau of Standards, Maryland, USA, 1969.
Copeland, B.J. The Essential Turing. Clarendon Press, Oxford, UK, 2004.
Dobre, A.M., Caragea, N. and Alexandru, C.A. R versus Other Statistical Software. Ovidius University Annals, Series Economic Sciences, 13(1), 2013.
Dybvig, R.K. The SCHEME programming language. MIT Press, Massachusetts, USA, 2009.
Friedman, D.P., Wand, M. and Haynes, C.T. Essentials of programming languages. MIT Press, Massachusetts, USA, 2001.
Gunter, C.A. Semantics of programming languages: structures and techniques. MIT Press, Massachusetts, USA, 1992.
Harper, R. What, if anything, is a programming paradigm?, 2017.
Hornik, K. The comprehensive R archive network. Wiley Interdisciplinary Reviews: Computational Statistics, 4(4):394–398, 2012.
Hornik, K. R FAQ. https://CRAN.R-project.org/doc/FAQ/R-FAQ.html, 2018. [Online, accessed 2020-02-29].
Ihaka, R. R: lessons learned, directions for the future. In Joint Statistical Meetings proceedings, Virginia, USA, 2010. ASA.
Ihaka, R. and Gentleman, R. R: a language for data analysis and graphics. Journal of computational and graphical statistics, 5(3):299–314, 1996.
Iverson, K.E. Notation as a tool of thought. Commun. ACM, 23(8):444–465, 1980.
Knuth, D.E. The art of computer programming, volume 3. Pearson Education, London, UK, 1997.
Knuth, D.E. and Pardo, L.T. The early development of programming languages. In A history of computing in the twentieth century, pages 197–273. Elsevier, Amsterdam, Netherlands, 1980.
McCarthy, J. Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I. Commun. ACM, 3(4):184–195, 1960.
Menabrea, L.F. Notions sur la Machine Analytique de M. Charles Babbage. Bibliothèque Universelle de Genève, 41:352–376, 1842. Translated, with additional notes by Augusta Ada, Countess of Lovelace, as Sketch of the Analytical Engine.
Posselt, E.A. and Philadelphia Museum of Art. The Jacquard Machine Analyzed and Explained: with an Appendix on the Preparation of Jacquard Cards. Published under the auspices of the school [Pennsylvania museum and school of industrial art], Pennsylvania, USA, 1887.
Price, D.S. A history of calculating machines. IEEE Micro, 4(1):22–52, 1984.
Pugh, E.W. and Eugene Spafford Collection. Building IBM: Shaping an Industry and Its Technology. MIT Press, Massachusetts, USA, 1995.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2018.
Racine, J.S. RStudio: A platform-independent IDE for R and Sweave. Journal of Applied Econometrics, 27(1):167–172, 2012.
Rogers, H. and Rogers, H. Theory of recursive functions and effective computability, volume 5. McGraw-Hill, New York, USA, 1967.
RStudio Team. RStudio: Integrated Development Environment for R. Massachusetts, USA, 2015.
Slonneger, K. and Kurtz, B.L. Formal syntax and semantics of programming languages, volume 340. Addison-Wesley Reading, Massachusetts, USA, 1995.
Truesdell, L.E. The development of punch card tabulation in the Bureau of the Census, 1890-1940; with outlines of actual tabulation programs. U. S. Dept. of Commerce, Bureau of the Census Washington, Washington DC, USA, 1965.
Turing, A.M. On Computable Numbers, with an Application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, 2(42):230–265, 1936.
Van Roy, P. and Haridi, S. Concepts, Techniques, and Models of Computer Programming. MIT Press, Massachusetts, USA, 2003.
Wickham, H. R Packages: Organize, Test, Document, and Share Your Code. O’Reilly Media, California, USA, 2015.
Winskel, G. The formal semantics of programming languages: an introduction. MIT Press, Massachusetts, USA, 1993.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Zamora Saiz, A., Quesada González, C., Hurtado Gil, L., Mondéjar Ruiz, D. (2020). Introduction to R . In: An Introduction to Data Analysis in R. Use R!. Springer, Cham. https://doi.org/10.1007/978-3-030-48997-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-48997-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-48996-0
Online ISBN: 978-3-030-48997-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)