This chapter covers how to install R, R Studio, and required packages for replicating examples in this book. This chapter also covers R basics such as objects, data structures, and data types that are fundamental to working in R. In addition, many common functions will be covered in this chapter, and many more will be introduced throughout later chapters.

Getting Started

This section will cover some fundamental concepts and best practices for working in R.

Installing R

R can be compiled and run on a variety of platforms including UNIX, Windows, and MacOS. R can be downloaded here: https://www.r-project.org/.

When installing R, you will need to select a CRAN mirror. The Comprehensive R Archive Network (CRAN) is a network of servers around the world that store identical, current versions of code and documentation for R. You should select the CRAN mirror nearest you to minimize network load.

Installing R Studio

R Studio is an Integrated Development Environment (IDE) for R. This IDE provides a console with syntax editing that is helpful for debugging code as well as tools for plotting, history, and workspace management. Both open source and commercial editions are available, but the open-source option is sufficient for replicating everything in this book.

R Studio can be downloaded here: https://posit.co/download/rstudio-desktop/#download.

Installing Packages

This book will utilize libraries from many R packages, and all are available on CRAN. The line of code below can be executed within either the R console or IDE to install all at once:

A code snippet which installs R packages available on CRAN like, people analytics, tidyverse, pysch, moments, lavaan, factoextra and others.

Loading Data

To load the data sets for this book from the peopleanalytics package, we need to load the library using the library() function and then load the data using the data() function.

A code snippet that loads the library from the people analytics package and loads the dataset named employees.

To view a list of available data sets, execute data(package = "peopleanalyt ics").

Files stored locally, or hosted on an online service such as GitHub, can be imported into R using the read.table() function. For example, the following line of code will import a comma-separated values file named employees.csv containing a header record (row with column names) from a specified GitHub directory, and then store the data in a R object named data:

A code snippet that loads data from the GitHub file.

Case Sensitivity

It is important to note that everything in R is case-sensitive. When working with functions, be sure to match the case when typing the function name. For example, Mean() is not the same as mean(); since mean() is the correct case for the function, capitalized characters will generate an error when executing the function.

Help

Documentation for functions and data is available via the ? command or help() function. For example, ?sentiment or help(sentiment) will display the available documentation for the sentiment data set, as shown in Fig. 1.

Fig. 1
A text graphic for documentation of sentiment data set which contains the description, usage, format and examples.

R documentation for sentiment data set

Objects

Objects underpin just about everything we do in R. An object is a container for various types of data. Objects can take many forms, ranging from simple objects holding a single number or character to complex structures that support advanced visualizations. The assignment operator <- is used to assign values to an object, though = serves the same function.

Let us use the assignment operator to assign a number and character to separate objects. Note that non-numeric values must be enveloped in either single ticks '' or double quotes "":

A code snippet reads, O b j underscore 1, left angle bracket, hyphen, 1, O b j underscore 1.

## [1] 1

A code snippet reads, O b j underscore 2, left angle bracket, hyphen, quote, a, quote, O b j underscore 2.

## [1] "a"

Several functions are available for returning the type of data in an object, such as typeof() and class():

A code snippet reads, typeof, open bracket, O b j underscore 2, close bracket.

## [1] "character"

A code snippet reads, class, open round bracket, O b j underscore 2, close round bracket.

## [1] "character"

Comments

The # symbol is used for commenting/annotating code. Everything on a line that follows # is treated as commentary rather than code to execute. This is a best practice to aid in quickly and easily deciphering the role of each line or block of code. Without comments, troubleshooting large scripts can be a more challenging task.

A code snippet that assigns a new number to the object named o b j underscore 1 and displays the value in o b j underscore 1.

## [1] 2

A code snippet that assigns a new number to the object named o b j underscore 2 and displays the value in o b j underscore 2.

## [1] "b"

In-line code comments can also be added where needed to reduce the number of lines in a script:

A code snippet that assigns a new number to the object named o b j underscore 1 and displays the value in o b j underscore 1.

## [1] 3

A code snippet that assigns a new character to the object named o b j underscore 2 and displays the value in o b j underscore 2.

## [1] "c"

Testing Early and Often

A best practice in coding is to run and test your code frequently. Writing many lines of code before testing will make debugging issues far more difficult and time-intensive than it needs to be.

Vectors

A vector is the most basic data object in R. Vectors contain a collection of data elements of the same data type, which we will denote as x1, x2, …, xn in this book, where n is the number of observations or length of the vector. A vector may contain a series of numbers, set of characters, collection of dates, or logical TRUE or FALSE results.

In this example, we introduce the combine function c(), which allows us to fill an object with more than one value:

A code snippet that creates and fills a numeric vector named vect underscore num with 2, 4, 6, 8 and 10.

## [1]  2  4  6  8 10

A code snippet that creates and fills a character vector named vect underscore char with a, b, and c.

## [1] "a" "b" "c"

We can use the as.Date() function to convert character strings containing dates to an actual date data type. By default, anything within single ticks or double quotes is treated as a character, so we must make an explicit type conversion from character to date. Remember that R is case-sensitive. Therefore, as.date() is not a valid function; the D must be capitalized.

A code snippet that creates and fills a date vector named vect underscore d t.

## [1] "2021-01-01" "2022-01-01"

We can use the sequence generation function seq() to fill values between a start and end point using a specified interval. For example, we can fill vect_dt with the first day of each month between 2021-01-01 and 2022-01-01 via the seq() function and its by = "months" argument:

A code snippet that creates and fills a date vector named vect underscore d t.

##  [1] "2021-01-01" "2021-02-01" "2021-03-01" "2021-04-01" "2021-05-01" ##  [6] "2021-06-01" "2021-07-01" "2021-08-01" "2021-09-01" "2021-10-01" ## [11] "2021-11-01" "2021-12-01" "2022-01-01"

We can also use the : operator to fill integers between a starting and ending number:

A code snippet that creates and fills a numeric vector with values between 1 and 10.

##  [1]  1  2  3  4  5  6  7  8  9 10

We can access a particular element of a vector using its index. An index represents the position in a set of elements. In R, the first element of a vector has an index of 1, and the final element of a vector has an index equal to the vector’s length. The index is specified using square brackets, such as [5] for the fifth element of a vector.

A code snippet that returns the value in position 5 of vect underscore num.

## [1] 5

When applied to a vector, the length() function returns the number of elements in the vector, and this can be used to dynamically return the last value of vectors.

A code snippet that returns the last element of vect underscore num.

## [1] 10

Vectorized Operations

Vectorized operations (or vectorization) underpin mathematical operations in R and greatly simplify computation. For example, if we need to perform a mathematical operation to each data element in a numeric vector, we do not need to specify each and every element explicitly. We can simply apply the operation at the vector level, and the operation will be applied to each of the vector’s individual elements.

A code snippet that creates a numeric vector x and fills it with values between 1 and 10.
A code snippet that adds 2 to each element of x.

##  [1]  3  4  5  6  7  8  9 10 11 12

A code snippet that multiplies each element of x by 2.

##  [1]  2  4  6  8 10 12 14 16 18 20

A code snippet that squares each element of x.

##  [1]   1   4   9  16  25  36  49  64  81 100

Many built-in arithmetic functions are available and compatible with vectors:

A code snippet that aggregates the sum of x elements.

## [1] 55

A code snippet that finds the count of x elements.

## [1] 10

A code snippet that finds the square root of x elements.

##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 ##  [9] 3.000000 3.162278

A code snippet that finds the natural logarithm of x elements.

##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 ##  [8] 2.0794415 2.1972246 2.3025851

A code snippet that finds the exponential of x elements.

##  [1]     2.718282     7.389056    20.085537    54.598150   148.413159 ##  [6]   403.428793  1096.633158  2980.957987  8103.083928 22026.465795

We can also perform various operations on multiple vectors. Vectorization will result in an implied ordering of elements, in that the specified operation will be applied to the first elements of the vectors and then the second, then third, etc.

A code snippet that creates vectors x 1 and x 2 and fills them with integers. It also stores the sum of vectors to a new x 3 vector.

##  [1] 12 14 16 18 20 22 24 26 28 30

Using vectorization, we can also evaluate whether a specified condition is true or false for each element in a vector:

A code snippet that evaluates whether each element of x is less than 6, and stores results to a logical vector.

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  FALSE

Matrices

A matrix is like a vector in that it represents a collection of data elements of the same data type; however, the elements of a matrix are arranged into a fixed number of rows and columns.

We can create a matrix using the matrix() function. Per ?matrix, the nrow and ncol arguments can be used to organize like data elements into a specified number of rows and columns.

A code snippet that creates and fills a matrix with numbers.

##      [,1] [,2] ## [1,]    1    6 ## [2,]    2    7 ## [3,]    3    8 ## [4,]    4    9 ## [5,]    5   10

As long as the argument values are in the correct order per the documentation, the argument names are not required. Per ?matrix, the first argument is data, followed by nrow and then ncol. Therefore, we can achieve the same result using the following:

A code snippet that creates and fills the matrix with numbers from 1 to 10, 5, and 2.

##      [,1] [,2] ## [1,]    1    6 ## [2,]    2    7 ## [3,]    3    8 ## [4,]    4    9 ## [5,]    5   10

Several functions are available to quickly retrieve the number of rows and columns in a rectangular object like a matrix:

A code snippet that returns the number of rows in a matrix.

## [1] 5

A code snippet that returns the number of columns in a matrix.

## [1] 2

A code snippet that returns the number of rows and columns in a matrix.

## [1] 5 2

The head() and tail() functions return the first and last pieces of data, respectively. For large matrices (or other types of objects), this can be helpful for previewing the data:

A code snippet that returns the first five rows of a matrix containing 1000 rows and 10 columns.

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,]    1 1001 2001 3001 4001 5001 6001 7001 8001  9001 ## [2,]    2 1002 2002 3002 4002 5002 6002 7002 8002  9002 ## [3,]    3 1003 2003 3003 4003 5003 6003 7003 8003  9003 ## [4,]    4 1004 2004 3004 4004 5004 6004 7004 8004  9004 ## [5,]    5 1005 2005 3005 4005 5005 6005 7005 8005  9005

A code snippet that returns the last five rows of a matrix containing 1000 rows and 10 columns.

##         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ##  [996,]  996 1996 2996 3996 4996 5996 6996 7996 8996  9996 ##  [997,]  997 1997 2997 3997 4997 5997 6997 7997 8997  9997 ##  [998,]  998 1998 2998 3998 4998 5998 6998 7998 8998  9998 ##  [999,]  999 1999 2999 3999 4999 5999 6999 7999 8999  9999 ## [1000,] 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Using vectorization, we can easily perform matrix multiplication.

A code snippet that creates a 3 by 3 matrix.

##      [,1] [,2] [,3] ## [1,]    1    4    7 ## [2,]    2    5    8 ## [3,]    3    6    9

A code snippet that multiplies each matrix value by 2.

##      [,1] [,2] [,3] ## [1,]    2    8   14 ## [2,]    4   10   16 ## [3,]    6   12   18

Factors

A factor is a data structure containing a finite number of categorical values. Each categorical value of a factor is known as a level, and the levels can be either ordered or unordered. This data structure is a requirement for several statistical models we will cover in later chapters.

We can create a factor using the factor() function:

A code snippet that creates and fills factors with unordered categories like education, undergraduate, and post graduate.

## [1] undergraduate post-graduate graduate ## Levels: graduate post-graduate undergraduate

Since education has an inherent ordering, we can use the ordered and levels arguments of the factor() function to order the categories:

A code snippet that creates and fills factors with unordered categories like education, graduate, and post-graduate.

## [1] undergraduate post-graduate graduate ## Levels: undergraduate < graduate < post-graduate

The ordering of factors is critical to a correct interpretation of statistical model output as we will cover later.

Data Frames

A data frame is like a matrix in that it organizes elements within rows and columns but unlike matrices, data frames can store multiple types of data. Data frames are often the most appropriate data structures for the data required in people analytics.

A data frame can be created using the data.frame() function:

A code snippet that creates three vectors containing integers x, characters y and dates z. It also creates a data frame with 3 columns, vectors x, y and z, and 10 rows.

##     x y          z ## 1   1 a 2021-01-01 ## 2   2 b 2021-02-01 ## 3   3 c 2021-03-01 ## 4   4 d 2021-04-01 ## 5   5 e 2021-05-01 ## 6   6 f 2021-06-01 ## 7   7 g 2021-07-01 ## 8   8 h 2021-08-01 ## 9   9 i 2021-09-01 ## 10 10 j 2021-10-01

The structure of an object can be viewed using the str() function:

A code snippet that describes the structure of d f.

## 'data.frame':    10 obs. of  3 variables: ##  $ x: int  1 2 3 4 5 6 7 8 9 10 ##  $ y: chr  "a" "b" "c" "d" ... ##  $ z: Date, format: "2021-01-01" "2021-02-01" ...

Specific columns in the data frame can be referenced using the operator $ between the data frame and column names:

A code snippet that returns data in column x in d f.

##  [1]  1  2  3  4  5  6  7  8  9 10

Another method that allows for efficient subsetting of rows and/or columns is the subset() function. The example below illustrates how to subset df using conditions on x and y while only displaying z in the output. The logical operator | is used for OR conditions, while & is the logical operator for AND conditions:

A code snippet that returns x values for rows where x is at least 7 or y is a, b, or c.

##             z ## 1  2021-01-01 ## 2  2021-02-01 ## 3  2021-03-01 ## 7  2021-07-01 ## 8  2021-08-01 ## 9  2021-09-01 ## 10 2021-10-01

Lists

Lists are versatile objects that can contain elements with different types and lengths. The elements of a list can be vectors, matrices, data frames, functions, or even other lists.

A list can be created using the list() function:

A code snippet that stores vectors x, y, and z as well as d f to a list.

## List of 4 ##  $ : int [1:10] 1 2 3 4 5 6 7 8 9 10 ##  $ : chr [1:10] "a" "b" "c" "d" ... ##  $ : Date[1:10], format: "2021-01-01" "2021-02-01" ... ##  $ :'data.frame':    10 obs. of  3 variables: ##   ..$ x: int [1:10] 1 2 3 4 5 6 7 8 9 10 ##   ..$ y: chr [1:10] "a" "b" "c" "d" ... ##   ..$ z: Date[1:10], format: "2021-01-01" "2021-02-01" ...

Unlike vectors, accessing elements of a list requires double brackets, such as [[3]] for the third element.

A code snippet that returns data from the third element of l s t.

##  [1] "2021-01-01" "2021-02-01" "2021-03-01" "2021-04-01" "2021-05-01" ##  [6] "2021-06-01" "2021-07-01" "2021-08-01" "2021-09-01" "2021-10-01"

Loops

In many cases, the need arises to perform an operation a variable number of times. Loops are available to avoid the cumbersome task of writing the same statement many times. The two most common types of loops are while and for loops.

Let us use a while loop to square integers between 1 and 5:

A code snippet that uses a while loop to square integers between 1 and 5, and print results to the screen.

## [1] 1 ## [1] 4 ## [1] 9 ## [1] 16 ## [1] 25

With a while loop, we needed to initialize the variable i as well as increment it by 1 within the loop. With a for loop, we can accomplish the same goal with less code:

A code snippet that uses a for loop to square integers between 1 and 5, and print results to the screen.

## [1] 1 ## [1] 4 ## [1] 9 ## [1] 16 ## [1] 25

User-Defined Functions (UDFs)

Though many built-in functions are available, R provides the flexibility to create our own.

Functions can be an effective alternative to loops. For example, here is a basic function that achieves the same goal as our while and for loop examples (i.e., squaring integers 1 through 5):

A code snippet that creates a function named square dot val that takes one argument x and gives the squares of given x values. It also passes integers 1 through 5 into the new square dot val function and displays results.

## [1]  1  4  9 16 25

While many projects warrant UDFs and/or loops, we do not actually need either to square a set of integers thanks to vectorization. As you gain experience writing R code, you will naturally discover ways to write more performant and terse code:

A code snippet to square a set of integers.

## [1]  1  4  9 16 25

Graphics

While base R has native plotting capabilities, we will use more flexible and sophisticated visualization libraries such as ggplot2 in this book. ggplot2 is one of the most versatile and popular data visualization libraries in R.

We can load the ggplot2 library by executing the following command:

A code snippet to load the g g plot 2 library.

When working with functions beyond what is available in base R, entering :: between the library and function names is a best practice for efficient coding. R Studio will provide a menu of available functions within the specified library upon typing library_name::.

The ggplot2 library contains many types of visualizations. For example, we can build a line chart to show how the values of vector x relate to values of vector y in a data frame named data:

A code snippet to create a data frame containing two related columns, combining columns via the c bind function and produce a line chart.

Furthermore, we can use ggplot parameters and themes to adjust the aesthetics of visuals:

A code snippet to produce a line chart, to reduce the line thickness and change color to blue, and to remove the default grey background.
A code snippet to center plot title.

%## List of 1 %##  $ plot.title:List of 11 %##   ..$ family       : NULL %##   ..$ face         : NULL %##   ..$ colour       : NULL %##   ..$ size         : NULL %##   ..$ hjust        : num 0.5 %##   ..$ vjust        : NULL %##   ..$ angle        : NULL %##   ..$ lineheight   : NULL %##   ..$ margin       : NULL %##   ..$ debug        : NULL %##   ..$ inherit.blank: logi FALSE %##   ..- attr(∗, "class")= chr [1:2] "element_text" "element" %##  - attr(∗, "class")= chr [1:2] "theme" "gg" %##  - attr(∗, "complete")= logi FALSE %##  - attr(∗, "validate")= logi TRUE %

Review Questions

  1. 1.

    What is the difference between a factor and character vector?

  2. 2.

    What is vectorization?

  3. 3.

    How do data frames differ from matrices?

  4. 4.

    Does executing the Sum() function achieve the same result as executing sum()?

  5. 5.

    Does seq(1,10) return the same output as 1:10?

  6. 6.

    Is the following command sufficient for creating a vector recognized by R as having three dates: dates <- c("2022-01-01", "2022-01-02", "2022-01-03").

  7. 7.

    How are while and for loops different?

  8. 8.

    If vectors x1 and x2 each hold 100 integers, how can we add each element of one to the respective element of the other using a single line of code?

  9. 9.

    How are slots in a list object referenced?

  10. 10.

    What are some examples in which a user-defined function (UDF) is needed?