Structured Query Language (SQL) is the most common language used to extract and wrangle data contained in a relational database. SQL is an essential skill for anyone working in analytics.

Basics

There are three main clauses in a SQL query: (a) SELECT, (b) FROM, and (c) WHERE. The SELECT and FROM clauses are required, though the optional WHERE clause is frequently needed.

  • SELECT: Specifies the columns to include in the output

  • FROM: Specifies the table(s) in which the relevant data are contained

  • WHERE: Specifies the rows to search

Despite the clauses being ordered as shown above (SELECT then FROM then WHERE), the FROM clause is the first to execute since we first need to identify the relevant table(s) before filtering rows and selecting columns. The SELECT clause is the last to execute.

Additional clauses are available for grouping and sorting data.

  • GROUP BY: Specifies the columns by which data should be grouped when using aggregate functions

  • HAVING: Specifies conditions for filtering rows based on aggregate functions

  • ORDER BY: Specifies how data should be sorted in the output

When implementing aggregate functions in a SELECT clause, such as counting, summing, or averaging a numeric field, all other non-aggregated fields must be included in the GROUP BY clause.

When working with large data sets, it is best to filter records on the database side to avoid reading superfluous records into an analytics tool such as R only to then filter data to the relevant subset. For example, if we are performing an analysis on employees in the Research & Development department, we should ideally filter to this subset on the database side rather than loading data on the entire workforce and then paring down to the relevant records within R. Fewer records can help enhance the performance of R scripts—especially when R is running on a local machine, such as a laptop, rather than on a more powerful server.

Though it is important to execute SQL queries directly on the database to minimize the amount of data read into R, we will use the sqldf library within R to demonstrate the mechanics of a SQL query for easily replicable examples. The sqldf library allows us to write SQL to query data frames via an embedded database engine (SQLite by default). In a practical setting, we would pass a string containing the SQL query, execute it directly against the database, and then store the query’s results to an object within R. While the syntax of SQL may vary by database, the core structure of queries is universal.

First, let us load the data sets:

A code that starts with the hashtag load library, followed by library open parenthesis people analytics, close parenthesis. It ends with hashtag return row and column counts, followed by d i m, open parenthesis, employees, close parenthesis

## [1] 1470   36

Next, we will apply the sqldf() function to our data frame to extract specific rows and columns. In addition to the SELECT, FROM, and WHERE clauses, we will use the LIMIT clause to limit the number of rows that are displayed given the data frame’s size (n = 1, 470). In a practical setting, the LIMIT clause is only used for efficient data profiling and troubleshooting, as we would not want to arbitrarily truncate a data set used for analysis.

A best practice in writing SQL is to capitalize the names of clauses and functions and to use separate lines and indentation to make the SQL statements more readable:

A code that starts with the hashtag load library, followed by library open parenthesis, s q l d f, close parenthesis. It ends with the hashtag execute Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##    employee_id ## 1         1002 ## 2         1003 ## 3         1004 ## 4         1005 ## 5         1006 ## 6         1007 ## 7         1008 ## 8         1009 ## 9         1010 ## 10        1011

This query returned a list of employee ids for employees in the Research & Development department.

To optimize query performance, it is important to order conditions in the WHERE clause beginning with the condition that will exclude the largest number of records. Conditions are executed sequentially, and each subsequent condition must evaluate all records that remain following any preceding filtering. Limiting the number of records that must be searched when evaluating each condition will reduce the time it takes the query to return results. For example, if two conditions are needed and one excludes 5000 records while the other excludes 10, the condition that excludes 5000 records should be listed first in the WHERE clause.

Aggregate Functions

Next, let us take a look at average organization tenure by job for those in the Research & Development department:

A code that starts with the hashtag store S Q L query as a character string. It ends with hashtag execute Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##                   job_title AVG(org_tenure) ## 1 Healthcare Representative        8.618321 ## 2     Laboratory Technician        5.019305 ## 3                   Manager       13.673077 ## 4    Manufacturing Director        7.600000 ## 5         Research Director       10.937500 ## 6        Research Scientist        5.113014 ## 7            Vice President        9.500000

There are 7 distinct job titles among employees in the Research & Development department, and the average organization tenure for these ranges from 5 to 13.7 years.

Since there could be a small number of employees in certain jobs, in which case average organization tenure may not be as meaningful, we can use the COUNT(*) function to count the number of rows for each group. In this case, COUNT(*) will return the number of employees in each job in the Research & Development department. We can also assign column aliases via AS in the SELECT clause to assign different names to the output fields:

A code that starts with the hashtag store S Q L query as a character string. It ends with hashtag execute Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##                   job_title employee_cnt avg_org_tenure ## 1 Healthcare Representative          131       8.618321 ## 2     Laboratory Technician          259       5.019305 ## 3                   Manager           52      13.673077 ## 4    Manufacturing Director          145       7.600000 ## 5         Research Director           80      10.937500 ## 6        Research Scientist          292       5.113014 ## 7            Vice President            2       9.500000

The output shows that there are only 2 Vice Presidents in the Research & Development department, while other job titles are much more prevalent.

Since relatively few employees are Vice Presidents, let us use the HAVING clause to only show average organization tenure for Research & Development department jobs with more than 10 employees. We can also use the ROUND() function to truncate average organization tenure to one significant digit:

A code that starts with the hashtag store S Q L query as a character string. It ends with hashtag execute Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##                   job_title employee_cnt avg_org_tenure ## 1 Healthcare Representative          131            8.6 ## 2     Laboratory Technician          259            5.0 ## 3                   Manager           52           13.7 ## 4    Manufacturing Director          145            7.6 ## 5         Research Director           80           10.9 ## 6        Research Scientist          292            5.1

Joins

In a practical setting, the required data are rarely contained within a single table. Therefore, we must query multiple tables and join them together.

Figure 1 illustrates how worker, position, and recruiting schemas may be related. For example, a candidate submits a job application to a posted requisition, which is connected to an open position Finance approved as part of the company’s workforce plan; when the selected candidate is hired, they become a worker with one or many events (hire, promotion, transfer, termination) and activities (learning, performance appraisals, surveys) during their tenure with the company.

Fig. 1
Three diagrams of worker schema, position schema, and recruiting schema. Worker schema starts with learning, position schema starts with org and succession, and recruiting schema starts with the job app.

Related tables organized within schemas

Tables are related using a set of keys. Each table needs a Primary Key (PK), which is a unique identifier for each row in the table. A PK may be a single column or multiple columns; a multi-column PK is known as a composite key. It is generally best to leverage non-recyclable system-generated ids for PKs. A Foreign Key (FK) is a column whose values correspond to the values of a PK in another table. Referential integrity is the logical dependency of a FK on a PK, and this is an important concept in the context of relational data structures. Referential integrity constraints protect against orphaned FK values in child tables by deleting PK values from an associated parent table.

Figure 2 shows an Entity Relationship Diagram (ERD) that depicts PK/FK relationships among the Position, Worker, and Requisition tables. Notice that the PK for each table shown in Fig. 1 is listed as a FK in related tables.

Fig. 2
A table titled position has two columns, P K and position underscore I D. It branches into two more tables titled worker and requisition, the former having two columns labeled P K and worker underscore I D and the latter having two columns labeled P K and R E Q underscore I D.

Entity Relationship Diagram (ERD)

With knowledge of the keys required to connect records across tables, there are several methods of joining the tables. Figure 3 illustrates SQL join types using Venn diagrams. Both the join type and keys for related tables need to be specified in the SQL statement. The structure of SQL queries for each method of joining Table A and Table B is represented in the following code blocks:

Fig. 3
7 diagrams of left inclusive, full outer inclusive, right inclusive, left exclusive, full outer exclusive, right exclusive, and finally inner join. All diagrams have two circles A and B overlapping each other, the certain parts shaded in depending on the type of S Q L joins.

Types of SQL joins

LEFT INCLUSIVE

A code starts with the hashtag SELECT, open block bracket, output field list, close block bracket. It is followed by hashtag FROM A, hashtag LEFT OUTER JOIN B. It ends with hashtag ON A.Key equals B.Key.

LEFT EXCLUSIVE

A code starts with hashtag SELECT, open block bracket, output field list, close block bracket. It is followed by hashtag FROM A, hashtag LEFT OUTER JOIN B, hashtag ON A.Key equals B.Key. It ends with WHERE B.Key IS NULL.

FULL OUTER INCLUSIVE

A code starts with hashtag SELECT, open block bracket, output field list, close block bracket. It is followed by the hashtag FROM A, hashtag FULL OUTER JOIN B. It ends with hashtag ON A.Key equals B.Key.

FULL OUTER EXCLUSIVE

A code starts with hashtag SELECT, open block bracket, output field list, close block bracket. It is followed by hashtag FROM A, hashtag FULL OUTER JOIN B, hashtag ON A.Key equals B.Key. It ends with WHERE A.Key IS NULL OR B.Key IS NULL.

RIGHT INCLUSIVE

A code starts with hashtag SELECT, open block bracket, output field list, close block bracket. It is followed by hashtag FROM A, hashtag RIGHT OUTER JOIN B. It ends with hashtag ON A.Key equals B.Key.

RIGHT EXCLUSIVE

A code starts with hashtag SELECT, open block bracket, output field list, close block bracket. It is followed by hashtag FROM A, hashtag LEFT OUTER JOIN B. It ends with hashtag ON A.Key equals B.Key. It ends with WHERE A.Key IS NULL.

INNER JOIN

A code starts with hashtag SELECT, open block bracket, output field list, close block bracket. It is followed by hashtag FROM A, hashtag INNER JOIN B. It ends with hashtag ON A.Key equals B.Key.

To illustrate how SQL joins work, we will leverage three of the data sets used to produce the consolidated employees data set that will be leveraged throughout this book: job, tenure, and demographics. In a people analytics context, employee id is often the PK since this identifier should not be shared by two or more employees—past, present, or future. Email or network id may also be a suitable PK. We will use the employee_id column in each of the three data frames to facilitate joins.

Let us query these data frames to return the average organization tenure and average commute distance for employees in the Research & Development department, grouped by jobs with more than 10 employees. To accomplish this, we will leverage an INNER JOIN, which will return records only for employee ids which are present in all three data frames. For example, if a record exists in demographics and tenure for a particular employee id, but there is no corresponding record in job, that employee id would not be included in the output.

A code that starts with hashtag store S Q L query as a character string. It ends with hashtag executes Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##                   job_title employee_cnt avg_org_tenure avg_commute_dist ## 1 Healthcare Representative          131            8.6              9.8 ## 2     Laboratory Technician          259            5.0              9.4 ## 3                   Manager           52           13.7              7.2 ## 4    Manufacturing Director          145            7.6              9.5 ## 5         Research Director           80           10.9              8.4 ## 6        Research Scientist          292            5.1              9.0

Note that the INNER JOIN in this SQL query was structured such that both tenure and job were joined to demographics via the employee_id column. We could have instead joined job to tenure since we joined tenure to demographics; this would have achieved the same result since all employee ids exist in each of the three data frames.

If it were possible for all employee ids to exist in demographics but not in either tenure or job, we could leverage a LEFT JOIN to ensure all records from demographics are included in the output irrespective of whether they have matches in tenure or job:

A code that starts with hashtag store S Q L query as a character string. It ends with hashtag executes Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##                   job_title employee_cnt avg_org_tenure avg_commute_dist ## 1 Healthcare Representative          131            8.6              9.8 ## 2     Laboratory Technician          259            5.0              9.4 ## 3                   Manager           52           13.7              7.2 ## 4    Manufacturing Director          145            7.6              9.5 ## 5         Research Director           80           10.9              8.4 ## 6        Research Scientist          292            5.1              9.0

In this case, if demographics is the base data set which contains all employee ids (i.e., the “LEFT” data set), it is important for tenure and job to be joined to it. Joining job to tenure may result in information loss if an employee id exists in demographics and job but not in the intermediate tenure data set.

When integrating data within R, the tidyverse provides a more efficient and parsimonious method of joining many data sets using various join types. Within this framework, components are chained together via the |> operator. Though slightly less efficient, the legacy %>% operator could be used as an alternative. The example below joins nine data sets into a single employees data frame using a left join on the employee_id column:

A code starts with the hashtag load library. It is followed by library, open parenthesis tidyverse, close parenthesis. It is followed by employees with a list of demographics, status, benefits, job, payroll, performance, prior, underscore, employment, and sentiment among others.

Subqueries

Subqueries are queries nested within other queries. Subqueries are often referred to as inner queries, while the main queries are referred to as outer queries.

For example, if we are interested in performing an analysis on employees with more than a year of organization tenure, we can use a subquery to pass a list of employee ids that meet this criterion into the outer query for filtering. In this case, we would not need to include tenure in the join conditions of our main query:

A code that starts with hashtag store S Q L query as a character string. It ends with hashtag executes Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##                   job_title employee_cnt avg_commute_dist ## 1 Healthcare Representative          118              9.7 ## 2     Laboratory Technician          198              9.6 ## 3                   Manager           49              6.9 ## 4    Manufacturing Director          133              9.7 ## 5         Research Director           74              8.3 ## 6        Research Scientist          238              9.2

Virtual Tables

An alternative to a subquery is creating a virtual table in the FROM clause. When using an INNER JOIN to connect demographics to the virtual table ids, which provides a list of employee ids for those with more than a year of organization tenure, any records in demographics or job that do not relate to employees with at least a year of organization tenure will be dropped. This is true even though a LEFT JOIN is used to join job to demographics since records in demographics will be filtered based on employee_id matches in the virtual table. With this approach, our WHERE clause is limited to the department = 'Research & Development' condition:

A code that starts with hashtag store S Q L query as a character string. It ends with hashtag executes Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##                   job_title employee_cnt avg_commute_dist ## 1 Healthcare Representative          118              9.7 ## 2     Laboratory Technician          198              9.6 ## 3                   Manager           49              6.9 ## 4    Manufacturing Director          133              9.7 ## 5         Research Director           74              8.3 ## 6        Research Scientist          238              9.2

As you can see, the output of the query using a virtual table matches the results from the preceding approach that utilized a subquery.

Window Functions

Window functions are used for performing calculations over a set of rows without collapsing the records. Unlike the aggregate functions we have covered, window functions do not collapse rows into a single value; the calculated value is returned for each of the rows over which the calculation is performed.

For example, we can assign an organization tenure rank by Research & Development job using the RANK() and OVER() functions in the SELECT clause. The PARTITION BY argument functions like a GROUP BY clause but without collapsing rows, while the ORDER BY argument sorts the records in ascending (ASC) or descending (DESC) order for proper ranking:

A code that starts with hashtag store S Q L query as a character string, followed by the hashtag limit output to 10 records since query does not collapse. It ends with hashtag executes Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##    employee_id                 job_title commute_dist commute_dist_rank ## 1         2325 Healthcare Representative           29                 1 ## 2         1414 Healthcare Representative           28                 2 ## 3         1010 Healthcare Representative           27                 3 ## 4         1573 Healthcare Representative           27                 3 ## 5         2200 Healthcare Representative           26                 5 ## 6         1730 Healthcare Representative           25                 6 ## 7         1833 Healthcare Representative           25                 6 ## 8         1993 Healthcare Representative           25                 6 ## 9         2415 Healthcare Representative           25                 6 ## 10        1164 Healthcare Representative           24                10

Notice that in the case of commute distance ties, the RANK() function assigns the same rank and then adds the number of ties to that rank to determine the rank for the next highest value of commute distance.

We can also treat this query as a virtual table, and then filter on the derived commute_dist_rank field to return the highest commute distance for each job. We can add a DISTINCT() function in the SELECT clause to collapse jobs for which there are more than one employee with the max commute distance and display the number of ties for each using the COUNT(*) function:

A code that starts with hashtag store S Q L query as a character string. It ends with hashtag executes Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##                   job_title employee_count commute_dist ## 1 Healthcare Representative              1           29 ## 2     Laboratory Technician              6           29 ## 3                   Manager              2           29 ## 4    Manufacturing Director              4           29 ## 5         Research Director              3           28 ## 6        Research Scientist              4           29 ## 7            Vice President              1            8

Common Table Expressions (CTEs)

An alternative to the virtual table approach is to use a common table expression (CTE), which is the result set of a query that exists temporarily and only for use in a larger query. Like the virtual table example, CTEs do not persist data in objects or tables; the data exist only for the duration of the query.

A code that starts with hashtag store S Q L query as a character string. It ends with hashtag executes Q L query, followed by s q d f in proportion to s q l d f, open parenthesis, s q l, underscore, string, close parenthesis.

##                   job_title employee_count commute_dist ## 1 Healthcare Representative              1           29 ## 2     Laboratory Technician              6           29 ## 3                   Manager              2           29 ## 4    Manufacturing Director              4           29 ## 5         Research Director              3           28 ## 6        Research Scientist              4           29 ## 7            Vice President              1            8

Review Questions

  1. 1.

    What two clauses must always be present in a SQL query?

  2. 2.

    What SQL clause is executed first at run time?

  3. 3.

    To optimize the performance of a SQL query, how should conditions in the WHERE clause be ordered?

  4. 4.

    How do aggregate functions differ from window functions in SQL?

  5. 5.

    What is a subquery?

  6. 6.

    What is the difference between an INNER JOIN and LEFT JOIN?

  7. 7.

    What is the purpose of a common table expression (CTE)?

  8. 8.

    What does the PARTITION BY function do?

  9. 9.

    Why is it important for queries to limit records on the database side before reading into R?

  10. 10.

    In which clause are filter conditions applied to aggregate functions (e.g., COUNT(*) > 5, AVG(salary) < 100000)?