Exploring Bivariate and Categorical Data

Wolfe, Douglas A.; Schneider, Grant

doi:10.1007/978-3-319-56072-4_2

Douglas A. Wolfe⁶ &
Grant Schneider⁷

Part of the book series: Springer Texts in Statistics ((STS))

5847 Accesses

Abstract

In Chapter 1 we focused on displaying and describing information on one variable at a time. In this chapter we consider graphical and numerical methods that can be used to investigate the relationship between two variables. Section 1 contains methods for exploring the relationship between two quantitative variables. Descriptive statistics for measuring the strength of association are provided in Sect. 2. Section 3 deals with relationships between two categorical variables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Bibliography

Kenyon Center for Environmental Study. (1997). Gambier, Ohio. Personal communication.
Google Scholar
United States Environmental Protection Agency. (2016, May). Greenhouse Gas Inventory Data Explorer. Washington, DC: United States Environmental Protection Agency
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, The Ohio State University, Columbus, OH, USA
Douglas A. Wolfe
Upstart Network, San Carlos, CA, USA
Grant Schneider

Authors

Douglas A. Wolfe
View author publications
You can also search for this author in PubMed Google Scholar
Grant Schneider
View author publications
You can also search for this author in PubMed Google Scholar

Chapter 2 Comprehensive Exercises

2.1.1 2.A. Conceptual

2.A.1. Use basic algebra to show that the correlation coefficient r in (2.1) can also be expressed by

$$ r=\frac{1}{n-1}\frac{\sum_{i=1}^n\left({x}_i-\overline{x}\right)\left({y}_i-\overline{y}\right)}{s_x{s}_y} $$

as noted in (2.2).

2.A.2. An alternative form of Spearman’s rank correlation coefficient r _S (2.3) is given by

$$ {r}_s=\frac{12\sum_{i=1}^n\left\{\left({r}_i-\frac{n+1}{2}\right)\left({s}_i-\frac{n+1}{2}\right)\right\}}{n\left({n}^2-1\right)}. $$

Compute this alternative version of r _S for the first and second quarter on time arrival data in Table 2.9 and verify that it matches the value given in Example 2.8. Compute $ \overline{r}=\frac{1}{n}\sum_{i=1}^n{r}_i $ and $ \overline{s}=\frac{1}{n}\sum_{i=1}^n{s}_i $ and compare them with the constants used to form the deviations in this version of the formula. What did you find?

2.A.3. Another alternative form of Spearman’s rank correlation coefficient r _S (2.3) is given by

$$ {r}_s=1-\frac{6\sum_{i=1}^n{d}_i^2}{n\left({n}^2-1\right)}, $$

where d _i = s _i - r _i, for i = 1, …, n; that is, d _i is the difference in the ranks for the ith pair.

(a)
Compute the value of this alternative version of r _S for the first and second quarter on time arrival data in Table 2.9 and verify that it is equal to the value of r _S given in Example 2.8 (and computed in Exercise 2.A.2.).
(b)
What are the d _i values, for i = 1, …, n, if there is perfect positive association between x and y?
(c)
What are the d _i values, for i = 1, …, n, if there is perfect negative association between x and y?

2.A.4. Construct a set of 8 pairs of observations that are positively associated and compute the value of r for your data set.

(a)
Apply a linear transformation of your choice to the x values only. Compute the value of r for the transformed data. Did the absolute value of r stay the same? Did it change sign? What would happen to the value of r if you change the sign of the slope parameter in your linear transformation?
(b)
Compute r _S for the untransformed data (x _i, y _i) and for the linearly transformed data (x _i ^*, y _i). Does the value of r _S remain the same?
(c)
Do you think linear transformations will always affect r _S the same way that they affect r? Why or why not?

2.1.2 2.B. Data Analysis/Computational

2.B.1. Quarterly Delinquency Rates and Charge-off Rates. Consider the twenty-five years of quarterly delinquency rates for eight different types of loans, as reported by the Federal Reserve, presented in the R dataset delinquency_rates. Use R to construct smoothed scatterplots to compare the charge-off rates for each of the eight types of loans (see Example 2.5) with the corresponding delinquency rates. Does one set of rates appear more variable than the other? Do the two sets of rates appear to be related to each other over time?

2.B.2. College Scorecard Comparisons. Consider the College Scorecard Data reported by the U.S. Department of Education. A subset of these data for the year 2012 is presented in the R dataset college_rankings_2012. Your task is to prepare a report on U.S. educational institutions for a group of concerned parents. They are particularly interested in:

1.
differences between public, private, and for-profit institutions;
2.
relationships between faculty salaries and costs;
3.
differences between schools of differing sizes.

(a)
Compare in-state tuition for each of the three types of institution: public, private, and for-profit.
(b)
Examine the relationship between average monthly faculty salaries and in-state tuition. Also examine the relationship between monthly faculty salaries and median graduate debt. Provide both graphical and numerical evidence to demonstrate any association you find.
(c)
Examine the completion rates and average SAT scores of each of the three university types.
(d)
Use enrollment numbers to group the institutions into “large”, “medium”, and “small” (using whatever thresholds you deem reasonable). Examine the relationship between school size and admission rates and that between school size and completion rates.
(e)
Examine any other relationships that you think would be of interest to the concerned parents. Do you notice any patterns as to which types of schools seem more or less willing to provide data? (Missing values are denoted as NA in the dataset.)

2.B.3. Population, Birth Rates, and Migration. In this exercise, you will analyze population data provided by the U.S. Census Bureau at the state level. A subset of these data as of 2015 is presented in the R dataset population_estimates_2015, which contains population estimates, birth rates (per 1000 population), and net migration (per 1000 population) for each year 2011 through 2015.

(a)
Begin by selecting the state you live in and any three other states that you would like to analyze. For each of these four states, produce a smoothed scatterplot of population estimates over time.
(b)
For the four states you selected in (a), produce bar graphs of birth rates by year for each state. What patterns do you notice throughout time? What similarities and differences do you notice between the states you have selected?
(c)
Produce scatterplots for population estimate and net migration for each of the 20 combinations of the 5 years and four states. What, if any, association do you see? Would you have expected this? Confirm your visual findings by computing two numerical measures of association.

2.B.4. Population, Birth Rates, and Migration. Repeat Exercise 2.B.3, but analyze the data for each of the four regions (rather than states) this time.

2.B.5. Comparing NBA Teams. In this exercise, you will analyze NBA teams’ performance in the 2015–2016 season. A subset of the data made available at http://stats.nba.com/league/team/ is accessible in the R dataset nba_2015_2016, which contains information on various statistics measuring team performance.

(a)
Ultimately, teams care about winning games. Using both graphical and numerical methods, examine the association between win percentage and each of the following statistics: rebounds, assists, turnovers, steals, and blocks. Based on these associations, if you were in charge of an NBA team, which area would you focus on most heavily?
(b)
Without looking at the data, write down what you expect the association to be between each pair of the following three statistics: field-goal percent, three-point percent, and free-throw percent. Generate scatterplots and at least one numerical measure of these associations. Do the results agree with what you predicted?
(c)
Produce bar graphs for the three percentages discussed in part (b) for each of the following four teams: Cleveland Cavaliers, Golden State Warriors, Los Angeles Lakers, and Philadelphia 76ers. Comment on any patterns you observe when comparing these teams.

2.1.3 2.C. Activities

2.C.1. You will need a standard measuring tape to collect information on a small group of people who are willing to participate in this activity. Collect the information described below for each individual. Construct an appropriate scatterplot for each pair of variables. Are the various pairs of variables associated? If so, explain the nature of the associations. Compute and compare two different measures of association for each pair of variables.

(a)
Waist and neck sizes.
(b)
Foot and forearm sizes.
(c)
Shirt and shoe sizes.
(d)
Inseam length and circumference of head.
(e)
Height and distance from belly button to the floor.
(f)
Height and weight.

2.C.2. Who is better at rolling sixes? Obtain a few standard six-sided dice and split participants into groups along a categorical variable (for example: hair color, height, gender, etc.). Each member of each group should roll a die five times and record the number of sixes that he or she obtains. Does group membership appear to be associated with ability to roll sixes? If so, explain the nature of the association. Repeat the experiment 5 times and construct a bar graph for the proportion of sixes for each group by experiment. Do your conclusions about each group’s ability to roll sixes change when you analyze five repetitions of the experiment rather than one?

2.C.3. Tossing Quarters. Stand approximately 15 feet from a wall. Toss a U.S. quarter toward the wall and try to get it as close to the wall as you can. After some practice trials, record the distance from the edge of the wall to the closest edge of the coin for 15 consecutive tosses.

(a)
Create a time series plot for the distance from the edge of the wall to the edge of the quarter.
(b)
Describe any patterns or unusual observations you see on the plot. Did you get better or worse over time?
(c)
Now try a different method of tossing the coin. After some practice trials, record the distance from the edge of the wall to the edge of the coin for 15 consecutive tosses. Create a time series plot for these distances and compare the plot with the one in part (a).
(d)
Is one method clearly better than the other method for you? Justify your answer with appropriate descriptive statistics and graphical displays.

2.C.4. Weather Forecasts. Go online and find the 10-day weather forecasts for your local area, for Flint, Michigan, and for Tempe, Arizona. Record the temperatures for each of the 10 days at each of the three locations.

(a)
Create a time series plot for the daily temperature at each location.
(b)
What similarities and differences do you notice between the patterns of temperature at these locations?
(c)
How do you think your answers to (a) and (b) would change if you were to repeat the exercise for hourly temperatures rather than daily temperatures?
(d)
How do you think your answers to (a) and (b) would be affected if you were to repeat this activity in 6 months? (It may be helpful to find monthly average temperatures for each location.)

2.1.4 2.D. Internet Archives

2.D.1. Visit www.guessthecorrelation.com to check your intuition regarding the shape of a scatterplot and a statistic that measures the strength of linear association.

2.D.2. Performance Statistics for the New York Stock Exchange. Find a website that enables you to obtain recent performance statistics for stocks traded on the New York Stock Exchange (NYSE). Enter a ticker symbol for a stock of interest to you and obtain plots of performance statistics for your stock for each of the following time periods: intraday, 1-week, 1-month, 3-month, YTD (year-to-date), 1-year, 3-year, and 5-year. Do all of the plots show the same overall trend? If so, comment on that overall pattern. If not, what does this tell you about the importance of considering an appropriate scale for the horizontal axis?

2.D.3. Kentucky Derby Races. Find a website that provides data for all of the previous Kentucky Derby races (Derby charts, race statistics, etc.). Select a 25-year period of interest to you. Download the data from the 25-year period you selected and load it into R. (Hint: you can use the function to load a .csv file into R.)

(a)
Construct a time series plot for the winning times and comment on any obvious patterns.
(b)
Construct a time series plot for the net amount of money paid to the winner of the race and comment on the pattern over time.
(c)
Add a plotting symbol to the time series plot in part (a) that identifies the condition of the track at the time of the race. Does the condition of the track affect the winning times? Compute appropriate descriptive statistics to justify your response.
(d)
Add a plotting symbol to the time series plot in part (b) that identifies the condition of the track at the time of the race. Does the condition of the track affect the net amount of money paid to the winner? Compute appropriate descriptive statistics to justify your response.
(e)
Is there any association between winning time and the net amount of money paid to the winner? Justify your response.

2.D.4. Stock Performances. Visit YAHOO! Finance at finance.yahoo.com. Enter a ticker symbol for a company of interest to you and then click Go. (If you don’t know the ticker symbol, you can do a quick search by typing a company name in the Quote Lookup.) After looking at the recent performance statistics, click the chart to see a time series plot of closing prices. Select 1-year under the time period options to get a different chart.

(a)
Another method of smoothing time series data to search for trends is to compute moving averages of the prices. At the top of the chart you can add moving averages to your plot by clicking Indicator, then clicking Simple Moving Average, and finally entering the number of days you want to include in the moving average in the Period field. Select 25 and click the chart to add a 25-day moving average to your 1-year time series plot . Does the addition of the 25-day moving average help you see the overall trend?
(b)
Now add a 50-day moving average to your 1-year time series plot . Do you see the same overall pattern with the 50-day moving average that you do with the 25-day moving average?

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wolfe, D.A., Schneider, G. (2017). Exploring Bivariate and Categorical Data. In: Intuitive Introductory Statistics. Springer Texts in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-56072-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-56072-4_2
Published: 10 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56070-0
Online ISBN: 978-3-319-56072-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Exploring Bivariate and Categorical Data

Abstract

Access this chapter

Bibliography

Author information

Authors and Affiliations

Chapter 2 Comprehensive Exercises

Chapter 2 Comprehensive Exercises

2.1.1 2.A. Conceptual

2.1.2 2.B. Data Analysis/Computational

2.1.3 2.C. Activities

2.1.4 2.D. Internet Archives

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation