Skip to main content

Overview and Descriptive Statistics

  • Chapter
  • First Online:
Modern Mathematical Statistics with Applications

Part of the book series: Springer Texts in Statistics ((STS))

Abstract

Statistical concepts and methods are not only useful but indeed often indispensable in understanding the world around us. They provide ways of gaining new insights into the behavior of many phenomena that you will encounter in your chosen field of specialization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Different software packages calculate the quartiles (and, thus, the iqr) somewhat differently, for example using different interpolation methods between x values. For smaller data sets, the difference can be noticeable; this is typically less of an issue for larger data sets.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jay L. Devore .

Supplementary Exercises: (73–96)

Supplementary Exercises: (73–96)

  1. 73.

    The article “Correlation Analysis of Stenotic Aortic Valve Flow Patterns Using Phase Contrast MRI” (Annals of Biomed. Engr. 2005: 878–887) included the following data on aortic root diameter (cm) for a sample of patients having various degrees of aortic stenosis (i.e., narrowing of the aortic valve):

    Males:

    3.7

    3.4

    3.7

    4.0

    3.9

    3.8

    3.4

    3.6

    3.1

    4.0

    3.4

    3.8

    3.5

    Females:

    3.8

    2.6

    3.2

    3.0

    4.3

    3.5

    3.1

    3.1

    3.2

    3.0

       
    1. a.

      Create a comparative stem-and-leaf plot.

    2. b.

      Calculate an appropriate measure of center for each set of observations.

    3. c.

      Compare and contrast the diameter observations for the two sexes.

  1. 74.

    Consider the following information from a sample of four Wolferman’s cranberry citrus English muffins, which are said on the package label to weigh 116 g: \( \bar{x} \) = 104.4 g, s = 4.1497 g, smallest weighs 98.7 g, largest weighs 108.0 g. Determine the values of the two middle sample observations (and don’t do it by successive guessing!).

  1. 75.

    Three different C2F6 flow rates (SCCM) were considered in an experiment to investigate the effect of flow rate on the uniformity (%) of the etch on a silicon wafer used in the manufacture of integrated circuits, resulting in the following data:

Flow rate

125

2.6

2.7

3.0

3.2

3.8

4.6

160

3.6

4.2

4.2

4.6

4.9

5.0

200

2.9

3.4

3.5

4.1

4.6

5.1

Compare and contrast the uniformity observations resulting from these three different flow rates.

  1. 76.

    The amount of radiation received at a greenhouse plays an important role in determining the rate of photosynthesis. The accompanying observations on incoming solar radiation were read from a graph in the article “Radiation Components over Bare and Planted Soils in a Greenhouse” (Solar Energy 1990: 1011–1016).

6.3

6.4

7.7

8.4

8.5

8.8

8.9

9.0

9.1

10.0

10.1

10.2

10.6

10.6

10.7

10.7

10.8

10.9

11.1

11.2

11.2

11.4

11.9

11.9

12.2

13.1

  

Use some of the methods discussed in this chapter to describe and summarize this data.

  1. 77.

    The article “Motor Vehicle Emissions Variability” (J. Air Waste Manag. Assoc. 1996: 667–675) reported the following hydrocarbon and carbon dioxide measurements using the Federal Testing Procedure for emissions-testing, applied four times each to the same car:

HC (g/mile):

13.8

18.3

32.2

32.5

CO (g/mile):

118

149

232

236

  1. a.

    Compute the sample standard deviations for the HC and CO observations. Why should it not be surprising that the CO measurements have a larger standard deviation?

  2. b.

    The sample coefficient of variation \( s/\bar{x} \) (or \( 100 \cdot s/\bar{x} \)) assesses the extent of variability relative to the mean. Values of this coefficient for several different data sets can be compared to determine which data sets exhibit more or less variation. Carry out such a comparison for the given data.

  1. 78.

    The cost-to-charge ratio for a hospital is the ratio of the actual cost of care to what the hospital charges for that care. In 2008, the Kentucky Department of Health and Family Services reported the following cost-to-charge ratios, expressed as percents, for 116 Kentucky hospitals:

52.9

49.7

58.1

41.4

66.5

44.1

53.0

49.1

59.8

47.1

44.3

52.3

60.5

59.9

47.1

62.4

47.3

62.1

52.1

47.8

65.1

42.9

38.5

65.9

51.3

52.6

44.9

47.8

60.2

56.4

67.6

31.9

53.9

50.6

72.5

47.8

50.5

25.1

45.0

86.0

53.7

61.2

63.4

51.5

48.6

42.1

49.3

50.0

66.4

64.6

47.4

48.1

45.8

64.7

58.7

56.9

45.9

82.9

46.0

51.0

67.0

49.3

69.5

56.5

55.0

39.2

85.0

46.7

41.6

45.4

71.2

42.7

46.9

39.2

55.3

46.1

43.2

67.7

60.6

68.2

81.6

39.2

54.7

63.5

67.9

50.9

40.4

49.0

54.4

39.2

43.2

43.2

51.7

48.4

50.7

59.4

49.7

60.2

40.2

62.3

41.4

48.6

45.6

46.2

51.4

65.3

31.5

50.6

41.4

82.3

45.2

46.0

58.3

46.3

38.2

59.1

    

(For example, a cost-to-charge ratio of 53.0% means the actual cost of care is 53% of what the hospital charges.) Use various techniques discussed in this chapter to organize, summarize, and describe the data.

  1. 79.

    Fifteen air samples from a certain region were obtained, and for each one the carbon monoxide concentration was determined. The results (in ppm) were

9.3

10.7

8.5

9.6

12.2

15.6

9.2

10.5

9.0

13.2

11.0

8.8

13.7

12.1

9.8

 

Using the interpolation method suggested in Section 1.3, compute the 10% trimmed mean.

  1. 80.
    1. a.

      For what value of c is the quantity \( \sum {(x_{i} - c)^{2} } \) minimized? [Hint: Take the derivative with respect to c, set equal to 0, and solve.]

    2. b.

      Using the result of part (a), which of the two quantities \( \sum {(x_{i} - \bar{x})^{2} } \) and \( \sum {(x_{i} - \mu )^{2} } \) will be smaller than the other (assuming that \( \bar{x} \ne \mu \))?

  2. 81.

    The article “A Longitudinal Study of the Development of Elementary School Children’s Private Speech” (Merrill-Palmer Q. 1990: 443–463) reported on a study of children talking to themselves (private speech). It was thought that private speech would be related to IQ, because IQ is supposed to measure mental maturity, and it was known that private speech decreases as students progress through the primary grades. The study included 33 students whose first-grade IQ scores are given here:

82

96

99

102

103

103

106

107

108

108

108

108

109

110

110

111

113

113

113

113

115

115

118

118

119

121

122

122

127

132

136

140

146

       

Use various techniques discussed in this chapter to organize, summarize, and describe the data.

  1. 82.

    The accompanying specific gravity values for various wood types used in construction appeared in the article “Bolted Connection Design Values Based on European Yield Model” (J. Struct. Engr. 1993: 2169–2186):

.31

.35

.36

.36

.37

.38

.40

.40

.40

.41

.41

.42

.42

.42

.42

.42

.43

.44

.45

.46

.46

.47

.48

.48

.48

.51

.54

.54

.55

.58

.62

.66

.66

.67

.68

.75

Construct a stem-and-leaf display using repeated stems, and comment on any interesting features of the display.

  1. 83.

    In recent years, some evidence suggests that high indoor radon concentration may be linked to the development of childhood cancers, but many health professionals remain unconvinced. The article “Indoor Radon and Childhood Cancer” (Lancet 1991: 1537–1538) presented the accompanying data on radon concentration (Bq/m3) in two different samples of houses. The first sample consisted of houses in which a child diagnosed with cancer had been residing. Houses in the second sample had no recorded cases of childhood cancer.

Cancer:

3

5

6

7

8

9

9

10

10

10

 
 

11

11

11

11

12

13

13

15

15

15

 
 

16

16

16

17

18

18

18

20

21

21

 
 

22

22

23

23

27

33

34

38

39

45

 
 

57

210

         

No cancer:

3

3

5

6

6

7

7

7

8

8

 
 

9

9

9

9

11

11

11

11

11

12

 
 

12

13

14

17

17

21

21

24

24

29

 
 

29

29

29

33

38

39

55

55

85

  
  1. a.

    Construct a side-by-side stem-and-leaf display, and comment on any interesting features.

  2. b.

    Calculate the standard deviation of each sample. Which sample appears to have greater variability, according to these values?

  3. c.

    Calculate the iqr for each sample. Now which sample has greater variability, and why is this different than the result of part (b)?

  1. 84.

    Elevated energy consumption during exercise continues after the workout ends. Because calories burned after exercise contribute to weight loss and have other consequences, it is important to understand this process. The paper “Effect of Weight Training Exercise and Treadmill Exercise on Post-Exercise Oxygen Consumption” (Med. Sci. Sports Exercise 1998: 518–522) reported the accompanying data from a study in which oxygen consumption (liters) was measured continuously for 30 min for each of 15 subjects both after a weight training exercise and after a treadmill exercise.

    1. a.

      Construct side-by-side boxplots of the weight and treadmill observations, and comment on what you see.

    2. b.

      Because the data is in the form of (x, y) pairs, with x and y measurements on the same variable under two different conditions, it is natural to focus on the differences within pairs: d1 = x1 − y1, …, dn = xn − yn . Construct a boxplot of the sample differences. What does it suggest?

Subject

1

2

3

4

5

6

Weight (x)

14.6

14.4

19.5

24.3

16.3

22.1

Treadmill (y)

11.3

5.3

9.1

15.2

10.1

19.6

Subject

7

8

9

10

11

12

Weight (x)

23.0

18.7

19.0

17.0

19.1

19.6

Treadmill (y)

20.8

10.3

10.3

2.6

16.6

22.4

Subject

13

14

15

   

Weight (x)

23.2

18.5

15.9

   

Treadmill (y)

23.6

12.6

4.4

   
  1. 85.

    Anxiety disorders and symptoms can often be effectively treated with benzodiazepine medications. It is known that animals exposed to stress exhibit a decrease in benzodiazepine receptor binding in the frontal cortex. The paper “Decreased Benzodiazepine Receptor Binding in Prefrontal Cortex in Combat-Related Posttraumatic Stress Disorder” (Amer. J. Psychiatry 2000: 1120–1126) described the first study of benzodiazepine receptor binding in individuals suffering from PTSD. The accompanying data on a receptor binding measure (adjusted distribution volume) was read from a graph in the paper.

PTSD:

10

20

25

28

31

35

37

38

38

39

39

42

46

Healthy:

23

39

40

41

43

47

51

58

63

66

67

69

72

Use various methods from this chapter to describe and summarize the data.

  1. 86.

    The article “Can We Really Walk Straight?” (Amer. J. Phys. Anthropol. 1992: 19–27) reported on an experiment in which each of 20 healthy men was asked to walk as straight as possible to a target 60 m away at normal speed. Consider the following observations on cadence (number of strides per second):

.95

.85

.92

.95

.93

.86

1.00

.92

.85

.81

.78

.93

.93

1.05

.93

1.06

1.06

.96

.81

.96

Use the methods developed in this chapter to summarize the data; include an interpretation or discussion wherever appropriate. [Note: The author of the article used a rather sophisticated statistical analysis to conclude that people cannot walk in a straight line and suggested several explanations for this.]

  1. 87.

    The mode of a numerical data set is the value that occurs most frequently in the set.

    1. a.

      Determine the mode for the cadence data given in the previous exercise.

    2. b.

      For a categorical sample, how would you define the modal category?

  2. 88.

    Specimens of three different types of rope wire were selected, and the fatigue limit (MPa) was determined for each specimen, resulting in the accompanying data.

Type 1:

350

350

350

358

370

370

370

371

371

372

372

384

391

391

392

 

Type 2:

350

354

359

363

365

368

369

371

373

374

376

380

383

388

392

 

Type 3:

350

361

362

364

364

365

366

371

377

377

377

379

380

380

392

 
  1. a.

    Construct a comparative boxplot, and comment on similarities and differences.

  2. b.

    Construct a comparative dotplot (a dotplot for each sample with a common scale). Comment on similarities and differences.

  3. c.

    Does the comparative boxplot of part (a) give an informative assessment of similarities and differences? Explain your reasoning.

  1. 89.

    The three measures of center introduced in this chapter are the mean, median, and trimmed mean. Two additional measures of center that are occasionally used are the midrange, which is the average of the smallest and largest observations, and the midquarter, which is the average of the two quartiles. Which of these five measures of center are resistant to the effects of outliers and which are not? Explain your reasoning.

  2. 90.

    The authors of the article “Predictive Model for Pitting Corrosion in Buried Oil and Gas Pipelines” (Corrosion 2009: 332–342) provided the data on which their investigation was based.

  1. a.

    Consider the following sample of 61 observations on maximum pitting depth (mm) of pipeline specimens buried in clay loam soil.

0.41

0.41

0.41

0.41

0.43

0.43

0.43

0.48

0.48

0.58

0.79

0.79

0.81

0.81

0.81

0.91

0.94

0.94

1.02

1.04

1.04

1.17

1.17

1.17

1.17

1.17

1.17

1.17

1.19

1.19

1.27

1.40

1.40

1.59

1.59

1.60

1.68

1.91

1.96

1.96

1.96

2.10

2.21

2.31

2.46

2.49

2.57

2.74

3.10

3.18

3.30

3.58

3.58

4.15

4.75

5.33

7.65

7.70

8.13

10.41

13.44

  

Construct a stem-and-leaf display in which the two largest values are shown in a last row labeled HI.

  1. b.

    Refer back to (a), and create a histogram based on eight classes with 0 as the lower limit of the first class and class widths of .5, .5, .5, .5, 1, 2, 5, and 5, respectively.

  2. c.

    The accompanying comparative boxplot shows plots of pitting depth for four different types of soils. Describe its important features.

figure e
  1. 91.

    Consider a sample x1, x2, …, xn and suppose that the values of \( \bar{x} \), s2, and s have been calculated.

    1. a.

      Let \( y_{i} = x_{i} - \bar{x} \) for i = 1, …, n. How do the values of s2 and s for the yi’s compare to the corresponding values for the xi’s? Explain.

    2. b.

      Let \( z_{i} = (x_{i} - \bar{x})/s \) for i = 1, …,n. What are the values of the sample variance and sample standard deviation for the zi’s?

  2. 92.

    Let \( \bar{x}_{n} \) and \( s_{n}^{2} \) denote the sample mean and variance for the sample x1, …, xn and let \( \bar{x}_{n + 1} \) and \( s_{n + 1}^{2} \) denote these quantities when an additional observation xn+1 is added to the sample.

    1. a.

      Show how \( \bar{x}_{n + 1} \) can be computed from \( \bar{x}_{n} \) and xn+1.

    2. b.

      Show that

      $$ \quad ns_{n + 1}^{2} = (n - 1)s_{n}^{2} + \frac{n}{n + 1}(x_{n + 1} - \bar{x}_{n} )^{2} $$

      so that \( s_{n + 1}^{2} \) can be computed from xn+1, \( \bar{x}_{n} \), and \( s_{n}^{2} \).

    3. c.

      Suppose that a sample of 15 strands of drapery yarn has resulted in a sample mean thread elongation of 12.58 mm and a sample standard deviation of .512 mm. A 16th strand results in an elongation value of 11.8. What are the values of the sample mean and sample standard deviation for all 16 elongation observations?

  1. 93.

    Lengths of bus routes for any particular transit system will typically vary from one route to another. The article “Planning of City Bus Routes” (J. Institut. Engr. 1995: 211–215) gives the following information on lengths (km) for one particular system:

Length:

6–8

8–10

10–12

12–14

14–16

Frequency:

6

23

30

35

32

Length:

16–18

18–20

20–22

22–24

24–26

Frequency:

48

42

40

28

27

Length:

26–28

28–30

30–35

35–40

40–45

Frequency:

26

14

27

11

2

  1. a.

    Draw a histogram corresponding to these frequencies.

  2. b.

    What proportion of these route lengths are less than 20? What proportion of these routes have lengths of at least 30?

  3. c.

    Roughly what is the value of the 90th percentile of the route length distribution?

  4. d.

    Roughly what is the median route length?

  1. 94.

    A study carried out to investigate the distribution of total braking time (reaction time plus accelerator-to-brake movement time, in msec) during real driving conditions at 60 km/h gave the following summary information on the distribution of times (“A Field Study on Braking Responses during Driving,” Ergonomics 1995: 1903–1910):

    $$ \begin{array}{*{20}l} {{\text{mean = 535}}\quad {\text{median = 500}}\quad {\text{mode = 500}}} \hfill \\ {{\text{sd = 96}}\quad {\text{minimum = 220}}\quad {\text{maximum = 925}}} \hfill \\ {5{\text{th}}\,{\text{percentile = 400}}\quad 10{\text{th}}\,{\text{percentile = 430}}} \hfill \\ {90{\text{th}}\;{\text{percentile = 640}}\quad 95{\text{th}}\,{\text{percentile = 720}}} \hfill \\ \end{array} $$

    What can you conclude about the shape of a histogram of this data? Explain your reasoning.

  1. 95.

    The sample data x1, x2, …, xn sometimes represents a time series, where xt = the observed value of a response variable x at time t. Often the observed series shows a great deal of random variation, which makes it difficult to study longer-term behavior. In such situations, it is desirable to produce a smoothed version of the series. One technique for doing so involves exponential smoothing. The value of a smoothing constant α is chosen (0 < α < 1). Then with \( \bar{x}_{t} \) defined as the smoothed value at time t, we set \( \bar{x}_{1} = x_{1} \), and for t = 2, 3, …, n, \( \bar{x}_{t} = \alpha x_{t} + \left( {1 - \alpha } \right)\bar{x}_{t - 1} \).

    1. a.

      Consider the following time series in which xt = temperature (°F) of effluent at a sewage treatment plant on day t: 47, 54, 53, 50, 46, 46, 47, 50, 51, 50, 46, 52, 50, 50. Plot each xt against t on a two-dimensional coordinate system (a time series plot). Does there appear to be any pattern?

    2. b.

      Calculate the \( \bar{x}_{t} \)’s using α = .1. Repeat using α = .5. Which value of α gives a smoother \( \bar{x}_{t} \) series?

    3. c.

      Substitute \( \bar{x}_{t - 1} = \alpha x_{t - 1} + \left( {1 - \alpha } \right)\bar{x}_{t - 2} \) on the right-hand side of the expression for \( \bar{x}_{t} \), then substitute \( \bar{x}_{t - 2} \) in terms of \( {x}_{t - 2} \) and \( \bar{x}_{t - 3} \), and so on. On how many of the values xt, xt–1, …, x1 does \( \bar{x}_{t} \) depend? What happens to the coefficient on xtk as k increases?

    4. d.

      Refer to part (c). If t is large, how sensitive is \( \bar{x}_{t} \) to the initialization \( \bar{x}_{1} = x_{1} ? \) Explain.

  1. 96.

    Consider numerical observations \( x_{1} , \ldots ,x_{n}. \) It is frequently of interest to know whether the xi’s are (at least approximately) symmetrically distributed about some value. If n is at least moderately large, the extent of symmetry can be assessed from a stem-and-leaf display or histogram. However, if n is not very large, such pictures are not particularly informative. Consider the following alternative. Let y1 denote the smallest xi, y2 the second-smallest xi, and so on. Then plot the following pairs as points on a two-dimensional coordinate system: \( (y_{n} - \tilde{x},\tilde{x} - y_{1} ),\; (y_{n - 1} - \tilde{x},\tilde{x} - y_{2} )\)\( (y_{n - 2} - \tilde{x},\tilde{x} - y_{3} ) \), …. There are n/2 points when n is even and (n − 1)/2 when n is odd.

    1. a.

      What does this plot look like when there is perfect symmetry in the data? What does it look like when observations stretch out more above the median than below it (a long upper tail)?

    2. b.

      Construct the plot for the nitrogen data presented in Example 1.17, and comment on the extent of symmetry or nature of departure from symmetry.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Devore, J.L., Berk, K.N., Carlton, M.A. (2021). Overview and Descriptive Statistics. In: Modern Mathematical Statistics with Applications. Springer Texts in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-030-55156-8_1

Download citation

Publish with us

Policies and ethics