Skip to main content

Inferences Based on Two Samples

  • Chapter
  • First Online:
Modern Mathematical Statistics with Applications

Part of the book series: Springer Texts in Statistics ((STS))

  • 17k Accesses

Abstract

Chapters 8 and 9 presented confidence intervals (CIs) and hypothesis-testing procedures for single parameters, such as a population mean μ and a population proportion p.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jay L. Devore .

Supplementary Exercises: (95–124)

Supplementary Exercises: (95–124)

  1. 95.

    A group of 115 University of Iowa students was randomly divided into a build-up condition group (m = 56) and a scale-down condition group (n = 59). The task for each subject was to build his or her own pizza from a menu of 12 ingredients. The build-up group was told that a basic cheese pizza costs $5 and that each extra ingredient would cost 50 cents. The scale-down group was told that a pizza with all 12 ingredients (ugh!!!) would cost $11 and that deleting an ingredient would save 50 cents. The article “A Tale of Two Pizzas: Building Up from a Basic Product Versus Scaling Down from a Fully Loaded Product” (Market. Lett. 2002: 335–344) reported that the mean number of ingredients selected by the scale-down group was significantly greater than the mean number for the build-up group: 5.29 versus 2.71. The calculated value of the appropriate t statistic was 6.07. Would you reject the null hypothesis of equality in favor of inequality at a significance level of .05? .01? .001? Can you think of other products aside from pizza where one could build up or scale down? [Note: A separate experiment involved students from the University of Rome, but details were a bit different because there are typically not so many ingredient choices in Italy.]

  2. 96.

    Is the number of export markets in which a firm sells its products related to the firm’s return on sales? The article “Technology Industry Success: Strategic Options for Small and Medium Firms” (Bus. Horizons, Sept.–Oct. 2003: 41–46) gave the accompanying information on the number of export markets for one group of firms whose return on sales was less than 10% and another group whose return was at least 10%.

Return

Sample size

Sample mean

Sample SD

Less than 10%

36

5.12

.57

At least 10%

47

8.26

1.20

The investigators reported that an appropriate test of hypotheses resulted in a P-value between .01 and .05. What hypotheses do you think were tested, and do you agree with the stated P-value information? What assumptions if any are needed in order to carry out the test? Can the plausibility of these assumptions be investigated based just on the foregoing summary data? Explain.

  1. 97.

    Suppose when using a two-sample t procedure that m < n, and show that \( \nu \) > m − 1. (This is why some authors suggest using min(m − 1, n − 1) as df in place of Welch’s formula). What impact does this have on the CI and test procedure?

  2. 98.

    The accompanying summary data on compression strength (lb) for 12 × 10 × 8 in. boxes appeared in the article “Compression of Single-Wall Corrugated Shipping Containers Using Fixed and Floating Test Platens” (J. Testing Eval. 1992: 318–320). The authors stated that “the difference between the compression strength using fixed and floating platen method was found to be small compared to normal variation in compression strength between identical boxes.” Do you agree?

Method

Sample size

Sample mean

Sample SD

Fixed

10

807

27

Floating

10

757

41

  1. 99.

    The authors of the article “Dynamics of Canopy Structure and Light Interception in Pinus elliotti, North Florida” (Ecol. Monogr. 1991: 33–51) planned an experiment to determine the effect of fertilizer on a measure of leaf area. A number of plots were available for the study, and half were selected at random to be fertilized. To ensure that the plots to receive the fertilizer and the control plots were similar, before beginning the experiment tree density (the number of trees per hectare) was recorded for eight plots to be fertilized and eight control plots, resulting in the given data. Minitab output follows.

Fertilizer plots

1024

1216

1312

1280

1216

1312

992

1120

Control plots

1104

1072

1088

1328

1376

1280

1120

1200

Two sample T for fertilize vs. control

 

N

Mean

Std. Dev.

SE Mean

Fertilize

8

1184

126

44

Control

8

1196

118

42

95% CI for mu fertilize-mu control: (−144, 120)

  1. a.

    Construct a comparative boxplot and comment on any interesting features.

  2. b.

    Would you conclude that there is a significant difference in the mean tree density for fertilizer and control plots? Use α = .05.

  3. c.

    Interpret the given confidence interval.

  1. 100.

    Is the response rate for questionnaires affected by including some sort of incentive to respond along with the questionnaire? In one experiment, 110 questionnaires with no incentive resulted in 75 being returned, whereas 98 questionnaires that included a chance to win a lottery yielded 66 responses (“Charities, No; Lotteries, No; Cash, Yes,” Public Opinion Q. 1996: 542–562). Does this data suggest that including an incentive increases the likelihood of a response? State and test the relevant hypotheses at significance level .10 by using the P-value method.

  2. 101.

    The article “Quantitative MRI and Electrophysiology of Preoperative Carpal Tunnel Syndrome in a Female Population” (Ergonomics 1997: 642–649) reported that (−473.3, 1691.9) was a large-sample 95% confidence interval for the difference between true average thenar muscle volume (mm3) for sufferers of carpal tunnel syndrome and true average volume for nonsufferers. Calculate and interpret a 90% confidence interval for this difference.

  3. 102.

    The following summary data on bending strength (lb-in/in) of joints is taken from the article “Bending Strength of Corner Joints Constructed with Injection Molded Splines” (Forest Prod. J. April 1997: 89–92). Assume normal distributions.

Type

Sample size

Sample mean

Sample SD

Without side coating

10

80.95

9.59

With side coating

10

63.23

5.96

  1. a.

    Calculate a 95% lower confidence bound for true average strength of joints with a side coating.

  2. b.

    Calculate a 95% lower prediction bound for the strength of a single joint with a side coating.

  3. c.

    Calculate a 95% confidence interval for the difference between true average strengths for the two types of joints.

  1. 103.

    An experiment was carried out to compare various properties of cotton/polyester spun yarn finished with softener only and yarn finished with softener plus 5% DP-resin (“Properties of a Fabric Made with Tandem Spun Yarns,” Textile Res. J. 1996: 607–611). One particularly important characteristic of fabric is its durability, that is, its ability to resist wear. For a sample of 40 softener-only specimens, the sample mean stoll-flex abrasion resistance (cycles) in the filling direction of the yarn was 3975.0, with a sample standard deviation of 245.1. Another sample of 40 softener-plus specimens gave a sample mean and sample standard deviation of 2795.0 and 293.7, respectively. Calculate a confidence interval with confidence level 99% for the difference between true average abrasion resistances for the two types of fabrics. Does your interval provide convincing evidence that true average resistances differ for the two types of fabrics? Why or why not?

  2. 104.

    The derailment of a freight train due to the catastrophic failure of a traction motor armature bearing provided the impetus for a study reported in the article “Locomotive Traction Motor Armature Bearing Life Study” (Lubricat. Engr. August 1997: 12–19). A sample of 17 high-mileage traction motors was selected, and the amount of cone penetration (mm/10) was determined both for the pinion bearing and for the commutator armature bearing, resulting in the following data:

Motor

1

2

3

4

5

6

Commutator

211

273

305

258

270

209

Pinion

226

278

259

244

273

236

Motor

7

8

9

10

11

12

Commutator

223

288

296

233

262

291

Pinion

290

287

287

242

288

242

Motor

13

14

15

16

17

 

Commutator

278

275

210

272

264

 

Pinion

278

208

281

274

274

 

Calculate an estimate of the population mean difference between penetration for the commutator armature bearing and penetration for the pinion bearing, and do so in a way that conveys information about the reliability and precision of the estimate. [Note: A normal probability plot validates the necessary normality assumption.] Would you say that the population mean difference has been precisely estimated? Does it look as though population mean penetration differs for the two types of bearings? Explain.

  1. 105.

    The article “Two Parameters Limiting the Sensitivity of Laboratory Tests of Condoms as Viral Barriers” (J. Test. Eval. 1996: 279–286) reported that, in brand A condoms, among 16 tears produced by a puncturing needle, the sample mean tear length was 74.0 μm, whereas for the 14 brand B tears, the sample mean length was 61.0 μm (determined using light microscopy and scanning electron micrographs). Suppose the sample standard deviations are 14.8 and 12.5, respectively (consistent with the sample ranges given in the article). The authors commented that the thicker brand B condom displayed a smaller mean tear length than the thinner brand A condom. Is this difference in fact statistically significant? State the appropriate hypotheses and test at α = .05.

  2. 106.

    Information about hand posture and forces generated by the fingers during manipulation of various daily objects is needed for designing high-tech hand prosthetic devices. The article “Grip Posture and Forces During Holding Cylindrical Objects with Circular Grips” (Ergonomics 1996: 1163–1176) reported that for a sample of 11 females, the sample mean four-finger pinch strength (N) was 98.1 and the sample standard deviation was 14.2. For a sample of 15 males, the sample mean and sample standard deviation were 129.2 and 39.1, respectively.

    1. a.

      A test carried out to see whether true average strengths for the two genders were different resulted in t = 2.51 and P-value = .019. Does the appropriate test procedure described in this chapter yield this value of t and the stated P-value?

    2. b.

      Is there substantial evidence for concluding that true average strength for males exceeds that for females by more than 25 N? State and test the relevant hypotheses.

  3. 107.

    After the Enron scandal in the fall of 2001, faculty in accounting began to incorporate ethics more into accounting courses. One study looked at the effectiveness of such educational interventions “pre-Enron” and “post-Enron.” The data below shows students’ improvement in score on the Accounting Ethical Dilemma Instrument (AEDI) across a one-semester accounting class in Spring 2001 (“pre-Enron”) and another in Spring 2002 (“post-Enron”). (From “A Note in Ethics Educational Interventions in an Undergraduate Auditing Course: Is There an ‘Enron Effect’?” Issues Account. Educ. 2004: 53–71.)

  

Improvement in AEDI score

Class

n

Mean

SD

2001 (pre-Enron)

37

5.48

13.83

2002 (post-Enron)

21

6.31

13.20

  1. a.

    Test to see whether the 2001 class showed a statistically significant improvement in AEDI score across the semester.

  2. b.

    Test to see whether the 2002 class showed a statistically significant improvement in AEDI score across the semester.

  3. c.

    Test to see whether the 2002 class showed a statistically significantly greater improvement in AEDI score than the 2001 class. In this respect, does there appear to be an “Enron effect”?

  1. 108.

    Torsion during hip external rotation (ER) and extension may be responsible for certain kinds of injuries in golfers and other athletes. The article “Hip Rotational Velocities during the Full Golf Swing” (J. Sport Sci. Med. 2009: 296–299) reported on a study in which peak ER velocity and peak IR (internal rotation) velocity (both in deg/s) were determined for a sample of 15 female collegiate golfers during their swings. The following data was supplied by the article’s authors.

Golfer

ER

IR

Diff.

1

−130.6

−98.9

−31.7

2

−125.1

−115.9

−9.2

3

−51.7

−161.6

109.9

4

−179.7

−196.9

17.2

5

−130.5

−170.7

40.2

6

−101.0

−274.9

173.9

7

−24.4

−275.0

250.6

8

−231.1

−275.7

44.6

9

−186.8

−214.6

27.8

10

−58.5

−117.8

59.3

11

−219.3

−326.7

107.4

12

−113.1

−272.9

159.8

13

−244.3

−429.1

184.8

14

−184.4

−140.6

−43.8

15

−199.2

−345.6

146.4

  1. a.

    Is it plausible that the differences came from a normally distributed population?

  2. b.

    The article reported that mean(sd) = –145.3(68.0) for ER velocity and = −227.8(96.6) for IR velocity. Based just on this information, could a test of hypotheses about the difference between true average IR velocity and true average ER velocity be carried out? Explain.

  3. c.

    Do an appropriate hypothesis test about the difference between true average IR velocity and true average ER velocity and interpret the result.

  1. 109.

    The accompanying summary data on the ratio of strength to cross-sectional area for knee extensors is taken from the article “Knee Extensor and Knee Flexor Strength: Cross-Sectional Area Ratios in Young and Elderly Men” (J. Gerontol. 1992: M204–M210).

Group

Sample size

Sample mean

Standard error

Young

13

7.47

.22

Elderly men

12

6.71

.28

Does this data suggest that the true average ratio for young men exceeds that for elderly men? Carry out a test of appropriate hypotheses using α = .05. Be sure to state any assumptions necessary for your analysis.

  1. 110.

    The accompanying data on response time appeared in the article “The Extinguishment of Fires Using Low-Flow Water Hose Streams—Part II” (Fire Tech. 1991: 291–320). The samples are independent, not paired.

Good visibility

.43

1.17

.37

.47

.68

.58

.50

2.75

Poor visibility

1.47

.80

1.58

1.53

4.33

4.23

3.25

3.22

The authors analyzed the data with the pooled t test. Does the use of this test appear justified? [Hint: Check for normality.]

  1. 111.

    The accompanying data on the alcohol content of wine is representative of that reported in a study in which wines from the years 1999 and 2000 were randomly selected and the actual content was determined by laboratory analysis (London Times August 5, 2001).

Wine

1

2

3

4

5

6

Actual

14.2

14.5

14.0

14.9

13.6

12.6

Label

14.0

14.0

13.5

15.0

13.0

12.5

The two-sample t test gives a test statistic value of .62 and a two-tailed P-value of .55. Does this convince you that there is no significant difference between true average actual alcohol content and true average content stated on the label? Explain.

  1. 112.

    The article “The Accuracy of Stated Energy Contents of Reduced-Energy, Commercially Prepared Foods” (J. Am. Diet. Assoc. 2010: 116–123) presented the accompanying data on vendor-stated gross energy and measured value (both in kcal) for 10 different supermarket convenience meals):

Meal

1

2

3

4

5

Stated

180

220

190

230

200

Measured

212

319

231

306

211

Meal

6

7

8

9

10

Stated

370

250

240

80

180

Measured

431

288

265

145

228

Obtain a 95% confidence interval for the difference of population means. By roughly what percentage are the actual calories higher than the stated value?

Note that the article calls this a convenience sample and suggests that therefore it should have limited value for inference. However, even if the ten meals were a random sample from their local store, there could still be a problem in drawing conclusions about a purchase at your store.

  1. 113.

    How does energy intake compare to energy expenditure? One aspect of this issue was considered in the article “Measurement of Total Energy Expenditure by the Doubly Labelled Water Method in Professional Soccer Players” (J. Sports Sci. 2002: 391–397), which contained the accompanying data (MJ/day).

Player

1

2

3

4

5

6

7

Expenditure

14.4

12.1

14.3

14.2

15.2

15.5

17.8

Intake

14.6

9.2

11.8

11.6

12.7

15.0

16.3

Test to see whether there is a significant difference between intake and expenditure. Does the conclusion depend on whether a significance level of .05, .01, or .001 is used?

  1. 114.

    An experimenter wishes to obtain a CI for the difference between true average breaking strength for cables manufactured by company I and by company II. Suppose breaking strength is normally distributed for both types of cable with σ1 = 30 psi and σ2 = 20 psi.

    1. a.

      If costs dictate that the sample size for the type I cable should be three times the sample size for the type II cable, how many observations are required if the 99% CI is to be no wider than 20 psi?

    2. b.

      Suppose a total of 400 observations is to be made. How many of the observations should be made on type I cable samples if the width of the resulting interval is to be a minimum?

  2. 115.

    To assess the tendency of people to rationalize poor performance, 246 college students were randomly assigned to one of two groups: a negative feedback group and a positive feedback group. All students took a test which asked them to identify people’s emotions based on photographs of their faces. Those in the negative feedback group were all given D grades, while those in the positive feedback group received A’s (regardless of how they actually performed). A follow-up questionnaire asked students to assess the validity of the test and the importance of being able to read people’s faces. The results of these two follow-up surveys appear below.

  

Test validity rating

Face reading importance rating

Group

n

\( \bar{{x}} \)

s

\( \bar{{x}} \)

s

Positive feedback

123

6.95

1.09

6.62

1.19

Negative feedback

123

5.51

0.79

5.36

1.00

  1. a.

    Test the hypothesis that negative feedback is associated with a lower average validity rating than positive feedback at the α = .01 level.

  2. b.

    Test the hypothesis that students receiving positive feedback rate face-reading as more important, on average, than do students receiving negative feedback. Again use a 1% significance level.

  3. c.

    Is it reasonable to conclude that the results seen in parts (a) and (b) are attributable to the different types of feedback? Why or why not?

  1. 116.

    The insulin-binding capacity (pmol/mg protein) was measured for four different groups of rats: (1) nondiabetic, (2) untreated diabetic, (3) diabetic treated with a low dose of insulin, (4) diabetic treated with a high dose of insulin. The accompanying table gives sample sizes and sample standard deviations. Denote the sample size for the ith treatment by ni and the sample variance by \( S_{i}^{2} (i = 1,2,3,4) \). Assuming that the true variance for each treatment is σ2, construct a pooled estimator of σ2 that is unbiased, and verify using rules of expected value that it is indeed unbiased. What is your estimate for the following actual data? [Hint: Modify the pooled estimator \( S_{p}^{2} \) from Section 10.2.]

 

Treatment

1

2

3

4

Sample size

16

18

8

12

Sample SD

.64

.81

.51

.35

  1. 117.

    Suppose a level .05 test of H0: \( \mu_{1} - \mu_{2} \) = 0 versus Ha: \( \mu_{1} - \mu_{2} \) > 0 is to be performed, assuming σ1 = σ2 = 10 and normality of both distributions, using equal sample sizes (m = n). Evaluate the probability of a type II error when \( \mu_{1} - \mu_{2} \) = 1 and n = 25, 100, 2500, and 10,000. Can you think of real situations in which the difference \( \mu_{1} - \mu_{2} \) = 1 has little practical significance? Would sample sizes of n = 10,000 be desirable in such situations?

  2. 118.

    Are male college students more easily bored than their female counterparts? This question was examined in the article “Boredom in Young Adults—Gender and Cultural Comparisons” (J. Cross-Cult. Psych. 1991: 209–223). The authors administered a scale called the Boredom Proneness Scale to 97 male and 148 female U.S. college students. Does the accompanying data support the research hypothesis that the mean Boredom Proneness Rating is higher for men than for women? Test the appropriate hypotheses using a .05 significance level.

Sex

Sample size

Sample mean

Sample SD

Male

97

10.40

4.83

Female

148

9.26

4.68

  1. 119.

    Researchers sent 5000 resumes in response to job ads that appeared in the Boston Globe and Chicago Tribune. The resumes were identical except that 2500 of them had “white sounding” first names, such as Brett and Emily, whereas the other 2500 had “black sounding” names such as Tamika and Rasheed. The resumes of the first type elicited 250 responses and the resumes of the second type only 167 responses (these numbers are consistent with information that appeared in a January 15, 2003, report by the Associated Press). Does this data strongly suggest that a resume with a “black” name is less likely to result in a response than is a resume with a “white” name?

  2. 120.

    Is touching by a coworker sexual harassment? This question was included on a survey given to federal employees, who responded on a scale of 1–5, with 1 meaning a strong negative and 5 indicating a strong yes. The table summarizes the results.

Sex

Sample size

Sample mean

Sample SD

Female

4343

4.6056

.8659

Male

3903

4.1709

1.2157

Of course, with 1–5 being the only possible values, the normal distribution does not apply here, but the sample sizes are sufficient that it does not matter. Obtain a two-sided confidence interval for the difference in population means. Does your interval suggest that females are more likely than males to regard touching as harassment? Explain your reasoning.

  1. 121.

    Let X1, …, Xm be a random sample from a Poisson distribution with parameter μ1, and let Y1, …, Yn be a random sample from another Poisson distribution with parameter μ2. We wish to test H0: \( \mu_{1} - \mu_{2} \) = 0 against one of the three standard alternatives. When m and n are large, the CLT justifies using a large-sample z test. However, the fact that \( V(\overline{X}) = \mu /n \) suggests that a different denominator should be used in standardizing \( \overline{X} - \overline{Y} \). Develop a large-sample test procedure appropriate to this problem, and then apply it to the following data to test whether the plant densities for a particular species are equal in two different regions (where each observation is the number of plants found in a randomly located square sampling quadrat having area 1 m2, so for region 1, there were 40 quadrats in which one plant was observed, etc.):

Frequency

 

0

1

2

3

4

5

6

7

 

Region 1

28

40

28

17

8

2

1

1

m = 125

Region 2

14

25

30

18

49

2

1

1

n = 140

  1. 122.

    Referring to the previous exercise, develop a large-sample confidence interval formula for \( \mu_{1} - \mu_{2} \). Calculate the interval for the data given there using a confidence level of 95%.

  2. 123.

    Refer back to the pooled t procedures described at the end of Section 10.2. The test statistic for testing \( H_{0} {:}\;\mu_{1} - \mu_{2} = \Delta_{0} \) is

    $$ T_{p} = \frac{{(\overline{X} - \overline{Y}) - \Delta_{0} }}{{\sqrt {\frac{{S_{p}^{2} }}{m} + \frac{{S_{p}^{2} }}{n}} }} = \frac{{(\overline{X} - \overline{Y}) - \Delta_{0} }}{{S_{p} \sqrt {\left( {\frac{1}{m} + \frac{1}{n}} \right)} }} $$

    Show that when \( \mu_{1} - \mu_{2} = \Delta^{\prime} \) (some alternative value for the difference), then Tp has a noncentral t distribution with df = m + n – 2 and noncentrality parameter

    $$ \delta = \frac{{\Delta^{\prime} - \Delta_{0} }}{{\sigma \sqrt {\tfrac{1}{m} + \tfrac{1}{n}} }} $$

    [Hint: Look back at Exercises 3940, as well as Chapter 9 Exercise 38.]

  3. 124.

    Let R1 be a rejection region with significance level α for testing H01: θ ∈ Ω1 versus Ha1: θ ∉ Ω1, and let R2 be a level α rejection region for testing H02: θ ∈ Ω2 versus Ha2: θ ∉ Ω2, where Ω1 and Ω2 are two disjoint sets of possible values of θ. Now consider testing H0: θ ∈ Ω1 ∪ Ω2 versus the alternative Ha: θ ∉ Ω1 ∪ Ω2. The proposed rejection region is R1R2. That is, H0 is rejected only if both H01 and H02 can be rejected. This procedure is called a unionintersection test (UIT).

    1. a.

      Show that the UIT is a level α test.

    2. b.

      As an example, let μT denote the mean value of a particular variable for a generic (test) drug, and μR denote the mean value of this variable for a brand-name (reference) drug. In bioequivalence testing, the relevant hypotheses are H0: μT/μRδL or μT/μRδU (the two aren’t bioequivalent) versus Ha: δL < μT/μR   <  δU (bioequivalent). The limits δL and δU are standards set by regulatory agencies; the FDA often uses .80 and 1.25 = 1/.8, respectively. By taking logarithms and letting η = ln(μ), τ = ln(δ), the hypotheses become H0: either ηTηRτL or ≥τU versus Ha: τL  < ηTηR  < τU. With this setup, a type I error involves saying the drugs are bioequivalent when they are not. The FDA mandates α  = .05.

      Let D be an estimator of ηTηR with standard error SD such that standardized variable T = [D − (ηTηR)]/SD has a t distribution with v df. The standard test procedure is referred to as TOST for “two one-sided tests” and is based on the two test statistics TU = (DτU)/SD and TL = (DτL)/SD. If v = 20, state the appropriate conclusion in each of the following cases: (1) τL  = 2.0, τU  =−1.5; (2) τL  = 1.5, τU  = −2.0; (3) τL  = 2.0, τU  = −2.0.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Devore, J.L., Berk, K.N., Carlton, M.A. (2021). Inferences Based on Two Samples. In: Modern Mathematical Statistics with Applications. Springer Texts in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-030-55156-8_10

Download citation

Publish with us

Policies and ethics