Environmental Monitoring and Assessment

, Volume 184, Issue 10, pp 6367–6375

Statistical evaluation of data from multi-laboratory testing of a measurement method intended to indicate the presence of dust resulting from the collapse of the World Trade Center

  • Henry D. Kahn
  • Jacky A. Rosati
  • Andrew P. Bray
Article

DOI: 10.1007/s10661-011-2426-7

Cite this article as:
Kahn, H.D., Rosati, J.A. & Bray, A.P. Environ Monit Assess (2012) 184: 6367. doi:10.1007/s10661-011-2426-7
  • 96 Downloads

Abstract

In this paper we describe a statistical analysis of the inter-laboratory data summarized in Rosati et al. (2008) to assess the performance of an analytical method to detect the presence of dust from the collapse of the World Trade Center (WTC) on September 11, 2001. The focus of the inter-lab study was the measurement of the concentration of slag wool fibers in dust which was considered to be an indicator of WTC dust. Eight labs were provided with two blinded samples each of three batches of dust that varied in slag wool concentration. Analysis of the data revealed that three of labs, which did not meet measurement quality objectives set forth prior to the experimental work, were statistically distinguishable from the five labs that did meet the quality objectives. The five labs, as a group, demonstrated better measurement capability although their ability to distinguish between the batches was somewhat mixed. This work provides important insights for the planning and implementation of future studies involving examination of dust samples for physical contaminants. This work demonstrates (a) the importance of controlling the amount of dust analyzed, (b) the need to take additional replicates to improve count estimates, and (c) the need to address issues related to the execution of the analytical methodology to ensure all labs meet the measurement quality objectives.

Keywords

Dust Analytical method Inter-lab testing Slag wool World Trade Center Statistical analysis 

Introduction and background

The collapse of the World Trade Center (WTC) on September 11, 2001 was an environmental disaster of significant magnitude. There were three general sources of contamination associated with the collapse that were released into lower Manhattan and surrounding areas: (1) a large amount of dust and debris from the initial collapse; (2) fires that burned in the rubble for an extended period of time following the collapse, and (3) emissions from the large number of vehicles and equipment required to support rescue, clean-up, and construction activities. A number of measures of residual contamination from the WTC disaster were investigated including PAHs, dioxins, heavy metals, and other airborne and re-suspended species and total suspended particulate matter (see, e.g., Pleil et al. [2004]). Concerns about the possible toxicity of and the extent of the distribution of the WTC residual contamination were a focus of the World Trade Center Expert Technical Review Panel (WTCETRP). This panel was convened to advise EPA on a range of issues related to the effects of the collapse. The record of the interaction between the WTCETRP and the EPA may be found at http://www.epa.gov/wtc/panel/. Following guidance provided by the WTCETRP, EPA proceeded with the development of an analytical measurement method that would differentiate dust originating from the WTC collapse from typical urban dust. The method was intended to be used in a proposed field survey (which was not conducted) of the area surrounding the WTC site to estimate the geographic extent of the distribution of possible WTC-related contamination. Based on prior work by Panel member Greg Meeker and co-workers of the USGS (Meeker et al. [2005]), the WTCETRP concluded that slag wool fibers, as well as gypsum and components of concrete present in substantial quantities in samples of dust, should serve as an indicator or “signature” for the presence in the sample of dust from the WTC collapse. Slag wool is a form of man-made vitreous fiber that was used extensively as an insulating and fireproofing material in the construction of the WTC. As a common building material, slag wool fibers might well be expected to be present in typical urban dust but not at the extremely high levels found in samples of dust known to have been affected by the WTC collapse. The development of the method for the measurement of slag wool fibers in dust, as well as the gypsum and components of concrete analysis, are described in Rosati et al. (2008). The results from the initial gypsum and components of concrete analysis were inconclusive, and thus were not pursued as a screening option. Rosati et al. (2008) includes a description of the inter-lab study of the slag wool measurement method and a brief summary of the results. In this paper, we describe, in detail, the statistical analysis of the data from the inter-lab study of the slag wool method, present the data generated in the study, and discuss the implications of the results.

The measurement method for slag wool

The following is a general overview of the process developed by the US EPA for measuring slag wool fibers in dust. This method is complex and requires a significant number of manual, operator-performed procedures. For a detailed description of the measurement method, see Rosati et al (2008), US EPA (2005). Samples of dust were subjected to a number of processing steps, i.e., weighing, ashing, and aliquot formation, followed by visual examination with a scanning electronic microscope to arrive at a fiber count associated with the dust sample weight. For each sample, the observed fiber count was then converted to a measured value in units of fibers per gram using an analytic parameter that reflects the effective amount of dust examined. For every reported sample measurement in units of fibers per gram of dust, there is a corresponding measurement of number of fibers counted. Two samples of dust may have the same number of fibers counted but may differ in terms of fibers per gram because different amounts of dust were examined. The number of slag wool fibers counted in a sample is the actual “observed value”, thus, measurement uncertainty is directly related to the number of fibers counted and the amount of dust examined.

Description of the multi-laboratory study and resulting data

Generic background dust was spiked with WTC dust for evaluation by multiple laboratories (Rosati et al., 2008). These dust samples were created to represent the expected lower range of concentration of slag wool in dust at sites affected/contaminated by the WTC collapse. The focus on the expected lower concentration range of slag wool was an attempt to approximate the region of overlap between the range of higher levels of slag wool in typical urban dust, and the lower range of slag wool levels associated with dust affected by the WTC collapse. Test samples were formulated from dust collected inside the Deutsche Bank (DB) building at 4 Albany Street (near the WTC site) collected in 2004. The DB building was damaged by the collapse of the WTC and large amounts of dust and debris from the WTC penetrated the DB building. The dust collected from the DB building was formulated into batches of 1%, 5%, and 10% dilution and two replicate samples from each batch were sent to eight labs for blind testing. The results reported by the labs are shown in Table 1 in units of fibers per gram and plotted in Fig. 1. The reported fiber counts corresponding to each of the measurement results shown in Table 1 are shown in Table 2.
Table 1

Slag wool inter-lab data: fibers per gram of dust (unashed) measured in blind test samples two replicate samples per laboratory per batch

Batch

Laboratory

A

B

C

D

E

F

G

H

Batch mean

1%

 

3,241

6,151

3,074

16,859

797

1,083

4,314

8,979

 

19,637

9,823

5,149

15,393

2,126

7,048

744

9,043

Lab batch mean

11,439

7,987

4,111

16,126

1,461

4,065

2,529

9,011

7,091

5%

 

41,299

16,618

18,432

28,913

17,644

968

3,546

60,982

 

38,587

14,383

19,150

20,376

3,927

8,367

7,422

40,110

Lab batch mean

39,943

15,501

18,791

24,645

10,786

4,667

5,484

50,546

21,295

10%

 

60,214

38,245

43,091

65,065

62,186

4,031

7,428

66,008

 

48,797

44,784

33,191

54,758

11,746

19,635

14,516

55,677

Lab batch mean

54,505

41,515

38,141

59,912

36,966

11,833

10,972

60,843

39,336

Fig. 1

Slag wool inter-lab data: fibers per milligram of dust (unashed) measured in blind test samples

Table 2

Slag wool inter-lab data: fiber counts observed in dust samples, two replicate samples per laboratory per batch

Batch

Laboratory

A

B

C

D

E

F

G

H

1%

1

3

1

13

0

1

7

2

7

5

2

12

4

2

1

2

5%

8

8

7

22

7

1

6

6

12

7

7

16

8

2

11

10

10%

16

18

12

48

13

2

10

13

15

21

12

42

9

3

25

12

Substantial variability among and within labs is apparent in the data shown in the left panel of Fig. 1. Visual inspection of the data shows three of the labs had difficulty differentiating between the different batches. In fact, these three labs (E, F, and G) “did not meet the measurement quality objectives (MQOs) for the spiked samples put forth in the QAPP [Quality Assurance Project Plan] for this study.”[U.S. EPA (2005), p. 13]. As shown in the right panel of Fig. 1, without the data from these labs, there is less apparent overall among and within lab variability. There is also a visual pattern of measured concentration values increasing with increasing dilution which is more in keeping with expectations.

After surveying the three labs that did not meet the MQOs, it is believed that there were multiple reasons for this variability. These labs differed from the other five labs in the amount of dust that was examined to arrive at a fiber count for each sample (Fig. 2). Labs A, B, C, D, and H tended to analyze dust samples of similar amount, while the amounts of dust analyzed by labs E, F, and G were much more variable. Other factors believed to affect performance include differences in the consistency of applying analytical procedures and time pressures to complete analyses, and SEM instrumentation and software age.
Fig. 2

Fibers counted versus amount of dust analyzed for each sample by lab milligrams of dust analyzed

Statistical methods

The data in Table 1 were analyzed using a one-way analysis of variance (ANOVA) approach for each of three batches. The basic model by batch is as follows:
Yi,j

=measured value at the ith lab, jth replicate in fibers per gram of dust

Yi,j

=\( \mu + {\alpha_i} + {\varepsilon_{{i,j}}}\,{\text{for}}\,i = {1},...,k;j = {1},...,{r_i} \)

Where
μ

=Overall mean

αi

=Effect of the ith lab

εi,j

=Error term assumed to be independent and normally distributed with mean 0 and variance σ2

k

=The number of labs = 8

ri

=The number of replicates at the ith lab = 2 for all labs

The null hypothesis in this model is that all the labs produce the same results for a given batch. That is, all the lab means are the same or H0: \( {\alpha_1} = {\alpha_2} = ....... = {\alpha_k} = 0 \). The alternative hypothesis that two or more of the lab means are not equal is equivalent to the alternative hypothesis that \( {\alpha_i} \ne 0{\text{ for at least some }}i{.} \) Differences among the labs can result in findings that provide evidence to reject the null hypothesis, i.e., a calculated value of the F-statistic that is greater than the critical value for a specified significance level.

To address the potential for non-normality in the data, analyses using the log and square-root transformations were also performed. These transformations are standard procedures used to improve the extent to which data meet assumptions such as the assumption of normality inherent in ANOVA. The transformed models are, respectively, the following:
$$ \begin{gathered} \log \left( {Yi,j} \right) = \mu + \alpha i + \varepsilon i,j\,{\text{for }}i = 1, \ldots, k;j = 1, \ldots, ri \hfill \\ {\text{and}} \hfill \\ \sqrt {{Yi,j}} = \mu + \alpha i + \varepsilon i,j\,{\text{for}}\,i = 1, \ldots, k;j = 1, \ldots, ri \hfill \\ \end{gathered} $$

The knowledge that three of the labs, i.e., E, F, and G, did not meet the MQOs, prompted the application of the Scheffe method for the examination of contrasts using the results of the ANOVA. See Appendix 1 below and Scheffe [1959] for descriptions of the method.

Since there is a fiber count measurement corresponding to each observation reported in fibers/grams of dust (Table 2), it is also possible to analyze these count data as if they follow a Poisson distribution. The Poisson distribution arises when considering the probability that a given number of objects are observed or counted in a fixed interval when their average occurrence rate is known to be constant. The Poisson has been used to characterize the distribution of fiber counts in a number of applications, e.g., International Standard (2002). In applied problems, however, it is often the case that count data have greater variance than would be suggested by the Poisson. This problem, referred to as overdispersion, is an indication of departure from the Poisson assumption and is commonly addressed by estimating an additional parameter for variance. This yields a model in the form of the negative binomial distribution, which can be a useful alternative to the normal model assumed under the standard analysis of variance.

Results

The ANOVA results using the data from all eight labs are summarized in Table 3. The differences among the labs for the 5% and 10% batches are significant at the 0.01 and 0.05 levels, respectively. The p value for the test on the 1% batch was 0.125 and, although relatively small, would not be considered significant. Quantile plots of the standardized residuals, shown in Fig. 3, support the approximate normality assumed by ANOVA.
Table 3

ANOVA summaries: slag wool inter-lab data all labs

Batch

Source

Df

Sum of Sq

Mean Sq

F value

P = Pr(F)

1%

 

Labs

7

351,120,262

50,160,037

2.368

0.125

Residuals

8

169,437,567

21,179,696

  

5%

 

Labs

7

3,782,712,300

540,387,471

11.095

0.002

Residuals

8

389,648,045

48,706,006

  

10%

 

Labs

7

5,377,522,835

768,217,548

3.700

0.043

Residuals

8

1,660,995,583

207,624,448

  
Fig. 3

Normal probability plots of residuals for each analysis of variance model

The results of applying the Scheffe method to compare the mean of the group 1 labs (A, B, C, D, and H) with the mean of the group 2 labs (E, F, and G), are shown in Table 4. For the three batches, the 95% confidence intervals for the difference between the mean of the group 1 labs and the mean of the group 2 labs do not contain zero and thereby support the conclusion that the means of the groups are different. This provides statistical evidence for the judgment that observed differences between labs that met the MQOs and those that did not are real and likely reflect meaningful differences in performance of the method.
Table 4

Scheffe contrasts by batch for group 1 (labs A, B, C, D, and H) versus group 2 (labs E, F, and G): data in fibers per gram

Batch

Group 1 mean

Group 2 mean

Residual standard error (within lab)

95% Contrast: group 1 vs. group 2

1%

9,735

2,685

4,602

(5,370–8,740)

5%

29,885

6,979

6,978

(20,350–25,450)

10%

51,000

19,925

14,409

(25,800–36,300)

The results of the ANOVA conducted on the group 1 labs can be seen in Table 5. The p values for all three batches were higher than those in the full analysis (Table 2), and only the 5% batch was found to be significant at the 0.05 level. This demonstrates the aberrant nature of the group 2 labs; without them there is better agreement among the labs. Conducting the same analysis of variance on the log- and square-root-transformed data yielded very similar results.
Table 5

ANOVA summaries: slag wool inter-lab data group 1 labs (A, B, C, D, and H)

Batch

Source

Df

Sum Sq

Mean Sq

F value

P = Pr(F)

1%

 

Labs

4

157,903,487

39,475,872

1.367

0.364

Residuals

5

144,391,911

28,878,382

  

5%

 

Labs

4

1,770,968,216

442,742,054

8.492

0.019

Residuals

5

260,688,540

52,137,708

  

10%

 

Labs

4

887,827,847

221,956,962

4.585

0.063

Residuals

5

242,041,908

48,408,382

  

The negative binomial model was also a reasonable fit to the data and produced results consistent with the normal models. The Poisson model was found to be a very poor fit due to strong overdispersion in the data.

Discussion and conclusions

Statistical analysis of the slag wool fiber data from the eight labs suggests that further refinement would have been required prior to application of the measurement method in a significant field study. Three of the labs surveyed did not meet the MQOs and were found to have generated results that are statistically different from the other five labs. In addition to several qualitative methodological problems, these labs also had in common a relatively high variability in the amount of dust that they analyzed. By contrast, the five labs that did meet the MQOs analyzed samples of dust that varied in mass by no more than 0.0002 g. The results illustrate the importance of controlling the amount of dust analyzed. Sub-samples from the same batch should be expected to contain the same amount of fibers. However, variation in the amount of dust in the sub-samples introduces variability into the observed fiber counts that is not compensated for by normalizing to obtain a result in units of fibers per gram. This can be seen in the plots in Fig. 2 and the measurement results in Tables 1 and 2. The three labs (E, F, and G) that did not meet the MQOs had greater difficulty in comparison to the other labs in controlling the amount of dust analyzed for each batch sub-sample and in reporting consistent measurement results.

While results from the group 1 ANOVA suggest that labs have the ability to agree on mean batch concentration, it is not clear that this will ensure the ability of the measurement method to support correct classification of a given dust sample. This is due to the considerable within-lab variability. This variability reduces the accuracy of parameter estimates for each batch model and would tend to increase the number of errors made in deciding whether or not a given dust sample contains WTC dust

Variability also has an impact on “reproducibility” which is defined by Caulcutt and Boddy (1983), page 143, as: “The value below which the absolute difference between two single test results on identical material obtained by operators in different laboratories, using the standardized test method, may be expected to lie with 95% confidence.” Reproducibility should be included in the evaluation of an analytical method. In particular, when a method is intended for use in a study where multiple labs are required to implement the same method, reproducibility is an important indicator of the reliability of overall study results. The reproducibility value calculated from these inter-lab data is 40,285 fibers per gram (details of the formula and calculation are provided in Appendix 3). The utility of the method in the context of a particular study should be assessed in terms of the requirements of the study. In this case, consideration of the reproducibility value in combination with the batch means and confidence intervals shown in Table 6, suggests that, using the above criterion, it would be difficult to reliably associate particular lab measurement results with any of the 1%, 5%, or 10% batch level concentrations. This is an indication of the need to improve precision (i.e., reduce variability) associated with measurements obtained using the method if the objective is to discriminate among lower level concentrations of slag wool in dust.
Table 6

95% Confidence limits for batch means: group 1 labs data

Batch

Mean (slag wool fibers/gram of dust)

95% Confidence limits (slag wool fibers/gram of dust)

Lower

Upper

1%

10,000

5,400

14,000

5%

30,000

13,000

47,000

10%

51,000

40,000

63,000

Review of these results and the experience gained in this inter-lab study suggest a number of actions by labs that would improve control of variability and assist with meeting MQOs. Control of the amount of dust analyzed for each sample is associated with achieving better precision and meeting the MQOs. Careful execution of the measurement protocol should result in consistent control of the amount of dust analyzed from sample to sample which should, in turn, reduce variability and improve overall precision. The better performing labs in this study demonstrated superior control of the amounts of dust analyzed. Improvement in estimates of within-lab variability can be achieved by taking more than two replicates per batch at each lab. Ensuring that all labs stringently follow the protocol, ensuring lab equipment capabilities, as well as providing sufficient time for analysis, proper training of lab personnel, and further method development and testing should support improved control of between-lab variability and overall enhanced results. The work described here provides important insights for the planning and implementation of future studies involving examination of dust samples for physical contaminants.

Acknowledgments

The authors are grateful to Barry Nussbaum, Dennis Santella, and Paul White for helpful comments and suggestions.

Copyright information

© Springer Science+Business Media B.V. (outside the USA) 2011

Authors and Affiliations

  • Henry D. Kahn
    • 1
  • Jacky A. Rosati
    • 2
  • Andrew P. Bray
    • 1
    • 3
  1. 1.US Environmental Protection Agency, Office of Research and DevelopmentWashingtonUSA
  2. 2.US Environmental Protection Agency, Office of Research and DevelopmentResearch Triangle ParkUSA
  3. 3.Department of StatisticsUCLALos AngelesUSA

Personalised recommendations