Journal of Archaeological Method and Theory

, Volume 21, Issue 3, pp 563–588 | Cite as

Accurate Measurements of Low Z Elements in Sediments and Archaeological Ceramics Using Portable X-ray Fluorescence (PXRF)

Article

Abstract

This study seeks to demonstrate the ability of portable X-ray fluorescence (PXRF) to estimate concentrations of K, Ca, and Fe in sediments and archaeological ceramics under controlled conditions. After a discussion of the potential confounding factors in PXRF use, a protocol which attempts to address these issues through repeated measurement, calibration, and re-sampling is detailed. Data generated using this protocol are then tested for accuracy and repeatability. PXRF is argued to be able to produce accurate estimates of K provided the suggested protocol is used, and able to produce repeatable estimates of K and Ca under these same conditions. Other experimental conditions tested failed to produce accurate and repeatable results. Fe results are found to be problematic given the calibration standards used here.

Keywords

PXRF Geochemistry Sediments Ceramics 

Introduction

The use of portable X-ray fluorescence (PXRF) is on the rise in archaeology. More papers using PXRF are appearing in journals and edited volumes, PXRF data is featured more often at professional conferences, and these papers and presentations are branching out thematically, as they now encompass everything from the sourcing of lithics (Burley et al.2011; Craig et al.2007; Davis 2011; Forster and Grave 2012; Millhauser et al.2011; Nazaroff et al.2009; Phillips and Speakman 2009; Sheppard et al.2011), glasses (Polikreti et al.2011), and ceramics and clays (Frankel and Webb 2012; Goren et al.2011; Speakman et al.2011) to chemical characterization of sediments and soils (Abrahams et al.2010; Davis 2011; Davis et al.2012) and a range of in situ materials (e.g., Liritzis and Zacharias 2010; Potts and West 2008).

Within this surge lies a body of commentary reminding users to apply the method with a conscious, diligent care and a critical eye towards results. The voices of this undercurrent offer a variety of insights ranging from illumination of key issues in the application of PXRF (Shackley 2010) to evaluations of the accuracy of PXRF (Craig et al.2007; Goodale et al.2012; Millhauser et al.2011; Nazaroff et al.2009). When the latter type of investigation has been conducted, a common result is that PXRF produces source clusters which match those derived from independent laboratory-based wavelength dispersive XRF, energy dispersive XRF, and instrumental neutron activation analysis but fails to produce elemental concentrations which match these independent measurements (Craig et al.2007; Goodale et al.2012; Nazaroff et al.2009). In other words, published studies show that PXRF often produces inaccurate elemental concentrations which pattern in accurate ways. This fact should give the rapidly expanding user-base of PXRF pause, especially given (1) growing interest in non-sourcing applications for which elemental concentrations accurate in and of themselves are needed, and (2) the obvious superiority of accurate results for comparison between independent datasets and studies. Going forward, it will be necessary to rigorously test which uses of PXRF yield demonstrably accurate results, to document the conditions under which such results can be obtained, and to publicize these conditions as an aid to future users. Further, as Shackley (2010) points out, it is imperative that work begin now, while the PXRF boom is relatively young and the volume of potentially flawed data relatively small.

The primary goal of the analysis presented here is to address this need through the development of protocols for preparation, measurement, and analysis which consistently yield accurate estimates of elemental concentrations. My specific interest is in deriving estimates of potassium (K), calcium (Ca), and iron (Fe) in sediments and soils from archaeological contexts as an aid to larger investigations into geochemistry and radiometric dosimetry. As regards dosimetric measurement, the particular goal of this study is to evaluate the potential of PXRF for measurement of K in conjunction with luminescence dating, as radioactivity from 40K, which composes about 1.17 % of all naturally occurring K, is a major contributor to the generation of luminescence signals (Aitken 1985:62). Given this objective, this study is also concerned with the accurate estimation of errors associated with PXRF use, since these are used in the calculation of errors in final luminescence age estimation.

Because some prior work suggests that elements with low atomic number (low Z)—such as K and Ca—may pose a significant problem for PXRF measurement (e.g., Goodale et al.2012), accurate measurement of these elements is particularly likely to be challenging (see also Buhrke et al.1998:51–4; Lachance and Claisse 1995:204–5). I will attempt to meet this challenge by first diagnosing the potential sources of inaccuracy and imprecision inherent in PXRF measurement, with specific emphasis on the instrument used here, a Bruker Tracer III–V. Once these sources of error are defined and their potential impacts on resultant data are detailed, a protocol which attempts to address these issues through strategic sample preparation, instrument calibration, and measurement will be set forth. Following this, experimental testing of this protocol’s ability to consistently produce accurate results is discussed, along with the broader implications of test results.

Sources of Error in PXRF

I divide all errors inherent in the use of PXRF to estimate sample elemental concentrations into two primary conceptual types. These are: (1) errors external to instrumentation, and (2) errors internal to instrumentation. Accounting for both types of error is essential to the derivation of accurate data, and both types will be discussed here.

External Errors and Calibration

External errors are defined here as those errors which are not purely due to fluctuations in machine performance; they are instead due to a range of factors, including within-sample heterogeneity, sample preparation, instrument calibration, and user error. The degree to which within-sample heterogeneity affects results will vary as a product of material form, composition, and measurement strategy. The immense diversity of archaeological materials and applications therefore prevent full discussion of this issue here; it must instead be empirically addressed by users on a case-by-case basis through either sample preparation, strategic re-measurement, or both. Issues related to sample preparation, while potentially significant, are relatively easily addressed and are fully detailed elsewhere (e.g., Buhrke et al.1998). User error is perhaps more difficult to address fully, but mistakes of this type can generally be minimized by following sound protocols for measurement and employing redundant checks on data and conclusions. Calibration, on the other hand, is a complex and often misunderstood exercise which is essential to the integrity of results, and as such merits further discussion here.

Calibration

Calibration is a numerical translation from one set of units to another. In the case of XRF, the process translates the intensity of the observed sample response to X-rays, measured in counts, to elemental concentration, measured in parts per million (ppm) or percent mass. XRF results cannot be reported as elemental concentrations without this translation.

XRF calibration can be undertaken in many ways (Lachance and Claisse 1995:259–75), but in general the process by which a calibration is constructed and then used to estimate sample elemental concentrations involves a special type of regression analysis known as inverse prediction (see Seber and Lee 2003:145–8). In this analysis, calibration standards—materials of known elemental concentrations—are measured with XRF to observe the counts these concentrations generate, then a best-fit equation is used to describe the relationship between concentrations (the independent variable) and counts (the dependent variable). In nearly all cases, this relationship is imperfect, causing error to be passed to results, which are accompanied by probabilistic confidence intervals to encompass this error. As a general rule, calibrations have less error (and smaller confidence intervals) when more standards are used, more replicate measurements of each standard are used, and standards are homoscedastically distributed throughout the full range of observed values in all dimensions. Once a sound regression is achieved, field samples of unknown concentrations can then have PXRF counts measured and interpolated into the regression equation to infer the concentrations of interest. Error terms in these concentrations are then derived using interpolation into the confidence intervals attached to the regression function.

Importantly, traditional regression analysis relies on assumptions about the nature of the data, and valid estimates of concentrations and error terms can only be obtained if these assumptions are met or if the model is appropriately adjusted to correct for deviations from assumptions. Among these assumptions is the notion that the magnitude of error terms in x-values is independent of the x-values themselves (Seber and Lee 2003:227), but this condition is not met by the calibration standards used in this study. Instead, elemental concentration is a strong predictor of reported error for some elements, such as K (see Fig. 1) and Ca (R2 = 0.958), in these standards. As such, confidence intervals in calibration for these elements cannot be assumed to be adequately modeled by traditional regression analysis, and results based on uncritical calibration may carry grossly inaccurate error terms. If traditional assumptions are applied to the K values for these standards, for example, underestimation of error terms will systematically become more pronounced as K concentration increases.
Fig. 1

Reported 1σ errors in %K concentrations of standards plotted against reported %K concentrations of standards. A value of 0.01 was used for the error in the blank

There are a number of ways to address this issue (see Seber and Lee 2003:265–92), but they fall into two primary types. The first approach attempts to solve the problem at a more theoretical level and proceeds by modifying the regression model itself as a way to account for the effects of errors in x-values. In practice this approach involves a complex process of identifying trends in variance, estimating the magnitude of these trends, calculating errors in estimation, and then either applying these estimates and errors to the overall regression equation or applying a transformation to data so that they behave normally in a statistical sense. This approach complicates matters, especially given the fact that trends in errors in x-values may themselves vary in strength (i.e., how well the trend fits all x-values and errors), magnitude (i.e., the rate with which error terms associated with x grow in proportion to x itself), and shape (i.e., the mathematical function describing the trend) depending on the set of standards and the element used. For PXRF measurement, then, this approach can present a barrier to routine use, as this approach greatly diminishes the ease and possibly the overall precision with which calibration can be performed.

As an alternative, the second approach is to use iterative re-sampling to simulate a distribution of expected outcomes of regression analysis given the reported error in each standard concentration. Here, the x-value assigned to each standard during regression is probabilistically drawn from the Gaussian distribution defined by μ and σ, where μ is the reported elemental concentration of the standard and σ is the reported error in this concentration. Regression can then be performed with these sampled x-values as normal, as these values are empirically reflective of reported errors. This process is repeated a number of times to empirically generate a statistical distribution of outcomes of regression parameters such as slope and y-intercept, as well as confidence intervals about these parameters. In this way, patterned errors in x-value inputs are automatically embedded in regression results, avoiding the need to model them theoretically. This method is relatively easy with the aid of widely available statistical software, and is therefore the more practical of the two options.

In addition to errors in x-values, it is probably also necessary to account for errors in y-values used in calibration, since repeated PXRF measurements of a given sample will generate a distribution of counts due to u-drift (see below). Further, as with errors in x, distributions of y-values tend to enlarge as x-values increase, indicating a positive systematic relationship between x-values and errors in y (see Fig. 2 for K; this relationship generally also holds for Ca (R2 = 0.975) and Fe (R2 = 0.738). This presents a similar problem as described above for errors in x. As above, there is more than one way to incorporate this error in y-values, but here it is sufficient to note that re-sampling will again adequately address the issue. To ensure quality results, however, such re-sampling should be used to generate y-values in two phases—first for each standard during calibration, and then for each field sample during interpolation—as both sets are subject to the same source(s) of error in the estimation of their count values. In both cases, repeated PXRF measurement to generate technical replicates should be used to generate y-value distributions before bootstrapping is undertaken. If this re-measurement and re-sampling is not done for both standards and field samples, however, errors in regression and interpolation will not be fully taken into account, and resulting estimates, particularly those of high-concentration field samples, will again be prone to systematic underestimates of error.
Fig. 2

1σ error in observed PXRF counts of K for all standards, averaged over all trials and plotted against reported %K concentrations of standards

Instrumental Errors

Instrumental errors are defined here as those errors due to changes in machine performance which affect the data collected. Such errors are ubiquitous to complex instruments and cannot be fully prevented, but must rather be taken into account in the process of measurement. In XRF measurement, these fluctuations are often termed “drift,” and three major types of drift have been documented for XRF instruments (Buhrke et al.1998): ultra short-term drift (hereafter u-drift), short-term drift (hereafter s-drift) and long-term drift (hereafter l-drift). Each has a unique effect on instrument precision and accuracy which is relevant here.

u-Drift

u-Drift occurs when the instrument yields different intensity counts for consecutive measurements using identical measurement parameters. Figure 3 shows u-drift using a Bruker Tracer III–V at 15 keV and 20 μA excitation with a 25 μm Ti filter and the vacuum attachment to generate a series of sixty 1-min assays using a standard which was unaltered between assays. This type of drift theoretically results in a Gaussian distribution of observations where the mean value represents the best estimate of real PXRF counts and error terms are randomly distributed. In practice, the mean value can be accurately estimated two ways. First, the analyst can use longer assays, as collecting data for more time will average out short-term variation in counts. This approach masks short-term error in favor of obtaining an approximation of the mean. Alternatively, a series of short, repeated measurements can be used to flesh out the distribution of values caused by u-drift. This method is preferred, since it provides useful additional information relevant to the magnitude of short-term drift, which may itself vary between instruments, instrumental settings, etc. As a general rule, then, a series of short, repeated measurements is probably better than a single, long measurement for a given sample, provided all else is equal.
Fig. 3

U-drift wiggles on a single standard measured with a series of 1-min assays. The y-axis represents percent difference from each element’s 60-min mean counts

l-Drift

l-Drift occurs when instrument sensitivity is permanently changed over time, typically through the accumulation of parasite lines or tube degradation as a result of routine use (Buhrke et al.1998:307). This change in sensitivity results in the gradual accumulation of systematic bias in observed counts. Eventually measurements from a given instrument will no longer be directly comparable with earlier measurements taken from that instrument. Thus, the clock is ticking on all PXRF instruments currently in use, as all PXRF instruments are likely to suffer l-drift at some point if they are used long enough.

The solution to this problem is re-calibration of each instrument whenever l-drift causes it to stray far enough from its prior sensitivity, but it is up to the user to monitor the change and determine when re-calibration is necessary, as little information is available for the timescales on which l-drift can affect the variety of instrument models currently in use. As a general rule, then, all PXRF instruments should be regularly checked for changes in sensitivity to elements of interest due to l-drift. For the user community at a whole, it would also be helpful for individual users to publicize the timescales on which l-drift is significant for specific instruments, instrumental settings, and elements of interest, as this would help set professional standards for the frequency of re-calibration. Publicity would also help users identify which machines are the most stable and durable over the long haul. For the present, however, the degree to which l-drift limits the use-life of a given calibration remains an open question for the instrument used here.

s-Drift

s-Drift occurs outside the typical length of a single measurement window (usually 1–5 min or so) but within the timeframe of a typical use-session of the instrument (say, 3–4 h). Like u-drift, it tends to occur randomly, it can either increase or decrease instrument sensitivity, and its effects are temporary. Unlike u-drift, however, its effects can last for hours, essentially causing the mean of the ultra short-term distribution to systematically increase or decrease for a given period of time. Figure 4 shows an example of s-drift derived from the same experimental parameters used to document u-drift above, although patterned data of this type has been directly observed for both of the Bruker Tracer III–V instruments tested independent of filter, vacuum, electron voltage, micro-amperage, and counting time used. Further, s-drift occurs after variable durations of use, it lasts for variable amounts of time, and none of the real-time feedback provided by the instrument was predictive of its onset or effects during testing conducted for this study. The effect of a single instance of s-drift can also differ according to the elements analyzed; Fig. 4 shows a case in which instrumental sensitivity to K and Ca is increased at around the 20th minute, while sensitivity to Fe decreases.
Fig. 4

S-drift on a single standard measured with a series of 1-min assays. In this case, K and Ca increased relative to each element’s 60-min mean, while Fe counts decreased. Wiggles are due to u-drift

This particular type of drift therefore presents unique challenges to the practical use of PXRF, and its effects on instrument performance and resultant data can be profound. If, for example, an analyst were to measure the same sample with the same instrumental parameters ten times each at the initial and terminal portions of a single 4-h use-session, it is entirely possible that the distributions generated by these two sets of measurements would significantly differ statistically. On the moderate-term, then, s-drift produces systematic bias between short-term datasets, which is a problem for the derivation of accurate results. On the long-term, however, errors due to s-drift can be seen as effectively random, as s-drift does not appear to consistently affect individual use-sessions in the same way.

In practice, then, s-drift can be addressed through a series of steps designed to randomize the effects of s-drift on PXRF counts data over the long-term. First, samples measured within a given use-period should be re-measured in such a way that individual measurements of each sample are evenly distributed throughout the use-session. This strategy effectively subsumes the effects of s-drift within the distribution of repeated measurements for each sample, allowing the user to attempt to address s- and u-drift simultaneously. Second, the median of counts distributions should be used as the measure of central tendency instead of the mean, as the median is more robust to secondary modes and outliers in distributions of repeated measurements—both possible products of s-drift—than the mean (Schuenemeyer and Drew 2011:10). Third, samples should be measured in multiple sessions of use, preferably on separate days, so that results of individual sessions can be compared for consistency. This step is critical, as it will allow the user to identify the few cases in which s-drift is potentially responsible for the primary modes of distributions within a given use-session, and by extension medians which are heavily contaminated by s-drift. Further, when results from independent use-sessions are integrated, any effects of s-drift on each of the individual sessions’ results will be randomized, helping to eliminate systematic errors in estimates. Lastly, re-measurements of a set of standards should also be interspersed within each session of use to allow the tracking of s-drift within an individual session, thereby providing diagnostic data which can help inform the user.

As an alternative to the above approach, it is also possible to use data from re-measured standards in an attempt to correct for s-drift by re-scaling s-drift-affected counts values to match counts at time of calibration. Such correction through re-scaling would use a formula such as \( \frac{S_m }{S_c }=\frac{F_m }{F_c } \) where Sm and Fm are observed s-drift-effected counts for a given standard and a field sample, respectively, Sc is the observed counts for standards used in initial calibration, and Fc represents counts for a given field sample on the scale used in instrument calibration. Importantly, however, this correction will be burdened by error from two main sources. First, it will incur errors in estimating the ratio between Sm and Sc, since (1) each of these terms is subject to error in estimation (especially Sc, which embeds errors in the original regression), and (2) any relationship between these terms must hold for all calibrated standards to be effective. In other words, the definition of Sm:Sc is functionally a mini-calibration—in this case one that translates s-drift-affected counts to counts on the scale used for original calibration—and as such it is subject to all the sample-size effects and errors of any other calibration. Second, it will incur errors in the use of Sm:Sc to infer Fm:Fc, as would interpolation into any other calibration; it will also pass all of these errors on to results. While it is possible to formally account for these errors, it may not be practical to do so each time s-drift asserts itself. The former solution may therefore be more useful in practice, since with this approach (1) the errors s-drift causes do not require extra work to calculate, and (2) its primary inconvenience—the need to measure samples a number of times—is in itself a benefit in generating statistically robust data.

Summary of Issues and Experimental Goals

Each of the above sources of error has the potential to impede accurate PXRF results. As such, the most reliable way to derive accurate PXRF results is to control as many of these sources as possible, while accounting for the others. Errors due to sample composition and preparation and user errors are minimized if simple procedures are followed. Errors in calibration and interpolation are inevitable, but they can be fully taken into account through statistical regression models which make use of re-sampling from distributions of known concentrations and observed PXRF counts. Instrumental errors are inherent and pose a significant challenge, but their effects on PXRF data can be minimized through strategic use of the same redundant measurements needed for calibration and interpolation. This strategy should involve re-measurements of individual samples which are evenly distributed throughout a single session of use and repeated over multiple sessions of use in order to account for s-drift. It should also involve repeated measurements of standards as a part of standard practice to help diagnose the onset of s-drift and l-drift and provide feedback about instrument performance. If these steps are taken, however, it should be possible to produce PXRF results which are both consistent and reproducible by independent methods.

The experiment described below is designed to test this assertion through intensive analysis of a set of calibration standards and field samples. It seeks to do so by creating relatively ideal conditions in sample preparation, calibration, and measurement in an attempt to verify the accuracy of the instrument under these conditions. The protocol used in preparation, measurement, and analysis used to create these conditions is detailed first. Next, the degree to which calibrations accurately described the variation in data derived from standards is discussed, followed by an evaluation of the degree to which instrument sensitivity remained stable through repeated uses. The instrument’s ability to accurately derive K concentrations for field samples is then evaluated. Lastly, the statistical consistency with which the instrument is able to derive results for other elements of interest in field samples is discussed.

Experimental Methods

Field Samples

In total, 71 field samples were selected for analysis. Fifty-one of these were samples of beach and flood sediments collected from a series of sites in north coastal Peru. These are samples for which elemental compositions have not yet been independently measured. Their primary value to this study is in providing data relevant to the evaluation of the consistency of PXRF results throughout repeated measurements. The other 20 samples were archaeological ceramics from a series of sites in North Carolina loaned from the University of Washington Luminescence Laboratory. Potassium concentrations for these ceramics were independently measured by inductively coupled plasma mass spectrometry (ICP-MS) for use in luminescence measurements carried out by Jean-Luc Schwenninger at Oxford (Feathers, unpublished report), and these samples are therefore useful for the evaluation of instrument accuracy.

Sample Preparation

Bulk samples were first dried in a 50 °C oven for 24 h to remove moisture content. Next, samples were pulverized using a sterilized tungsten-carbide orbital rock crusher until the consistency of fine flour was achieved for each sample. This step homogenized each sample and put each sample in a similar form as the calibration standards used, each of which is critical to ensuring PXRF measurements would be maximally comparable between assays and between analytes (see Buhrke et al.1998:35–49). Archaeologists intending to take advantage of the non-destructive potential of PXRF may want to avoid this step, but in cases where analytes are compositionally heterogeneous (such as sediments and ceramics) it is highly recommended when possible. Cost considerations prevented use of a glass-forming agent or a pellet press in sample preparation, and samples were left in powdered form for analysis. Although not ideal, this shortcut did not appear to have any damaging effect on results.

After pulverization, samples and standards were each placed in sterile Teflon bottles and bottle apertures were covered with 4 μm Ultralene film to create measurement windows. This film was secured in place to seal samples inside bottles for the duration of experimentation, helping minimize potential contamination caused by repeated handling. Sample and standard bottles were stored in a dessication cabinet when not undergoing measurement, preventing re-hydration of samples over the period of experimental work, as such re-hydration could affect apparent elemental concentrations.

Calibration Standards

Standards, again defined here as materials of known elemental values for use in calibration, are detailed in Table 1. To ensure accuracy, only materials with elemental concentrations verified by multiple independent methods and laboratories were used as standards. Specifically, all non-zero values used for calibration represent “recommended values” reported for geochemical reference materials provided by the U.S. Geologic Survey or the National Research Council Canada. Standards were also chosen to provide a broad range of values for relevant elements, thereby allowing for more robust regression fitting. Importantly, one of the standards used was a “blank” composed of the same type of Teflon bottle and Ultralene film described above, but with no sample inside. This blank provides a zero value for calibration and a means of measuring background counts values. In total, eight standards were used. Preliminary testing indicated this number produced sufficient resolution in calibration while minimizing measurement time over the multiple cycles of repeated measurement described below.
Table 1

Standards used for calibration

Calibration Standard

%K

1σ

%Ca

1σ

%Fe

1σ

AGV-2 Andesite

2.39

0.09

3.72

0.09

4.68

0.09

BCR-2 Basalt

1.49

0.04

5.09

0.08

9.65

0.15

BHVO-2 Basalt

0.43

0.01

8.17

0.12

8.63

0.14

Blanka

0

0.01

0

0.01

0

0.01

DNC-1 Dunite

0.19

0.01

8.21

0.05

6.97

0.1

G-2 Granite

3.72

0.11

1.4

0.06

1.86

0.12

GSP-2 Granodiorite

4.48

0.12

1.5

0.04

3.43

0.11

PACS-2 Marine Seda

1.24

0.05

1.96

0.18

4.09

0.06

With the exception of the blank, all values represent recommended values for geochemical reference materials, as reported by the USGS or the NRCC

aDenotes non-USGS sources; PACS-2 was made available by the NRCC

Measurement Parameters

Measurements were carried out using a Bruker Tracer III–V with a 25 μm Ti filter and the external vacuum pump attachment. The machine was selected for its flexibility in configuration, the filter was selected after some experimentation as the best way to reduce noise in the kiloelectron volts range of interest to K, Ca, and Fe, and the vacuum pump was used to help prevent attenuation of the extremely low-energy X-rays which characterize this range. To prevent heterogeneity across the measured surface, all samples were presented to the instrument in such a way that the sample completely covered the beam profile. Additionally, all samples measured were at least 6 mm thick across the beam profile to fully ensure “infinite thickness” and thus reliable comparison between samples. Samples were also gently tapped prior to measurement to remove any air pockets which might have affected the penetration and escape of X-rays. The instrument was always turned on at least 0.5 h prior to initial measurement to allow the detector temperature to fully cool and stabilize. Likewise, the instrument’s X-ray tube was always turned on for at least 1 min prior to initial measurement to help eliminate the effects of initial fluctuation in electrical current—and therefore spikes in the data—as exploratory analysis often yielded such spikes within the first few seconds of tube activation. As such, instrument hardware was never manipulated once measurements had begun, as measurements themselves were initiated solely with the use of software. All measurements used a 15 keV, 20 μA stimulation with a 1-min counting window, and Bruker’s S1PXRF software (version 3.8.22). Extensive preliminary experimentation with these samples showed these parameters generate sufficient signals for all elements of interest to this study.

Measurement Sequences

Each measurement cycle was composed of two tiers. The lower tier, the assay, was a single 1-min interval in which a single sample was exposed to X-ray stimulation and the resultant intensities counted. This is the basic temporal unit of measurement in this study. The higher tier, the trial, was always a sequence of 180 individual assays; 10 assays each for 10 different samples, and 10 assays each for the 8 standards. Assays were interwoven in such a way that two samples would be measured once each, followed by two standards, then two samples, etc. This process was repeated until each sample and standard had been measured ten times, with each of the ten measurements being interspersed throughout the trial to help account for s-drift. As used here, the trial was the basic unit of analysis; all trials carried their own self-sufficient local calibration data, and all trials contained enough sample data to provide a robust estimate of elemental concentrations with error terms. These repeated local calibrations allowed the statistical comparison of the relative stability of the instrument over repeated trials, as well as the degree to which l-drift affected calibration parameters over the course of experimentation.

Each of the 71 samples examined here underwent no less than two individual trials. The majority (n = 50) underwent at least three trials. Again, this provided redundant measurements to help eliminate potential errors due to user oversight or s-drift, while allowing for testing of the repeatability of results and the stability of the instrument over time. Repeated trials also helped improve precision in results due to increased sample size, as most samples were measured 30 times total. Each of the 20 ceramics used for the evaluation of instrument accuracy was measured 30 times. Altogether, results reported here make use of data from around 2,100 sample assays and exactly 1,680 standard assays derived from exactly 21 individual trials.

Data Selection

Raw data used for analysis were selected by a process designed to incorporate only those data most directly representative of the elemental concentrations of interest. Bruker provides software which will do this task, but this software is a bit of a “black box” in its operation and is also not compatible with the calibration program in R software described below. To circumvent this black box in a way useful to this study, a custom process for identifying and incorporating raw count data was used.

The first step in this process was to examine the visible spectrum for each assay in S1PXRF to identify the most pronounced peak for each element of interest. For purposes of clarity, Ca will be used as an example here, although for all of the elements of interest to this study the relevant peak is always the more intense K-orbital peak, also known as the Kα peak. Next, the range of electrical channels potentially relevant to the Ca Kα peak was identified using S1PXRF. Relevant channels were defined as those which indicated Ca counts distinctly exceeding background noise. Raw data for each trial were then compiled into a table, and a total of within-trial counts was derived for each of the electrical channels previously identified as potentially relevant to the Kα peak for Ca. The relevant channel with the highest total was taken to be the apex of the Ca peak for that trial. This channel, hereafter referred to as the primary channel, was assumed to be the channel most sensitive to Ca, as this channel by definition exhibited the most pronounced XRF response to samples’ Ca concentrations. Next, the correlation of each channel within the identified range with the primary channel was calculated to yield R2 values describing the strength of the relationship between the primary channel and each of the other channels under consideration. Channels highly correlated with the primary channel were also taken to be directly indicative of Ca; channels less correlated with the primary channel were likely responding to other factors, including background noise. Channels which were highly correlated with the primary channel, hereafter referred to as secondary channels, were included in analysis. In practice, it was easy to identify secondary channels, as there was in all cases a large drop-off in R2 between highly correlated channels and less-correlated channels. Additionally, in all cases, the secondary channels identified were adjacent to the primary channel and contiguous with one another, making the breaking-point between secondary channels and noisy channels easily identifiable. Once secondary channels were identified, counts from the primary and secondary channels were totaled together for each assay to calculate total counts for Ca for each assay. This process was repeated for all elements of interest, and channels 80–85 were used for K, 90–96 for Ca, and 155–164 for Fe for all trials to ensure totaled counts and calibrations would be directly comparable from one trial to the next. These channels are likely to vary according to instrumental settings, and users replicating this method are advised to define channels appropriate to their own uses.

Calibration and Interpolation

Calibration was performed for each trial in R version 2.13.1 using a three-stage re-sampling regime (see Fig. 5) to help account for error in the following: standard concentration estimates, standard PXRF counts, regression residuals, and field sample PXRF counts.
Fig. 5

Detail of the process used to calibrate, interpolate, and estimate errors for each trial

The first stage empirically fit a regression function which fully incorporated errors in standard concentrations and observed standard XRF counts. To do this, ten elemental concentration values were sampled (with replacement) for each standard from the Gaussian distribution described by the mean and standard deviation reported for each standard. Next, ten replicates of standard intensities (counts) were sampled (with replacement) from the ten assays for each standard to probabilistically generate ten y-values for each standard. Then a regression function was fitted to eight points, each corresponding to a single calibration standard, and each derived using the mean of the corresponding standard’s simulated distribution of x-values (as this distribution is reported as Gaussian) and the median of its y-values (as the median is more robust to s-drift).

The second stage involved the modification of first-stage residuals to help ensure variation in residuals was adequately modeled, as the assumption of constant variance about the regression line is another fundamental assumption in regression (Seber and Lee 2003:227). This is therefore an important step in creating a regression which is statistically robust to differences between standards and in particular to the potential leveraging effects of a small number of standards used for regression fitting. Modification was accomplished by first sampling the residuals from the regression fitted as above to generate bootstrapped residuals. Next, bootstrapped residuals were modified using the formula \( {Y_b}=\hat{Y}+{e_b}\sqrt{{\left( {\frac{n}{n-2 }} \right)}} \) where eb is the bootstrapped residual, Ŷ is the predicted count value, n is the number of observations, and Yb is the modified residual. Observed residuals were then replaced with randomly selected modified residuals. Residuals at this point had a constant variance that emulated the observed variation in residuals and modeled the effects of this variation across the full range of x-values used in the regression. The regression line was then re-fit to these modified residuals.

The third stage interpolated observed counts of measured field samples into the regression using realistic representations of counts distributions. This was accomplished by first sampling (with replacement) ten replicates from the ten assays for each field sample. The median of each distribution was then interpolated into the regression function to derive an estimate of the elemental concentration for each field sample measured in the trial.

Each of these three stages was then iteratively repeated 5,000 times per trial to generate a distribution of estimates of elemental concentrations for each element and each field sample. From these distributions, mean elemental concentrations and confidence intervals were derived for each trial. Means were used as best estimates of true field sample concentrations, and are reported with 1−σ error terms in this study.

Calculation of Final Estimates

Results of individual trials were combined by deriving a single mean value and error terms weighted by precision. As a result, trials with less scatter due to lower u-drift and/or s-drift had greater influence on final estimates than those with higher scatter. In cases where single-trial error terms were asymmetrical as a result of the effects of s-drift, error terms were made symmetrical by inflating errors on the smaller “side” of the distribution to match the higher. Final estimates are not listed in entirety here due to space considerations.

Results and Discussion

Accuracy of Calibrations

Table 2 provides a summary of data derived from regression fitting for each of the 21 trials, including within-trial R2 values for each element. All values listed are derived from linear regression.
Table 2

K, Ca, and Fe linear regression parameters and limits of detection (LOD) for all trials

Trial

K calibration

Ca calibration

Fe calibration

R2

a

1σ

b

1σ

LOD

R2

a

1σ

b

1σ

LOD

R2

a

1σ

b

1σ

LOD

1

0.998

636.7

11.9

127.1

27.7

0.13

0.998

974.7

16.9

260.1

80.6

0.25

0.982

5329.0

295.8

3907.2

1718.7

0.97

2

0.989

652.4

28.1

91.0

65.6

0.30

0.994

926.5

30.4

295.0

145.0

0.47

0.976

5043.8

321.9

4169.8

1870.1

1.11

3

0.995

587.9

17.6

130.6

41.1

0.21

0.996

929.0

24.0

214.0

114.5

0.37

0.973

5057.6

344.3

4088.7

2000.1

1.19

4

0.998

612.6

11.9

107.5

27.8

0.14

0.996

927.4

25.5

246.1

121.4

0.39

0.974

4994.8

333.9

4415.2

1939.7

1.17

5

0.996

621.8

15.7

105.4

36.7

0.18

0.989

901.5

39.6

338.0

188.7

0.63

0.969

4949.9

362.5

4326.3

2106.0

1.28

6

0.997

625.8

14.9

146.9

34.7

0.17

0.996

1032.6

28.0

178.0

133.7

0.39

0.982

5368.0

295.4

3719.9

1716.3

0.96

7

0.988

638.9

28.8

131.5

67.2

0.32

0.991

1051.9

40.0

116.6

190.7

0.54

0.986

5456.8

262.3

3611.6

1524.2

0.84

8

0.998

589.4

10.1

156.8

23.5

0.12

0.991

1011.2

38.6

104.9

183.9

0.55

0.969

5232.5

380.3

4400.2

2209.6

1.27

9

0.996

621.5

15.7

157.9

36.7

0.18

0.996

1033.3

26.8

169.9

127.8

0.37

0.983

5528.3

295.6

3796.6

1717.2

0.93

10

0.996

609.8

16.5

160.9

38.5

0.19

0.998

1001.6

19.9

244.5

94.7

0.28

0.984

5601.3

294.7

3751.7

1712.3

0.92

11

0.995

596.3

18.0

187.6

42.0

0.21

0.998

1004.3

20.2

258.8

96.5

0.29

0.979

5617.5

339.1

3633.6

1970.0

1.05

12

0.997

536.8

11.3

174.5

26.4

0.15

0.996

940.3

23.5

207.0

111.9

0.36

0.974

5084.8

338.9

3764.5

1969.2

1.16

13

0.997

571.1

13.7

167.5

32.0

0.17

0.998

962.8

17.3

215.7

82.4

0.26

0.975

5176.4

336.6

3801.5

1955.7

1.13

14

0.997

501.0

12.2

147.7

28.5

0.17

0.995

876.9

26.7

172.8

127.5

0.44

0.963

5491.4

437.8

4510.1

2543.5

1.39

15

0.997

512.4

10.8

163.3

25.3

0.15

0.999

895.2

11.9

207.7

56.5

0.19

0.971

5532.0

390.3

4568.3

2267.7

1.23

16

0.995

535.6

16.3

193.8

38.0

0.21

0.998

961.3

16.1

150.2

76.8

0.24

0.973

5205.8

357.6

3608.0

2077.4

1.20

17

0.996

608.5

15.8

211.9

36.9

0.18

0.993

1079.7

36.4

173.5

173.7

0.48

0.979

5585.6

332.8

3296.4

1933.2

1.04

18

0.997

542.7

12.2

182.0

28.5

0.16

0.998

959.2

18.4

159.8

87.8

0.27

0.973

5172.0

349.0

3638.0

2028.0

1.18

19

0.996

541.1

13.5

187.4

31.5

0.17

0.999

945.1

12.8

194.4

61.0

0.19

0.977

5162.5

523.6

3443.9

1880.2

1.09

20

0.991

526.8

20.6

193.0

48.2

0.27

0.998

959.7

18.6

135.8

88.5

0.28

0.970

5119.6

368.8

3764.1

2142.6

1.26

21

0.993

536.4

19.1

209.0

44.5

0.25

0.999

970.2

9.8

189.4

46.8

0.14

0.974

5289.6

351.1

3334.8

2040.1

1.16

Mean

0.995

581.2

15.9

158.7

37.2

0.19

0.996

968.8

23.9

201.5

113.8

0.35

0.976

5285.7

348.2

3883.4

1967.7

1.12

Where \( \mathrm{y}=a\mathrm{x}+b \)

Values reflect % elemental concentration. \( \mathrm{LOD}=b+3\sigma \) for the same equation when σ is the standard error in the estimation of the intercept (Lachance and Claisse 1995:272)

These data suggest excellent linear regression fits for K for all standards, as R2 values indicate nearly perfect fits (mean R2 = 0.995) between known standard concentrations and observed intensities. Examination of R output for residuals (Fig. 6) consistently shows homoscedasticity about the regression line, indicating the relationship between variables is, in fact, linear. Overall, local calibrations for K were highly accurate for all trials, since even the lowest observed R2 value (0.989, Trial 2) provides an excellent predictive relationship and homoscedastic residuals. The average limit of detection (where \( \mathrm{LOD}=b+3\sigma \) when b is the predicted y-intercept and σ is the standard error in the estimation of the intercept (Lachance and Claisse 1995:272)) for K is about 0.19 %, meaning K concentrations below this value are on average considered statistically indistinguishable from background noise for these data. Linear regression fits for Ca are likewise consistently excellent (mean R2 = 0.996, homoscedastic residuals, minimum R2 value of 0.989). Mean LOD for Ca is about 0.35 %, which is slightly higher than K but still below any of the predicted values for field samples examined here. The methods used have therefore produced calibrations which very accurately describe the relationship between PXRF counts and known elemental concentrations of K and Ca.
Fig. 6

Linear regression and residuals from R output for K values of Trial 12

Iron results are a slightly different story. While average regression correlation for Fe was generally quite high (mean R2 = 0.976), residuals consistently show evidence of heteroscedasticity (Fig. 7). The relationship between standard concentrations and counts is therefore probably not linear for Fe. As such, if a linear regression is used to calibrate for Fe in this case, systematic bias will be introduced in resulting estimates. Here, this bias would result in overestimation of concentrations at low and high counts and underestimation of concentrations for moderate counts. This deviation from linearity is likely due to a degree of saturation on the part of the instrument; this is caused by the instrument becoming marginally less sensitive as Fe concentrations increase. Importantly, the particular pattern of the residuals here indicate a quadratic relationship between variables (Schuenemeyer and Drew 2011:121), and indeed the application of a quadratic regression to Fe data yields homoscedastic residuals as well as higher R2 values on a trial-by-trial basis. Unfortunately, however, quadratic regression requires the estimation of an additional parameter when compared to linear regression, and therefore requires a greater number of standards to resolve robustly. Iron results based on quadratic regression are therefore not reported here, as they are possibly reflective of bias due to an “over-fitted” statistical model (see Seber and Lee 2003:230). Additionally, a number of predicted Fe concentrations in field samples were higher than reported values for standards, and extrapolated values do not merit detailed reporting or analysis here. Iron results based on these data and linear regression are likely still useful (sans error terms) on an ordinal scale, although ratio-scale results for these samples will require further testing with additional calibration standards.
Fig. 7

Linear regression and residuals from R output for Fe values of Trial 12

The fact that PXRF calibration significantly varies by element provides a powerful example of the need for thorough attention to detail in the calibration and use of PXRF instruments. Similar experiments with the calibration of manganese (Mn) using the same standards, settings, and samples, for example, found that for this element the best fit is probably to be found by applying logarithmic transformation to each axis (with additional adjustments to accommodate the blank standard, as zero values are nonsensical on a log scale), suggesting that there may be multiple formulae applicable to proper PXRF calibration. Perhaps appropriate regression functions are patterned as a product of relative elemental concentration, element atomic mass, or some other variable, although such patterning is not within the scope of this study. At present, it is therefore probably best to rigorously check the applicability of each calibration for each element of interest to avoid biased results. In many cases, this will require users to undertake manual calibration rather than rely on manufacturer software, which often does not provide much statistical information relevant to the overall performance of a given calibration.

Repeatability of Calibrations

Given the overall accuracy of the calibrations generated for K and Ca, evaluation of the degree to which these calibrations remain consistent over repeated uses provides a good measure of the overall effects of drift on instrument sensitivity. To this end, regression parameters from each trial’s internal calibration were statistically compared to examine variation in these parameters, and therefore long-term consistency in instrument sensitivity. These parameters include fitted regression slope (a) and y-intercept (b) from the linear equation \( y=ax+b \) and were compared using independent samples t testing (two-tailed, n = infinite, 95 % confidence) to examine the compatibility of one calibration with each of the others. The ability to perform such comparison is another benefit of manual calibration, as regression parameters and accompanying error terms are not always provided by manufacturer-produced calibration software.

On the whole, K calibrations vary significantly between trials. The slope of a given calibration line is on average statistically compatible with only 0.85 calibrations from other trials; y-intercepts are on average compatible with 2.19 other trials. Moreover, incompatible calibrations are very statistically divergent, as the average p value in t testing of K slopes is <0.0001. Interestingly, the few regression lines which are compatible are typically not derived from consecutive trials, as only one regression slope is compatible between consecutive trials (Trials 17 and 18), and only six intercepts are compatible between consecutive trials; this indicates that the observed bimodal distribution of slope values (see Fig. 8) for K regression is at least partly the product of s-drift differentially affecting data between trials. On the other hand, use-time is somewhat predictive of slope (R2 = 0.604, negative predictive relationship) and y-intercept (R2 = 0.776, positive predictive relationship), indicating that systematic error due to l-drift is also a meaningful factor over the period of experimentation (about 80 h total) for K measurement. Specifically, the instrument seems to have more background noise and less sensitivity to K the longer it is used, even within two work-weeks of relatively continuous measurement. By contrast, instrument use-time exhibits no significant power to predict other indices of calibration performance such as R2 of regression fit (R2 < 0.001) and LOD (R2 = 0.005), so the instrument has not lost any baseline ability to measure K over this time period, provided local (within-trial) calibration is used.
Fig. 8

Probability density function of observed slope values for all K and Ca calibration lines

Calcium results show similar instability between individual trials, as slopes are compatible with an average of 1.33 other trials and y-intercepts with an average of 4.67. Here, three trials have slopes compatible with the following trial, and six have y-intercepts compatible with the following trial. Usage time is a poor predictor of calibration slope (R2 < 0.001) and y-intercept (R2 = 0.225), so l-drift is not a major factor in Ca calibration error over the timescale observed. This leaves random fluctuation in sensitivity as a result of s-drift as the primary cause of the pronounced differences between one trial’s Ca calibration and the next. As with K, within-trial calibration fit bears no relationship to use-time for Ca.

A few implications of these results are significant here. First, even though each of the calibrations generated was quite accurate, there is a very low chance that any PXRF calibration for K or Ca will be valid outside the use-period in which it is generated. In fact, in these experiments there was no observed instance where consecutive trials produced compatible K calibrations and compatible Ca calibrations. Functionally, then, it is almost as if each trial were performed on a different instrument, providing consistent incentive to make use of a calibration local to each trial. In light of this, many users may want to re-think the use of a large number of standards to build a single, permanent calibration for PXRF measurement. In addition, it is also noteworthy that the chance a given calibration will be applicable to future measurements may vary somewhat by element, and this chance is a product of a combination of random and systematic fluctuations in instrument performance over moderate and long timescales of use. The effects of l-drift in particular appear to vary by element, even if instrumental settings are held constant. This variation fits with previous observations of differential performance between elements (e.g., Goodale et al.2012) and means that a one-size-fits-all remedy may not suffice—even amongst elements of similar atomic number—and as such a great deal of experimentation may be needed in order to test instrument performance for the range of elements of potential interest to archaeologists.

Accuracy of Elemental Estimates

Table 3 provides an outlay of K concentrations for 20 ceramic samples as predicted by seven different analytical methods. The first six methods are variations on PXRF analysis which make use of different approaches to calibration; differences between these approaches provide additional insight into the degree to which accuracy varies as a result of either (1) the software used, or (2) the temporal proximity of calibration data to field sample data. The first method used was PXRF analysis following the protocol described above, including the creation of a calibration for each trial using R software. Results from this method are referred to as “R Local” results here. The second method is a variation on the first, as it adheres to the above protocol with one key exception: it uses a single calibration derived from data generated from the first of the six trials in which the 20 ceramic samples were measured (Trial 16). This data is referred to as “R Single” results here. The third method is much like the second, except it uses data from only the first overall trial (Trial 1) to calibrate and derive estimates. This method is called “R Global” here. The fourth method used Bruker’s S1CalProcess (version 2.2.29), which was provided with purchase of the instrument, instead of R to create a local calibration for each trial. This method is referred to here as “S1Cal Local.” The fifth method also makes use of S1CalProcess, but (like the second method) makes use of a single calibration based on Trial 16 to derive all estimates. This method is referred to as “S1Cal Single” here. The sixth method used a single S1CalProcess calibration based on Trial 1 to derive all estimates, and this method most closely represents the way most users currently undertake PXRF analysis. This method is referred to as “S1Cal Global” here. Lastly, ICP-MS results are displayed (no error terms were provided with analysis). These ICP-MS results were assumed to be “true” concentration values, thereby allowing the quantification of the accuracy of the PXRF methods by comparison. Calcium estimates for these samples were not provided by available ICP-MS data, so only potassium results will be discussed here.
Table 3

%K estimates of ceramics from seven different methods

Sample

R-calibrated PXRF

S1CalProcess-calibrated PXRF

ICP-MS

Local

Single

Global

Local

Single

Global

%K

1σ

%K

1σ

%K

1σ

%K

1σ

%K

1σ

%K

1σ

%Ka

1776

1.15

0.09

1.21

0.09

1.12

0.05

1.27

0.03

1.28

0.03

1.48

0.07

1.00

1777

0.31

0.06

0.30

0.11

0.36

0.05

0.92

0.02

0.92

0.02

0.68

0.05

0.36

1778

1.03

0.09

1.08

0.09

1.00

0.05

1.25

0.03

1.26

0.04

1.45

0.08

1.07

1779

1.34

0.10

1.38

0.10

1.26

0.05

1.33

0.03

1.35

0.03

1.62

0.06

1.22

1781

1.27

0.10

1.34

0.11

1.23

0.07

1.60

0.05

1.61

0.06

2.14

0.11

1.08

1782

1.23

0.09

1.29

0.10

1.19

0.05

1.07

0.02

1.08

0.03

1.04

0.06

1.17

1783

1.15

0.07

1.14

0.08

1.07

0.05

1.03

0.02

1.04

0.01

0.94

0.03

1.18

1785

1.22

0.07

1.22

0.09

1.13

0.05

1.51

0.04

1.51

0.04

1.95

0.07

1.15

1786

0.62

0.06

0.62

0.10

0.63

0.05

1.02

0.02

1.02

0.02

0.90

0.04

0.59

1787

0.76

0.08

0.76

0.09

0.74

0.05

1.49

0.04

1.49

0.04

1.91

0.08

0.81

1788

1.11

0.08

1.11

0.09

1.04

0.05

1.09

0.02

1.10

0.02

1.08

0.05

1.06

1792

1.30

0.08

1.30

0.09

1.21

0.05

1.29

0.03

1.29

0.03

1.50

0.06

1.28

1793

1.04

0.09

1.10

0.09

1.03

0.05

1.09

0.03

1.10

0.04

1.08

0.09

0.90

1794

0.44

0.08

0.43

0.10

0.46

0.05

0.86

0.02

0.87

0.01

0.53

0.03

0.53

1795

1.30

0.09

1.35

0.10

1.23

0.05

1.88

0.06

1.91

0.07

2.65

0.11

1.27

1796

1.27

0.09

1.32

0.10

1.20

0.05

1.67

0.04

1.69

0.06

2.28

0.11

1.17

1797

1.02

0.07

1.01

0.09

0.96

0.05

0.98

0.02

0.98

0.01

0.81

0.03

1.20

1798

0.32

0.08

0.31

0.11

0.36

0.05

0.81

0.02

0.81

0.01

0.40

0.03

0.49

1799

1.20

0.06

1.20

0.10

1.11

0.06

1.49

0.03

1.51

0.04

1.94

0.08

1.04

1800

1.14

0.07

1.13

0.09

1.06

0.05

1.31

0.03

1.31

0.03

1.56

0.07

1.27

aNo error terms were reported

Where S1CalProcess was used, data selection also deviates from the R protocol, as S1CalProcess automatically selects data for inclusion using its own criteria. Also, when using this software, estimates represent the median and simple standard deviation of all assays instead of the weighted mean, as above. This was done because weighted mean calculation for these data resulted in mean errors of about ±0.01 % K, and such low errors are not reasonable given average reported errors in calibration standards of about ±0.06 %. In part, this is due to the fact that S1CalProcess neither considers error terms in x-values nor reports error terms in calibration and estimation for each trial, and as such the program probably has a tendency to underestimate real errors in results. Using simple standard deviation helps inflate these error terms to a slightly more realistic ±0.04 %, while encompassing scatter across all trials.

Accuracy testing used two statistical measures. First, a paired-samples t test (two-tailed, n = 20 pairs, 19df) was used to examine whether independent methods produced sets of results which were statistically identical overall. Second, each of the six methods of deriving PXRF estimates for each sample was compared with ICP-MS estimates by calculating z scores on a sample-by-sample basis. This second test was performed to highlight the proportion of individual sample estimates which were statistically compatible with ICP-MS data. The probability that the null hypothesis—in all cases that observed values are statistically identical—is met are listed in Table 4 for paired t tests and in Table 5 for z tests of each sample. Tests significant at the 95 % level of confidence are shaded in these tables. As a whole, t test p values show that only PXRF results generated with the R Local and R Single calibrations produced results which are statistically compatible with the ICP-MS estimates overall. All other methods produced results which statistically differed from the ICP-MS data, including R Global.
Table 4

Paired t test results comparing %K estimates derived from various methods

  

R

S1CalProcess

ICP-MS

Local

Single

Global

Local

Single

Global

R

Local

 

0.019

<0.001

0.001

<0.001

0.001

0.445

Single

0.019

 

<0.001

0.002

0.001

0.002

0.222

Global

0.043

0.062

 

<0.001

<0.001

0.001

<0.001

S1Cal

Local

0.237

0.219

0.280

 

<0.001

0.057

<0.001

Single

0.245

0.226

0.288

0.008

 

0.067

<0.001

Global

0.385

0.367

0.428

0.148

0.140

 

0.002

ICP-MS

0.019

0.038

0.024

0.256

0.264

0.404

 

Test p values are high and right. Values significant at 95 % confidence are in italics. Low and left shows the \( {{\bar{X}}_D} \) value for the test; this provides an index of average difference between methods’ estimates of %K

Table 5

z Testing of multiple PXRF %K estimates against ICP-MS data

Sample

R calibration

S1CalProcess calibration

Local

Single

Global

Local

Single

Global

p value

p value

1776

0.086

0.019

<0.001

<0.001

<0.001

<0.001

1777

0.410

0.613

<0.001

<0.001

<0.001

<0.001

1778

0.704

0.938

<0.001

<0.001

<0.001

<0.001

1779

0.234

0.117

0.201

<0.001

<0.001

<0.001

1781

0.070

0.016

<0.001

<0.001

<0.001

<0.001

1782

0.552

0.195

<0.001

<0.001

<0.001

0.020

1783

0.653

0.633

<0.001

<0.001

<0.001

<0.001

1785

0.350

0.441

<0.001

<0.001

<0.001

<0.001

1786

0.564

0.770

0.335

<0.001

<0.001

<0.001

1787

0.562

0.570

0.002

<0.001

<0.001

<0.001

1788

0.532

0.570

<0.001

0.177

0.100

0.721

1792

0.804

0.856

<0.001

0.770

0.775

<0.001

1793

0.148

0.032

<0.001

<0.001

<0.001

0.045

1794

0.278

0.329

<0.001

<0.001

<0.001

0.920

1795

0.774

0.428

<0.001

<0.001

<0.001

<0.001

1796

0.266

0.114

0.081

<0.001

<0.001

<0.001

1797

0.013

0.024

<0.001

<0.001

<0.001

<0.001

1798

0.043

0.092

<0.001

<0.001

<0.001

0.008

1799

0.006

0.087

<0.001

<0.001

<0.001

<0.001

1800

0.078

0.097

<0.001

0.182

0.184

<0.001

Probabilities showing significant statistical differences from ICP-MS data (at 95 % confidence) are in italics

The two types of R calibration which produced results compatible with ICP-MS estimates differed significantly from each other (p = 0.019), and are therefore not completely interchangeable. Of these two methods, the R Local was superior, for two reasons. First, it produced a slightly higher rate of compatible individual estimates (17 out of 20 versus 16 out of 20 for R Single) in z testing. Second, in paired-sample t testing, the \( {{\bar{X}}_D} \) value of R Local results was half that of R Single results (0.019 versus 0.038) when each was compared to ICP-MS estimates, indicating that the R Single results were on average twice as divergent from ICP-MS data as R Local results. Of all the PXRF methods used, then, the R Local produced the best results, the R Single produced results which are technically accurate as a dataset but less harmonious on a sample-by-sample basis, and the R Global and the S1CalProcess software produced inaccurate results. Figure 9 shows the contrast between plots of the best and worst methods used here.
Fig. 9

%K estimates using two PXRF calibrations compared with ICP-MS %K estimates. Error terms are 1σ

The relatively poor performance of the S1CalProcess software could be due in part to the fact that it is probably not designed to work well with such a small number of standards. Typically, an analyst using this software would want to include more standards to build a robust regression. On the other hand, these data generally support the idea that local calibration is better, as temporal proximity of calibration data to field sample data yields marginally better correspondence with ICP-MS data regardless of software used. For example, not only are R Local results better than R Single results and R Single results better than R Global results (particularly in the number of significant z scores), but S1Cal Local results have a marginally better agreement with ICP-MS data than S1Cal Single results, which in turn are marginally better than S1Cal Global results. Because frequent re-calibration is apparently necessary for the derivation of accurate K estimates, the incentive to use calibration software which needs a large number of standards to function is reduced, since frequent use of such software will greatly inflate measurement time. Further, because S1CalProcess does not take error in standard concentrations into account, it is reasonable to question whether its inaccuracy is also in part due to embedded problems with its regression fitting and interpolation. For the present, it is therefore probably best to avoid S1CalProcess in favor of local calibrations which make use of a small number of calibration standards while effectively modeling errors throughout the process. If this is done, however, demonstrably accurate results can be achieved, at least given the experimental parameters used here.

Repeatability of Elemental Estimates

As with calibration data, examination of the consistency with which elemental estimates could be derived provided important information about instrument stability, as well as the repeatability and the reliability of results. Evaluation of this consistency made use of statistical comparison of elemental concentrations from one trial to the next. This comparison was carried out two ways. First, paired-sample t testing (n = 63) was carried out as above to compare the first trial with the second trial for all samples. Second, an independent two-sample t test (two-tailed, n = 10 for each sample to reflect the number of replicate measures per sample per trial) was used to compare estimates of elemental concentration on a sample-by-sample basis. For this test, an f test was first conducted to assess compatibility of variances; Student’s t test (18 df) was used in cases of compatible variances, and Welch’s t test (variable df) in cases of incompatible variances. These analyses were performed (at 95 % confidence) for all samples for both K and Ca results; Fe results needed to be based on quadratic calibration to constitute a valid test of accuracy here and were therefore omitted.

Paired-sample t tests of K results show that K estimates are significantly different as a whole from the first trial to the second (p = 0.004). Individual sample estimates are statistically compatible at a rate of about 70 % (44 out of 63) from the first trial to the next, however, showing that statistical redundancy is achieved for the majority of individual samples after only two trials. S-drift was again the likely factor when results were not repeated between trials; each of the 21 trials contained samples and/or standards which showed a degree of bimodality in counts distributions relevant to K, indicating s-drift was pervasive throughout experimentation. Of the 19 samples which statistically differed between trials 1 and 2, however, 15 (79 %) produced a Trial 3 estimate which was statistically compatible with estimates from either Trials 1 or 2. Thus, after three trials statistical redundancy was achieved in K estimates for 59 out of 63 field samples (94 %), indicating at least three trials are needed to generate a large enough sample of measurements to average out s-drift and achieve repeatable concentrations of K using these experimental conditions.

Paired-sample t tests of Ca results show that Trials 1 and 2 are statistically compatible (p = 0.06), although just barely. Individual sample estimates are statistically compatible at a rate of about 90 % (57 out of 63) from Trial 1 to Trial 2. Here, s-drift was just as pervasive, but its resultant error was not as disruptive to overall consistency in results as it was for K, possibly as a result of the fact that larger average errors in Ca estimates (±0.14 versus ±0.08 for K estimates) drove greater overall statistical compatibility. Of the six samples for which Trials 1 and 2 produced incompatible results, all but one produced a Trial 3 measurement which was statistically compatible with one of the previous trials, meaning statistical redundancy was achieved for over 98 % of the Ca estimates after three trials. Again, three trials seems like the magic number in deriving repeatable estimates in spite of s-drift.

Importantly, repeatable estimates of K and Ca concentrations were successfully produced even when the calibrations with which they were derived were not themselves repeatable from one trial to the next. This outcome is further testament to the advantages of local calibration as a means of accounting for fluctuations in instrument sensitivity, as this method generally prevented inconsistency in instrument performance being passed along to inconsistency in results.

Conclusions

The protocol proposed here addressed significant issues in calibration and instrument performance to successfully produce PXRF estimates of K and Ca which are generally consistent over repeated measurements. In the case of K measurement, this protocol also produced PXRF estimates of K which agree with independent estimates derived from established laboratory methods, despite known difficulties in measuring low Z elements. In fact, these estimates agree for 85 % of the samples independently measured for K, which is a high rate considering the fact that even well-established methods such as ICP-MS often produce estimates which diverge from one another (Liritzis and Zacharias 2010), especially for K and Ca (e.g., Murphy et al.2002). This indicates that the accuracy of K estimates from PXRF can be roughly on par with more established instruments provided measured samples exceed the relatively high minimum concentrations detectable by PXRF. Further, accurate results were accompanied by error terms which fully reflect true errors in PXRF analysis while achieving precision roughly comparable to average reported errors in calibration standards themselves, indicating PXRF using this protocol is a viable alternative for measurement of K and Ca. Measurement of Fe was less successful, but this was largely an issue in calibration, and the inclusion of additional calibration standards with higher known Fe concentrations seems likely to fix the problem.

By contrast, deviations from this protocol failed to produce accurate results. Experimentation showed that inaccuracies in deviations were likely due to a number of factors, including: (1) both random and systematic changes to instrument sensitivity over multiple timescales, (2) calibration software used, and (3) calibration data used. Portions of the protocol essential to success therefore appear to be (1) repeated measurements of samples and standards within trials, (2) local calibration, (3) empirical simulation of true errors in calibration, and (4) redundant trials.

Unfortunately, at a practical level, this protocol is quite rigorous, and it strips away much of the appeal of PXRF for users interested in results which are both rapid and non-destructive. Further experimentation will reveal to what degree the experimental controls used here can be relaxed without compromising results, as well as the degree to which results here are applicable to other materials, elements, ranges of concentration values, instruments, settings, etc. For the present, however, it seems that the accuracy of PXRF is limited primarily by the time investment required for re-sampling, even in cases where sample composition is relatively homogenous. The method may therefore be a valid alternative for bulk 40K dosimetry in a devoted laboratory setting, provided access, cost, and productivity are consummate with larger laboratory resources and needs. In cases where heterogeneous samples are measured, however, more re-measurement will be necessary to account for additional scatter within observed counts due to micro-compositional differences between sampled locations. As such, in situ applications which require accurate estimates of K, Ca, and possibly other elements may not be served well by PXRF in many cases. In situ measurements of heterogeneous materials may still have some use, however, provided users are interested in ordinal-scale discrimination between samples and are appropriately diligent in assessing errors in repeated measurements.

Notes

Acknowledgements

Significant assistance in statistics, calibration, and R software was provided by UW statisticians Soyoung Ryu, Paul Sampson, and Jonathan Gruhl. Sediment samples used for this study were collected during fieldwork supported by National Science Foundation DDIG #0731529. PXRF units and equipment used for this study were purchased with funds awarded by University of Washington STF Award #2008-068-1. Access to field sites in Peru was graciously given by Dr. Santiago Uceda (Universidad Nacional de Trujillo) and by Mr. Francisco Burga (Agroindustrias San Simon S.A.). Ceramic samples and ICP-MS data were made available by Dr. James Feathers (University of Washington). Jim also deserves thanks along with Dr. Donald Grayson (UW) and two anonymous reviewers for providing insightful comments on drafts.

References

  1. Abrahams, P. W., Entwistle, J. A., & Dodgshon, R. A. (2010). The Ben Lawers historic landscape project: simultaneous multi-element analysis of former settlement and arable soils by X-ray fluorescence spectrometry. Journal of Archaeological Method and Theory, 17(3), 231–248.CrossRefGoogle Scholar
  2. Aitken, M. J. (1985). Thermoluminescence dating. London: Academic.Google Scholar
  3. Buhrke, V. E., Jenkins, R., & Smith, D. K. (Eds.). (1998). A practical guide for the preparation of specimens for X-ray fluorescence and X-ray diffraction analysis. New York: Wiley.Google Scholar
  4. Burley, D. V., Sheppard, P. J., & Simonin, M. (2011). Tongan and Samoan volcanic glass: pXRF analysis and implications for constructs of ancestral Polynesian society. Journal of Archaeological Science, 38(10), 2625–2632.CrossRefGoogle Scholar
  5. Craig, N., Speakman, R. J., Popelka-Filcoff, R. S., Glascock, M. D., Robertson, J. D., Shackley, S. M., & Aldenderfer, M. S. (2007). Comparison of XRF and PXRF for analysis of archaeological obsidian from southern Perú. Journal of Archaeological Science, 34(12), 2012–2024.CrossRefGoogle Scholar
  6. Davis, L. G. (2011). Return to Cooper’s ferry site: studying cultural chronology, geoecology, and foragers in context. Idaho Archaeologist, 34, 1–4.Google Scholar
  7. Davis, L. G., Macfarlan, S. J., & Henrickson, C. N. (2012). A PXRF-based chemostratigraphy and provenience system for the Cooper’s Ferry site, Idaho. Journal of Archaeological Science, 39(3), 663–671.CrossRefGoogle Scholar
  8. Forster, N., & Grave, P. (2012). Non-destructive PXRF analysis of museum-curated obsidian from the Near East. Journal of Archaeological Science, 39(3), 728–736.CrossRefGoogle Scholar
  9. Frankel, D., & Webb, J. M. (2012). Pottery production and distribution in prehistoric Bronze Age Cyprus. An application of pXRF analysis. Journal of Archaeological Science, 39(5), 1380–1387.CrossRefGoogle Scholar
  10. Goodale, N., Bailey, D. G., Jones, G. T., Prescott, C., Scholz, E., Stagliano, N., & Lewis, C. (2012). pXRF: a study of inter-instrument performance. Journal of Archaeological Science, 39(4), 875–883.CrossRefGoogle Scholar
  11. Goren, Y., Mommsen, H., & Klinger, J. (2011). Non-destructive provenance study of cuneiform tablets using portable X-ray fluorescence (pXRF). Journal of Archaeological Science, 38(3), 684–696.CrossRefGoogle Scholar
  12. Lachance, G. R., & Claisse, F. (1995). Quantitative X-ray fluorescence analysis. Chichester, UK: Wiley.Google Scholar
  13. Liritzis, I., & Zacharias, N. (2010). Portable XRF of archaeological artifacts: Current research, potentials and limitations. In M. S. Shackley (Ed.), X-ray fluorescence spectrometry (XRF) in geoarchaeology (pp. 109–142). New York: Springer.Google Scholar
  14. Millhauser, J. K., Rodriguez-Alegria, E., & Glascock, M. D. (2011). Testing the accuracy of portable X-ray fluorescence to study Aztec and Colonial obsidian supply at Xaltocan, Mexico. Journal of Archaeological Science, 38(11), 3141–3152.CrossRefGoogle Scholar
  15. Murphy, K. E., Long, S. E., Rearick, M. S., & Ertas, O. S. (2002). The accurate determination of potassium and calcium using isotope dilution inductively coupled “cold” plasma mass spectrometry. Journal of Analytical Atomic Spectrometry, 17(5), 469–477.CrossRefGoogle Scholar
  16. Nazaroff, A. J., Prufer, K. M., & Drake, B. L. (2009). Assessing the applicability of portable X-ray fluorescence spectrometry for obsidian provenance research in the Maya lowlands. Journal of Archaeological Science, 37(4), 885–895.CrossRefGoogle Scholar
  17. Phillips, S. C., & Speakman, R. J. (2009). Initial source evaluation of archaeological obsidian from the Kuril Islands of the Russian Far East using portable XRF. Journal of Archaeological Science, 36(6), 1256–1263.CrossRefGoogle Scholar
  18. Polikreti, K., Murphy, J. M. A., Kantarelou, V., & Germanos Karydas, A. (2011). XRF analysis of glass beads from the Mycenaean palace of Nestor at Pylos, Peloponnesus, Greece: new insight into the LBA glass trade. Journal of Archaeological Science, 38(11), 2889–2896.CrossRefGoogle Scholar
  19. Potts, P. J., & West, M. (2008). Portable X-ray fluorescence spectrometry: capabilities for in situ analysis. Cambridge, UK: The Royal Society of Chemistry.CrossRefGoogle Scholar
  20. Schuenemeyer, J. H., & Drew, L. J. (2011). Statistics for earth and environmental scientists. Hoboken, NJ: Wiley.Google Scholar
  21. Seber, G. A. F., & Lee, A. J. (2003). Linear regression analysis. Hoboken, NJ: Wiley.CrossRefGoogle Scholar
  22. Shackley, M. S. (2010). Is there reliability and validity in portable X-ray fluorescence spectrometry (PXRF)? The SAA Archaeological Record, 10(5), 17–20.Google Scholar
  23. Sheppard, P. J., Irwin, G. J., Lin, S. C., & McCaffrey, C. P. (2011). Characterization of New Zealand obsidian using PXRF. Journal of Archaeological Science, 38(1), 45–56.CrossRefGoogle Scholar
  24. Speakman, R. J., Little, N. C., Creel, D., Miller, M. R., & Iñañez, J. G. (2011). Sourcing ceramics with portable XRF spectrometers? A comparison with INAA using Mimbres pottery from the American Southwest. Journal of Archaeological Science, 38(12), 3483–3496.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  1. 1.Department of AnthropologyUniversity of WashingtonSeattleUSA

Personalised recommendations