Introduction

Regression and correlation are the most used techniques for investigating the relationship between two quantitative variables [1, 2]. There is, though, a key difference between them; regression is how one variable affects another and correlation measures the degree of a relationship between two independent variables (x and y). A good example of regression analysis is the construction of an analytical calibration curve, where the instrument response (the dependent variable) depends upon the concentration of the analyte (independent variable). In other words, if someone aims to analyze the effect of how an independent variable is numerically associated with a dependent variable, then the use of regression is mandatory [3]. The ultimate goal of calibration involves the prediction of the concentration of an analyte from a single instrumental response after setting up the above relationship between the values of known samples (i.e., standards with known amounts of analyte present) and instrument responses. Calibration curves (or graphs or plots) are the bread and butter of analytical chemistry, and their examination is an important step in any method validation or application.

Although the common name of the resulting plot is “calibration curve,” analytical chemistry researchers, typically, attempt fitting a linear function. Calibration, typically, involves the proper preparation of a set of standards containing a known amount of the analyte of interest, measurement of the instrument response for each standard, and establishment of the relationship between them. Based on a certain number of measurements of standards, two fitting techniques for the linear regression model can be established to set up the calibration curve and estimate the slope and the intercept of a linear calibration model: the ordinary least-squares (OLS) and weighted least-squares (WLS) [4, 5]. The testing of the reported linearity of a calibration curve should be an everyday task in routine analytical operations. While a great deal of effort is put in the selection and calibration of an analytical instrument, the choices behind curve calculation are usually overlooked. However, neglecting the consideration of statistics of calibration can lead to unfortunate conclusions as there is no advanced instrumentation or additional measurements that can rescue sound data from an erroneously built calibration curve. A calibration curve should either confirm the analyte–response relationship or raise an alert of the presence of a problem, which should properly be addressed.

Critical to the success of calibration and correlation is the understanding of the limitations and statistics used to set up a curve. The aim of this tutorial review is to provide a good practice guide in building calibration and correlation experiments and to explain how the results should be evaluated and interpreted. It centers on calibration experiments where the relationship between response and concentration is expected to be linear, although many of the principles of good practice described can be applied to non-linear systems, as well. First, we provide a general-case argument for the minimum number of standards required by regulatory guidance. Then, we examine the poor success rate of simple outlier detection in calibration curves using equidistant (linear) and logarithmic scale standard spacing, as well as we look into the significant risks associated with extrapolating the curve (where appropriate) beyond the linear response region. To enhance the understanding of this significant field, we present, through a practical example, a step-by-step procedure dealing with typical challenges related to regression, outlier assessment, procedures for linearity testing, calculation of the associated error of the predicted concentration and the limits of detection. The utilization of the concepts of correlation and agreement to compare analytical methods is also elaborated. The regression data results are acquired by utilizing the Excel spreadsheet of Microsoft, being perhaps one of the most widely used user-friendly software in educational settings.

Handling calibration data sets

One of the first questions analysts often ask is: “How many calibration standards do we need to measure and what is the number of replicates at each calibration level?” Before answering this question, the purpose of the calibration experiment must be defined. It is necessary to make a distinction between the calibration of a measurement system and the check of the validity of the calibration of a measurement system. To minimize the risk of error associated with improper calibration of a measurement system, international guidance dictates a minimum number of calibrators and the threshold at which a measurement becomes an outlier. Regulatory guidance provides the minimum required number of standards to establish the calibration curve. For an assessment of the calibration function, as part of a method validation, for example, standards with at least seven different concentrations should be included. The EURACHEM Guide “The Fitness for Purpose of Analytical Methods” and the draft guidance from the USFDA mandate a minimum of seven calibration standards—six plus zero concentration standards—to perform the calibration [6, 7]. Other documents lay down a different number of calibration levels. For example, ISO standard 15,302:2007 specifies four calibration levels and Commission Decision 2002/657/EC stipulates at least five concentration levels (including blank) for the construction of a calibration curve [8], as a minimum requirement for an assessment of the calibration function. ISO standard 8466–1:1990 demands ten calibration levels [9]. However, these requirements do not explain why the curve should be drawn with this number of points and not with more or less than that. The sample with zero analyte concentration should definitely be included as it allows us to gain better insight into the region of low analyte concentrations and detection capabilities.

The design of calibration experiments and the number of calibration levels depend very much not only on the purpose of the experiment but also on the existing knowledge. Less knowledge about the shape of the calibration functions requires performing initial assessment measurements on more concentration levels. Ideally, the calibration range should be drawn so that the concentrations of the analyte in the test samples fall in the center of the range, where the uncertainty associated with the predicted concentrations is minimized. It is also useful to make at least triplicate independent measurements at each concentration level, particularly at the method validation stage, as it allows the precision of the calibration process to be evaluated, at each concentration level. Analyte calibration solutions should be prepared from a pure substance with a known purity value or a solution of a substance with a known concentration.

The standard concentrations should not only cover the range of concentrations encountered during the analysis of test samples but also, they should be evenly spaced across the range (for wide calibration ranges, partial arithmetic series could be considered). However, the risk of leverage arises from the introduction of error into the measurement in the calculated curve, even in the absence of an outlier (vide infra). Leverage can be a concern if one or two of the calibration points are far from the others along the x-axis (near the ends of a calibration curve), where any error has a disproportionate effect on the curve. Even if these points are not outliers, they may have a leverage to a certain degree. In other words, a relatively small error in the measured response will have a significant effect on the position of the regression line. Often, this situation arises when calibration standards are prepared by sequential dilution of solutions. The procedure frequently employed in the preparation of calibration standards is to prepare the most concentrated standard and then dilute it by, say, 50%, to obtain the next standard. This standard is subsequently diluted by 50% and so on (e.g., 64 μg L−1, 32 μg L−1, 16 μg L−1, 8 μg L−1, 4 μg L−1, 2 μg L−1). This procedure is not recommended as, in addition to the lack of independence, the standard concentrations will not be evenly spaced across the concentration range, leading to a leverage (e.g., the concentration of 64 μg L−1 in Fig. 1). As a result, the calculated slope and intercept might be disproportionally affected by that data point.

Fig. 1
figure 1

Leverage due to uneven distribution of the calibration standards

Linearity in calibration and its misinterpretation

Linearity is an important feature for any analytical method. If the calibration function is linear, then, the estimation of the equation is easier and errors in estimating the concentrations of unknown samples from the calibration equation are likely to be smaller. The correlation coefficient (r) of a calibration graph or the R-squared (R2) or coefficient of determination is usually employed as an indicator of linearity, by inspecting its closeness to 1. Strictly speaking, r is a measure of the relationship between two variables x and y. Its use in calibration, though, is based on a widespread misunderstanding; if the calibration points are clustered around a straight line (and this is not the unique case), the experimental value of r will be close to unity. However, the opposite is not true. The International Union of Pure and Applied Chemistry (IUPAC) discourages the use of r to assume linearity in the relationship between concentration and analytical response. This is expressed by the excerpt from Ref. [4]: “... the correlation coefficient, which is a measure of the relationship of two random variables, has no meaning in calibration...” Furthermore, when a new analytical method is developed, the guide for authors of the Journal of Chromatography Α explicitly states that “claims of linearity should be supported by regression data that include slope, intercept, standard deviations of the slope and intercept, standard error and the number of data points; correlation coefficients are optional.” That is, the criterion of r to provide a measure of the degree of linear association between concentration and signal is weak.

In view of the above, the perception of linearity based on the criterion of correlation coefficient has been overturned by relevant statistical tests, in the last years. The assessment of the linearity can be carried out by resorting to the analysis of variance (ANOVA) of the calibration data [10]. In this methodology, a comparison of the so-called lack-of-fit (LOF) variance with the squared pure error is made through an F-test (See the “Practical example” and “Procedures for linearity assessment” section); however, it is essential to consider the error components of the regression. The prediction error for each data point can be measured as the difference between the observed response (\({y}_{ij}\)) and the predicted response (\({\widehat{y}}_{ij}\)), i.e., \({y}_{ij}-{\widehat{y}}_{ij}\). To assess the overall prediction error, we calculate this difference for every data point, square it, and then sum all these squared differences, resulting in a term known as the “sum of squares of the residuals (SSRES).” The degrees of freedom (d.f.) associated with SSRES is [p (concentration levels) × n (replicates)] − 2 since we estimate the slope and the intercept (i.e., two parameters).

This residual error, SSRES, can be decomposed into two constituent components: the “sum of squares for lack of fit (SSLOF)” and the “sum of squares for pure error SSPE,” e.g., SSRES = SSLOF + SSPE (Fig. 2).

Fig. 2
figure 2

Error components in regression analysis

When a line effectively fits the data, it implies that the average of observed responses \({\overline{y} }_{i}\) at each x-value closely aligns with the predicted response (\({\widehat{y}}_{ij}\)), for that specific x-value. Consequently, to assess the extent to which the overall error arises from model inadequacy, we gauge the deviation between the average observed response at each x-value and the predicted response for each data point, i.e., \({\overline{y} }_{i}\)\({\widehat{y}}_{ij}\) (Fig. 2). To measure the complete lack of fit of the model, we calculate this distance for every calibration point, square it, and sum all these squared differences to obtain the SSLOF (associated with p-2 degrees of freedom).

To assess the portion of the overall error attributed solely to random fluctuations, we examine the extent to which each observed response (\({y}_{ij}\)) deviates from the average observed response (\({\overline{y} }_{i}\)) at each corresponding concentration (x-value), i.e., \({y}_{ij}-{\overline{y} }_{i}\). Similarly, the total pure error is calculated by summing the squared differences for each calibration point to get the SSPE (p × n − p d.f.). When the respective sum of squares is divided by their associated degrees of freedom, the mean squares (MS) are obtained. It is those mean squares from which we can calculate the F-statistic, as follows:

$$F = \frac{MS\left(LOF\right)}{MS \left(PURE\; ERROR\right)}$$

It is advisable that readers, for further reading, peruse the excellent brief reports edited by the Analytical Methods Committee of the Royal Society of Chemistry [11] and the textbook “Calibration and Validation of Analytical Methods” edited by Mark T. Stauffer [12], which seek to introduce the readers to current methodologies of analytical calibration.

Ηomogeneity and non-homogeneity of variances

Most of the reports in the literature refer to OLS, which practically, should be only used when experimental data have constant variance (homoscedasticity). In contrast, WLS is more appropriate when the variance varies (heteroscedasticity) or in other words, when every calibration point does not have an equal impact on the regression. There are several tests for homogeneity of more than two variances [13]. A simple way for testing homoscedasticity is to plot the residuals calculated from the straight line obtained by using the OLS method. A horizontal band of residuals indicates constant variance and unweighted least squares regression is recommended. A trumpet-shaped opening toward larger values signifies increasing variability as concentration increases.

From a practical viewpoint, when a narrow concentration range is considered, the unweighted linear model is usually adapted while a larger range may require a weighted model. If the weight is estimated incorrectly, the calculated estimators of regression coefficients (slope and intercept), being sensitive to extreme data points will be biased, with a concomitant negative impact on the predicted concentration intervals for real samples. It is noted that ignoring the inhomogeneity of variances will not sacrifice much statistical reliability when working in the mid-range of the calibration curve. Nonetheless, the WLS can reduce the limit of quantification and enable a broader linear calibration range with higher accuracy and precision, especially for bioanalytical methods. Depending on the characteristics of the data set, the weighting factors can be employed in a number of different ways [14]. The incorporation of heteroscedasticity into the calibration procedure is recognized by ISO and USFDA; the latter recommends that “the simplest model that adequately describes the concentration–response relationship should be used. Selection of weighting and use of a complex regression equation should be justified” [15]. Raposo, in his tutorial review, provides an illustrative example selected from the literature that best suits the weighting approach in calibration [5]. A more detailed examination of this topic is beyond the scope of this review.

Non-linearity

It may be the case that several analytical methods exhibit good response over a broad concentration range (i.e., several orders of magnitude). Because of this behavior, it is helpful to construct a double logarithmic plot based on the raw data. Importantly, the resulting plot serves the purpose of proving the method response over this broad concentration range but not of fitting linear function to the data sets. In other cases, non-linear dependence of analytical response on concentration is likely to appear, for instance, in analytical methods based on electrochemistry (Fig. 3). The method that leads to such a calibration data set may have unacceptably low or very low sensitivity (note the low slope in the concentration range above 10 nM in the raw data calibration plot of Fig. 3A). To cope with this situation, some researchers choose to tap into the logarithms of concentration values and the analytical response (Fig. 3B), thus claiming linearity of the method over the extended concentration range. Using raw data or data after logarithmic transformation primarily depends on analytical principles.

Fig. 3
figure 3

Calibration data set in standard plots with linear axes showing response plotted against A concentration and B logarithm of concentration. Graph B is often incorrectly used to prove linear dependence between analytical response and concentration

Arguably, associating the analytical response with the logarithm of concentration in handling calibration data sets may not be the best choice, as it can easily lead to misinterpretation of the analytical performance. A legitimate practice in this case is to consider displaying calibration data sets in plots with linear axes (equidistant scale) fitted with logarithmic or other nonlinear functions. Nonetheless, when the non-linear relationship is fitted for a small number of concentration levels this makes the process unreliable. Any curve that includes a non-linear response should have sufficient additional points between the upper limit of linearity and the upper limit of quantitation to describe the inflection point. Unless a great number of concentration levels are included in the data set, the plateau interval within a fitted nonlinear function cannot be used for quantitative analysis.

In 2020, P.L. Urban published a Perspective on the dependence between analytical response and logarithm of concentration, with the deterring title “Please Avoid Plotting Analytical Response against Logarithm of Concentration” [16]. The author puts forward a well-argued case for proper data treatment, where a non-logarithm of concentration is displayed in the x-axis of the calibration plot, threatening bad results if someone does not follow it. Others, in defense of the application of logarithmically transformed data, claim that enzyme-catalyzed reactions or electrochemical data in logarithmic form are more appropriate for function fitting [17].

Outliers’ assessment in linear regression

An outlier is an experimental measurement that is significantly different from the rest of the entire data set. In the case of calibration, an outlier appears as a point which is well apart from the trend of the other calibration points and introduces a leverage or bias into the position of the line. Once in the middle of the calibration range, the outlier can shift the regression line up or down (Fig. 4A). The slope of the line will approximately be correct but the intercept will be wrong. In this case, a bias is introduced. An outlier at the extremes of the calibration range will change the position of the calibration curve by tilting it upwards or downwards (Fig. 4B). The outlier is said to have a degree of leverage.

Fig. 4
figure 4

A The outlier laid in the middle of the calibration range introduces bias (red line). B Leverage caused by an outlier laid in the uppermost part of the calibration curve

After estimating the regression parameters, the plot needs to be examined to identify any data points that deviate significantly from the remaining data set (considering the assumption of linear calibration). For this purpose, initially, the residuals (yi − \(\widehat{y}\)) are calculated and graphed against their corresponding concentration levels. Two horizontal dotted-dashed lines, representing ± t (0.05, p–2) × Sy/x, are utilized to denote the permissible deviation for each individual point in the residual plot. Any calibration data point(s) that lie outside these lines are considered qualitative outliers. Details are given in the “Practical example” and “Assessment of outliers” section.

With respect to identifying outliers in the case of fitting curves with nonlinear regression, a method has been proposed, which combines robust regression and outlier removal [18]. Although this is not an easy task, the analysis of simulated data demonstrated that this procedure identified outliers from nonlinear curve fits with reasonable power and few false positives.

Standard addition(s) method

The use of pure standard solutions to establish the ordinary calibration graph assumes that there is no reduction or enhancement of signal by other matrix components of real samples. Hence, the so-called direct calibration method of analysis can be applied. However, in many areas of analytical chemistry, this assumption is not valid as matrix effects can occur even with methods that have the reputation of being relatively free from interferences. A solution to this problem is that all the analytical measurements, including the construction of the calibration graph, can be performed using the sample itself, thus applying the standard addition(s) method. Matrix effects should be ascertained previously; this can be done by testing the statistical equality of the slopes of the lines arising from the direct calibration and standard addition methods. If slopes are demonstrated to be statistically different at a certain confidence level (and the adequate number of degrees of freedom), the use of the standard addition method is undoubtedly the most favorable choice [19].

Note that sound evidence of linearity of the direct calibration method is a requisite for the use of the method of standard addition and the extrapolation involved. Also, standard addition cannot be applied if the intercept of the regression equation estimated by pure standard solutions in the direct calibration method is not zero, or close to zero (the zero-intercept assumption is seldom plausible). Unless the intercept is zero, the method will show a positive error compared with the true concentration of the target in the sample.

Based on the above, once it has been established that the linearity of the direct calibration method fits the data, a statistical test should be carried out for zero intercept (see the “Practical example” and “Procedures for linearity assessment” section). This can be judged by inspecting the confidence interval of the intercept a, i.e., a ± t × sa (sa is the standard error of a and t the two-tailed Student’s with n–2 degrees of freedom). If zero value is included within the confidence interval ± t × sa, then, the intercept is statistically zero. Provided that zero-intercept has been demonstrated, the standard addition approach adheres to the following procedure: (i) take several portions of the (treated) solution (six or seven at a minimum) and add a known amount of the analyte to each of them; (ii) the amounts added are (almost) evenly spaced from zero to maximum. Importantly, the total volume of all the treated solutions should be kept constant; (iii) the responses are measured, and the original concentration is estimated by extrapolation of the line to zero response.

Despite its capacity, the implementation of standard addition implies certain limitations. The extrapolation causes the technique to perform poorly in narrow (linear) calibration ranges. When carrying out such an experiment it is, therefore, recommended to add several times the original analyte concentration. Also, extrapolation degrades the precision compared with direct calibration where interpolation is exploited and hence, the uncertainty is generally increased. More details about its application are beyond the scope of this review.

Standard deviation vs standard error

From a properly constructed calibration plot, the analyst expects that a reliable calculation of the concentration of analyte in tested samples can be made. No quantitative result is of any value unless it is accompanied by a realistic estimate of uncertainty, i.e., the range within which the true value of the quantity being measured should lie. At this stage, the residual standard deviation (or error) can be used as an estimate of the uncertainty in predicted concentration values. This is due to the precision of measurements (as represented by the residual standard deviation) being an important factor in assessing the uncertainty. Additionally, the regression model can be employed for estimating the limit of detection of the analytical procedure (see the “Practical example” and “Calculation of a concentration, its error, and the limits of detection” section). Hence, the random errors in the slope (Sb) and intercept (Sa) values hold significance (see the “Practical example” and “Assessing the scatter plot and performing regression analysis” section). “Standard error of mean” (SEM) and “standard deviation” (SD) are employed in different contexts and have different interpretations and calculations; however, they are often confused and misused and for this reason, they deserve discussion [20, 21]. The SD may be a good estimate of the variability of the population from which the data (i.e., the statistical sample) was drawn. For normally distributed data, about 99% of them will have values which lie within 3 × SD of the mean value while the other 1% is scattered above and below these limits. Evidently, in case that widely scattered measurements are to be expressed, the standard deviation should be quoted.

As the sample mean varies from statistical sample to sample in a population, the way this variation occurs is expressed by the “sampling distribution of the mean.” Here, the SEM is a type of standard deviation, which expresses the precision of the sample mean and is calculated by the simple relation:

$${\text{SEM}}={\text{SD}}/\sqrt{\mathrm{sample\; size}}$$

The SEM is always smaller than the SD; as the sample size increases the standard error decreases. From a practical viewpoint, if the aim is to obtain an insight into the uncertainty around the estimate of the mean value of the measurements—this is almost invariably the case in analytical chemistry—reporting the SEM is the most useful and reliable way of calculating a confidence interval. For a large sample, a 95% confidence interval is obtained as values of 1.96 × SEM either side of the mean. To report a 95% confidence interval instead of a 99% one is only a matter of choice, and has become a convention, related to calling statistically significant a p-value lower than 0.05. the analyst/researcher should appreciate that the contrast between these two terms reflects the distinction between description statistics (i.e., SD) and inference statistics (i.e., SEM). Standard deviation is the degree to which individual values within a statistical sample differ from the sample mean. In contrast, SEM gauges how close the sample mean is likely to be to the population mean.

Finally, in many publications/analytical reports, the sign ± is used to join the SD or SEM to an observed mean—for example, 13.4 ± 2.3 or 13 ± 2. However, this notation does not give an indication of whether the second figure is the SD or SEM. Analysts/researchers are advised to indicate clearly whether standard deviation or standard error is being quoted.

Correlation vs agreement

As mentioned above, correlation allows researchers to know the association or the absence of a relationship between two variables. When these variables are correlated, we can measure the strength of their association. Tasks pertinent to the correlation in analytical chemistry often involve demonstrating a degree of association between analytical methods. These may be carried out by investigating a relationship between a new method and an official/reference/alternative interference-free method. When researchers seek to report this association, correlation or agreement is often used. However, the term “correlation” as a synonym of “agreement” can be misleading in any field of research [22]. This part aims to clarify the definition of these two terms in method development.

Correlation coefficient r does not provide any information on the agreement between two variables. Under certain conditions, the magnitude of r only provides information on how close the points lie to the regression line (see above, the “Linearity in calibration and its misinterpretation” section). The correlation analysis assumes that the distribution of one variable does not depend on the other, which is the case when comparing two different analytical methods. Also, r does not assume normality but it does assume homoscedasticity, i.e., constant variance. Inspection of the available data can reveal whether the correlation is linear or non-linear. In this context, a scatterplot of data should always be the first step before interpreting a correlation coefficient, to avoid incorrect assumptions about the relationship between the variables of interest.

When calculating r, statistical software reports a p-value which represents the chance that a significant linear correlation does not exist between the variables (this is the null hypothesis). A significantly non-zero r means that there is a dependence between x and y. Here, “significant” means how far we would expect r to be from zero and depends on both the number of measurements, n, and the distribution of each. A low p-value (≤ 0.05) provides evidence that the measured r represents a significant correlation between two variables. When a correlation exists, linear regression enables the calculation of the equation that minimizes the distance between the fitted line and all the data points in the sample. The R-squared (R2) or coefficient of determination commonly used in analytical chemistry, is another measure often encountered in linear regression analysis.

The term agreement is distinct; it is used to assess whether the measurements by two analysts or two different methods yield similar results. In the case of comparing the concentration of an analyte via two methods, we are measuring the same analyte in the same sample with different methods. Τhe two values may be correlated but do not necessarily agree. In this case, r is not sufficient. Even if the two methods are likely to be highly correlated with an r approaching unity, this does not provide concluding evidence that the methods agree. A high r value may indicate agreement but this remains false without further analysis to assess potential biased results. Again, by inspecting the relationship on a scatter plot it may be easy to see if they demonstrate poor agreement with significant bias.

Ordinary least-square regression could potentially be used to assess agreement; however, this regression assumes that one method is error-free and this is not true in most settings. When agreement between paired measurements is required, a Bland–Altman plot, also referred to as a difference plot, is a straightforward and reliable alternative to a scatter plot [23]. Practically, it is about a plot of the difference between two measurements on the y-axis (Method 1 minus Method 2) against the mean of the two measurements on the x-axis ({Method 1 + Method 2}/2). This simple plot reveals any bias between measurements, which is the difference between Method 1 and Method 2. Limits of agreement (LOA) are plotted as separate lines, demonstrating the range within which 95% of the differences between one method and the other are included. These limits are expressed as: mean observed difference (md) ± 1.96 × standard deviation of the observed differences (sdd). Broad LOA or values that consistently fall outside these bounds is an indication of a lack of agreement between the two methods (see the “Practical example” and “Comparison of analytical methods” section).

As alluded to above, correlation is not synonymous with agreement. Correlation refers to the presence of a relationship between two different variables, whereas agreement refers to the concordance between two measurements of one variable. Two sets of measurements, which are highly correlated, may have poor agreement. However, if the two sets agree, they will surely be highly correlated.

Practical example

To showcase the suitability of the suggested approach, we utilize data from a method that has been internally validated for the determination of Cd in natural water (Table 1).

Table 1 Calibration data for Cd determination in natural water

Assessing the scatter plot and performing regression analysis

The process that allows for a quick identification of any issues with the calibration data is the visual inspection of the calibration plot (Fig. 5).

Fig. 5
figure 5

Calibration plot of response (Absorbance) versus Cd concentration (μg⋅L−1)

Based on the findings, a typical approach would be to assume that the response is linearly correlated to the concentration. Regression determines the optimal values of slope (indicated as “b”) and intercept (indicated as “α”), that best describe the linear relationship between the analyte level (x) and the analytical signal (y). Typically, regression analysis is conducted using specialized software provided with instruments or popular packages like Excel (Table 2).

Table 2 Output of Excel regression

According to the Excel spreadsheet output, the instrumental response is linearly related to the Cd concentration (independent variable-x) based on OLS regression, in the form of y = b (± Sb) x + α (± Sa), as follows (with rounding to an appropriate number of significant digits):

$$y = 0.01115\; (\pm\; 0.00008)\; x - 0.0017\; (\pm\; 0.0003)$$
(1)

Assessment of outliers

The differences between the observed values (yi) and the predicted values (\(\widehat{y})\) (Table 3) are computed or can be generated from spreadsheet software programs.

Table 3 Responses and residuals from data of Table 1 after OLS regression

These differences are then plotted against their respective concentration levels as shown in Fig. 6. In this plot, two horizontal dashed lines, which indicate the acceptable range of deviation for each individual data point, are drawn at ± t (0.05, p–2) × Sy/x.

Fig. 6
figure 6

Plot of residuals vs Cd concentration. No evidence of outliers is observed. Dashed lines represent the ± t (0.05, p–2) × Sy/x.

Another straightforward numerical criterion to identify potential outliers is to check if standardized residuals (a residual divided by Sy/x is commonly known as a standardized residual) greater than 3 are found (last column of Table 3). Similarly, there is also no indication of an outlier in this hypothetical scenario.

Note: Other more sophisticated calculation techniques include the estimation of Cook’s squared distance for each data point. However, before applying these, it is important to identify (based on the above) the suspect calibration point that could potentially be omitted.

Procedures for linearity assessment

As mentioned before, R2 values can be misleading in the context of linearity evaluation. Examining the plot of residuals generated through linear regression of the responses against the concentrations is an option to assess the linearity. The residual plot (Fig. 6) exhibits a random pattern within the range, without any discernible systematic pattern, indicating that the linearity assumption is likely correct. However, there are instances when this visual inspection may be quite personal and open to interpretation.

Note: By plotting the ratio of the individual response values to their corresponding concentrations vs the concentration range could be another option (Huber’s linearity test [5]). The lower and upper limits of tolerance are established by multiplying the median value of the ratio of individual response values to their corresponding concentrations with constant factors of 0.95 and 1.05, respectively. The calibration range falls within the linear range as there are no results that exceed the tolerance limits.

To ensure the accuracy of the chosen model (Eq. 1), it is essential to employ reliable statistical tools that can confirm the assumption of linearity. This necessitates the use of robust statistical methods, such as ANalysis Of VAriance (ANOVA).

If the regression F-value (found in the Excel output, Table 2) is greater than the critical F (0.05, 1, p × np), then the regression model is considered acceptable. This indicates that the variations in the response can be explained by the model.

The ANOVA-LOF, also known as the LOF (lack of fit) test, is the most common and reliable statistical test that is applied to calibration experiments for the acceptance of the model linearity. The ratio of FLOF (MS-LackOfFit/MS-PureError) is compared to critical F (0.05, p–2, p × np). If FLOF is equal or less than critical F, it is possible to accept the hypothesis that the regression model is linear. From Table 3, we obtain the residual error sum of squares \(\sum {\left({y}_{i}-\widehat{y}\right)}^{2}\) or SSRES = 0.0000278516. The pure error sum of squares (SSPE) is equal to 0.0000244000 (see Table 4).

Table 4 Data treatment for the calculation of the terms SSPE and MSPE

Since the sum of squares of the residuals (SSRES) are divided into pure error (SSPE) and lack of fit (SSLOF), the latter term can be calculated by subtraction: SSLOF = SSRES − SSPE = 0.0000278516 − 0.0000244000 = 0.00000345163. Mean squares (MS) for lack of fit (MSLOF, p − 2 d.f.) is calculated as 0.00000345163/5 = 0.000000690325. Accordingly, the mean squares for pure error, (MSPE, p × n − p d.f.) is 0.0000244000/28 = 0.000000871429. So, FLOF = MSLOF/MSPE = 0.000000690325/0.000000871429 = 0.792 < Fcritical (0.05 5, 28) = 2.5581.

If there is no significant lack-of-fit, it is advisable to conduct a final t-test to determine if the intercept significantly deviates from zero. If the calculated t-value (= a/Sa) is equal or lower to the critical t (d.f.: p − 2, confidence limit:  99.9%, although a 95% is more common), the null hypothesis that the intercept is not significantly different from zero is accepted. From Eq. 1, we calculate t = 0.0017/0.0003 = 5.666 > 4.032 tcrit (0.01, 5). Therefore, the calibration curve should not be forced through zero and it is described properly by Eq. 1, as given above.

Note: Arithmetically, the decision to force zero or not can be based on the comparison of y-intercept (α) to its standard error (Sa = 0.000280046, Table 2). If y-intercept > Sa, then b ≠ 0, otherwise b = 0.

Calculation of a concentration, its error, and the limits of detection

Assuming that homoscedasticity is fulfilled, the regression line computed in the preceding section will be utilized for estimating the concentrations of test samples through interpolation as well as its error. Additionally, it may be employed for estimating the limit of detection of the analytical procedure. Hence, the significance lies in the random errors associated with the values of the slope (Sb) and intercept (Sa).

The unknown sample concentration can readily be determined by substituting the sample signal or response of 0.030 (for k = 5 replicates) into the regression equation (Eq. 1). This yields x value of 2.84024 μg⋅L-1. To calculate the overall error in the corresponding concentration we employ the following formula [2]:

$${s}_{x} =\frac{{s}_\frac{y}{x}}{b} \sqrt{\frac{1}{k}+\frac{1}{p}+\frac{{(y-\overline{y} }^{)2}}{{b}^{2}\sum_{i} {{(x}_{i}-\overline{x })}^{2}}}$$
(2)

where k is the number of replicate measurements we use to establish the sample’s average signal y, p is the number of calibration points, \(\overline{y }\) is the average signal for the calibration standards, xi and \(\overline{x }\) are the individual and the mean concentrations for the calibration standards and b is the calculated slope from the regression equation (Eq. 1).

Confidence limits can be calculated as μ = xo ± t × sxo, (p − 2 d.f.). In our case, xo = 2.84024 μg⋅L-1, sxo = 0.04833 μg⋅L-1, and the corresponding 95% confidence limits (t3 = 2.571) are 2.84024 ± 0.12425 μg⋅L-1. [Sy/x = 0.000918689, b = 0.01115, k = 5, p = 7, y(sample signal) = 0.030, \(\overline{y }\) = 0.032, \({(y-\overline{y})}^{2}\) = 0.0000033, \(\overline{x }\) = 3.004, Σi(\({x}_{i}-\overline{x }\))2 = 28.05132143].

As we have seen, the limit of detection (LOD) can be described as the concentration of the analyte that produces a signal equivalent to the blank signal, yB, augmented by three times the standard deviations of the blank, sB: LOD = yB + 3×sB. The calculated intercept (α) can serve as an estimation for the value of yB. The unweighted least-squares method relies on the assumption that every point on the plot—including the point representing the blank or background—exhibits a variation that follows a normal distribution. This variation occurs only in the y-direction and its standard deviation is estimated using Sy/x. Thus, it is suitable to substitute Sy/x for sB, when estimating the limit of detection. From previous calculations Sy/x = 0.000918689 and yBα = 0.0017. Thus, LOD = 0.0017 + (3 × 0.000918689) = 0.00446 μg⋅L−1.

Note: It is feasible to conduct multiple repetitions of the blank experiment to acquire independent values for sB. These two approaches for estimating LOD should exhibit negligible differences.

For the standard additions method, the concentration of the analyte in the test sample can be calculated from the ratio of the intercept (α) and the slope (b) of the regression line (since this is extrapolated to the point on the x-axis at which y = 0). As the concentration of the sample cannot be predicted solely based on a single measured value (multiple standard additions are required) the error of the extrapolated xE value is provided by:

$${s}_{{x}_{\rm E}} =\frac{{S}_{y/x}}{b} \sqrt{\frac{1}{n}+\frac{{\overline{y} }^{2}}{{b}^{2}\sum_{i} {{(x}_{i}-\overline{x })}^{2}}}$$
(3)

The respective confidence limits can be calculated as xE ± t (p − 2) × sxE (see worked example below).

The copper concentration in wastewater was determined by the method of standard additions. The following results were obtained with the aid of flame atomic absorption spectrometry: added Cu (moles L−1): 0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, response (absorbance): 0.4115, 0.647, 0.8134, 1.021, 1.212 1.356, 1.667. Regression analysis provides the values of intercept a = 0.44, slope b = 0.0189. The ratio of these values (α/b) provides the concentration of 0.44/0.0189 = 23 mol⋅L−1. Sy/x = 0.024575484, \(\overline{y }\) = 1.0183, \(\sum_{i}{{(x}_{i}-\overline{x })}^{2}\) = 2800. Therefore, the calculated error of the extrapolated xE value is: \({s}_{{x}_{\rm E}}\) = 1.408 and the confidence limits are 23 ± 2.57 × 1.408, i.e., 23.1 ± 3.6 (mol⋅L−1).

Comparison of analytical methods

Typically, when comparing two methods (e.g., reference and new method, Table 5) across various levels of analyte concentrations, it is customary to follow the procedure depicted in Fig. 7.

Table 5 Data obtained by two analytical methods
Fig. 7
figure 7

Comparison of two analytical methods with regression analysis

If each sample produces the same outcome with both analytical protocols, the regression line depicted will have an intercept of zero, a slope of 1, and a correlation coefficient of 1. Typically, we aim to examine whether the intercept significantly deviates from zero and whether the slope significantly deviates from 1. Confidence limits for the values of constant and intercept are calculated, usually at a significance level of 95%.

Based on the calculation of the regression parameters (as in the previous worked example) the intercept is determined to be 0.103923647, with upper and lower confidence limits of − 0.066613699 and + 0.274460993, respectively, encompassing the desired value of zero. Similarly, the slope is 0.955604606244823, with a 95% confidence interval that falls within 0.880924897601987 − 1.03028431488766 including the value of 1.

Note: The focus is more on establishing the range within which the slope and intercept values are likely to fall, rather than solely relying on the correlation coefficient. The method that offers higher precision is represented on the x-axis of the graph while we assume that the error in the y-values remains constant (homoscedasticity).

Bland–Altman plots serve as a valuable graphical tool, especially in clinical analysis to compare and evaluate the agreement between two sets of data obtained from different measurement techniques. Illustrated in Fig. 8, these plots visually depict the difference between two measurements on the y-axis and the average of those measurements on the x-axis. Additionally, a horizontal line is incorporated to represent the mean difference between the two measurements. Also, these plots, as already mentioned feature lines—the limits of agreement (LOA)—indicating the standard deviation, typically ± 1.96 × standard deviation of the differences (sdd) from the mean difference (md). This enables the identification of any potential outliers within the data [24].

Fig. 8
figure 8

Bland and Altman plot for data from Table 5, with the representation of the limits of agreement (dotted lines), from − 1.96 × sdd to + 1.96 × sdd. md: mean difference, sdd: standard deviation of the differences

From the data of Table 5, sdd = 0.101, so the 95% of differences will be: a) for − 1.96 × sdd: − 0.008 − (1.96 × 0.101) =  − 0.206, and b) for + 1.96 × sdd: − 0.008 + (1.96 × 0.101) = 0.190.

Conclusions

Researchers need to realize the limits and capabilities of conventional statistics and they should bring into their chemical analysis elements of scientific judgement about the plausibility of statistics. This review focuses, mainly, on the regression and correlation to find connections between two variables, measure the connections, and to make predictions of analyte concentrations in a proper way. By way of a practical example, the Excel software package can easily generate a large number of statistics in a form which is digestible and easily applicable. The tutorial review benefits researchers and authors embarking on studies handling analytical measurements.