Introduction

Human thyroglobulin (Tg) is a high molecular weight (660 kDa) soluble glycoprotein, typically stored within the follicular colloid of the thyroid, acting as the substrate for thyroid hormones (triiodothyronine, T3 and thyroxine, T4). As Tg is produced and utilized entirely by benign or differentiated malignant thyroid cells, it is considered a good tumor marker for patients with differentiated thyroid carcinoma (DTC) [1, 2] after removal of benign and malignant thyroid tissue by surgery and I131 ablation. Over the years, advances in assay technologies have led to important improvements in the analytical performances of Tg immunometric assays (IMAs); above all, the functional sensitivity (FS) of Tg IMAs has greatly improved: from 0.5 to 1.0 μg/L of the first generation IMAs to 0.05–0.10 μg/L of the second generation (2G) IMAs [3].

Nevertheless, the major limitation of 2G IMA testing is interference by serum Tg autoantibodies (TgAb) causing, as a rule, underestimation of Tg results and possibly masking disease recurrence [4,5,6]: it has been hypothesized that the complex between free Tg and endogenous TgAb prevents free Tg from binding to the capture and/or signal monoclonal antibody reagents and/or alternatively, endogenous TgAb binding to free Tg masks the epitopes recognized by monoclonal antibody reagents [5, 7].

Serum TgAb are reported to be present in about 25–30% of DTC patients depending of the assay used and the cut-off employed to classify samples as positive or negative [1, 7]. They are more frequent in females [8] and they are also present in about 60% of patients with autoimmune thyroid disease (AITD) [9]. On the basis of these considerations, the role of TgAb measurement has evolved from a marker of thyroid autoimmunity [10, 11] to a test associated with Tg to investigate TgAb interference [12]. Consequently, serum TgAb have evolved as a surrogate test for tumor marker replacing Tg determination by IMAs, in cases of analytical interference from TgAb [13, 14].

Of note, the measurement of TgAb could be cumbersome. Analytical limitations of serum TgAb assays have been reported in the context of thyroid autoimmunity diagnosis [9]. Despite standardization against the International Reference Preparation (IRP) MRC 65/93, several studies demonstrated a high variability in the analytical performances of different TgAb IMAs: large variation in limits of detection (LOD), FS, inter-method results, reference intervals with poor concordance between TgAb assays in patients with DTC [15,16,17,18,19,20,21,22]. The difficulty in standardization is in part due to the heterogeneous Tg immunoreactivity: differential splicing of Tg mRNA, various post-translational modifications, and alterations of biosynthesis regulation in thyroid tumor cells lead to exposure or masking of epitopes with resulting differences in Tg immunologic structure [23]. Besides Tg heterogeneity, assay discordance has also been assigned to various specificity of circulating TgAb in patient sera [6]. As a result, different TgAb values are obtained when the same serum is tested with different methods [15,16,17,18,19,20,21,22]. Finally, differences in assay reagents, above all the preparation of the antigen (Tg), definitely contribute to assay variability [1,2,3,4,5,6,7,8,9,10,11,12].

The manufacturers’ upper reference limit (URL) for TgAb, set up to identify patients with AITD but misleading for evaluation of TgAb interference in Tg assay, is another aspect to consider. Reference intervals are the most widely used tool for the interpretation of clinical laboratory results. The Clinical and Laboratory Standards Institute (CLSI) Expert Panel on Reference Values has provided guidelines for the determination of reliable reference intervals (EP28-A3c) [24]. They recommended the use of the direct method, which implies the enrolment of a healthy population of at least 120 individuals and the determination of 2.5th and 97.5th percentile for the lower reference limit and the URL, respectively. As regards thyroid antibodies (thyroid peroxidase antibodies—TPOAb and TgAb) for AITD diagnosis, the 2003 proposal of the National Academy of Clinical Biochemistry (NACB) recommends the use of a direct method and a reference group composed of 120 men younger than 30 years, biochemically euthyroid [i.e., with serum thyrotropin stimulating hormone (TSH), concentrations between 0.5 and 2.0 mIU/L], and without risk parameters (goiter, family history of AITD, or other autoimmune diseases) [25].

However, the definition of the TgAb URL remains a matter of debate, because of the problems in enrolling the appropriate reference group [25] and in the determination of TgAb cut-off suitable for the identification of assay interference and consequently for the use of TgAb as surrogate marker in the follow-up of DTC [12].

Taking into account the above considerations, the main aim of the present study was the determination of TgAb URL, according to the NACB guidelines, by the use of eleven commercial automated IMA platforms. A further aim of the study was to compare the analytical performances of the methods used, in an attempt to evaluate, whenever possible, their effectiveness in detecting TgAb interference.

Materials and methods

One hundred and twenty male subjects were selected from a population survey in the province of Verona, Italy, according to the NACB criteria [25]. All of them gave informed consent for their participation in the study. Their sera were tested for TgAb concentration by using eleven IMA methods applied in as many automated analyzers: AIA-2000 (AIA) and AIA-CL2400 (CL2), Tosoh Bioscience; Architect (ARC), Abbott Diagnostics; Advia Centaur XP (CEN) and Immulite 2000 XPi (IMM), Siemens Healthineers; Cobas 6000 (COB), Roche Diagnostics; Kryptor (KRY), Thermo Fisher Scientific BRAHMS, Liaison XL (LIA), Diasorin; Lumipulse G (LUM), Fujirebio; Maglumi 2000 Plus (MAG), Snibe and Phadia 250 (PHA), Phadia AB, Thermo Fisher Scientific. All assays were performed according to manufacturers’ instructions at six different laboratories in Friuli-Venezia Giulia and Veneto regions of Italy [Lab 1 (AIA), Lab 2 (CL2), Lab 3 (ARC, COB and LUM), Lab 4 (CEN, IMM, KRY and MAG), Lab 5 (LIA) and Lab 6 (PHA)]. The main features of the eleven methods are summarized in Table 1. All methods are standardized with the reference preparation (IRP MRC 65/93) and use International Units (IU), except for CEN and KRY whose results were initially expressed in Arbitrary Units and then converted in IU (Table 1). The normality of the distribution was assessed using the Shapiro–Wilk test. Since TgAb values were not normally distributed, the experimental URL (e-URL) was established at 97.5th according to the non-parametric percentile method (CLSI standard C28-A3c) [24]. Moreover, the non-parametric Kruskal–Wallis test and the Dunn’s multiple comparison test were used for comparing the median values of the eleven groups.

Table 1 Analytical performance characteristics of the current TgAb automated immunoassays

The inter-method variability was assessed considering the interquartile range (25th and 75th percentile). To compare the eleven methods, ARC was regarded as the reference assay since it showed a satisfactory combination between the LoD and the assay imprecision (Table 1). Correlation between assays was assessed by Spearman Rank correlation coefficient (r s); Passing-Bablok regression was applied to verify the linear association between methods, while agreement between assays was analyzed by Bland–Altman plot considering the difference between ARC and the other ten methods (AIA, CEN, CL2, COB, IMM, KRY, LIA, LUM, MAG and PHA). The difference between manufacturer’s URL (m-URL) and e-URL was expressed as the ratio between them in percentage (Delta% = |m-URL − e-URL|/m-URL × 100). A two-sided value of p < 0.05 was considered statistically significant. Statistical analyses were performed by GraphPad Prism Software, version 4.0 (San Diego, CA, USA) and MedCalc software, version 11.6 (Ostend, Belgium).

Results

TgAb results showed a relevant inter-method variability with wide interquartile ranges: the difference reached 48 times for the 25th percentile (minimum: 0.24 IU/mL and maximum: 11.5 IU/mL) and 30 times for 75th percentile (minimum: 0.59 IU/mL, maximum: 17.97 IU/mL) (Fig. 1) (Table 2).

Fig. 1
figure 1

Distribution of TgAb values for each method. AIA AIA-2000, Tosoh Bioscience, ARC Architect, Abbott Diagnostics, CEN Advia Centaur XP, Siemens Healthineers, CI confidence intervals, CL2 AIA CL-2400, Tosoh Bioscience, COB Cobas 6000, Roche Diagnostics, IMM Immulite 2000 XPi, Siemens Healthineers, KRY Kryptor, Thermo Fisher Scientific BRAHMS, LIA Liaison XL, Diasorin, LUM Lumipulse G, Fujirebio, MAG Maglumi 2000 Plus, Snibe, No. number, PHA Phadia 250, Phadia AB, Thermo Fisher Scientific, RSD relative standard deviation. SD standard deviation

Table 2 Summary statistics of TgAb measurements for each method

A statistically significant difference between medians was observed for all methods except for 11 pairs of the 45 combinations analyzed (Fig. 1) (Table 3).

Table 3 Kruskal–Wallis test and Dunn’s multiple comparison test of TgAb methods: comparison of all pairs of columns

e-URLs differed from one method to the other. Of note, within the same method, e-URL was much lower than m-URL, except for ARC and MAG, which showed similar values for both (Table 4).

Table 4 Experimental upper reference limit compared to the manufacturer’s upper reference limit for most of the current TgAb automated immunoassays, established from a cohort of 120 euthyroid control subjects

As regards the correlations between methods, r s ranged from 0.17 (ARC vs CEN) to 0.56 (ARC vs CL2) (Table 5). Using Passing-Bablok analysis, TgAb method comparison resulted in varying degrees of agreement with the reference method (ARC). Slopes were all far from 1 except for ARC vs AIA (slope = 1.15) and ARC vs CL2 (0.34) (Fig. 2) (Table 5); intercepts varied from −29.92 to 3.7, they were far from 0 except for ARC vs AIA (−0.75) and ARC vs CL2 (−0.15) (Fig. 2) (Table 5). Subsequently, a relevant positive or negative mean biases were observed by Bland–Altman analysis ranging from −115.8% (CL2 vs ARC) to 156.4% (MAG vs ARC). The best agreement was between AIA and ARC with a mean bias of −37% (Fig. 3) (Table 6).

Table 5 Summary of method comparison by Passing-Bablok regression and Spearman’s rank correlation for the TgAb methods
Fig. 2
figure 2

Passing-Bablok regression of TgAb methods. ARC was chosen as the reference method on the x axis. ARC vs AIA and ARC vs CL2 showed the best relationship in terms of slope and intercept. AIA AIA-2000, Tosoh Bioscience, ARC Architect, Abbott Diagnostics, CL2 AIA CL-2400, Tosoh Bioscience

Fig. 3
figure 3

Bland-Altman plots showing the difference between ARC and AIA and between ARC and CL2. ARC was chosen as the reference method. An ideal mean difference of 0 is indicated by a dotted line, the mean difference by a solid line and the limits of agreement for the mean difference, as defined by 95% confidence limits, by dashed lines. AIA AIA-2000, Tosoh Bioscience, ARC Architect, Abbott Diagnostics, CL2 AIA CL-2400, Tosoh Bioscience

Table 6 Summary of method agreement (Bland–Altman plot) for the TgAb methods

Discussion

The determination of the cut-off for the definition of TgAb positivity is an important and controversial issue.

In this study, we have determined the TgAb URL in a reference group of male individuals, meticulously defined as being free of thyroid diseases, by eleven IMA methods, currently used in autoimmunology laboratories, and compared to each other. Actually, to our knowledge, no similar data are present in literature: in the past, other studies faced the same topic but with small numbers of different analytical methods, most of which are no longer in use [9, 15,16,17,18,19,20,21,22, 29].

The first relevant result of the present study was the demonstration of differences between TgAb URLs claimed in the package insert (m-URL) and those obtained in the male reference sample (e-URL): with the exception of ARC and MAG method, e-URLs were lower than those proposed by the manufacturers, the difference ranging from 2.33 to 88.85%. These results were similar to those described in two previous studies dealing with the definition of TPOAb reference limits, determined by several current IMA platforms [30, 31]. In our opinion, these discrepancies could be related to the lack of strict criteria in the selection of the subjects for the reference group. Specifically, racial differences could play some role, as most of the studies, sponsored by manufacturers, were performed in the geographical area of the production line and consequently difficult to reproduce in other settings. Moreover, the use of non-stringent criteria in the choice of subjects could have led to the enrolment of individuals with subclinical AITD, thus resulting in relatively high levels of TgAb causing the raise of the 97.5th percentile of the reference value distribution platforms [32,33,34,35,36,37].

The second relevant consideration that emerged from the present study was the variation of e-URLs according to the method used. The e-URL ranged from 2.25 (CL2) to 41.15 IU/mL (COB), with an approximately 18-fold variation, consistent with a previous paper which reported the same magnitude of variation using five IMA methods distinct from those considered in the present study (18). The difference between e-URLs supports concerns regarding inter-method variation [38]. Specifically, there were relevant differences between methods in terms of medians (31-fold) (p < 0.05, Kruskal–Wallis test) and interquartile ranges. These discrepancies were not expected and not easily explained; in fact, in recent decades, there have been significant improvements in harmonization between methods [39], resulting from the high level of automation of analytical procedures and the use of the same reference preparation (IRP MRC 65/93). Moreover, analytical imprecision seems not contribute to the above differences, as the values declared by the individual manufacturer were essentially overlapping (although obtained with different protocols, some of them standardized, some others not) and in general lower than 10% for both intra- and inter-assay imprecision (Table 1). Such discordance between TgAb assays could be attributed to various factors, including: (1) TgAb heterogeneity which is often independent to standardization efforts, and which implies different specificity for Tg antigen; (2) Tg interference and (3) differences in assay reagents, including solid phase material and the preparations of the antigen (Tg), which could affect the proper exposure of the immunodominant epitopes. Another important aspect to consider, to explain inter-method variability, was the diverse assay structures of the eleven IMA methods leading to a different LoD (Table 1) ranging from 0.005 to 12 IU/mL. Especially, a clear-cut discrepancy between methods with a LoD lower than 0.2 IU/mL (ARC, AIA and CL2) and methods with a LoD equal to or higher than 2 IU/mL was apparent.

To better evaluate the relationship between methods, ARC was chosen as the reference method on the basis of the best combination between LoD and imprecision (Table 1): the correlation of ARC with the other methods was not satisfactory, in line with the variability of the results, broadly described above. Passing-Bablok regression did not show a satisfactory agreement between assays. Furthermore, consistent with regression results, Bland–Altman plot highlighted a statistically significant positive or negative mean biases.

The lack of acceptable agreement between methods has relevant practical implications: clinicians have to use the same method to monitor TgAb concentration in the follow-up of DTC, on the other hand, laboratories must keep users timely informed about any modification in TgAb method to simplify re-baselining.

Despite the analysis of the data showed satisfactory analytical performances of some methods in terms of LoD, being able to measure also low levels of TgAb with adequate precision, the main limitation to this study lay in having contributed only indirectly to the debated question of TgAb analytical interference. In fact, the obtained results did not prove but only suggested the opportunity to choose the more sensitive and accurate latest generation methods for measuring TgAb, to better detect any false negative results even in patients with TgAb levels lower than the cut-off (the so-called “negative patient”). Therefore, according to these considerations, two different cut-offs for TgAb could be proposed, one for the diagnosis of AITD and one for the effects of TgAb on Tg measurement.

Conclusions

In spite of the attempt of harmonization, quantitative agreement between methods was generally not satisfactory and methods could not be used interchangeably. Therefore, additional standardization efforts are required to improve analytical performance, and biomedical industries are strongly invited to re-evaluate their assays taking into consideration CLSI approved protocols and guidelines.

Finally, as long as the relationship between TgAb concentration and interference in Tg measurement is not clearly defined, TgAb URL must be used with caution, taking into account that it is usually set for the diagnosis of AITD and not for the identification of potential interference in Tg assay.