Low reproducibility of equivocal categories of the Bethesda System for Reporting Thyroid Cytology makes the associated risk of malignancy specific to the diagnostic center

Purpose Equivocal categories (III, IV, V) of the Bethesda System for Reporting Thyroid Cytology (BSRTC) are characterized by high variability of the estimated risk of malignancy. The aim of the study was to analyze the reproducibility of classification of nodules into an equivocal category and the frequency of malignancy (FoM) observed in such categories. Methods Five experienced cytopathologists from three centers (A, B, C) independently performed reclassification of smears obtained from 213 thyroid nodules with equivocal routine cytology and known results of the postoperative histopathological examination. Results The interobserver agreement among all cytopathologists was poor, with a Krippendorff’s alpha coefficient equaling 0.34. The intra-center agreement was higher than the inter-center (fair vs poor). Pathologists of the center A classified smears into categories II and III significantly less often and categories IV and V more often than pathologists of centers B and C. The joint FoM of nodules classified into any of categories IV–VI (regarded as an indication for surgery) was different among centers (A: 40.0%, B: 66.7%, C: 80.6%). The FoM of category III nodules with features of nuclear atypia (AUS) in center B and C was two times higher than that of other nodules of category III (FLUS), while in center A the FoM was similar. Conclusions The use of published data on the risk of malignancy in nodules of particular BSRTC categories without concern for the uniqueness of the diagnostic center may lead to erroneous conclusions.


Introduction
The diagnostics of thyroid nodules is challenging. The basic diagnostic method in use is the fine-needle aspiration biopsy (FNA), but unfortunately up to 30% of FNA outcomes is classified into one of three equivocal categories (III, IV, or V) according to the Bethesda System for Reporting Thyroid Cytology (BSRTC) [1,2]. An additional difficulty is that the risk of malignancy (RoM) of nodules classified into the same equivocal category is highly variable [3,4].
The category V-the suspicion of malignancy (SM)-is commonly regarded as an indication for surgical treatment because of the high RoM: amounting to 45-75%, according to the authors of the BSRTC (45-60% if the diagnosis of non-invasive follicular thyroid neoplasm with papillary-like nuclear features -NIFTP is regarded as a benign lesion and 50-75% if it is regarded as a malignancy), and even reaching 90%, according to some reports [1][2][3][4].
The category IV-the suspicion of follicular neoplasm (SFN)/suspicion of Hürthle cell tumor (SHT)-is related to the lower RoM (10-40 or 25-40% depending on the interpretation of NIFTP as a benign or malignant lesion, respectively), according to the authors of the BSRTC, but it is also regarded as an indication for thyroid surgery [1][2][3][4]. In such cases, diagnostic uncertainties about nodule's malignancy are usually dispelled only by the postoperative histopathological examination. A certain preoperative diagnosis is often not possible even if molecular tests were performed [4,5]. In the original BSRTC, cases that demonstrated the nuclear features of papillary carcinoma (PTC) were excluded from the category IV [1]. Currently, follicular-patterned cases with mild nuclear changes can also be classified into the category IV as long as true papillae and intranuclear pseudoinclusions are absent [2].
The category III: follicular lesion of undetermined significance (FLUS)/atypia of undetermined significance (AUS) is related to the most diverse RoM [2,3,[6][7][8][9][10][11][12]. It embraces specimens in which the cytomorphological findings are not representative of a benign lesion, yet the degree of cellular, nuclear and/or architectural atypia is not sufficient to render a diagnosis of SFN/SHCT or SM. Initially, the category III was meant to include no more than 7% of FNA results, and its RoM was to remain under 15% [1]. Consequently, the usual management was to consist of the repeated FNA with the consideration of molecular testing if possible. Over time it was found that the frequency of category III was significantly higher in some centers and its RoM differed according to the nature of the atypia and it could even reach above 70%, particularly in the case of nodules with nuclear atypia, commonly referred to as AUS [2,3,[6][7][8][9][10][11].
The above-mentioned large variability of the RoM related to particular equivocal categories of the BSRTC may be a result of factors independent of the diagnostic center, e.g., iodine supply (in iodine-deficient areas the RoM of nodules is lower because of the high incidence of non-neoplastic nodules). But that variability may also be caused by centerspecific factors, such as different interpretation of the rules for the classification of nodules into BSRTC categories. Thus, the aim of our study was the assessment of the reproducibility of the classification of nodules into the equivocal category of the BSRTC and the analysis of the frequency of malignancy (FoM) observed in particular categories when several experienced pathologists independently reevaluated the same set of smears.

Materials and methods
The analysis comprised smears obtained from 213 thyroid nodules with equivocal outcome of routine FNA (categories III, IV, or V of the BSRTC: 127, 53, and 33 cases, respectively) and the known result of the postoperative histopathological examination. Biopsies were performed in 213 patients (mean age: 54.2 ± 15.6), including 189 (88.7%) women and 24 (11.3%) men, in years 2010-2018 in a single diagnostic center (denoted throughout the text with the letter B). All patients gave their informed consent.
FNAs were performed following regular procedures on thyroid nodules with a diameter of at least 5 mm (and usually over 1 cm) and at least one malignancy risk factor (US or clinical). In all cases, two aspirations of a nodule were done. Smears were fixed with 95% ethanol solution and stained with hematoxylin and eosin (H&E). The results of FNA were formulated according to the Bethesda classification in the version prior to the modification in 2017. Patients with a cytological outcome of SFN/SHT or SM were routinely referred for surgical treatment. In the case of the diagnosis of FLUS/AUS, the surgical treatment was performed based on the patient's preference or due to the large size of the goiter or the presence of other clinical risk features. The histopathologic examination was performed according to the standard procedure and its results were formulated according to the WHO classification of thyroid tumors that was in effect at the time. The reclassification of the histopathological examination in order to reveal cases of NIFTP was not performed.

Analyses, statistical evaluation
The outcomes of thyroid cytological examinations performed by five pathologists were compared. Pathologists belonged to three different diagnostic centers (labeled as A, B, or C), the center A: pathologists A1 and A2, center B: pathologists B1 and B2, center C: pathologist C1. All participating cytopathologists had at least 15 years' experience in thyroid cytopathology and assessed about 1500-3500 FNAs a year (only one pathologist in the center C had a comparable experience in thyroid cytology-25 years' one with about 3500 FNAs evaluated yearly). The H&E is a routine staining of thyroid biopsy specimens in all participating centers. All the centers serve as a tertiary thyroid referral center. The center A is a part of the Academic Center for Oncology. The center B is an academic pathology center providing pathomorphological services to several academic hospitals with various profiles and to two large endocrine outpatient clinics. The center C is a large, non-academic center providing cytologic and histopathologic diagnostic services for other medical centers. Cytopathologists were informed that all evaluated smears had been classified into the categories III-V of the BSRTC during the routine diagnostics but they did not know the exact category nor the result of postoperative histopathological examination. The pathologists were instructed to apply the BSRTC classification in the modification of 2017. They were asked to identify follicular-patterned cases with mild nuclear changes among nodules classified into the category IV, as there was a possibility of the follicular variant of PTC or NIFTP in such nodules. Those cases were denoted by FP-NC. Mild nuclear changes were defined as the increased nuclear size, nuclear contour irregularity, and/or chromatin clearing. The pathologists were also asked to identify smears with nuclear atypia among the biopsies classified into the category III. In such cases, PTC cannot be excluded and those smears were denoted by AUS. They were defined as smears with the presence of any abnormal nuclear features such as nuclear enlargement, grooves, prominent nucleoli, abnormal chromatin pattern, alteration of nuclear contour and shape, and/or presence of intranuclear cytoplasmic inclusions in an otherwise predominantly benign-appearing sample. In all three centers pathologists used only one term, "FLUS" to denote category III in routine diagnostics until 2018. Recently, following the national guidelines and the practice common in numerous publications, they began to describe the cases of category III presenting nuclear atypia with the term "AUS". The cytopathologists were given all relevant clinical information such as the patient's age, thyroid function status, diagnosis of autoimmune thyroid disease or any malignant tumor and treatment applied.
The distribution of FNA outcomes between particular categories was assessed for each pathologist. The distribution was analyzed for the whole group of nodules and separately for benign and malignant nodules, including the most frequent carcinomas (PTC and FTC). The number and percentage of cases characterized by the complete agreement of diagnoses (in terms of the BSRTC category) between the five pathologists were determined and the coefficient of concordant diagnoses was calculated. Similar analyzes were performed for all possible pairs of participating pathologists with a particular concern for the agreement of diagnoses within and between the diagnostic centers. Then the agreement in diagnosing the following equivocal cytological categories: III (and separately AUS), IV (and separately FP-NC), and V was assessed within the diagnostic centers. For this purpose, the number of concordant diagnoses of each category was divided by the number of all diagnoses of that category at the center. Finally, the FoM was compared for all nodules classified into each equivocal BSTRC category by each pathologist. We did not use the term 'RoM' because the nodules were not randomly included into the study. Those comparisons also included AUS and FP-NC subcategories. For the sake of statistical analysis, the FT-UMP and NIFTP diagnoses were regarded as a "malignancy". In our opinion, as well as the opinion of authors of the Bethesda classification, such an approach is more clinically relevant because tumors of both these types should be treated surgically [2].
The study design was approved by the Local Bioethics Committee at the National Research Institute of Oncologyas a part of the studies carried out for the realization of the grant MILESTONE. The approval code is "13/2015/1/ 2016".
The statistical analysis was performed with Dell Statistica (data analysis software system), version 13, Dell Inc. (2016), Round Rock, TX, USA. The comparison of frequency distributions was performed with chi2 test (with modifications appropriate for the number of analyzed cases). The reports of the cytopathologists were evaluated for the interobserver variability by calculating the percentage of agreement. Statistically, the degree of reproducibility in the formulating of equivocal BSRTC categories was estimated with Krippendorff's alpha coefficient jointly among all pathologists and for all possible pairs of them [13]. The value of alpha coefficient was calculated in two variants: with the Bethesda classification regarded as a nominal scale or as an ordinal scale. Such an approach was adopted because of reasonable doubts about the ordinal nature of that classification, with a number of reports showing the RoM associated with the category III as higher than the category IV, and the category I higher than the category II [3,4]. Krippendorff's alpha coefficient was interpreted using the following criteria: 0-0.4 poor agreement; 0.41-0.75 fair agreement; 0.76-1.0 almost perfect agreement. The value of 0.05 was assumed as the level of significance. Table 1 shows the distribution of FNA outcomes formulated by each pathologist among BSRTC categories. In the whole examined group of nodules, pathologists of the center A classified smears into categories II and III significantly less often and categories IV and V more often than pathologists of other centers. In the case of category III that difference was twofold, and in the case of categories II, IV, and V even severalfold. A similar pattern was observed when malignant neoplasms and benign nodules were analyzed separately. Pathologist C1 formulated the diagnosis of category V conspicuously rarely, assigning only 6% of cancers to that category, while pathologist A1 assigned 53.7% of cancers.  Fig. 1). Supplementary Tables S1-S3 (Online Resource 1) show detailed information on the distribution of diagnoses by all pathologists for nodules assigned to one of equivocal BSRTC categories by at least one of them.

Analysis of the agreement between diagnoses
Full agreement of cytological diagnoses among all five pathologists was found in only 31 cases (14.6%): single cases of category I, II, and VI, two cases of category IV and V each, and 24 cases of category III. There were five malignant neoplasms among those 31 cases (7.5% of all cancers, four PTC and one FTC) and 26 (17.8%) benign lesions. The concordance of diagnoses was lower for malignant neoplasms than benign nodules (see Table 2). In the whole examined group of nodules the interobserver agreement among all the cytopathologists was poor, with a The agreement between particular pathologists of the centers A and B ranged from 32.9 to 48.4%. Pathologist C1 showed agreement of diagnoses with pathologists of the center B (58.2-73.7%) that was twofold higher than with the center A (29.6-32.4%) (see Table 2). When the concordance of diagnoses of two pathologists of different centers was analyzed, fair agreement was found only for a single pair of them: B1 and C1, with the value of Krippendorff's alpha coefficient equal to 0.43 (with the Bethesda classification regarded as a nominal scale) or 0.59 (an ordinal scale). That agreement concerned mainly category III, which was most frequently formulated-73.6% (131/178). In the case of other equivocal categories the concordance of their diagnoses was low-12.5% in category IV (3/24), and 26.7% in category V (4/15), p < 0.0005 vs. category III in both cases. In the case of other pairs of pathologists of different centers the agreement of diagnoses was poor ( Table 3). Table 4 shows the comparison of the FoM in nodules categorized into the equivocal BSTRC category by particular pathologists. The lowest differences in the FoMs were observed in category III, but a detailed analysis of AUS subcategory revealed significant discrepancies (Table 5).  In the case of category IV, there were also differences in the frequency of FP-NC diagnoses. Pathologists A1 and B2 identified FP-NC two times less often than pathologists B1 and C1 (Table 5). But only in the case of pathologist C1 the FoM in FP-NC nodules, or nodules of category IV in general, was significantly higher (with a more than threefold difference in both cases) than for other pathologists (Tables 4 and 5). The diagnoses of category IV by pathologist C1 also showed the highest percentage of PTC among cancers (pathologist C1: 84.6% PTC vs. A1: 23.5%, A2: 40.0%, B1 and B2: 0.0%, p < 0.05 in all cases except for pathologist B1). The same was true for FP-NC subcategory (pathologist C1: 90.0% PTC vs. A1, B1, B2: 0.0% and A2: 50.0%, NS). No significant differences were found in the FoM between nodules diagnosed as FP-NC and other nodules of category IV for any pathologist.

Analysis of FoM in nodules categorized into the equivocal BSTRC category
Among nodules of category V, the FoM was 20 percentage points higher in the case of pathologists of centers B and C than pathologists of the center A, but those differences did not reach the significance threshold ( Table 4).
The joint FoM of nodules classified into any of categories IV-VI (regarded as an indication for the surgical

Discussion
The introduction of the Bethesda classification for thyroid cytological diagnoses was an important step toward better preoperative diagnostics of thyroid nodules. The primary aim of the classification was to make the rules for the categorization of the smears as well as the nomenclature uniform. That aim has been achieved to a high extent, mainly because nowadays the classification is commonly used worldwide. Unfortunately, not all expectations have been met. In particular, the RoM of nodules in each equivocal category has not been successfully established [3,4]. This problem is well illustrated by our study in which experienced cytopathologists, specialized in diagnostics of the thyroid gland, reevaluated over 200 thyroid biopsies classified into an equivocal category in the routine examination. We found that during the reevaluation most of the smears were again classified into the equivocal category, but their distribution among particular BSRTC categories differed significantly between the pathologists. That was the case in spite of the fact that the pathologists followed the same guidelines and had similar long-time experience in the thyroid cytopathology. The latter was probably the reason why the pathologists did not feel bound by the original diagnosis; as experts they were used to the consultation of difficult cases and verification of the primary diagnosis. The analysis of the incidence of particular BSRTC categories showed that pathologists of the centers B and C were more willing to assign smears into categories II and III, while those of the center A had more readiness to use categories IV and V, as well as AUS subcategory of category III. Those differences in the frequency of BSRTC categories were as high as severalfold and were observed in both cancers and benign nodules. That had a significant impact on the differences in the frequency of revealing cancers in nodules classified by particular pathologists into the same BSRTC category. Pathologists of the center A, who were more inclined to use categories of a potentially higher RoM, obtained the lower FoM in those categories than other pathologists. It was probably a consequence of specific characteristics of the center A, which is an oncology center while the others are not. In consequence, pathologists of center A relatively rarely evaluate smears obtained from non-neoplastic lesions, where features of atypia may result from chronic thyroiditis, hyperthyroidism, antithyroid agent use, or radioiodine treatment. On the other hand, they are more vigilant about features of atypia and tend to classify cases with the slightest degree of atypia to higher BSRTC categories. In particular, they are more eager to use categories that are an indication for surgical treatment (IV-VI). Among patients with nodules, which pathologists of the center A classified into such a category, only 40.0% actually had a cancer, while of the center B-66.7%, and the center C-80.6%. Obviously, these percentages do not reflect the joint RoM of three equivocal diagnostic categories in particular centers. But they show how big the differences may be in the FoM in the case of biopsy with an equivocal microscopic image. Taking that into account, endocrinologists need to know how to interpret particular categories of FNA outcomes in the context of the diagnostic center where the examination was made. Only then is it possible to weigh potential benefits and risks related to the surgery in a patient with equivocal and alerting cytological outcome correctly, especially when significant comorbidities occur.
We expected to find poor agreement in diagnosing category III, as its definition is the least precise. That problem was indicated by other investigators [6][7][8][9][10][11], and even the authors of the Bethesda classification [1,2]. Accordingly, in the center A where pathologists assigned smears into each of three equivocal categories with a similar frequency, the reproducibility of category III was markedly lower than categories IV or V. Interestingly, the FoM in nodules classified as AUS by pathologists of center A was similar to that in nodules without features of nuclear atypia (usually denoted as FLUS), which is concordant with the intentions of the authors of the BSRTC classification. On the other hand, the FoM among nodules diagnosed as AUS by other pathologists was twice as high as in other nodules of category III (without nuclear atypia) which was concordant with the majority of reports on that matter. It is probably a consequence of the same factors that led to earlier discussed differences in FoM of nodules of categories IV-VI between these centers (lower threshold of nuclear atypia intensity when categorizing smears into the AUS subgroup in the center A). Because the estimated RoM for nodules with nuclear atypia in some centers approaches the RoM of nodules in category V, the decision about the surgical treatment is made without repeating FNA, particularly if the nodule shows a highly suspicious sonographic pattern [11,14,15]. However, it is hard to formulate decisive recommendations in this area with such large differences between diagnostic centers. In our sample, the FoM in nodules assigned to category V in centers B and C was two times higher than in AUS nodules and four times higher than in FLUS nodules. But in center A the FoM in nodules of category V was 20% lower than in center B and three times higher than in nodules diagnosed as AUS or FLUS that showed similar frequencies of malignancy, as it was already mentioned. That problem could probably be solved to some degree by the wider use of molecular tests in patients with equivocal cytology. Unfortunately their use is limited by high costs and the lack of satisfactory validation in various populations. ATA guidelines recommend the use of such tests in nodules with equivocal cytology, especially of category III, but European guidelines are much more conservative in that area [4,5]. Consequently, multi-gen panels, that are popular in the USA, are used in Europe much less often.
The low agreement of diagnoses of category IV, especially between the centers, was somewhat unexpected. That problem was well shown by the analysis of diagnoses formulated for nodules classified into category IV by pathologist C1: in those cases, pathologists of center B most often diagnosed category III, while pathologist A1 assigned categories IV and V with similar frequency. Pathologist C1 was unique in their attitude to the presence of mild nuclear changes in follicular-patterned cases. The percentage of cancers among nodules diagnosed as FP-NC by pathologist C1 was close to 90% and about 90% of those cancers were PTC, while in the case of other pathologists the FoM in FP-NC cases did not exceed 40%, and the frequency of PTC among those cancers-50%. Those differences resulted in a FoM in category IV nodules three times higher as diagnosed by pathologist C1 than other pathologists. Interestingly, in the case of both pathologists of center B none of the cancers in nodules classified into category IV was PTC, and the majority of them were HTC. Distinguishing HTC cells from PTC cells may be difficult because of similar oxyphilic cytoplasm in both cases. It seems that pathologists of center B avoided assigning smears with cells presenting oncocytelike cytoplasm and features of PTC into the category IV. It could be a consequence of their habit to use the definition of category IV in its primary form (excluding cases with nuclear features of PTC) in routine diagnostics, according to the national guidelines. On the other hand, pathologist C1 made intensive use of the possibility of assigning cases with mild nuclear changes into that category. That is a good example of how much individual interpretations of Bethesda classification guidelines may differ. Interestingly, in the case of pathologist C1 the FoM in FN-NC nodules was twice as high as the FoM in other nodules of category IV, while in the case of pathologist B1 it was the opposite. That differences were not statistically significant but undoubtedly the impact of FN-NC inclusion into category IV should be analyzed on a higher number of cases and in populations with different iodine supply, which is known to modify the relative incidences of FTC and PTC.
In our material the agreement of diagnoses within the centers was higher than between the centers but still rather unsatisfactory. The higher agreement was probably a consequence of local traditions of thyroid cytology assessment and the cooperation of pathologists during the routine diagnostics. This thesis is confirmed by a relatively high agreement between pathologists B1 and C1 (with the exception of category IV), who despite working in different centers used to consult difficult cases between themselves. The significance of the group consensus in lowering interobserver disagreement in the case of smears with equivocal microscopic image was also indicated by others [16,17].
Previous studies on the agreement of diagnoses in the Bethesda classification between several pathologists were usually conducted on smaller groups of nodules comprising all categories of the classification. In such a setting the agreement was satisfactory. But when the unequivocal categories (categories II and VI) were excluded from the analysis the agreement decreased [18,19]. Kocjan et al. also reported poor agreement for categories Thy3a and Thy4 of the UK Royal College of Pathologists' classification system (equivalent to BSRTC III and V, respectively) [20]. Similar conclusions were drawn by Padmanabhan et al. [21] and Bhasin et al. [22] in the case of category III of the Bethesda classification and Cochand-Priollet et al. [17] in the case of categories III and IV. Our study is special not only for the assessment of the agreement of diagnoses of equivocal categories in a large set of nodules but also for the analysis of the frequency of their malignancy based on postoperative histopathological examination. We used the term 'FoM' instead of RoM on purpose. The evaluation of RoM would demand an examination of subsequent nodules with equivocal cytology and known definite diagnosis. Our sample was characterized with the overrepresentation of cancers in the relation to their incidence in the general population that had been exposed to iodine deficiency for a long time. Such a selection of nodules allowed us to address the investigated problem better, but made it impossible to draw any direct conclusions about the RoM of particular BSRTC categories. Such a selection of nodules, as well as the pathologists being aware of it, could be a potential source of bias. Another limitation of our study is the lack of the reassessment of postoperative histopathological examinations. Thus, the actual incidence of NIFTP and other borderline tumors could not be determined.
Despite these limitations, our results indicate a necessity for future studies focused on the standardization of the rules for the classification of smears into particular BSRTC categories, especially equivocal ones. At present, the use of published data on the RoM of nodules in particular BSRTC categories without a concern for the uniqueness of the diagnostic center may lead to erroneous conclusions. That uniqueness is not only a result of specific epidemiological conditions (e.g., iodine supply) but also of variations in the interpretation of diagnostic criteria for each BSRTC category. That interpretation may be modified by a profile of patients who are routinely diagnosed-the profile that is different in oncological and endocrinological centers.

Compliance with ethical standards
Conflict of interest The authors declare no competing interests.
Ethical approval The study design was approved by the Local Bioethics Committee at the National Research Institute of Oncologyas a part of the studies carried out for the realization of the grant MILESTONE. The approval code is "13/2015/1/2016".
Consent to participate All patients subjected to FNA of the thyroid gland gave their informed consent.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.