Performance of current ultrasound-based malignancy risk stratification systems for thyroid nodules in patients with follicular neoplasms

Objectives To investigate the ability of the currently used ultrasound-based malignancy risk stratification systems for thyroid neoplasms (ATA, AACE/ACE/AME, K-TIRADS, EU-TIRADS, ACR-TIRADS and C-TIRADS) in distinguishing follicular thyroid carcinoma (FTC) from follicular thyroid adenoma (FTA). Additionally, we evaluated the ability of these systems in correctly determining the indication for biopsy. Methods Three hundred twenty-nine follicular neoplasms with definitive postoperative histopathology were included. The nodules were categorized according to each of six stratification systems, based on ultrasound findings. We dichotomized nodules into the positive predictive group of FTC (high and intermediate risk) and negative group of FTC based on the classification results. Missed biopsy was defined as neoplasms that were diagnosed as FTCs but for which biopsy was not indicated based on lesion classification. Unnecessary biopsy was defined as neoplasms that were diagnosed as FTAs but for whom biopsy was considered indicated based on classification. The diagnostic performance and missed and unnecessary biopsy rates were evaluated for each stratification system. Results The area under the curve of each system for distinguishing follicular neoplasms was < 0.700 (range, 0.511–0.611). The missed biopsy rates were 9.0–22.4%. The missed biopsy rates for lesions ≤ 4 cm and lesions sized 2–4 cm were 16.2–35.1% and 0–20.0%, respectively. Unnecessary biopsy rates were 65.3–93.1%. In ≤ 4 cm group, the unnecessary biopsy rates were 62.2–89.7%. Conclusion The malignancy risk stratification systems can select appropriate nodules for biopsy in follicular neoplasms, while they have limitations in distinguishing follicular neoplasms and reducing unnecessary biopsy. Specific stratification systems and recommendations should be established for follicular neoplasms. Key Points • Current ultrasound-based malignancy risk stratification systems of thyroid nodules had low efficiency in the characterization of follicular neoplasms. • The adopted stratification systems showed acceptable performance for selecting FTC for biopsy but unsatisfactory performance for reducing unnecessary biopsy. Supplementary Information The online version contains supplementary material available at 10.1007/s00330-021-08450-3.


Introduction
Both follicular thyroid adenoma (FTA) and follicular thyroid carcinoma (FTC) originate from follicular cells [1]. FTC is the second most common thyroid malignancy accounting for 10-15% of all malignant thyroid tumors. FTC shows a high propensity for hematogenous spread and 15% to 27% of these patients develop distant metastasis [2,3]. Compared with papillary thyroid carcinoma (PTC), which is the most common subtype of thyroid cancer (approximately 80%), patients with FTC have a twofold higher risk of lung metastasis and tenfold higher risk of bone metastasis resulting in worse survival outcomes [2,4,5]. However, preoperative differentiation of FTC from its benign counterpart (FTA) is an inherently challenging aspect of management of thyroid nodules.
Ultrasound is the first-line imaging tool to evaluate the risk of malignancy and for formulating the optimal management strategy, including determining the indication for fineneedle aspiration (FNA), and informing treatment decisionmaking (surgical resection, monitoring, or no follow-up) [6]. Use of a single parameter for ultrasound evaluation may lead to inter-observer variability resulting in suboptimal sensitivity and specificity [7]. To standardize the evaluation of malignant thyroid nodules, various clinical societies have recently developed ultrasound-based malignancy risk stratification systems [8][9][10][11][12][13]. Based on the different versions of the Thyroid Imaging Reporting and Data System (TIRADS), several "pattern-based" systems and "score-based" systems have been established. The former are represented by 2015 ATA (American Thyroid Association), AACE/ACE/AME (American Association of Clinical Endocrinologists, American College of Endocrinology, and Associazione Medici Endocrinology), K-TIRADS (Korean Society of Thyroid Radiology), and EU-TIRADS (European Thyroid Association). The latter include ACR-TIRADS (American College of Radiology) and C-TIRADS (2020 Chinese Guidelines for Ultrasound Malignancy Risk Stratification of Thyroid Nodules). All these systems have shown reliable performance for the diagnosis and for selecting candidates for FNA [14][15][16][17].
As mentioned above, there are considerable differences in the incidence rates of various thyroid cancer subtypes [2]. In previous studies that evaluated these systems, the vast majority of malignant specimens were PTCs (88.9 to 99.6%) [18], which may have introduced an element of bias. Ultrasonographic features of FTC and PTC are considerably different [3]. The established malignant features, including hypoechogenicity, irregular margins, microcalcifications, and nonparallel orientation, are more common in PTC than in FTC.
Currently, few studies have focused on the value of ultrasound-based malignancy risk stratification systems in patients with follicular neoplasms [19][20][21]. The performance of these systems in distinguishing FTC from FTA and in correctly determining the indications for biopsy is uncertain. Thus, we hypothesized that the relevant conclusions reported in the past could not be simply extrapolated to follicular tumors.
In this study, we aimed to investigate the performance of the currently used systems in the context of follicular neoplasms. For this purpose, we compared the ability of the current systems in distinguishing FTC from FTA. Additionally, we assessed whether these systems can help identify the nodules that require a biopsy and can reduce the rate of unnecessary biopsies.

Materials and methods
This retrospective study was approved by the Institutional Review Board, and the requirement for informed consent to review images and medical records was waived.

Patients
From January 2014 to May 2020, 441 consecutive patients (455 nodules) with thyroid follicular neoplasms proven by histopathological examination of the surgical specimens following thyroidectomy at a tertiary referral center were included in this study. Fourteen patients were pathologically confirmed to have two lesions each (4 with FTCs and 10 with FTAs). For these patients, the larger nodule among the two lesions was selected. The exclusion criteria were as follows: absence of preoperative thyroid US images (n = 55); uncertain match between imaging findings and histopathological results (n = 8). Because hyperfunctioning nodules do not require FNA [8], definitive or suspected hot nodules were also excluded (n = 49).

Ultrasonography
Ultrasonography examinations of thyroid glands and cervical regions were performed with Aplio 500 (Toshiba Medical System), HI Vision Ascendus (Hitachi Medical Corporation), or HI Vision Preirus (Hitachi Medical Corporation) ultrasound instruments equipped with 5-12-MHz linear array transducers. The following ultrasound features of each neoplasm were recorded and reviewed by 2 researchers who were blinded to the diagnosis: maximum diameter (cm); location (left, right, or isthmus); composition (spongiform, cystic, mixed, or solid); echogenicity (anechoic, hyperechoic, isoechoic, hypoechoic, or very hypoechoic); margin (smooth, ill-defined, or irregular); calcification (absent, microcalcification, macrocalcification, or rim calcification); shape (round to oval or irregular); orientation in transverse view (parallel or nonparallel); hypoechoic peripheral halo (absence or presence); vascularization (absent, perinodular, intranodular, or mixed); and the location of the solid component for mixed-content nodules (eccentric or non-eccentric). The presence of other hyperechoic foci (comet-tail artifacts or indeterminate), extrathyroidal extension, and suspicious cervical lymph node was also investigated. Any disagreement between the 2 researchers with respect to these features was resolved by consensus.

Categorization according to the risk stratification systems
Based on retrospective analysis of ultrasound features, each thyroid nodule was categorized using the six stratification systems: 2015 ATA, AACE/ACE/AME, K-TIRADS, EU-TIRADS, ACR-TIRADS, and C-TIRADS [8][9][10][11][12][13]. For statistical analysis, firstly, the nodules were dichotomized into the positive predictive group of FTC (high and intermediate suspicion according to 2015 ATA; high and intermediate risk according to AACE/ACE/AME; high and intermediate suspicion according to K-TIRADS; high and intermediate risk according to EU-TIRADS; category 4 and 5 according to ACR TI-RADS; category 4B to 5 according to C-TIRADS) and negative predictive group of FTC (benign, very low, and low suspicion according to 2015 ATA; low risk according to AACE/ACE/AME; benign and low suspicion according to K-TIRADS; benign and low risk according to EU-TIRADS; category 1 to 3 according to ACR-TIRADS; category 2 to 4A according to C-TIRADS).
Secondly, based on the risk stratification recommendations, the nodules were retrospectively divided into 2 categories: "indication for FNA" and "no indication for FNA" (Supplementary 1). Missed biopsy was defined as any case of FTC nodule for which FNA was not indicated based on the risk stratification system. Unnecessary biopsy was defined as any case of FTA for which FNA was considered indicated based on the risk stratification system. Missed biopsy rate and unnecessary biopsy rate were calculated for each of the six systems.
In addition, nodules that did not conform to any category according to the systems were included in the non-classifiable group. For non-classifiable nodules, FNA was only considered to be recommended if suspicious cervical lymph nodes were present.

Data and statistical analysis
Nominal and ordinal variables are expressed as frequencies and proportions, while continuous variables were expressed as mean ± standard deviation (SD) and range. Betweengroup differences with respect to demographic, clinical, and ultrasound features were assessed using independent two-sample t test or rank-sum test for continuous variables and chi-square test or Fisher's exact test for nominal variables. Data pertaining to the distribution of lesions in various categories according to the risk stratification systems were analyzed using Mann-Whitney U test or Kruskal-Wallis test for ordinal variables. Based on the established cutoff, diagnostic performance of the systems was assessed using receiver operating characteristic (ROC) curve analysis. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the ROC curves (AUC) were calculated with 95% confidence intervals (95% CI). AUCs of each system were compared using the DeLong method. Missed biopsy rates and unnecessary biopsy rates were compared using Cochran's Q test, respectively. All statistical analyses were performed using SPSS (version 23.0, IBM) or MedCalc software (version 19.0.4) software. Twosided p values < 0.05 were considered indicative of statistical significance.
The performance of the six systems in correctly determining the indication for biopsy and reducing unnecessary biopsy is shown in Table 5. The lowest missed biopsy rate was found with K-TIRADS (9.0%), and the highest with ACR-TIRADS and AACE/ACE/AME (22.4%). The missed biopsy rates were significantly different between the six systems (p = 0.049), but not in pairwise comparisons. ACR-TIRADS was associated with the lowest unnecessary biopsy rate (65.3%), while K-TIRADS was associated with the highest unnecessary biopsy rate (93.1%). The unnecessary biopsy rates were significantly different among the six systems (p < 0.001). The unnecessary biopsy rates of ACR-TIRADS (65.3%) and C-TIRADS (67.9%) were lower than those of the other systems.
We further performed sub-group analysis based on nodule size. The missed biopsy rates for lesions ≤ 4 cm ranged from 16.2 to 35.1%; the missed biopsy rates for lesions sized 2-4 cm ranged from 0 to 20.0%; there was no significant difference between the six systems with respect to missed biopsy rate for lesions ≤ 4 cm or lesions sized 2-4 cm (p = 0.135 and p = 0.075, respectively). The systems with the lowest and highest unnecessary biopsy rate for lesions ≤ 4 cm were ACR-TIRADS (62.2%) and K-TIRADS (89.7%), respectively. The unnecessary biopsy rates (≤ 4 cm) were significantly different among the six systems (p < 0.001). ACR-TIRADS (62.2%) and C-TIRADS (70.5%) had lower unnecessary biopsy rates than the other systems.

Discussion
Our study showed that the stratification systems did not help distinguish FTA from FTC. In addition, while the systems showed acceptable performance for correctly determining the indication for biopsy in FTC, the performance with respect to reducing unnecessary biopsy was unsatisfactory.
In this study, on using high or intermediate suspicious stratification as the positive cutoff for FTC, the AUCs of all six systems (0.511 to 0.611) were less than 0.700. A Korean study that focused on the performance of K-TIRADS in classifying follicular neoplasms found the low efficiency of K-TIRADS using the same cutoff (AUC = 0.575, p = 0.439) [19]. However, Liu et al reported acceptable performances of 2015 ATA (AUC = 0.744, p < 0.001) and ACR-TIRADS (AUC = 0.744, p < 0.001) for distinguishing follicular neoplasms [20]. In our study, AUCs of K-TIRADS, EU-TIRADS, ACR-TIRADS, and C-TIRADS were disappointing (AUC = 0.573-0.611, p < 0.05). It may be difficult to improve clinical management of nodules based on these systems. The poor performance is attributable to the fact that follicular neoplasms present with substantially overlapping ultrasound features, and FTCs rarely present with features favoring malignancy as described in the current systems, such as nonparallel,   In addition, our study found that "pattern-based" systems (2015 ATA, AACE/ACE/AME, K-TIRADS, and EU-TIRADS) had more limitations in the classification of follicular neoplasms than previously reported in the literature [19][20][21], in contrast to the "score-based" systems (ACR-TIRADS and C-TIRADS). Given the limitations of pattern-based systems in the classification of follicular neoplasms, these systems should incorporate findings, such as undetermined composition, irregular shape, iso/hyperechoic with microcalcification and ill-defined margin (Table 4), or switch to "score-based" systems in the future.
In our cohort, the missed biopsy rates for all FTCs ranged from 9.0 to 22.4% and for FTCs ≤ 4 cm ranged from 16.2 to 35.1%, which is concordant with the study by Castellana et al [21] in which 0 to 31% of FTCs were missed. Therefore, we believe that all six systems assessed in this study are effective tools to select FTC for FNA in clinical practice. In fact, previous studies have shown that follicular neoplasms are large [22][23][24][25][26][27]. The mean maximum diameter of FTC (3.9 ± 2.1 cm) in our study was higher than any threshold proposed for ultrasound-based malignancy risk stratification systems. The highest maximum diameter was up to 12.1 cm, which is the main reason for the high performance in selecting FTC for FNA. Besides, the size of FTC at the time of management is important. FTCs larger than 2 cm have a higher risk of distant metastasis and are associated with worse prognosis [28]. In our study, a satisfactory performance (0 to 20.0%) was also seen for FTCs sized 2 to 4 cm. Apart from unclassifiable nodules, missed FTCs were classified as category 2 and 3 by ACR-TIRADS, and as category 3 by C-TIRADS. There were 4 FTC lesions that were missed by both systems, including a solid isoechoic nodule and three mixed (predominantly solid) isoechoic nodules, which are generally considered indicative of benignity. Therefore, meticulous care should be exercised while managing non-high risk nodules as well.
However, correctly determining the indication for biopsy is only a step towards further diagnostic workup, since the definitive distinction between follicular neoplasms is only based on postoperative histopathology [3]. The risk of malignancy associated with a FNA reading of Bethesda IV (follicular neoplasm or suspicious for a follicular neoplasm) is 10-40% and that with Bethesda III (atypia of undetermined significance or follicular lesion of undetermined significance) is 6-18% [29]. The possibility of FTC in lesions with the above suspicious malignant cytological findings is uncertain. In previous studies, 10.8% follicular neoplasm (FN) and 1.2% follicular lesion of undetermined significance (FLUS) were eventually found to be FTCs [30,31]. As an extreme example, 1379 thyroid nodules with FNA findings consistent with FN were not diagnosed as FTC after surgery [32]. Other studies have shown that the efficacy of TIRADS depends on the incidence of PTC in the study population [33] and that suspicious ultrasound features may not be useful in predicting malignancy of FLUS [34]. Therefore, due discretion is required due to the limitations of cytology in the diagnosis of FTC.
In previous studies involving papillary carcinoma as the primary malignant tumor, unnecessary biopsy rates with ultrasound-based malignancy risk stratification systems were generally lower than 50% [14][15][16][17]. However, our study found higher unnecessary biopsy rates in patients with follicular neoplasm. In our study, FNA was considered indicated for 65.3% to 93.1% of all FTAs and for 62.2% to 89.7% of FTAs sized ≤ 4 cm. We believe that this result is mainly due to the large size of follicular Table 3 Diagnostic indices of the systems for follicular thyroid neoplasms depending on predictive classifications. Classifiable nodules of each system were included Numbers in parentheses are 95% confidence intervals a There was no significant difference between the AUCs of the above four systems (p > 0.05)    [22][23][24][25][26][27]. Furthermore, our study showed that the unnecessary biopsy rates with use of "pattern-based" systems were higher than those with use of "score-based" systems irrespective of the lesion size because of the different criteria for determining the indications for biopsy. According to "pattern-based" systems, FNA is not indicated only when nodules with special patterns exceed the above size threshold [8][9][10][11], such as entirely spongiform and pure cyst nodules (3.3% and 0% of FTAs, respectively). In "score-based" systems [12,13], FNA is not indicated for nodules that are classified as category 1 and 2 by ACR-TIRADS or as category 2 and 3 by C-TIRADS, irrespective of the size. In our study, a certain proportion of FTAs was categorized in the "no biopsy" group by the ACR-TIRADS and C-TIRADS (23.6% and 26.4% of FTAs, respectively). Currently, there is an impetus on reducing unnecessary biopsy of thyroid nodules, because of the low rate of malignancy and the generally good prognosis of the most common thyroid malignancy (PTC) [8]. Nevertheless, FTC exhibits a more aggressive biological behavior than PTC. Although the precise diagnosis of follicular neoplasms requires postoperative histopathology, preoperative biopsy and further molecular testing may provide supplementary information aiding the differential diagnosis of follicular neoplasms [35]. Thus, deliberately avoiding biopsy and further testing for follicular neoplasms require careful consideration. Recently, some studies have demonstrated the value of molecular testing for follicular neoplasms. For example, a study using next-generation sequencing reported that the presence of FLT3 and TP53 with no RET mutations was consistent with FTC and the absence of FLT3 and TP53 with the presence of RET mutations was consistent with FTA [36]. Another study investigating DNA methylation haplotype block markers identified 70 DNA methylation markers that were significantly different between the FTC and FTA samples [37].
There are several limitations of our study. First, this was a single-center retrospective study. Only patients with a confirmed postoperative diagnosis of thyroid follicular neoplasm were included. Patients who did not undergo surgery were missed, which may have introduced an element of selection bias. Additionally, 126 nodules (27.7%, 126 of 455) were excluded which may also have resulted in selection bias. Finally, only non-dynamic images were available for recording the ultrasound features which may have affected the accuracy of data.
In conclusion, the currently used malignancy risk stratification systems for thyroid nodules showed poor ability in distinguishing FTA from FTC. The performance of these systems in selecting nodules for biopsy was acceptable, but the performance with respect to reducing unnecessary biopsy was unsatisfactory. Our findings indicate the need to develop a specific stratification system and recommendations for follicular neoplasms.   (Table 4): an unclassifiable nodule with unknown size was indicated for FNA due to a suspicious cervical lymph node according to ATA and K-TIRADS b The missed biopsy rates were significantly different among the six systems (p = 0.049), but not found in pairwise comparisons c There was no significant difference between the six systems with respect to missed biopsy rate (≤ 4 cm) (p = 0.135) d There was no significant difference between the six systems with respect to missed biopsy rate (