Introduction

Ultrasonography, as a simple, non-invasive diagnostic method, now occupies a priority position in the thyroid nodule evaluation process [1]. Certain ultrasound indices are significantly associated with thyroid cancer. Commonly used real-time ultrasound indices include size, composition, shape, halo sign, echogenicity, calcification, and some accessory features, including extrathyroidal extension, lymph nodes, blood flow, and elasticity. In fact, the sensitivity and specificity of any single ultrasound feature for diagnosing thyroid cancer are difficult to reach more than 90% simultaneously. Hypoechoic and solid nodules have higher diagnostic sensitivity but lower specificity, while nodules with microcalcifications, infiltrative margins, and taller-than-wide shapes have higher specificity but lower sensitivity [2]. Therefore, an ultrasound model consisting of a combination of valid ultrasound features is more helpful for identifying the nature of nodules.

In 2009, Chilean scholars first introduced the concept of TIRADS and defined ten ultrasound patterns to distinguish benign and malignant thyroid nodules [3]. Kwak then proposed a simplified stratified assessment system containing only five ultrasound indices including shape, echogenicity, structure, calcification, and margin in 2011 [4]. Subsequently published TIRADS, including ATA guidelines, EU-TIRADS, ACR-TIRADS, and KTA/KSThR-TIRADS, also have been constructed based on these five ultrasound modes. These TIRADS are currently clinically validated and have good diagnostic value. But the definitions of some features within their ultrasound lexicons (e.g., hypoechoic, solid, spongiform) are currently not uniform. And the number of assessment classifications, specific malignant features involved, and even the ways in utilizing suspicious ultrasonic features vary (i.e., calculating the number of suspicious features or using ultrasonic pattern for risk stratification), which make the malignancy rate of the classification from low suspicion to high suspicion different among these systems [5,6,7,8].

There is no perfect TIRADS to date. Various TIRADS have their own advantages. For example, the EU-TIRADS and ATA guidelines are pattern-dependent systems characterized by a high negative predictive value and sensitivity, whereas the ACR-TIRADS is a typical score-based system with a high positive predictive value and specificity [9,10,11]. We assume that various TIRADS will probably form complementary relationships based on these facts. For example, some TIRADS are more applicable in some thyroid nodular cases, while other TIRADS cannot classify them correctly. Further, is it possible to explore new methods to improve the diagnostic accuracy based on data from those unmatched nodules?

Thus, in this study, we focused on the differences of unmatched findings among four TIRADS (including the newly released C-TIRADS) [12]. We then explored potential ways such as two-TIRADS parallel or serial tests or one TIRADS combined with specific ultrasound features to improve the diagnostic accuracy.

Methods

This retrospective study was approved by the Institutional Review Board, and the requirement for informed consent to review images and medical records was waived.

Patients

From February 2016 to February 2019, 1001 thyroid nodules in 933 patients were enrolled in the study. Only definitely diagnosed nodules were included, malignant nodules were confirmed by surgical pathology, and benign nodules were diagnosed by surgical pathology or repeated Bethesda II findings. Based on the above criteria, 795 nodules were finally included, which involved 334 surgical malignant nodules, 63 benign surgical nodules and 398 nodules with repeated Bethesda II results. One hundred eighty-eight nodules that could not be clearly diagnosed were excluded: 7 nodules with Bethesda I cytopathology and 28 Bethesda III-V nodules with no further surgical pathology, 132 nodules with a single benign cytopathologic result, and 21 nodules with initial benign pathology but with increased nodular size in the follow-up period by ultrasound examination (mean interval of 21 months, range 2 to 35 months) (Fig. 1).

Fig. 1
figure 1

Study flowchart

Sonography examination and image evaluation

Conventional ultrasound examinations were performed using Aplio 500 (Toshiba Medical System), HI Vision Ascendus (Hitachi Medical Corporation), or HI Vision Preirus (Hitachi Medical Corporation) ultrasound instruments equipped with 5–12-MHz linear array transducers by board-certified radiologists. The ultrasound images were reviewed by one radiologist with more than 20 years of experience in thyroid ultrasound diagnosis and recorded by two experienced endocrinologists with the help of the radiologist. They were all blinded to the patients’ fine-needle aspiration (FNA) results or pathological diagnosis before sonography examination. In case of disagreement, conclusions would be drawn by consensus. Before assessing nodules, we studied and compared the lexicon and classification of four TIRADS (Supplementary Tables 1 and 2). The definition and classification of the various TIRADS regarding composition, echogenicity, margin, shape, and calcification are substantially similar. However, there are slight differences in the definition of solid, spongiform, hypoechoic, and section to evaluate the nodular orientation. C-TIRADS and Kwak-TIRADS are both counting-based systems. C-TIRADS included marked hypo-echogenicity and ill-defined margin into the scoring system and considered the presence of comet tail artifacts as a minus item.

FNA, cytopathology, and histopathology

Thyroid nodules were judged as benign or malignant according to FNA cytopathology or surgical histopathology. The surgical pathological diagnosis was based on the WHO diagnostic criteria [13], and the cytopathological classification was based on the Bethesda system of thyroid FNA cytology proposed by the National Cancer Institute [14]. Informed consent was obtained from all patients before the FNA biopsy. The procedure was performed by an endocrinologist experienced in puncture using ultrasound-guided FNA technique by a color doppler ultrasound scanner with an L14-5 high-frequency line array probe (Ultrasonix Medical Ltd., Sonix SP). Benign pathology was defined by repeated Bethesda II results according to the recommendation by the guidelines about ablation treatment [1, 15]. At our institution, the requirement of repeated FNAs meets the following situations: (1) The puncture results are Bethesda I, III, and IV, requiring repeated confirmation or performing further genetic test; (2) Nodules are categorized as intermediate or high suspicion according to TIRADS assessment, but the puncture results are Bethesda II; (3) Patients are scheduled to undergo thermal ablation treatment; (4) During the follow-up period, nodules are with a rapid increase in diameter or volume, or development of new suspicious features including margin, echogenicity, calcification, etc. The time interval was 2–4 weeks between two repeated benign FNAs.

Statistical analysis

SPSS 26 software (IBM) and MedCalc 19.0.4 software (MedCalc) were used for statistical analysis. Quantitative data conforming to normal distribution were presented as mean ± standard deviation and evaluated by independent samples t-test. Measurement data that did not conform to a normal distribution were expressed as median and interquartile ranges and evaluated by a nonparametric test. Qualitative data were expressed as frequencies and evaluated by a chi-squared test. The optimal cut-off point of each TIRADS was determined from ROC analysis when the Youden index was the highest, as well as sensitivity and specificity. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the curve (AUC) were calculated. The McNemar test was used to assess the differences in parameters. Stepwise discriminant analysis was done to determine variables that may discriminate between benign and malignant nodules. Two-sided p values < 0.05 were considered statistically significant.

Results

Baseline

There were no significant differences in the age and gender of patients with benign and malignant nodules. The maximum size of malignant nodules was significantly smaller than that of benign nodules (median 1.00 [Q1-Q3, 0.80-1.50] vs. 2.40 [Q1-Q3, 1.50-3.20]) (Supplementary Table 3). The number of surgical pathological papillary thyroid carcinoma (PTC) was 334. There were 52 surgical pathologies of nodular goiter, three adenomatous goiters, six follicular thyroid adenoma, and two nodular Hashimoto thyroiditis among benign nodules.

As shown in Table 1, the malignant rate of ACR-TIRADS TR3 and TR4, C-TIRADS CTR 4b, and EU-TIRADS grade 4 were higher than the recommended malignancy rate. There were significant differences between the four TIRADS grades (all p < 0.001).

Table 1 Estimated malignant risk of the four TIRADS according to pathological diagnosis

Supplementary Table 4 shows the diagnostic performance of the four TIRADS. The results showed that the best diagnostic cut-off values of ACR-TRADS, Kwak-TIRADS, C-TIRADS and EU-TIRADS were TR5, 4c, CTR 4b, and grade 5, respectively. C-TIRADS had the highest sensitivity (91.6%) and NPV (93.0%), while ACR-TIRADS had the highest specificity (91.1%) and PPV (86.9%). However, Kwak-TIRADS had the highest AUC 0.884 (95% CI:0.860-0.906).

Figure 2 illustrates the diagnostic distribution for the four TIRADS in assessing pathological benign and malignant nodules. The number of cases with inconsistent findings of benign pathology was more than that of malignant pathology (96/795 vs. 65/795). In total, 86.8% (269/310) malignant nodules and 93.6% (365/390) benign cases diagnosed by the four TIRADS simultaneously were pathologically confirmed, whereas 8.3% (66/795) of nodules could not be correctly diagnosed by any of the TIRADS, and 12.0% (95/795) nodules could not be consistently diagnosed by all the four TIRADS.

Fig. 2
figure 2

Frequency distribution plot of different TIRADS in assessing pathological benign and malignant nodules. The definitions for the malignant and benign results under optimal cut point are as follows: If we set category 5 of ACR-TIRADS as the best cut-off point value, diagnostic malignant nodules are set to be category 5, whereas category 1 to 4 indicates diagnostic benign nodules; If category 5 of EU-TIRADS is set to be best cut-off point value, category 5 indicates diagnostic malignant nodules, while category 1 to 4 represents diagnostic benign nodules; If we set category 4b of C-TIRADS as best cut-off point value, category 4b or 4c or 5 is set to be diagnostic malignant nodules, whereas category 1 to 4a indicates diagnostic benign nodules. If category 4c of Kwak-TIRADS is set to be the best cut-off point value, category 4c or 5 indicates diagnostic malignant nodules, while category 1 to 4b represents diagnostic benign nodules

Discriminant strategy

Most of the 95 nodules were solid, wider-than-tall, without calcifications regardless of pathological diagnosis. Only a small percentage of nodules contained taller-than-wide (1.1%) and microcalcification (8.4%) features. As for echogenic features, benign nodules were predominantly iso/hyperechoic (70.9%), while malignant ones showed predominantly hypoechoic (85.0%). As for margin features, a well-circumscribed margin was predominant in benign nodules (56.4%), and a lobulated or irregular margin was predominant in malignant cases (77.5%) (Supplementary Table 5).

Stepwise discriminant analysis was used to distinguish the variables that best identified benign and malignant nodules. Five commonly used variables were included as predictor variables for malignant thyroid nodules. Stepwise discriminant analysis screened the echogenicity variable (F = 34.87, p < 0.001). The discriminant function was = 2.368 × echogenicity−1.421. This discriminant function was statistically different (Wilks’ lambda = 0.74, p < 0.001) and had an excellent predictive value as it could correctly predict the classification of 79.0% of cases.

According to the above discriminant function, the discriminant strategy (DS) based on nodular features was as follows: Iso- or hyper-echogenicity nodules should be considered benign. Hypo- or marked hypo-echogenicity nodules should be regarded as malignant. The diagnostic results of this strategy remained consistent with the above prediction results (Table 2).

Table 2 Classification results of discriminant analysis and criteria for differentiating nodules subgroups

Performance characteristics

For 95 inconsistently diagnosed nodules screened by at least two TIRADS, DS performed best with an accuracy of 79.0%, followed by Kwak-TIRADS (72.6%) (Fig. 3). Table 3 examines the connection modes of various TIRADS and DS for multiple TIRADS inconsistently diagnosed nodules. Combining DS and ACR-TIRADS in parallel resulted in a significant increase in accuracy (from 61.1 to 80.0%), and the AUC of A-DS was significantly improved (0.817 vs. 0.538) compared to those of ACR-TIRADS alone, while a serial test combining DS and C-TIRADS also resulted in a sharp increase of accuracy (from 47.3 to 76.8%), and the AUC of C-DS method was significantly improved (0.776 vs. 0.535) compared to those of C-TIRADS alone. Regardless of using any combined tests, the AUC of combining DS and EU-TIRADS was substantially higher than that of EU-TIRADS alone (0.700 vs. 0.637, 0.742 vs. 0.637). But the serial test may be preferred because of the higher AUC value and the more balanced sensitivity and specificity values.

Fig. 3
figure 3

Summary of methods and strategies included in the article analysis process. SP: Screening procedures, DS: Discriminant strategy, SP+DS: The evaluation method consists of the four TIRADS screening procedures with partially inconsistently diagnosed nodules judged by discriminant strategy, SP+A/C/K/E+DS: The evaluation method consists of the four TIRADS screening procedures with partially inconsistently diagnosed nodules judged by ACR-TIRADS/C-TIRADS/Kwak-TIRADS/EU-TIRADS and combined with discriminant strategy. The parallel test is defined as follows: The same nodule is defined as benign only when both tests are diagnosed as benign, or malignance when one of the tests is diagnosed as malignance. The serial test is defined as follows: The same nodule is defined as malignance only when both tests are diagnosed as malignance, or benign when one of tests is diagnosed as benign. The dotted line indicates that the two longitudinal TIRADS or TIRADS and discriminant strategy (DS) are combined in parallel or serial. The numbers at the bottom of the pie chart represent accuracy

Table 3 Diagnostic performance of different TIRADS combined with discriminant strategy using parallel or serial tests on partially inconsistently diagnosed nodules subgroups

Table 4 shows the diagnostic performance of assessment methods built from the screening process, the DS, and combined tests in the overall sample. The sensitivity and AUC were highest for the SP+DS method compared to the four TIRADS (91.3%, 0.895). The specificity was highest for ACR-TIRADS (91.1%), followed by Kwak-TIRADS and the SP+DS method with no significance between them (88.5% vs. 87.6%, p > 0.05). When evaluating new methods including combined tests, the sensitivity and AUC were highest for the SP+A+DS method (Parallel mode) (91.6%, 0.896), while the specificity and AUC was highest for the SP+C+DS method (Serial mode) (87.9%, 0.891). For a total of 31 initial Bethesda 3 and 4 nodules (3.9%, 31/795), of which 17 (54.8%) were pathologically malignant and 14 (45.2%) benign, the frequency of correct diagnosis was highest for the SP+DS method and C-TIRADS (both were 20/31), followed by Kwak-TIRADS (19/31) (Supplementary Table 6).

Table 4 Diagnostic performance of evaluation methods consisting of the discriminant strategy alone or combined with the four TIRADS using serial or parallel tests after the screening procedures

We further examined the performance of one TIRADS combined with another TIRADS or DS in the overall sample (Table 5). Despite a decrease in specificity (from 91.1% to 88.5%), combining ACR-TIRADS and Kwak-TIRADS via parallel test resulted in significant improvements in the sensitivity and AUC compared to ACR-TIRADS (89.2% vs. 81.4%, 0.889 vs. 0.863). Although the p-value is at the boundary for statistical significance for the sensitivity (from 91.0 to 92.5%, p = 0.053), combining EU-TIRADS and DS via parallel test resulted in significant improvements in AUC (from 0.875 to 0.882, p = 0.0245). There are three ways to improve the specificity of C-TIRADS, including combing with Kwak-TIRADS, EU-TIRADS and DS. Combining C-TIRADS and DS results in the highest AUC (0.887, p = 0.0013), followed by Kwak-TIRADS (0.884, p = 0.0062), while the lowest AUC was EU-TIRADS (0.879, p = 0.0064).

Table 5 Diagnostic performance of the discriminant strategy combined with TIRADS or two combined TIRADS using parallel or serial combination strategies

Figure 4 illustrates recommended strategies to improve ultrasound accuracy based on this article’s findings. For suspicious or indeterminate nodules, it was recommended to use two-TIRADS combined tests or one TIRADS combined with DS. But Kwak-TIRADS could be used alone. If someone considered the inconsistent results of two TIRADS, it was recommended to use the DS directly for judgment.

Fig. 4
figure 4

Summary of recommended strategies to improve ultrasound accuracy based on this article’s findings. ACR: ACR-TIRADS, EU: EU-TIRADS, C: C-TIRADS, Kwak: Kwak-TIRADS, DS: Discriminant strategy, Hpo/M: Nodules with hypo- or marked hypo-echogenicity, non-Hpo/M: Nodules with iso- or hyper-echogenicity. The best diagnostic cut-off values of ACR-TRADS, Kwak-TIRADS, C-TIRADS, and EU-TIRADS are TR5, 4c, CTR 4b, and grade 5 in this article. The numbers at the bottom of the pie chart represent accuracy

Discussion

Our study compared the diagnostic performance of the four TIRADS and showed that all four TIRADS have good diagnostic performance. The four TIRADS screening procedures resulted in 12.0% inconsistently diagnosed nodules. We then tested a strategy focusing on this subgroup of nodules to establish methods for improving diagnostic accuracy. The results showed that established criteria based on the independent variable of echogenicity could fully predict the discriminant results with an accuracy of 79.0%, followed by Kwak-TIRADS (72.6%). The diagnostic performance of the SP+DS method was significantly higher than that of the four TIRADS. Especially, the four TIRADS can substantially improve the diagnostic results of partially inconsistently diagnosed nodules when combined with DS, thus improving the diagnostic performance of the constructed methods (including SP+A+DS, SP+K+DS, SP+E+DS, and SP+C+DS). When the DS was applied to the overall sample, significant improvements in the diagnostic performance of C-TIRADS and EU-TIRADS could be obtained by combining tests. Two-TIRADS parallel or serial tests can also help improve the diagnostic performance of ACR-TIRADS and C-TIRADS by combining Kwak-TIRADS.

In this study, the sensitivity of the four TIRADS ranged from 81.4 to 91.6%, specificity from 80.9 to 91.1%, and AUC 0.863 to 0.884, which indicates that all TIRADS have a good diagnostic performance. C-TIRADS has the highest sensitivity, while ACR-TIRADS has the highest specificity, consistent with the results of previous studies [16,17,18]. The classification screening results corroborated our hypothesis that the partially inconsistently diagnosed results of the four TIRADS for some nodules are precisely the reason for their differential diagnostic performance.

Without adding other new indicators, we used discriminant analysis to screen out a predictor- echo characteristics. The SP+DS method constructed achieved better diagnostic performance. One could argue whether these malignant indicators depend on the probability distribution of the sample. It must be noted that the remaining nodules partially inconsistent with the diagnosis screened by the four TIRADS are less likely to show highly malignant features such as taller-than-wide, microcalcifications, and infiltrative margins. On the contrary, most of them show less highly malignant features such as solid, hypo-echogenicity, irregular margin and macrocalcifications. Moreover, it can be observed that nodules with highly malignant features often have multiple malignant features simultaneously and are more likely to be correctly diagnosed by various TIRADS [4, 6, 7, 12].

The new evaluation methods have their advantages. The consistent diagnosis of the four TIRADS can provide immediate feedback to increase confidence in confirming the diagnosis. What is more, the false-positive or false-negative rate could be effectively controlled and balanced, reducing the rate of unnecessary punctures and improving diagnostic accuracy, which is currently two essential goals in nodular diagnosis [19]. In addition, the DS has been simplified and easy to master. Considering the ease of use, it is recommended that in practice the four TIRADS screening procedures are best carried out with the help of structured forms or designed procedures. Most importantly, the diagnostic performance of the SP+DS method and other SP-based methods have been improved compared with the four TIRADS. As for the Bethesda 3 and 4 nodules, the correct diagnostic frequency of the SP+DS method was even the highest. However, due to the limited sample size, the latter conclusion must be confirmed in future studies.

We further explored the clinical applicability of the DS and extended it to the overall sample. The results showed that combined modes between the DS and the four TIRADS differed in the partially inconsistently diagnosed nodules. Those TIRADS with high sensitivity, such as EU-TIRADS and C-TIRADS, are applicable to the serial test to improve specificity, while ACR-TIRADS and Kwak-TIRADS with better specificity performance are suitable for the parallel test. These results suggest that the way the DS is applied depends on the different characteristics of each TIRADS [19, 20]. The situation seems to be somewhat different in the overall sample. For example, EU-TIRADS with high sensitivity seemed to be able to continue to improve sensitivity through the parallel test. It should be noted that various kinds of TIRADS combine the DS in different ways, which may also reflect differences in the weighting of echo characteristics in the various TIRADS. For C-TIRADS, both in partially inconsistently diagnosed nodules and in the sample overall, C-TIRADS and DS are combined using a serial test to improve specificity, which may be attributed to the fact that hypo-echogenicity is not a highly malignant risk feature in its lexicon and many nodules diagnosed as malignance by C-TIRADS with ill-defined margin were correctly diagnosed as benign according to the serial strategy [12, 17].

Notably, in our study, Kwak-TIRADS has a good balance in terms of sensitivity and specificity, which is consistent with previous studies [20, 21]. For partially inconsistently diagnosed nodules, Kwak-TIRADS also exhibited the best performance besides the SP+DS method. In the overall sample, ACR-TIRADS in parallel combined with Kwak-TIRADS reduced the false negative rate, and C-TIRADS in serial combined with Kwak-TIRADS reduced the false positive rate. However, although C-TIRADS can be combined with Kwak-TIRADS to improve specificity, the accuracy is the same as Kwak-TIRADS (both 88.4%), so it is recommended to use Kwak-TIRADS directly. Taken together, it may not be necessary to combine another strategy to achieve better diagnostic performance for Kwak-TIRADS.

In the clinical setting, two or more evaluation systems are usually considered for suspicious or indeterminate nodules with few highly malignant features [22]. Considering that the customary use of TIRADS may differ among institutions or individuals, diagnostic combinations that are both accurate and clinically significant need to be examined. Based on the results of this study, if a nodule is suspicious or uncertain diagnosis, two TIRADS or the combination of one TIRADS with DS using a parallel or serial tests can be considered to help improve the accuracy, where Kwak-TIRADS can be directly used without the combination test. On the other hand, the diagnostic consistency of the two TIRADS at the optimal cut point can be examined. If the results of the two selected TIRADS are inconsistent, considering the time cost of the screening process, it is suggested to directly use the DS, which can significantly improve the accuracy.

ACR-TIRADS is a commonly used TIRADS with high specificity, effectively reducing unnecessary FNA rates [17, 18, 23]. But false negatives are a concern. According to the results of this study, we may suggest using a parallel test combined with Kwak-TIRADS for judgment to obtain a balance of sensitivity and specificity. However, as with ACR-TIRADS, there seems to be value in the uneven diagnostic performance of TIRADS [19]. Despite using the serial strategy, some new methods’ specificity does not seem to exceed that of ACR-TIRADS. On the contrary, with the combining strategy, this study has obtained multiple sets of assessment methods with a balance of sensitivity and specificity, even some methods that can enhance sensitivity, e.g., A-DS and K-DS. Whether these methods have clinical application need to be further investigated.

Our study has some limitations. First, all patients in this study with malignant thyroid tumors were confined to PTC. Whether the conclusion of this study applies to other thyroid malignant tumors needs to be verified. Second, the selection of optimal cut-off points for TIRADS, especially considering the balance of sensitivity and specificity, might affect the results of this study. However, the cut-off point of each TIRADS is relatively stable. The ACR-TIRADS and EU-TIRADS are mostly set at category 4 or 5, while the Kwak-TIRADS is set chiefly at category 4c to maximize the balance of sensitivity and specificity to ensure the accuracy of diagnosis [1924,25,26]. As mentioned above, the screening procedures almost exclude nodules with multiple highly malignant features, so it can be predicted that, even though samples might be different, similar criteria may still be obtained according to this research strategy. Further research with larger samples or the other thyroid carcinoma is needed to confirm the above hypothesis. In addition, it must be acknowledged that classification below the optimal cut point also has the risk of malignancy, and it is still necessary to consider whether to carry out an FNA examination based on the size of nodules, personal or family history of cancer, and changes in nodules during the follow-up period. Third, as mentioned in the “FNA, cytopathology and histopathology” section, repeated FNAs are not routinely performed in our institution, which may cause selection bias. What is more, there is still a 1–2% false-negative rate based on repeated results, which might overestimate benign nodules and affect the diagnostic performance of each TIRADS. Further study could use benign surgical pathology results to exclude this potential bias [1, 15, 27].

In conclusion, it is undeniable that various TIRADS have good diagnostic performance, but how to further improve the diagnostic accuracy is a question worth exploring. This study is the first to analyze and compare in detail the misdiagnosed and missed cases of different TIRADS. We explored new methods without additional diagnostic indicators and achieved an effective improvement in accuracy. The recommended strategies our findings provide may help to improve the diagnostic accuracy of ultrasound uncertain or suspicious nodules.