Introduction

Up to 35% of fine-needle biopsy (FNB) procedures on thyroid nodules produce an indeterminate result [1, 2], which include Bethesda classification categories III and IV: atypia of undetermined significance/follicular lesion of undetermined significance (AUS/FLUS) or follicular neoplasm/suspicious for follicular neoplasm (FN/SFN) [2]. Most patients with indeterminate FNB results, especially those with malignant features in the USA [3] are referred for surgery. However, the malignancy rate in indeterminate nodules is reported between 15 and 54% [1, 2]. This means that many patients with indeterminate cytological results are exposed to potentially unnecessary surgery with the associated increased risks of mortality and complications.

The 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer recommends the use of molecular testing to support malignancy risk assessment in thyroid nodules with indeterminate cytology results [3]. The two most common genetic tests used as preoperative molecular markers in this context are ThyroSeq, version 2 (ThyroSeq), a seven-gene panel of genetic mutations and rearrangements [4], and a gene expression classifier (GEC) testing mRNA expression of 167 genes [5]. Several studies have assessed these panels with various degrees of outcome blinding. NPV and PPV appear strictly linked to the pretest prevalence of malignancy in the population being studied [3]. Furthermore, there are considerable discrepancies in FNB categories of nodules assessed by genetic testing between studies. Moreover, only for a part of results, a post-surgical follow-up which allows the assessment of the test reliability by comparing to histopathological results is available. Therefore, a wide inter-institutional variation in performance of a molecular classifier for indeterminate thyroid nodules may be observed depending on an institution’s practice and patients’ characteristics [6]. As a consequence, the results of single-center prospective trials may not be generalized.

In order to bring more reliable data on real-world performance of the aforementioned tests, the aim of the present study was to perform a meta-analysis and to this end assess aggregate study data on the performance of the 167 Gene Expression Classifier (GEC) and the 2 ThyroSeq test for the preoperative assessment of thyroid nodules with indeterminate FNB result.

Materials and Methods

Search Strategy

Our search strategy followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines [7]. We searched PubMed/MEDLINE, Cochrane Library, Scopus, CINAHL, Academic Search Complete, Web of Knowledge, PubMed Central, PubMed Central Canada, and Clinical Key databases from January 2001 up to April 2018 to find all relevant, full-text journal articles written in English. The search strategy included Medical Subject Headings terms and keywords: “thyroid and (“follicular cancer” or “follicular carcinoma” or “follicular neoplasm” or “follicular adenoma” or “follicular nodule”) and (“genetic” or “genetics” or “molecular” or “mutation” or “gene” or “genes”) or (“ThyroSeq v2”) or (“thyroid” and “Gene Expression Classifier”)”. Reference lists of all the selected articles, previous reviews, and meta-analyses were hand-searched for any additional articles.

Data Extraction

Two authors (MB and MO) independently selected trials which fulfilled the inclusion criteria and extracted data for the outcomes using a standardized data extraction form. Relevant data included articles comparing the number of malignant and benign nodules with the conclusive histopathological result in a group of benign and suspicious changes according to GEC, previously assessed as Bethesda categories III and IV in thyroid FNB or positive changes according to ThyroSeq were included in the meta-analysis. We excluded cases without a post-surgical diagnosis, as well as papers assessing only the AUS category (without FLUS).

Another author (ESP) rechecked the extracted data. We used data from the analysis of available papers or the accompanying illustrations.

Assessment of Methodological Quality

The risk of bias in the included studies was independently assessed by two authors (MB and MO), in accordance with the Cochrane risk of bias tool [8]. It assessed factors such as sequence generation, allocation concealment, blinding of participants and personnel, blinding of outcome assessment, incomplete outcome data, and selective outcome reporting. As recommended for diagnostic accuracy test studies, the revised Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool was also used [9] to cover patient selection, reference standard, index test, and flow of patients through the study and timing of the index test(s), as well as the reference standard (“flow and timing”) alike to assess each domain in terms of the risk of bias and the concerns regarding applicability. A third author (ESP) was the adjudicator when no consensus was achieved. All included studies were assessed using the Newcastle-Ottawa Scale [10]. Studies with a result of seven stars or more were included.

Statistical Analysis

p values < 0.05 were considered to indicate statistical significance.

Authors referred to the Cochrane Handbook for Systematic Reviews of Interventions [8] and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. For every study, the number of true positive (TP), true negative (TN), true negative (TN), and false negative (FN) results was noted. Univariate estimation of sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV) with 95% confidence intervals based on the exact binomial Clopper-Pearson method were calculated and presented by forest plots. In addition, we summarized the obtained results by determining the proportions describing pooled accuracy, assuming the lack of asymmetry according to Egger’s test.

The meta-analysis focused on two hierarchical logistic regression models currently recommended by the Cochrane Collaboration [11]: the bivariate model and the hierarchical summary ROC (HSROC) model. Bivariate meta-analyses jointly model both sensitivity and specificity. Based on these, the negative likelihood ratio (LR−), positive likelihood ratio (LR+), and diagnostic odds ratio (DOR) were calculated. Receiver operating characteristic (ROC) curve and estimated summary parameters were presented by HSROC. Heterogeneity, which comes from the difference between the results achieved by comparable diagnostic methods, was investigated by adding methods of preoperative assessment of indeterminate thyroid nodules (GEC and ThyroSeq) as a covariate to the bivariate meta-regression hierarchical model. Using this model, we compared summary estimates of sensitivity and specificity. Summary estimation of combined NPV and PPV was given on the base on the same bivariate meta-analysis model that was used for sensitivity and specificity [12].

The bivariate/HSROC method was performed using STATA version 14.2 (StataCorp LP, USA). To add covariates and compare GEC and ThyroSeq in meta-regression of the diagnostic accuracy model, we used R CRAN, version 3.4.2.

Results

After a complete systematic review was performed, 16 studies met the inclusion criteria. The search results and steps of selection are shown in the flowchart (Fig. 1).

Fig. 1
figure 1

Methodological flow diagram of the meta-analysis for the Gene Expression Classifier test

The specificity for individual studies based on the GEC varied from 0 to 52%, and the sensitivity from 83 to 100%. For nodules screened with the ThyroSeq method, specificity varied from 45 to 96%, while sensitivity ranged from 70 to 91% (Tables 1 and 2).

Table 1 The number of patients with false negative (FN), false positive (FP), true negative (TN), and true positive (TP) results with forest plots presenting the punctual estimates of sensitivity and specificity and 95% confidence intervals of each study across two diagnostic tests: GEC and ThyroSeq
Table 2 The results of positive predictive value (PPV) and negative predictive value (NPV) and forest plots presenting the punctual estimates with 95% confidence intervals of each study across two diagnostic tests: GEC and ThyroSeq

Table 3 presents the pooled estimation of sensitivity, specificity, LR+, LR−, and DOR obtained from the bivariate model for both methods.

Table 3 The results of bivariate summary estimates of sensitivity, specificity, positive likelihood ratio (LR+), negative likelihood ratio (LR−), and diagnostic odds ratio (DOR) for the GEC and ThyroSeq tests

The GEC studies are characterized by a significantly higher sensitivity (98%, 95%CI 96–99% vs. 84%, 95%CI 74–91%, p = 0.003). In contrast, a significantly higher specificity was found for the ThyroSeq method (78%, 95%CI 50–92% vs. 12%, 95%CI 8–20%, p < 0.001). Moreover, the combined specificity and sensitivity, as well as the calculated LR+, confirm the advantage of the ThyroSeq method. For ThyroSeq, the chance that a positive test result will be given to a patient with malignancy is almost four times higher (LR+ = 3.79, 95%CI 1.40–10.27) when we compare it with the chance of a positive test result in healthy people. For the GEC method, the same proportion was much lower and amounted to 1.12 (95%CI 1.05–1.20). However, the proportion of patients with malignancy who are correctly identified was higher (p = 0.003) for the GEC method (sensitivity GE 0.98, 95%CI 0.96–0.99) than for ThyroSeq (0.84, 95%CI 0.74–0.91). The value of LR− was similar for both methods and DOR was slightly higher for ThyroSeq. Patients with malignancy are almost 19 times more likely to have a positive test with the ThyroSeq than disease-free individuals (DOR = 19.86), while the diagnostic odds ratio value observed for GEC is lower (DOR = 8.87).

Figure 2 shows the bivariate summary estimates of sensitivity and specificity for both the GEC and the ThyroSeq malignancy screening method, together with their corresponding 95% credibility ellipses represented in the ROC space. The ellipses indicate the area that may contain the actual values of the average test sensitivity and specificity for each screening method. We observe a clear difference between the sensitivity and the specificity of the ThyroSeq compared with GEC algorithm. The closer the curve is to the upper left corner of the plot (summary value of sensitivity and specificity is both close to 100%), the more accurate is a diagnostic test. In our study, summary point of specificity and sensitivity for ThyroSeq is localized much closer to this point of maximal specificity and sensitivity than the point evaluating GEC.

Fig. 2
figure 2

Bivariate summary estimates of sensitivity and specificity for diagnostic tests (GEC, ThyroSeq), hierarchical summary receiver operating characteristic, and the corresponding 95% credibility ellipse around the mean values for a ThyroSeq and b GEC

The PPV for individual studies based on the GEC diagnostic varied from 0.33 to 0.86, and the NPV between 0.00 and 1.00. For nodules screened with the ThyroSeq method, the PPV varied from 0.27 to 0.89, and the NPV from 0.88 to 0.97 (Table 2). PPV and NPV were compared using bivariate meta-analysis in Table 4 and did not differ significantly, 0.45 vs. 0.58 and 0.91 vs. 0.93, respectively, for GEC and ThyroSeq (p values > 0.05 in both cases). However, pooled accuracy (Table 5) was significantly higher for ThyroSeq (p = 0.015). The significance of the latter comparison increased after exclusion of the results influencing asymmetry according to Egger’s test (p < 0.0001) [31].

Table 4 The results of a bivariate meta-regression hierarchical model with bivariate summary estimates of PPV and NPV and the results of a bivariate meta-regression hierarchical model
Table 5 Calculation of pooled accuracy

Discussion

The present study to the best of our knowledge is the first one to provide (a) an aggregate analysis of performance data of two different genetic analysis–based methods for the assessment of malignancy of thyroid nodules with an indeterminate cytology and (b) a direct comparison of key performance measures of these tests. Although both tests have their advantages, the aggregate performance indicators on balance seem to indicate that the ThyroSeq v2 may be preferable for clinical practice.

Many molecular markers have been proposed to improve FNB-based diagnosis of thyroid nodules; however, only a small minority demonstrated the potential to modify clinical decision-making and thus impact care [14]. The ATA guidelines’ authors suggest that an ideal “rule-in” and “rule-out” test for the assessment of malignancy of cytologically indeterminate nodules should have a PPV for histopathologically proven malignancy similar to a malignant cytological diagnosis (98.6%) and an NPV similar to a benign cytological diagnosis (96.3%) [3] with estimates based on a recent meta-analysis of performance of the Bethesda system [1]. Consequently, the “ideal rule-out test” would have a high sensitivity and high NPV, whereas the “ideal rule-in test” would have a high specificity and PPV [16].

Alexander et al. demonstrated that the GEC result of a nodule classified as “benign” had an NPV of 95% for nodules previously categorized cytologically as AUS-FLUS and an NPV of 94% for aspirates that were classified as FN/SFN. In the first clinical study of preoperative use of GEC, sensitivity with regard to malignancy for the classification of indeterminate nodules was 90% [5]. The advantages of the test were therefore both a high NPV and a high sensitivity. This result was also validated in a blinded multicenter prospective trial [14]. The results of our meta-analysis include pooled data from 1086 nodules with histopathologic confirmation from multiple studies and showed that GEC indeed has a high sensitivity of 98% and a high NPV of 91%. However, test limitations were the specificity of 12% and a low PPV of 45%, meaning that on aggregate, a “suspicious” result signifies a 12% risk of malignancy. Considering these performance indicators, the GEC can only be considered valuable as a “rule-out test.”

Next-generation sequencing technology allows high-output genomic analysis. A custom next-generation sequencing panel called ThyroSeq (initial version) was designed to target 12 cancer genes with 284 mutational hotspots for the detection of mutations in thyroid cancer by next-generation sequencing technology in fine-needle aspiration and tissue samples [17]. This method was superseded by a second, improved version of the assay (ThyroSeq v2) [27]. This version, analyzed in the present study, was designed to detect mutations in > 1000 hotspots of 14 thyroid cancer–related genes (AKT1, BRAF, CTNNB1, GNAS, HRAS, KRAS, NRAS, PIK3CA, PTEN, RET, TP53, TSHR, TERT, and EIF1AX) and for 42 types of gene fusions or rearrangements known to occur in thyroid cancer (RET, PPARG, NTRK1, NTRK3, BRAF, and ALK) [27]. The results of the studies of the ThyroSeq v2.1 multigene NGS panel of molecular markers demonstrate that it provides both high sensitivity and high specificity for cancer detection in thyroid nodules with AUS/FLUS cytology, which should allow improved management for these patients [28]. However, the new test to date only has limited real-world experience.

In our analysis, we pooled data from 459 nodules and found both an acceptable sensitivity and specificity, as well as a higher PPV than the GEC and a marginally higher NPV of 93% (89–97%). This is also reflected in the significantly higher overall accuracy found for ThyroSeq compared to GEC. The combination of all-round acceptable parameters of diagnostic performance suggests ThyroSeq v2 can be used as both a “rule-in” and “rule-out” test. However, the test still seems to be underdeveloped regarding its interpretation as cutoff points for the recognition of individual mutations, and standard operating procedures still leave room for further study. Therefore, the European Thyroid Association thus far suggests to use this test for scientific purposes, but not yet in the clinical setting [32].

The main limitation of the present study is the scarce source material of four publications for ThyroSeq v2, which results in a high heterogeneity of research and publication bias. A more reliable summary may be possible as more such studies will become available. Possibly as a consequence of the limited number of studies compared to the number available for the GEC, a higher heterogeneity with a bigger ellipse representing the confidence region was observed for ThyroSeq.

Numerous precautions were taken to minimize potential bias in the study. Bivariate/HSROC meta-analysis is the most rigorous statistical method in the field of diagnostic meta-analysis. This is due to its ability to model the binomial structure as part of the study while taking into account the heterogeneity between the studies [33]. According to guidelines for Meta-analysis of Diagnostic Test Accuracy Studies described by European Network for health technology assessment [34] and recommendations given by the Cochrane Collaboration [11], the bivariate/HSROC method must be used as the standard, together with an analysis of summary ROC curves.

The comparison of the obtained summary results for GEC and ThyroSeq shows a significant difference between the values of sensitivity and specificity of the studies. The GEC studies are characterized by significantly higher sensitivity, while a significantly higher specificity is observed in the ThyroSeq group. Therefore, the superiority of the GEC test lies in ruling-out of thyroid malignancy in the preoperative assessment of indeterminate thyroid nodules. In contrast, the ThyroSeq is characterized with a better performance in the “ruling-in” of malignancy. Pooled sensitivity, specificity, and accuracy as obtained using bivariate/HSROC analysis nonetheless on aggregate showed a clear dominance of the ThyroSeq method.

Although molecular testing of cytology specimens from thyroid nodules has the potential to play a major role in the evaluation of indeterminate thyroid lesions, to date, the results of its performance are still not fully satisfactory. The ideal test combining both—high sensitivity and NPV with a high sensitivity and PPV, allowing a test to be both a “ruling-in” and “ruling-out” test—still needs to be developed. Furthermore, such an ideal test also would have the ability to stratify prognostic risk based on the mutation also providing a direction for therapy.

Recently, a new version of the ThyroSeq panel—ThyroSeq v3—has been proposed [35]. This DNA- and RNA-based next-generation sequencing assay analyzes 112 genes for a variety of genetic alterations, including point mutations, insertions/deletions, gene fusions, copy number alterations, and abnormal gene expression, and it uses a genomic classifier (GC) to separate malignant lesions from benign lesions [35]. It is conceivable that this further development of the ThyroSeq test, now being rolled out in clinical practice, has the potential to improve on the already clinically interesting diagnostic performance of the ThyroSeq v2. However, real-life study data are currently still lacking on this test, precluding its inclusion in the present analysis.

Conclusion

The 167 gene GEC with a high sensitivity and NPV may be helpful in ruling out malignancy in cases of indeterminate thyroid nodule cytology. However, ThyroSeq v2 has a significantly higher specificity and accuracy with acceptable sensitivity, so that it has the potential for use as an all-round test of malignancy of thyroid nodules. However, long-term outcome data are necessary before a clear recommendation can be made.