Malignant melanoma accounts for about 5% of all skin cancers [1]. Depending mainly on the initial stage of disease, survival after diagnosis can range from only a few months to many years [2]. Once distant metastases have been diagnosed, median survival in untreated patients amounts to 6 to 9 months [3]. Accurate primary staging is therefore essential for developing an appropriate treatment strategy. Conventional techniques for staging include computed tomography (CT), magnetic resonance imaging, skeletal scintigraphy, and conventional X-ray [4].

Positron emission tomography (PET) is a nuclear-medical imaging method that provides information on the function and metabolism of tissue (metabolic imaging) and is used as a stand-alone procedure (PET) or in combination with CT (PET/CT). A number of international societies have concluded that PET(/CT) is useful for detection of metastasis in patients with American Joint Committee on Cancer (AJCC) stages III and IV [25]. A current UK guideline states that there is good evidence against the use of PET(/CT) for detection of metastasis in patients with AJCC stages I and II [6].

This systematic review formed part of a health technology assessment report that will be used to guide national policy on the reimbursement of PET(/CT) in Germany.

The full protocol and report are available on the website of the responsible health technology assessment agency [7, 8], whose tasks and methodological approach are described in its paper on general methods [7, 8]. The main aim of this systematic review was to assess the potential patient-relevant benefit of PET(/CT) in primary staging of malignant melanoma. Our secondary aim was to assess the diagnostic and prognostic accuracy of PET(/CT) in the same indication (detection of regional lymph node metastases and/or diagnostic accuracy for detection of distant metastases).

Materials and methods

Search strategy and study selection

For our primary aim (that is, assessment of patient-relevant benefit) we searched for relevant randomized controlled trials (RCTs) investigating at least one predefined patient-relevant outcome (see below). In this context, the term ‘patient-relevant’ refers to how a patient feels, functions or survives. If insufficient evidence was available to answer the primary question, an assessment of the diagnostic and prognostic accuracy of PET(/CT) was performed (secondary aim). For this purpose, a review of reviews was conducted, and where appropriate, supplemented with additional recent primary studies identified by our own update search.

Primary studies were searched for in MEDLINE (1948 to January 2011) and EMBASE (1980 to January 2011) via Ovid, and in the Cochrane Central Register of Controlled Trials. The Cochrane Database of Systematic Reviews, the Database of Abstracts of Reviews of Effects, and the Health Technology Assessment Database were screened to identify systematic reviews. In addition, reference lists of retrieved articles and conference proceedings were searched by hand. Databases of guideline developers were also searched to identify further systematic reviews. Finally, web-based clinical trial registries and trial results databases were screened. The search strategy included bibliographic index terms on melanoma and PET. The full search strategy, which was developed by one information specialist and checked by another, has been described elsewhere [7, 8]. Two reviewers independently screened titles and abstracts of the retrieved citations to identify potentially eligible primary and secondary publications. The full texts of these articles were obtained and independently evaluated by the same two reviewers applying the full set of inclusion and exclusion criteria. Disagreements were resolved by consensus.

Eligibility criteria

For our primary aim, RCTs with the following characteristics were included. First, the RCT investigated patients with malignant melanoma. Second, the trial evaluated at least one of the following predefined patient-relevant outcomes: health-related quality of life, melanoma-related mortality, all-cause mortality, and other adverse events. Third, in the RCT either PET and/or PET/CT were compared with each other or with another diagnostic test (for example, CT), or PET or PET/CT was used to identify patients and assign them to different treatment regimens based on the test result (for example, enrichment design). Finally, there was a full-text document available (no language restrictions).

For the secondary aim, systematic reviews identified in the literature search were evaluated using Oxman and Guyatt’s quality tool [9, 10]. Eligible systematic reviews had to achieve five or more of seven possible points in order to be included in our review. In addition, we conducted an update search for primary studies published up to January 2011 to cover the period not considered by the systematic reviews.

With regard to diagnostic and prognostic accuracy, we extracted and analyzed data from primary studies (identified either by the published high-quality systematic reviews or by our literature search) if the following criteria were fulfilled:

prospective design; patient-based analysis; index test (PET or PET/CT); valid reference standard (histopathology, clinical follow-up >6 months, or a combination of the two) – if histological analyses were not possible (for example, for M-staging), we also accepted studies that only applied clinical-radiologic follow-up; number of patients (at least 10); sufficient data for calculation of 2×2 tables; and full-text document available.

Where the number of studies that investigated a direct comparison of PET or PET/CT with other diagnostic procedures was insufficient, we also included verification of only positive testers design studies and discordance studies. In the verification of only positive testers design, the reference standard is applied only to those patients in whom a suspicious lesion was found by one or both of the index tests. In the discordance design, the reference standard is applied only to those patients in whom one test, but not both tests, was positive (that is, where there is a discordant result).

Data extraction

The individual steps of the data extraction and risk-of-bias assessment procedures were conducted by one reviewer and checked by another; disagreements were resolved by consensus. As no RCTs and prognostic accuracy studies were identified, no further details of the planned data extraction and risk-of-bias assessment are provided here.

Using standardized tables, information was extracted from each included diagnostic study on: baseline characteristics of the study participants; characteristics of the index test and reference standard; data for constructing 2×2 contingency tables of diagnostic and prognostic accuracy (that is, numbers of true-positive, false-negative, false-positive and true-negative test results); reported sensitivities and specificities of PET(/CT); and risk-of-bias items (see below).

Assessment of risk of bias

The risk of bias in diagnostic accuracy studies identified in the update search was evaluated using a modified version of QUADAS, an evidence-based tool recommended for assessing the methodological quality of test accuracy studies [11]. We modified this tool because, in our opinion, some QUADAS items referred more to external validity than to internal validity. Since our review was undertaken, a revised version of QUADAS (QUADAS-2) has been released [12]. QUADAS-2 more closely resembles the approach and structure of the Cochrane risk-of-bias tool. Furthermore, it aims to produce an assessment of the risk of bias by methodological domain (participant selection, index test, reference standard, and flow of participants through the study), rather than the more general overall risk of bias.

The following criteria were assessed: item 1, application of a reference standard likely to correctly classify the target condition; item 2, observance of an appropriate time period between application of the reference standard and the index test; item 3, independence of the index test from the reference standard; items 4 to 6, avoidance of partial verification bias, differential verification bias, and incorporation bias; item 7, interpretation of the reference standard without knowledge of the results of the index text; item 8, performance of an intention-to-diagnose analysis; item 9, avoidance of selective reporting; and item 10, no identification of other aspects contributing to a risk of bias.

The QUADAS tool does not recommend the use of algorithms to derive overall ratings of study quality. However, the revised version of the QUADAS tool (QUADAS-2) allows for an overall rating of ‘low risk of bias, where all criteria are met’ [12]. For this evaluation we developed topic-specific scoring guidance. In particular, we modified the requirement for all criteria to be met, to allow for the fact it was not possible for studies to meet item 5 (differential verification bias). The studies were then categorized as follows. If one of the items was evaluated with ‘no’ (except item 5), the study was rated as having a high risk of bias. If at least two of the items were rated as ‘unclear’, the study was also rated as having a high risk of bias – items 7, 8 and 10 were exceptions; here at least two items (plus an additional one from items 1 to 6 or 9) had to be rated as ‘unclear’ in order to conclude a high risk of bias.

The risk of bias in the diagnostic accuracy studies identified by the systematic reviews was classified according to the ratings given in the respective reviews.

Data analysis

Sensitivity and specificity were calculated from contingency tables as reported in or derived from the studies. The 95% confidence intervals were computed using the Clopper and Pearson method [13]. Owing to great clinical heterogeneity, bivariate analyses did not seem appropriate. The meta-analyses of the diagnostic accuracy values from the included studies are presented in forest plots according to analyses of all patients with AJCC stages I to IV and of two subgroups (AJCC stages I and II, and AJCC stages III and IV).


Literature search

The search for primary studies (that is, the search for RCTs investigating the patient-relevant benefit of PET(/CT) as well as the update search for diagnostic and prognostic accuracy studies) retrieved 9,824 references. However, no relevant RCT on the primary aim of our review was identified (Figure 1).

Figure 1
figure 1

Flow chart of study.

Concerning the secondary aim, four high-quality potentially relevant systematic reviews on the diagnostic accuracy of PET(/CT) in melanoma [1417] were retrieved from 1,650 citations. After extraction of the study information, the systematic reviews by Mijnhout and colleagues [16] and Xing and colleagues [17] were excluded from our review. Additional studies investigated by Mijnhout and colleagues and by Xing and colleagues that were not already included through Jiménez-Requena and colleagues [14] or Krug and colleagues [15] did not fulfill our inclusion criteria due to retrospective study designs, due to the reporting of solely nonpatient-based analyses, or because 2×2 tables could not be reconstructed. However, both of the excluded high-quality reviews are addressed in the discussion section of this paper. We did not find any discordance studies or studies using a verification of only positive testers design.

Our review of reviews was ultimately based on primary studies included in two systematic reviews [14, 15].

Eleven primary studies included in these reviews met our inclusion criteria.

The searches conducted in the two included reviews ended in March 2007. Our update search therefore covered the period from March 2007 to January 2011 and identified six further relevant studies. In total we therefore included 17 primary studies on diagnostic accuracy in our review. No relevant prognostic study was found.

Study characteristics

The characteristics of the 17 diagnostic accuracy studies (1,155 patients) are presented in Additional file 1. Twelve studies investigated PET and five studies investigated PET/CT; the latter studies were exclusively identified by our update search. All included studies used F-18-fluoro-2-deoxyglucose as a tracer.

The age of patients ranged from 18 to 89 years. The mean percentage of males per study was 51.0% and the mean number of participants was 68 (range 17 to 251). The studies were published between 1998 and 2010.

Six studies included patients with primary tumors in AJCC stages I and II [1823]. Three studies analyzed patients with AJCC stages III and IV: patients with AJCC stages II and III [24], patients with AJCC stages II to IV [25], and patients with AJCC stages I to III [26]. For three further studies, information on AJCC stages was missing [2729].

Risk of bias

The risk of bias varied across studies. Both systematic reviews used different assessment tools. Jiménez-Requena and colleagues modified items from previous systematic reviews on PET [3033]. These items covered seven dimensions: ‘description of study design, description of the study population, indications leading to F-18-fluoro-2-deoxyglucose PET use, technical and image interpretation issues, final confirmation, sensitivity and specificity data, and change in management information’ [14]. A score >70% was defined as high quality (low risk of bias); six of the 10 studies included in our review were defined as having a low risk of bias by Jiménez-Requena and colleagues [1820, 2426].

Krug and colleagues assessed the risk of bias using the QUADAS tool [11]. Detailed information was provided on the potential sources of bias and the proportion of affected studies. These sources were, among others: the inclusion of an inappropriate range of patients, the insufficient description of inclusion and exclusion criteria, insufficient information on primary tumors, the lack of an independent interpretation of the index test and reference standard, and insufficient explanation of withdrawals. However, only a QUADAS sum score was provided for individual studies, which does not discriminate between what could be a wide variety of quality issues (different sources of bias and reporting and applicability issues) and in addition does not distinguish between internal and external validity. Furthermore, no categorization of studies into those with a high risk and those with a low risk of bias was performed. The QUADAS sum score was developed by Krug and colleagues, which is not recommended by the authors of QUADAS. However, one study fulfilled all QUADAS criteria [20].

The seven studies included in both reviews were rated similarly [1820, 22, 24, 26, 27], independent of the assessors. All studies except the study by Rinne and colleagues [22] were classified as having a low risk of bias.

Of the six studies published after March 2007 and identified in the update search [23, 28, 3437], two had a high risk of bias [28, 35]. The main weaknesses of these two studies were: the time period between the reference test and the index test was inappropriate or unclear (Table 1, item 2), incorporation bias was evident (item 6), the results of the reference test were interpreted with knowledge of the results of the index test or the corresponding information was missing (item 7), or no intention-to-diagnose analysis was performed (item 8).

Table 1 Risk-of-bias assessment of studies published after March 2007


Diagnostic accuracy

The results for diagnostic accuracy are displayed in Table 2. We did not pool data in meta-analyses because of differences in indications (N-staging or M-staging, or a mix of both), reference standards and index tests. The sensitivity of PET(/CT) of all included studies ranged from 0 to 100%; specificity ranged from 18 to 100%. Eleven analyses (nine studies) on the detection of regional lymph node metastases (N-staging) were included. Seven analyses were based on PET and four on PET/CT. They demonstrated a sensitivity and specificity for PET(/CT) of 0% (specificity 80.6%) to 100% (specificity 77%) and 18% (sensitivity 40%) to 100% (sensitivity 8 to 38%), respectively [18, 19, 21, 23, 2527, 34, 37]. Seven analyses were based on PET and four on PET/CT.

Table 2 Results of included studies

Six analyses on the detection of distant metastases (M-staging) were performed in five studies [24, 25, 29, 35, 37]. Five analyses were based on PET and one on PET/CT. The sensitivity of PET(/CT) ranged from 33% (specificity 90.3%) to 97% (specificity 56%) and specificity from 56% (sensitivity 97%) to 98% (sensitivity 92%) [24, 25, 29, 35, 37].

Four studies reported contingency tables for both indications (N-staging and M-staging) [20, 22, 28, 36].

In two studies, differences in diagnostic accuracy between imaging procedures were tested for statistical significance [35, 37]. Veit-Haibach and colleagues analyzed PET versus PET/CT and PET versus CT [37]. Sensitivity was low for all technologies whereas specificity was high. No statistically significant differences were detected between the technologies.

Analyses by Bastiaannet and colleagues showed that PET and CT performed similarly regarding the detection of liver, lung and abdominal metastases (P = 0.81, P = 0.16 and P = 0.62, respectively, no estimates and confidence intervals provided) [35].

Subgroup analysis

In patients with AJCC stages I and II, the sensitivity of PET(/CT) ranged from 0% (specificity 81%) to 67% (specificity 100%) (Figure 2). In four studies of this subgroup, the sensitivity of PET(/CT) ranged from 0% (specificity 81%) to 17% (specificity 93%), while in the other two studies the sensitivity was 67% (specificity 100%) and 100% (specificity 100%). The specificity of PET(/CT) in this subgroup ranged between 77% (sensitivity 100%) and 100% (sensitivity 13 or 67%). Three studies investigated AJCC stages III and IV; in this second subgroup, the sensitivity of PET(/CT) ranged from 68% (specificity 92%) to 87% (specificity 98%) (specificity 92% (sensitivity 68%) to 98% (sensitivity 87%)).

Figure 2
figure 2

Forest plots of diagnostic accuracy.

In the third group, studies including patients in all stages (AJCC stages I to IV) were compared. The sensitivity of PET(/CT) ranged from 17% (specificity 74%) to 97% (specificity 56%) and specificity ranged from 18% (sensitivity 40%) to 100% (sensitivity 38%).

Veit-Haibach and colleagues analyzed M-staging and N-staging for PET(/CT) [37]. The 2×2 tables for PET(/CT) for N-staging showed identical results. Only three 2×2 tables are therefore presented in Figure 2. Sensitivity ranged from 17% (specificity 74%) to 97% (specificity 56%). Sensitivity could not be calculated for Hafner and colleagues as no true-positive and false-negative cases were detected by PET [25]. In six studies, specificity ranged from 56% (sensitivity 97%) to 100% (sensitivity 38%), except for Vereecken and colleagues [26], with a specificity of 18% (sensitivity 40%).


Despite a comprehensive search, no RCT investigating the potential patient-relevant benefit of PET(/CT) in primary staging of malignant melanoma was identified. Only diagnostic accuracy studies were found, which were retrieved from two previous systematic reviews and from an update search.

For N-staging and M-staging, the results of the individual studies varied widely in their estimates of sensitivity and specificity. Potential sources of heterogeneity included differences in the risk of bias, in the reference tests and index tests applied and in the spectrum of patients investigated. Stratification by disease stage indicated that the diagnostic accuracy of PET(/CT) varied with AJCC stage. PET(/CT) performed better in patients with AJCC stages III and IV than in those with lower AJCC stages. However, higher sensitivity and specificity of PET(/CT) in AJCC stages III and IV do not necessarily imply that there is a patient-relevant benefit of PET(/CT) in this subgroup. The aim of diagnostic accuracy studies is only to explain how well a new diagnostic test agrees with the reference standard. One cannot draw conclusions from diagnostic accuracy studies as to how variations in the test strategy may ultimately affect a patient’s outcomes. Likewise, robust conclusions cannot be drawn from nonrandomized studies as the potential risk of bias is higher than in randomized ones [3840]. Moreover, diagnostic accuracy studies are unable to answer effectiveness questions. Owing to the lack of direct comparisons in our review, there is no evidence that the diagnostic accuracy of PET(/CT) is better than conventional imaging.

Our search identified four high-quality systematic reviews investigating diagnostic accuracy [1417], of which two were included in the actual assessment.

All four reviews conducted meta-analyses: Xing and colleagues [17] and Jiménez-Requena and colleagues [14] presented separate analyses for N-staging and M-staging, whereas Krug and colleagues [15] and Mijnhout and colleagues [16] presented combined analyses. However, in most cases Jiménez-Requena and colleagues refrained from pooling data due to heterogeneity. In contrast, we refrained from conducting any meta-analyses at all because, in our view, the studies were too heterogeneous; for example, concerning patient characteristics (AJCC stage, age, primary cancer site, and so forth) or the use of different index or reference tests (for example, newer studies applied PET/CT whereas older ones applied PET).

For information purposes, we present the results of the meta-analyses or ranges for individual studies in Table 3. However, we emphasize that the results lack comparability due to various factors.

Table 3 Results of meta-analyses on diagnostic accuracy

For example, all four of the other reviews included both prospective and retrospective studies whereas we only included prospective ones. The quality of studies with a retrospective design is limited due to factors such as spectrum bias, attrition bias, unclear blinding, the often lacking standardised implementation of the index and reference test, and especially the lack of the prospective definition of cutoff points [41]. For future reviews we would suggest performing a comparison of results between study types (prospective vs. retrospective); for example, in the form of sensitivity analyses.

A further reason for the noncomparability of results is the fact that the analyses in the other reviews were largely not exclusively patient-based, but lesion-based. From a statistical point of view, a lesion-based analysis overestimates the precision of results. In our opinion, if not properly accounted for, this type of analysis also ignores the fact that observations on lesions of a given patient are not independent, which may lead to bias.

As already recommended in the literature [42], we suggest that studies on PET imaging report a patient-based analysis as the primary analysis; other types of analyses should always be adjusted for patients or should only be presented as supplementary information.

Furthermore, when examining the diagnostic accuracy of PET in N-staging, we could not reproduce some of the results presented by Xing and colleagues [17], who, for example, reported a specificity of 100% for the studies by Kokoska and colleagues and Longo and colleagues [43, 44]. However, specificity in these studies was not reported and could not be calculated; therefore, despite the high-quality rating according to the Oxman and Guyatt tool, we have some doubts about the validity of the analysis performed by Xing and colleagues.

Assessment of risk of bias in primary studies

Different reference standards will be employed in patients with positive and negative PET scan results. Most metastases will be confirmed histologically, while apparently disease-free patients will remain under regular clinical observation. This issue should be listed as a possible source of differential verification bias (item 5) for all studies; the item’s possible impact is then a matter of judgment – scoring this item negatively does not necessarily mean that the whole study was biased. We judged that histology results and follow-up results will often have similarly high validity, so that this item was essentially not used to decide about the validity of included studies.


One of the major limitations of this systematic review is the low quality of reporting in the studies considered. Studies were often described inadequately. In some studies it was not possible to determine whether they had a retrospective or a prospective design. Furthermore, in some cases it was not clear which cutoff values were used and when the index test PET(/CT) or the reference test was applied.

We performed a review of reviews. Results of risk-of bias assessment were therefore taken from included reviews. A comparison of the results of the risk-of-bias assessment was hampered by the fact that different assessment tools were used and by inadequacies such as a lack of distinction between internal and external validity or a lack of categorization of studies into those with a high or a low risk of bias. In accordance with the authors of the reviews, we also concluded that many studies failed to provide sufficient details to adequately assess risk-of-bias items.


There is currently no evidence for a patient-relevant benefit of PET(/CT) in primary staging of malignant melanoma, indicating that it may be too early for broad clinical use of this technology. In future, stage-adapted RCTs investigating patient-relevant outcomes are needed to determine whether PET(/CT) actually has a benefit for patients.

The diagnostic accuracy of PET(/CT) appears to increase with higher AJCC stages. However, regardless of the indication (N-staging or M-staging), the ranges for sensitivity and specificity were wide. In addition, the 17 studies on diagnostic accuracy were heterogeneous, some were small and many showed methodological deficiencies. Future diagnostic accuracy studies should be prospective, of better quality, and better reported.