Background

Health care policy makers and clinicians increasingly seek evidence-based guidance on how to provide mental health services effectively and efficiently. Systematic reviews have come to be accepted to produce the best estimates of clinical efficacy and effectiveness, thus constituting a cornerstone of policy analysis and cost-effectiveness evaluation [1]. Founded in 1999, the London-based National Institute for Health and Clinical Excellence (NICE) has quickly established itself as a leading agency conducting health technology assessments (HTAs) including economic evaluation. A review team of the World Health Organization observed that its "published technology appraisals are already being used as international benchmarks" [2], beyond NICE's primary remit to provide guidance for the National Health Service in England and Wales.

In March 2006, NICE published guidance on the use of methylphenidate, atomoxetine and dexamphetamine for attention-deficit/hyperactivity disorder (ADHD) in children and adolescents (NICE Technology Appraisal No. 98 [3]). This guidance was based on a >600-page technology assessment report, which had been produced by a team of ten experts, including one clinical specialist who provided input and comments [4].

In the United States, NICE guidelines are routinely referenced by the National Guideline Clearinghouse, an initiative of the Agency for Healthcare Research and Quality (AHRQ). NICE technology assessments and appraisals are easily accessible through its website, which receives more than one million hits per month from the United States alone [5]. The NICE assessment of ADHD treatment strategies thus might have great influence on future treatment practices beyond England and Wales, notably including the United States where ADHD now is the most commonly diagnosed behavioral disorder in children, with approximately 4.4 million diagnosed and 2.5 million taking medication for the disorder in the age group 4 – 17 years [6]. Therefore a critical appraisal of NICE Technology Appraisal No. 98 will be of interest to mental health care policy makers and clinicians.

Methods

A qualitative study was done of NICE Technology Appraisal No. 98, "Methylphenidate, atomoxetine and dexamfetamine for attention deficit hyperactivity disorder (ADHD) in children and adolescents (Review of Technology Appraisal 13)" [3]. The study focused on policy-relevant aspects and had descriptive, explorative, and explanatory elements.

Its initial phase consisted of defining a theoretical framework for analysis. This included a description of NICE technology appraisal processes, which fell in a period of substantial upgrade and definition of "reference case" analysis by NICE [7, 8]. During this phase, a thematic framework was defined, comprising use of the "accountability for reasonableness" concept as a process benchmark [9, 10], a critique of the technology assessment report underlying the appraisal, as well as a review of the clinical and economic literature on attention-deficit/hyperactivity disorder [11].

Its second phase comprised data collection employing a number of closely related strategies, including retrieval and analysis of documents related to the ADHD appraisal which were posted on the NICE website. Scientific articles cited in these documents were obtained for analysis. This was supplemented by literature searches (using the PubMed and, via EBSCO host services, the Business Source Elite databases as well as Google Scholar) for articles on ADHD diagnosis, treatment, compliance, cost, and cost-effectiveness, which were complemented by a search for relevant abstracts presented at international meetings in the fields of psychiatry and health economics. Documents were indexed using categories including study type, product tested, and subject matter (e.g., "treatment compliance") for further analysis and interpretation.

The analysis reported here is part of this more comprehensive study of NICE appraisal processes by the same author [11], and it is focused on the underlying Technology Assessment Report [4]. The purpose of the present paper is to shed light on the validity of the conclusions offered by NICE; it should be emphasized that it is not intended to assign responsibility for any identified problems to particular actors (such as NICE, its committees, or the assessment team). Unless specified otherwise, the following citations will refer to the Technology Assessment Report ("TAR" [4]), which was subsequently published as a full paper in the Health Technology Assessment monograph series of the NHS R&D HTA Programme [12], apparently unchanged.

Results

The various products evaluated by NICE are summarized in Table 1. The scope of the assessment [13] and its final protocol [14] specified that these products should be compared to placebo and usual care. Outcomes should include the incidence and severity of core symptoms, problem behaviors, educational performance, measures of depression and/or anxiety, measures of conduct/oppositional-defiant-disorder-related outcomes, adverse events, and quality of life. A recommendation was also included to consider the impact of co-morbid disorders, quality of life of family members, and optimal duration of treatment, "where the evidence permits". The scope effectively excluded an evaluation of non-drug treatment and of ADHD in adults. Also alternative treatments were not reviewed [15].

Table 1 Products evaluated by NICE

The Technology Assessment Report (TAR), comprising 605 pages with 13 appendices, included a systematic review of the evidence and a statistical data synthesis using mixed treatment comparison (MTC) techniques, a review of submissions by manufacturers, and an economic evaluation model. Main conclusions of the TAR were that "(i) drug therapy seems to be superior to no drug therapy; (ii) no significant differences between the various drugs in terms of efficacy or side effects were found – mainly due to lack of evidence; (iii) the additional benefits from behavioural therapy (in combination with drug therapy) are uncertain" and: "Given the lack of evidence for any differences in effectiveness between the drugs, the [economic] model tends to be driven by drug cost, which differ considerably" (TAR, p. 20). More specifically, it was asserted that "for a decision taken now, with current available data, the results of the economic model clearly identify an optimal treatment strategy" (TAR, p. 261; italics added) and that "this analysis showed that a treatment strategy of 1st line dexamphetamine, followed by 2nd line methylphenidate immediate-release for treatment failures, followed by 3rd line atomoxetine for repeat treatment failures was optimal" (TAR, p. 260).

Remarkably the NICE Appraisal Committee did not uphold these "clear conclusions", issuing guidance that was based on the assumption that it was "not possible to distinguish between the different [treatment] strategies on the grounds of cost-effectiveness" [3].

Analysis of the TAR reveals a number of methodological issues, which collectively leave the assessment open to critique concerning all four essential components of a review question [16], namely the population studied (e.g., exclusion of adults and failure to address the impact of coexisting conditions), the choice of interventions (e.g., exclusion of psychosocial treatment), the clinical and economic outcomes criteria used, as well as the study designs and selection criteria. The following critique will concentrate on key issues concerning data selection and synthesis as well as the model developed for economic evaluation.

Data selection for assessment

Departing from the assessment protocol [14], literature searches for assessment did not include "abstracts, ... conference proceedings, ... and other grey literature etc." (see final protocol, pp. 2f. [14]; cf. also TAR, p. 178), thus excluding relevant cost-effectiveness analyses in the public domain at the time of assessment [1720]. This notwithstanding, it was claimed that "this review presents a comprehensive overview of existing economic evaluations of methylphenidate, atomoxetine and dexamphetamine for children and adolescents with ADHD" (TAR, p. 266). Further search anomalies include the overlooking of at least two clinical studies meeting the specified inclusion criteria [21, 22].

For assessment, studies had to have a minimum duration of three weeks because "the literature suggests that three weeks is the minimum duration for therapeutic trials" to assess "the impact on the social adjustment of the child" (TAR, p. 44), citing the DSM-IV diagnostic manual (TAR, p. 45). The rationale offered was that "the effect of medication on behaviour is often (not always) apparent immediately, but the impact on the social adjustment of the child my well not be apparent in the first days of therapy (final assessment protocol [14], pp. 3ff., and TAR, p. 45). To make sense out of this reasoning, one would expect a minimum treatment duration of three weeks. However, a minimum study duration of three weeks was applied as inclusion criterion. As a consequence, more than one third of the 64 trials included in the clinical effectiveness review were very-short-term crossover studies with treatment periods of one week or less, some of which had been conducted without washout phases between treatment periods (TAR, pp. 51–163) [11].

The observation that the choice of outcomes measures reflects a critical design choice for analysis (TAR, p. 178) was not followed by a review of the literature on measurement instruments [23, 24]. Although social adjustment of patients was implied to be the outcome of interest (TAR, p. 45), clinical effect measures reflecting functional impairment were discarded from analysis (TAR, p. 46). Instead, clinical global impressions (CGI scores) were used "as a proxy of quality of life" (TAR, pp. 16 and 48), and CGI-I (improvement) subscores were selected as the primary endpoint informing economic evaluation. In this respect, the economic analysis deviated from the clinical effectiveness review that had included measures of hyperactivity and comprised a total of 64 randomized clinical trials (RCTs) plus the MTA study [25, 26]. This decision was motivated by the desire to compute quality-adjusted life years (QALYs) as effectiveness measure for economic modeling (TAR, pp. 224f.). Since this maneuver left only five RCTs with treatment durations from three to eight weeks for modeling, and none of those included dexamphetamine, a study previously excluded for inadequate data presentation was secondarily added in order to have any data on dexamphetamine for economic analysis (TAR, pp. 225f. and p. 338). Given well-documented gender differences in ADHD [27], which include clinical response to methylphenidate [28], and the fact that the disorder is most often diagnosed in boys [29], it is noteworthy that this three-weeks cross-over study had reported on 32 girls [30]. The assessment did not offer a discussion of this peculiarity. On this basis, after eliminating consideration of the role of concomitant psychosocial interventions, 19 alternative treatment "strategies" (in fact, product sequences) were modeled (TAR, p. 221).

Another important gap occurred in relation to atomoxetine. First, one out of two state-of-the-art RCTs comparing atomoxetine and long-acting methylphenidate was overlooked [22, 31]. These studies concurred suggesting lower or at best equal efficacy of atomoxetine [22, 3133]. Second, two analyses were not considered that provided effect size estimates for long-acting stimulants (0.95 – 1.02) and nonstimulant medications (0.44 – 0.62), using core symptom improvement as effect measure [3437]. These findings had been interpreted by their authors as "substantial and significant differences in efficacy" [37].

Data synthesis

In an attempt to overcome the limitations of the remaining database, the NICE assessment relied on advanced mixed-treatment comparison techniques for quantitative meta-analysis of response rates, which were subsequently transformed into QALY gains. For QALY computation, quality weights were derived from utility studies that had used health state descriptions, which did neither correspond to the CGI criteria nor to any of the other clinical endpoints secondarily added (cf. TAR, pp. 359ff.). This approach was pursued despite explicit recognition that "the validity of these measures depends on the content and style of the vignette used to describe each health state" (TAR, p. 181).

In order to broaden the database of six RCTs, data derived from different clinical effect measures were subsequently pooled for "sensitivity analyses"; data synthesis comprised heterogeneously defined "response rates" based on (in addition to the CGI-I subscores) CGI-S ratings as well SNAP-IV and ADHD-RS scores (TAR, p. 254), while the most widely used measures of clinical efficacy in ADHD trials, the Conners Rating Scales [23, 24], remained excluded from economic modeling (TAR, p. 224) despite their well-documented psychometric properties [24]. These narrow-band symptom scales (i.e., the SNAP-IV and ADHD-RS [23, 24]) were erroneously regarded as "disease-specific instruments [measuring] health-related quality of life in children" (TAR, p. 176). In total, these secondary extensions resulted in the inclusion of 13 RCTs, four of which were designated "commercial-in-confidence" and not disclosed (TAR, Chapter 6).

In a final step, also the MTA study [25, 26] – arguably the most important clinical study completed in ADHD to date – was added, although it remains enigmatic which data were actually used, as the model used information from three out of the four study arms only (TAR, p. 254): Whereas it was stated that "the nature of the treatment received in the community comparison arm of the MTA trial is still unclear, and as a result this data is omitted from the analysis" (TAR, p. 254), a table on the same page of the assessment report explicates that "results for behavioral treatment were omitted as not relevant to this review" (TAR, p. 254).

Whereas the assessment team did not explain its implicit assumption that CGI-I subscores – its primary measure of effectiveness used for the calculation of "response rates", which directly refers to a comparison "to the patient's condition before admission to the project" [38] – were independent from baseline, it rejected the Conners Rating Scales – the most widely used group of measurement instruments in ADHD studies [24] – for precisely this reason (TAR, p. 186 and p. 224). As a consequence, (apart from the enigmatic use of MTA study data) none of the 14 extended treatment studies reviewed by Schachar et al. (2002) [39] were included in data synthesis for cost-effectiveness evaluation (cf. TAR, Chapter 6). Also insights from an important 24-months RCT involving more than 100 patients treated with methylphenidate [40, 41] were not considered because the study did not fit the narrowly defined inclusion criteria for review. Finally, discussion in the assessment report of the 24-months follow-up data from the MTA study, providing insights into the persistence of treatment effects over the first ten months after trial completion [42], was limited to the clinical review (TAR, p. 168); these data were not addressed in the context of the economic modeling exercise (TAR, Chapter 6).

Although the assessment group correctly observed that MTA subgroup analyses "should be seen as 'exploratory', because of the danger of repeated statistical testing with a sample not designed for this purpose" (TAR, p. 167), it claimed at the same time that its own "model is probabilistic, meaning that relevant input parameters are entered as probabilistic distributions in order to represent the uncertainty around each point estimate" (TAR, p. 220), emphasizing that "the output from the model incorporates the uncertainty around the estimated response rates" (TAR, p. 229). However, in RCTs it is the primary analysis, as defined ex ante, which is most important, and the CGI scores did not represent the primary endpoint in any of the studies selected for synthesis. While it is certainly legitimate to carry out secondary analyses, these should not be represented as fully capturing stochastic uncertainty [43].

So-derived differences in QALYs gained by each treatment "strategy" extended to the third or fourth decimal place only (TAR, pp. 237ff.), and the primary analysis produced a series of inconsistent rankings (TAR, p. 237), which were left uncommented and disappeared only after secondary model extensions ("sensitivity analyses", TAR, pp. 240ff.) comprising the pooling of heterogeneous response criteria mentioned earlier. Although it was claimed that "the issue of heterogeneity was overcome by basing the base case [primary] analysis on trials that are more similar in terms of how they measure the outcome of interest" (TAR, p. 266), in fact internally consistent model results were achieved after this pooling only, and there is no indication that potential confounding effects between treatment strategies and effect measures were assessed.

Efficacy versus effectiveness

An important issue pervading the NICE assessment is the way the distinction between efficacy and effectiveness was (not) addressed. Whereas RCTs follow an explanatory orientation ("Can the intervention work?"), economic evaluations to be meaningful require a pragmatic orientation ("Does the intervention work?") [16, 44, 45]. Efficacy data collected during RCTs deliberately and necessarily exclude naturalistic effects associated with a routine clinical practice setting, while effectiveness may be influenced by a number of external factors, notably including poor treatment compliance. "Artificially enhanced compliance" in RCTs has come to be recognized as a major threat to their external validity [46]. This is a relevant aspect since the more expensive treatment options evaluated (atomoxetine and modified-release formulations of methylphenidate) differ from their comparators by their simplified dosage regimens, which may be expected to result in improved treatment adherence in practice settings.

There are multiple streams of evidence supporting this expectation. First, there is a statistically significant association between complexity of dosage regimens and treatment compliance [47]. Second, there are reasons to assume that treatment adherence of patients with ADHD may be impaired by disease-specific factors [48, 49]. Third, due to their pharmacokinetic and pharmacodynamic properties, the behavioral effects of short-acting stimulants dissipate rapidly after three to four hours – making these drugs prototypical examples of non-forgiving compounds regarding noncompliance [5055]. Fourth, three independent retrospective database studies consistently indicate higher treatment persistence rates for patients receiving long-acting stimulants compared to short-acting formulations [5660]. Although analyses based on administrative data typically do not allow differential analysis of reasons for treatment discontinuation and may be distorted by patient selection bias, data showing a higher number of prior diagnoses and significantly lower rates of accidents, injuries, emergency room visits, and hospitalizations among those treated with long-acting formulations are consistent with the assumption that such distortions were absent [56, 59, 60]. Fifth, mediator analyses of the NIMH MTA study confirm the important role of compliance: as intended acceptance/attendance was found to significantly enhance treatment response for both the medication management and combined treatment strategies [26]. The NICE assessment group overlooked these findings (TAR, pp. 167ff.) and reasoned instead that "we can also incorporate the results of the MTA trial, but only by assuming that the medical management group in that trial represents treatment with immediate-release methylphenidate" (TAR, p. 253).

For economic evaluation, there are two broadly accepted ways to address the impact of compliance [61, 62], i.e., the use of decision analytic models combined with appropriate sensitivity analyses and information from randomized "pragmatic trials" with minimal study management [63]. Modeling studies, sixth, have been indicative of an acceptable cost-effectiveness of long-acting methylphenidate, possibly reaching extending dominance over short-acting formulations [17, 18, 64]. Seventh, a randomized open-label study comparing long-acting with short-acting methylphenidate reported a number-needed-to-treat (NNT) of 3.6 to 4.8 to achieve one additional responder (depending on response criterion applied), consistently below the NNTs synthesized for assessment (TAR, pp. 226ff.) [11, 65, 66], although this particular study was impaired by the absence of teacher-reported outcome ratings.

None of these aspects are reflected in the NICE assessment. Instead it was "assumed that the trial data [referring to double-blind, double-dummy ADHD trials] adequately captures the effect of compliance on response to treatment" (TAR, p. 232). As a consequence, data from highly controlled double-blind RCTs and from randomized pragmatic open-label studies were pooled, necessarily concealing any differentiation on grounds of treatment compliance. Nevertheless it was claimed "the effect of compliance on response rates [...] is reflected in the model" (TAR, p. 250).

Economic model

The economic model (cf. Fig. 1) relied on cost per QALY estimates based on response rates, using utility weights taken from one study reporting parent-proxy ratings using the EQ-5D [67] but not from standard gamble experiments as stated in the assessment report (TAR, p. 235). Although the structure of the model implied assumptions on withdrawal rates, which caused double-counting of nonresponders (TAR, p. 230), no attention was paid to the uneven effect this modeling approach had on the treatment options evaluated: the fact that for dexamphetamine extremely low withdrawal rates were assumed (TAR, p. 236, solely based on the study by Sharp et al., 1999 [30], with n = 32 girls observed over three weeks, which had initially been excluded: TAR, p. 231; cf. also TAR, pp. 225f. and p. 338) could only bias the modeled "treatment continuation rates" (cf. TAR, p. 222f.) in favor of dexamphetamine. This source of bias remained unmentioned.

Figure 1
figure 1

Structure of the economic model. The economic model was composed of modules for each product, which had a common structure and were combined sequentially to reflect "treatment strategies". The structure of the modules implied, inter alia, that the withdrawal rates "due to intolerable side-effects" should be independent from "no response" to treatment. This requirement was violated because intent-to-treat analyses were used to estimate withdrawal rates, which reported "withdrawals" for many reasons, including lack of efficacy, inevitably leading to double-counting of nonresponders (TAR, p. 230). The impact of this phenomenon was unevenly distributed across the treatments evaluated (TAR, p. 231 and p. 236), resulting in a biased assessment [11]. Graphical representation of model reproduced from King et al. 2006, with kind permission.

None of the studies selected for the primary model exceeded an observation period of eight weeks per treatment arm (TAR, p. 226), and also the secondary model extensions (referred to as "sensitivity analyses") were informed by trials with observations periods of 12 weeks or less, except for the data extracted from three out of four parallel arms of the MTA study (TAR, p. 254). On this basis, costs and benefits were extrapolated over a time horizon of 12 years. Although the assessment includes a discussion of sensitivity of findings to time horizon (TAR, pp. 245ff.), this did not include a review of long-term sequelae associated with ADHD. No attempt was made to address the impact of the disorder on educational outcomes, injuries and accidents, or other problems such as encounters with the criminal justice system and the burden of caregivers.

The technology assessment purports to have "clearly identified an optimal treatment strategy" (TAR, pp. 19, 261, 266). The caveats offered essentially relate to a paucity of evidence and poor reporting of studies (e.g., TAR, pp. 18ff., 266), which were blamed for the inability to discriminate between drugs in efficacy or between patients in terms of ADHD subtype, age, gender or previous treatment (TAR, pp. 266f.).

Conclusion

The NICE ADHD health technology assessment does not provide a complete account of the problem addressed. From an economic perspective, the omission from analysis of psychosocial interventions, representing a mainstay of ADHD therapy, is especially disturbing, as estimates of allocative efficiency require all possible options to be considered.

Furthermore the literature search was incomplete, existing evidence was used in an overly restrictive manner, and neither long-term sequelae nor caregiver burden were addressed. These shortcomings were confounded by outright technical errors, including but not limited to [11] the apparent confusion of efficacy and effectiveness.

Although the assessment may have its uses in listing existing literature and presenting condensed summaries, it leaves substantial room for improvement. Its main conclusions rest on shaky grounds and are potentially misleading. NICE itself, in its final appraisal determination and its guidance issued in March 2006, wisely moderated the putatively "clear conclusions" of the technology assessment report [3]. Understandably the appraisal process following the assessment was unable to overcome the limitations of the latter, which on the basis of its restricted dataset could not address the full range of questions specified in its scope (cf. Results, above).

Regarding the technology assessment process, the broad range of observed anomalies can be interpreted as symptoms, which may be indicative of specific underlying problems. In the present case, such underlying issues appear likely to have included an insufficient integration of clinical and economic perspectives, the extraordinarily high level of standardization of NICE technology appraisals, enforcing the computation of clinical outcomes as quality-adjusted life years and hereby requiring the clinical problem definition to fit to a preconceived solution, and the apparent absence of effective quality assurance systems [11]. Beyond avoiding certain technical errors, a more appropriate evaluation strategy might have made better use of available data on symptomatic improvement such as Conners Rating Scale scores, considered the impact of treatment nonadherence in ADHD, addressed clinical studies and meta-analyses indicating differences in effectiveness between stimulants and nonstimulants, reflected information on the importance of coexisting conditions and functional impairment, and discussed the long-term sequelae associated with ADHD [11].

While there remains a need for more research into the long-term effectiveness and cost-effectiveness of ADHD treatment strategies [68], at the same time a new, state-of-the-art systematic review including economic evaluations would be most welcome.