Spiralling healthcare costs are a major concern for policymakers worldwide. Overuse, underuse and misuse of healthcare are estimated to be responsible for 30% of the total spending on healthcare annually. It has been estimated that 7% of the wasted healthcare spending in the US is due to overtreatment, including test ordering and prescribing [1]. In the years 2004–2011, the average annual growth in the number of prescriptions in the Netherlands was 5.7%; in fact, the growth of the national income of the Netherlands has been smaller than the growth of the healthcare budget year after year [2, 3]. If nothing is done to reduce the growth in healthcare spending, it is feared that Western countries will not be able to pay the healthcare bill in the long term. Therefore, physicians are being targeted by policymakers to contribute on reducing waste in healthcare, and are encouraged to alter their habits.

An unsolved problem with changing professional behaviour is the lack of a clear and solid benchmark for the desired behaviour [4, 5]. This can be overcome by using practice variations as a proxy for quality of care. A certain degree of practice variation is clearly warranted, given the unique profiles of individual patients and practice populations. However, when practice variation is caused by underuse or overuse of care, this results in unwarranted variation and thus inappropriate care [6]. In the Netherlands, general practitioners (GPs) now have access to over 100 evidence-based clinical practice guidelines. These guidelines have been developed by the Dutch College of General Practitioners (NHG) with the aim of reducing unwarranted practice variation and improving the quality of care provided. Although the general adherence to these guidelines seems quite reasonable, viz. approximately 70%, there is considerable practice variation in test ordering and prescribing, indicating room for improvement in poorly performing practices [712].

In local quality improvement collaboratives (LQICs), general practitioners meet on a regular basis to discuss current issues and gain new insights concerning test ordering and prescribing behaviour. Healthcare organisations and governments promote these meetings as a means to implement guidelines. LQICs are widely implemented in primary care, mainly in Europe and, to a lesser degree, in North America. Local pharmacists are members of these groups and are well respected for their input and knowledge.

These LQICs are an attractive target for interventions aimed at changing professional behaviour both effectively and efficaciously [1320]. In a robust trial on three clinical topics, Verstappen et al. showed the beneficial effects of a multifaceted strategy involving audit and feedback with peer review in LQICs on test ordering behaviour. They found a reduction in the volumes of tests ordered ranging from 8 to 12% for the various clinical topics [21, 22]. Lagerlov et al. showed that individual feedback embedded in local peer group discussions improved appropriate treatment of asthma patients by 21% and urinary tract infections by 108%, compared to baseline values [23]. There is also evidence suggesting that the mere provision of information on test fees when presented at the time of the order entry reduces the volumes of tests ordered [24].

Most of this evidence, however, stems from trials focussing on a single or limited number of clinical topics, and involving a strong influence of the researcher on the participants, e.g. as moderator during sessions. Moreover, in the Verstappen trial, the included groups were selected by the researcher and can be regarded as innovator groups. We wanted to build on the experiences from the work by Verstappen et al. and undertake a large-scale implementation of the strategy in a pragmatic trial with much room for the LQICs to adapt the strategy to their own needs and without any researchers being present embedded within the existing network of LQICs.

We hypothesized that our intervention would reduce inappropriate testing and prescribing behaviour. Our research question was therefore: What is the effect of audit and feedback with peer review on general practitioners’ prescribing and test ordering performance?

We also report the sum scores of volumes of tests and prescriptions in a per-protocol analysis. This analysis was not planned in the study protocol [16], but we decided to add it as the process evaluation of the study revealed that the uptake of the strategy was much lower than expected [25].



We conducted a two-arm cluster-randomised trial with the LQIC as the unit of randomisation and with central allocation. Core elements of the intervention are audit and comparative feedback on test ordering and prescribing volumes, dissemination of guidelines and peer review in quality improvement collaboratives moderated by local opinion leaders [16].

The intervention started in January 2008 and was completed as planned at the end of December 2010. We measured baseline performance during the six months before the intervention, and follow-up performance during the six months after the intervention. The design of this intervention is described in more detail in the trial protocol [16].

Setting and Participants

Recruitment was restricted to the south of the Netherlands, because of our access to prescribing data of GPs working in this area. First, the regional health officers or key laboratory specialists for all 24 primary care diagnostic facilities in the south of the Netherlands were identified and recruited by the first author. They were trained in a three hour session in their region by the researchers. Object of this training was to transfer knowledge on effectively discussing test ordering and prescribing behaviour, setting working agreements, on how to effectively moderate meetings and how to deal with questions on the validity of the feedback data or other aspects of the intervention. Also written and digital materials where made available to enable them to facilitate recruitment of LQIC groups. The routines of the LQICs were deliberately left unchanged as they represented normal quality improvement routines in primary care in the Netherlands. Only when test ordering was discussed a laboratory specialist from the diagnostic facility moderated the group discussion. The strategy was new to all participants in this trial.


In this trial with audit and feedback with peer review in LQICs we wanted to test the results on test ordering behaviour and prescribing behaviour of the strategy. Aggregated comparative feedback was provided on tests ordered or drugs prescribed in the period of six months before each meeting in which it was discussed. Feedback was sent to the moderator for that session (the local pharmacist or laboratory specialist). At the start of each meeting, each GP received feedback report on their own performance together with an outline of the recommendations from the guidelines, validated by clinical experts (Additional file 1: Appendix 1). The feedback was adjusted for practice size and compared with the aggregated results from their practice, their LQIC group and neighbouring groups (Fig. 1). To mimic the normal situation in self-directing LQIC groups, the groups in both arms were allowed to choose three clinical topics out of a set of five presented to them. The set of five topics differed between the two arms (Table 1). Each group planned two paired meetings for each topic, one on test ordering and one on prescribing making it a total of six meetings. Each meeting lasted between 90 and 120 min as was usual before depending on the intensity of the discussion. Groups were encouraged by the trained moderator (see under “setting and participants”) to establish working agreements to improve their performance, and to discuss barriers to change. The LQICs were allowed to adapt the format of the meeting to their own needs and routines, as long as peer review and working agreements were included. At the end of each meeting groups where asked to fill out a form stating what working agreements and goals were set.

Fig. 1
figure 1

Example of the graphical comparative feedback (this image doesn’t reflect actual data)

Table 1 Sets of clinical topics and the number of meetings held for each topic

Feedback reports were generated from two main databases, one on diagnostic tests and one on prescriptions, with data originating from primary care diagnostic facilities and the two dominant insurance companies in the region. The databases contained data on the specific test or drug, the date it was ordered or prescribed, the practice in which the physician who had ordered or prescribed it worked, the date of birth of the patient, their gender and, in the case of prescriptions, the number of defined daily dosages (DDDs) that were prescribed. A more detailed description of the intervention is available in the previously published trial protocol [16].

Data collection and main outcome measures

The primary outcome measures were the volumes of tests ordered and drugs prescribed per practice, per 1000 patients, per 6 months. Although data on a large number of diagnostic tests and prescriptions were available (Additional file 2: Appendix 2), only results on key tests and drugs for each clinical topic are reported in this paper. The identification of these key tests and drugs was based on consensus within the research group and one clinical expert on each topic before the intervention started.

Sample size

We calculated that a total of 44 LQICs would be sufficient to detect a standardised effect size (Cohens d) of 0.5, with a significance level alpha of 0.05, a power of 0.9, an ICC of 0.1 and a mean group size of seven GPs. Anticipating a dropout rate of 10%, we would need to recruit 50 LQICs. [10, 16].


Prior to the randomisation, the LQICs were stratified on their level of group performance, as assessed by a questionnaire resulting in four levels of group work, from ‘poor’ to ‘good’. The level of group performance may be a confounder for the ability to establish shared working agreements and for the quality of prescribing behaviour [2629]. By stratifying on this, we ensured an equal distribution of these levels over the trial arms. An independent research assistant produced a computer-generated allocation list and allocated the LQICs to arm A or arm B, while the researcher was blinded to this process. Groups in both trial arms were exposed to the same intervention, but on different clinical topics. Each LQIC in one arm served as an unmatched control for the LQICs in the other arm [30, 31]. Groups were blinded to the clinical topics discussed in the other arm. The researcher was blinded until all data analyses had been completed.

Data analysis

To analyse the intention-to-treat differences between the two arms we compared performances at the LQIC level for all key tests and drugs during the six months prior to the intervention with performances during the six months after completion of the intervention period. We analysed according to the intention-to-treat principle regardless whether a group had chosen a clinical topic or not. In addition, we performed a per-protocol before and after analysis to test for effects in the groups that had actually organised a meeting on a specific topic, with all other groups acting as controls in the analysis.

The time intervals for our per-protocol analyses were six months prior to each LQIC meeting, compared with 0–6 months after each LQIC meeting in the case of tests, and 3–9 months after each meeting in the case of prescribing (Fig. 2). By using this washout period we avoided contamination with long-term prescriptions.

Fig. 2
figure 2

Graphical display of the periods defined for the baseline and follow-up measurement of tests ordered and drugs prescribed in the per protocol analysis

We used a Chi-square test or T-test to check if the stratification of groups had led to an even distribution of the LQIC group performance levels and the characteristics of the participants over the trial arms and topics. The group effect (intervention versus control) on prescribing rate and test ordering rate after the intervention was assessed using a linear mixed model with the LQIC as a random effect to account for the clustering of practices within the LQIC. In addition, group (intervention or control), baseline value of the outcome measure (before the intervention) and interaction between the baseline and group were included as fixed factors. If the interaction term was not statistically significant, it was removed from the model, and only the overall group effect is presented. This method was used for both the intention-to-treat analysis and the per-protocol analysis.

If the interaction term was statistically significant, the overall group effect (obtained from the model without the interaction term) and the group effects for different baseline values, at the 10th and 90th percentiles of the baseline variable, are presented to assess the effects at both ends of the spectrum. We expected that change would mostly be seen in GPs in the 90th percentile, as this is a clear indication of overuse and marks a need to decrease test ordering or prescription volumes. Although it is not clear what the benchmark is for volumes of tests and prescriptions, we did not expect GPs at the other end of the spectrum—the 10th percentile—to clearly fail in terms of underuse of tests and prescriptions.

All data analyses were performed using IBM SPSS Statistics for Windows, Version 21.0 (Armonk, NY: IBM Corp). P-values ≤ 0.05 were considered statistically significant.

Exclusion of data before analysis

At the time when the intervention was designed, the recommendations for dyslipidemia and type 2 diabetes mellitus were provided in two separate guidelines. However, at the start of the actual intervention, the guidelines on diabetes management and dyslipidemia treatment were merged into one new, multidisciplinary national guideline on cardiovascular risk management. This was directly followed by a massive government led intervention to transfer care for diabetics and cardiovascular risk patients from specialist care to GPs. Part of this transfer was the institution of a pay for performance model for these two topics together with the introduction of many outcome indicators being tracked by newly introduced and completely integrated software [32]. As part of this campaign much publicity was created in both the professional and public media. This caused substantial contamination of our intervention contrast, resulting in an inability to interpret the results on these clinical topics. Therefore, we chose to exclude the topics of dyslipidemia and type 2 diabetes mellitus from the analyses. The results on test ordering for the clinical topics of urinary tract infections (UTI) and stomach complaints could not be calculated either, due to insufficient data on test ordering from laboratories and diagnostic facilities to report these data (e.g., urine cultures and gastroscopies). Results on prescription rates for Chlamydia are not shown because it proved impossible to link the prescribed antibiotics reliably to this condition. This problem did not occur for UTI, as we confined the data to nitrofurantoin and trimethoprim, which are antibiotics that are only indicated or prescribed for UTI treatment in Dutch primary care.



Out of the 24 primary care diagnostic facilities (laboratories) we approached, 12 actually managed to recruit 21 LQICs for the trial (Fig. 3). The other facilities did not manage to recruit due to various reasons, which is described in detail in the process evaluation. The 21 groups consisted of 197 GPs working in 88 practices, and 39 community pharmacists. Eight laboratory specialists participated in the groups when test ordering was being discussed. The characteristics of the participating groups and the GP members of these groups are described in Table 2.

Fig. 3
figure 3

Flowchart of recruitment of laboratories, laboratory specialists or regional health officers and their recruitment of LQICs with the number of GPs in brackets

Table 2 Characteristics of the participating GPs and groups

Results of the intention-to-treat analysis

The intention-to-treat before and after analyses on test ordering did not show any differences between the intervention and control groups, with wide confidence intervals, all including 0, and all with p-values well above 0.05 (Table 3). Interaction with the baseline values was present for rheumatic complaints. This showed a difference in the desired direction in number of tests ordered for the practices within the p90 range of test ordering only.

Table 3 Intention-to-treat analysis of changes in test ordering rates

The intention-to-treat analysis on drug prescriptions showed a difference in the desired direction for misoprostol only. Interaction with baseline values was present for misoprostol, the triple therapy for Helicobacter pylori eradication (PantoPac®), antithyroid preparation drugs and clonidine (Table 4), showing changes in the desired direction for misoprostol, antithyroid preparations and clonidine. This effect is not present for the prescription rates of triple therapy at either p10 and p90.

Table 4 Intention-to-treat analysis of changes in prescribing rates showing sum scores per clinical topic and scores per drug

Results of the per-protocol results

Table 5 shows the results of the per-protocol analyses on test ordering volumes for all groups that covered a specific topic (intervention group) compared to all other groups (controls). We found a difference between both trial arms in the desired direction in test ordering only for thyroid dysfunction and perimenopausal complaints.

Table 5 Per-protocol analysis of change in test ordering rates for each topic

Testing for interaction with baseline measurements showed a difference in test ordering rates in the desired direction for those GP practices with a baseline test-ordering rate at or above p90 for chlamydia infections, rheumatic complaints and perimenopausal complaints.

Table 6 shows the results of the per-protocol analysis on prescribing performance. A difference in the overall volume of all prescribed drugs between intervention groups and their controls was observed for medication prescribed for prostate complaints (, stomach complaints and thyroid dysfunction. For each of the clinical topics we also analysed each Anatomical Therapeutic Chemical Classification System (ATC) group in that topic separately, as changes found for specific ATC groups could represent clinically relevant changes. These results are shown in more detail in Table 6. Testing for interaction with baseline measurements again showed a statistically significant interaction for several topics and specific ATC groups. All showed larger differences in prescribing rates before and after the intervention for the practices in the p90 range than for those in the p10 range (Table 6).

Table 6 Per-protocol analysis of change in prescribing rates, showing sum scores per clinical topic and drug


Summary of the main findings

Our study found that the beneficial results obtained in earlier, well-controlled studies on audit and feedback with peer review in LQICs in primary care were not confirmed when we introduced this intervention in existing primary care LQICs. The per-protocol analyses showed that GPs from practices with the highest baseline volumes on test-ordering and prescribing showed the largest improvements.

Many participant and context related factors can be identified as possible explanations for the lack of overall effects of our intervention. Lack of confidence in and adherence to the strategy emerged during the trial; The origin and validity of the feedback was questioned by some participants, while others felt that this intervention was too complex and too ambitious. Although we provided complete transparency on the data sources and instructed the moderators in this respect, we learned from the process evaluation that the source of the feedback was often not clear to the participants [25]. We found that many groups failed to set achievable and measurable working agreements. More than half of the meeting reports we received from groups did not contain specific, achievable, realistic or measurable working agreements. What also seemed to have occurred is the phenomenon of choosing topics for quality improvement for which the group already showed good performance, the Sibley effect [33].

In a recently published article on how to provide feedback effectively Brehaut et al. provide 15 suggestions for designing and delivering effective feedback. [34] The feedback we provided meets all these suggestions but three. First providing the feedback as soon as possible at optimal intervals was not possible in our trial as we provided feedback on demand of the LQICs. Secondly we did not provide the feedback in more than one way, this could indeed have been helpful. The last suggestion we (partially) missed is to provide short key messages with an option to have extra detailed information on demand. This was impossible in this trial with the use of peer review as a means to discuss the feedback. Would we have provided key messages on individual feedback we would have impaired or at least influenced the peer review process, an essential part of LQIC work. When we look at the criteria for effective audit and feedback as defined by Ivers et al. in their somewhat older Cochrane review we conclude that we meet most criteria except including exclusively practices with poor baseline performance and the provision of predefined goals [35]. Including practices with poor baseline performance would have forced us to leave the stable and safe environment of the existing LQICs. The peer review effect where poorly performing GPs can learn from role models would have been impaired and the negative effects of an organisational reform would have occurred [8, 36, 37]. The provision of aggregated results from their own and neighbouring groups, together with the recommendations from clinical guidelines, can be regarded as implicit goal setting [21]. However we acknowledge that this differs to a greater extend from the definition of predefined goals suggested by Ivers et. al. [25].

We clearly underestimated the influence of a healthcare reform that was launched shortly after the start of this trial. As a result much specialist care was transferred to primary care, with Dutch GPs earning a higher income but at the same time feeling threatened in their autonomy and time management. Last but not least, the GPs are more than ever controlled by external parties such as the health inspectorate and healthcare insurers, which may have led to a more defensive attitude among GPs, resulting in higher test ordering rates.

Strengths and limitations of the research methods used

Lack of power: our efforts to implement the strategy widely in the southern part of the Netherlands failed to recruit a sufficient number of groups for the trial, leaving us with an underpowered study. This could, in part, have been caused by the pragmatic character of our trial, with local pharmacists and experts on diagnostics leading the recruitment effort and moderating the groups. A major healthcare reform programme was launched shortly after our recruitment started [38], causing frustration among many GPs due to the resulting high administrative burden. This probably reduced their willingness to participate in our trial.

Choice of outcome and lack of quality indicators: we chose to express the volume of prescribed drugs in DDDs. A risk of this is that not all DDDs are compatible with the actual dosage physicians prescribe to a patient. For diclofenac, for instance, the normal dosage is 1.5 to 2 DDDs every day. However, this did not affect the comparability of the two groups, as both were affected by this form of distortion in the same way. If we had been able to provide feedback on quality indicators as well as volume data, a more valid insight into performance might have resulted, but with more interpretation problems for the GPs. Also, using volume data only has been proven to lower volumes especially in areas characterised by overuse [22].

Change of the protocol after its publication: the fact that all groups in the intervention arm, whether or not they had chosen a specific topic, were analysed using the intention-to-treat principle as if they had been exposed to the topic may have diluted the effects of the intervention. We would rather have analysed the effect of the intervention on changes in the direction of shared working agreements, as stated in the protocol, but as these were hardly established, we decided to use a per-protocol analysis as the second best option, after the protocol had been published.

Minimization of the Hawthorne effect: a strong point of the design we used is that it minimized the Hawthorne effect. On the downside, it might have caused contamination of the effects if we had not exposed all groups in the trial to the intervention at the same time [30, 39, 40].

Comparison with other studies

Evaluating the effect of large-scale implementation of a quality improvement strategy of proven effectiveness using a pragmatic design like ours has not often been performed. Earlier and ongoing work has focused mainly on one particular clinical topic (e.g., prescribing antibiotics for respiratory tract infections or X-rays for low back pain patients), while we applied the peer review strategy to a broad range of topics and focussed on both test ordering and prescribing behaviour [4145]. By researching whether the results of more fully controlled trials were also found in large-scale implementation, we sought to contribute to the knowledge on ways to improve professional performance.

We are not aware of similar multi-faceted studies using audit and feedback with peer group discussion in this field that would allow direct comparison with our study, although much is known about the individual components we combined in our study.

Much work has been done on evaluating the effects of audit and feedback on both test ordering and prescribing behaviour in well-controlled trials. These interventions show a modest but statistically significant positive effect on changing professional behaviours but the heterogeneity of the trials prevents solid conclusions to be drawn [35, 41, 4652]. Although audit and feedback on test ordering behaviour embedded in peer review in small groups has been found to be more effective than audit and feedback alone, it generally remains unclear exactly what factors contribute to this effect [10, 21, 53, 54]. The use of pragmatic designs in quality improvement research contributes to bridging the gap between academia and clinical practice [55, 56].

Multifaceted interventions like ours are complex by nature but seem attractive because the individual effects could add up. It remains unclear, however, whether multifaceted interventions or single interventions are more effective. Mostofian concluded in a review of reviews that multifaceted interventions are most effective in changing professional behaviour [57]. On the other hand, Irwin et al. concluded that there is no evidence for a larger effect of combined interventions, while Johnson and May find it likely that multifaceted interventions are more effective [52, 54].

Studies embedding the discussion of clinical topics in LQICs have reported a modest positive effect on prescribing costs and quality [14, 5862]. Our finding, based on the per-protocol analysis, that groups with the highest volumes at baseline showed the largest improvement, is in line with the results presented by Irwin et al. [52].

Implications for future research

The problems on the fidelity of the feedback and with the uptake of the intervention could best be handled by assuring that a strong leader picks up the group and lead them forward. It may also be helpful to identify GPs with a low quality baseline performance representing an unwarranted deviation from the mean and target those GPs in this type of quality improvement initiatives. Other physicians who are already doing well can concentrate on what they are doing already; delivering high quality care. Further research is needed on whether low baseline performance is consistent behaviour for an individual GP. Also further research on the cut-off point for participants that can benefit from a quality improvement intervention like this is needed to clarify the population to be targeted best. Potential downsides of such an approach such as the loss of peer learning with learning from the best practices need to be addressed as well.

Further pragmatic research should be performed to confirm our findings that the results found in earlier well-controlled trials are not easily replicated. We therefore encourage other researchers to perform vigorous large scale evaluations of complex implementation strategies, preferably embedded and owned by the field, as we did


Our intervention, which aimed at changing the test ordering and prescribing behaviour of GPs by means of auditing and feedback, embedded in LQICs, with academia at a distance, shows that the favourable results of earlier work could not be replicated. It appeared that large-scale uptake of evidence-based but complex implementation strategies with a minimum of influence of external researchers, but with the stakeholders in healthcare themselves being responsible for the work that comes with integrating this intervention into their own groups, was not feasible. Although our study suffered from a lack of power, we expect that even if a sufficient number of groups had been included, no clinically relevant changes would have been observed.