Highlights

  • We combine latent class and mixed logit models to study heterogeneity in general practitioners’ preferences elicited from a discrete choice experiment

  • We demonstrate that general practitioners exhibit substantive heterogeneity in preferences for quality improvement programs, notably for pay-for-performance

  • We show that the majority of physicians dislike the implemented pay-for-performance program, and would favour non-financial interventions

Background

Quality improvement programs (QIP) are an increasingly popular approach for enhancing the quality of physician practice in ambulatory care [13]. However, available evidence suggests that QIPs, whether they focus on or combine financial, non-financial or organizational components, have modest and variable impacts on quality of care [46]. Beyond methodological differences in the studies, this observed heterogeneity results from the target and design of the QIPs, as well as from variability in physicians’ responsiveness to the programs [710]. Within a single program, differences in physicians’ reactions may be explained by differences in contextual constraints, as well as knowledge or attitudes regarding the QIP [9, 10].

Physicians’ preferences for QIP are particularly important given that, in many cases, physicians’ participation is voluntary and, thus, necessary to ensure the success of the program. From 2009 to 2011, the French Statutory National Health Insurance implemented a voluntary QIP program (Contract for Improved Individual Practice – CAPI) aimed at general practitioners (GP), which combined pay-for-performance (P4P) and quarterly performance feedback. While the program could only increase their income, only one-third of all French GPs had registered a year and a half after the program’s implementation, and the program was subsequently cancelled due to its unpopularityFootnote 1. While GPs’ ethical concerns with the program design was one key explanation of the low take-up of the CAPI [11], a QIP better designed to meet physicians’ work-related needs may have been more successful.

Health economists have thoroughly studied physicians’ preferences regarding their job characteristics [12, 13], sometimes accounting for preference heterogeneity [1416]. Yet, no studies, to the best of our knowledge, have specifically examined physicians’ preferences for QIPs and their components. While recent studies have focused on designs of QIPs that would be effective irrespective of the targeted physicians [6, 17, 18], understanding these physicians’ preferences may allow for fine-tuning of the programs and improve acceptance. Moreover, understanding the heterogeneity of physicians’ preferences about QIPs may help policymakers tailor and diversify their programs to better match the needs of their targeted population.

The objectives of this study are precisely to elicit heterogeneity in physicians’ preferences for the components of QIPs; and by policy simulation, to compare the potential and differential impact on physician welfare of various QIPs, including the French CAPI. To do so, we conduct a discrete choice experiment (DCE) on a sample of French GPs.

Methods

Data and the discrete choice experiment

DCE design

Discrete choice experiments are widely used in the health economics literature to assess preferences [19]. Our study followed the recommended steps [20] as described below.

The first step of a DCE is to select the attributes of interest and their levels. We selected attributes based on a literature review on QIPs and on two criteria: supposed efficacy suggested by the literature and credibility of application in the French health care context (see Table 1). For concreteness, we focused on preventive care, a key quality indicator. Following the same two above criteria, a level for each attribute was defined to reflect the CAPI. The relevance of the list of attributes, of their number and of their levels was confirmed in a focus group of ten representative GPs [21]Footnote 2. This led to a final list of eight attributes presented in Table 2.

Table 1 Interventions used in quality improvement programs for GPs
Table 2 List of attributes and levels

The second step is to combine attributes into choice sets. Most of time, the combination relies on experimental plan theory since a full factorial design implies proposing too many choices to respondents [22] – 864 scenarios in our case. Using JMP software, we generated an orthogonal design [23] that resulted in 24 scenarios and achieved the properties of orthogonality and level balance. All other analyses are done with STATA. In order to facilitate respondents’ choices, we relied on a common comparator selected from these 24 scenarios, ensuring that this reference scenario is not strictly dominant a priori [24]. Choice sets were constructed by pairs which resulted in 23 choices between pairs of combinations of quality interventions. The 23 choice sets were randomly divided into four blocks so that each respondent made 5 or 6 choices [25]Footnote 3. To limit non-response and the subsequent loss of statistical efficiency, we did not include an opt-out possibility. An example of choice set is provided in Appendix 1.

Finally, the DCE was pilot tested with a focus group of self-employed GPs to validate the attributes phrasing and then pre-tested (n = 100 GPs) to verify that the reference scenario was not strictly dominant.

Data

The DCE questionnaire is composed of three parts. In the first part, questions regarding the GP’s opinion about health care reforms in general practice and the public health role of GPs are used as a warm-up. The second part is the choice experiment. The third part collects sociodemographic and professional information about each GP. The questionnaire is self-administered during the summer of 2009 in a postal survey with one repeated attempt for non-response.

The population under study consists of all the GPs in active practice in one French geographic regionFootnote 4 (N = 1368). After the pre-test, the questionnaires were sent to the 1268 remaining physicians. 303 questionnaires were returned completed, resulting in a response rate of 22 %. This response rate is consistent with other DCE studies [2628] and with self-administered postal surveys to French general practitioners [29].

GPs working in a rural setting are slightly overrepresented in our sample (see Table 3). The responding GPs are also more active, with the weekly number of acts being significantly higher than the national meanFootnote 5. With these exceptions, our sample compares well with the reference population. Of course, our methodology does not allow for national representativeness.

Table 3 Descriptive statistics

With the exception of the level of remuneration, all attributes of the DCE are coded using “effects coding” [30]. We constructed the questionnaire in order to test the symmetry [31], the completeness and the continuity axioms [32]Footnote 6 and found that the axioms are largely respected: totally for the first, and respectively by 82 % and 65 % of the respondents for the two other axioms. Following current practice, we kept all the responses for the analysis [3234].

Econometric framework

Modelling heterogeneity

The analysis of DCE data relies on classical choice models and random utility theory (RUT) [35]. When applying the DCE approach, the utility of an individual n choosing alternative i at the t choice situation can be written as

$$ {U}_{nit}={V}_{nit}+{\varepsilon}_{nit} $$

Where \( {V}_{nit}={\displaystyle {\sum}_{k=1}^K}{\beta}_kx{\hbox{'}}_{nitk} \) is the deterministic part of the utility (with k attributes), observable to the researcher and sometimes referred to as the indirect utility, and ε nit is the unobservable, stochastic part and is treated as randomFootnote 7. The individual will choose the alternative yielding the highest utility.

The conditional logit is the most commonly used method to analyse DCE data, but relies on restrictive assumptions on the stochastic terms [23], fails to incorporate the panel structure of most DCE data and does not account for preference heterogeneity. The two principal models that circumvent these limitations are the mixed logit (MXL) [36, 37] and the latent class model (LCM) [38].

The choice between these two models critically depends on expectations about the variation of preferences [39]: if researchers expect preferences to vary greatly between individuals, the MXL is preferred; the LCM is preferred if individuals are thought to be grouped in homogeneous latent groups. However, the information the models provide is complementary: MXL provides information about how heterogeneity is distributed relative to each attribute while LCM informs on the heterogeneity among latent subgroups of physicians. Thus, we elect to run both MXL and LCM.

The unconditional probability of a mixed model that allows for individual-specific variation in tastes and accounts for the panel dimension of choices is as follows [40]:

$$ {P}_{nI}\left(\theta \right)={\displaystyle \int }{S}_{nI}f\left(\beta \Big|\theta \right)d\beta $$

Where \( {S}_{nI}\left(\beta \right)={\displaystyle {\prod}_{t=1}^T}\left[\frac{ \exp \left(\beta \hbox{'}{x}_{nit}\right)}{{\displaystyle {\sum}_{j=1}^J} \exp \left(\beta \hbox{'}{x}_{njt}\right)}\right] \) is the conditional probability that the individual n realises a choice sequence I = {i 1, …, i t }, f(β|θ) is a density function of the individual-specific β with distribution parameters θ (see [40] for more on the family of mixed models).

Preference heterogeneity is reflected in the density function, f(β|θ), and the distribution of β can be either continuous or discrete, implying MXL or LCM, respectively.

The other major difference between the models is the estimation method. Each model relies on log-likelihood maximization, with the log-likelihood given by \( LL\left(\theta \right)={\displaystyle {\sum}_{n=1}^N} ln{P}_n\left(\theta \right) \). Unlike the LCM, this expression cannot be solved analytically in MXL and simulation methods are used for approximation [38, 40].

Simulating policy

The goal of the policy simulation is to evaluate the effects of changes in the three main components of a QIP (financial, non-financial and organizational), and we use the compensating variation (CV) method to measure the relative impact on GPs’ welfare of such change [41, 42].

The CV is calculated using the utility estimates computed after the regressions in the following expression [41]

$$ CV=-\frac{1}{\beta_w}\left[ ln{\displaystyle {\sum}_{j=1}^J} \exp \left({V}_j^0\right)- ln{\displaystyle {\sum}_{j=1}^J} \exp \left({V}_j^1\right)\right] $$

Where β w is the marginal utility of income, \( {V}_j^0 \) is the indirect utility for each option j before the policy change and \( {V}_j^1 \) the same after the policy change. In our case, we consider only two policy options at a time, the CAPI versus something else. The formula is then simplified to [20]

$$ CV=-\frac{1}{\beta_w}\left[{V}_j^0-{V}_j^1\right] $$

The question of heterogeneity is evaluated by estimating CV for each latent group of physicians with LCM. For MXL, we compute and compare CV for the specific attributes where GPs exhibit significantly heterogeneous preferences (e.g. those GPs obtaining positive versus negative marginal utility from the attribute).

Model specification

We include an intercept in all models. This alternative-specific constant (ASC) is necessary since choices are made relative to a fixed comparator (the constant scenario) [30, 42]. In our case, this ASC has no natural interpretation and is expected to be statistically insignificant [12].

When specifying a mixed logit it is critical to choose which parameters are allowed to vary and which distribution these latter will follow. The normal and log-normal distributions are the most commonly used for the random coefficients [39, 40, 43]. As the log-normal distribution is criticised for its long right tail [37, 44], we choose the normal distributionFootnote 8.

The possibility to specify the coefficients as random is one of the great strengths of the MXL. The ASC is fixed since it has no reason to vary between the respondents. Fixing the monetary attribute (the remuneration) has several advantages [45]. In our case, the main one is the capacity to calculate CV. The possibility of significant preference heterogeneity in terms of remuneration cannot be ruled out and should be considered in order to fully understand physicians’ preferences. GPs valuing less payment can indeed be explained in an intrinsic motivation framework, among others. We therefore run two MXL: one with all coefficients normally distributed except the constant and the amount of remuneration coefficient (MN1) and the other with only the constant term fixed (MN2).

Without an intuitive way to choose the number of latent classes in LCM, the decision is often made on the basis of goodness-of-fit measures [27, 39]. We use the Akaike (AIC), Bayesian (BIC) and consistent Akaike (CAIC) information criteria.

The results for the selection of the number of classes are presented in Table 4. The BIC and CAIC show that the best fit is obtained with four classes, a number we retain for the following analysesFootnote 9.

Table 4 Selection of the number of classes for the LCM

Results

Heterogeneity in GPs’ preferences

The estimation results for the mixed logit are presented in Table 5. The sign, significance and magnitude of the mean coefficients are very stable between the two models (MN1 and MN2), underlining the robustness of the results. The ASC is not significant, indicating that respondents have made their choice only on the basis of the attributes in the list (so the model is correctly specified). The estimates reveal the existence of preference heterogeneity among GPs that is quite concentrated around some attributes.

Table 5 Estimation of the mixed logit models

The standard deviations are significant for the pay-for-performance and the assistance by NPP in model MN1. In MN2, this is also the case for the application of guidelines, the type of practice, and the level of remuneration. The heterogeneity in preferences for pay-for-performance is particularly relevant. This remuneration scheme is a source of marginal disutility at the mean but is positively valued by 22 % and 24 % of physicians (in MN1 and MN2, respectively). These figures are consistent with the proportion of French GPs having chosen to adhere to the CAPI (around 30 %, [11]). It is also worth noting that the indifference to the assistance by NPP at the mean masked a strong heterogeneity. Indeed, 60 to 62 % would like to benefit from this kind of assistance. Finally, even the amount of remuneration is marked by heterogeneity, with 14 % of physicians not valuing an increase in income for the targeted activities (MN2).

The latent class model estimates are presented in Table 6. Over all the classes, the ASCs are insignificant. For the first class, the only significant attributes are continuing education and assistance by NPP. Continuing education has a positive effect on indirect utility while assistance by NPP has a negative one. In the second class, the significance of the attributes is slightly different. While continuing education remains significant, this time it has a negative effect. GPs in this class prefer higher payment and to be paid more often, as the sign and significance of the frequency attribute attests. They dislike the forfait but they are indifferent to pay-for-performance. They also prefer solo practice. All attributes are significant for classes 3 and 4, however distinct behaviour is observed. The doctors in these two latent classes place negative value on alternative payment relative to FFS while preferring more frequent payment. They also prefer to work in groups. They differ in respect to all the other attributes. In contrast to the third class, an increase in remuneration has a negative effect on indirect utility in the fourth class. Class 3 physicians disvalue all types of clinical guidelines but positively value continuing education and information feedback, contrary to class 4. Physicians in the fourth class value assistance by NPP while those in the third class do not. With the preference for group practice in both classes, this result suggests a preference for physician groups only in class 3 while multidisciplinary teams are preferred in class 4.

Table 6 Estimation of the latent class logit model – 4 classes

At this point it is worth comparing the results of the two kinds of models. One of the major conclusions, holding in both MXL and LCM, is the negative impact on indirect utility of an increase in remuneration observed for some GPs. It shows that this result is not only a matter of statistical artefact resulting from the use of a normal distribution in the MXL [39]. The MXL underlined heterogeneity of preferences for P4P. This heterogeneity is also found in the LCM, with the third and fourth classes disliking this payment while the coefficient is positive in the second class (but significant only at 10 %). The strong difference in preferences for assistance by NPP found in MXL is also seen in LCM. The negative coefficients in classes 1 and 3 are contrasted by a strong positive preference in class 4. All in all, this suggests a stability of the main conclusions made from the different models, with preference heterogeneity remaining among classes.

Regarding the goodness of fit of the models, results in Table 7 indicate very little advantage to LCM while MXL (MN2) has better BIC. The minimal difference between the best fitting models suggests that each provides relevant information on the heterogeneity of GPs’ preferences.

Table 7 Goodness-of-fit measures of the different specifications

Simulating alternative quality improvement programs

The policy simulation study relies on the calculation of compensating variation. The goal is here to evaluate the relative impact on physicians’ welfare of alternative QIPs to the CAPI. These alternatives were chosen to be consistent with, and believable in, the context of French general practice.

The DCE attributes are used to depict five QIPs – the CAPI and four alternative policies (refer to Appendix 2 for more details). The first is close to the emerging organizational model in French primary care (maisons pluridisciplinaires et pôles de santé) implemented to foster quality of care, and also known in the literature as “integrated” primary care model [46]. The second introduces a mixed remuneration scheme that can better balance quantity and quality in physicians’ activity [47]. In order to measure only the effect of the payment scheme, we assume an increase in income similar to the CAPI. The third QIP is composed of only non-financial mechanisms that do not require a sharp transformation in physicians’ organization (i.e. no multidisciplinary team). The fourth is designed as a maximal satisfaction policy and is used as a benchmarkFootnote 10. Even if the maximum satisfaction of GPs is not necessarily an objective per se, comparing it to the CAPI gives a sense of the distance separating this QIP from the most desirable one. The details of each policy are presented in Table 8.

Table 8 CAPI and alternative QIPs

The indirect utilities and the corresponding CV are first computed for all GPs on the basis of MN1 estimates. With mixed logit models, we concentrate on the attributes which are consistently heterogeneous in the two models (MN1 and MN2): P4P and assistance by NPP. For each, we identify “inclined” who obtain positive marginal utility from these attributes and “adverse” who obtain negative marginal utility. The LCM provides natural subgroups for the estimation of CV, which are computed in the four latent classes. It should be noted that only the significant coefficients enter in the computation of CV for each subgroup of interest. As GPs are indifferent to insignificant attributes, using their estimate values would distort the welfare estimates. Results are presented in Table 9.

Table 9 Policy simulation: compensating variation (Euro per year)

The first striking result is that CAPI is a source of indirect disutility in the majority of the subgroups considered (5 out of 8).

The compensating variation indicates the annual benefits for GPs of choosing an alternative QIP rather than the CAPI. P4P “inclined” have a positive indirect utility from the CAPI of course. However, with the exception of the mixed remuneration program, all other alternative policies still give a greater benefit than the CAPIFootnote 11. P4P “adverse” would prefer each of the alternative policies to the CAPI, if they were proposed. The non-financial policy has the greatest CV, but the gap with integrated primary care is reduced. Whether they are “inclined” or “adverse” to assistance by NPP, GPs disvalue the CAPI and prefer all alternatives. We expected the NPP “inclined” to have a greater benefit from P1 because of the multidisciplinary team but P3 is a little more valued,. The NPP “adverse” have their lowest (though still positive) CV for P1 and their preferred alternative is the non-financial program P3.

The patterns are very different between latent classes. Classes 1 and 4 obtain negative and extremely negative indirect utility from the CAPI, respectively, while the sign is positive in classes 2 and 3. Compared to the other subgroups, CV is very high in class 1Footnote 12. The benefit of having the non-financial policy rather than the CAPI is equivalent to 93,705€, almost the same amount as for the maximum satisfaction program. There is no benefit from shifting from the CAPI to the mixed remuneration scheme. This last result holds for class 2. This class is very specific since it is the only subgroup where other policies result in losses. It is even the case for P4, designed to be the most desirable for GPs in the whole, underlining again the particularity of this latent group. For class 3, mixed remuneration has the highest CV, with a relative benefit of 18,474€. With the exception of P1, alternative policies still dominate the CAPI. For class 4, integrated primary care offers the highest relative benefit (53,925€) while the CV for the non-financial policy remains important (47,148€).

Discussion and conclusion

Using a discrete choice experiment, we elicited French GPs’ preferences for the different components of QIPs. We showed the strength of heterogeneity in their preferences and demonstrated how this heterogeneity leads physicians to evaluate very differently the same interventions aimed at improving the quality of care. The heterogeneity in preferences is concentrated on some components, especially P4P and assistance by a NPP. There is also variation in preferences by latent groups of GPs, with some physicians valuing some components of QIP only (continuing education and assistance by NPP in group 1), while other physicians value the same components differently (group 3 versus 4). Given this heterogeneity, the crucial policy lesson is that QIPs could be adapted to meet physicians’ preferences by offering a menu of programs and allowing GPs to self-select. If policymakers were to choose only one QIP, CV indicates that they should implement a program using only non-financial interventions. Yet, policymakers continue to rely heavily on the financial dimension to change physician behaviour with QIP, as it is the case in France with the ROSP – the QIP that has replaced the CAPI. Strong beliefs in the power of the financial lever or perceptions of potential implementation difficulties for non-financial interventions could explain this policy choice. Another interpretation is that financial QIP could be seen as a mechanism to both address unavoidable compensation claims from medical union and concerns for the quality of care.

Some limitations should be noted. First, the limited response rate, though consistent with the DCE literature, may have led to sample selection bias. While we do not have information on the non-responders, the opinions expressed in the first part of the questionnaire are reassuring in the sense that they are quite close to those expressed in other French studies [4850]. Second, the use of a forced choice design might have biased the estimates if physicians wished to choose neither of the two proposed QIP. However, physicians who were not willing to choose one of the two options in a given choice set actually did not respond at the specific choice occasion, the forced choice is still used in health professional DCE studies [15], and this “forced choice” strategy is consistent with the new orientation of the French national QIP program (the ROSP is mandatory). Finally, we choose to use a common comparator when we constructed the choice set, which does not necessarily maximize the statistical efficiency of the experimental design [22]. Yet, fixed comparator increases the “respondent efficiency”, which can be defined as the capacity of a respondent to express his “real” preferences in the context of the DCE [51]. Given that private practice physicians are heavily time-constrained, particularly in the French fee-for-service context, we believe this trade-off between statistical and respondent efficiency has allowed us to obtain a satisfactory response rate and better quality and completeness of responses relative to other designs.

Despite these limitations, this study adds to the broader literature on the heterogeneity of health professionals’ preferences [1315, 28] and for the first time, combines LCM and MXL approaches. Each model contributes a better understanding of physicians’ preferences and using such an approach can help policymakers to better design their QIP.