Background

The use of healthcare-related databases (DBs) for pharmacoepidemiology research has expanded in recent years [1]. A PubMed search found a nearly six-fold increase in the number of publications related to DB studies and administrative claims data from the decade spanning 2001‒2010 to 2011‒2020. A rapid increase has also been reported in the Asia‒Pacific region, where such databases have become widely available in recent years [2]. [† Search query for Title/Abstract: ("database study" OR "database studies") AND ("claims" OR "administrative"); Search date: 06 JAN 2021].

In such times of change, it is important to make renewed efforts to promote good practice in the planning, conduction, analysis, and interpretation of DB studies. Advancing the understanding of outcome validation studies is an essential part of these efforts. Outcome validation studies are particularly important for DB studies based on secondary use DBs, such as administrative claim DBs. This paper focuses on how to utilize the existing validation studies to inform and evaluate the design of a new claim-based DB study in its planning phase. One possible conclusion from such evaluation is that there is not enough information to proceed with confidence, leading to a decision to conduct a new validation study. The steps after the conduct of the DB study, which may include bias adjustments using the data from the validation studies, are out of the scope of this paper.

In a claim-based DB study, the source information typically includes diagnosis, drug prescription, and medical procedure records from an administrative claims DB. The outcome of interest is defined by a specific combination of these records. When the source DB is the electronic medical record (EMR), such a combination of records is sometimes referred to as the “EMR-derived phenotype algorithm” [3]. In this paper, we will use the term “phenotype algorithm” or simply “algorithm” when there is no confusion. Even a well-considered algorithm is not perfect in identifying the true occurrence (or lack of occurrence) of an outcome. Thus, an “outcome validation study” is conducted to characterize the degree of imperfection of the algorithm. More specifically, a validation study characterizes the relationship between the proposed algorithm and a “gold standard” evaluation.

In addition to outcomes, the validation target may include exposure variables (e.g., the use of specific drugs), selection variables (e.g., the diagnosis of a specific disease), or confounder variables. To refer to these wider usages, a generic term “validation study” is used. Some general references related to validation studies are available [4,5,6,7,8,9]. For common outcomes, systematic reviews of validation studies are available [6, 10,11,12,13,14,15,16,17].

The “diagnostic likelihood ratio” (DLR) and its utility are well-known in the field of medical diagnostic testing, such as screening tests for specific diseases [18,19,20,21]. However, its use in the context of a validation study seems to be limited. We found only two such examples: Barbhaiya et al. [22] and Shrestha et al. [23]. Both used DLR as a summary measure to characterize the target phenotype algorithms. In this paper, we explored additional usages for the DLR. Specifically, we examined the use of DLR in the assessment of bias during the planning of a comparative cohort DB study. We consider that wider recognition of the full utility of the DLR will enhance the practices surrounding DB studies, including those during the reporting of outcome validation studies and the planning of a new DB study.

Methods

Outcome validation study

Typically, a validation study is conducted on a random sample from an entire population of subjects. For clarity, we refer to the random sample as “validation study sample” and to the entire population as the “validation study population.” A hypothetical summary of a validation study result is shown in Table 1 (adapted from Figure. 37.1 of Ritchey et al. [6]). The rows represent the outcomes (“positive” or “negative”) as identified by the proposed phenotype algorithm. The columns represent the phenotype or the true disease status (with or without disease) based on the gold standard. For example, NA represents the number of subjects who are identified as positive by the algorithm among those who truly have the disease. NB, NC, and ND are defined analogously.

Table 1 Summary of a typical validation study result

Sensitivity and specificity are two fundamental measures of misclassification. Sensitivity is the proportion of subjects identified by the algorithm as positive among those who truly have the disease, i.e., NA/(NA + NC). Specificity is the proportion of subjects identified by the algorithm as negative among those who are truly without the disease, i.e., ND/(NB + ND). The disease prevalence in the validation study sample is (NA + NC)/N, where N is the total number of subjects in the sample.

The following equations give the relationship between positive and negative DLR and the two misclassification measures.

$$\begin{gathered} \text{Positive\,diagnostic\,likelihood\,ratio} \left( {\text{DLR}^{ + } } \right) = \text{sensitivity}/\left( {1 - \text{specificity}} \right) \hfill \\ \text{Negative\,diagnostic\,likelihood\,ratio} \left( {\text{DLR}^{ - } } \right) = \left( {1 - \text{sensitivity}} \right)/\text{specificity} \hfill \\ \end{gathered}$$

If an appropriate sampling design is employed, the validation study sample can be used to estimate the sensitivity, specificity, and DLR of the validation study population. The precision of the point estimate of each measure can be quantified by their respective confidence intervals (CI).

We now introduce the notation shown in Table 2. First, let Pr(D+;S) denote the probability that a subject truly has the disease (D+) in a population of interest S. If we consider a randomly sampled subject from S, then the probability that a subject has the disease is simply the proportion of subjects with the disease in S. If S is the validation study population SVS, then Pr(D+;SVS) is the disease prevalence of the validation study population. Next, let Pr(O+|D+;S) denote the probability that a subject’s outcome is positive (O+) according to the algorithm in a subset of S with the disease. The expression Pr(X|Y;S) denotes the conditional probability of X in a subset of S in which Y is true. Thus, Pr(O+|D+;SVS) is the probability of a positive outcome in a subset of the validation study population with the disease, which is simply the sensitivity in the validation study population. Analogously, Pr(O−|D− ;SVS) is the specificity in the validation study population.

Table 2 Notations for prevalence, sensitivity, and specificity

Comparative cohort database study

In the following, we envision a DB study planning consisting of 4 main steps. The 1st step is to formulate the research question and consider possible study design and database options for the DB study. We assumed this step had been completed and that a comparative cohort study based on the claims database was chosen. We also assumed the risk ratio (test versus control group) was chosen as the relative measure. The 2nd step is to search for relevant validation studies and extract usable information such as sensitivity, specificity, and other performance measure values. The 3rd step is to consider possible values, or a range of possible values, for the risk of the outcome event in the control group based on historical information (e.g., clinical trials, observational studies). Also, there is likely to be a target risk ratio value for the DB study. Such evaluations are commonly conducted in sample size and power calculations for the DB study. The 4th step is to evaluate the impact of the performance measures on the bias of risk ratio and other features of the planned DB study.

Positive-predictive values

In a comparative cohort DB study, we wish to infer the true state of disease based on the proposed claims-based algorithm. Because the algorithm is imperfect, as characterized by the validation study results, we need to understand how it performs when applied to the DB study. Two such measures of performance are the positive-predictive value (PPV) and the negative-predictive value (NPV) [6]. In the developments below, estimates of sensitivity, specificity, and disease prevalence are assumed to be available from past validation studies or other sources. Additionally, as before, we distinguish the terms “DB study sample” and “DB study population.”

PPV is the probability that a subject identified by the algorithm as positive truly has the disease. Using Bayes’ theorem from probability theory [21, 24], the PPV of the algorithm when applied to the DB study population (PPVDB) can be expressed as follows, where PDB is the disease prevalence of the DB study population:

$$\begin{aligned} \text{PPV}_{\rm DB} & = \Pr \left( {\text{D} + |\text{O} + ;\text{S}_{\rm DB} } \right) \\ & = \frac{{\Pr \left( {\text{O} + |\text{D} + ;\text{S}_{\rm DB} } \right)\Pr \left( {\text{D} + ;\text{S}_{\rm DB} } \right)}}{{\Pr \left( {\text{O} + |\text{D} + ;\text{S}_{\rm DB} } \right)\Pr \left( {\text{D} + ;\text{S}_{\rm DB} } \right) + \Pr \left( {\text{O} + |\text{D} - ;\text{S}_{\rm DB} } \right)\Pr \left( {\text{D} - ;\text{S}_{\rm DB} } \right)}} \cdots \text{Bayes'\, theorem} \\ & = \frac{{\text{Sensitivity} \cdot \,\text{P}_{\rm DB} }}{{\text{Sensitivity} \cdot \,\text{P}_{\rm DB} + \left( {1 - \text{Specificity}} \right)\left( {1 - \text{P}_{\rm DB} } \right)}} \\ \end{aligned}$$
(1A)
$$= \frac{{\text{DLR}^{ + } \cdot \,\text{P}_{\rm DB} }}{{\text{DLR}^{ + } \cdot \,\text{P}_{\rm DB} + \left( {1 - \text{P}_{\rm DB} } \right)}}.$$
(1B)

Equation 1A follows from the previous line because sensitivity and specificity are assumed not to depend on the population so that Pr(O+|D+;SDB) = Pr(O+|D+;SVS) and Pr(O−|D− SDB) = Pr(O−|D− ;SVS). In practice, the plausibility of this assumption should be justified [25]. Equation 1B is obtained by dividing the numerator and denominator by the term (1 − Specificity). In many validation studies, an estimate of PPV for the validation study itself (PPVVS) is reported. The population value of PPVVS is obtained by replacing PDB in Eq. 1A with the disease prevalence of the validation population (PVS). It is noted that the usual estimate of PPVVS (= NA/(NA + NB)) can be obtained by substituting the estimates of the DLR+ and PVS from the validation study into Eq. 1B.

By solving Eq. 1B for the DLR+ and by noting that the equation holds for either the validation study or the DB study population, another useful expression for the DLR+ is obtained:

$$\text{DLR}^{ + } = \left( {\frac{{\text{PPV}_{\rm VS} }}{{1 - \text{PPV}_{\rm VS} }}} \right)\bigg/\left( {\frac{{\text{P}_{\rm VS} }}{{1 - \text{P}_{\rm VS} }}} \right) = \left( {\frac{{\text{PPV}_{\rm DB} }}{{1 - \text{PPV}_{\rm DB} }}} \right)\bigg/\left( {\frac{{\text{P}_{\rm DB} }}{{1 - \text{P}_{\rm DB} }}} \right) = \frac{\text{Post - test odds}}{{\text{Pre - test odds}}}.$$
(2)

In the terminology of diagnostic tests, DLR+ is equal to the ratio of “post-test odds” to the “pre-test odds” [18, 19]. Pre-test odds is the odds of disease (D+), and post-test odds is the odds of disease when the test result is positive (in the current case, when the ocome is O+). Under the current assumption, the DLR+ is invariant between validation and DB studies.

Analogous developments for the NPV are possible, where the DLR plays the corresponding role.

Relative measures of risk

We now examine the impact of misclassifications on relative measures of risk, namely, the risk ratio (RR). As stated by Ritchey et al., the ultimate criterion for the importance of misclassification is the degree of bias exerted on relative measures of risk [6].

Let NTES and NCON indicate the sample sizes of the test and control (referent) groups of a hypothetical cohort DB study, respectively. Similarly, let XTES and XCON indicate the corresponding number of subjects with the true disease, which are assumed to be known for this hypothetical situation. The expected numbers of positive outcomes based on the algorithm and the corresponding risk expressions are given in Table 3. Table 3 assumes that sensitivity and specificity are invariant between the test and control groups. For applications in actual DB studies, the plausibility of this “non-differential misclassification error” should be justified.

Table 3 True and expected number of positive outcomes, risks, and risk ratio

Using the risk expressions in Table 3, we can write the expected RR in terms of the true RR, as shown in Eq. 3, where RREXP is the expected RR, RRTRUE is the true RR, and RCON is the true disease risk of the control group in the DB study:

$$RR_{EXP} = RR_{TRUE} + \frac{{1 - RR_{TRUE} }}{{R_{CON} \cdot \left( {DLR^{ + } - 1} \right) + 1}}.$$
(3)

The details of the derivation are shown in Appendix A (Additional file 1). The term \(\left( {1 - RR_{TRUE} } \right)/\left\{ {R_{CON} \cdot \left( {DLR^{ + } - 1} \right) + 1} \right\}\) is the bias of the RREXP relative to the RRTRUE. If the RRTRUE is greater than 1, then the bias term is always negative in this “ideal” situation (see Appendix B, Additional file 1). In real-life situations, there may be other sources of bias so that the overall bias may not be negative [6, 26].

All calculations were performed and graphs were generated using R version 3.6.1 [27].

Results

Positive-predictive values

Figure 1A displays the expected PPV of the DB study as a function of a DLR+ and the disease prevalence of the DB study population. A hypothetical range (0.025‒0.4) is graphed for the disease prevalence in the DB study population. For each value of the disease prevalence, the expected PPV of the DB study increases with increasing values of DLR+. Figure 1B gives an alternative display format in which the x-axis is the disease prevalence, and each line represents a value of the DLR+. For each DLR+ value, the expected PPV of the DB study increases with increasing disease prevalence. If the disease prevalence of the DB study population is equal to that of the validation study, then the PPVs are also expected to be equal. It follows that if the disease prevalence of the DB study is likely to be lower than that in the validation study, then the expected PPV of the DB study would be lower than that in the validation study.

Fig. 1
figure 1

Expected PPV of the DB study as a function of DLR+ and disease prevalence. A Positive-predictive value (PPV) of the database (DB) study is plotted against positive likelihood ratio (DLR+). Each line represents a fixed value of disease prevalence of the DB study population. A hypothetical range of disease prevalence values is shown (0.025–0.4). B Expected PPV is plotted against a hypothetical range of the disease prevalence of the DB study. Each line represents a fixed value of DLR+. A range of values for the DLR+ is shown (20–1000). Both plots are based on Eq. 1B

In many validation studies, sensitivity and specificity are not available, and only PPVs are reported. Thus, previously mentioned assessment methods are not applicable. However, a plausible range of DLR+ can be ascertained by using Eq. 2. Figure 2 shows DLR+ as a function of disease prevalence of the validation study (PVS) for selected values of the PPV for the validation study (PPVVS). Suppose a plausible range of PVS is 0.04‒0.06, based on information from the validation study or other sources, and the PPVVS is 0.8 according to the validation study. From Fig. 2, the corresponding range of DLR+ is approximately 63‒96. If desired, a range of values for PPVVS may be considered to account for the precision of the estimate. Once the value of DLR+ is in hand, one can refer to Fig. 1, as before.

Fig. 2
figure 2

DLR+ as a function of disease prevalence of the validation study population. Positive diagnostic likelihood ratio (DLR+) is plotted against disease prevalence of the validation study population. Lines are drawn for selected values of positive-predictive value (PPV) for the validation study. The plot is based on Eq. 2 for the validation study population

Relative measures of risk

Figure 3A displays the RREXP as a function of the DLR+ and the true disease risk of the control group of the DB study. For illustrative purposes, the RRTRUE is set to 2.0, and a hypothetical range of values (0.01‒0.1) for the true disease risk of the control group (RCON) is graphed. For each value of the control group’s risk, the degree of bias decreases with increasing values of the DLR+. Figure 3B gives an alternative display format in which the x-axis is the control group risk. For each value of the DLR+, the degree of bias decreases with increasing values of the control group risk. Figure 3A and B permit a more compact and transparent way of visualizing the relationship between the expected RR and the control group risk of the DB study, as compared with a traditional display format shown in Appendix Figure X1 (Additional file 1).

Fig. 3
figure 3

Expected RR as a function of DLR+ and the true disease risk. Expected risk ratio (RR) of the database (DB) study is shown as a function of positive diagnosis likelihood ratio (DLR+) and the true disease risk of the control (referent) group of the DB study. The true RR is set to 2.0. A Expected RR is plotted against DLR+. A hypothetical range of values for the true disease risk of the control group is shown (0.01–0.1). B Expected RR is plotted against the true disease risk of the control group. A range of values for the DLR+ is shown (20–1000). The right axis of each plot displays the scales in terms of % bias relative to the true RR. Both plots are based on Eq. 3. Figure X1 (Additional file 1) gives a more traditional display in which sensitivity and specificity are considered separately

Use examples

Published examples of the DLR in the context of outcome validation studies are rare. Barbhaiya et al. (2017) conducted a validation study of claim-based phenotype algorithms for identifying the diagnosis of avascular necrosis [22]. In their paper, the DLR+ was used as a summary measure, along with the sensitivity, specificity, and PPV. Shrestha et al. (2016) conducted a systematic review of administrative data-based phenotype algorithms for the diagnosis of osteoarthritis [23]. In their review, the DLR+ was included as a summary measure of the phenotype algorithms, along with sensitivity, specificity, and expected PPV values at three hypothetical values of the disease prevalence. We recommend a routine inclusion of DLR in a validation study report whenever it is computable.

As a further illustration of the use of the DLR+, we provide two artificial examples based on data from a systematic review by McCormick et al. [17]. The review identified 30 studies on administrative data-based phenotype algorithms for the diagnosis of acute myocardial infarction (MI). We envision planning a DB study with acute MI outcomes.

In the first artificial example, we selected three studies that reported sensitivity, specificity, PPV, and NPV: Kennedy et al. [28], Pladevall et al. [29], and Austin et al. [30]. Many studies in the review reported only PPVs. Table 4 provides a summary of the three studies. We supplemented the DLR+ and its 95% confidence interval (CI), which were not included in either the systematic review or the original reports. In addition, we calculated two features of the planned DB study that would be expected under specific assumptions. The first feature is the expected PPV when the prevalence of acute MI is assumed to be 0.05 in the planned DB study. The second feature is the relative bias of the RR when the control group’s risk of acute MI and the true RR are assumed to be 0.03 and 2.0, respectively. The relative bias is defined as the bias divided by the true RR multiplied by 100%. The assumptions were chosen for illustrative purposes.

Table 4 Use example of DLR+ in validation studies and in planning of a DB study

The DLR+ value for the Kennedy study is nearly 40 times greater than that of the other two studies (Table 4). This translates to a large difference in the expected PPV and bias of the RR between Kennedy and the other studies. Figure 4A displays the expected PPV in the DB study for a hypothetical range of disease prevalence, which is set to 0.01‒0.09 for our illustration. For the Kennedy study, the expected PPV at a disease prevalence of 0.05 is 0.959, which contrasts with values below 0.4 for the other two studies (Table 4 and Fig. 4A). Figure 4B displays the expected bias of the RR for a plausible range of the control group’s risk, which is assumed to be 0.01‒0.05 for our illustration (true RR is set to 2.0). For the Kennedy study, the bias of RR is − 3.49% at a control group risk of 0.03, which contrasts with values less than − 37% for the other two studies (Table 4 and Fig. 4B). Additionally, the disease prevalence is 0.003 for the Kennedy study, which is notably lower than that of the other two studies (Table 4). Thus, planning for the DB study is greatly affected by the choice of validation studies. In actual applications, one needs to evaluate various features of the validation studies carefully and select those studies that are most relevant for the planned DB study. The validation study features to be scrutinized might include the study population, the “gold standard” criteria, and the outcome definition. Also, in actual applications, the range of parameters such as the disease prevalence and control group risk should be judiciously selected by each investigator based on past information and to cover relevant expected scenarios in the planned DB study.

Fig. 4
figure 4

An example of application of Eqs. 1B and 3 to data from actual validation studies. Validation studies by Austin [30], Pladevall [29] and Kennedy [28] were selected from the systematic review by McCormick et al. [17]. A Expected positive-predictive value (PPV) of the planned database (DB) study is plotted against the disease prevalence of the DB study. B Expected risk ratio (RR) of the planned DB study is plotted against the control group risk of the DB study. The right axis is in terms of relative bias scale. In each panel, center, lower and upper lines for each study correspond to the point estimate and lower and upper bounds of 95% confidence interval of DLR+

The second example involves a case in which only PPVs are reported. In this case, the previous type of assessment is not applicable. McCormick et al. [17] reported a systematic difference in PPV values between studies with and without cardiac troponin measurement as a part of the “gold standard.” For this illustration, we considered eight phenotype algorithms from seven studies in Fig. 2A of McCormick et al. [17], whose gold standard criteria included cardiac troponin measurements. Figure 5 plots the DLR+ against the disease prevalence for the reported PPV value for each algorithm. Each line is drawn based on the relationship in Eq. 2. A wide range of disease prevalence is displayed to consider various possibilities.

Fig. 5
figure 5

An example of application of Eq. 2 to algorithms from validation studies. Positive likelihood ratio (DLR+) is plotted against the disease prevalence of the validation study. Each line is drawn corresponding to the reported PPV value for each algorithm. Seven studies (eight algorithms) were selected from Fig. 2A of the systematic review by McCormick et al. [17]. Ordering of the algorithms in the legend corresponds to the order of lines in the graph. Included algorithms and the reported PPVs are: Merry 2009 (0.9688), Kiyota 2004p (primary diagnosis) (0.9411), Ainla 2006 (0.933), Kiyota 2004s (primary or secondary diagnosis) (0.9245), Barchielli 2010 (0.8602), Hammar 2001 (0.8583), Heckbert 2004 (0.8302), and Varas-Lorenzo 2008 (0.7202). The gray band indicates disease prevalence between 0.1 and 0.3. Horizontal blue solid lines indicate DLR+ values that are consistent with all eight algorithms; blue dotted lines indicate DLR+ values that are consistent with the “median” algorithm

A detailed examination of each validation study and the related sources may provide a hint on a narrower plausible range for the disease prevalence. Suppose that this plausible range is taken to be 0.1‒0.3 (shown by the shaded region in Fig. 5). Next, consider a freely moving horizontal line moving up from the bottom of the figures. The horizontal line crosses the first algorithm (Varas-Lorenzo, 2008) at the disease prevalence of 0.3 (DLR+  = 6). As the horizontal line continues to move up, it will cross multiple algorithms. Analogously, a horizontal line moving down from the top of the figure crosses the first algorithm (Merry, 2009) at the disease prevalence of 0.1 (DLR+  = 279). Thus, the range of the DLR+ values that is consistent with all eight algorithms is 6‒279; this range is indicated by a pair of horizontal blue solid lines in Fig. 5. In actual applications, this range for DLR+ may be too wide, and algorithm selection may need to be refined further. One idea to narrow the range might be to consider DLR+ values that are consistent with the “median” algorithm, which, in this case, are the two central algorithms (i.e., Kiyota 2004s and Barchielli 2010). A pair of horizontal blue dotted lines in Fig. 5 indicates such a range (Note: the Barchielli 2010 and Hammar 2001 algorithms nearly overlap in Fig. 5). Once a plausible range of DLR+ value is determined based on assessments such as above, one can compute the corresponding range for the expected RR using Eq. 3.

Discussion

In this paper, we investigated the utility of the DLR in the context of an outcome validation study. Positive DLR was identified as a pivotal parameter that connects the expected PPV with the disease prevalence in the planned DB study, where the positive DLR is equal to sensitivity/(1-specificity). Moreover, positive DLR emerged as a pivotal parameter that links the expected RR with the disease risk of the control group in the planned DB study.

The importance of thorough sensitivity analyses after the completion of a DB study is well established [6, 35,36,37,38]. In contrast, there has been less focus on what can be done to improve the planning of a DB study. During the planning phase, careful assessments of outcome definitions and other elements of the study design should be conducted. Toward this end, the DLR provides a transparent and informative summary of the relationship between PPVs that can be expected in the planned DB study based on the results of a validation study (Fig. 1). Additionally, the expected degree of bias of the RRs can be characterized clearly (Fig. 3).

There are some limitations to the method described above. As mentioned in “Methods” section, there are assumptions in the derivation of the equations, such as the non-differential misclassification error. The invariance of sensitivity and specificity between the validation study and the DB study populations is another assumption. If assessments of sensitivity to deviations from these assumptions are desired, an investigator can start with an expression such as that in Table 3 and use computer calculations to evaluate performance under any arbitrary settings. In particular, the assumption of non-differential misclassification error requires careful considerations. In addition, extensions to other relative measures such as the risk difference and odds ratio as well as non-binary variables (e.g., continuous, categorical) may be of interest. Finally, although we focused on claim-based DB studies, some features are also relevant for DB studies based on electronic health records.

Conclusions

Wider recognition of the full utility of the DLR in the context of validation studies will make a meaningful contribution to the promotion of good practice in the planning, execution, analysis, and interpretation of DB studies.