FormalPara Key Points

We attempted to replicate the published randomized controlled trials (RCTs) on diabetic medications in patients with type 2 diabetes in Japan using a Japanese claims and health checkup database.

We closely designed three observational studies using this real-world data (RWD) source, mimicking the critical study elements of RCTs; however, various design elements could not be precisely emulated, primarily because of the lack of necessary data.

This particular RWD source may not be the best fit for these specific research questions, requiring laboratory data as study outcomes. More RCT replication exercises should be conducted to accumulate knowledge on the opportunities and limitations of real-world evidence studies.

1 Introduction

Real-world evidence (RWE) is “the clinical evidence about the usage and potential benefits and risks of a medical product derived from analysis of real-world data (RWD),” the routinely collected health-related data [1]. Although randomized controlled trials (RCTs) are considered the “gold standard” for evaluating treatment effects and safety [2], their highly selective populations and tightly controlled settings limit their generalizability. RWE can supplement the evidence obtained from RCTs by providing information on their effectiveness in clinical settings [3]. In this sense, RCTs and RWE should be regarded as mutually complementary rather than competing relationships [4]. Furthermore, RWE can be utilized throughout the life cycle of a drug, let alone for effectiveness and safety evaluation [5, 6], which is expected to accelerate the drug development process.

Despite its great potential, the use of RWE remains limited, especially at the contribution level to regulatory decision-making [7, 8]. RWE studies lack randomization and primary data collection, and people are concerned about their drawbacks, such as low data quality and improper analytical methods [9, 10]. These factors complicate the interpretation of causal inference in these studies, leading to less confidence in the reliability of RWE [2]. Given this situation, enhancing people’s trust in RWE is crucial for facilitating its use, especially in effectiveness evaluation. For this purpose, first, when we can obtain valid conclusions from RWE studies instead of RCTs, and second, how it can be implemented should be identified [2].

To obtain insights into those “when” and “how,” efforts have been made to replicate RCTs results with rigorously designed observational studies using RWD [1]. These are attempts to replicate an RCT by mimicking its critical study elements (e.g., study population, treatments, outcomes) and comparing the results between the RCT and emulated RWE study. Such replication exercises may provide insights into clinical scenarios (e.g., indications and outcomes), study designs, and analytical approaches for implementing high-quality RWE studies that can produce valid conclusions [7]. The RCT DUPLICATE Initiative—a collaboration project of Brigham and Women’s Hospital, Harvard Medical School, the U.S. Food and Drug Administration (FDA), and Aetion—is one such leading project aimed at replicating 30 completed phase III or IV RCTs using health claims data [11, 12]. Some other attempts also existed, which were not only to replicate completed RCTs [13] but also to predict the results of ongoing RCTs [14, 15], albeit mainly in the USA.

However, such attempts to replicate RCTs have not yet been made in Japan. RWD is increasingly being used in Japan for drug safety assessment and epidemiological research. Still, RWE studies have yet to be entirely acknowledged to contribute to decision-making on effectiveness, as in other countries, due to concerns about its reliability [5]. Thus, more knowledge on their opportunities and limitations should be accumulated through RCT replication exercises using Japanese RWD. This will enhance people’s confidence in RWE and facilitate its use in Japan. Despite previous overseas practices, such country-specific attempts are essential because healthcare systems/policies and available RWD sources vary among countries.

Therefore, in this study, we attempted to replicate published RCTs using the JMDC database, one of Japan’s most commonly used commercial databases [16]. We chose diabetes studies for replication because the increasing disease burden of diabetes is a serious public health concern in Japan [17], where one in eight adults has diabetes [18]. The proper management of diabetes is an important clinical mission. The JMDC database contains claims and health checkup data, including blood test results [19]. The availability of health checkup data is a unique characteristic of this database, which enabled us to target diabetes studies with hemoglobin A1c (HbA1c) as the study outcome, which cannot be emulated using RWD sources such as administrative databases.

2 Methods

2.1 Study Overview

This was a feasibility study to examine whether RWE studies using Japanese RWD can reproduce the results of published, specific RCTs, if closely designed. After selecting RCTs of a particular clinical area for replication, we designed RWE studies to mimic the trials’ critical study elements, such as inclusion/exclusion criteria, interventions, comparators, outcomes, covariates, and follow-up periods, as precisely as possible, and analyzed treatment effectiveness. We then compared the obtained results between the RCTs and the emulated RWE studies. This study was approved by the ethics committee of Juntendo University, Tokyo, Japan (E21-0284).

2.2 Selection of Randomized Controlled Trials (RCTs) for Replication

We targeted RCTs that evaluated the efficacy of diabetic medications on HbA1c levels (not necessarily as primary outcomes) in patients with diabetes in Japan, published in the last 10 years, and potentially replicable with our RWD source. Figure S1 of the Electronic Supplementary Material (ESM) shows the flow chart of the RCT selection process. We searched PubMed on 1 June, 2022, using the following search terms: (“diabetes”[Title/Abstract] AND “Japanese”[Title/Abstract] AND “HbA1c”[Title/Abstract]) AND ((y_10[Filter]) AND (randomized controlled trial [Filter])).

Of the 149 articles obtained, those unsuitable for our target (e.g., non-RCTs, no HbA1c outcomes, and non-Japanese patients included) were excluded after reviewing the titles and abstracts. We also excluded placebo-controlled trials and studies with complicated designs or treatment schemes, making them non-replicable with RWD. Thus, the studies were limited to active-controlled RCTs with simple treatment schemes. Additionally, studies that did not find statistically significant differences in HbA1c outcomes were excluded. This was because, in the case of null results, the agreement between RWE studies and RCTs is more likely to occur because of measurement error, given that misclassification in RWE studies can result in a bias toward the null [11]. The full exclusion criteria are presented in Fig. S1 of the ESM.

This screening process resulted in 13 candidate RCTs, for which we examined their feasibility for replication with RWD. Within the database, we identified patients who had (1) necessary prescription records (study drugs or comparator drugs) and (2) HbA1c data within 90 days before the first prescription and 180 days after the trial follow-up period. If the number of patients who met these two minimum conditions was already less than 100 in either of the groups, the ultimate number of patients meeting all the study’s inclusion/exclusion criteria would be minimal. Therefore, these RCTs were considered unsuitable to replicate using this RWD source; thus, they were excluded from candidates.

Consequently, we obtained three RCTs, all in type 2 diabetes, for replication: Trial 1 compared ipragliflozin (sodium-dependent glucose transporter-2 inhibitor [SGLT-2i]) versus metformin (biguanides) [20]; Trial 2 compared sitagliptin (dipeptidyl peptidase-4 inhibitors (DPP-4i)) and pioglitazone (thiazolidinediones) [21]; and Trial 3 compared insulin degludec/insulin aspart versus insulin glargine [22]. Summaries of these RCTs (Trials 1–3) are presented in Table 1.

Table 1 Summary of three randomized controlled trials (RCTs) for replication

2.3 Data Source

This study used the JMDC database, which consists of claims and health checkup results of insured employees and their dependents, collected from health insurance societies [19]. In Japan, people usually undergo a health checkup annually because employers must provide employees with a yearly health checkup under the Industrial Safety and Health Act, and the insurers are obligated to provide an annual health checkup (“specific health checkup” aiming to prevent metabolic syndrome) to their insurers and their dependents aged ≥ 40 years.

The database includes the following information: patient attributes (age and sex), diagnoses, medical care activities, prescriptions (date, dose, and supply days), and health checkup results (including body mass index (BMI), blood measures, and lifestyle habits). Data have been anonymized, but personal IDs enable tracking the same individuals across different hospitals, as long as the same insurance society covers them. This traceability is one advantage of this database for use in research on chronic diseases, such as diabetes. Indeed, many RWE studies have been conducted in diabetes research using this database [23].

The present study used data of patients with type 2 diabetes (10th revised version of the International Statistical Classification of Diseases (ICD-10) codes, E11-14) between January 2005 and April 2020.

2.4 Replication of Three Real-World Data (RCTs) Using RWD

We designed three observational studies using RWD (RWE studies), mirroring the critical study elements of the respective RCTs, to emulate these target RCTs.

2.4.1 Population

In the emulation of each RCT, data for patients with prescriptions for study treatment (study drugs or comparator drugs) were extracted from the database. The cohort entry date (CED) was defined as the first prescription date of the study treatment, that is, treatment initiation. We conditioned patients to have data for at least 180 days before CED to check for previous treatment status and to extract new users of the study treatment. The patients also had to have the necessary data during the baseline and post-treatment assessment windows. Therefore, patients without health-checkup data within 90 days before CED and within 180–360 days after CED were excluded.

The other inclusion/exclusion criteria, which were defined to mirror those of the corresponding RCT as closely as possible, were applied to these patients, unless the criterion was not imitable with our RWD. The original patient criteria in the RCTs and the corresponding operational definitions in our emulations are provided in Table S1 of the ESM. However, in two of our emulation studies, applying all imitable patient criteria resulted in almost no patients (0 or 5). In this case, the patient criterion that most affected the number of patients, that is, the criterion regarding antidiabetic medications before CED, was disregarded to secure the number of patients. The modified definitions are presented in Table S1 of the ESM and the number of subjects eliminated based on each criteria were presented in Fig. S2 of the ESM.

Table 2 illustrates the specification and emulation of a key component of the target trial. Overall, the timing when clinical test values were taken has gaps for several months from baseline or the end of the study period, and this could affect the accuracy of the patient’s background and outcome. In addition, doses of target drugs were not considered, which could cause misclassification in exposure definition. All criteria other than one exclusion criterion (Koshizaka et al.) or inclusion criterion (Onishi et al.) were considered. Ignoring this exclusion criterion or inclusion criterion could cause a difference in patient background compared to RCTs. Background differences between exposure and comparator groups were minimized by matching the propensity score.

Table 2 Specification and emulation of a target trial

2.4.2 Outcomes and Confounding Variables

The study outcome was either the change in HbA1c levels or the percentage change in HbA1c levels from baseline (Table 1). Baseline HbA1c levels were assessed 90 days before treatment initiation and post-treatment HbA1c levels were assessed 180–360 days after treatment initiation. The following potential confounders were measured using data at CED or in CED months: age, sex, duration of diabetes, and the Charlson Comorbidity Index [24].

2.5 Statistical Analyses

In each emulation, we implemented 1:1 propensity score (PS) nearest-neighbor matching using the above-listed potential confounders, with a caliper of 0.2 on the PS score scale, to balance the baseline patient characteristics between the groups. The baseline characteristics of the patients were summarized for both the pre- and post-matching populations, and standardized mean differences were calculated. An intention-to-treat (ITT) analysis was conducted, in which patients who started treatment were included and not censored regardless of discontinuation or change of treatment. The differences in study outcomes between the groups, their 95% confidence intervals (CIs), and p-values were calculated based on the t-distribution. Analyses were performed using SAS release 9.4 (SAS Institute, Cary, NC, USA).

2.5.1 Assessments of RCT‒RWE Agreement

We used two binary metrics used in the RCT DUPLICATE Initiative to evaluate whether our RWE studies reproduced the same results as RCTs: (1) regulatory agreement and (2) estimate agreement [11]. The “regulatory agreement” refers to the ability of the RWE study to reproduce the direction and statistical significance of the findings of the RCT. The “estimate agreement” is met when the effect estimate obtained by the RWE study lies within the 95% CI for the effect estimate by the RCT. In the case of no 95% CI presented for the effect estimate in the RCT (Trial 2), we calculated the 95% CI using the estimates (mean differences), their standard deviations (SDs), and the number of patients based on the t-distribution.

2.5.2 Sensitivity Analyses

Several sensitivity analyses were conducted to explore the factors influencing agreement or disagreement between the results of RWE studies and RCTs. First, the effect estimates were calculated by modifying the time windows for the baseline and post-treatment HbA1c data. Second, summary statistics were calculated for the number and proportion of patients who discontinued RCT-allowed co-antidiabetic medications and those who had prescriptions of any other concomitant antidiabetic medications. When there was no prescription after the date of the previous prescription + supply days + 90 days (grace period), the medication was considered discontinued. Patients who discontinued the medication within 180 days from the CED were considered patients who discontinued the medication.

3 Results

3.1 Patient Characteristics

The baseline characteristics of the patients in our RWE studies are summarized in Table 3 along with the corresponding data in the RCTs. The number of patients in our emulation studies was equal to that in Trial 1 (48 vs 48 patients in the treatment group), more than that in Trial 2 (126 vs. 58 patients), and fewer than that in Trial 3 (61 vs. 147 patients). In all the RWE studies, the mean age of the patients was lower than that of the corresponding RCTs. Regarding sex distribution, the emulation studies for Trials 1 and 2 included fewer female patients than the RCTs, resulting in predominantly male patients. The mean baseline HbA1c levels in our emulation studies for Trials 1 and 3 were similar to those of the RCTs. However, in Trial 2 emulation, patients had higher mean ± SD HbA1c levels than in the RCT (treatment group, 7.8 ± 0.7 vs. 7.47 ± 0.66; comparator group, 7.9 ± 0.7 vs. 7.40 ± 0.61).

Table 3 Baseline characteristics of patients in randomized controlled trials (RCTs) and emulated real-world evidence (RWE) studies

After PS matching, the standardized mean differences for each confounding factor were mostly within 0.25 in all emulations, indicating an acceptable balance of covariate distribution between the groups [25] (Table S2 of the ESM).

3.2 Results Between RWD Studies and RCTs

The between-group differences in outcome measurements in our emulations and the agreements between the RWE studies and RCTs are summarized in Table 4. In Trial 1 emulation, the percentage changes in HbA1c levels from baseline were larger in the treatment group than in the comparator group (difference [treatment – comparator] −6.21, 95% CI −11.01 to −1.40; p = 0.012). This result was in the opposite direction to that of the RCT. Similarly, emulations of Trials 2 and 3 did not yield the same results as those of the RCTs. Changes in HbA1c levels from baseline were larger in the treatment group than in the comparator group in Trial 2 (difference −0.01; 95% CI −0.25 to 0.23; p = 0.926), and smaller in Trial 3 (difference 0.46; 95% CI −0.01 to 0.94; p = 0.056). In all three emulations, neither regulatory nor estimate agreement was achieved.

Table 4. Effect estimates and RCT–RWE agreements

3.3 Results of Sensitivity Analyses

We modified the time windows for baseline HbA1c data (180 days and 60 days before treatment initiation, instead of 90 days) and post-treatment HbA1c assessments (180–330 days after treatment initiation, instead of 180–360 days). However, these modifications did not alter the conclusions (Table S3 of the ESM).

The proportion of patients who discontinued the RCT-allowed concomitant antidiabetics (i.e., DPP-4i in Trial 1, metformin or sulfonylurea in Trial 2, and any oral antidiabetics in Trial 3) during the follow-up, which was not considered in the primary analysis, was less than 15% in any emulation (Table S4 of the ESM). However, a substantial proportion of patients used antidiabetics other than the group’s treatment and the RCT-allowed co-medications in both the treatment and comparator groups: Trial 1 emulation, 62.5% and 64.1%; Trial 2, 39.7% and 81.0%; and Trial 3, 16.7% and 61.7%, respectively (Table S4 of the ESM).

4 Discussion

This was the first attempt to replicate RCTs with a Japanese database of claims and health checkup data to examine whether RWE studies can produce the same conclusions as RCTs if carefully designed and analyzed. Of the 13 candidate RCTs evaluating the treatment effects of diabetic medications on HbA1c levels, only three were feasible for replication using this RWD source, primarily due to a lack of necessary data. This major challenge limits opportunities for RWE studies, as observed in previous studies [13, 26]. In all three emulation studies with RWD, the obtained results did not meet either “regulatory agreement” or “estimate agreement” with the results from RCTs, demonstrating that this database was not the best fit for these research questions.

As the JMDC database contains health checkup results in addition to claims data, we expected that it could be utilized for effectiveness evaluation using laboratory data as outcomes. However, many patients in the database lacked clinical data to define the population and outcomes, resulting in only a few replicable RCTs. One reason for the lack of HbA1c data was missing values; health checkup results were not collected from all health insurance societies contributing to this database [19]. Another reason is that people in Japan usually undergo health check-ups once a year. These infrequent data further reduced the number of patients with HbA1c data within specific time windows. If laboratory data of frequent intervals or precise timing are essential variables in the study, an RWE study would not be feasible using yearly health checkup data.

As suggested in previous studies, the discrepancies in results between RWE studies and RCTs can arise from differences in design, such as the study population, treatment patterns, and outcome measurement [13, 27] in addition to the lack of randomization. For example, a previous study suggested that heterogeneity in patient characteristics may lead to different results between the emulated RWE study and RCTs, and in that case, evaluation of the agreement between them is not feasible [28]. In our study, patient characteristics, such as age, sex, duration of diabetes, BMI, and baseline HbA1c levels, differed between emulations and RCTs. The authors of Trial 2 argued that BMI might affect the effectiveness of sitagliptin [21]. The mean BMI in our emulation study was higher than that in the RCT, which might be partly responsible for the different conclusions. Examining the clinical reasons behind the RWE–RCT differences was beyond the scope of this study; therefore, we will not go elaborate on these such details. Moreover, these differences in study populations do not necessarily indicate the drawbacks of RWE studies; instead, they are essential to fill the efficacy–effectiveness gap [29]. Nevertheless, researchers should bear in mind that RWE studies can result in populations that are different from RCTs, even if rigorously designed.

Some of these differences were probably introduced because we could not precisely mirror some inclusion/exclusion criteria due to the lack of data and other constraints of the data source, as in previous attempts [13, 27]. For example, we had to loosen the condition of antidiabetic medications to secure the number of patients, which undoubtedly diverged patient selection and treatment patterns in the RWE studies. Indeed, most patients in our emulations, used other antidiabetics in addition to the study treatment. This is a typical example of the difficulty of tightly controlling treatment settings in RWE studies. As mentioned earlier, such data reflecting actual clinical practice are essential for filling the efficacy–effectiveness gap [30]. However, this complexity of RWE studies is the very thing that complicates the interpretations of the study results, posing hurdles for their use as valid evidence about the treatment’s effectiveness [2]. Thus, for an RWE study with such an aim, not as a supplement to RCTs, it would still be crucial to simplify the settings as much as possible. In this sense, it is important to understand that RWE studies have limited opportunities depending on the research questions, as demonstrated in this study.

Furthermore, the extended time windows for HbA1c data must also have introduced differences between the RWE studies and RCTs. We had to set broad time windows because patients usually had only one HbA1c data point yearly. Therefore, their HbA1c data would not have adequately reflected the glycemic condition at the precise timing of treatment initiation or the end of follow-up. This impreciseness is a significant design limitation in our emulations. Our sensitivity analyses using different time windows primarily resulted in the same trends as the primary analysis, suggesting that these time windows had no significant impact on the outcomes. However, the modified time window (i.e., 180–330 days after treatment initiation) was still a long way off from the end of follow-up; thus, the results may have changed if data exactly at the end of follow-up were analyzed.

In this study, we found that this RWD source was not feasible for evaluating the effectiveness of diabetic medications on HbA1c levels because of the lack of data critical for designing these studies. However, the disagreement in results between RWE studies and original RCTs, as in the present study and similar attempts [13, 27], does not necessarily indicate the low quality of the data sources or analyses. For example, this database may still be useful for evaluating yearly changes in blood test data or evaluating an outcome that can be defined by a diagnostic record in claims data with high accuracy. Instead, understanding whether a particular RWD source fits the study of interest is a significant finding in RCT replication exercises. Accumulating such knowledge will help to further understand when and how we can implement a high-quality RWE study to produce a valid conclusion. Therefore, RCT replication exercises such as ours should be performed more vigorously in various clinical settings and RWD sources.

This study evaluated the feasibility of claims and health checkup data for RCT replication. Previous RCT replication attempts used data sources such as claims data [15, 27], electronic health records (EHRs) [14], and registry data [13], but not health checkup data. Thus, the findings of this study will add new information to the existing knowledge from replication exercises. However, this study also has limitations. First, we emulated only three RCTs. Thus, the generalizability of our findings may be limited, and similar replication exercises may yield different results. Second, despite using a large database in this study, the number of subjects in emulated RWEs was relatively small. Therefore, the characteristics of the sampled population may be biased, which also limits the generalizability of the results. Furthermore, the small sample size in our emulated RWE study is considered to increase the variability of estimates and reduce the statistical power. In general, the RWE study requires at least as great a sample size as calculated in the corresponding RCT study. However, in one emulation of our study, there were fewer study subjects than the corresponding RCT, which made our conclusion difficult due to random error. Third, to get comparable results, the distribution of patient characteristics in the emulated RWE study should have been matched with the RCTs as recommended in the previous study [28]. However, it was not implemented in our study due to the small number of study subjects. Forth, since only subjects with measured outcome variable were included, there was the possibility of selection bias. Fifth, an ITT analysis was implemented in this study. Since adherence to medications are generally poor in RWD relative to RCTs, exposure misclassification is likely to have occurred. However, although a per-protocol approach may reduce this type of misclassification, there is concern that informative censoring is important. Therefore, we adopted the ITT analysis because it is straightforward. Sixth, our RWE studies did not precisely mimic various study elements, including eligibility criteria and outcome measures, primarily because of the limitations of the data sources. A different data source, such as EHR, might have replicated these RCTs. Thus, it should be noted that our results do not necessarily deny the feasibility of all RWE studies for these research questions. Furthermore, the data source used in our study could also be used to replicate RCTs with outcomes that can be emulated using a diagnostic record.

5 Conclusion

In conclusion, our RWE studies using a Japanese claims and health checkup database did not reproduce the same conclusions as the RCTs that evaluated the treatment effects of diabetic medications on HbA1c levels in patients with type 2 diabetes in Japan. The results of this RCT replication attempt suggested that this particular RWD source may not be suitable for evaluating treatment effects using laboratory data as the study outcomes. We expect that further RCT replication attempts should be conducted in various clinical areas using Japanese RWD to accumulate knowledge on the opportunities and limitations of RWE studies in Japan.