FormalPara Key Points

The real-world data landscape is rapidly evolving in Japan, with more than 20 patient-level data sources available.

The Pharmaceuticals and Medical Devices Agency is organizing two main data repositories, namely the Medical Information Database Network and the Clinical Innovation Network.

A methodological framework of how to conduct real-world evidence studies needs to be developed to ensure transparency, robustness, and reproducibility.

1 Introduction

In Japan, the use of real-world evidence (RWE) for hypothesis generation and decision-making has become an area of focus to overcome the limitations of clinical trials. The Pharmaceuticals and Medical Devices Agency (PMDA) has recognized the importance of using real-world data (RWD) for regulatory purposes, and discussions regarding the acceptability of RWE for regulatory submissions have already been initiated. However, there are multiple challenges that still need to be addressed, including RWD availability, quality, and completeness; accurate definition of clinical outcomes; and analytical approaches used in RWE studies [1]. To address some of these issues and promote the utilization of RWD throughout the drug life cycle, including post-marketing drug safety assessment, relevant guidelines have been published, and multiple initiatives have been launched in Japan [2].

The number of available RWD sources is steadily increasing, with 22 patient-level databases listed in the Japanese Society for Pharmacoepidemiology database in 2022 [3]. In a previous work focusing on RWD in Japan, we described the advantages and challenges of the commercially available databases Medical Data Vision (MDV) and JMDC (formerly known as Japan Medical Data Center) [4]. Another resource, the Japanese National Database of Health Insurance Claims and Specific Health Checkups (NDB), was established in 2009 to generate data and assist in health policy planning and regulation of public healthcare expenditures. NDB includes almost all the administrative claims data collected from the insured population in Japan [5], as well as data from yearly health checkups offered to people who are covered by the National Health Insurance schemes [6]. However, NDB is currently used only by academic researchers and is not accessible to private companies [5].

Most studies using RWD from MDV and JMDC focus on describing patient characteristics and treatment patterns since such RWD sources contain little information on clinical characteristics and outcomes [4]. A recent review that assessed the types of studies conducted between 2015 and 2020 using claims databases in Japan found that descriptive studies were the most common, accounting for 63%, 43%, and 41% of the studies conducted using NDB, MDV, and JMDC data, respectively [7]. To address this issue of missing data and to create a high-quality medical information database for conducting rigorous assessments of drug safety, the Medical Information Database Network (MID-NET) was launched in April 2018, in collaboration with 23 hospitals from 10 healthcare organizations across Japan.

The strength of the MID-NET database lies in the rich clinical data and standardized coding procedures across different contributing hospitals, allowing integrated analysis. For example, an important effort is in progress to improve signal detection using this database [8]. MID-NET is also well recognized for post-authorization safety studies (PASS), especially under Good Post-Marketing Study Practice regulation. The Clinical Innovation Network (CIN) is another initiative that was launched to develop a registry-based infrastructure for improving clinical drug development. Some of the registry databases have been leveraged for multiple applications, including patient enrollment into clinical trials and PASS, though the initiative has not yet been completed. Therefore, both MID-NET and CIN may provide data of high quality and reliability [9].

Although local guidelines refer to data-related issues, including quality, completeness, suitability, and transparency, further guidance is needed to expand the framework of RWD use [10, 11]. Furthermore, inappropriate analytical methodologies may lead to biased evidence generation, undermining the credibility of RWE. Although PMDA offers expertise to pharmaceutical companies, there is no formal process in place to systematically utilize this support, and the choice of inappropriate study design and analytical methods may negatively influence study results [12, 13]. Other challenges that could limit the interpretation of results include sensitivity analysis, analytical models, measurement of validity, and confounding. These elements are generally considered to be study limitations but are often not addressed analytically. Additionally, time-related bias is generally under-considered with respect to confounding bias. Time-related bias is potentially more important than randomization itself in the context of target trial emulation [14]; it is harder to address due to the needs of data at multiple time points, and is highly dependent on the study design [15].

This review aims to provide strategies to overcome some of these challenges, and to fill gaps in the current practices in RWE generation in Japan related to pharmacoepidemiology for regulatory purposes, by focusing mainly on data- and methodology-related issues.

2 Challenges and Solutions Relating to Data

This section discusses challenges relating to RWD and how these may be addressed (Table 1). PMDA was the primary source of evidence to assess the most recent initiatives on RWD-related considerations in Japan, and a targeted literature review was conducted to collect information on the most prominent methodological issues associated with RWE generation in Japan.

Table 1 Summary of potential solutions to overcome current challenges in real-world research

2.1 Lack of Data Transparency of RWD Sources

Sufficient transparency of RWE provides information on how findings were derived, allowing for assessment of validity and study reproducibility, but the reliability and transparency of the RWD used to generate RWE in Japan have not been adequately assessed. Several research groups have attempted to develop guidelines and best practices for the assessment of different types of data. For example, an evidence-based guideline (3X3 Data Quality Assessment) has been proposed for the assessment of electronic health record (EHR) data, focusing on three dimensions of the EHR data: patients, variables, and time [16]. A different study explored good practices and challenges, and proposed solutions for the effective use of administrative data, including aspects of data acquisition, approval processes, access and disclosure, data and metadata, research support, and data reuse [17].

Assessment of RWD should cover both non-regulatory and regulatory purposes, but generally the requirement for the latter is more stringent. Considering the future needs of RWE, RWD that meets a minimum standard accepted by regulators, a “regulatory grade,” would be needed to further expand its applications, though no consensus has been reached on the definition of the minimum accepted standard. To assess the appropriateness for “regulatory grade,” a proposed checklist includes the following elements: high quality (provenance of each datapoint must be clear, traceable, and auditable), completeness (predefined rules for data abstraction), and timing of data update [18]. A different checklist for investigators in database research focuses on following items: population covered, capture of study variable, continuous and consistent data capture, record duration and data latency, and database expertise [19]. In this regard, a flow diagram was developed allowing users to match a study objective and the data quality and quantity in the Japanese context. The process sequentially clarifies the following elements: construction process (reliability of the data collection), data related to endpoints, anonymization, data volume (sample size), and sufficiency of the follow-up period to address the research question [20].

2.2 Lack of Linkage Across Different Care Settings

In Japan, there are hospital-based (e.g., MDV) and insurance-based (e.g., JMDC) administrative claims databases, which can cover different healthcare settings. Japan has a universal medical insurance system consisting of five public insurance programs (based on occupational status), and every resident of Japan must be enrolled in one of these programs. Patients can access clinics and hospitals without a primary care physician gatekeeper system or insurance restrictions [4]. However, the presence of different health insurance plans based on patient occupational status represents a barrier to patient follow-up when individuals change their health insurance plan. In addition, different healthcare settings, such as hospitals and clinics, are not usually linked.

Data linkage of de-identified databases, using either deterministic or probabilistic methods, has been a recent focus. These methods are considered promising, though they generally require a certain degree of personal health information [21, 22]. The deterministic approach considers linkage as binary (i.e., linked or not linked), while probabilistic data linkage is useful when linkage variables are inaccurate or not unique [23]. For instance, an algorithm was developed to perform probabilistic data linkage based solely on diagnosis codes [International Classification of Diseases, 10th Revision (ICD-10)] available in two different de-identified datasets. This method presents three main advantages: (i) it only uses diagnosis code information, (ii) it accounts for discrepancies between the datasets, and (iii) it does not require subsample for tuning [24].

A different study performed in the USA demonstrated that a comparative effectiveness study could be reproduced when conducting imputation on missing outcome values based on probabilistic data linkage of two claim databases, using pre-index characteristics, including demographic data, comorbidities, and utilization of healthcare services [25]. Although this method has not yet been applied, it could prove to be successful in Japanese databases, for instance, linking claims data to electronic medical records (EMRs). However, transparent reporting of the data linkage method, including the quality of the data sources, linkage variables, methods, and evaluation, is crucial to inform potential linkage bias [26].

Of note, there are important concerns in terms of regulations of patient privacy. Under the Next-Generation Medical Infrastructure Act, enacted in 2018, Japanese regulation authorities have certified some business operators to conduct data linkage across different data sources. Some stakeholders have suggested using MyNumber, the Japanese national identification number, to link different databases [12].

2.3 Lack of Clinical Outcome Data in RWD

Different types of RWD, such as EHRs, administrative claims data, or disease-specific patient registries, are structured and collect data elements for specific purposes, and may vary in population coverage, variables included, and longitudinality. Previously, we highlighted the limited amount of clinical information available in widely used Japanese RWD sources, namely MDV and JDMC [4]. To address this limitation, chart review studies at single or multiple sites could be conducted. The emergence of the RWD Database (Real World Data Co. Ltd. database, containing both EMR and claims data [12], may help to fill this gap. For instance, a validation study was conducted leveraging this database to use EMR data as a gold standard for the validation of algorithms to identify cardiovascular outcomes [27]. This study developed claims-based algorithms to identify clinical outcomes and validated them through the use of EMRs. Furthermore, another study validated a claims-based (JMDC) algorithm identifying patients with Crohn’s disease by using patients’ medical record data from a single site. Patient identifiers from the claims database were restored to collect data from medical records [28].

The creation of databases rich in clinical outcomes may help address the gaps encountered in traditional claims data sources. As mentioned above, the MID-NET database, which is rich in clinical characteristics, has been developed to respond to the needs of conducting PASS. Multiple case studies, focusing on blood coagulability, thrombocytopenia, and renal dysfunction, identified clinical outcomes based on laboratory data that are collected in this database [2]. CIN encompasses multiple registry databases aiming to collect clinically relevant data that reflect real-world clinical practice, and studies using the Registry of Muscular Dystrophy (REMUDY) and Japanese Registry for Mechanically Assisted Circulatory Support (J-MACS) have been conducted for patient enrollment and PASS, respectively. The J-MACS database also includes quality of life data that are not available in traditional Japanese claims RWD sources. However, the validation of the data quality for each registry remains to be addressed and approved by PMDA [9].

3 Challenges and Solutions Relating to the Methodology

3.1 Lack of Design Transparency in RWE Generation

The growing importance of RWE for regulatory purposes has led to the need to build stakeholders’ trust in the RWE generation process. Transparency is a necessary but not sufficient condition to ensure high-quality RWE. In real-world research the study design precedes the study implementation, but transparency is often lacking for the methodological decisions applied during study conduct and potential derivations from the planned analysis, which may result in cherry picking of the results [29]. There are several publications that provide guidance on a structured process to support the validity, transparency, and reproducibility of real-world research [30,31,32]. A published study proposes a sequential process from early design addressing the adequacy and feasibility of the data, analysis considerations, and reporting [31]. During design phase and prior to analysis, it is advisable to identify the critical temporal anchors (e.g., study entry date, follow-up window) and provide a design diagram [29, 32, 33]. The structured template and reporting tool for real world evidence (STaRT-RWE) has been developed as a result of a collaboration between the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) and the International Society for Pharmacoepidemiology to provide guidance on planning and reporting of RWE studies, including aspects of key study parameters, study design, validity assessment, sensitivity analysis [34, 35], study registration, replicability, and stakeholder involvement to increase decision-makers’ confidence in RWE [30].

Study design is constrained by and strongly related to the structure of the data sources used; therefore, as a first step, we believe that transparent, detailed, structured descriptions of the data sources used, as well as reports of the limitations and advantages of the study design, would help build consistent knowledge about the appropriateness of the existing data sources in studies in Japan (Table 1). In the design phase, this information can be acquired from data dictionaries and pre-analysis investigation of data samples to assess data completeness. In the context of evaluating treatment effectiveness, the following process has been proposed: (i) choose the estimand by specifying the hypothesis, (ii) opt for a trial emulation framework and define the variables for meeting the “exchangeability” assumption, and (iii) specify the analytical method at the design stage [36]. In this regard, examples of trial emulation using observational data were reported in the literature [37, 38]. In Japan, only one trial emulation study has been published, aiming to evaluate the effect of several antidiabetics among Japanese patients using the NDB [39]. Further pilot studies are needed to evaluate challenges and opportunities of conducting emulation trials using Japanese RWD.

3.2 Lack of Clarification on “Time Zero” Definition and Related Bias

We have previously discussed the lack of reliability and reproducibility due to the absence of guidance on the definition of “time zero” of follow-up [4]. Inappropriate definition of “time zero” due to misalignment with study eligibility and treatment assignment may result in introducing bias in real-world studies, especially selection bias and immortal time bias [14]. On one hand, setting “time zero” either after both eligibility and treatment assignment or at eligibility but after treatment assignment, introduces a selection bias in favor of prevalent users. On the other hand, setting “time zero” either before eligibility and treatment assignment or at eligibility and before treatment assignment introduces immortal time bias [14]. A recently published study using a Japanese database showed that, in the new user versus non-user design, different settings of “time zero” for non-users (i.e., at eligibility assignment, by propensity score matching, by random selection, and the cloning method) would generate substantially different results in the parameters evaluated [15]. In RWD studies comparing a treatment with an active comparator, the new user active comparator design can accommodate for this misalignment by anchoring “time zero” of follow-up to treatment initiation and eligibility [40]. Moreover, this design allows preservation of the temporality of covariates and evaluation of time-varying hazards [41]. Even though “time zero” is properly defined, it remains necessary to adjust for potential post-“time zero” selection due to censoring [14].

Determining the consistency of a treatment strategy may not always be straightforward, and in that case, two approaches have been proposed: (i) random assignment of each individual to one of the treatments, and (ii) creating exact copies of the individual clones [14]. This latter approach has recently gained attention due to the possibility of considering the time lag, for example, between diagnosis and treatment assignment as part of an intervention [42]. The cloning method by design introduces an artificial censoring, associated with the incompatibility of a clone to be allocated to one arm. The method uses weights that account for the selection bias, and are estimated using inverse probability of censoring weighting. However, this method inflates sample size, and observations are not independently and identically distributed; hence variance estimation requires a robust method, for example, non-parametric bootstrap, which can be computationally intensive when using large databases [42]. One possibility would be to work on a subsample to address this issue when the sample is large enough.

3.3 Hurdle in Handling Time-Varying Confounding

In longitudinal studies, properly accounting for the relationship between patient characteristics, patients’ medical history, and the treatment is required to avoid bias. However, when time-varying exposure can be influenced by prior time-varying patients’ clinical characteristics and mediate the effect on the outcome, common methods for controlling confounding may not be valid [43]. A marginal structural model using inverse propensity score weighting can address the issue of mediator, and is based on estimating inverse probability of treatment weighted estimators [44]. Marginal structural model method was applied to account for the time-dependent nature of anemia to assess its effect on renal function using the JMDC database, implementing the inverse propensity score weighting method to balance for time-dependent confounding [45]. Importantly, the abovementioned methods do not address the issue of long-term dependency, and in this regard, advanced methods such as counterfactual recurrent network were developed to handle time-dependent confounders by using balancing representations [46]. Counterfactual recurrent network implements a model that uses adversarial training and recurrent network to build a balancing representation without assuming the form of treatment assignment, resulting in removal of the association between patient history and assignment treatment. Other methods and algorithms may help address these challenges. For instance, Medical Deconfounder, which relies on a probabilistic factor model, was developed to account for common unobserved characteristics among multiple medications [47]. G-method is another method that can account for time-varying confounders affected by previous treatment, though it requires modeling both covariates and outcome, and may be sensitive to the violation of the assumption of unmeasured confounding and model misspecification [48]. Further, targeted maximum likelihood estimation would allow for implementation of a variety of machine learning algorithms [49]. However, a significant challenge is the lack of a data-driven procedure to find the most appropriate causal inference method when using RWD.

3.4 Sensitivity Analysis and Validation Study to Tackle Uncertainty of Key Parameter Definitions

Another issue is the presence of uncertainty in the operation definitions, which may have a negative impact on decision-makers. Sensitivity analyses on time windows, outcomes, and exposures may help assess the impact of those definitions on the results. Using MDV database, for instance, sensitivity analysis was conducted by setting a more stringent definition of the treatment/exposure in a study that investigated treatment effectiveness in patients with inflammatory bowel disease [50]. Further, a comparative effectiveness study among patients with non-valvular atrial fibrillation employed varying time horizons in a sensitivity analysis on the same database. We believe that conducting sensitivity analysis on study parameter definitions would be useful for stakeholders to evaluate result robustness in the presence of uncertainty on parameter definitions.

Generally, RWD are collected for purposes other than research (e.g., audit, reimbursement), and accurate diagnosis of medical conditions may be incorrectly captured in the data, resulting in potential misclassification of outcomes and exposures. Misclassification occurs in most studies, and in general, methods for sensitivity analysis consist of testing different assumptions with regards to sensitivity and specificity [51]. There are several methods to evaluate the impact of misclassification. For example, probabilistic sensitivity analysis allows a record-level correction that can handle the correction for multiple sources of bias [51]. Another study focused on outcome misclassification and proposed an adjustment method when misclassification probabilities are known beforehand [52]. One of the important issues when focusing on misclassification is setting the proper range of sensitivity and specificity parameters to conclude on the measurement validity [53].

Code-based algorithm validation studies are also scarce in Japan [12]. There are multiple considerations when conducting validation studies, such as the selection of the gold standard against which the algorithm validity is assessed. In Japan, positive predictive value (PPV) is considered as the main performance indicator when considering outcome validation [54]. There are multiple Japanese data sources that can be leveraged to conduct a validation study, and guidelines recommend using clinical measurement, registry databases, or chart review as a gold standard [54]. As mentioned above, data sources such as the Real-World Data Co. Ltd. have the potential to be used for validation, since EMR data could be used as a gold standard.

There are only a few examples of code-based validation studies using RWD in Japan. One study using JMDC claims data evaluated the sensitivity, specificity, negative predictive value, and PPV of an algorithm identifying patients with Crohn’s disease using data extracted from a medical chart review [28]. The authors concluded that using ICD-10 codes alone were not sufficient to achieve a suitable PPV, and adding prescription codes was necessary [28]. Another study aimed to validate cardiovascular outcomes among diabetic patients in claims using EMR data from the Real World Data Co. Ltd. database for validation [27]. For the three main outcomes—that is, congestive heart failure, hemorrhagic stroke, and mild or moderate chronic kidney disease—the PPV was estimated to be over 90%, despite the fact that definitions in the EMR data were based solely on ICD-10 codes, and only one of the outcomes was defined based on laboratory data [27]. As an alternative example, a validation study was carried out utilizing JMDC to evaluate distinct algorithms for identifying treated diabetic patients [55]. Hemoglobin A1c measurements from health checkup data were used as reference. PPVs above 80% were found for several algorithms [55]. The study focused on both PPV and sensitivity.

3.5 Assumption of No Unmeasured Confounding in Traditional Methods

Methods widely implemented in RWD studies, such as propensity score matching, are based on the assumption of absence of unmeasured confounding, also known as the “no unobserved confounder assumption” (NUCA). Failure to evaluate the degree of the impact of the deviation from this assumption may reduce the quality of the evidence and lower its credibility among stakeholders. To enhance the quality of evidence, the importance of investigating residual confounding was noted to better clarify the measured effect [56]. Best practices for addressing potential deviation from NUCA when estimating causal treatment effect using RWD have been suggested [57]. As the first step, an initial sensitivity analysis is conducted using E-value, which expresses the minimum strength of unmeasured confounding to nullify the observed treatment effect. E-value does not require any assumptions, nor prior information on unmeasured confounding, and the robustness of the evidence increases with larger E-values [58]. When small E-values are obtained, that is, the presence of unmeasured confounding is not implausible, further analyses can be performed to examine the deviation from NUCA. However, many of these methods are not implementable in a straightforward manner and lack guidance [57].

In Japan, RWD studies evaluating the impact of unmeasured confounding are limited. For instance, a quantitative bias analysis was conducted using E-value to assess the minimum strength of unmeasured confounding to nullify the hazard ratio of sodium-glucose cotransporter-2 inhibitors compared with other glucose-lowering drugs in Japanese patients with diabetes. Because the E-value was above two, the authors concluded that the impact of the unmeasured confounding would be limited [59].

E-value interpretation depends on the adjustment for confounders performed, and there is no clear threshold to decide whether to carry out further sensitivity analyses [60]. Depending on the level of information on confounders available in the data and the study design, methods such as instrumental variable (IV) methods [61], difference-in-difference [62], missing cause [63], trend-in-trend [64], and perturbation variable [65] may be appropriate. Studies using Japanese databases with such quasi-experimental designs accounted for 22.1% (142/643) of all the studies using RWD reported from 2015 to 2020 [7]. Few Japanese studies used quasi-experimental designs other than propensity score methods. For instance, one study examined the effectiveness of tranexamic acid in post-tonsillectomy hemorrhage by defining the use of tranexamic acid in the preceding patient as IV, using the Diagnosis Procedure Combination (DPC) database. Results between this method and the propensity-score-based method were consistent [66]. Another study had a similar approach, using IVs to evaluate the effect of dexmedetomidine on patients admitted to the intensive care unit using the DPC database. The selected IV was a proxy for hospital preference to prescribe dexmedetomidine [67]. Another quasi-experimental design, the difference-in-difference, has been implemented in Japan RWD. A COVID-19 study assessed the difference-in-difference for the change in admissions for ambulatory care sensitive conditions before and during the pandemic by comparing prefectures enforcing a state of emergency with those that did not [68].

More recently, a novel method was developed for addressing unobserved confounding when the assumption of the presence of two correlated confounders with a nonlinear condition on the exposure holds [69]. When external data is available, methods leveraging this information can be implemented, such as Bayesian twin regression [70], multiple imputation [71], and propensity score calibration [72]. However, identifying the most suitable model is challenging, especially in a causal inference framework; use of synthetic datasets generated using the “plasmode” simulation based on healthcare claims, or Wasserstein generative adversarial networks, may help assess biases among different methods [73, 74]. Overall, we would recommend testing the robustness to NUCA using E-value as the initial step (Table 1).

4 Conclusion

Coordinated efforts are being made in Japan toward optimization of RWD as the number of patient-level data sources is constantly increasing, and new databases, such as MID-NET, strive to mitigate the lack of clinical data in the commonly used commercial RWDs. Probabilistic or deterministic linkage and the use of “checklists” for RWD assessment are acknowledged as potential solutions for data-related issues. Transparent reporting of study design is recognized as an important element to increase reproducibility and credibility with respect to stakeholders. Emphasis should be placed on carefully defining different sources of bias, especially time-related bias; addressing it analytically; and comparing different methods when no best methods have been previously defined. Robustness assessment covering the uncertainty of the definitions, misclassification, and hypothesis on unmeasured confounders would be valuable for decision-makers. Future pilot studies would provide more insight into the strengths and limitations of the databases and methodological issues. The knowledge gained can support the development of methodological guidelines for pharmacoepidemiological real-world studies in Japan.