1 Background

Real-world data have become increasingly important in medical science and healthcare (Mellinghoff et al. 2022; Sun et al. 2018). Administrative claims databases contain larger data in many countries, and have great potential for new scientific discovery, solving problems, and making decisions that are otherwise unfeasible (Hernán and Robins 2016). Meanwhile, new, effective, and practically feasible statistical designs are needed to unlock the potential of real-world data; decision makers and practitioners can apply the results and conclusions to better meet the medical and healthcare needs of our society (Frieden 2017; Baumfeld Andre et al. 2020).

With the target trial emulation (TTE) framework, an increasing amount of real-world data is being utilized (Matthews et al. 2022; Xie et al. 2020; Madenci et al. 2021; Takeuchi et al. 2021). However, TTE is not always properly understood. The idea of TTE is not to bring observational studies closer to randomized controlled trials (RCTs), but to bring clinical studies, including observational studies and RCTs, closer to the “ideal study” that can settle a research question (Hernán and Robins 2016). For example, in an ideal comparison, the same people are divided and different interventions are used to compare outcomes. This is clearly an ideal study, but an impossible one. Therefore, in RCTs, randomization is used, which is mimicked in propensity score (PS) analysis (Liang et al. 2014). Randomization is a method of approaching the ideal, and not the goal. Based on the TTE approach, a study has greater comparability if all factors are matched, in all pairs, between the comparison and control groups.

One study suggests that the correctness of the estimates requires designing the study and analyzing the data based on principle (Hoffman et al. 2022). The purpose of this study is to verify a new method for real-world data analysis, based on the TTE framework.

2 Methods

In the first half of the study, we validated the new method by simulation, and in the second half, using this method, we conducted a clinical study on actual real-world data. As the subject of the latter half of the study, we estimated the effect of the specific health guidance provided in Japan, on the prevention of diabetes onset and medical expenditures. This subject was chosen because the effect of lifestyle guidance in preventing the onset of diabetes is almost self-evident (Muramoto et al. 2014), making it difficult to conduct an RCT from an ethical perspective. Therefore, real-world evidence is required to estimate this effect. This observational cohort study used administrative claims data and was approved by the Ethics Committee of Nara Medical University (approval no. 1123–7, October 8th, 2015).

2.1 Outline for a new method: exact-matching algorithm using administrative health claims database equivalence factors

This novel method is based on a very simple idea: according to the TTE concept, an ideal study in terms of comparability (= Target Trial) should have all covariates, except exposure, matched in both groups. However, the key to this simple approach is to match each of the covariates, not just the representative values of the covariates, as in an RCT or the PS in an observational study. This differs from traditional statistical methods, in that all measurable factors, including interactions of any order, are controlled. It is also characterized by the fact that the differences in the covariates are always zero and not distributed, regardless of the stratification of the analysis. Administrative health claims database equivalence factors (AHCDEFs) are collected from administrative claims databases and can confound exposure and outcome. The weighting factor for the group with a smaller number of people therein without exposure, and the group with exposure, among those with perfectly matched AHCDEFs, was set to 1, so that the sum of the weighting factors for the group without exposure and the group with exposure, was equal. For example, consider a simple case in which the AHCDEFs are age, gender, and systolic and diastolic blood pressure: consider a 42-year-old male with a systolic/diastolic blood pressure of 130/80 mmHg. The control for that case is a 42-year-old male with a blood pressure of 130/80 mmHg. Here, all four items are matched in the case and the subject; along with the four items, the 11 interaction terms of any combination of the four items are also controlled. The difference in this approach, compared to the conventional method that assumes agreement of representative values in the population, is that we are able to control for interaction terms of any order in the model.

2.2 Simulation procedure

The trials were conducted 500 times, independently (\({\text{t}}=\mathrm{1,2}, \cdots , 500\)) considering the misclassification and chance errors of all variables and competing events of outcome. The simulation process followed a sufficient causal model (Liang et al. 2014). Thus, it is only when there are equal or more than one sufficient causes that an event occurs. For each patient (\({\text{n}}=1, 2, \cdots , 50 000\)), we constructed a confounding model of exposure (\({X}_{t1n}\)) and outcome (\({{\text{Y}}}_{tn}\)) of interest. That is, there was no difference in the true proportions of outcome occurrence between the groups with and without exposure, and the true odds ratio of outcome occurrence for the group exposed to the group without exposure was 1. \({X}_{tin}\) is the low-to-moderate correlated variable. \({X}_{t1n}\) is observable, \({X}_{t 2 n}, \cdots , {X}_{t 100 n}\) contain both observable and unobservable variables. Any of the latter 99 variables causes Y. to estimate the effect of \({X}_{t1n}\) on \({{\text{Y}}}_{tn}\), when only some component causes are observable and known and there may be some noncausal factors being mistaken as causal factors. Thus, we have

$$P\left({{\text{Y}}}_{tn}=1|{X}_{tin},{{\text{K}}}_{ti} \right)=\frac{exp\left({\beta }_{t1}{X}_{t1n}+\sum_{i=2}^{100}{\beta }_{ti}{X}_{tin}{{\text{K}}}_{ti}\right)}{1+{\text{exp}}\left({\beta }_{t1}{X}_{t1n}+\sum_{i=2}^{100}{\beta }_{ti}{X}_{tin}{{\text{K}}}_{ti}\right)}$$

If \({{\text{X}}}_{tin}\) is an observable and known risk factor, \({{\text{K}}}_{ti}=1\), otherwise \({{\text{K}}}_{ti}=0\) (\({\text{i}}=2, \cdots , 100\)). For some \({\text{i}}, {{\text{K}}}_{ti}=1\) and \({{\text{X}}}_{tin}\) is a sufficient cause of Y. \({\beta }_{ti}\) is the estimated effect of \({X}_{tin}\) on \({{\text{Y}}}_{tn}\) if \({{\text{K}}}_{ti}=1\). In summary, \({X}_{t1n}\) is a factor that is observed and is not a sufficient cause of Y. Among the remaining 99 \({X}_{tin}\), there is at least one that is a sufficient cause of Y with \({{\text{K}}}_{ti}\)= 1. When \({{\text{K}}}_{ti}\)= 0, \({X}_{tin}\) is either unobservable or observable (as not a risk factor).

2.3 Details of the simulation

To set low-to-moderate correlated variables \({X}_{t 1 n},{X}_{t 2 n}, \cdots , {X}_{t 100 n}\), independent random sampling of \({{\text{V}}}_{tin}\) and \({{\text{U}}}_{tn}\) was performed from uniform [0,1) distributions. All \({{\text{V}}}_{tin}\) are 500 × 100 × 50,000 independent samples, and all \({{\text{U}}}_{tn}\) are 500 × 50,000 independent samples. Independent random sampling of \({{\text{P}}}_{ti}\) was performed from uniform [0.3,0.6) distributions. All \({{\text{P}}}_{ti}\) are 500 × 100 independent samples. For \(j\)=\(1, \cdots , 9,\) \({{\text{A}}}_{tji}\) takes on a random value, drawn from a uniform [0.5, 0.9) distribution. All \({{\text{A}}}_{tji}\) are 500 × 9 × 100 independent samples. \({G}_{tjin}\) is an unknown value based on an exact threshold value, with no misclassifications. \({G}_{tjin}\)=1, if a linear combination of variables \({{\text{V}}}_{tin}{{\text{P}}}_{ti}+{{\text{U}}}_{tn}\left(1-{{\text{P}}}_{ti}\right)\)>\({{\text{A}}}_{tji}\). \({G}_{tjin}\)=0, if a linear combination of variables \({{\text{V}}}_{tin}{{\text{P}}}_{ti}+{{\text{U}}}_{tn}\left(1-{{\text{P}}}_{ti}\right)\le {{\text{A}}}_{tji}\). \({X}_{tin}\) is the known value, based on the expected threshold value of a uniform [0.5, 0.9) distribution, that is, 0.7 with some misclassifications. \({X}_{tin}\)= 1 if the linear combination of variables \({{\text{V}}}_{tin}{{\text{P}}}_{ti}+{{\text{U}}}_{tn}\left(1-{{\text{P}}}_{ti}\right)\)>\(0.7\). \({X}_{tin}\)= 0 if a linear combination of the variables \({{\text{V}}}_{tin}{{\text{P}}}_{ti}+{{\text{U}}}_{tn}\left(1-{{\text{P}}}_{ti}\right)\le 0.7\).

\({C}_{tji}\) takes a random value, drawn from the Bernoulli distribution, with a probability of success of 0.05. All \({C}_{tji}\) are 500 × 9 × 50,000 independent samples, which indicate whether a linear combination of variables \({{\text{V}}}_{tin}{{\text{P}}}_{ti}+{{\text{U}}}_{tn}\left(1-{{\text{P}}}_{ti}\right)\) is a component of the \(j\) th possible sufficient cause. Let \({O}_{tjn}=1\), when all components for the jth possible sufficient cause are active in the \(n\) th observation of \(t\) th trial, that is \(\sum_{i=2}^{100}{G}_{tjin}{C}_{tji} =\sum_{i=2}^{100}{C}_{tji}\). If \(\exists {\text{t}},j\) \(\sum_{i=2}^{100}{G}_{tjin}{C}_{tji} <2,\) all variables are redetermined through the same \(j\) th random process in the \({\text{t}}\) th trial. \({F}_{tj}\) takes on a random value, drawn from the Bernoulli distribution, with a probability of success of 0.5. All \({F}_{tj}\) are 500 × 9 × 50,000 independent samples, which show whether each of the nine possible sufficient causes is a real sufficient cause. If there is no real sufficient cause, that is, \(\exists \mathrm{t }\sum_{j=1}^{9}{F}_{tj}\)= 0, then all variables are redetermined through the same random process in the \({\text{t}}\) th trial.

\({E}_{tn}\), and \({Q}_{tn}\) take a random value, drawn from the Bernoulli distribution, with a probability of success of 0.001. All \({E}_{tn}\) are 500 × 50,000 independent samples, which indicate competing events for the outcome. All \({Q}_{tn}\) are 500 × 50,000 independent samples, indicating a small error in the outcome.\({{\text{Y}}}_{tn}=1\) where \({Q}_{tn}\)= 1, and where \(\sum_{j=1}^{9}{O}_{tjn}{F}_{tj}\ge 1\) and \({E}_{tn}=0\); otherwise \({{\text{Y}}}_{tn}=0\). \({{\text{K}}}_{ti}\) is a random value, drawn from the Bernoulli distribution, with a probability of success of 0.1 + \(\sum_{j=1}^{9}{C}_{tji}{F}_{tj}\), which indicates whether \({X}_{ti}\) is an observable and known risk factor for the outcome.

2.4 Adjustment by new methods

In the above model, there was no difference in the true proportion of outcome occurrence between the groups with and without exposure. However, the bias in the model resulted in an apparent difference in the proportion of outcomes. To visualize the extent to which the bias was adjusted by our proposed method, we simulated \(P\left({{\text{Y}}}_{tn}=1|{X}_{t1n}=1\right)-P\left({{\text{Y}}}_{tn}=1|{X}_{t1n}=0 \right)\) with and without adjustment.

2.5 Comparison with conventional methods

Two conventional methods, multivariate analysis and PS, were compared using this method (Seeger et al. 2005; Schneeweiss et al. 2009). In the model, the true odds ratio of outcome occurrence for the group exposed to the group without exposure was 1. Specifically, the odds ratios \({\beta }_{ti}\) of the three methods were compared. Type I error is the probability of erroneously determining that there is a difference when, in fact, there is no difference. Similarly, the probabilities of Type-I errors for the three methods were determined. In the PS analysis, PSs themselves were used as covariates in the regression model, because the variance in both groups was not very different (Rubin 1979).

2.6 Application to real-world data: data source

In the demonstration of the methodology, the study cohort comprised anonymized data of individuals in the Kokuho database (KDB) of Nara prefecture in Japan. This data provided information on personal identifiers, date, age group, sex, description of the procedures performed, World Health Organization International Classification of Diseases (ICD-10) diagnosis codes, medical care received, medical examinations conducted without the results, prescribed drugs, and specific health check-ups, including results from 2013 to 2021.

2.7 Study population

We included data on individuals who underwent specific health check-ups between April 2014 and March 2021. Those who were not followed up in the past year and those who were prescribed diabetic medications in the past year were excluded.

2.8 Definition of diabetes

A validated algorithm was used to define diabetes based on claims data from Japan. This algorithm (74.6% sensitivity and 88.4% positive predictive value), for detecting people with diabetes, had three elements: the diagnosis-related codes for diabetes without the “suspected” flag, the medication codes for diabetes, and these two codes on the same record (Nishioka et al. 2022). This algorithm cannot detect people with diabetes who have not consulted a doctor and those on diet and exercise therapy only, but it can identify most of them on medication.

2.9 Effect of the specific health guidance

Among those who received the specified health check-up during the six-month period, those who received health guidance were identified and designated as the health guidance group. Those whose age, sex, body mass index (BMI), abdominal circumference, medical expenses in the past year, the number of days of outpatient visits in the past year, and the number of days of hospitalization in the past year matched those of the health guidance group were identified and designated as the control group. In the health guidance and control groups, odds ratios for new onset diabetes were calculated. Generalized estimating equations were used in the analysis, and a binomial distribution was assumed for the outcome of diabetes. The link function was assumed to be a logit. In an additional analysis, average medical expenditures per person per month, for both groups, were observed over time. All statistical tests were two-tailed, and P-values < 0.05 were considered statistically significant. All statistical analyses were performed using the Microsoft SQL Server 2016 Standard (Microsoft Corp., Redmond, WA, USA) and IBM SPSS Statistics for Windows, version 27.0 (IBM, Armonk, NY, USA).

3 Results

3.1 Adjustment by new methods

We estimated \(P\left({{\text{Y}}}_{tn}=1|{X}_{t1n}=1\right)-P\left({{\text{Y}}}_{tn}=1|{X}_{t1n}=0\right)\) in Fig. 1. The new method tended to make the estimated value closer to the true value of 0 and reduced the variation in the estimated values.

Fig. 1
figure 1

Box plot of estimated difference in incidence proportions of outcome with and without exposure EMA; Exact Matching Algorithm using AHCDEFs AHCDEFs; Administrative Health Claims Database Equivalence Factors Trials conducted 500 times independently (\({\text{t}}=\mathrm{1,2}, \cdots , 500\)). The simulation process followed a sufficient causal model. Box plots depict descriptive statistics of the estimated values. The mean is represented by x, the median by the line across, and the first (Q1) and third (Q3) quartiles by the bottom and top of the box, respectively. The box is drawn from Q1 to Q3, with a horizontal line drawn in the middle to denote the median. The interquartile range (IQR) is the distance between the upper and lower quartiles. The boundaries are based on range 1.5 IQR value. From above Q3, a distance of 1.5 times the IQR is measured, and a whisker is drawn up to the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 × the IQR is measured below Q1, and a whisker is drawn down to the lowest observed data point from the dataset that falls within this distance. All other observed data points outside the whisker boundaries are plotted as outliers on a box plot, as small circles

3.2 Comparison with conventional methods

Figure 2 shows the estimated odds ratios of outcome occurrence for the group exposed to the group without exposure, among the three methods. Compared to the multivariate model and PS, the estimates for the new method approach the true odds ratio of 1, with a smaller scatter. Table 1 shows the probability of a type I error when the odds ratios were estimated using the three methods and the univariate model. The probability of a type I error for the new method was 6.6%, while the univariate, multivariate, and PS models all had a probability of 97% or higher.

Fig. 2
figure 2

Box plot of estimated odds ratios of outcome with and without exposure by the methods of adjustment Trials were conducted 500 times independently (\({\text{t}}=\mathrm{1,2}, \cdots , 500\)). The simulation process followed a sufficient causal model. Box plots depict the descriptive statistics of the estimated odds ratios. The mean is represented by x, the median by the line across, and the first (Q1) and third (Q3) quartiles by the bottom and top of the box, respectively. The box is drawn from Q1 to Q3, with a horizontal line drawn in the middle to denote the median. The interquartile range (IQR) is the distance between the upper and lower quartiles. The boundaries are based on range 1.5 IQR value. From above Q3, a distance of 1.5 times the IQR is measured, and a whisker is drawn up to the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 × the IQR is measured below Q1, and a whisker is drawn down to the lowest observed data point from the dataset that falls within this distance. All other observed data points outside the whisker boundaries are plotted as outliers on a box plot, as small circles. MRA; Multivariate Regression Analysis PS; Propensity Score EMA; Exact Matching Algorithm using AHCDEFs AHCDEFs; Administrative Health Claims Database Equivalence Factors

Table 1 Number of type I errors of each adjustment method

3.3 Effects of specific health guidance in preventing the onset of diabetes mellitus

In total, 6332 individuals (4964 duplicate pairs) were enrolled. Of these, 240 were diagnosed with diabetes. Table 2 shows the characteristics of the study cohort. The odds ratio for type 2 diabetes in a participant was 0.75 (95%CI 0.58–0.97), if the participant was provided specific health guidance, compared to those who did not.

Table 2 Background after exact-matching algorithm using AHCDEFs

3.4 Effects of specific health guidance in reducing medical expenditures

Figure 3 presents data on medical expenditures per person per month, with and without specific health guidance. After adjusting for background factors, medical expenditures were lower in the group that received health guidance throughout the study period. The results of simple aggregation and PS matching are shown in the Supplementary Figure as a comparison with conventional methods. These are different from the results of the new method employed in this study, and baseline adjustment was considered inadequate.

Fig. 3
figure 3

Medical expenditures with and without specific health guidance Vertical axis; medical expenses per person per month (USD) Horizontal axis; time USD/JPY was 150 (October 21, 2022). Those who received health guidance during the six-month period among those who received the specified health check-up were identified and designated as the health guidance group. Those whose age, sex, BMI, abdominal circumference, medical expenses in the past year, number of days of outpatient visits in the past year, and number of days of hospitalization in the past year matched those of the health guidance group were identified and designated as the control group. The medical expenditure per person per month in both groups was observed over time

4 Discussion

In this study, we showed that our proposed novel method for real-world data returned improved estimates and fewer type I errors than conventional methods. Using this new method, we also quantitatively demonstrated the effectiveness of specific health guidance in Japan, in preventing the onset of diabetes and reducing medical expenditures during five years.

In contrast, most previous studies have not shown the effectiveness of specific health guidance in Japan. Creating reliable evidence from complex longitudinal data is not easy, and many studies may have flaws in their designs (Hoffman et al. 2022; Groenwold 2021). The most important reason we were able to demonstrate the effectiveness of specific health guidance for the first time in this study is that we did not adhere to a feasible RCT when setting up the target trial. It is impossible to perfectly match the background factors in an RCT. However, it is only by matching the background factors that we can make the best use of information from the measured factors. Specifically, even arbitrary order interactions between background factors can be incorporated into the model and adjusted accordingly. Various types of evidence can be generated by refining the design by constantly pushing for an ideal study within the framework of TTE, in order to take advantage of the strengths of observational studies.

All the studies had unmeasured confounding factors. Observational studies have reported a high rate of type I errors (Liang et al. 2014), and the same result was obtained in this study. However, with our proposed exact-matching algorithm using AHCDEFs, the type I error probability is 6.6%, which is much lower than that of conventional methods. Although the type I error is still slightly higher than the acceptable range, we confirmed that it can be sufficiently reduced by refining the design using our method.

This study is the first to demonstrate the effectiveness of specific health guidance in reducing the incidence of diabetes and medical expenditures in Japan. In the United States, evidence suggests that multicomponent behavioral interventions in adults with obesity can lead to clinically significant improvements in reducing the incidence of type 2 diabetes among such adults and those with elevated plasma glucose levels (US Preventive Services Task Force et al. 2018). In 2008, Japan introduced a screening program to identify individuals with obesity and metabolic syndrome (Tsushita et al. 2018). All adults aged 40–74 years were required by law to participate every year (Fukuma et al. 2020). Therefore, this study presents the impact of this national health guidance intervention with an appropriate design, such that an RCT cannot be assembled.

Although RCTs are designed to answer a single question, they are expensive, time-consuming, and resource-intensive. It is not possible, in principle, to randomize a large sample population by including the various comorbidities and other confounding factors, and the patients included are generally younger with fewer comorbidities due to resource constraints. Therefore, the results are not immediately generalizable, leading to a large selection bias that compromises representativeness in this regard. Despite these major limitations, there is currently a worldwide mass production of "evidence" from sub-analyses and stratified analyses of past RCTs, which should be one per RCT, and which is used in daily practice. However, observational studies, when properly designed, have multiple strengths that make them a suitable complement to RCTs for decision-making.

Despite the notable findings that have emerged from this study, it had several limitations that must be acknowledged. First, the control for time-dependent confounding factors was not considered. However, this method has great potential in this respect, that is, based on the TTE concept, A*B time comparisons are performed and weighted 1/(A*B) in the groups with (A) and without exposure (B), which are perfectly matched in AHCDEFs. For each comparison, the termination of the observation for comparison patients is added to the end of the observation requirement (usually the occurrence of an outcome, the end of the study period, or withdrawal from the study), allowing control for time-dependent confounding. Second, this study did not model the administrative claims databases. Therefore, compatibility with high-dimensional propensity score adjustments cannot be examined (Schneeweiss et al. 2009). In principle, the two methods are expected to be highly compatible with each other, and future studies should be conducted to apply this method, to perfectly match the variables selected by the algorithm used in high-dimensional propensity score adjustment. Finally, we discuss the generalizability of the results. Generalizability is broadly compromised in order to increase comparability. However, this method is considered less susceptible to random errors than conventional methods, even when stratified analysis is performed. Thus, it is possible to make comparisons based on the circumstances of individual patients. Observational studies have significant limitations in dealing with real-measurement confounding and cannot replace RCTs. Although this method is effective for research questions for which an RCT would be difficult to conduct, it goes without saying that an RCT should be conducted if it is feasible to do so.

5 Conclusions

We propose a new method for analyzing real-world data and an exact-matching algorithm using AHCDEFs. The larger the number of patients available for analysis, the more AHCDEFs that can be matched, thereby removing the influence of confounding factors. It is expected that this method will generate significant evidence when applied to real-world data. In this process, it is desirable to clarify in detail the problems that may arise when applying this method.