Background

Routinely collected health data (RCD), including electronic medical records, health administrative data, and registries, are important resource for observational studies exploring the treatment effects of medicines [1,2,3,4]. These data contain information on drug exposures and outcomes that are essential for pharmacoepidemiology studies [1, 5]. The added support of information technologies that enable the storage of large datasets, including details on drug use, clinical management, laboratory test results, and patient outcomes, has led to a proliferation in observational pharmacoepidemiology studies in recent decades [3, 6, 7].

The increasing use of RCD has motivated the REporting of studies Conducted using Observational Routinely collected Data (RECORD) statement and its extension specific to pharmacoepidemiology (RECORD-PE) [1]. However, there are growing concerns about observational studies that use RCD to assess drug treatment effects [8,9,10,11], including the quality of the data used [12,13,14,15]. Complete and adequate reporting of data source profiles allows readers to effectively evaluate data quality and understand a study’s strengths and limitations. Meanwhile, poor reporting limits the assessment of scientific validity and leads to a misguided translation of research findings. Recently, concerns regarding the reporting of data source profiles have arisen.

The reporting of data source profiles among RCD studies exploring drug treatment remains unclear. Previous studies have shown that underreporting of data sources is common [16,17,18]. However, many of these studies were either outdated or used a small sample size [17,18,19], and none focused on research exploring the effects of drug treatment. For example, a literature review of 25 studies that used RCD for pharmacovigilance found that only 44% reported the type of data source [17]. Another review of 124 RCD studies published in 2012 showed that 28.2% did not report the type of database [18].

Detailed data source profiles are often limited in studies exploring the effects of drug treatment due to space restrictions. A published study or website including a data source profile can provide important information on data resources, data linkage, coverage, and the timeframe of the database [20, 21]. Citing a reference with a database profile helps to improve research transparency and strengthen scientific validity. The RECORD and RECORD-PE statements recommend referencing studies on data linkage and database validation [1, 5]. However, the method by which investigators cite database-specific references and whether citing references helps to improve reporting and publication remains uncertain. To date, no study has systematically examined the issue of database reporting, so a thorough investigation is strongly needed.

The current study was part of a major research project investigating the quality of the reporting and methods of observational studies using RCD to explore the effects of drug treatment. This project was conceptualized in early 2019 and has previously published the results on reporting of abstracts [16]. By providing empirical evidence about the present state of quality reporting, this study aims to inform recommendations on data source reporting and improve the transparency of observational studies of the effects of drug treatment using RCD.

Methods

Eligibility of studies

Observational studies that exclusively used RCD to explore the effects of drug treatment, including effectiveness, safety, or both, were included in the analysis. RCD was defined as data that were generated for administrative or clinical purposes without a priori research goals [1, 5]. Typical RCD includes electronic medical records (EMR), administrative claims data, safety surveillance databases, and pharmacy data [1, 7]. Studies that could not confirm whether the data resources were collected independently of prior research goals or that used at least one actively collected data element for research purposes were excluded.

Literature search and selection process

Studies published between January 1 and December 31, 2018, were identified in PubMed using RCD-specific search terms. The National Library of Medicine’s search criteria regarding electronic health records were also used [22]. The detailed search strategy was described previously [16].

The sample size was calculated based on the number of factors that could be associated with study quality. Seven characteristics with eleven categories were considered as independent variables [23,24,25], resulting in a sample of 220 [16]. The journals were stratified into those published in the top five general medical journals and those published in lower-level medical journals according to the impact factor from the Institute for Scientific Information (ISI) Web of Knowledge Journal Citation Reports in 2018. The top five general medical journals, BMJ, Journal of the American Medical Association, JAMA Internal Medicine, Lancet, and New England Journal of Medicine, were those with the highest number of citations in 2018. All studies published in the top five general medical journals and a random sample of studies published in the lower-level medical journals were included. Detailed information on the sample size estimation and selection process was summarized previously [16].

Data extraction

The general study characteristics and the reporting of database characteristics by each eligible study were collected from the full text, including the title and abstract. Information regarding the database characteristics in the titles and abstracts was not documented. Three study investigators (XS, WW, and ML) conducted a review of existing literature and guidance documents for routinely collected health data (e.g., RECORD, RECORD-PE, and others released by ISPOR, AHRQ) and developed the initial data extraction forms [1, 5, 26, 27]. The whole research team then brainstormed for additional items and developed a multidisciplinary research team that included one pharmacoepidemiology expert, two people who routinely conduct health data research, and two clinical epidemiologists to determine the importance of each item and reach a consensus about what items to include or exclude.

For general study characteristics, information on the specific disease, source of funding, number of participants, involvement of a methodologist, and type of outcomes were collected. For database characteristics, information related to database linkage, type of databases (EMR, claims data, or RCD not specified), name of the database, geographic region of the database, data source (i.e., inpatient records, outpatient records), variables collected (i.e., demographics, diagnosis, laboratory and microbiology tests, prescription, surgery), coverage and time span of the data source was collected. Information relating to database characteristics was documented based on the database descriptions in the included studies. For example, a study was considered to report data linkage if related words, such as “linked” and “linkage”, were used in the text. Studies citing references relating to the database in the Methods section were also documented. To fully capture information on database characteristics, information from relevant citations was also abstracted and reviewed.

Data analysis

Descriptive analysis was used to describe the reporting characteristics of data sources in the included studies. Reporting characteristics, including data linkage across databases, types of data source, database name, database coverage, geographic region, data resources, data collection information, the time span of the data source, and population coverage were collected. Categorical variables were summarized as numbers (percentages), and continuous variables were summarized as the mean (standard deviation) or median (interquartile range, IQR).

The reporting of database characteristics was compared between studies published in the top five general medical journals and those published in other medical journals. The reporting quality and journal impact factor (IF) (provided by the 2018 Journal Citation Report) were also compared between studies that cited and did not cite database references. We used chi-squared test or Fisher’s exact test to compare categorical variables and the t test or nonparametric Wilcoxon’s test to compare continuous variables. The two tailed significance level was set at P < 0.05. Data analyses were performed using Stata/SE (version 14.0).

Results

Study characteristics

A total of 222 studies were included in the analysis (Supplementary Table 1). A flowchart of study identification, screening and inclusion is shown in Supplementary Fig. 1. The median number of participants in all included studies, the top five general medical journals, and other medical journals were 17,961 [interquartile range (IQR), 2,495–92,366], 154,162 [IQR, 58,994–289,469], and 15,597 [IQR, 1,925–80,198], respectively. Of the included studies, 114 (51.3%) received funding from a nonprofit organization, and 41 (18.4%) received industry funding. Detailed information on the study characteristics was described previously [16].

Reporting characteristics of the data sources

Of the 222 included studies, 53 (23.9%) reported the use of data linkage, and of these, 22 (41.5%) reported the linkage methods used (Table 1). Most (202/222; 91.0%) reported the type of databases used, and the majority (211/222; 95.0%) reported coverage of the data source. The database name was reported by 195 (87.8%) studies, of which 89 (40.1%) specified the type of data source in the name. Of the included studies, 130 (58.6%) reported information on the data resource, 151 (68.0%) reported information on the data collected, and 81 (36.5%) included the timeframe of the database. Studies published in the top five medical journals were more likely than those published in lower-level journals to indicate the use of data linkage (63.1% and 20.2%, respectively; p < 0.001) (Table 1).

Table 1 Reporting characteristics of data sources of the included studies

Of all included studies, the most common database type was claims data (55.9%). The proportions of studies that used national data sources, multiple center or regional data sources, and single center data sources were 55.9%, 33.8%, and 14.4%, respectively. The largest percentage (30.1%) of studies used data from the United States, followed by China and Taiwan (23.8%) and the United Kingdom (UK) (13.6%) (Fig. 1A–C). A larger proportion of studies published in the top five general medical journals than those published in lower-level medical journals used EMR (73.7% and 30.0%, respectively; p < 0.001). Of 19 studies published in the top five general medical journals, 57.9% used data from the UK (Fig. 1A–C). The most common databases used among studies in the top five general medical journals and lower-level journals were the United Kingdom Clinical Practice Research Datalink (CPRD) (47.3%) and the Taiwan National Health Insurance Research Database (NHIRD) (25.7%), respectively.

Fig. 1
figure 1

Characteristics of data sources of the included studies. (A) Type of database; (B) Coverage of data source; (C) Country of origin

EMR: electronic medical record

Citing database references

Of the 222 studies, 137 (61.7%) cited database-specific references. A total of 71 (32.0%) studies exclusively cited studies published in journals, 50 (22.5%) exclusively cited other reference types, such as websites and statistical files, and 16 (7.2%) cited both types of references (Table 2). Of 87 studies citing studies published in journals, 33 (14.9%) referenced database studies with detailed descriptions about the data source profile, 24 (10.8%) referenced validation studies, and 46 (20.7%) referenced case studies (Table 2). Of 137 studies citing database-specific references, 15 (78.9%) were published in the top five general medicine journals, while 122 (60.1%) were published in other journals. A larger proportion of studies published in the top five general medical journals than those published in lower-level journals cited references published in journals (63.2% and 36.9%, respectively; p = 0.046) and references specific to the data source profile (36.8% and 12.8%, respectively; p = 0.012) (Table 2).

Table 2 Citing database references among included studies

Reporting quality was generally better among studies that cited database-specific references (Table 3). For example, 105 (76.6%) studies that cited database-specific references reported data resource information, while only 25 (29.4%) of those that did not cite database-specific references reported data resource information. In addition, while 79.6% of studies citing database-specific references reported data collection information, only 49.4% of studies not citing database-specific references reported this information (Table 3).

Table 3 Reporting characteristics of studies citing or not citing database references

Studies that cited database-specific references were more likely than those that did not cite database-specific references to publish in high-impact journals (mean IF, 6.08 and 4.09, respectively; p = 0.006). The proportions of studies published in journals with an IF > 10 among studies with and without citing database-specific references were 15.3% and 5.9%, respectively (Fig. 2).

Fig. 2
figure 2

Journal impact factor among studies citing or not citing references regarding database. IF: impact factor

Discussion

Main findings and interpretations

RCD has been increasingly used for exploring drug treatment effects; however, growing concerns have arisen about the potential risk of bias induced by data quality. Since RCD are developed without a priori research purpose, assessing whether the data elements contained within the data source are sufficient to address the research questions is essential. Transparent and detailed reporting of data source profiles may facilitate users of research to assess the risk of bias and optimally interpret the research findings. However, our study found some deficiencies in the reporting of data sources, even by studies published in the top five general medical journals. For example, only 41.5% of studies that used data linkage approaches reported the linkage methods used, and almost two-thirds of studies did not include the database timeframe. Similar to our findings, a survey of 124 RCD studies published in 2012 found that only 29.3% of studies adequately reported data linkage [18]. Another study of 56 urological manuscripts published in 2014 showed that 48.2% reported the geographic region of the database, and none reported the methods used to link the data [19].

This study found that data source characteristics differed between those studies published in the top five general medical journals and those published in lower-level medical journals. Those published in the top medical journals were more likely to use EMR and national data sources, while those published in lower medical journals more often used claims data. Administrative claims data often lack important information, such as laboratory results and over-the-counter drug use [28]. The absence of these data can limit the extent to which studies on the effects of drug treatment can address a prognostic imbalance.

Our study also found that 61.7% of studies cited references regarding the database profile. Studies that cited database references often had a higher quality of data source reporting and were more likely to publish in high-impact journals than those without. The potential reason may be that citing a database reference can provide important information regarding the database profile, which helps to increase the credibility of evidence generated from these data [29,30,31]. No study, however, has addressed this issue.

Implications

Adequate and transparent reporting is key to producing valid and reliable evidence to inform decision-making, and our study highlights potential areas for improvement. Researchers should include information regarding the database profile to improve readers’ understanding and assess the quality of the data sources. Since detailed descriptions of the data source profile may be restricted by word count limitations, citing a database reference can provide important information regarding the database profile and allow researchers and readers to critically evaluate potential bias relating to the data quality. Several organizations have developed searchable repositories of information on database profiles, including the Health Data Research UK and the DARWIN project from the European Medicines Agency [32,33,34]. RECORD and RECORD-PE statements are important to improve the reporting quality of RCD studies. The adoption of reporting guidelines and education on their use is urgently needed to promote the transparency of studies using RCD to explore the effects of drug treatment.

Strengths and limitations

This study has several strengths. First, it included several representative studies using RCD to explore the effects of drug treatment. Previous studies had a small sample size and were restricted to specific topics and journals. Second, rigorous methods were used to thoroughly identify eligibilities, and standardized forms were developed to improve the accuracy of extraction.

Some caveats, however, should be considered in this study. First, it only included studies published in 2018. Thus, the findings may not be generalizable to other years. However, the practice of using RCD to assess the effects of drug treatment is unlikely to have changed significantly over a relatively short period. To further confirm this, the reporting quality of a sample of studies published in 2021 was investigated. PubMed was searched for RCD studies published in 2021, and the reports were placed in chronological order of their publication. The first 20 reports that met the eligibility criteria were selected, and the reporting quality was comparable to those published in 2018. All the studies reported the type of database, and 90.0% reported the data source. Only 25% of the studies reported that they used data linkage, of which 50% reported the methods used (supplementary Table 2). Studies that were labeled as a registry without specifying whether the data were collected for administrative or research purposes were excluded. The definition of a registry and the approach used to collect data for registries varies substantially [35, 36]. In this study, registries were defined as those in which RCD was collected for administrative purposes. Third, this study only included 19 articles published in the top five general medical journals, which may not be representative of all studies published in these journals. Fourth, the reporting of data sharing or exchanging was not investigated by this study.

Conclusions

This study found some deficiencies in the reporting of data sources, such as linkage methods and timeframe. The reporting quality and characteristics of the data sources differed between studies published in the top five general medical journals and other journals. Studies citing database-specific references may provide detailed information regarding data source characteristics and are more likely to publish in high-impact journals. The adoption of reporting guidelines and education on their use is urgently needed to promote transparency.