FormalPara Key Points for Decision Makers

The observational and medical outcomes partnerships (OMOP) common data model standardises the structure and coding systems of otherwise disparate datasets, enabling the application of standardised and validated analytical code across a federated data network without the need to share patient data.

Common data models have the potential to overcome some of the key operational, methodological, and technical challenges of using observational data in health technology assessment (HTA), particularly by enhancing the interoperability of data and the transparency of analyses.

To ensure the usefulness of the OMOP common data model to HTA, it is imperative that the HTA community engages with this work to develop tools and processes to support reliable, timely, and transparent evidence generation in HTA.

1 Introduction

There is growing interest in the use of observational data (or, ‘real-world data’) to assess the safety, effectiveness, and cost effectiveness of medical technologies [1, 2]. But several barriers limit its more widespread use, including challenges in identifying and accessing relevant data, in ensuring the quality and representativeness of data, and in the differences between datasets in terms of their structure, content, and coding systems used [1, 3, 4].

Common data models and distributed data networks offer a possible solution to these problems [4,5,6,7,8,9]. A common data model standardises the structure, and sometimes also the coding systems, of otherwise disparate datasets, enabling the application of standardised and validated analytical code across datasets. Datasets conforming to a common data model can be accessed through federated data networks, in which all data reside locally in the secure environment of the data custodian(s). Analytical code is then brought to the data and executed locally, with only aggregated results returned. This puts the data custodian in full control and avoids the need to share patient-level data, thereby at least partially addressing data privacy and governance concerns. In so doing, it may also increase the availability of data for healthcare research. Common data models can enhance the transparency and reliability of medical research and ensure efficient and timely generation of evidence for decision making.

Several common data models are in widespread use [9], including the US FDA Sentinel, which is used predominantly for post-marketing drug safety surveillance but increasingly also for effectiveness research [4, 10], and the open-source Observational and Medical Outcomes Partnerships (OMOP) common data model, which has been used to study treatment pathways, comparative effectiveness, safety, and patient-level prediction [8, 11,12,13,14,15]. The European Medicines Agency (EMA) will use the OMOP common data model to conduct multicentre cohort studies on the use of medicines in patients with coronavirus 2019 (COVID-19) [16]. The EMA are also looking to establish a data network for the proactive monitoring of benefit-risk profiles of new medicines over their life cycles, which could use a common data model approach [6, 17, 18]. To date, relatively little attention has been given to the usefulness of these models and data networks for supporting health technology assessment (HTA).

Here, we discuss the potential value of the OMOP common data model for use in HTA, for both evidence generation and healthcare decision making, and identify priority areas for further development to ensure its potential is realised.

2 The Use of Observational Data in Health Technology Assessment (HTA)

HTA is used to inform clinical practice and the reimbursement, coverage, and/or pricing of medical technologies, including drugs. While the exact methods and uses of HTA differ between healthcare systems, substantial commonalities exist [19,20,21]. Most HTA bodies require a relative effectiveness assessment of one or more technologies compared with standard of care and prefer data on final clinical endpoints (such as survival) and patient-reported outcomes (such as health-related quality of life) [21, 22]. Often estimates of relative effectiveness need to be provided over the long term (e.g. patients’ lifetime), and this may necessitate economic modelling. Some HTA bodies also require evidence on (long-term) cost effectiveness, which requires an assessment of the additional cost of achieving additional benefits, and budget impact analysis, i.e. the gross or net budgetary impact of implementing a technology in a health system. Most HTA bodies and payers prefer data pertaining directly to their jurisdiction.

The potential uses of observational data in HTA are large. There is wide acceptance of its use for assessing safety, particularly for rare outcomes and over longer time periods, for describing patient characteristics and treatment patterns in clinical practice, and for estimating epidemiological parameters, including disease incidence, event rates, overall survival, healthcare utilisation and costs, and health-related quality of life [23, 24]. It could also be used to validate modelling decisions, e.g. extrapolation of overall survival or from surrogate to final clinical endpoints, but its use here, so far, is limited [25]. To inform local reimbursement and pricing decisions, it is important that such data reflect the local populations and healthcare settings.

The role of non-randomised data in establishing comparative effectiveness is more controversial [23, 26, 27]. In principle, it could be used to support decisions in the absence of reliable or sufficient randomised controlled trial (RCT) data [28,29,30,31,32] or to supplement RCTs with evidence from routine clinical practice on long-term outcomes or outcomes with immature data from trials to validate findings or translate results to different populations and settings [23, 25, 33,34,35,36]. Increasingly, such data are used as part of managed entry arrangements, including commissioning through evaluation and outcomes-based contracting [37, 38]. However, despite growing calls for increased use of observational data in decision making, its role remains limited [23]. We follow the OPTIMAL framework in categorising barriers to the wider use of such data into operational, technical, and methodological challenges [39], supplemented by additional considerations where necessary [1, 3].

Operational challenges to the use of observational data include issues of feasibility, governance, and sustainability, which complicate access to, and the use of, data. A limited number of European datasets are of sufficient quality for use in decision making, particularly in Eastern and Southern Europe [40]. Furthermore, it can be difficult to identify datasets containing relevant information or to understand the quality of the data with respect to a planned application [41]. When relevant, high-quality data are identified, it may not be accessible because of governance restrictions on data sharing, lack of patient consent, or prohibitively high access costs. Beyond the direct costs of data acquisition, substantial investments may be needed in staff and infrastructure to manage, analyse, and interpret such data [42]. These challenges limit the opportunity to generate robust, relevant, and timely information to support local decision making. Finally, transparency is often lacking in the conduct of studies using observational data, which limits the acceptability of results for decision making [43, 44].

Technical constraints relate to the contents and quality of data and impair the ability to generate robust and valid results. Most observational datasets are not designed for research purposes but rather to support clinical care or healthcare administration. The quality of observational data varies, including in the extent of missing data, measurement reliability, coding accuracy, misclassification of exposures and outcomes, or insufficient numbers of patients [1, 45]. Certain types of data are routinely missing from observational databases, including drugs dispensed in secondary care or over the counter [40] and patient-reported outcomes [46]. A further complication is data fragmentation, where information about a patient’s care pathway is stored across disparate datasets. Data linkage is essential to adequately address many research questions, but operational constraints due to varying governance processes may limit the ability to link datasets in a timely fashion.

A final major technical constraint is the substantial variation between datasets in terms of their structure, contents, and the coding systems used to represent clinical and health system data. Datasets can differ in several ways, including in their structure (e.g. single data frame vs. relational database design), contents (i.e. what data are included), and in the representation of data (i.e. how data are coded). For instance, numerous competing coding and classification systems are used to represent clinical diagnoses (e.g. Standard Nomenclature of Medicine [SNOMED], Medical Dictionary for Regulatory Activities [MedDRA], International Classification of Diseases [ICD], Read), pharmaceuticals (e.g. British National Formulary, RxNorm, Anatomical Therapeutic Chemical [ATC] classification), procedures (e.g. OPCS, ICD-10-PCS), and other types of clinical and health system data. Conventions for any given vocabulary are also subject to change over time. These differences impose a burden on analysts who are required to understand the idiosyncrasies of each dataset and coding system, and their developments over time, which limits the opportunity to validate analyses in different datasets or to translate results to different populations or settings. It also complicates the interpretation of the results for those who use the evidence, including regulators, HTA bodies, payers, patients, and clinicians.

Finally, methodological challenges arise from the inherent limitations of observational databases as they are not designed to produce causal associations [47]. Biases may arise because of poor-quality data or patient selection, whereby the associations observed among those in a database do not apply to the wider population of interest, and because of confounding, whereby patients are allocated differently to exposures based on unobserved or poorly characterised characteristics [47, 48]. Detailed consideration must be given to these potential biases in study design and, where appropriate, advanced methodologies must be used to address them.

3 The OMOP Common Data Model

3.1 What is a Common Data Model?

The main purpose of common data models is to address problems caused by poor data interoperability. They do this by imposing some level of standardisation on otherwise disparate data sources. Several open-source common data models are in use that differ in a number of important respects, including the extent of the standardisation, for instance, whether they standardise just the structure (FDA Sentinel) or also the semantic representation of data (OMOP); the coverage of the standardisation, whether only for selected types of clinical data (FDA Sentinel) or an attempt to be comprehensive, including all clinical and health system data (OMOP); and in their applications [9, 11,12,13,14]. These differences may impact on the timeliness with which high-quality multidatabase studies can be conducted, the transparency of analyses, and the adaptability of the analysis to specific research questions [4].

An alternative approach to multidatabase studies is to allow local data extraction and analysis following a common protocol [49, 50]. While this has been shown to produce reliable results in some applications [50], differences between datasets may arise because of differences in data curation, implementation of the analysis, and coding errors. Alternatively, aggregated or patient-level datasets can be pooled following a study-specific common data model, e.g. as developed in the European Union Adverse Drug Reaction project [5]. This allows the sharing of common analytics, reducing between-dataset variation in study conduct, but may restrict the analytical choices (e.g. large-scale propensity score matching). In some cases, data pooling will be prohibited by data governance and privacy concerns.

3.2 The OHDSI Community

The OMOP common data model has its origins in a programme of work by the OMOP, designed to develop methods to inform the FDA’s active safety surveillance activities [8, 51, 52]. Since 2014, the OMOP common data model has been maintained by the open-science Observational Health Data Sciences and Informatics (OHDSI, pronounced ‘Odyssey’) community (https://www.ohdsi.org). OHDSI also develops open-source software to support high-quality research, engages in methodological work to establish best practices, and performs many multidatabase studies across its network. The common data model and open-source tools are shared on OHDSI’s GitHub account (https://github.com/OHDSI/), and discussions take place on a dedicated open forum (https://forums.ohdsi.org/). For more information about OHDSI and the analytical pipelines see The Book of OHDSI [8].

3.3 OMOP Common Data Model

The OMOP common data model has a ‘person-centric relational database’ design similar to many electronic healthcare record systems. This means that clinical (e.g. signs, symptoms, and diagnoses, drugs, procedures, devices, measurements, and health surveys) and health system data (e.g. healthcare provider, care site, and costs) are organised into pre-defined tables, which are linked, either directly or indirectly, to patients (Fig. 1) [8]. Each table stores ‘events’ (i.e. clinical or health system data) with defined content, format, and representation. The OMOP common data model contains two standardised health economic tables. The first contains information about a patient’s health insurance arrangements, and the second contains data on costs, charges, or expenditures related to specific episodes of care (e.g. inpatient stay, ambulatory visit, drug prescription). This structure reflects the origins of the common data model in the USA, with a largely insurance-based system of healthcare. The OMOP common data model is designed to be as comprehensive as possible to allow a wide variety of research questions to be addressed.

Fig. 1
figure 1

Overview of the OMOP common data model version 6.0 [8]. The tables relating to standardised vocabularies provide comprehensive information on mappings from between source and standard concepts and hierarchies for standard concepts (e.g. concept_ancestor). CDM common data model, NLP natural language processing

Standard vocabularies are used to normalise the meaning of data within the common data model and are defined separately for different types of data (i.e. data residing in different tables). The SNOMED system is used to represent clinical data, RxNorm to represent drugs, and Logical Observation Identifiers Names and Codes (LOINC) to represent clinical measurements. RxNorm has been extended within OMOP to include all authorised drugs in Europe using the Article 57 database. Other types of data, including procedures, devices, and health surveys, have more than one standard vocabulary because of the absence of a comprehensive standard. These standard vocabularies have hierarchies, which allows users to select a single concept and any descendants of that concept in defining a cohort or outcome set. Some standard vocabularies can also be linked to hierarchical classification systems such as MedDRA for clinical conditions and ATC for drugs. The codes used in the original data are also retained within the common data model and can be used by analysts. Figure 2 provides a visual illustration of the vocabularies for the condition domain.

Fig. 2
figure 2

A visual representation of vocabularies and their relationships in the condition domain of the OMOP common data model [8]. ICD International Classification of Diseases, ICD-9 ICD, Ninth Revision, ICD-10-CM ICD Tenth Revision, Clinical Modification, MedDRA Medical Dictionary for Regulatory Activities, MeSH medical subject heading, SNOMED-CT Standard Nomenclature of Medicine Clinical Terminology

Mapping to OMOP is performed by a multidisciplinary team involving mapping and vocabulary experts, local data experts, and clinicians who together use open-source tools to construct an ‘execute, transform, and load’ (ETL) procedure. The mapping of a given dataset to the common data model is intended to be separated from any particular analysis. Bespoke mapping tables may need be created or updated to represent data in local vocabularies [53,54,55]. It is essential that the ETL is maintained over time, for instance, to respond to changes in the source data, coding errors in the ETL, or the release of new OMOP vocabularies. This requires highly developed and robust quality assurance processes [56]. Finally, it should be noted that OMOP has been predominantly used for claims databases and electronic health records. Further work is ongoing to better support the representation of other data types (including patient registries), specific diseases (including oncology), and health data (including genetic and biomarker data).

3.4 Standardised Analytical Tools

The OHDSI collaborative has developed a number of open-source applications and tools that support the mapping of datasets to the OMOP common data model, data quality assessment, data analysis, and the conduct of multidatabase studies across a federated data network [8].

The data quality dashboard is designed to enable evaluation of the data quality of any given observational dataset. It does this by running a series of prespecified data quality checks about the OMOP common data model following the framework by Kahn et al. [57]. In this framework, data quality is defined in relation to its conformance (including value, relational, and computational conformance), completeness, and plausibility (including uniqueness, atemporal, and temporal plausibility). These are assessed by verifying against organisational data or validating against an accepted gold standard.

Data exploration and analyses can be conducted using the ATLAS user interface (https://atlas.ohdsi.org/) and/or open-source software such as R. Standardised tools have been created to support the characterisation of cohorts in terms of baseline characteristics and treatment and disease pathways; patient-level prediction, e.g. for estimating the risk of an adverse event or patient stratification; and population-level estimation, e.g. for safety surveillance and comparative-effectiveness estimation. Where standardised tools are used, interactive dashboards are available to display key results and outputs of various diagnostic checks. While the tools allow for considerable flexibility in user specifications, they also impose some constraints on the analyst in line with community-defined standards of best practice. Analyses can also be performed without utilising these tools by writing bespoke analysis code and using existing R packages.

The common data model and analytical tools are not fixed but are developed in collaboration with the OHDSI community and according to the priorities of its members. Where developments and new tools are made, these will be available to all users.

3.5 Data Networks

A common data model is most useful when it is part of large data network. In federated (or distributed) data networks, data ownership is retained by the data custodian (or licensed data holders), and analysis code can be run against the data in the local environment, subject to standard data access approvals, with only aggregated results returned to the analysts [9]. This recognises the governance and infrastructural constraints that limit the ability of external institutions to access individual patient data. This stands in contrast to pooled data networks, where individual patient-level data are collated centrally and made available for analysis.

The OMOP common data are used (and maintained) by the OHDSI network [51]. As of 2019, the OHDSI network had mapped over 100 datasets to the OMOP common data model, encompassing more than 1 billion patients [8]. There is growing interest in the use of the OMOP common data model, particularly for regulatory purposes [6, 17]. In response to this, the Innovative Medicines Initiative has funded the European Health Data and Evidence Network (EHDEN, https://www.ehden.eu) public–private partnership, which aims to establish a federated network of healthcare datasets across Europe conforming to the OMOP common data model [58]. The EMA has formed a partnership, including EHDEN consortium members, to use OMOP to conduct multicentre cohort studies on the use of medicines in patients with COVID-19 [16]. The OHDSI network and tools are built in alignment with ‘findability, accessibility, interoperability, and reusability’ (FAIR) principles, designed to support good scientific data management and stewardship [59].

4 What is the Role for the OMOP Common Data Model in HTA?

We discuss the role of the OMOP common data model and its associated data networks in overcoming the challenges identified in the OPTIMAL framework and for evidence generation in HTA.

4.1 Operational Constraints

A large and diverse data network facilitates the identification of relevant data sources and allows for an understanding of their quality [60]. To support the timely generation of high-quality and relevant evidence, a well-maintained, high-quality register with detailed and substantial meta-data about each dataset is essential [41].

A key benefit of a federated data network is that it obviates the need for patient-level data to be shared across organisations, which may not be possible because of data governance constraints, cost, or limited infrastructural or technical capacity. This should work to increase the data available for analysis, ensure that the most appropriate dataset(s) are used, enable multidatabase studies and the translation of evidence across jurisdictions to meet the needs of local decision makers, and increase the efficiency and timeliness of evidence generation.

The open-science nature of the OHDSI community means there is an emphasis on transparency in all aspects of study conduct. A comprehensive and computer-readable record of the ETL process used to map source data to the OMOP common data model and of data preparation and analysis for all applications is available. Tools are available to help understand data quality and check the validity of methodological choices (e.g. covariate balance after propensity score matching). The use of standardised analytics reduces the risk of coding errors and imposes community-agreed standards of methodological best practice and reporting. Transparency in study conduct and reporting increases the confidence in the results by HTA bodies and independent reviewers [61]. However, this alone is not sufficient to ensure transparency: it should be combined with other approaches, including pre-registration of study protocols and the use of standard reporting tools [44].

4.2 Technical Constraints

Perhaps the main benefit of common data models is in overcoming problems caused by limited data interoperability due to the diversity in data structures, formats, and terminologies. The common data model allows analysts to develop code on a single mapped dataset, or even synthetic dataset, and then execute that code on other data. The involvement of the data custodian in the study design and execution is still necessary, but the standardisation reduces the extent to which analysts need to be familiar with the idiosyncrasies of many different datasets. It enables a community to collaboratively develop and validate analytical pipelines.

The usefulness of any mapped dataset largely depends on the quality and contents of the source data from which it was derived. The mapping process itself cannot overcome problems due to missing data items or observations, data fragmentation, misclassification of exposures or outcomes, or selection bias. The OHDSI collaborative does, however, have tools that help characterise such problems, which can guide decisions about database selection and support critical appraisal of evidence.

There is also potential for information loss during the process of mapping from the source data to the common data model and standardised vocabularies [62]. Several validation studies have been published describing both successes and challenges in mapping data to OMOP [18, 53,54,55, 62,63,64,65]. For common data elements such as drugs and conditions, mapping can usually be performed with high fidelity. Most challenges were related to the absence of mapping tables for local vocabularies or of relevant standard concepts, which are more common for other types of healthcare and health system data [53,54,55]. This can lead to a loss of information in some instances [62, 65]. The extent and implications of any information loss will depend on numerous factors, including the quality of the source data, the source vocabulary, and the clinical application of interest. It is important to understand the likely impact of any information loss in each analysis. However, source data concepts are retained within the common data model and can be used in analyses as required. Where needed, vocabularies can be extended by the OHDSI community. Of course, a preferred long-term solution is for high-quality data to be collected at source using global standards.

4.3 Methodological Constraints

The role of the common data model in supporting multidatabase studies across a large data network has numerous benefits. It supports the translation of evidence across populations, time, and setting to support the needs of local HTA decision making, improving the efficiency and relevance of evidence generation for market access across Europe [66]. The opportunity it affords to enhance statistical power is likely to be particularly valuable in rare diseases where data may otherwise be insufficient to understand patient characteristics, health outcomes, treatment pathways, or comparative effectiveness [67]. It may also enable the extension of immature evidence on clinical outcomes from RCTs, allow exploration of heterogeneity, and support validation. The ability to produce reliable evidence at speed across a data network has been demonstrated in several applications, including in understanding the safety profile of hydroxychloroquine in the early stages of the COVID-19 pandemic [8, 11, 14, 15].

However, the risks of bias due to poor-quality data, selection bias, and residual confounding cannot be eradicated by the common data model, although best practice tools for causal estimation are available. It can help characterise some of these problems, and the ability to replicate results in different datasets may increase confidence in using the results in decision making [30, 68].

The mapping of source data to the OMOP common data model is performed independently of any analysis. While this allows faster development of analytical code, an understanding of the source data, and its strengths and limitations, overall and in relation to specific applications, remains vital. Others have argued that this separation adds a layer of complexity and may impede transparency in the absence of detailed reporting [4].

4.4 Types of Evidence and Analytical Challenges

The OMOP common data model and the standardised analytical tools have been largely built with regulatory uses in mind, particularly drug utilisation and comparative safety studies. The models for population-level estimation are currently limited to logistic, Poisson, and Cox proportional hazards modelling and propensity score matching and stratification. This covers only a limited range of potential applications in HTA. For example, in HTA, continuous outcome (e.g. generalised linear models for healthcare utilisation, costs, or quality of life) and parametric survival models (e.g. for extrapolation of survival) are widely used. Furthermore, the focus of tool development has been on big data analytics rather than smaller curated datasets. Analysts can of course always develop bespoke code to run against the common data model and utilise existing R packages for analyses.

The OMOP common data model includes two standardised health economic tables (see Sect. 3). The structure and contents reflect US claims data and are most useful in this setting. In the European setting, many healthcare datasets will not contain information on costs directly, but rather costs must be constructed from measures of healthcare utilisation. Unit costs can be attached to measures of utilisation extracted from the common data model, or, where costs depend on multiple parameters, this can be done prior to mapping with appropriate involvement of health economic experts. Some vocabularies will need to be extended to support HTA applications, for instance, visit concepts should reflect the differences in the delivery of healthcare in different settings, and mapping must appropriately reflect the uses of these data in the HTA context. Work is ongoing to further develop the common data model and vocabularies to better represent oncology treatments and outcomes, genetic and biomarker data, and patient-reported outcomes, all of which are important to HTA.

Numerous studies have shown the value of the OMOP common data model in undertaking drug utilisation and characterisation studies [13] and in estimating comparative safety and even effectiveness [11, 14, 15], all of which are important components of HTA. We undertook an additional case study to further understand some of the additional challenges in the context of HTA. Our objective was to estimate annual measures of primary care visits among patients with chronic obstructive pulmonary disease by disease severity (defined with spirometry measurements) using data in the UK (Clinical Practice Research Datalink) and the Netherlands (Integrated Primary Care Information) using a single script. While this analysis is possible using the common data model framework, we faced several challenges in its implementation. These included inappropriate mapping of source visit concepts to standard concepts, differences in mapping of measurements and observations in the two databases, and the absence of standard analytical tools directly applicable to this use case. All these challenges can be overcome by improved ETL processes and tool development. See Box A for a fuller description.

5 Conclusion

The OMOP common data model and its federated data networks have the potential to improve the efficiency, relevance, robustness, and timeliness of evidence generation for HTA. It supports the identification and access of data, the conduct of multidatabase studies, and the translation of evidence across populations and settings according to local HTA needs. The use of open-source standardised analytics and the possibility for greater model validation and replication should improve confidence in the results and their acceptability for decision making.

To realise this potential, it is essential that the future development of the common data model, vocabularies, and tools support the needs of HTA. We therefore call for the HTA community to engage with OHDSI and EHDEN to undertake use cases to identify development needs, drive priorities, and collaborate to build new tools, for example to support extrapolation and modelling of healthcare utilisation, costs, and quality of life. Finally, we urge those mapping source data to the common data model to collaborate with HTA experts to ensure that the mapping, particularly of healthcare utilisation and cost data, reflects the needs of the HTA community.