Here we describe the Dementias Platform UK (DPUK) Data Portal (https://portal.dementiasplatform.uk/) [1]. The Data Portal is a collaboration between DPUK and a growing number of cohort research teams who wish to make their data globally accessible. DPUK was established by the Medical Research Council (MRC) to accelerate the development of new treatments for dementia. The Data Portal is a component infrastructure of DPUK, designed to exploit the opportunities provided by cohort data to inform experimental medicine and to improve access to cohort data more widely. It is a data repository facilitating access to data for 3 370 929 individuals in 42 cohorts. The Data Portal provides a secure, fully auditable, remote access environment for the analysis of cohort data. The Data Portal supports the FAIR principles (Findability, Accessibility, Interoperability and Reusability) to improve the infrastructure supporting the re-use of data [2].
Arguments for multi-cohort-focused data repositories include: (1) as research questions focus on smaller effect sizes, access is required to data at-scale to achieve statistical purchase, (2) as emerging research questions become more complex, access to diverse multi-modal data is needed for rigorous hypothesis testing, (3) as scientific rigour increases there is growing recognition of the value of triangulation and replication using independent datasets, (4) as cohort datasets increase in size the transfer of large datasets is decreasingly feasible, (5) as cohort datasets become more complex the mastering of bespoke data models for survey, omics (genomics, proteomics and metabolomics), imaging and device data becomes burdensome, (6) as cohort datasets become more sensitive the non-auditable use of data is decreasingly acceptable. Whilst these issues can be addressed individually, the Data Portal provides an integrated solution.
The Data Portal is an end-to-end data management solution designed to support cohort data sharing. All projects utilising the data are by default collaborations with the cohort research teams generating the data. Although it has the three core utilities of data discovery, access, and analysis, to achieve these it operates across seven layers (Fig. 1).
Layer 1: Data ingestion
The data journey begins with upload to the Data Portal. Datasets and data dictionaries are received from cohorts on an ‘as-is’ basis along with other supporting documentation. The Data Portal operates within the UK Secure eResearch Platform (UKSeRP) environment according to ISO 27001 [3] and operates exclusively as a data processor according to the UK Data Protection Act 2018 [4] and EU General Data Protection Regulation 2016 [5]. DPUK facilitates the legal engagements necessary for data transfer into the Data Portal on behalf of data controllers, by ensuring robust contractual arrangements are in place as an overarching mechanism for data governance and use [6]. In practice the data controller is considered to be the principal investigator for the dataset. A dataset is removed from the Data Portal upon receipt of a wet-ink signature request from the data controller.
Layer 2: Data curation
Upon receipt, ‘native’ data are curated to a common data model (C-Surv). The C-Surv ontology is designed to simplify the analytic challenge of working across multiple datasets and multiple modalities by providing standard structure, variable naming and value labelling conventions. Other data models, CDISC [7], OMOP [8], or HPO [9] involve structural complexity that is rarely relevant to cohort-based analyses. An example variable name using the C-Surv model is given below:
$${\text{GEN05\_PAINCHESTEVR\_0\_1}}$$
The cohort is identified using a three digit alphabetic character (GEN for Generation Scotland), and category-level data by a two digit numeric character (05 for physical health status). The measure is described by an alphanumeric acronym (PAINCHESTEVR for: Do you ever get pain or discomfort in your chest?). This is followed by an integer giving the number of repeat measurements within a study wave (_0 indicates there were no repeat measurements). Finally, an integer suffix indicates study wave (_1 for recruitment, _2 for the first follow-up, etc.). This data model is a contribution to a wider debate on how global access to cohort data can be achieved.
C-Surv is optimised for the analysis of ‘flat-file’ data. Higher order data must be pre-processed prior to curation. The XNAT [10] imaging platform is used to receive and process DICOM and NIfTI files. For genetics, variant call format and allele frequency data may be uploaded. C-Surv is also designed to improve the efficiency of data selection and management from the applicants and administrators’ perspectives. There are 22 categories and 132 sub-categories, which may be used for data selection as an alternative to individual variable selection. The ontology is machine-readable for mapping to other ontologies. Researchers may request access to either native or curated data. Standardisation does not imply harmonisation i.e. comparability (equivalence) of values and/or distributions across variables. Data harmonisation is implicitly purpose specific. DPUK undertakes a very limited harmonisation programme solely to support data discovery (see Layer 4). Data curation is resource intensive and is ongoing. To accelerate the process we are developing machine-learning approaches, which currently achieve an 80% accuracy.
Layer 3: Interoperability
We anticipate a global mixed-model data access environment that respects sovereign boundaries and requires the highest levels of security. In practice, this involves the integration of individual participant data across datasets (‘pooled’ analyses) and the integration of summary data across datasets (‘federated’ analyses). Both models require interoperability between data platforms across national boundaries. DPUK supports the development of interoperability in several ways. In partnership with UKSeRP, we provide access to our software solutions to other data platforms to facilitate architectural compatibility. We also engage with other data platforms to develop interoperability across architectures. In collaboration with Gates Ventures and other data platforms, we are working to develop a high-level data-gateway for cross-platform data access.
Layer 4: Data discovery
A tiered data discovery pathway begins with the Cohort Matrix [11] which provides a high-level comparison of data availability for each cohort. The Cohort Directory [12] enables detailed exploration across cohorts using a range of metadata categories. The Cohort Explorer [13] provides access to a set of 30 variables harmonised across cohorts to enable feasibility analysis. Adding cohorts to the Cohort Explorer is ongoing. As the Cohort Explorer provides access to harmonised data, login to the Data Portal is required. Data discovery is a dynamic area and the development of increasingly ergonomic and intuitive tools is anticipated. This is likely to include the identification of high priority variable sets and variable selection algorithms through the analysis of user-activity and the development of application programming interfaces (APIs) to rapidly reproduce data selection (see Layer 7). DPUK welcomes user feedback and collaboration to inform an ongoing programme of discovery tool development.
Layer 5: Data access brokerage
Automated third-party brokerage, allowing submission of a single data access request to multiple data controllers, reduces the researcher administrative burden. Standard cross-cohort access agreements and streamlined decision-making procedures simplifies the application and approvals process for all stakeholders. Access requests can be specified at the level of detail required by the data controller and if approval is granted, a standard access agreement must be completed by both parties prior to access being granted. The application form is a synthesis of key issues that are addressed by most, if not all, the individual cohort access management procedures, fail-fast criteria have been developed to facilitate rapid and transparent decision-making by data controllers. These comprise the proposal not being in the public interest, potential identifiability of participants, no clear scientific rationale, no appropriate analysis plan, or a conflict of scientific interest. Upon access approval by a cohort data access committee (DAC), completion by the applicant of a data access agreement is required prior to access being granted. Upon receipt of a completed data access agreement, data access is granted within several days for any one cohort [14]. Data access is normally free at the point of use with a small number of cohorts levying a data access fee (Table 1).
Table 1 Collaborating cohorts Layer 6: Data analysis
The analysis environment is VMware based with each researcher allocated their own private personal ‘lab’ (desktop) into which approved data are moved, and from which they can access a range of pre-loaded generic and specialist software packages. Bespoke software may also be uploaded upon approval. VMware clients are available for a range of operating system including Windows, Linux, IOS, Mac, and Android. A standard desktop is provided for researchers seeking to access standard phenotypic data on the Portal. The standard desktop has the following specification: Windows 7/10, 8 GB RAM, and 4 CPUs. It is pre-loaded with R, RStudio, SPSS, SAS, Stata, Python, Eclipse, MATLAB, SQL Server Management Studio, Microsoft Office. Statistical software, such as R and Python, can connect to its official library/package/index directory to enable configuration of software on a per-user basis. Larger desktops are available on request (32 GB RAM with 8 CPUs, 128 GB RAM and 16 CPUs) which are more suitable for large-scale multi-modal analyses such as omics and imaging analyses, and for machine learning.
Storage is scalable according to the study requirements, with basic access to data stored on our systems free to all users. Studies needing unusually large amounts of storage to ingress their own at-scale data will be subject to potential charges which will be discussed with researchers as part of the desktop set-up. High Performance Computing and non-Windows operating systems are available upon request. The Data Portal also offers consortia-based workspace; providing an independent and transparent storage, access and analysis solution for use by multiple institutions.
Two-factor authentication is required to access approved datasets. This involves the provision of a username with password creation, and an authentication code generated by an app on a mobile device of the applicant’s choosing. Data may not be removed from the Data Portal. However, tables, graphs and scripts for export are submitted to the data export panel for approval. Manuscripts may be prepared in the Data Portal so that collaborators who are registered users may contribute without the need for manuscript download. A facility for import is also available, enabling researchers to upload scripts and additional datasets from outside the Data Portal to reside within their approved DPUK datasets.
Layer 7: Knowledge environment
Cohort data grow in quantity and can be complex, but ultimately there is the need for the transition from data to knowledge and insight. In response to this challenge, the Data Portal is developing a Knowledge Hub to organise and make information readily accessible on the key activities of the Data Portal. The immediate purpose of the Knowledge Hub is to enable collaborative knowledge building; enabling researchers to understand what is available, what is known, and where there are knowledge gaps. It also serves as a knowledge preservation repository enabling the storage of linked datasets and analysis code with persistent identifiers for rapid access and replication. DPUK welcomes collaboration in the development of this hub.
Application for access can be made through the Data Portal: https://portal.dementiasplatform.uk/[56].