There is a plethora of individual-level, routinely recorded data in the UK. These data are recorded to fulfil specific, defined purposes and are regulated for security, confidentiality and disclosure by The Data Protection Act 1998 [1] and The Freedom of Information Act 2000 [2]. Access to routinely recorded data for ‘secondary purposes’, such as clinical research, is permitted providing that there is demonstrable secondary benefit.

The potential for routinely recorded data to inform clinical research and Health Technology Assessment (HTA) has long been recognised [3]. Presently, there are a number of sources of routinely recorded primary and secondary-care clinical data with regional or national coverage. However, limitations with accuracy of coding, confidentiality, ownership and data access have been previously identified as significant barriers to using routinely recorded data in research [4].

There are numerous examples of retrospective, observational, record-linkage population studies where routine sources have proved a valid and efficient method for providing data for clinical research [5]. In the context of prospective research, such as randomised controlled trials (RCTs), routinely recorded data have been used to inform judgements about the feasibility of sample size and recruitment targets [6] and measuring participant outcomes [3, 7]. Pragmatic cluster RCTs have been coordinated through routine data sources including patient recruitment, randomisation, and administration of intervention and trial assessments, such as through the Clinical Practice Research Datalink (CPRD) [8]. The majority of RCTs incur costs as clinicians assess participants, record outcomes and complete Case Report Forms – hence, using routinely recorded data may provide an efficient alternative method for data collection in addition to reducing the burden on participants. Furthermore, data from non-clinical routine sources may inform outcomes beyond the standard RCT assessments of clinical efficacy and effectiveness. For example, cost data (such as use of health care resources) and socioeconomic data (such as employment and means-tested benefits data) may inform health economic analyses and the assessment of the broader societal impact of health care interventions.

The potential benefits of using routinely recorded data in clinical research have resulted in a political drive to increase implementation, detailed in The Plan for Growth [9] and The NHS Constitution [10], where research is presented as a core activity making the link explicit between the provision of NHS services and research. Consequently, initiatives, such as the Administrative Data Research Network [11], have been established to provide a method of access to individual-level data, linking clinical and non-clinical sources of routinely recorded data.

The objective of this paper is to review relevant sources of routinely recorded data for England, Scotland and Wales and to discuss our experience with the feasibility of accessing individual-level data for a subgroup of participants enrolled into a RCT before finally proposing recommendations for improving the access and implementation of routinely collected data in RCTs. This is an on-going study and in a future publication we aim to assess the agreement of routinely recorded data compared to paired data collected in a RCT using standard prospective methods.


The case study RCT is the Standard and New Antiepileptic Drugs (SANAD) II trial. SANAD II is a pragmatic, UK, multicentre, phase IV RCT funded by the National Institute for Health Research (NIHR) Health Technology Assessment (HTA) programme, assessing the clinical and cost-effectiveness of a number of antiepileptic drugs as first-line treatments for newly diagnosed epilepsy. Data for clinical outcomes, including seizure freedom and adverse events, are recorded on Case Report Forms by the treating clinical team during outpatient appointments. Data to inform cost-effectiveness analyses, including health care resource use and quality of life, are recorded through participant completion of questionnaires. SANAD II is currently recruiting and is expected to report in 2019.

Following research ethics and governance approvals, 470 participants enrolled in SANAD II were invited to provide written consent to permit the request of routinely recorded data for the duration of their participation in SANAD II. Ninety-eight (20.9%) participants provided consent and were included in the study. Relevant sources of routinely recorded data were identified and detailed scoping discussions ensued. Subsequently, where accessible, routinely recorded data for participants recruited into SANAD II were requested through formal applications. The routinely recorded data sources included in this study are as follows:

  • Clinical routine data sources: secondary care:

    • ◦ The Health and Social Care Information Centre (HSCIC)

    • ◦ The NHS Wales Informatics Service (NWIS)

    • ◦ The NHS National Services Scotland; Information Services Division (ISD)

  • Clinical routine data sources: primary care:

    • ◦ The Clinical Practice Research Datalink (CPRD)

    • ◦ ResearchOne

    • ◦ QResearch

    • ◦ The Health Improvement Network (THIN) database

    • ◦ North West eHealth (NWEH)

  • Non-clinical routine data sources:

    • ◦ The Office for National Statistics (ONS)

    • ◦ HM Revenue and Customs (HMRC)

    • ◦ The Department for Work and Pensions (DWP)

    • ◦ The Driver and Vehicle Licensing Authority (DVLA)

  • ‘Linked’ routine data sources:

    • ◦ The Secure Anonymised Information Linkage (SAIL) databank

    • ◦ The Administrative Data Research Network (ADRN)

In a future publication, the agreement between routinely recorded data and data collected using standard prospective methods will be assessed for baseline variables such as gender, age and date of first seizure, and for outcome measures relevant to SANAD II such as time to 12-month remission from seizures. To assess agreement between paired continuous data, Bland-Altman methods will be employed. Acceptable clinical limits of agreement for each variable or SANAD II outcome will be specified a priori and compared to the 95% confidence limits of agreement. To assess agreement between paired, nominal categorical datasets, cross tabulations will be constructed followed by calculation of Cohen’s Kappa.


Clinical routine data sources: secondary care

Electronic medical records of patients’ use of secondary-care services in the UK are routinely managed on a national basis. A number of public service organisations provide national information, data and IT systems for commissioners, analysts and clinicians in health and social care. Data are recorded to inform patient care, provide the data for remuneration for hospital trusts and are subsequently used to monitor and improve clinical services through clinical research. Table 1 summarises the data sources where access to individual-level data is possible.

Table 1 Example sources of routinely recorded secondary-care data

Clinical routine data sources: primary care

Electronic medical records of patients’ use of primary-care services in the UK are recorded routinely by the general practitioner to inform patient care and remuneration, but are not currently available for clinical research on a national basis. A number of organisations represent collaborations between governmental bodies or academic institutions and providers of primary-care IT systems. Access on a regional basis is possible through a number of data sources summarised in Table 2.

Table 2 Example sources of routinely recorded primary-care data

Non-clinical routine data sources

Non-clinical, individual-level data are routinely recorded by a number of UK governmental departments for a variety of indications. Selected organisations record data that would be informative to prospective clinical research in epilepsy and other diseases, summarised in Table 3.

Table 3 Example sources of routinely recorded non-clinical data

‘Linked’ routine data sources

In order to provide a ‘complete’ dataset of the information required to meet research objectives, data from a number of organisations may need to be accessed. This is typically accomplished by linking data sources using identifiers such as patients’ name, date of birth, National Insurance number or NHS number. In response to the growing recognition of the potential of routinely recorded data, initiatives have been established to assist with the provision of linked, de-identified, aggregate data between data sources:

  • The Secure Anonymised Information Linkage (SAIL) Databank is an initiative developed by Swansea University and funded by the Welsh Government. SAIL provides a method of access to individual-level, routinely recorded, de-identified electronic data for patients across Wales to support research [12]. Access to clinical datasets provided by NWIS is complemented with numerous non-clinical administrative datasets including births, deaths and demographic data. Following the scoping process a formal application is submitted to the Information Governance Review Panel before access to data is granted. SAIL data have been accessed to measure clinical outcomes in retrospective research [13]

  • The Administrative Data Research Network (ADRN) is a UK-wide partnership between universities, government departments, national statistics authorities, funders and researchers, funded by the Economic and Social Research Council. ADRN provides a method of access to a number of non-clinical administrative routine datasets including employment, socioeconomic, crime and education data [11] in addition to clinical datasets detailed previously such as those recorded by HSCIC. Following development of a project proposal a formal application is reviewed by the Approvals Panel before access to data is granted

Challenges and feasibility of access

We have requested access to routinely recorded data for individuals enrolled in the SANAD II RCT, resident in England and Wales, who have provided written consent. There were insufficient participants meeting the eligibility criteria resident in Scotland. Data sources were identified and scoping discussions informed the initial assessment of feasibility. Data sources were deemed feasible if individual-level data could be provided for specified individuals providing consent. Resources required including cost and researcher time were also factors important in the assessment of feasibility. Including the preparation, research ethics and governance approval and submission of the applications for data access, significant researcher time and a period of 18 months were required. The feasibility, timeline and key milestones involved for each data source are summarised in Table 4.

Table 4 Summary of key application milestones

Clinical routine data sources

Routinely recorded secondary-care data can be requested on an individual-level, identifiable basis for patients in England and Wales through HSCIC and NWIS, accessed through SAIL and in our experience this process is feasible as part of a RCT, yet there are notable limitations. In England, HSCIC has set a target time to data access of sixty working days following submission for a complex application, involving bespoke data linkage from multiple datasets. From the date of submission of the Data Access Request Service online application, we have been granted access to the data within this timeframe. However, this positive experience following submission of the application is countered by limitations in the pre-application process. Acknowledging the significant update to online application and approval procedures that occurred during this period, there remains a considerable period of time required in the development of the application. The nature of the request for identifiable data necessitated participant consent as the valid legal basis. HSCIC require ethical and governance approval to be in place prior to DARS review and to prevent future amendments and delays, it was rational to ensure the consent materials had been reviewed by the HSCIC’s Information Governance Team, prior to submitting the documents for ethical and governance approval. HSCIC provide written guidance regarding the consent materials and advise that documents should be reviewed. However, in our experience there is no formalised process for providing this review. Following significant correspondence the consent materials were reviewed by the Data Access and Information Sharing Team. However, this feedback was provided following a formal submission and review by the Data Access Request Service. Formalising the process for the review of consent materials would likely improve the time and resource efficiency for both HSCIC and the researcher.

For participants in Wales, we have requested secondary-care data and, for a proportion of participants, primary-care data through SAIL databank. SAIL provided a streamlined pre-application service, including engaging in multiple discussions and completion of a scoping document outlining the study methods and costs involved. Consent materials were also promptly reviewed by a member of the Information Governance Team.

Common to both sources of secondary-care, routinely recorded data; there are stringent information governance requirements that must be in place prior to application. These include information security measures and assessments, specific inclusion regarding the ‘processing of health care data for the subjects of research’ in the institutional Data Protection Act registration and, in the case of HSCIC, an institutional Data Sharing Framework Contract. Adequate guidance is provided by the data sources and, if not addressed by the researcher, may cause delay. Furthermore, there is a time lag of approximately 3–6 months before data become available within each data source. This delay potentially limits the utility of such sources in prospective clinical research, such as drug trials, where prompt reporting is clinically important and a regulatory requirement.

Routinely recorded primary-care data for specific participants in England are less accessible. The majority of providers of primary-care data, such as ResearchOne and QResearch, provide data on a de-identified basis with no facility to re-identify individuals. Therefore, where specific participants need to be identified, as for RCTs such as SANAD II, these sources are not applicable. Following our correspondence, CPRD confirmed it may be possible to retrieve identifiable individual-level data linked to HSCIC data in the future, but the required approvals were not in place and the timescale to resolution was unclear. Furthermore, such primary-care sources provide data for only a proportion of the population and can be expensive.

North West eHealth employs an alternative methodology whereby primary-care data are extracted directly from the GP through a third party. This process requires participant and GP consent and installation of the required software but is an effective data-extraction method [14]. NWEH offers a number of primary-care research tools for the wider research community but does not currently routinely provide a bespoke primary-care data-extraction service for research.

Non-clinical routine data sources

Aggregate economic and societal statistics, provided by Lower Layer Super Output Area (LSOA), can be accessed through the ONS and are in the public domain. Such data may have additional benefits to the analyses of health and socioeconomic outcomes in RCTs. Individual-level, economic data from sources such as the DWP and HMRC would likely be informative to prospective clinical research as such data are often poorly or incompletely recorded using standard methods [15]. However, relevant to this study, there is no previous evidence of access to DWP or HMRC individual-level or aggregate data for clinical research.

During scoping discussions with DWP and HMRC, we were directed to ADRN but this network has not been successful in negotiating data access.

Finally, the outcomes of selected clinical studies may be measured using DVLA data. However, the DVLA declined the request for access, citing insufficient internal resources to process the request and more stringent data protection requirements than those employed in the NHS or academic institutions, without providing explicit details regarding these requirements.


Routinely recorded data are valid for use in retrospective clinical research [3, 4] and have the potential to be used in prospective research including measuring the outcomes of RCTs [7] and providing additional benefits such as a method to address missing RCT data. Limitations, specifically with respect to accuracy and access have been recognised for some time. Academic, political [9] and health service [10] interest in UK sources of routinely recorded data has resulted in expansion and improvements, notably in the access to linked datasets. However, our experience with accessing individual-level data for specific participants providing written consent, to inform the outcomes of a RCT, highlights persisting limitations.

Clinical routine data sources are numerous and there is comprehensive national coverage of secondary-care data. In our experience, accessing individual-level data is feasible. However, inefficiencies in the application processes persist, particularly during the informal ‘pre-application’ phase. The notable limitation encountered was obtaining feedback on the Patient Information Sheet and Consent Form prior to ethical and governance review. Formalising an explicit review process for consent materials would improve the efficiency for both the data holders and the research team.

Access to routinely recorded, individual-level, primary-care data has not been feasible. Each primary-care data source has limited geographical coverage, often based on GP IT systems, which usually process de-identified data and may incur significant expense. The inception of the HSCIC General Practice Extraction Service, which records primary-care data nationally for England, represents the most optimistic national source; however, access is currently restricted to Department of Health initiatives such as research involving screening procedures [16].

The access to non-clinical data sources for clinical research has not been possible. ADRN has been established to act on behalf of the researcher in negotiating access to de-identified, linked, routinely recorded data from a number of organisations and the study proposal was promptly directed to ADRN. However, the decision whether to release data remains with the data holder. Ideologically, the next step would be the storage of de-identified linked data from participating organisations in a single repository, similar to those established for RCT data [17]. This would create a single point of access and remove the burden for each organisation to consider each study individually. This would, however, require significant information governance and security barriers to be cleared and, in light of recent developments within the research climate, individual consent. Including patients as stakeholders in the development of such data sources is essential [18].

Although there are examples of pragmatic RCTs being coordinated through routine data sources [8], there are likely to be limitations when accessing routinely recorded data to measure the outcomes of RCTs. Quality assurance is unclear and the level of agreement of routinely recorded data with data recorded through standard RCT methods remains uncertain, particularly when measuring clinical outcomes. The time delay before routinely recorded data become available may have implications for RCTs where prompt reporting is both clinically important and a regulatory requirement. Furthermore the pre-application and application process may introduce further delays. This will have implications for RCTs relying on routinely recorded data. The cost-efficiency of accessing routinely recorded data, compared to standard methods, is unclear. Further research is required to assess the agreement, additional benefits and cost-efficiency of routinely recorded data compared to data collected through standard RCT methods; it may be in the additional benefits, such as addressing missing RCT data, where routinely recorded data is most useful.


The failure of access to routinely recorded data for a purpose, such as this study with clear secondary benefit to clinical research methodology, seems inappropriate when the ‘public purse’ funds the research, the researcher and the public body holding the data. Perhaps a significant cause or contributor to the current limitations is the Care.Data initiative in 2014. The proposal to extract primary-care records from all patients was opposed publicly by a number of groups and, for example, resulted in an internal inquiry within HSCIC. Data applications were suspended during this period and our current experience may be explained by the concurrent revision of the HSCIC application and approval procedures. However, in the medium term, of more concern is the harm in public perception that resulted. Currently, more than 1.2 million individuals in the UK have submitted a ‘Type 2 objection’, meaning that their data will not be shared for purposes other than direct care [19]. Although the application procedures may improve, and in time we may be able to access data more efficiently, the loss of 2.2% of the population’s data will have implications for the routinely recorded data that will then be made available for research. Involving patients as important stakeholders and re-gaining their trust will be an essential factor in realising the individual and population health care benefits of routinely recorded data [20].


We propose recommendations to improve access and implementation of routinely recorded data during a RCT, summarised in Table 5.

Table 5 Recommendations to improve access to routinely recorded data for research