The objective of this paper is to provide the reader with some ideas around developments that have been collectively labelled as ‘big data’ and how they might have an impact on the field of health economics in the years ahead. This piece uses a SWOT analysis approach to understanding the strengths, weaknesses, opportunities and threats of big-data techniques in the fields of health economics (including pharmacoeconomics), epidemiology and public health. As with any SWOT analysis, there is an element of subjectivity behind the ideas in this paper. The paper first defines and introduces big data (talking specifically about open data), talks about the opportunities for health economics and biomonitoring, and then discusses large data repositories, the opportunities for public health, and some of the risks and challenges around big data.
There are several definitions of big data; most definitions talk about unstructured, large and often time-stamped datasets, which would be difficult to process in standard relational databases . Big data involves technologies that allow unprecedented opportunities to store, match up, analyse and visualize datasets in ways that would not have been possible in the past, and which can reveal aspects of human behaviour and processes that were previously difficult to measure. There is a question over whether big data is a genuinely new phenomenon. There are big datasets that have been used for years, such as the National Health and Nutrition Examination Survey (NHANES) in the USA; the volumes of digital data surpassed the volumes of non-digital data globally in 2002 and now constitute 94 % of all data. Business intelligence technology has moved quickly, but the real big-data revolution is in the volumes of data that are collected and stored (with the amount of information stored globally being estimated to double every 40 months), and the increase in the processing power of computers and servers, with people now talking about peta-, exa-, or zettabytes, where in the past they would refer to kilobytes. We now collect data without necessarily knowing the purpose for collecting them. In the past, developers would spend months trying to get information systems to speak to each other, whereas now it is usually a lot quicker and automated. In the past, data were used to measure against a performance or a plan, or to test hypotheses, whereas data mining and automated neural network models are now used to generate new hypotheses and to look for links that may not immediately be logical or apparent—‘unknown unknowns’. In the last 3 years, it has been revealed that the National Security Agency (NSA) in the USA and the Government Communications Headquarters (GCHQ) in the UK have been monitoring people’s phones and internet use on a massive scale—action that is possible only through big-data storage and sifting technology. There is a debate around whether big data will enhance our human potential or will mean being over-monitored and manipulated without our knowledge.
Big data often involves ‘open’ datasets, which are shared in the public domain. In future, amateur or volunteer ‘citizen scientists’ may compete with academic groups; for example, PXE International has found treatments for a rare disease, pseudoxanthoma elasticum (PXE), by sharing data and blood samples with interested parties . The UK Government has had a programme of putting data into the public domain and has a central repository, data.gov.uk, for these data; for example, one sector of the UK Government with a lot of potential applications is Ordnance Survey (OS), which has allowed a lot of its previously paid-for geographical mapping products to be used openly . In terms of health, the UK Government has faced negative headlines and a backlash over its plans for care.data, a web interface where linked general practitioner (GP) and hospital admissions data would be available to researchers, potentially creating one of the world’s biggest linked health datasets. Much of this negative publicity has arisen because of the possibility of data being shared with private companies . In the USA, the Patient-Centered Outcomes Research Institute (PCORI) has funded similar big-data projects . These types of datasets could be crucial in understanding the whole patient journey and meeting healthcare challenges such as efficiency savings or moving services out of hospitals.
Phil Hammond, a British doctor, broadcaster and campaigner, has called for open data in healthcare and aggressive transparency , and for the UK health service to do more to protect whistleblowers who expose poor standards of care. But having open, transparent data could put healthcare givers at greater risk of litigation when things go wrong. It is often quoted that around 1 in 10 people is harmed by healthcare, although this estimate varies widely . There are risks that with open data, people will miss the nuances and misinterpret; for example, in the case of US Medicare cost data being released, some physicians were identified as being the biggest earners when in fact they were managing health budgets for large programmes .
There are also ‘closed datasets’, which are owned and sold by private companies, such as Quintiles. These closed datasets also have increased potential in big-data applications through data linkage. Dr. Foster, a company that was initially set up in partnership with the UK Department of Health, analyses closed datasets to provide hospital data analytics, including the controversial Hospital Standardised Mortality Ratios (HSMRs), which were part of identifying the high mortality rates in the Mid Staffordshire NHS [National Health Service] Trust, which led to the Francis Inquiry. These mortality ratios are adjusted for case mix, age and co-morbidities so that a fair comparison can be made, although this is disputed by some who argue that it is over-sensitive to local differences in clinical coding within hospitals . Hospital readmissions are notoriously tricky to predict, despite there being several algorithms already, such as PARR+ (Patients at Risk of Readmission), that estimate a patient’s probability of being readmitted. The Heritage Health Prize was a competition where analysts were given access to a large hospital training dataset and were asked to produce an algorithm that would predict hospital readmissions as accurately as possible in a test dataset .
Table 1 shows a list of the types of datasets and an example of each one. These datasets are not all uniquely linked to the big-data phenomenon; some have been around for many years. But within each category, big data will mean that capability is increased.
Most big-data applications use the internet. The influential book Super Crunchers  detailed how companies such as Amazon record every click a customer makes and how long it takes, and sometimes they test out different prices for the same product; this has been employed controversially by budget airlines, who increase prices when a customer looks at a flight several times. Healthcare data are one of the types of data that individuals are less willing or less likely to share on the internet, so it could be argued that there are fewer applications for acquiring these data routinely. Google Flu Trends was a big-data ‘collective intelligence’ system, which recorded web searches for flu symptoms and was postulated as an early warning system for flu outbreaks. However, despite some early success, it was found in subsequent analyses to be not very accurate in predicting rates of flu cases . Because Google’s algorithms are proprietary and not openly available, they cannot be interrogated by other researchers.