Data at Scale
Pre-requisites to better understand the chapter: basic knowledge of major sources of clinical data.
Logical position of the chapter with respect to the previous chapter: in the previous chapter, you have learned what the major sources of clinical data are. In this chapter, we will dive into the main characteristics of presented data sources. In particular, we will learn how to distinguish and classify data according to its scale.
Learning objectives: you will learn the major differences between data sources presented in previous chapters; how clinical data can be classified according to its scale. You will get familiar with the concept of ‘big’ clinical data; you will learn which are the major concerns limiting ‘big’ data exchange.
KeywordsClinical data Big data
Various data in hospital facilities is generated daily by different sources. Data is usually stored electronically and spread across different locations. For example, electronic reports reporting patients’ treatment information are usually stored within the oncology department of a hospital. Conversely, patient’s images are often stored into the radiology department within a different data platform (PACS, Pictures Archive Communication System). In addition, different departments within the same hospital might use different infrastructures (e.g. software’s, data formats) to store acquired clinical data. Very often, those systems and / or data formats might not be interoperable between each other’s. No matter, what the source of clinical data is, data fragmentation represents one of the biggest issues when dealing with clinical data in general . Data fragmentation occurs when a collection of data in memory is broken up into many pieces that are not close together. The problem becomes even more enhanced when willing to perform multicenter studies (e.g. developing and validating a model using data from different institutions). In fact, relevant information might be spread across the different institutions and, due to lack of standardization, data interoperability might be compromised.
In addition, in the last decade we have been facing a continuous and rapid exponential growth of usage and production of clinical data, such as for example in the field of radiation oncology . This growth has been affecting all the different sources of clinical data. For example, new technologies / scanners enabling the possibility to acquire images of a patient in less than a second have determined what has been called ‘ data explosion’  for medical imaging data. In general, technological developments associated with healthcare (new powerful imaging machines) on one side have improved the general healthcare quality. Nevertheless, on the other side they have produced much more data than expected. Conversely, our developments in data mining techniques have been growing much slower than expected or at least not as fast as the production of data.
In fact, this data volume has been increasing so rapidly, even beyond the capability of humans. This data represents then an almost unexplored source of potential information that can be used for example to develop clinical prediction models, using all the information (e.g. imaging, genetics banks, and electronic reports) available in medical institutions.
Some of the biggest problems associated with this unexplored data are presence of missing values, and absence of a pre-determined structure.
Missing values happen when no data value is stored for the variable in an observation . Missing data is a common occurrence and can have a significant effect on the conclusions that can be drawn from the data common occurrence. Statistical techniques such as data imputation (explained later in the book) could be used to replace missing values.
Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner . A data model is an agreement between several institutions on the format and database structure of storing data.
Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. But also audiovisual, locations, sensors data.
If we look at clinical data, we can recognize both the presence of missing values and its absence of predetermined structure. For these reasons, clinical data is still not ready to be mined (i.e. processed) automatically by machines (e.g. artificial intelligence).
Therefore, the terms big (clinical) data refers to not only a large volume of data, but on a large volume of complex, unstructured and fragmented data coming from different sources.
We will explain this concept in the next section.
2.2 ‘Big’ Clinical Data: The Four ‘Vs’
As we already mentioned in the introduction, the problem of clinical data is not only its increased and growing volume, but also that data is collected in different formats and stored in various separated databases (fragmentation), together with the absence of an agreed data format (not structured). Now, why we use the term ‘big’ and what makes big data ‘big’?
We performed a literature research and we tried to summarize the most common definitions of big data.
Volume: volume of data exponentially increases every day, since not only humans, but also and especially machines are producing faster and faster new information (refer to previous example of ‘data explosion’ in medical imaging, but also “Internet of Things”). In the community, data of the order of Terabyte and larger is considered as ‘big volume’. Volume contributes to the big issue that traditional storage systems such as traditional database are not suitable anymore to welcome a huge amount of data.
- 2.Variety: big data comes from different sources and are stored in different formats:
Different types: in the past, major sources of clinical data were databases or spreadsheets. Now data can come under the form of free text (electronic report) or images (patients’ scans). This type of data is usually characterized by structured or, less often, semi-structured data (e.g. databases with some missing values or inconsistencies)
Different sources: variety is also used to mean that data can come from different sources. These sources do not necessarily belong to the same institution.
Variety affects both data collection and storage. Two major challenges must be faced: (a) storing and retrieving this data in an efficient and cost-effective way, (b) aligning data types from different sources, so that all the data is mined at the same time.
- 3.Velocity: the production of big data (by machines or humans) is a continuous and massive flow.
Data in motion and real time big data analytics: big data are produced ‘real time’ and most of the time need to be analyzed ‘real time’. Therefore, an architecture for capturing and mining big data flows must support real-time turnaround.
Lifetime of data utility: a second dimension of data velocity is for how long data will be valuable. Understanding this additional ‘temporal’ dimension of velocity will allow to discard data that is not meaningful anymore when new up-to-date and more detailed information has been produced. The period of “data lifetime” can be long, but it some cases also short (days). For example, we might think that for a specific analysis we only need the results from a recent lab test (most recent data). However, for a more detailed analysis we might want to trace same measurements from the past (longer lifetime).
Veracity: big data, due to its complexity, might present inconsistencies, such as missing values. More in general, big data has ‘noise’, biases and abnormality. The data science community usually recognizes veracity as the biggest challenge compared to velocity and volume. For example, if we took three measurements of blood pressure, even if they can vary differently, reporting the average may be common practice, but it is also not a real measurement value.
Validity: due to large volume and data veracity, we need to make sure data is accurate for the intended use. However, compared to other small datasets, in the initial stage of the analysis, there is no need to worry about the validity of each single data element. In fact, it is more important to see whether any relationships exist between elements within this massive data source than to ensure that all elements are valid.
Volatility: big data volatility refers to for how long data must be available and how long they should be stored, since concerns about the increasing storage capacity might be raised.
Viscosity: viscosity measures the resistance to flow in the volume of data. This resistance can come from different data sources, friction from integration flow rates, and processing required turning the data into insight.
Virality: defined as the rate at which the data spreads, for example it measures how often the data is picked and re-used by other users than the original owner of the data.
Due to improvements in the hardware (e.g. scanning machines) a large amount of images are produced (and stored) within a short elapsed of time (Volume).
Developments on hardware and in general in the imaging healthcare sector are producing machines able to produce much more images, combining different modality at the same time. This phenomenon is growing exponentially (Velocity).
Different imaging modality are combined together (Variety).
Despite there is a unified standard for storing and transmitting medical images (DICOM - Digital Imaging and Communications in Medicine), there is no agreement on associated metadata, such as for example medical annotations of patient’s scans. So that, meta-data associated with imaging data can be of different formats, without a unique agreed data model (Veracity).
Previous considerations apply to clinical data in general. We advise the reader to identify the eight ‘Vs’ through the different sources of data presented in the previous chapter.
2.3 Data Landscape
- 1.Data collections such as clinical data registries or clinical trial data cover only a small portion of the data landscape. In fact,
Cancer registry contains usually several information about a large number of patients (y-axis) or population, but the variables (or features, x-axis) collected are limited.
Clinical trial data usually collect more information than cancer registries, but with respect to a selected and limited patients population
Clinical routine data covers all the data landscape. Unfortunately, the figure shows how the data landscape is not fully covered by points in the clinical routine domain. These missing dots represent ‘missing’ values. ‘Real world’ clinical data are characterized by a large amount (around 80%) of missing values.
A vast volume of data is produced (large extension on x-axis and y-axis): Velocity + Volume.
Data includes several information from different sources (‘features’): Veracity + Variety.
In the last part of this chapter, we will analyze some of the barriers that are currently limiting the share of big data across institutions (or sometimes even within different departments of the same institution). We will also provide the reader with some possible advanced data management techniques to solve mentioned issues.
2.4 Barriers to Big Data Exchange
Administrative barriers: mining big clinical data might require additional effort, such as new dedicated figures in hospital facility, increasing cost of personnel.
Ethical barriers: issues are mainly related to data privacy concerns. Several different privacy laws might apply leading to relevant differences in privacy explanation, application of data confidentiality, and finally different legislations between countries exist .
Political barriers: even if technical barriers have been overcome, very often people are not willing to share their data. A joint effort by the community is then required to prove the benefits associated with ‘big’ data exchange.
Technical barriers: technical barriers are mainly related to scarce big data interoperability across different institutions. We saw that veracity is one of the cause of poor big data interoperability.
Secondly, lack of standardization and big data harmonization is still limiting the data exchange. More in general, technical barriers are determined by a lack of: support of internationally standardize protocols, formats and semantics.
We believe that all the community should collaborate for facing presented challenges. In fact, the success of effective clinical prediction models based on big clinical data depends much more on the curation of data used to develop / validate the model, than on sophisticated choices for models development (e.g. the usage of very complicated machine learning algorithms).
Accelerating the progress toward standardized and agreed data model for the clinical domain by making use of advanced techniques such as ontologies  and Semantic Web . Ontologies provide a common terminology to overcome for example language barriers. In fact, in an ontology, data is associated to universal concepts (classes) specifically determined by a Universe Resource Identifier (URI). By mean of Semantic Web, data and related metadata is published an accessible (via queries) by using the universal concepts defined by the ontology . In this way, data and metadata can be queried without knowing a priori the original structures or data format of the original sources.
Show the advantages the usage of real world clinical data by focusing on more high quality and published research articles that completely proves the benefits of data exchange (e.g., efficiency, robustness and security).
Data volume has been increasing so rapidly, even beyond that capability of humans. This data represents then an almost unexplored source of potential information.
The term big (clinical) data refers to not only a large volume of data, but also more on a large volume of complex, unstructured and fragmented data coming from different sources.
Big Clinical data are defined by the four ‘Vs’: volume, variety, velocity, and veracity.
Several issues limit that sharing and exchange of big clinical data: administrative, ethical, political, and technical barriers.
- 5.Han J, Kamber M, Pei J. Data mining: concepts and techniques. San Francisco: Morgan Kaufmann; 2011.Google Scholar
- 9.Traverso A, van Soest J, Wee L, Dekker A. The radiation oncology ontology (ROO): publishing linked data in radiation oncology using semantic web and ontology techniques. Med Phys.Google Scholar
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.