Background

The usage of the term ‘big data’ has picked up since 2011. This was the year that Gartner introduced “Big Data and Extreme Information Processing and Management” in its hype cycle [1]. Furthermore, increased interest is visible in the ever growing search traffic shown by Google Trends [2]. Scientific publications in (bio)medicine, which are our main interest in this study, also show a massive increase in the number of papers published yearly that mention big data [3].

Still, in spite of the popularity of this term, there is much debate about the definition of big data. In 2001 Gartner (called “META Group” at the time [4]) published a report which in hindsight is often referred to as the first description of big data. It defines the term through information technology (IT) challenges described by three V’s: volume, velocity, and variety [5].

Over the years this has evolved into many interpretations. Mostly, companies define big data in the light of their prime business, meaning that Google will mention analysis (e.g., Google Flu), while Oracle emphasises volume and storage [6], and IBM or Microsoft focus on computation and usability [7]. In a web-blog, posted on the data science sub-domain of the Berkeley school of information, 43 ‘thought leaders’ from the industry were asked for their definition of big data [8]. Not many of these leaders agreed with each other and definitions range from “data that cannot fit easily into a standard relational database” to “big data is not all about volume, it is more about combining different data sets and to analyse it in real-time to get insights for your organisation”. On a governmental level, the US National Institute of Standards and Technology (NIST) defined big data in 2014 as the need for scalable technology and four V’s: volume, velocity, variety, and variability. Finally, in the scientific domain, big data is mostly understood as the challenges of working with large volumes of data [911].

Possibly due to this great variety of definitions, in practice we have observed many different interpretations of the term big data among (bio)medical scientists. Some understand big data as a positive development, and actively pursue usage of new methods and technology associated with the term [3]. Others, however, view it as a harmful influence on, for example, the strength of research evidence, preferring classical statistical methods [12]. A better understanding of big data would facilitate communication and clarify expectations regarding this overloaded term [13].

Some researchers have attempted to capture comprehensive definitions of big data, such as De Mauro et al. [14], Ward and Barker [15], and Andreu-Perez et al. [3]. The first two focus on no domain in particular, whereas Andreu-Perez et al. [3] focuses on health-oriented applications. Of particular interest is the work by De Mauro et al. which analysis various big data definitions and from these distil their own. Their proposed definition is based on four themes found in the underlying definitions that were gathered, namely information, methods, technology, and impact. Note that all the cases mentioned above are based on qualitative literature studies. Hansmann and Niemeyer [16], however, used text mining to understand the themes included in big data literature. They combined automatic and manual approaches to identify three themes: IT infrastructure, methods, and data. While these efforts have been valuable for a better understanding of the term big data, they do not present systematic evidence of the actual themes used in the scientific literature, in particular for the (bio)medical research domain.

In this paper we present our efforts to answer the following research question: Which themes from various existing big data definitions are expressed in (bio)medical scientific publications? For this purpose, we adopted a data-driven systematic approach. First, big data definitions were revised and 12 themes were identified. Then, (bio)medical literature was systematically gathered from two scientific databases (i.e., PubMed and PubMed Central) and analysed automatically with text mining. While there are many text mining and clustering methods, we chose topic modelling (TM, [17, 18]) because this method captures two aspects that are important for this dataset: words may have multiple meanings or interpretations and documents may contain one or more topics. The topics identified through TM were annotated with the 12 themes by a small group of observers. In the following sections we detail the methods, present the results and discuss our findings.

Methods

In this section the construction of the corpus is described, followed by an explanation of the concepts behind TM. Then the application of TM to the corpus is presented in three steps: pre-processing, model fitting, and post-processing. Finally we present the gathering and summary of existing big data definitions, and the process used to identify them in the topics determined by TM.

Corpus

The corpus of documents was created by querying two literature databases focused on (bio)medical publications: PubMed and PubMed Central (PMC). The search queries were as follows:

  • PubMed: “big data”[TIAB] OR (big[TIAB] AND “health data”[TIAB]) OR “large data” [TI];

  • PMC: “big data”[TI] OR “big data”[AB] OR (big[TI] AND “health data”[TI]) OR (big[AB] AND “health data”[AB]) OR “large data” [TI].

Each query was built to search for literal use of the term ‘big data’, therefore selecting documents that were self-identified with big data. No word spacing was allowed to minimise the amount of irrelevant results. The terms ‘big health data’ and ‘large data’ were added because they also retrieved relevant literature, especially for publications before 2011, when the term big data was not popular yet.

Titles and abstracts were exported from the databases and merged into a local repository for further processing. Based on the title (stripped of all special characters and spaces) or the digital object identifier (DOI, if available), duplicates were removed from the corpus. Lastly, any record with an empty abstract (i.e., not provided in the database) was also removed from the corpus.

Topic modelling concepts

A specific type of TM was chosen, namely Latent Dirichlet Allocation (LDA) [17]. Throughout this paper the abbreviations TM and LDA are used interchangeably to indicate topic modelling through the application of LDA. The concept of TM is captured in Fig. 1 using the plate notation [1719]. Plate D denotes the set of documents, while \(\theta ^{(d)}\) is the multinomial distribution over topics for document d. Plate \(N_{(d)}\) denotes the set of words w for a specific document d, while z is the topic to which word w is assigned. Lastly, plate T denotes the set of topics where \(\phi ^{(z)}\) is the multinomial distribution over words for topic z.

Fig. 1
figure 1

Plate notation of topic modelling, plates are shown as rectangles and the arrows are conditional dependencies. Shows the relations between known variables (documents D, number of words \(N_{(d)}\), and words w), latent variables (multinomial distributions \(\theta ^{(d)}\) and \(\phi ^{(z)}\), and word to topic assignment z), and hyperparameters (\(\alpha\) and \(\beta\))

In TM, \(\theta\), \(\phi\), and z are the latent variables that have to be estimated. Together with the Dirichlet distributed hyperparameters \(\alpha\) and \(\beta\), the model is called Latent Dirichlet Allocation [17, 19]. The hyperparameters \(\alpha\) and \(\beta\) should be interpreted as smoothing factors for respectively topic-to-document (\(\theta\)) and word-to-topic (\(\phi\)) assignments.

Topic modelling implementation

The statistical software R [20] was used to implement the pre-processing, TM fitting, model selection, and post-processing steps.

Pre-processing was executed using the R tm and quanteda packages [21, 22]. Processing consisted of removing stop words taken from the SMART list [23, 24] (e.g., about, the, which).Footnote 1 Extra stop words were added, which were either junk words resulting from processing steps, or terms that appeared very often and diluted the TM outcome, such as ‘big data’, ‘introduction’ and ‘discussion’.Footnote 2 From the remaining words, bi-grams were created with function dfm: two words that occur next to each other at least 15 times in the whole corpus are joined by an underscore (e.g., health_care). Furthermore, words were stemmed with function stemDocument; e.g., ‘develop’, ‘developed’, and ‘development’ were all stemmed to ‘develop’. Lastly, words longer than 26 characters were removed.

Fitting the model consisted of estimating the latent variables \(\theta\), \(\phi\) and z, which was done with the R topicmodels package [26]. Directly calculating \(\theta\) and \(\phi\) was shown to be suboptimal [19], therefore we used a Bayesian approach from the topicmodels package using Gibbs iterative sampling to approximate the distribution z. In this sampling process the probability of a word occurring in a topic is estimated. This probability of a given word-to-topic assignment is calculated from how often the word already occurs in the topic and how dominant the topic is for the document from which the word was sampled. Once the model fitting converges, \(\theta\) and \(\phi\) can be derived from the approximated distribution z with the posterior function.

Multiple models were fitted to determine the best TM parameters. We first conducted experiments to find adequate values for \(\alpha\) and \(\beta\). These influence the model as follows: with a small \(\alpha\) (i.e., with many topics \(\alpha = 50/T\) becomes smaller) it is likely for documents to contain only a few topics, whereas a bigger \(\alpha\) (i.e., few topics) results in more topics per document. A small \(\beta\) similarly makes it likely for a topic to contain a mixture of a few words, thereby pushing the model to select highly specific words per topic. A range of values was fitted for both \(\alpha\) and \(\beta\) and model outcomes were compared. Within a reasonable range (i.e., \(0.1< \alpha < 1\)) we observed only minor differences between topics. Ultimately, fixed values were chosen for \(\alpha\) and \(\beta\), respectively 50/T and 0.01 as suggested in the literature [19, 27].

For model selection we analysed the likelihood for varying numbered of topics in the range \(T \in \{5, 10, 15,\ldots, 100, 150, 200,\ldots, 500\}\). However, likelihood alone cannot be used to find the best model. A penalising factor has to be added for the model’s complexity (i.e., the number of variables that have to be estimated). Two information criteria were considered, namely the Bayesian Information Criterion (BIC) [28] and the Akaike Information Criterion (AIC) [29]. When increasing the number of topics in a model, each topic becomes more specific and, therefore, easier to interpret. BIC puts more emphasis on the simplicity (in terms of the number of free parameters) of the model, resulting in a smaller number of topics as compared to AIC. We therefore chose to perform model selection using the AIC. In the case of TM, the variables to be estimated are the latent variables \(\phi\) and \(\theta\), which grow with the number of topics. The model where the AIC reached its minimum was considered the optimal model. Equation (1) defines the AIC, where T is the number of topics in model \(M_T\), L is the likelihood of model \(M_T\), and W is the number of unique words in the corpus:

$$\begin{aligned} AIC(M_T) = -2 \log (L) + 2 \left( \left( T - 1\right) + T \left( W - 1\right) \right) \end{aligned}$$
(1)

Post-processing consisted of retrieving \(\theta\) and \(\phi\) for the optimal model, and calculating the relevance of words within a topic according to the method described by Sievert et al. [30]. Equation (2) defines how relevance r was calculated for word w in topic t given \(\lambda\):

$$\begin{aligned} r\left( t, w | \lambda \right) = \lambda \log \left( \phi _{tw}\right) + \left( 1 - \lambda \right) \log \left( \frac{\phi _{tw}}{p_{w}}\right) \end{aligned}$$
(2)

The relevance is a convex combination of two measures: the topic-specific distribution (\(\phi _{tw}\)) and ‘lift’ (\(\phi _{tw} / p_w\)), which is a ratio between topic-specific and corpus-wide distributions. These measures can be balanced with \(0 \le \lambda \le 1\), by giving more weight to \(\phi\) (\(\lambda = 1\)) or to the lift (\(\lambda = 0\)). In our experiments a value of 0.6 was chosen for \(\lambda\), as suggested in Sievert et al. [30]. \(T \times W\) relevancies were calculated (i.e., each word had one relevance score per topic) and used to sort the most relevant words per topic.

Big data definitions

The definition proposed by De Mauro et al. was used as a starting point for this study. Furthermore, the underlying definitions gathered in De Mauro et al. were reassessed and where necessary updated (e.g., updates in white papers published by industry). Lastly, a publication by Andreu-Perez et al. [3] was added because it defined six big data V’s in the context of (bio)medical research.

All the definitions were analysed. If the definition was given in free text, the major themes were extracted. Themes were then grouped on similarity, for example, volume and size were merged into one theme. For various reasons a few definitions were discarded, as discussed in the “Big data definitions” section.

Topic analysis

Topic model results were analysed manually by inspecting the top relevant words (i.e., 20 per topic). The observers received a list of topics and a description of each theme. They were instructed to read all the words in each topic, then consult the big data definition themes, and finally provide their opinion about which themes are associated with that set of words. Each of the topics was assigned zero, one, or more themes by each observer individually. In total seven persons performed the analysis independently: each of the authors and three external health data scientists.

Results

This section reports the results of corpus extraction, TM model fitting and selection, gathering and consolitation of big data definitions, and annotation of topics with the themes.

Corpus

A total of 1659 documents were extracted from Pubmed and 543 from PubMed Central. After removing duplicates and records with an empty abstract, 1308 documents were included in the corpus as shown in Fig. 2.

Fig. 2
figure 2

Corpus generation: documents extracted per literature database, documents removed from the corpus, and total number of included documents

After pre-processing 136,339 words remained in the corpus, of which 7849 were unique. A large portion (7081 words) had a low frequency (<40 occurrences). Figures 3 and 4 give an impression of the corpus’s contents, showing a frequency plot of the top 2000 words, which seems to be in accordance with Zipf’s law [31]. To create the word cloud the top 100 most frequent words were extracted (as marked with the vertical line in the frequency plot).

Fig. 3
figure 3

Frequency of the top 2000 unique words in the corpus. The vertical line is the cut-off point (\(n=100\)) used for the word cloud

Fig. 4
figure 4

Word cloud of the top-100 unique words in the corpus

Topic modelling and model selection

In total 49 models \(M_T\) were fitted with T ranging between 5 and 500. The AIC curve for all fitted models M is shown in Fig. 5. The minimum of the AIC curve lies at \(T=14\), however the differences are small until \(T=25\). We also calculated the distances between topics from diverse models (\(T\in \{14-25\}\)), which showed that topics are fairly stable (data not shown). When increasing the number of topics, changes observed include one topic splitting into two topics or a new topic appearing. We saw no major reorganisation of topics or words within topics. We also observed that increasing the number of topics in the model makes the terms in each individual topic more specific. For example, one topic covering both application and big data themes might be split into two separate topics in a larger model. We therefore selected \(M_{25}\) for annotation, as this model has a better interpretability compared to \(M_{14}\) (more specific topics), with comparable quality of model fit (similar AIC).

Fig. 5
figure 5

Left: AIC curve of the 49 fitted TM models (20 models between \(T=5\) and \(T=30\) not plotted, see right). The minimum is marked by the dotted line (\(T=14\)). Right: Close-up of the AIC curve between \(T=5\) and \(T=30\), showing 26 fitted TM models. The minimum is marked by the dotted line (\(T=14\))

To assess the robustness of the model \(M_{25}\), the log-likelihood was tracked for each iteration of Gibbs sampling. This model was fitted three times with fixed input, but with different starting seeds for the sampling. The outcome of these fits is presented in Fig. 6. It shows that the log-likelihood reaches its approximate maximum after 100–150 iterations. Models run with a higher number of iterations (up to 4000, data not shown) showed no major difference in log-likelihood convergence, therefore, final models such as \(M_{14}\) and \(M_{25}\) were run for 500 iterations. The top-20 most relevant words per topic of the \(M_{25}\) model are shown in Table 4.

Fig. 6
figure 6

Convergence of the log-likelihood for the chosen optimal model \(M_{25}\) for three runs starting from different seeds

Big data definitions

In total 17 definitions of big data were considered from the following sources [3, 5, 6, 14, 15, 3243]. Table 1 presents the results of our analysis listing the found themes, their description, and respective sources. Note that we have not attempted to consolidate the names of the themes, leaving the complete description as found in the sources. The definitions can be divided into three groups, with each group containing multiple themes.

Table 1 Description of themes identified in big data definitions from literature

The first group (I) corresponds to the big data V’s, which occur in various forms in many of the analysed definitions. Some words were merged into one theme because they are essentially pseudonyms of each other. For example: volume, size, voluminous, and cardinality were found in ten of the definitions and, from their descriptions, refer to the amount of data. Also note that velocity and continuity, and complexity and variety were combined.

The second group (II) corresponds to the aggregated themes proposed by De Mauro et al., which represent concepts of a higher level of abstraction than the previous group.

The third group (III) includes a theme identified in three definitions, which describe big data as data that is beyond conventional processing and analysis. The V’s describe data by many different aspects, but none of those define a hard limit beyond which data becomes big. The theme ‘beyond conventional’ therefore describes big data as something that needs novel specialised and scalable solutions. This also means that the types of problems and applications that are assigned to the scope of big data change over time, as technology and methods evolve and improve.

The fourth group (IV) was not found in the studied definitions, but was added to cope with the reality of our data. Because the body of literature used in this study was obtained from (bio)medical literature databases, we expected to see application-related themes to be strongly represented in the resulting topics. We therefore included the Application theme to classify those topics that do not fall under big data.

Note that some definitions considered by De Mauro et al. were not used here:

  • The definition by Microsoft [40] was a web-blogpost from 2013, therefore possibly outdated;

  • Shneiderman et al. [41] does not specifically mention big data, as it was a publication from 2008 when this term was not in use yet;

  • The definition by Manyika et al. [43] was only described in the executive summary;

  • Mayer-Schönberger et al. [42] propose an abstract definition that was considered too difficult to convert into interpretable themes for topic analysis.

Topic analysis

The list of topics and words and big data themes were analysed by the seven observers. The observers all worked at the local department of epidemiology, biostatistics and bioinformatics, therefore they were extremely suitable for the annotation task. The big data themes (Table 1) and topic words (Table 4) were well understood and the task could be finished without further help in a reasonable amount of time (30 min to an hour).

The raw annotation results are displayed per observer and per topic in Table 2. Note that some observers did not assign any theme to some topics, and that in many cases more than one theme was assigned to the topics. Table 3 presents the frequency of themes assigned per topic, highlighting high or unanimous agreement among the observers (shown underlined and bold). It also shows the overall themes, i.e., those that were assigned to a topic by at least four observers.

In four topics less than four observers assigned the same theme to it (i.e., 3, 17, 19 and 25). Out of the remaining 21 topics, five had unanimous agreement between the observers for some theme (i.e., 6, 7, 8, 20 and 21). The remaining 16 topics could be split into topics with a single overall theme (i.e., 2, 4, 9, 10, 11, 13, 14, 15, 16, 18, 22, 24) and topics with two overall themes (i.e., 1, 5, 12, 23).

Note that the most frequently assigned theme was Application (66 times), followed by the themes in the second group, proposed by de Mauro et al. From the themes in the first group, volume and velocity occurred more often than the others. Notably, variability was hardly identified among these topics.

Table 2 Raw annotation results per observer
Table 3 Summed annotations per topic and theme, and overall theme per topic (≥4 counts)

Figure 7 presents the distribution of topics over documents based on the probability of each topic to each document (i.e., \(\theta\)). The large majority of topics (in black) have a strong presence in only a few hundred documents. However, there are four topics (in red and blue) that deviate from this pattern. The two red topics (topic 1 and 2, see Table 4) have a stronger presence in more documents as compared to the topics pictured in black. The blue topics (topic 3 and 5, see Table 4) have a stronger presence in nearly all documents.

Table 4 Top 20 words for the 25-topic model identified with TM
Fig. 7
figure 7

Distribution of topics over documents (i.e., \(\theta\), y-axis). The documents are sorted on topic-to-document relevance within each topic. The x-axis represents the order of the sorted documents. Each line represents one topic, in black. Exceptions are topics 1 and 2, plotted in red, and topic 3 and 5, plotted in blue

Discussion

In this paper we attempted to identify themes related to big data definitions in a large corpus of (bio)medical literature through topic modelling. We have followed a structured and objective approach as much as possible. This process delivered novel and interesting results, which however need to be carefully interpreted due to remaining limitations in our study.

Identification of themes in big data definitions

Due to the lack of a consolidated and widely accepted definition of big data, it was necessary to consult a large number of scientific papers. This work is limited to scientific literature, but obviously there are many other definitions of big data that have not been considered in our work, such as the Berkeley blog mentioned in the introduction [8]. Nevertheless, most of the definitions in [8] can be mapped to the themes identified in this study. Interestingly, the word cloud in [8] highlights words such as size, complex, and techniques, which are also found in the descriptions of the themes consolidated in Table 1. Furthermore, there are qualitative approaches to describing the big data field in publications such as Chen et al. [13] and Tsai et al. [44]. Note that, although these works do not strive to deliver a formal definition, the description of the big data field in both these publications include the same aspects found in the definition themes.

We have observed a large overlap among the big data definition literature considered in this study, nevertheless with variations in the focus applied by each author. Furthermore, certain themes occur more often than others in the definitions (Table 1). The original three V’s (volume, velocity, variety) occur in many definitions compared to the relatively ‘newer’ V’s (veracity, value, variability), which are present in only a few. This is also the case with Technology and Methods which are found in definitions more often than Information and Impact.

Finally, as the corpus was gathered from (bio)medical literature databases, we expected to find topics describing this domain. Therefore the theme ‘Application’ has been introduced, which is obviously not found in the published big data definitions. Indeed, the annotation results presented in Table 3 show that 10 out of 25 topics have been annotated with Application by the majority of the observers. Note that the large fraction of application-related words might have overshadowed others that are related to big data themes. Scrubbing the corpus of application-related words could be used to circumvent this problem. This opens the possibility for fitting highly granular models that would be more easily interpretable and better reflect big data instead of the research field topics.

Corpus gathering

By design, in this study we only considered papers that were self-annotated with big data, whatever definition the authors might have used. This led to an interesting observation by one observer who could not find his research domain in any of the topics. However, the searched databases certainly included this domain and many of the big data themes could potentially be assigned to its papers. The domain could be missing due to various reasons, such as a low frequency of this research domain in the corpus. However, this observer acknowledged to consider his domain as ‘conventional’, therefore, papers published about this research domain most likely do not mention big data and were therefore not captured in the search performed in this study.

Note also that we only considered two databases, whereas many others could be included as well (e.g., Scopus or Ovid). Nevertheless, PubMed and PMC are important sources in medical research and therefore have been considered sufficiently representative for the purposes of our study.

Finally, a potential limitation of our study is that only abstracts were included in the corpus instead of full-text papers. Our assumption is that the abstracts contain the essence of a paper and are therefore representative of the actual themes found in a full paper. Moreover, it is currently still difficult to retrieve and parse full papers in an automated fashion, which would have severely limited the number of papers considered in our study.

Automatic identification of topics

In the progress of this research various text mining approaches were attempted to identify relevant topics to characterise the publications. First, we attempted to use AlchemyAPI [45], a natural language processing service that is accessible through the web. However, in a pilot experiment of 100 documents we observed that the number of results produced would be too big for effective analysis (i.e., 3774 results, of which 3006 were unique). Moreover, AlchemyAPI’s method is implemented by proprietary code, so relations between documents and results were difficult to interpret.

We continued searching for a text mining method and considered document clustering to find the definition themes in literature. In principle, document clustering could capture themes but results are often limited to one theme per document. Furthermore, analysing document clusters to find definition themes would be a non-trivial (if not impossible) task.

A seemingly more suitable method was topic modelling, a method that can discover latent semantics in text. The main purpose of topic models is described as “discovering main themes that pervade large unstructured collections of documents” [18]. Furthermore, TM captures multiple meanings of words, but most importantly, it can identify multiple topics for each observed document. The LDA approach is perhaps the most popular and common topic model. The R package implementing the algorithm topicmodels had 22,576 downloads in 2015.Footnote 3 Moreover, the paper describing the underlying model by Blei et al. [17] has been cited over 16,000 times.Footnote 4 We therefore chose to use the LDA implementation of TM because of its appropriateness for our data, the relative ease of use of this approach (i.e., ready to use implementations in R), and extensive use in the literature by our peers.

Various TM approaches were tried to find a model with a manageable number of topics which allowed for manual annotation. The largest challenges were encountered during model selection. Two model evaluation methods (i.e., perplexity and harmonic mean) are often used in TM literature [16, 19, 46, 47]. The harmonic mean method calculates an approximation of the marginal likelihood of a fitted model, while perplexity measures how well a fitted model can predict unseen data. These criteria were calculated for multiple models with varying parameters expecting that the model decision boundary lay at some optimum of the response curve. For both criteria we were looking for a sudden decrease in marginal difference between two consecutive data points (i.e., models). Unfortunately, in our case, even when fitting models with up to 1,500 topics (data not shown), the curves did not show an optimum.

Finally we opted for TM with model selection through AIC, a method based on likelihood and model complexity. The AIC curve shows an optimum at \(M_{14}\), however \(M_{25}\) was chosen for further analysis. While experimenting with the parameter T we noticed that quantitatively measuring model fit did not relate to the interpretability of the topics, as also noted in [30, 48]. Comparison between models showed that there was no major reorganisation of topics (data not shown), but increasing the number of topics made them more specific and therefore more interpretable.

Manual annotation of topics

Subjectivity of the manual annotation is one of the limitations of this study. Some research has been done in objectifying the analysis of TM results [27, 30, 49, 50]. However, so far, the results of TM cannot be quantitatively evaluated [16, 48]. For the purpose of this study, a group of seven observers was deemed enough for the topic analysis. We also present all the data in the paper, such that the reader can assess the topics themselves to confirm or dispute our results.

We took great effort to objectify the interpretation of TM results, but seven is a small number of observers. Ideally more persons should be involved in the assessment of theme assignment. For example, crowd sourcing services such as Mechanical Turk could be used [51]. However, this particular annotation task requires sufficient background knowledge in health data science, which significantly reduces the pool of suitable observers.

All the observers in this study were trained in health data science, therefore they are familiar with the terms and concepts that appeared in the topics and the big data themes. Nevertheless, no baseline assessment was performed to more precisely understand their own interpretations, which might have introduced some noise in our results.

In general, the observers reported some difficulty to associate words with a theme. They also noted that their annotation decisions were mostly based on words that stood out in the topic, which means that not all words were considered equally. This possibly led to the discrepancy between annotators displayed by the results (Tables 2, 3). For example, when asked, annotator F noted that he chose Technology for topic 4 because of the specific word ‘cluster’, while all others chose Methods. Note that cluster could be interpreted as a computer cluster (i.e., Technology) or a cluster used in unsupervised machine learning (i.e., Methods). Furthermore, note that Information is often co-annotated or interchanged with Application. For example, neuroimaging, neuroscience, image, and signal are present in topic 23. The first two words can be associated with Application, and the latter with Information. Also, topics containing words referring to data (e.g., images and age) have been annotated as Information and/or Application by some observers. For such reasons some observers said that it was possible that their annotation might change slightly if they would analyse the topics again.

Big data themes in biomedical literature

Despite annotation subjectivity we consider to have found sufficient agreement between the observers to support our findings, which show how big data themes are identified in biomedical literature (see Table 3).

Technology and methods are found fairly often in topics. Note that the identification of these themes is facilitated because they can be associated to concrete terms such as device, cloud, and platform for Technology, or model, infer, and simulate for Methods. From the V’s, volume and velocity were the most identified themes, which are also easily associated with terms such as large scale, performance, and computability. These terms are frequently used in practice, explaining why they have been so strongly identified in topics 4, 5, 6, 7, 12, 13 and 20.

Impact, variety, veracity, value, and beyond conventional were annotated less often. Because these are more abstract concepts it is likely that they are more difficult to discover within topics. For example, Value was annotated to topic 1, containing words such as secure, challenged, and protect. Compared to concrete themes (e.g., technology and volume), it was more difficult for the annotators to find a fitting theme. Variability was annotated only twice, however we do believe that it is an integral part of big data. Variability not being recognised could mean that the observers could not identify the theme properly (due to poor theme description or understanding), or that the topics in the selected model could not capture this theme (due to insufficient representation in the corpus).

Each of the themes from the definition by De Mauro et al. (information, methods, technology, impact) was annotated more often than any other (apart from Application). Note that by design these themes are defined in a broader manner, which means that they include the others. For example, Methods includes a few V’s such as volume and velocity as well as beyond conventional. Perhaps due to their broadness, the themes from De Mauro et al. were chosen more easily, indicating that their definition covers the understanding of big data in a better way. However, one might wonder whether these themes are exclusively related to big data or whether they will also pop-out in other types of papers. The set-up of our study is not able to answer this question.

Related work

Other studies have been performed to discern a definition of big data [3, 14, 15]. These have provided an overview of big data research in different research fields [3]; a literature analysis to discover big data themes and a proposal for their consolidation into one definition [14]; and an analysis of industry statements on big data [15]. Each of these studies used qualitative methods, whereas our work builds upon their findings with a quantitative method. In particular, our study provides evidence that supports the definition proposed by De Mauro et al. [14] and an aggregation of its underlying definitions (see Table 1).

Many researchers have applied TM for text analysis in various fields [52]. Most similar to our approach is a study by Hansmann and Niemeyer [16], which applied TM to a big data corpus to discover its characteristics. Their research identified three themes, namely IT infrastructure, methods, and data, and applied TM in two stages. The first stage separated the corpus of 248 manually selected papers into the three themes mentioned above. Then, in the second stage, TM was applied to the papers which had been grouped by theme. An in-depth word-by-word analysis of big data characteristics was performed on the second stage TM results. The meaning of each word was assessed, finding the important concepts for each of the themes and where research focus lies in the corpus. Our work differs from [16] in three ways. First, their analysis was based on only three big data themes, whereas we used multiple definitions which led to twelve themes. Secondly, we collected a larger corpus resulting from a systematic review of the literature. Lastly, the research goals differ: instead of finding the defining concepts for each of the themes, our approach identifies existing definitions in a biomedical big data corpus.

There are also more sophisticated (and complex) text analysis approaches such as the method described by Hurtado et al. [53]. Whereas we applied a bag-of-words principle, where each word is considered independently, the method by Hurtado et al. processes whole sentences and preserves context information. In [53] text mining was applied to find trends in topics over time and predict topic popularity in the future. While this is not applicable in our current case it might be interesting for further research (e.g., finding trends of big data over time within scientific literature). Lastly, their method to generate topics also gives them a concise label built from the topic’s keywords. This would partially remove subjectivity from annotation, however interpretation of the results is still bound to human interpretation.

Conclusion

In this work we describe a systematic study that attempted to answer the question: ‘Which themes from various existing big data definitions are expressed in (bio)medical scientific publications?’. A large number of existing definitions were analysed and consolidated into twelve themes. A large corpus of representative biomedical scientific publications was collected and automatically analysed with text mining to identify the 25 most relevant topics based on title and abstract. Manual annotation was performed by seven observers to identify big data themes in the topics. In spite of the limitations of our study, the results show that these themes can be identified in this corpus. Volume, Velocity and Value are recognized frequently, but in particular results show strong presence of the themes defined by De Mauro et al. (i.e., Information, Methods, Technology, and Impact). This finding indicates that their definition of big data is supported by the current understanding expressed by authors when they use the term big data in their own (bio)medical publications in this corpus. To our knowledge this is the first time that this is shown in a systematic manner for literature in an application field.