Introduction

The purpose of data analytics in healthcare is to find new insights in data, at least partially automate tasks such as diagnosing, and to facilitate clinical decision-making [1, 2]. Higher hardware cost-efficiency and the popularization and advancement of data analysis techniques have led to data analytics gaining increasing scholarly and practical footing in the healthcare sector in recent decades [3]. Some data analytics solutions have also been demonstrated to surpass human effort [4]. As healthcare data is often characterized as diverse and plentiful, especially big data analysis techniques, prospects, and challenges have been discussed in scientific literature [5]. Other related concepts such as data mining, machine learning, and artificial intelligence have also been used either as buzzwords to promote data analytics applications or as genuine novel innovations or combinations of previously tested solutions.

The terms big data, big data analytics, and data analytics are often used interchangeably, which makes the search for related scientific works difficult. Especially, big data is often used as a synonym for analytics [6], a view contested in multiple sources [7,8,9]. In addition, the term data analytics is wide and usually at least partly subsumes concepts such as statistical analyses, machine learning, data mining, and artificial intelligence, many of which overlap with each other as well in terms of, e.g., using similar algorithms for different purposes. Finally, it is not uncommon that scientific works that are not focused on technical details discuss concepts such as machine learning at different levels of specificity. For example, some studies consider merely high-level paradigms such as supervised on unsupervised learning, while some discuss different tasks such as classification or clustering, and others focus on specific modeling techniques such as decision trees, kernel methods, or different types of artificial neural networks. These concerns of nomenclature and terminology apply to healthcare as well, and we adapt the broad view of both healthcare and data analytics in this study. In other words, with data analytics we refer to general data analytics encompassing terms such as data mining, machine learning, and big data analytics, and with healthcare we refer to different fields of medicine such as oncology and cardiology, some closely related concepts such as diagnosis and disease profiling, and diseases in the broad sense of the word, including but not limited to symptoms, injuries, and infections.

Naturally, because of growing interest in the intersection of data analytics and healthcare, the scientific field has seen numerous secondary studies on the applications of different data analysis techniques to different healthcare subfields such as disease profiling, epidemiology, oncology, and mental health. As the purpose of systematic reviews and mapping studies is to summarize and synthesize literature for easier conceptualization and a higher level view [10, 11], when the number of secondary studies renders the subjective point of understanding a phenomenon on a high level arduous, a tertiary study is arguably warranted. In fact, we deemed the number of secondary studies high enough to conduct a tertiary study. In this study, we review systematic secondary studies on healthcare data analytics during 2000–2021, with the research goals to map publication fora, publication years, numbers of primary studies utilized, scientific databases utilized, healthcare subfields, data analytics subfields, and the intersection of healthcare and data analytics. The results indicate that the number of secondary studies is rising steadily, that data analytics is widely applied in a myriad of healthcare subfields, and that machine learning techniques are the most widely utilized data analytics subfield in healthcare. The relatively high number of secondary studies appears to be the consequence of over 6800 primary studies utilized by the secondary studies included in our review. Our results present a high-level overview of healthcare data analytics: specific and general data analytics and healthcare subfields and the intersection thereof, publication trends, as well as synthesis on the challenges and opportunities of healthcare data analytics presented by the secondary studies.

The rest of the study is structured as follows. In the next section, we describe the systematic method behind secondary study search and selection. In Section “Results” we present the results of this tertiary study, and in Section “Discussion” discuss the practical implications of the results as well as threats to validity. Section “Conclusion” concludes the study.

Methods

Search Strategy

We searched for eligible secondary studies using five databases: ACM Digital Library (ACM DL), IEEExplore, ScienceDirect, Scopus, and PubMed. In addition, we utilized Google Scholar, but the search returned too many results to be considered in a feasible timeframe. The search strings and publications returned from the respective databases are detailed in Table 1. Because the relevant terms healthcare, big data and data analytics have been used in an ambiguous manner in the literature, we performed two rounds of backward snowballing, i.e., followed the reference lists of included articles to capture works not found by the database searches. The search and selection processes are detailed in Fig. 1.

Table 1 Search strings—Scopus database search returned 16,135 results which were sorted by relevance, and the first 2,000 papers were selected for further inspection

Study Selection

After the secondary studies were searched for closer eligibility inspection, the first author applied the exclusion criteria listed in Table 2. In case the first author was unsure about a study, the second author was consulted. In case a consensus was not reached, the third author was consulted with the final decision on whether to include or exclude the study. Regarding exclusion criterion E5, we only considered secondary studies, i.e., mapping studies and different types of literature reviews. Furthermore, due to different levels of systematic approaches, we deemed a study systematic if (i) the utilized databases were explicitly stated (i.e., stated with more detail than “we used databases such as...” or “we mainly used Scopus”), (ii) search terms were explicitly stated, and (iii) inclusion or exclusion criteria or both were explicitly stated. Regarding exclusion criteria E6, E7 and E8, several studies considered healthcare in related fields such as healthcare from administrative perspectives [12], healthcare data privacy [13, 14], data quality [15], and comparing human performance with data analytics solutions [4]. Such studies were excluded. Similarly, studies returned by the database searches on data analytics related fields such as big data and its challenges [16], Internet-of-Things [17], and studies with a focus on software or hardware architectures behind analytics platforms [18, 19] rather than on the process of analysis were also excluded.

It is worth noting that we followed the respective secondary study authors’ classification of techniques, e.g., whether a technique is considered machine learning or deep learning. In the case a study considered more than one data analytics or healthcare subfield, we categorized the study according to what was to our understanding the primary focus. This is the reason we have refrained from defining terms such as deep learning in this study—the definitions are numerous and by defining the terms, we might give the reader the impression that we have judged whether a secondary study is concerned with, e.g., machine learning or deep learning.

Fig. 1
figure 1

Study selection process showing the process step by step as well as the numbers of secondary studies in each step—A1, A2 and A3 refer to the authors responsible for each step, E refers to an exclusion criterion described in Table 2, and n indicates the number of included papers after a step was completed

Table 2 Exclusion (E) criteria

Results

Publication Fora and Years

We included 45 secondary studies (abbreviated SE in the figures, cf. 7 for full bibliographic details). A total of 34 (76%) of the selected secondary studies were published in academic journals, nine (20%) in conference proceedings, and two (4%) were book chapters. Most of the studies were published in distinct fora (cf. Table 3), and fora with more than one selected secondary study consisted of Journal of Medical Systems, International Journal of Medical Informatics, Journal of Biomedical Informatics, and IEEE Access. As expected, the publication fora were aimed at either computer science, healthcare, or both. Finally, as can be observed in Fig. 2, the trend of systematic secondary studies in the intersection of data analytics and healthcare is growing.

Table 3 Publication fora
Fig. 2
figure 2

Number of included secondary studies by publication year (bars, left y-axis), and the number of included primary studies by publication year (dots, right y-axis)—the year 2021 was only considered from January to April; the figure shows that the number of secondary studies is rising

Secondary Study Qualities

The selected secondary studies utilized a total of 37 different databases. The most frequently used databases were PubMed, Scopus, IEEExplore, and Web of Science, respectively. Other relatively frequently used databases were ACM Digital Library, Google Scholar, and Springer Link. Most of the secondary studies (33, or 73%) utilized four or fewer databases (M = 3.6, Mdn = 3). However, many bibliographic databases subsume others, and the number of utilized databases should not be taken as a metric for a systematic review quality. For example, a PubMed search implicitly searches MEDLINE records, and Google Scholar indexes works from most other scientific databases. The extended coverage of a wider range of academic works naturally results in numerous studies to further inspect, posing a challenge in the amount of work required. The most popular databases used in the secondary studies are visualized in Fig. 3.

Fig. 3
figure 3

Four most popular databases used by the secondary studies were PubMed, IEEEXplore, Scopus and Web of Science—4 studies did not use any of these four databases, and other databases are not considered, e.g., the secondary study SE14, in addition to IEEExplore, might have also utilized other databases not visualized here

The secondary studies reported an average of 155 selected primary studies (Mdn = 63, SD = 379.2), with a minimum of 6 (SE44) and a maximum of 2,421 primary studies (SE31). Five secondary studies selected more than 200 primary studies (cf. Fig 5). In total, the secondary studies utilized 6,838 primary studies. The number of secondary and primary studies categorized by the data analytics approach is summarized in Fig. 4.

Fig. 4
figure 4

Number of secondary studies included in this tertiary study, and the number of primary studies utilized by the secondary studies, categorized by data analytics approach; DA general data analytics, TA text analytics, INF informatics, NA network analytics, DL deep learning, PM process mining, BDA big data analytics, DM data mining, ML machine learning; the figure shows that the general term data analytics was the most popular in the secondary studies

Some secondary studies reported similar details on their respective primary studies, such as visualizations of publication years (22 studies), research approach summaries such as the number of qualitative and quantitative studies (8 studies), research field summaries (4 studies), and details on the geographic distribution of the primary study authors (5 studies). The use of PRISMA (preferred reporting items for systematic reviews and meta-analyses) [41] guidelines was reported in 15 studies.

Fig. 5
figure 5

Number of primary studies (x-axis) selected for final inclusion in the secondary studies (y-axis), e.g., the chart shows that six secondary studies included 0–24 primary studies—one study (SE6) did not disclose the number of primary studies, and one study (SE15) reported two numbers: 24 primary studies for a quantitative analysis, and 28 primary studies for a qualitative analysis, and we reported that study using the latter number

Subject Areas Identified

Some selected studies considered the relationship between healthcare in general and a specific data analysis technique, while other studies considered the relationship between data analytics in general and a specific healthcare subfield. Most of the studies, however, considered the relationship between a specific data analysis technique and a specific healthcare subfield. These considerations are summarized in Fig. 6. Readers interested mainly in general healthcare in the context of a specific analysis topic should refer to the secondary studies on the left-hand side, readers interested in general data analytics in the context of a specific healthcare topic should refer to secondary studies on the right-hand side, readers interested in a specific analysis topic applied to a specific healthcare topic should consider the studies in the middle, and readers interested in the applicability of analytics techniques in general to healthcare in general should consider the studies in the top row. Additional information on the secondary studies is presented in 6.

Fig. 6
figure 6

Selected secondary studies and whether they consider only specific data analytics techniques (left side), only specific healthcare subfields (right side), both (center), or neither (top); the figure may be utilized in finding relevant secondary studies on desired subfields

Discussion

Implications

Considering the number of primary studies utilized, only 12 studies (27%) used more than a hundred primary studies. Figure 5 seems to indicate that the threshold for conducting a literature review or a mapping study in healthcare data analytics is typically between 25 and 100 studies. Furthermore, and on the basis of the evidence currently available, it seems reasonable to argue that at least 25 primary studies (84% of the secondary studies) warrant a systematic review, and the results of systematic reviews can be seen as valuable synthesizing contributions to the field. This observation arguably also supports the relevance of this study, although this study covers a relatively large intersection of the two research areas.

The earliest included secondary study was published in 2009, which might be explained by the relative novelty of data analysis in healthcare, at least with computerized automation rather than merely applying statistical analyses. In addition, although systematic reviews are relatively common in medicine, they have only recently gained popularity and visibility in information technology [10]. As may be observed in Fig. 2, the trend of secondary studies is growing, which consequently indicates that the number of primary studies in the intersection of data analytics and healthcare is gaining research interest. The rising popularity of machine learning algorithms may be explained by the rising popularity of unstructured data, the growing utilization of graphics processing units, and the development of different machine learning tools and software libraries. Indeed, many of the techniques behind modern machine learning implementations have been around since the 1980s, but only the combination of large amounts of data, and developments in methods and computer hardware in recent years have made such implementations more cost-effective. The development of trends illustrated in, e.g., Fig. 2 propounds the view that machine learning algorithms will gain more and more practical applications in healthcare and related fields, such as molecular biology [42]. Finally, some studies have argued [43] as well as demonstrated [44, 45] that the evolution of machine learning is changing the way research hypotheses are formulated. Instead of theory-driven hypothesis formulation, machine learning can be used to facilitate the formulation of data-driven hypotheses, also in the field of medicine.

Secondary study publication fora were numerous and focused either on information technology, healthcare, or both, without obvious anomalies. The secondary studies utilized dozens of different databases in their primary study searches. It seems that the coverage of these databases is not always understood, or it is disregarded, regardless that utilizing non-overlapping databases results in less work in duplicate publication removal. For example, Scopus indexes some of ACM DL, some of Web of Science, and all of IEEExplore, effectively rendering IEEExplore search redundant if Scopus is utilized—a fact we as well understood only after conducting our searches. In addition, Google Scholar appears to index the bibliographic details of effectively all published research, yet the number of search results returned may be overwhelming for a systematic review. In practice, the selection of databases is balanced by the amount of work needed to examine the results on one end of the scales, and coverage on the other. Backward or forward snowballing may be utilized to limit the amount of work and to extend coverage.

Secondary study topics summarized in Fig. 6 give some implications for subject areas of healthcare data analytics that are mature enough to warrant a secondary study. As the figure shows, these areas are aplenty, and the most frequent data analytics techniques applied seem to be machine learning (13 secondary studies) and data mining (7 secondary studies). It is worth noting that the nomenclature we applied in this study reflects that of the secondary study authors. As explained earlier in this study, attempts at defining, e.g., machine learning and data mining in this study would inevitably contradict the definitions given in some of the included secondary studies. For further reading, Cabatuan and Maguerra [46] provide a high-level overview of machine learning and deep learning, and Shukla, Patel and Sen [47] on data mining. For more technical approaches, both Ahmad, Qamar and Rizvi [30] and Harper [48] review data mining techniques and algorithms in healthcare.

Opportunities and Challenges in Healthcare Data Analytics

Many of the selected secondary studies provided syntheses on the current challenges and opportunities in healthcare data analytics. As the secondary studies inspected over 6800 studies of healthcare data analytics, we have summarized recurring insights here.

It was a generally accepted view in the secondary studies that healthcare data analytics is an opportunity that has already been partly realized, yet needs to be more studied and applied in more diverse contexts and in-depth scenarios [49,50,51]. For example, it has been noted that while big data applications are relatively mature in bio-informatics, this is not necessarily the case in other biomedical fields [52]. In general, healthcare data analytics is rather uniformly perceived as an opportunity for more cost-efficient healthcare [52, 53] through many applications such as automating a specialist’s routine tasks so that they may focus on tasks more crucial in a patient’s treatment. The cost-efficiency is likely to be more concretized by novel deep learning techniques such as large language models [54], which are also offered through implementations that perform tasks faster while consuming less resources [55]. In addition to faster diagnoses, data analytics solutions may also offer more objective diagnoses in, e.g., pathology, if the models are trained with data from multiple pathologists.

Challenges regarding healthcare data analytics are more diverse. Perhaps the most discussed challenge was the nature of the data and how it can be treated. Many secondary studies highlighted problems with missing data [56, 57], low-quality data [54], and datasets stored in various formats which are not interoperable with each other [52, 55, 56]. Furthermore, some studies raised the concern of missing techniques to visualize the outputs given by different data analyses [56, 58]. Rather intuitively, many new implementations and the increases in the amount of data require new computational infrastructure for feasible use [54, 58,59,60]. Some studies raised ethical concerns regarding data collection, merging, and sharing, as data privacy is a multifaceted concept [52, 54, 58, 59], especially when the datasets cover multiple countries with different legislations. Many studies also called for multidisciplinary collaboration between medical and computing experts, stating that it is crucial that the analytics implementations are based on the same vocabulary and rules as medical experts use [49, 57, 61,62,63,64], and that the technical experts understand, e.g., how feasible it is to collect training data for a model to find patterns in medical images. Closely related, many of the more complex analytics solutions operate on a black box principle, meaning that it is not obvious how the implementation reaches the conclusion it reaches [56, 59, 65,66,67]. Open solutions, on the other hand, are typically understandable only for technical experts and may be outperformed by the more complex black box solutions. Finally, it has been observed that the already existing analytics solutions implemented in different environments, e.g., different hospitals [56, 59, 64], are not portable into other environments. In addition, it may be that the existing solutions are not fully integrated into actual day-to-day work [57]. Fleuren et al. [68] summarize the issue aptly, urging “to bridge the gap between bytes and bedside.

Threats to Validity

As is typical for studies involving human judgment, it is possible for another group of researchers to select at least a slightly different group of studies. Furthermore, the categorization of studies into specific healthcare and data analytics topics is a likely candidate for the subject of change. We tried to mitigate the effect of human judgment by following the systematic mapping study guidelines, such as utilizing and reporting explicit exclusion criteria and search terms [11], following the PRISMA flow of information guidelines [41], and discussing discrepancies and disagreements among the authors until consensus was reached. Regarding the challenges related to the wide and rather ambiguous subject areas of data analytics and healthcare, we utilized two rounds of backward snowballing to mitigate the threat of missing relevant studies.

Conclusion

In this study, we systematically mapped systematic secondary studies on healthcare data analytics. The results implicate that the number of secondary—and naturally primary—studies are rising, and the scientific publication fora around the topics are numerous. We also discovered that the number of primary studies included in the secondary studies varies greatly, as do the scientific databases used in primary study search. The results also show that while machine learning and data mining seem to be the most popular data analytics subfields in healthcare, specific healthcare topics are more diverse. This meta-analysis provides researchers with a high-level overview of the intersection of data analytics and healthcare, and an accessible starting point towards specific studies. What was not considered in this study is whether or not and how much the selected secondary studies overlap in their primary study selection, which could indicate the level of either deliberate or unaware overlap of similar work.