Introduction

The increasing development of information and communication technologies (ICT) has brought many achievements for human society and greatly influenced people’s lives [1], and it has been adding significant benefits to various aspects of it [2]. Captured and stored huge quantities of information about people, their daily interactions and even their biotic signs via a variety of digital devices, potentially processed and analyzed by academic researchers, corporations, and governments [3]. Fortunately, the cost of information processing is cheap today [4], while organizations are using information systems for optimizing processes in order to increase coordination and interoperability across the organizations [5], and helps them to increase the integration and standardization of processes [6].

In the same vein, cutting-edge technologies such as Big Data have the potential to leverage the adoption of circular economy concepts by organizations and society, becoming more present in our daily lives [7]. Nowadays, scientific, research and commercial literature about Big Data corroborate penetrating its capabilities in all areas [8,9,10,11].

In another side, healthcare plays an important role in our societies. Improving the healthcare efficiency, accuracy, and quality of people is the main goal set forth by both the government and researchers [12]. The healthcare industry historically has generated large amounts of data, driven by record keeping, compliance and regulatory requirements, and patient-care [13].

The importance of healthcare to individuals and governments and its growing costs to the economy have contributed to the emergence of healthcare as an important area of research for scholars in business and other disciplines [14]. By now, the electronic collection, organization, annotation, storage, and distribution of heterogeneous data are essential activities in the contemporary biomedical, clinical, and translational discovery processes [15]. Therefore, Big Data in healthcare has become an emerging and remarkable research field. So that in the middle of 2018, Google Scholar displays 17,000 results for searching “Big Data in Healthcare” for the only year of 2018. And Big Data in healthcare has drawn substantial attention in recent years [12]. Big healthcare data has considerable potential to improve patient outcomes, predict outbreaks of epidemics, gain valuable insights, avoid preventable diseases, reduce the cost of healthcare delivery and improve the quality of life in general [16]. Based on this importance, there are numerous current areas of research within the field of Health Informatics, including Bioinformatics, Image Informatics (e.g. Neuroinformatics), Clinical Informatics, Public Health Informatics, and also Translational BioInformatics (TBI) [17]. Scientific publications in (bio)medicine show a massive increase in the number of papers published yearly that mention Big Data [18]. However, the identification of major hot topics and related research methodologies in big data in healthcare still lacks a comprehensive quantitative analysis. Provide tools by scientometrics approaches well suit to address questions of interdisciplinary integration in research fields [19]. They can help us identify cross-sectional patterns within scientific communities and can explicate how those patterns evolve over the life course of fields [19, 20]. The scientometrics studies help in get information about research areas that researchers are attentive in, how they like presenting the results of the research, the journals and publications they are interested in and the importance of a research topic in a specific time period that based on the information research policies can be made with a less probability of mistakes.

Scientometrics aims at the advancement of knowledge on the development of science and technology [21]. According to Van Raan’s claim about the relationship between knowledge discovery and scientometrics, it can be resulted in one of the functions of it is the ‘knowledge discovery’ [21].

Some scholars and practitioners use the notion of ‘V’ to define ‘Big Data’ [22,23,24,25], 3V, 5V and even 7V. volume, velocity, variety in 3V [22,23,24,25], and then value and veracity have been added for 5V [26] and recently 7V added variability and visualization [27, 28]. In the field of “Big Data in Healthcare” few researchers have focused on this issue and they considered the 5V. Jatrniko et al. mentioned 5V as the defining factors of Big Data [29]. In other study 5V introduced as patient data attributes [30]. Van and Alagar believe 5V characterize Big Data and motivate their relevance to healthcare data [31]. In a study named “Big Data stream computing in healthcare real-time analytics”, the challenges in big data analysis on health care can be understood by 5V s characteristics [32]. “Big Data, Big Knowledge: Big Data for Personalized Healthcare” fully described 5V in healthcare: Volume the community wishes to exploit the vast entirety of clinical data records, but the datasets that support these analyses are often very expensive to acquire, and currently the penetration is limited [33]. Variety it explains the diversified data sets with respect to the structured, semi-structured and unstructured data sets [30], in healthcare field variety could be defined as clinical data, data from medical imaging, data from wearable sensors, lab exams, and simulation results [33]. Velocity is expressed in terms of data arrival rate from the patients. Veracity while data collected as part of clinical studies are in general of good quality, clinical practice tends to generate low quality data. This is due in part to the extreme pressure medical professionals face, but also to a lack of “data value” culture; most medical professionals see the logging of data a bureaucratic need and a waste of time that distracts them from the care of their patients. Value refers to the “economic value” that results in saving and analyzing Big Data [31]. For example in general, healthcare expenditure in most developed countries is astronomical: the 2013/2014 budget for NHS England was £95.6 billion, with an increase over the previous year of 2.6%, at a time when all public services in the UK are facing hard cuts. In OECD countries, we spend on average USD$3395 per year per inhabitant in healthcare [34].

This paper aims to address an analysis of the more considerable research output (papers published in the seven important databases) “Big Data in Healthcare” for achieving a deep and comprehensive trend study and based on it, makes a knowledge discovery from the publications. Using Naïve Bayes, results identified a classification of methodologies used in duplicated papers in journals.

Methods and materials

The statistical population of research is 82,313 papers, shown as the search result of “Big Data in Healthcare” in intended databases. The source of data is articles (conference papers, articles, reviews, articles in press and survey) published in selected databases. Since the studies in the field of the Big Data are toddle and the subject of using Big Data in the field of healthcare is not more than 10 years old, we chose the period 2008–2018. The reason for choosing these databases are based on the carried-out evaluation on 20 well-known databases (IEEE, Elsevier, Wiley Online Library, Springer, Nature, Taylor and Francis Online,Footnote 1 ACM Digital Library,Footnote 2 ASP Publication,Footnote 3 JStore,Footnote 4 AIP,Footnote 5 Emerald InsightFootnote 6 and ASME,Footnote 7 Sage journals,Footnote 8 Oxford Journals, World Scientific,Footnote 9 AMS,Footnote 10 Annual Reviews,Footnote 11 Cambridge University Press and Royal Society);Footnote 12 the largest number of papers in the field of interest has been published in the selected databases, but after the first review, it turned out that the number of articles published on some of these sites was very small, so these databases were deleted and Elsevier, IEEE, Springer, Nature, Science, Oxford Journals, Cambridge University Press and Wiley Online Library remained. Figure 1 shows the frequency chart of databases’ publication vs the number of selected paper to study from every database after refining.

Fig. 1
figure 1

Frequency chart of databases’ publication (left) vs number of selected paper to study from every database (right)

Two different databases have been ready for this research. The first one contains 265 records (refined papers). The other one contains eight datasets of papers, seven for training data mining model and the other one for testing the model. Researches applied VOSviewer 1.6.9 to draw up the maps and RapidMiner Studio 8.2 for data mining.

Quality control

Quality control was carried out in several stages: firstly, in the weekly meetings, doubtful papers, reviews, and feedbacks were given. Second, ten percent of all papers were reviewed by the research coordinator and provided feedback to colleagues. The evaluation of the re-examination process indicated a significant reduction in errors and disagreements. Third, collected data forms were checked and be ready for data entry after completing, in the end of data entry stage, the accuracy of 20% of the entered data is reviewed.

Although all the shown results were not aligned with the researchers’ purpose, all of them had been investigated and the appropriated ones had been selected to analyze. Therefore, the total number of investigated papers is 82,313 and the number of selected papers is 265. It needs to be notifying Science database had no related paper to use in current research.

Results and discussion

In this study, in first phase the overall status of researches on Big Data in healthcare, and related science was studied. Totally 265 papers were evaluated in a 12-year period in seven databases. Most of the first-ever authors had a Ph.D. or higher degree whose affiliations were universities. Furthermore, each article had an average of 4.1 writers.

The process of publishing papers is shown in Fig. 2. The number of articles published until 2013 was limited and insignificant and did not fluctuate significantly, but since 2013 there has been a noticeable upward trend, we are published papers.

Fig. 2
figure 2

Number of published papers in various years

Figure 3 shows the number of papers’ authors. The most number of authors belongs to one paper titled “MAKING SENSE OF BIG DATA IN HEALTH RESEARCH: TOWARDS AN EU ACTION PLAN” written by 57 persons in 2016 and has been published by GENOME MEDICINE journal in Springer database. Its authors’ information is available at the end of paper. They are scientists working in universities and research centers in different European countries. After this paper, the most number of authors respectively are 28 (one paper), 21 (one paper), 15 (one paper), 14 (one paper), 11 (three papers), 10 (four papers).

Fig. 3
figure 3

Number of authors of papers

The “subject area” of the study that was categorized into 17 areas adopted with minor adaptation of the criteria used in Hermon and Williams [35] study. “Healthcare Data Analysis” is with the 74 papers is the most noteworthy subject of the article. “E-health” with 50 papers placed after it. The other places and the number of their publications are shown in Fig. 4.

Fig. 4
figure 4

Papers subject categories

Usually, data mining techniques are used in most of the studied papers to analyze their data (179 paper). Figure 5 shows the frequency of used data mining techniques (in percentage). The decision tree is the most used technique while the Appriori algorithm is the least applied technique.

Fig. 5
figure 5

Data mining techniques used in papers

The journals and the number of papers published shows in Fig. 6. “Journal of Medical Systems” has published the largest number (seven paper) of papers on the field. The other journals have published at least two papers.

Fig. 6
figure 6

The number of published paper in each journal

Based on some categories used by researches [35] and some seen constant terms in papers 17 categories (as it previously mentioned) were considered for papers. The frequency distribution of the Big Data in healthcare published papers in the period 2007–2018 in terms of papers categories shows in Fig. 7.

Fig. 7
figure 7

The frequency distribution of the Big Data in healthcare published paper in terms of papers categories

Figure 8 shows the conceptual map of the words used as keywords of papers (made by VOSviewer) vs the bar chart of frequency of the most repeated keywords (made by RapidMiner).

Fig. 8
figure 8

Conceptual map of the keywords vs frequency chart of them

Conceptual map of the words used in the title, abstract and conclusion of the papers is shown in Fig. 9. As it can be seen, these words are located in eight clusters. These clusters are based on the most frequent words.

Fig. 9
figure 9

The conceptual map of the used words in the titles and abstracts. Each colored point represents the key words in it. As the number of words in the neighborhood of one point and frequency is higher, the color is closer to red

Big Data methodologies used in papers based on the content and methodology applied were classified into nine categories, including data quality grading and assurance, identifying “unusual” data segments, machine learning and transactional data, developing methods to evaluate of care, meta-analysis and evidence, agent-based modeling, early warning systems, text data mining, tracking interactions among users. Figure 6 shows frequency of used Big Data methodology in papers (in percentage). The most used methodology is meta-analysis and evidence used in 99 papers, and the least one is “Tracking interactions among users” used in 37 papers (Fig. 10).

Fig. 10
figure 10

Frequency of used Big Data methodology

Based on descriptive statistics, “Meta-analysis and evidence” is a methodology used in most papers, but this research is based on the knowledge discovery, applied data mining techniques to predict the different methodologies used in the published papers in various databases.

The second part of the analysis has been done by RapidMiner Studio 8.2. In first step, text mining was done on data. RapidMiner is the best to handle continues data type [36]. The reason of using Rapid Miner, compared with other data mining tools like WEKA, orange, and R is that it provides the fully automatic parameter optimization of machine learning operator and presents good validation and cross validation. Introduced by some studies as the first of best open source data mining tools [36, 37]. Tables 1 and 2 show the results of studies.

Table 1 Technical overview of best six data mining open source tools [37]
Table 2 Tool with best accuracy in tested datasets [36]

Since naive Bayes is a high-bias, low-variance classifier, and it can build a good model even with a small data set [38], it has been used to achieve the aim. In classification, the goal of a learning algorithm is to construct a classifier given a set of training examples with class labels [39].

Naïve Bayes does the calculation for all possible label values and selects the label value that has maximum calculated probability.

The naive Bayes classifier is the simplest of these models, in that it assumes that all attributes of the examples are independent of each other given the context of the class. This is the so-called “naive Bayes assumption”. While this assumption is clearly false in most real-world tasks, naive Bayes often performs classification very well. This paradox is explained by the fact that classification estimation is only a function of the sign (in binary cases) of the function estimation; the function approximation can still be poor while classification accuracy remains high [40, 41].

Document classification is just such a domain with a large number of attributes. The attributes of the examples to be classified are words, and the number of different words can be quite large indeed. While some simple document classification tasks can be accurately performed with vocabulary sizes less than one hundred, many complex tasks on real-world data from the Web, UseNet and newswire articles do best with vocabulary sizes in the thousands. Naive Bayes has been successfully applied to document classification in many research efforts [42,43,44,45].

In this research process was a little different because the datasets were text files of papers, therefore, we used documentation operators of RapidMiner such as “Process Documents from Files” and “Split Validation”.

Datasets used for knowledge discovery, as it has mentioned, are published papers in Big Data in healthcare, generally in natural language. Therefore, firstly, they have been processed by RapidMiner text mining operators then the classifiers have been used for training as well as testing. It needed to classify papers into nine types of class labels (previously mentioned methodologies, Fig. 10). These labels are used to train the classifier operator and then based on it; the classifier will predict the label of the test dataset. This supervised process has been repeated for every seven databases. Figure 11 shows Naïve Bayes prediction for IEEE and Fig. 11 shows the accuracy of Naïve Bayes classifier for IEEE for the same database.

Fig. 11
figure 11

Naïve Bayes prediction for IEEE database

Accuracy is the most important criterion for determining the efficiency of a model calculating the exact criterion for the entire category. Based on it, Naïve Bayes for Wiley Online Library has the best result. In another side, recall measures the completeness, or sensitivity, of a classifier. Higher recall means few false negatives, while lower recall means more false negatives [46].

As it can be seen in Fig. 12, it is predicted 71.43% publishing papers in IEEE will use “Machine learning and transactional data” methodology by accuracy rate of 41.49%.

Fig. 12
figure 12

Accuracy of Naïve Bayes classifier for IEEE

This test has been done for all of databases. Based on results: for Cambridge university press by accuracy rate of 46.67%, it is predicted 100% of publishing papers will use “Agent-based modeling” as their methodology (Fig. 13).

Fig. 13
figure 13

Accuracy of Naïve Bayes classifier for Cambridge University Press

The result of Naïve Bayes test on Nature papers by 36.25% accuracy shows the probability of using “Developing methods to evaluate of care” methodology in this database is 67.86% (Fig. 14).

Fig. 14
figure 14

Accuracy of Naïve Bayes classifier for nature

This result for Elsevier presents “Developing methods to evaluate of care” methodology is predicted for 75% of papers by accuracy rate of 47.95% (Fig. 15).

Fig. 15
figure 15

Accuracy of Naïve Bayes classifier for Elsevier

The “Meta-analysis and evidence” is predicted dominant methodology using in Springer by probability of 92.86% and accuracy of 62% (Fig. 16).

Fig. 16
figure 16

Accuracy of Naïve Bayes classifier for Springer

It is predicted in 93.02% of papers of Wiley Online Library will use “Developing methods to evaluate of care” by the accuracy rate of 64.06% (Fig. 17).

Fig. 17
figure 17

Accuracy of Naïve Bayes classifier for Wiley Online Library

This test results show the probability of using “Developing methods to evaluate of care” methodology in papers publishing in Oxford Journals is 64.29% by accuracy rate of 38.57% (Fig. 18).

Fig. 18
figure 18

Accuracy of Naïve Bayes classifier for Oxford Journals

Therefore, averagely it is predicted “Developing methods to evaluate of care” methodology is the most intended methodology for publishing papers and in another side “Agent-based modeling” in Wiley Online Library has fewer false results.

Conclusion and further direction

In this paper we performed a scientometric study on published research papers during last 11-year period to “Big Data in Healthcare” researches characterization while has used the Naïve Bayes data mining technique to explore knowledge from them.

Results show the most duplicated papers belong to Springer database, and year of 2016 had the high frequency of publication. Big Data” are the high-frequency words and also key words and results verified the expectation. Results of applying VOSviewer for keywords, title, abstracts and conclusion of papers show eight clusters of words. The clusters are: public health, health informatics, healthcare big data research, data science, association, e-health encryption and things. Journal of medical systems published most papers in the field. decision Tree was most used techniques in data mining in papers applied data mining. The most number of author is 57. Health data analytics has the first rank among the subject. Males having Ph.D. degree with university affiliations had the dominant rate of authors. Meta-analysis and evidence was the most used Big Data methodology. In addition to descriptive statistics methods, in order to perform scientometrics study, a prediction technique (classification) has been done on Big Data methodology used in the papers of various databases and knowledge discovered from them. According to the results, the Nature database had the maximum accuracy in the results, and the “Agent-based modeling” had the maximum call in Wiley’s database. It shows applied Big Data methodology in papers of Nature could be better predicted, and the other papers of this database are more consonant in Big Data methodology. Moreover in papers of Wiley database there was no papers with “Agent-based modeling” methodology which its methodologies predicted false.

Future researchers can be utilized more refine the strategy to give more precision and manage some other issue like regional health systems, or do the work on more databases and content (like as books). Additionally, build the span of the testing dataset and can look at the more brand of the cellular telephone as a huge number of versatile brand are accessible to the market while this study can be performed on more labels and apply classification. In addition, the future studies can exanimate the predictions of this study.