1 Introduction

Big data analytics refers to various techniques to analyze data, extract information, and gain insights from large-scale datasets with complex patterns because conventional data-processing views cannot be easily dealt with. Data having a huge number of samples have higher statistical power, as over-complicated data with great dimensional feature space might lead to a greater rate of false discovery (Breur 2016). Data capture, data storage, search, visualization, transfer, sharing, query, update, data analysis, information privacy, and data source are the most important challenges in big data (Anagnostopoulos et al. 2016). The recent application of the term big data seems to refer to using user behavior analytics, predictive analytics, or different advanced data analytic methods that obtain value from data, and rarely to a data set of special size. In essence, available data quantities now are huge; however, it is not the most related feature of this data ecosystem (Rodríguez-Mazahua et al. 2016). Similarly, medical experts, researchers, business people, advertising, and governments usually have challenges dealing with large data sets in urban informatics, business informatics (Bhuimali et al. 2018), fintech (Thakuriah et al. 2017), and web surfing. Scientists face constraints in e-science activities, including meteorology (Fathi et al. 2021), biology (Gharajeh 2018), genomics (Wong 2016), complicated physics simulations, and environmental studies. Naturally, the big data ecosystem is explained by: Value, Veracity, Velocity, Variety, and Volume.

Value is the worth of the data being extracted. Data has no use or importance in itself, but it requires to be changed into a valuable to extract information. In addition, veracity defines data quality and value. It majorly affects the quality of captured data and the exact analysis. Velocity specifies the pace at which the data is generated and processed to satisfy the needs and challenges of the growth and development path. Variety elucidates the type and nature of the data. It assists those in analyzing it to use the resulting insight influentially. Lastly, Volume outlines the amount of generated and stored data. The data size defines the value and potential insight and whether or not it can be regarded as big data (Fathi et al. 2021).

With the emergence of systems of big data, predictive analytics has gained prominence. Enterprises have bigger and greater data pools in big data platforms, which has led to an increase in data mining opportunities to obtain predictive insights (Mohamed et al. 2020). The commercialization of machine learning tools has also expedited this trend, which results in emerging demands for predictive analytic services (Casado and Younas 2015). A large volume of techniques is used by predictive analytics to help organizations forecast outcomes, techniques that continue to develop with the expanding adoption of big data analytics. Big Data Predictive Analytics (BDPA) defines frameworks and systems that gather, analyze, and give an interpretation of great variety, volume, velocity, veracity, and value data to show patterns, trends, and relationships within data to find challenges and opportunities, to foresee future happenings, and direct decision making in contexts of applications.

Predictive analytics turns to statistical methods from forecasting modeling, machine learning, and data mining, which examine recent historical facts to foresee events or future that are unrevealed (Nyce 2007). In the business domain, predictive models apply the patterns that are found in transactional and historical data to find chances and dangers. Models find relations among lots of factors to permit assessing the risks or potentials inherent in a special set of conditions to direct decision-making for candidate transactions (Coker and Pulse 2014). In this functional effect, related to technical methods, an estimated score demonstrating a probability is provided by predictive analytics for each person, a customer, an employee, a healthcare patient, a vehicle to inform, determine, or influence organizational processes that are related to a wide range of people in healthcare, manufacturing, fraud detection, marketing, and the like. Predictive analytics is used in healthcare (Etemadi, et al. 2023), smart cities (Karimi et al. 2021), marketing (Miles et al. 2017), retail (Huang et al. 2019), actuarial science (Homer et al. 2017), social networking (Failed 2017a), financial services (Ouahilal et al. 2016), insurance (Longhi and Nanni 2019), telecommunication (Failed 2018a), mobility (Moreira-Matias et al. 2016), travel (Amirian et al. 2016), child protection (Russell 2015), pharmaceuticals (Sohrabi 2019), capacity planning (Delfmann et al. 2019), and other fields. This Systematic Literature Review (SLR) is arranged with the central aim of recognition, taxonomic classification, comparison of the big data analytic approaches, and systematic contrast of the current articles that concentrate on planning, executing, and authentication of big data analytics. In accordance with the former purposes, we have made a determined effort to answer the following study questions: What are the fields of prediction analysis applications in big data? What are the evaluation metrics of predictive analytics using big data? What evaluation methods are used in BDPA? What are the tools and environments in BDPA? And, what are the challenges and future issues of BDPA?

We accompanied the recommendations in Kitchenham (2004); Brereton et al. 2007). Our main goal is to possess a systematic identification and assort classification of the current authenticated achievements on BDPA. Up to the present date, according to our inspections and studies, only the minority of SLRs investigated BDPA thoroughly. Another factor to consider is that none of them presented a complete and precise review article in the field of BDPA. Furthermore, since BDPA is an ultra-critical and sensitive field, it is necessary to provide a comprehensive study. Due to this, we have studied 109 articles to provide an exhaustive systematic review of predictive analytic methods that exploit big data for better performance. Although there are some reviews on various big data approaches, the vast majority of reviews do not focus on predictive analytic challenges, open issues, benefits, and drawbacks. To overcome this shortcoming, a systematic review of the literature and overview is presented for predictive analytics by using big data. This article assists researchers in gaining an overall understanding of various approaches that utilize big data for predictive analytics. The key contributions of this study are summarized as follows:

  • Representing a systematic review of the predictive analytic approaches by utilizing big data

  • Providing a technical and comprehensive taxonomy that categorizes various applications in BDPA as depicted in Fig. 1.

  • Making a detailed comparison of the applied evaluation metrics and methods, tools, and the pros and cons of each study

  • Determining open challenges and future trends of predictive analytics by applying big data

Fig. 1
figure 1

Taxonomy of prediction analysis applications in big data

The remaining parts of this article are structured as follows. Section 2 considers some relevant review works. Section 3 provides the research methodology, including research questions and the article selection process. Section 4 provides a classification for big data predictive analytics. Section 5 shows the comparisons and results of the articles reviewed. Section 4.6 refers to open issues and further studies. And, the final results are illustrated in Section 4.6.1.

2 Related work

Different kinds of review articles have been prepared in the field of BDPA. And, in this section, some of these articles are reviewed and compared with our work.

Philip Chen and Zhang (Philip Chen and Zhang 2014) surveyed big data applications, big data opportunities, and upcoming state-of-the-art methods and technologies that are being adopted to tackle the challenges of big data. To deal with problems, they provided techniques to tackle the pitfalls of big data, such as cloud computing, quantum computing, and biological computing. However, the article’s procedure for selected articles was not provided, and it was not systematic.

Kumaresan and Rajakumar (Kumaresan and Rajakumar 2015) provided details on predictive analytics. The area of predictive analytics was discussed. The authors introduced several tools and techniques in predictive analytics. The given list of reviews was examined and deliberated upon to ascertain the utilization of predictive analytics by researchers in industrial and medical contexts. Various techniques and approaches were referenced in the process. This research discussed various issues and challenges of predictive analytics, available tools, applications, and modeling techniques in big data. However, no guideline for future research was suggested; recent year's published articles were not considered, no taxonomy was prepared, the article selection process was not transparent, and it was not an SLR.

Banumathi and Aloysius (Banumathi and Aloysius 2017) provided a review of different predictive analytic applications and approaches. Analytic methods, with dissimilar perspectives based on applications and data variety, were considered. Some of the applications discussed are big data in health care, hotel governance, consumer orientations, higher education, and data e-governance. The authors presented predictive approaches adapted for different applications with challenges and suggestions. The article specifically identified the main applications that depend thoroughly on BDPA solutions and already adopted themselves as one of the big data entities. Nevertheless, this study did not deliver an SLR, and the article selection process was not clear.

Poornima and Pushpalatha (Poornima and Pushpalatha 2018) introduced the thought of using predictive analytics and data mining methods on various medical datasets to foresee different illnesses with all advantages, disadvantages, and accuracy levels included that are related to future approaches to big data. The review list was discussed and presented to see how different authors applied predictive analytics for medicine and business and how they were regarded. The algorithms and techniques were also referred to while being applied to big data. However, this research did not represent an SLR, and no taxonomy was prepared. In addition, the article selection process is not clear. In other words, possible future studies were not presented. Ghani, et al. (Ghani et al. 2019) survey looked at different angles in social media big data analytic topics. The authors arranged the survey based on different features. They provided a discussion on the applications of social media big data analytics by taking methods and quality tokens from various studies. Open research challenges and future works in big data analytics are introduced, but the authors did not consider noting the covered years of the articles reviewed, and it was not an SLR.

Kaffash, et al. (Kaffash et al. 2021) worked on a review taking into account the big data algorithms and applications on intelligent transportation. In this study, no taxonomy was organized. Mallika and Selvamuthukumaran (Mallika and Selvamuthukumaran 2022) provided a review of prospective precision medicine by utilizing big data. They illustrated the most frequently used tools and computational platforms on which precision medicine based its foundation on the role of big data; however, their review had no taxonomy nor any article collection process. Nobanee, et al. (Nobanee et al. 2022) only reviewed the applications of big data in the area of credit risk assessment. The authors described the notions of credit risk and types of credit risk and then connected the relation of big data with credit risk management. However, one of the weaknesses of their work was the limitation of their scope and the number of reviewed articles.

Ikegwu, et al. (Ikegwu et al. 2022) reviewed big data analytics in data-driven industries, covering tools, data sources, challenges, solutions, and research directions. The authors discussed different classification methods, data characteristics, and real-life applications across sectors. The study reviewed related studies to big data analytics, which were published only between 2013 and 2021. Also, the study is not a systematic literature. Himeur, et al. (Himeur, et al. 2023) provided an overview of the fall shorts of building automation and management systems (BAMSs) in terms of performance evaluation, energy consumption analysis, and security. The authors reviewed various AI-based tasks, presented existing frameworks, and discussed challenges of BAMSs performance in intelligent buildings.

In screening to manage diabetes long and short-term complications, predictive models were introduced by Cichosz, et al. (Cichosz et al. 2016). The authors also presented a systematic mapping study (SMS). These models have been created to manage diabetes and its related problems, and there has been a tremendous rise in the number of studies on these models recently. A linear regression or multiple logistics was applied to develop the prediction model, probably because of its clear functionality. Finally, in order to prove the usefulness of prediction models, they have to show their impact, or in other words, their application should yield more satisfactory outcomes in patients. Despite all efforts made to build these predictive models, a considerable scarcity in impact studies was observed. However, there was not a systematic review in this study, and the method of selecting articles was unclear. Newly published articles were also excluded.

To reach high-level comprehension in big data manufacturing, O’Donovan, et al. (O’Donovan et al. 2015) provided an SMS. Their contributions were some reports on the current state of work concerning big data approaches in assembling, such as methods of research taken into account, sectors in producing where big data exploration were concentrated, and results from big data research projects. The authors classified their study based on different research questions and answers. Nonetheless, their study did not provide any information regarding the covered years of the studied articles.

Different predictive models were classified by Muthukrishnan, et al. (Muthukrishnan et al. 2017), which were applied to monitor and improve the performances of students in educational settings similar to schools or universities. Within the educational data mining methodology, the whole areas were analyzed, two databases were selected, and systematic mapping research was conducted for this article. The main aim of the noted systematic mapping study was to examine the current predictive analytic models within the educational environment of schools and other educational institutions. Due to the need to understand the functional applications linked to the approaches in healthcare, Mehta, et al. (Mehta et al. 2019) provided an SMS by considering artificial intelligence with big data. To examine the improvements in this field, the authors employed bubble plots to map the arrangements of publications. They categorized their reviewed article into different sub-classes, which led to the creation of taxonomy; however, they did not review recently published articles.

Rahman and Reza (Rahman and Reza 2020) reviewed the non-functional requirements (NFRs) in big data. Afterward, they implemented a model to map the NFRs. The authors showed that some metrics, such as performance, scalability, and reliability are the most important factors in data-intensive systems. Biesialska, et al. (Biesialska et al. 2021) conducted an SMS to review agile software developments with the impact of big data. Their taken method of collecting articles was snowballing and manual search through the databases. In addition, articles were reviewed by the authors, in which their applications, company names, country, and per-industry usage were reviewed. Montero, et al. (Montero et al. 2021) took a systematic mapping approach to review big data quality models. The authors collected articles by providing an overview of their selection process, but they did not mention nor provide directions for future works on the quality models of big data.

The information system literature review was introduced by Ohiomah, et al. (Failed 2017b) on BDPA to find BDPA areas that were investigated before, but still needed greater focus. They suggested special research questions to be studied further and found out that big data arrival altered predictive analytic roles from such activities as generation and validation of theory to the more data-driven discovery of complicated patterns and relations among variables and evaluation of the probability of relationship occurrence in variables of a dataset. In this research, recently published articles were ignored. Mikalef, et al. (Mikalef et al. 2018) arranged an SLR on big data analytics to explain the system performance through which they should be leveraged to contribute to competitive productivity. They reviewed the research frameworks that were based on IT–business value, alongside the segments from strategic management. The authors focused on tools, technical methods, network analytics, and the infrastructure of big data analysis. Despite this, their review article did not pay attention to the recent date published articles.

Kolajo, et al. (Kolajo et al. 2019) tried to represent the flow of big data evaluation by providing a systematic review in order to recognize the tools and approaches. However, the authors did not provide a taxonomy for their study, and recently published articles were not included. Al-Sai, et al. (Al-Sai et al. 2020) provided an SLR to divide the schema and framework into five major groups of big data critical success factors, namely individuals, management, approaches, authorities, and companies. By answering three research questions during their survey, the authors tried to provide solutions to the key issues of big data analytics. Nonetheless, they did not research recently published articles to provide a more up-to-date SLR.

Rathore, et al. (Rathore et al. 2021) discussed the influencers on digital twinning. They identified research challenges and deficiencies that need to be worked on in the future to excel in digital twinning. The authors also divided their article into different sections, noting the scopes of manufacturing, medicine, transportation, education, business, and other industries in digital twinning. Naghib, et al. (Naghib et al. 2022) provided an SLR regarding the methods of how to manage big data in the Internet of Things (IoT). In their role as article organizers, they delineated four distinct categories: processes related to big data management (BDM), the BDM framework, quality attributes, and, finally, big data analytics Georgiadis and Poels (Georgiadis and Poels 2022) came up to the conclusion that although there have been numerous studies in big data security assessment, there is still room and potential that needs to be fulfilled by more pertinent and methodological rules to lower data protection risks in systems that store big data analytic algorithms.

Acciarini, et al. (Acciarini et al. 2023) focused on reviewing the benefits of business model innovation with the use of big data to unleash companies to reach a comprehensive understanding of diverse applications. The authors offered guidance on harnessing the potential of big data in the industry. Shah, et al. (Shah et al. 2023) provided an SLR regarding the applications of BDPA in Supply Chain Risk Management (SCRM). The authors analyzed 68 selected articles, categorized them based on publication year, country, journal, application areas, and tools used. Singh, et al. (Singh et al. 2023) reviewed the prospective plus points and challenges of big data analytics (BDA) in the healthcare industry. The authors highlighted the increasing adoption of BDA in healthcare while addressing the associated challenges. Although the article was based on an SLR, it offered possible solutions for healthcare challenges.

The reviewed studies are divided into three categories: survey, systematic mapping study (SMS), and SLR, which are depicted in Table 1. Considering the previous points, neither of the SLRs (Failed 2017b; Mikalef et al. 2018; Kolajo et al. 2019; Al-Sai et al. 2020; Rathore et al. 2021; Naghib et al. 2022; Georgiadis and Poels 2022) has reviewed BDPA holistically. Ohiomah, et al. (Failed 2017b) only reviewed articles between 2006 and 2017. Kolajo, et al. (Kolajo et al. 2019) concentrated on big data stream examination covering years of which were between 2004 and 2018. Al-Sai, et al. (Al-Sai et al. 2020) reviewed big data’s influential success elements between 2007 and 2019, which appeared to be lacking studies published after 2019; it would have been helpful to include more recent studies to ensure the findings were up to date The only article that is relatively close to our work is Mikalef, et al. (Mikalef et al. 2018), in which the authors only reviewed articles until 2018. Due to this, we state that the SLR that we have presented is the primary one trying to investigate BDPA thoroughly up to the 2023.

Table 1 Related studies in the field of BDPA

3 Research methodology

In the structured progression of this approach, we adhere to a three-step framework outlined as planning, execution, and documentation (Etemadi, et al. 2023; Brereton et al. 2007; Kitchenham et al. 2009), as illustrated in Fig. 2. The assessment is complemented by an external evaluation of the outcomes at each juncture. Initially, we discern the inquiries and motivations underlying this SLR during the planning phase. Subsequently, the selection of pertinent articles within this domain is based on predetermined inclusion/exclusion criteria (see Table 2) during the execution phase. Finally, within the documentation phase, observations are recorded, and the outcomes undergo analysis, comparison, and visualization, culminating in responses to the research queries, followed by the presentation of conclusive reports. This SLR adheres to a three-phase research methodology, detailed subsequently.

Fig. 2
figure 2

Overview of research methodology

Table 2 Inclusion/Exclusion criteria

3.1 Planning the systematic review

Planning initiates with the identification of the research rationale for this SLR and concludes with the formulation of a review protocol, outlined as follows:

Stage 1 - Clarifying the research motivation

The initial stage involves specifying the research motivation, determined based on the contribution of this SLR, justified through a comparative analysis of existing reviews. The need for a systematic review leads to the identification, classification, and comparison of the recent studies concerned with BDPA. This work mainly focuses on a detailed comparison and classification of big data applications in several areas that are provided in Section 4. To claim that similar literature studies to ours have not yet been conducted, we browsed Google Scholar and famous publishers such as ScienceDirect, Springer, IEEE, ACM, Taylor & Francis, SAGE, World Scientific, Emerald, Wiley, and Hindawi with the following research string.

“Big data” < AND > 

(survey < OR > review < OR > overview < OR > trends < OR > challenges < OR > “state of the art” < OR > study)

Initial results of the review articles with the above search terms were extracted from Google Scholar based on the titles of the articles. Then, we studied the abstract of the articles that discussed predictive analytics and the topics near it. Consequently, we selected our related works. Finally, we made a comparison of the related works to ours. However, none of the observed reviews mainly answered our proposed research questions (RQs) in Section 3.1. Since BDPA is quite a critical field of study, reinforcing and updating the current evidence on the applications of big data is necessary. So, the legitimate reason that motivated us to propose an SLR is to address all the aforementioned weaknesses. Table 1 presented a summary of the studied surveys that the parameters, such as review types, main topics, publication years, article selection processes, taxonomies, future works, and covered years of each study are depicted. It is clear that just ten articles have used the SLR method, and several articles have not mentioned their article selection processes. In contrast, in this research, our article selection process is completely clear; taxonomy is prepared, future works are explained, and recently published articles up to 2023 are included. Therefore, to fulfill the deficiencies mentioned above, we have conducted a very detailed and thorough study to cover the following demerits:

  • Newly published articles, especially from 2020 to 2023, have been missed.

  • The structure of most articles is not systematic since the article selection mechanism is not obvious.

  • Many articles did not use appropriate classifications.

  • The majority of reviews have ignored the evaluation parameters and tools.

  • A notable number of previously reviewed articles did not provide a fixed structure and taxonomy.

  • Previous articles at maximum reviewed a limited number of articles.

Stage 2—Formulating research questions

In the second stage, aligned with the motivation of this article, the research questions are articulated to aid in the development and validation of the review protocol. The subsequent research questions, outlined below, seek to identify gaps in the current understanding of this subject. The resolution of these questions in the documenting phase can unveil new perspectives and ideas.

  • RQ1: What are the fields of prediction analysis applications in big data?

  • RQ2: What are the evaluation metrics of predictive analytics using big data?

  • RQ3: What evaluation methods are used in BDPA?

  • RQ4: What are the tools and environments in BDPA?

  • RQ5: What are the challenges and future issues of BDPA?

Stage 3—Establishing the review protocol

In line with the objectives of this SLR, the preceding stage involved the identification of research questions and the delineation of the review's scope to fine-tune search strings for literature extraction, as per (Brereton et al. 2007). Additionally, a protocol was formulated by drawing inspiration from the approach outlined by Brereton, et al. (Brereton et al. 2007) and our past involvement in SLRs (Bazzaz Abkenar et al. 2021, 2023; Khoshniat et al. 2023; Songhorabadi et al. 2023, 2011; Kashani and Mahdipour 2023; Nikravan and Haghi Kashani 2022; Sheikh Sofla et al. 2022; Haghi Kashani et al. 2021, 2020; Ahmadi et al. 2021; Rahimi et al. 2020; Nemati et al. 2023; Abkenar et al. 2011; Kashani et al. 2011). To evaluate the formulated protocol prior to its implementation, external expertise was enlisted from a specialist with proficiency in conducting SLRs within this specific domain. The feedback received was incorporated into the refined protocol. A pilot study, covering approximately 20% of the included articles, was conducted to mitigate potential biases among researchers and to optimize the data extraction process. Further refinements were made to the review scope, search strategies, and inclusion/exclusion criteria during this pilot stage.

3.2 Conducting the systematic review

The conducting phase of the research methodology, which commences with the selection of articles and concludes with data extraction, constitutes the second stage. This section is dedicated to illustrating the procedures involved in searching and selecting articles undertaken during the second phase of the SLR.

Stage 1 – Selecting primary articles

At this stage, we explored the following search string to collect primary articles:

(“big data” < OR > “large data” < OR > Hadoop < OR > Spark < OR > Storm)

 < AND > 

(predictive < OR > forecasting < OR > prediction < OR > foresee)

We searched through famous databases, such as Google Scholar, ScienceDirect, Springer, IEEE, ACM, Taylor & Francis, SAGE, World Scientific, Emerald, Wiley, and Hindawi, based on the title, keywords, and the abstraction section of each article.

  • Initial selection: This stage involves the screening of titles, abstracts, and keywords of potential primary articles. As a result, 1130 articles were primarily recorded from conference articles, journals, book chapters, and notes. The search string was applied to digital databases between 2014 to 2023.

  • Final selection: After a full-text study all non-English articles, all editorial articles, all book chapters, all working articles, and all short articles, which were less than six pages, were omitted because they could not give us enough information. The extracted number of articles was 109 based on our inclusion and exclusion criteria in Table 2.

Stage 2 and 3—We acquired data from specified online search databases and structured the information based on aspects of characterization, following the guidelines outlined in Kitchenham (2004); Brereton et al. 2007). The scrutiny of these 109 articles forms the basis for our proposed classification of BDPA in Section 4, shedding light on both the advantages and drawbacks of these approaches in Section 5, and presenting future works and open issues in Section 4.6.

4 A classification for applications of big data predictive analytics

The main target of this section is to present a comprehensible trend of predictive analytics by using big data to examine all 109 selected articles. It is not easy to structure the related works on BDPA systematically because the literature is very diverse. The selected articles are classified into seven main groups in this study. According to the article domain, this part is categorized into seven application classes: Industrial, e-commerce, smart healthcare, smart agriculture, smart city, ICT, and weather. These seven categories are widespread among most researchers and authors because they scrutinize the problems and issues from various approaches.

Scrutinizing the articles shows that fifteen parameters exist in the evaluation of the obtained results, and every article might regard one parameter or more. The parameters are as follows:

  • Accuracy: It refers to the assessment that is applied for setting the most qualified model at identifying relations in a dataset primarily based on input or training data. The following formula (Fawcett 2006) was used:

    $$\begin{array}{l}Accuracy=\frac{TP+TN}{TP+TN+FP+FN}\\(N\rightarrow Negative,P\rightarrow Positive,F\rightarrow False,T\rightarrow True)\end{array}$$
    (1)
  • Timeliness: It refers to data accessibility and availability in business decision-making. Clear, well-organized data makes intelligent decisions and leads to a better understanding of future expectations.

  • Cost: The price that has to be paid totally by the one who requests the service to attain the highest composite service.

  • Scalability: It refers to the measurement applied to see whether or not the algorithm/framework/platform accommodates fast alternation in data growth.

  • Reliability: It refers to the probability that a system will be capable of performing a task designed or intended at a specific time and environment.

  • Performance: The amount of useful work accomplished at a specified time.

  • Validity: The measurement used to verify the suggested model. This factor tries to see whether or not models are executing as expected and in line with their design purposes and business applications.

  • Resource Utilization: It refers to the time percentage that a component is occupied in comparison with the total time that the component is available to be applied.

  • Time: Factors relevant to time, namely processing time, the overall time to provide an output, and performing time.

  • Energy: The total sum of energy consumed to perform the applied requests.

  • Throughput: The largest amount of processed data in a system at a particular time.

  • Sustainability: The capability of the model to be maintained at a certain rate or level without the need for future updates.

  • Feasibility: The possibility of a statement or model that can be conveniently done.

  • Security: The degree of being free from threat or avoiding any irreplaceable consequences is generally critical in a smart city and smart healthcare.

  • Precision: Quality, condition, or fact of being exact and precise while testing a model.

4.1 Industrial applications

We can classify these articles into three sub-classes. The first class has articles discussing business processes (Zhang et al. 2022a; Kong et al. 2022; Yang and Ge 2022; Shafi et al. 2021; Krumeich et al. 2014; Mishra 2019). The second class has articles focusing on the Industrial Internet of things (IIoT) (Failed 2017c, 2018b; Yu et al. 2020; Wang et al. 2020; Tryapkin and Shurova 2020; Lin et al. 2022; Kodidala et al. 2021; Rosati, et al. 2023). The third one deals with articles that talk about issues in supply-chain management (Hazen et al. 2014; Gunasekaran, et al. 2017; Dubey, et al. 2018; Dubey, et al. 2019; Dubey, et al. 2018; Jeble 2018; Nilashi, et al. 2023) and the pros of big data analysis and precise prediction in this domain. In Section 4.1.1, the selected articles are reviewed and then compared in Section 4.1.2.

4.1.1 Overview of industrial articles

Zhang, et al. (Zhang et al. 2022a) took five major factors into account to calculate the emission of carbon in this epoch. First, they calculated the energy infrastructure and energy intensity. Following the industrial structure, they looked up to employee scale and economic prosperity. With these tokens, they could deepen the calculation of emissions on a large scale. Kong, et al. (Kong et al. 2022) analyzed the specifications of the current industry and introduced some preliminaries using machine learning and deep learning. Afterward, they described the definitions of latent variable models (LVMs). Consequently, they advised about the important issues, future directions, and LVMs concepts. Yang and Ge (Yang and Ge 2022) made three contributions to the area of big data analysis in the industrial area. They first summarized the relationships of learning paradigms and then lifelong learning was expressed. Lastly, the authors shed light on the future directions and possible potential to excel the industrial applications with the use of big data.

Shafi, et al. (Shafi et al. 2021) mathematically modeled an IP system. Then, by the usage of proportional-integral-derivative, controlled utilizing of data. Consequently, a storming time series input–output data is driven. With the implementation of current neural network algorithms, the authors were able to control IP systems. The outcome of their research is the trainability of neural networks on random input/output data. Focusing on event-based predictions, Krumeich, et al. (Krumeich et al. 2014) used potentials via predictive analytics on big data to enable proactive control of processes in the business. Therefore, the article simply concentrates on processes in production in the manufacturing industry analytical processes and outlines—based on a case study related to a huge steel company in Germany, Saarstahl AG. In this company, data related to production are gathered to form a foundation to have exact predictions. Nevertheless, this sample cannot use available data potentials for proactive process control without considering big data analytics dedicated approaches.

A model was suggested by Mishra (Mishra 2019) for examining the way the deployment of information technology (IT) (i.e., the partnership of business–BDPA, strategic flexibility of IT, and alignment of business–BDPA) and capabilities of HR makes an influence on OP through BDPA. Structural equation modeling is used on survey data obtained from 159 companies in India in order to test the suggested hypotheses. Based on the results, the diffusion of BDPA would mediate the deployment impact of IT and HR capabilities upon OP. What is more, a direct influence of IT deployment and capabilities of HR upon diffusion of BDPA would be observable, and it also would have a direct relationship with OP. The authors, in this study, showed that the deployment of IT and the capabilities of HR would influence OP indirectly through the diffusion of BDPA.

An open-source database was designed by Oyekanlu (Failed 2017c), which was properly compacted to be adjusted into edge analytic device memory by applying lightweight database software that is not limited by demanding a server, a client, password schemes, and further requirements similar to traditional database systems. The purpose was to have effective support for real-time computing for IIoT systems to send minimal reports regarding the stages of system conditions to the cloud. This way, by having this lightweight low-memory footprint, the database system at the edge would amplify the whole IIoT system's reliability.

Truong (Failed 2018b) described how they designed and augmented complex IoT and big data cloud systems for integrated analytics of IIoT predictive maintenance. The technique was supposed to locate different complicated interactions for tackling initial errors of the system, which were related to the results of fundamental analytics about the tools. Both Incidents of the system and results of critical analytics were addressed for the equipment integration for maintenance of IIoT predictive.

A manufacturing big data ecosystem was suggested by Yu, et al. (Yu et al. 2020) in order to deal with ingestion issues of big data, management, and analytics to detect faults of predictive maintenance in IoT-based smart factories. Compared to other similar studies that have made efforts to develop different techniques to detect anomalies via simulations, the suggested ECD architecture concentrated on a framework to guarantee the security of data and real-time data analytics by the deployment of a data lake, an encryption protocol, NoSQL database, and the like, on the Apache Spark platform. The MapReduce-based DPCA algorithm was introduced and defined for the fault detection model. In accordance with experimental results, the suggested big data ecosystem has the capability to alarm the system some days ahead of the real occurrence.

The causes and problems of the traditional remanufacturing mode were analyzed by Wang, et al. (Wang et al. 2020). Big data-driven hierarchical digital twin predictive remanufacturing (BDHDTPREMfg) concept and architecture, which was scalable, were suggested, and a big detailed data-driven control mechanism was presented. Afterward, a detailed paradigm scheme implementation in the application of AGV with SESD was presented. Interestingly, the application results validated the efficacy and feasibility of BDHDTPREMfg. Based on the above-mentioned analysis, the advantages of applying BDHDTPREMfg were shown.

The main challenges related to big data analytics were defined by Tryapkin and Shurova (Tryapkin and Shurova 2020). In order to save data storage resources and transportation devices, the best part of data has to be processed on the basis of real-time. On the other hand, a rich toolkit was permitted by the suggested tools to solve the problems of transport, which were associated with identifying the equipment operating modes and the predictive analytic tool deployment to evaluate anomaly of equipment status. Good examples of applying Big Data technologies were provided by this article to evaluate the conditions of existing infrastructures. Different systems of information, various systems of computerized monitoring, and microprocessor systems, which were set up at infrastructural facilities, had been implemented as record sources.

Lin, et al. (Lin et al. 2022) proposed a per-job framework in order to lower the costs and energy of data analysis. In aiming to reduce any possible loss of spot instances irregularities, the authors used a checkpointing mechanism. As for their future works, they have noted that they should optimize the cost when utilizing cloud infrastructures. Kodidala, et al. (Kodidala et al. 2021) tested their model with.NET Micro Systems to see if they can improve the protection of data and privacy. Their main focus was on improving the security, reliability, and quality of the system. The test findings showed that the defined architecture offers valuable insights into IIoT's intelligent social communities.

Rosati, et al. (Rosati, et al. 2023) introduced a Decision Support System (DSS) for predictive maintenance (PdM) in industrial settings, leveraging IoT, Big Data, and Machine Learning. The DSS addresses the challenge of obtaining quality labeled data by employing a feature extraction strategy and ML prediction model based on specific topics collected from the production system. Experimental results demonstrate that this approach offers a good trade-off between predictive performance and computation effort. The data quality problem was introduced by Hazen, et al. (Hazen et al. 2014) in the supply chain management (SCM) context, and they suggested methods for data quality controlling and monitoring. Additionally, the authors highlighted interdisciplinary research topics on the basis of complementary theories. It was suggested that there is a need for continual improvement in the process of SCM data production and a framework, that is familiar, for the establishment of a quality control mechanism considering data quality. The SPC method application was mentioned but not theory-based topics.

Gunasekaran, et al. (Gunasekaran, et al. 2017) investigated the influence of BDPA assimilation on supply chain planning (SCP) and organizational performance (OP). This study worked on a resource-based view, and the authors considered three stages of assimilation; routinization, assimilation, and acceptance. And, tried to identify resource influence, information sharing, and connectivity, under the commitment influence of great management on big data assimilation capability, OP, and SCP. Based on the findings, information sharing and integration under the commitment influence of great management are relevant to BDPA confirmation positively, which is relevant to BDPA assimilation under BDPA routinization mediation influence and also positively relevant to OP and SCP.

Dubey, et al. (Dubey, et al. 2018) tested the BDPA role in collaborative performance (CP) among those involved in the sustainable development program to attain SCP purpose. The organization fit contingent influence upon BDPA impact on CP was investigated in this study. Variance-based structural equation modeling (PLS-SEM) was taken into account to examine the theory of this study by applying 190 respondents, as samples, who worked in an Indian auto-components manufacturing organization obtained from the ACMA, Bradstreet, and Dun databases. The findings suggested a considerable positive impact of BDPA upon the CP among partners, and it also indicated that resource complementarity and organizational compatibility had a beneficial moderating impact on joining CP and BDPA. There were several limitations to this study. First, the authors collected cross-sectional data. Secondly, it was confined to dyadic networks. In this article, the theoretical structure of the suggested framework was analyzed at the inter-organizational level. However, it was observed from the focal organization outlook only.

BDPA effects on social performance (SP) and environmental performance (EP) were investigated by Dubey, et al. (Dubey, et al. 2019) empirically by applying equation modeling of variance-based structural (i.e., PLS). It was found that BDPA had a considerable impact on SP/EP. However, any evidence for the moderating influence of flexible and control orientation in the links between SP/EP and BDPA was not found. The findings suggested a covert understanding of BDPA performance implications and addressed the vital questions of when and how BDPA is able to increase environmental/social sustainability in supply chains. This study collected data at one point in a time which is regarded as a limitation. It also concentrated on perceptions of the manager rather than his actual performance, which is another limitation. We can also refer to the authors’ application of DCV logic to define BDPA adoption and their research sample demographics that might limit the generalizability of findings as other limitations.

Dubey, et al. (Dubey et al. 2018) defined how big data and predictive analytics might refine coordination and visibility in humanitarian supply chains. A research model in a contingent resource-based view was conceptualized by the author. It was suggested that the capabilities of BDPA influenced coordination and visibility being influenced by the swift trust. Based on the results, BDPA has a considerable effect on coordination and visibility, and also, swift trust does not necessarily affect the relations between coordination, visibility, and BDPA. The application of cross-sectional survey data for testing the research speculation seems to be the main limitation of this article.

A theoretical model was developed by Jeble (Jeble 2018) to define the influence of big data and predictive analytics on an organization's maintainable business flourishing aim. This theoretical model was expanded by applying a resource-based view contingency theory, and logic. By applying PLS-SEM (partial least squares- structural equation modeling), the model was tested more. This study contributed a lot to supply chain management literature and operations. Empirically proven results and theory-driven results were provided by them, and then previous studies focusing on single performance measures (i.e., environmental and economic) were extended. The authors tried to answer some questions unsolved before by investigating BDPA's influence on performance measures.

Nilashi, et al. (Nilashi, et al. 2023) focused on a research gap by exploring the influence of BDPA on recycling and waste management in the food industry, particularly its impact on environmental and economic performance. The findings of their research highlighted the importance of employee knowledge and competitive pressure in driving BDPA adoption while emphasizing the significant role of BDPA in enhancing an organization's competitive advantage through improved environmental conditions.

4.1.2 Summary of industrial articles

In almost all articles, attempts have been made by the authors to improve environmental and organizational performance in addition to boosting their prediction accuracy. These improvements will positively affect resource reliability, utilization, and other related metrics. According to the reviewed and discussed industrial articles, the comparison of their specifications has been depicted in Table 3 which are divided into three categories: business processes (Zhang et al. 2022a; Kong et al. 2022; Yang and Ge 2022; Shafi et al. 2021; Krumeich et al. 2014; Mishra 2019); IIoT (Failed 2017c, 2018b; Yu et al. 2020; Wang et al. 2020; Tryapkin and Shurova 2020; Lin et al. 2022; Kodidala et al. 2021); Supply chain (Hazen et al. 2014; Gunasekaran, et al. 2017; Dubey, et al. 2018; Dubey, et al. 2019; Dubey, et al. 2018; Jeble 2018; Nilashi, et al. 2023). Table 3 shows the main ideas, evaluation methods, tools, advantages, and disadvantages of industrial articles. In addition, Table 4 depicts the improvement of different evaluation metrics in each study. These metrics include accuracy, timeliness, cost, scalability, reliability, performance, validity, and resource utilization.

Table 3 Comparison of industrial articles
Table 4 Evaluation metrics in industrial articles

4.2 E-commerce applications

Mainly the focus of these studies is on the commercial application of BDPA. They are classified into three sub-classes. The first category talks about finance and stock data (Failed 2018c; Haitao 2020; Chen 2018a; Saito and Gupta 2022; Han et al. 2023). And the second class has articles discussing retail (Bradlow et al. 2017; Failed 2014; Lee 2017; Sasidhar and Mallikharjuna Rao 2019; Zheng et al. 2020; Chen 2021; Zhuang 2021; Alrumiah and Hadwan 2021; Li and Li 2022). The third class takes technology into account in the area of e-commerce: (Suguna et al. 2016; Failed 2015a; Xiao 2022). Almost all studies are related to different big data analytic solutions and retail to predict customers’ behaviors. In Section 4.2, the selected e-commerce articles are reviewed and then compared in Section 4.2.2.

4.2.1 Overview of e-commerce articles

An algorithm of machine learning, a token-based ensemble specifically, was suggested by Morris, et al. (Failed 2018c), which used nonlinear and linear estimators for predicting big financial time-series data. The ensemble is composed of a long/short-term memory (LSTM) network, a traditional Kalman filter, and a traditional linear regression model. They found the adaptive features in short-term, high-risk trading when noisy data, like stock prices, were present and showed their ensemble the performance. In (Haitao 2020), the expanding situation of the targeted e-commerce supply chain management information system was proposed. The evaluation results indicated a large-scale reduction of costs and enhancements in the system's efficiency. However, the precision of the healthy development of the network loans should be investigated by both governments and industries.

Chen (Chen 2018a) produced a structural personalization regulation of e-commerce and text-matching algorithms. He stated that, by using the algorithms, not only the chances of the transaction increased but also the level of personalized service was elevated. However, the method lost its precision when it was tested in a commodity search. Saito and Gupta (Saito and Gupta 2022) came up with a quantitative model researching the impacts of social media on finances. They tested three distinct models, namely a revenue manager model, high-frequency trading equity, and interest rate framework. However, their models had some deficiencies as COVID-19 had lowered the travel frequency of people, in which affected one of their models in a hotel as they could not find adequate up-to-date data. Han, et al. (Han et al. 2023) proposed a method for risk analysis and supervision in internet finance using big data analysis, establishing a financial risk assessment model. The experiments conducted confirm the effectiveness of the approach in addressing the limitations of traditional models.

The opportunities in big data and also the possibilities resulting from big data in retailing were examined by Bradlow, et al. (Bradlow et al. 2017), especially along five main data dimensions—data related to channels, time customers, products, and geospatial locations. The rise in the quality of data and possibilities of application is the result of a combination of new sources of data -an intelligent application of domain knowledge and statistical tools mixed with theoretical insights. It is important that a theory can guide a systematic search to answer the retailing questions and can also streamline the analysis, and at the same time, remain intact. Big data and predictive analytic roles in repetition are becoming important since they are assisted by data sources and large-scale interconnected methods. All statistical issues, which were discussed in this study, concentrated on applications and relevance of Bayesian analysis techniques; data borrowing, hierarchical modeling, augmentation, and updating, a field experiment, and predictive analytics by applying big data in a retailing context.

A retail recommender model, which was based on a cooperating filter, was suggested by Sun, et al. (Failed 2014). They also designed an algorithm of corresponding distributed computing on MapReduce to execute a big data-based retail recommender system. This big data procedure assisted the system in performing scalable data processing effortlessly. According to the outcomes of the experiment, the system was effective in estimating the retail sales for every product and store; therefore, this innovative method of precision marketing support benefited non-e-commerce enterprises.

In order to back anticipatory shipping, Lee (Lee 2017) suggested a model of genetic algorithm (GA)-based optimization. On the other hand, cloud computing was applied to store the big data that was generated from every channel. To predict the purchases of the future, based on If–Then prediction rules, and to understand the patterns of the purchases, cluster-based association rule mining was used. Afterward, a modified GA was applied to produce optimal anticipatory shipping plans. Besides the costs of transportation and shipping distance, in GA, the confidence of prediction rules was regarded. A huge number of numerical experiments were performed to illustrate the interchange between various elements in shipping, and the model optimization authentication was confirmed.

In (Sasidhar and Mallikharjuna Rao 2019), a model was presented to depict the widening area of cloud computing with big data in the retail market. The authors introduced a model that connects big data as a service with the cloud for the realistic measurement of a customer’s behavior. However, the model’s scalability suffered from narrowness, and there is still room for further studies to examine real data trends and network congestion of e-commerce retailers. Zheng, et al. (Zheng et al. 2020) presented a model using two methods, namely the analytic hierarchy process and the technique for order preference by similarity to an ideal solution to investigate logical distribution modes for the stores at JD. Their employed technique attracted the outcomes of subjective analysis, simultaneously giving full play to the advantages of measurable analysis.

Chen (Chen 2021) proposed an e-commerce method in accordance with the technology of network data, researching the game balance of four members of the supply chain to lower the cost and enhance the efficiency of customization. The author applied four elements of game equilibrium of supply: centralized decision-making and decentralized decision-making, C2B-dominated decision-making, and traditional enterprise-dominated business utilization. Zhuang (Zhuang 2021) analyzed the impacts of big data on e-commerce in the U.S. and China. The two major databases that the author used were the Web of Science and CNKI. He mostly tried to clarify any doubts about the development of big data on e-commerce in the noted nations. The author, however, only limited his case study to two countries.

Alrumiah and Hadwan (Alrumiah and Hadwan 2021) searched the vendors and also the customer’s views on e-commerce. They reached the outcome that e-commerce has some negative affections on the customers, such as addiction and it is costly for vendors to be able to take advantage of big data analytics tools. Li and Li (Li and Li 2022) conducted research concerning big data mining tools via cellphone applications for e-commerce. They divided their work into two sides; theoretical and experimental. In the experimental sector, the authors came up with the appreciation of customers through promotional activities. And, e-commerce provides very convenient and high-quality atmospheres for both the buyers and sellers.

Suguna, et al. (Suguna et al. 2016) discussed the criticality of log files in e-commerce by analyzing the log files which were used for identifying the user’s actions. Their employed model depicted how to process log files using MapReduce and how to utilize the Hadoop framework for parallel computation of log files. The author’s approach decreased the response time, added to the functionality of the model, and provided accurate results in the appropriate mean of the response time. Aboutorabiª, et al. (Failed 2015a) concentrated on variances between two NoSQL databases; MongoDB and Microsoft SQL Server. The author's study bolded the big variances between the noted database management systems regarding the productivity of processing the queries. MongoDB produced better performance, flexibility, and reliability; however, there is a need to test the model in the real-world environment to have a more precise evaluation of the application of MongoDB and SQL. Xiao (Xiao 2022) mostly focused on big data killing occurrences in the e-commerce emergence. He modeled a four-party game model, which was evolutionary of the government departments, e-commerce companies and platforms, and the consumers. Although the study took big data killing into account in a detailed way, the author states that there is still room for perfection in this field as the four-party evolutionary model would not cover all the circumstances.

4.2.2 Summary of e-commerce articles

Most studies make efforts to reduce the cost and improve prediction accuracy. Such different tools as Hadoop and XLMiner have been applied. According to the reviewed and discussed e-commerce articles, the comparison of their specifications has been depicted in Table 5. Table 5 indicates the main ideas, evaluation methods, tools, advantages, and disadvantages of each e-commerce article. In addition, Table 6 displays the improvement of different evaluation metrics in each study. These metrics include accuracy, timeliness, cost, scalability, reliability, performance, validity, resource utilization, time, and throughput.

Table 5 Comparison of e-commerce articles
Table 6 Evaluation metrics in e-commerce articles

4.3 Smart healthcare applications

These studies focus on predictions in order to help and protect individuals against diseases with the assistance of big data, which has been classified into two distinct groups. The first category discusses the methods for the prediction of disease: (Hendri and Sulaiman 2018; Venkatesh et al. 2019; Khatibi et al. 2019; Failed 2018d, 2017d, 2020; Souza et al. 2020; Gaedke Nomura et al. 2021; Safa et al. 2023). The second category focuses on the economic prosperity of smart healthcare (Weerakkody et al. 2021; Nallathamby et al. 2021; Chen 2018b; Awotunde et al. 2022; Ali et al. 2022; Das and Namasudra 2022; Zang and You 2022; Babar et al. 2022). In Section 4.2.2, the selected smart healthcare articles were reviewed. Finally, in Section 4.3.2, the discussed articles are compared and summarized.

4.3.1 Overview of smart healthcare articles

Findings from Malaysian healthcare facilities were reported by Hendri and Sulaiman (Hendri and Sulaiman 2018), who recorded 9,261 dengue patients in 2014. The research aimed to procure descriptive analysis and suggested techniques of big data analytic modeling to predict and define dengue patients’ length of stay (LoS). Such demographic data as the date of discharge, age, admission, and gender have been considered as factors contributing to LoS prediction.

Naive Bayes (NB) machine learning Technique was proposed by Venkatesh, et al. (Venkatesh et al. 2019) for forecasting heart failure, which procures high accuracy. The heart disease data, obtained from the UCI machine learning repository, was trained by The Naive Bayes approach. This approach, then, predicted the test data in order to predict the classification. The suggested BPANB scheme applied Hadoop-Spark as a big data computing tool to attain important insights into healthcare data. To predict the future health conditions of various patients, experiments were conducted. The training dataset estimated the health parameters that were required for classification, and the results defined the early detection of the disease to find out the future health of the patients.

To predict premature births and rank predictive features, Khatibi, et al. (Khatibi et al. 2019) suggested machine learning models for big data analytics. This model is capable of predicting premature births with 81% accuracy and 68% AUC. Findings suggested that pregnancy risk segments, gestational diabetes, heart-related problems, mother’s age, underlying maternal diseases, the number of pregnancies, education level, prenatal gender, and city were high-ranked predictive features. To reduce the risk of premature births, there was a suggestion to manage and monitor the high-ranked features remotely with applications of smartphones and IoT gadgets at regular intervals.

An architecture of big-data-based predictive maintenance was suggested by Çoban, et al. (Failed 2018d) for biomedical devices in the medical domain. The data categorization was obtained from the alarms of devices, biomedical devices, and their health conditions in a real-time style; the suggested architecture made data management in this foreseeable maintenance model possible. To increase biomedical device performances, enhance the reliance on these devices, and reduce the employees’ accidents that healthcare staff encounter, a model of predictive maintenance in big data has to be applied.

A clinical decision scheme based on RNN, called PCD, was suggested by Lin, et al. (Failed 2017d) to alert and predict disease before it happened while considering the patient’s privacy. A homomorphic encryption scheme was designed by the authors not to disclose users' and data providers’ privacy, so PCD could resist different security intimidations. In order to guarantee the improvement of accuracy prediction, the authors designed an averaged and sequential RNN model in real-time systems. As it can be seen from the experimental results, PCD could attain high efficiency in time and high accuracy in disease forecasting and at the same time keep the patients' and data providers’ privacy. Souza, et al. (Souza et al. 2020) presented a framework that used three kernels, namely linear, polynomial, and RBF in a support vector machine (SVM) ensemble to forecast dengue cases. They illustrated the effectiveness and functionality of the proposed framework by conducting numerous case studies. Neither flexibility nor performance was taken into account in their approach. Nevertheless, their proposed method was relatively precise.

Gaedke Nomura, et al. (Gaedke Nomura et al. 2021) described the usage of a big data science framework to extend a pain information model and argued the prospects for its use in predictive modeling. Data for the proposed framework was extracted from a hospital. The framework was intended to customize remedies, improve health outcomes, and reduce costs. Christobel and Kamalakannan (Failed 2020) discussed the effects of diabetes—type 1, type 2, and GDM). The proposed framework employed the big PIMA dataset. Their presented algorithm used Hadoop/MapReduce, which helped the disease prediction and supported the production reports. It was observed that their method provided high performance.

Safa, et al. (Safa et al. 2023) proposed a healthcare big data analytics model (HCBDA) for disease prediction. The HCBDA model utilizes wireless sensor networks and IoT devices to monitor patients' biosignals and generate medical assistance. The proposed approach achieved a disease prediction accuracy of up to 96% and employed machine learning algorithms for classification and recommendation generation. The HCBDA paradigm clusters large healthcare data and utilizes a decision support system for analysis and prediction. Weerakkody, et al. (Weerakkody et al. 2021) employed a model for improving subject well-being. The authors used open data available from the national annual population survey, which limited the accessibility to detailed and personal data. However, the presented model depicted that the data can be continuously collected without additional and exorbitant costs.

Nallathamby, et al. (Nallathamby et al. 2021) introduced a model to create an effective appointment-scheduling platform for outpatients. In their model, Hadoop and MapReduce were used, which led to low costs in healthcare and excellent flexibility for the presented model. Chen (Chen 2018b) established a systematic pathway to enable big data to become applicable in the medical sector by extracting data and analyzing them in a net relation map. The author created a framework of Taiwan’s’ healthcare industry to propose his methods and conduct the research. Awotunde, et al. (Awotunde et al. 2022) proposed a nonstop monitoring device with the usage of IoT health surveillance to measure the body’s temperature, blood glucose, blood pressure, and other elements affecting the patient’s overall health status. Ali, et al. (Ali et al. 2022) came up with a framework simulating the security release and sensitive data about the patients using online time slots to have an appointment. They filtered their data NoSQL and Redis cache to increase their accuracy while improving security.

Das and Namasudra (Das and Namasudra 2022) presented a scheme to make smart health care in an IoT-enabled environment more confidential and secure so as to only allow the authenticated user to gain access to a database. They analyzed the security and performance of the scheme, which satisfied the requirements, but the overall overhead of the model was a burden. Zang and You (Zang and You 2022) proposed a framework with three distinct modules, namely data preparation, model training, and data computation for evaluating a real-time efficiency process. They used machine learning techniques, such as regression and decision tree to train their model. Babar, et al. (Babar et al. 2022) developed a scheme with a pre-processing data module, data processing, and data ingestion module using Spark to analyze data properly. The reason that they used pre-processing data was to level up and speed up the real-time data processing time. However, their scheme had a lack of well-organized parallel data processing.

4.3.2 Summary of smart healthcare articles

These studies would like to focus on boosting performance, making more accurate predictions, and limiting costs. According to the reviewed and discussed healthcare articles, the comparison of their specifications has been depicted in Table 7. Table 7 indicates the main ideas, evaluation methods, tools, advantages, and disadvantages of each healthcare article. In addition, Table 8 displays the improvement of different evaluation metrics in each study. These metrics include accuracy, timeliness, cost, scalability, reliability, performance, validity, resource utilization, time, precision, and energy.

Table 7 Comparison of smart healthcare articles
Table 8 Evaluation metrics in smart healthcare articles

4.4 Smart agriculture applications

These articles specifically aimed at enhancing the quality of various parameters in the field of smart agriculture, and they are divided into three sub-classes. The first group’s articles zoom in on crop prediction (Tsouli Fathi et al. 2020; Velmurugan et al. 2021). The second group of articles takes a detailed look into precision agriculture (Bendre et al. 2016; Sabarina and Priya 2015; Keswani et al. 2020; Failed 2015b; Melgar-García, et al. 2022; Wang and Mu 2022). And, the third one discusses smart farming methods (Liu 2021; Roukh et al. 2020; Li and Niu 2020; Osinga et al. 2022; Shrivastava et al. 2023). In Section 4.3.1, the selected agriculture articles are reviewed and then compared in Section 4.4.2.

4.4.1 Overview of smart agriculture articles

According to a network of neurons, Tsouli Fathi, et al. (Tsouli Fathi et al. 2020) offered an algorithm to forecast the alterations in the climatic conditions influenced by the yields and production of crops in agriculture for the future and a defined area. The proposed ANN-based approach performance experimented with a 30-year meteorological dataset composed of 54,000 records having such features as humidity temperature, rainfall, wind velocity, and some agro-climatic data obtained from the climate rules, Köppen classification. The inaccuracy of prediction was very low, and the learning convergence was robust. According to predictive data mining, this study would discover the possibility of extracting useful knowledge and patterns out of a considerable amount of agro-climatic and meteorological data.

Velmurugan, et al. (Velmurugan et al. 2021) introduced a technique for consumption and processing strategies for evaluating the dataset. A highly technological methodology, fuzzy enumeration crop prediction algorithm (FECPA), was used for precise cultivation. The resources used in the FECPA system simulate Jupyter notebooks were shared with appropriate datasets and depicted the availability of fast branch-and-bound, naive Bayes, conventional neural network, and the presented system FECPA. An abstract idea was suggested by Bendre, et al. (Bendre et al. 2016), about big data in precision agriculture and about the manner of discovering the insights from big precision agriculture data via ICT resources for farming in the future. The authors also suggested an e-agriculture model that applies the services of ICT in the agricultural environment to collect big data. This article sorted out various big data sources in the precision ICT-based e-agriculture model, its challenges, and its future applications. In the end, they discussed the application of rainfall prediction by applying the unsupervised and supervised method to process and forecast data.

Sabarina and Priya (Sabarina and Priya 2015) focused on the manner of big data reduction size systematically by using a model of tensor-based feature reduction in favor of precision agriculture. With the help of the IHOSVD algorithm, the decomposition of data and the extraction of core values were done. Consequently, the total file size was reduced by deleting unnecessary data dimensions. The time spent on CPU usage and data analysis would be markedly decreased when dimensionality-reduced data were applied instead of raw unprocessed data. Keswani, et al. (Keswani et al. 2020) presented a real-time decision support system (DSS) to generate adequate valve control commands. Six different sensors were proposed to test the prediction techniques, such as deep neural networks (DNN) and random forests to forecast soil moisture content. It can be inferred that the DNN model was appropriate for a prediction-based smart irrigation scheme because DNN is an artificial neural network model including many hidden layers between inputs and outputs.

Bendre, et al. (Failed 2015b) employed an approach regarding unearthing extra perception from precise agriculture data by using big data. Their major aim in this article was to level up the accuracy of forecasting different weather parameters for future precision in agricultural areas. The outcomes are predicted using a regression model, and big data is handled by MapReduce. Melgar-García, et al. (Melgar-García, et al. 2022) took a tri-clustering approach in a field located in Portugal to examine the precision in agriculture. The main metrics that they tried to emphasize were scalability, performance, and reliability. However, they could not manage the anomalies with their proposed framework and algorithms.

Wang and Mu (Wang and Mu 2022) focused on big data management and risk monitoring in precision agriculture. They created an IoT-based and big data-based platform in order to analyze how accurately the light wavelength would be transformed and received. Nevertheless, a more comprehensive study is needed to cover all types of plants and species as the authors only took wheat landfields into account. Liu (Liu 2021) used ZigBee wireless sensor network to cover all aspects of crops under the instruction of the concept of efficient agricultural technologies. The author initially introduced the multi-generation genetic algorithm back propagation model in the first layer of his model. Then, in the application layer, the hierarchical analytic process was proposed as the guidance mechanism of neural networks. In the end, he used mathematical statistics to test the presented model.

Roukh, et al. (Roukh et al. 2020) investigated the state-of-the-art platforms and big data architecture before proposing their solution. The authors introduced an overall big data architecture for smart farming. Their presented framework used the approach of Lambda architecture to address the issues of acquisition, execution and storing real-time big data. They tested their model in a real-time environment; however, their model was not tested in various agricultural lands to obtain more precise results. Li and Niu (Li and Niu 2020) used the K-means algorithm based on the farthest distance to research data mining in the agricultural production process. The tested outcomes illustrated that the improved K-means clustering method was constituted for the reduction of total time. The presented algorithm can be used in real-time data and does not lose its efficiency. However, the value of a large amount of data was not fully comprehended because of insufficient information to control the mining.

Osinga, et al. (Osinga et al. 2022) took a survey (n = 56) from stakeholders regarding livestock and fishing to food security. The method that they used was a mixed approach to conduct their study and they mainly focused on four perspectives, namely the elements changing the initiated drivers, distinguishing big data methods, the maturity status of technology, and the stakeholder’s view. The authors in Shrivastava et al. (2023) adopted smart farming concepts, such as hydroponics with IoT platforms by eliminating the need for soil and optimizing resources. This vertical hydroponic system, aided by IoT sensors, allows continuous monitoring of crop health and supply of nutrients and water, resulting in increased productivity and reduced costs. The research article focused on the design and implementation of this automated vertical hydroponic farming method.

4.4.2 Summary of smart agriculture articles

This class of articles made use of their innovations in real-world cases and majorly concentrated on enhancing performance. According to the reviewed and discussed smart agriculture articles, they are divided into three categories, which have been depicted in Table 9. Table 9 indicates the main ideas, evaluation methods, tools, advantages, and disadvantages of each smart agriculture article. In addition, Table 10 displays the improvement of different evaluation metrics in each study. These metrics include accuracy, timeliness, cost, scalability, reliability, performance, validity, and resource utilization.

Table 9 Comparison of smart agriculture articles
Table 10 Evaluation metrics in smart agriculture articles

4.5 Smart city applications

These articles particularly targeted big data for public transit to deter delays of public transportation, and improve the accuracy and changes in future intelligent transportation in smart cities. Due to this, we have arranged articles into five sub-classes. The first class has articles regarding vehicles in smart city (Balbin et al. 2020; Cui et al. 2019; Guo and Xu 2022). The second sub-class collects articles on city planning: (Failed 2017e, 2022; Khan et al. 2017; Ng et al. 2017; Li et al. 2022; Chang 2021; Tong et al. 2022; Zhang et al. 2022b; Mortaheb and Jankowski 2023). The third category focuses on surveillance systems (Tian et al. 2018; Ramsahai, et al. 2023; Huang et al. 2023). The fourth group talks about the real-time system: (Nathali Silva et al. 2017; Rathore et al. 2018). The last category article concentrates on social network analysis (Azzaoui et al. 2021). In Section 4.4, the selected smart city articles are reviewed and then compared in Section 4.5.2.

4.5.1 Overview of smart city articles

F. Balbin, et al. (Balbin et al. 2020) proposed predictive analytics on an open data portal about bus performance. They concentrated on big data for public transit to deter delays in public transportation. The collected data from the analysis and real-life tests depicted that buses were on time, and the feasibility of analyzing big data was high. Cui, et al. (Cui et al. 2019) proposed a network calculus model to decrease the mean travel hour during rush hours. The results of their experiment demonstrated that the fleet management of autonomous vehicles in a smart city could significantly reduce travel time and energy consumption. Guo and Xu (Guo and Xu 2022) extracted data regarding traffic congestion. They aimed to study the mechanism and the function of diffusion of traffic congestion. As moved forward, they stated that there is no single technology or method to resolve the issue of traffic or even meet the specifications in researching the traffic flow.

Chin, et al. (Failed 2017e) looked for understanding the relationship between weather and not lengthy cycling behavior by utilizing four machine-learning classification algorithms. The outcomes were accurate and reliable. In addition, the results illustrated that the integration of ML, IoT, and big data paved the way for the feasibility of smart city technologies. Khan, et al. (Khan et al. 2017) designed a scheme for energy-aware communications in an IoT environment. Their architecture had four phases: identification of the energy needed for devices, deployment of sensors, scheduling, and information collection. The employed scheme optimized the energy consumption and balanced the load during rush hours.

Ng, et al. (Ng et al. 2017) stated that making use of the value of big infrastructure data is a challenge. In order to tackle that, the authors introduced a master data management (MDM) solution. In their study, the multi-domains master data objects were built using MDM tools, and the MDM was implemented with the registry style. To make a change in the smart city way of data analysis and to have efficient data processing, Li, et al. (Li et al. 2022) suggested a deep learning algorithm within uses big data analysis in addition to the convolutional neural network. The accuracy of their work is estimated to be above 97%, however, the energy consumption and resource utilization were higher than expected.

Chang (Chang 2021) introduced an ethical framework in regard to bid data applications in smart cities and city planning. The author mainly focused on raising public awareness of how to follow the instructions of smart cities. However, the point that was neglected in ethical issues of smart cities was the security and data release without authorization. Tong, et al. (Tong et al. 2022) conducted research using concurrent data driven by the government and national traffic noise to look further into sleep deprivation based on the structure of cities and homes in city planning. One of the highlights was that above 62% of the people had suffered from a lack of sleep due to traffic noise. Zhang, et al. (Zhang et al. 2022b) provided a case study with three phases of development. They studied the case in Wuhu; however, it was the sole case study in that the authors collected data. This limited their work as it had a limited and not enough statistical generalizability.

Lv, et al. (Failed 2022) with concern about the COVID-19 pandemic, tried to make smart city construction much safer and faster by introducing building information modeling, big data processing methods, and tools of digital tools. The authors in Mortaheb and Jankowski (2023) advocated for a reimagining of the smart city concept, emphasizing the crucial role of city planning and the integration of Geospatial Artificial Intelligence (GeoAI). By leveraging the synergies between city planning, big data, geographic information science and systems, and data science, the article proposes achieving policy goals such as enhancing urban services' efficiency, improving quality of life, addressing urban challenges, and generating valuable spatial data and knowledge. Tian, et al. (Tian et al. 2018) represented a block-level background modeling (BBM) algorithm to support long-term reference structure for efficient surveillance video coding and also developed a rate-distortion optimization for the surveillance source algorithm. Ramsahai, et al. (Ramsahai, et al. 2023) employed BDPA techniques such as exploratory data analysis, geocoding for hotspot mapping, and kernel density estimation, to analyze historical crime data and make crime predictions. Additionally, the study confirms the relevance of Twitter data in crime analysis, with the integration of Twitter data improving accuracy by 9%.

Huang, et al. (Huang et al. 2023) focused on the security threshold setting algorithm for a distributed optical fiber monitoring and sensing system based on big data. The components of the system are introduced, factors affecting system performance are analyzed, and methods for enhancing system performance are summarized. The proposed algorithms required less storage, powerful code sustainability, and high productivity. Nathali Silva, et al. (Nathali Silva et al. 2017) presented the design of a smart city based on big data analytics. Their model consisted of three levels: data generation and acquisition level, data management and processing level, and application level. The authors tested their model on the Hadoop environment with the datasets that were obtained from various authentic and reliable sources.

Rathore, et al. (Rathore et al. 2018) organized a system for smart digital cities to pave the way for acquiring information. The gathered data was processed in a real-time environment to reach the smart city by using Hadoop working under Apache Spark. The authors illustrated that the efficiency of the proposed system was enhanced when big data was using Apache Spark over Hadoop. El Azzaoui, et al. (Azzaoui et al. 2021) developed a big data analysis framework based on information shared openly on the social network service (SNS) platform to monitor, comprehend, and forecast the immediate future virus widespread besides controlling the infodemic and prevent biased or manipulated news from broadcasting. The authors proposed natural language processing (NLP) to obtain more precise and accurate data regarding the prediction of a virus outbreak.

4.5.2 Summary of smart city articles

The main goal of each reviewed article was to enhance the feasibility of smart cities by testing models in a real-time environment. According to the reviewed and discussed smart city articles, the comparison of their specifications has been depicted in Table 11. Table 11 indicates the main ideas, evaluation methods, tools, advantages, and disadvantages of each smart city article. In addition, Table 12 displays the improvement of different evaluation metrics in each study. These metrics include accuracy, time, performance, reliability, energy, scalability, throughput, sustainability, feasibility, and security.

Table 11 Comparison of smart city articles
Table 12 Evaluation metrics in smart city articles

4.6 ICT applications

These articles are of different kinds and subjects and can be divided into two classes. The first class targets articles that are connected with algorithms or deal with machine learning approaches: (Nural et al. 2015; Failed 2017f; Oo and Thein 2019; Kannan et al. 2018; AlFarraj et al. 2019; Khine and Nyunt 2019; Arun Kumar and Venkatesulu 2019). The second class includes articles that point to technology: (Shenoy and Gorinevsky 2015; Su and Huang 2018; Mujeeb et al. 2019). The popular feature in all of these articles is combining big data and data mining techniques. In Section 4.4.2, the selected ICT articles were reviewed. Finally, in Section 4.6.2, the discussed articles are compared and summarized.

4.6.1 Overview of ICT articles

Applying semantic technology was suggested by Nural, et al. (Nural et al. 2015) to help data analysts and scientists in the selection of proper techniques of modeling and creation of special models, as well as the introduction of the rationale for the selected models and techniques. To describe models, modeling techniques, and results, the analytic ontology was developed by the authors to support inferencing for the selection of a semi-automated model. The ScalaTion framework is applied as a testbed to assess semantic technology use, and this framework supports more than thirty techniques of modeling for predictive big data analytics.

Putting special emphasis on modeling for predictive analytics, through meta-learning, with algorithms based on regression, Nural, et al. (Failed 2017f) tried to focus on and discuss the progress that was made in automated modeling. Besides a meta-learning system development, the authors introduced a wide range of meta-attributes to capture the features related to algorithms of regression. An influential system of predictive analytics was proposed by Oo and Thein (Oo and Thein 2019) for huge dimensional big data via an increasing scalable random forest (SRF) algorithm on the Apache Spark platform. Optimization of hyperparameters enhanced SRF, and the reduction of dimensions improved the prediction performance. The suggested system efficacy is tested on different real-world datasets. Based on the outcomes, the suggested approach attains a competitive performance in comparison with the RF algorithm that is implemented by Spark MLlib.

By applying the support vector machine approach (PAD-SVM), Kannan, et al. (Kannan et al. 2018) suggested a predictive analysis of demonetization data. This suggested that PAD-SVM had three steps: preprocessing, descriptive analysis, and prescriptive analysis. In the pre-processing step, the obtained data were cleaned, missing value treatment was performed, and the necessary data from the tweets were split. In the descriptive analysis step, the most effective individuals were found, this subject was regarded, and analytical functionalities were performed. Performing semantic analysis was to find users’ sentiment values and to find each tweet's compound polarity. Predictive analysis was done to see people’s present mindsets and the reaction of the community to current time issues. The purpose of performing this analysis was to find society's overall viewpoints and the views that might alter in the near future considering the demonetization scheme.

An optimized feature selection method and techniques of soft computing were introduced by AlFarraj, et al. (AlFarraj et al. 2019), to reduce dataset dimensionality. First, they collected data from different resources containing some inconsistent data, which reduced system efficiency. After that, they removed noise and inconsistency by using a normalized approach. They also chose the optimized traits by applying the firefly's gravitational ant colony optimization method. The noted method of optimized feature selection could examine the features during the selection process. The chosen features were composed of details of special predictive analytics. The efficiency of the proposed system was assessed by applying various datasets.

Khine and Nyunt (Khine and Nyunt 2019) suggested Map Reduce on the basis of the multiple linear regression model that is appropriate for distributed and parallel execution, aiming at predictive analytics upon massive datasets. This QR decomposition-based model in the decomposition of big matrix training data unearths model coefficients from a huge amount of matrix data on the Map-Reduce framework extensively.

Arun Kumar and Venkatesulu (Arun Kumar and Venkatesulu 2019) suggested a Gramian symmetric data collection-based random forest bivariate regression and classification method to make improvements in the accuracy of prediction with less complication. Firstly, they collected a huge data volume. They used the Gramian symmetric matrix for storing the data volume in columns and rows of a matrix. Afterward, they classified and did the regression process by applying random decision forests in order to locate outcomes in the future. The relationship between independent variables (i.e., data) and a dependent variable (i.e., outcomes) was measured by the regression process via bivariate correlation. Random decision forests made some decision trees to classify on the basis of the correlation. In the end, it mixed some decision trees and used a voting scheme. Classification results vote was identified in order to attain great.

An influential computational methodology was presented by Shenoy and Gorinevsky (Shenoy and Gorinevsky 2015) for cross-sectional longitudinal analysis of ultimate event statistics in large data sets. The data that were analyzed were accessible through multiple periods of time and multiple individuals in a population, some of which may not have extreme events, and some may have data. They modeled utmost events with an exponential tail or Pareto distribution. The suggested technique is on the basis of non-parametric Bayesian formulation.

On the basis of Apache Spark, Su and Huang (Su and Huang 2018) decided to expand a real-time predictive maintenance system for detecting failures of imminent hard disk drives (HDD) in data centers. A real-time prediction system development was described, which was capable of assisting IT teams in having extensive storage systems by sending notifications for failures of the impending drive. A framework was described for failures of HDD predictive monitoring via analyzing files of machine logs rather than applying conventional methods of statistical prediction.

Mujeeb and Javaid (Mujeeb et al. 2019) regarded load relationships and price while proposing two multiple inputs multiple outputs deep recurrent neural network models for forecasting load and price. An efficient sparse autoencoder nonlinear autoregressive network with exogenous inputs as the first suggested model is composed of forecasting and feature engineering. They suggested ESAE for feature engineering and performed forecasting by applying NARX as an existing method. Differential evolution recurrent extreme learning machine (DE-RELM), as the second suggested model, is based on the meta-heuristic DE optimization technique and RELM model. The predictive and descriptive analysis was conducted on PJM and ISO NE, as two famous electricity markets’ big data.

4.6.2 Summary of ICT articles

Most articles try to have predictions that are more accurate in their related domains. These studies, on the other hand, propose several beneficial tools (like Apache Spark, Hadoop.) to be applied for data mining when facing big data. According to the reviewed and discussed ICT articles, the comparison of their specifications has been depicted in Table 13. Table 13 indicates the main ideas, evaluation methods, tools, advantages, and disadvantages of each ICT article. In addition, Table 14 displays the improvement of different evaluation metrics in each study. These metrics include accuracy, timeliness, cost, scalability, reliability, performance, validity, and resource utilization.

Table 13 Comparison of ICT articles
Table 14 Evaluation metrics in ICT Articles

4.7 Weather applications

This section represents the application of big data in weather-related activities, which we have divided into five sub-classes. The first class takes a look at the influence of weather on sport (Zhao et al. 2018; Abeza et al. 2022). The second sub-class discusses the employment of weather prediction in electric power systems (Kezunovic, et al. 2017). The third category mainly focuses on weather forecasting: (Aljawarneh et al. 2020; Alam and Amjad 2019; Failed 2018e, 2017g; Liu et al. 2015; Simpson and Nagarajan 2021; Reis et al. 2022; Roh 2022). The fourth category discusses the prediction of fire risk (Agarwal et al. 2020). The last sub-class explores the wind pace by utilizing weather prediction big data (Xu et al. 2020). In Section 4.5.1, the selected weather articles are reviewed and then compared in Section 4.7.2.

4.7.1 Overview of weather articles

Zhao, et al. (Zhao et al. 2018) proposed a real-time model to comprehend which weather alterations influenced cycling on an off-road trail and an on-road bridge cycling lane. They tested the model to calculate the proportion of cycling during different weather conditions. However, their model suffered from the limitation of varied weather conditions, such as snowfall. Abeza, et al. (Abeza et al. 2022) conducted semi-structural interviews with practitioners in four top leagues in the U.S. The four strategic factors that the authors considered were transactional, informational, strategic, and infrastructural. Kezunovic, et al. (Kezunovic, et al. 2017) introduced an application of big data to investigate climatic situations' effects on power system running time, outcome, and administration. They conducted a methodology that used the Spatio-temporal correlation between diverse data sets to create more appropriate decision-making tactics for smart distribution networks. Their prospect framework was called Gaussian conditional random fields, which was applied to two power system applications: Risk assessment and spatio-temporal solar generation.

Aljawarneh, et al. (Aljawarneh et al. 2020) presented a system that could cope with climatic variables which were related to big data. They designed three categories of experiments for testing: First, the execution of standard univariate analysis. Second, the performance of the multivariate analysis compared to univariate analysis, and third, the productivity of the neighbor-based analysis approach compared to univariate. The authors considered the local NoSQL database at different levels in order to execute a predictive analysis by utilizing univariate and multivariate solutions as well as forecasting based on training data from neighbor stations. Alam and Amjad (Alam and Amjad 2019) introduced weather data analysis by taking Hadoop with multiple node systems into account. They designed system architecture for weather prediction using a big data-based analytic approach in cloud circumstances to forecast the highest, lowest, average, and mild temperatures.

Madan, et al. (Failed 2018e) explored an ongoing statistical linear regression and support machine learning, which could make constant types of information from equipment groups and weather forecasts. The results were originally accurate thus a method to level up the straight relapse was by collecting more information utilizing linear regression and supporting vector machine toward the sustainable and productive model. Onal, et al. (Failed 2017g) used big data IoT framework in a climate data analysis. They took into account weather clustering by utilizing a publicly available dataset. Their selected dataset was from library sci-kit-learn-based k-means clustering. In order to examine their use case, the authors provided the implementation details of each framework layer.

Liu, et al. (Liu et al. 2015) employed a computational intelligence technology named stacked auto-encoder to imitate climate data for three decades. They mainly introduced the greedy layer-wise unsupervised pre-training approach based on a deep neural network. Their proposed model levigated the criteria of the raw weather data layer by layer, and the test’s outcomes depicted that the newly acquired features can enhance the performances of classical computational intelligence models. Simpson and Nagarajan (Simpson and Nagarajan 2021) presented a deep learning algorithm according to a stacked sparse autoencoder for forecasting the varied climatic conditions of a specific area. They imposed the principal component analysis to lower the extraction of criteria by dramatic variance. In addition, the authors presented an algorithm based on binary butterfly optimization algorithm along with a deep stack autoencoder to level up precision. Simulation results indicated that execution time and prospective errors decreased.

Reis, et al. (Reis et al. 2022) studied UHI in Lisbons’ metropolitan neighborhood through the local weather types with the usage of the dataset available from Copernicus which was collecting data between 2008 and 2014. The outcomes of the analysis were a positive and correlative relationship between the modeled air temperature and the measurements, high precipitation rate, and the decomposition of UHI. Roh (Roh 2022) used the collected traffic data from five different WIM locations in Canada. The author divided each type of car into three categories on the road, in which he detected that winter traffic groups are more transferrable to homogeneous and heterogeneous road segments.

Agarwal, et al. (Agarwal et al. 2020) used two sets of big data, which were fire incident data from the National Fire Incident Reporting System and climatic data from the National Oceanic and Atmospheric Administration to achieve an overall record for forecasting and analyzing the fire risk. The authors described gradient-boosting trees and machine learning algorithms to precisely predict the incidents of a future fire. Xu, et al. (Xu et al. 2020) applied a framework to compute the wind pace forecasting based on the Apache Spark platform and using Python API for Spark. The authors proposed a synthesis computing framework applied to wind speed. The simulation results illustrated the proposed framework on Spark to foresee the wind’s speed precisely and effectively.

4.7.2 Summary of weather articles

The studied articles mainly focused on the accuracy and costs of the presented models for forecasting climatic conditions by using various precise tools, such as Hadoop, GIS, and MongoDB. According to the reviewed and discussed weather articles, the comparison of their specifications is depicted in Table 15. Table 15 indicates the main ideas, evaluation methods, tools, advantages, and disadvantages of each weather article. In addition, Table 16 displays the improvement of different evaluation metrics in each study. These metrics include accuracy, time, performance, reliability, energy, scalability, throughput, sustainability, feasibility, and cost.

Table 15 Comparison of weather articles
Table 16 Evaluation metrics in weather articles

5 Discussion

Previous sections described the review process of the selected articles in BDPA in seven groups. Here in this section, the authors deliberate a statistical analysis of the reviewed on the basis of different attributes for answering the RQs in Section 3.1:

5.1 Overview of the selected studies

To examine the current state of research on BDPA, the following supplemental questions are considered:

  • Which years had seen the most number of published articles in the area of BDPA?

  • In which representatives did the researchers provide their results?

The articles were classified based on years of publication, from 2014 to 2023, which is illustrated in Fig. 3. We had the highest number of published articles in 2021 and 2022. Figure 4 depicts the classification of the studied articles over time as per journals, including ScienceDirect, Emerald, IEEE, Springer, Taylor & Francis, Hindawi, Wiley, SAGE, and ACM. Research on BDPA is currently in a progressive state, with researchers actively exploring methodologies, algorithms, and applications for handling and analyzing large datasets. There is also a focus on generative AI and machine learning techniques while addressing ethical and privacy concerns associated with BDPA. Figure 5 illustrates the classification among nine publishers, where 34% of the total articles belong to ScienceDirect, 25% to Springer, 21% to IEEE, 6% to Taylor & Francis, 6% to Hindawi, 3% to Emerald, 3% to Wiley, and the least proportion is constituted by SAGE and ACM at 1% each.

Fig. 3
figure 3

Distribution of articles by publication year

Fig. 4
figure 4

Distribution of articles over time for each publisher

Fig. 5
figure 5

Percentage of articles by different publishers

5.2 Research aims, methods, and evaluation metrics

This section aims to provide clear answers to the stated RQs 1 to 4, in Section 3.1, based on the collected statistical data.

  • RQ1: What are the fields of prediction analysis applications in big data?

Figure 6 presents the comparison side of the big data predictive analytic applications. The best parameters to be used for categorization are the domains and main topics of articles that establish a logical relationship between them. Articles were divided into seven different categories based on their application. Industrial articles comprised 19% and smart city 17%. Smart healthcare and e-commerce articles comprised 32%, collectively, and smart agriculture made up 12%. Weather and ICT had the lowest number of the articles, with 11% and 9%, respectively. So, industrial articles own a great number of articles in BDPA. However, ICT and weather possessed a few numbers of articles in BDPA. Table 17 represents a summary of the benefits and limitations of the discussed groups. It indicates that all applications have distinguishing features such as better accuracy and better performance.

Fig. 6
figure 6

Percentage of big data predictive analytic approaches

Table 17 The main pros and cons of the discussed classification(add q)
  • •RQ2: What are the evaluation metrics of predictive analytics using big data?

There were several metrics to evaluate the predictive analytics, but some of them, including accuracy, timeliness, cost, scalability, reliability, performance, and time were more popular among authors. According to Fig. 7, it is observed that most of the articles focused on accuracy by 18% while resource utilization was considered in only 5% of the reviewed papers. Performance and others (energy, throughput, feasibility, security, precision, and sustainability) constituted 14% each. Likewise, cost and reliability were the main consideration of 22% of articles, collectively. Figure 8 specifies the importance of each evaluation metric in different categories. By applying (2), to calculate the percentages in each category, the number of a metric has been counted separately and divided by the sum of the number of all metrics.

Fig. 7
figure 7

Percentage of evaluation metrics in BDPA

Fig. 8
figure 8

Percentage of evaluation metrics in each category in BDPA

$$Im{p}_{percentage}(i)=\frac{metric(i)}{{\sum }_{j=1}^{n}metric(j)}$$
(2)
  • RQ3: What evaluation methods are used in BDPA?

Figure 9 depicts the comparison results in Tables 3, 5, 7, 9, 11, 13, and 15. According to Fig. 9, it was observed that 52% of case studies put into work the simulation evaluations, 30% considered a real testbed environment, 11% of case studies used to prototype their study, 6% of them just designed an algorithm or framework, and other 1% used formal methods.

Fig. 9
figure 9

Percentage of evaluation methods in BDPA

  • RQ4: What are the tools and environments in BDPA?

Based on Tables 3, 5, 7, 9, 11, 13, and 15, it was observed that various tools and modeling environments were used in case studies, such as Hadoop, Apache Spark, and MATLAB. However, many articles did not mention their tools.

6 Open issues and future trends

In this section, the following research question is addressed:

  • RQ5: What are the challenges and future issues of BDPA?

To answer RQ5, major challenges that impact BDPA are briefly discussed. The main challenges for exploiting big data for predictive analysis are streaming data sources, user privacy, multiple information sources, vertical domain applications, scalability, structured data vs. unstructured, incompleteness, leadership, adoption, and trust. Then each of these challenges is discussed, respectively.

  • Streaming data sources

    The recent explosion of social media services like Facebook, Instagram, and Twitter has led to a significant interest in social media predictive analytics and has attracted many researchers – automatically inferring hidden information from a large number of freely available content. It has a large number of diverse applications, including personalized marketing, real-time healthcare analytics (Balduini et al. 2014), online targeted advertising (Kuang et al. 2018), politics (Bianchi 2019), personalized recommendation systems and searches (Cheung et al. 2018), large-scale passive polling, and real-time live polling. One of the most noteworthy topics in social media predictive analytics is sentiment analytics (Park et al. 2018; Nguyen et al. 2014; You et al. 2015), which is the technique of figuring out if a posted report is negative, positive, or neutral. Sentiment analytics helps data analysts within large enterprises conduct nuanced market research, and public opinion, understand customer experiences, and monitor brand and product reputation (Hajiali 2020). Despite its value and importance, building predictive models for social media data is a challenging and overwhelming task as social data is large in volume, fast in pace, and heterogeneous in content. How to tackle all these challenges simultaneously is still an open problem.

  • Privacy of data

    Rigorous ethical authorizations are required to collect data that may include the personal information of users and process the collected data for downstream tasks (Scotti 2017; Leary 2015). However, the big data era now makes such data available to organizations to explore without a crystal-clear ethical policy in place. Even though anonymization of data is common to protect user’s privacy, there remains a potential risk of eliciting critical information form a big pile of data. In addition, predictive analytics can cause privacy concerns, especially when sensitive information is used (Gong et al. 2016). Predicting which employees are likely to quit their jobs and delivering that information to their manager is an example. Hence, despite previous studies, it is still a very interesting subject to understand the ethical and privacy implications of BDPA.

  • Multiple information sources

    Different pieces of data are often housed in different systems. This may lead to incomplete or inaccurate analysis (Jiang et al. 2016). Combining data manually is time-consuming and can limit insights into what is easily viewed. Many problems that are related to incorrect insights can be terraced back to the way data is gathered, verified, stored, and used. When one works with data-sensitive/insensitive industries, the tiniest error seems to be critical to the overall process success (Zhao et al. 2019; Li et al. 2016). Furthermore, it is common to have inconsistent information coming from different sources. Thus, how to integrate data from different sources to develop better predictive models remains an open issue.

  • Vertical domain applications

    With regard to applications, predictive analytics on big data was applied to distinct domains, including e-commerce and marketing intelligence (Das et al. 2017; Tuladhar et al. 2018), healthcare (Harris et al. 2016; Belle et al. 2015), financial (Ravi and Kamaruddin 2017), security (Shao et al. 2018), public safety (Turet and Costa 2018), and utility. For example, online retailers such as eBay, Amazon, and Alibaba use BDPA to collect insights, predict consumer behavior and operational efficiency, and improve their customer relationship management initiatives, decision-making, and marketing campaigns. Although some studies have been done in various industries, it is still an attractive open issue.

  • Scalability

    One of the significant evaluation metrics in BDPA, according to reviewed articles, is the scalability of algorithms and platforms (Hu et al. 2014). Possessing a huge amount of data is totally beneficial for database systems. As social networks are popular, collections of huge and extended databases have been developed. No need to say that it will be essential to set some limitations via system scalability. Scaling methods are so challenging that communication and synchronization overheads rise. Famous and successful organizations try to scale their overhead capabilities, specifically in the case of predictive analytics and data mining, to make an improvement in business performance and to decrease fraud (Sun et al. 2019). However, it is complicated to scale conventional approaches to predictive analytic projects and to execute them in a real-time environment which is needed in the architecture of modern enterprises. Therefore, scaling data and scalable analytic algorithms to generate real-time results can be a direction for future work.

  • Structured data vs. unstructured data

    With recent data mining statistical methods, the analysis of the structural data is not difficult; however, in doing so, such techniques as natural language processing (Quan et al. 2019), multimedia analytics (Fiadino et al. 2016), and text analytics need to be improved. It is also costly to process unstructured data for analysis as conducting a predictive analytic project (Seng and Ang 2019). It is time-consuming, challenging, and boring to select, clean, and transform the relevant data.

  • Incompleteness

    As was observed in the reviewed articles, the most important evaluation metric for researchers is the accuracy of the predicted model. The accuracy of models is restricted by the exhaustivity and accuracy of the data being used. Because the analytical algorithms try to build models based on the current data, the inadequacy in the records may lead to deficiencies in the model (Akbari et al. 2017). Equally, the evolved version of the model may not embody sufficient statistics to be able to spot adequate precious sentinel predictive patterns. Reducing incompleteness can be considered as an interesting open issue for future research.

  • Leadership

    To manage challenges, successful enterprises in a data-driven era have made teams set goals, modulate attainments, and ask appropriate questions that can be answered by data insights. The big data power, besides its technical approach, can use human sight and vision. Having the vision and capability of talking about future opportunities and trends, leaders will be capable of acting and motivating their teams to attain their goals effectively (Shroff 2017; Courtney 2018). So, considering the role of enterprisers, leadership in BDPA is another direction for future works.

  • Adoption

    Clearly, the harder a technology is to be applied, the less likely it is to be adopted by the end-users. No need to say, to satisfy this challenge, using predictive analytic solutions is demanding, as they are standalone tools. On the other hand, to apply it, users are forced to alter from their initial business applications to predictive analytic solutions. Besides, scaling and deploying traditional predictive tools is difficult, making updating a painstaking process (Al-Qirim et al. 2017; Raguseo 2018). Therefore, this can be a challenge and an opportunity for the next studies.

  • Trust

    In industry, a lack of trust in big data seems to be a critical case. The willingness to put reliance on another one is trust. The building block of trust formation has many metrics: subjective reasons, like predispositions of an individual, and objective reasons. In the big data domain, trust can be related to the quality of data that is raised by the processes of data quality assurance. Predictive analytics that is based on low-quality data will not be reliable (Rubin et al. 2017). Thus, increasing the quality of data is a challenge and can be another path for further research.

7 Conclusion and limitation

This article provided a systematic review of BDPA. Predictive analytics and big data were investigated, and the relationship between them was elaborated. The research methodology was described, and 109 principal studies were chosen out of 1130 primary articles from our search query, which were published between the years 2014 and 2023. According to this SLR, most articles were published in 2021 and 2022, and the least of them were published in 2016. ScienceDirect, with publishing 34%, outnumbered other journals in publishing BDPA articles. However, SAGE and ACM with 1% comprised the least number of published articles. 109 articles were sorted into seven categories in accordance with their applications, namely industrial, e-commerce, smart healthcare, smart agriculture, smart city, ICT, and weather. For each of these classes, numerous characteristics were reviewed and compared. Accuracy was the main concern of the researchers because it had the highest percentage, 19%, among evaluation metrics. Time, validity, and resource utilization had the least importance in the reviewed articles. For the evaluation method, 52% of studies implemented a simulation, 30% of studies used a real testbed environment, 11% brought a prototype up, 6% of them designed a new model, and 1% of studies used formal methods. Moreover, to design and develop more efficient architectures, frameworks, and algorithms in BDPA in the future, a detailed description of the open issues and challenges of BDPA was presented. Ultimately, considering RQ5, to create a more functional BDPA, some open issues and future challenges, such as streaming data sources, user privacy, multiple information sources, vertical domain applications, scalability, structured data vs. unstructured, incompleteness, leadership, adoption, and trust ought to be addressed. Moreover, the practical implications of this SLR extend to its role as a comprehensive director in the domain of BDPA. This work serves as a stimulus for researchers, markets, and industry to have BDPA implemented in their plans and improve the accuracy of their work. We hope that, in the domain of BDPA researchers cooperate to make progress and investigate this field further. This SLR provided significant insights into ongoing research trends, industry integration, and policy implications. Through delineating key research domains, showcasing potential applications in diverse sectors, and delineating necessary technological innovations to tackle current obstacles, this review served as a valuable resource for stakeholders. Furthermore, it emphasized the ethical imperatives and cautious implementation of BDPA tools, aiming to facilitate a shift towards proactive predictive analytics in tandem with adherence to regulatory standards and protection of individual privacy. Nevertheless, the thorough exploration of BDPA outlined in this study is accompanied by a few limitations, including the following:

  • Language: Non-English articles have been omitted.

  • Research domain: Different sources have covered BDPA. JCR-indexed journals and famous conferences have been included to attain competency fully. The nationally published articles are removed. In addition, book chapters, survey articles, and editorial articles have not been considered.

  • Study and publication bias: Google Scholar, Springer, IEEE, ScienceDirect, SAGE, ACM, World Scientific, Emerald, Wiley, Hindawi, and Taylor & Francis were selected as electronic databases. The statistics show that these electronic databases supply the foremost connected and valid articles. Nevertheless, the choice of all applicable studies cannot be warranted. Some proper articles were excluded due to the mentioned processes.

  • Taxonomy: The articles are classified into seven categories based on the application: Industrial, e-commerce, smart healthcare, smart agriculture, smart city, ICT, and weather. However, it can be classified in broader terms.

  • Study queries: In order to expand this study, five questions are chosen. However, other questions may be regarded.